All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] Implement close-on-fork
@ 2020-05-15 15:23 ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao


Series of 4 patches to implement close-on-fork. Tests have been
published to https://github.com/nkarstens/ltp/tree/close-on-fork
and cover close-on-fork functionality in the following syscalls:

 * accept(4)
 * dup3(2)
 * fcntl(2)
 * open(2)
 * socket(2)
 * socketpair(2)
 * unshare(2)

Addresses underlying issue in that there is no way to prevent
a fork() from duplicating a file descriptor. The existing
close-on-exec flag partially-addresses this by allowing the
parent process to mark a file descriptor as exclusive to itself,
but there is still a period of time the failure can occur
because the auto-close only occurs during the exec().

One manifestation of this is a race conditions in system(), which
(depending on the implementation) is non-atomic in that it first
calls a fork() and then an exec().

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

---

This is v2 of the change. See https://lkml.org/lkml/2020/4/20/113
for the original work.

Thanks to everyone who provided comments on the first series of
patches. Here are replies to specific comments:

> I suggest we group the two bits of a file (close_on_exec, close_on_fork)
> together, so that we do not have to dirty two separate cache lines.

I could be mistaken, but I don't think this would improve efficiency.
The close-on-fork and close-on-exec flags are read at different
times. If you assume separate syscalls for fork and exec then
there are several switches between when the two flags are read.
In addition, the close-on-fork flags in the new process must be
cleared, which will be much harder if the flags are interleaved.

> Also the F_GETFD/F_SETFD implementation must use a single function call,
> to not acquire the spinlock twice.

Good point, done.

> How about only allocating the 'close on fork' bitmap the first time
> a process sets a bit in it?

I looked into it and there are side effects I dont't think we want.
For example, if fcntl is used to set the close-on-fork flag, then
there is a chance that it cannot allocate memory, and so we'd have
to return ENOMEM. Seems cleaner to allocate memory up front so that
we know the file has all of the memory it needs.

> You should be able to use the same 'close the fds in this bitmap'
> function for both cases.

I looked into this and I think it is more efficient to prevent the
new process from having a reference to the open file than it is to
temporarily give the new process a reference and then close it later.

> I'm not sure dup_fd() is the best place to check the close-on-fork flag.
> For example, the ksys_unshare() > unshare_fd() > dup_fd() execution path
> seems suspect.

I have a better understanding of clone(2)/unshare(2) now and believe
that dup_fd() is the appropriate place to handle this. clone(2) with
CLONE_FILES set intentionally shares the file descriptor table, so
close-on-fork should not impact that. However, if unshare(2) is later
used to unshare the file descriptor table then the process calling
unshare(2) should automatically close its copy of any file descriptor
with close-on-fork set.

> If the close-on-fork flag is set, then __clear_open_fd() should be
> called instead of just __clear_bit(). This will ensure that
> fdt->full_fds_bits() is updated.

Done. It falls through to the case where the file had not finished
opening yet and leverages its call to __clear_open_fd().

> Need to investigate if the close-on-fork (or close-on-exec) flags
> need to be cleared when the file is closed as part of the
> close-on-fork execution path.

Done. The new file descriptor table starts with all close-on-fork
flags being cleared and dup_fd() gets the close-on-fork flag from
the old file descriptor table.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2] Implement close-on-fork
@ 2020-05-15 15:23 ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao


Series of 4 patches to implement close-on-fork. Tests have been
published to https://github.com/nkarstens/ltp/tree/close-on-fork
and cover close-on-fork functionality in the following syscalls:

 * accept(4)
 * dup3(2)
 * fcntl(2)
 * open(2)
 * socket(2)
 * socketpair(2)
 * unshare(2)

Addresses underlying issue in that there is no way to prevent
a fork() from duplicating a file descriptor. The existing
close-on-exec flag partially-addresses this by allowing the
parent process to mark a file descriptor as exclusive to itself,
but there is still a period of time the failure can occur
because the auto-close only occurs during the exec().

One manifestation of this is a race conditions in system(), which
(depending on the implementation) is non-atomic in that it first
calls a fork() and then an exec().

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

---

This is v2 of the change. See https://lkml.org/lkml/2020/4/20/113
for the original work.

Thanks to everyone who provided comments on the first series of
patches. Here are replies to specific comments:

> I suggest we group the two bits of a file (close_on_exec, close_on_fork)
> together, so that we do not have to dirty two separate cache lines.

I could be mistaken, but I don't think this would improve efficiency.
The close-on-fork and close-on-exec flags are read at different
times. If you assume separate syscalls for fork and exec then
there are several switches between when the two flags are read.
In addition, the close-on-fork flags in the new process must be
cleared, which will be much harder if the flags are interleaved.

> Also the F_GETFD/F_SETFD implementation must use a single function call,
> to not acquire the spinlock twice.

Good point, done.

> How about only allocating the 'close on fork' bitmap the first time
> a process sets a bit in it?

I looked into it and there are side effects I dont't think we want.
For example, if fcntl is used to set the close-on-fork flag, then
there is a chance that it cannot allocate memory, and so we'd have
to return ENOMEM. Seems cleaner to allocate memory up front so that
we know the file has all of the memory it needs.

> You should be able to use the same 'close the fds in this bitmap'
> function for both cases.

I looked into this and I think it is more efficient to prevent the
new process from having a reference to the open file than it is to
temporarily give the new process a reference and then close it later.

> I'm not sure dup_fd() is the best place to check the close-on-fork flag.
> For example, the ksys_unshare() > unshare_fd() > dup_fd() execution path
> seems suspect.

I have a better understanding of clone(2)/unshare(2) now and believe
that dup_fd() is the appropriate place to handle this. clone(2) with
CLONE_FILES set intentionally shares the file descriptor table, so
close-on-fork should not impact that. However, if unshare(2) is later
used to unshare the file descriptor table then the process calling
unshare(2) should automatically close its copy of any file descriptor
with close-on-fork set.

> If the close-on-fork flag is set, then __clear_open_fd() should be
> called instead of just __clear_bit(). This will ensure that
> fdt->full_fds_bits() is updated.

Done. It falls through to the case where the file had not finished
opening yet and leverages its call to __clear_open_fd().

> Need to investigate if the close-on-fork (or close-on-exec) flags
> need to be cleared when the file is closed as part of the
> close-on-fork execution path.

Done. The new file descriptor table starts with all close-on-fork
flags being cleared and dup_fd() gets the close-on-fork flag from
the old file descriptor table.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2] Implement close-on-fork
@ 2020-05-15 15:23 ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao


Series of 4 patches to implement close-on-fork. Tests have been
published to https://github.com/nkarstens/ltp/tree/close-on-fork
and cover close-on-fork functionality in the following syscalls:

 * accept(4)
 * dup3(2)
 * fcntl(2)
 * open(2)
 * socket(2)
 * socketpair(2)
 * unshare(2)

Addresses underlying issue in that there is no way to prevent
a fork() from duplicating a file descriptor. The existing
close-on-exec flag partially-addresses this by allowing the
parent process to mark a file descriptor as exclusive to itself,
but there is still a period of time the failure can occur
because the auto-close only occurs during the exec().

One manifestation of this is a race conditions in system(), which
(depending on the implementation) is non-atomic in that it first
calls a fork() and then an exec().

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

---

This is v2 of the change. See https://lkml.org/lkml/2020/4/20/113
for the original work.

Thanks to everyone who provided comments on the first series of
patches. Here are replies to specific comments:

> I suggest we group the two bits of a file (close_on_exec, close_on_fork)
> together, so that we do not have to dirty two separate cache lines.

I could be mistaken, but I don't think this would improve efficiency.
The close-on-fork and close-on-exec flags are read at different
times. If you assume separate syscalls for fork and exec then
there are several switches between when the two flags are read.
In addition, the close-on-fork flags in the new process must be
cleared, which will be much harder if the flags are interleaved.

> Also the F_GETFD/F_SETFD implementation must use a single function call,
> to not acquire the spinlock twice.

Good point, done.

> How about only allocating the 'close on fork' bitmap the first time
> a process sets a bit in it?

I looked into it and there are side effects I dont't think we want.
For example, if fcntl is used to set the close-on-fork flag, then
there is a chance that it cannot allocate memory, and so we'd have
to return ENOMEM. Seems cleaner to allocate memory up front so that
we know the file has all of the memory it needs.

> You should be able to use the same 'close the fds in this bitmap'
> function for both cases.

I looked into this and I think it is more efficient to prevent the
new process from having a reference to the open file than it is to
temporarily give the new process a reference and then close it later.

> I'm not sure dup_fd() is the best place to check the close-on-fork flag.
> For example, the ksys_unshare() > unshare_fd() > dup_fd() execution path
> seems suspect.

I have a better understanding of clone(2)/unshare(2) now and believe
that dup_fd() is the appropriate place to handle this. clone(2) with
CLONE_FILES set intentionally shares the file descriptor table, so
close-on-fork should not impact that. However, if unshare(2) is later
used to unshare the file descriptor table then the process calling
unshare(2) should automatically close its copy of any file descriptor
with close-on-fork set.

> If the close-on-fork flag is set, then __clear_open_fd() should be
> called instead of just __clear_bit(). This will ensure that
> fdt->full_fds_bits() is updated.

Done. It falls through to the case where the file had not finished
opening yet and leverages its call to __clear_open_fd().

> Need to investigate if the close-on-fork (or close-on-exec) flags
> need to be cleared when the file is closed as part of the
> close-on-fork execution path.

Done. The new file descriptor table starts with all close-on-fork
flags being cleared and dup_fd() gets the close-on-fork flag from
the old file descriptor table.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 1/4] fs: Implement close-on-fork
  2020-05-15 15:23 ` Nate Karstens
  (?)
@ 2020-05-15 15:23   ` Nate Karstens
  -1 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

The close-on-fork flag causes the file descriptor to be closed
atomically in the child process before the child process returns
from fork(). Implement this feature and provide a method to
get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

If clone(2) is used to create a child process and the CLONE_FILES
flag is set, then both processes will share the table of file
descriptors and the state of the close-on-fork flag for any
individual file descriptor. If unshare(2) is later used to stop
sharing the file descriptor table, then any file descriptor with
the close-on-fork flag set will be closed in the process that
calls unshare(2).

execve(2) also causes the file descriptor table to be unshared,
so any file descriptor with the close-on-fork flag set will be
closed in the process that calls execve(2).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  4 +-
 fs/file.c                              | 64 ++++++++++++++++++++++++--
 include/linux/fdtable.h                |  7 +++
 include/linux/file.h                   |  2 +
 include/uapi/asm-generic/fcntl.h       |  5 +-
 tools/include/uapi/asm-generic/fcntl.h |  5 +-
 6 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..913b0cb70804 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -334,11 +334,11 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		err = f_dupfd(arg, filp, O_CLOEXEC);
 		break;
 	case F_GETFD:
-		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err = f_getfd(fd);
 		break;
 	case F_SETFD:
 		err = 0;
-		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		f_setfd(fd, arg);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..81194349e980 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -47,7 +47,7 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
  * spinlock held for write.
  */
 static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
-			    unsigned int count)
+			    unsigned int count, bool copy_cof)
 {
 	unsigned int cpy, set;
 
@@ -58,6 +58,13 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
 
+	if (copy_cof) {
+		memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+		memset((char *)nfdt->close_on_fork + cpy, 0, set);
+	} else {
+		memset((char *)nfdt->close_on_fork, 0, cpy + set);
+	}
+
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
 	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
@@ -79,7 +86,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memcpy(nfdt->fd, ofdt->fd, cpy);
 	memset((char *)nfdt->fd + cpy, 0, set);
 
-	copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
+	copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds, true);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -118,7 +125,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +133,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +245,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -330,13 +351,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		open_files = count_open_files(old_fdt);
 	}
 
-	copy_fd_bitmaps(new_fdt, old_fdt, open_files);
+	copy_fd_bitmaps(new_fdt, old_fdt, open_files, false);
 
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (close_on_fork(open_files - i, old_fdt))
+			f = NULL;
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +478,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -840,6 +866,36 @@ void __f_unlock_pos(struct file *f)
  * file count (done either by fdget() or by fork()).
  */
 
+void f_setfd(unsigned int fd, int flags)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flags & FD_CLOEXEC)
+		__set_close_on_exec(fd, fdt);
+	else
+		__clear_close_on_exec(fd, fdt);
+	if (flags & FD_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+int f_getfd(unsigned int fd)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	int flags;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	flags = (close_on_exec(fd, fdt) ? FD_CLOEXEC : 0) |
+	        (close_on_fork(fd, fdt) ? FD_CLOFORK : 0);
+	rcu_read_unlock();
+	return flags;
+}
+
 void set_close_on_exec(unsigned int fd, int flag)
 {
 	struct files_struct *files = current->files;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);
 }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable *fdt)
+{
+	return test_bit(fd, fdt->close_on_fork);
+}
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
 {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..0ee15ee24010 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -83,6 +83,8 @@ static inline void fdput_pos(struct fd f)
 
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
+extern int f_getfd(unsigned int fd);
+extern void f_setfd(unsigned int fd, int flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 1/4] fs: Implement close-on-fork
@ 2020-05-15 15:23   ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

The close-on-fork flag causes the file descriptor to be closed
atomically in the child process before the child process returns
from fork(). Implement this feature and provide a method to
get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

If clone(2) is used to create a child process and the CLONE_FILES
flag is set, then both processes will share the table of file
descriptors and the state of the close-on-fork flag for any
individual file descriptor. If unshare(2) is later used to stop
sharing the file descriptor table, then any file descriptor with
the close-on-fork flag set will be closed in the process that
calls unshare(2).

execve(2) also causes the file descriptor table to be unshared,
so any file descriptor with the close-on-fork flag set will be
closed in the process that calls execve(2).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  4 +-
 fs/file.c                              | 64 ++++++++++++++++++++++++--
 include/linux/fdtable.h                |  7 +++
 include/linux/file.h                   |  2 +
 include/uapi/asm-generic/fcntl.h       |  5 +-
 tools/include/uapi/asm-generic/fcntl.h |  5 +-
 6 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..913b0cb70804 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -334,11 +334,11 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		err = f_dupfd(arg, filp, O_CLOEXEC);
 		break;
 	case F_GETFD:
-		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err = f_getfd(fd);
 		break;
 	case F_SETFD:
 		err = 0;
-		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		f_setfd(fd, arg);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..81194349e980 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -47,7 +47,7 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
  * spinlock held for write.
  */
 static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
-			    unsigned int count)
+			    unsigned int count, bool copy_cof)
 {
 	unsigned int cpy, set;
 
@@ -58,6 +58,13 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
 
+	if (copy_cof) {
+		memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+		memset((char *)nfdt->close_on_fork + cpy, 0, set);
+	} else {
+		memset((char *)nfdt->close_on_fork, 0, cpy + set);
+	}
+
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
 	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
@@ -79,7 +86,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memcpy(nfdt->fd, ofdt->fd, cpy);
 	memset((char *)nfdt->fd + cpy, 0, set);
 
-	copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
+	copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds, true);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -118,7 +125,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +133,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +245,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -330,13 +351,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		open_files = count_open_files(old_fdt);
 	}
 
-	copy_fd_bitmaps(new_fdt, old_fdt, open_files);
+	copy_fd_bitmaps(new_fdt, old_fdt, open_files, false);
 
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (close_on_fork(open_files - i, old_fdt))
+			f = NULL;
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +478,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -840,6 +866,36 @@ void __f_unlock_pos(struct file *f)
  * file count (done either by fdget() or by fork()).
  */
 
+void f_setfd(unsigned int fd, int flags)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flags & FD_CLOEXEC)
+		__set_close_on_exec(fd, fdt);
+	else
+		__clear_close_on_exec(fd, fdt);
+	if (flags & FD_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+int f_getfd(unsigned int fd)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	int flags;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	flags = (close_on_exec(fd, fdt) ? FD_CLOEXEC : 0) |
+	        (close_on_fork(fd, fdt) ? FD_CLOFORK : 0);
+	rcu_read_unlock();
+	return flags;
+}
+
 void set_close_on_exec(unsigned int fd, int flag)
 {
 	struct files_struct *files = current->files;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);
 }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable *fdt)
+{
+	return test_bit(fd, fdt->close_on_fork);
+}
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
 {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..0ee15ee24010 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -83,6 +83,8 @@ static inline void fdput_pos(struct fd f)
 
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
+extern int f_getfd(unsigned int fd);
+extern void f_setfd(unsigned int fd, int flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 1/4] fs: Implement close-on-fork
@ 2020-05-15 15:23   ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

The close-on-fork flag causes the file descriptor to be closed
atomically in the child process before the child process returns
from fork(). Implement this feature and provide a method to
get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

If clone(2) is used to create a child process and the CLONE_FILES
flag is set, then both processes will share the table of file
descriptors and the state of the close-on-fork flag for any
individual file descriptor. If unshare(2) is later used to stop
sharing the file descriptor table, then any file descriptor with
the close-on-fork flag set will be closed in the process that
calls unshare(2).

execve(2) also causes the file descriptor table to be unshared,
so any file descriptor with the close-on-fork flag set will be
closed in the process that calls execve(2).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  4 +-
 fs/file.c                              | 64 ++++++++++++++++++++++++--
 include/linux/fdtable.h                |  7 +++
 include/linux/file.h                   |  2 +
 include/uapi/asm-generic/fcntl.h       |  5 +-
 tools/include/uapi/asm-generic/fcntl.h |  5 +-
 6 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..913b0cb70804 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -334,11 +334,11 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		err = f_dupfd(arg, filp, O_CLOEXEC);
 		break;
 	case F_GETFD:
-		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err = f_getfd(fd);
 		break;
 	case F_SETFD:
 		err = 0;
-		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		f_setfd(fd, arg);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..81194349e980 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -47,7 +47,7 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
  * spinlock held for write.
  */
 static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
-			    unsigned int count)
+			    unsigned int count, bool copy_cof)
 {
 	unsigned int cpy, set;
 
@@ -58,6 +58,13 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
 
+	if (copy_cof) {
+		memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+		memset((char *)nfdt->close_on_fork + cpy, 0, set);
+	} else {
+		memset((char *)nfdt->close_on_fork, 0, cpy + set);
+	}
+
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
 	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
@@ -79,7 +86,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memcpy(nfdt->fd, ofdt->fd, cpy);
 	memset((char *)nfdt->fd + cpy, 0, set);
 
-	copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
+	copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds, true);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -118,7 +125,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +133,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +245,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -330,13 +351,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		open_files = count_open_files(old_fdt);
 	}
 
-	copy_fd_bitmaps(new_fdt, old_fdt, open_files);
+	copy_fd_bitmaps(new_fdt, old_fdt, open_files, false);
 
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (close_on_fork(open_files - i, old_fdt))
+			f = NULL;
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +478,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -840,6 +866,36 @@ void __f_unlock_pos(struct file *f)
  * file count (done either by fdget() or by fork()).
  */
 
+void f_setfd(unsigned int fd, int flags)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flags & FD_CLOEXEC)
+		__set_close_on_exec(fd, fdt);
+	else
+		__clear_close_on_exec(fd, fdt);
+	if (flags & FD_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+int f_getfd(unsigned int fd)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	int flags;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	flags = (close_on_exec(fd, fdt) ? FD_CLOEXEC : 0) |
+	        (close_on_fork(fd, fdt) ? FD_CLOFORK : 0);
+	rcu_read_unlock();
+	return flags;
+}
+
 void set_close_on_exec(unsigned int fd, int flag)
 {
 	struct files_struct *files = current->files;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);
 }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable *fdt)
+{
+	return test_bit(fd, fdt->close_on_fork);
+}
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
 {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..0ee15ee24010 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -83,6 +83,8 @@ static inline void fdput_pos(struct fd f)
 
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
+extern int f_getfd(unsigned int fd);
+extern void f_setfd(unsigned int fd, int flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2)
  2020-05-15 15:23 ` Nate Karstens
  (?)
@ 2020-05-15 15:23   ` Nate Karstens
  -1 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

Add the O_CLOFORK flag to open(2) and dup3(2) to automatically
set the close-on-fork flag in the new file descriptor, saving
a separate call to fcntl(2).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 arch/alpha/include/uapi/asm/fcntl.h    |  2 ++
 arch/parisc/include/uapi/asm/fcntl.h   | 39 +++++++++++++-------------
 arch/sparc/include/uapi/asm/fcntl.h    |  1 +
 fs/fcntl.c                             |  2 +-
 fs/file.c                              | 10 ++++++-
 include/linux/fcntl.h                  |  2 +-
 include/uapi/asm-generic/fcntl.h       |  4 +++
 tools/include/uapi/asm-generic/fcntl.h |  4 +++
 8 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..fbab69b15f7f 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -35,6 +35,8 @@
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
 
+#define O_CLOFORK	0200000000 /* set close_on_fork */
+
 #define F_GETLK		7
 #define F_SETLK		8
 #define F_SETLKW	9
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..8f5989e75b05 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
 #ifndef _PARISC_FCNTL_H
 #define _PARISC_FCNTL_H
 
-#define O_APPEND	000000010
-#define O_BLKSEEK	000000100 /* HPUX only */
-#define O_CREAT		000000400 /* not fcntl */
-#define O_EXCL		000002000 /* not fcntl */
-#define O_LARGEFILE	000004000
-#define __O_SYNC	000100000
+#define O_APPEND	0000000010
+#define O_BLKSEEK	0000000100 /* HPUX only */
+#define O_CREAT		0000000400 /* not fcntl */
+#define O_EXCL		0000002000 /* not fcntl */
+#define O_LARGEFILE	0000004000
+#define __O_SYNC	0000100000
 #define O_SYNC		(__O_SYNC|O_DSYNC)
-#define O_NONBLOCK	000200004 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY	000400000 /* not fcntl */
-#define O_DSYNC		001000000 /* HPUX only */
-#define O_RSYNC		002000000 /* HPUX only */
-#define O_NOATIME	004000000
-#define O_CLOEXEC	010000000 /* set close_on_exec */
-
-#define O_DIRECTORY	000010000 /* must be a directory */
-#define O_NOFOLLOW	000000200 /* don't follow links */
-#define O_INVISIBLE	004000000 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH		020000000
-#define __O_TMPFILE	040000000
+#define O_NONBLOCK	0000200004 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY	0000400000 /* not fcntl */
+#define O_DSYNC		0001000000 /* HPUX only */
+#define O_RSYNC		0002000000 /* HPUX only */
+#define O_NOATIME	0004000000
+#define O_CLOEXEC	0010000000 /* set close_on_exec */
+
+#define O_DIRECTORY	0000010000 /* must be a directory */
+#define O_NOFOLLOW	0000000200 /* don't follow links */
+#define O_INVISIBLE	0004000000 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH		0020000000
+#define __O_TMPFILE	0040000000
+#define O_CLOFORK	0100000000
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..d631ea13bac3 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_CLOFORK	0x4000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 913b0cb70804..40af2a48702b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index 81194349e980..4a1b84eecec6 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -547,6 +547,10 @@ int __alloc_fd(struct files_struct *files,
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	error = fd;
 #if 1
 	/* Sanity check */
@@ -953,6 +957,10 @@ __releases(&files->file_lock)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
@@ -993,7 +1001,7 @@ static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	if ((flags & ~O_CLOEXEC) != 0)
+	if ((flags & ~(O_CLOEXEC | O_CLOFORK)) != 0)
 		return -EINVAL;
 
 	if (unlikely(oldfd == newfd))
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..cd4c625647db 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_CLOFORK)
 
 /* List of all valid flags for the how->upgrade_mask argument: */
 #define VALID_UPGRADE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 0cb7199a7743..165a0736a3aa 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index e04a00fecb4a..69d8a000ec65 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2)
@ 2020-05-15 15:23   ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

Add the O_CLOFORK flag to open(2) and dup3(2) to automatically
set the close-on-fork flag in the new file descriptor, saving
a separate call to fcntl(2).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 arch/alpha/include/uapi/asm/fcntl.h    |  2 ++
 arch/parisc/include/uapi/asm/fcntl.h   | 39 +++++++++++++-------------
 arch/sparc/include/uapi/asm/fcntl.h    |  1 +
 fs/fcntl.c                             |  2 +-
 fs/file.c                              | 10 ++++++-
 include/linux/fcntl.h                  |  2 +-
 include/uapi/asm-generic/fcntl.h       |  4 +++
 tools/include/uapi/asm-generic/fcntl.h |  4 +++
 8 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..fbab69b15f7f 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -35,6 +35,8 @@
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
 
+#define O_CLOFORK	0200000000 /* set close_on_fork */
+
 #define F_GETLK		7
 #define F_SETLK		8
 #define F_SETLKW	9
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..8f5989e75b05 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
 #ifndef _PARISC_FCNTL_H
 #define _PARISC_FCNTL_H
 
-#define O_APPEND	000000010
-#define O_BLKSEEK	000000100 /* HPUX only */
-#define O_CREAT		000000400 /* not fcntl */
-#define O_EXCL		000002000 /* not fcntl */
-#define O_LARGEFILE	000004000
-#define __O_SYNC	000100000
+#define O_APPEND	0000000010
+#define O_BLKSEEK	0000000100 /* HPUX only */
+#define O_CREAT		0000000400 /* not fcntl */
+#define O_EXCL		0000002000 /* not fcntl */
+#define O_LARGEFILE	0000004000
+#define __O_SYNC	0000100000
 #define O_SYNC		(__O_SYNC|O_DSYNC)
-#define O_NONBLOCK	000200004 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY	000400000 /* not fcntl */
-#define O_DSYNC		001000000 /* HPUX only */
-#define O_RSYNC		002000000 /* HPUX only */
-#define O_NOATIME	004000000
-#define O_CLOEXEC	010000000 /* set close_on_exec */
-
-#define O_DIRECTORY	000010000 /* must be a directory */
-#define O_NOFOLLOW	000000200 /* don't follow links */
-#define O_INVISIBLE	004000000 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH		020000000
-#define __O_TMPFILE	040000000
+#define O_NONBLOCK	0000200004 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY	0000400000 /* not fcntl */
+#define O_DSYNC		0001000000 /* HPUX only */
+#define O_RSYNC		0002000000 /* HPUX only */
+#define O_NOATIME	0004000000
+#define O_CLOEXEC	0010000000 /* set close_on_exec */
+
+#define O_DIRECTORY	0000010000 /* must be a directory */
+#define O_NOFOLLOW	0000000200 /* don't follow links */
+#define O_INVISIBLE	0004000000 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH		0020000000
+#define __O_TMPFILE	0040000000
+#define O_CLOFORK	0100000000
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..d631ea13bac3 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_CLOFORK	0x4000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 913b0cb70804..40af2a48702b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index 81194349e980..4a1b84eecec6 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -547,6 +547,10 @@ int __alloc_fd(struct files_struct *files,
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	error = fd;
 #if 1
 	/* Sanity check */
@@ -953,6 +957,10 @@ __releases(&files->file_lock)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
@@ -993,7 +1001,7 @@ static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	if ((flags & ~O_CLOEXEC) != 0)
+	if ((flags & ~(O_CLOEXEC | O_CLOFORK)) != 0)
 		return -EINVAL;
 
 	if (unlikely(oldfd == newfd))
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..cd4c625647db 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_CLOFORK)
 
 /* List of all valid flags for the how->upgrade_mask argument: */
 #define VALID_UPGRADE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 0cb7199a7743..165a0736a3aa 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index e04a00fecb4a..69d8a000ec65 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2)
@ 2020-05-15 15:23   ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

Add the O_CLOFORK flag to open(2) and dup3(2) to automatically
set the close-on-fork flag in the new file descriptor, saving
a separate call to fcntl(2).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 arch/alpha/include/uapi/asm/fcntl.h    |  2 ++
 arch/parisc/include/uapi/asm/fcntl.h   | 39 +++++++++++++-------------
 arch/sparc/include/uapi/asm/fcntl.h    |  1 +
 fs/fcntl.c                             |  2 +-
 fs/file.c                              | 10 ++++++-
 include/linux/fcntl.h                  |  2 +-
 include/uapi/asm-generic/fcntl.h       |  4 +++
 tools/include/uapi/asm-generic/fcntl.h |  4 +++
 8 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..fbab69b15f7f 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -35,6 +35,8 @@
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
 
+#define O_CLOFORK	0200000000 /* set close_on_fork */
+
 #define F_GETLK		7
 #define F_SETLK		8
 #define F_SETLKW	9
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..8f5989e75b05 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
 #ifndef _PARISC_FCNTL_H
 #define _PARISC_FCNTL_H
 
-#define O_APPEND	000000010
-#define O_BLKSEEK	000000100 /* HPUX only */
-#define O_CREAT		000000400 /* not fcntl */
-#define O_EXCL		000002000 /* not fcntl */
-#define O_LARGEFILE	000004000
-#define __O_SYNC	000100000
+#define O_APPEND	0000000010
+#define O_BLKSEEK	0000000100 /* HPUX only */
+#define O_CREAT		0000000400 /* not fcntl */
+#define O_EXCL		0000002000 /* not fcntl */
+#define O_LARGEFILE	0000004000
+#define __O_SYNC	0000100000
 #define O_SYNC		(__O_SYNC|O_DSYNC)
-#define O_NONBLOCK	000200004 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY	000400000 /* not fcntl */
-#define O_DSYNC		001000000 /* HPUX only */
-#define O_RSYNC		002000000 /* HPUX only */
-#define O_NOATIME	004000000
-#define O_CLOEXEC	010000000 /* set close_on_exec */
-
-#define O_DIRECTORY	000010000 /* must be a directory */
-#define O_NOFOLLOW	000000200 /* don't follow links */
-#define O_INVISIBLE	004000000 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH		020000000
-#define __O_TMPFILE	040000000
+#define O_NONBLOCK	0000200004 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY	0000400000 /* not fcntl */
+#define O_DSYNC		0001000000 /* HPUX only */
+#define O_RSYNC		0002000000 /* HPUX only */
+#define O_NOATIME	0004000000
+#define O_CLOEXEC	0010000000 /* set close_on_exec */
+
+#define O_DIRECTORY	0000010000 /* must be a directory */
+#define O_NOFOLLOW	0000000200 /* don't follow links */
+#define O_INVISIBLE	0004000000 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH		0020000000
+#define __O_TMPFILE	0040000000
+#define O_CLOFORK	0100000000
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..d631ea13bac3 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_CLOFORK	0x4000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 913b0cb70804..40af2a48702b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ ! 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index 81194349e980..4a1b84eecec6 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -547,6 +547,10 @@ int __alloc_fd(struct files_struct *files,
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	error = fd;
 #if 1
 	/* Sanity check */
@@ -953,6 +957,10 @@ __releases(&files->file_lock)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
@@ -993,7 +1001,7 @@ static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	if ((flags & ~O_CLOEXEC) != 0)
+	if ((flags & ~(O_CLOEXEC | O_CLOFORK)) != 0)
 		return -EINVAL;
 
 	if (unlikely(oldfd = newfd))
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..cd4c625647db 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_CLOFORK)
 
 /* List of all valid flags for the how->upgrade_mask argument: */
 #define VALID_UPGRADE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 0cb7199a7743..165a0736a3aa 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index e04a00fecb4a..69d8a000ec65 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2)
  2020-05-15 15:23 ` Nate Karstens
  (?)
@ 2020-05-15 15:23   ` Nate Karstens
  -1 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

Implement functionality for duplicating a file descriptor
and having the close-on-fork flag automatically set in the
new file descriptor.

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                       | 4 ++++
 include/uapi/linux/fcntl.h       | 3 +++
 tools/include/uapi/linux/fcntl.h | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 40af2a48702b..e15cdd77b1df 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -333,6 +333,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_DUPFD_CLOEXEC:
 		err = f_dupfd(arg, filp, O_CLOEXEC);
 		break;
+	case F_DUPFD_CLOFORK:
+		err = f_dupfd(arg, filp, O_CLOFORK);
+		break;
 	case F_GETFD:
 		err = f_getfd(fd);
 		break;
@@ -437,6 +440,7 @@ static int check_fcntl_cmd(unsigned cmd)
 	switch (cmd) {
 	case F_DUPFD:
 	case F_DUPFD_CLOEXEC:
+	case F_DUPFD_CLOFORK:
 	case F_GETFD:
 	case F_SETFD:
 	case F_GETFL:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/tools/include/uapi/linux/fcntl.h
+++ b/tools/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2)
@ 2020-05-15 15:23   ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

Implement functionality for duplicating a file descriptor
and having the close-on-fork flag automatically set in the
new file descriptor.

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                       | 4 ++++
 include/uapi/linux/fcntl.h       | 3 +++
 tools/include/uapi/linux/fcntl.h | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 40af2a48702b..e15cdd77b1df 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -333,6 +333,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_DUPFD_CLOEXEC:
 		err = f_dupfd(arg, filp, O_CLOEXEC);
 		break;
+	case F_DUPFD_CLOFORK:
+		err = f_dupfd(arg, filp, O_CLOFORK);
+		break;
 	case F_GETFD:
 		err = f_getfd(fd);
 		break;
@@ -437,6 +440,7 @@ static int check_fcntl_cmd(unsigned cmd)
 	switch (cmd) {
 	case F_DUPFD:
 	case F_DUPFD_CLOEXEC:
+	case F_DUPFD_CLOFORK:
 	case F_GETFD:
 	case F_SETFD:
 	case F_GETFL:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/tools/include/uapi/linux/fcntl.h
+++ b/tools/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2)
@ 2020-05-15 15:23   ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

Implement functionality for duplicating a file descriptor
and having the close-on-fork flag automatically set in the
new file descriptor.

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                       | 4 ++++
 include/uapi/linux/fcntl.h       | 3 +++
 tools/include/uapi/linux/fcntl.h | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 40af2a48702b..e15cdd77b1df 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -333,6 +333,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_DUPFD_CLOEXEC:
 		err = f_dupfd(arg, filp, O_CLOEXEC);
 		break;
+	case F_DUPFD_CLOFORK:
+		err = f_dupfd(arg, filp, O_CLOFORK);
+		break;
 	case F_GETFD:
 		err = f_getfd(fd);
 		break;
@@ -437,6 +440,7 @@ static int check_fcntl_cmd(unsigned cmd)
 	switch (cmd) {
 	case F_DUPFD:
 	case F_DUPFD_CLOEXEC:
+	case F_DUPFD_CLOFORK:
 	case F_GETFD:
 	case F_SETFD:
 	case F_GETFL:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/tools/include/uapi/linux/fcntl.h
+++ b/tools/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 4/4] net: Add SOCK_CLOFORK
  2020-05-15 15:23 ` Nate Karstens
  (?)
@ 2020-05-15 15:23   ` Nate Karstens
  -1 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

Implements a new socket flag that automatically sets the
close-on-fork flag for sockets created using socket(2),
socketpair(2), and accept4(2).

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 include/linux/net.h |  3 ++-
 net/socket.c        | 14 ++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 6451425e828f..57663c9dc8c4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -17,7 +17,7 @@
 #include <linux/stringify.h>
 #include <linux/random.h>
 #include <linux/wait.h>
-#include <linux/fcntl.h>	/* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/fcntl.h>	/* For O_CLOEXEC, O_CLOFORK, and O_NONBLOCK */
 #include <linux/rcupdate.h>
 #include <linux/once.h>
 #include <linux/fs.h>
@@ -73,6 +73,7 @@ enum sock_type {
 
 /* Flags for socket, socketpair, accept4 */
 #define SOCK_CLOEXEC	O_CLOEXEC
+#define SOCK_CLOFORK	O_CLOFORK
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
diff --git a/net/socket.c b/net/socket.c
index 2eecf1517f76..ba6e971c7e78 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1511,12 +1511,14 @@ int __sys_socket(int family, int type, int protocol)
 
 	/* Check the SOCK_* constants for consistency.  */
 	BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(SOCK_CLOFORK != O_CLOFORK);
 	BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
+	BUILD_BUG_ON(SOCK_CLOFORK & SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1527,7 +1529,7 @@ int __sys_socket(int family, int type, int protocol)
 	if (retval < 0)
 		return retval;
 
-	return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+	return sock_map_fd(sock, flags & (O_CLOEXEC | O_CLOFORK | O_NONBLOCK));
 }
 
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
@@ -1547,7 +1549,7 @@ int __sys_socketpair(int family, int type, int protocol, int __user *usockvec)
 	int flags;
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1715,7 +1717,7 @@ int __sys_accept4_file(struct file *file, unsigned file_flags,
 	int err, len, newfd;
 	struct sockaddr_storage address;
 
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
@@ -3628,8 +3630,8 @@ EXPORT_SYMBOL(kernel_listen);
  *	@newsock: new connected socket
  *	@flags: flags
  *
- *	@flags must be SOCK_CLOEXEC, SOCK_NONBLOCK or 0.
- *	If it fails, @newsock is guaranteed to be %NULL.
+ *	@flags must be SOCK_CLOEXEC, SOCK_CLOFORK, SOCK_NONBLOCK,
+ *	or 0. If it fails, @newsock is guaranteed to be %NULL.
  *	Returns 0 or an error.
  */
 
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 4/4] net: Add SOCK_CLOFORK
@ 2020-05-15 15:23   ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

Implements a new socket flag that automatically sets the
close-on-fork flag for sockets created using socket(2),
socketpair(2), and accept4(2).

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 include/linux/net.h |  3 ++-
 net/socket.c        | 14 ++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 6451425e828f..57663c9dc8c4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -17,7 +17,7 @@
 #include <linux/stringify.h>
 #include <linux/random.h>
 #include <linux/wait.h>
-#include <linux/fcntl.h>	/* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/fcntl.h>	/* For O_CLOEXEC, O_CLOFORK, and O_NONBLOCK */
 #include <linux/rcupdate.h>
 #include <linux/once.h>
 #include <linux/fs.h>
@@ -73,6 +73,7 @@ enum sock_type {
 
 /* Flags for socket, socketpair, accept4 */
 #define SOCK_CLOEXEC	O_CLOEXEC
+#define SOCK_CLOFORK	O_CLOFORK
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
diff --git a/net/socket.c b/net/socket.c
index 2eecf1517f76..ba6e971c7e78 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1511,12 +1511,14 @@ int __sys_socket(int family, int type, int protocol)
 
 	/* Check the SOCK_* constants for consistency.  */
 	BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(SOCK_CLOFORK != O_CLOFORK);
 	BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
+	BUILD_BUG_ON(SOCK_CLOFORK & SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1527,7 +1529,7 @@ int __sys_socket(int family, int type, int protocol)
 	if (retval < 0)
 		return retval;
 
-	return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+	return sock_map_fd(sock, flags & (O_CLOEXEC | O_CLOFORK | O_NONBLOCK));
 }
 
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
@@ -1547,7 +1549,7 @@ int __sys_socketpair(int family, int type, int protocol, int __user *usockvec)
 	int flags;
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1715,7 +1717,7 @@ int __sys_accept4_file(struct file *file, unsigned file_flags,
 	int err, len, newfd;
 	struct sockaddr_storage address;
 
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
@@ -3628,8 +3630,8 @@ EXPORT_SYMBOL(kernel_listen);
  *	@newsock: new connected socket
  *	@flags: flags
  *
- *	@flags must be SOCK_CLOEXEC, SOCK_NONBLOCK or 0.
- *	If it fails, @newsock is guaranteed to be %NULL.
+ *	@flags must be SOCK_CLOEXEC, SOCK_CLOFORK, SOCK_NONBLOCK,
+ *	or 0. If it fails, @newsock is guaranteed to be %NULL.
  *	Returns 0 or an error.
  */
 
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 4/4] net: Add SOCK_CLOFORK
@ 2020-05-15 15:23   ` Nate Karstens
  0 siblings, 0 replies; 61+ messages in thread
From: Nate Karstens @ 2020-05-15 15:23 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao, Nate Karstens

Implements a new socket flag that automatically sets the
close-on-fork flag for sockets created using socket(2),
socketpair(2), and accept4(2).

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 include/linux/net.h |  3 ++-
 net/socket.c        | 14 ++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 6451425e828f..57663c9dc8c4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -17,7 +17,7 @@
 #include <linux/stringify.h>
 #include <linux/random.h>
 #include <linux/wait.h>
-#include <linux/fcntl.h>	/* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/fcntl.h>	/* For O_CLOEXEC, O_CLOFORK, and O_NONBLOCK */
 #include <linux/rcupdate.h>
 #include <linux/once.h>
 #include <linux/fs.h>
@@ -73,6 +73,7 @@ enum sock_type {
 
 /* Flags for socket, socketpair, accept4 */
 #define SOCK_CLOEXEC	O_CLOEXEC
+#define SOCK_CLOFORK	O_CLOFORK
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
diff --git a/net/socket.c b/net/socket.c
index 2eecf1517f76..ba6e971c7e78 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1511,12 +1511,14 @@ int __sys_socket(int family, int type, int protocol)
 
 	/* Check the SOCK_* constants for consistency.  */
 	BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(SOCK_CLOFORK != O_CLOFORK);
 	BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
+	BUILD_BUG_ON(SOCK_CLOFORK & SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1527,7 +1529,7 @@ int __sys_socket(int family, int type, int protocol)
 	if (retval < 0)
 		return retval;
 
-	return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+	return sock_map_fd(sock, flags & (O_CLOEXEC | O_CLOFORK | O_NONBLOCK));
 }
 
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
@@ -1547,7 +1549,7 @@ int __sys_socketpair(int family, int type, int protocol, int __user *usockvec)
 	int flags;
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1715,7 +1717,7 @@ int __sys_accept4_file(struct file *file, unsigned file_flags,
 	int err, len, newfd;
 	struct sockaddr_storage address;
 
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
@@ -3628,8 +3630,8 @@ EXPORT_SYMBOL(kernel_listen);
  *	@newsock: new connected socket
  *	@flags: flags
  *
- *	@flags must be SOCK_CLOEXEC, SOCK_NONBLOCK or 0.
- *	If it fails, @newsock is guaranteed to be %NULL.
+ *	@flags must be SOCK_CLOEXEC, SOCK_CLOFORK, SOCK_NONBLOCK,
+ *	or 0. If it fails, @newsock is guaranteed to be %NULL.
  *	Returns 0 or an error.
  */
 
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 15:23 ` Nate Karstens
@ 2020-05-15 15:30   ` Eric Dumazet
  -1 siblings, 0 replies; 61+ messages in thread
From: Eric Dumazet @ 2020-05-15 15:30 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, David Laight, linux-fsdevel, linux-arch,
	linux-alpha, linux-parisc, sparclinux, netdev, LKML, Changli Gao

On Fri, May 15, 2020 at 8:23 AM Nate Karstens <nate.karstens@garmin.com> wrote:
>
>
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:
>
>  * accept(4)
>  * dup3(2)
>  * fcntl(2)
>  * open(2)
>  * socket(2)
>  * socketpair(2)
>  * unshare(2)
>
> Addresses underlying issue in that there is no way to prevent
> a fork() from duplicating a file descriptor. The existing
> close-on-exec flag partially-addresses this by allowing the
> parent process to mark a file descriptor as exclusive to itself,
> but there is still a period of time the failure can occur
> because the auto-close only occurs during the exec().
>
> One manifestation of this is a race conditions in system(), which
> (depending on the implementation) is non-atomic in that it first
> calls a fork() and then an exec().
>
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).
>
> ---
>
> This is v2 of the change. See https://lkml.org/lkml/2020/4/20/113
> for the original work.
>
> Thanks to everyone who provided comments on the first series of
> patches. Here are replies to specific comments:
>
> > I suggest we group the two bits of a file (close_on_exec, close_on_fork)
> > together, so that we do not have to dirty two separate cache lines.
>
> I could be mistaken, but I don't think this would improve efficiency.
> The close-on-fork and close-on-exec flags are read at different
> times. If you assume separate syscalls for fork and exec then
> there are several switches between when the two flags are read.
> In addition, the close-on-fork flags in the new process must be
> cleared, which will be much harder if the flags are interleaved.

:/

Fast path in big and performance sensitive applications is not fork()
and/or exec().

This is open()/close() and others (socket(), accept(), ...)

We do not want them to access extra cache lines for this new feature.

Sorry, I will say no to these patches in their current form.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 15:30   ` Eric Dumazet
  0 siblings, 0 replies; 61+ messages in thread
From: Eric Dumazet @ 2020-05-15 15:30 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, David Laight, linux-fsdevel, linux-arch,
	linux-alpha, linux-parisc, sparclinux, netdev, LKML, Changli Gao

On Fri, May 15, 2020 at 8:23 AM Nate Karstens <nate.karstens@garmin.com> wrote:
>
>
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:
>
>  * accept(4)
>  * dup3(2)
>  * fcntl(2)
>  * open(2)
>  * socket(2)
>  * socketpair(2)
>  * unshare(2)
>
> Addresses underlying issue in that there is no way to prevent
> a fork() from duplicating a file descriptor. The existing
> close-on-exec flag partially-addresses this by allowing the
> parent process to mark a file descriptor as exclusive to itself,
> but there is still a period of time the failure can occur
> because the auto-close only occurs during the exec().
>
> One manifestation of this is a race conditions in system(), which
> (depending on the implementation) is non-atomic in that it first
> calls a fork() and then an exec().
>
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).
>
> ---
>
> This is v2 of the change. See https://lkml.org/lkml/2020/4/20/113
> for the original work.
>
> Thanks to everyone who provided comments on the first series of
> patches. Here are replies to specific comments:
>
> > I suggest we group the two bits of a file (close_on_exec, close_on_fork)
> > together, so that we do not have to dirty two separate cache lines.
>
> I could be mistaken, but I don't think this would improve efficiency.
> The close-on-fork and close-on-exec flags are read at different
> times. If you assume separate syscalls for fork and exec then
> there are several switches between when the two flags are read.
> In addition, the close-on-fork flags in the new process must be
> cleared, which will be much harder if the flags are interleaved.

:/

Fast path in big and performance sensitive applications is not fork()
and/or exec().

This is open()/close() and others (socket(), accept(), ...)

We do not want them to access extra cache lines for this new feature.

Sorry, I will say no to these patches in their current form.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 15:23 ` Nate Karstens
  (?)
@ 2020-05-15 15:57   ` Matthew Wilcox
  -1 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2020-05-15 15:57 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao, a.josey

On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:

[...]

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

NAK to this patch series, and the entire concept.

Is there a way to persuade POSIX that they made a bad decision by
standardising this mess?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 15:57   ` Matthew Wilcox
  0 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2020-05-15 15:57 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao, a.josey

On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:

[...]

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

NAK to this patch series, and the entire concept.

Is there a way to persuade POSIX that they made a bad decision by
standardising this mess?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 15:57   ` Matthew Wilcox
  0 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2020-05-15 15:57 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao, a.josey

On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:

[...]

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

NAK to this patch series, and the entire concept.

Is there a way to persuade POSIX that they made a bad decision by
standardising this mess?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
  2020-05-15 15:30   ` Eric Dumazet
  (?)
@ 2020-05-15 15:59     ` David Laight
  -1 siblings, 0 replies; 61+ messages in thread
From: David Laight @ 2020-05-15 15:59 UTC (permalink / raw)
  To: 'Eric Dumazet', Nate Karstens
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, LKML, Changli Gao

From: Eric Dumazet
> Sent: 15 May 2020 16:31
...
> Fast path in big and performance sensitive applications is not fork()
> and/or exec().
> 
> This is open()/close() and others (socket(), accept(), ...)
> 
> We do not want them to access extra cache lines for this new feature.
> 
> Sorry, I will say no to these patches in their current form.

Is it worth completely removing the bitmaps and just remembering
the lowest fd that has had each bit set (don't worry about clears).

Then leverage the close_all() code that closes all fd above
a specified number to close only those with the 'close on exec'
or 'close on fork' flag set.

After all an application is currently very likely to have set
'close on exec' on all open fd above 2.

So the number of fd that don't need closing is small.

This puts all the expensive code in the already slow fork/exec
paths.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
@ 2020-05-15 15:59     ` David Laight
  0 siblings, 0 replies; 61+ messages in thread
From: David Laight @ 2020-05-15 15:59 UTC (permalink / raw)
  To: 'Eric Dumazet', Nate Karstens
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, LKML

From: Eric Dumazet
> Sent: 15 May 2020 16:31
...
> Fast path in big and performance sensitive applications is not fork()
> and/or exec().
> 
> This is open()/close() and others (socket(), accept(), ...)
> 
> We do not want them to access extra cache lines for this new feature.
> 
> Sorry, I will say no to these patches in their current form.

Is it worth completely removing the bitmaps and just remembering
the lowest fd that has had each bit set (don't worry about clears).

Then leverage the close_all() code that closes all fd above
a specified number to close only those with the 'close on exec'
or 'close on fork' flag set.

After all an application is currently very likely to have set
'close on exec' on all open fd above 2.

So the number of fd that don't need closing is small.

This puts all the expensive code in the already slow fork/exec
paths.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
@ 2020-05-15 15:59     ` David Laight
  0 siblings, 0 replies; 61+ messages in thread
From: David Laight @ 2020-05-15 15:59 UTC (permalink / raw)
  To: 'Eric Dumazet', Nate Karstens
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, LKML

RnJvbTogRXJpYyBEdW1hemV0DQo+IFNlbnQ6IDE1IE1heSAyMDIwIDE2OjMxDQouLi4NCj4gRmFz
dCBwYXRoIGluIGJpZyBhbmQgcGVyZm9ybWFuY2Ugc2Vuc2l0aXZlIGFwcGxpY2F0aW9ucyBpcyBu
b3QgZm9yaygpDQo+IGFuZC9vciBleGVjKCkuDQo+IA0KPiBUaGlzIGlzIG9wZW4oKS9jbG9zZSgp
IGFuZCBvdGhlcnMgKHNvY2tldCgpLCBhY2NlcHQoKSwgLi4uKQ0KPiANCj4gV2UgZG8gbm90IHdh
bnQgdGhlbSB0byBhY2Nlc3MgZXh0cmEgY2FjaGUgbGluZXMgZm9yIHRoaXMgbmV3IGZlYXR1cmUu
DQo+IA0KPiBTb3JyeSwgSSB3aWxsIHNheSBubyB0byB0aGVzZSBwYXRjaGVzIGluIHRoZWlyIGN1
cnJlbnQgZm9ybS4NCg0KSXMgaXQgd29ydGggY29tcGxldGVseSByZW1vdmluZyB0aGUgYml0bWFw
cyBhbmQganVzdCByZW1lbWJlcmluZw0KdGhlIGxvd2VzdCBmZCB0aGF0IGhhcyBoYWQgZWFjaCBi
aXQgc2V0IChkb24ndCB3b3JyeSBhYm91dCBjbGVhcnMpLg0KDQpUaGVuIGxldmVyYWdlIHRoZSBj
bG9zZV9hbGwoKSBjb2RlIHRoYXQgY2xvc2VzIGFsbCBmZCBhYm92ZQ0KYSBzcGVjaWZpZWQgbnVt
YmVyIHRvIGNsb3NlIG9ubHkgdGhvc2Ugd2l0aCB0aGUgJ2Nsb3NlIG9uIGV4ZWMnDQpvciAnY2xv
c2Ugb24gZm9yaycgZmxhZyBzZXQuDQoNCkFmdGVyIGFsbCBhbiBhcHBsaWNhdGlvbiBpcyBjdXJy
ZW50bHkgdmVyeSBsaWtlbHkgdG8gaGF2ZSBzZXQNCidjbG9zZSBvbiBleGVjJyBvbiBhbGwgb3Bl
biBmZCBhYm92ZSAyLg0KDQpTbyB0aGUgbnVtYmVyIG9mIGZkIHRoYXQgZG9uJ3QgbmVlZCBjbG9z
aW5nIGlzIHNtYWxsLg0KDQpUaGlzIHB1dHMgYWxsIHRoZSBleHBlbnNpdmUgY29kZSBpbiB0aGUg
YWxyZWFkeSBzbG93IGZvcmsvZXhlYw0KcGF0aHMuDQoNCglEYXZpZA0KDQotDQpSZWdpc3RlcmVk
IEFkZHJlc3MgTGFrZXNpZGUsIEJyYW1sZXkgUm9hZCwgTW91bnQgRmFybSwgTWlsdG9uIEtleW5l
cywgTUsxIDFQVCwgVUsNClJlZ2lzdHJhdGlvbiBObzogMTM5NzM4NiAoV2FsZXMpDQo

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 15:23 ` Nate Karstens
@ 2020-05-15 16:03   ` Al Viro
  -1 siblings, 0 replies; 61+ messages in thread
From: Al Viro @ 2020-05-15 16:03 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev, linux-kernel, Changli Gao

On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

It penalizes every call of fork() in the system (as well as adds
an extra dirtied cacheline on each socket()/open()/etc.), adds
memory footprint and complicates the API.  All of that - to deal
with rather uncommon problem that already has a portable solution.

As for the Austin Group, the only authority it has ever had derives
from consensus between existing Unices.  "Solaris does it, Linux and
*BSD do not" translates into "Austin Group is welcome to take a hike".
BTW, contrary to the lovely bit of misrepresentation in that
thread of theirs ("<LWN URL> suggests that" != "someone's comment
under LWN article says it _appears_ that"), none of *BSD do it.

IMO it's a bad idea.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 16:03   ` Al Viro
  0 siblings, 0 replies; 61+ messages in thread
From: Al Viro @ 2020-05-15 16:03 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev, linux-kernel, Changli Gao

On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

It penalizes every call of fork() in the system (as well as adds
an extra dirtied cacheline on each socket()/open()/etc.), adds
memory footprint and complicates the API.  All of that - to deal
with rather uncommon problem that already has a portable solution.

As for the Austin Group, the only authority it has ever had derives
from consensus between existing Unices.  "Solaris does it, Linux and
*BSD do not" translates into "Austin Group is welcome to take a hike".
BTW, contrary to the lovely bit of misrepresentation in that
thread of theirs ("<LWN URL> suggests that" != "someone's comment
under LWN article says it _appears_ that"), none of *BSD do it.

IMO it's a bad idea.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
  2020-05-15 15:57   ` Matthew Wilcox
@ 2020-05-15 16:07     ` Karstens, Nate
  -1 siblings, 0 replies; 61+ messages in thread
From: Karstens, Nate @ 2020-05-15 16:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao, a.josey

Matthew,

What alternative would you suggest?

From an earlier email:

> ...nothing else addresses the underlying issue: there is no way to
> prevent a fork() from duplicating the resource. The close-on-exec
> flag partially-addresses this by allowing the parent process to
> mark a file descriptor as exclusive to itself, but there is still
> a period of time the failure can occur because the auto-close only
> occurs during the exec(). Perhaps this would not be an issue with
> a different process/threading model, but that is another discussion
> entirely.

Do you disagree there is an issue?

-----Original Message-----
From: Matthew Wilcox <willy@infradead.org>
Sent: Friday, May 15, 2020 10:58
To: Karstens, Nate <Nate.Karstens@garmin.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Eric Dumazet <edumazet@google.com>; David Laight <David.Laight@aculab.com>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Changli Gao <xiaosuo@gmail.com>; a.josey@opengroup.org
Subject: Re: [PATCH v2] Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:

[...]

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

NAK to this patch series, and the entire concept.

Is there a way to persuade POSIX that they made a bad decision by standardising this mess?

________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
@ 2020-05-15 16:07     ` Karstens, Nate
  0 siblings, 0 replies; 61+ messages in thread
From: Karstens, Nate @ 2020-05-15 16:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sp

Matthew,

What alternative would you suggest?

From an earlier email:

> ...nothing else addresses the underlying issue: there is no way to
> prevent a fork() from duplicating the resource. The close-on-exec
> flag partially-addresses this by allowing the parent process to
> mark a file descriptor as exclusive to itself, but there is still
> a period of time the failure can occur because the auto-close only
> occurs during the exec(). Perhaps this would not be an issue with
> a different process/threading model, but that is another discussion
> entirely.

Do you disagree there is an issue?

-----Original Message-----
From: Matthew Wilcox <willy@infradead.org>
Sent: Friday, May 15, 2020 10:58
To: Karstens, Nate <Nate.Karstens@garmin.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Eric Dumazet <edumazet@google.com>; David Laight <David.Laight@aculab.com>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Changli Gao <xiaosuo@gmail.com>; a.josey@opengroup.org
Subject: Re: [PATCH v2] Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:

[...]

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

NAK to this patch series, and the entire concept.

Is there a way to persuade POSIX that they made a bad decision by standardising this mess?

________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 16:07     ` Karstens, Nate
  (?)
@ 2020-05-15 16:25       ` James Bottomley
  -1 siblings, 0 replies; 61+ messages in thread
From: James Bottomley @ 2020-05-15 16:25 UTC (permalink / raw)
  To: Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev, linux-kernel, Changli Gao, a.josey

On Fri, 2020-05-15 at 16:07 +0000, Karstens, Nate wrote:
> Matthew,
> 
> What alternative would you suggest?
> 
> From an earlier email:
> 
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to
> > mark a file descriptor as exclusive to itself, but there is still
> > a period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with
> > a different process/threading model, but that is another discussion
> > entirely.
> 
> Do you disagree there is an issue?

Oh good grief that's a leading question: When I write bad code and it
crashes, most people would agree there is an issue; very few would
agree the kernel should be changed to fix it. Several of us have
already said the problem seems to be with the way your application is
written.  You didn't even answer emails like this speculating about the
cause being the way your application counts resources:

https://lore.kernel.org/linux-fsdevel/1587569663.3485.18.camel@HansenPartnership.com/

The bottom line is that we think you could rewrite this one application
not to have the problem you're complaining about rather than introduce
a new kernel API to "fix" it.

James




^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 16:25       ` James Bottomley
  0 siblings, 0 replies; 61+ messages in thread
From: James Bottomley @ 2020-05-15 16:25 UTC (permalink / raw)
  To: Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev

On Fri, 2020-05-15 at 16:07 +0000, Karstens, Nate wrote:
> Matthew,
> 
> What alternative would you suggest?
> 
> From an earlier email:
> 
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to
> > mark a file descriptor as exclusive to itself, but there is still
> > a period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with
> > a different process/threading model, but that is another discussion
> > entirely.
> 
> Do you disagree there is an issue?

Oh good grief that's a leading question: When I write bad code and it
crashes, most people would agree there is an issue; very few would
agree the kernel should be changed to fix it. Several of us have
already said the problem seems to be with the way your application is
written.  You didn't even answer emails like this speculating about the
cause being the way your application counts resources:

https://lore.kernel.org/linux-fsdevel/1587569663.3485.18.camel@HansenPartnership.com/

The bottom line is that we think you could rewrite this one application
not to have the problem you're complaining about rather than introduce
a new kernel API to "fix" it.

James

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 16:25       ` James Bottomley
  0 siblings, 0 replies; 61+ messages in thread
From: James Bottomley @ 2020-05-15 16:25 UTC (permalink / raw)
  To: Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev

On Fri, 2020-05-15 at 16:07 +0000, Karstens, Nate wrote:
> Matthew,
> 
> What alternative would you suggest?
> 
> From an earlier email:
> 
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to
> > mark a file descriptor as exclusive to itself, but there is still
> > a period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with
> > a different process/threading model, but that is another discussion
> > entirely.
> 
> Do you disagree there is an issue?

Oh good grief that's a leading question: When I write bad code and it
crashes, most people would agree there is an issue; very few would
agree the kernel should be changed to fix it. Several of us have
already said the problem seems to be with the way your application is
written.  You didn't even answer emails like this speculating about the
cause being the way your application counts resources:

https://lore.kernel.org/linux-fsdevel/1587569663.3485.18.camel@HansenPartnership.com/

The bottom line is that we think you could rewrite this one application
not to have the problem you're complaining about rather than introduce
a new kernel API to "fix" it.

James

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 16:07     ` Karstens, Nate
  (?)
@ 2020-05-15 16:26       ` Matthew Wilcox
  -1 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2020-05-15 16:26 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao, a.josey

On Fri, May 15, 2020 at 04:07:33PM +0000, Karstens, Nate wrote:
> Matthew,
> 
> What alternative would you suggest?
> 
> >From an earlier email:
> 
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to
> > mark a file descriptor as exclusive to itself, but there is still
> > a period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with
> > a different process/threading model, but that is another discussion
> > entirely.
> 
> Do you disagree there is an issue?

Yes.  system() is defined as being unsafe for a threaded application
to call.  I pointed this out in the last thread.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 16:26       ` Matthew Wilcox
  0 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2020-05-15 16:26 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sp

On Fri, May 15, 2020 at 04:07:33PM +0000, Karstens, Nate wrote:
> Matthew,
> 
> What alternative would you suggest?
> 
> >From an earlier email:
> 
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to
> > mark a file descriptor as exclusive to itself, but there is still
> > a period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with
> > a different process/threading model, but that is another discussion
> > entirely.
> 
> Do you disagree there is an issue?

Yes.  system() is defined as being unsafe for a threaded application
to call.  I pointed this out in the last thread.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 16:26       ` Matthew Wilcox
  0 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2020-05-15 16:26 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sp

On Fri, May 15, 2020 at 04:07:33PM +0000, Karstens, Nate wrote:
> Matthew,
> 
> What alternative would you suggest?
> 
> >From an earlier email:
> 
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to
> > mark a file descriptor as exclusive to itself, but there is still
> > a period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with
> > a different process/threading model, but that is another discussion
> > entirely.
> 
> Do you disagree there is an issue?

Yes.  system() is defined as being unsafe for a threaded application
to call.  I pointed this out in the last thread.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
  2020-05-15 16:03   ` Al Viro
@ 2020-05-15 16:26     ` Karstens, Nate
  -1 siblings, 0 replies; 61+ messages in thread
From: Karstens, Nate @ 2020-05-15 16:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev, linux-kernel, Changli Gao

Al,

The reference to the POSIX change was included more for a reference to that discussion, not to say "POSIX did this and so Linux must do it to". In any case, the documentation presented to the Austin Group was focused more around the issue we ran into and some alternative solutions. In reviewing the notes from the meeting I didn't get the impression that they added this to POSIX simply because Solaris and *BSD already had it (reference https://austingroupbugs.net/view.php?id=1317, the first solution we suggested).

> It penalizes every call of fork() in the system

Is the performance hit really that drastic? fork() does a lot of stuff and this really seems like a drop in the bucket...

> adds an extra dirtied cacheline on each socket()/open()/etc.

It sounds like we can work to improve that, though.

> already has a portable solution

What is the solution?

Thanks,

Nate

-----Original Message-----
From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
Sent: Friday, May 15, 2020 11:04
To: Karstens, Nate <Nate.Karstens@garmin.com>
Cc: Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Eric Dumazet <edumazet@google.com>; David Laight <David.Laight@aculab.com>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Changli Gao <xiaosuo@gmail.com>
Subject: Re: [PATCH v2] Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

It penalizes every call of fork() in the system (as well as adds an extra dirtied cacheline on each socket()/open()/etc.), adds memory footprint and complicates the API.  All of that - to deal with rather uncommon problem that already has a portable solution.

As for the Austin Group, the only authority it has ever had derives from consensus between existing Unices.  "Solaris does it, Linux and *BSD do not" translates into "Austin Group is welcome to take a hike".
BTW, contrary to the lovely bit of misrepresentation in that thread of theirs ("<LWN URL> suggests that" != "someone's comment under LWN article says it _appears_ that"), none of *BSD do it.

IMO it's a bad idea.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
@ 2020-05-15 16:26     ` Karstens, Nate
  0 siblings, 0 replies; 61+ messages in thread
From: Karstens, Nate @ 2020-05-15 16:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc,
	sparclinux@vger.kernel.org

Al,

The reference to the POSIX change was included more for a reference to that discussion, not to say "POSIX did this and so Linux must do it to". In any case, the documentation presented to the Austin Group was focused more around the issue we ran into and some alternative solutions. In reviewing the notes from the meeting I didn't get the impression that they added this to POSIX simply because Solaris and *BSD already had it (reference https://austingroupbugs.net/view.php?id=1317, the first solution we suggested).

> It penalizes every call of fork() in the system

Is the performance hit really that drastic? fork() does a lot of stuff and this really seems like a drop in the bucket...

> adds an extra dirtied cacheline on each socket()/open()/etc.

It sounds like we can work to improve that, though.

> already has a portable solution

What is the solution?

Thanks,

Nate

-----Original Message-----
From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
Sent: Friday, May 15, 2020 11:04
To: Karstens, Nate <Nate.Karstens@garmin.com>
Cc: Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Eric Dumazet <edumazet@google.com>; David Laight <David.Laight@aculab.com>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Changli Gao <xiaosuo@gmail.com>
Subject: Re: [PATCH v2] Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

It penalizes every call of fork() in the system (as well as adds an extra dirtied cacheline on each socket()/open()/etc.), adds memory footprint and complicates the API.  All of that - to deal with rather uncommon problem that already has a portable solution.

As for the Austin Group, the only authority it has ever had derives from consensus between existing Unices.  "Solaris does it, Linux and *BSD do not" translates into "Austin Group is welcome to take a hike".
BTW, contrary to the lovely bit of misrepresentation in that thread of theirs ("<LWN URL> suggests that" != "someone's comment under LWN article says it _appears_ that"), none of *BSD do it.

IMO it's a bad idea.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 16:03   ` Al Viro
  (?)
@ 2020-05-15 16:53     ` David Howells
  -1 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2020-05-15 16:53 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: dhowells, Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao

Karstens, Nate <Nate.Karstens@garmin.com> wrote:

> > already has a portable solution
> 
> What is the solution?

sys_spawn(const char *path, const char **argv, const char **envv,
	  unsigned long clone_flags, unsigned int nfds, int *fds);

maybe?

David


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 16:53     ` David Howells
  0 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2020-05-15 16:53 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: dhowells, Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc@vger.kernel.org

Karstens, Nate <Nate.Karstens@garmin.com> wrote:

> > already has a portable solution
> 
> What is the solution?

sys_spawn(const char *path, const char **argv, const char **envv,
	  unsigned long clone_flags, unsigned int nfds, int *fds);

maybe?

David

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 16:53     ` David Howells
  0 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2020-05-15 16:53 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: dhowells, Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc@vger.kernel.org

Karstens, Nate <Nate.Karstens@garmin.com> wrote:

> > already has a portable solution
> 
> What is the solution?

sys_spawn(const char *path, const char **argv, const char **envv,
	  unsigned long clone_flags, unsigned int nfds, int *fds);

maybe?

David

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
  2020-05-15 16:25       ` James Bottomley
  (?)
@ 2020-05-15 18:28         ` Karstens, Nate
  -1 siblings, 0 replies; 61+ messages in thread
From: Karstens, Nate @ 2020-05-15 18:28 UTC (permalink / raw)
  To: James Bottomley, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev, linux-kernel, Changli Gao, a.josey

James,

Sorry, perhaps I was indirect, but I thought I had responded to that in https://lore.kernel.org/linux-fsdevel/de6adce76b534310975e4d3c4a4facb2@garmin.com/.

I really hope I do not come off as complaining about this issue. We identified what seemed to be something that was overlooked with the various APIs around creating child processes. Rather than fixing it ourselves and moving on we chose to invest more time and effort into it by engaging the community (first POSIX, and now this one) in a discussion. I humbly and sincerely ask if you would help me understand, if we could turn back the clock, how our application could have been written to avoid this issue:

*A parent process forks a child. Another thread in the parent process closes and attempts to reopen a socket, file, or other resource it needs exclusive access to. This fails because the -operating system- still has a reference to that resource that it is keeping on behalf of the child. The child eventually calls exec and the resource is closed because the close-on-exec flag is set.*

Our first attempt, which was to use the pthread_atfork() handlers, failed because system() is not required to call the handlers.

Most of the feedback we're getting on this seems to say "don't use system(), it is unsafe for threaded applications". Is that documented anywhere? The man page says it is "MT-Safe".

Aside from that, even if we remove all uses of system() from our application (which we already have), then our application, like many other applications, needs to use third-party shared libraries. There is nothing that prevents those libraries from using system(). We can audit those libraries and go back with the vendor with a request to replace system() with a standard fork/exec, but they will also want documentation supporting that.

We can also take steps to change or remove system() from our standard library. It fixes our issue, but still leaves the community with an API that is broken/flawed/poorly-documented (depending on how one looks at it).

If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?

Thanks for your help with this.

Nate

-----Original Message-----
From: James Bottomley <James.Bottomley@HansenPartnership.com>
Sent: Friday, May 15, 2020 11:26
To: Karstens, Nate <Nate.Karstens@garmin.com>; Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Eric Dumazet <edumazet@google.com>; David Laight <David.Laight@aculab.com>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Changli Gao <xiaosuo@gmail.com>; a.josey@opengroup.org
Subject: Re: [PATCH v2] Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Fri, 2020-05-15 at 16:07 +0000, Karstens, Nate wrote:
> Matthew,
>
> What alternative would you suggest?
>
> From an earlier email:
>
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to mark
> > a file descriptor as exclusive to itself, but there is still a
> > period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with a
> > different process/threading model, but that is another discussion
> > entirely.
>
> Do you disagree there is an issue?

Oh good grief that's a leading question: When I write bad code and it crashes, most people would agree there is an issue; very few would agree the kernel should be changed to fix it. Several of us have already said the problem seems to be with the way your application is written.  You didn't even answer emails like this speculating about the cause being the way your application counts resources:

https://lore.kernel.org/linux-fsdevel/1587569663.3485.18.camel@HansenPartnership.com/

The bottom line is that we think you could rewrite this one application not to have the problem you're complaining about rather than introduce a new kernel API to "fix" it.

James




________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
@ 2020-05-15 18:28         ` Karstens, Nate
  0 siblings, 0 replies; 61+ messages in thread
From: Karstens, Nate @ 2020-05-15 18:28 UTC (permalink / raw)
  To: James Bottomley, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev

James,

Sorry, perhaps I was indirect, but I thought I had responded to that in https://lore.kernel.org/linux-fsdevel/de6adce76b534310975e4d3c4a4facb2@garmin.com/.

I really hope I do not come off as complaining about this issue. We identified what seemed to be something that was overlooked with the various APIs around creating child processes. Rather than fixing it ourselves and moving on we chose to invest more time and effort into it by engaging the community (first POSIX, and now this one) in a discussion. I humbly and sincerely ask if you would help me understand, if we could turn back the clock, how our application could have been written to avoid this issue:

*A parent process forks a child. Another thread in the parent process closes and attempts to reopen a socket, file, or other resource it needs exclusive access to. This fails because the -operating system- still has a reference to that resource that it is keeping on behalf of the child. The child eventually calls exec and the resource is closed because the close-on-exec flag is set.*

Our first attempt, which was to use the pthread_atfork() handlers, failed because system() is not required to call the handlers.

Most of the feedback we're getting on this seems to say "don't use system(), it is unsafe for threaded applications". Is that documented anywhere? The man page says it is "MT-Safe".

Aside from that, even if we remove all uses of system() from our application (which we already have), then our application, like many other applications, needs to use third-party shared libraries. There is nothing that prevents those libraries from using system(). We can audit those libraries and go back with the vendor with a request to replace system() with a standard fork/exec, but they will also want documentation supporting that.

We can also take steps to change or remove system() from our standard library. It fixes our issue, but still leaves the community with an API that is broken/flawed/poorly-documented (depending on how one looks at it).

If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?

Thanks for your help with this.

Nate

-----Original Message-----
From: James Bottomley <James.Bottomley@HansenPartnership.com>
Sent: Friday, May 15, 2020 11:26
To: Karstens, Nate <Nate.Karstens@garmin.com>; Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Eric Dumazet <edumazet@google.com>; David Laight <David.Laight@aculab.com>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Changli Gao <xiaosuo@gmail.com>; a.josey@opengroup.org
Subject: Re: [PATCH v2] Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Fri, 2020-05-15 at 16:07 +0000, Karstens, Nate wrote:
> Matthew,
>
> What alternative would you suggest?
>
> From an earlier email:
>
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to mark
> > a file descriptor as exclusive to itself, but there is still a
> > period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with a
> > different process/threading model, but that is another discussion
> > entirely.
>
> Do you disagree there is an issue?

Oh good grief that's a leading question: When I write bad code and it crashes, most people would agree there is an issue; very few would agree the kernel should be changed to fix it. Several of us have already said the problem seems to be with the way your application is written.  You didn't even answer emails like this speculating about the cause being the way your application counts resources:

https://lore.kernel.org/linux-fsdevel/1587569663.3485.18.camel@HansenPartnership.com/

The bottom line is that we think you could rewrite this one application not to have the problem you're complaining about rather than introduce a new kernel API to "fix" it.

James




________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
@ 2020-05-15 18:28         ` Karstens, Nate
  0 siblings, 0 replies; 61+ messages in thread
From: Karstens, Nate @ 2020-05-15 18:28 UTC (permalink / raw)
  To: James Bottomley, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev

SmFtZXMsDQoNClNvcnJ5LCBwZXJoYXBzIEkgd2FzIGluZGlyZWN0LCBidXQgSSB0aG91Z2h0IEkg
aGFkIHJlc3BvbmRlZCB0byB0aGF0IGluIGh0dHBzOi8vbG9yZS5rZXJuZWwub3JnL2xpbnV4LWZz
ZGV2ZWwvZGU2YWRjZTc2YjUzNDMxMDk3NWU0ZDNjNGE0ZmFjYjJAZ2FybWluLmNvbS8uDQoNCkkg
cmVhbGx5IGhvcGUgSSBkbyBub3QgY29tZSBvZmYgYXMgY29tcGxhaW5pbmcgYWJvdXQgdGhpcyBp
c3N1ZS4gV2UgaWRlbnRpZmllZCB3aGF0IHNlZW1lZCB0byBiZSBzb21ldGhpbmcgdGhhdCB3YXMg
b3Zlcmxvb2tlZCB3aXRoIHRoZSB2YXJpb3VzIEFQSXMgYXJvdW5kIGNyZWF0aW5nIGNoaWxkIHBy
b2Nlc3Nlcy4gUmF0aGVyIHRoYW4gZml4aW5nIGl0IG91cnNlbHZlcyBhbmQgbW92aW5nIG9uIHdl
IGNob3NlIHRvIGludmVzdCBtb3JlIHRpbWUgYW5kIGVmZm9ydCBpbnRvIGl0IGJ5IGVuZ2FnaW5n
IHRoZSBjb21tdW5pdHkgKGZpcnN0IFBPU0lYLCBhbmQgbm93IHRoaXMgb25lKSBpbiBhIGRpc2N1
c3Npb24uIEkgaHVtYmx5IGFuZCBzaW5jZXJlbHkgYXNrIGlmIHlvdSB3b3VsZCBoZWxwIG1lIHVu
ZGVyc3RhbmQsIGlmIHdlIGNvdWxkIHR1cm4gYmFjayB0aGUgY2xvY2ssIGhvdyBvdXIgYXBwbGlj
YXRpb24gY291bGQgaGF2ZSBiZWVuIHdyaXR0ZW4gdG8gYXZvaWQgdGhpcyBpc3N1ZToNCg0KKkEg
cGFyZW50IHByb2Nlc3MgZm9ya3MgYSBjaGlsZC4gQW5vdGhlciB0aHJlYWQgaW4gdGhlIHBhcmVu
dCBwcm9jZXNzIGNsb3NlcyBhbmQgYXR0ZW1wdHMgdG8gcmVvcGVuIGEgc29ja2V0LCBmaWxlLCBv
ciBvdGhlciByZXNvdXJjZSBpdCBuZWVkcyBleGNsdXNpdmUgYWNjZXNzIHRvLiBUaGlzIGZhaWxz
IGJlY2F1c2UgdGhlIC1vcGVyYXRpbmcgc3lzdGVtLSBzdGlsbCBoYXMgYSByZWZlcmVuY2UgdG8g
dGhhdCByZXNvdXJjZSB0aGF0IGl0IGlzIGtlZXBpbmcgb24gYmVoYWxmIG9mIHRoZSBjaGlsZC4g
VGhlIGNoaWxkIGV2ZW50dWFsbHkgY2FsbHMgZXhlYyBhbmQgdGhlIHJlc291cmNlIGlzIGNsb3Nl
ZCBiZWNhdXNlIHRoZSBjbG9zZS1vbi1leGVjIGZsYWcgaXMgc2V0LioNCg0KT3VyIGZpcnN0IGF0
dGVtcHQsIHdoaWNoIHdhcyB0byB1c2UgdGhlIHB0aHJlYWRfYXRmb3JrKCkgaGFuZGxlcnMsIGZh
aWxlZCBiZWNhdXNlIHN5c3RlbSgpIGlzIG5vdCByZXF1aXJlZCB0byBjYWxsIHRoZSBoYW5kbGVy
cy4NCg0KTW9zdCBvZiB0aGUgZmVlZGJhY2sgd2UncmUgZ2V0dGluZyBvbiB0aGlzIHNlZW1zIHRv
IHNheSAiZG9uJ3QgdXNlIHN5c3RlbSgpLCBpdCBpcyB1bnNhZmUgZm9yIHRocmVhZGVkIGFwcGxp
Y2F0aW9ucyIuIElzIHRoYXQgZG9jdW1lbnRlZCBhbnl3aGVyZT8gVGhlIG1hbiBwYWdlIHNheXMg
aXQgaXMgIk1ULVNhZmUiLg0KDQpBc2lkZSBmcm9tIHRoYXQsIGV2ZW4gaWYgd2UgcmVtb3ZlIGFs
bCB1c2VzIG9mIHN5c3RlbSgpIGZyb20gb3VyIGFwcGxpY2F0aW9uICh3aGljaCB3ZSBhbHJlYWR5
IGhhdmUpLCB0aGVuIG91ciBhcHBsaWNhdGlvbiwgbGlrZSBtYW55IG90aGVyIGFwcGxpY2F0aW9u
cywgbmVlZHMgdG8gdXNlIHRoaXJkLXBhcnR5IHNoYXJlZCBsaWJyYXJpZXMuIFRoZXJlIGlzIG5v
dGhpbmcgdGhhdCBwcmV2ZW50cyB0aG9zZSBsaWJyYXJpZXMgZnJvbSB1c2luZyBzeXN0ZW0oKS4g
V2UgY2FuIGF1ZGl0IHRob3NlIGxpYnJhcmllcyBhbmQgZ28gYmFjayB3aXRoIHRoZSB2ZW5kb3Ig
d2l0aCBhIHJlcXVlc3QgdG8gcmVwbGFjZSBzeXN0ZW0oKSB3aXRoIGEgc3RhbmRhcmQgZm9yay9l
eGVjLCBidXQgdGhleSB3aWxsIGFsc28gd2FudCBkb2N1bWVudGF0aW9uIHN1cHBvcnRpbmcgdGhh
dC4NCg0KV2UgY2FuIGFsc28gdGFrZSBzdGVwcyB0byBjaGFuZ2Ugb3IgcmVtb3ZlIHN5c3RlbSgp
IGZyb20gb3VyIHN0YW5kYXJkIGxpYnJhcnkuIEl0IGZpeGVzIG91ciBpc3N1ZSwgYnV0IHN0aWxs
IGxlYXZlcyB0aGUgY29tbXVuaXR5IHdpdGggYW4gQVBJIHRoYXQgaXMgYnJva2VuL2ZsYXdlZC9w
b29ybHktZG9jdW1lbnRlZCAoZGVwZW5kaW5nIG9uIGhvdyBvbmUgbG9va3MgYXQgaXQpLg0KDQpJ
ZiB0aGUgZmVlZGJhY2sgZnJvbSB0aGUgY29tbXVuaXR5IGlzIHRydWx5IGFuZCBmaW5hbGx5IHRo
YXQgc3lzdGVtKCkgc2hvdWxkIG5vdCBiZSB1c2VkIGluIHRoZXNlIGFwcGxpY2F0aW9ucywgdGhl
biBpcyB0aGVyZSBzdXBwb3J0IGZvciB1cGRhdGluZyB0aGUgbWFuIHBhZ2UgdG8gYmV0dGVyIGNv
bW11bmljYXRlIHRoYXQ/DQoNClRoYW5rcyBmb3IgeW91ciBoZWxwIHdpdGggdGhpcy4NCg0KTmF0
ZQ0KDQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KRnJvbTogSmFtZXMgQm90dG9tbGV5IDxK
YW1lcy5Cb3R0b21sZXlASGFuc2VuUGFydG5lcnNoaXAuY29tPg0KU2VudDogRnJpZGF5LCBNYXkg
MTUsIDIwMjAgMTE6MjYNClRvOiBLYXJzdGVucywgTmF0ZSA8TmF0ZS5LYXJzdGVuc0BnYXJtaW4u
Y29tPjsgTWF0dGhldyBXaWxjb3ggPHdpbGx5QGluZnJhZGVhZC5vcmc+DQpDYzogQWxleGFuZGVy
IFZpcm8gPHZpcm9AemVuaXYubGludXgub3JnLnVrPjsgSmVmZiBMYXl0b24gPGpsYXl0b25Aa2Vy
bmVsLm9yZz47IEouIEJydWNlIEZpZWxkcyA8YmZpZWxkc0BmaWVsZHNlcy5vcmc+OyBBcm5kIEJl
cmdtYW5uIDxhcm5kQGFybmRiLmRlPjsgUmljaGFyZCBIZW5kZXJzb24gPHJ0aEB0d2lkZGxlLm5l
dD47IEl2YW4gS29rc2hheXNreSA8aW5rQGp1cmFzc2ljLnBhcmsubXN1LnJ1PjsgTWF0dCBUdXJu
ZXIgPG1hdHRzdDg4QGdtYWlsLmNvbT47IEhlbGdlIERlbGxlciA8ZGVsbGVyQGdteC5kZT47IERh
dmlkIFMuIE1pbGxlciA8ZGF2ZW1AZGF2ZW1sb2Z0Lm5ldD47IEpha3ViIEtpY2luc2tpIDxrdWJh
QGtlcm5lbC5vcmc+OyBFcmljIER1bWF6ZXQgPGVkdW1hemV0QGdvb2dsZS5jb20+OyBEYXZpZCBM
YWlnaHQgPERhdmlkLkxhaWdodEBhY3VsYWIuY29tPjsgbGludXgtZnNkZXZlbEB2Z2VyLmtlcm5l
bC5vcmc7IGxpbnV4LWFyY2hAdmdlci5rZXJuZWwub3JnOyBsaW51eC1hbHBoYUB2Z2VyLmtlcm5l
bC5vcmc7IGxpbnV4LXBhcmlzY0B2Z2VyLmtlcm5lbC5vcmc7IHNwYXJjbGludXhAdmdlci5rZXJu
ZWwub3JnOyBuZXRkZXZAdmdlci5rZXJuZWwub3JnOyBsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwu
b3JnOyBDaGFuZ2xpIEdhbyA8eGlhb3N1b0BnbWFpbC5jb20+OyBhLmpvc2V5QG9wZW5ncm91cC5v
cmcNClN1YmplY3Q6IFJlOiBbUEFUQ0ggdjJdIEltcGxlbWVudCBjbG9zZS1vbi1mb3JrDQoNCkNB
VVRJT04gLSBFWFRFUk5BTCBFTUFJTDogRG8gbm90IGNsaWNrIGFueSBsaW5rcyBvciBvcGVuIGFu
eSBhdHRhY2htZW50cyB1bmxlc3MgeW91IHRydXN0IHRoZSBzZW5kZXIgYW5kIGtub3cgdGhlIGNv
bnRlbnQgaXMgc2FmZS4NCg0KDQpPbiBGcmksIDIwMjAtMDUtMTUgYXQgMTY6MDcgKzAwMDAsIEth
cnN0ZW5zLCBOYXRlIHdyb3RlOg0KPiBNYXR0aGV3LA0KPg0KPiBXaGF0IGFsdGVybmF0aXZlIHdv
dWxkIHlvdSBzdWdnZXN0Pw0KPg0KPiBGcm9tIGFuIGVhcmxpZXIgZW1haWw6DQo+DQo+ID4gLi4u
bm90aGluZyBlbHNlIGFkZHJlc3NlcyB0aGUgdW5kZXJseWluZyBpc3N1ZTogdGhlcmUgaXMgbm8g
d2F5IHRvDQo+ID4gcHJldmVudCBhIGZvcmsoKSBmcm9tIGR1cGxpY2F0aW5nIHRoZSByZXNvdXJj
ZS4gVGhlIGNsb3NlLW9uLWV4ZWMNCj4gPiBmbGFnIHBhcnRpYWxseS1hZGRyZXNzZXMgdGhpcyBi
eSBhbGxvd2luZyB0aGUgcGFyZW50IHByb2Nlc3MgdG8gbWFyaw0KPiA+IGEgZmlsZSBkZXNjcmlw
dG9yIGFzIGV4Y2x1c2l2ZSB0byBpdHNlbGYsIGJ1dCB0aGVyZSBpcyBzdGlsbCBhDQo+ID4gcGVy
aW9kIG9mIHRpbWUgdGhlIGZhaWx1cmUgY2FuIG9jY3VyIGJlY2F1c2UgdGhlIGF1dG8tY2xvc2Ug
b25seQ0KPiA+IG9jY3VycyBkdXJpbmcgdGhlIGV4ZWMoKS4gUGVyaGFwcyB0aGlzIHdvdWxkIG5v
dCBiZSBhbiBpc3N1ZSB3aXRoIGENCj4gPiBkaWZmZXJlbnQgcHJvY2Vzcy90aHJlYWRpbmcgbW9k
ZWwsIGJ1dCB0aGF0IGlzIGFub3RoZXIgZGlzY3Vzc2lvbg0KPiA+IGVudGlyZWx5Lg0KPg0KPiBE
byB5b3UgZGlzYWdyZWUgdGhlcmUgaXMgYW4gaXNzdWU/DQoNCk9oIGdvb2QgZ3JpZWYgdGhhdCdz
IGEgbGVhZGluZyBxdWVzdGlvbjogV2hlbiBJIHdyaXRlIGJhZCBjb2RlIGFuZCBpdCBjcmFzaGVz
LCBtb3N0IHBlb3BsZSB3b3VsZCBhZ3JlZSB0aGVyZSBpcyBhbiBpc3N1ZTsgdmVyeSBmZXcgd291
bGQgYWdyZWUgdGhlIGtlcm5lbCBzaG91bGQgYmUgY2hhbmdlZCB0byBmaXggaXQuIFNldmVyYWwg
b2YgdXMgaGF2ZSBhbHJlYWR5IHNhaWQgdGhlIHByb2JsZW0gc2VlbXMgdG8gYmUgd2l0aCB0aGUg
d2F5IHlvdXIgYXBwbGljYXRpb24gaXMgd3JpdHRlbi4gIFlvdSBkaWRuJ3QgZXZlbiBhbnN3ZXIg
ZW1haWxzIGxpa2UgdGhpcyBzcGVjdWxhdGluZyBhYm91dCB0aGUgY2F1c2UgYmVpbmcgdGhlIHdh
eSB5b3VyIGFwcGxpY2F0aW9uIGNvdW50cyByZXNvdXJjZXM6DQoNCmh0dHBzOi8vbG9yZS5rZXJu
ZWwub3JnL2xpbnV4LWZzZGV2ZWwvMTU4NzU2OTY2My4zNDg1LjE4LmNhbWVsQEhhbnNlblBhcnRu
ZXJzaGlwLmNvbS8NCg0KVGhlIGJvdHRvbSBsaW5lIGlzIHRoYXQgd2UgdGhpbmsgeW91IGNvdWxk
IHJld3JpdGUgdGhpcyBvbmUgYXBwbGljYXRpb24gbm90IHRvIGhhdmUgdGhlIHByb2JsZW0geW91
J3JlIGNvbXBsYWluaW5nIGFib3V0IHJhdGhlciB0aGFuIGludHJvZHVjZSBhIG5ldyBrZXJuZWwg
QVBJIHRvICJmaXgiIGl0Lg0KDQpKYW1lcw0KDQoNCg0KDQpfX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fXw0KDQpDT05GSURFTlRJQUxJVFkgTk9USUNFOiBUaGlzIGVtYWlsIGFuZCBhbnkg
YXR0YWNobWVudHMgYXJlIGZvciB0aGUgc29sZSB1c2Ugb2YgdGhlIGludGVuZGVkIHJlY2lwaWVu
dChzKSBhbmQgY29udGFpbiBpbmZvcm1hdGlvbiB0aGF0IG1heSBiZSBHYXJtaW4gY29uZmlkZW50
aWFsIGFuZC9vciBHYXJtaW4gbGVnYWxseSBwcml2aWxlZ2VkLiBJZiB5b3UgaGF2ZSByZWNlaXZl
ZCB0aGlzIGVtYWlsIGluIGVycm9yLCBwbGVhc2Ugbm90aWZ5IHRoZSBzZW5kZXIgYnkgcmVwbHkg
ZW1haWwgYW5kIGRlbGV0ZSB0aGUgbWVzc2FnZS4gQW55IGRpc2Nsb3N1cmUsIGNvcHlpbmcsIGRp
c3RyaWJ1dGlvbiBvciB1c2Ugb2YgdGhpcyBjb21tdW5pY2F0aW9uIChpbmNsdWRpbmcgYXR0YWNo
bWVudHMpIGJ5IHNvbWVvbmUgb3RoZXIgdGhhbiB0aGUgaW50ZW5kZWQgcmVjaXBpZW50IGlzIHBy
b2hpYml0ZWQuIFRoYW5rIHlvdS4NCg=

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 18:28         ` Karstens, Nate
  (?)
@ 2020-05-15 18:43           ` Matthew Wilcox
  -1 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2020-05-15 18:43 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: James Bottomley, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Helge Deller, David S. Miller, Jakub Kicinski, Eric Dumazet,
	David Laight, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao,
	a.josey

On Fri, May 15, 2020 at 06:28:20PM +0000, Karstens, Nate wrote:
> Our first attempt, which was to use the pthread_atfork() handlers, failed because system() is not required to call the handlers.
> 
> Most of the feedback we're getting on this seems to say "don't use system(), it is unsafe for threaded applications". Is that documented anywhere? The man page says it is "MT-Safe".

https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Aside from that, even if we remove all uses of system() from our application (which we already have), then our application, like many other applications, needs to use third-party shared libraries. There is nothing that prevents those libraries from using system(). We can audit those libraries and go back with the vendor with a request to replace system() with a standard fork/exec, but they will also want documentation supporting that.

They might also be using any number of other interfaces which aren't
thread-safe.

> If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?

Yes, absolutely.

I think you're mischaracterising our feedback.  It's not "We don't want
to fix this interface".  It's "We don't want to slow down everything else
to fix this interface (and by the way there are lots of other problems
with this interface that you aren't even trying to address)".

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 18:43           ` Matthew Wilcox
  0 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2020-05-15 18:43 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: James Bottomley, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Helge Deller, David S. Miller, Jakub Kicinski, Eric Dumazet,
	David Laight, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux

On Fri, May 15, 2020 at 06:28:20PM +0000, Karstens, Nate wrote:
> Our first attempt, which was to use the pthread_atfork() handlers, failed because system() is not required to call the handlers.
> 
> Most of the feedback we're getting on this seems to say "don't use system(), it is unsafe for threaded applications". Is that documented anywhere? The man page says it is "MT-Safe".

https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Aside from that, even if we remove all uses of system() from our application (which we already have), then our application, like many other applications, needs to use third-party shared libraries. There is nothing that prevents those libraries from using system(). We can audit those libraries and go back with the vendor with a request to replace system() with a standard fork/exec, but they will also want documentation supporting that.

They might also be using any number of other interfaces which aren't
thread-safe.

> If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?

Yes, absolutely.

I think you're mischaracterising our feedback.  It's not "We don't want
to fix this interface".  It's "We don't want to slow down everything else
to fix this interface (and by the way there are lots of other problems
with this interface that you aren't even trying to address)".

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-15 18:43           ` Matthew Wilcox
  0 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2020-05-15 18:43 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: James Bottomley, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Helge Deller, David S. Miller, Jakub Kicinski, Eric Dumazet,
	David Laight, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux

On Fri, May 15, 2020 at 06:28:20PM +0000, Karstens, Nate wrote:
> Our first attempt, which was to use the pthread_atfork() handlers, failed because system() is not required to call the handlers.
> 
> Most of the feedback we're getting on this seems to say "don't use system(), it is unsafe for threaded applications". Is that documented anywhere? The man page says it is "MT-Safe".

https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Aside from that, even if we remove all uses of system() from our application (which we already have), then our application, like many other applications, needs to use third-party shared libraries. There is nothing that prevents those libraries from using system(). We can audit those libraries and go back with the vendor with a request to replace system() with a standard fork/exec, but they will also want documentation supporting that.

They might also be using any number of other interfaces which aren't
thread-safe.

> If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?

Yes, absolutely.

I think you're mischaracterising our feedback.  It's not "We don't want
to fix this interface".  It's "We don't want to slow down everything else
to fix this interface (and by the way there are lots of other problems
with this interface that you aren't even trying to address)".

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 15:57   ` Matthew Wilcox
  (?)
@ 2020-05-16 13:29     ` Christian Brauner
  -1 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2020-05-16 13:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao, a.josey

On Fri, May 15, 2020 at 08:57:30AM -0700, Matthew Wilcox wrote:
> On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> > Series of 4 patches to implement close-on-fork. Tests have been
> > published to https://github.com/nkarstens/ltp/tree/close-on-fork
> > and cover close-on-fork functionality in the following syscalls:
> 
> [...]
> 
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
> 
> NAK to this patch series, and the entire concept.

Yeah.
But also, stuff like this should really be on linux-api.

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-16 13:29     ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2020-05-16 13:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli

On Fri, May 15, 2020 at 08:57:30AM -0700, Matthew Wilcox wrote:
> On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> > Series of 4 patches to implement close-on-fork. Tests have been
> > published to https://github.com/nkarstens/ltp/tree/close-on-fork
> > and cover close-on-fork functionality in the following syscalls:
> 
> [...]
> 
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
> 
> NAK to this patch series, and the entire concept.

Yeah.
But also, stuff like this should really be on linux-api.

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-16 13:29     ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2020-05-16 13:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli

On Fri, May 15, 2020 at 08:57:30AM -0700, Matthew Wilcox wrote:
> On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> > Series of 4 patches to implement close-on-fork. Tests have been
> > published to https://github.com/nkarstens/ltp/tree/close-on-fork
> > and cover close-on-fork functionality in the following syscalls:
> 
> [...]
> 
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
> 
> NAK to this patch series, and the entire concept.

Yeah.
But also, stuff like this should really be on linux-api.

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 18:28         ` Karstens, Nate
  (?)
@ 2020-05-25  8:16           ` Pavel Machek
  -1 siblings, 0 replies; 61+ messages in thread
From: Pavel Machek @ 2020-05-25  8:16 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: James Bottomley, Matthew Wilcox, Alexander Viro, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao, a.josey

[-- Attachment #1: Type: text/plain, Size: 552 bytes --]

Hi!

> 
> If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?
> 

Clarifying documenation might be the best way forward. Note you'd have
to do that anyway, since people would not know about O_CLOFORK without
pointers in documentation.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-25  8:16           ` Pavel Machek
  0 siblings, 0 replies; 61+ messages in thread
From: Pavel Machek @ 2020-05-25  8:16 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: James Bottomley, Matthew Wilcox, Alexander Viro, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 552 bytes --]

Hi!

> 
> If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?
> 

Clarifying documenation might be the best way forward. Note you'd have
to do that anyway, since people would not know about O_CLOFORK without
pointers in documentation.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2020-05-25  8:16           ` Pavel Machek
  0 siblings, 0 replies; 61+ messages in thread
From: Pavel Machek @ 2020-05-25  8:16 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: James Bottomley, Matthew Wilcox, Alexander Viro, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 552 bytes --]

Hi!

> 
> If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?
> 

Clarifying documenation might be the best way forward. Note you'd have
to do that anyway, since people would not know about O_CLOFORK without
pointers in documentation.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2020-05-15 15:23 ` Nate Karstens
                   ` (8 preceding siblings ...)
  (?)
@ 2022-06-18 11:41 ` Ralph Corderoy
  2022-06-18 19:40     ` Matthew Wilcox
  -1 siblings, 1 reply; 61+ messages in thread
From: Ralph Corderoy @ 2022-06-18 11:41 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao

Hi Nate,

> One manifestation of this is a race conditions in system(), which
> (depending on the implementation) is non-atomic in that it first calls
> a fork() and then an exec().

The need for O_CLOFORK might be made more clear by looking at a
long-standing Go issue, i.e. unrelated to system(3), which was started
in 2017 by Russ Cox when he summed up the current race-condition
behaviour of trying to execve(2) a newly created file:
https://github.com/golang/go/issues/22315.  I raised it on linux-kernel
in 2017, https://marc.info/?l=linux-kernel&m=150834137201488, and linked
to a proposed patch from 2011, ‘[PATCH] fs: add FD_CLOFORK and
O_CLOFORK’ by Changli Gao.  As I said, long-standing.

The Go issue is worth a read.  Russ wondered ‘What would Java do’ only
to find that Java already had an issue open for the same problem since
2014.

I think the kernel is the place to fix the problem, just as with
FD_CLOEXEC/O_CLOEXEC.  Ian Lance Taylor says on the Go issue that it
looks like ‘Solaris and macOS and OpenBSD have O_CLOFORK already.
Hopefully it will catch on further’.

-- 
Cheers, Ralph.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2022-06-18 11:41 ` Ralph Corderoy
@ 2022-06-18 19:40     ` Matthew Wilcox
  0 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2022-06-18 19:40 UTC (permalink / raw)
  To: Ralph Corderoy
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao

On Sat, Jun 18, 2022 at 12:41:11PM +0100, Ralph Corderoy wrote:
> Hi Nate,
> 
> > One manifestation of this is a race conditions in system(), which
> > (depending on the implementation) is non-atomic in that it first calls
> > a fork() and then an exec().
> 
> The need for O_CLOFORK might be made more clear by looking at a
> long-standing Go issue, i.e. unrelated to system(3), which was started
> in 2017 by Russ Cox when he summed up the current race-condition
> behaviour of trying to execve(2) a newly created file:
> https://github.com/golang/go/issues/22315.  I raised it on linux-kernel
> in 2017, https://marc.info/?l=linux-kernel&m=150834137201488, and linked
> to a proposed patch from 2011, ‘[PATCH] fs: add FD_CLOFORK and
> O_CLOFORK’ by Changli Gao.  As I said, long-standing.

The problem is that people advocating for O_CLOFORK understand its
value, but not its cost.  Other google employees have a system which has
literally millions of file descriptors in a single process.  Having to
maintain this extra state per-fd is a cost they don't want to pay
(and have been quite vocal about earlier in this thread).

Fundamentally, fork()+exec() is a terrible model.  Mind you, so is
spawn().  I haven't seen a good model yet.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2022-06-18 19:40     ` Matthew Wilcox
  0 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2022-06-18 19:40 UTC (permalink / raw)
  To: Ralph Corderoy
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli

On Sat, Jun 18, 2022 at 12:41:11PM +0100, Ralph Corderoy wrote:
> Hi Nate,
> 
> > One manifestation of this is a race conditions in system(), which
> > (depending on the implementation) is non-atomic in that it first calls
> > a fork() and then an exec().
> 
> The need for O_CLOFORK might be made more clear by looking at a
> long-standing Go issue, i.e. unrelated to system(3), which was started
> in 2017 by Russ Cox when he summed up the current race-condition
> behaviour of trying to execve(2) a newly created file:
> https://github.com/golang/go/issues/22315.  I raised it on linux-kernel
> in 2017, https://marc.info/?l=linux-kernel&m=150834137201488, and linked
> to a proposed patch from 2011, ‘[PATCH] fs: add FD_CLOFORK and
> O_CLOFORK’ by Changli Gao.  As I said, long-standing.

The problem is that people advocating for O_CLOFORK understand its
value, but not its cost.  Other google employees have a system which has
literally millions of file descriptors in a single process.  Having to
maintain this extra state per-fd is a cost they don't want to pay
(and have been quite vocal about earlier in this thread).

Fundamentally, fork()+exec() is a terrible model.  Mind you, so is
spawn().  I haven't seen a good model yet.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2022-06-18 19:40     ` Matthew Wilcox
@ 2022-06-19 10:42       ` Ralph Corderoy
  -1 siblings, 0 replies; 61+ messages in thread
From: Ralph Corderoy @ 2022-06-19 10:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao

Hi Matthew, thanks for replying.

> > The need for O_CLOFORK might be made more clear by looking at a
> > long-standing Go issue, i.e. unrelated to system(3), which was started
> > in 2017 by Russ Cox when he summed up the current race-condition
> > behaviour of trying to execve(2) a newly created file:
> > https://github.com/golang/go/issues/22315.
>
> The problem is that people advocating for O_CLOFORK understand its
> value, but not its cost.  Other google employees have a system which
> has literally millions of file descriptors in a single process.
> Having to maintain this extra state per-fd is a cost they don't want
> to pay (and have been quite vocal about earlier in this thread).

So do you agree the userspace issue is best solved by *_CLOFORK and the
problem is how to implement *_CLOFORK at an acceptable cost?

OTOH David Laight was making suggestions on moving the load to the
fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
‘portable solution’, though that could have been to a specific issue
rather than the more general case.

How would you recommend approaching an acceptable cost is progressed?
Iterate on patch versions?  Open a bugzilla.kernel.org for central
tracking and linking from the other projects?  ..?

-- 
Cheers, Ralph.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2022-06-19 10:42       ` Ralph Corderoy
  0 siblings, 0 replies; 61+ messages in thread
From: Ralph Corderoy @ 2022-06-19 10:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, Eric Dumazet, David Laight, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli

Hi Matthew, thanks for replying.

> > The need for O_CLOFORK might be made more clear by looking at a
> > long-standing Go issue, i.e. unrelated to system(3), which was started
> > in 2017 by Russ Cox when he summed up the current race-condition
> > behaviour of trying to execve(2) a newly created file:
> > https://github.com/golang/go/issues/22315.
>
> The problem is that people advocating for O_CLOFORK understand its
> value, but not its cost.  Other google employees have a system which
> has literally millions of file descriptors in a single process.
> Having to maintain this extra state per-fd is a cost they don't want
> to pay (and have been quite vocal about earlier in this thread).

So do you agree the userspace issue is best solved by *_CLOFORK and the
problem is how to implement *_CLOFORK at an acceptable cost?

OTOH David Laight was making suggestions on moving the load to the
fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
‘portable solution’, though that could have been to a specific issue
rather than the more general case.

How would you recommend approaching an acceptable cost is progressed?
Iterate on patch versions?  Open a bugzilla.kernel.org for central
tracking and linking from the other projects?  ..?

-- 
Cheers, Ralph.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2022-06-19 10:42       ` Ralph Corderoy
@ 2022-06-28 13:13         ` Christian Brauner
  -1 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2022-06-28 13:13 UTC (permalink / raw)
  To: Ralph Corderoy
  Cc: Matthew Wilcox, Nate Karstens, Alexander Viro, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev, linux-kernel, Changli Gao

On Sun, Jun 19, 2022 at 11:42:28AM +0100, Ralph Corderoy wrote:
> Hi Matthew, thanks for replying.
> 
> > > The need for O_CLOFORK might be made more clear by looking at a
> > > long-standing Go issue, i.e. unrelated to system(3), which was started
> > > in 2017 by Russ Cox when he summed up the current race-condition
> > > behaviour of trying to execve(2) a newly created file:
> > > https://github.com/golang/go/issues/22315.
> >
> > The problem is that people advocating for O_CLOFORK understand its
> > value, but not its cost.  Other google employees have a system which
> > has literally millions of file descriptors in a single process.
> > Having to maintain this extra state per-fd is a cost they don't want
> > to pay (and have been quite vocal about earlier in this thread).
> 
> So do you agree the userspace issue is best solved by *_CLOFORK and the
> problem is how to implement *_CLOFORK at an acceptable cost?
> 
> OTOH David Laight was making suggestions on moving the load to the
> fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
> ‘portable solution’, though that could have been to a specific issue
> rather than the more general case.
> 
> How would you recommend approaching an acceptable cost is progressed?
> Iterate on patch versions?  Open a bugzilla.kernel.org for central
> tracking and linking from the other projects?  ..?

Quoting from that go thread

"If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure would help.)"

So why can't this be solved with:
close_range(fd_first, fd_last, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
e.g.
close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?

https://man7.org/linux/man-pages/man2/close_range.2.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2022-06-28 13:13         ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2022-06-28 13:13 UTC (permalink / raw)
  To: Ralph Corderoy
  Cc: Matthew Wilcox, Nate Karstens, Alexander Viro, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, David Laight,
	linux-fsdevel, linux-arch, linux-alpha, linux-parisc, sparclinux,
	netdev

On Sun, Jun 19, 2022 at 11:42:28AM +0100, Ralph Corderoy wrote:
> Hi Matthew, thanks for replying.
> 
> > > The need for O_CLOFORK might be made more clear by looking at a
> > > long-standing Go issue, i.e. unrelated to system(3), which was started
> > > in 2017 by Russ Cox when he summed up the current race-condition
> > > behaviour of trying to execve(2) a newly created file:
> > > https://github.com/golang/go/issues/22315.
> >
> > The problem is that people advocating for O_CLOFORK understand its
> > value, but not its cost.  Other google employees have a system which
> > has literally millions of file descriptors in a single process.
> > Having to maintain this extra state per-fd is a cost they don't want
> > to pay (and have been quite vocal about earlier in this thread).
> 
> So do you agree the userspace issue is best solved by *_CLOFORK and the
> problem is how to implement *_CLOFORK at an acceptable cost?
> 
> OTOH David Laight was making suggestions on moving the load to the
> fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
> ‘portable solution’, though that could have been to a specific issue
> rather than the more general case.
> 
> How would you recommend approaching an acceptable cost is progressed?
> Iterate on patch versions?  Open a bugzilla.kernel.org for central
> tracking and linking from the other projects?  ..?

Quoting from that go thread

"If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure would help.)"

So why can't this be solved with:
close_range(fd_first, fd_last, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
e.g.
close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?

https://man7.org/linux/man-pages/man2/close_range.2.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
  2022-06-28 13:13         ` Christian Brauner
@ 2022-06-28 13:38           ` David Laight
  -1 siblings, 0 replies; 61+ messages in thread
From: David Laight @ 2022-06-28 13:38 UTC (permalink / raw)
  To: 'Christian Brauner', Ralph Corderoy
  Cc: Matthew Wilcox, Nate Karstens, Alexander Viro, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao

From: Christian Brauner
> Sent: 28 June 2022 14:13
> 
> On Sun, Jun 19, 2022 at 11:42:28AM +0100, Ralph Corderoy wrote:
> > Hi Matthew, thanks for replying.
> >
> > > > The need for O_CLOFORK might be made more clear by looking at a
> > > > long-standing Go issue, i.e. unrelated to system(3), which was started
> > > > in 2017 by Russ Cox when he summed up the current race-condition
> > > > behaviour of trying to execve(2) a newly created file:
> > > > https://github.com/golang/go/issues/22315.
> > >
> > > The problem is that people advocating for O_CLOFORK understand its
> > > value, but not its cost.  Other google employees have a system which
> > > has literally millions of file descriptors in a single process.
> > > Having to maintain this extra state per-fd is a cost they don't want
> > > to pay (and have been quite vocal about earlier in this thread).
> >
> > So do you agree the userspace issue is best solved by *_CLOFORK and the
> > problem is how to implement *_CLOFORK at an acceptable cost?
> >
> > OTOH David Laight was making suggestions on moving the load to the
> > fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
> > ‘portable solution’, though that could have been to a specific issue
> > rather than the more general case.
> >
> > How would you recommend approaching an acceptable cost is progressed?
> > Iterate on patch versions?  Open a bugzilla.kernel.org for central
> > tracking and linking from the other projects?  ..?
> 
> Quoting from that go thread
> 
> "If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure
> would help.)"
> 
> So why can't this be solved with:
> close_range(fd_first, fd_last, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
> e.g.
> close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?

That is a relatively recent linux system call.
Although it can be (mostly) emulated by reading /proc/fd
- but that may not be mounted.

In any case another thread can open an fd between the close_range()
and fork() calls.

(I can't remember what I said before :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v2] Implement close-on-fork
@ 2022-06-28 13:38           ` David Laight
  0 siblings, 0 replies; 61+ messages in thread
From: David Laight @ 2022-06-28 13:38 UTC (permalink / raw)
  To: 'Christian Brauner', Ralph Corderoy
  Cc: Matthew Wilcox, Nate Karstens, Alexander Viro, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc

From: Christian Brauner
> Sent: 28 June 2022 14:13
> 
> On Sun, Jun 19, 2022 at 11:42:28AM +0100, Ralph Corderoy wrote:
> > Hi Matthew, thanks for replying.
> >
> > > > The need for O_CLOFORK might be made more clear by looking at a
> > > > long-standing Go issue, i.e. unrelated to system(3), which was started
> > > > in 2017 by Russ Cox when he summed up the current race-condition
> > > > behaviour of trying to execve(2) a newly created file:
> > > > https://github.com/golang/go/issues/22315.
> > >
> > > The problem is that people advocating for O_CLOFORK understand its
> > > value, but not its cost.  Other google employees have a system which
> > > has literally millions of file descriptors in a single process.
> > > Having to maintain this extra state per-fd is a cost they don't want
> > > to pay (and have been quite vocal about earlier in this thread).
> >
> > So do you agree the userspace issue is best solved by *_CLOFORK and the
> > problem is how to implement *_CLOFORK at an acceptable cost?
> >
> > OTOH David Laight was making suggestions on moving the load to the
> > fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
> > ‘portable solution’, though that could have been to a specific issue
> > rather than the more general case.
> >
> > How would you recommend approaching an acceptable cost is progressed?
> > Iterate on patch versions?  Open a bugzilla.kernel.org for central
> > tracking and linking from the other projects?  ..?
> 
> Quoting from that go thread
> 
> "If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure
> would help.)"
> 
> So why can't this be solved with:
> close_range(fd_first, fd_last, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
> e.g.
> close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?

That is a relatively recent linux system call.
Although it can be (mostly) emulated by reading /proc/fd
- but that may not be mounted.

In any case another thread can open an fd between the close_range()
and fork() calls.

(I can't remember what I said before :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
  2022-06-28 13:38           ` David Laight
@ 2022-06-28 13:43             ` Christian Brauner
  -1 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2022-06-28 13:43 UTC (permalink / raw)
  To: David Laight
  Cc: Ralph Corderoy, Matthew Wilcox, Nate Karstens, Alexander Viro,
	Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel, Changli Gao

On Tue, Jun 28, 2022 at 01:38:07PM +0000, David Laight wrote:
> From: Christian Brauner
> > Sent: 28 June 2022 14:13
> > 
> > On Sun, Jun 19, 2022 at 11:42:28AM +0100, Ralph Corderoy wrote:
> > > Hi Matthew, thanks for replying.
> > >
> > > > > The need for O_CLOFORK might be made more clear by looking at a
> > > > > long-standing Go issue, i.e. unrelated to system(3), which was started
> > > > > in 2017 by Russ Cox when he summed up the current race-condition
> > > > > behaviour of trying to execve(2) a newly created file:
> > > > > https://github.com/golang/go/issues/22315.
> > > >
> > > > The problem is that people advocating for O_CLOFORK understand its
> > > > value, but not its cost.  Other google employees have a system which
> > > > has literally millions of file descriptors in a single process.
> > > > Having to maintain this extra state per-fd is a cost they don't want
> > > > to pay (and have been quite vocal about earlier in this thread).
> > >
> > > So do you agree the userspace issue is best solved by *_CLOFORK and the
> > > problem is how to implement *_CLOFORK at an acceptable cost?
> > >
> > > OTOH David Laight was making suggestions on moving the load to the
> > > fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
> > > ‘portable solution’, though that could have been to a specific issue
> > > rather than the more general case.
> > >
> > > How would you recommend approaching an acceptable cost is progressed?
> > > Iterate on patch versions?  Open a bugzilla.kernel.org for central
> > > tracking and linking from the other projects?  ..?
> > 
> > Quoting from that go thread
> > 
> > "If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure
> > would help.)"
> > 
> > So why can't this be solved with:
> > close_range(fd_first, fd_last, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
> > e.g.
> > close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
> 
> That is a relatively recent linux system call.
> Although it can be (mostly) emulated by reading /proc/fd
> - but that may not be mounted.
> 
> In any case another thread can open an fd between the close_range()
> and fork() calls.

The CLOSE_RANGE_UNSHARE gives the calling thread a private file
descriptor table before marking fs close-on-exec.

close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2] Implement close-on-fork
@ 2022-06-28 13:43             ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2022-06-28 13:43 UTC (permalink / raw)
  To: David Laight
  Cc: Ralph Corderoy, Matthew Wilcox, Nate Karstens, Alexander Viro,
	Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, Eric Dumazet, linux-fsdevel,
	linux-arch, linux-alpha@vger.kernel.org

On Tue, Jun 28, 2022 at 01:38:07PM +0000, David Laight wrote:
> From: Christian Brauner
> > Sent: 28 June 2022 14:13
> > 
> > On Sun, Jun 19, 2022 at 11:42:28AM +0100, Ralph Corderoy wrote:
> > > Hi Matthew, thanks for replying.
> > >
> > > > > The need for O_CLOFORK might be made more clear by looking at a
> > > > > long-standing Go issue, i.e. unrelated to system(3), which was started
> > > > > in 2017 by Russ Cox when he summed up the current race-condition
> > > > > behaviour of trying to execve(2) a newly created file:
> > > > > https://github.com/golang/go/issues/22315.
> > > >
> > > > The problem is that people advocating for O_CLOFORK understand its
> > > > value, but not its cost.  Other google employees have a system which
> > > > has literally millions of file descriptors in a single process.
> > > > Having to maintain this extra state per-fd is a cost they don't want
> > > > to pay (and have been quite vocal about earlier in this thread).
> > >
> > > So do you agree the userspace issue is best solved by *_CLOFORK and the
> > > problem is how to implement *_CLOFORK at an acceptable cost?
> > >
> > > OTOH David Laight was making suggestions on moving the load to the
> > > fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
> > > ‘portable solution’, though that could have been to a specific issue
> > > rather than the more general case.
> > >
> > > How would you recommend approaching an acceptable cost is progressed?
> > > Iterate on patch versions?  Open a bugzilla.kernel.org for central
> > > tracking and linking from the other projects?  ..?
> > 
> > Quoting from that go thread
> > 
> > "If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure
> > would help.)"
> > 
> > So why can't this be solved with:
> > close_range(fd_first, fd_last, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
> > e.g.
> > close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
> 
> That is a relatively recent linux system call.
> Although it can be (mostly) emulated by reading /proc/fd
> - but that may not be mounted.
> 
> In any case another thread can open an fd between the close_range()
> and fork() calls.

The CLOSE_RANGE_UNSHARE gives the calling thread a private file
descriptor table before marking fs close-on-exec.

close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2022-06-28 13:43 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-15 15:23 [PATCH v2] Implement close-on-fork Nate Karstens
2020-05-15 15:23 ` Nate Karstens
2020-05-15 15:23 ` Nate Karstens
2020-05-15 15:23 ` [PATCH v2 1/4] fs: " Nate Karstens
2020-05-15 15:23   ` Nate Karstens
2020-05-15 15:23   ` Nate Karstens
2020-05-15 15:23 ` [PATCH v2 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2) Nate Karstens
2020-05-15 15:23   ` Nate Karstens
2020-05-15 15:23   ` Nate Karstens
2020-05-15 15:23 ` [PATCH v2 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2) Nate Karstens
2020-05-15 15:23   ` Nate Karstens
2020-05-15 15:23   ` Nate Karstens
2020-05-15 15:23 ` [PATCH v2 4/4] net: Add SOCK_CLOFORK Nate Karstens
2020-05-15 15:23   ` Nate Karstens
2020-05-15 15:23   ` Nate Karstens
2020-05-15 15:30 ` [PATCH v2] Implement close-on-fork Eric Dumazet
2020-05-15 15:30   ` Eric Dumazet
2020-05-15 15:59   ` David Laight
2020-05-15 15:59     ` David Laight
2020-05-15 15:59     ` David Laight
2020-05-15 15:57 ` Matthew Wilcox
2020-05-15 15:57   ` Matthew Wilcox
2020-05-15 15:57   ` Matthew Wilcox
2020-05-15 16:07   ` Karstens, Nate
2020-05-15 16:07     ` Karstens, Nate
2020-05-15 16:25     ` James Bottomley
2020-05-15 16:25       ` James Bottomley
2020-05-15 16:25       ` James Bottomley
2020-05-15 18:28       ` Karstens, Nate
2020-05-15 18:28         ` Karstens, Nate
2020-05-15 18:28         ` Karstens, Nate
2020-05-15 18:43         ` Matthew Wilcox
2020-05-15 18:43           ` Matthew Wilcox
2020-05-15 18:43           ` Matthew Wilcox
2020-05-25  8:16         ` Pavel Machek
2020-05-25  8:16           ` Pavel Machek
2020-05-25  8:16           ` Pavel Machek
2020-05-15 16:26     ` Matthew Wilcox
2020-05-15 16:26       ` Matthew Wilcox
2020-05-15 16:26       ` Matthew Wilcox
2020-05-16 13:29   ` Christian Brauner
2020-05-16 13:29     ` Christian Brauner
2020-05-16 13:29     ` Christian Brauner
2020-05-15 16:03 ` Al Viro
2020-05-15 16:03   ` Al Viro
2020-05-15 16:26   ` Karstens, Nate
2020-05-15 16:26     ` Karstens, Nate
2020-05-15 16:53   ` David Howells
2020-05-15 16:53     ` David Howells
2020-05-15 16:53     ` David Howells
2022-06-18 11:41 ` Ralph Corderoy
2022-06-18 19:40   ` Matthew Wilcox
2022-06-18 19:40     ` Matthew Wilcox
2022-06-19 10:42     ` Ralph Corderoy
2022-06-19 10:42       ` Ralph Corderoy
2022-06-28 13:13       ` Christian Brauner
2022-06-28 13:13         ` Christian Brauner
2022-06-28 13:38         ` David Laight
2022-06-28 13:38           ` David Laight
2022-06-28 13:43           ` Christian Brauner
2022-06-28 13:43             ` Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.