Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: "Karstens, Nate" <Nate.Karstens@garmin.com>
To: Alexander Viro <viro@zeniv.linux.org.uk>,
	Jeff Layton <jlayton@kernel.org>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Arnd Bergmann <arnd@arndb.de>,
	Richard Henderson <rth@twiddle.net>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	Matt Turner <mattst88@gmail.com>,
	"James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>,
	Helge Deller <deller@gmx.de>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"linux-alpha@vger.kernel.org" <linux-alpha@vger.kernel.org>,
	"linux-parisc@vger.kernel.org" <linux-parisc@vger.kernel.org>,
	"sparclinux@vger.kernel.org" <sparclinux@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	David Laight <David.Laight@ACULAB.COM>,
	Matthew Wilcox <willy@infradead.org>
Cc: Changli Gao <xiaosuo@gmail.com>
Subject: RE: [PATCH 1/4] fs: Implement close-on-fork
Date: Wed, 22 Apr 2020 15:36:09 +0000
Message-ID: <fa6c5c9c7c434f878c94a7c984cd43ba@garmin.com> (raw)
In-Reply-To: <20200420071548.62112-2-nate.karstens@garmin.com>

Thanks for everyone's feedback so far. There have been a few questions on why this feature is necessary/desired, so I'll describe that here.

We are running Linux on an embedded system. The platform can change the IP address either according to a proprietary negotiation scheme or a manual setting. The application uses netlink to listen for IP address changes; when this occurs the application closes all of its sockets and re-opens them using the new address. A problem can occur if the application is simultaneously fork/exec-ing a new process. The parent process attempts to bind a new socket to a port that it had previously bound to (before the IP address change), only to fail because the child process continues to hold a socket bound to that port.

Our initial solution was to use pthread_atfork() handlers to lock a mutex and wait for the child process to close all of its sockets (as signaled through an eventfd) before the parent attempts to create them again. This doesn't work if the application uses system() anywhere. glibc does not invoke pthread_atfork() handlers; older versions did but this was removed, and the Linux manpage for system(2) notes that "According  to  POSIX.1,  it is unspecified whether handlers registered using pthread_atfork(3) are called during the execution of system().  In the glibc implementation, such handlers are not called. "

This issue was discussed in the Austin Group mailing list; the root message is here:

https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html

There was some skepticism about whether our practice of closing/reopening sockets was advisable. Regardless, it does expose what I believe to be something that was overlooked in the forking process model. We posted two solutions to the Austin Group defect tracker:

http://austingroupbugs.net/view.php?id=1317
http://austingroupbugs.net/view.php?id=1318

Ultimately the Austin Group felt that close-on-fork was the preferred approach. I think it's also worth pointing that out Solaris reportedly has this feature (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

Cheers,

Nate

-----Original Message-----
From: Karstens, Nate <Nate.Karstens@garmin.com> 
Sent: Monday, April 20, 2020 02:16
To: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@HansenPartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Cc: Changli Gao <xiaosuo@gmail.com>; Karstens, Nate <Nate.Karstens@garmin.com>
Subject: [PATCH 1/4] fs: Implement close-on-fork

The close-on-fork flag causes the file descriptor to be closed atomically in the child process before the child process returns from fork(). Implement this feature and provide a method to get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards Revision Group for inclusion in the next revision of the POSIX standard (see issue 1318 in the Austin Group Defect Tracker).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  2 ++
 fs/file.c                              | 50 +++++++++++++++++++++++++-
 include/linux/fdtable.h                |  7 ++++
 include/linux/file.h                   |  2 ++
 include/uapi/asm-generic/fcntl.h       |  5 +--
 tools/include/uapi/asm-generic/fcntl.h |  5 +--
 6 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..23964abf4a1a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
 		break;
 	case F_SETFD:
 		err = 0;
 		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		set_close_on_fork(fd, arg & FD_CLOFORK);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..de7260ba718d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memset((char *)nfdt->open_fds + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
+	memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+	memset((char *)nfdt->close_on_fork + cpy, 0, set);
 
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy; @@ -118,7 +120,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +128,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +240,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable 
+*fdt) {
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct 
+fdtable *fdt) {
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)  {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +305,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -337,6 +353,12 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (test_bit(open_files - i, new_fdt->close_on_fork)) {
+			__clear_bit(open_files - i, new_fdt->open_fds);
+			f = NULL;
+		}
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +475,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -865,6 +888,31 @@ bool get_close_on_exec(unsigned int fd)
 	return res;
 }
 
+void set_close_on_fork(unsigned int fd, int flag) {
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flag)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+bool get_close_on_fork(unsigned int fd) {
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	bool res;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	res = close_on_fork(fd, fdt);
+	rcu_read_unlock();
+	return res;
+}
+
 static int do_dup2(struct files_struct *files,
 	struct file *file, unsigned fd, unsigned flags)
 __releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);  }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable 
+*fdt) {
+	return test_bit(fd, fdt->close_on_fork); }
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)  {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT]; diff --git a/include/linux/file.h b/include/linux/file.h index 142d102f285e..86fbb36b438b 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -85,6 +85,8 @@ extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);  extern void set_close_on_exec(unsigned int fd, int flag);  extern bool get_close_on_exec(unsigned int fd);
+extern void set_close_on_fork(unsigned int fd, int flag); extern bool 
+get_close_on_fork(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);  extern int get_unused_fd_flags(unsigned flags);  extern void put_unused_fd(unsigned int fd); diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
--
2.26.1


  parent reply index

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-20  7:15 Nate Karstens
2020-04-20  7:15 ` [PATCH 1/4] fs: " Nate Karstens
2020-04-20 10:25   ` Eric Dumazet
2020-04-22  3:38     ` Changli Gao
2020-04-22  3:41       ` Changli Gao
2020-04-22  8:35     ` David Laight
2020-05-01 14:45     ` Karstens, Nate
2020-05-01 15:23       ` Matthew Wilcox
2020-05-03 13:52       ` David Laight
2020-04-22 15:36   ` Karstens, Nate [this message]
2020-04-22 15:43     ` Matthew Wilcox
2020-04-22 16:02       ` Karstens, Nate
2020-04-22 16:31         ` Bernd Petrovitsch
2020-04-22 16:55           ` David Laight
2020-04-23 12:34             ` Bernd Petrovitsch
2020-04-20  7:15 ` [PATCH 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2) Nate Karstens
2020-04-20  7:15 ` [PATCH 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2) Nate Karstens
2020-04-20  7:15 ` [PATCH 4/4] net: Add SOCK_CLOFORK Nate Karstens
2020-04-22 14:32 ` Implement close-on-fork James Bottomley
2020-04-22 15:01 ` Al Viro
2020-04-22 15:18   ` Matthew Wilcox
2020-04-22 15:34     ` James Bottomley
2020-04-22 16:00     ` Al Viro
2020-04-22 16:13       ` Al Viro
2020-05-04 13:46       ` Karstens, Nate

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa6c5c9c7c434f878c94a7c984cd43ba@garmin.com \
    --to=nate.karstens@garmin.com \
    --cc=David.Laight@ACULAB.COM \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=arnd@arndb.de \
    --cc=bfields@fieldses.org \
    --cc=davem@davemloft.net \
    --cc=deller@gmx.de \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jlayton@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-alpha@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=mattst88@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=rth@twiddle.net \
    --cc=sparclinux@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=xiaosuo@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git