All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nextfd(2)
@ 2012-04-01 12:57 Alexey Dobriyan
  2012-04-01 13:58 ` Konstantin Khlebnikov
                   ` (7 more replies)
  0 siblings, 8 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-01 12:57 UTC (permalink / raw)
  To: akpm, viro; +Cc: torvalds, drepper, linux-kernel, linux-fsdevel

Currently there is no reliable way to close all opened file descriptors
(which daemons need and like to do):

* dumb close(fd) loop is slow, upper bound is unknown and
  can be arbitrary large,

* /proc/self/fd is unreliable:
  proc may be unconfigured or not mounted at expected place.
  Looking at /proc/self/fd requires opening directory
  which may not be available due to malicious rlimit drop or ENOMEM situations.
  Not opening directory is equivalent to dumb close(2) loop except slower.

BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
on the bigger scale. closefrom(2) does only close(2) (obviously :-)
closefrom(2) siletly ignores errors from close(2) which in theory is not OK
for userspace.

So, don't add closefrom(2), add nextfd(2).

	int nextfd(int fd)

returns next opened file descriptor which is >= than fd or -1/ESRCH
if there aren't any descriptors >= than fd.

Thus closefrom(3) can be rewritten through it in userspace:

	void closefrom(int fd)
	{
		while (1) {
			fd = nextfd(fd);
			if (fd == -1 && errno == ESRCH)
				break;
			(void)close(fd);
			fd++;
		}
	}

Maybe it will grow other smart uses.

nextfd(2) doesn't change kernel state and thus can't fail
which is why it should go in. Other means may fail or
may not be available or require linear time with only guessed
upper boundaries (1024, getrlimit(RLIM_NOFILE), sysconf(_SC_OPEN_MAX).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 arch/x86/syscalls/syscall_32.tbl |    1 +
 arch/x86/syscalls/syscall_64.tbl |    1 +
 fs/Makefile                      |    1 +
 fs/nextfd.c                      |   27 +++++++++++++++++++++++++++
 include/linux/syscalls.h         |    1 +
 5 files changed, 31 insertions(+)

--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	nextfd			sys_nextfd
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -318,6 +318,7 @@
 309	common	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	nextfd			sys_nextfd
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
 # for native 64-bit operation.
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -12,6 +12,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
 		stack.o fs_struct.o statfs.o
+obj-y += nextfd.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
--- /dev/null
+++ b/fs/nextfd.c
@@ -0,0 +1,27 @@
+#include <linux/errno.h>
+#include <linux/fdtable.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+
+/* Return first opened file descriptor which is >= than the argument. */
+SYSCALL_DEFINE1(nextfd, unsigned int, fd)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	while (fd < fdt->max_fds) {
+		struct file *file;
+
+		file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
+		if (file) {
+			rcu_read_unlock();
+			return fd;
+		}
+		fd++;
+	}
+	rcu_read_unlock();
+	return -ESRCH;
+}
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -857,5 +857,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 				      const struct iovec __user *rvec,
 				      unsigned long riovcnt,
 				      unsigned long flags);
+asmlinkage long sys_nextfd(unsigned int fd);
 
 #endif

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
@ 2012-04-01 13:58 ` Konstantin Khlebnikov
  2012-04-01 21:30   ` Alexey Dobriyan
  2012-04-02  0:09   ` Alan Cox
  2012-04-01 15:43 ` Eric Dumazet
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 78+ messages in thread
From: Konstantin Khlebnikov @ 2012-04-01 13:58 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

Alexey Dobriyan wrote:
> Currently there is no reliable way to close all opened file descriptors
> (which daemons need and like to do):
>
> * dumb close(fd) loop is slow, upper bound is unknown and
>    can be arbitrary large,
>
> * /proc/self/fd is unreliable:
>    proc may be unconfigured or not mounted at expected place.
>    Looking at /proc/self/fd requires opening directory
>    which may not be available due to malicious rlimit drop or ENOMEM situations.
>    Not opening directory is equivalent to dumb close(2) loop except slower.
>
> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
> for userspace.
>
> So, don't add closefrom(2), add nextfd(2).
>
> 	int nextfd(int fd)

Can we add "pid" argument to be able to search next fd in other task?
Together with sys_kcmp() this will be very useful for checkpoint/restore.

>
> returns next opened file descriptor which is>= than fd or -1/ESRCH
> if there aren't any descriptors>= than fd.
>
> Thus closefrom(3) can be rewritten through it in userspace:
>
> 	void closefrom(int fd)
> 	{
> 		while (1) {
> 			fd = nextfd(fd);
> 			if (fd == -1&&  errno == ESRCH)
> 				break;
> 			(void)close(fd);
> 			fd++;
> 		}
> 	}
>
> Maybe it will grow other smart uses.
>
> nextfd(2) doesn't change kernel state and thus can't fail
> which is why it should go in. Other means may fail or
> may not be available or require linear time with only guessed
> upper boundaries (1024, getrlimit(RLIM_NOFILE), sysconf(_SC_OPEN_MAX).
>
> Signed-off-by: Alexey Dobriyan<adobriyan@gmail.com>
> ---
>
>   arch/x86/syscalls/syscall_32.tbl |    1 +
>   arch/x86/syscalls/syscall_64.tbl |    1 +
>   fs/Makefile                      |    1 +
>   fs/nextfd.c                      |   27 +++++++++++++++++++++++++++
>   include/linux/syscalls.h         |    1 +
>   5 files changed, 31 insertions(+)
>
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -355,3 +355,4 @@
>   346	i386	setns			sys_setns
>   347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
>   348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
> +349	i386	nextfd			sys_nextfd
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -318,6 +318,7 @@
>   309	common	getcpu			sys_getcpu
>   310	64	process_vm_readv	sys_process_vm_readv
>   311	64	process_vm_writev	sys_process_vm_writev
> +312	64	nextfd			sys_nextfd
>   #
>   # x32-specific system call numbers start at 512 to avoid cache impact
>   # for native 64-bit operation.
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -12,6 +12,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
>   		seq_file.o xattr.o libfs.o fs-writeback.o \
>   		pnode.o drop_caches.o splice.o sync.o utimes.o \
>   		stack.o fs_struct.o statfs.o
> +obj-y += nextfd.o
>
>   ifeq ($(CONFIG_BLOCK),y)
>   obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
> --- /dev/null
> +++ b/fs/nextfd.c
> @@ -0,0 +1,27 @@
> +#include<linux/errno.h>
> +#include<linux/fdtable.h>
> +#include<linux/rcupdate.h>
> +#include<linux/sched.h>
> +#include<linux/syscalls.h>
> +
> +/* Return first opened file descriptor which is>= than the argument. */
> +SYSCALL_DEFINE1(nextfd, unsigned int, fd)
> +{
> +	struct files_struct *files = current->files;
> +	struct fdtable *fdt;
> +
> +	rcu_read_lock();
> +	fdt = files_fdtable(files);
> +	while (fd<  fdt->max_fds) {
> +		struct file *file;
> +
> +		file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
> +		if (file) {
> +			rcu_read_unlock();
> +			return fd;
> +		}
> +		fd++;
> +	}
> +	rcu_read_unlock();
> +	return -ESRCH;
> +}
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -857,5 +857,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
>   				      const struct iovec __user *rvec,
>   				      unsigned long riovcnt,
>   				      unsigned long flags);
> +asmlinkage long sys_nextfd(unsigned int fd);
>
>   #endif
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
  2012-04-01 13:58 ` Konstantin Khlebnikov
@ 2012-04-01 15:43 ` Eric Dumazet
  2012-04-01 21:31   ` Alexey Dobriyan
  2012-04-01 21:36   ` Alan Cox
  2012-04-01 17:20 ` Linus Torvalds
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 78+ messages in thread
From: Eric Dumazet @ 2012-04-01 15:43 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Sun, 2012-04-01 at 15:57 +0300, Alexey Dobriyan wrote:

> +
> +/* Return first opened file descriptor which is >= than the argument. */
> +SYSCALL_DEFINE1(nextfd, unsigned int, fd)
> +{
> +	struct files_struct *files = current->files;
> +	struct fdtable *fdt;
> +
> +	rcu_read_lock();
> +	fdt = files_fdtable(files);
> +	while (fd < fdt->max_fds) {
> +		struct file *file;
> +
> +		file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
> +		if (file) {
> +			rcu_read_unlock();
> +			return fd;
> +		}
> +		fd++;
> +	}
> +	rcu_read_unlock();
> +	return -ESRCH;
> +}

Interesting idea but what about using fdt->open_fds bitmap to have a
fast search and less cache pollution ?

alloc_fd(start, flags) uses find_next_zero_bit(), you could use
find_next_bit().





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
  2012-04-01 13:58 ` Konstantin Khlebnikov
  2012-04-01 15:43 ` Eric Dumazet
@ 2012-04-01 17:20 ` Linus Torvalds
  2012-04-01 18:28 ` Valentin Nechayev
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2012-04-01 17:20 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, viro, drepper, linux-kernel, linux-fsdevel

On Sun, Apr 1, 2012 at 5:57 AM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
> for userspace.
>
> So, don't add closefrom(2), add nextfd(2).

I don't disagree with this, but I don't think it's worth a new file of
its own - I think it should go next to dup() and friends (currently in
fs/fcntl.h).

Also, I think you should:

 (a) at a minimum use the bitmap.  It's easy:

    struct fdtable *fdt;

    rcu_read_lock();
    fdt = files_fdtable(current->files);
    .. maximum fd in fdt->max_fds ...
    .. you have the bitmap in fdt->open_fds ..

 (b) I'd also suggest you add a "flag" argument and have at least one
bit option for "search for next *free* file table entry". I'm not sure
anybody wants it, but it is kind of the same operation, and having a
"flags" field means that it's extensible in the future if people want
to find the next cloexec fd or whatever. Or if they just want to find
the *last* fd, or something.

But I think the concept makes sense.

                               Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
                   ` (2 preceding siblings ...)
  2012-04-01 17:20 ` Linus Torvalds
@ 2012-04-01 18:28 ` Valentin Nechayev
  2012-04-01 21:33   ` Alexey Dobriyan
  2012-04-01 19:21 ` Arnd Bergmann
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 78+ messages in thread
From: Valentin Nechayev @ 2012-04-01 18:28 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

 Sun, Apr 01, 2012 at 15:57:42, adobriyan wrote about "[PATCH] nextfd(2)": 

> So, don't add closefrom(2), add nextfd(2).
> 
> 	int nextfd(int fd)

Is it really needed here to create syscall? One shall update all arch
lists for it. Instead, adding a new option for fcntl() (named e.g.
F_NEXTFD) solves for all platforms simultaneously.


-netch-

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
                   ` (3 preceding siblings ...)
  2012-04-01 18:28 ` Valentin Nechayev
@ 2012-04-01 19:21 ` Arnd Bergmann
  2012-04-01 21:35   ` Alexey Dobriyan
  2012-04-01 22:05   ` H. Peter Anvin
  2012-04-01 22:03 ` H. Peter Anvin
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 78+ messages in thread
From: Arnd Bergmann @ 2012-04-01 19:21 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Sunday 01 April 2012, Alexey Dobriyan wrote:
>  arch/x86/syscalls/syscall_32.tbl |    1 +
>  arch/x86/syscalls/syscall_64.tbl |    1 +
>  fs/Makefile                      |    1 +
>  fs/nextfd.c                      |   27 +++++++++++++++++++++++++++
>  include/linux/syscalls.h         |    1 +
>  5 files changed, 31 insertions(+)

I don't have any comments on the syscall itself, but when you add one, please
also make the change to include/asm-generic/unistd.h so it appears in the
various architectures using the generic syscall table.

	Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 13:58 ` Konstantin Khlebnikov
@ 2012-04-01 21:30   ` Alexey Dobriyan
  2012-04-02  0:09   ` Alan Cox
  1 sibling, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-01 21:30 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Sun, Apr 01, 2012 at 05:58:53PM +0400, Konstantin Khlebnikov wrote:
> Alexey Dobriyan wrote:

> > So, don't add closefrom(2), add nextfd(2).
> >
> > 	int nextfd(int fd)
> 
> Can we add "pid" argument to be able to search next fd in other task?
> Together with sys_kcmp() this will be very useful for checkpoint/restore.

You've added /proc/*/map_files for C/R making proc mandatory,
so you do have /proc/*/fd .

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 15:43 ` Eric Dumazet
@ 2012-04-01 21:31   ` Alexey Dobriyan
  2012-04-01 21:36   ` Alan Cox
  1 sibling, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-01 21:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Sun, Apr 01, 2012 at 05:43:25PM +0200, Eric Dumazet wrote:
> On Sun, 2012-04-01 at 15:57 +0300, Alexey Dobriyan wrote:
> 
> > +
> > +/* Return first opened file descriptor which is >= than the argument. */
> > +SYSCALL_DEFINE1(nextfd, unsigned int, fd)
> > +{
> > +	struct files_struct *files = current->files;
> > +	struct fdtable *fdt;
> > +
> > +	rcu_read_lock();
> > +	fdt = files_fdtable(files);
> > +	while (fd < fdt->max_fds) {
> > +		struct file *file;
> > +
> > +		file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
> > +		if (file) {
> > +			rcu_read_unlock();
> > +			return fd;
> > +		}
> > +		fd++;
> > +	}
> > +	rcu_read_unlock();
> > +	return -ESRCH;
> > +}
> 
> Interesting idea but what about using fdt->open_fds bitmap to have a
> fast search and less cache pollution ?
> 
> alloc_fd(start, flags) uses find_next_zero_bit(), you could use
> find_next_bit().

Indeed.
I've copied code from /proc/*/fd implementation which does loop.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 18:28 ` Valentin Nechayev
@ 2012-04-01 21:33   ` Alexey Dobriyan
  0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-01 21:33 UTC (permalink / raw)
  To: Valentin Nechayev
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Sun, Apr 01, 2012 at 09:28:26PM +0300, Valentin Nechayev wrote:
>  Sun, Apr 01, 2012 at 15:57:42, adobriyan wrote about "[PATCH] nextfd(2)": 
> 
> > So, don't add closefrom(2), add nextfd(2).
> > 
> > 	int nextfd(int fd)
> 
> Is it really needed here to create syscall? One shall update all arch
> lists for it. Instead, adding a new option for fcntl() (named e.g.
> F_NEXTFD) solves for all platforms simultaneously.

This is not a problem, build system even checks for not yet added syscalls.
Explicit syscall is better than multiplexer.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 19:21 ` Arnd Bergmann
@ 2012-04-01 21:35   ` Alexey Dobriyan
  2012-04-01 22:05   ` H. Peter Anvin
  1 sibling, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-01 21:35 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Sun, Apr 01, 2012 at 07:21:59PM +0000, Arnd Bergmann wrote:
> On Sunday 01 April 2012, Alexey Dobriyan wrote:
> >  arch/x86/syscalls/syscall_32.tbl |    1 +
> >  arch/x86/syscalls/syscall_64.tbl |    1 +
> >  fs/Makefile                      |    1 +
> >  fs/nextfd.c                      |   27 +++++++++++++++++++++++++++
> >  include/linux/syscalls.h         |    1 +
> >  5 files changed, 31 insertions(+)
> 
> I don't have any comments on the syscall itself, but when you add one, please
> also make the change to include/asm-generic/unistd.h so it appears in the
> various architectures using the generic syscall table.

I will.
The amount of places to plug new syscall is suprising. :-(

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 15:43 ` Eric Dumazet
  2012-04-01 21:31   ` Alexey Dobriyan
@ 2012-04-01 21:36   ` Alan Cox
  1 sibling, 0 replies; 78+ messages in thread
From: Alan Cox @ 2012-04-01 21:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

> > +	return -ESRCH;
> > +}
> 
> Interesting idea but what about using fdt->open_fds bitmap to have a
> fast search and less cache pollution 

I do wonder who needs this on a hot path versus just looking
in /proc/self ?

Alan

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
                   ` (4 preceding siblings ...)
  2012-04-01 19:21 ` Arnd Bergmann
@ 2012-04-01 22:03 ` H. Peter Anvin
  2012-04-01 22:13   ` H. Peter Anvin
                     ` (3 more replies)
  2012-04-02 23:17 ` KOSAKI Motohiro
  2012-04-04  3:01 ` Al Viro
  7 siblings, 4 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-01 22:03 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
> 
> * /proc/self/fd is unreliable:
>   proc may be unconfigured or not mounted at expected place.
>   Looking at /proc/self/fd requires opening directory
>   which may not be available due to malicious rlimit drop or ENOMEM situations.
>   Not opening directory is equivalent to dumb close(2) loop except slower.
> 

This is really the motivation for this... the real question is how much
functionality is actually available in the system without /proc mounted,
and in particular if this particular subcase is worth optimizing ...
after all, if someone is maliciously setting rlimit, we can just abort
(if someone can set an rlimit they can also force an abort) or revert to
the slow path.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 19:21 ` Arnd Bergmann
  2012-04-01 21:35   ` Alexey Dobriyan
@ 2012-04-01 22:05   ` H. Peter Anvin
  2012-04-04 12:13     ` Arnd Bergmann
  1 sibling, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-01 22:05 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On 04/01/2012 12:21 PM, Arnd Bergmann wrote:
> On Sunday 01 April 2012, Alexey Dobriyan wrote:
>>  arch/x86/syscalls/syscall_32.tbl |    1 +
>>  arch/x86/syscalls/syscall_64.tbl |    1 +
>>  fs/Makefile                      |    1 +
>>  fs/nextfd.c                      |   27 +++++++++++++++++++++++++++
>>  include/linux/syscalls.h         |    1 +
>>  5 files changed, 31 insertions(+)
> 
> I don't have any comments on the syscall itself, but when you add one, please
> also make the change to include/asm-generic/unistd.h so it appears in the
> various architectures using the generic syscall table.
> 
> 	Arnd

Arnd: do you have any interest in leveraging the syscall scripts I did
for x86?  I have tried to make them as generic as possible, with the
hope of getting more and more of syscall information into more easily
processed form.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 22:03 ` H. Peter Anvin
@ 2012-04-01 22:13   ` H. Peter Anvin
  2012-04-02  0:08   ` Alan Cox
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-01 22:13 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On 04/01/2012 03:03 PM, H. Peter Anvin wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>>
>> * /proc/self/fd is unreliable:
>>   proc may be unconfigured or not mounted at expected place.
>>   Looking at /proc/self/fd requires opening directory
>>   which may not be available due to malicious rlimit drop or ENOMEM situations.
>>   Not opening directory is equivalent to dumb close(2) loop except slower.
>>
> 
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...
> after all, if someone is maliciously setting rlimit, we can just abort
> (if someone can set an rlimit they can also force an abort) or revert to
> the slow path.
> 

A few more observations:

- There is a huge backwards compatibility problem with this for a
substantial transition period; using /proc/self/fd has worked for a very
long time already.

- Your nextfd() system call will require more system calls that the
typical case for reading /proc/self/fd, because each getdents() system
call handles multiple readdir() invocations.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 22:03 ` H. Peter Anvin
  2012-04-01 22:13   ` H. Peter Anvin
@ 2012-04-02  0:08   ` Alan Cox
  2012-04-30  9:58     ` Valentin Nechayev
  2012-04-02  1:19     ` Kyle Moffett
  2012-04-06  9:54     ` Alexey Dobriyan
  3 siblings, 1 reply; 78+ messages in thread
From: Alan Cox @ 2012-04-02  0:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Sun, 01 Apr 2012 15:03:38 -0700
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
> > 
> > * /proc/self/fd is unreliable:
> >   proc may be unconfigured or not mounted at expected place.

Equally someone could be ptracing your process and changing the answers.
Both are "do stupid stuff, get stupid answers" cases. Why do we care ???
Lots of other stuff breaks in this case so nobody will be doing it for
good reasons

> >   Looking at /proc/self/fd requires opening directory

Which is very efficient

> >   which may not be available due to malicious rlimit drop or ENOMEM situations.
> >   Not opening directory is equivalent to dumb close(2) loop except slower.

If I've rlimited you to no file handles then you already lost whichever
approach you use. Just abort. Trying to do clever stuff in security
sensitive situations via codepaths that inevitably don't get tested when
you've apparently got a malicious executor is dumb.

> (if someone can set an rlimit they can also force an abort) or revert to
> the slow path.

Or have set a CPU limit or memory limit or more.

Alan

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 13:58 ` Konstantin Khlebnikov
  2012-04-01 21:30   ` Alexey Dobriyan
@ 2012-04-02  0:09   ` Alan Cox
  2012-04-02  8:38     ` Konstantin Khlebnikov
  1 sibling, 1 reply; 78+ messages in thread
From: Alan Cox @ 2012-04-02  0:09 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

> Can we add "pid" argument to be able to search next fd in other task?
> Together with sys_kcmp() this will be very useful for checkpoint/restore.

That would raise all sorts of security questions unless protected, plus
its got races. If you want to do that and you have the rights you can do
it via procfs anyway.

Alan

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 22:03 ` H. Peter Anvin
@ 2012-04-02  1:19     ` Kyle Moffett
  2012-04-02  0:08   ` Alan Cox
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 78+ messages in thread
From: Kyle Moffett @ 2012-04-02  1:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Sun, Apr 1, 2012 at 15:03, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>> * /proc/self/fd is unreliable:
>>   proc may be unconfigured or not mounted at expected place.
>>   Looking at /proc/self/fd requires opening directory
>>   which may not be available due to malicious rlimit drop or ENOMEM situations.
>>   Not opening directory is equivalent to dumb close(2) loop except slower.
>
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...
> after all, if someone is maliciously setting rlimit, we can just abort
> (if someone can set an rlimit they can also force an abort) or revert to
> the slow path.

Well, I imagine one typical usecase for closing all FDs is for
security isolation purposes (EG: chroot()+etc), and in a great deal of
chroot environments you don't have /proc available.  In particular
/proc has been a source of a lot of privilege escalations in the past,
so avoiding mounting it in a chroot is good security policy if
possible.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
@ 2012-04-02  1:19     ` Kyle Moffett
  0 siblings, 0 replies; 78+ messages in thread
From: Kyle Moffett @ 2012-04-02  1:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Sun, Apr 1, 2012 at 15:03, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>> * /proc/self/fd is unreliable:
>>   proc may be unconfigured or not mounted at expected place.
>>   Looking at /proc/self/fd requires opening directory
>>   which may not be available due to malicious rlimit drop or ENOMEM situations.
>>   Not opening directory is equivalent to dumb close(2) loop except slower.
>
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...
> after all, if someone is maliciously setting rlimit, we can just abort
> (if someone can set an rlimit they can also force an abort) or revert to
> the slow path.

Well, I imagine one typical usecase for closing all FDs is for
security isolation purposes (EG: chroot()+etc), and in a great deal of
chroot environments you don't have /proc available.  In particular
/proc has been a source of a lot of privilege escalations in the past,
so avoiding mounting it in a chroot is good security policy if
possible.

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-02  1:19     ` Kyle Moffett
  (?)
@ 2012-04-02  1:37     ` H. Peter Anvin
  -1 siblings, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-02  1:37 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

Are those use cases heavyweight enough to motivate a new interface?

Kyle Moffett <kyle@moffetthome.net> wrote:

>On Sun, Apr 1, 2012 at 15:03, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>>> * /proc/self/fd is unreliable:
>>>   proc may be unconfigured or not mounted at expected place.
>>>   Looking at /proc/self/fd requires opening directory
>>>   which may not be available due to malicious rlimit drop or ENOMEM
>situations.
>>>   Not opening directory is equivalent to dumb close(2) loop except
>slower.
>>
>> This is really the motivation for this... the real question is how
>much
>> functionality is actually available in the system without /proc
>mounted,
>> and in particular if this particular subcase is worth optimizing ...
>> after all, if someone is maliciously setting rlimit, we can just
>abort
>> (if someone can set an rlimit they can also force an abort) or revert
>to
>> the slow path.
>
>Well, I imagine one typical usecase for closing all FDs is for
>security isolation purposes (EG: chroot()+etc), and in a great deal of
>chroot environments you don't have /proc available.  In particular
>/proc has been a source of a lot of privilege escalations in the past,
>so avoiding mounting it in a chroot is good security policy if
>possible.
>
>Cheers,
>Kyle Moffett

-- 
Sent from my mobile phone. Please excuse my brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-02  0:09   ` Alan Cox
@ 2012-04-02  8:38     ` Konstantin Khlebnikov
  2012-04-02  9:26       ` Cyrill Gorcunov
  0 siblings, 1 reply; 78+ messages in thread
From: Konstantin Khlebnikov @ 2012-04-02  8:38 UTC (permalink / raw)
  To: Alan Cox
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel, Cyrill Gorcunov

Alan Cox wrote:
>> Can we add "pid" argument to be able to search next fd in other task?
>> Together with sys_kcmp() this will be very useful for checkpoint/restore.
>
> That would raise all sorts of security questions unless protected, plus
> its got races. If you want to do that and you have the rights you can do
> it via procfs anyway.

Don't worry, I just asked. =) Seems like crtools dumps fd in the target-task context,
so the current version is suitable.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-02  8:38     ` Konstantin Khlebnikov
@ 2012-04-02  9:26       ` Cyrill Gorcunov
  0 siblings, 0 replies; 78+ messages in thread
From: Cyrill Gorcunov @ 2012-04-02  9:26 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Alan Cox, Alexey Dobriyan, akpm, viro, torvalds, drepper,
	linux-kernel, linux-fsdevel

On Mon, Apr 02, 2012 at 12:38:25PM +0400, Konstantin Khlebnikov wrote:
> Alan Cox wrote:
> >>Can we add "pid" argument to be able to search next fd in other task?
> >>Together with sys_kcmp() this will be very useful for checkpoint/restore.
> >
> >That would raise all sorts of security questions unless protected, plus
> >its got races. If you want to do that and you have the rights you can do
> >it via procfs anyway.
> 
> Don't worry, I just asked. =) Seems like crtools dumps fd in the target-task context,
> so the current version is suitable.

Kind of ;) At moment we drain fds from dumpee task context to
our tool context and then dump them.

	Cyrill

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-02  1:19     ` Kyle Moffett
  (?)
  (?)
@ 2012-04-02 11:37     ` Ulrich Drepper
  -1 siblings, 0 replies; 78+ messages in thread
From: Ulrich Drepper @ 2012-04-02 11:37 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds,
	linux-kernel, linux-fsdevel

On Sun, Apr 1, 2012 at 21:19, Kyle Moffett <kyle@moffetthome.net> wrote:
> Well, I imagine one typical usecase for closing all FDs is for
> security isolation purposes (EG: chroot()+etc),

chroot and security in the same sentence...?


> and in a great deal of
> chroot environments you don't have /proc available.  In particular
> /proc has been a source of a lot of privilege escalations in the past,
> so avoiding mounting it in a chroot is good security policy if
> possible.

The problem is that the kernel exports quite a bit of information only
through the /proc and /sys filesystems.  I might try to finish my
comprehensive list of functionality depending on /proc sometime soon.
The list is quite long.

Not mounting /proc is inconvenient at best, it renders the environment
unusable quite often and in some cases is outright insecure..  I don't
think you can use not mounting /proc as an argument.  And, as Peter
said, the loop over the directory content is quite efficient.

If you want to avoid /proc I suggest you first work on removing the
dependencies.  Of just secure /proc itself.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
                   ` (5 preceding siblings ...)
  2012-04-01 22:03 ` H. Peter Anvin
@ 2012-04-02 23:17 ` KOSAKI Motohiro
  2012-04-02 23:56   ` H. Peter Anvin
  2012-04-03 19:21   ` Colin Walters
  2012-04-04  3:01 ` Al Viro
  7 siblings, 2 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-02 23:17 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

Hi,

2012/4/1 Alexey Dobriyan <adobriyan@gmail.com>:
> Currently there is no reliable way to close all opened file descriptors
> (which daemons need and like to do):
>
> * dumb close(fd) loop is slow, upper bound is unknown and
>  can be arbitrary large,
>
> * /proc/self/fd is unreliable:
>  proc may be unconfigured or not mounted at expected place.
>  Looking at /proc/self/fd requires opening directory
>  which may not be available due to malicious rlimit drop or ENOMEM situations.
>  Not opening directory is equivalent to dumb close(2) loop except slower.

Sorry for the long delay comment. I realized this thread now. I think
/proc no mount
case is not good explanation for the worth of this patch. The problem
is, we can't
use opendir() after fork() if an app has multi threads.

SUS clearly say so,
 http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html

we can only call async-signal-safe functions after fork() when multi threads and
opendir() call malloc() internally.

As far as I know, OpenJDK has a such fork-readdir-exec code and it can
make deadlock
when spawnning a new process. Unfortunately Java language perfeter to
make a lot of threads rather than other language.

This patch can solve such multi threaded case.

offtopic, glibc malloc is a slightly clever. It reinitialize its
internal lock when fork by
using thread_atfork() hook. It mean glibc malloc can be used after
fork() and the
technique can avoid this issue. But, glibc malloc still has several
performance problem and
many people prefer to use jemalloc or google malloc instead. Then,
they hit an old issue, bah.


> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
> for userspace.

Solaris don't only have a closefrom(), but also has fdwalk().

  http://docs.oracle.com/cd/E19082-01/819-2243/6n4i098vd/index.html

and I've received a request that linux aim fdwalk() several times. Example,

Ruby uses fcntl(FD_CLOEXEC) instead of close() because their community uses
valgrind for daily stress test and it don't support for(f<maxfd){
close() } pattern.
(I guess it is valgrind's bug, but I'm not sure.)

I have several other similar bug report of our proprietary software.
but i'm sorry.
I cant show it here. But clearly, fdwalk (or nextfd) is powerful than
closefrom().

Honestly, I had no idea to aim fdwalk cleanly. but your patch is much
cleaner to me.
So, I like this.

> So, don't add closefrom(2), add nextfd(2).
>
>        int nextfd(int fd)
>
> returns next opened file descriptor which is >= than fd or -1/ESRCH
> if there aren't any descriptors >= than fd.
>
> Thus closefrom(3) can be rewritten through it in userspace:
>
>        void closefrom(int fd)
>        {
>                while (1) {
>                        fd = nextfd(fd);
>                        if (fd == -1 && errno == ESRCH)
>                                break;
>                        (void)close(fd);
>                        fd++;
>                }
>        }
>
> Maybe it will grow other smart uses.
>
> nextfd(2) doesn't change kernel state and thus can't fail
> which is why it should go in. Other means may fail or
> may not be available or require linear time with only guessed
> upper boundaries (1024, getrlimit(RLIM_NOFILE), sysconf(_SC_OPEN_MAX).

This is not enough explanation. The problem is, RLIM_NOFILE can be changed
at runtime. then, a process can have a larger fd than RLIM_NOFILE.

Therefore, if you accept other developers opinion (especially linus's
flags argument suggestion), I'll ack this patch.

Thank you.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-02 23:17 ` KOSAKI Motohiro
@ 2012-04-02 23:56   ` H. Peter Anvin
  2012-04-04 11:51     ` Ulrich Drepper
  2012-04-04 16:31     ` KOSAKI Motohiro
  2012-04-03 19:21   ` Colin Walters
  1 sibling, 2 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-02 23:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On 04/02/2012 04:17 PM, KOSAKI Motohiro wrote:
> 
> Sorry for the long delay comment. I realized this thread now. I think
> /proc no mount case is not good explanation for the worth of this patch. The problem
> is, we can't use opendir() after fork() if an app has multi threads.
> 
> SUS clearly say so,
>  http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html
> 
> we can only call async-signal-safe functions after fork() when multi threads and
> opendir() call malloc() internally.
> 
> As far as I know, OpenJDK has a such fork-readdir-exec code and it can
> make deadlock
> when spawnning a new process. Unfortunately Java language perfeter to
> make a lot of threads rather than other language.
> 
> This patch can solve such multi threaded case.
> 
> offtopic, glibc malloc is a slightly clever. It reinitialize its
> internal lock when fork by using thread_atfork() hook. It mean glibc malloc can be used after
> fork() and the technique can avoid this issue. But, glibc malloc still has several
> performance problem and many people prefer to use jemalloc or google malloc instead. Then,
> they hit an old issue, bah.
> 

OK, so what you're saying here is:

Linux doesn't actually have a problem unless:
1. You use the library implementation of opendir/readdir/closedir;
2. You use a nonstandard malloc for the platform which doesn't
   correctly set up fork hooks (which I would consider a bug);

You can deal with this in one of two ways:

2. Fix your malloc().
1. Use the low level open()/getdents()/close() functions instead of
   opendir()/readdir()/closedir().

> and I've received a request that linux aim fdwalk() several times. Example,

It doesn't sound very hard to implement fdwalk() in terms of
open/getdents/close without using malloc; since the fdwalk() interface
lets you use the stack for storage.  You can then implement closefrom()
in terms of fdwalk().  Something like this (untested):

int fdwalk(int (*func)(void *, int), void *cd)
{
	char buf[4096];	/* ... could be less... */
	const char *p, *q;
	const struct linux_dirent *dp
	int dfd, fd;
	unsigned char c;
	int rv = 0;
	int sz;

	dfd = open("/proc/self/fd", O_RDONLY|O_DIRECTORY|O_CLOEXEC);
	if (dfd < 0)
		return -1;

	/*** XXX: may want to check for procfs magic here ***/

	while ((sz = getdents(dfd, buf, sizeof buf)) > 0) {
		p = buf;

		while (sz > offsetof(struct linux_dirent, d_name)) {
			dp = (const struct linux_dirent *)p;

			if (sz < dp->d_reclen)
				break;

			q   = dp->d_name;
			p  += dp->d_reclen;
			sz -= dp->d_reclen;

			fd = 0;
			while (q < p && (c = *q++)) {
				c -= '0';
				if (c >= 10)
					goto skip;
				fd = fd*10 + c;
			}

			if (fd != dfd)
				rv = func(cd, fd);
		skip:
			;
		}
	}

	if (close(dfd))
		return -1;

	return rv;
}

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-02 23:17 ` KOSAKI Motohiro
  2012-04-02 23:56   ` H. Peter Anvin
@ 2012-04-03 19:21   ` Colin Walters
  1 sibling, 0 replies; 78+ messages in thread
From: Colin Walters @ 2012-04-03 19:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Mon, 2012-04-02 at 16:17 -0700, KOSAKI Motohiro wrote:

> we can only call async-signal-safe functions after fork() when multi threads and
> opendir() call malloc() internally.
> 
> As far as I know, OpenJDK has a such fork-readdir-exec code and it can
> make deadlock
> when spawnning a new process. Unfortunately Java language perfeter to
> make a lot of threads rather than other language.
> 
> This patch can solve such multi threaded case.
> 
> offtopic, glibc malloc is a slightly clever. It reinitialize its
> internal lock when fork by
> using thread_atfork() hook. It mean glibc malloc can be used after
> fork() and the
> technique can avoid this issue. 

Yeah, a *lot* of code out there depends on this glibc malloc
implementation detail.  We uncovered our dependency on this in
GNOME a while back while investigating getenv/setenv/fork and threads:
https://bugzilla.gnome.org/show_bug.cgi?id=659326



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
                   ` (6 preceding siblings ...)
  2012-04-02 23:17 ` KOSAKI Motohiro
@ 2012-04-04  3:01 ` Al Viro
  2012-04-04 17:10   ` KOSAKI Motohiro
  7 siblings, 1 reply; 78+ messages in thread
From: Al Viro @ 2012-04-04  3:01 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, torvalds, drepper, linux-kernel, linux-fsdevel

On Sun, Apr 01, 2012 at 03:57:42PM +0300, Alexey Dobriyan wrote:
> Currently there is no reliable way to close all opened file descriptors
> (which daemons need and like to do):
> 
> * dumb close(fd) loop is slow, upper bound is unknown and
>   can be arbitrary large,
> 
> * /proc/self/fd is unreliable:
>   proc may be unconfigured or not mounted at expected place.
>   Looking at /proc/self/fd requires opening directory
>   which may not be available due to malicious rlimit drop or ENOMEM situations.
>   Not opening directory is equivalent to dumb close(2) loop except slower.
> 
> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
> for userspace.
> 
> So, don't add closefrom(2), add nextfd(2).

Or unshare(CLONE_FILES_EMPTY) to steal an idea from rfork(2) (Plan 9 one,
that is - I don't remember if its *BSD analog has that).  Basically, they
allow 3 kinds of behaviour on clone(2) analog (and unshare(2) is part of
the same thing there):
	1) share descriptor table with parent (default for rfork(2))
	2) copy descriptor table from parent (RFFDG is set in flags)
	3) give child an empty descriptor table (RFCFDG is set in flags)
They have something similar for namespace, BTW - the same share/copy/clean
triple.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-02 23:56   ` H. Peter Anvin
@ 2012-04-04 11:51     ` Ulrich Drepper
  2012-04-04 16:38       ` KOSAKI Motohiro
  2012-04-04 16:31     ` KOSAKI Motohiro
  1 sibling, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2012-04-04 11:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, torvalds,
	linux-kernel, linux-fsdevel

On Mon, Apr 2, 2012 at 19:56, H. Peter Anvin <hpa@zytor.com> wrote:
>
> You can deal with this in one of two ways:
>
> 2. Fix your malloc().
> 1. Use the low level open()/getdents()/close() functions instead of
>   opendir()/readdir()/closedir().

And if their is concern about using opendir() then we can add
something to posix_spawn.  And attribute like POSIX_SPAWN_CLOSEALL and
you then would have to explicitly add dup and open requests for the
descriptors you want to have open.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 22:05   ` H. Peter Anvin
@ 2012-04-04 12:13     ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2012-04-04 12:13 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Sunday 01 April 2012, H. Peter Anvin wrote:
> On 04/01/2012 12:21 PM, Arnd Bergmann wrote:
> > On Sunday 01 April 2012, Alexey Dobriyan wrote:
> >>  arch/x86/syscalls/syscall_32.tbl |    1 +
> >>  arch/x86/syscalls/syscall_64.tbl |    1 +
> >>  fs/Makefile                      |    1 +
> >>  fs/nextfd.c                      |   27 +++++++++++++++++++++++++++
> >>  include/linux/syscalls.h         |    1 +
> >>  5 files changed, 31 insertions(+)
> > 
> > I don't have any comments on the syscall itself, but when you add one, please
> > also make the change to include/asm-generic/unistd.h so it appears in the
> > various architectures using the generic syscall table.
> 
> Arnd: do you have any interest in leveraging the syscall scripts I did
> for x86?  I have tried to make them as generic as possible, with the
> hope of getting more and more of syscall information into more easily
> processed form.

Sounds interesting, but I'm not planning to do the changes myself,
especially since I have no machine that actually uses the generic
syscall table.

Maybe I should ask the next person who submits a new architecture to
do that work, that's usually how progress in asm-generic happens
these days ;-)

	Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-02 23:56   ` H. Peter Anvin
  2012-04-04 11:51     ` Ulrich Drepper
@ 2012-04-04 16:31     ` KOSAKI Motohiro
  2012-04-04 17:10       ` Colin Walters
  2012-04-04 18:44       ` H. Peter Anvin
  1 sibling, 2 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 16:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, torvalds, drepper,
	linux-kernel, linux-fsdevel

(4/2/12 4:56 PM), H. Peter Anvin wrote:
> On 04/02/2012 04:17 PM, KOSAKI Motohiro wrote:
>>
>> Sorry for the long delay comment. I realized this thread now. I think
>> /proc no mount case is not good explanation for the worth of this patch. The problem
>> is, we can't use opendir() after fork() if an app has multi threads.
>>
>> SUS clearly say so,
>>   http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html
>>
>> we can only call async-signal-safe functions after fork() when multi threads and
>> opendir() call malloc() internally.
>>
>> As far as I know, OpenJDK has a such fork-readdir-exec code and it can
>> make deadlock
>> when spawnning a new process. Unfortunately Java language perfeter to
>> make a lot of threads rather than other language.
>>
>> This patch can solve such multi threaded case.
>>
>> offtopic, glibc malloc is a slightly clever. It reinitialize its
>> internal lock when fork by using thread_atfork() hook. It mean glibc malloc can be used after
>> fork() and the technique can avoid this issue. But, glibc malloc still has several
>> performance problem and many people prefer to use jemalloc or google malloc instead. Then,
>> they hit an old issue, bah.
>>
>
> OK, so what you're saying here is:
>
> Linux doesn't actually have a problem unless:
> 1. You use the library implementation of opendir/readdir/closedir;
> 2. You use a nonstandard malloc for the platform which doesn't
>     correctly set up fork hooks (which I would consider a bug);

Right. but I'm argue "correctly set up" term because SUS/POSIX don't require it.
It is only a workaround of buggy userland in glibc. SUS still says you can't
use opendir and typical userland people don't want ignore SUS as far as possible.


> You can deal with this in one of two ways:
>
> 2. Fix your malloc().
> 1. Use the low level open()/getdents()/close() functions instead of
>     opendir()/readdir()/closedir().

Ideally possible. but practically impossible. 2) people don't use a their
own malloc. they only uses open sources alternative malloc. And, I think
you have too narrowing concern. Even though malloc people adds a workaround,
the standard inhibit to use it and people may continue to use more dangerous
RLIM_NOFILE loop. 1) I haven't seen _practical_ userland software uses such
linux internal hacking. Almost all major software can run on multiple OSs.


>> and I've received a request that linux aim fdwalk() several times. Example,
>
> It doesn't sound very hard to implement fdwalk() in terms of
> open/getdents/close without using malloc; since the fdwalk() interface
> lets you use the stack for storage.  You can then implement closefrom()
> in terms of fdwalk().  Something like this (untested):
>
> int fdwalk(int (*func)(void *, int), void *cd)
> {
> 	char buf[4096];	/* ... could be less... */
> 	const char *p, *q;
> 	const struct linux_dirent *dp
> 	int dfd, fd;
> 	unsigned char c;
> 	int rv = 0;
> 	int sz;
>
> 	dfd = open("/proc/self/fd", O_RDONLY|O_DIRECTORY|O_CLOEXEC);
> 	if (dfd<  0)
> 		return -1;
>
> 	/*** XXX: may want to check for procfs magic here ***/
>
> 	while ((sz = getdents(dfd, buf, sizeof buf))>  0) {
> 		p = buf;
>
> 		while (sz>  offsetof(struct linux_dirent, d_name)) {
> 			dp = (const struct linux_dirent *)p;
>
> 			if (sz<  dp->d_reclen)
> 				break;
>
> 			q   = dp->d_name;
> 			p  += dp->d_reclen;
> 			sz -= dp->d_reclen;
>
> 			fd = 0;
> 			while (q<  p&&  (c = *q++)) {
> 				c -= '0';
> 				if (c>= 10)
> 					goto skip;
> 				fd = fd*10 + c;
> 			}
>
> 			if (fd != dfd)
> 				rv = func(cd, fd);
> 		skip:
> 			;
> 		}
> 	}
>
> 	if (close(dfd))
> 		return -1;
>
> 	return rv;
> }

It can. but more ugly. no?


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04 11:51     ` Ulrich Drepper
@ 2012-04-04 16:38       ` KOSAKI Motohiro
  2012-04-04 16:43         ` Ulrich Drepper
  0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 16:38 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: H. Peter Anvin, KOSAKI Motohiro, Alexey Dobriyan, akpm, viro,
	torvalds, linux-kernel, linux-fsdevel

(4/4/12 4:51 AM), Ulrich Drepper wrote:
> On Mon, Apr 2, 2012 at 19:56, H. Peter Anvin<hpa@zytor.com>  wrote:
>>
>> You can deal with this in one of two ways:
>>
>> 2. Fix your malloc().
>> 1. Use the low level open()/getdents()/close() functions instead of
>>    opendir()/readdir()/closedir().
>
> And if their is concern about using opendir() then we can add
> something to posix_spawn.  And attribute like POSIX_SPAWN_CLOSEALL and
> you then would have to explicitly add dup and open requests for the
> descriptors you want to have open.

As far as I understand, any major open source project don't use posix_spawn().
Please remind, I'm talking about real world issue.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04 16:38       ` KOSAKI Motohiro
@ 2012-04-04 16:43         ` Ulrich Drepper
  2012-04-04 17:07           ` KOSAKI Motohiro
  0 siblings, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2012-04-04 16:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds,
	linux-kernel, linux-fsdevel

On Wed, Apr 4, 2012 at 12:38, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
> As far as I understand, any major open source project don't use
> posix_spawn().
> Please remind, I'm talking about real world issue.

This doesn't mean they shouldn't.  If you require code to be changed
anyway let them change to something which doesn't require more cruft
in the kernel.  The limitations you cited are irrelevant for
posix_spawn.  And perhaps there will be actually spawn support in the
kernel which would make dealing with OOM situations and non-overcommit
much easier.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04 16:43         ` Ulrich Drepper
@ 2012-04-04 17:07           ` KOSAKI Motohiro
  2012-04-04 17:49             ` Ulrich Drepper
  0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 17:07 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: KOSAKI Motohiro, H. Peter Anvin, Alexey Dobriyan, akpm, viro,
	torvalds, linux-kernel, linux-fsdevel

(4/4/12 9:43 AM), Ulrich Drepper wrote:
> On Wed, Apr 4, 2012 at 12:38, KOSAKI Motohiro<kosaki.motohiro@gmail.com>  wrote:
>> As far as I understand, any major open source project don't use
>> posix_spawn().
>> Please remind, I'm talking about real world issue.
>
> This doesn't mean they shouldn't.  If you require code to be changed
> anyway let them change to something which doesn't require more cruft
> in the kernel.  The limitations you cited are irrelevant for
> posix_spawn.  And perhaps there will be actually spawn support in the
> kernel which would make dealing with OOM situations and non-overcommit
> much easier.

Umm... I'm sorry. I haven't catch why OOM is related topic. Could you please
elaborate more?



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04 16:31     ` KOSAKI Motohiro
@ 2012-04-04 17:10       ` Colin Walters
  2012-04-04 17:25         ` Colin Walters
  2012-04-04 23:35         ` KOSAKI Motohiro
  2012-04-04 18:44       ` H. Peter Anvin
  1 sibling, 2 replies; 78+ messages in thread
From: Colin Walters @ 2012-04-04 17:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds, drepper,
	linux-kernel, linux-fsdevel

On Wed, 2012-04-04 at 09:31 -0700, KOSAKI Motohiro wrote:

> Ideally possible. but practically impossible. 2) people don't use a their
> own malloc. they only uses open sources alternative malloc. And, I think
> you have too narrowing concern. Even though malloc people adds a workaround,
> the standard inhibit to use it

What do you mean?  If as hpa says, the maintainer of e.g. google
tcmalloc added a call to pthread_atfork(), then code which uses
opendir() would start working.

>  and people may continue to use more dangerous
> RLIM_NOFILE loop. 1) I haven't seen _practical_ userland software uses such
> linux internal hacking. Almost all major software can run on multiple OSs.

Except that if you're using /proc/self/fd, you're already relying on
Linux-specific functionality.  So it's not burdensome to use "struct
linux_dirent" and O_DIRECTORY either.

In GLib we're presently doing the regular /proc+opendir() under
#ifdef __linux__:
http://git.gnome.org/browse/glib/tree/glib/gspawn.c#n932

Now I'd happily switch to hpa's fdwalk() implementation if I was aware
of someone using glib in combination with an alternative malloc hitting
this problem.

Basically I think hpa is right here, and it's not really worth adding
a new system call.

The thing is, even if it were added today, since we need to run on old
kernels, we'd have to carry the code to use the /proc trick
approximately forever.  And in the end all nextfd() would accomplish
would be a *third* case in the already messy ifdefs/fallbacks
in the various implementations of process spawning.



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04  3:01 ` Al Viro
@ 2012-04-04 17:10   ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 17:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Alexey Dobriyan, akpm, torvalds, drepper, linux-kernel,
	linux-fsdevel, kosaki.motohiro

(4/3/12 8:01 PM), Al Viro wrote:
> On Sun, Apr 01, 2012 at 03:57:42PM +0300, Alexey Dobriyan wrote:
>> Currently there is no reliable way to close all opened file descriptors
>> (which daemons need and like to do):
>>
>> * dumb close(fd) loop is slow, upper bound is unknown and
>>    can be arbitrary large,
>>
>> * /proc/self/fd is unreliable:
>>    proc may be unconfigured or not mounted at expected place.
>>    Looking at /proc/self/fd requires opening directory
>>    which may not be available due to malicious rlimit drop or ENOMEM situations.
>>    Not opening directory is equivalent to dumb close(2) loop except slower.
>>
>> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
>> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
>> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
>> for userspace.
>>
>> So, don't add closefrom(2), add nextfd(2).
>
> Or unshare(CLONE_FILES_EMPTY) to steal an idea from rfork(2) (Plan 9 one,
> that is - I don't remember if its *BSD analog has that).  Basically, they
> allow 3 kinds of behaviour on clone(2) analog (and unshare(2) is part of
> the same thing there):
> 	1) share descriptor table with parent (default for rfork(2))
> 	2) copy descriptor table from parent (RFFDG is set in flags)
> 	3) give child an empty descriptor table (RFCFDG is set in flags)
> They have something similar for namespace, BTW - the same share/copy/clean
> triple.

Please remember why closefrom() have "from" argument. Almost all case, people
don't cloase fd 0,1,2 (rarely and 3).

If we add 2nd argument into unshare(CLONE_FILES_EMPTY), It become more ugly than
current nextfd proposal.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04 17:10       ` Colin Walters
@ 2012-04-04 17:25         ` Colin Walters
  2012-04-04 23:35         ` KOSAKI Motohiro
  1 sibling, 0 replies; 78+ messages in thread
From: Colin Walters @ 2012-04-04 17:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds, drepper,
	linux-kernel, linux-fsdevel

On Wed, 2012-04-04 at 13:10 -0400, Colin Walters wrote:

> The thing is, even if it were added today, since we need to run on old
> kernels, we'd have to carry the code to use the /proc trick
> approximately forever.  And in the end all nextfd() would accomplish
> would be a *third* case in the already messy ifdefs/fallbacks
> in the various implementations of process spawning.

The only really counterargument I see to this is if nextfd() were
*significantly* more efficient.  I happily switched to using eventfd()
instead of pipe(), while still carrying the burden of fallback code,
because dropping from two open descriptors to one solved actual
problems with maxing out 1024 fds.

As far as efficiency goes, I could imagine that as Linus said, what
you really want is nextfd(O_NOT_CLOEXEC), because in a well-behaved
program that marks its internal descriptors as O_CLOEXEC, the effort the
forked process spends closing them is just wasted over having it
all happen more efficiently in the kernel as part of the exec().

But if claiming efficiency, one needs to post benchmarks...






^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04 17:07           ` KOSAKI Motohiro
@ 2012-04-04 17:49             ` Ulrich Drepper
  2012-04-04 18:08               ` KOSAKI Motohiro
  0 siblings, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2012-04-04 17:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds,
	linux-kernel, linux-fsdevel

On Wed, Apr 4, 2012 at 13:07, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
> Umm... I'm sorry. I haven't catch why OOM is related topic. Could you please
> elaborate more?

With fork you always have some copy-on-write (and worse for
overcommit) just to then execute exec.  With a real spawn
implementation you wouldn't have that.  A big problem if you, for
instance, have to spawn a small helper from a gigantic process.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04 17:49             ` Ulrich Drepper
@ 2012-04-04 18:08               ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 18:08 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: KOSAKI Motohiro, H. Peter Anvin, Alexey Dobriyan, akpm, viro,
	torvalds, linux-kernel, linux-fsdevel

(4/4/12 10:49 AM), Ulrich Drepper wrote:
> On Wed, Apr 4, 2012 at 13:07, KOSAKI Motohiro<kosaki.motohiro@gmail.com>  wrote:
>> Umm... I'm sorry. I haven't catch why OOM is related topic. Could you please
>> elaborate more?
>
> With fork you always have some copy-on-write (and worse for
> overcommit) just to then execute exec.  With a real spawn
> implementation you wouldn't have that.  A big problem if you, for
> instance, have to spawn a small helper from a gigantic process.

Ah, ok. I agree posix_spawn() has a chance to aim more momemory efficiency
than fork-exec. But in this purpose, vfork may be enough useful and be widely
accepted from userland folks.

Example, some daemon has a following patten,
  1. fork
  2. change /proc/<pid>/oom_adj
  3. exec

That's said, when adding linux specific knob, we need to add new posix_spawn flags
if we really need (or want) to replaces all userland. this seems very hard and doubtful
worth to me.

Ahh, note. I'm not against to implement posix_spawn() into the kernel. I only argue spawn()
can solve closefrom issue.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04 16:31     ` KOSAKI Motohiro
  2012-04-04 17:10       ` Colin Walters
@ 2012-04-04 18:44       ` H. Peter Anvin
  1 sibling, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-04 18:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On 04/04/2012 09:31 AM, KOSAKI Motohiro wrote:
>>
>> OK, so what you're saying here is:
>>
>> Linux doesn't actually have a problem unless:
>> 1. You use the library implementation of opendir/readdir/closedir;
>> 2. You use a nonstandard malloc for the platform which doesn't
>>     correctly set up fork hooks (which I would consider a bug);
> 
> Right. but I'm argue "correctly set up" term because SUS/POSIX don't
> require it.
> It is only a workaround of buggy userland in glibc. SUS still says you
> can't
> use opendir and typical userland people don't want ignore SUS as far as
> possible.

Since you are comparing with a Linux-only system call, any suggestion
that depends on SuS requirements as opposed to Linux requirements is
irrelevant.

> 
> It can. but more ugly. no?
> 

And a new system call isn't?  What planet are you on?

	-hpa

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-04 17:10       ` Colin Walters
  2012-04-04 17:25         ` Colin Walters
@ 2012-04-04 23:35         ` KOSAKI Motohiro
  1 sibling, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 23:35 UTC (permalink / raw)
  To: Colin Walters
  Cc: KOSAKI Motohiro, H. Peter Anvin, Alexey Dobriyan, akpm, viro,
	torvalds, drepper, linux-kernel, linux-fsdevel

(4/4/12 10:10 AM), Colin Walters wrote:
> On Wed, 2012-04-04 at 09:31 -0700, KOSAKI Motohiro wrote:
>
>> Ideally possible. but practically impossible. 2) people don't use a their
>> own malloc. they only uses open sources alternative malloc. And, I think
>> you have too narrowing concern. Even though malloc people adds a workaround,
>> the standard inhibit to use it
>
> What do you mean?  If as hpa says, the maintainer of e.g. google
> tcmalloc added a call to pthread_atfork(), then code which uses
> opendir() would start working.

 From point of application programmer view, they have no right to choose
malloc implementation. user can easily override it by using LD_PRELOAD.
So, even though one or two malloc implementation will be fixed, it doesn't
mean user land folks pain will gone. they are living more painful world.
At least, they told me so.

But, I discussed about this issue with hpa today and we agreed the best way is
to merge hpa fdwalk() into libc. It makes maximum happy, example, glib can drop
its own fdwalk implementation. so, I'd like to discuss libc folks.




>>   and people may continue to use more dangerous
>> RLIM_NOFILE loop. 1) I haven't seen _practical_ userland software uses such
>> linux internal hacking. Almost all major software can run on multiple OSs.
>
> Except that if you're using /proc/self/fd, you're already relying on
> Linux-specific functionality.  So it's not burdensome to use "struct
> linux_dirent" and O_DIRECTORY either.
>
> In GLib we're presently doing the regular /proc+opendir() under
> #ifdef __linux__:
> http://git.gnome.org/browse/glib/tree/glib/gspawn.c#n932
>
> Now I'd happily switch to hpa's fdwalk() implementation if I was aware
> of someone using glib in combination with an alternative malloc hitting
> this problem.
>
> Basically I think hpa is right here, and it's not really worth adding
> a new system call.
>
> The thing is, even if it were added today, since we need to run on old
> kernels, we'd have to carry the code to use the /proc trick
> approximately forever.  And in the end all nextfd() would accomplish
> would be a *third* case in the already messy ifdefs/fallbacks
> in the various implementations of process spawning.

Nope. some OSs only have a silly rlimit(RLIM_NOFILE), then userland continue
to keep rlimit(NOFILE) for fallback. it's no linux specific.

I mean, OpenJDK and glib already has a linux specific trick, but many software
don't.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-01 22:03 ` H. Peter Anvin
@ 2012-04-06  9:54     ` Alexey Dobriyan
  2012-04-02  0:08   ` Alan Cox
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-06  9:54 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Mon, Apr 2, 2012 at 1:03 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>>
>> * /proc/self/fd is unreliable:
>>   proc may be unconfigured or not mounted at expected place.
>>   Looking at /proc/self/fd requires opening directory
>>   which may not be available due to malicious rlimit drop or ENOMEM situations.
>>   Not opening directory is equivalent to dumb close(2) loop except slower.
>>
>
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...

I agree, this particular changelog may be somewhat out of line.

But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
fix compilation problems associated with it, do not mount proc by default,
do not mark it unmountable somehow and
then say procless setups aren't worth it.

I haven't seen personally procless environments
but several people mentioned them including on this very list.

Without proc knowledge about fdtable is gathered linearly and still unreliable.
With nextfd(2), even procful environments could lose several failure branches.
And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
@ 2012-04-06  9:54     ` Alexey Dobriyan
  0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-06  9:54 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Mon, Apr 2, 2012 at 1:03 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>>
>> * /proc/self/fd is unreliable:
>>   proc may be unconfigured or not mounted at expected place.
>>   Looking at /proc/self/fd requires opening directory
>>   which may not be available due to malicious rlimit drop or ENOMEM situations.
>>   Not opening directory is equivalent to dumb close(2) loop except slower.
>>
>
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...

I agree, this particular changelog may be somewhat out of line.

But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
fix compilation problems associated with it, do not mount proc by default,
do not mark it unmountable somehow and
then say procless setups aren't worth it.

I haven't seen personally procless environments
but several people mentioned them including on this very list.

Without proc knowledge about fdtable is gathered linearly and still unreliable.
With nextfd(2), even procful environments could lose several failure branches.
And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-06  9:54     ` Alexey Dobriyan
  (?)
@ 2012-04-06 15:27     ` Colin Walters
  -1 siblings, 0 replies; 78+ messages in thread
From: Colin Walters @ 2012-04-06 15:27 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: H. Peter Anvin, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Fri, 2012-04-06 at 12:54 +0300, Alexey Dobriyan wrote:

> But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
> fix compilation problems associated with it, do not mount proc by default,
> do not mark it unmountable somehow and
> then say procless setups aren't worth it.
> 
> I haven't seen personally procless environments
> but several people mentioned them including on this very list.

Now that the kernel has CLONE_NEWNS, it's possible to mount proc
"privately" just for a specific process tree.  It meshes nicely with
CLONE_NEWPID.  Previously if you mounted proc in a chroot, it cluttered
the mount list and leaked information about outside the root.
With modern clone/unshare, that's no longer a concern, so there's
much less reason to use "bare" chroots.



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-06  9:54     ` Alexey Dobriyan
  (?)
  (?)
@ 2012-04-06 16:14     ` H. Peter Anvin
  2012-04-06 20:16       ` Alexey Dobriyan
  -1 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-06 16:14 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
> 
> I agree, this particular changelog may be somewhat out of line.
> 
> But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
> fix compilation problems associated with it, do not mount proc by default,
> do not mark it unmountable somehow and
> then say procless setups aren't worth it.
> 

Aren't worth *optimizing for*.  But yes, CONFIG_PROC_FS is pretty much a
historic relic at this point, and probably should just be dropped.

> Without proc knowledge about fdtable is gathered linearly and still unreliable.
> With nextfd(2), even procful environments could lose several failure branches.

What?  Please explain how on Earth this would "lose several failure
branches."

New system calls are not something that should be added lightly...

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-06  9:54     ` Alexey Dobriyan
                       ` (2 preceding siblings ...)
  (?)
@ 2012-04-06 16:23     ` H. Peter Anvin
  2012-04-07 21:21       ` Ben Pfaff
  2012-04-11  0:09       ` KOSAKI Motohiro
  -1 siblings, 2 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-06 16:23 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
> 
> Without proc knowledge about fdtable is gathered linearly and still unreliable.
> With nextfd(2), even procful environments could lose several failure branches.
> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
> 

Incidentally, if we were to create a system call for this -- which I so
far see no reason for -- I would make it return a select-style bitmask
of file descriptors in use, not a "next fd" which would require a system
call per iteration.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-06 16:14     ` H. Peter Anvin
@ 2012-04-06 20:16       ` Alexey Dobriyan
  2012-04-06 20:33         ` H. Peter Anvin
  2012-04-06 21:02         ` H. Peter Anvin
  0 siblings, 2 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-06 20:16 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Fri, Apr 06, 2012 at 09:14:17AM -0700, H. Peter Anvin wrote:
> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
> > 
> > I agree, this particular changelog may be somewhat out of line.
> > 
> > But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
> > fix compilation problems associated with it, do not mount proc by default,
> > do not mark it unmountable somehow and
> > then say procless setups aren't worth it.
> > 
> 
> Aren't worth *optimizing for*.  But yes, CONFIG_PROC_FS is pretty much a
> historic relic at this point, and probably should just be dropped.

What to do with automounting /proc so it availablility would match
syscall availability?

> > Without proc knowledge about fdtable is gathered linearly and still unreliable.
> > With nextfd(2), even procful environments could lose several failure branches.
> 
> What?  Please explain how on Earth this would "lose several failure
> branches."

closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
closefrom(3) written via close(fd++) is unreliable.

If programmer adds nextfd(2) loop before any closefrom(3) code
he currently uses, there will be less failures.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-06 20:16       ` Alexey Dobriyan
@ 2012-04-06 20:33         ` H. Peter Anvin
  2012-04-06 21:02         ` H. Peter Anvin
  1 sibling, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-06 20:33 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

What is unreliable about it?  This is starting to be ridiculous.  As far as automounting /proc is concerned, this does not seem to be a significant problem with current userspace.

Alexey Dobriyan <adobriyan@gmail.com> wrote:

>On Fri, Apr 06, 2012 at 09:14:17AM -0700, H. Peter Anvin wrote:
>> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>> > 
>> > I agree, this particular changelog may be somewhat out of line.
>> > 
>> > But I find it little hypocritical that kernel developers add
>CONFIG_PROC_FS,
>> > fix compilation problems associated with it, do not mount proc by
>default,
>> > do not mark it unmountable somehow and
>> > then say procless setups aren't worth it.
>> > 
>> 
>> Aren't worth *optimizing for*.  But yes, CONFIG_PROC_FS is pretty
>much a
>> historic relic at this point, and probably should just be dropped.
>
>What to do with automounting /proc so it availablility would match
>syscall availability?
>
>> > Without proc knowledge about fdtable is gathered linearly and still
>unreliable.
>> > With nextfd(2), even procful environments could lose several
>failure branches.
>> 
>> What?  Please explain how on Earth this would "lose several failure
>> branches."
>
>closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
>closefrom(3) written via /proc/self/fd is reliable and can fail
>(including ENOMEM).
>closefrom(3) written via close(fd++) is unreliable.
>
>If programmer adds nextfd(2) loop before any closefrom(3) code
>he currently uses, there will be less failures.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-06 20:16       ` Alexey Dobriyan
  2012-04-06 20:33         ` H. Peter Anvin
@ 2012-04-06 21:02         ` H. Peter Anvin
  2012-04-12 10:54             ` Alexey Dobriyan
  1 sibling, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-06 21:02 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
> 
> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
> closefrom(3) written via close(fd++) is unreliable.
> 

I call shenanigans on this.  There is no reason to ENOMEM on the second
written using the fdwalk() implementation I already posted, for example.

	-hpa



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-06 16:23     ` H. Peter Anvin
@ 2012-04-07 21:21       ` Ben Pfaff
  2012-04-11  0:12           ` KOSAKI Motohiro
  2012-04-11  0:09       ` KOSAKI Motohiro
  1 sibling, 1 reply; 78+ messages in thread
From: Ben Pfaff @ 2012-04-07 21:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>> 
>> Without proc knowledge about fdtable is gathered linearly and still unreliable.
>> With nextfd(2), even procful environments could lose several failure branches.
>> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
>> 
>
> Incidentally, if we were to create a system call for this -- which I so
> far see no reason for -- I would make it return a select-style bitmask
> of file descriptors in use, not a "next fd" which would require a system
> call per iteration.

It's already possible to do something a little like that with the
existing "poll" system call:

#include <stdio.h>
#include <sys/poll.h>

int
main(void)
{
  enum { N_FDS = 1024 };
  struct pollfd fds[N_FDS];
  int error;
  int i;

  for (i = 0; i < N_FDS; i++)
    {
      fds[i].fd = i;
      fds[i].events = 0;
    }

  error = poll(fds, N_FDS, 0);
  if (error < 0)
    perror("poll");
  else
    {
      printf("valid fds:");
      for (i = 0; i < N_FDS; i++)
        if (!(fds[i].revents & POLLNVAL))
          printf(" %d", fds[i].fd);
      printf("\n");
    }

  return 0;
}

blp@blp:~/tmp(0)$ gcc tmp.c -Wall
blp@blp:~/tmp(0)$ ./a.out
valid fds: 0 1 2
blp@blp:~/tmp(0)$ ./a.out 5>/dev/null
valid fds: 0 1 2 5
blp@blp:~/tmp(0)$ ./a.out 5>/dev/null 3</dev/stdin
valid fds: 0 1 2 3 5
blp@blp:~/tmp(0)$ 

-- 
Ben Pfaff 
http://benpfaff.org

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-06 16:23     ` H. Peter Anvin
  2012-04-07 21:21       ` Ben Pfaff
@ 2012-04-11  0:09       ` KOSAKI Motohiro
  2012-04-11 17:58         ` H. Peter Anvin
  2012-04-11 18:00         ` H. Peter Anvin
  1 sibling, 2 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11  0:09 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Fri, Apr 6, 2012 at 12:23 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>>
>> Without proc knowledge about fdtable is gathered linearly and still unreliable.
>> With nextfd(2), even procful environments could lose several failure branches.
>> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
>>
>
> Incidentally, if we were to create a system call for this -- which I so
> far see no reason for -- I would make it return a select-style bitmask
> of file descriptors in use, not a "next fd" which would require a system
> call per iteration.

I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.

  http://austingroupbugs.net/view.php?id=149

nextfd() has a semantics of F_NEXT.

Next, why shoundn't we implement fcntl(F_NEXT) in our kernel? I think
we have two reason.

1) As linus pointed out, linux specific "flags" argument may be useful.
2) The name of F_NEXT is not fixed yet. another url of the austin says
it is FD_NEXT.
     So, we can't choose right name yet. Moreover, A meanings of 3rd
argument of F_NEXT
     haven't been fixed.


I dont think following #ifdef is insane. but glibc also can provide
correct F_NEXT when next SUS is published.

#ifdef FOO
#define NEXTFD(fd) nextfd(fd, flags)
#else
#define NEXTFD(fd) fcntl(fd, F_NEXT, O_FDWR)
#endif


In short, kernel developer don't care any standard at all. but
application programmer usually care it.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-07 21:21       ` Ben Pfaff
@ 2012-04-11  0:12           ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11  0:12 UTC (permalink / raw)
  To: Ben Pfaff
  Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds, drepper,
	linux-kernel, linux-fsdevel

On Sat, Apr 7, 2012 at 5:21 PM, Ben Pfaff <blp@cs.stanford.edu> wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>
>> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>>>
>>> Without proc knowledge about fdtable is gathered linearly and still unreliable.
>>> With nextfd(2), even procful environments could lose several failure branches.
>>> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
>>>
>>
>> Incidentally, if we were to create a system call for this -- which I so
>> far see no reason for -- I would make it return a select-style bitmask
>> of file descriptors in use, not a "next fd" which would require a system
>> call per iteration.
>
> It's already possible to do something a little like that with the
> existing "poll" system call:
>
> #include <stdio.h>
> #include <sys/poll.h>
>
> int
> main(void)
> {
>  enum { N_FDS = 1024 };
>  struct pollfd fds[N_FDS];

Your code has a muximum fd assumption here. that is one of that we
really want to avoid.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
@ 2012-04-11  0:12           ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11  0:12 UTC (permalink / raw)
  To: Ben Pfaff
  Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds, drepper,
	linux-kernel, linux-fsdevel

On Sat, Apr 7, 2012 at 5:21 PM, Ben Pfaff <blp@cs.stanford.edu> wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>
>> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>>>
>>> Without proc knowledge about fdtable is gathered linearly and still unreliable.
>>> With nextfd(2), even procful environments could lose several failure branches.
>>> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
>>>
>>
>> Incidentally, if we were to create a system call for this -- which I so
>> far see no reason for -- I would make it return a select-style bitmask
>> of file descriptors in use, not a "next fd" which would require a system
>> call per iteration.
>
> It's already possible to do something a little like that with the
> existing "poll" system call:
>
> #include <stdio.h>
> #include <sys/poll.h>
>
> int
> main(void)
> {
>  enum { N_FDS = 1024 };
>  struct pollfd fds[N_FDS];

Your code has a muximum fd assumption here. that is one of that we
really want to avoid.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11  0:09       ` KOSAKI Motohiro
@ 2012-04-11 17:58         ` H. Peter Anvin
  2012-04-11 18:04             ` Linus Torvalds
  2012-04-11 18:00         ` H. Peter Anvin
  1 sibling, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 17:58 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On 04/10/2012 05:09 PM, KOSAKI Motohiro wrote:
> 
> I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.
> 
>   http://austingroupbugs.net/view.php?id=149
> 
> nextfd() has a semantics of F_NEXT.
> 
> Next, why shoundn't we implement fcntl(F_NEXT) in our kernel? I think
> we have two reason.
> 
> 1) As linus pointed out, linux specific "flags" argument may be useful.
> 2) The name of F_NEXT is not fixed yet. another url of the austin says
> it is FD_NEXT.
>      So, we can't choose right name yet. Moreover, A meanings of 3rd
> argument of F_NEXT
>      haven't been fixed.
> 

But it still has the same braindamage: one system call per loop
invocation, and we can do better.  I would much rather see fdwalk() in SUS.

	-hpa


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11  0:09       ` KOSAKI Motohiro
  2012-04-11 17:58         ` H. Peter Anvin
@ 2012-04-11 18:00         ` H. Peter Anvin
  2012-04-11 19:20             ` KOSAKI Motohiro
  1 sibling, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 18:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On 04/10/2012 05:09 PM, KOSAKI Motohiro wrote:
> 
> I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.
> 
>   http://austingroupbugs.net/view.php?id=149
> 
> nextfd() has a semantics of F_NEXT.
> 

Looking at that bug report, it sounds like that proposal was rejected.

	-hpa


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 17:58         ` H. Peter Anvin
@ 2012-04-11 18:04             ` Linus Torvalds
  0 siblings, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2012-04-11 18:04 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, drepper,
	linux-kernel, linux-fsdevel

On Wed, Apr 11, 2012 at 10:58 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> But it still has the same braindamage: one system call per loop
> invocation, and we can do better.  I would much rather see fdwalk() in SUS.

Why would we bother to do better?

System calls are cheap, and usually you actually do want to do
something about the fd, so you actually want to iterate over them.

I'd much rather have simple cheap interfaces than anything else. If
SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
doing something more complex and then just having to emulate the
simple thing in user space anyway.

If a standard interface exists, we should just use it.

                  Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
@ 2012-04-11 18:04             ` Linus Torvalds
  0 siblings, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2012-04-11 18:04 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, drepper,
	linux-kernel, linux-fsdevel

On Wed, Apr 11, 2012 at 10:58 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> But it still has the same braindamage: one system call per loop
> invocation, and we can do better.  I would much rather see fdwalk() in SUS.

Why would we bother to do better?

System calls are cheap, and usually you actually do want to do
something about the fd, so you actually want to iterate over them.

I'd much rather have simple cheap interfaces than anything else. If
SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
doing something more complex and then just having to emulate the
simple thing in user space anyway.

If a standard interface exists, we should just use it.

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 18:04             ` Linus Torvalds
  (?)
@ 2012-04-11 18:11             ` H. Peter Anvin
  2012-04-11 19:46                 ` KOSAKI Motohiro
  -1 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, drepper,
	linux-kernel, linux-fsdevel

On 04/11/2012 11:04 AM, Linus Torvalds wrote:
> On Wed, Apr 11, 2012 at 10:58 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> But it still has the same braindamage: one system call per loop
>> invocation, and we can do better.  I would much rather see fdwalk() in SUS.
> 
> Why would we bother to do better?
> 
> System calls are cheap, and usually you actually do want to do
> something about the fd, so you actually want to iterate over them.
> 
> I'd much rather have simple cheap interfaces than anything else. If
> SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
> doing something more complex and then just having to emulate the
> simple thing in user space anyway.
> 
> If a standard interface exists, we should just use it.
> 

I went back and looked at the post, and also the discussion on the SUS
mailing list.

The proposal for FD_NEXT was rejected with some serious vitriol.

fdwalk() was considered just more palatable since there is an existing
implementation (in Solaris) and since it might be possible to provide a
way to hide specific fds from fdwalk(), but a much bigger issue raised
is that *ALL* of these interfaces are inherently broken.  Closing random
file descriptors is:

a) inherently racy in a multithreaded environment;
b) unsafe because there might be file descriptors used by libc itself.

Instead, from the resolution text:

> Therefore, the rest of this proposal seeks to document the problem
> with closing arbitrary file descriptors, and a new bugid will be
> opened to propose standardizing some recent interfaces and interface
> extensions first appearing in Linux (new interfaces such as pipe2( ),
> accept4( ), mkostemps( ), ..., and extensions like fopen(,"we")) to
> guarantee the atomic creation of file descriptors with the cloexec
> bit already set, as was already done in the standard with O_CLOEXEC
> in open( ) and F_DUPFD_CLOEXEC in fcntl( ). See also 0000368 for
> a related proposal to require CLOEXEC on hidden descriptors.

I say we ask the new glibc people to provide fdwalk() since it already
has an implementation history (and because it can be implemented without
new system calls, thereby working on old kernels), but the *big*
takeaway from this is that if there is way to create a file descriptor
so that it doesn't have O_CLOEXEC set from the very beginning, *that* is
what we need to fix.

	-hpa

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 18:00         ` H. Peter Anvin
@ 2012-04-11 19:20             ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Wed, Apr 11, 2012 at 2:00 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/10/2012 05:09 PM, KOSAKI Motohiro wrote:
>>
>> I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.
>>
>>   http://austingroupbugs.net/view.php?id=149
>>
>> nextfd() has a semantics of F_NEXT.
>>
>
> Looking at that bug report, it sounds like that proposal was rejected.

?

If I understand correctly, above URL says, latest satatus is

  Resolution: Accepted As Marked

Please see toppest issue summary. Or if I am missing something, can you please
point out which line are you seeing?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
@ 2012-04-11 19:20             ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Wed, Apr 11, 2012 at 2:00 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/10/2012 05:09 PM, KOSAKI Motohiro wrote:
>>
>> I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.
>>
>>   http://austingroupbugs.net/view.php?id=149
>>
>> nextfd() has a semantics of F_NEXT.
>>
>
> Looking at that bug report, it sounds like that proposal was rejected.

?

If I understand correctly, above URL says, latest satatus is

  Resolution: Accepted As Marked

Please see toppest issue summary. Or if I am missing something, can you please
point out which line are you seeing?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 19:20             ` KOSAKI Motohiro
  (?)
@ 2012-04-11 19:22             ` H. Peter Anvin
  2012-04-11 19:26               ` KOSAKI Motohiro
  -1 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 19:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On 04/11/2012 12:20 PM, KOSAKI Motohiro wrote:
> 
> If I understand correctly, above URL says, latest satatus is
> 
>   Resolution: Accepted As Marked
> 
> Please see toppest issue summary. Or if I am missing something, can you please
> point out which line are you seeing?

You need to read through the entire set of notes for it to make sense.

	-hpa


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 19:22             ` H. Peter Anvin
@ 2012-04-11 19:26               ` KOSAKI Motohiro
  2012-04-11 19:28                 ` H. Peter Anvin
  0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Wed, Apr 11, 2012 at 3:22 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/11/2012 12:20 PM, KOSAKI Motohiro wrote:
>>
>> If I understand correctly, above URL says, latest satatus is
>>
>>   Resolution: Accepted As Marked
>>
>> Please see toppest issue summary. Or if I am missing something, can you please
>> point out which line are you seeing?
>
> You need to read through the entire set of notes for it to make sense.

The note described several downsides. but I have no seen a rejection.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 19:26               ` KOSAKI Motohiro
@ 2012-04-11 19:28                 ` H. Peter Anvin
  2012-04-11 19:31                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 19:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On 04/11/2012 12:26 PM, KOSAKI Motohiro wrote:
> The note described several downsides. but I have no seen a rejection.

The Aardvark resolution "Accept as Marked" means "make the changes in
the markup."  The markup doesn't include any new interfaces, but just a
set of warnings.

	-hpa


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 19:28                 ` H. Peter Anvin
@ 2012-04-11 19:31                   ` KOSAKI Motohiro
  2012-04-11 19:32                     ` H. Peter Anvin
  0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Wed, Apr 11, 2012 at 3:28 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/11/2012 12:26 PM, KOSAKI Motohiro wrote:
>> The note described several downsides. but I have no seen a rejection.
>
> The Aardvark resolution "Accept as Marked" means "make the changes in
> the markup."  The markup doesn't include any new interfaces, but just a
> set of warnings.

Hm. Now are you saying "make the changes in the markup" mean rejection?
I'm sorry I haven't caught your point.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 19:31                   ` KOSAKI Motohiro
@ 2012-04-11 19:32                     ` H. Peter Anvin
  0 siblings, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 19:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On 04/11/2012 12:31 PM, KOSAKI Motohiro wrote:
> On Wed, Apr 11, 2012 at 3:28 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/11/2012 12:26 PM, KOSAKI Motohiro wrote:
>>> The note described several downsides. but I have no seen a rejection.
>>
>> The Aardvark resolution "Accept as Marked" means "make the changes in
>> the markup."  The markup doesn't include any new interfaces, but just a
>> set of warnings.
> 
> Hm. Now are you saying "make the changes in the markup" mean rejection?
> I'm sorry I haven't caught your point.
> 

You have to look at what the markup is.

	-hpa


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 18:11             ` H. Peter Anvin
@ 2012-04-11 19:46                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
	linux-kernel, linux-fsdevel

>>> But it still has the same braindamage: one system call per loop
>>> invocation, and we can do better.  I would much rather see fdwalk() in SUS.
>>
>> Why would we bother to do better?
>>
>> System calls are cheap, and usually you actually do want to do
>> something about the fd, so you actually want to iterate over them.
>>
>> I'd much rather have simple cheap interfaces than anything else. If
>> SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
>> doing something more complex and then just having to emulate the
>> simple thing in user space anyway.
>>
>> If a standard interface exists, we should just use it.
>>
>
> I went back and looked at the post, and also the discussion on the SUS
> mailing list.
>
> The proposal for FD_NEXT was rejected with some serious vitriol.
>
> fdwalk() was considered just more palatable since there is an existing
> implementation (in Solaris) and since it might be possible to provide a
> way to hide specific fds from fdwalk(), but a much bigger issue raised
> is that *ALL* of these interfaces are inherently broken.  Closing random
> file descriptors is:
>
> a) inherently racy in a multithreaded environment;

I would say two things. 1) I know and I agree we _can_ misuse the interface.
but many already existed interface also can be misused. 2) As I
already explained
this can be used correctly.

So, I have a question. Why do you bother a possibility of misuse? Of
if you didn't point
out misuse, can you please point out a real world use case of multi
threads + fd interation?


> b) unsafe because there might be file descriptors used by libc itself.

I agree this. Even though almost developer don't use libc message catalogue and
we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).


> Instead, from the resolution text:
>
>> Therefore, the rest of this proposal seeks to document the problem
>> with closing arbitrary file descriptors, and a new bugid will be
>> opened to propose standardizing some recent interfaces and interface
>> extensions first appearing in Linux (new interfaces such as pipe2( ),
>> accept4( ), mkostemps( ), ..., and extensions like fopen(,"we")) to
>> guarantee the atomic creation of file descriptors with the cloexec
>> bit already set, as was already done in the standard with O_CLOEXEC
>> in open( ) and F_DUPFD_CLOEXEC in fcntl( ). See also 0000368 for
>> a related proposal to require CLOEXEC on hidden descriptors.
>
> I say we ask the new glibc people to provide fdwalk() since it already
> has an implementation history (and because it can be implemented without
> new system calls, thereby working on old kernels), but the *big*
> takeaway from this is that if there is way to create a file descriptor
> so that it doesn't have O_CLOEXEC set from the very beginning, *that* is
> what we need to fix.

Yeah, I don't think fdwalk() is problematic. It's an option if I
understand Alexey's mail
correctly. but I disagree almost all developers should fix a design
and rewrite their
applications. In theory, they can avoid glibc or they can rewrite all
of their code or
avoid linux. but there is one problem. unrealistic.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
@ 2012-04-11 19:46                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
	linux-kernel, linux-fsdevel

>>> But it still has the same braindamage: one system call per loop
>>> invocation, and we can do better.  I would much rather see fdwalk() in SUS.
>>
>> Why would we bother to do better?
>>
>> System calls are cheap, and usually you actually do want to do
>> something about the fd, so you actually want to iterate over them.
>>
>> I'd much rather have simple cheap interfaces than anything else. If
>> SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
>> doing something more complex and then just having to emulate the
>> simple thing in user space anyway.
>>
>> If a standard interface exists, we should just use it.
>>
>
> I went back and looked at the post, and also the discussion on the SUS
> mailing list.
>
> The proposal for FD_NEXT was rejected with some serious vitriol.
>
> fdwalk() was considered just more palatable since there is an existing
> implementation (in Solaris) and since it might be possible to provide a
> way to hide specific fds from fdwalk(), but a much bigger issue raised
> is that *ALL* of these interfaces are inherently broken.  Closing random
> file descriptors is:
>
> a) inherently racy in a multithreaded environment;

I would say two things. 1) I know and I agree we _can_ misuse the interface.
but many already existed interface also can be misused. 2) As I
already explained
this can be used correctly.

So, I have a question. Why do you bother a possibility of misuse? Of
if you didn't point
out misuse, can you please point out a real world use case of multi
threads + fd interation?


> b) unsafe because there might be file descriptors used by libc itself.

I agree this. Even though almost developer don't use libc message catalogue and
we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).


> Instead, from the resolution text:
>
>> Therefore, the rest of this proposal seeks to document the problem
>> with closing arbitrary file descriptors, and a new bugid will be
>> opened to propose standardizing some recent interfaces and interface
>> extensions first appearing in Linux (new interfaces such as pipe2( ),
>> accept4( ), mkostemps( ), ..., and extensions like fopen(,"we")) to
>> guarantee the atomic creation of file descriptors with the cloexec
>> bit already set, as was already done in the standard with O_CLOEXEC
>> in open( ) and F_DUPFD_CLOEXEC in fcntl( ). See also 0000368 for
>> a related proposal to require CLOEXEC on hidden descriptors.
>
> I say we ask the new glibc people to provide fdwalk() since it already
> has an implementation history (and because it can be implemented without
> new system calls, thereby working on old kernels), but the *big*
> takeaway from this is that if there is way to create a file descriptor
> so that it doesn't have O_CLOEXEC set from the very beginning, *that* is
> what we need to fix.

Yeah, I don't think fdwalk() is problematic. It's an option if I
understand Alexey's mail
correctly. but I disagree almost all developers should fix a design
and rewrite their
applications. In theory, they can avoid glibc or they can rewrite all
of their code or
avoid linux. but there is one problem. unrealistic.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 19:46                 ` KOSAKI Motohiro
  (?)
@ 2012-04-11 19:49                 ` H. Peter Anvin
  2012-04-11 20:23                   ` KOSAKI Motohiro
  -1 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 19:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
	linux-kernel, linux-fsdevel

On 04/11/2012 12:46 PM, KOSAKI Motohiro wrote:
>>
>> a) inherently racy in a multithreaded environment;
> 
> I would say two things. 1) I know and I agree we _can_ misuse the interface.
> but many already existed interface also can be misused. 2) As I
> already explained
> this can be used correctly.
> 
> So, I have a question. Why do you bother a possibility of misuse? Of
> if you didn't point out misuse, can you please point out a real world
> use case of multi threads + fd interation?
> 

This were brought up in the POSIX discussion as part of why these
interfaces were considered undesirable.

> 
>> b) unsafe because there might be file descriptors used by libc itself.
> 
> I agree this. Even though almost developer don't use libc message catalogue and
> we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
> 

No, that's exactly the point that we cannot.

> 
> Yeah, I don't think fdwalk() is problematic. It's an option if I
> understand Alexey's mail
> correctly. but I disagree almost all developers should fix a design
> and rewrite their
> applications. In theory, they can avoid glibc or they can rewrite all
> of their code or
> avoid linux. but there is one problem. unrealistic.
> 

The problem -- as was brought up in the POSIX discussion -- is that you
actually end up breaking *properly functioning programs*.

	-hpa

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 19:49                 ` H. Peter Anvin
@ 2012-04-11 20:23                   ` KOSAKI Motohiro
  2012-04-11 20:32                     ` H. Peter Anvin
  0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 20:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
	linux-kernel, linux-fsdevel

>>> a) inherently racy in a multithreaded environment;
>>
>> I would say two things. 1) I know and I agree we _can_ misuse the interface.
>> but many already existed interface also can be misused. 2) As I
>> already explained
>> this can be used correctly.
>>
>> So, I have a question. Why do you bother a possibility of misuse? Of
>> if you didn't point out misuse, can you please point out a real world
>> use case of multi threads + fd interation?
>
> This were brought up in the POSIX discussion as part of why these
> interfaces were considered undesirable.

Hmmm.... I'm sorry I don't find "considered undesirable". Maybe because
my English is not very good. can you please help me clarify?

This text says,


> so a future revision of the standard may indeed add fdwalk( ), although no
> one in the meeting was willing to draft a proposal for fdwalk( ) at this time

and, later says after noting F_NEXT and O_CLOEXEC,

>Therefore, the rest of this proposal seeks to document the problem
>with closing arbitrary file descriptors, and a new bugid will be
>opened to propose standardizing some recent interfaces and interface
>extensions first appearing in Linux

Do you think latter override former?



>>> b) unsafe because there might be file descriptors used by libc itself.
>>
>> I agree this. Even though almost developer don't use libc message catalogue and
>> we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
>
> No, that's exactly the point that we cannot.

I thknk we are talking different aspect. I'm talking practical issue.
say, ruby hit the exact same issue
because valgrind uses internal fds and they don't think their exec()
case don't need fd
inheritance. Even though it close libc internal fds, invoked new
executable may open them
again at process strtup code. Therefore, they are using O_CLOEXEC. In
the other hands,
you seems talking about it is corner case. If so, I agree. I was not
argue it. I only say, I
haven't seen real world application require it.

Personally, I'm only interesting real world issue.


>> Yeah, I don't think fdwalk() is problematic. It's an option if I
>> understand Alexey's mail
>> correctly. but I disagree almost all developers should fix a design
>> and rewrite their
>> applications. In theory, they can avoid glibc or they can rewrite all
>> of their code or
>> avoid linux. but there is one problem. unrealistic.
>>
>
> The problem -- as was brought up in the POSIX discussion -- is that you
> actually end up breaking *properly functioning programs*.

But the url only talk about a possibility of misuse.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 20:23                   ` KOSAKI Motohiro
@ 2012-04-11 20:32                     ` H. Peter Anvin
  2012-04-17 18:12                       ` KOSAKI Motohiro
  0 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 20:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
	linux-kernel, linux-fsdevel

On 04/11/2012 01:23 PM, KOSAKI Motohiro wrote:
> 
> Hmmm.... I'm sorry I don't find "considered undesirable". Maybe because
> my English is not very good. can you please help me clarify?
> 

I also went and read the mailing list discussion on the topic.

Ulrich, for example (in his usual mild-mannered style), commented:

> And all these programs and systems are wrong.
> 
> There is no guarantee that one of the fds isn't used behind the
> scenes for something important which is still running as part of the
> fork/exec code.  It's completely unacceptable to build into the
> interfaces the assumption that the programmer knows all the file
> descriptors.
> 
> This is why using CLOEXEC is the only correct way to deal with this
> and now there is no exceuse anymore whatsoever. Every fd-creating
> interface can use CLOEXEC.

> This text says,
> 
>> so a future revision of the standard may indeed add fdwalk( ), although no
>> one in the meeting was willing to draft a proposal for fdwalk( ) at this time
> 
> and, later says after noting F_NEXT and O_CLOEXEC,
> 
>> Therefore, the rest of this proposal seeks to document the problem
>> with closing arbitrary file descriptors, and a new bugid will be
>> opened to propose standardizing some recent interfaces and interface
>> extensions first appearing in Linux
> 
> Do you think latter override former?

Yes.

>>>> b) unsafe because there might be file descriptors used by libc itself.
>>>
>>> I agree this. Even though almost developer don't use libc message catalogue and
>>> we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
>>
>> No, that's exactly the point that we cannot.
> 
> I thknk we are talking different aspect. I'm talking practical issue.
> say, ruby hit the exact same issue
> because valgrind uses internal fds and they don't think their exec()
> case don't need fd
> inheritance. Even though it close libc internal fds, invoked new
> executable may open them
> again at process strtup code. Therefore, they are using O_CLOEXEC. In
> the other hands,
> you seems talking about it is corner case. If so, I agree. I was not
> argue it. I only say, I
> haven't seen real world application require it.
> 
> Personally, I'm only interesting real world issue.

These are real-world issues.

>> The problem -- as was brought up in the POSIX discussion -- is that you
>> actually end up breaking *properly functioning programs*.
> 
> But the url only talk about a possibility of misuse.

There are concrete examples on the mailing list.

Anyway, fdwalk() at least exists as an interface.  There is absolutely
no momentum for FD_NEXT that I can see.

	-hpa


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-06 21:02         ` H. Peter Anvin
@ 2012-04-12 10:54             ` Alexey Dobriyan
  0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-12 10:54 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
>>
>> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
>> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
>> closefrom(3) written via close(fd++) is unreliable.
>>
>
> I call shenanigans on this.  There is no reason to ENOMEM on the second
> written using the fdwalk() implementation I already posted, for example.

open("/proc/self/fd") can fail with ENOMEM.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
@ 2012-04-12 10:54             ` Alexey Dobriyan
  0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-12 10:54 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
>>
>> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
>> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
>> closefrom(3) written via close(fd++) is unreliable.
>>
>
> I call shenanigans on this.  There is no reason to ENOMEM on the second
> written using the fdwalk() implementation I already posted, for example.

open("/proc/self/fd") can fail with ENOMEM.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-12 10:54             ` Alexey Dobriyan
@ 2012-04-12 11:11               ` Alan Cox
  -1 siblings, 0 replies; 78+ messages in thread
From: Alan Cox @ 2012-04-12 11:11 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: H. Peter Anvin, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Thu, 12 Apr 2012 13:54:25 +0300
Alexey Dobriyan <adobriyan@gmail.com> wrote:

> On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
> >>
> >> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
> >> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
> >> closefrom(3) written via close(fd++) is unreliable.
> >>
> >
> > I call shenanigans on this.  There is no reason to ENOMEM on the second
> > written using the fdwalk() implementation I already posted, for example.
> 
> open("/proc/self/fd") can fail with ENOMEM.

Any syscall can fail with a process kill due to a stack extend on out of
memory in most configurations. That includes your nextfd stuff

This whole thing is getting stupid. "Perfection is the enemy of
success".

Your code will also fail when the cat pees on the computer, when the
power fails and when disk dies. I suspect that other than the cat these
are all more likely real world cases than your ENOMEM.

Alan

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
@ 2012-04-12 11:11               ` Alan Cox
  0 siblings, 0 replies; 78+ messages in thread
From: Alan Cox @ 2012-04-12 11:11 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: H. Peter Anvin, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Thu, 12 Apr 2012 13:54:25 +0300
Alexey Dobriyan <adobriyan@gmail.com> wrote:

> On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
> >>
> >> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
> >> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
> >> closefrom(3) written via close(fd++) is unreliable.
> >>
> >
> > I call shenanigans on this.  There is no reason to ENOMEM on the second
> > written using the fdwalk() implementation I already posted, for example.
> 
> open("/proc/self/fd") can fail with ENOMEM.

Any syscall can fail with a process kill due to a stack extend on out of
memory in most configurations. That includes your nextfd stuff

This whole thing is getting stupid. "Perfection is the enemy of
success".

Your code will also fail when the cat pees on the computer, when the
power fails and when disk dies. I suspect that other than the cat these
are all more likely real world cases than your ENOMEM.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-12 11:11               ` Alan Cox
  (?)
@ 2012-04-12 13:35               ` Alexey Dobriyan
  2012-04-12 13:51                 ` H. Peter Anvin
  -1 siblings, 1 reply; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-12 13:35 UTC (permalink / raw)
  To: Alan Cox
  Cc: H. Peter Anvin, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

On Thu, Apr 12, 2012 at 2:11 PM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> On Thu, 12 Apr 2012 13:54:25 +0300
> Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
>> On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> > On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
>> >>
>> >> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
>> >> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
>> >> closefrom(3) written via close(fd++) is unreliable.
>> >>
>> >
>> > I call shenanigans on this.  There is no reason to ENOMEM on the second
>> > written using the fdwalk() implementation I already posted, for example.
>>
>> open("/proc/self/fd") can fail with ENOMEM.
>
> Any syscall can fail with a process kill due to a stack extend on out of
> memory in most configurations. That includes your nextfd stuff
>
> This whole thing is getting stupid. "Perfection is the enemy of
> success".
>
> Your code will also fail when the cat pees on the computer, when the
> power fails and when disk dies. I suspect that other than the cat these
> are all more likely real world cases than your ENOMEM.

You can't see I'm talking about difference only, can you?
Have you read what hpa wrote exactly and what exactly I was answering to?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-12 13:35               ` Alexey Dobriyan
@ 2012-04-12 13:51                 ` H. Peter Anvin
  2012-04-12 19:21                   ` Alexey Dobriyan
  0 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-12 13:51 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Alan Cox, akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On 04/12/2012 06:35 AM, Alexey Dobriyan wrote:
> 
> You can't see I'm talking about difference only, can you?
> Have you read what hpa wrote exactly and what exactly I was answering to?
> 

He has and he is right.  You're trying to catch pointless corner cases.

	-hpa


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-12 11:11               ` Alan Cox
  (?)
  (?)
@ 2012-04-12 14:09               ` Eric Dumazet
  -1 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2012-04-12 14:09 UTC (permalink / raw)
  To: Alan Cox
  Cc: Alexey Dobriyan, H. Peter Anvin, akpm, viro, torvalds, drepper,
	linux-kernel, linux-fsdevel

On Thu, 2012-04-12 at 12:11 +0100, Alan Cox wrote:

> This whole thing is getting stupid. "Perfection is the enemy of
> success".
> 
> Your code will also fail when the cat pees on the computer, when the
> power fails and when disk dies. I suspect that other than the cat these
> are all more likely real world cases than your ENOMEM.

Time to open a thread on G+ and close this one I guess.



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-12 13:51                 ` H. Peter Anvin
@ 2012-04-12 19:21                   ` Alexey Dobriyan
  0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-12 19:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alan Cox, akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel

On Thu, Apr 12, 2012 at 06:51:37AM -0700, H. Peter Anvin wrote:
> He has and he is right.  You're trying to catch pointless corner cases.

For the record, the only objection I find real is about
sequential vs bulk nature of the system call.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-11 20:32                     ` H. Peter Anvin
@ 2012-04-17 18:12                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-17 18:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: KOSAKI Motohiro, Linus Torvalds, Alexey Dobriyan, akpm, viro,
	drepper, linux-kernel, linux-fsdevel

(4/11/12 4:32 PM), H. Peter Anvin wrote:
> On 04/11/2012 01:23 PM, KOSAKI Motohiro wrote:
>>
>> Hmmm.... I'm sorry I don't find "considered undesirable". Maybe because
>> my English is not very good. can you please help me clarify?
>>
>
> I also went and read the mailing list discussion on the topic.
>
> Ulrich, for example (in his usual mild-mannered style), commented:
>
>> And all these programs and systems are wrong.
>>
>> There is no guarantee that one of the fds isn't used behind the
>> scenes for something important which is still running as part of the
>> fork/exec code.  It's completely unacceptable to build into the
>> interfaces the assumption that the programmer knows all the file
>> descriptors.
>>
>> This is why using CLOEXEC is the only correct way to deal with this
>> and now there is no exceuse anymore whatsoever. Every fd-creating
>> interface can use CLOEXEC.
>
>> This text says,
>>
>>> so a future revision of the standard may indeed add fdwalk( ), although no
>>> one in the meeting was willing to draft a proposal for fdwalk( ) at this time
>>
>> and, later says after noting F_NEXT and O_CLOEXEC,
>>
>>> Therefore, the rest of this proposal seeks to document the problem
>>> with closing arbitrary file descriptors, and a new bugid will be
>>> opened to propose standardizing some recent interfaces and interface
>>> extensions first appearing in Linux
>>
>> Do you think latter override former?
>
> Yes.
>
>>>>> b) unsafe because there might be file descriptors used by libc itself.
>>>>
>>>> I agree this. Even though almost developer don't use libc message catalogue and
>>>> we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
>>>
>>> No, that's exactly the point that we cannot.
>>
>> I thknk we are talking different aspect. I'm talking practical issue.
>> say, ruby hit the exact same issue
>> because valgrind uses internal fds and they don't think their exec()
>> case don't need fd
>> inheritance. Even though it close libc internal fds, invoked new
>> executable may open them
>> again at process strtup code. Therefore, they are using O_CLOEXEC. In
>> the other hands,
>> you seems talking about it is corner case. If so, I agree. I was not
>> argue it. I only say, I
>> haven't seen real world application require it.
>>
>> Personally, I'm only interesting real world issue.
>
> These are real-world issues.
>
>>> The problem -- as was brought up in the POSIX discussion -- is that you
>>> actually end up breaking *properly functioning programs*.
>>
>> But the url only talk about a possibility of misuse.
>
> There are concrete examples on the mailing list.
>
> Anyway, fdwalk() at least exists as an interface.  There is absolutely
> no momentum for FD_NEXT that I can see.

Thanks Peter, I guess I now understand what you said. Again, thanks for the
patience. _Personally_ I can't agree Ulrich's opinion because I've only seen
fork-closeall-exec pattern. but, I also can't say there is no other use case.

And, as I already wrote, I don't think fdwalk() is bad taste. I only want to
explain the background of nextfd interface bacause you said you have no seen
the reason and I think I can explain the background and motivation because this
is famous issue in user land folks. (oops, this should be noted, I'm not original
patch author and I talked only about my ovserved issue. Alexey might know another
use cause, I dunnno)

Unfortnately, I'll be offlined full of this and next week and then I have to leave
this thread. But I believe I'm not needed this thread any more. :)


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] nextfd(2)
  2012-04-02  0:08   ` Alan Cox
@ 2012-04-30  9:58     ` Valentin Nechayev
  0 siblings, 0 replies; 78+ messages in thread
From: Valentin Nechayev @ 2012-04-30  9:58 UTC (permalink / raw)
  To: Alan Cox, H. Peter Anvin
  Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
	linux-fsdevel

 Sun, Apr 01, 2012 at 15:13:31, hpa wrote about "Re: [PATCH] nextfd(2)": 

> - There is a huge backwards compatibility problem with this for a
> substantial transition period; using /proc/self/fd has worked for a very
> long time already.
> 
> - Your nextfd() system call will require more system calls that the
> typical case for reading /proc/self/fd, because each getdents() system
> call handles multiple readdir() invocations.

First, your forecast is very doubtfully correct in typical case when a
process checks extra descriptors to close; seeing tools like sendmail
shows that typically the all this long cycle does no work. So, it will
reduce to one nextfd() which returns final error, instead of
open()+getdents()+close().

Second, the principal advantage of nextfd() approach that it doesn't
consumer additional resources for its work, compared with open() which
allocates yet another descriptor. So, it could fail unpredictably.

Typical Unix-like system has at least four kinds of critical resources;
"critical" means here that they could be requested even in case of
lack of such resource, to provide a kind of graceful shutdown. They
are virtual memory, file descriptors, threads and disk space, all
possibly limited per-system, per-user or per-process. The current lack
of a measure to provide work when a resource is exhausted gives only
one possible reaction - to die immediately; this is now typical
reaction to memory allocation failure. I'm already sick of seeing
a language runtime which crash with messages like "cannot allocate
500M" when doing GC; this isn't single example but common approach of
such a devil-may-care attitude to application stability.

It isn't hard to expand the current approach in two ways: 1) allowing
to allocate reserve pool which will be then used by compating or
shutdown actions; 2) making actions which are typical on such
compacting using minimum of resources, even with some loss of
uniformity, API orthogonality and simplicity. Yep, this way is long
and can spend decades - more long than /proc existence - but it is
good to achieve.

Seems all this was in mind of Alexey Dobriyan when he proposed
nextfd(); maybe not with the same level of details, but in general.

 Mon, Apr 02, 2012 at 01:08:19, alan wrote about "Re: [PATCH] nextfd(2)": 

> If I've rlimited you to no file handles then you already lost whichever
> approach you use. Just abort.

Well, this is the good illustration to my words. In real work I don't
want to use the world when abort() is not simply the only case for
complicated state, but the suggestion. My software isn't
samurai-inspired kamikaze.


-netch-

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2012-04-30  9:59 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
2012-04-01 13:58 ` Konstantin Khlebnikov
2012-04-01 21:30   ` Alexey Dobriyan
2012-04-02  0:09   ` Alan Cox
2012-04-02  8:38     ` Konstantin Khlebnikov
2012-04-02  9:26       ` Cyrill Gorcunov
2012-04-01 15:43 ` Eric Dumazet
2012-04-01 21:31   ` Alexey Dobriyan
2012-04-01 21:36   ` Alan Cox
2012-04-01 17:20 ` Linus Torvalds
2012-04-01 18:28 ` Valentin Nechayev
2012-04-01 21:33   ` Alexey Dobriyan
2012-04-01 19:21 ` Arnd Bergmann
2012-04-01 21:35   ` Alexey Dobriyan
2012-04-01 22:05   ` H. Peter Anvin
2012-04-04 12:13     ` Arnd Bergmann
2012-04-01 22:03 ` H. Peter Anvin
2012-04-01 22:13   ` H. Peter Anvin
2012-04-02  0:08   ` Alan Cox
2012-04-30  9:58     ` Valentin Nechayev
2012-04-02  1:19   ` Kyle Moffett
2012-04-02  1:19     ` Kyle Moffett
2012-04-02  1:37     ` H. Peter Anvin
2012-04-02 11:37     ` Ulrich Drepper
2012-04-06  9:54   ` Alexey Dobriyan
2012-04-06  9:54     ` Alexey Dobriyan
2012-04-06 15:27     ` Colin Walters
2012-04-06 16:14     ` H. Peter Anvin
2012-04-06 20:16       ` Alexey Dobriyan
2012-04-06 20:33         ` H. Peter Anvin
2012-04-06 21:02         ` H. Peter Anvin
2012-04-12 10:54           ` Alexey Dobriyan
2012-04-12 10:54             ` Alexey Dobriyan
2012-04-12 11:11             ` Alan Cox
2012-04-12 11:11               ` Alan Cox
2012-04-12 13:35               ` Alexey Dobriyan
2012-04-12 13:51                 ` H. Peter Anvin
2012-04-12 19:21                   ` Alexey Dobriyan
2012-04-12 14:09               ` Eric Dumazet
2012-04-06 16:23     ` H. Peter Anvin
2012-04-07 21:21       ` Ben Pfaff
2012-04-11  0:12         ` KOSAKI Motohiro
2012-04-11  0:12           ` KOSAKI Motohiro
2012-04-11  0:09       ` KOSAKI Motohiro
2012-04-11 17:58         ` H. Peter Anvin
2012-04-11 18:04           ` Linus Torvalds
2012-04-11 18:04             ` Linus Torvalds
2012-04-11 18:11             ` H. Peter Anvin
2012-04-11 19:46               ` KOSAKI Motohiro
2012-04-11 19:46                 ` KOSAKI Motohiro
2012-04-11 19:49                 ` H. Peter Anvin
2012-04-11 20:23                   ` KOSAKI Motohiro
2012-04-11 20:32                     ` H. Peter Anvin
2012-04-17 18:12                       ` KOSAKI Motohiro
2012-04-11 18:00         ` H. Peter Anvin
2012-04-11 19:20           ` KOSAKI Motohiro
2012-04-11 19:20             ` KOSAKI Motohiro
2012-04-11 19:22             ` H. Peter Anvin
2012-04-11 19:26               ` KOSAKI Motohiro
2012-04-11 19:28                 ` H. Peter Anvin
2012-04-11 19:31                   ` KOSAKI Motohiro
2012-04-11 19:32                     ` H. Peter Anvin
2012-04-02 23:17 ` KOSAKI Motohiro
2012-04-02 23:56   ` H. Peter Anvin
2012-04-04 11:51     ` Ulrich Drepper
2012-04-04 16:38       ` KOSAKI Motohiro
2012-04-04 16:43         ` Ulrich Drepper
2012-04-04 17:07           ` KOSAKI Motohiro
2012-04-04 17:49             ` Ulrich Drepper
2012-04-04 18:08               ` KOSAKI Motohiro
2012-04-04 16:31     ` KOSAKI Motohiro
2012-04-04 17:10       ` Colin Walters
2012-04-04 17:25         ` Colin Walters
2012-04-04 23:35         ` KOSAKI Motohiro
2012-04-04 18:44       ` H. Peter Anvin
2012-04-03 19:21   ` Colin Walters
2012-04-04  3:01 ` Al Viro
2012-04-04 17:10   ` KOSAKI Motohiro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.