* Re: [PATCH] nextfd(2)
2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
@ 2012-04-01 13:58 ` Konstantin Khlebnikov
2012-04-01 21:30 ` Alexey Dobriyan
2012-04-02 0:09 ` Alan Cox
2012-04-01 15:43 ` Eric Dumazet
` (6 subsequent siblings)
7 siblings, 2 replies; 78+ messages in thread
From: Konstantin Khlebnikov @ 2012-04-01 13:58 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
Alexey Dobriyan wrote:
> Currently there is no reliable way to close all opened file descriptors
> (which daemons need and like to do):
>
> * dumb close(fd) loop is slow, upper bound is unknown and
> can be arbitrary large,
>
> * /proc/self/fd is unreliable:
> proc may be unconfigured or not mounted at expected place.
> Looking at /proc/self/fd requires opening directory
> which may not be available due to malicious rlimit drop or ENOMEM situations.
> Not opening directory is equivalent to dumb close(2) loop except slower.
>
> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
> for userspace.
>
> So, don't add closefrom(2), add nextfd(2).
>
> int nextfd(int fd)
Can we add "pid" argument to be able to search next fd in other task?
Together with sys_kcmp() this will be very useful for checkpoint/restore.
>
> returns next opened file descriptor which is>= than fd or -1/ESRCH
> if there aren't any descriptors>= than fd.
>
> Thus closefrom(3) can be rewritten through it in userspace:
>
> void closefrom(int fd)
> {
> while (1) {
> fd = nextfd(fd);
> if (fd == -1&& errno == ESRCH)
> break;
> (void)close(fd);
> fd++;
> }
> }
>
> Maybe it will grow other smart uses.
>
> nextfd(2) doesn't change kernel state and thus can't fail
> which is why it should go in. Other means may fail or
> may not be available or require linear time with only guessed
> upper boundaries (1024, getrlimit(RLIM_NOFILE), sysconf(_SC_OPEN_MAX).
>
> Signed-off-by: Alexey Dobriyan<adobriyan@gmail.com>
> ---
>
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/Makefile | 1 +
> fs/nextfd.c | 27 +++++++++++++++++++++++++++
> include/linux/syscalls.h | 1 +
> 5 files changed, 31 insertions(+)
>
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -355,3 +355,4 @@
> 346 i386 setns sys_setns
> 347 i386 process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv
> 348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev
> +349 i386 nextfd sys_nextfd
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -318,6 +318,7 @@
> 309 common getcpu sys_getcpu
> 310 64 process_vm_readv sys_process_vm_readv
> 311 64 process_vm_writev sys_process_vm_writev
> +312 64 nextfd sys_nextfd
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> # for native 64-bit operation.
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -12,6 +12,7 @@ obj-y := open.o read_write.o file_table.o super.o \
> seq_file.o xattr.o libfs.o fs-writeback.o \
> pnode.o drop_caches.o splice.o sync.o utimes.o \
> stack.o fs_struct.o statfs.o
> +obj-y += nextfd.o
>
> ifeq ($(CONFIG_BLOCK),y)
> obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
> --- /dev/null
> +++ b/fs/nextfd.c
> @@ -0,0 +1,27 @@
> +#include<linux/errno.h>
> +#include<linux/fdtable.h>
> +#include<linux/rcupdate.h>
> +#include<linux/sched.h>
> +#include<linux/syscalls.h>
> +
> +/* Return first opened file descriptor which is>= than the argument. */
> +SYSCALL_DEFINE1(nextfd, unsigned int, fd)
> +{
> + struct files_struct *files = current->files;
> + struct fdtable *fdt;
> +
> + rcu_read_lock();
> + fdt = files_fdtable(files);
> + while (fd< fdt->max_fds) {
> + struct file *file;
> +
> + file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
> + if (file) {
> + rcu_read_unlock();
> + return fd;
> + }
> + fd++;
> + }
> + rcu_read_unlock();
> + return -ESRCH;
> +}
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -857,5 +857,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
> const struct iovec __user *rvec,
> unsigned long riovcnt,
> unsigned long flags);
> +asmlinkage long sys_nextfd(unsigned int fd);
>
> #endif
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 13:58 ` Konstantin Khlebnikov
@ 2012-04-01 21:30 ` Alexey Dobriyan
2012-04-02 0:09 ` Alan Cox
1 sibling, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-01 21:30 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Sun, Apr 01, 2012 at 05:58:53PM +0400, Konstantin Khlebnikov wrote:
> Alexey Dobriyan wrote:
> > So, don't add closefrom(2), add nextfd(2).
> >
> > int nextfd(int fd)
>
> Can we add "pid" argument to be able to search next fd in other task?
> Together with sys_kcmp() this will be very useful for checkpoint/restore.
You've added /proc/*/map_files for C/R making proc mandatory,
so you do have /proc/*/fd .
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 13:58 ` Konstantin Khlebnikov
2012-04-01 21:30 ` Alexey Dobriyan
@ 2012-04-02 0:09 ` Alan Cox
2012-04-02 8:38 ` Konstantin Khlebnikov
1 sibling, 1 reply; 78+ messages in thread
From: Alan Cox @ 2012-04-02 0:09 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
> Can we add "pid" argument to be able to search next fd in other task?
> Together with sys_kcmp() this will be very useful for checkpoint/restore.
That would raise all sorts of security questions unless protected, plus
its got races. If you want to do that and you have the rights you can do
it via procfs anyway.
Alan
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-02 0:09 ` Alan Cox
@ 2012-04-02 8:38 ` Konstantin Khlebnikov
2012-04-02 9:26 ` Cyrill Gorcunov
0 siblings, 1 reply; 78+ messages in thread
From: Konstantin Khlebnikov @ 2012-04-02 8:38 UTC (permalink / raw)
To: Alan Cox
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel, Cyrill Gorcunov
Alan Cox wrote:
>> Can we add "pid" argument to be able to search next fd in other task?
>> Together with sys_kcmp() this will be very useful for checkpoint/restore.
>
> That would raise all sorts of security questions unless protected, plus
> its got races. If you want to do that and you have the rights you can do
> it via procfs anyway.
Don't worry, I just asked. =) Seems like crtools dumps fd in the target-task context,
so the current version is suitable.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-02 8:38 ` Konstantin Khlebnikov
@ 2012-04-02 9:26 ` Cyrill Gorcunov
0 siblings, 0 replies; 78+ messages in thread
From: Cyrill Gorcunov @ 2012-04-02 9:26 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Alan Cox, Alexey Dobriyan, akpm, viro, torvalds, drepper,
linux-kernel, linux-fsdevel
On Mon, Apr 02, 2012 at 12:38:25PM +0400, Konstantin Khlebnikov wrote:
> Alan Cox wrote:
> >>Can we add "pid" argument to be able to search next fd in other task?
> >>Together with sys_kcmp() this will be very useful for checkpoint/restore.
> >
> >That would raise all sorts of security questions unless protected, plus
> >its got races. If you want to do that and you have the rights you can do
> >it via procfs anyway.
>
> Don't worry, I just asked. =) Seems like crtools dumps fd in the target-task context,
> so the current version is suitable.
Kind of ;) At moment we drain fds from dumpee task context to
our tool context and then dump them.
Cyrill
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
2012-04-01 13:58 ` Konstantin Khlebnikov
@ 2012-04-01 15:43 ` Eric Dumazet
2012-04-01 21:31 ` Alexey Dobriyan
2012-04-01 21:36 ` Alan Cox
2012-04-01 17:20 ` Linus Torvalds
` (5 subsequent siblings)
7 siblings, 2 replies; 78+ messages in thread
From: Eric Dumazet @ 2012-04-01 15:43 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Sun, 2012-04-01 at 15:57 +0300, Alexey Dobriyan wrote:
> +
> +/* Return first opened file descriptor which is >= than the argument. */
> +SYSCALL_DEFINE1(nextfd, unsigned int, fd)
> +{
> + struct files_struct *files = current->files;
> + struct fdtable *fdt;
> +
> + rcu_read_lock();
> + fdt = files_fdtable(files);
> + while (fd < fdt->max_fds) {
> + struct file *file;
> +
> + file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
> + if (file) {
> + rcu_read_unlock();
> + return fd;
> + }
> + fd++;
> + }
> + rcu_read_unlock();
> + return -ESRCH;
> +}
Interesting idea but what about using fdt->open_fds bitmap to have a
fast search and less cache pollution ?
alloc_fd(start, flags) uses find_next_zero_bit(), you could use
find_next_bit().
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 15:43 ` Eric Dumazet
@ 2012-04-01 21:31 ` Alexey Dobriyan
2012-04-01 21:36 ` Alan Cox
1 sibling, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-01 21:31 UTC (permalink / raw)
To: Eric Dumazet; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Sun, Apr 01, 2012 at 05:43:25PM +0200, Eric Dumazet wrote:
> On Sun, 2012-04-01 at 15:57 +0300, Alexey Dobriyan wrote:
>
> > +
> > +/* Return first opened file descriptor which is >= than the argument. */
> > +SYSCALL_DEFINE1(nextfd, unsigned int, fd)
> > +{
> > + struct files_struct *files = current->files;
> > + struct fdtable *fdt;
> > +
> > + rcu_read_lock();
> > + fdt = files_fdtable(files);
> > + while (fd < fdt->max_fds) {
> > + struct file *file;
> > +
> > + file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
> > + if (file) {
> > + rcu_read_unlock();
> > + return fd;
> > + }
> > + fd++;
> > + }
> > + rcu_read_unlock();
> > + return -ESRCH;
> > +}
>
> Interesting idea but what about using fdt->open_fds bitmap to have a
> fast search and less cache pollution ?
>
> alloc_fd(start, flags) uses find_next_zero_bit(), you could use
> find_next_bit().
Indeed.
I've copied code from /proc/*/fd implementation which does loop.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 15:43 ` Eric Dumazet
2012-04-01 21:31 ` Alexey Dobriyan
@ 2012-04-01 21:36 ` Alan Cox
1 sibling, 0 replies; 78+ messages in thread
From: Alan Cox @ 2012-04-01 21:36 UTC (permalink / raw)
To: Eric Dumazet
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
> > + return -ESRCH;
> > +}
>
> Interesting idea but what about using fdt->open_fds bitmap to have a
> fast search and less cache pollution
I do wonder who needs this on a hot path versus just looking
in /proc/self ?
Alan
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
2012-04-01 13:58 ` Konstantin Khlebnikov
2012-04-01 15:43 ` Eric Dumazet
@ 2012-04-01 17:20 ` Linus Torvalds
2012-04-01 18:28 ` Valentin Nechayev
` (4 subsequent siblings)
7 siblings, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2012-04-01 17:20 UTC (permalink / raw)
To: Alexey Dobriyan; +Cc: akpm, viro, drepper, linux-kernel, linux-fsdevel
On Sun, Apr 1, 2012 at 5:57 AM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
> for userspace.
>
> So, don't add closefrom(2), add nextfd(2).
I don't disagree with this, but I don't think it's worth a new file of
its own - I think it should go next to dup() and friends (currently in
fs/fcntl.h).
Also, I think you should:
(a) at a minimum use the bitmap. It's easy:
struct fdtable *fdt;
rcu_read_lock();
fdt = files_fdtable(current->files);
.. maximum fd in fdt->max_fds ...
.. you have the bitmap in fdt->open_fds ..
(b) I'd also suggest you add a "flag" argument and have at least one
bit option for "search for next *free* file table entry". I'm not sure
anybody wants it, but it is kind of the same operation, and having a
"flags" field means that it's extensible in the future if people want
to find the next cloexec fd or whatever. Or if they just want to find
the *last* fd, or something.
But I think the concept makes sense.
Linus
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
` (2 preceding siblings ...)
2012-04-01 17:20 ` Linus Torvalds
@ 2012-04-01 18:28 ` Valentin Nechayev
2012-04-01 21:33 ` Alexey Dobriyan
2012-04-01 19:21 ` Arnd Bergmann
` (3 subsequent siblings)
7 siblings, 1 reply; 78+ messages in thread
From: Valentin Nechayev @ 2012-04-01 18:28 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
Sun, Apr 01, 2012 at 15:57:42, adobriyan wrote about "[PATCH] nextfd(2)":
> So, don't add closefrom(2), add nextfd(2).
>
> int nextfd(int fd)
Is it really needed here to create syscall? One shall update all arch
lists for it. Instead, adding a new option for fcntl() (named e.g.
F_NEXTFD) solves for all platforms simultaneously.
-netch-
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 18:28 ` Valentin Nechayev
@ 2012-04-01 21:33 ` Alexey Dobriyan
0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-01 21:33 UTC (permalink / raw)
To: Valentin Nechayev
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Sun, Apr 01, 2012 at 09:28:26PM +0300, Valentin Nechayev wrote:
> Sun, Apr 01, 2012 at 15:57:42, adobriyan wrote about "[PATCH] nextfd(2)":
>
> > So, don't add closefrom(2), add nextfd(2).
> >
> > int nextfd(int fd)
>
> Is it really needed here to create syscall? One shall update all arch
> lists for it. Instead, adding a new option for fcntl() (named e.g.
> F_NEXTFD) solves for all platforms simultaneously.
This is not a problem, build system even checks for not yet added syscalls.
Explicit syscall is better than multiplexer.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
` (3 preceding siblings ...)
2012-04-01 18:28 ` Valentin Nechayev
@ 2012-04-01 19:21 ` Arnd Bergmann
2012-04-01 21:35 ` Alexey Dobriyan
2012-04-01 22:05 ` H. Peter Anvin
2012-04-01 22:03 ` H. Peter Anvin
` (2 subsequent siblings)
7 siblings, 2 replies; 78+ messages in thread
From: Arnd Bergmann @ 2012-04-01 19:21 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Sunday 01 April 2012, Alexey Dobriyan wrote:
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/Makefile | 1 +
> fs/nextfd.c | 27 +++++++++++++++++++++++++++
> include/linux/syscalls.h | 1 +
> 5 files changed, 31 insertions(+)
I don't have any comments on the syscall itself, but when you add one, please
also make the change to include/asm-generic/unistd.h so it appears in the
various architectures using the generic syscall table.
Arnd
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 19:21 ` Arnd Bergmann
@ 2012-04-01 21:35 ` Alexey Dobriyan
2012-04-01 22:05 ` H. Peter Anvin
1 sibling, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-01 21:35 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Sun, Apr 01, 2012 at 07:21:59PM +0000, Arnd Bergmann wrote:
> On Sunday 01 April 2012, Alexey Dobriyan wrote:
> > arch/x86/syscalls/syscall_32.tbl | 1 +
> > arch/x86/syscalls/syscall_64.tbl | 1 +
> > fs/Makefile | 1 +
> > fs/nextfd.c | 27 +++++++++++++++++++++++++++
> > include/linux/syscalls.h | 1 +
> > 5 files changed, 31 insertions(+)
>
> I don't have any comments on the syscall itself, but when you add one, please
> also make the change to include/asm-generic/unistd.h so it appears in the
> various architectures using the generic syscall table.
I will.
The amount of places to plug new syscall is suprising. :-(
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 19:21 ` Arnd Bergmann
2012-04-01 21:35 ` Alexey Dobriyan
@ 2012-04-01 22:05 ` H. Peter Anvin
2012-04-04 12:13 ` Arnd Bergmann
1 sibling, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-01 22:05 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On 04/01/2012 12:21 PM, Arnd Bergmann wrote:
> On Sunday 01 April 2012, Alexey Dobriyan wrote:
>> arch/x86/syscalls/syscall_32.tbl | 1 +
>> arch/x86/syscalls/syscall_64.tbl | 1 +
>> fs/Makefile | 1 +
>> fs/nextfd.c | 27 +++++++++++++++++++++++++++
>> include/linux/syscalls.h | 1 +
>> 5 files changed, 31 insertions(+)
>
> I don't have any comments on the syscall itself, but when you add one, please
> also make the change to include/asm-generic/unistd.h so it appears in the
> various architectures using the generic syscall table.
>
> Arnd
Arnd: do you have any interest in leveraging the syscall scripts I did
for x86? I have tried to make them as generic as possible, with the
hope of getting more and more of syscall information into more easily
processed form.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 22:05 ` H. Peter Anvin
@ 2012-04-04 12:13 ` Arnd Bergmann
0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2012-04-04 12:13 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Sunday 01 April 2012, H. Peter Anvin wrote:
> On 04/01/2012 12:21 PM, Arnd Bergmann wrote:
> > On Sunday 01 April 2012, Alexey Dobriyan wrote:
> >> arch/x86/syscalls/syscall_32.tbl | 1 +
> >> arch/x86/syscalls/syscall_64.tbl | 1 +
> >> fs/Makefile | 1 +
> >> fs/nextfd.c | 27 +++++++++++++++++++++++++++
> >> include/linux/syscalls.h | 1 +
> >> 5 files changed, 31 insertions(+)
> >
> > I don't have any comments on the syscall itself, but when you add one, please
> > also make the change to include/asm-generic/unistd.h so it appears in the
> > various architectures using the generic syscall table.
>
> Arnd: do you have any interest in leveraging the syscall scripts I did
> for x86? I have tried to make them as generic as possible, with the
> hope of getting more and more of syscall information into more easily
> processed form.
Sounds interesting, but I'm not planning to do the changes myself,
especially since I have no machine that actually uses the generic
syscall table.
Maybe I should ask the next person who submits a new architecture to
do that work, that's usually how progress in asm-generic happens
these days ;-)
Arnd
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
` (4 preceding siblings ...)
2012-04-01 19:21 ` Arnd Bergmann
@ 2012-04-01 22:03 ` H. Peter Anvin
2012-04-01 22:13 ` H. Peter Anvin
` (3 more replies)
2012-04-02 23:17 ` KOSAKI Motohiro
2012-04-04 3:01 ` Al Viro
7 siblings, 4 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-01 22:03 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>
> * /proc/self/fd is unreliable:
> proc may be unconfigured or not mounted at expected place.
> Looking at /proc/self/fd requires opening directory
> which may not be available due to malicious rlimit drop or ENOMEM situations.
> Not opening directory is equivalent to dumb close(2) loop except slower.
>
This is really the motivation for this... the real question is how much
functionality is actually available in the system without /proc mounted,
and in particular if this particular subcase is worth optimizing ...
after all, if someone is maliciously setting rlimit, we can just abort
(if someone can set an rlimit they can also force an abort) or revert to
the slow path.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 22:03 ` H. Peter Anvin
@ 2012-04-01 22:13 ` H. Peter Anvin
2012-04-02 0:08 ` Alan Cox
` (2 subsequent siblings)
3 siblings, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-01 22:13 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On 04/01/2012 03:03 PM, H. Peter Anvin wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>>
>> * /proc/self/fd is unreliable:
>> proc may be unconfigured or not mounted at expected place.
>> Looking at /proc/self/fd requires opening directory
>> which may not be available due to malicious rlimit drop or ENOMEM situations.
>> Not opening directory is equivalent to dumb close(2) loop except slower.
>>
>
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...
> after all, if someone is maliciously setting rlimit, we can just abort
> (if someone can set an rlimit they can also force an abort) or revert to
> the slow path.
>
A few more observations:
- There is a huge backwards compatibility problem with this for a
substantial transition period; using /proc/self/fd has worked for a very
long time already.
- Your nextfd() system call will require more system calls that the
typical case for reading /proc/self/fd, because each getdents() system
call handles multiple readdir() invocations.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 22:03 ` H. Peter Anvin
2012-04-01 22:13 ` H. Peter Anvin
@ 2012-04-02 0:08 ` Alan Cox
2012-04-30 9:58 ` Valentin Nechayev
2012-04-02 1:19 ` Kyle Moffett
2012-04-06 9:54 ` Alexey Dobriyan
3 siblings, 1 reply; 78+ messages in thread
From: Alan Cox @ 2012-04-02 0:08 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Sun, 01 Apr 2012 15:03:38 -0700
"H. Peter Anvin" <hpa@zytor.com> wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
> >
> > * /proc/self/fd is unreliable:
> > proc may be unconfigured or not mounted at expected place.
Equally someone could be ptracing your process and changing the answers.
Both are "do stupid stuff, get stupid answers" cases. Why do we care ???
Lots of other stuff breaks in this case so nobody will be doing it for
good reasons
> > Looking at /proc/self/fd requires opening directory
Which is very efficient
> > which may not be available due to malicious rlimit drop or ENOMEM situations.
> > Not opening directory is equivalent to dumb close(2) loop except slower.
If I've rlimited you to no file handles then you already lost whichever
approach you use. Just abort. Trying to do clever stuff in security
sensitive situations via codepaths that inevitably don't get tested when
you've apparently got a malicious executor is dumb.
> (if someone can set an rlimit they can also force an abort) or revert to
> the slow path.
Or have set a CPU limit or memory limit or more.
Alan
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-02 0:08 ` Alan Cox
@ 2012-04-30 9:58 ` Valentin Nechayev
0 siblings, 0 replies; 78+ messages in thread
From: Valentin Nechayev @ 2012-04-30 9:58 UTC (permalink / raw)
To: Alan Cox, H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
Sun, Apr 01, 2012 at 15:13:31, hpa wrote about "Re: [PATCH] nextfd(2)":
> - There is a huge backwards compatibility problem with this for a
> substantial transition period; using /proc/self/fd has worked for a very
> long time already.
>
> - Your nextfd() system call will require more system calls that the
> typical case for reading /proc/self/fd, because each getdents() system
> call handles multiple readdir() invocations.
First, your forecast is very doubtfully correct in typical case when a
process checks extra descriptors to close; seeing tools like sendmail
shows that typically the all this long cycle does no work. So, it will
reduce to one nextfd() which returns final error, instead of
open()+getdents()+close().
Second, the principal advantage of nextfd() approach that it doesn't
consumer additional resources for its work, compared with open() which
allocates yet another descriptor. So, it could fail unpredictably.
Typical Unix-like system has at least four kinds of critical resources;
"critical" means here that they could be requested even in case of
lack of such resource, to provide a kind of graceful shutdown. They
are virtual memory, file descriptors, threads and disk space, all
possibly limited per-system, per-user or per-process. The current lack
of a measure to provide work when a resource is exhausted gives only
one possible reaction - to die immediately; this is now typical
reaction to memory allocation failure. I'm already sick of seeing
a language runtime which crash with messages like "cannot allocate
500M" when doing GC; this isn't single example but common approach of
such a devil-may-care attitude to application stability.
It isn't hard to expand the current approach in two ways: 1) allowing
to allocate reserve pool which will be then used by compating or
shutdown actions; 2) making actions which are typical on such
compacting using minimum of resources, even with some loss of
uniformity, API orthogonality and simplicity. Yep, this way is long
and can spend decades - more long than /proc existence - but it is
good to achieve.
Seems all this was in mind of Alexey Dobriyan when he proposed
nextfd(); maybe not with the same level of details, but in general.
Mon, Apr 02, 2012 at 01:08:19, alan wrote about "Re: [PATCH] nextfd(2)":
> If I've rlimited you to no file handles then you already lost whichever
> approach you use. Just abort.
Well, this is the good illustration to my words. In real work I don't
want to use the world when abort() is not simply the only case for
complicated state, but the suggestion. My software isn't
samurai-inspired kamikaze.
-netch-
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 22:03 ` H. Peter Anvin
@ 2012-04-02 1:19 ` Kyle Moffett
2012-04-02 0:08 ` Alan Cox
` (2 subsequent siblings)
3 siblings, 0 replies; 78+ messages in thread
From: Kyle Moffett @ 2012-04-02 1:19 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Sun, Apr 1, 2012 at 15:03, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>> * /proc/self/fd is unreliable:
>> proc may be unconfigured or not mounted at expected place.
>> Looking at /proc/self/fd requires opening directory
>> which may not be available due to malicious rlimit drop or ENOMEM situations.
>> Not opening directory is equivalent to dumb close(2) loop except slower.
>
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...
> after all, if someone is maliciously setting rlimit, we can just abort
> (if someone can set an rlimit they can also force an abort) or revert to
> the slow path.
Well, I imagine one typical usecase for closing all FDs is for
security isolation purposes (EG: chroot()+etc), and in a great deal of
chroot environments you don't have /proc available. In particular
/proc has been a source of a lot of privilege escalations in the past,
so avoiding mounting it in a chroot is good security policy if
possible.
Cheers,
Kyle Moffett
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
@ 2012-04-02 1:19 ` Kyle Moffett
0 siblings, 0 replies; 78+ messages in thread
From: Kyle Moffett @ 2012-04-02 1:19 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Sun, Apr 1, 2012 at 15:03, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>> * /proc/self/fd is unreliable:
>> proc may be unconfigured or not mounted at expected place.
>> Looking at /proc/self/fd requires opening directory
>> which may not be available due to malicious rlimit drop or ENOMEM situations.
>> Not opening directory is equivalent to dumb close(2) loop except slower.
>
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...
> after all, if someone is maliciously setting rlimit, we can just abort
> (if someone can set an rlimit they can also force an abort) or revert to
> the slow path.
Well, I imagine one typical usecase for closing all FDs is for
security isolation purposes (EG: chroot()+etc), and in a great deal of
chroot environments you don't have /proc available. In particular
/proc has been a source of a lot of privilege escalations in the past,
so avoiding mounting it in a chroot is good security policy if
possible.
Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-02 1:19 ` Kyle Moffett
(?)
@ 2012-04-02 1:37 ` H. Peter Anvin
-1 siblings, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-02 1:37 UTC (permalink / raw)
To: Kyle Moffett
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
Are those use cases heavyweight enough to motivate a new interface?
Kyle Moffett <kyle@moffetthome.net> wrote:
>On Sun, Apr 1, 2012 at 15:03, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>>> * /proc/self/fd is unreliable:
>>> proc may be unconfigured or not mounted at expected place.
>>> Looking at /proc/self/fd requires opening directory
>>> which may not be available due to malicious rlimit drop or ENOMEM
>situations.
>>> Not opening directory is equivalent to dumb close(2) loop except
>slower.
>>
>> This is really the motivation for this... the real question is how
>much
>> functionality is actually available in the system without /proc
>mounted,
>> and in particular if this particular subcase is worth optimizing ...
>> after all, if someone is maliciously setting rlimit, we can just
>abort
>> (if someone can set an rlimit they can also force an abort) or revert
>to
>> the slow path.
>
>Well, I imagine one typical usecase for closing all FDs is for
>security isolation purposes (EG: chroot()+etc), and in a great deal of
>chroot environments you don't have /proc available. In particular
>/proc has been a source of a lot of privilege escalations in the past,
>so avoiding mounting it in a chroot is good security policy if
>possible.
>
>Cheers,
>Kyle Moffett
--
Sent from my mobile phone. Please excuse my brevity and lack of formatting.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-02 1:19 ` Kyle Moffett
(?)
(?)
@ 2012-04-02 11:37 ` Ulrich Drepper
-1 siblings, 0 replies; 78+ messages in thread
From: Ulrich Drepper @ 2012-04-02 11:37 UTC (permalink / raw)
To: Kyle Moffett
Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds,
linux-kernel, linux-fsdevel
On Sun, Apr 1, 2012 at 21:19, Kyle Moffett <kyle@moffetthome.net> wrote:
> Well, I imagine one typical usecase for closing all FDs is for
> security isolation purposes (EG: chroot()+etc),
chroot and security in the same sentence...?
> and in a great deal of
> chroot environments you don't have /proc available. In particular
> /proc has been a source of a lot of privilege escalations in the past,
> so avoiding mounting it in a chroot is good security policy if
> possible.
The problem is that the kernel exports quite a bit of information only
through the /proc and /sys filesystems. I might try to finish my
comprehensive list of functionality depending on /proc sometime soon.
The list is quite long.
Not mounting /proc is inconvenient at best, it renders the environment
unusable quite often and in some cases is outright insecure.. I don't
think you can use not mounting /proc as an argument. And, as Peter
said, the loop over the directory content is quite efficient.
If you want to avoid /proc I suggest you first work on removing the
dependencies. Of just secure /proc itself.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 22:03 ` H. Peter Anvin
@ 2012-04-06 9:54 ` Alexey Dobriyan
2012-04-02 0:08 ` Alan Cox
` (2 subsequent siblings)
3 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-06 9:54 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Mon, Apr 2, 2012 at 1:03 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>>
>> * /proc/self/fd is unreliable:
>> proc may be unconfigured or not mounted at expected place.
>> Looking at /proc/self/fd requires opening directory
>> which may not be available due to malicious rlimit drop or ENOMEM situations.
>> Not opening directory is equivalent to dumb close(2) loop except slower.
>>
>
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...
I agree, this particular changelog may be somewhat out of line.
But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
fix compilation problems associated with it, do not mount proc by default,
do not mark it unmountable somehow and
then say procless setups aren't worth it.
I haven't seen personally procless environments
but several people mentioned them including on this very list.
Without proc knowledge about fdtable is gathered linearly and still unreliable.
With nextfd(2), even procful environments could lose several failure branches.
And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
@ 2012-04-06 9:54 ` Alexey Dobriyan
0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-06 9:54 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Mon, Apr 2, 2012 at 1:03 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/01/2012 05:57 AM, Alexey Dobriyan wrote:
>>
>> * /proc/self/fd is unreliable:
>> proc may be unconfigured or not mounted at expected place.
>> Looking at /proc/self/fd requires opening directory
>> which may not be available due to malicious rlimit drop or ENOMEM situations.
>> Not opening directory is equivalent to dumb close(2) loop except slower.
>>
>
> This is really the motivation for this... the real question is how much
> functionality is actually available in the system without /proc mounted,
> and in particular if this particular subcase is worth optimizing ...
I agree, this particular changelog may be somewhat out of line.
But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
fix compilation problems associated with it, do not mount proc by default,
do not mark it unmountable somehow and
then say procless setups aren't worth it.
I haven't seen personally procless environments
but several people mentioned them including on this very list.
Without proc knowledge about fdtable is gathered linearly and still unreliable.
With nextfd(2), even procful environments could lose several failure branches.
And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-06 9:54 ` Alexey Dobriyan
(?)
@ 2012-04-06 15:27 ` Colin Walters
-1 siblings, 0 replies; 78+ messages in thread
From: Colin Walters @ 2012-04-06 15:27 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: H. Peter Anvin, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Fri, 2012-04-06 at 12:54 +0300, Alexey Dobriyan wrote:
> But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
> fix compilation problems associated with it, do not mount proc by default,
> do not mark it unmountable somehow and
> then say procless setups aren't worth it.
>
> I haven't seen personally procless environments
> but several people mentioned them including on this very list.
Now that the kernel has CLONE_NEWNS, it's possible to mount proc
"privately" just for a specific process tree. It meshes nicely with
CLONE_NEWPID. Previously if you mounted proc in a chroot, it cluttered
the mount list and leaked information about outside the root.
With modern clone/unshare, that's no longer a concern, so there's
much less reason to use "bare" chroots.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-06 9:54 ` Alexey Dobriyan
(?)
(?)
@ 2012-04-06 16:14 ` H. Peter Anvin
2012-04-06 20:16 ` Alexey Dobriyan
-1 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-06 16:14 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>
> I agree, this particular changelog may be somewhat out of line.
>
> But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
> fix compilation problems associated with it, do not mount proc by default,
> do not mark it unmountable somehow and
> then say procless setups aren't worth it.
>
Aren't worth *optimizing for*. But yes, CONFIG_PROC_FS is pretty much a
historic relic at this point, and probably should just be dropped.
> Without proc knowledge about fdtable is gathered linearly and still unreliable.
> With nextfd(2), even procful environments could lose several failure branches.
What? Please explain how on Earth this would "lose several failure
branches."
New system calls are not something that should be added lightly...
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-06 16:14 ` H. Peter Anvin
@ 2012-04-06 20:16 ` Alexey Dobriyan
2012-04-06 20:33 ` H. Peter Anvin
2012-04-06 21:02 ` H. Peter Anvin
0 siblings, 2 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-06 20:16 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Fri, Apr 06, 2012 at 09:14:17AM -0700, H. Peter Anvin wrote:
> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
> >
> > I agree, this particular changelog may be somewhat out of line.
> >
> > But I find it little hypocritical that kernel developers add CONFIG_PROC_FS,
> > fix compilation problems associated with it, do not mount proc by default,
> > do not mark it unmountable somehow and
> > then say procless setups aren't worth it.
> >
>
> Aren't worth *optimizing for*. But yes, CONFIG_PROC_FS is pretty much a
> historic relic at this point, and probably should just be dropped.
What to do with automounting /proc so it availablility would match
syscall availability?
> > Without proc knowledge about fdtable is gathered linearly and still unreliable.
> > With nextfd(2), even procful environments could lose several failure branches.
>
> What? Please explain how on Earth this would "lose several failure
> branches."
closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
closefrom(3) written via close(fd++) is unreliable.
If programmer adds nextfd(2) loop before any closefrom(3) code
he currently uses, there will be less failures.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-06 20:16 ` Alexey Dobriyan
@ 2012-04-06 20:33 ` H. Peter Anvin
2012-04-06 21:02 ` H. Peter Anvin
1 sibling, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-06 20:33 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
What is unreliable about it? This is starting to be ridiculous. As far as automounting /proc is concerned, this does not seem to be a significant problem with current userspace.
Alexey Dobriyan <adobriyan@gmail.com> wrote:
>On Fri, Apr 06, 2012 at 09:14:17AM -0700, H. Peter Anvin wrote:
>> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>> >
>> > I agree, this particular changelog may be somewhat out of line.
>> >
>> > But I find it little hypocritical that kernel developers add
>CONFIG_PROC_FS,
>> > fix compilation problems associated with it, do not mount proc by
>default,
>> > do not mark it unmountable somehow and
>> > then say procless setups aren't worth it.
>> >
>>
>> Aren't worth *optimizing for*. But yes, CONFIG_PROC_FS is pretty
>much a
>> historic relic at this point, and probably should just be dropped.
>
>What to do with automounting /proc so it availablility would match
>syscall availability?
>
>> > Without proc knowledge about fdtable is gathered linearly and still
>unreliable.
>> > With nextfd(2), even procful environments could lose several
>failure branches.
>>
>> What? Please explain how on Earth this would "lose several failure
>> branches."
>
>closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
>closefrom(3) written via /proc/self/fd is reliable and can fail
>(including ENOMEM).
>closefrom(3) written via close(fd++) is unreliable.
>
>If programmer adds nextfd(2) loop before any closefrom(3) code
>he currently uses, there will be less failures.
--
Sent from my mobile phone. Please excuse brevity and lack of formatting.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-06 20:16 ` Alexey Dobriyan
2012-04-06 20:33 ` H. Peter Anvin
@ 2012-04-06 21:02 ` H. Peter Anvin
2012-04-12 10:54 ` Alexey Dobriyan
1 sibling, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-06 21:02 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
>
> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
> closefrom(3) written via close(fd++) is unreliable.
>
I call shenanigans on this. There is no reason to ENOMEM on the second
written using the fdwalk() implementation I already posted, for example.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-06 21:02 ` H. Peter Anvin
@ 2012-04-12 10:54 ` Alexey Dobriyan
0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-12 10:54 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
>>
>> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
>> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
>> closefrom(3) written via close(fd++) is unreliable.
>>
>
> I call shenanigans on this. There is no reason to ENOMEM on the second
> written using the fdwalk() implementation I already posted, for example.
open("/proc/self/fd") can fail with ENOMEM.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
@ 2012-04-12 10:54 ` Alexey Dobriyan
0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-12 10:54 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
>>
>> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
>> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
>> closefrom(3) written via close(fd++) is unreliable.
>>
>
> I call shenanigans on this. There is no reason to ENOMEM on the second
> written using the fdwalk() implementation I already posted, for example.
open("/proc/self/fd") can fail with ENOMEM.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-12 10:54 ` Alexey Dobriyan
@ 2012-04-12 11:11 ` Alan Cox
-1 siblings, 0 replies; 78+ messages in thread
From: Alan Cox @ 2012-04-12 11:11 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: H. Peter Anvin, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Thu, 12 Apr 2012 13:54:25 +0300
Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
> >>
> >> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
> >> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
> >> closefrom(3) written via close(fd++) is unreliable.
> >>
> >
> > I call shenanigans on this. There is no reason to ENOMEM on the second
> > written using the fdwalk() implementation I already posted, for example.
>
> open("/proc/self/fd") can fail with ENOMEM.
Any syscall can fail with a process kill due to a stack extend on out of
memory in most configurations. That includes your nextfd stuff
This whole thing is getting stupid. "Perfection is the enemy of
success".
Your code will also fail when the cat pees on the computer, when the
power fails and when disk dies. I suspect that other than the cat these
are all more likely real world cases than your ENOMEM.
Alan
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
@ 2012-04-12 11:11 ` Alan Cox
0 siblings, 0 replies; 78+ messages in thread
From: Alan Cox @ 2012-04-12 11:11 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: H. Peter Anvin, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Thu, 12 Apr 2012 13:54:25 +0300
Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
> >>
> >> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
> >> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
> >> closefrom(3) written via close(fd++) is unreliable.
> >>
> >
> > I call shenanigans on this. There is no reason to ENOMEM on the second
> > written using the fdwalk() implementation I already posted, for example.
>
> open("/proc/self/fd") can fail with ENOMEM.
Any syscall can fail with a process kill due to a stack extend on out of
memory in most configurations. That includes your nextfd stuff
This whole thing is getting stupid. "Perfection is the enemy of
success".
Your code will also fail when the cat pees on the computer, when the
power fails and when disk dies. I suspect that other than the cat these
are all more likely real world cases than your ENOMEM.
Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-12 11:11 ` Alan Cox
(?)
@ 2012-04-12 13:35 ` Alexey Dobriyan
2012-04-12 13:51 ` H. Peter Anvin
-1 siblings, 1 reply; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-12 13:35 UTC (permalink / raw)
To: Alan Cox
Cc: H. Peter Anvin, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Thu, Apr 12, 2012 at 2:11 PM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> On Thu, 12 Apr 2012 13:54:25 +0300
> Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
>> On Sat, Apr 7, 2012 at 12:02 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> > On 04/06/2012 01:16 PM, Alexey Dobriyan wrote:
>> >>
>> >> closefrom(3) written via nextfd(2) loop is reliable and doesn't fail.
>> >> closefrom(3) written via /proc/self/fd is reliable and can fail (including ENOMEM).
>> >> closefrom(3) written via close(fd++) is unreliable.
>> >>
>> >
>> > I call shenanigans on this. There is no reason to ENOMEM on the second
>> > written using the fdwalk() implementation I already posted, for example.
>>
>> open("/proc/self/fd") can fail with ENOMEM.
>
> Any syscall can fail with a process kill due to a stack extend on out of
> memory in most configurations. That includes your nextfd stuff
>
> This whole thing is getting stupid. "Perfection is the enemy of
> success".
>
> Your code will also fail when the cat pees on the computer, when the
> power fails and when disk dies. I suspect that other than the cat these
> are all more likely real world cases than your ENOMEM.
You can't see I'm talking about difference only, can you?
Have you read what hpa wrote exactly and what exactly I was answering to?
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-12 13:35 ` Alexey Dobriyan
@ 2012-04-12 13:51 ` H. Peter Anvin
2012-04-12 19:21 ` Alexey Dobriyan
0 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-12 13:51 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Alan Cox, akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On 04/12/2012 06:35 AM, Alexey Dobriyan wrote:
>
> You can't see I'm talking about difference only, can you?
> Have you read what hpa wrote exactly and what exactly I was answering to?
>
He has and he is right. You're trying to catch pointless corner cases.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-12 13:51 ` H. Peter Anvin
@ 2012-04-12 19:21 ` Alexey Dobriyan
0 siblings, 0 replies; 78+ messages in thread
From: Alexey Dobriyan @ 2012-04-12 19:21 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alan Cox, akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On Thu, Apr 12, 2012 at 06:51:37AM -0700, H. Peter Anvin wrote:
> He has and he is right. You're trying to catch pointless corner cases.
For the record, the only objection I find real is about
sequential vs bulk nature of the system call.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-12 11:11 ` Alan Cox
(?)
(?)
@ 2012-04-12 14:09 ` Eric Dumazet
-1 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2012-04-12 14:09 UTC (permalink / raw)
To: Alan Cox
Cc: Alexey Dobriyan, H. Peter Anvin, akpm, viro, torvalds, drepper,
linux-kernel, linux-fsdevel
On Thu, 2012-04-12 at 12:11 +0100, Alan Cox wrote:
> This whole thing is getting stupid. "Perfection is the enemy of
> success".
>
> Your code will also fail when the cat pees on the computer, when the
> power fails and when disk dies. I suspect that other than the cat these
> are all more likely real world cases than your ENOMEM.
Time to open a thread on G+ and close this one I guess.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-06 9:54 ` Alexey Dobriyan
` (2 preceding siblings ...)
(?)
@ 2012-04-06 16:23 ` H. Peter Anvin
2012-04-07 21:21 ` Ben Pfaff
2012-04-11 0:09 ` KOSAKI Motohiro
-1 siblings, 2 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-06 16:23 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>
> Without proc knowledge about fdtable is gathered linearly and still unreliable.
> With nextfd(2), even procful environments could lose several failure branches.
> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
>
Incidentally, if we were to create a system call for this -- which I so
far see no reason for -- I would make it return a select-style bitmask
of file descriptors in use, not a "next fd" which would require a system
call per iteration.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-06 16:23 ` H. Peter Anvin
@ 2012-04-07 21:21 ` Ben Pfaff
2012-04-11 0:12 ` KOSAKI Motohiro
2012-04-11 0:09 ` KOSAKI Motohiro
1 sibling, 1 reply; 78+ messages in thread
From: Ben Pfaff @ 2012-04-07 21:21 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
"H. Peter Anvin" <hpa@zytor.com> writes:
> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>>
>> Without proc knowledge about fdtable is gathered linearly and still unreliable.
>> With nextfd(2), even procful environments could lose several failure branches.
>> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
>>
>
> Incidentally, if we were to create a system call for this -- which I so
> far see no reason for -- I would make it return a select-style bitmask
> of file descriptors in use, not a "next fd" which would require a system
> call per iteration.
It's already possible to do something a little like that with the
existing "poll" system call:
#include <stdio.h>
#include <sys/poll.h>
int
main(void)
{
enum { N_FDS = 1024 };
struct pollfd fds[N_FDS];
int error;
int i;
for (i = 0; i < N_FDS; i++)
{
fds[i].fd = i;
fds[i].events = 0;
}
error = poll(fds, N_FDS, 0);
if (error < 0)
perror("poll");
else
{
printf("valid fds:");
for (i = 0; i < N_FDS; i++)
if (!(fds[i].revents & POLLNVAL))
printf(" %d", fds[i].fd);
printf("\n");
}
return 0;
}
blp@blp:~/tmp(0)$ gcc tmp.c -Wall
blp@blp:~/tmp(0)$ ./a.out
valid fds: 0 1 2
blp@blp:~/tmp(0)$ ./a.out 5>/dev/null
valid fds: 0 1 2 5
blp@blp:~/tmp(0)$ ./a.out 5>/dev/null 3</dev/stdin
valid fds: 0 1 2 3 5
blp@blp:~/tmp(0)$
--
Ben Pfaff
http://benpfaff.org
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-07 21:21 ` Ben Pfaff
@ 2012-04-11 0:12 ` KOSAKI Motohiro
0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 0:12 UTC (permalink / raw)
To: Ben Pfaff
Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds, drepper,
linux-kernel, linux-fsdevel
On Sat, Apr 7, 2012 at 5:21 PM, Ben Pfaff <blp@cs.stanford.edu> wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>
>> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>>>
>>> Without proc knowledge about fdtable is gathered linearly and still unreliable.
>>> With nextfd(2), even procful environments could lose several failure branches.
>>> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
>>>
>>
>> Incidentally, if we were to create a system call for this -- which I so
>> far see no reason for -- I would make it return a select-style bitmask
>> of file descriptors in use, not a "next fd" which would require a system
>> call per iteration.
>
> It's already possible to do something a little like that with the
> existing "poll" system call:
>
> #include <stdio.h>
> #include <sys/poll.h>
>
> int
> main(void)
> {
> enum { N_FDS = 1024 };
> struct pollfd fds[N_FDS];
Your code has a muximum fd assumption here. that is one of that we
really want to avoid.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
@ 2012-04-11 0:12 ` KOSAKI Motohiro
0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 0:12 UTC (permalink / raw)
To: Ben Pfaff
Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds, drepper,
linux-kernel, linux-fsdevel
On Sat, Apr 7, 2012 at 5:21 PM, Ben Pfaff <blp@cs.stanford.edu> wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>
>> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>>>
>>> Without proc knowledge about fdtable is gathered linearly and still unreliable.
>>> With nextfd(2), even procful environments could lose several failure branches.
>>> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
>>>
>>
>> Incidentally, if we were to create a system call for this -- which I so
>> far see no reason for -- I would make it return a select-style bitmask
>> of file descriptors in use, not a "next fd" which would require a system
>> call per iteration.
>
> It's already possible to do something a little like that with the
> existing "poll" system call:
>
> #include <stdio.h>
> #include <sys/poll.h>
>
> int
> main(void)
> {
> enum { N_FDS = 1024 };
> struct pollfd fds[N_FDS];
Your code has a muximum fd assumption here. that is one of that we
really want to avoid.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-06 16:23 ` H. Peter Anvin
2012-04-07 21:21 ` Ben Pfaff
@ 2012-04-11 0:09 ` KOSAKI Motohiro
2012-04-11 17:58 ` H. Peter Anvin
2012-04-11 18:00 ` H. Peter Anvin
1 sibling, 2 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 0:09 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Fri, Apr 6, 2012 at 12:23 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/06/2012 02:54 AM, Alexey Dobriyan wrote:
>>
>> Without proc knowledge about fdtable is gathered linearly and still unreliable.
>> With nextfd(2), even procful environments could lose several failure branches.
>> And they can keep old dumb fd++ or smart /proc/self/fd loops for a change.
>>
>
> Incidentally, if we were to create a system call for this -- which I so
> far see no reason for -- I would make it return a select-style bitmask
> of file descriptors in use, not a "next fd" which would require a system
> call per iteration.
I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.
http://austingroupbugs.net/view.php?id=149
nextfd() has a semantics of F_NEXT.
Next, why shoundn't we implement fcntl(F_NEXT) in our kernel? I think
we have two reason.
1) As linus pointed out, linux specific "flags" argument may be useful.
2) The name of F_NEXT is not fixed yet. another url of the austin says
it is FD_NEXT.
So, we can't choose right name yet. Moreover, A meanings of 3rd
argument of F_NEXT
haven't been fixed.
I dont think following #ifdef is insane. but glibc also can provide
correct F_NEXT when next SUS is published.
#ifdef FOO
#define NEXTFD(fd) nextfd(fd, flags)
#else
#define NEXTFD(fd) fcntl(fd, F_NEXT, O_FDWR)
#endif
In short, kernel developer don't care any standard at all. but
application programmer usually care it.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 0:09 ` KOSAKI Motohiro
@ 2012-04-11 17:58 ` H. Peter Anvin
2012-04-11 18:04 ` Linus Torvalds
2012-04-11 18:00 ` H. Peter Anvin
1 sibling, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 17:58 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On 04/10/2012 05:09 PM, KOSAKI Motohiro wrote:
>
> I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.
>
> http://austingroupbugs.net/view.php?id=149
>
> nextfd() has a semantics of F_NEXT.
>
> Next, why shoundn't we implement fcntl(F_NEXT) in our kernel? I think
> we have two reason.
>
> 1) As linus pointed out, linux specific "flags" argument may be useful.
> 2) The name of F_NEXT is not fixed yet. another url of the austin says
> it is FD_NEXT.
> So, we can't choose right name yet. Moreover, A meanings of 3rd
> argument of F_NEXT
> haven't been fixed.
>
But it still has the same braindamage: one system call per loop
invocation, and we can do better. I would much rather see fdwalk() in SUS.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 17:58 ` H. Peter Anvin
@ 2012-04-11 18:04 ` Linus Torvalds
0 siblings, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2012-04-11 18:04 UTC (permalink / raw)
To: H. Peter Anvin
Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, drepper,
linux-kernel, linux-fsdevel
On Wed, Apr 11, 2012 at 10:58 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> But it still has the same braindamage: one system call per loop
> invocation, and we can do better. I would much rather see fdwalk() in SUS.
Why would we bother to do better?
System calls are cheap, and usually you actually do want to do
something about the fd, so you actually want to iterate over them.
I'd much rather have simple cheap interfaces than anything else. If
SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
doing something more complex and then just having to emulate the
simple thing in user space anyway.
If a standard interface exists, we should just use it.
Linus
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
@ 2012-04-11 18:04 ` Linus Torvalds
0 siblings, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2012-04-11 18:04 UTC (permalink / raw)
To: H. Peter Anvin
Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, drepper,
linux-kernel, linux-fsdevel
On Wed, Apr 11, 2012 at 10:58 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> But it still has the same braindamage: one system call per loop
> invocation, and we can do better. I would much rather see fdwalk() in SUS.
Why would we bother to do better?
System calls are cheap, and usually you actually do want to do
something about the fd, so you actually want to iterate over them.
I'd much rather have simple cheap interfaces than anything else. If
SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
doing something more complex and then just having to emulate the
simple thing in user space anyway.
If a standard interface exists, we should just use it.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 18:04 ` Linus Torvalds
(?)
@ 2012-04-11 18:11 ` H. Peter Anvin
2012-04-11 19:46 ` KOSAKI Motohiro
-1 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 18:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, drepper,
linux-kernel, linux-fsdevel
On 04/11/2012 11:04 AM, Linus Torvalds wrote:
> On Wed, Apr 11, 2012 at 10:58 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> But it still has the same braindamage: one system call per loop
>> invocation, and we can do better. I would much rather see fdwalk() in SUS.
>
> Why would we bother to do better?
>
> System calls are cheap, and usually you actually do want to do
> something about the fd, so you actually want to iterate over them.
>
> I'd much rather have simple cheap interfaces than anything else. If
> SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
> doing something more complex and then just having to emulate the
> simple thing in user space anyway.
>
> If a standard interface exists, we should just use it.
>
I went back and looked at the post, and also the discussion on the SUS
mailing list.
The proposal for FD_NEXT was rejected with some serious vitriol.
fdwalk() was considered just more palatable since there is an existing
implementation (in Solaris) and since it might be possible to provide a
way to hide specific fds from fdwalk(), but a much bigger issue raised
is that *ALL* of these interfaces are inherently broken. Closing random
file descriptors is:
a) inherently racy in a multithreaded environment;
b) unsafe because there might be file descriptors used by libc itself.
Instead, from the resolution text:
> Therefore, the rest of this proposal seeks to document the problem
> with closing arbitrary file descriptors, and a new bugid will be
> opened to propose standardizing some recent interfaces and interface
> extensions first appearing in Linux (new interfaces such as pipe2( ),
> accept4( ), mkostemps( ), ..., and extensions like fopen(,"we")) to
> guarantee the atomic creation of file descriptors with the cloexec
> bit already set, as was already done in the standard with O_CLOEXEC
> in open( ) and F_DUPFD_CLOEXEC in fcntl( ). See also 0000368 for
> a related proposal to require CLOEXEC on hidden descriptors.
I say we ask the new glibc people to provide fdwalk() since it already
has an implementation history (and because it can be implemented without
new system calls, thereby working on old kernels), but the *big*
takeaway from this is that if there is way to create a file descriptor
so that it doesn't have O_CLOEXEC set from the very beginning, *that* is
what we need to fix.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 18:11 ` H. Peter Anvin
@ 2012-04-11 19:46 ` KOSAKI Motohiro
0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:46 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
linux-kernel, linux-fsdevel
>>> But it still has the same braindamage: one system call per loop
>>> invocation, and we can do better. I would much rather see fdwalk() in SUS.
>>
>> Why would we bother to do better?
>>
>> System calls are cheap, and usually you actually do want to do
>> something about the fd, so you actually want to iterate over them.
>>
>> I'd much rather have simple cheap interfaces than anything else. If
>> SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
>> doing something more complex and then just having to emulate the
>> simple thing in user space anyway.
>>
>> If a standard interface exists, we should just use it.
>>
>
> I went back and looked at the post, and also the discussion on the SUS
> mailing list.
>
> The proposal for FD_NEXT was rejected with some serious vitriol.
>
> fdwalk() was considered just more palatable since there is an existing
> implementation (in Solaris) and since it might be possible to provide a
> way to hide specific fds from fdwalk(), but a much bigger issue raised
> is that *ALL* of these interfaces are inherently broken. Closing random
> file descriptors is:
>
> a) inherently racy in a multithreaded environment;
I would say two things. 1) I know and I agree we _can_ misuse the interface.
but many already existed interface also can be misused. 2) As I
already explained
this can be used correctly.
So, I have a question. Why do you bother a possibility of misuse? Of
if you didn't point
out misuse, can you please point out a real world use case of multi
threads + fd interation?
> b) unsafe because there might be file descriptors used by libc itself.
I agree this. Even though almost developer don't use libc message catalogue and
we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
> Instead, from the resolution text:
>
>> Therefore, the rest of this proposal seeks to document the problem
>> with closing arbitrary file descriptors, and a new bugid will be
>> opened to propose standardizing some recent interfaces and interface
>> extensions first appearing in Linux (new interfaces such as pipe2( ),
>> accept4( ), mkostemps( ), ..., and extensions like fopen(,"we")) to
>> guarantee the atomic creation of file descriptors with the cloexec
>> bit already set, as was already done in the standard with O_CLOEXEC
>> in open( ) and F_DUPFD_CLOEXEC in fcntl( ). See also 0000368 for
>> a related proposal to require CLOEXEC on hidden descriptors.
>
> I say we ask the new glibc people to provide fdwalk() since it already
> has an implementation history (and because it can be implemented without
> new system calls, thereby working on old kernels), but the *big*
> takeaway from this is that if there is way to create a file descriptor
> so that it doesn't have O_CLOEXEC set from the very beginning, *that* is
> what we need to fix.
Yeah, I don't think fdwalk() is problematic. It's an option if I
understand Alexey's mail
correctly. but I disagree almost all developers should fix a design
and rewrite their
applications. In theory, they can avoid glibc or they can rewrite all
of their code or
avoid linux. but there is one problem. unrealistic.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
@ 2012-04-11 19:46 ` KOSAKI Motohiro
0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:46 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
linux-kernel, linux-fsdevel
>>> But it still has the same braindamage: one system call per loop
>>> invocation, and we can do better. I would much rather see fdwalk() in SUS.
>>
>> Why would we bother to do better?
>>
>> System calls are cheap, and usually you actually do want to do
>> something about the fd, so you actually want to iterate over them.
>>
>> I'd much rather have simple cheap interfaces than anything else. If
>> SuS has a F_NEXT fcntl, let's just do that thing. Much simpler than
>> doing something more complex and then just having to emulate the
>> simple thing in user space anyway.
>>
>> If a standard interface exists, we should just use it.
>>
>
> I went back and looked at the post, and also the discussion on the SUS
> mailing list.
>
> The proposal for FD_NEXT was rejected with some serious vitriol.
>
> fdwalk() was considered just more palatable since there is an existing
> implementation (in Solaris) and since it might be possible to provide a
> way to hide specific fds from fdwalk(), but a much bigger issue raised
> is that *ALL* of these interfaces are inherently broken. Closing random
> file descriptors is:
>
> a) inherently racy in a multithreaded environment;
I would say two things. 1) I know and I agree we _can_ misuse the interface.
but many already existed interface also can be misused. 2) As I
already explained
this can be used correctly.
So, I have a question. Why do you bother a possibility of misuse? Of
if you didn't point
out misuse, can you please point out a real world use case of multi
threads + fd interation?
> b) unsafe because there might be file descriptors used by libc itself.
I agree this. Even though almost developer don't use libc message catalogue and
we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
> Instead, from the resolution text:
>
>> Therefore, the rest of this proposal seeks to document the problem
>> with closing arbitrary file descriptors, and a new bugid will be
>> opened to propose standardizing some recent interfaces and interface
>> extensions first appearing in Linux (new interfaces such as pipe2( ),
>> accept4( ), mkostemps( ), ..., and extensions like fopen(,"we")) to
>> guarantee the atomic creation of file descriptors with the cloexec
>> bit already set, as was already done in the standard with O_CLOEXEC
>> in open( ) and F_DUPFD_CLOEXEC in fcntl( ). See also 0000368 for
>> a related proposal to require CLOEXEC on hidden descriptors.
>
> I say we ask the new glibc people to provide fdwalk() since it already
> has an implementation history (and because it can be implemented without
> new system calls, thereby working on old kernels), but the *big*
> takeaway from this is that if there is way to create a file descriptor
> so that it doesn't have O_CLOEXEC set from the very beginning, *that* is
> what we need to fix.
Yeah, I don't think fdwalk() is problematic. It's an option if I
understand Alexey's mail
correctly. but I disagree almost all developers should fix a design
and rewrite their
applications. In theory, they can avoid glibc or they can rewrite all
of their code or
avoid linux. but there is one problem. unrealistic.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 19:46 ` KOSAKI Motohiro
(?)
@ 2012-04-11 19:49 ` H. Peter Anvin
2012-04-11 20:23 ` KOSAKI Motohiro
-1 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 19:49 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
linux-kernel, linux-fsdevel
On 04/11/2012 12:46 PM, KOSAKI Motohiro wrote:
>>
>> a) inherently racy in a multithreaded environment;
>
> I would say two things. 1) I know and I agree we _can_ misuse the interface.
> but many already existed interface also can be misused. 2) As I
> already explained
> this can be used correctly.
>
> So, I have a question. Why do you bother a possibility of misuse? Of
> if you didn't point out misuse, can you please point out a real world
> use case of multi threads + fd interation?
>
This were brought up in the POSIX discussion as part of why these
interfaces were considered undesirable.
>
>> b) unsafe because there might be file descriptors used by libc itself.
>
> I agree this. Even though almost developer don't use libc message catalogue and
> we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
>
No, that's exactly the point that we cannot.
>
> Yeah, I don't think fdwalk() is problematic. It's an option if I
> understand Alexey's mail
> correctly. but I disagree almost all developers should fix a design
> and rewrite their
> applications. In theory, they can avoid glibc or they can rewrite all
> of their code or
> avoid linux. but there is one problem. unrealistic.
>
The problem -- as was brought up in the POSIX discussion -- is that you
actually end up breaking *properly functioning programs*.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 19:49 ` H. Peter Anvin
@ 2012-04-11 20:23 ` KOSAKI Motohiro
2012-04-11 20:32 ` H. Peter Anvin
0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 20:23 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
linux-kernel, linux-fsdevel
>>> a) inherently racy in a multithreaded environment;
>>
>> I would say two things. 1) I know and I agree we _can_ misuse the interface.
>> but many already existed interface also can be misused. 2) As I
>> already explained
>> this can be used correctly.
>>
>> So, I have a question. Why do you bother a possibility of misuse? Of
>> if you didn't point out misuse, can you please point out a real world
>> use case of multi threads + fd interation?
>
> This were brought up in the POSIX discussion as part of why these
> interfaces were considered undesirable.
Hmmm.... I'm sorry I don't find "considered undesirable". Maybe because
my English is not very good. can you please help me clarify?
This text says,
> so a future revision of the standard may indeed add fdwalk( ), although no
> one in the meeting was willing to draft a proposal for fdwalk( ) at this time
and, later says after noting F_NEXT and O_CLOEXEC,
>Therefore, the rest of this proposal seeks to document the problem
>with closing arbitrary file descriptors, and a new bugid will be
>opened to propose standardizing some recent interfaces and interface
>extensions first appearing in Linux
Do you think latter override former?
>>> b) unsafe because there might be file descriptors used by libc itself.
>>
>> I agree this. Even though almost developer don't use libc message catalogue and
>> we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
>
> No, that's exactly the point that we cannot.
I thknk we are talking different aspect. I'm talking practical issue.
say, ruby hit the exact same issue
because valgrind uses internal fds and they don't think their exec()
case don't need fd
inheritance. Even though it close libc internal fds, invoked new
executable may open them
again at process strtup code. Therefore, they are using O_CLOEXEC. In
the other hands,
you seems talking about it is corner case. If so, I agree. I was not
argue it. I only say, I
haven't seen real world application require it.
Personally, I'm only interesting real world issue.
>> Yeah, I don't think fdwalk() is problematic. It's an option if I
>> understand Alexey's mail
>> correctly. but I disagree almost all developers should fix a design
>> and rewrite their
>> applications. In theory, they can avoid glibc or they can rewrite all
>> of their code or
>> avoid linux. but there is one problem. unrealistic.
>>
>
> The problem -- as was brought up in the POSIX discussion -- is that you
> actually end up breaking *properly functioning programs*.
But the url only talk about a possibility of misuse.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 20:23 ` KOSAKI Motohiro
@ 2012-04-11 20:32 ` H. Peter Anvin
2012-04-17 18:12 ` KOSAKI Motohiro
0 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 20:32 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Linus Torvalds, Alexey Dobriyan, akpm, viro, drepper,
linux-kernel, linux-fsdevel
On 04/11/2012 01:23 PM, KOSAKI Motohiro wrote:
>
> Hmmm.... I'm sorry I don't find "considered undesirable". Maybe because
> my English is not very good. can you please help me clarify?
>
I also went and read the mailing list discussion on the topic.
Ulrich, for example (in his usual mild-mannered style), commented:
> And all these programs and systems are wrong.
>
> There is no guarantee that one of the fds isn't used behind the
> scenes for something important which is still running as part of the
> fork/exec code. It's completely unacceptable to build into the
> interfaces the assumption that the programmer knows all the file
> descriptors.
>
> This is why using CLOEXEC is the only correct way to deal with this
> and now there is no exceuse anymore whatsoever. Every fd-creating
> interface can use CLOEXEC.
> This text says,
>
>> so a future revision of the standard may indeed add fdwalk( ), although no
>> one in the meeting was willing to draft a proposal for fdwalk( ) at this time
>
> and, later says after noting F_NEXT and O_CLOEXEC,
>
>> Therefore, the rest of this proposal seeks to document the problem
>> with closing arbitrary file descriptors, and a new bugid will be
>> opened to propose standardizing some recent interfaces and interface
>> extensions first appearing in Linux
>
> Do you think latter override former?
Yes.
>>>> b) unsafe because there might be file descriptors used by libc itself.
>>>
>>> I agree this. Even though almost developer don't use libc message catalogue and
>>> we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
>>
>> No, that's exactly the point that we cannot.
>
> I thknk we are talking different aspect. I'm talking practical issue.
> say, ruby hit the exact same issue
> because valgrind uses internal fds and they don't think their exec()
> case don't need fd
> inheritance. Even though it close libc internal fds, invoked new
> executable may open them
> again at process strtup code. Therefore, they are using O_CLOEXEC. In
> the other hands,
> you seems talking about it is corner case. If so, I agree. I was not
> argue it. I only say, I
> haven't seen real world application require it.
>
> Personally, I'm only interesting real world issue.
These are real-world issues.
>> The problem -- as was brought up in the POSIX discussion -- is that you
>> actually end up breaking *properly functioning programs*.
>
> But the url only talk about a possibility of misuse.
There are concrete examples on the mailing list.
Anyway, fdwalk() at least exists as an interface. There is absolutely
no momentum for FD_NEXT that I can see.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 20:32 ` H. Peter Anvin
@ 2012-04-17 18:12 ` KOSAKI Motohiro
0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-17 18:12 UTC (permalink / raw)
To: H. Peter Anvin
Cc: KOSAKI Motohiro, Linus Torvalds, Alexey Dobriyan, akpm, viro,
drepper, linux-kernel, linux-fsdevel
(4/11/12 4:32 PM), H. Peter Anvin wrote:
> On 04/11/2012 01:23 PM, KOSAKI Motohiro wrote:
>>
>> Hmmm.... I'm sorry I don't find "considered undesirable". Maybe because
>> my English is not very good. can you please help me clarify?
>>
>
> I also went and read the mailing list discussion on the topic.
>
> Ulrich, for example (in his usual mild-mannered style), commented:
>
>> And all these programs and systems are wrong.
>>
>> There is no guarantee that one of the fds isn't used behind the
>> scenes for something important which is still running as part of the
>> fork/exec code. It's completely unacceptable to build into the
>> interfaces the assumption that the programmer knows all the file
>> descriptors.
>>
>> This is why using CLOEXEC is the only correct way to deal with this
>> and now there is no exceuse anymore whatsoever. Every fd-creating
>> interface can use CLOEXEC.
>
>> This text says,
>>
>>> so a future revision of the standard may indeed add fdwalk( ), although no
>>> one in the meeting was willing to draft a proposal for fdwalk( ) at this time
>>
>> and, later says after noting F_NEXT and O_CLOEXEC,
>>
>>> Therefore, the rest of this proposal seeks to document the problem
>>> with closing arbitrary file descriptors, and a new bugid will be
>>> opened to propose standardizing some recent interfaces and interface
>>> extensions first appearing in Linux
>>
>> Do you think latter override former?
>
> Yes.
>
>>>>> b) unsafe because there might be file descriptors used by libc itself.
>>>>
>>>> I agree this. Even though almost developer don't use libc message catalogue and
>>>> we can avoid such issue by using nextfd() + fcntl(O_CLOEXEC).
>>>
>>> No, that's exactly the point that we cannot.
>>
>> I thknk we are talking different aspect. I'm talking practical issue.
>> say, ruby hit the exact same issue
>> because valgrind uses internal fds and they don't think their exec()
>> case don't need fd
>> inheritance. Even though it close libc internal fds, invoked new
>> executable may open them
>> again at process strtup code. Therefore, they are using O_CLOEXEC. In
>> the other hands,
>> you seems talking about it is corner case. If so, I agree. I was not
>> argue it. I only say, I
>> haven't seen real world application require it.
>>
>> Personally, I'm only interesting real world issue.
>
> These are real-world issues.
>
>>> The problem -- as was brought up in the POSIX discussion -- is that you
>>> actually end up breaking *properly functioning programs*.
>>
>> But the url only talk about a possibility of misuse.
>
> There are concrete examples on the mailing list.
>
> Anyway, fdwalk() at least exists as an interface. There is absolutely
> no momentum for FD_NEXT that I can see.
Thanks Peter, I guess I now understand what you said. Again, thanks for the
patience. _Personally_ I can't agree Ulrich's opinion because I've only seen
fork-closeall-exec pattern. but, I also can't say there is no other use case.
And, as I already wrote, I don't think fdwalk() is bad taste. I only want to
explain the background of nextfd interface bacause you said you have no seen
the reason and I think I can explain the background and motivation because this
is famous issue in user land folks. (oops, this should be noted, I'm not original
patch author and I talked only about my ovserved issue. Alexey might know another
use cause, I dunnno)
Unfortnately, I'll be offlined full of this and next week and then I have to leave
this thread. But I believe I'm not needed this thread any more. :)
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 0:09 ` KOSAKI Motohiro
2012-04-11 17:58 ` H. Peter Anvin
@ 2012-04-11 18:00 ` H. Peter Anvin
2012-04-11 19:20 ` KOSAKI Motohiro
1 sibling, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 18:00 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On 04/10/2012 05:09 PM, KOSAKI Motohiro wrote:
>
> I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.
>
> http://austingroupbugs.net/view.php?id=149
>
> nextfd() has a semantics of F_NEXT.
>
Looking at that bug report, it sounds like that proposal was rejected.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 18:00 ` H. Peter Anvin
@ 2012-04-11 19:20 ` KOSAKI Motohiro
0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:20 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Wed, Apr 11, 2012 at 2:00 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/10/2012 05:09 PM, KOSAKI Motohiro wrote:
>>
>> I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.
>>
>> http://austingroupbugs.net/view.php?id=149
>>
>> nextfd() has a semantics of F_NEXT.
>>
>
> Looking at that bug report, it sounds like that proposal was rejected.
?
If I understand correctly, above URL says, latest satatus is
Resolution: Accepted As Marked
Please see toppest issue summary. Or if I am missing something, can you please
point out which line are you seeing?
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
@ 2012-04-11 19:20 ` KOSAKI Motohiro
0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:20 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Wed, Apr 11, 2012 at 2:00 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/10/2012 05:09 PM, KOSAKI Motohiro wrote:
>>
>> I know the reason. fcntl(F_NEXT) is one of a proposal of next SUS enhancement.
>>
>> http://austingroupbugs.net/view.php?id=149
>>
>> nextfd() has a semantics of F_NEXT.
>>
>
> Looking at that bug report, it sounds like that proposal was rejected.
?
If I understand correctly, above URL says, latest satatus is
Resolution: Accepted As Marked
Please see toppest issue summary. Or if I am missing something, can you please
point out which line are you seeing?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 19:20 ` KOSAKI Motohiro
(?)
@ 2012-04-11 19:22 ` H. Peter Anvin
2012-04-11 19:26 ` KOSAKI Motohiro
-1 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 19:22 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On 04/11/2012 12:20 PM, KOSAKI Motohiro wrote:
>
> If I understand correctly, above URL says, latest satatus is
>
> Resolution: Accepted As Marked
>
> Please see toppest issue summary. Or if I am missing something, can you please
> point out which line are you seeing?
You need to read through the entire set of notes for it to make sense.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 19:22 ` H. Peter Anvin
@ 2012-04-11 19:26 ` KOSAKI Motohiro
2012-04-11 19:28 ` H. Peter Anvin
0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:26 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Wed, Apr 11, 2012 at 3:22 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/11/2012 12:20 PM, KOSAKI Motohiro wrote:
>>
>> If I understand correctly, above URL says, latest satatus is
>>
>> Resolution: Accepted As Marked
>>
>> Please see toppest issue summary. Or if I am missing something, can you please
>> point out which line are you seeing?
>
> You need to read through the entire set of notes for it to make sense.
The note described several downsides. but I have no seen a rejection.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 19:26 ` KOSAKI Motohiro
@ 2012-04-11 19:28 ` H. Peter Anvin
2012-04-11 19:31 ` KOSAKI Motohiro
0 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 19:28 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On 04/11/2012 12:26 PM, KOSAKI Motohiro wrote:
> The note described several downsides. but I have no seen a rejection.
The Aardvark resolution "Accept as Marked" means "make the changes in
the markup." The markup doesn't include any new interfaces, but just a
set of warnings.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 19:28 ` H. Peter Anvin
@ 2012-04-11 19:31 ` KOSAKI Motohiro
2012-04-11 19:32 ` H. Peter Anvin
0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-11 19:31 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Wed, Apr 11, 2012 at 3:28 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/11/2012 12:26 PM, KOSAKI Motohiro wrote:
>> The note described several downsides. but I have no seen a rejection.
>
> The Aardvark resolution "Accept as Marked" means "make the changes in
> the markup." The markup doesn't include any new interfaces, but just a
> set of warnings.
Hm. Now are you saying "make the changes in the markup" mean rejection?
I'm sorry I haven't caught your point.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-11 19:31 ` KOSAKI Motohiro
@ 2012-04-11 19:32 ` H. Peter Anvin
0 siblings, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-11 19:32 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On 04/11/2012 12:31 PM, KOSAKI Motohiro wrote:
> On Wed, Apr 11, 2012 at 3:28 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/11/2012 12:26 PM, KOSAKI Motohiro wrote:
>>> The note described several downsides. but I have no seen a rejection.
>>
>> The Aardvark resolution "Accept as Marked" means "make the changes in
>> the markup." The markup doesn't include any new interfaces, but just a
>> set of warnings.
>
> Hm. Now are you saying "make the changes in the markup" mean rejection?
> I'm sorry I haven't caught your point.
>
You have to look at what the markup is.
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
` (5 preceding siblings ...)
2012-04-01 22:03 ` H. Peter Anvin
@ 2012-04-02 23:17 ` KOSAKI Motohiro
2012-04-02 23:56 ` H. Peter Anvin
2012-04-03 19:21 ` Colin Walters
2012-04-04 3:01 ` Al Viro
7 siblings, 2 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-02 23:17 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: akpm, viro, torvalds, drepper, linux-kernel, linux-fsdevel
Hi,
2012/4/1 Alexey Dobriyan <adobriyan@gmail.com>:
> Currently there is no reliable way to close all opened file descriptors
> (which daemons need and like to do):
>
> * dumb close(fd) loop is slow, upper bound is unknown and
> can be arbitrary large,
>
> * /proc/self/fd is unreliable:
> proc may be unconfigured or not mounted at expected place.
> Looking at /proc/self/fd requires opening directory
> which may not be available due to malicious rlimit drop or ENOMEM situations.
> Not opening directory is equivalent to dumb close(2) loop except slower.
Sorry for the long delay comment. I realized this thread now. I think
/proc no mount
case is not good explanation for the worth of this patch. The problem
is, we can't
use opendir() after fork() if an app has multi threads.
SUS clearly say so,
http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html
we can only call async-signal-safe functions after fork() when multi threads and
opendir() call malloc() internally.
As far as I know, OpenJDK has a such fork-readdir-exec code and it can
make deadlock
when spawnning a new process. Unfortunately Java language perfeter to
make a lot of threads rather than other language.
This patch can solve such multi threaded case.
offtopic, glibc malloc is a slightly clever. It reinitialize its
internal lock when fork by
using thread_atfork() hook. It mean glibc malloc can be used after
fork() and the
technique can avoid this issue. But, glibc malloc still has several
performance problem and
many people prefer to use jemalloc or google malloc instead. Then,
they hit an old issue, bah.
> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
> for userspace.
Solaris don't only have a closefrom(), but also has fdwalk().
http://docs.oracle.com/cd/E19082-01/819-2243/6n4i098vd/index.html
and I've received a request that linux aim fdwalk() several times. Example,
Ruby uses fcntl(FD_CLOEXEC) instead of close() because their community uses
valgrind for daily stress test and it don't support for(f<maxfd){
close() } pattern.
(I guess it is valgrind's bug, but I'm not sure.)
I have several other similar bug report of our proprietary software.
but i'm sorry.
I cant show it here. But clearly, fdwalk (or nextfd) is powerful than
closefrom().
Honestly, I had no idea to aim fdwalk cleanly. but your patch is much
cleaner to me.
So, I like this.
> So, don't add closefrom(2), add nextfd(2).
>
> int nextfd(int fd)
>
> returns next opened file descriptor which is >= than fd or -1/ESRCH
> if there aren't any descriptors >= than fd.
>
> Thus closefrom(3) can be rewritten through it in userspace:
>
> void closefrom(int fd)
> {
> while (1) {
> fd = nextfd(fd);
> if (fd == -1 && errno == ESRCH)
> break;
> (void)close(fd);
> fd++;
> }
> }
>
> Maybe it will grow other smart uses.
>
> nextfd(2) doesn't change kernel state and thus can't fail
> which is why it should go in. Other means may fail or
> may not be available or require linear time with only guessed
> upper boundaries (1024, getrlimit(RLIM_NOFILE), sysconf(_SC_OPEN_MAX).
This is not enough explanation. The problem is, RLIM_NOFILE can be changed
at runtime. then, a process can have a larger fd than RLIM_NOFILE.
Therefore, if you accept other developers opinion (especially linus's
flags argument suggestion), I'll ack this patch.
Thank you.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-02 23:17 ` KOSAKI Motohiro
@ 2012-04-02 23:56 ` H. Peter Anvin
2012-04-04 11:51 ` Ulrich Drepper
2012-04-04 16:31 ` KOSAKI Motohiro
2012-04-03 19:21 ` Colin Walters
1 sibling, 2 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-02 23:56 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On 04/02/2012 04:17 PM, KOSAKI Motohiro wrote:
>
> Sorry for the long delay comment. I realized this thread now. I think
> /proc no mount case is not good explanation for the worth of this patch. The problem
> is, we can't use opendir() after fork() if an app has multi threads.
>
> SUS clearly say so,
> http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html
>
> we can only call async-signal-safe functions after fork() when multi threads and
> opendir() call malloc() internally.
>
> As far as I know, OpenJDK has a such fork-readdir-exec code and it can
> make deadlock
> when spawnning a new process. Unfortunately Java language perfeter to
> make a lot of threads rather than other language.
>
> This patch can solve such multi threaded case.
>
> offtopic, glibc malloc is a slightly clever. It reinitialize its
> internal lock when fork by using thread_atfork() hook. It mean glibc malloc can be used after
> fork() and the technique can avoid this issue. But, glibc malloc still has several
> performance problem and many people prefer to use jemalloc or google malloc instead. Then,
> they hit an old issue, bah.
>
OK, so what you're saying here is:
Linux doesn't actually have a problem unless:
1. You use the library implementation of opendir/readdir/closedir;
2. You use a nonstandard malloc for the platform which doesn't
correctly set up fork hooks (which I would consider a bug);
You can deal with this in one of two ways:
2. Fix your malloc().
1. Use the low level open()/getdents()/close() functions instead of
opendir()/readdir()/closedir().
> and I've received a request that linux aim fdwalk() several times. Example,
It doesn't sound very hard to implement fdwalk() in terms of
open/getdents/close without using malloc; since the fdwalk() interface
lets you use the stack for storage. You can then implement closefrom()
in terms of fdwalk(). Something like this (untested):
int fdwalk(int (*func)(void *, int), void *cd)
{
char buf[4096]; /* ... could be less... */
const char *p, *q;
const struct linux_dirent *dp
int dfd, fd;
unsigned char c;
int rv = 0;
int sz;
dfd = open("/proc/self/fd", O_RDONLY|O_DIRECTORY|O_CLOEXEC);
if (dfd < 0)
return -1;
/*** XXX: may want to check for procfs magic here ***/
while ((sz = getdents(dfd, buf, sizeof buf)) > 0) {
p = buf;
while (sz > offsetof(struct linux_dirent, d_name)) {
dp = (const struct linux_dirent *)p;
if (sz < dp->d_reclen)
break;
q = dp->d_name;
p += dp->d_reclen;
sz -= dp->d_reclen;
fd = 0;
while (q < p && (c = *q++)) {
c -= '0';
if (c >= 10)
goto skip;
fd = fd*10 + c;
}
if (fd != dfd)
rv = func(cd, fd);
skip:
;
}
}
if (close(dfd))
return -1;
return rv;
}
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-02 23:56 ` H. Peter Anvin
@ 2012-04-04 11:51 ` Ulrich Drepper
2012-04-04 16:38 ` KOSAKI Motohiro
2012-04-04 16:31 ` KOSAKI Motohiro
1 sibling, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2012-04-04 11:51 UTC (permalink / raw)
To: H. Peter Anvin
Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, torvalds,
linux-kernel, linux-fsdevel
On Mon, Apr 2, 2012 at 19:56, H. Peter Anvin <hpa@zytor.com> wrote:
>
> You can deal with this in one of two ways:
>
> 2. Fix your malloc().
> 1. Use the low level open()/getdents()/close() functions instead of
> opendir()/readdir()/closedir().
And if their is concern about using opendir() then we can add
something to posix_spawn. And attribute like POSIX_SPAWN_CLOSEALL and
you then would have to explicitly add dup and open requests for the
descriptors you want to have open.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 11:51 ` Ulrich Drepper
@ 2012-04-04 16:38 ` KOSAKI Motohiro
2012-04-04 16:43 ` Ulrich Drepper
0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 16:38 UTC (permalink / raw)
To: Ulrich Drepper
Cc: H. Peter Anvin, KOSAKI Motohiro, Alexey Dobriyan, akpm, viro,
torvalds, linux-kernel, linux-fsdevel
(4/4/12 4:51 AM), Ulrich Drepper wrote:
> On Mon, Apr 2, 2012 at 19:56, H. Peter Anvin<hpa@zytor.com> wrote:
>>
>> You can deal with this in one of two ways:
>>
>> 2. Fix your malloc().
>> 1. Use the low level open()/getdents()/close() functions instead of
>> opendir()/readdir()/closedir().
>
> And if their is concern about using opendir() then we can add
> something to posix_spawn. And attribute like POSIX_SPAWN_CLOSEALL and
> you then would have to explicitly add dup and open requests for the
> descriptors you want to have open.
As far as I understand, any major open source project don't use posix_spawn().
Please remind, I'm talking about real world issue.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 16:38 ` KOSAKI Motohiro
@ 2012-04-04 16:43 ` Ulrich Drepper
2012-04-04 17:07 ` KOSAKI Motohiro
0 siblings, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2012-04-04 16:43 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds,
linux-kernel, linux-fsdevel
On Wed, Apr 4, 2012 at 12:38, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
> As far as I understand, any major open source project don't use
> posix_spawn().
> Please remind, I'm talking about real world issue.
This doesn't mean they shouldn't. If you require code to be changed
anyway let them change to something which doesn't require more cruft
in the kernel. The limitations you cited are irrelevant for
posix_spawn. And perhaps there will be actually spawn support in the
kernel which would make dealing with OOM situations and non-overcommit
much easier.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 16:43 ` Ulrich Drepper
@ 2012-04-04 17:07 ` KOSAKI Motohiro
2012-04-04 17:49 ` Ulrich Drepper
0 siblings, 1 reply; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 17:07 UTC (permalink / raw)
To: Ulrich Drepper
Cc: KOSAKI Motohiro, H. Peter Anvin, Alexey Dobriyan, akpm, viro,
torvalds, linux-kernel, linux-fsdevel
(4/4/12 9:43 AM), Ulrich Drepper wrote:
> On Wed, Apr 4, 2012 at 12:38, KOSAKI Motohiro<kosaki.motohiro@gmail.com> wrote:
>> As far as I understand, any major open source project don't use
>> posix_spawn().
>> Please remind, I'm talking about real world issue.
>
> This doesn't mean they shouldn't. If you require code to be changed
> anyway let them change to something which doesn't require more cruft
> in the kernel. The limitations you cited are irrelevant for
> posix_spawn. And perhaps there will be actually spawn support in the
> kernel which would make dealing with OOM situations and non-overcommit
> much easier.
Umm... I'm sorry. I haven't catch why OOM is related topic. Could you please
elaborate more?
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 17:07 ` KOSAKI Motohiro
@ 2012-04-04 17:49 ` Ulrich Drepper
2012-04-04 18:08 ` KOSAKI Motohiro
0 siblings, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2012-04-04 17:49 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds,
linux-kernel, linux-fsdevel
On Wed, Apr 4, 2012 at 13:07, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
> Umm... I'm sorry. I haven't catch why OOM is related topic. Could you please
> elaborate more?
With fork you always have some copy-on-write (and worse for
overcommit) just to then execute exec. With a real spawn
implementation you wouldn't have that. A big problem if you, for
instance, have to spawn a small helper from a gigantic process.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 17:49 ` Ulrich Drepper
@ 2012-04-04 18:08 ` KOSAKI Motohiro
0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 18:08 UTC (permalink / raw)
To: Ulrich Drepper
Cc: KOSAKI Motohiro, H. Peter Anvin, Alexey Dobriyan, akpm, viro,
torvalds, linux-kernel, linux-fsdevel
(4/4/12 10:49 AM), Ulrich Drepper wrote:
> On Wed, Apr 4, 2012 at 13:07, KOSAKI Motohiro<kosaki.motohiro@gmail.com> wrote:
>> Umm... I'm sorry. I haven't catch why OOM is related topic. Could you please
>> elaborate more?
>
> With fork you always have some copy-on-write (and worse for
> overcommit) just to then execute exec. With a real spawn
> implementation you wouldn't have that. A big problem if you, for
> instance, have to spawn a small helper from a gigantic process.
Ah, ok. I agree posix_spawn() has a chance to aim more momemory efficiency
than fork-exec. But in this purpose, vfork may be enough useful and be widely
accepted from userland folks.
Example, some daemon has a following patten,
1. fork
2. change /proc/<pid>/oom_adj
3. exec
That's said, when adding linux specific knob, we need to add new posix_spawn flags
if we really need (or want) to replaces all userland. this seems very hard and doubtful
worth to me.
Ahh, note. I'm not against to implement posix_spawn() into the kernel. I only argue spawn()
can solve closefrom issue.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-02 23:56 ` H. Peter Anvin
2012-04-04 11:51 ` Ulrich Drepper
@ 2012-04-04 16:31 ` KOSAKI Motohiro
2012-04-04 17:10 ` Colin Walters
2012-04-04 18:44 ` H. Peter Anvin
1 sibling, 2 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 16:31 UTC (permalink / raw)
To: H. Peter Anvin
Cc: KOSAKI Motohiro, Alexey Dobriyan, akpm, viro, torvalds, drepper,
linux-kernel, linux-fsdevel
(4/2/12 4:56 PM), H. Peter Anvin wrote:
> On 04/02/2012 04:17 PM, KOSAKI Motohiro wrote:
>>
>> Sorry for the long delay comment. I realized this thread now. I think
>> /proc no mount case is not good explanation for the worth of this patch. The problem
>> is, we can't use opendir() after fork() if an app has multi threads.
>>
>> SUS clearly say so,
>> http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html
>>
>> we can only call async-signal-safe functions after fork() when multi threads and
>> opendir() call malloc() internally.
>>
>> As far as I know, OpenJDK has a such fork-readdir-exec code and it can
>> make deadlock
>> when spawnning a new process. Unfortunately Java language perfeter to
>> make a lot of threads rather than other language.
>>
>> This patch can solve such multi threaded case.
>>
>> offtopic, glibc malloc is a slightly clever. It reinitialize its
>> internal lock when fork by using thread_atfork() hook. It mean glibc malloc can be used after
>> fork() and the technique can avoid this issue. But, glibc malloc still has several
>> performance problem and many people prefer to use jemalloc or google malloc instead. Then,
>> they hit an old issue, bah.
>>
>
> OK, so what you're saying here is:
>
> Linux doesn't actually have a problem unless:
> 1. You use the library implementation of opendir/readdir/closedir;
> 2. You use a nonstandard malloc for the platform which doesn't
> correctly set up fork hooks (which I would consider a bug);
Right. but I'm argue "correctly set up" term because SUS/POSIX don't require it.
It is only a workaround of buggy userland in glibc. SUS still says you can't
use opendir and typical userland people don't want ignore SUS as far as possible.
> You can deal with this in one of two ways:
>
> 2. Fix your malloc().
> 1. Use the low level open()/getdents()/close() functions instead of
> opendir()/readdir()/closedir().
Ideally possible. but practically impossible. 2) people don't use a their
own malloc. they only uses open sources alternative malloc. And, I think
you have too narrowing concern. Even though malloc people adds a workaround,
the standard inhibit to use it and people may continue to use more dangerous
RLIM_NOFILE loop. 1) I haven't seen _practical_ userland software uses such
linux internal hacking. Almost all major software can run on multiple OSs.
>> and I've received a request that linux aim fdwalk() several times. Example,
>
> It doesn't sound very hard to implement fdwalk() in terms of
> open/getdents/close without using malloc; since the fdwalk() interface
> lets you use the stack for storage. You can then implement closefrom()
> in terms of fdwalk(). Something like this (untested):
>
> int fdwalk(int (*func)(void *, int), void *cd)
> {
> char buf[4096]; /* ... could be less... */
> const char *p, *q;
> const struct linux_dirent *dp
> int dfd, fd;
> unsigned char c;
> int rv = 0;
> int sz;
>
> dfd = open("/proc/self/fd", O_RDONLY|O_DIRECTORY|O_CLOEXEC);
> if (dfd< 0)
> return -1;
>
> /*** XXX: may want to check for procfs magic here ***/
>
> while ((sz = getdents(dfd, buf, sizeof buf))> 0) {
> p = buf;
>
> while (sz> offsetof(struct linux_dirent, d_name)) {
> dp = (const struct linux_dirent *)p;
>
> if (sz< dp->d_reclen)
> break;
>
> q = dp->d_name;
> p += dp->d_reclen;
> sz -= dp->d_reclen;
>
> fd = 0;
> while (q< p&& (c = *q++)) {
> c -= '0';
> if (c>= 10)
> goto skip;
> fd = fd*10 + c;
> }
>
> if (fd != dfd)
> rv = func(cd, fd);
> skip:
> ;
> }
> }
>
> if (close(dfd))
> return -1;
>
> return rv;
> }
It can. but more ugly. no?
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 16:31 ` KOSAKI Motohiro
@ 2012-04-04 17:10 ` Colin Walters
2012-04-04 17:25 ` Colin Walters
2012-04-04 23:35 ` KOSAKI Motohiro
2012-04-04 18:44 ` H. Peter Anvin
1 sibling, 2 replies; 78+ messages in thread
From: Colin Walters @ 2012-04-04 17:10 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds, drepper,
linux-kernel, linux-fsdevel
On Wed, 2012-04-04 at 09:31 -0700, KOSAKI Motohiro wrote:
> Ideally possible. but practically impossible. 2) people don't use a their
> own malloc. they only uses open sources alternative malloc. And, I think
> you have too narrowing concern. Even though malloc people adds a workaround,
> the standard inhibit to use it
What do you mean? If as hpa says, the maintainer of e.g. google
tcmalloc added a call to pthread_atfork(), then code which uses
opendir() would start working.
> and people may continue to use more dangerous
> RLIM_NOFILE loop. 1) I haven't seen _practical_ userland software uses such
> linux internal hacking. Almost all major software can run on multiple OSs.
Except that if you're using /proc/self/fd, you're already relying on
Linux-specific functionality. So it's not burdensome to use "struct
linux_dirent" and O_DIRECTORY either.
In GLib we're presently doing the regular /proc+opendir() under
#ifdef __linux__:
http://git.gnome.org/browse/glib/tree/glib/gspawn.c#n932
Now I'd happily switch to hpa's fdwalk() implementation if I was aware
of someone using glib in combination with an alternative malloc hitting
this problem.
Basically I think hpa is right here, and it's not really worth adding
a new system call.
The thing is, even if it were added today, since we need to run on old
kernels, we'd have to carry the code to use the /proc trick
approximately forever. And in the end all nextfd() would accomplish
would be a *third* case in the already messy ifdefs/fallbacks
in the various implementations of process spawning.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 17:10 ` Colin Walters
@ 2012-04-04 17:25 ` Colin Walters
2012-04-04 23:35 ` KOSAKI Motohiro
1 sibling, 0 replies; 78+ messages in thread
From: Colin Walters @ 2012-04-04 17:25 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: H. Peter Anvin, Alexey Dobriyan, akpm, viro, torvalds, drepper,
linux-kernel, linux-fsdevel
On Wed, 2012-04-04 at 13:10 -0400, Colin Walters wrote:
> The thing is, even if it were added today, since we need to run on old
> kernels, we'd have to carry the code to use the /proc trick
> approximately forever. And in the end all nextfd() would accomplish
> would be a *third* case in the already messy ifdefs/fallbacks
> in the various implementations of process spawning.
The only really counterargument I see to this is if nextfd() were
*significantly* more efficient. I happily switched to using eventfd()
instead of pipe(), while still carrying the burden of fallback code,
because dropping from two open descriptors to one solved actual
problems with maxing out 1024 fds.
As far as efficiency goes, I could imagine that as Linus said, what
you really want is nextfd(O_NOT_CLOEXEC), because in a well-behaved
program that marks its internal descriptors as O_CLOEXEC, the effort the
forked process spends closing them is just wasted over having it
all happen more efficiently in the kernel as part of the exec().
But if claiming efficiency, one needs to post benchmarks...
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 17:10 ` Colin Walters
2012-04-04 17:25 ` Colin Walters
@ 2012-04-04 23:35 ` KOSAKI Motohiro
1 sibling, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 23:35 UTC (permalink / raw)
To: Colin Walters
Cc: KOSAKI Motohiro, H. Peter Anvin, Alexey Dobriyan, akpm, viro,
torvalds, drepper, linux-kernel, linux-fsdevel
(4/4/12 10:10 AM), Colin Walters wrote:
> On Wed, 2012-04-04 at 09:31 -0700, KOSAKI Motohiro wrote:
>
>> Ideally possible. but practically impossible. 2) people don't use a their
>> own malloc. they only uses open sources alternative malloc. And, I think
>> you have too narrowing concern. Even though malloc people adds a workaround,
>> the standard inhibit to use it
>
> What do you mean? If as hpa says, the maintainer of e.g. google
> tcmalloc added a call to pthread_atfork(), then code which uses
> opendir() would start working.
From point of application programmer view, they have no right to choose
malloc implementation. user can easily override it by using LD_PRELOAD.
So, even though one or two malloc implementation will be fixed, it doesn't
mean user land folks pain will gone. they are living more painful world.
At least, they told me so.
But, I discussed about this issue with hpa today and we agreed the best way is
to merge hpa fdwalk() into libc. It makes maximum happy, example, glib can drop
its own fdwalk implementation. so, I'd like to discuss libc folks.
>> and people may continue to use more dangerous
>> RLIM_NOFILE loop. 1) I haven't seen _practical_ userland software uses such
>> linux internal hacking. Almost all major software can run on multiple OSs.
>
> Except that if you're using /proc/self/fd, you're already relying on
> Linux-specific functionality. So it's not burdensome to use "struct
> linux_dirent" and O_DIRECTORY either.
>
> In GLib we're presently doing the regular /proc+opendir() under
> #ifdef __linux__:
> http://git.gnome.org/browse/glib/tree/glib/gspawn.c#n932
>
> Now I'd happily switch to hpa's fdwalk() implementation if I was aware
> of someone using glib in combination with an alternative malloc hitting
> this problem.
>
> Basically I think hpa is right here, and it's not really worth adding
> a new system call.
>
> The thing is, even if it were added today, since we need to run on old
> kernels, we'd have to carry the code to use the /proc trick
> approximately forever. And in the end all nextfd() would accomplish
> would be a *third* case in the already messy ifdefs/fallbacks
> in the various implementations of process spawning.
Nope. some OSs only have a silly rlimit(RLIM_NOFILE), then userland continue
to keep rlimit(NOFILE) for fallback. it's no linux specific.
I mean, OpenJDK and glib already has a linux specific trick, but many software
don't.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 16:31 ` KOSAKI Motohiro
2012-04-04 17:10 ` Colin Walters
@ 2012-04-04 18:44 ` H. Peter Anvin
1 sibling, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2012-04-04 18:44 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On 04/04/2012 09:31 AM, KOSAKI Motohiro wrote:
>>
>> OK, so what you're saying here is:
>>
>> Linux doesn't actually have a problem unless:
>> 1. You use the library implementation of opendir/readdir/closedir;
>> 2. You use a nonstandard malloc for the platform which doesn't
>> correctly set up fork hooks (which I would consider a bug);
>
> Right. but I'm argue "correctly set up" term because SUS/POSIX don't
> require it.
> It is only a workaround of buggy userland in glibc. SUS still says you
> can't
> use opendir and typical userland people don't want ignore SUS as far as
> possible.
Since you are comparing with a Linux-only system call, any suggestion
that depends on SuS requirements as opposed to Linux requirements is
irrelevant.
>
> It can. but more ugly. no?
>
And a new system call isn't? What planet are you on?
-hpa
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-02 23:17 ` KOSAKI Motohiro
2012-04-02 23:56 ` H. Peter Anvin
@ 2012-04-03 19:21 ` Colin Walters
1 sibling, 0 replies; 78+ messages in thread
From: Colin Walters @ 2012-04-03 19:21 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Alexey Dobriyan, akpm, viro, torvalds, drepper, linux-kernel,
linux-fsdevel
On Mon, 2012-04-02 at 16:17 -0700, KOSAKI Motohiro wrote:
> we can only call async-signal-safe functions after fork() when multi threads and
> opendir() call malloc() internally.
>
> As far as I know, OpenJDK has a such fork-readdir-exec code and it can
> make deadlock
> when spawnning a new process. Unfortunately Java language perfeter to
> make a lot of threads rather than other language.
>
> This patch can solve such multi threaded case.
>
> offtopic, glibc malloc is a slightly clever. It reinitialize its
> internal lock when fork by
> using thread_atfork() hook. It mean glibc malloc can be used after
> fork() and the
> technique can avoid this issue.
Yeah, a *lot* of code out there depends on this glibc malloc
implementation detail. We uncovered our dependency on this in
GNOME a while back while investigating getenv/setenv/fork and threads:
https://bugzilla.gnome.org/show_bug.cgi?id=659326
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-01 12:57 [PATCH] nextfd(2) Alexey Dobriyan
` (6 preceding siblings ...)
2012-04-02 23:17 ` KOSAKI Motohiro
@ 2012-04-04 3:01 ` Al Viro
2012-04-04 17:10 ` KOSAKI Motohiro
7 siblings, 1 reply; 78+ messages in thread
From: Al Viro @ 2012-04-04 3:01 UTC (permalink / raw)
To: Alexey Dobriyan; +Cc: akpm, torvalds, drepper, linux-kernel, linux-fsdevel
On Sun, Apr 01, 2012 at 03:57:42PM +0300, Alexey Dobriyan wrote:
> Currently there is no reliable way to close all opened file descriptors
> (which daemons need and like to do):
>
> * dumb close(fd) loop is slow, upper bound is unknown and
> can be arbitrary large,
>
> * /proc/self/fd is unreliable:
> proc may be unconfigured or not mounted at expected place.
> Looking at /proc/self/fd requires opening directory
> which may not be available due to malicious rlimit drop or ENOMEM situations.
> Not opening directory is equivalent to dumb close(2) loop except slower.
>
> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
> for userspace.
>
> So, don't add closefrom(2), add nextfd(2).
Or unshare(CLONE_FILES_EMPTY) to steal an idea from rfork(2) (Plan 9 one,
that is - I don't remember if its *BSD analog has that). Basically, they
allow 3 kinds of behaviour on clone(2) analog (and unshare(2) is part of
the same thing there):
1) share descriptor table with parent (default for rfork(2))
2) copy descriptor table from parent (RFFDG is set in flags)
3) give child an empty descriptor table (RFCFDG is set in flags)
They have something similar for namespace, BTW - the same share/copy/clean
triple.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] nextfd(2)
2012-04-04 3:01 ` Al Viro
@ 2012-04-04 17:10 ` KOSAKI Motohiro
0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2012-04-04 17:10 UTC (permalink / raw)
To: Al Viro
Cc: Alexey Dobriyan, akpm, torvalds, drepper, linux-kernel,
linux-fsdevel, kosaki.motohiro
(4/3/12 8:01 PM), Al Viro wrote:
> On Sun, Apr 01, 2012 at 03:57:42PM +0300, Alexey Dobriyan wrote:
>> Currently there is no reliable way to close all opened file descriptors
>> (which daemons need and like to do):
>>
>> * dumb close(fd) loop is slow, upper bound is unknown and
>> can be arbitrary large,
>>
>> * /proc/self/fd is unreliable:
>> proc may be unconfigured or not mounted at expected place.
>> Looking at /proc/self/fd requires opening directory
>> which may not be available due to malicious rlimit drop or ENOMEM situations.
>> Not opening directory is equivalent to dumb close(2) loop except slower.
>>
>> BSD added closefrom(fd) which is OK for this exact purpose but suboptimal
>> on the bigger scale. closefrom(2) does only close(2) (obviously :-)
>> closefrom(2) siletly ignores errors from close(2) which in theory is not OK
>> for userspace.
>>
>> So, don't add closefrom(2), add nextfd(2).
>
> Or unshare(CLONE_FILES_EMPTY) to steal an idea from rfork(2) (Plan 9 one,
> that is - I don't remember if its *BSD analog has that). Basically, they
> allow 3 kinds of behaviour on clone(2) analog (and unshare(2) is part of
> the same thing there):
> 1) share descriptor table with parent (default for rfork(2))
> 2) copy descriptor table from parent (RFFDG is set in flags)
> 3) give child an empty descriptor table (RFCFDG is set in flags)
> They have something similar for namespace, BTW - the same share/copy/clean
> triple.
Please remember why closefrom() have "from" argument. Almost all case, people
don't cloase fd 0,1,2 (rarely and 3).
If we add 2nd argument into unshare(CLONE_FILES_EMPTY), It become more ugly than
current nextfd proposal.
^ permalink raw reply [flat|nested] 78+ messages in thread