All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Drysdale <drysdale@google.com>
To: Josh Triplett <josh@joshtriplett.org>
Cc: Thiago Macieira <thiago.macieira@intel.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>, Kees Cook <keescook@chromium.org>,
	Oleg Nesterov <oleg@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Rik van Riel <riel@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	X86 ML <x86@kernel.org>
Subject: Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
Date: Sun, 15 Mar 2015 10:18:05 +0000	[thread overview]
Message-ID: <CAHse=S9OLvyXCpbNSzA-qxYOm8VscFkKV0d2oyexM9gUjomN3g@mail.gmail.com> (raw)
In-Reply-To: <20150314192940.GD22130@thin>

On Sat, Mar 14, 2015 at 7:29 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
>> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
>> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
>> > > In any event, we should find out what FreeBSD does in response to
>> > > read(2) on the fd.
>> >
>> > I've just successfully installed FreeBSD and compiled qtbase (main package
>> > of Qt 5) on it.
>> >
>> > I'll test pdfork during the weekend and report its behaviour.
>>
>> Here are my findings about pdfork.
>>
>> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
>> Qt adaptations: https://codereview.qt-project.org/108561
>>
>> Processes created with pdfork() are normal processes that still send SIGCHLD
>> to their parents. The only difference is that you get the extra file descriptor
>> that can be passed to the pdgetpid() system call and works on select()/poll().
>> Trying to read from that file descriptor will result in EOPNOTSUPP.
>
> OK, since read() doesn't work on a pdfork() file descriptor, we don't
> have to worry about compatibility with pdfork()'s read result.
>
> However, if the expectation is that pdfork()ed child processes still
> send SIGCHLD, then I don't see how we can be compatible there, nor do I
> think we want to; as you mention below, that breaks the ability to
> encapsulate management of the created process entirely within a library.

I didn't think that was the case -- my understanding was that pdfork()ed
children would not generate SIGCHLD (and that does seem to be the
case with a quick test program).

As an aside, I do think there are some aspects of FreeBSD's process
descriptors that aren't quite right yet, particularly their interaction with
waitpid(-1, ...) -- IIRC pdfork()ed children are visible to it, but I'd expect
them not to be (to allow libraries to use sub-processes invisibly to the
programs using them). There's a thread at:
https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2014-March/thread.html
but I'm not sure that anything came of that discussion.

As it happens, I'm meeting Robert Watson (one of the progenitors
of Capsicum/process descriptors) tomorrow, so I'll chase further.

>> Since they've never implemented pdwait4() (it's not even declared in the
>> headers), the only way to reap a child if you only have the file descriptor is
>> to first pdgetpid() and then call wait4() or wait6().
>
> Which suggests that we shouldn't try to implement pdwait4() in glibc
> until FreeBSD implements it in their kernel, since we won't know the
> exact semantics they expect.

By the way, I should point out one part of the FreeBSD design
which might help explain some of the semantics.

Process descriptors are particularly designed to be used with
Capsicum, which is a security framework where file descriptors
get extra rights associated with them, and the kernel polices
the use of those rights (e.g. you need CAP_READ for read(2)
operations; normal file descriptors implicitly have all of the
rights for back-compatibility).
  https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4

Capsicum also includes 'capability mode', where system calls
that access global namespaces are disabled -- including the
pid namespace.

So process descriptors are the only way to manipulate child
processes when a program is in capability mode -- and this
means that pdkill() is then genuinely needed over and above
kill(pdgetpid(),...).

>> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when
>> the file closes.
>
> OK, that makes sense.  We could certainly implement a
> CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
> future.
>
>> Conclusion:
>> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess.
>> As long as all child process activations use this feature, the problem is
>> solved.
>>
>> Cons: it requires cooperation from all child starters. If some other library
>> or the application installs a global SIGCHLD handler that waits on all child
>> processes, like libvlc used to do and Glib and Ecore still do, you won't be
>> able to get the child exit status.
>>
>> I have not tested what happens if you try to pass the file descriptor to other
>> processes (can you even do that on FreeBSD?). But even if you could and got
>> notifications, you couldn't wait on the child to get its exit status -- unless
>> they implement pdwait4.
>
> Even if they do implement pdwait4, they might not bypass the "must be
> the parent process" restriction.  Let's wait to see what semantics they
> go with.

Hmm, interesting point.  FreeBSD certainly allows FD passing, but
I'm not sure what the interactions are when it's a process descriptor
that's passed.

Given the object-capability background to Capsicum, I'd assume that a
holder of the process descriptor should be able to do whatever operations
are allowed by the rights associated with the descriptor (CAP_PDGETPID,
CAP_PDKILL and CAP_PDWAIT exist as specific rights allowing those
operations, and a non-restricted descriptor will have all of them by default).

But I'll add some test cases for this to the Capsicum test suite to check
whether theory matches practice...
  https://github.com/google/capsicum-test/blob/dev/procdesc.cc

>>  - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
>>  - pdwait4: can be emulated with read()
>>  - pdgetpid: needs an ioctl
>>  - pdkill: needs an ioctl [or just write()]
>
> I think that should be a dedicated syscall, not an ioctl.
>
> It's unfortunate that rt_sigqueueinfo doesn't take a flags argument.
> However, I just realized that it takes a 32-bit "int" for the signal
> number, yet signal numbers fit in 8 bits.  So we could just add flags in
> the high 24 bits of that argument, and in particular add a flag
> indicating that the first argument is a file descriptor rather than a
> PID.
>
> - Josh Triplett

WARNING: multiple messages have this Message-ID (diff)
From: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
To: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
Cc: Thiago Macieira
	<thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>,
	Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>,
	Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	"Paul E. McKenney"
	<paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>,
	"H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>,
	Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>,
	Michael Kerrisk
	<mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Linux FS Devel
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	X86 ML <x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Subject: Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
Date: Sun, 15 Mar 2015 10:18:05 +0000	[thread overview]
Message-ID: <CAHse=S9OLvyXCpbNSzA-qxYOm8VscFkKV0d2oyexM9gUjomN3g@mail.gmail.com> (raw)
In-Reply-To: <20150314192940.GD22130@thin>

On Sat, Mar 14, 2015 at 7:29 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
>> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
>> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
>> > > In any event, we should find out what FreeBSD does in response to
>> > > read(2) on the fd.
>> >
>> > I've just successfully installed FreeBSD and compiled qtbase (main package
>> > of Qt 5) on it.
>> >
>> > I'll test pdfork during the weekend and report its behaviour.
>>
>> Here are my findings about pdfork.
>>
>> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
>> Qt adaptations: https://codereview.qt-project.org/108561
>>
>> Processes created with pdfork() are normal processes that still send SIGCHLD
>> to their parents. The only difference is that you get the extra file descriptor
>> that can be passed to the pdgetpid() system call and works on select()/poll().
>> Trying to read from that file descriptor will result in EOPNOTSUPP.
>
> OK, since read() doesn't work on a pdfork() file descriptor, we don't
> have to worry about compatibility with pdfork()'s read result.
>
> However, if the expectation is that pdfork()ed child processes still
> send SIGCHLD, then I don't see how we can be compatible there, nor do I
> think we want to; as you mention below, that breaks the ability to
> encapsulate management of the created process entirely within a library.

I didn't think that was the case -- my understanding was that pdfork()ed
children would not generate SIGCHLD (and that does seem to be the
case with a quick test program).

As an aside, I do think there are some aspects of FreeBSD's process
descriptors that aren't quite right yet, particularly their interaction with
waitpid(-1, ...) -- IIRC pdfork()ed children are visible to it, but I'd expect
them not to be (to allow libraries to use sub-processes invisibly to the
programs using them). There's a thread at:
https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2014-March/thread.html
but I'm not sure that anything came of that discussion.

As it happens, I'm meeting Robert Watson (one of the progenitors
of Capsicum/process descriptors) tomorrow, so I'll chase further.

>> Since they've never implemented pdwait4() (it's not even declared in the
>> headers), the only way to reap a child if you only have the file descriptor is
>> to first pdgetpid() and then call wait4() or wait6().
>
> Which suggests that we shouldn't try to implement pdwait4() in glibc
> until FreeBSD implements it in their kernel, since we won't know the
> exact semantics they expect.

By the way, I should point out one part of the FreeBSD design
which might help explain some of the semantics.

Process descriptors are particularly designed to be used with
Capsicum, which is a security framework where file descriptors
get extra rights associated with them, and the kernel polices
the use of those rights (e.g. you need CAP_READ for read(2)
operations; normal file descriptors implicitly have all of the
rights for back-compatibility).
  https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4

Capsicum also includes 'capability mode', where system calls
that access global namespaces are disabled -- including the
pid namespace.

So process descriptors are the only way to manipulate child
processes when a program is in capability mode -- and this
means that pdkill() is then genuinely needed over and above
kill(pdgetpid(),...).

>> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when
>> the file closes.
>
> OK, that makes sense.  We could certainly implement a
> CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
> future.
>
>> Conclusion:
>> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess.
>> As long as all child process activations use this feature, the problem is
>> solved.
>>
>> Cons: it requires cooperation from all child starters. If some other library
>> or the application installs a global SIGCHLD handler that waits on all child
>> processes, like libvlc used to do and Glib and Ecore still do, you won't be
>> able to get the child exit status.
>>
>> I have not tested what happens if you try to pass the file descriptor to other
>> processes (can you even do that on FreeBSD?). But even if you could and got
>> notifications, you couldn't wait on the child to get its exit status -- unless
>> they implement pdwait4.
>
> Even if they do implement pdwait4, they might not bypass the "must be
> the parent process" restriction.  Let's wait to see what semantics they
> go with.

Hmm, interesting point.  FreeBSD certainly allows FD passing, but
I'm not sure what the interactions are when it's a process descriptor
that's passed.

Given the object-capability background to Capsicum, I'd assume that a
holder of the process descriptor should be able to do whatever operations
are allowed by the rights associated with the descriptor (CAP_PDGETPID,
CAP_PDKILL and CAP_PDWAIT exist as specific rights allowing those
operations, and a non-restricted descriptor will have all of them by default).

But I'll add some test cases for this to the Capsicum test suite to check
whether theory matches practice...
  https://github.com/google/capsicum-test/blob/dev/procdesc.cc

>>  - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
>>  - pdwait4: can be emulated with read()
>>  - pdgetpid: needs an ioctl
>>  - pdkill: needs an ioctl [or just write()]
>
> I think that should be a dedicated syscall, not an ioctl.
>
> It's unfortunate that rt_sigqueueinfo doesn't take a flags argument.
> However, I just realized that it takes a 32-bit "int" for the signal
> number, yet signal numbers fit in 8 bits.  So we could just add flags in
> the high 24 bits of that argument, and in particular add a flag
> indicating that the first argument is a file descriptor rather than a
> PID.
>
> - Josh Triplett

  reply	other threads:[~2015-03-15 10:18 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-13  1:40 [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor Josh Triplett
2015-03-13  1:40 ` Josh Triplett
2015-03-13  1:40 ` [PATCH 1/6] clone: Support passing tls argument via C rather than pt_regs magic Josh Triplett
2015-03-13  1:40 ` [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit Josh Triplett
2015-03-13  1:40   ` Josh Triplett
2015-03-13 22:01   ` Andy Lutomirski
2015-03-13 22:01     ` Andy Lutomirski
2015-03-13 22:31     ` josh
2015-03-13 22:38       ` Andy Lutomirski
2015-03-13 22:43         ` josh
2015-03-13 22:43           ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 22:45           ` Andy Lutomirski
2015-03-13 22:45             ` Andy Lutomirski
2015-03-13 23:01             ` josh
2015-03-13 23:01               ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13  1:40 ` [PATCH 3/6] Introduce a new clone4 syscall with more flag bits and extensible arguments Josh Triplett
2015-03-13  1:40 ` [PATCH 4/6] signal: Factor out a helper function to process task_struct exit_code Josh Triplett
2015-03-13  1:40 ` [PATCH 5/6] fs: Make alloc_fd non-private Josh Triplett
2015-03-13  1:40   ` Josh Triplett
2015-03-13  1:41 ` [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd Josh Triplett
2015-03-13 16:21   ` Oleg Nesterov
2015-03-13 19:57     ` josh
2015-03-13 21:34       ` Andy Lutomirski
2015-03-13 21:34         ` Andy Lutomirski
2015-03-13 22:20         ` josh
2015-03-13 22:28           ` Andy Lutomirski
2015-03-13 22:28             ` Andy Lutomirski
2015-03-13 22:34             ` josh
2015-03-13 22:34               ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 22:38               ` Andy Lutomirski
2015-03-14 14:14       ` Oleg Nesterov
2015-03-14 14:14         ` Oleg Nesterov
2015-03-14 14:32         ` Oleg Nesterov
2015-03-14 14:32           ` Oleg Nesterov
2015-03-14 18:38           ` Thiago Macieira
2015-03-14 18:54             ` Oleg Nesterov
2015-03-14 22:03               ` Josh Triplett
2015-03-14 22:03                 ` Josh Triplett
2015-03-14 22:26                 ` Thiago Macieira
2015-03-14 19:01             ` Josh Triplett
2015-03-14 19:18               ` Oleg Nesterov
2015-03-14 19:18                 ` Oleg Nesterov
2015-03-14 19:47                 ` Oleg Nesterov
2015-03-14 19:47                   ` Oleg Nesterov
2015-03-14 20:14                   ` Josh Triplett
2015-03-14 20:14                     ` Josh Triplett
2015-03-14 20:30                     ` Oleg Nesterov
2015-03-14 22:14                       ` Josh Triplett
2015-03-14 22:14                         ` Josh Triplett
2015-03-14 20:03                 ` Josh Triplett
2015-03-14 20:03                   ` Josh Triplett
2015-03-14 20:20                   ` Oleg Nesterov
2015-03-14 22:09         ` Josh Triplett
2015-03-14 14:35   ` Oleg Nesterov
2015-03-14 14:35     ` Oleg Nesterov
2015-03-14 19:15     ` Josh Triplett
2015-03-14 19:15       ` Josh Triplett
2015-03-14 19:24       ` Oleg Nesterov
2015-03-14 19:48         ` Josh Triplett
2015-03-14 19:48           ` Josh Triplett
2015-03-13  1:41 ` [PATCH] clone4.2: New manpage documenting clone4(2) Josh Triplett
2015-03-13  2:07 ` [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor Thiago Macieira
2015-03-13  2:07   ` Thiago Macieira
2015-03-13 16:05 ` David Drysdale
2015-03-13 16:05   ` David Drysdale
2015-03-13 19:42   ` Josh Triplett
2015-03-13 21:16     ` Thiago Macieira
2015-03-13 21:44       ` josh
2015-03-13 21:33     ` Andy Lutomirski
2015-03-13 21:45       ` josh
2015-03-13 21:45         ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 21:51         ` Andy Lutomirski
2015-03-13 21:51           ` Andy Lutomirski
2015-03-14  1:11           ` Thiago Macieira
2015-03-14  1:11             ` Thiago Macieira
2015-03-14 19:03             ` Thiago Macieira
2015-03-14 19:29               ` Josh Triplett
2015-03-14 19:29                 ` Josh Triplett
2015-03-15 10:18                 ` David Drysdale [this message]
2015-03-15 10:18                   ` David Drysdale
2015-03-15 10:59                   ` Josh Triplett
2015-03-15  8:55     ` David Drysdale
2015-03-15  8:55       ` David Drysdale

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHse=S9OLvyXCpbNSzA-qxYOm8VscFkKV0d2oyexM9gUjomN3g@mail.gmail.com' \
    --to=drysdale@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=hpa@zytor.com \
    --cc=josh@joshtriplett.org \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=thiago.macieira@intel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.