From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753424AbcCLSFu (ORCPT <rfc822;w@1wt.eu>);
	Sat, 12 Mar 2016 13:05:50 -0500
Received: from 216-12-86-13.cv.mvl.ntelos.net ([216.12.86.13]:51714 "EHLO
	brightrain.aerifal.cx" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751921AbcCLSFk (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 12 Mar 2016 13:05:40 -0500
Date: Sat, 12 Mar 2016 13:05:31 -0500
From: Rich Felker <dalias@libc.org>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>,
        the arch/x86 maintainers <x86@kernel.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Borislav Petkov <bp@alien8.de>,
        "musl@lists.openwall.com" <musl@lists.openwall.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [musl] Re: [RFC PATCH] x86/vdso/32: Add AT_SYSINFO cancellation
 helpers
Message-ID: <20160312180531.GD9349@brightrain.aerifal.cx>
References: <CA+55aFzOcxbhXCm01+NMgY9=THYgjojvDGeYsnxe-vWfiX4X0g@mail.gmail.com>
 <20160310033446.GL9349@brightrain.aerifal.cx>
 <20160310111646.GA13102@gmail.com>
 <20160310164104.GM9349@brightrain.aerifal.cx>
 <20160310180331.GB15940@gmail.com>
 <20160310232819.GR9349@brightrain.aerifal.cx>
 <20160311093347.GA17749@gmail.com>
 <20160311113914.GD29662@port70.net>
 <CA+55aFxvMM3j1aWjN-kr5Hn8CUC_RSNw5hc+X8zFXMaMv+mGww@mail.gmail.com>
 <20160312170040.GA1108@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160312170040.GA1108@gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, Mar 12, 2016 at 06:00:40PM +0100, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > [...]
> > 
> > Because if that's the case, I wonder if what you really want is not "sticky 
> > signals" as much as "synchronous signals" - ie the ability to say that a signal 
> > shouldn't ever interrupt in random places, but only at well-defined points 
> > (where a system call would be one such point - are there others?)
> 
> Yes, I had similar 'deferred signal delivery' thoughts after having written up the 
> sticky signals approach, I just couldn't map all details of the semantics: see the 
> 'internal libc functions' problem below.
> 
> If we can do this approach then there's another advantage as well: this way the C 
> library does not even have to poll for cancellation at syscall boundaries: i.e. 
> the regular system call fast path gets faster by 2-3 instructions as well.

That is not a measurable benefit. You're talking about 2-3 cycles out
of 10k or more cycles (these are heavy blocking syscalls not light
things like SYS_time or SYS_getpid).

> > So then you could make "pthread_setcanceltype()" just set that flag for the 
> > cancellation signal, and just know that the signal itself will always be 
> > deferred to such a synchronous point (ie system call entry).
> >
> > We already have the ability to catch things at system call entry (ptrace needs 
> > it, for example), so we could possibly make our signal delivery have a mode 
> > where a signal does *not* cause user space execution to be interrupted by a 
> > signal handler, but instead just sets a bit in the thread info state that then 
> > causes the next system call to take the signal.
> 
> Yes, so this would need a bit of work, to handle the problem mentioned by Rich 
> Felker: "internal" libc APIs (such as name server lookups) may consist of a series 
> of complex system calls - some of which might be blocking. It should still be 
> possible to execute such 'internal' system calls undisturbed, even if a 'deferred' 
> signal is sent.

That's equivalent to setcancelstate(disabled), and actually the
mechanism we use for most "complex" functions since it's a lot simpler
and more maintainable to build these complex functins on top of public
APIs than direct inline syscalls or internal APIs that may change. In
musl, direct non-cancellable syscall variants are mainly used in
places where either it's just a single simple syscall (like close) or
where calling the public API is already impossible for namespace
reasons (e.g. inside stdio, which can't use POSIX namespace because
it's implementing ISO C not POSIX).

> One workable solution I think would be to prepare the internal functions for 
> eventual interruption by the cancellation signal. They have to be restartable 
> anyway, because the application can send other signals. As long as the 
> interruption is only transient it should be fine.

No, that does not work. EINTR from a non-restarting signal is a
specified, reportable error (despite being rather useles in practice
due to race conditions; of course you can solve those with repeated
signals and exponential backoff). We cannot just loop and retry on
spurious EINTR except in a few cases where EINTR is optional or not
used (like sem_wait).

> And note that this approach would also be pretty fast on the libc side: none of 
> the 'fast' cancellation APIs would have to do anything complex like per call 
> signal blocking/unblocking or other complex signal operations. They would just 
> activate a straightforward new SA_ flag and rely on its semantics.

It's already fast, aside from not being able to use sysenter/syscall
instructions. I'm really frustrated that, again and again, we have
kernel folks with no experience with libc implementation trying to
redesign something that already has a simple zero-cost design that
works on all existing systems, and proposing things that have a mix of
immediately-obvious flaws and potential future problems we haven't
even thought of yet.

Even if your designs were ideal, we would end up with libc
implementing two good designs and switching them at runtime based on
kernel version, instead of just one good design. As it stands, every
alternative proposed so far is _more_ complex on the libc side, _more_
complex on the kernel side, _and_ on top of that, requires having two
implementations.

Rich