From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S935007AbcCJDe4 (ORCPT <rfc822;w@1wt.eu>);
	Wed, 9 Mar 2016 22:34:56 -0500
Received: from 216-12-86-13.cv.mvl.ntelos.net ([216.12.86.13]:51649 "EHLO
	brightrain.aerifal.cx" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S933173AbcCJDey (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 9 Mar 2016 22:34:54 -0500
Date: Wed, 9 Mar 2016 22:34:46 -0500
From: Rich Felker <dalias@libc.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>, Andy Lutomirski <luto@kernel.org>,
        the arch/x86 maintainers <x86@kernel.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Borislav Petkov <bp@alien8.de>,
        "musl@lists.openwall.com" <musl@lists.openwall.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [musl] Re: [RFC PATCH] x86/vdso/32: Add AT_SYSINFO cancellation
 helpers
Message-ID: <20160310033446.GL9349@brightrain.aerifal.cx>
References: <06079088639eddd756e2092b735ce4a682081308.1457486598.git.luto@kernel.org>
 <20160309085631.GA3247@gmail.com>
 <20160309113449.GZ29662@port70.net>
 <CA+55aFzOcxbhXCm01+NMgY9=THYgjojvDGeYsnxe-vWfiX4X0g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CA+55aFzOcxbhXCm01+NMgY9=THYgjojvDGeYsnxe-vWfiX4X0g@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Mar 09, 2016 at 11:47:30AM -0800, Linus Torvalds wrote:
> On Wed, Mar 9, 2016 at 3:34 AM, Szabolcs Nagy <nsz@port70.net> wrote:
> >>
> >> Could someone remind me why cancellation points matter to user-space?
> >
> > because of standards.
> 
> So quite frankly, if we have to do kernel support for this, then let's
> do it right, instead of just perpetuating a hack that was done in user
> space in a new way.
> 
> We already have support for cancelling blocking system calls early: we
> do it for fatal signals (exactly because we know that it's ok to
> return -EINTR without failing POSIX semantics - the dying thread will
> never actually *see* the -EINTR because it's dying).
> 
> I suspect that what you guys want is the same semantics as a fatal
> signal (return early with -EINTR), but without the actual fatality
> (you want to do cleanup in the cancelled thread).

No, the semantics need to be identical to EINTR -- you can't cancel an
operation where some work has already been done. This is both a POSIX
requirement and a conceptual requirement. When a thread is cancelled,
the process is not terminating abnormally; it's continuing. It needs
to be able to know whether some work was completed, because that
changes what the cleanup code needs to do in order for a consistent
state to be maintained. This is most critical with syscalls that
allocate or free resources -- open, close, recvmsg accepting file
descriptors, etc. -- but it can even matter for reads and writes.
This is the whole reason we need a race-free cancellation rather than
the buggy implementation glibc historically used (which they are in
the process of fixing too).

Anyway, in the case where some but not all work was completed already
at the time the cancellation request was made, the function needs to
return and report whatever was successful.

> I suspect that we could fairly easily give those kinds of semantics.
> We could add a new flag to the sigaction (sa_flags) that says "this
> signal interrupts even uninterruptible system calls".

This would not help, because whether the system call should be
cancellable is a function of the caller, not the system call; some
syscalls are cancellable when used in one place but not in others.

Also it does not solve the race condition; it's possible that the
signal is delivered _after_ userspace checks the cancellation flag,
but _before_ the syscall is made. Thus we need a way to probe whether
the program counter is in a range between the userspace flag check and
the syscall instruction.

I believe a new kernel cancellation API with a sticky cancellation
flag (rather than a signal), and a flag or'd onto the syscall number
to make it cancellable at the call point, could work, but then
userspace needs to support fairly different old and new kernel APIs in
order to be able to run on old kernels while also taking advantage of
new ones, and it's not clear to me that it would actually be
worthwhile to do so. I could see doing it for a completely new syscall
API, but as a second syscall API for a system that already has one it
seems gratuitous. From my perspective the existing approach (checking
program counter from signal handler) is very clean and simple. After
all it made enough sense that I was able to convince the glibc folks
to adopt it.

Rich