From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S2992470AbcB0OPL (ORCPT <rfc822;w@1wt.eu>);
	Sat, 27 Feb 2016 09:15:11 -0500
Received: from mail.efficios.com ([78.47.125.74]:50420 "EHLO mail.efficios.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1756331AbcB0OPJ (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Sat, 27 Feb 2016 09:15:09 -0500
Date: Sat, 27 Feb 2016 14:15:01 +0000 (UTC)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Russell King <linux@arm.linux.org.uk>, Ingo Molnar <mingo@redhat.com>,
        linux-kernel@vger.kernel.org, linux-api <linux-api@vger.kernel.org>,
        Paul Turner <pjt@google.com>, Andrew Hunter <ahh@google.com>,
        Andy Lutomirski <luto@amacapital.net>,
        Andi Kleen <andi@firstfloor.org>, Dave Watson <davejwatson@fb.com>,
        Chris Lameter <cl@linux.com>, Ben Maurer <bmaurer@fb.com>,
        rostedt <rostedt@goodmis.org>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Josh Triplett <josh@joshtriplett.org>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Will Deacon <will.deacon@arm.com>,
        Michael Kerrisk <mtk.manpages@gmail.com>,
        Linus Torvalds <torvalds@linux-foundation.org>
Message-ID: <2053850250.10158.1456582501604.JavaMail.zimbra@efficios.com>
In-Reply-To: <56D14132.5050100@zytor.com>
References: <1456270120-7560-1-git-send-email-mathieu.desnoyers@efficios.com> <alpine.DEB.2.11.1602261719440.3638@nanos> <967083634.8940.1456507201156.JavaMail.zimbra@efficios.com> <alpine.DEB.2.11.1602261855100.3638@nanos> <724964987.9217.1456518255392.JavaMail.zimbra@efficios.com> <7096DA23-3908-40DC-A46B-C4CF2252CEE8@zytor.com> <1150363257.9781.1456533630895.JavaMail.zimbra@efficios.com> <56D14132.5050100@zytor.com>
Subject: Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of
 running thread
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [78.47.125.74]
X-Mailer: Zimbra 8.6.0_GA_1178 (ZimbraWebClient - FF44 (Linux)/8.6.0_GA_1178)
Thread-Topic: getcpu_cache system call: cache CPU number of running thread
Thread-Index: ufpMLEsSK7Ir8+A31B98dzv+19FnFw==
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

----- On Feb 27, 2016, at 1:24 AM, H. Peter Anvin hpa@zytor.com wrote:

> On 02/26/16 16:40, Mathieu Desnoyers wrote:
>>>
>>> I think it would be a good idea to make this a general pointer for the kernel to
>>> be able to write per thread state to user space, which obviously can't be done
>>> with the vDSO.
>>>
>>> This means the libc per thread startup should query the kernel for the size of
>>> this structure and allocate thread local data accordingly.  We can then grow
>>> this structure if needed without making the ABI even more complex.
>>>
>>> This is more than a system call: this is an entirely new way for userspace to
>>> interact with the kernel.  Therefore we should make it a general facility.
>>
>> I'm really glad to see I'm not the only one seeing potential for
>> genericity here. :-) This is exactly what I had in mind
>> last year when proposing the thread_local_abi() system call:
>> a generic way to register an extensible per-thread data structure
>> so the kernel can communicate with user-space and vice-versa.
>>
>> Rather than having the libc query the kernel for size of the structure,
>> I would recommend that libc tells the kernel the size of the thread-local
>> ABI structure it supports. The idea here is that both the kernel and libc
>> need to know about the fields in that structure to allow a two-way
>> interaction. Fields known only by either the kernel or userspace
>> are useless for a given thread anyway. This way, libc could statically
>> define the structure.
> 
> Big fat NOPE there.  Why?  Because it means that EVERY interaction with
> this memory, no matter how critical, needs to be conditionalized.
> Furthermore, userspace != libc.  Applications or higher-layer libraries
> might have more information than the running libc about additional
> fields, but with your proposal libc would gate them.

Good point!

> 
> As far as the kernel providing the size in the structure (alone) -- I
> *really* hope you can see what is wrong with that!!  That doesn't mean
> we can't provide it in the structure as well, and that too might avoid
> the skipped libc problem.

Indeed, libc would need to query the size before it can allocate
the structure.

> 
>> I would be tempted to also add "features" flags, so both user-space
>> and the kernel could tell each other what they support: user-space
>> would announce the set of features it supports, and it could also
>> query the kernel for the set of supported features. One simple approach
>> would be to use a uint64_t as type for those feature flags, and
>> reserve the last bit for extending to future flags if we ever have
>> more than 64.
>>
>> Thoughts ?
> 
> It doesn't seem like it would hurt, although the size of the flags field
> could end up being an issue.

I'm concerned that this thread-local ABI structure may become messy.
Let's just imagine how we would first introduce a "cpu_id" field (int32_t),
and eventually add a "seqnum" field for rseq in the future (unsigned long).

Both fields need to be read with single-copy semantics as volatile
reads, and both need to be naturally aligned. However, I'm tempted
to use the "packed" attribute on the structure since it's an ABI
between kernel and user-space. A pretty bad example of what this
could become, due to alignment constraints, looks like:

/* This structure needs to be aligned on pointer size. */
struct thread_local_abi {
        int32_t cpu_id;
        int32_t __unused1;
        unsigned long seqnum;
        /* Add new fields at the end. */
} __attribute__((packed));

And this is just a start. It may become messier as we append
new fields in the future.

The main argument I currently see in favor of having this
meta system call for all per-thread features is to only
maintain a single pointer in the kernel task_struct rather
than one per thread-local feature.

If the goal is really to keep the burden on the task struct
small, we could use kmalloc()/kfree() to allocate and free an
array of pointers to the various per-thread features, rather
than putting them directly in task_struct. We could keep a
mask of the enabled features in the task struct too (which
we will likely have to do even if we go the the thread-local
ABI meta system call).

Having this per-task allocated pointer array at kernel-level
would allow us to have one system call per feature, with clear
semantics, without evolving a messy thread-local ABI structure
due to all sorts of alignment constraints.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com