From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1760440AbcHYRIm (ORCPT <rfc822;w@1wt.eu>);
        Thu, 25 Aug 2016 13:08:42 -0400
Received: from mail.efficios.com ([78.47.125.74]:40869 "EHLO mail.efficios.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752794AbcHYRIh (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 25 Aug 2016 13:08:37 -0400
Date: Thu, 25 Aug 2016 17:08:32 +0000 (UTC)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
        Dave Watson <davejwatson@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        Andy Lutomirski <luto@amacapital.net>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        linux-api <linux-api@vger.kernel.org>, Paul Turner <pjt@google.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Russell King <linux@arm.linux.org.uk>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>, Andrew Hunter <ahh@google.com>,
        Andi Kleen <andi@firstfloor.org>, Chris Lameter <cl@linux.com>,
        Ben Maurer <bmaurer@fb.com>, rostedt <rostedt@goodmis.org>,
        Josh Triplett <josh@joshtriplett.org>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Will Deacon <will.deacon@arm.com>,
        Michael Kerrisk <mtk.manpages@gmail.com>
Message-ID: <545371402.19191.1472144912215.JavaMail.zimbra@efficios.com>
In-Reply-To: <CA+55aFz+Q33m1+ju3ANaznBwYCcWo9D9WDr2=p0YLEF4gJF12g@mail.gmail.com>
References: <1471637274-13583-1-git-send-email-mathieu.desnoyers@efficios.com> <1471637274-13583-2-git-send-email-mathieu.desnoyers@efficios.com> <CA+55aFz+Q33m1+ju3ANaznBwYCcWo9D9WDr2=p0YLEF4gJF12g@mail.gmail.com>
Subject: Re: [RFC PATCH v8 1/9] Restartable sequences system call
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [78.47.125.74]
X-Mailer: Zimbra 8.7.0_GA_1659 (ZimbraWebClient - FF45 (Linux)/8.7.0_GA_1659)
Thread-Topic: Restartable sequences system call
Thread-Index: ATiS3BDdeOu0zrm90KY7u/P88yklug==
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

----- On Aug 19, 2016, at 4:23 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Fri, Aug 19, 2016 at 1:07 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Benchmarking various approaches for reading the current CPU number:
> 
> So I'd like to see the benchmarks of something that actually *does* something.
> 
> IOW, what's the bigger-picture "this is what it actually is useful
> for, and how it speeds things up".
> 
> Nobody gets a cpu number just to get a cpu number - it's not a useful
> thing to benchmark. What does getcpu() so much that we care?
> 
> We've had tons of clever features that nobody actually uses, because
> they aren't really portable enough. I'd like to be convinced that this
> is actually going to be used by real applications.

I completely agree with your request for real-life application numbers.

The most appealing application we have so far is Dave Watson's Facebook
services using jemalloc as a memory allocator. It would be nice if he
could re-run those benchmarks with my rseq implementation. The trade-offs
here are about speed and memory usage:

1) single process-wide pool:
   - speed: does not scale well to many-cores,
   + efficient use of memory.
2) per-thread pools:
   + speed: scales really well to many-cores,
   - inefficient use of memory.
3) per-cpu pools without rseq:
   - speed: requires atomic instructions due to migration and preemption,
   + efficient use of memory.
4) per-cpu pools with rseq:
   + speed: no atomic instructions required,
   + efficient use of memory.

His benchmarks should confirm that we get best of speed and
memory use with (4).

I plan to personally start working on integrating rseq with
the lttng-ust user-space tracer per-CPU ring buffer, but
I expect to mainly publish microbenchmarks, as most of
our heavy tracing users are proprietary applications, for
which it's tricky to publish numbers. I suspect that
microbenchmarks are not what you are looking for here.

Boqun Feng expressed interested in working on a
userspace RCU flavor that would implement per-CPU
(rather than per-thread) grace period tracking. I suspect
this will be a rather large undertaking. The benefits
should be visible as grace period overhead and speed
in applications that have many more threads than cores.

Paul Turner from Google probably have interesting numbers too,
but I suspect he is busy on other projects at the moment.

Let's see if we can get Dave Watson to provide those numbers.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com