From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760440AbcHYRIm (ORCPT ); Thu, 25 Aug 2016 13:08:42 -0400 Received: from mail.efficios.com ([78.47.125.74]:40869 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752794AbcHYRIh (ORCPT ); Thu, 25 Aug 2016 13:08:37 -0400 Date: Thu, 25 Aug 2016 17:08:32 +0000 (UTC) From: Mathieu Desnoyers To: Linus Torvalds , Dave Watson Cc: Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Andy Lutomirski , linux-kernel , linux-api , Paul Turner , Andrew Morton , Russell King , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Andrew Hunter , Andi Kleen , Chris Lameter , Ben Maurer , rostedt , Josh Triplett , Catalin Marinas , Will Deacon , Michael Kerrisk Message-ID: <545371402.19191.1472144912215.JavaMail.zimbra@efficios.com> In-Reply-To: References: <1471637274-13583-1-git-send-email-mathieu.desnoyers@efficios.com> <1471637274-13583-2-git-send-email-mathieu.desnoyers@efficios.com> Subject: Re: [RFC PATCH v8 1/9] Restartable sequences system call MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [78.47.125.74] X-Mailer: Zimbra 8.7.0_GA_1659 (ZimbraWebClient - FF45 (Linux)/8.7.0_GA_1659) Thread-Topic: Restartable sequences system call Thread-Index: ATiS3BDdeOu0zrm90KY7u/P88yklug== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- On Aug 19, 2016, at 4:23 PM, Linus Torvalds torvalds@linux-foundation.org wrote: > On Fri, Aug 19, 2016 at 1:07 PM, Mathieu Desnoyers > wrote: >> >> Benchmarking various approaches for reading the current CPU number: > > So I'd like to see the benchmarks of something that actually *does* something. > > IOW, what's the bigger-picture "this is what it actually is useful > for, and how it speeds things up". > > Nobody gets a cpu number just to get a cpu number - it's not a useful > thing to benchmark. What does getcpu() so much that we care? > > We've had tons of clever features that nobody actually uses, because > they aren't really portable enough. I'd like to be convinced that this > is actually going to be used by real applications. I completely agree with your request for real-life application numbers. The most appealing application we have so far is Dave Watson's Facebook services using jemalloc as a memory allocator. It would be nice if he could re-run those benchmarks with my rseq implementation. The trade-offs here are about speed and memory usage: 1) single process-wide pool: - speed: does not scale well to many-cores, + efficient use of memory. 2) per-thread pools: + speed: scales really well to many-cores, - inefficient use of memory. 3) per-cpu pools without rseq: - speed: requires atomic instructions due to migration and preemption, + efficient use of memory. 4) per-cpu pools with rseq: + speed: no atomic instructions required, + efficient use of memory. His benchmarks should confirm that we get best of speed and memory use with (4). I plan to personally start working on integrating rseq with the lttng-ust user-space tracer per-CPU ring buffer, but I expect to mainly publish microbenchmarks, as most of our heavy tracing users are proprietary applications, for which it's tricky to publish numbers. I suspect that microbenchmarks are not what you are looking for here. Boqun Feng expressed interested in working on a userspace RCU flavor that would implement per-CPU (rather than per-thread) grace period tracking. I suspect this will be a rather large undertaking. The benefits should be visible as grace period overhead and speed in applications that have many more threads than cores. Paul Turner from Google probably have interesting numbers too, but I suspect he is busy on other projects at the moment. Let's see if we can get Dave Watson to provide those numbers. Thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com