From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752031AbcAEWyZ (ORCPT ); Tue, 5 Jan 2016 17:54:25 -0500 Received: from e38.co.us.ibm.com ([32.97.110.159]:58044 "EHLO e38.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751663AbcAEWyW (ORCPT ); Tue, 5 Jan 2016 17:54:22 -0500 X-IBM-Helo: d03dlp03.boulder.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-api@vger.kernel.org;linux-kernel@vger.kernel.org Date: Tue, 5 Jan 2016 14:54:20 -0800 From: "Paul E. McKenney" To: Mathieu Desnoyers Cc: Russell King - ARM Linux , Will Deacon , Thomas Gleixner , Paul Turner , Andrew Hunter , Peter Zijlstra , linux-kernel@vger.kernel.org, linux-api , Andy Lutomirski , Andi Kleen , Dave Watson , Chris Lameter , Ingo Molnar , Ben Maurer , rostedt , Josh Triplett , Linus Torvalds , Andrew Morton , Catalin Marinas , Michael Kerrisk Subject: Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread Message-ID: <20160105225420.GF3818@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <1451977320-4886-1-git-send-email-mathieu.desnoyers@efficios.com> <1451977320-4886-2-git-send-email-mathieu.desnoyers@efficios.com> <20160105120400.GD10705@arm.com> <1079064730.338115.1452015105259.JavaMail.zimbra@efficios.com> <20160105174017.GY19062@n2100.arm.linux.org.uk> <20160105214717.GE3818@linux.vnet.ibm.com> <1777488643.338535.1452033244991.JavaMail.zimbra@efficios.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1777488643.338535.1452033244991.JavaMail.zimbra@efficios.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16010522-0029-0000-0000-00000F5AB304 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 05, 2016 at 10:34:04PM +0000, Mathieu Desnoyers wrote: > ----- On Jan 5, 2016, at 4:47 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote: > > > On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote: > >> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote: > >> > For instance, an application could create a linked list or hash map > >> > of thread control structures, which could contain the current CPU > >> > number of each thread. A dispatch thread could then traverse or > >> > lookup this structure to see on which CPU each thread is running and > >> > do work queue dispatch or scheduling decisions accordingly. > >> > >> So, what happens if the linked list is walked from thread X, and we > >> discover that thread Y is allegedly running on CPU1. We decide that > >> we want to dispatch some work on that thread due to it being on CPU1, > >> so we send an event to thread Y. > >> > >> Thread Y becomes runnable, and the scheduler decides to schedule the > >> thread on CPU3 instead of CPU1. > >> > >> My point is that the above idea is inherently racy. The only case > >> where it isn't racy is when thread Y is bound to CPU1, and so can't > >> move - but then you'd know that thread Y is on CPU1 and there > >> wouldn't be a need for the inherent complexity suggested above. > >> > >> The behaviour I've seen on ARM from the scheduler (on a quad CPU > >> platform, observing the system activity with top reporting the last > >> CPU number used by each thread) is that threads often migrate > >> between CPUs - especially in the case of (eg) one or two threads > >> running in a quad-CPU system. > >> > >> Given that, I'm really not sure what the use of reading and making > >> decisions on the current CPU number would be within a program - > >> unless the thread is bound to a particular CPU or group of CPUs, > >> it seems that you can't rely on being on the reported CPU by the > >> time the system call returns. > > > > As I understand it, the idea is -not- to eliminate synchronization > > like we do with per-CPU variables in the kernel, but rather to > > reduce the average cost of synchronization. For example, there > > might be a separate data structure per CPU, each structure guarded > > by its own lock. A thread could sample the current running CPU, > > acquire that CPU's corresponding lock, and operate on that CPU's > > structure. This would work correctly even if there was an arbitrarily > > high number of preemptions/migrations, but would have improved > > performance (compared to a single global lock) in the common case > > where there were no preemptions/migrations. > > > > This approach can also be used in conjunction with Paul Turner's > > per-CPU atomics. > > > > Make sense, or am I missing your point? > > Russell's point is more about accessing a given thread's cpu_cache > variable from other threads/cores, which is beyond what is needed > for restartable critical sections. Fair enough! > Independently of the usefulness of reading other thread's cpu_cache > to see their current CPU, I would advocate for checking the cpu_cache > natural alignment, and return EINVAL if it is not aligned. Even for > thread-local reads, we care about ensuring there is no load tearing > when reading this variable. The behavior of the kernel updating this > variable read by a user-space thread is very similar to having a > variable updated by a signal handler nested on top of a thread. This > makes it simpler and reduces the testing state space. Makes sense to me! Thanx, Paul From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Paul E. McKenney" Subject: Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread Date: Tue, 5 Jan 2016 14:54:20 -0800 Message-ID: <20160105225420.GF3818@linux.vnet.ibm.com> References: <1451977320-4886-1-git-send-email-mathieu.desnoyers@efficios.com> <1451977320-4886-2-git-send-email-mathieu.desnoyers@efficios.com> <20160105120400.GD10705@arm.com> <1079064730.338115.1452015105259.JavaMail.zimbra@efficios.com> <20160105174017.GY19062@n2100.arm.linux.org.uk> <20160105214717.GE3818@linux.vnet.ibm.com> <1777488643.338535.1452033244991.JavaMail.zimbra@efficios.com> Reply-To: paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <1777488643.338535.1452033244991.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Mathieu Desnoyers Cc: Russell King - ARM Linux , Will Deacon , Thomas Gleixner , Paul Turner , Andrew Hunter , Peter Zijlstra , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api , Andy Lutomirski , Andi Kleen , Dave Watson , Chris Lameter , Ingo Molnar , Ben Maurer , rostedt , Josh Triplett , Linus Torvalds , Andrew Morton , Catalin Marinas , Michael Kerrisk List-Id: linux-api@vger.kernel.org On Tue, Jan 05, 2016 at 10:34:04PM +0000, Mathieu Desnoyers wrote: > ----- On Jan 5, 2016, at 4:47 PM, Paul E. McKenney paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org wrote: > > > On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote: > >> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote: > >> > For instance, an application could create a linked list or hash map > >> > of thread control structures, which could contain the current CPU > >> > number of each thread. A dispatch thread could then traverse or > >> > lookup this structure to see on which CPU each thread is running and > >> > do work queue dispatch or scheduling decisions accordingly. > >> > >> So, what happens if the linked list is walked from thread X, and we > >> discover that thread Y is allegedly running on CPU1. We decide that > >> we want to dispatch some work on that thread due to it being on CPU1, > >> so we send an event to thread Y. > >> > >> Thread Y becomes runnable, and the scheduler decides to schedule the > >> thread on CPU3 instead of CPU1. > >> > >> My point is that the above idea is inherently racy. The only case > >> where it isn't racy is when thread Y is bound to CPU1, and so can't > >> move - but then you'd know that thread Y is on CPU1 and there > >> wouldn't be a need for the inherent complexity suggested above. > >> > >> The behaviour I've seen on ARM from the scheduler (on a quad CPU > >> platform, observing the system activity with top reporting the last > >> CPU number used by each thread) is that threads often migrate > >> between CPUs - especially in the case of (eg) one or two threads > >> running in a quad-CPU system. > >> > >> Given that, I'm really not sure what the use of reading and making > >> decisions on the current CPU number would be within a program - > >> unless the thread is bound to a particular CPU or group of CPUs, > >> it seems that you can't rely on being on the reported CPU by the > >> time the system call returns. > > > > As I understand it, the idea is -not- to eliminate synchronization > > like we do with per-CPU variables in the kernel, but rather to > > reduce the average cost of synchronization. For example, there > > might be a separate data structure per CPU, each structure guarded > > by its own lock. A thread could sample the current running CPU, > > acquire that CPU's corresponding lock, and operate on that CPU's > > structure. This would work correctly even if there was an arbitrarily > > high number of preemptions/migrations, but would have improved > > performance (compared to a single global lock) in the common case > > where there were no preemptions/migrations. > > > > This approach can also be used in conjunction with Paul Turner's > > per-CPU atomics. > > > > Make sense, or am I missing your point? > > Russell's point is more about accessing a given thread's cpu_cache > variable from other threads/cores, which is beyond what is needed > for restartable critical sections. Fair enough! > Independently of the usefulness of reading other thread's cpu_cache > to see their current CPU, I would advocate for checking the cpu_cache > natural alignment, and return EINVAL if it is not aligned. Even for > thread-local reads, we care about ensuring there is no load tearing > when reading this variable. The behavior of the kernel updating this > variable read by a user-space thread is very similar to having a > variable updated by a signal handler nested on top of a thread. This > makes it simpler and reduces the testing state space. Makes sense to me! Thanx, Paul