From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754672AbbDNKZT (ORCPT ); Tue, 14 Apr 2015 06:25:19 -0400 Received: from mail-wi0-f170.google.com ([209.85.212.170]:38109 "EHLO mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752479AbbDNKZL (ORCPT ); Tue, 14 Apr 2015 06:25:11 -0400 Date: Tue, 14 Apr 2015 12:25:06 +0200 From: Ingo Molnar To: "Paul E. McKenney" Cc: Linus Torvalds , Mathieu Desnoyers , Peter Zijlstra , Rusty Russell , Oleg Nesterov , Linux Kernel Mailing List , Andi Kleen , Steven Rostedt , Thomas Gleixner , Lai Jiangshan , George Spelvin , Andrea Arcangeli , David Woodhouse , Rik van Riel , Michel Lespinasse Subject: Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch() Message-ID: <20150414102505.GA13015@gmail.com> References: <20150413141126.756350256@infradead.org> <20150413141213.492831596@infradead.org> <20150413163201.GC6040@gmail.com> <1979415164.29724.1428944899771.JavaMail.zimbra@efficios.com> <20150413174323.GY23685@linux.vnet.ibm.com> <20150413184253.GZ23685@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150413184253.GZ23685@linux.vnet.ibm.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Paul E. McKenney wrote: > On Mon, Apr 13, 2015 at 11:21:46AM -0700, Linus Torvalds wrote: > > On Mon, Apr 13, 2015 at 10:43 AM, Paul E. McKenney > > wrote: > > > > > > A shorthand for READ_ONCE + smp_read_barrier_depends() is the shiny > > > new lockless_dereference() > > > > Related side note - I think people should get used to seeing > > "smp_load_acquire()". It has well-defined memory ordering properties > > and should generally perform well on most architectures. It's (much) > > stronger than lockless_dereference(), and together with > > smp_store_release() you can make rather clear guarantees about passing > > data locklessly from one CPU to another. > > > > I'd like to see us use more of the pattern of > > > > - one thread does: > > > > .. allocate/create some data > > smp_store_release() to "expose it" > > > > - another thread does: > > > > smp_load_acquire() to read index/pointer/flag/whatever > > .. use the data any damn way you want .. > > > > and we should probably aim to prefer that pattern over a lot of our > > traditional memory barriers. > > I couldn't agree more! /me too! > RCU made a similar move from open-coding smp_read_barrier_depends() > to using rcu_dereference() many years ago, and that change made RCU > code -much- easier to read and understand. I believe that moving > from smp_mb(), smp_rmb(), and smp_wmb() to smp_store_release() and > smp_load_acquire() will provide similar maintainability benefits. > Furthermore, when the current code uses smp_mb(), > smp_store_release() and smp_load_acquire() generate faster code on > most architectures. A similar maintainability argument can be made for locking: spin_lock(x) was a big step forward compared to lock_kernel(), primarily not because it improves scalability (it often does), but because the '(x)' very clearly documents the data structure that is being accessed and makes locking and data access bugs a lot more visible in the review phase already. I wish rcu_read_lock() had a data argument, for similar reasons - even if it just pointed to a pre-existing lock or an rcu head it never touches ;-) As an example I picked a random file out of the kernel that uses RCU: kernel/cpuset.c::validate_change(): static int validate_change(struct cpuset *cur, struct cpuset *trial) { struct cgroup_subsys_state *css; struct cpuset *c, *par; int ret; rcu_read_lock(); /* Each of our child cpusets must be a subset of us */ ret = -EBUSY; cpuset_for_each_child(c, css, cur) if (!is_cpuset_subset(c, trial)) goto out; /* Remaining checks don't apply to root cpuset */ ret = 0; if (cur == &top_cpuset) goto out; par = parent_cs(cur); /* On legacy hiearchy, we must be a subset of our parent cpuset. */ ret = -EACCES; if (!cgroup_on_dfl(cur->css.cgroup) && !is_cpuset_subset(trial, par)) goto out; /* * If either I or some sibling (!= me) is exclusive, we can't * overlap */ ret = -EINVAL; cpuset_for_each_child(c, css, par) { if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) && c != cur && cpumask_intersects(trial->cpus_allowed, c->cpus_allowed)) goto out; if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) && c != cur && nodes_intersects(trial->mems_allowed, c->mems_allowed)) goto out; } /* * Cpusets with tasks - existing or newly being attached - can't * be changed to have empty cpus_allowed or mems_allowed. */ ret = -ENOSPC; if ((cgroup_has_tasks(cur->css.cgroup) || cur->attach_in_progress)) { if (!cpumask_empty(cur->cpus_allowed) && cpumask_empty(trial->cpus_allowed)) goto out; if (!nodes_empty(cur->mems_allowed) && nodes_empty(trial->mems_allowed)) goto out; } /* * We can't shrink if we won't have enough room for SCHED_DEADLINE * tasks. */ ret = -EBUSY; if (is_cpu_exclusive(cur) && !cpuset_cpumask_can_shrink(cur->cpus_allowed, trial->cpus_allowed)) goto out; ret = 0; out: rcu_read_unlock(); return ret; } So just from taking a glance at that function can you tell me what is being RCU protected here? I cannot, I can only guess that it must either be cpuset_for_each_child() or maybe the cpumasks or other internals. And if I search the file for call_rcu() it shows me nothing. Only if I know that cpusets are integrated with cgroups and I search kernel/cgroup.c for call_rcu(), do I find: call_rcu(&css->rcu_head, css_free_rcu_fn); aha! ... or if I drill down 3 levels into cpuset_for_each_child() -> css_for_each_child() -> css_next_child() do I see the RCU iteration. It would have been a lot clearer from the onset, if I had a hint syntactically: rcu_read_lock(&css->rcu_head); ... rcu_read_unlock(&css->rcu_head); beyond the reviewer bonus I bet this would allow some extra debugging as well (only enabled in debug kernels): - for example to make sure we only access a field if _that field_ is RCU locked (reducing the chance of having the right locking for the wrong reason) - we could possibly also build lockdep dependencies out of such annotated RCU locking patterns. - RCU aware list walking primitives could auto-check that this particular list is properly RCU locked. This could be introduced gradually by using a different API name: rcu_lock(&css->rcu_head); ... rcu_unlock(&css->rcu_head); (the 'read' is implied in RCU locking anyway.) ... and if you think this approach has any merit, I volunteer the perf and sched subsystems as guinea pigs! :-) What do you think? Thanks, Ingo