linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: paulmck@linux.vnet.ibm.com
Cc: linux-kernel@vger.kernel.org, mingo@elte.hu,
	laijs@cn.fujitsu.com, dipankar@in.ibm.com,
	akpm@linux-foundation.org, mathieu.desnoyers@polymtl.ca,
	josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de,
	rostedt@goodmis.org, Valdis.Kletnieks@vt.edu,
	dhowells@redhat.com, eric.dumazet@gmail.com, darren@dvhart.com,
	fweisbec@gmail.com, patches@linaro.org
Subject: Re: [PATCH RFC tip/core/rcu 6/6] rcu: Reduce cache-miss initialization latencies for large systems
Date: Fri, 27 Apr 2012 00:01:25 +0200	[thread overview]
Message-ID: <1335477685.2463.128.camel@laptop> (raw)
In-Reply-To: <20120426202828.GK2407@linux.vnet.ibm.com>

On Thu, 2012-04-26 at 13:28 -0700, Paul E. McKenney wrote:
> Interesting!  A few probably clueless questions and comments below.

Thanks, shows me where to put comments in :-)

> > +static void sched_init_numa(void)
> > +{
> > +     int next_distance, curr_distance = node_distance(0, 0);
> > +     struct sched_domain_topology_level *tl;
> > +     int level = 0;
> > +     int i, j, k;
> > +
> > +     sched_domains_numa_scale = curr_distance;
> > +     sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL);
> > +     if (!sched_domains_numa_distance)
> > +             return;
> > +
> > +     next_distance = curr_distance;
> > +     for (i = 0; i < nr_node_ids; i++) {
> > +             for (j = 0; j < nr_node_ids; j++) {
> > +                     int distance = node_distance(0, j);
> 
> Shouldn't this be "node_distance(i, j)"?
> 
> Actually, I don't see any use of "i" in this loop anywhere, which seems
> strange.

It assumes all distances in node_distance(i,j) occur in
node_distance(0,j). This to avoid a cubic loop.
             
> > +                     if (distance > curr_distance &&
> > +                                     (distance < next_distance ||
> > +                                      next_distance == curr_distance))
> > +                             next_distance = distance;
> > +             }
> > +             if (next_distance != curr_distance) {
> > +                     sched_domains_numa_distance[level++] = next_distance;
> > +                     sched_domains_numa_levels = level;
> 
> I don't understand the intent here.  One approach would be to have
> one sched_domains_numa_distance[] array per node, with distances
> in each array sorted by increasing distance from the array's node.
> 
> As it is, I believe you get N copies of the sorted distances from
> node 0 in an array that is only guaranteed to be big enough for 1 copy.
> Unless you know something about the node_distance() matrix that limits
> the number of distinct values.

Note this is in the 'i' loop, not in the nested i,j loop. The nested
loop finds the next smallest distance, the outer loop records it.

> > +                     curr_distance = next_distance;
> > +             } else break;
> > +     }
> 
> The above loop seems to be doing a selection sort eliminating duplicates.

Correct, and leaving out the identity distance, see below.

> Would it make sense to put a real sort implementation into lib/?

There's a heapsort in lib/sort.c. And I guess we could write the whole
lot in 3 stages, 1) dump node_distance(0,i) in an array, 2) sort array,
3) remove duplicates from array, but I was lazy and did the O(n^2)
thing.

So we have 'level' be the number of unique distances larger than the
identity distance node_distance(i,i) -- in specific (0,0). And
sched_domains_numa_distance[0..level-1] be the array recording the
specific distance numbers.

> > +
> > +     sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
> > +     if (!sched_domains_numa_masks)
> > +             return;
> > +
> > +     for (i = 0; i < level; i++) {
> > +             sched_domains_numa_masks[i] = alloc_percpu(cpumask_t);
> > +             if (!sched_domains_numa_masks[i])
> > +                     return;
> > +
> > +             for_each_possible_cpu(j) {
> > +                     struct cpumask *mask =
> > +                             per_cpu_ptr(sched_domains_numa_masks[i], j);
> > +
> > +                     for (k = 0; k < nr_node_ids; k++) {
> > +                             if (node_distance(cpu_to_node(j), k) >
> > +                                             sched_domains_numa_distance[i])
> > +                                     continue;
> > +
> > +                             cpumask_or(mask, mask, cpumask_of_node(k));
> > +                     }
> > +             }
> > +     }
> 
> OK, if we can assume that the node distances are symmetric, so that
> node_distance(i, j) == node_distance(j, i), 

Yeah, I do assume that, not quite sure this is true, but we'll see. I'll
have to ask someone at SGI/HP/etc. for node_distance() tables for
'interesting' hardware.

> I think I see what you
> are getting at, though I am having a hard time seeing how to pack
> it into a linear array.

Yeah, I'm not sure you can either. Hence me building a tree ;-) But you
too have a tree, its tree-rcu after all.

> The idea seems to be to compute a per-CPU list of CPU masks, with the first
> entry having bits set for the CPUs closest to the CPU corresponding to
> the list, and subsequent entries adding more-distant CPUs.  The last
> CPU mask would presumably have bits set for all CPUs.

Indeed. So the scheduler already knows about nodes (included in the
default_topology thing), here we're constructing masks spanning nodes
based on distance.

So the first level is all nodes that are directly connected, the second
level are all nodes that have one intermediate hop, etc.. with indeed
the last level being the entire machine.

> I take it that there is no data structure listing per-node CPU masks,
> indicating which CPUs are members of a given node?  Or is something else
> going on here?

There is, its cpumask_of_node(), you'll find it used in the above
code :-) We do the for_each_cpu loop because we need the mask per-node
and there's no such thing as per-node memory so we fudge it using
per-cpu memory.

This could be optimized to reduce overhead if this all turns out to work
well.

So in short: for every 'i < level', for every cpu, we build a mask of
which cpus are '<= i' hops away from our current node.

> > +
> > +     tl = kzalloc((ARRAY_SIZE(default_topology) + level) *
> > +                     sizeof(struct sched_domain_topology_level), GFP_KERNEL);
> > +     if (!tl)
> > +             return;
> > +
> > +     for (i = 0; default_topology[i].init; i++)
> > +             tl[i] = default_topology[i];
> > +
> > +     for (j = 0; j < level; i++, j++) {
> > +             tl[i] = (struct sched_domain_topology_level){
> 
> tl[j]?

No, [i]. See how we allocate an array of ARRAY_SIZE(default_topology) +
level, then copy the default topology array then continue i by j
additional levels.

> > +                     .init = sd_init_NUMA,
> > +                     .mask = sd_numa_mask,
> > +                     .flags = SDTL_OVERLAP,
> > +                     .numa_level = j,
> > +             };
> > +     }
> > +
> > +     sched_domain_topology = tl;
> > +} 


  reply	other threads:[~2012-04-26 22:01 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-23 16:41 [PATCH RFC 0/6] Miscellaneous RCU fixes for 3.5 Paul E. McKenney
2012-04-23 16:42 ` [PATCH RFC tip/core/rcu 1/6] rcu: Stabilize use of num_online_cpus() for GP short circuit Paul E. McKenney
2012-04-23 16:42   ` [PATCH RFC tip/core/rcu 2/6] rcu: List-debug variants of rcu list routines Paul E. McKenney
2012-04-23 16:42   ` [PATCH RFC tip/core/rcu 3/6] rcu: Replace list_first_entry_rcu() with list_first_or_null_rcu() Paul E. McKenney
2012-04-23 16:42   ` [PATCH RFC tip/core/rcu 4/6] rcu: Clarify help text for RCU_BOOST_PRIO Paul E. McKenney
2012-04-26 12:46     ` Peter Zijlstra
2012-04-26 17:28       ` Paul E. McKenney
2012-04-23 16:42   ` [PATCH RFC tip/core/rcu 5/6] rcu: Make __kfree_rcu() less dependent on compiler choices Paul E. McKenney
2012-04-26 12:48     ` Peter Zijlstra
2012-04-26 13:29       ` Jan Engelhardt
2012-04-26 13:50         ` Peter Zijlstra
2012-04-23 16:42   ` [PATCH RFC tip/core/rcu 6/6] rcu: Reduce cache-miss initialization latencies for large systems Paul E. McKenney
2012-04-26 12:51     ` Peter Zijlstra
2012-04-26 14:12       ` Paul E. McKenney
2012-04-26 15:28         ` Peter Zijlstra
2012-04-26 16:15           ` Paul E. McKenney
2012-04-26 19:41             ` Peter Zijlstra
2012-04-26 19:47               ` Peter Zijlstra
2012-04-26 20:29                 ` Paul E. McKenney
2012-04-26 22:04                   ` Peter Zijlstra
2012-04-26 20:28               ` Paul E. McKenney
2012-04-26 22:01                 ` Peter Zijlstra [this message]
2012-04-27 14:17                   ` Paul E. McKenney
2012-04-27  4:36     ` Mike Galbraith
2012-04-27 15:15       ` Paul E. McKenney
2012-04-28  4:42         ` Mike Galbraith
2012-04-28 17:21           ` Paul E. McKenney
2012-04-29  3:54             ` Mike Galbraith
2012-04-24 15:35   ` [PATCH RFC tip/core/rcu 1/6] rcu: Stabilize use of num_online_cpus() for GP short circuit Srivatsa S. Bhat
2012-04-24 16:50     ` Paul E. McKenney
2012-04-24 17:46       ` Srivatsa S. Bhat
2012-05-07  3:47       ` Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1335477685.2463.128.camel@laptop \
    --to=peterz@infradead.org \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=akpm@linux-foundation.org \
    --cc=darren@dvhart.com \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fweisbec@gmail.com \
    --cc=josh@joshtriplett.org \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@polymtl.ca \
    --cc=mingo@elte.hu \
    --cc=niv@us.ibm.com \
    --cc=patches@linaro.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).