All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/numa: Stagger NUMA balancing scan periods for new threads
@ 2018-04-26 10:48 Mel Gorman
  2018-04-27  6:50 ` Ingo Molnar
  0 siblings, 1 reply; 3+ messages in thread
From: Mel Gorman @ 2018-04-26 10:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, LKML, Mel Gorman

Threads share an address space and each can change the protections of the
same address space to trap NUMA faults. This is redundant and potentially
counter-productive as any thread doing the update will suffice. Potentially
only one thread is required but that thread may be idle or it may not have
any locality concerns and pick an unsuitable scan rate.

This patch uses independent scan period but they are staggered based on the
number of address space users when the thread is created.  The intent is
that threads will avoid scanning at the same time and have a chance to adapt
their scan rate later if necessary. This reduces the total scan activity early
in the lifetime of the threads.

The different in headline performance across a range of machines and
workloads is marginal but the system CPU usage is reduced due to less scan
activity.  The following is the time reported by NAS Parallel Benchmark
using unbound openmp threads and a D size class.

                      4.17.0-rc1             4.17.0-rc1
                         vanilla           stagger-v1r1
Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)

Note it's not a universal win but we have no prior knowledge of which
thread matters but the number of threads created often exceeds the size of
the node when the threads are not bound. On balance, the set of workloads
complete faster and there is a a reducation of overall system CPU usage

                            4.17.0-rc1             4.17.0-rc1
                               vanilla           stagger-v1r1
sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)

NUMA scan activity is reduced as well as other balancing activity.

NUMA alloc local               1042828     1342670
NUMA base PTE updates        140481138    93577468
NUMA huge PMD updates           272171      180766
NUMA page range updates      279832690   186129660
NUMA hint faults               1395972     1193897
NUMA hint local faults          877925      855053
NUMA hint local percent             62          71
NUMA pages migrated           12057909     9158023

Similar observations are made for other thread-intensive workloads. System
CPU usage is lower even though the headline gains in performance tend to be
small. For example, specjbb 2005 shows almost no difference in performance
but scan activity is reduced by a third on a 4-socket box. I didn't find
a workload (thread intensive or otherwise) that suffered badly.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aaeebfcc..9f47d6c3e386 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2174,27 +2174,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
 
-#ifdef CONFIG_NUMA_BALANCING
-	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
-		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-		p->mm->numa_scan_seq = 0;
-	}
-
-	if (clone_flags & CLONE_VM)
-		p->numa_preferred_nid = current->numa_preferred_nid;
-	else
-		p->numa_preferred_nid = -1;
-
-	p->node_stamp = 0ULL;
-	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
-	p->numa_work.next = &p->numa_work;
-	p->numa_faults = NULL;
-	p->last_task_numa_placement = 0;
-	p->last_sum_exec_runtime = 0;
-
-	p->numa_group = NULL;
-#endif /* CONFIG_NUMA_BALANCING */
+	init_numa_balancing(clone_flags, p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e7ab9b..7c5d510aec6b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1139,6 +1139,44 @@ static unsigned int task_scan_max(struct task_struct *p)
 	return max(smin, smax);
 }
 
+void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
+{
+	int mm_users = 0;
+
+	if (p->mm) {
+		mm_users = atomic_read(&p->mm->mm_users);
+		if (mm_users == 1) {
+			p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+			p->mm->numa_scan_seq = 0;
+		}
+	}
+	p->node_stamp = 0ULL;
+	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
+	p->last_task_numa_placement = 0;
+	p->last_sum_exec_runtime = 0;
+	p->numa_group = NULL;
+
+	/* New address space */
+	if (!(clone_flags & CLONE_VM)) {
+		p->numa_preferred_nid = -1;
+		return;
+	}
+
+	/* New thread, use existing preferred nid but stagger scans */
+	if (p->mm) {
+		unsigned int delay;
+
+		delay = min_t(unsigned int, task_scan_max(current),
+			current->numa_scan_period * mm_users * NSEC_PER_MSEC);
+		delay += 2 * TICK_NSEC;
+		p->numa_preferred_nid = current->numa_preferred_nid;
+		p->node_stamp = delay;
+	}
+}
+
 static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
 {
 	rq->nr_numa_running += (p->numa_preferred_nid != -1);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c222ca2..c9895d35c5f7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1069,6 +1069,12 @@ enum numa_faults_stats {
 extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
+extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p);
+#else
+static inline void
+init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_SMP

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched/numa: Stagger NUMA balancing scan periods for new threads
  2018-04-26 10:48 [PATCH] sched/numa: Stagger NUMA balancing scan periods for new threads Mel Gorman
@ 2018-04-27  6:50 ` Ingo Molnar
  2018-04-27  7:30   ` Mel Gorman
  0 siblings, 1 reply; 3+ messages in thread
From: Ingo Molnar @ 2018-04-27  6:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML


* Mel Gorman <mgorman@techsingularity.net> wrote:

> The different in headline performance across a range of machines and
> workloads is marginal but the system CPU usage is reduced due to less scan
> activity.  The following is the time reported by NAS Parallel Benchmark
> using unbound openmp threads and a D size class.
> 
>                       4.17.0-rc1             4.17.0-rc1
>                          vanilla           stagger-v1r1
> Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
> Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
> Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
> Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
> Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
> Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
> Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
> Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)
> 
> Note it's not a universal win but we have no prior knowledge of which
> thread matters but the number of threads created often exceeds the size of
> the node when the threads are not bound. On balance, the set of workloads
> complete faster and there is a a reducation of overall system CPU usage
> 
>                             4.17.0-rc1             4.17.0-rc1
>                                vanilla           stagger-v1r1
> sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
> sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
> sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
> sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
> sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
> sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
> sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
> sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
> 
> NUMA scan activity is reduced as well as other balancing activity.
> 
> NUMA alloc local               1042828     1342670
> NUMA base PTE updates        140481138    93577468
> NUMA huge PMD updates           272171      180766
> NUMA page range updates      279832690   186129660
> NUMA hint faults               1395972     1193897
> NUMA hint local faults          877925      855053
> NUMA hint local percent             62          71
> NUMA pages migrated           12057909     9158023

Looks like a nice reduction in scanning overhead - which was always the main worry 
with the fault based active NUMA balancing technique.

I have a couple of minor code cleanliness nit:

> +void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> +{
> +	int mm_users = 0;
> +
> +	if (p->mm) {
> +		mm_users = atomic_read(&p->mm->mm_users);
> +		if (mm_users == 1) {
> +			p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> +			p->mm->numa_scan_seq = 0;
> +		}
> +	}
> +	p->node_stamp = 0ULL;
> +	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> +	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> +	p->numa_work.next = &p->numa_work;
> +	p->numa_faults = NULL;
> +	p->last_task_numa_placement = 0;
> +	p->last_sum_exec_runtime = 0;
> +	p->numa_group = NULL;

While this is pre-existing code that you moved, could we please use a bit more 
organization to make this more readable:

	p->node_stamp			= 0ULL;
	p->numa_scan_seq		= p->mm ? p->mm->numa_scan_seq : 0;
	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
	p->numa_work.next		= &p->numa_work;
	p->numa_faults			= NULL;
	p->last_task_numa_placement	= 0;
	p->last_sum_exec_runtime	= 0;
	p->numa_group			= NULL;

?

This form made me notice a detail: the 0ULL asymmetry looks weird, this integer 
literal type specification is entirely superfluous here, we can just write '0'.

> +	/* New address space */
> +	if (!(clone_flags & CLONE_VM)) {
> +		p->numa_preferred_nid = -1;
> +		return;
> +	}
> +
> +	/* New thread, use existing preferred nid but stagger scans */
> +	if (p->mm) {
> +		unsigned int delay;
> +
> +		delay = min_t(unsigned int, task_scan_max(current),
> +			current->numa_scan_period * mm_users * NSEC_PER_MSEC);
> +		delay += 2 * TICK_NSEC;
> +		p->numa_preferred_nid = current->numa_preferred_nid;
> +		p->node_stamp = delay;
> +	}

So this is a fork time function, shouldn't p->numa_preferred_nid be equal to 
current->numa_preferred_nid already?

This is what happens in the !p->mm && CLONE_VM case anyway, right?

So we could leave out the superfluous assignment and make it clear in a comment 
that we inherit the parent's ->numa_preferred_nid intentionally.

Also, there's a lot of p->mm use, could we add this helper local variable to 
simplify the code some more:

	struct mm_struct *mm = p->mm;

?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched/numa: Stagger NUMA balancing scan periods for new threads
  2018-04-27  6:50 ` Ingo Molnar
@ 2018-04-27  7:30   ` Mel Gorman
  0 siblings, 0 replies; 3+ messages in thread
From: Mel Gorman @ 2018-04-27  7:30 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML

On Fri, Apr 27, 2018 at 08:50:57AM +0200, Ingo Molnar wrote:
> >                             4.17.0-rc1             4.17.0-rc1
> >                                vanilla           stagger-v1r1
> > sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
> > sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
> > sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
> > sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
> > sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
> > sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
> > sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
> > sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
> > 
> > NUMA scan activity is reduced as well as other balancing activity.
> > 
> > NUMA alloc local               1042828     1342670
> > NUMA base PTE updates        140481138    93577468
> > NUMA huge PMD updates           272171      180766
> > NUMA page range updates      279832690   186129660
> > NUMA hint faults               1395972     1193897
> > NUMA hint local faults          877925      855053
> > NUMA hint local percent             62          71
> > NUMA pages migrated           12057909     9158023
> 
> Looks like a nice reduction in scanning overhead - which was always the main worry 
> with the fault based active NUMA balancing technique.
> 
> I have a couple of minor code cleanliness nit:
> 

No problem.

> > +void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> > +{
> > +	int mm_users = 0;
> > +
> > +	if (p->mm) {
> > +		mm_users = atomic_read(&p->mm->mm_users);
> > +		if (mm_users == 1) {
> > +			p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> > +			p->mm->numa_scan_seq = 0;
> > +		}
> > +	}
> > +	p->node_stamp = 0ULL;
> > +	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> > +	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > +	p->numa_work.next = &p->numa_work;
> > +	p->numa_faults = NULL;
> > +	p->last_task_numa_placement = 0;
> > +	p->last_sum_exec_runtime = 0;
> > +	p->numa_group = NULL;
> 
> While this is pre-existing code that you moved, could we please use a bit more 
> organization to make this more readable:
> 
> 	p->node_stamp			= 0ULL;
> 	p->numa_scan_seq		= p->mm ? p->mm->numa_scan_seq : 0;
> 	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
> 	p->numa_work.next		= &p->numa_work;
> 	p->numa_faults			= NULL;
> 	p->last_task_numa_placement	= 0;
> 	p->last_sum_exec_runtime	= 0;
> 	p->numa_group			= NULL;
> 
> ?
> 
> This form made me notice a detail: the 0ULL asymmetry looks weird, this integer 
> literal type specification is entirely superfluous here, we can just write '0'.
> 

Can do. I'll revise it and should send a new version on Monday. You're
right that the type is superfluous, this was simply a code movement so I
kept the structure.

> > +	/* New address space */
> > +	if (!(clone_flags & CLONE_VM)) {
> > +		p->numa_preferred_nid = -1;
> > +		return;
> > +	}
> > +
> > +	/* New thread, use existing preferred nid but stagger scans */
> > +	if (p->mm) {
> > +		unsigned int delay;
> > +
> > +		delay = min_t(unsigned int, task_scan_max(current),
> > +			current->numa_scan_period * mm_users * NSEC_PER_MSEC);
> > +		delay += 2 * TICK_NSEC;
> > +		p->numa_preferred_nid = current->numa_preferred_nid;
> > +		p->node_stamp = delay;
> > +	}
> 
> So this is a fork time function, shouldn't p->numa_preferred_nid be equal to 
> current->numa_preferred_nid already?
> 

It should but I see no harm in being explicit.

> This is what happens in the !p->mm && CLONE_VM case anyway, right?
> 

!p->mm should never have changed numa_preferred_nid and it's useless
information anyway. Kernel threads do not have a user address space to
sample and migrate.

> So we could leave out the superfluous assignment and make it clear in a comment 
> that we inherit the parent's ->numa_preferred_nid intentionally.
> 

I can do that.

> Also, there's a lot of p->mm use, could we add this helper local variable to 
> simplify the code some more:
> 
> 	struct mm_struct *mm = p->mm;
> 
> ?

That should be harmless.

Thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-04-27  7:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-26 10:48 [PATCH] sched/numa: Stagger NUMA balancing scan periods for new threads Mel Gorman
2018-04-27  6:50 ` Ingo Molnar
2018-04-27  7:30   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.