All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Faster migration for automatic NUMA balancing
@ 2018-10-01 10:05 Mel Gorman
  2018-10-01 10:05 ` [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa balancing migration Mel Gorman
  2018-10-01 10:05 ` [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task Mel Gorman
  0 siblings, 2 replies; 16+ messages in thread
From: Mel Gorman @ 2018-10-01 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Srikar Dronamraju, Jirka Hladky, Rik van Riel, LKML,
	Linux-MM, Mel Gorman

These two patches are based on top of Srikar Dronamraju's recent work
on automatic NUMA balancing and are motivated by a bug report from Jirka
Hladky that STREAM performance has regressed.

The STREAM workload is mildly interesting in that it only works as a valid
benchmark if tasks are pinned to memory channels. Otherwise it is very
sensitive to the starting conditions of the benchmark. Recent scheduler
changes prevent prematurely spreading a workload across multiple sockets
which benefits many workloads but not STREAM. This series restores STREAM
performance without reintroducing other regressions.

The first patch removes migration rate limiting as it's expected that
automatic NUMA balancing decisions are mature enough that we do not
need the safety net. The second patch migrates pages faster early in the
lifetime of the process which has an impact if the load balancer spreads
a workload to remote nodes.

 include/linux/mmzone.h         |  6 ----
 include/trace/events/migrate.h | 27 ------------------
 kernel/sched/fair.c            | 12 +++++++-
 mm/migrate.c                   | 65 ------------------------------------------
 mm/page_alloc.c                |  2 --
 5 files changed, 11 insertions(+), 101 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa balancing migration
  2018-10-01 10:05 [PATCH 0/2] Faster migration for automatic NUMA balancing Mel Gorman
@ 2018-10-01 10:05 ` Mel Gorman
  2018-10-01 15:39   ` Rik van Riel
                     ` (2 more replies)
  2018-10-01 10:05 ` [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task Mel Gorman
  1 sibling, 3 replies; 16+ messages in thread
From: Mel Gorman @ 2018-10-01 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Srikar Dronamraju, Jirka Hladky, Rik van Riel, LKML,
	Linux-MM, Mel Gorman

Rate limiting of page migrations due to automatic NUMA balancing was
introduced to mitigate the worst-case scenario of migrating at high
frequency due to false sharing or slowly ping-ponging between nodes.
Since then, a lot of effort was spent on correctly identifying these
pages and avoiding unnecessary migrations and the safety net may no longer
be required.

Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
avoids spreading STREAM tasks wide prematurely. However, once the task
was properly placed, it delayed migrating the memory due to rate limiting.
Increasing the limit fixed the problem for him.

Currently, the limit is hard-coded and does not account for the real
capabilities of the hardware. Even if an estimate was attempted, it would
not properly account for the number of memory controllers and it could
not account for the amount of bandwidth used for normal accesses. Rather
than fudging, this patch simply eliminates the rate limiting.

However, Jirka reports that a STREAM configuration using multiple
processes achieved similar performance to 4.16. In local tests, this patch
improved performance of STREAM relative to the baseline but it is somewhat
machine-dependent. Most workloads show little or not performance difference
implying that there is not a heavily reliance on the throttling mechanism
and it is safe to remove.

STREAM on 2-socket machine
                         4.19.0-rc5             4.19.0-rc5
                         numab-v1r1       noratelimit-v1r1
MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h         |  6 ----
 include/trace/events/migrate.h | 27 ------------------
 mm/migrate.c                   | 65 ------------------------------------------
 mm/page_alloc.c                |  2 --
 4 files changed, 100 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1e22d96734e0..3f4c0b167333 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -671,12 +671,6 @@ typedef struct pglist_data {
 #ifdef CONFIG_NUMA_BALANCING
 	/* Lock serializing the migrate rate limiting window */
 	spinlock_t numabalancing_migrate_lock;
-
-	/* Rate limiting time interval */
-	unsigned long numabalancing_migrate_next_window;
-
-	/* Number of pages migrated during the rate limiting time interval */
-	unsigned long numabalancing_migrate_nr_pages;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 711372845945..705b33d1e395 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -70,33 +70,6 @@ TRACE_EVENT(mm_migrate_pages,
 		__print_symbolic(__entry->mode, MIGRATE_MODE),
 		__print_symbolic(__entry->reason, MIGRATE_REASON))
 );
-
-TRACE_EVENT(mm_numa_migrate_ratelimit,
-
-	TP_PROTO(struct task_struct *p, int dst_nid, unsigned long nr_pages),
-
-	TP_ARGS(p, dst_nid, nr_pages),
-
-	TP_STRUCT__entry(
-		__array(	char,		comm,	TASK_COMM_LEN)
-		__field(	pid_t,		pid)
-		__field(	int,		dst_nid)
-		__field(	unsigned long,	nr_pages)
-	),
-
-	TP_fast_assign(
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
-		__entry->pid		= p->pid;
-		__entry->dst_nid	= dst_nid;
-		__entry->nr_pages	= nr_pages;
-	),
-
-	TP_printk("comm=%s pid=%d dst_nid=%d nr_pages=%lu",
-		__entry->comm,
-		__entry->pid,
-		__entry->dst_nid,
-		__entry->nr_pages)
-);
 #endif /* _TRACE_MIGRATE_H */
 
 /* This part must be outside protection */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4f1d894835b5..5e285c1249a0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1855,54 +1855,6 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 	return newpage;
 }
 
-/*
- * page migration rate limiting control.
- * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
- * window of time. Default here says do not migrate more than 1280M per second.
- */
-static unsigned int migrate_interval_millisecs __read_mostly = 100;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
-
-/* Returns true if the node is migrate rate-limited after the update */
-static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
-					unsigned long nr_pages)
-{
-	unsigned long next_window, interval;
-
-	next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
-	interval = msecs_to_jiffies(migrate_interval_millisecs);
-
-	/*
-	 * Rate-limit the amount of data that is being migrated to a node.
-	 * Optimal placement is no good if the memory bus is saturated and
-	 * all the time is being spent migrating!
-	 */
-	if (time_after(jiffies, next_window) &&
-			spin_trylock(&pgdat->numabalancing_migrate_lock)) {
-		pgdat->numabalancing_migrate_nr_pages = 0;
-		do {
-			next_window += interval;
-		} while (unlikely(time_after(jiffies, next_window)));
-
-		WRITE_ONCE(pgdat->numabalancing_migrate_next_window, next_window);
-		spin_unlock(&pgdat->numabalancing_migrate_lock);
-	}
-	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
-		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
-								nr_pages);
-		return true;
-	}
-
-	/*
-	 * This is an unlocked non-atomic update so errors are possible.
-	 * The consequences are failing to migrate when we potentiall should
-	 * have which is not severe enough to warrant locking. If it is ever
-	 * a problem, it can be converted to a per-cpu counter.
-	 */
-	pgdat->numabalancing_migrate_nr_pages += nr_pages;
-	return false;
-}
-
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
 	int page_lru;
@@ -1975,14 +1927,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	if (page_is_file_cache(page) && PageDirty(page))
 		goto out;
 
-	/*
-	 * Rate-limit the amount of data that is being migrated to a node.
-	 * Optimal placement is no good if the memory bus is saturated and
-	 * all the time is being spent migrating!
-	 */
-	if (numamigrate_update_ratelimit(pgdat, 1))
-		goto out;
-
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated)
 		goto out;
@@ -2029,14 +1973,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
 
-	/*
-	 * Rate-limit the amount of data that is being migrated to a node.
-	 * Optimal placement is no good if the memory bus is saturated and
-	 * all the time is being spent migrating!
-	 */
-	if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR))
-		goto out_dropref;
-
 	new_page = alloc_pages_node(node,
 		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
 		HPAGE_PMD_ORDER);
@@ -2133,7 +2069,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_modify(entry, vma->vm_page_prot);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89d2a2ab3fe6..706a738c0aee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6197,8 +6197,6 @@ static unsigned long __init calc_memmap_size(unsigned long spanned_pages,
 static void pgdat_init_numabalancing(struct pglist_data *pgdat)
 {
 	spin_lock_init(&pgdat->numabalancing_migrate_lock);
-	pgdat->numabalancing_migrate_nr_pages = 0;
-	pgdat->numabalancing_migrate_next_window = jiffies;
 }
 #else
 static void pgdat_init_numabalancing(struct pglist_data *pgdat) {}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-01 10:05 [PATCH 0/2] Faster migration for automatic NUMA balancing Mel Gorman
  2018-10-01 10:05 ` [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa balancing migration Mel Gorman
@ 2018-10-01 10:05 ` Mel Gorman
  2018-10-01 15:41   ` Rik van Riel
                     ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Mel Gorman @ 2018-10-01 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Srikar Dronamraju, Jirka Hladky, Rik van Riel, LKML,
	Linux-MM, Mel Gorman

Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
should migrate to a local node. This filter avoids excessive ping-ponging
if a page is shared or used by threads that migrate cross-node frequently.

Threads inherit both page tables and the preferred node ID from the
parent. This means that threads can trigger hinting faults earlier than
a new task which delays scanning for a number of seconds. As it can be
load balanced very early in its lifetime there can be an unnecessary delay
before it starts migrating thread-local data. This patch migrates private
pages faster early in the lifetime of a thread using the sequence counter
as an identifier of new tasks.

With this patch applied, STREAM performance is the same as 4.17 even though
processes are not spread cross-node prematurely. Other workloads showed
a mix of minor gains and losses. This is somewhat expected most workloads
are not very sensitive to the starting conditions of a process.

                         4.19.0-rc5             4.19.0-rc5                 4.17.0
                         numab-v1r1       fastmigrate-v1r1                vanilla
MB/sec copy     43298.52 (   0.00%)    47335.46 (   9.32%)    47219.24 (   9.06%)
MB/sec scale    30115.06 (   0.00%)    32568.12 (   8.15%)    32527.56 (   8.01%)
MB/sec add      32825.12 (   0.00%)    36078.94 (   9.91%)    35928.02 (   9.45%)
MB/sec triad    32549.52 (   0.00%)    35935.94 (  10.40%)    35969.88 (  10.51%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25c7c7e09cbd..7fc4a371bdd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1392,6 +1392,17 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	int last_cpupid, this_cpupid;
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
+	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+
+	/*
+	 * Allow first faults or private faults to migrate immediately early in
+	 * the lifetime of a task. The magic number 4 is based on waiting for
+	 * two full passes of the "multi-stage node selection" test that is
+	 * executed below.
+	 */
+	if ((p->numa_preferred_nid == -1 || p->numa_scan_seq <= 4) &&
+	    (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
+		return true;
 
 	/*
 	 * Multi-stage node selection is used in conjunction with a periodic
@@ -1410,7 +1421,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	 * This quadric squishes small probabilities, making it less likely we
 	 * act on an unlikely task<->page relation.
 	 */
-	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 	if (!cpupid_pid_unset(last_cpupid) &&
 				cpupid_to_nid(last_cpupid) != dst_nid)
 		return false;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa balancing migration
  2018-10-01 10:05 ` [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa balancing migration Mel Gorman
@ 2018-10-01 15:39   ` Rik van Riel
  2018-10-02 10:17   ` [tip:sched/urgent] mm, sched/numa: Remove rate-limiting of automatic NUMA " tip-bot for Mel Gorman
  2018-10-02 11:54   ` [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa " Srikar Dronamraju
  2 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2018-10-01 15:39 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Srikar Dronamraju, Jirka Hladky, LKML, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

On Mon, 2018-10-01 at 11:05 +0100, Mel Gorman wrote:
> 
> STREAM on 2-socket machine
>                          4.19.0-rc5             4.19.0-rc5
>                          numab-v1r1       noratelimit-v1r1
> MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
> MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
> MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
> MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-01 10:05 ` [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task Mel Gorman
@ 2018-10-01 15:41   ` Rik van Riel
  2018-10-02 10:17   ` [tip:sched/urgent] sched/numa: " tip-bot for Mel Gorman
  2018-10-02 12:41   ` [PATCH 2/2] mm, numa: " Srikar Dronamraju
  2 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2018-10-01 15:41 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Srikar Dronamraju, Jirka Hladky, LKML, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 1033 bytes --]

On Mon, 2018-10-01 at 11:05 +0100, Mel Gorman wrote:
> With this patch applied, STREAM performance is the same as 4.17 even
> though
> processes are not spread cross-node prematurely. Other workloads
> showed
> a mix of minor gains and losses. This is somewhat expected most
> workloads
> are not very sensitive to the starting conditions of a process.
> 
>                          4.19.0-rc5             4.19.0-
> rc5                 4.17.0
>                          numab-v1r1       fastmigrate-
> v1r1                vanilla
> MB/sec copy     43298.52 (   0.00%)    47335.46
> (   9.32%)    47219.24 (   9.06%)
> MB/sec scale    30115.06 (   0.00%)    32568.12
> (   8.15%)    32527.56 (   8.01%)
> MB/sec add      32825.12 (   0.00%)    36078.94
> (   9.91%)    35928.02 (   9.45%)
> MB/sec triad    32549.52 (   0.00%)    35935.94
> (  10.40%)    35969.88 (  10.51%)
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Rik van Riel <riel@surriel.com>
-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip:sched/urgent] mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration
  2018-10-01 10:05 ` [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa balancing migration Mel Gorman
  2018-10-01 15:39   ` Rik van Riel
@ 2018-10-02 10:17   ` tip-bot for Mel Gorman
  2018-10-02 11:54   ` [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa " Srikar Dronamraju
  2 siblings, 0 replies; 16+ messages in thread
From: tip-bot for Mel Gorman @ 2018-10-02 10:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, linux-kernel, jhladky, mingo, a.p.zijlstra, mgorman,
	riel, tglx, hpa, srikar, linux-mm

Commit-ID:  efaffc5e40aeced0bcb497ed7a0a5b8c14abfcdf
Gitweb:     https://git.kernel.org/tip/efaffc5e40aeced0bcb497ed7a0a5b8c14abfcdf
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Mon, 1 Oct 2018 11:05:24 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 2 Oct 2018 11:31:14 +0200

mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration

Rate limiting of page migrations due to automatic NUMA balancing was
introduced to mitigate the worst-case scenario of migrating at high
frequency due to false sharing or slowly ping-ponging between nodes.
Since then, a lot of effort was spent on correctly identifying these
pages and avoiding unnecessary migrations and the safety net may no longer
be required.

Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
avoids spreading STREAM tasks wide prematurely. However, once the task
was properly placed, it delayed migrating the memory due to rate limiting.
Increasing the limit fixed the problem for him.

Currently, the limit is hard-coded and does not account for the real
capabilities of the hardware. Even if an estimate was attempted, it would
not properly account for the number of memory controllers and it could
not account for the amount of bandwidth used for normal accesses. Rather
than fudging, this patch simply eliminates the rate limiting.

However, Jirka reports that a STREAM configuration using multiple
processes achieved similar performance to 4.16. In local tests, this patch
improved performance of STREAM relative to the baseline but it is somewhat
machine-dependent. Most workloads show little or not performance difference
implying that there is not a heavily reliance on the throttling mechanism
and it is safe to remove.

STREAM on 2-socket machine
                         4.19.0-rc5             4.19.0-rc5
                         numab-v1r1       noratelimit-v1r1
MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mmzone.h         |  6 ----
 include/trace/events/migrate.h | 27 ------------------
 mm/migrate.c                   | 65 ------------------------------------------
 mm/page_alloc.c                |  2 --
 4 files changed, 100 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1e22d96734e0..3f4c0b167333 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -671,12 +671,6 @@ typedef struct pglist_data {
 #ifdef CONFIG_NUMA_BALANCING
 	/* Lock serializing the migrate rate limiting window */
 	spinlock_t numabalancing_migrate_lock;
-
-	/* Rate limiting time interval */
-	unsigned long numabalancing_migrate_next_window;
-
-	/* Number of pages migrated during the rate limiting time interval */
-	unsigned long numabalancing_migrate_nr_pages;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 711372845945..705b33d1e395 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -70,33 +70,6 @@ TRACE_EVENT(mm_migrate_pages,
 		__print_symbolic(__entry->mode, MIGRATE_MODE),
 		__print_symbolic(__entry->reason, MIGRATE_REASON))
 );
-
-TRACE_EVENT(mm_numa_migrate_ratelimit,
-
-	TP_PROTO(struct task_struct *p, int dst_nid, unsigned long nr_pages),
-
-	TP_ARGS(p, dst_nid, nr_pages),
-
-	TP_STRUCT__entry(
-		__array(	char,		comm,	TASK_COMM_LEN)
-		__field(	pid_t,		pid)
-		__field(	int,		dst_nid)
-		__field(	unsigned long,	nr_pages)
-	),
-
-	TP_fast_assign(
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
-		__entry->pid		= p->pid;
-		__entry->dst_nid	= dst_nid;
-		__entry->nr_pages	= nr_pages;
-	),
-
-	TP_printk("comm=%s pid=%d dst_nid=%d nr_pages=%lu",
-		__entry->comm,
-		__entry->pid,
-		__entry->dst_nid,
-		__entry->nr_pages)
-);
 #endif /* _TRACE_MIGRATE_H */
 
 /* This part must be outside protection */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4f1d894835b5..5e285c1249a0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1855,54 +1855,6 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 	return newpage;
 }
 
-/*
- * page migration rate limiting control.
- * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
- * window of time. Default here says do not migrate more than 1280M per second.
- */
-static unsigned int migrate_interval_millisecs __read_mostly = 100;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
-
-/* Returns true if the node is migrate rate-limited after the update */
-static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
-					unsigned long nr_pages)
-{
-	unsigned long next_window, interval;
-
-	next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
-	interval = msecs_to_jiffies(migrate_interval_millisecs);
-
-	/*
-	 * Rate-limit the amount of data that is being migrated to a node.
-	 * Optimal placement is no good if the memory bus is saturated and
-	 * all the time is being spent migrating!
-	 */
-	if (time_after(jiffies, next_window) &&
-			spin_trylock(&pgdat->numabalancing_migrate_lock)) {
-		pgdat->numabalancing_migrate_nr_pages = 0;
-		do {
-			next_window += interval;
-		} while (unlikely(time_after(jiffies, next_window)));
-
-		WRITE_ONCE(pgdat->numabalancing_migrate_next_window, next_window);
-		spin_unlock(&pgdat->numabalancing_migrate_lock);
-	}
-	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
-		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
-								nr_pages);
-		return true;
-	}
-
-	/*
-	 * This is an unlocked non-atomic update so errors are possible.
-	 * The consequences are failing to migrate when we potentiall should
-	 * have which is not severe enough to warrant locking. If it is ever
-	 * a problem, it can be converted to a per-cpu counter.
-	 */
-	pgdat->numabalancing_migrate_nr_pages += nr_pages;
-	return false;
-}
-
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
 	int page_lru;
@@ -1975,14 +1927,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	if (page_is_file_cache(page) && PageDirty(page))
 		goto out;
 
-	/*
-	 * Rate-limit the amount of data that is being migrated to a node.
-	 * Optimal placement is no good if the memory bus is saturated and
-	 * all the time is being spent migrating!
-	 */
-	if (numamigrate_update_ratelimit(pgdat, 1))
-		goto out;
-
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated)
 		goto out;
@@ -2029,14 +1973,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
 
-	/*
-	 * Rate-limit the amount of data that is being migrated to a node.
-	 * Optimal placement is no good if the memory bus is saturated and
-	 * all the time is being spent migrating!
-	 */
-	if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR))
-		goto out_dropref;
-
 	new_page = alloc_pages_node(node,
 		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
 		HPAGE_PMD_ORDER);
@@ -2133,7 +2069,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_modify(entry, vma->vm_page_prot);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89d2a2ab3fe6..706a738c0aee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6197,8 +6197,6 @@ static unsigned long __init calc_memmap_size(unsigned long spanned_pages,
 static void pgdat_init_numabalancing(struct pglist_data *pgdat)
 {
 	spin_lock_init(&pgdat->numabalancing_migrate_lock);
-	pgdat->numabalancing_migrate_nr_pages = 0;
-	pgdat->numabalancing_migrate_next_window = jiffies;
 }
 #else
 static void pgdat_init_numabalancing(struct pglist_data *pgdat) {}

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [tip:sched/urgent] sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-01 10:05 ` [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task Mel Gorman
  2018-10-01 15:41   ` Rik van Riel
@ 2018-10-02 10:17   ` tip-bot for Mel Gorman
  2018-10-02 12:41   ` [PATCH 2/2] mm, numa: " Srikar Dronamraju
  2 siblings, 0 replies; 16+ messages in thread
From: tip-bot for Mel Gorman @ 2018-10-02 10:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, srikar, mgorman, riel, linux-mm, linux-kernel, hpa,
	a.p.zijlstra, mingo, tglx, jhladky

Commit-ID:  37355bdc5a129899f6b245900a8eb944a092f7fd
Gitweb:     https://git.kernel.org/tip/37355bdc5a129899f6b245900a8eb944a092f7fd
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Mon, 1 Oct 2018 11:05:25 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 2 Oct 2018 11:31:33 +0200

sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task

Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
should migrate to a local node. This filter avoids excessive ping-ponging
if a page is shared or used by threads that migrate cross-node frequently.

Threads inherit both page tables and the preferred node ID from the
parent. This means that threads can trigger hinting faults earlier than
a new task which delays scanning for a number of seconds. As it can be
load balanced very early in its lifetime there can be an unnecessary delay
before it starts migrating thread-local data. This patch migrates private
pages faster early in the lifetime of a thread using the sequence counter
as an identifier of new tasks.

With this patch applied, STREAM performance is the same as 4.17 even though
processes are not spread cross-node prematurely. Other workloads showed
a mix of minor gains and losses. This is somewhat expected most workloads
are not very sensitive to the starting conditions of a process.

                         4.19.0-rc5             4.19.0-rc5                 4.17.0
                         numab-v1r1       fastmigrate-v1r1                vanilla
MB/sec copy     43298.52 (   0.00%)    47335.46 (   9.32%)    47219.24 (   9.06%)
MB/sec scale    30115.06 (   0.00%)    32568.12 (   8.15%)    32527.56 (   8.01%)
MB/sec add      32825.12 (   0.00%)    36078.94 (   9.91%)    35928.02 (   9.45%)
MB/sec triad    32549.52 (   0.00%)    35935.94 (  10.40%)    35969.88 (  10.51%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25c7c7e09cbd..7fc4a371bdd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1392,6 +1392,17 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	int last_cpupid, this_cpupid;
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
+	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+
+	/*
+	 * Allow first faults or private faults to migrate immediately early in
+	 * the lifetime of a task. The magic number 4 is based on waiting for
+	 * two full passes of the "multi-stage node selection" test that is
+	 * executed below.
+	 */
+	if ((p->numa_preferred_nid == -1 || p->numa_scan_seq <= 4) &&
+	    (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
+		return true;
 
 	/*
 	 * Multi-stage node selection is used in conjunction with a periodic
@@ -1410,7 +1421,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	 * This quadric squishes small probabilities, making it less likely we
 	 * act on an unlikely task<->page relation.
 	 */
-	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 	if (!cpupid_pid_unset(last_cpupid) &&
 				cpupid_to_nid(last_cpupid) != dst_nid)
 		return false;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa balancing migration
  2018-10-01 10:05 ` [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa balancing migration Mel Gorman
  2018-10-01 15:39   ` Rik van Riel
  2018-10-02 10:17   ` [tip:sched/urgent] mm, sched/numa: Remove rate-limiting of automatic NUMA " tip-bot for Mel Gorman
@ 2018-10-02 11:54   ` Srikar Dronamraju
  2 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-10-02 11:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Jirka Hladky, Rik van Riel, LKML, Linux-MM

* Mel Gorman <mgorman@techsingularity.net> [2018-10-01 11:05:24]:

> Rate limiting of page migrations due to automatic NUMA balancing was
> introduced to mitigate the worst-case scenario of migrating at high
> frequency due to false sharing or slowly ping-ponging between nodes.
> Since then, a lot of effort was spent on correctly identifying these
> pages and avoiding unnecessary migrations and the safety net may no longer
> be required.
> 
> Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
> avoids spreading STREAM tasks wide prematurely. However, once the task
> was properly placed, it delayed migrating the memory due to rate limiting.
> Increasing the limit fixed the problem for him.
> 
> Currently, the limit is hard-coded and does not account for the real
> capabilities of the hardware. Even if an estimate was attempted, it would
> not properly account for the number of memory controllers and it could
> not account for the amount of bandwidth used for normal accesses. Rather
> than fudging, this patch simply eliminates the rate limiting.
> 
> However, Jirka reports that a STREAM configuration using multiple
> processes achieved similar performance to 4.16. In local tests, this patch
> improved performance of STREAM relative to the baseline but it is somewhat
> machine-dependent. Most workloads show little or not performance difference
> implying that there is not a heavily reliance on the throttling mechanism
> and it is safe to remove.
> 
> STREAM on 2-socket machine
>                          4.19.0-rc5             4.19.0-rc5
>                          numab-v1r1       noratelimit-v1r1
> MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
> MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
> MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
> MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-01 10:05 ` [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task Mel Gorman
  2018-10-01 15:41   ` Rik van Riel
  2018-10-02 10:17   ` [tip:sched/urgent] sched/numa: " tip-bot for Mel Gorman
@ 2018-10-02 12:41   ` Srikar Dronamraju
  2018-10-02 13:54     ` Mel Gorman
  2 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-10-02 12:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Jirka Hladky, Rik van Riel, LKML, Linux-MM

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 25c7c7e09cbd..7fc4a371bdd2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1392,6 +1392,17 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
>  	int last_cpupid, this_cpupid;
>
>  	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
> +	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
> +
> +	/*
> +	 * Allow first faults or private faults to migrate immediately early in
> +	 * the lifetime of a task. The magic number 4 is based on waiting for
> +	 * two full passes of the "multi-stage node selection" test that is
> +	 * executed below.
> +	 */
> +	if ((p->numa_preferred_nid == -1 || p->numa_scan_seq <= 4) &&
> +	    (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
> +		return true;
>

This does have issues when using with workloads that access more shared faults
than private faults.

In such workloads, this change would spread the memory causing regression in
behaviour.

5 runs of on 2 socket/ 4 node power 8 box


Without this patch
./numa01.sh      Real:  382.82    454.29    422.31    29.72
./numa01.sh      Sys:   40.12     74.53     58.50     13.37
./numa01.sh      User:  34230.22  46398.84  40292.62  4915.93

With this patch
./numa01.sh      Real:  415.56    555.04    473.45    51.17    -10.8016%
./numa01.sh      Sys:   43.42     94.22     73.59     17.31    -20.5055%
./numa01.sh      User:  35271.95  56644.19  45615.72  7165.01  -11.6694%

Since we are looking at time, smaller numbers are better.

----------------------------------------
# cat numa01.sh
#! /bin/bash
# numa01.sh corresponds to 2 perf bench processes each having ncpus/2 threads
# 50 loops of 3G process memory.

THREADS=${THREADS:-$(($(getconf _NPROCESSORS_ONLN)/2))}
perf bench numa mem --no-data_rand_walk -p 2 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 2000 $@
----------------------------------------

I know this is a synthetic benchmark, but wonder if benchmarks run on vm
guest show similar behaviour when noticed from host.

SPECJbb did show some small loss and gains.

Our numa grouping is not fast enough. It can take sometimes several
iterations before all the tasks belonging to the same group end up being
part of the group. With the current check we end up spreading memory faster
than we should hence hurting the chance of early consolidation.

Can we restrict to something like this?

if (p->numa_scan_seq >=MIN && p->numa_scan_seq <= MIN+4 &&
    (cpupid_match_pid(p, last_cpupid)))
	return true;

meaning, we ran atleast MIN number of scans, and we find the task to be most likely
task using this page.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-02 12:41   ` [PATCH 2/2] mm, numa: " Srikar Dronamraju
@ 2018-10-02 13:54     ` Mel Gorman
  2018-10-02 17:30       ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2018-10-02 13:54 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Jirka Hladky, Rik van Riel, LKML, Linux-MM

On Tue, Oct 02, 2018 at 06:11:49PM +0530, Srikar Dronamraju wrote:
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 25c7c7e09cbd..7fc4a371bdd2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1392,6 +1392,17 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
> >  	int last_cpupid, this_cpupid;
> >
> >  	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
> > +	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
> > +
> > +	/*
> > +	 * Allow first faults or private faults to migrate immediately early in
> > +	 * the lifetime of a task. The magic number 4 is based on waiting for
> > +	 * two full passes of the "multi-stage node selection" test that is
> > +	 * executed below.
> > +	 */
> > +	if ((p->numa_preferred_nid == -1 || p->numa_scan_seq <= 4) &&
> > +	    (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
> > +		return true;
> >
> 
> This does have issues when using with workloads that access more shared faults
> than private faults.
> 

Not as such. It can have issues on workloads where memory is initialised
by one thread, then additional threads are created and access the same
memory. They are not necessarily shared once buffers are handed over. In
such a case, migrating quickly is the right thing to do. If it's truely
shared pages then there may be some unnecessary migrations early in the
lifetime of the task but it'll settle down quickly enough.

> In such workloads, this change would spread the memory causing regression in
> behaviour.
> 
> 5 runs of on 2 socket/ 4 node power 8 box
> 
> 
> Without this patch
> ./numa01.sh      Real:  382.82    454.29    422.31    29.72
> ./numa01.sh      Sys:   40.12     74.53     58.50     13.37
> ./numa01.sh      User:  34230.22  46398.84  40292.62  4915.93
> 
> With this patch
> ./numa01.sh      Real:  415.56    555.04    473.45    51.17    -10.8016%
> ./numa01.sh      Sys:   43.42     94.22     73.59     17.31    -20.5055%
> ./numa01.sh      User:  35271.95  56644.19  45615.72  7165.01  -11.6694%
> 
> Since we are looking at time, smaller numbers are better.
> 

Is it just numa01 that was affected for you? I ask because that particular
workload is an averse workload on any machine with more than sockets and
your machine description says it has 4 nodes. What it is testing is quite
specific to 2-node machines.

> SPECJbb did show some small loss and gains.
> 

That almost always shows small gains and losses so that's not too
surprising.

> Our numa grouping is not fast enough. It can take sometimes several
> iterations before all the tasks belonging to the same group end up being
> part of the group. With the current check we end up spreading memory faster
> than we should hence hurting the chance of early consolidation.
> 
> Can we restrict to something like this?
> 
> if (p->numa_scan_seq >=MIN && p->numa_scan_seq <= MIN+4 &&
>     (cpupid_match_pid(p, last_cpupid)))
> 	return true;
> 
> meaning, we ran atleast MIN number of scans, and we find the task to be most likely
> task using this page.
> 

What's MIN? Assuming it's any type of delay, note that this will regress
STREAM again because it's very sensitive to the starting state.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-02 13:54     ` Mel Gorman
@ 2018-10-02 17:30       ` Srikar Dronamraju
  2018-10-02 18:22         ` Mel Gorman
  2018-10-03 13:07         ` Srikar Dronamraju
  0 siblings, 2 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-10-02 17:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Jirka Hladky, Rik van Riel, LKML, Linux-MM

> > 
> > This does have issues when using with workloads that access more shared faults
> > than private faults.
> > 
> 
> Not as such. It can have issues on workloads where memory is initialised
> by one thread, then additional threads are created and access the same
> memory. They are not necessarily shared once buffers are handed over. In
> such a case, migrating quickly is the right thing to do. If it's truely
> shared pages then there may be some unnecessary migrations early in the
> lifetime of the task but it'll settle down quickly enough.
> 

Do you have a workload recommendation to try for shared fault accesses.
I will try to get a DayTrader run in a day or two. There JVM and db threads
act on the same memory, I presume it might show some insights.

> Is it just numa01 that was affected for you? I ask because that particular
> workload is an averse workload on any machine with more than sockets and
> your machine description says it has 4 nodes. What it is testing is quite
> specific to 2-node machines.
> 

Agree, 

Some variations of numa01.sh where I have one process having threads equal
to number of cpus does regress but not as much as numa01.

./numa03.sh      Real:  484.84    555.51    518.59    22.91    -5.84277%
./numa03.sh      Sys:   44.41     64.40     53.24     6.65     -11.3824%
./numa03.sh      User:  51328.77  59429.39  55366.62  2744.39  -9.47912%


> > SPECJbb did show some small loss and gains.
> > 
> 
> That almost always shows small gains and losses so that's not too
> surprising.
> 

Okay.

> > Our numa grouping is not fast enough. It can take sometimes several
> > iterations before all the tasks belonging to the same group end up being
> > part of the group. With the current check we end up spreading memory faster
> > than we should hence hurting the chance of early consolidation.
> > 
> > Can we restrict to something like this?
> > 
> > if (p->numa_scan_seq >=MIN && p->numa_scan_seq <= MIN+4 &&
> >     (cpupid_match_pid(p, last_cpupid)))
> > 	return true;
> > 
> > meaning, we ran atleast MIN number of scans, and we find the task to be most likely
> > task using this page.
> > 
> 


> What's MIN? Assuming it's any type of delay, note that this will regress
> STREAM again because it's very sensitive to the starting state.
> 

I was thinking of MIN as 3 to give a chance for things to settle.
but that might not help STREAM as you pointed out.

Do you have a hint on which commit made STREAM regress?

if we want to prioritize STREAM like workloads (i.e private faults) one simpler
fix could be to change the quadtraic equation

from:
	if (!cpupid_pid_unset(last_cpupid) &&
				cpupid_to_nid(last_cpupid) != dst_nid)
		return false;
to:
	if (!cpupid_pid_unset(last_cpupid) &&
				cpupid_to_nid(last_cpupid) == dst_nid)
		return true;

i.e to say if the group tasks likely consolidated to a node or the task was
moved to a different node but access were private, just move the memory.

The drawback though is we keep pulling memory everytime the task moves
across nodes. (which is probably restricted for long running tasks to some
extent by your fix)

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-02 17:30       ` Srikar Dronamraju
@ 2018-10-02 18:22         ` Mel Gorman
  2018-10-03 13:15           ` Srikar Dronamraju
  2018-10-03 13:07         ` Srikar Dronamraju
  1 sibling, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2018-10-02 18:22 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Jirka Hladky, Rik van Riel, LKML, Linux-MM

On Tue, Oct 02, 2018 at 11:00:05PM +0530, Srikar Dronamraju wrote:
> > > 
> > > This does have issues when using with workloads that access more shared faults
> > > than private faults.
> > > 
> > 
> > Not as such. It can have issues on workloads where memory is initialised
> > by one thread, then additional threads are created and access the same
> > memory. They are not necessarily shared once buffers are handed over. In
> > such a case, migrating quickly is the right thing to do. If it's truely
> > shared pages then there may be some unnecessary migrations early in the
> > lifetime of the task but it'll settle down quickly enough.
> > 
> 
> Do you have a workload recommendation to try for shared fault accesses.

NAS parallelised with OMP tends to be ok but I haven't quantified if it's
perfect or a good example. I don't have an example of a workload that
is good at targetting the specific case where pages are shared between
tasks that tend to run on separate nodes. It would be somewhat of an
anti-pattern for any workload regardless of automatic NUMA balancing.

> > > <SNIP>
> > >
> > > Our numa grouping is not fast enough. It can take sometimes several
> > > iterations before all the tasks belonging to the same group end up being
> > > part of the group. With the current check we end up spreading memory faster
> > > than we should hence hurting the chance of early consolidation.
> > > 
> > > Can we restrict to something like this?
> > > 
> > > if (p->numa_scan_seq >=MIN && p->numa_scan_seq <= MIN+4 &&
> > >     (cpupid_match_pid(p, last_cpupid)))
> > > 	return true;
> > > 
> > > meaning, we ran atleast MIN number of scans, and we find the task to be most likely
> > > task using this page.
> > > 
> > 
> 
> 
> > What's MIN? Assuming it's any type of delay, note that this will regress
> > STREAM again because it's very sensitive to the starting state.
> > 
> 
> I was thinking of MIN as 3 to give a chance for things to settle.
> but that might not help STREAM as you pointed out.
> 

Probably not.

> Do you have a hint on which commit made STREAM regress?
> 

2c83362734da ("sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on")

Reverting it hurts workloads that communicate immediately with new processes
or threads as workloads spread prematurely and then get pulled back just
after clone.

> if we want to prioritize STREAM like workloads (i.e private faults) one simpler
> fix could be to change the quadtraic equation
> 
> from:
> 	if (!cpupid_pid_unset(last_cpupid) &&
> 				cpupid_to_nid(last_cpupid) != dst_nid)
> 		return false;
> to:
> 	if (!cpupid_pid_unset(last_cpupid) &&
> 				cpupid_to_nid(last_cpupid) == dst_nid)
> 		return true;
> 
> i.e to say if the group tasks likely consolidated to a node or the task was
> moved to a different node but access were private, just move the memory.
> 
> The drawback though is we keep pulling memory everytime the task moves
> across nodes. (which is probably restricted for long running tasks to some
> extent by your fix)
> 

This has way more consequences as it changes the behaviour for the entire
lifetime of the workload. It could cause excessive migrations in the case
where a machine is almost fully utilised and getting load balanced or in
cases where tasks are pulled frequently cross-node (e.g. worker thread
model or a pipelined computation).

I'm only looking to address the case where the load balancer spreads a
workload early and the memory should move to the new node quickly. If it
turns out there are cases where that decision is wrong, it gets remedied
quickly but if your proposal is ever wrong, the system doesn't recover.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-02 17:30       ` Srikar Dronamraju
  2018-10-02 18:22         ` Mel Gorman
@ 2018-10-03 13:07         ` Srikar Dronamraju
  2018-10-03 13:21           ` Mel Gorman
  1 sibling, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-10-03 13:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Jirka Hladky, Rik van Riel, LKML, Linux-MM

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-10-02 23:00:05]:

> I will try to get a DayTrader run in a day or two. There JVM and db threads
> act on the same memory, I presume it might show some insights.

I ran 2 runs of daytrader 7 with and without patch on a 2 node power9
PowerNv box.
https://github.com/WASdev/sample.daytrader7
In each run, has 8 JVMs.

Throughputs (Higher are better)
Without patch 19216.8 18900.7 Average: 19058.75
With patch    18644.5 18480.9 Average: 18562.70

Difference being -2.6% regression

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-02 18:22         ` Mel Gorman
@ 2018-10-03 13:15           ` Srikar Dronamraju
  0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-10-03 13:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Jirka Hladky, Rik van Riel, LKML, Linux-MM

> > if we want to prioritize STREAM like workloads (i.e private faults) one simpler
> > fix could be to change the quadtraic equation
> > 
> > from:
> > 	if (!cpupid_pid_unset(last_cpupid) &&
> > 				cpupid_to_nid(last_cpupid) != dst_nid)
> > 		return false;
> > to:
> > 	if (!cpupid_pid_unset(last_cpupid) &&
> > 				cpupid_to_nid(last_cpupid) == dst_nid)
> > 		return true;
> > 
> > i.e to say if the group tasks likely consolidated to a node or the task was
> > moved to a different node but access were private, just move the memory.
> > 
> > The drawback though is we keep pulling memory everytime the task moves
> > across nodes. (which is probably restricted for long running tasks to some
> > extent by your fix)
> > 
> 
> This has way more consequences as it changes the behaviour for the entire
> lifetime of the workload. It could cause excessive migrations in the case
> where a machine is almost fully utilised and getting load balanced or in
> cases where tasks are pulled frequently cross-node (e.g. worker thread
> model or a pipelined computation).
> 
> I'm only looking to address the case where the load balancer spreads a
> workload early and the memory should move to the new node quickly. If it
> turns out there are cases where that decision is wrong, it gets remedied
> quickly but if your proposal is ever wrong, the system doesn't recover.
> 

Agree.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-03 13:07         ` Srikar Dronamraju
@ 2018-10-03 13:21           ` Mel Gorman
  2018-10-03 14:08             ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2018-10-03 13:21 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Jirka Hladky, Rik van Riel, LKML, Linux-MM

On Wed, Oct 03, 2018 at 06:37:41PM +0530, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-10-02 23:00:05]:
> 
> > I will try to get a DayTrader run in a day or two. There JVM and db threads
> > act on the same memory, I presume it might show some insights.
> 
> I ran 2 runs of daytrader 7 with and without patch on a 2 node power9
> PowerNv box.
> https://github.com/WASdev/sample.daytrader7
> In each run, has 8 JVMs.
> 
> Throughputs (Higher are better)
> Without patch 19216.8 18900.7 Average: 19058.75
> With patch    18644.5 18480.9 Average: 18562.70
> 
> Difference being -2.6% regression
> 

That's unfortunate.

How much does this workload normally vary between runs? If you monitor
migrations over time, is there an increase spike in migration early in
the lifetime of the workload?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task
  2018-10-03 13:21           ` Mel Gorman
@ 2018-10-03 14:08             ` Srikar Dronamraju
  0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-10-03 14:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Jirka Hladky, Rik van Riel, LKML, Linux-MM

* Mel Gorman <mgorman@techsingularity.net> [2018-10-03 14:21:55]:

> On Wed, Oct 03, 2018 at 06:37:41PM +0530, Srikar Dronamraju wrote:
> > * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-10-02 23:00:05]:
> > 
> 

> That's unfortunate.
> 
> How much does this workload normally vary between runs? If you monitor
> migrations over time, is there an increase spike in migration early in
> the lifetime of the workload?
> 

The run to run variation has always been less than 1%.
I haven't monitored migrations over time. Will try to include it my next
run. Its a shared setup so I may not get the box immediately.


-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-10-03 14:08 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-01 10:05 [PATCH 0/2] Faster migration for automatic NUMA balancing Mel Gorman
2018-10-01 10:05 ` [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa balancing migration Mel Gorman
2018-10-01 15:39   ` Rik van Riel
2018-10-02 10:17   ` [tip:sched/urgent] mm, sched/numa: Remove rate-limiting of automatic NUMA " tip-bot for Mel Gorman
2018-10-02 11:54   ` [PATCH 1/2] mm, numa: Remove rate-limiting of automatic numa " Srikar Dronamraju
2018-10-01 10:05 ` [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task Mel Gorman
2018-10-01 15:41   ` Rik van Riel
2018-10-02 10:17   ` [tip:sched/urgent] sched/numa: " tip-bot for Mel Gorman
2018-10-02 12:41   ` [PATCH 2/2] mm, numa: " Srikar Dronamraju
2018-10-02 13:54     ` Mel Gorman
2018-10-02 17:30       ` Srikar Dronamraju
2018-10-02 18:22         ` Mel Gorman
2018-10-03 13:15           ` Srikar Dronamraju
2018-10-03 13:07         ` Srikar Dronamraju
2018-10-03 13:21           ` Mel Gorman
2018-10-03 14:08             ` Srikar Dronamraju

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.