linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] autonuma: Support to scan page table asynchronously
@ 2020-04-14  8:19 Huang Ying
  2020-04-14 12:06 ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: Huang Ying @ 2020-04-14  8:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Ingo Molnar, Mel Gorman,
	Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen, Tim Chen,
	Aubrey Li

In current AutoNUMA implementation, the page tables of the processes
are scanned periodically to trigger the NUMA hint page faults.  The
scanning runs in the context of the processes, so will delay the
running of the processes.  In a test with 64 threads pmbench memory
accessing benchmark on a 2-socket server machine with 104 logical CPUs
and 256 GB memory, there are more than 20000 latency outliers that are
> 1 ms in 3600s run time.  These latency outliers are almost all
caused by the AutoNUMA page table scanning.  Because they almost all
disappear after applying this patch to scan the page tables
asynchronously.

Because there are idle CPUs in system, the asynchronous running page
table scanning code can run on these idle CPUs instead of the CPUs the
workload is running on.

So on system with enough idle CPU time, it's better to scan the page
tables asynchronously to take full advantages of these idle CPU time.
Another scenario which can benefit from this is to scan the page
tables on some service CPUs of the socket, so that the real workload
can run on the isolated CPUs without the latency outliers caused by
the page table scanning.

But it's not perfect to scan page tables asynchronously too.  For
example, on system without enough idle CPU time, the CPU time isn't
scheduled fairly because the page table scanning is charged to the
workqueue thread instead of the process/thread it works for.  And
although the page tables are scanned for the target process, it may
run on a CPU that is not in the cpuset of the target process.

One possible solution is to let the system administrator to choose the
better behavior for the system via a sysctl knob (implemented in the
patch).  But it's not perfect too.  Because every user space knob adds
maintenance overhead.

A better solution may be to back-charge the CPU time to scan the page
tables to the process/thread, and find a way to run the work on the
proper cpuset.  After some googling, I found there's some discussion
about this as in the following thread,

https://lkml.org/lkml/2019/6/13/1321

So this patch may be not ready to be merged by upstream yet.  It
quantizes the latency outliers caused by the page table scanning in
AutoNUMA.  And it provides a possible way to resolve the issue for
users who cares about it.  And it is a potential customer of the work
related to the cgroup-aware workqueue or other asynchronous execution
mechanisms.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Aubrey Li <aubrey.li@intel.com>
---
 include/linux/sched.h        |  1 +
 include/linux/sched/sysctl.h |  1 +
 kernel/exit.c                |  1 +
 kernel/sched/fair.c          | 55 +++++++++++++++++++++++++++---------
 kernel/sysctl.c              |  9 ++++++
 5 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..433f77436ed4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1087,6 +1087,7 @@ struct task_struct {
 	u64				last_task_numa_placement;
 	u64				last_sum_exec_runtime;
 	struct callback_head		numa_work;
+	struct work_struct		numa_async_work;
 
 	/*
 	 * This pointer is only modified for current in syscall and
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..ea595ae9e573 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -37,6 +37,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_numa_balancing_scan_async;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/exit.c b/kernel/exit.c
index 0b81b26a872a..f5e84ff4b788 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -798,6 +798,7 @@ void __noreturn do_exit(long code)
 	if (group_dead)
 		disassociate_ctty(1);
 	exit_task_namespaces(tsk);
+	cancel_work_sync(&tsk->numa_async_work);
 	exit_task_work(tsk);
 	exit_thread(tsk);
 	exit_umh(tsk);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1217bfe5e81..7d82c6826708 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,9 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/* Scan asynchronously via work queue */
+unsigned int sysctl_numa_balancing_scan_async;
+
 struct numa_group {
 	refcount_t refcount;
 
@@ -2481,24 +2484,17 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	p->mm->numa_scan_offset = 0;
 }
 
-/*
- * The expensive part of numa migration is done from task_work context.
- * Triggered from task_tick_numa().
- */
-static void task_numa_work(struct callback_head *work)
+static void numa_scan(struct task_struct *p)
 {
 	unsigned long migrate, next_scan, now = jiffies;
-	struct task_struct *p = current;
+	struct task_struct *pc = current;
 	struct mm_struct *mm = p->mm;
-	u64 runtime = p->se.sum_exec_runtime;
+	u64 runtime = pc->se.sum_exec_runtime;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
 	long pages, virtpages;
 
-	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
-
-	work->next = work;
 	/*
 	 * Who cares about NUMA placement when they're dying.
 	 *
@@ -2621,12 +2617,36 @@ static void task_numa_work(struct callback_head *work)
 	 * Usually update_task_scan_period slows down scanning enough; on an
 	 * overloaded system we need to limit overhead on a per task basis.
 	 */
-	if (unlikely(p->se.sum_exec_runtime != runtime)) {
-		u64 diff = p->se.sum_exec_runtime - runtime;
+	if (unlikely(pc->se.sum_exec_runtime != runtime)) {
+		u64 diff = pc->se.sum_exec_runtime - runtime;
 		p->node_stamp += 32 * diff;
 	}
 }
 
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+static void task_numa_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+
+	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
+
+	work->next = work;
+
+	numa_scan(p);
+}
+
+static void numa_async_work(struct work_struct *work)
+{
+	struct task_struct *p;
+
+	p = container_of(work, struct task_struct, numa_async_work);
+
+	numa_scan(p);
+}
+
 void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
 	int mm_users = 0;
@@ -2650,6 +2670,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->last_sum_exec_runtime	= 0;
 
 	init_task_work(&p->numa_work, task_numa_work);
+	INIT_WORK(&p->numa_async_work, numa_async_work);
 
 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
@@ -2699,8 +2720,14 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 			curr->numa_scan_period = task_scan_start(curr);
 		curr->node_stamp += period;
 
-		if (!time_before(jiffies, curr->mm->numa_next_scan))
-			task_work_add(curr, work, true);
+		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+			if (sysctl_numa_balancing_scan_async)
+				queue_work_node(numa_node_id(),
+						system_unbound_wq,
+						&curr->numa_async_work);
+			else
+				task_work_add(curr, work, true);
+		}
 	}
 }
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ad5b88a53c5a..31a987230c58 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -418,6 +418,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_scan_async",
+		.data		= &sysctl_numa_balancing_scan_async,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-14  8:19 [RFC] autonuma: Support to scan page table asynchronously Huang Ying
@ 2020-04-14 12:06 ` Mel Gorman
  2020-04-15  8:14   ` Huang, Ying
  2020-04-15 11:32   ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Mel Gorman @ 2020-04-14 12:06 UTC (permalink / raw)
  To: Huang Ying
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Ingo Molnar, Mel Gorman,
	Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen, Tim Chen,
	Aubrey Li

On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
> In current AutoNUMA implementation, the page tables of the processes
> are scanned periodically to trigger the NUMA hint page faults.  The
> scanning runs in the context of the processes, so will delay the
> running of the processes.  In a test with 64 threads pmbench memory
> accessing benchmark on a 2-socket server machine with 104 logical CPUs
> and 256 GB memory, there are more than 20000 latency outliers that are
> > 1 ms in 3600s run time.  These latency outliers are almost all
> caused by the AutoNUMA page table scanning.  Because they almost all
> disappear after applying this patch to scan the page tables
> asynchronously.
> 
> Because there are idle CPUs in system, the asynchronous running page
> table scanning code can run on these idle CPUs instead of the CPUs the
> workload is running on.
> 
> So on system with enough idle CPU time, it's better to scan the page
> tables asynchronously to take full advantages of these idle CPU time.
> Another scenario which can benefit from this is to scan the page
> tables on some service CPUs of the socket, so that the real workload
> can run on the isolated CPUs without the latency outliers caused by
> the page table scanning.
> 
> But it's not perfect to scan page tables asynchronously too.  For
> example, on system without enough idle CPU time, the CPU time isn't
> scheduled fairly because the page table scanning is charged to the
> workqueue thread instead of the process/thread it works for.  And
> although the page tables are scanned for the target process, it may
> run on a CPU that is not in the cpuset of the target process.
> 
> One possible solution is to let the system administrator to choose the
> better behavior for the system via a sysctl knob (implemented in the
> patch).  But it's not perfect too.  Because every user space knob adds
> maintenance overhead.
> 
> A better solution may be to back-charge the CPU time to scan the page
> tables to the process/thread, and find a way to run the work on the
> proper cpuset.  After some googling, I found there's some discussion
> about this as in the following thread,
> 
> https://lkml.org/lkml/2019/6/13/1321
> 
> So this patch may be not ready to be merged by upstream yet.  It
> quantizes the latency outliers caused by the page table scanning in
> AutoNUMA.  And it provides a possible way to resolve the issue for
> users who cares about it.  And it is a potential customer of the work
> related to the cgroup-aware workqueue or other asynchronous execution
> mechanisms.
> 

The caveats you list are the important ones and the reason why it was
not done asynchronously. In an earlier implementation all the work was
done by a dedicated thread and ultimately abandoned.

There is no guarantee there is an idle CPU available and one that is
local to the thread that should be doing the scanning. Even if there is,
it potentially prevents another task from scheduling on an idle CPU and
similarly other workqueue tasks may be delayed waiting on the scanner. The
hiding of the cost is also problematic because the CPU cost is hidden
and mixed with other unrelated workqueues. It also has the potential
to mask bugs. Lets say for example there is a bug whereby a task is
scanning excessively, that can be easily missed when the work is done by
a workqueue.

While it's just an opinion, my preference would be to focus on reducing
the cost and amount of scanning done -- particularly for threads. For
example, all threads operate on the same address space but there can be
significant overlap where all threads are potentially scanning the same
areas or regions that the thread has no interest in. One option would be
to track the highest and lowest pages accessed and only scan within
those regions for example. The tricky part is that library pages may
create very wide windows that render the tracking useless but it could
at least be investigated. 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-14 12:06 ` Mel Gorman
@ 2020-04-15  8:14   ` Huang, Ying
  2020-04-17  7:05     ` SeongJae Park
  2020-04-15 11:32   ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Huang, Ying @ 2020-04-15  8:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Ingo Molnar, Mel Gorman,
	Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen, Tim Chen,
	Aubrey Li

Mel Gorman <mgorman@techsingularity.net> writes:

> On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
>> In current AutoNUMA implementation, the page tables of the processes
>> are scanned periodically to trigger the NUMA hint page faults.  The
>> scanning runs in the context of the processes, so will delay the
>> running of the processes.  In a test with 64 threads pmbench memory
>> accessing benchmark on a 2-socket server machine with 104 logical CPUs
>> and 256 GB memory, there are more than 20000 latency outliers that are
>> > 1 ms in 3600s run time.  These latency outliers are almost all
>> caused by the AutoNUMA page table scanning.  Because they almost all
>> disappear after applying this patch to scan the page tables
>> asynchronously.
>> 
>> Because there are idle CPUs in system, the asynchronous running page
>> table scanning code can run on these idle CPUs instead of the CPUs the
>> workload is running on.
>> 
>> So on system with enough idle CPU time, it's better to scan the page
>> tables asynchronously to take full advantages of these idle CPU time.
>> Another scenario which can benefit from this is to scan the page
>> tables on some service CPUs of the socket, so that the real workload
>> can run on the isolated CPUs without the latency outliers caused by
>> the page table scanning.
>> 
>> But it's not perfect to scan page tables asynchronously too.  For
>> example, on system without enough idle CPU time, the CPU time isn't
>> scheduled fairly because the page table scanning is charged to the
>> workqueue thread instead of the process/thread it works for.  And
>> although the page tables are scanned for the target process, it may
>> run on a CPU that is not in the cpuset of the target process.
>> 
>> One possible solution is to let the system administrator to choose the
>> better behavior for the system via a sysctl knob (implemented in the
>> patch).  But it's not perfect too.  Because every user space knob adds
>> maintenance overhead.
>> 
>> A better solution may be to back-charge the CPU time to scan the page
>> tables to the process/thread, and find a way to run the work on the
>> proper cpuset.  After some googling, I found there's some discussion
>> about this as in the following thread,
>> 
>> https://lkml.org/lkml/2019/6/13/1321
>> 
>> So this patch may be not ready to be merged by upstream yet.  It
>> quantizes the latency outliers caused by the page table scanning in
>> AutoNUMA.  And it provides a possible way to resolve the issue for
>> users who cares about it.  And it is a potential customer of the work
>> related to the cgroup-aware workqueue or other asynchronous execution
>> mechanisms.
>> 
>
> The caveats you list are the important ones and the reason why it was
> not done asynchronously. In an earlier implementation all the work was
> done by a dedicated thread and ultimately abandoned.
>
> There is no guarantee there is an idle CPU available and one that is
> local to the thread that should be doing the scanning. Even if there is,
> it potentially prevents another task from scheduling on an idle CPU and
> similarly other workqueue tasks may be delayed waiting on the scanner. The
> hiding of the cost is also problematic because the CPU cost is hidden
> and mixed with other unrelated workqueues. It also has the potential
> to mask bugs. Lets say for example there is a bug whereby a task is
> scanning excessively, that can be easily missed when the work is done by
> a workqueue.

Do you think something like cgroup-aware workqueue is a solution deserve
to be tried when it's available?  It will not hide the scanning cost,
because the CPU time will be charged to the original cgroup or task.
Although the other tasks may be disturbed, cgroup can provide some kind
of management via cpusets.

> While it's just an opinion, my preference would be to focus on reducing
> the cost and amount of scanning done -- particularly for threads. For
> example, all threads operate on the same address space but there can be
> significant overlap where all threads are potentially scanning the same
> areas or regions that the thread has no interest in. One option would be
> to track the highest and lowest pages accessed and only scan within
> those regions for example. The tricky part is that library pages may
> create very wide windows that render the tracking useless but it could
> at least be investigated. 

In general, I think it's good to reduce the scanning cost.

Why do you think there will be overlap between the threads of a process?
If my understanding were correctly, the threads will scan one by one
instead of simultaneously.  And how to determine whether a vma need to
be scanned or not?  For example, there may be only a small portion of
pages been accessed in a vma, but they may be accessed remotely and
consumes quite some inter-node bandwidth, so need to be migrated.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-14 12:06 ` Mel Gorman
  2020-04-15  8:14   ` Huang, Ying
@ 2020-04-15 11:32   ` Peter Zijlstra
  2020-04-16  1:24     ` Huang, Ying
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2020-04-15 11:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Huang Ying, linux-mm, linux-kernel, Ingo Molnar, Mel Gorman,
	Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen, Tim Chen,
	Aubrey Li

On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
> While it's just an opinion, my preference would be to focus on reducing
> the cost and amount of scanning done -- particularly for threads.

This; I really don't believe in those back-charging things, esp. since
not having cgroups or having multiple applications in a single cgroup is
a valid setup.

Another way to reduce latency spikes is to decrease both
sysctl_numa_balancing_scan_delay and sysctl_numa_balancing_scan_size.
Then you do more smaller scans. By scanning more often you reduce the
contrast, by reducing the size you lower the max latency.

And this is all assuming you actually want numa balancing for this
process.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-15 11:32   ` Peter Zijlstra
@ 2020-04-16  1:24     ` Huang, Ying
  2020-04-17 10:06       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Ying @ 2020-04-16  1:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, linux-mm, linux-kernel, Ingo Molnar, Mel Gorman,
	Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen, Tim Chen,
	Aubrey Li

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
>> While it's just an opinion, my preference would be to focus on reducing
>> the cost and amount of scanning done -- particularly for threads.
>
> This; I really don't believe in those back-charging things, esp. since
> not having cgroups or having multiple applications in a single cgroup is
> a valid setup.

Technically, it appears possible to back-charge the CPU time to the
process/thread directly (not the cgroup).

> Another way to reduce latency spikes is to decrease both
> sysctl_numa_balancing_scan_delay and sysctl_numa_balancing_scan_size.
> Then you do more smaller scans. By scanning more often you reduce the
> contrast, by reducing the size you lower the max latency.

Yes.  This can reduce latency spikes.

Best Regards,
Huang, Ying

> And this is all assuming you actually want numa balancing for this
> process.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-15  8:14   ` Huang, Ying
@ 2020-04-17  7:05     ` SeongJae Park
  2020-04-17 10:04       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: SeongJae Park @ 2020-04-17  7:05 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mel Gorman, Peter Zijlstra, linux-mm, linux-kernel, Ingo Molnar,
	Mel Gorman, Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen,
	Tim Chen, Aubrey Li

On Wed, 15 Apr 2020 16:14:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:

> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
> >> In current AutoNUMA implementation, the page tables of the processes
> >> are scanned periodically to trigger the NUMA hint page faults.  The
> >> scanning runs in the context of the processes, so will delay the
> >> running of the processes.  In a test with 64 threads pmbench memory
> >> accessing benchmark on a 2-socket server machine with 104 logical CPUs
> >> and 256 GB memory, there are more than 20000 latency outliers that are
> >> > 1 ms in 3600s run time.  These latency outliers are almost all
> >> caused by the AutoNUMA page table scanning.  Because they almost all
> >> disappear after applying this patch to scan the page tables
> >> asynchronously.
> >> 
> >> Because there are idle CPUs in system, the asynchronous running page
> >> table scanning code can run on these idle CPUs instead of the CPUs the
> >> workload is running on.
> >> 
> >> So on system with enough idle CPU time, it's better to scan the page
> >> tables asynchronously to take full advantages of these idle CPU time.
> >> Another scenario which can benefit from this is to scan the page
> >> tables on some service CPUs of the socket, so that the real workload
> >> can run on the isolated CPUs without the latency outliers caused by
> >> the page table scanning.
> >> 
> >> But it's not perfect to scan page tables asynchronously too.  For
> >> example, on system without enough idle CPU time, the CPU time isn't
> >> scheduled fairly because the page table scanning is charged to the
> >> workqueue thread instead of the process/thread it works for.  And
> >> although the page tables are scanned for the target process, it may
> >> run on a CPU that is not in the cpuset of the target process.
> >> 
> >> One possible solution is to let the system administrator to choose the
> >> better behavior for the system via a sysctl knob (implemented in the
> >> patch).  But it's not perfect too.  Because every user space knob adds
> >> maintenance overhead.
> >> 
> >> A better solution may be to back-charge the CPU time to scan the page
> >> tables to the process/thread, and find a way to run the work on the
> >> proper cpuset.  After some googling, I found there's some discussion
> >> about this as in the following thread,
> >> 
> >> https://lkml.org/lkml/2019/6/13/1321
> >> 
> >> So this patch may be not ready to be merged by upstream yet.  It
> >> quantizes the latency outliers caused by the page table scanning in
> >> AutoNUMA.  And it provides a possible way to resolve the issue for
> >> users who cares about it.  And it is a potential customer of the work
> >> related to the cgroup-aware workqueue or other asynchronous execution
> >> mechanisms.
> >> 
> >
> > The caveats you list are the important ones and the reason why it was
> > not done asynchronously. In an earlier implementation all the work was
> > done by a dedicated thread and ultimately abandoned.
> >
> > There is no guarantee there is an idle CPU available and one that is
> > local to the thread that should be doing the scanning. Even if there is,
> > it potentially prevents another task from scheduling on an idle CPU and
> > similarly other workqueue tasks may be delayed waiting on the scanner. The
> > hiding of the cost is also problematic because the CPU cost is hidden
> > and mixed with other unrelated workqueues. It also has the potential
> > to mask bugs. Lets say for example there is a bug whereby a task is
> > scanning excessively, that can be easily missed when the work is done by
> > a workqueue.
> 
> Do you think something like cgroup-aware workqueue is a solution deserve
> to be tried when it's available?  It will not hide the scanning cost,
> because the CPU time will be charged to the original cgroup or task.
> Although the other tasks may be disturbed, cgroup can provide some kind
> of management via cpusets.
> 
> > While it's just an opinion, my preference would be to focus on reducing
> > the cost and amount of scanning done -- particularly for threads. For
> > example, all threads operate on the same address space but there can be
> > significant overlap where all threads are potentially scanning the same
> > areas or regions that the thread has no interest in. One option would be
> > to track the highest and lowest pages accessed and only scan within
> > those regions for example. The tricky part is that library pages may
> > create very wide windows that render the tracking useless but it could
> > at least be investigated. 
> 
> In general, I think it's good to reduce the scanning cost.

I think the main idea of DAMON[1] might be able to applied here.  Have you
considered it?

[1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/


Thanks,
SeongJae Park

> 
> Why do you think there will be overlap between the threads of a process?
> If my understanding were correctly, the threads will scan one by one
> instead of simultaneously.  And how to determine whether a vma need to
> be scanned or not?  For example, there may be only a small portion of
> pages been accessed in a vma, but they may be accessed remotely and
> consumes quite some inter-node bandwidth, so need to be migrated.
> 
> Best Regards,
> Huang, Ying
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-17  7:05     ` SeongJae Park
@ 2020-04-17 10:04       ` Peter Zijlstra
  2020-04-17 10:21         ` SeongJae Park
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2020-04-17 10:04 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Huang, Ying, Mel Gorman, linux-mm, linux-kernel, Ingo Molnar,
	Mel Gorman, Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen,
	Tim Chen, Aubrey Li

On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> I think the main idea of DAMON[1] might be able to applied here.  Have you
> considered it?
> 
> [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/

I've ignored that entire thing after you said the information it
provides was already available through the PMU.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-16  1:24     ` Huang, Ying
@ 2020-04-17 10:06       ` Peter Zijlstra
  2020-04-20  3:26         ` Huang, Ying
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2020-04-17 10:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mel Gorman, linux-mm, linux-kernel, Ingo Molnar, Mel Gorman,
	Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen, Tim Chen,
	Aubrey Li

On Thu, Apr 16, 2020 at 09:24:35AM +0800, Huang, Ying wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
> >> While it's just an opinion, my preference would be to focus on reducing
> >> the cost and amount of scanning done -- particularly for threads.
> >
> > This; I really don't believe in those back-charging things, esp. since
> > not having cgroups or having multiple applications in a single cgroup is
> > a valid setup.
> 
> Technically, it appears possible to back-charge the CPU time to the
> process/thread directly (not the cgroup).

I've yet to see a sane proposal there. What we're not going to do is
make regular task accounting more expensive than it already is.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-17 10:04       ` Peter Zijlstra
@ 2020-04-17 10:21         ` SeongJae Park
  2020-04-17 12:16           ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: SeongJae Park @ 2020-04-17 10:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: SeongJae Park, Huang, Ying, Mel Gorman, linux-mm, linux-kernel,
	Ingo Molnar, Mel Gorman, Rik van Riel, Daniel Jordan, Tejun Heo,
	Dave Hansen, Tim Chen, Aubrey Li

On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > I think the main idea of DAMON[1] might be able to applied here.  Have you
> > considered it?
> > 
> > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
> 
> I've ignored that entire thing after you said the information it
> provides was already available through the PMU.

Sorry if my answer made you confused.  What I wanted to say was that the
fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
for now, but it could be modified to use PMU or other features instead.  In
other words, PMU on some architectures could provide the fundamental, low level
information for DAMON, as PTE Accessed bit does.  What DAMON does are efficient
control of the fundamental access checking features and making of elaborated
final informations, which PMU itself doesn't provide.


Thanks,
SeongJae Park

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-17 10:21         ` SeongJae Park
@ 2020-04-17 12:16           ` Mel Gorman
  2020-04-17 12:21             ` Peter Zijlstra
  2020-04-17 12:44             ` SeongJae Park
  0 siblings, 2 replies; 16+ messages in thread
From: Mel Gorman @ 2020-04-17 12:16 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Peter Zijlstra, Huang, Ying, linux-mm, linux-kernel, Ingo Molnar,
	Mel Gorman, Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen,
	Tim Chen, Aubrey Li

On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > I think the main idea of DAMON[1] might be able to applied here.  Have you
> > > considered it?
> > > 
> > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
> > 
> > I've ignored that entire thing after you said the information it
> > provides was already available through the PMU.
> 
> Sorry if my answer made you confused.  What I wanted to say was that the
> fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> for now, but it could be modified to use PMU or other features instead. 

I would not be inclined to lean towards either approach for NUMA
balancing. Fiddling with the accessed bit can have consequences for page
aging and residency -- fine for debugging a problem, not to fine for
normal usage. I would expect the PMU approach would have high overhead
as well as taking over a PMU counter that userspace debugging may expect
to be available.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-17 12:16           ` Mel Gorman
@ 2020-04-17 12:21             ` Peter Zijlstra
  2020-04-17 12:44             ` SeongJae Park
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2020-04-17 12:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: SeongJae Park, Huang, Ying, linux-mm, linux-kernel, Ingo Molnar,
	Mel Gorman, Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen,
	Tim Chen, Aubrey Li

On Fri, Apr 17, 2020 at 01:16:29PM +0100, Mel Gorman wrote:
> On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > I think the main idea of DAMON[1] might be able to applied here.  Have you
> > > > considered it?
> > > > 
> > > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
> > > 
> > > I've ignored that entire thing after you said the information it
> > > provides was already available through the PMU.
> > 
> > Sorry if my answer made you confused.  What I wanted to say was that the
> > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > for now, but it could be modified to use PMU or other features instead. 
> 
> I would not be inclined to lean towards either approach for NUMA
> balancing. Fiddling with the accessed bit can have consequences for page
> aging and residency -- fine for debugging a problem, not to fine for
> normal usage. I would expect the PMU approach would have high overhead
> as well as taking over a PMU counter that userspace debugging may expect
> to be available.

Oh, quite agreed; I was just saying I never saw the use of that whole
DAMON thing. AFAICT it's not actually solving a problem, just making
more.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: Re: Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-17 12:16           ` Mel Gorman
  2020-04-17 12:21             ` Peter Zijlstra
@ 2020-04-17 12:44             ` SeongJae Park
  2020-04-17 14:46               ` Mel Gorman
  1 sibling, 1 reply; 16+ messages in thread
From: SeongJae Park @ 2020-04-17 12:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: SeongJae Park, Peter Zijlstra, Huang, Ying, linux-mm,
	linux-kernel, Ingo Molnar, Mel Gorman, Rik van Riel,
	Daniel Jordan, Tejun Heo, Dave Hansen, Tim Chen, Aubrey Li

On Fri, 17 Apr 2020 13:16:29 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > I think the main idea of DAMON[1] might be able to applied here.  Have you
> > > > considered it?
> > > > 
> > > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
> > > 
> > > I've ignored that entire thing after you said the information it
> > > provides was already available through the PMU.
> > 
> > Sorry if my answer made you confused.  What I wanted to say was that the
> > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > for now, but it could be modified to use PMU or other features instead. 
> 
> I would not be inclined to lean towards either approach for NUMA
> balancing. Fiddling with the accessed bit can have consequences for page
> aging and residency -- fine for debugging a problem, not to fine for
> normal usage. I would expect the PMU approach would have high overhead
> as well as taking over a PMU counter that userspace debugging may expect
> to be available.

Good point.  But, isn't it ok to use Accessed bit as long as PG_Idle and
PG_Young is adjusted properly?  Current DAMON implementation does so, as
idle_page_tracking also does.

That said, the core logics of DAMON and the underlying access check primitive
are logically seperated.  I am planning[1] to further entirely seperate those
and let users to be able to use right access check promitive for their needs.

If I'm missing something, please let me know.

[1] https://lore.kernel.org/linux-mm/20200409094232.29680-1-sjpark@amazon.com/


Thanks,
SeongJae Park

> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: Re: Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-17 12:44             ` SeongJae Park
@ 2020-04-17 14:46               ` Mel Gorman
  2020-04-18  9:48                 ` SeongJae Park
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2020-04-17 14:46 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Mel Gorman, Peter Zijlstra, Huang, Ying, linux-mm, linux-kernel,
	Ingo Molnar, Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen,
	Tim Chen, Aubrey Li

On Fri, Apr 17, 2020 at 02:44:37PM +0200, SeongJae Park wrote:
> On Fri, 17 Apr 2020 13:16:29 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > > I think the main idea of DAMON[1] might be able to applied here.  Have you
> > > > > considered it?
> > > > > 
> > > > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
> > > > 
> > > > I've ignored that entire thing after you said the information it
> > > > provides was already available through the PMU.
> > > 
> > > Sorry if my answer made you confused.  What I wanted to say was that the
> > > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > > for now, but it could be modified to use PMU or other features instead. 
> > 
> > I would not be inclined to lean towards either approach for NUMA
> > balancing. Fiddling with the accessed bit can have consequences for page
> > aging and residency -- fine for debugging a problem, not to fine for
> > normal usage. I would expect the PMU approach would have high overhead
> > as well as taking over a PMU counter that userspace debugging may expect
> > to be available.
> 
> Good point.  But, isn't it ok to use Accessed bit as long as PG_Idle and
> PG_Young is adjusted properly?  Current DAMON implementation does so, as
> idle_page_tracking also does.
> 

PG_idle and PG_young are page flags that are used by idle tracking. The
accessed bit I'm thinking of is in the PTE bit and it's the PTE that
is changed by page_referenced() during reclaim. So it's not clear how
adjusting the page bits help avoid page aging problems during reclaim.

Maybe your suggestion was to move NUMA balancing to use the PG_idle and
PG_young bits from idle tracking but I'm not sure what that gains us.
This may be because I did not look too closely at DAMON as for debugging
and analysis, the PMU sounded more suitable.

It's not clear to me how looking at the page or pte bit handling of
DAMON helps us reduce the scanning cost of numa balancing. There may
be some benefit in splitting the address space and rescanning sections
that were referenced but it has similar pitfalls to simply tracking the
highest/lowest address that trapped a NUMA hinting fault.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: Re: Re: Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-17 14:46               ` Mel Gorman
@ 2020-04-18  9:48                 ` SeongJae Park
  2020-04-20  2:32                   ` Huang, Ying
  0 siblings, 1 reply; 16+ messages in thread
From: SeongJae Park @ 2020-04-18  9:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: SeongJae Park, Mel Gorman, Peter Zijlstra, Huang, Ying, linux-mm,
	linux-kernel, Ingo Molnar, Rik van Riel, Daniel Jordan,
	Tejun Heo, Dave Hansen, Tim Chen, Aubrey Li

On Fri, 17 Apr 2020 15:46:26 +0100 Mel Gorman <mgorman@suse.de> wrote:

> On Fri, Apr 17, 2020 at 02:44:37PM +0200, SeongJae Park wrote:
> > On Fri, 17 Apr 2020 13:16:29 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > > On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > > > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > > > 
> > > > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > > > I think the main idea of DAMON[1] might be able to applied here.  Have you
> > > > > > considered it?
> > > > > > 
> > > > > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
> > > > > 
> > > > > I've ignored that entire thing after you said the information it
> > > > > provides was already available through the PMU.
> > > > 
> > > > Sorry if my answer made you confused.  What I wanted to say was that the
> > > > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > > > for now, but it could be modified to use PMU or other features instead. 
> > > 
> > > I would not be inclined to lean towards either approach for NUMA
> > > balancing. Fiddling with the accessed bit can have consequences for page
> > > aging and residency -- fine for debugging a problem, not to fine for
> > > normal usage. I would expect the PMU approach would have high overhead
> > > as well as taking over a PMU counter that userspace debugging may expect
> > > to be available.
> > 
> > Good point.  But, isn't it ok to use Accessed bit as long as PG_Idle and
> > PG_Young is adjusted properly?  Current DAMON implementation does so, as
> > idle_page_tracking also does.
> > 
> 
> PG_idle and PG_young are page flags that are used by idle tracking. The
> accessed bit I'm thinking of is in the PTE bit and it's the PTE that
> is changed by page_referenced() during reclaim. So it's not clear how
> adjusting the page bits help avoid page aging problems during reclaim.

The idle tracking reads and updates not only PG_idle and PG_young, but also the
accessed bit.  So idle tacking was also required to deal with the interference
of the reclaimer logic and thus it introduced PG_young.  The commit
33c3fc71c8cf ("mm: introduce idle page tracking") explains this as below:

    The Young page flag is used to avoid interference with the memory
    reclaimer.  A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

DAMON adjusts PG_young in similar way to avoid the interference.  DAMON further
adjusts PG_idle so that it does not interfere to page idle tracking mechanism.
I think I made you confused by unnecessarily mentioning PG_idle, sorry.

> 
> Maybe your suggestion was to move NUMA balancing to use the PG_idle and
> PG_young bits from idle tracking but I'm not sure what that gains us.
> This may be because I did not look too closely at DAMON as for debugging
> and analysis, the PMU sounded more suitable.
> 
> It's not clear to me how looking at the page or pte bit handling of
> DAMON helps us reduce the scanning cost of numa balancing. There may
> be some benefit in splitting the address space and rescanning sections
> that were referenced but it has similar pitfalls to simply tracking the
> highest/lowest address that trapped a NUMA hinting fault.

DAMON allows users to know which address ranges have which access frequency.
Thus, I think DAMON could be used for detection of hot pages, which will be
good migration candidates, instead of the NUMA hinting fault based referenced
pages detection.

The benefits I expect from this change are better accuracy and less overhead.
As we can know not only referenced pages but hot pages, migration will be more
effective.

In terms of the overhead, DAMON allows users to set the upperbound of the
monitoring overhead regardless of the size of the targe address space, unlike
the page scanning based mechanisms.  Thus, the overhead could be reduced,
especially if the target process has large memory footprint.

As I'm not deeply understanding AutoNUMA, I might saying something unrealistic
or missing some important points.


Thanks,
SeongJae Park

> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-18  9:48                 ` SeongJae Park
@ 2020-04-20  2:32                   ` Huang, Ying
  0 siblings, 0 replies; 16+ messages in thread
From: Huang, Ying @ 2020-04-20  2:32 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Mel Gorman, SeongJae Park, Mel Gorman, Peter Zijlstra, linux-mm,
	linux-kernel, Ingo Molnar, Rik van Riel, Daniel Jordan,
	Tejun Heo, Dave Hansen, Tim Chen, Aubrey Li

SeongJae Park <sj38.park@gmail.com> writes:
>
> DAMON allows users to know which address ranges have which access frequency.
> Thus, I think DAMON could be used for detection of hot pages, which will be
> good migration candidates, instead of the NUMA hinting fault based referenced
> pages detection.
>
> The benefits I expect from this change are better accuracy and less overhead.
> As we can know not only referenced pages but hot pages, migration will be more
> effective.

The purpose of AutoNUMA scanning isn't to find the hot pages, but the
pages that are accessed across NUMA node.  PTE accessed bit cannot
provide information of accessing CPU, so it's not enough for AutoNUMA.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] autonuma: Support to scan page table asynchronously
  2020-04-17 10:06       ` Peter Zijlstra
@ 2020-04-20  3:26         ` Huang, Ying
  0 siblings, 0 replies; 16+ messages in thread
From: Huang, Ying @ 2020-04-20  3:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, linux-mm, linux-kernel, Ingo Molnar, Mel Gorman,
	Rik van Riel, Daniel Jordan, Tejun Heo, Dave Hansen, Tim Chen,
	Aubrey Li

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Apr 16, 2020 at 09:24:35AM +0800, Huang, Ying wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
>> >> While it's just an opinion, my preference would be to focus on reducing
>> >> the cost and amount of scanning done -- particularly for threads.
>> >
>> > This; I really don't believe in those back-charging things, esp. since
>> > not having cgroups or having multiple applications in a single cgroup is
>> > a valid setup.
>> 
>> Technically, it appears possible to back-charge the CPU time to the
>> process/thread directly (not the cgroup).
>
> I've yet to see a sane proposal there. What we're not going to do is
> make regular task accounting more expensive than it already is.

Yes.  There's overhead to back-charge.  To reduce the overhead, instead
of back-charge immediately, we can

- Add one field to task_struct, say backcharge_time, to track the
  delayed back-charged CPU time.

- When the work item completes its work, add the CPU time it spends to
  task_struct->backcharge_time atomically

- When the task account CPU regularly, e.g. in scheduler_tick(),
  task_struct->backcharge is considered too.

Although this cannot eliminate the overhead, it can reduce it.  Do you
think this is acceptable or not?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-04-20  3:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-14  8:19 [RFC] autonuma: Support to scan page table asynchronously Huang Ying
2020-04-14 12:06 ` Mel Gorman
2020-04-15  8:14   ` Huang, Ying
2020-04-17  7:05     ` SeongJae Park
2020-04-17 10:04       ` Peter Zijlstra
2020-04-17 10:21         ` SeongJae Park
2020-04-17 12:16           ` Mel Gorman
2020-04-17 12:21             ` Peter Zijlstra
2020-04-17 12:44             ` SeongJae Park
2020-04-17 14:46               ` Mel Gorman
2020-04-18  9:48                 ` SeongJae Park
2020-04-20  2:32                   ` Huang, Ying
2020-04-15 11:32   ` Peter Zijlstra
2020-04-16  1:24     ` Huang, Ying
2020-04-17 10:06       ` Peter Zijlstra
2020-04-20  3:26         ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).