From: SeongJae Park <sjpark@amazon.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
Peter Zijlstra <peterz@infradead.org>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@redhat.com>,
Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@surriel.com>,
Daniel Jordan <daniel.m.jordan@oracle.com>,
Tejun Heo <tj@kernel.org>, Dave Hansen <dave.hansen@intel.com>,
Tim Chen <tim.c.chen@intel.com>, Aubrey Li <aubrey.li@intel.com>
Subject: Re: Re: [RFC] autonuma: Support to scan page table asynchronously
Date: Fri, 17 Apr 2020 09:05:08 +0200 [thread overview]
Message-ID: <20200417070508.32243-1-sjpark@amazon.com> (raw)
In-Reply-To: <87eespyxld.fsf@yhuang-dev.intel.com> (raw)
On Wed, 15 Apr 2020 16:14:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
>
> > On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
> >> In current AutoNUMA implementation, the page tables of the processes
> >> are scanned periodically to trigger the NUMA hint page faults. The
> >> scanning runs in the context of the processes, so will delay the
> >> running of the processes. In a test with 64 threads pmbench memory
> >> accessing benchmark on a 2-socket server machine with 104 logical CPUs
> >> and 256 GB memory, there are more than 20000 latency outliers that are
> >> > 1 ms in 3600s run time. These latency outliers are almost all
> >> caused by the AutoNUMA page table scanning. Because they almost all
> >> disappear after applying this patch to scan the page tables
> >> asynchronously.
> >>
> >> Because there are idle CPUs in system, the asynchronous running page
> >> table scanning code can run on these idle CPUs instead of the CPUs the
> >> workload is running on.
> >>
> >> So on system with enough idle CPU time, it's better to scan the page
> >> tables asynchronously to take full advantages of these idle CPU time.
> >> Another scenario which can benefit from this is to scan the page
> >> tables on some service CPUs of the socket, so that the real workload
> >> can run on the isolated CPUs without the latency outliers caused by
> >> the page table scanning.
> >>
> >> But it's not perfect to scan page tables asynchronously too. For
> >> example, on system without enough idle CPU time, the CPU time isn't
> >> scheduled fairly because the page table scanning is charged to the
> >> workqueue thread instead of the process/thread it works for. And
> >> although the page tables are scanned for the target process, it may
> >> run on a CPU that is not in the cpuset of the target process.
> >>
> >> One possible solution is to let the system administrator to choose the
> >> better behavior for the system via a sysctl knob (implemented in the
> >> patch). But it's not perfect too. Because every user space knob adds
> >> maintenance overhead.
> >>
> >> A better solution may be to back-charge the CPU time to scan the page
> >> tables to the process/thread, and find a way to run the work on the
> >> proper cpuset. After some googling, I found there's some discussion
> >> about this as in the following thread,
> >>
> >> https://lkml.org/lkml/2019/6/13/1321
> >>
> >> So this patch may be not ready to be merged by upstream yet. It
> >> quantizes the latency outliers caused by the page table scanning in
> >> AutoNUMA. And it provides a possible way to resolve the issue for
> >> users who cares about it. And it is a potential customer of the work
> >> related to the cgroup-aware workqueue or other asynchronous execution
> >> mechanisms.
> >>
> >
> > The caveats you list are the important ones and the reason why it was
> > not done asynchronously. In an earlier implementation all the work was
> > done by a dedicated thread and ultimately abandoned.
> >
> > There is no guarantee there is an idle CPU available and one that is
> > local to the thread that should be doing the scanning. Even if there is,
> > it potentially prevents another task from scheduling on an idle CPU and
> > similarly other workqueue tasks may be delayed waiting on the scanner. The
> > hiding of the cost is also problematic because the CPU cost is hidden
> > and mixed with other unrelated workqueues. It also has the potential
> > to mask bugs. Lets say for example there is a bug whereby a task is
> > scanning excessively, that can be easily missed when the work is done by
> > a workqueue.
>
> Do you think something like cgroup-aware workqueue is a solution deserve
> to be tried when it's available? It will not hide the scanning cost,
> because the CPU time will be charged to the original cgroup or task.
> Although the other tasks may be disturbed, cgroup can provide some kind
> of management via cpusets.
>
> > While it's just an opinion, my preference would be to focus on reducing
> > the cost and amount of scanning done -- particularly for threads. For
> > example, all threads operate on the same address space but there can be
> > significant overlap where all threads are potentially scanning the same
> > areas or regions that the thread has no interest in. One option would be
> > to track the highest and lowest pages accessed and only scan within
> > those regions for example. The tricky part is that library pages may
> > create very wide windows that render the tracking useless but it could
> > at least be investigated.
>
> In general, I think it's good to reduce the scanning cost.
I think the main idea of DAMON[1] might be able to applied here. Have you
considered it?
[1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
Thanks,
SeongJae Park
>
> Why do you think there will be overlap between the threads of a process?
> If my understanding were correctly, the threads will scan one by one
> instead of simultaneously. And how to determine whether a vma need to
> be scanned or not? For example, there may be only a small portion of
> pages been accessed in a vma, but they may be accessed remotely and
> consumes quite some inter-node bandwidth, so need to be migrated.
>
> Best Regards,
> Huang, Ying
>
next prev parent reply other threads:[~2020-04-17 7:06 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-14 8:19 [RFC] autonuma: Support to scan page table asynchronously Huang Ying
2020-04-14 12:06 ` Mel Gorman
2020-04-15 8:14 ` Huang, Ying
2020-04-17 7:05 ` SeongJae Park [this message]
2020-04-17 10:04 ` Peter Zijlstra
2020-04-17 10:21 ` SeongJae Park
2020-04-17 12:16 ` Mel Gorman
2020-04-17 12:21 ` Peter Zijlstra
2020-04-17 12:44 ` SeongJae Park
2020-04-17 14:46 ` Mel Gorman
2020-04-18 9:48 ` SeongJae Park
2020-04-20 2:32 ` Huang, Ying
2020-04-15 11:32 ` Peter Zijlstra
2020-04-16 1:24 ` Huang, Ying
2020-04-17 10:06 ` Peter Zijlstra
2020-04-20 3:26 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200417070508.32243-1-sjpark@amazon.com \
--to=sjpark@amazon.com \
--cc=aubrey.li@intel.com \
--cc=daniel.m.jordan@oracle.com \
--cc=dave.hansen@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@surriel.com \
--cc=tim.c.chen@intel.com \
--cc=tj@kernel.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).