From: Daniel Jordan <daniel.m.jordan@oracle.com>
To: Dave Hansen <dave.hansen@intel.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page
Date: Mon, 17 Jul 2017 21:49:51 -0400 [thread overview]
Message-ID: <c73eabca-4c8c-8fb1-36c7-1f887f56689a@oracle.com> (raw)
In-Reply-To: <398e9887-6d6e-e1d3-abcf-43a6d7496bc8@intel.com>
On 07/17/2017 12:02 PM, Dave Hansen wrote:
> On 07/14/2017 03:16 PM, daniel.m.jordan@oracle.com wrote:
>> Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
>> Test: Clear a range of gigantic pages
>> nthread speedup size (GiB) min time (s) stdev
>> 1 100 41.13 0.03
>> 2 2.03x 100 20.26 0.14
>> 4 4.28x 100 9.62 0.09
>> 8 8.39x 100 4.90 0.05
>> 16 10.44x 100 3.94 0.03
> ...
>> 1 800 434.91 1.81
>> 2 2.54x 800 170.97 1.46
>> 4 4.98x 800 87.38 1.91
>> 8 10.15x 800 42.86 2.59
>> 16 12.99x 800 33.48 0.83
> What was the actual test here? Did you just use sysfs to allocate 800GB
> of 1GB huge pages?
I used fallocate(1) on a hugetlbfs, so this test is similar to the 6th
patch in the series, but here we parallelize only the page clearing
function since gigantic pages are large enough to benefit from multiple
threads, whereas we parallelize at the level of hugetlbfs_fallocate in
patch 6 for smaller page sizes (e.g. 2M on x86).
> This test should be entirely memory-bandwidth-limited, right?
That's right, the page clearing function dominates the test, so it's
memory-bandwidth-limited.
> Are you
> contending here that a single core can only use 1/10th of the memory
> bandwidth when clearing a page?
Yes, this is the biggest factor here. More threads can use more memory
bandwidth.
And yes, in the page clearing loop exercised in the test, a single
thread can use only a fraction of the chip's theoretical memory
bandwidth. This is the page clearing loop I'm stressing:
ENTRY(clear_page_erms)
movl $4096,%ecx
xorl %eax,%eax
rep stosb
ret
On my test machine, it tops out at around 2550 MiB/s with 1 thread, and
I get that same rate for each of 2, 4, or 8 threads when running on the
same chip (i.e. group of 18 cores for this machine). It's only at 16
threads on the same chip that it starts to drop, falling to something
around 1420 MiB/s.
> Or, does all the gain here come because we are round-robin-allocating
> the pages across all 8 NUMA nodes' memory controllers and the speedup
> here is because we're not doing the clearing across the interconnect?
The default NUMA policy was used for all results shown, so there was no
round-robin'ing at small sizes. For example, in the 100 GiB case, all
pages were allocated from the same node. But when it gets up to 800
GiB, obviously we're allocating from many nodes so that we get a sort of
round-robin effect and NUMA starts to matter. For instance, the
1-thread case does better on 100 GiB with all local accesses than on 800
GiB with mostly remote accesses.
ktask's ability to run NUMA-aware threads helps out here so that we're
not clearing across the interconnect, which is why the speedups get
better as the sizes get larger.
Thanks for your questions.
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-07-18 1:49 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-14 22:16 [RFC PATCH v1 0/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 1/6] ktask: add documentation daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 2/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 3/6] ktask: add /proc/sys/debug/ktask_max_threads daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 4/6] mm: enlarge type of offset argument in mem_map_offset and mem_map_next daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page daniel.m.jordan
2017-07-17 16:02 ` Dave Hansen
2017-07-18 1:49 ` Daniel Jordan [this message]
2017-07-14 22:16 ` [RFC PATCH v1 6/6] hugetlbfs: parallelize hugetlbfs_fallocate with ktask daniel.m.jordan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c73eabca-4c8c-8fb1-36c7-1f887f56689a@oracle.com \
--to=daniel.m.jordan@oracle.com \
--cc=dave.hansen@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).