linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE
@ 2019-01-29 23:40 Andrea Arcangeli
  2019-01-30  7:17 ` Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2019-01-29 23:40 UTC (permalink / raw)
  To: lsf-pc, linux-mm, linux-kernel
  Cc: Peter Xu, Blake Caldwell, Mike Rapoport, Mike Kravetz,
	Michal Hocko, Mel Gorman, Vlastimil Babka, David Rientjes

Hello,

I'd like to attend the LSF/MM Summit 2019. I'm interested in most MM
topics and it's enlightening to listen to the common non-MM topics
too.

One current topic that could be of interest is the THP / NUMA tradeoff
in subject.

One issue about a change in MADV_HUGEPAGE behavior made ~3 years ago
kept floating around for the last 6 months (~12 months since it was
initially reported as regression through an enterprise-like workload)
and it was hot-fixed in commit
ac5b2c18911ffe95c08d69273917f90212cf5659, but it got quickly reverted
for various reasons.

I posted some benchmark results showing that for tasks without strong
NUMA locality the __GFP_THISNODE logic is not guaranteed to be optimal
(and here of course I mean even if we ignore the large slowdown with
swap storms at allocation time that might be caused by
__GFP_THISNODE). The results also show NUMA remote THPs help
intrasocket as well as intersocket.

https://lkml.kernel.org/r/20181210044916.GC24097@redhat.com
https://lkml.kernel.org/r/20181212104418.GE1130@redhat.com

The following seems the interim conclusion which I happen to be in
agreement with Michal and Mel:

https://lkml.kernel.org/r/20181212095051.GO1286@dhcp22.suse.cz
https://lkml.kernel.org/r/20181212170016.GG1130@redhat.com

Hopefully this strict issue will be hot-fixed before April (like we
had to hot-fix it in the enterprise kernels to avoid the 3 years old
regression to break large workloads that can't fit it in a single NUMA
node and I assume other enterprise distributions will follow suit),
but whatever hot-fix will likely allow ample margin for discussions on
what we can do better to optimize the decision between local non-THP
and remote THP under MADV_HUGEPAGE.

It is clear that the __GFP_THISNODE forced in the current code
provides some minor advantage to apps using MADV_HUGEPAGE that can fit
in a single NUMA node, but we should try to achieve it without major
disadvantages to apps that can't fit in a single NUMA node.

For example it was mentioned that we could allocate readily available
already-free local 4k if local compaction fails and the watermarks
still allows local 4k allocations without invoking reclaim, before
invoking compaction on remote nodes. The same can be repeated at a
second level with intra-socket non-THP memory before invoking
compaction inter-socket. However we can't do things like that with the
current page allocator workflow. It's possible some larger change is
required than just sending a single gfp bitflag down to the page
allocator that creates an implicit MPOL_LOCAL binding to make it
behave like the obsoleted numa/zone reclaim behavior, but weirdly only
applied to THP allocations.

--

In addition to the above "NUMA remote THP vs NUMA local non-THP
tradeoff" topic, there are other developments in "userfaultfd" land that
are approaching merge readiness and that would be possible to provide a
short overview about:

- Peter Xu made significant progress in finalizing the userfaultfd-WP
  support over the last few months. That feature was planned from the
  start and it will allow userland to do some new things that weren't
  possible to achieve before. In addition to synchronously blocking
  write faults to be resolved by an userland manager, it has also the
  ability to obsolete the softdirty feature, because it can provide
  the same information, but with O(1) complexity (as opposed of the
  current softdirty O(N) complexity) similarly to what the Page
  Modification Logging (PML) does in hardware for EPT write accesses.

- Blake Caldwell maintained the UFFDIO_REMAP support to atomically
  remove memory from a mapping with userfaultfd (which can't be done
  with a copy as in UFFDIO_COPY and it requires a slow TLB flush to be
  safe) as an alternative to host swapping (which of course also
  requires a TLB flush for similar reasons). Notably UFFDIO_REMAP was
  rightfully naked early on and quickly replaced by UFFDIO_COPY which
  is more optimal to add memory to a mapping is small chunks, but we
  can't remove memory with UFFDIO_COPY and UFFDIO_REMAP should be as
  efficient as it gets when it comes to removing memory from a
  mapping.

Thank you,
Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE
  2019-01-29 23:40 [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE Andrea Arcangeli
@ 2019-01-30  7:17 ` Michal Hocko
  2019-01-30  8:13 ` [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE) Mike Rapoport
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2019-01-30  7:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: lsf-pc, linux-mm, linux-kernel, Peter Xu, Blake Caldwell,
	Mike Rapoport, Mike Kravetz, Mel Gorman, Vlastimil Babka,
	David Rientjes

On Tue 29-01-19 18:40:58, Andrea Arcangeli wrote:
> Hello,
> 
> I'd like to attend the LSF/MM Summit 2019. I'm interested in most MM
> topics and it's enlightening to listen to the common non-MM topics
> too.
> 
> One current topic that could be of interest is the THP / NUMA tradeoff
> in subject.
> 
> One issue about a change in MADV_HUGEPAGE behavior made ~3 years ago
> kept floating around for the last 6 months (~12 months since it was
> initially reported as regression through an enterprise-like workload)
> and it was hot-fixed in commit
> ac5b2c18911ffe95c08d69273917f90212cf5659, but it got quickly reverted
> for various reasons.
> 
> I posted some benchmark results showing that for tasks without strong
> NUMA locality the __GFP_THISNODE logic is not guaranteed to be optimal
> (and here of course I mean even if we ignore the large slowdown with
> swap storms at allocation time that might be caused by
> __GFP_THISNODE). The results also show NUMA remote THPs help
> intrasocket as well as intersocket.
> 
> https://lkml.kernel.org/r/20181210044916.GC24097@redhat.com
> https://lkml.kernel.org/r/20181212104418.GE1130@redhat.com
> 
> The following seems the interim conclusion which I happen to be in
> agreement with Michal and Mel:
> 
> https://lkml.kernel.org/r/20181212095051.GO1286@dhcp22.suse.cz
> https://lkml.kernel.org/r/20181212170016.GG1130@redhat.com

I am definitely interested in discussing this topic and actually wanted
to propose it myself. I would add that part of the discussion was
proposing a neww memory policy that would effectively enable per-vma
node-reclaim like behavior.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE)
  2019-01-29 23:40 [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE Andrea Arcangeli
  2019-01-30  7:17 ` Michal Hocko
@ 2019-01-30  8:13 ` Mike Rapoport
  2019-01-30  9:23   ` Peter Xu
  2019-01-30 14:43   ` Andrea Arcangeli
  2019-01-30 23:14 ` [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE Mike Kravetz
  2019-02-01 14:17 ` Mel Gorman
  3 siblings, 2 replies; 8+ messages in thread
From: Mike Rapoport @ 2019-01-30  8:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: lsf-pc, linux-mm, linux-kernel, Peter Xu, Blake Caldwell,
	Mike Rapoport, Mike Kravetz, Michal Hocko, Mel Gorman,
	Vlastimil Babka, David Rientjes, Andrei Vagin, Pavel Emelyanov

Hi,

(changed the subject and added CRIU folks)

On Tue, Jan 29, 2019 at 06:40:58PM -0500, Andrea Arcangeli wrote:
> Hello,
> 
> --
> 
> In addition to the above "NUMA remote THP vs NUMA local non-THP
> tradeoff" topic, there are other developments in "userfaultfd" land that
> are approaching merge readiness and that would be possible to provide a
> short overview about:
> 
> - Peter Xu made significant progress in finalizing the userfaultfd-WP
>   support over the last few months. That feature was planned from the
>   start and it will allow userland to do some new things that weren't
>   possible to achieve before. In addition to synchronously blocking
>   write faults to be resolved by an userland manager, it has also the
>   ability to obsolete the softdirty feature, because it can provide
>   the same information, but with O(1) complexity (as opposed of the
>   current softdirty O(N) complexity) similarly to what the Page
>   Modification Logging (PML) does in hardware for EPT write accesses.
 
We (CRIU) have some concerns about obsoleting soft-dirty in favor of
uffd-wp. If there are other soft-dirty users these concerns would be
relevant to them as well.

With soft-dirty we collect the information about the changed memory every
pre-dump iteration in the following manner:
* freeze the tasks
* find entries in /proc/pid/pagemap with SOFT_DIRTY set
* unfreeze the tasks
* dump the modified pages to disk/remote host

While we do need to traverse the /proc/pid/pagemap to identify dirty pages,
in between the pre-dump iterations and during the actual memory dump the
tasks are running freely.

If we are to switch to uffd-wp, every write by the snapshotted/migrated
task will incur latency of uffd-wp processing by the monitor.

We'd need to see how this affects overall slowdown of the workload under
migration before moving forward with obsoleting soft-dirty.

> - Blake Caldwell maintained the UFFDIO_REMAP support to atomically
>   remove memory from a mapping with userfaultfd (which can't be done
>   with a copy as in UFFDIO_COPY and it requires a slow TLB flush to be
>   safe) as an alternative to host swapping (which of course also
>   requires a TLB flush for similar reasons). Notably UFFDIO_REMAP was
>   rightfully naked early on and quickly replaced by UFFDIO_COPY which
>   is more optimal to add memory to a mapping is small chunks, but we
>   can't remove memory with UFFDIO_COPY and UFFDIO_REMAP should be as
>   efficient as it gets when it comes to removing memory from a
>   mapping.

If we are to discuss userfaultfd, I'd like also to bring the subject of COW
mappings.
The pages populated with UFFDIO_COPY cannot be COW-shared between related
processes which unnecessarily increases memory footprint of a migrated
process tree.
I've posted a patch [1] a (real) while ago, but nobody reacted and I've put
this aside.
Maybe it's time to discuss it again :)

> Thank you,
> Andrea
> 

[1] https://lwn.net/ml/linux-api/20180328101729.GB1743%40rapoport-lnx/

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE)
  2019-01-30  8:13 ` [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE) Mike Rapoport
@ 2019-01-30  9:23   ` Peter Xu
  2019-01-31  9:54     ` Mike Rapoport
  2019-01-30 14:43   ` Andrea Arcangeli
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Xu @ 2019-01-30  9:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrea Arcangeli, lsf-pc, linux-mm, linux-kernel, Blake Caldwell,
	Mike Rapoport, Mike Kravetz, Michal Hocko, Mel Gorman,
	Vlastimil Babka, David Rientjes, Andrei Vagin, Pavel Emelyanov

On Wed, Jan 30, 2019 at 10:13:36AM +0200, Mike Rapoport wrote:
> Hi,
> 
> (changed the subject and added CRIU folks)
> 
> On Tue, Jan 29, 2019 at 06:40:58PM -0500, Andrea Arcangeli wrote:
> > Hello,
> > 
> > --
> > 
> > In addition to the above "NUMA remote THP vs NUMA local non-THP
> > tradeoff" topic, there are other developments in "userfaultfd" land that
> > are approaching merge readiness and that would be possible to provide a
> > short overview about:
> > 
> > - Peter Xu made significant progress in finalizing the userfaultfd-WP
> >   support over the last few months. That feature was planned from the
> >   start and it will allow userland to do some new things that weren't
> >   possible to achieve before. In addition to synchronously blocking
> >   write faults to be resolved by an userland manager, it has also the
> >   ability to obsolete the softdirty feature, because it can provide
> >   the same information, but with O(1) complexity (as opposed of the
> >   current softdirty O(N) complexity) similarly to what the Page
> >   Modification Logging (PML) does in hardware for EPT write accesses.
>  
> We (CRIU) have some concerns about obsoleting soft-dirty in favor of
> uffd-wp. If there are other soft-dirty users these concerns would be
> relevant to them as well.
> 
> With soft-dirty we collect the information about the changed memory every
> pre-dump iteration in the following manner:
> * freeze the tasks
> * find entries in /proc/pid/pagemap with SOFT_DIRTY set
> * unfreeze the tasks
> * dump the modified pages to disk/remote host
> 
> While we do need to traverse the /proc/pid/pagemap to identify dirty pages,
> in between the pre-dump iterations and during the actual memory dump the
> tasks are running freely.
> 
> If we are to switch to uffd-wp, every write by the snapshotted/migrated
> task will incur latency of uffd-wp processing by the monitor.
> 
> We'd need to see how this affects overall slowdown of the workload under
> migration before moving forward with obsoleting soft-dirty.
> 
> > - Blake Caldwell maintained the UFFDIO_REMAP support to atomically
> >   remove memory from a mapping with userfaultfd (which can't be done
> >   with a copy as in UFFDIO_COPY and it requires a slow TLB flush to be
> >   safe) as an alternative to host swapping (which of course also
> >   requires a TLB flush for similar reasons). Notably UFFDIO_REMAP was
> >   rightfully naked early on and quickly replaced by UFFDIO_COPY which
> >   is more optimal to add memory to a mapping is small chunks, but we
> >   can't remove memory with UFFDIO_COPY and UFFDIO_REMAP should be as
> >   efficient as it gets when it comes to removing memory from a
> >   mapping.
> 
> If we are to discuss userfaultfd, I'd like also to bring the subject of COW
> mappings.
> The pages populated with UFFDIO_COPY cannot be COW-shared between related
> processes which unnecessarily increases memory footprint of a migrated
> process tree.
> I've posted a patch [1] a (real) while ago, but nobody reacted and I've put
> this aside.
> Maybe it's time to discuss it again :)

Hi, Mike,

It's interesting to know such a work...

Since I really don't have much context on this, so sorry if I'm going
to ask a silly question... but I'd say when reading this I'm thinking
of KSM.  I think KSM does not suite in this case since when doing
UFFDIO_COPY_COW it'll contain hinting information while KSM was only
scanning over the pages between processes which seems to be O(N*N) if
assuming there're two processes.  However, would it make any sense to
provide a general interface to scan for same pages between any two
processes within specific range and merge them if found (rather than a
specific interface for userfaultfd only)?  Then it might even be used
by KSM admins (just as an example) when the admin knows exactly that
memory range (addr1, len) of process A should very probably has many
same contents as the memory range (addr2, len) of process B?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE)
  2019-01-30  8:13 ` [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE) Mike Rapoport
  2019-01-30  9:23   ` Peter Xu
@ 2019-01-30 14:43   ` Andrea Arcangeli
  1 sibling, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2019-01-30 14:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: lsf-pc, linux-mm, linux-kernel, Peter Xu, Blake Caldwell,
	Mike Rapoport, Mike Kravetz, Michal Hocko, Mel Gorman,
	Vlastimil Babka, David Rientjes, Andrei Vagin, Pavel Emelyanov

Hello Mike,

On Wed, Jan 30, 2019 at 10:13:36AM +0200, Mike Rapoport wrote:
> We (CRIU) have some concerns about obsoleting soft-dirty in favor of
> uffd-wp. If there are other soft-dirty users these concerns would be
> relevant to them as well.
> 
> With soft-dirty we collect the information about the changed memory every
> pre-dump iteration in the following manner:
> * freeze the tasks
> * find entries in /proc/pid/pagemap with SOFT_DIRTY set
> * unfreeze the tasks
> * dump the modified pages to disk/remote host
> 
> While we do need to traverse the /proc/pid/pagemap to identify dirty pages,
> in between the pre-dump iterations and during the actual memory dump the
> tasks are running freely.
> 
> If we are to switch to uffd-wp, every write by the snapshotted/migrated
> task will incur latency of uffd-wp processing by the monitor.

That's valid concern indeed.

I didn't go into the details of what additional feature is needed in
addition to what is already present present in Peter's current
patchset, but you're correct that in order to perform well to do the
softdirty equivalent, we'll also need to add an async event model.

The async event model would be set during UFFD registration. It'd work
like async signals, you just queue up uffd events in the kernel by
allocating them with a slab object (not in the kernel stack of the
faulting process). Only if the monitor won't read() them fast enough
it'll eventually block the write protect fault and release the
mmap_sem but the page fault would always be resolved by the kernel
even in that case. For the monitor there'll be just a stream of
uffd_msg structures to read in multiples of the uffd_msg structure
size with a single syscall per wakeup of the monitor. Conceptually
it'd work the same as how PML works for EPT.

The main downside will be an allocation per fault (soft dirty doesn't
need to do such allocation), but there will be no round-trip to
userland latency added to the wrprotect fault that needs to be logged.

We need the synchronous/blocking uffd-wp for other things that aren't
related to soft dirty and can't be achieved with an async model like
softdirty. Adding an async model later would be a self contained
feature inside uffd.

So the idea would be to ignore any comparison with softdirty until
uffd-wp is finalized, and then evaluate the possibility of adding an
async model which would be simple thing to add in comparison of the
uffd-wp feature itself.

The theoretical expectation would be that softdirty would perform
better for small processes (but for those the overall logging overhead
is small anyway), but when it gets to the hundred-gigabytes/terabytes
regions, async uffd-wp should perform much better.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE
  2019-01-29 23:40 [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE Andrea Arcangeli
  2019-01-30  7:17 ` Michal Hocko
  2019-01-30  8:13 ` [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE) Mike Rapoport
@ 2019-01-30 23:14 ` Mike Kravetz
  2019-02-01 14:17 ` Mel Gorman
  3 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2019-01-30 23:14 UTC (permalink / raw)
  To: Andrea Arcangeli, lsf-pc, linux-mm, linux-kernel
  Cc: Peter Xu, Blake Caldwell, Mike Rapoport, Michal Hocko,
	Mel Gorman, Vlastimil Babka, David Rientjes

On 1/29/19 3:40 PM, Andrea Arcangeli wrote:
> In addition to the above "NUMA remote THP vs NUMA local non-THP
> tradeoff" topic, there are other developments in "userfaultfd" land that
> are approaching merge readiness and that would be possible to provide a
> short overview about:
> 
> - Peter Xu made significant progress in finalizing the userfaultfd-WP
>   support over the last few months. That feature was planned from the
>   start and it will allow userland to do some new things that weren't
>   possible to achieve before. In addition to synchronously blocking
>   write faults to be resolved by an userland manager, it has also the
>   ability to obsolete the softdirty feature, because it can provide
>   the same information, but with O(1) complexity (as opposed of the
>   current softdirty O(N) complexity) similarly to what the Page
>   Modification Logging (PML) does in hardware for EPT write accesses.

I would be interested in this topic as well.  IIRC, Peter's patches do
not address hugetlbfs support.  I put together patches for this some
time back.  At the time, they worked as well as userfaultfd-WP support
for normal base pages: not too well :).  Once base page support is
finalized, I suspect I will be involved in hugetlbfs support.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE)
  2019-01-30  9:23   ` Peter Xu
@ 2019-01-31  9:54     ` Mike Rapoport
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2019-01-31  9:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrea Arcangeli, lsf-pc, linux-mm, linux-kernel, Blake Caldwell,
	Mike Rapoport, Mike Kravetz, Michal Hocko, Mel Gorman,
	Vlastimil Babka, David Rientjes, Andrei Vagin, Pavel Emelyanov

Hi Peter,

On Wed, Jan 30, 2019 at 05:23:02PM +0800, Peter Xu wrote:
> On Wed, Jan 30, 2019 at 10:13:36AM +0200, Mike Rapoport wrote:
> > 
> > If we are to discuss userfaultfd, I'd like also to bring the subject of COW
> > mappings.
> > The pages populated with UFFDIO_COPY cannot be COW-shared between related
> > processes which unnecessarily increases memory footprint of a migrated
> > process tree.
> > I've posted a patch [1] a (real) while ago, but nobody reacted and I've put
> > this aside.
> > Maybe it's time to discuss it again :)
> 
> Hi, Mike,
> 
> It's interesting to know such a work...
> 
> Since I really don't have much context on this, so sorry if I'm going
> to ask a silly question... but I'd say when reading this I'm thinking
> of KSM.  I think KSM does not suite in this case since when doing
> UFFDIO_COPY_COW it'll contain hinting information while KSM was only
> scanning over the pages between processes which seems to be O(N*N) if
> assuming there're two processes.  However, would it make any sense to
> provide a general interface to scan for same pages between any two
> processes within specific range and merge them if found (rather than a
> specific interface for userfaultfd only)?  Then it might even be used
> by KSM admins (just as an example) when the admin knows exactly that
> memory range (addr1, len) of process A should very probably has many
> same contents as the memory range (addr2, len) of process B?

I haven't really thought about using KSM in our case. Our goal was to make
the VM layout of the migrated processes as close as possible to the
original, including the COW sharing between parent process and its
descendants. For that UFFDIO_COPY_COW seems to be more natural fit than
KSM.

> Thanks,
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE
  2019-01-29 23:40 [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2019-01-30 23:14 ` [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE Mike Kravetz
@ 2019-02-01 14:17 ` Mel Gorman
  3 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2019-02-01 14:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: lsf-pc, linux-mm, linux-kernel, Peter Xu, Blake Caldwell,
	Mike Rapoport, Mike Kravetz, Michal Hocko, Vlastimil Babka,
	David Rientjes

On Tue, Jan 29, 2019 at 06:40:58PM -0500, Andrea Arcangeli wrote:
> I posted some benchmark results showing that for tasks without strong
> NUMA locality the __GFP_THISNODE logic is not guaranteed to be optimal
> (and here of course I mean even if we ignore the large slowdown with
> swap storms at allocation time that might be caused by
> __GFP_THISNODE). The results also show NUMA remote THPs help
> intrasocket as well as intersocket.
> 
> https://lkml.kernel.org/r/20181210044916.GC24097@redhat.com
> https://lkml.kernel.org/r/20181212104418.GE1130@redhat.com
> 
> The following seems the interim conclusion which I happen to be in
> agreement with Michal and Mel:
> 
> https://lkml.kernel.org/r/20181212095051.GO1286@dhcp22.suse.cz
> https://lkml.kernel.org/r/20181212170016.GG1130@redhat.com
> 
> Hopefully this strict issue will be hot-fixed before April (like we
> had to hot-fix it in the enterprise kernels to avoid the 3 years old
> regression to break large workloads that can't fit it in a single NUMA
> node and I assume other enterprise distributions will follow suit),
> but whatever hot-fix will likely allow ample margin for discussions on
> what we can do better to optimize the decision between local non-THP
> and remote THP under MADV_HUGEPAGE.
> 
> It is clear that the __GFP_THISNODE forced in the current code
> provides some minor advantage to apps using MADV_HUGEPAGE that can fit
> in a single NUMA node, but we should try to achieve it without major
> disadvantages to apps that can't fit in a single NUMA node.
> 
> For example it was mentioned that we could allocate readily available
> already-free local 4k if local compaction fails and the watermarks
> still allows local 4k allocations without invoking reclaim, before
> invoking compaction on remote nodes. The same can be repeated at a
> second level with intra-socket non-THP memory before invoking
> compaction inter-socket. However we can't do things like that with the
> current page allocator workflow. It's possible some larger change is
> required than just sending a single gfp bitflag down to the page
> allocator that creates an implicit MPOL_LOCAL binding to make it
> behave like the obsoleted numa/zone reclaim behavior, but weirdly only
> applied to THP allocations.
> 

I would also be interested in discussing this topic. My activity is
mostly compaction-related but I believe it will evolve into something
that returns more sane data to the page allocator. That should make it a
bit easier to detect when local compaction fails and make it easier to
improve the page allocator workflow without throwing another workload
under a bus.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-02-01 14:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-29 23:40 [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE Andrea Arcangeli
2019-01-30  7:17 ` Michal Hocko
2019-01-30  8:13 ` [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE) Mike Rapoport
2019-01-30  9:23   ` Peter Xu
2019-01-31  9:54     ` Mike Rapoport
2019-01-30 14:43   ` Andrea Arcangeli
2019-01-30 23:14 ` [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE Mike Kravetz
2019-02-01 14:17 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).