All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] fix for "pathological THP behavior"
@ 2018-08-20  3:22 Andrea Arcangeli
  2018-08-20  3:22 ` [PATCH 1/2] mm: thp: consolidate policy_nodemask call Andrea Arcangeli
                   ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-20  3:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Alex Williamson, David Rientjes, Vlastimil Babka

Hello,

we detected a regression compared to older kernels, only happening
with defrag=always or by using MADV_HUGEPAGE (and QEMU uses it).

I haven't bisected but I suppose this started since commit
5265047ac30191ea24b16503165000c225f54feb combined with previous
commits that introduced the logic to not try to invoke reclaim for THP
allocations in the remote nodes.

Once I looked into it the problem was pretty obvious and there are two
possible simple fixes, one is not to invoke reclaim and stick to
compaction in the local node only (still __GFP_THISNODE model).

This approach keeps the logic the same and prioritizes for NUMA
locality over THP generation.

Then I'll send the an alternative that drops the __GFP_THISNODE logic
if_DIRECT_RECLAIM is set. That however changes the behavior for
MADV_HUGEPAGE and prioritizes THP generation over NUMA locality.

A possible incremental improvement for this __GFP_COMPACT_ONLY
solution would be to remove __GFP_THISNODE (and in turn
__GFP_COMPACT_ONLY) after checking the watermarks if there's no free
PAGE_SIZEd memory in the local node. However checking the watermarks
in mempolicy.c is not ideal so it would be a more messy change and
it'd still need to use __GFP_COMPACT_ONLY as implemented here for when
there's no PAGE_SIZEd free memory in the local node. That further
improvement wouldn't be necessary if there's agreement to prioritize
THP generation over NUMA locality (the alternative solution I'll send
in a separate post).

Andrea Arcangeli (2):
  mm: thp: consolidate policy_nodemask call
  mm: thp: fix transparent_hugepage/defrag = madvise || always

 include/linux/gfp.h | 18 ++++++++++++++++++
 mm/mempolicy.c      | 16 +++++++++++++---
 mm/page_alloc.c     |  4 ++++
 3 files changed, 35 insertions(+), 3 deletions(-)

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 1/2] mm: thp: consolidate policy_nodemask call
  2018-08-20  3:22 [PATCH 0/2] fix for "pathological THP behavior" Andrea Arcangeli
@ 2018-08-20  3:22 ` Andrea Arcangeli
  2018-08-20  3:22 ` [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
  2018-08-20 11:58 ` [PATCH 0/2] fix for "pathological THP behavior" Kirill A. Shutemov
  2 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-20  3:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Alex Williamson, David Rientjes, Vlastimil Babka

Just a minor cleanup.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 01f1a14facc4..d6512ef28cde 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2026,6 +2026,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
+	nmask = policy_nodemask(gfp, pol);
+
 	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
 		int hpage_node = node;
 
@@ -2043,7 +2045,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 						!(pol->flags & MPOL_F_LOCAL))
 			hpage_node = pol->v.preferred_node;
 
-		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
 			page = __alloc_pages_node(hpage_node,
@@ -2052,7 +2053,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		}
 	}
 
-	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
 	mpol_cond_put(pol);

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-20  3:22 [PATCH 0/2] fix for "pathological THP behavior" Andrea Arcangeli
  2018-08-20  3:22 ` [PATCH 1/2] mm: thp: consolidate policy_nodemask call Andrea Arcangeli
@ 2018-08-20  3:22 ` Andrea Arcangeli
  2018-08-20  3:26   ` [PATCH 0/1] fix for "pathological THP behavior" v2 Andrea Arcangeli
                     ` (2 more replies)
  2018-08-20 11:58 ` [PATCH 0/2] fix for "pathological THP behavior" Kirill A. Shutemov
  2 siblings, 3 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-20  3:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Alex Williamson, David Rientjes, Vlastimil Babka

qemu uses MADV_HUGEPAGE which allows direct compaction (i.e.
__GFP_DIRECT_RECLAIM is set).

The problem is that direct compaction combined with the NUMA
__GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
hard the local node, instead of failing the allocation if there's no
THP available in the local node.

Such logic was ok until __GFP_THISNODE was added to the THP allocation
path even with MPOL_DEFAULT.

The idea behind the __GFP_THISNODE addition, is that it is better to
provide local memory in PAGE_SIZE units than to use remote NUMA THP
backed memory. That largely depends on the remote latency though, on
threadrippers for example the overhead is relatively low in my
experience.

The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
extremely slow qemu startup with vfio, if the VM is larger than the
size of one host NUMA node. This is because it will try very hard to
unsuccessfully swapout get_user_pages pinned pages as result of the
__GFP_THISNODE being set, instead of falling back to PAGE_SIZE
allocations and instead of trying to allocate THP on other nodes (it
would be even worse without vfio type1 GUP pins of course, except it'd
be swapping heavily instead).

It's very easy to reproduce this by setting
transparent_hugepage/defrag to "always", even with a simple memhog.

1) This can be fixed by retaining the __GFP_THISNODE logic also for
   __GFP_DIRECT_RELCAIM by allowing only one compaction run. Not even
   COMPACT_SKIPPED (i.e. compaction failing because not enough free
   memory in the zone) should be allowed to invoke reclaim.

2) An alternative is not use __GFP_THISNODE if __GFP_DIRECT_RELCAIM
   has been set by the caller (i.e. MADV_HUGEPAGE or
   defrag="always"). That would keep the NUMA locality restriction
   only when __GFP_DIRECT_RECLAIM is not set by the caller. So THP
   will be provided from remote nodes if available before falling back
   to PAGE_SIZE units in the local node, but an app using defrag =
   always (or madvise with MADV_HUGEPAGE) supposedly prefers that.

These are the results of 1) (higher GB/s is better).

Finished: 30 GB mapped, 10.188535s elapsed, 2.94GB/s
Finished: 34 GB mapped, 12.274777s elapsed, 2.77GB/s
Finished: 38 GB mapped, 13.847840s elapsed, 2.74GB/s
Finished: 42 GB mapped, 14.288587s elapsed, 2.94GB/s

Finished: 30 GB mapped, 8.907367s elapsed, 3.37GB/s
Finished: 34 GB mapped, 10.724797s elapsed, 3.17GB/s
Finished: 38 GB mapped, 14.272882s elapsed, 2.66GB/s
Finished: 42 GB mapped, 13.929525s elapsed, 3.02GB/s

These are the results of 2) (higher GB/s is better).

Finished: 30 GB mapped, 10.163159s elapsed, 2.95GB/s
Finished: 34 GB mapped, 11.806526s elapsed, 2.88GB/s
Finished: 38 GB mapped, 10.369081s elapsed, 3.66GB/s
Finished: 42 GB mapped, 12.357719s elapsed, 3.40GB/s

Finished: 30 GB mapped, 8.251396s elapsed, 3.64GB/s
Finished: 34 GB mapped, 12.093030s elapsed, 2.81GB/s
Finished: 38 GB mapped, 11.824903s elapsed, 3.21GB/s
Finished: 42 GB mapped, 15.950661s elapsed, 2.63GB/s

This is current upstream (higher GB/s is better).

Finished: 30 GB mapped, 8.821632s elapsed, 3.40GB/s
Finished: 34 GB mapped, 341.979543s elapsed, 0.10GB/s
Finished: 38 GB mapped, 761.933231s elapsed, 0.05GB/s
Finished: 42 GB mapped, 1188.409235s elapsed, 0.04GB/s

vfio is a good test because by pinning all memory it avoids the
swapping and reclaim only wastes CPU, a memhog based test would
created swapout storms and supposedly show a bigger stddev.

What is better between 1) and 2) depends on the hardware and on the
software. Virtualization EPT/NTP gets a bigger boost from THP as well
than host applications.

This commit implements 1).

Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/gfp.h | 18 ++++++++++++++++++
 mm/mempolicy.c      | 12 +++++++++++-
 mm/page_alloc.c     |  4 ++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index a6afcec53795..3c04d5d90e6d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -44,6 +44,7 @@ struct vm_area_struct;
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
+#define ___GFP_ONLY_COMPACT	0x1000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -178,6 +179,21 @@ struct vm_area_struct;
  *   definitely preferable to use the flag rather than opencode endless
  *   loop around allocator.
  *   Using this flag for costly allocations is _highly_ discouraged.
+ *
+ * __GFP_ONLY_COMPACT: Only invoke compaction. Do not try to succeed
+ * the allocation by freeing memory. Never risk to free any
+ * "PAGE_SIZE" memory unit even if compaction failed specifically
+ * because of not enough free pages in the zone. This only makes sense
+ * only in combination with __GFP_THISNODE (enforced with a
+ * VM_WARN_ON), to restrict the THP allocation in the local node that
+ * triggered the page fault and fallback into PAGE_SIZE allocations in
+ * the same node. We don't want to invoke reclaim because there may be
+ * plenty of free memory already in the local node. More importantly
+ * there may be even plenty of free THP available in remote nodes so
+ * we should allocate those if something instead of reclaiming any
+ * memory in the local node. Implementation detail: set ___GFP_NORETRY
+ * too so that ___GFP_ONLY_COMPACT only needs to be checked in a slow
+ * path.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
@@ -187,6 +203,8 @@ struct vm_area_struct;
 #define __GFP_RETRY_MAYFAIL	((__force gfp_t)___GFP_RETRY_MAYFAIL)
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
+#define __GFP_ONLY_COMPACT	((__force gfp_t)(___GFP_NORETRY | \
+						 ___GFP_ONLY_COMPACT))
 
 /*
  * Action modifiers
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d6512ef28cde..6bf839f20dcc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2047,8 +2047,18 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
+			/*
+			 * We restricted the allocation to the
+			 * hpage_node so we must use
+			 * __GFP_ONLY_COMPACT to allow at most a
+			 * compaction attempt and not ever get into
+			 * reclaim or it'll swap heavily with
+			 * transparent_hugepage/defrag = always (or
+			 * madvise under MADV_HUGEPAGE).
+			 */
 			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+						  gfp | __GFP_THISNODE |
+						  __GFP_ONLY_COMPACT, order);
 			goto out;
 		}
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a790ef4be74e..01a5c2bd0860 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4144,6 +4144,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			 */
 			if (compact_result == COMPACT_DEFERRED)
 				goto nopage;
+			if (gfp_mask & __GFP_ONLY_COMPACT) {
+				VM_WARN_ON(!(gfp_mask & __GFP_THISNODE));
+				goto nopage;
+			}
 
 			/*
 			 * Looks like reclaim/compaction is worth trying, but

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 0/1] fix for "pathological THP behavior" v2
  2018-08-20  3:22 ` [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
@ 2018-08-20  3:26   ` Andrea Arcangeli
  2018-08-20  3:26     ` [PATCH 1/1] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
  2018-08-20 12:35   ` [PATCH 2/2] " Zi Yan
  2018-08-21 11:50   ` Michal Hocko
  2 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-20  3:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Alex Williamson, David Rientjes, Vlastimil Babka

Hello,

This would be the other option which also works well, but changes the
behavior of MADV_HUGEPAGE and defrag="always" to prioritize THP
generation over NUMA locality.

Thanks,
Andrea

Andrea Arcangeli (1):
  mm: thp: fix transparent_hugepage/defrag = madvise || always

 mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 1/1] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-20  3:26   ` [PATCH 0/1] fix for "pathological THP behavior" v2 Andrea Arcangeli
@ 2018-08-20  3:26     ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-20  3:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Alex Williamson, David Rientjes, Vlastimil Babka

qemu uses MADV_HUGEPAGE which allows direct compaction (i.e.
__GFP_DIRECT_RECLAIM is set).

The problem is that direct compaction combined with the NUMA
__GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
hard the local node, instead of failing the allocation if there's no
THP available in the local node.

Such logic was ok until __GFP_THISNODE was added to the THP allocation
path even with MPOL_DEFAULT.

The idea behind the __GFP_THISNODE addition, is that it is better to
provide local memory in PAGE_SIZE units than to use remote NUMA THP
backed memory. That largely depends on the remote latency though, on
threadrippers for example the overhead is relatively low in my
experience.

The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
extremely slow qemu startup with vfio, if the VM is larger than the
size of one host NUMA node. This is because it will try very hard to
unsuccessfully swapout get_user_pages pinned pages as result of the
__GFP_THISNODE being set, instead of falling back to PAGE_SIZE
allocations and instead of trying to allocate THP on other nodes (it
would be even worse without vfio type1 GUP pins of course, except it'd
be swapping heavily instead).

It's very easy to reproduce this by setting
transparent_hugepage/defrag to "always", even with a simple memhog.

1) This can be fixed by retaining the __GFP_THISNODE logic also for
   __GFP_DIRECT_RELCAIM by allowing only one compaction run. Not even
   COMPACT_SKIPPED (i.e. compaction failing because not enough free
   memory in the zone) should be allowed to invoke reclaim.

2) An alternative is not use __GFP_THISNODE if __GFP_DIRECT_RELCAIM
   has been set by the caller (i.e. MADV_HUGEPAGE or
   defrag="always"). That would keep the NUMA locality restriction
   only when __GFP_DIRECT_RECLAIM is not set by the caller. So THP
   will be provided from remote nodes if available before falling back
   to PAGE_SIZE units in the local node, but an app using defrag =
   always (or madvise with MADV_HUGEPAGE) supposedly prefers that.

These are the results of 1) (higher GB/s is better).

Finished: 30 GB mapped, 10.188535s elapsed, 2.94GB/s
Finished: 34 GB mapped, 12.274777s elapsed, 2.77GB/s
Finished: 38 GB mapped, 13.847840s elapsed, 2.74GB/s
Finished: 42 GB mapped, 14.288587s elapsed, 2.94GB/s

Finished: 30 GB mapped, 8.907367s elapsed, 3.37GB/s
Finished: 34 GB mapped, 10.724797s elapsed, 3.17GB/s
Finished: 38 GB mapped, 14.272882s elapsed, 2.66GB/s
Finished: 42 GB mapped, 13.929525s elapsed, 3.02GB/s

These are the results of 2) (higher GB/s is better).

Finished: 30 GB mapped, 10.163159s elapsed, 2.95GB/s
Finished: 34 GB mapped, 11.806526s elapsed, 2.88GB/s
Finished: 38 GB mapped, 10.369081s elapsed, 3.66GB/s
Finished: 42 GB mapped, 12.357719s elapsed, 3.40GB/s

Finished: 30 GB mapped, 8.251396s elapsed, 3.64GB/s
Finished: 34 GB mapped, 12.093030s elapsed, 2.81GB/s
Finished: 38 GB mapped, 11.824903s elapsed, 3.21GB/s
Finished: 42 GB mapped, 15.950661s elapsed, 2.63GB/s

This is current upstream (higher GB/s is better).

Finished: 30 GB mapped, 8.821632s elapsed, 3.40GB/s
Finished: 34 GB mapped, 341.979543s elapsed, 0.10GB/s
Finished: 38 GB mapped, 761.933231s elapsed, 0.05GB/s
Finished: 42 GB mapped, 1188.409235s elapsed, 0.04GB/s

vfio is a good test because by pinning all memory it avoids the
swapping and reclaim only wastes CPU, a memhog based test would
created swapout storms and supposedly show a bigger stddev.

What is better between 1) and 2) depends on the hardware and on the
software. Virtualization EPT/NTP gets a bigger boost from THP as well
than host applications.

This commit implements 2).

Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d6512ef28cde..fb7f9581a835 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2047,8 +2047,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+			/*
+			 * We cannot invoke reclaim if __GFP_THISNODE
+			 * is set. Invoking reclaim with
+			 * __GFP_THISNODE set, would cause THP
+			 * allocations to trigger heavy swapping
+			 * despite there may be tons of free memory
+			 * (including potentially plenty of THP
+			 * already available in the buddy) on all the
+			 * other NUMA nodes.
+			 *
+			 * At most we could invoke compaction when
+			 * __GFP_THISNODE is set (but we would need to
+			 * refrain from invoking reclaim even if
+			 * compaction returned COMPACT_SKIPPED because
+			 * there wasn't not enough memory to succeed
+			 * compaction). For now just avoid
+			 * __GFP_THISNODE instead of limiting the
+			 * allocation path to a strict and single
+			 * compaction invocation.
+			 *
+			 * Supposedly if direct reclaim was enabled by
+			 * the caller, the app prefers THP regardless
+			 * of the node it comes from so this would be
+			 * more desiderable behavior than only
+			 * providing THP originated from the local
+			 * node in such case.
+			 */
+			if (!(gfp & __GFP_DIRECT_RECLAIM))
+				gfp |= __GFP_THISNODE;
+			page = __alloc_pages_node(hpage_node, gfp, order);
 			goto out;
 		}
 	}

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-20  3:22 [PATCH 0/2] fix for "pathological THP behavior" Andrea Arcangeli
  2018-08-20  3:22 ` [PATCH 1/2] mm: thp: consolidate policy_nodemask call Andrea Arcangeli
  2018-08-20  3:22 ` [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
@ 2018-08-20 11:58 ` Kirill A. Shutemov
  2018-08-20 15:19   ` Andrea Arcangeli
  2018-08-20 19:06   ` Yang Shi
  2 siblings, 2 replies; 77+ messages in thread
From: Kirill A. Shutemov @ 2018-08-20 11:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Sun, Aug 19, 2018 at 11:22:02PM -0400, Andrea Arcangeli wrote:
> Hello,
> 
> we detected a regression compared to older kernels, only happening
> with defrag=always or by using MADV_HUGEPAGE (and QEMU uses it).
> 
> I haven't bisected but I suppose this started since commit
> 5265047ac30191ea24b16503165000c225f54feb combined with previous
> commits that introduced the logic to not try to invoke reclaim for THP
> allocations in the remote nodes.
> 
> Once I looked into it the problem was pretty obvious and there are two
> possible simple fixes, one is not to invoke reclaim and stick to
> compaction in the local node only (still __GFP_THISNODE model).
> 
> This approach keeps the logic the same and prioritizes for NUMA
> locality over THP generation.
> 
> Then I'll send the an alternative that drops the __GFP_THISNODE logic
> if_DIRECT_RECLAIM is set. That however changes the behavior for
> MADV_HUGEPAGE and prioritizes THP generation over NUMA locality.
> 
> A possible incremental improvement for this __GFP_COMPACT_ONLY
> solution would be to remove __GFP_THISNODE (and in turn
> __GFP_COMPACT_ONLY) after checking the watermarks if there's no free
> PAGE_SIZEd memory in the local node. However checking the watermarks
> in mempolicy.c is not ideal so it would be a more messy change and
> it'd still need to use __GFP_COMPACT_ONLY as implemented here for when
> there's no PAGE_SIZEd free memory in the local node. That further
> improvement wouldn't be necessary if there's agreement to prioritize
> THP generation over NUMA locality (the alternative solution I'll send
> in a separate post).

I personally prefer to prioritize NUMA locality over THP
(__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good
enough to Ack it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-20  3:22 ` [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
  2018-08-20  3:26   ` [PATCH 0/1] fix for "pathological THP behavior" v2 Andrea Arcangeli
@ 2018-08-20 12:35   ` Zi Yan
  2018-08-20 15:32     ` Andrea Arcangeli
  2018-08-21 11:50   ` Michal Hocko
  2 siblings, 1 reply; 77+ messages in thread
From: Zi Yan @ 2018-08-20 12:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

[-- Attachment #1: Type: text/plain, Size: 4038 bytes --]

On 19 Aug 2018, at 23:22, Andrea Arcangeli wrote:

<snip>
>
> Reported-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/gfp.h | 18 ++++++++++++++++++
>  mm/mempolicy.c      | 12 +++++++++++-
>  mm/page_alloc.c     |  4 ++++
>  3 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index a6afcec53795..3c04d5d90e6d 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -44,6 +44,7 @@ struct vm_area_struct;
>  #else
>  #define ___GFP_NOLOCKDEP	0
>  #endif
> +#define ___GFP_ONLY_COMPACT	0x1000000u
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>
>  /*
> @@ -178,6 +179,21 @@ struct vm_area_struct;
>   *   definitely preferable to use the flag rather than opencode endless
>   *   loop around allocator.
>   *   Using this flag for costly allocations is _highly_ discouraged.
> + *
> + * __GFP_ONLY_COMPACT: Only invoke compaction. Do not try to succeed
> + * the allocation by freeing memory. Never risk to free any
> + * "PAGE_SIZE" memory unit even if compaction failed specifically
> + * because of not enough free pages in the zone. This only makes sense
> + * only in combination with __GFP_THISNODE (enforced with a
> + * VM_WARN_ON), to restrict the THP allocation in the local node that
> + * triggered the page fault and fallback into PAGE_SIZE allocations in
> + * the same node. We don't want to invoke reclaim because there may be
> + * plenty of free memory already in the local node. More importantly
> + * there may be even plenty of free THP available in remote nodes so
> + * we should allocate those if something instead of reclaiming any
> + * memory in the local node. Implementation detail: set ___GFP_NORETRY
> + * too so that ___GFP_ONLY_COMPACT only needs to be checked in a slow
> + * path.
>   */
>  #define __GFP_IO	((__force gfp_t)___GFP_IO)
>  #define __GFP_FS	((__force gfp_t)___GFP_FS)
> @@ -187,6 +203,8 @@ struct vm_area_struct;
>  #define __GFP_RETRY_MAYFAIL	((__force gfp_t)___GFP_RETRY_MAYFAIL)
>  #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
>  #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
> +#define __GFP_ONLY_COMPACT	((__force gfp_t)(___GFP_NORETRY | \
> +						 ___GFP_ONLY_COMPACT))
>
>  /*
>   * Action modifiers
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d6512ef28cde..6bf839f20dcc 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2047,8 +2047,18 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>
>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>  			mpol_cond_put(pol);
> +			/*
> +			 * We restricted the allocation to the
> +			 * hpage_node so we must use
> +			 * __GFP_ONLY_COMPACT to allow at most a
> +			 * compaction attempt and not ever get into
> +			 * reclaim or it'll swap heavily with
> +			 * transparent_hugepage/defrag = always (or
> +			 * madvise under MADV_HUGEPAGE).
> +			 */
>  			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> +						  gfp | __GFP_THISNODE |
> +						  __GFP_ONLY_COMPACT, order);
>  			goto out;
>  		}
>  	}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a790ef4be74e..01a5c2bd0860 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4144,6 +4144,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			 */
>  			if (compact_result == COMPACT_DEFERRED)
>  				goto nopage;
> +			if (gfp_mask & __GFP_ONLY_COMPACT) {
> +				VM_WARN_ON(!(gfp_mask & __GFP_THISNODE));
> +				goto nopage;
> +			}
>
>  			/*
>  			 * Looks like reclaim/compaction is worth trying, but


I think this can also be triggered in khugepaged. In collapse_huge_page(), khugepaged_alloc_page()
would also cause DIRECT_RECLAIM if defrag==always, since GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM.

But is it an expected behavior of khugepaged?


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-20 11:58 ` [PATCH 0/2] fix for "pathological THP behavior" Kirill A. Shutemov
@ 2018-08-20 15:19   ` Andrea Arcangeli
  2018-08-21 15:30     ` Vlastimil Babka
  2018-08-20 19:06   ` Yang Shi
  1 sibling, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-20 15:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

Hi Kirill,

On Mon, Aug 20, 2018 at 02:58:18PM +0300, Kirill A. Shutemov wrote:
> I personally prefer to prioritize NUMA locality over THP
> (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good
> enough to Ack it.

If we go in this direction it'd be nice after fixing the showstopper
bug, if we could then proceed with an orthogonal optimization by
checking the watermarks and if the watermarks shows there are no
PAGE_SIZEd pages available in the local node we should remove both
__GFP_THISNODE and __GFP_COMPACT_ONLY.

If as opposed there's still PAGE_SIZEd free memory in the local node
(not possible to compact for whatever reason), we should stick to
__GFP_THISNODE | __GFP_COMPACT_ONLY.

It's orthogonal because the above addition would make sense also in
the current (buggy) code.

The main implementation issue is that the watermark checking is not
well done in mempolicy.c but the place to clear __GFP_THISNODE and
__GFP_COMPACT_ONLY currently is there.

The case that the local node gets completely full and has not even 4k
pages available should be totally common, because if you keep
allocating and you allocate more than the size of a NUMA node
eventually you will fill the local node with THP then consume all 4k
pages and then you get into the case where the current code is totally
unable to allocate THP from the other nodes and it would be totally
possible to fix with the removal of __GFP_THISNODE |
__GFP_COMPACT_ONLY, after the PAGE_SIZE watermark check.

I'm mentioning this optimization in this context, even if it's
orthogonal, because the alternative patch that prioritizes THP over
NUMA locality for MADV_HUGEPAGE and defer=always would solve that too
with a one liner and there would be no need of watermark checking and
flipping gfp bits whatsoever. Once the local node is full, THPs keeps
being provided as expected.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-20 12:35   ` [PATCH 2/2] " Zi Yan
@ 2018-08-20 15:32     ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-20 15:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

Hello,

On Mon, Aug 20, 2018 at 08:35:17AM -0400, Zi Yan wrote:
> I think this can also be triggered in khugepaged. In collapse_huge_page(), khugepaged_alloc_page()
> would also cause DIRECT_RECLAIM if defrag==always, since GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM.
> 
> But is it an expected behavior of khugepaged?

That's a good point, and answer it not obvious. It's not an apple to
apple comparison because khugepaged is not increasing the overall
memory footprint of the app. The pages that gets compacted gets freed
later. However not all memory of the node may be possible to compact
100% so it's not perfectly a 1:1 exchange and it could require some
swap to succeed compaction.

So we may want to look also at khugepaged later, but it's not obvious
it needs fixing too. It'd weaken a bit khugepaged to add
__GFP_COMPACT_ONLY to it, if compaction returns COMPACT_SKIPPED
especially.

As opposed in the main allocation path (when memory footprint
definitely increases) I even tried to still allow reclaim only for
COMPACT_SKIPPED and it still caused swapout storms because new THP
kept being added to the local node as the old memory was swapped out
to make more free memory to compact more 4k pages. Otherwise I could
have gotten away with using __GFP_NORETRY instead of
__GFP_COMPACT_ONLY but it wasn't nearly enough.

Similarly to khugepaged NUMA balancing also uses __GFP_THISNODE but if
it decided to migrate a page to such node supposedly there's a good
reason to call reclaim to allocate the page there if needed. That also
is freeing memory on node and adding memory to another node, and it's
not increasing the memory footprint overall (unlike khugepaged, it
increases the footprint cross-node though).

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-20 11:58 ` [PATCH 0/2] fix for "pathological THP behavior" Kirill A. Shutemov
  2018-08-20 15:19   ` Andrea Arcangeli
@ 2018-08-20 19:06   ` Yang Shi
  2018-08-20 23:24     ` Andrea Arcangeli
  1 sibling, 1 reply; 77+ messages in thread
From: Yang Shi @ 2018-08-20 19:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Linux MM, Alex Williamson,
	David Rientjes, Vlastimil Babka

On Mon, Aug 20, 2018 at 4:58 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Sun, Aug 19, 2018 at 11:22:02PM -0400, Andrea Arcangeli wrote:
>> Hello,
>>
>> we detected a regression compared to older kernels, only happening
>> with defrag=always or by using MADV_HUGEPAGE (and QEMU uses it).
>>
>> I haven't bisected but I suppose this started since commit
>> 5265047ac30191ea24b16503165000c225f54feb combined with previous
>> commits that introduced the logic to not try to invoke reclaim for THP
>> allocations in the remote nodes.
>>
>> Once I looked into it the problem was pretty obvious and there are two
>> possible simple fixes, one is not to invoke reclaim and stick to
>> compaction in the local node only (still __GFP_THISNODE model).
>>
>> This approach keeps the logic the same and prioritizes for NUMA
>> locality over THP generation.
>>
>> Then I'll send the an alternative that drops the __GFP_THISNODE logic
>> if_DIRECT_RECLAIM is set. That however changes the behavior for
>> MADV_HUGEPAGE and prioritizes THP generation over NUMA locality.
>>
>> A possible incremental improvement for this __GFP_COMPACT_ONLY
>> solution would be to remove __GFP_THISNODE (and in turn
>> __GFP_COMPACT_ONLY) after checking the watermarks if there's no free
>> PAGE_SIZEd memory in the local node. However checking the watermarks
>> in mempolicy.c is not ideal so it would be a more messy change and
>> it'd still need to use __GFP_COMPACT_ONLY as implemented here for when
>> there's no PAGE_SIZEd free memory in the local node. That further
>> improvement wouldn't be necessary if there's agreement to prioritize
>> THP generation over NUMA locality (the alternative solution I'll send
>> in a separate post).
>
> I personally prefer to prioritize NUMA locality over THP
> (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good
> enough to Ack it.


May the approach #1 break the setting of zone_reclaim_mode? Or it may
behave like zone_reclaim_mode is set even though the knob is cleared?

Thanks,
Yang

>
> --
>  Kirill A. Shutemov
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-20 19:06   ` Yang Shi
@ 2018-08-20 23:24     ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-20 23:24 UTC (permalink / raw)
  To: Yang Shi
  Cc: Kirill A. Shutemov, Andrew Morton, Linux MM, Alex Williamson,
	David Rientjes, Vlastimil Babka

Hello,

On Mon, Aug 20, 2018 at 12:06:11PM -0700, Yang Shi wrote:
> May the approach #1 break the setting of zone_reclaim_mode? Or it may
> behave like zone_reclaim_mode is set even though the knob is cleared?

Current MADV_HUGEPAGE THP default behavior is similar to
zone/node_reclaim_mode yes, the approach #1 won't change that.

The problem is that it behaved like the hardest kind of
zone/node_reclaim_mode. It wouldn't even try to stop unmap/writeback.
zone/node_reclaim_mode can stop that at least.

The approach #1 simply reduces the aggressiveness level from the
hardest kind of zone/node_reclaim_mode to something lither than any
reclaim would be (i.e. no reclaim and only compaction, which of course
only makes sense for order > 0 allocations).

If THP fails then the PAGE_SIZE allocation fallback kicks in and it'll
spread to all nodes and it will invoke reclaim if needed. If it
invokes reclaim, it'll behave according to node_reclaim_mode if
set. There's no change to that part.

When MADV_HUGEPAGE wasn't used or defrag wasn't set to "always", the
current code didn't even invoke compaction, but the whole point of
MADV_HUGEPAGE is to try to provide THP from the very first page fault,
so it's ok to pay the cost of compaction there because userland told
us those are long lived performance sensitive allocations.

What MADV_HUGEPAGE can't to is to trigger an heavy swapout of the
memory in the local node, despite there may be plenty of free memory
in all other nodes (even THP pages) and in the local node in PAGE_SIZE
fragments.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-20  3:22 ` [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
  2018-08-20  3:26   ` [PATCH 0/1] fix for "pathological THP behavior" v2 Andrea Arcangeli
  2018-08-20 12:35   ` [PATCH 2/2] " Zi Yan
@ 2018-08-21 11:50   ` Michal Hocko
  2018-08-21 21:40     ` Andrea Arcangeli
  2 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-21 11:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Sun 19-08-18 23:22:04, Andrea Arcangeli wrote:
> qemu uses MADV_HUGEPAGE which allows direct compaction (i.e.
> __GFP_DIRECT_RECLAIM is set).
> 
> The problem is that direct compaction combined with the NUMA
> __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> hard the local node, instead of failing the allocation if there's no
> THP available in the local node.
> 
> Such logic was ok until __GFP_THISNODE was added to the THP allocation
> path even with MPOL_DEFAULT.
> 
> The idea behind the __GFP_THISNODE addition, is that it is better to
> provide local memory in PAGE_SIZE units than to use remote NUMA THP
> backed memory. That largely depends on the remote latency though, on
> threadrippers for example the overhead is relatively low in my
> experience.
> 
> The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> extremely slow qemu startup with vfio, if the VM is larger than the
> size of one host NUMA node. This is because it will try very hard to
> unsuccessfully swapout get_user_pages pinned pages as result of the
> __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> allocations and instead of trying to allocate THP on other nodes (it
> would be even worse without vfio type1 GUP pins of course, except it'd
> be swapping heavily instead).
> 
> It's very easy to reproduce this by setting
> transparent_hugepage/defrag to "always", even with a simple memhog.
> 
> 1) This can be fixed by retaining the __GFP_THISNODE logic also for
>    __GFP_DIRECT_RELCAIM by allowing only one compaction run. Not even
>    COMPACT_SKIPPED (i.e. compaction failing because not enough free
>    memory in the zone) should be allowed to invoke reclaim.
> 
> 2) An alternative is not use __GFP_THISNODE if __GFP_DIRECT_RELCAIM
>    has been set by the caller (i.e. MADV_HUGEPAGE or
>    defrag="always"). That would keep the NUMA locality restriction
>    only when __GFP_DIRECT_RECLAIM is not set by the caller. So THP
>    will be provided from remote nodes if available before falling back
>    to PAGE_SIZE units in the local node, but an app using defrag =
>    always (or madvise with MADV_HUGEPAGE) supposedly prefers that.

So does reverting 5265047ac301 ("mm, thp: really limit transparent
hugepage allocation to local node") help?

I really detest a new gfp flag for one time semantic that is muddy as
hell.

> + * __GFP_ONLY_COMPACT: Only invoke compaction. Do not try to succeed
> + * the allocation by freeing memory. Never risk to free any
> + * "PAGE_SIZE" memory unit even if compaction failed specifically
> + * because of not enough free pages in the zone. This only makes sense
> + * only in combination with __GFP_THISNODE (enforced with a
> + * VM_WARN_ON), to restrict the THP allocation in the local node that
> + * triggered the page fault and fallback into PAGE_SIZE allocations in
> + * the same node. We don't want to invoke reclaim because there may be
> + * plenty of free memory already in the local node. More importantly
> + * there may be even plenty of free THP available in remote nodes so
> + * we should allocate those if something instead of reclaiming any
> + * memory in the local node. Implementation detail: set ___GFP_NORETRY
> + * too so that ___GFP_ONLY_COMPACT only needs to be checked in a slow
> + * path.

This is simply incomprehensible. How can anybody who is not deeply
familiar with the allocator/reclaim internals know when to use it.

If this is really a regression then we should start by pinpointing the
real culprit and go from there. If this is really 5265047ac301 then just
start by reverting it. I strongly suspect there is some mismatch in
expectations here. What others consider acceptable seems to be a problem
for others. I believe that was one of the reasons why we have changed
the default THP direct compaction behavior, no?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-20 15:19   ` Andrea Arcangeli
@ 2018-08-21 15:30     ` Vlastimil Babka
  2018-08-21 17:26       ` David Rientjes
                         ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Vlastimil Babka @ 2018-08-21 15:30 UTC (permalink / raw)
  To: Andrea Arcangeli, Kirill A. Shutemov
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes

On 8/20/18 5:19 PM, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Mon, Aug 20, 2018 at 02:58:18PM +0300, Kirill A. Shutemov wrote:
>> I personally prefer to prioritize NUMA locality over THP
>> (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good
>> enough to Ack it.
> 
> If we go in this direction it'd be nice after fixing the showstopper
> bug, if we could then proceed with an orthogonal optimization by
> checking the watermarks and if the watermarks shows there are no
> PAGE_SIZEd pages available in the local node we should remove both
> __GFP_THISNODE and __GFP_COMPACT_ONLY.
> 
> If as opposed there's still PAGE_SIZEd free memory in the local node
> (not possible to compact for whatever reason), we should stick to
> __GFP_THISNODE | __GFP_COMPACT_ONLY.

If it's "not possible to compact" then the expected outcome of this is
to fail?

> It's orthogonal because the above addition would make sense also in
> the current (buggy) code.
> 
> The main implementation issue is that the watermark checking is not
> well done in mempolicy.c but the place to clear __GFP_THISNODE and
> __GFP_COMPACT_ONLY currently is there.

You could do that without calling watermark checking explicitly, but
it's rather complicated:

1. try alocating with __GFP_THISNODE and ~GFP_DIRECT_RECLAIM
2. if that fails, try PAGE_SIZE with same flags
3. if that fails, try THP size without __GFP_THISNODE
4. PAGE_SIZE without __GFP_THISNODE

Yeah, not possible in current alloc_pages_vma() which should return the
requested order. But the advantage is that it's not prone to races
between watermark checking and actual attempt.

> The case that the local node gets completely full and has not even 4k
> pages available should be totally common, because if you keep
> allocating and you allocate more than the size of a NUMA node
> eventually you will fill the local node with THP then consume all 4k
> pages and then you get into the case where the current code is totally
> unable to allocate THP from the other nodes and it would be totally
> possible to fix with the removal of __GFP_THISNODE |
> __GFP_COMPACT_ONLY, after the PAGE_SIZE watermark check.
> 
> I'm mentioning this optimization in this context, even if it's
> orthogonal, because the alternative patch that prioritizes THP over
> NUMA locality for MADV_HUGEPAGE and defer=always would solve that too
> with a one liner and there would be no need of watermark checking and
> flipping gfp bits whatsoever. Once the local node is full, THPs keeps
> being provided as expected.

Frankly, I would rather go with this option and assume that if someone
explicitly wants THP's, they don't care about NUMA locality that much.
(Note: I hate __GFP_THISNODE, it's an endless source of issues.)
Trying to be clever about "is there still PAGE_SIZEd free memory in the
local node" is imperfect anyway. If there isn't, is it because there's
clean page cache that we can easily reclaim (so it would be worth
staying local) or is it really exhausted? Watermark check won't tell...

Vlastimil

> Thanks,
> Andrea
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-21 15:30     ` Vlastimil Babka
@ 2018-08-21 17:26       ` David Rientjes
  2018-08-21 22:18         ` Andrea Arcangeli
  2018-08-21 22:05       ` Andrea Arcangeli
  2018-08-22  9:24       ` Michal Hocko
  2 siblings, 1 reply; 77+ messages in thread
From: David Rientjes @ 2018-08-21 17:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Alex Williamson

On Tue, 21 Aug 2018, Vlastimil Babka wrote:

> Frankly, I would rather go with this option and assume that if someone
> explicitly wants THP's, they don't care about NUMA locality that much.
> (Note: I hate __GFP_THISNODE, it's an endless source of issues.)
> Trying to be clever about "is there still PAGE_SIZEd free memory in the
> local node" is imperfect anyway. If there isn't, is it because there's
> clean page cache that we can easily reclaim (so it would be worth
> staying local) or is it really exhausted? Watermark check won't tell...
> 

MADV_HUGEPAGE (or defrag == "always") would now become a combination of 
"try to compact locally" and "allocate remotely if necesary" without the 
ability to avoid the latter absent a mempolicy that affects all memory 
allocations.  I think the complete solution would be a MPOL_F_HUGEPAGE 
flag that defines mempolicies for hugepage allocations.  In my experience 
thp falling back to remote nodes for intrasocket latency is a win but 
intersocket or two-hop intersocket latency is a no go.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-21 11:50   ` Michal Hocko
@ 2018-08-21 21:40     ` Andrea Arcangeli
  2018-08-22  9:02       ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-21 21:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> So does reverting 5265047ac301 ("mm, thp: really limit transparent
> hugepage allocation to local node") help?

That won't revert clean, to be sure you'd need to bisect around it
which I haven't tried because I don't think there was doubt around it
(and only the part in mempolicy.c is relevant at it).

It's not so important to focus on that commit the code changed again
later, I'm focusing on the code as a whole.

It's not just that commit that is the problem here, the problem starts
in the previous commits that adds the NUMA locality logic, the
addition of __GFP_THISNODE is the icing on the cake that makes things
fall apart.

> I really detest a new gfp flag for one time semantic that is muddy as
> hell.

Well there's no way to fix this other than to prevent reclaim to run,
if you still want to give a chance to page faults to obtain THP under
MADV_HUGEPAGE in the page fault without waiting minutes or hours for
khugpaged to catch up with it.

> This is simply incomprehensible. How can anybody who is not deeply
> familiar with the allocator/reclaim internals know when to use it.

Nobody should use this in drivers, it's a __GFP flag.

Note:

	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {

#define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)

Only THP is ever affected by the BUG so nothing else will ever need to
call __GFP_COMPACT_ONLY. It is a VM internal flag, I wish there was a
way to make the build fail if a driver would use it but there isn't
right now.

All other order 9 or 10 allocations from drivers won't call
alloc_hugepage_vma. Only mm/huge_memory.c ever calls that which is why
the regression only happens with MADV_HUGEPAGE (i.e. qemu is being
bitten badly on NUMA hosts).

> If this is really a regression then we should start by pinpointing the

You can check yourself, create a 2 node vnuma guest or pick any host
with more than one node. Set defrag=always and run "memhog -r11111111
18g" if host has 16g per node. Add some swap and notice the swap storm
while all ram is left free in the other node.

> real culprit and go from there. If this is really 5265047ac301 then just

In my view there's no single culprit, but it was easy to identify the
last drop that made the MADV_HUGEPAGE glass overflow, and it's that
commit that adds __GFP_THISNODE. The combination of the previous code
that prioritized NUMA over THP and then the MADV_HUGEPAGE logic that
still uses compaction (and in turn reclaim if compaction fails with
COMPACT_SKIPPED because there's no 4k page in the local node) just
falls apart with __GFP_THISNODE set as well on top of it and it
doesn't do the expected thing either without it (i.e. THP gets
priority over NUMA locality without such flag).

__GFP_THISNODE and the logic there, only works ok when
__GFP_DIRECT_RECLAIM is not set, i.e. MADV_HUGEPAGE not set.

We don't want to wait hours for khugepaged to catch up in qemu to get
THP. Compaction is certainly worth it to run if the userland
explicitly gives the hint to the kernel the allocations are long
lived.

> start by reverting it. I strongly suspect there is some mismatch in
> expectations here. What others consider acceptable seems to be a problem
> for others. I believe that was one of the reasons why we have changed
> the default THP direct compaction behavior, no?

This is not about the "default" though, I'm not changing the default
either, this is about MADV_HUGEPAGE behavior and when you change from
the defrag "default" value to "always" (which is equivalent than
having have all vmas set with MADV_HUGEPAGE).

QEMU is being optimal in setting MADV_HUGEPAGE, and rightfully so, but
it's getting punished badly because of this kernel bug.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-21 15:30     ` Vlastimil Babka
  2018-08-21 17:26       ` David Rientjes
@ 2018-08-21 22:05       ` Andrea Arcangeli
  2018-08-22  9:24       ` Michal Hocko
  2 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-21 22:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes

On Tue, Aug 21, 2018 at 05:30:11PM +0200, Vlastimil Babka wrote:
> If it's "not possible to compact" then the expected outcome of this is
> to fail?

It'll just call __alloc_pages_direct_compact once and fail if that
fails.

> You could do that without calling watermark checking explicitly, but
> it's rather complicated:
> 
> 1. try alocating with __GFP_THISNODE and ~GFP_DIRECT_RECLAIM
> 2. if that fails, try PAGE_SIZE with same flags
> 3. if that fails, try THP size without __GFP_THISNODE
> 4. PAGE_SIZE without __GFP_THISNODE

It's not complicated, it's slow, why to call 4 times into the
allocator, just to skip 1 watermark check?

> Yeah, not possible in current alloc_pages_vma() which should return the
> requested order. But the advantage is that it's not prone to races
> between watermark checking and actual attempt.

Watermark checking is always racy, zone_watermark_fast doesn't take
any lock before invoking rmqueue.

The racy in this case is the least issue because it doesn't need to be
perfect: if once in a while a THP is allocated from a remote node
despite there was a bit more of PAGE_SIZEd memory free than expected
in the local node it wouldn't be an issue. If it's wrong in the other
way around it'll just behave not-optimally (like today) once in a while.

> Frankly, I would rather go with this option and assume that if someone
> explicitly wants THP's, they don't care about NUMA locality that much.

I'm fine either ways, either way will work, large NUMA will prefer
__GFP_COMPACT_ONLY option 1), small NUMA will likely prefer option
2) and making this configurable at runtime with a different default is
possible too later but then I'm not sure it's worth it.

The main benefit of 1) really is to cause the least possible
interference to NUMA balancing (and the further optimization possible
with the watermark check would not cause interference either).

> (Note: I hate __GFP_THISNODE, it's an endless source of issues.)
> Trying to be clever about "is there still PAGE_SIZEd free memory in the
> local node" is imperfect anyway. If there isn't, is it because there's
> clean page cache that we can easily reclaim (so it would be worth
> staying local) or is it really exhausted? Watermark check won't tell...

I'm not sure if it's worth reclaiming cache to stay local, I mean it
totally depends on the app running, reclaim cache to stay local is
even harder to tell if it's worth it. This is why especially on small
NUMA option 2) that removes __GFP_THISNODE is likely to perform best.

However the tradeoff about clean cache reclaiming or not combined if
to do or not the watermark check on the PAGE_SIZEd free memory, is all
about the MADV_HUGEPAGE case only.

The default case (not even calling compaction in the page fault) would
definitely benefit from the (imperfect racy) heuristic "is there still
PAGE_SIZEd free memory in the local node" and that remains true even
if we go with option 2) to solve the bug (option 2 only changes the
behavior of MADV_HUGEPAGE, not the default).

The imperfect watermark check it's net gain for the default case
without __GFP_DIRECT_RECLAIM set, because currently the code has zero
chance to get THP even if already free and available in other nodes
and it only tries to get them from the local node. It costs a
watermark fast check only to get THP from all nodes if already
available (or if made available from kswapd). Costs 1 cacheline just
for the pcp test, but then we're in the page allocator anyway so all
cachelines involving watermarks may be activated regardless.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-21 17:26       ` David Rientjes
@ 2018-08-21 22:18         ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-21 22:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Alex Williamson

On Tue, Aug 21, 2018 at 10:26:54AM -0700, David Rientjes wrote:
> MADV_HUGEPAGE (or defrag == "always") would now become a combination of 
> "try to compact locally" and "allocate remotely if necesary" without the 
> ability to avoid the latter absent a mempolicy that affects all memory 

I don't follow why compaction should run only on the local node in
such case (i.e. __GFP_THISNODE removed when __GFP_DIRECT_RECLAIM is
set).

The zonelist will simply span all nodes so compaction & reclaim should
both run on all for MADV_HUGEPAGE with option 2).

The only mess there is in the allocator right now is that compaction
runs per zone and reclaim runs per node but that's another issue and
won't hurt for this case.

> allocations.  I think the complete solution would be a MPOL_F_HUGEPAGE 
> flag that defines mempolicies for hugepage allocations.  In my experience 
> thp falling back to remote nodes for intrasocket latency is a win but 
> intersocket or two-hop intersocket latency is a no go.

Yes, that's my expectation too.

So what you suggest is to add a new hard binding, that allows altering
the default behavior for THP, that sure sounds fine.

We've still to pick the actual default and decide if a single default
is ok or it should be tunable or even change the default depending on
the NUMA topology.

I suspect it's a bit overkill to have different defaults depending on
NUMA topology. There have been defaults for obscure things like
numa_zonelist_order that changed behavior depending on number of nodes
and they happened to hurt on some system. I ended up tuning them to
the current default (until the runtime tuning was removed).

It's a bit hard to just pick the best just based on arbitrary things
like number of numa nodes or distance, especially when what is better
also depends on the actual application.

I think options are sane behaviors with some pros and cons, and option
2) is simpler and will likely perform better on smaller systems,
option 1) is less risky in larger systems.

In any case the watermark optimization to set __GFP_THISNODE only if
there's plenty of PAGE_SIZEd memory in the local node, remains a valid
optimization for later for the default "defrag" value (i.e. no
MADV_HUGEPAGE) not setting __GFP_DIRECT_RECLAIM. If there's no RAM
free in the local node we can totally try to pick the THP from the
other nodes and not doing so only has the benefit of saving the
watermark check itself.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-21 21:40     ` Andrea Arcangeli
@ 2018-08-22  9:02       ` Michal Hocko
  2018-08-22 11:07         ` Michal Hocko
  2018-08-22 15:52         ` Andrea Arcangeli
  0 siblings, 2 replies; 77+ messages in thread
From: Michal Hocko @ 2018-08-22  9:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Tue 21-08-18 17:40:49, Andrea Arcangeli wrote:
> On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
[...]
> > I really detest a new gfp flag for one time semantic that is muddy as
> > hell.
> 
> Well there's no way to fix this other than to prevent reclaim to run,
> if you still want to give a chance to page faults to obtain THP under
> MADV_HUGEPAGE in the page fault without waiting minutes or hours for
> khugpaged to catch up with it.

I do not get that part. Why should caller even care about reclaim vs.
compaction. How can you even make an educated guess what makes more
sense? This should be fully controlled by the allocator path. The caller
should only care about how hard to try. It's been some time since I've
looked but we used to have a gfp flags to tell that for THP allocations
as well.

> > This is simply incomprehensible. How can anybody who is not deeply
> > familiar with the allocator/reclaim internals know when to use it.
> 
> Nobody should use this in drivers, it's a __GFP flag.

Like other __GFP flags (e.g. __GFP_NOWARN, __GFP_COMP, __GFP_ZERO and
many others)?

> Note:
> 
> 	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> 
> #define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
> 	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
> 
> Only THP is ever affected by the BUG so nothing else will ever need to
> call __GFP_COMPACT_ONLY. It is a VM internal flag, I wish there was a
> way to make the build fail if a driver would use it but there isn't
> right now.

My experience tells me that there is nothing like an internal gfp flag
and abusing them is quite common and really hard to get rid of. Having a
THP specific gfp flag has also turned out to be a bad idea (e.g. GFP_THISNODE).

> > If this is really a regression then we should start by pinpointing the
> 
> You can check yourself, create a 2 node vnuma guest or pick any host
> with more than one node. Set defrag=always and run "memhog -r11111111
> 18g" if host has 16g per node. Add some swap and notice the swap storm
> while all ram is left free in the other node.

I am not disputing the bug itself. How hard should defrag=allways really
try is good question and I would say different people would have
different ideas but a swapping storm sounds like genuinely unwanted
behavior. I would expect that to be handled in the reclaim/compaction.
GFP_TRANSHUGE doesn't have ___GFP_RETRY_MAYFAIL so it shouldn't really
try too hard to reclaim.

> > real culprit and go from there. If this is really 5265047ac301 then just
> 
> In my view there's no single culprit, but it was easy to identify the
> last drop that made the MADV_HUGEPAGE glass overflow, and it's that
> commit that adds __GFP_THISNODE. The combination of the previous code
> that prioritized NUMA over THP and then the MADV_HUGEPAGE logic that
> still uses compaction (and in turn reclaim if compaction fails with
> COMPACT_SKIPPED because there's no 4k page in the local node) just
> falls apart with __GFP_THISNODE set as well on top of it and it
> doesn't do the expected thing either without it (i.e. THP gets
> priority over NUMA locality without such flag).
> 
> __GFP_THISNODE and the logic there, only works ok when
> __GFP_DIRECT_RECLAIM is not set, i.e. MADV_HUGEPAGE not set.

I still have to digest the __GFP_THISNODE thing but I _think_ that the
alloc_pages_vma code is just trying to be overly clever and
__GFP_THISNODE is not a good fit for it. 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-21 15:30     ` Vlastimil Babka
  2018-08-21 17:26       ` David Rientjes
  2018-08-21 22:05       ` Andrea Arcangeli
@ 2018-08-22  9:24       ` Michal Hocko
  2018-08-22 15:56         ` Andrea Arcangeli
  2 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-22  9:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes

On Tue 21-08-18 17:30:11, Vlastimil Babka wrote:
[...]
> Frankly, I would rather go with this option and assume that if someone
> explicitly wants THP's, they don't care about NUMA locality that much.

Yes. And if they do care enough to configure for heavier fault-in
treatment then they also can handle numa policy as well.

> (Note: I hate __GFP_THISNODE, it's an endless source of issues.)

Absolutely. It used to be a slab thing but then found an interesting
abuse elsewhere. Like here for the THP. I am pretty sure that the
intention was not to stick to a specific node but rather all local nodes
within the reclaim distance (or other unit to define that nodes are
sufficiently close).

> Trying to be clever about "is there still PAGE_SIZEd free memory in the
> local node" is imperfect anyway. If there isn't, is it because there's
> clean page cache that we can easily reclaim (so it would be worth
> staying local) or is it really exhausted? Watermark check won't tell...

Exactly.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-22  9:02       ` Michal Hocko
@ 2018-08-22 11:07         ` Michal Hocko
  2018-08-22 14:24           ` Andrea Arcangeli
  2018-08-22 15:52         ` Andrea Arcangeli
  1 sibling, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-22 11:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Wed 22-08-18 11:02:14, Michal Hocko wrote:
> On Tue 21-08-18 17:40:49, Andrea Arcangeli wrote:
> > On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> [...]
> > > I really detest a new gfp flag for one time semantic that is muddy as
> > > hell.
> > 
> > Well there's no way to fix this other than to prevent reclaim to run,
> > if you still want to give a chance to page faults to obtain THP under
> > MADV_HUGEPAGE in the page fault without waiting minutes or hours for
> > khugpaged to catch up with it.
> 
> I do not get that part. Why should caller even care about reclaim vs.
> compaction. How can you even make an educated guess what makes more
> sense? This should be fully controlled by the allocator path. The caller
> should only care about how hard to try. It's been some time since I've
> looked but we used to have a gfp flags to tell that for THP allocations
> as well.

In other words, why do we even try to swap out when allocating costly
high order page for requests which do not insist to try really hard?

I mean why don't we do something like this?
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03822f86f288..41005d3d4c2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3071,6 +3071,14 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
 		return 1;
 
+	/*
+	 * If we are allocating a costly order and do not insist on trying really
+	 * hard then we should keep the reclaim impact at minimum. So only
+	 * focus on easily reclaimable memory.
+	 */
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
+		sc.may_swap = sc.may_unmap = 0;
+
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
 				sc.gfp_mask,
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-22 11:07         ` Michal Hocko
@ 2018-08-22 14:24           ` Andrea Arcangeli
  2018-08-22 14:45             ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-22 14:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Wed, Aug 22, 2018 at 01:07:37PM +0200, Michal Hocko wrote:
> On Wed 22-08-18 11:02:14, Michal Hocko wrote:
> > On Tue 21-08-18 17:40:49, Andrea Arcangeli wrote:
> > > On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> > [...]
> > > > I really detest a new gfp flag for one time semantic that is muddy as
> > > > hell.
> > > 
> > > Well there's no way to fix this other than to prevent reclaim to run,
> > > if you still want to give a chance to page faults to obtain THP under
> > > MADV_HUGEPAGE in the page fault without waiting minutes or hours for
> > > khugpaged to catch up with it.
> > 
> > I do not get that part. Why should caller even care about reclaim vs.
> > compaction. How can you even make an educated guess what makes more
> > sense? This should be fully controlled by the allocator path. The caller
> > should only care about how hard to try. It's been some time since I've
> > looked but we used to have a gfp flags to tell that for THP allocations
> > as well.
> 
> In other words, why do we even try to swap out when allocating costly
> high order page for requests which do not insist to try really hard?

Note that the testcase with vfio swaps nothing and writes nothing to
disk. No memory at all is being swapped or freed because 100% of the
node is pinned with GUP pins, so I'm dubious this could possible move
the needle for the reproducer that I used for the benchmark.

The swap storm I suggested to you as reproducer, because it's another
way the bug would see the light of the day and it's easier to
reproduce without requiring device assignment, but the badness is the
fact reclaim is called when it shouldn't be and whatever fix must
cover vfio too. The below I can't imagine how it could possibly have
an effect on vfio, and even for the swap storm case you're converting
a swap storm into a CPU waste, it'll still run just extremely slow
allocations like with vfio.

The effect of the below should be evaluated regardless of the issue
we've been discussing in this thread and it's a new corner case for
order > PAGE_ALLOC_COSTLY_ORDER. I don't like very much order >
PAGE_ALLOC_COSTLY_ORDER checks, those are arbitrary numbers, the more
checks are needed in various places for that, the more it's a sign the
VM is bad and arbitrary and with one more corner case required to hide
some badness. But again this will have effects unrelated to what we're
discussing here and it will just convert I/O into CPU waste and have
no effect on vfio.

> 
> I mean why don't we do something like this?
> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 03822f86f288..41005d3d4c2d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3071,6 +3071,14 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
>  		return 1;
>  
> +	/*
> +	 * If we are allocating a costly order and do not insist on trying really
> +	 * hard then we should keep the reclaim impact at minimum. So only
> +	 * focus on easily reclaimable memory.
> +	 */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> +		sc.may_swap = sc.may_unmap = 0;
> +
>  	trace_mm_vmscan_direct_reclaim_begin(order,
>  				sc.may_writepage,
>  				sc.gfp_mask,
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-22 14:24           ` Andrea Arcangeli
@ 2018-08-22 14:45             ` Michal Hocko
  2018-08-22 15:24               ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-22 14:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Wed 22-08-18 10:24:46, Andrea Arcangeli wrote:
> On Wed, Aug 22, 2018 at 01:07:37PM +0200, Michal Hocko wrote:
> > On Wed 22-08-18 11:02:14, Michal Hocko wrote:
> > > On Tue 21-08-18 17:40:49, Andrea Arcangeli wrote:
> > > > On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I really detest a new gfp flag for one time semantic that is muddy as
> > > > > hell.
> > > > 
> > > > Well there's no way to fix this other than to prevent reclaim to run,
> > > > if you still want to give a chance to page faults to obtain THP under
> > > > MADV_HUGEPAGE in the page fault without waiting minutes or hours for
> > > > khugpaged to catch up with it.
> > > 
> > > I do not get that part. Why should caller even care about reclaim vs.
> > > compaction. How can you even make an educated guess what makes more
> > > sense? This should be fully controlled by the allocator path. The caller
> > > should only care about how hard to try. It's been some time since I've
> > > looked but we used to have a gfp flags to tell that for THP allocations
> > > as well.
> > 
> > In other words, why do we even try to swap out when allocating costly
> > high order page for requests which do not insist to try really hard?
> 
> Note that the testcase with vfio swaps nothing and writes nothing to
> disk. No memory at all is being swapped or freed because 100% of the
> node is pinned with GUP pins, so I'm dubious this could possible move
> the needle for the reproducer that I used for the benchmark.

Now I am confused. How can compaction help at all then? I mean  if the
node is full of GUP pins then you can hardly do anything but fallback to
other node. Or how come your new GFP flag makes any difference?

> The swap storm I suggested to you as reproducer, because it's another
> way the bug would see the light of the day and it's easier to
> reproduce without requiring device assignment, but the badness is the
> fact reclaim is called when it shouldn't be and whatever fix must
> cover vfio too. The below I can't imagine how it could possibly have
> an effect on vfio, and even for the swap storm case you're converting
> a swap storm into a CPU waste, it'll still run just extremely slow
> allocations like with vfio.

It would still try to reclaim easy target as compaction requires. If you
do not reclaim at all you can make the current implementation of the
compaction noop due to its own watermark checks IIRC.

> The effect of the below should be evaluated regardless of the issue
> we've been discussing in this thread and it's a new corner case for
> order > PAGE_ALLOC_COSTLY_ORDER. I don't like very much order >
> PAGE_ALLOC_COSTLY_ORDER checks, those are arbitrary numbers, the more
> checks are needed in various places for that, the more it's a sign the
> VM is bad and arbitrary and with one more corner case required to hide
> some badness. But again this will have effects unrelated to what we're
> discussing here and it will just convert I/O into CPU waste and have
> no effect on vfio.

yeah, I agree about PAGE_ALLOC_COSTLY_ORDER being an arbitrary limit for
a different behavior. But we already do handle those specially so it
kind of makes sense to me to expand on that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-22 14:45             ` Michal Hocko
@ 2018-08-22 15:24               ` Andrea Arcangeli
  2018-08-23 10:50                 ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-22 15:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Wed, Aug 22, 2018 at 04:45:17PM +0200, Michal Hocko wrote:
> Now I am confused. How can compaction help at all then? I mean  if the
> node is full of GUP pins then you can hardly do anything but fallback to
> other node. Or how come your new GFP flag makes any difference?

It helps until the node is full.

If you don't call compaction you will get zero THP even when you've
plenty of free memory.

So the free memory goes down and down as more and more THP are
generated y compaction until compaction then fails with
COMPACT_SKIPPED, there's not enough free memory to relocate an "order
9" amount of physically contiguous PAGE_SIZEd fragments.

At that point the code calls reclaim to make space for a new
compaction run. Then if that fails again it's not because there's no
enough free memory.

Problem is if you ever call reclaim when compaction fails, what
happens is you free an "order 9" and then it gets eaten up by the app
so then next compaction call, calls COMPACT_SKIPPED again.

This is how compaction works since day zero it was introduced in
kernel 2.6.x something, if you don't have crystal clear the inner
workings of compaction you have an hard time to review this. So hope
the above shed some light of how this plays out.

So in general calling reclaim is ok because compaction fails more
often than not in such case because it can't compact memory not
because there aren't at least 2m free in any node. However when you use
__GFP_THISNODE combined with reclaim that changes the whole angle and
behavior of compaction if reclaim is still active.

Not calling compaction in MADV_HUGEPAGE means you can drop
MADV_HUGEPAGE as a whole. There's no point to ever set it unless we
call compaction. And if you don't call at least once compaction you
have near zero chances to get gigabytes of THP even if it's all
compactable memory and there are gigabytes of free memory in the node,
after some runtime that shakes the fragments in the buddy.

To make it even more clear why compaction has to run once at least
when MADV_HUGEPAGE is set, just check the second last column of your
/proc/buddyinfo before and after "echo 3 >/proc/sys/vm/drop_caches;
echo >/proc/sys/vm/compact_memory". Try to allocate memory without
MADV_HUGEPAGE and without running the "echo 3; echo" and see how much
THP you'll get. I've plenty of workloads that use MADV_HUGEPAGE not
just qemu and that totally benefit immediately from THP and there's no
point to ever defer compaction to khugepaged when userland says "this
is a long lived allocation".

Compaction is only potentially wasteful for short lived allocation, so
MADV_HUGEPAGE has to call compaction.

> It would still try to reclaim easy target as compaction requires. If you
> do not reclaim at all you can make the current implementation of the
> compaction noop due to its own watermark checks IIRC.

That's the feature, if you don't make it a noop when watermark checks
trigger, it'll end up wasting CPU and breaking vfio.

The point is that we want compaction to run when there's free memory
and compaction keeps succeeding.

So when compaction fails, if it's because we finished all free memory
in the node, we should just remove __GFP_THISNODE and allocate without
it (i.e. the optimization). If compaction fails because the memory is
fragmented but here's still free memory we should fail the allocation
and trigger the THP fallback to PAGE_SIZE fault.

Overall removing __GFP_THISNODE unconditionally would simply
prioritize THP over NUMA locality which is the point of this special
logic for THP. I can't blame the logic because it certainly helps NUMA
balancing a lot in letting the memory be in the right place from the
start. This is why __GFP_COMPACT_ONLY makes sense, to be able to
retain the logic but still preventing the corner case of such
__GFP_THISNODE that breaks the VM with MADV_HUGEPAGE.

> yeah, I agree about PAGE_ALLOC_COSTLY_ORDER being an arbitrary limit for
> a different behavior. But we already do handle those specially so it
> kind of makes sense to me to expand on that.

It's still a sign of one more place that needs magic for whatever
reason. So unless it can be justified by some runtime tests I wouldn't
make such change by just thinking about it. Reclaim is called if
there's no free memory left anywhere for compaction to run (i.e. if
__GFP_THISNODE is not set, if __GPF_THISNODE is set then the caller
better use __GFP_COMPACT_ONLY).

Now we could also get away without __GFP_COMPACT_ONLY, we could check
__GFP_THISNODE and make it behave exactly like __GFP_COMPACT_ONLY
whenever __GFP_DIRECT_RECLAIM was also set in addition of
__GFP_THISNODE, but then you couldn't use __GFP_THISNODE as a mbind
anymore and it would have more obscure semantics than a new flag I
think.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-22  9:02       ` Michal Hocko
  2018-08-22 11:07         ` Michal Hocko
@ 2018-08-22 15:52         ` Andrea Arcangeli
  2018-08-23 10:52           ` Michal Hocko
  1 sibling, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-22 15:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
> I am not disputing the bug itself. How hard should defrag=allways really
> try is good question and I would say different people would have
> different ideas but a swapping storm sounds like genuinely unwanted
> behavior. I would expect that to be handled in the reclaim/compaction.
> GFP_TRANSHUGE doesn't have ___GFP_RETRY_MAYFAIL so it shouldn't really
> try too hard to reclaim.

Everything was ok as long as the THP allocation is not bind to the
local node with __GFP_THISNODE. Calling reclaim to free memory is part
of how compaction works if all free memory has been extinguished from
all nodes. At that point it's much more likely compaction fails not
because there's not at least 2m free but because of all memory is
fragmented. So it's true that MADV_HUGEPAGE may run better on not-NUMA
by not setting __GFP_COMPACT_ONLY though (i.e. like right now,
__GFP_THISNODE would be a noop there).

How hard defrag=always should try, I think it should at least call
compaction once, so at least in the case there's plenty of free memory
in the local node it'll have a chance. It sounds a sure win that way.

Calling compaction with __GFP_THISNODE will at least defrag all free
memory, it'll give MADV_HUGEPAGE a chance.

> I still have to digest the __GFP_THISNODE thing but I _think_ that the
> alloc_pages_vma code is just trying to be overly clever and
> __GFP_THISNODE is not a good fit for it. 

My option 2 did just that, it removed __GFP_THISNODE but only for
MADV_HUGEPAGE and in general whenever reclaim was activated by
__GFP_DIRECT_RECLAIM. That is also signal that the user really wants
THP so then it's less bad to prefer THP over NUMA locality.

For the default which is tuned for short lived allocation, preferring
local memory is most certainly better win for short lived allocation
where THP can't help much, this is why I didn't remove __GFP_THISNODE
from the default defrag policy.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/2] fix for "pathological THP behavior"
  2018-08-22  9:24       ` Michal Hocko
@ 2018-08-22 15:56         ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2018-08-22 15:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes

On Wed, Aug 22, 2018 at 11:24:40AM +0200, Michal Hocko wrote:
> abuse elsewhere. Like here for the THP. I am pretty sure that the
> intention was not to stick to a specific node but rather all local nodes
> within the reclaim distance (or other unit to define that nodes are
> sufficiently close).

If it meant more than one node but still not all, the same problem
would then happen after the app used all RAM from those "sufficiently
close" nodes. So overall it would make zero difference because it
would just kick the bug down the road a bit more.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-22 15:24               ` Andrea Arcangeli
@ 2018-08-23 10:50                 ` Michal Hocko
  0 siblings, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2018-08-23 10:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Wed 22-08-18 11:24:02, Andrea Arcangeli wrote:
> On Wed, Aug 22, 2018 at 04:45:17PM +0200, Michal Hocko wrote:
> > Now I am confused. How can compaction help at all then? I mean  if the
> > node is full of GUP pins then you can hardly do anything but fallback to
> > other node. Or how come your new GFP flag makes any difference?
> 
> It helps until the node is full.
> 
> If you don't call compaction you will get zero THP even when you've
> plenty of free memory.
> 
> So the free memory goes down and down as more and more THP are
> generated y compaction until compaction then fails with
> COMPACT_SKIPPED, there's not enough free memory to relocate an "order
> 9" amount of physically contiguous PAGE_SIZEd fragments.
> 
> At that point the code calls reclaim to make space for a new
> compaction run. Then if that fails again it's not because there's no
> enough free memory.
> 
> Problem is if you ever call reclaim when compaction fails, what
> happens is you free an "order 9" and then it gets eaten up by the app
> so then next compaction call, calls COMPACT_SKIPPED again.
> 
> This is how compaction works since day zero it was introduced in
> kernel 2.6.x something, if you don't have crystal clear the inner
> workings of compaction you have an hard time to review this. So hope
> the above shed some light of how this plays out.
> 
> So in general calling reclaim is ok because compaction fails more
> often than not in such case because it can't compact memory not
> because there aren't at least 2m free in any node. However when you use
> __GFP_THISNODE combined with reclaim that changes the whole angle and
> behavior of compaction if reclaim is still active.
> 
> Not calling compaction in MADV_HUGEPAGE means you can drop
> MADV_HUGEPAGE as a whole. There's no point to ever set it unless we
> call compaction. And if you don't call at least once compaction you
> have near zero chances to get gigabytes of THP even if it's all
> compactable memory and there are gigabytes of free memory in the node,
> after some runtime that shakes the fragments in the buddy.
> 
> To make it even more clear why compaction has to run once at least
> when MADV_HUGEPAGE is set, just check the second last column of your
> /proc/buddyinfo before and after "echo 3 >/proc/sys/vm/drop_caches;
> echo >/proc/sys/vm/compact_memory". Try to allocate memory without
> MADV_HUGEPAGE and without running the "echo 3; echo" and see how much
> THP you'll get. I've plenty of workloads that use MADV_HUGEPAGE not
> just qemu and that totally benefit immediately from THP and there's no
> point to ever defer compaction to khugepaged when userland says "this
> is a long lived allocation".
> 
> Compaction is only potentially wasteful for short lived allocation, so
> MADV_HUGEPAGE has to call compaction.

I guess you have missed my point. I was not suggesting compaction is
pointless. I meant to say, how can be compaction useful in the scenario
you were suggesting when the node is full of pinned pages.

> > It would still try to reclaim easy target as compaction requires. If you
> > do not reclaim at all you can make the current implementation of the
> > compaction noop due to its own watermark checks IIRC.
> 
> That's the feature, if you don't make it a noop when watermark checks
> trigger, it'll end up wasting CPU and breaking vfio.
> 
> The point is that we want compaction to run when there's free memory
> and compaction keeps succeeding.
> 
> So when compaction fails, if it's because we finished all free memory
> in the node, we should just remove __GFP_THISNODE and allocate without
> it (i.e. the optimization). If compaction fails because the memory is
> fragmented but here's still free memory we should fail the allocation
> and trigger the THP fallback to PAGE_SIZE fault.
> 
> Overall removing __GFP_THISNODE unconditionally would simply
> prioritize THP over NUMA locality which is the point of this special
> logic for THP. I can't blame the logic because it certainly helps NUMA
> balancing a lot in letting the memory be in the right place from the
> start. This is why __GFP_COMPACT_ONLY makes sense, to be able to
> retain the logic but still preventing the corner case of such
> __GFP_THISNODE that breaks the VM with MADV_HUGEPAGE.

But __GFP_COMPACT_ONLY is a layering violation because you are
compaction does depend on the reclaim right now.
 
> > yeah, I agree about PAGE_ALLOC_COSTLY_ORDER being an arbitrary limit for
> > a different behavior. But we already do handle those specially so it
> > kind of makes sense to me to expand on that.
> 
> It's still a sign of one more place that needs magic for whatever
> reason. So unless it can be justified by some runtime tests I wouldn't
> make such change by just thinking about it. Reclaim is called if
> there's no free memory left anywhere for compaction to run (i.e. if
> __GFP_THISNODE is not set, if __GPF_THISNODE is set then the caller
> better use __GFP_COMPACT_ONLY).

I am not insisting on the hack I have proposed mostly for the sake of
discussion. But I _strongly_ believe that __GFP_COMPACT_ONLY is the
wrong way around the issue. We are revolving around __GFP_THISNODE
having negative side effect and that is exactly an example of a gfp flag
abuse for internal MM stuff which just happens to be a complete PITA for
a long time.
 
> Now we could also get away without __GFP_COMPACT_ONLY, we could check
> __GFP_THISNODE and make it behave exactly like __GFP_COMPACT_ONLY
> whenever __GFP_DIRECT_RECLAIM was also set in addition of
> __GFP_THISNODE, but then you couldn't use __GFP_THISNODE as a mbind
> anymore and it would have more obscure semantics than a new flag I
> think.

Or simply do not play tricks with __GFP_THISNODE.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-22 15:52         ` Andrea Arcangeli
@ 2018-08-23 10:52           ` Michal Hocko
  2018-08-28  7:53             ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-23 10:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
> On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
[...]
> > I still have to digest the __GFP_THISNODE thing but I _think_ that the
> > alloc_pages_vma code is just trying to be overly clever and
> > __GFP_THISNODE is not a good fit for it. 
> 
> My option 2 did just that, it removed __GFP_THISNODE but only for
> MADV_HUGEPAGE and in general whenever reclaim was activated by
> __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
> THP so then it's less bad to prefer THP over NUMA locality.
> 
> For the default which is tuned for short lived allocation, preferring
> local memory is most certainly better win for short lived allocation
> where THP can't help much, this is why I didn't remove __GFP_THISNODE
> from the default defrag policy.

Yes I agree.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-23 10:52           ` Michal Hocko
@ 2018-08-28  7:53             ` Michal Hocko
  2018-08-28  8:18               ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-28  7:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka

On Thu 23-08-18 12:52:53, Michal Hocko wrote:
> On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
> > On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
> [...]
> > > I still have to digest the __GFP_THISNODE thing but I _think_ that the
> > > alloc_pages_vma code is just trying to be overly clever and
> > > __GFP_THISNODE is not a good fit for it. 
> > 
> > My option 2 did just that, it removed __GFP_THISNODE but only for
> > MADV_HUGEPAGE and in general whenever reclaim was activated by
> > __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
> > THP so then it's less bad to prefer THP over NUMA locality.
> > 
> > For the default which is tuned for short lived allocation, preferring
> > local memory is most certainly better win for short lived allocation
> > where THP can't help much, this is why I didn't remove __GFP_THISNODE
> > from the default defrag policy.
> 
> Yes I agree.

I finally got back to this again. I have checked your patch and I am
really wondering whether alloc_pages_vma is really the proper place to
play these tricks. We already have that mind blowing alloc_hugepage_direct_gfpmask
and it should be the proper place to handle this special casing. So what
do you think about the following. It should be essentially the same
thing. Aka use __GFP_THIS_NODE only when we are doing an optimistic THP
allocation. Madvise signalizes you know what you are doing and THP has
the top priority. If you care enough about the numa placement then you
should better use mempolicy.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..3cdb62f6aea7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -634,16 +634,16 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     __GFP_THIS_NODE);
+	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..9f0800885613 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-28  7:53             ` Michal Hocko
@ 2018-08-28  8:18               ` Michal Hocko
  2018-08-28  8:54                 ` Stefan Priebe - Profihost AG
       [not found]                 ` <D5F4A33C-0A37-495C-9468-D6866A862097@cs.rutgers.edu>
  0 siblings, 2 replies; 77+ messages in thread
From: Michal Hocko @ 2018-08-28  8:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka, Stefan Priebe - Profihost AG

[CC Stefan Priebe who has reported the same/similar issue on openSUSE
 mailing list recently - the thread starts http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com]

On Tue 28-08-18 09:53:21, Michal Hocko wrote:
> On Thu 23-08-18 12:52:53, Michal Hocko wrote:
> > On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
> > > On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
> > [...]
> > > > I still have to digest the __GFP_THISNODE thing but I _think_ that the
> > > > alloc_pages_vma code is just trying to be overly clever and
> > > > __GFP_THISNODE is not a good fit for it. 
> > > 
> > > My option 2 did just that, it removed __GFP_THISNODE but only for
> > > MADV_HUGEPAGE and in general whenever reclaim was activated by
> > > __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
> > > THP so then it's less bad to prefer THP over NUMA locality.
> > > 
> > > For the default which is tuned for short lived allocation, preferring
> > > local memory is most certainly better win for short lived allocation
> > > where THP can't help much, this is why I didn't remove __GFP_THISNODE
> > > from the default defrag policy.
> > 
> > Yes I agree.
> 
> I finally got back to this again. I have checked your patch and I am
> really wondering whether alloc_pages_vma is really the proper place to
> play these tricks. We already have that mind blowing alloc_hugepage_direct_gfpmask
> and it should be the proper place to handle this special casing. So what
> do you think about the following. It should be essentially the same
> thing. Aka use __GFP_THIS_NODE only when we are doing an optimistic THP
> allocation. Madvise signalizes you know what you are doing and THP has
> the top priority. If you care enough about the numa placement then you
> should better use mempolicy.

Now the patch is still untested but it compiles at least.
---

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-28  8:18               ` Michal Hocko
@ 2018-08-28  8:54                 ` Stefan Priebe - Profihost AG
  2018-08-29 11:11                   ` Stefan Priebe - Profihost AG
       [not found]                 ` <D5F4A33C-0A37-495C-9468-D6866A862097@cs.rutgers.edu>
  1 sibling, 1 reply; 77+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-08-28  8:54 UTC (permalink / raw)
  To: Michal Hocko, Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka


Am 28.08.2018 um 10:18 schrieb Michal Hocko:
> [CC Stefan Priebe who has reported the same/similar issue on openSUSE
>  mailing list recently - the thread starts http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com]
> 
> On Tue 28-08-18 09:53:21, Michal Hocko wrote:
>> On Thu 23-08-18 12:52:53, Michal Hocko wrote:
>>> On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
>>>> On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
>>> [...]
>>>>> I still have to digest the __GFP_THISNODE thing but I _think_ that the
>>>>> alloc_pages_vma code is just trying to be overly clever and
>>>>> __GFP_THISNODE is not a good fit for it. 
>>>>
>>>> My option 2 did just that, it removed __GFP_THISNODE but only for
>>>> MADV_HUGEPAGE and in general whenever reclaim was activated by
>>>> __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
>>>> THP so then it's less bad to prefer THP over NUMA locality.
>>>>
>>>> For the default which is tuned for short lived allocation, preferring
>>>> local memory is most certainly better win for short lived allocation
>>>> where THP can't help much, this is why I didn't remove __GFP_THISNODE
>>>> from the default defrag policy.
>>>
>>> Yes I agree.
>>
>> I finally got back to this again. I have checked your patch and I am
>> really wondering whether alloc_pages_vma is really the proper place to
>> play these tricks. We already have that mind blowing alloc_hugepage_direct_gfpmask
>> and it should be the proper place to handle this special casing. So what
>> do you think about the following. It should be essentially the same
>> thing. Aka use __GFP_THIS_NODE only when we are doing an optimistic THP
>> allocation. Madvise signalizes you know what you are doing and THP has
>> the top priority. If you care enough about the numa placement then you
>> should better use mempolicy.
> 
> Now the patch is still untested but it compiles at least.

Great - i recompiled the SLES15 kernel with that one applied and will
test if it helps.

Stefan

> ---
> From 88e0ca4c9c403c6046f1c47d5ee17548f9dc841a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 28 Aug 2018 09:59:19 +0200
> Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-phSLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes. The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement. The later might be controlled
> via NUMA policies to be more fine grained.
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/huge_memory.c | 10 +++++-----
>  mm/mempolicy.c   | 26 --------------------------
>  2 files changed, 5 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..a703c23f8bab 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -634,16 +634,16 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     __GFP_THISNODE);
> +	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>  }
>  
>  /* Caller must hold page table lock. */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..9f0800885613 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-28  8:54                 ` Stefan Priebe - Profihost AG
@ 2018-08-29 11:11                   ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 77+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-08-29 11:11 UTC (permalink / raw)
  To: Michal Hocko, Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka


Am 28.08.2018 um 10:54 schrieb Stefan Priebe - Profihost AG:
> 
> Am 28.08.2018 um 10:18 schrieb Michal Hocko:
>> [CC Stefan Priebe who has reported the same/similar issue on openSUSE
>>  mailing list recently - the thread starts http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com]
>>
>> On Tue 28-08-18 09:53:21, Michal Hocko wrote:
>>> On Thu 23-08-18 12:52:53, Michal Hocko wrote:
>>>> On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
>>>>> On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
>>>> [...]
>>>>>> I still have to digest the __GFP_THISNODE thing but I _think_ that the
>>>>>> alloc_pages_vma code is just trying to be overly clever and
>>>>>> __GFP_THISNODE is not a good fit for it. 
>>>>>
>>>>> My option 2 did just that, it removed __GFP_THISNODE but only for
>>>>> MADV_HUGEPAGE and in general whenever reclaim was activated by
>>>>> __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
>>>>> THP so then it's less bad to prefer THP over NUMA locality.
>>>>>
>>>>> For the default which is tuned for short lived allocation, preferring
>>>>> local memory is most certainly better win for short lived allocation
>>>>> where THP can't help much, this is why I didn't remove __GFP_THISNODE
>>>>> from the default defrag policy.
>>>>
>>>> Yes I agree.
>>>
>>> I finally got back to this again. I have checked your patch and I am
>>> really wondering whether alloc_pages_vma is really the proper place to
>>> play these tricks. We already have that mind blowing alloc_hugepage_direct_gfpmask
>>> and it should be the proper place to handle this special casing. So what
>>> do you think about the following. It should be essentially the same
>>> thing. Aka use __GFP_THIS_NODE only when we are doing an optimistic THP
>>> allocation. Madvise signalizes you know what you are doing and THP has
>>> the top priority. If you care enough about the numa placement then you
>>> should better use mempolicy.
>>
>> Now the patch is still untested but it compiles at least.
> 
> Great - i recompiled the SLES15 kernel with that one applied and will
> test if it helps.

It seems to work fine. At least i was not able to reproduce the issue.

Greets,
Stefan

>> ---
>> From 88e0ca4c9c403c6046f1c47d5ee17548f9dc841a Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <mhocko@suse.com>
>> Date: Tue, 28 Aug 2018 09:59:19 +0200
>> Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
>>
>> Andrea has noticed [1] that a THP allocation might be really disruptive
>> when allocated on NUMA system with the local node full or hard to
>> reclaim. Stefan has posted an allocation stall report on 4.12 based
>> SLES kernel which suggests the same issue:
>> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
>> [245513.363983] kvm cpuset=/ mems_allowed=0-1
>> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-phSLE15 (unreleased)
>> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
>> [245513.365905] Call Trace:
>> [245513.366535]  dump_stack+0x5c/0x84
>> [245513.367148]  warn_alloc+0xe0/0x180
>> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
>> [245513.368406]  ? __slab_free+0xa9/0x2f0
>> [245513.369048]  ? __slab_free+0xa9/0x2f0
>> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
>> [245513.370300]  alloc_pages_vma+0x1e5/0x280
>> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
>> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
>> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
>> [245513.372812]  __handle_mm_fault+0x93d/0x1060
>> [245513.373439]  handle_mm_fault+0xc6/0x1b0
>> [245513.374042]  __do_page_fault+0x230/0x430
>> [245513.374679]  ? get_vtime_delta+0x13/0xb0
>> [245513.375411]  do_page_fault+0x2a/0x70
>> [245513.376145]  ? page_fault+0x65/0x80
>> [245513.376882]  page_fault+0x7b/0x80
>> [...]
>> [245513.382056] Mem-Info:
>> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>>                  active_file:60183 inactive_file:245285 isolated_file:0
>>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>>                  free:32294031 free_pcp:6233 free_cma:0
>> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>>
>> The defrag mode is "madvise" and from the above report it is clear that
>> the THP has been allocated for MADV_HUGEPAGA vma.
>>
>> Andrea has identified that the main source of the problem is
>> __GFP_THISNODE usage:
>>
>> : The problem is that direct compaction combined with the NUMA
>> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
>> : hard the local node, instead of failing the allocation if there's no
>> : THP available in the local node.
>> :
>> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
>> : path even with MPOL_DEFAULT.
>> :
>> : The idea behind the __GFP_THISNODE addition, is that it is better to
>> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
>> : backed memory. That largely depends on the remote latency though, on
>> : threadrippers for example the overhead is relatively low in my
>> : experience.
>> :
>> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
>> : extremely slow qemu startup with vfio, if the VM is larger than the
>> : size of one host NUMA node. This is because it will try very hard to
>> : unsuccessfully swapout get_user_pages pinned pages as result of the
>> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
>> : allocations and instead of trying to allocate THP on other nodes (it
>> : would be even worse without vfio type1 GUP pins of course, except it'd
>> : be swapping heavily instead).
>>
>> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
>> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
>> juggle gfp flags for different allocation modes. The rationale is that
>> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
>> to a different node might be more harmful than the benefit of a large page.
>> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
>> a higher priority than local NUMA placement. The later might be controlled
>> via NUMA policies to be more fine grained.
>>
>> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
>>
>> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
>> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
>> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>> ---
>>  mm/huge_memory.c | 10 +++++-----
>>  mm/mempolicy.c   | 26 --------------------------
>>  2 files changed, 5 insertions(+), 31 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c3bc7e9c9a2a..a703c23f8bab 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -634,16 +634,16 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>  
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>> -							     __GFP_KSWAPD_RECLAIM);
>> +							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>> -							     0);
>> -	return GFP_TRANSHUGE_LIGHT;
>> +							     __GFP_THISNODE);
>> +	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>>  }
>>  
>>  /* Caller must hold page table lock. */
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index da858f794eb6..9f0800885613 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>  		goto out;
>>  	}
>>  
>> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
>> -		int hpage_node = node;
>> -
>> -		/*
>> -		 * For hugepage allocation and non-interleave policy which
>> -		 * allows the current node (or other explicitly preferred
>> -		 * node) we only try to allocate from the current/preferred
>> -		 * node and don't fall back to other nodes, as the cost of
>> -		 * remote accesses would likely offset THP benefits.
>> -		 *
>> -		 * If the policy is interleave, or does not allow the current
>> -		 * node in its nodemask, we allocate the standard way.
>> -		 */
>> -		if (pol->mode == MPOL_PREFERRED &&
>> -						!(pol->flags & MPOL_F_LOCAL))
>> -			hpage_node = pol->v.preferred_node;
>> -
>> -		nmask = policy_nodemask(gfp, pol);
>> -		if (!nmask || node_isset(hpage_node, *nmask)) {
>> -			mpol_cond_put(pol);
>> -			page = __alloc_pages_node(hpage_node,
>> -						gfp | __GFP_THISNODE, order);
>> -			goto out;
>> -		}
>> -	}
>> -
>>  	nmask = policy_nodemask(gfp, pol);
>>  	preferred_nid = policy_node(gfp, pol, node);
>>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
>>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
       [not found]                 ` <D5F4A33C-0A37-495C-9468-D6866A862097@cs.rutgers.edu>
@ 2018-08-29 14:28                   ` Michal Hocko
  2018-08-29 14:35                     ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-29 14:28 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Wed 29-08-18 09:28:21, Zi Yan wrote:
[...]
> This patch triggers WARN_ON_ONCE() in policy_node() when MPOL_BIND is used and THP is on.
> Should this WARN_ON_ONCE be removed?
> 
> 
> /*
> * __GFP_THISNODE shouldn't even be used with the bind policy
> * because we might easily break the expectation to stay on the
> * requested node and not break the policy.
> */
> WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));

This is really interesting. It seems to be me who added this warning but
I cannot simply make any sense of it. Let me try to dig some more.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-29 14:28                   ` Michal Hocko
@ 2018-08-29 14:35                     ` Michal Hocko
  2018-08-29 15:22                       ` Zi Yan
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-29 14:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Wed 29-08-18 16:28:16, Michal Hocko wrote:
> On Wed 29-08-18 09:28:21, Zi Yan wrote:
> [...]
> > This patch triggers WARN_ON_ONCE() in policy_node() when MPOL_BIND is used and THP is on.
> > Should this WARN_ON_ONCE be removed?
> > 
> > 
> > /*
> > * __GFP_THISNODE shouldn't even be used with the bind policy
> > * because we might easily break the expectation to stay on the
> > * requested node and not break the policy.
> > */
> > WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> 
> This is really interesting. It seems to be me who added this warning but
> I cannot simply make any sense of it. Let me try to dig some more.

OK, I get it now. The warning seems to be incomplete. It is right to
complain when __GFP_THISNODE disagrees with MPOL_BIND policy but that is
not what we check here. Does this heal the warning?
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..7bb9354b1e4c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1728,7 +1728,10 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
 		 * because we might easily break the expectation to stay on the
 		 * requested node and not break the policy.
 		 */
-		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
+		if (policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)) {
+			nodemask_t *nmask = policy_nodemask(gfp, policy);
+			WARN_ON_ONCE(!node_isset(nd, *nmask));
+		}
 	}
 
 	return nd;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-29 14:35                     ` Michal Hocko
@ 2018-08-29 15:22                       ` Zi Yan
  2018-08-29 15:47                         ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Zi Yan @ 2018-08-29 15:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 3745 bytes --]

On 29 Aug 2018, at 10:35, Michal Hocko wrote:

> On Wed 29-08-18 16:28:16, Michal Hocko wrote:
>> On Wed 29-08-18 09:28:21, Zi Yan wrote:
>> [...]
>>> This patch triggers WARN_ON_ONCE() in policy_node() when MPOL_BIND is used and THP is on.
>>> Should this WARN_ON_ONCE be removed?
>>>
>>>
>>> /*
>>> * __GFP_THISNODE shouldn't even be used with the bind policy
>>> * because we might easily break the expectation to stay on the
>>> * requested node and not break the policy.
>>> */
>>> WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>>
>> This is really interesting. It seems to be me who added this warning but
>> I cannot simply make any sense of it. Let me try to dig some more.
>
> OK, I get it now. The warning seems to be incomplete. It is right to
> complain when __GFP_THISNODE disagrees with MPOL_BIND policy but that is
> not what we check here. Does this heal the warning?
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..7bb9354b1e4c 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1728,7 +1728,10 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
>  		 * because we might easily break the expectation to stay on the
>  		 * requested node and not break the policy.
>  		 */
> -		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> +		if (policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)) {
> +			nodemask_t *nmask = policy_nodemask(gfp, policy);
> +			WARN_ON_ONCE(!node_isset(nd, *nmask));
> +		}
>  	}
>
>  	return nd;

Unfortunately no. I simply ran “memhog -r3 1g membind 1” to test and the warning still showed up.

The reason is that nd is just a hint about which node to prefer for allocation and
can be ignored if it does not conform to mempolicy.
Taking my test as an example, if an application is only memory bound to node 1 but can run on any CPU
nodes and it launches on node 0, alloc_pages_vma() will see 0 as its node parameter
and passes 0 to policy_node()’s nd parameter. This should be OK, but your patches
would give a warning, because nd=0 is not set in nmask=1.

Now I get your comment “__GFP_THISNODE shouldn't even be used with the bind policy”,
since they are indeed incompatible. __GFP_THISNODE wants to use the node,
which can be ignored by MPOL_BIND policy.

IMHO, we could get rid of __GFP_THISNODE when MPOL_BIND is set, like

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0d2be5786b0c..a0fcb998d277 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1722,14 +1722,6 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
 {
        if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
                nd = policy->v.preferred_node;
-       else {
-               /*
-                * __GFP_THISNODE shouldn't even be used with the bind policy
-                * because we might easily break the expectation to stay on the
-                * requested node and not break the policy.
-                */
-               WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
-       }

        return nd;
 }
@@ -2026,6 +2018,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
                goto out;
        }

+       /*
+        * __GFP_THISNODE shouldn't even be used with the bind policy
+        * because we might easily break the expectation to stay on the
+        * requested node and not break the policy.
+        */
+       if (pol->mode == MPOL_BIND)
+               gfp &= ~__GFP_THISNODE;

        nmask = policy_nodemask(gfp, pol);
        preferred_nid = policy_node(gfp, pol, node);

What do you think?

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-29 15:22                       ` Zi Yan
@ 2018-08-29 15:47                         ` Michal Hocko
  2018-08-29 16:06                           ` Zi Yan
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-29 15:47 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Wed 29-08-18 11:22:35, Zi Yan wrote:
> On 29 Aug 2018, at 10:35, Michal Hocko wrote:
> 
> > On Wed 29-08-18 16:28:16, Michal Hocko wrote:
> >> On Wed 29-08-18 09:28:21, Zi Yan wrote:
> >> [...]
> >>> This patch triggers WARN_ON_ONCE() in policy_node() when MPOL_BIND is used and THP is on.
> >>> Should this WARN_ON_ONCE be removed?
> >>>
> >>>
> >>> /*
> >>> * __GFP_THISNODE shouldn't even be used with the bind policy
> >>> * because we might easily break the expectation to stay on the
> >>> * requested node and not break the policy.
> >>> */
> >>> WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> >>
> >> This is really interesting. It seems to be me who added this warning but
> >> I cannot simply make any sense of it. Let me try to dig some more.
> >
> > OK, I get it now. The warning seems to be incomplete. It is right to
> > complain when __GFP_THISNODE disagrees with MPOL_BIND policy but that is
> > not what we check here. Does this heal the warning?
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da858f794eb6..7bb9354b1e4c 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1728,7 +1728,10 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
> >  		 * because we might easily break the expectation to stay on the
> >  		 * requested node and not break the policy.
> >  		 */
> > -		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> > +		if (policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)) {
> > +			nodemask_t *nmask = policy_nodemask(gfp, policy);
> > +			WARN_ON_ONCE(!node_isset(nd, *nmask));
> > +		}
> >  	}
> >
> >  	return nd;
> 
> Unfortunately no. I simply ran a??memhog -r3 1g membind 1a?? to test and the warning still showed up.
> 
> The reason is that nd is just a hint about which node to prefer for allocation and
> can be ignored if it does not conform to mempolicy.
>
> Taking my test as an example, if an application is only memory bound to node 1 but can run on any CPU
> nodes and it launches on node 0, alloc_pages_vma() will see 0 as its node parameter
> and passes 0 to policy_node()a??s nd parameter. This should be OK, but your patches
> would give a warning, because nd=0 is not set in nmask=1.
> 
> Now I get your comment a??__GFP_THISNODE shouldn't even be used with the bind policya??,
> since they are indeed incompatible. __GFP_THISNODE wants to use the node,
> which can be ignored by MPOL_BIND policy.

Well, the assumption was that you do not run on a remote cpu to your
memory policy. But that seems a wrong assumption.

> IMHO, we could get rid of __GFP_THISNODE when MPOL_BIND is set, like
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0d2be5786b0c..a0fcb998d277 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1722,14 +1722,6 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
>  {
>         if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
>                 nd = policy->v.preferred_node;
> -       else {
> -               /*
> -                * __GFP_THISNODE shouldn't even be used with the bind policy
> -                * because we might easily break the expectation to stay on the
> -                * requested node and not break the policy.
> -                */
> -               WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> -       }
> 
>         return nd;
>  }
> @@ -2026,6 +2018,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>                 goto out;
>         }
> 
> +       /*
> +        * __GFP_THISNODE shouldn't even be used with the bind policy
> +        * because we might easily break the expectation to stay on the
> +        * requested node and not break the policy.
> +        */
> +       if (pol->mode == MPOL_BIND)
> +               gfp &= ~__GFP_THISNODE;
> 
>         nmask = policy_nodemask(gfp, pol);
>         preferred_nid = policy_node(gfp, pol, node);
> 
> What do you think?

I do not like overwriting gfp flags like that. It is just ugly and error
prone. A more proper way would be to handle that at the layer we play
with __GFP_THISNODE. The resulting diff is larger though.

If there is a general concensus that this is growing too complicated
then Andrea's patch (the second variant to overwrite gfp mask) is much
simpler of course but I really detest the subtle gfp rewriting. I still
believe that all the nasty details should be covered at the single
place.


diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a703c23f8bab..94472bf9a31b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+	struct mempolicy *pol;
+
+#ifdef CONFIG_NUMA
+	/* __GFP_THISNODE makes sense only if there is no explicit binding */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_THISNODE);
-	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9f0800885613..75bbfc3d6233 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-29 15:47                         ` Michal Hocko
@ 2018-08-29 16:06                           ` Zi Yan
  2018-08-29 16:25                             ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Zi Yan @ 2018-08-29 16:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 4745 bytes --]

<snip>
>
> I do not like overwriting gfp flags like that. It is just ugly and error
> prone. A more proper way would be to handle that at the layer we play
> with __GFP_THISNODE. The resulting diff is larger though.

This makes sense to me.

>
> If there is a general concensus that this is growing too complicated
> then Andrea's patch (the second variant to overwrite gfp mask) is much
> simpler of course but I really detest the subtle gfp rewriting. I still
> believe that all the nasty details should be covered at the single
> place.
>
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a703c23f8bab..94472bf9a31b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +#endif
>
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_THISNODE);
> -	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>
>  /* Caller must hold page table lock. */
> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 9f0800885613..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);

The warning goes away with this change. I am OK with this patch (plus the original one you sent out,
which could be merged with this one).

Thanks.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
  2018-08-29 16:06                           ` Zi Yan
@ 2018-08-29 16:25                             ` Michal Hocko
  2018-08-29 19:24                               ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-29 16:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Wed 29-08-18 12:06:48, Zi Yan wrote:
> The warning goes away with this change. I am OK with this patch (plus the original one you sent out,
> which could be merged with this one).

I will respin the patch, update the changelog and repost. Tomorrow I
hope.



-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 16:25                             ` Michal Hocko
@ 2018-08-29 19:24                               ` Michal Hocko
  2018-08-29 22:54                                 ` Zi Yan
                                                   ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Michal Hocko @ 2018-08-29 19:24 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Wed 29-08-18 18:25:28, Michal Hocko wrote:
> On Wed 29-08-18 12:06:48, Zi Yan wrote:
> > The warning goes away with this change. I am OK with this patch (plus the original one you sent out,
> > which could be merged with this one).
> 
> I will respin the patch, update the changelog and repost. Tomorrow I
> hope.

Here is the v2

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 19:24                               ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
@ 2018-08-29 22:54                                 ` Zi Yan
  2018-08-30  7:00                                   ` Michal Hocko
  2018-09-06 10:59                                   ` Vlastimil Babka
  2018-08-30  6:47                                 ` Michal Hocko
  2018-09-12 17:29                                 ` Mel Gorman
  2 siblings, 2 replies; 77+ messages in thread
From: Zi Yan @ 2018-08-29 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 7162 bytes --]

Hi Michal,

<snip>
>
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 25 +++++++++++++++++--------
>  mm/mempolicy.c            | 28 +---------------------------
>  3 files changed, 20 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..94472bf9a31b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +#endif
>
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>
>  /* Caller must hold page table lock. */
> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> -- 
> 2.18.0
>

Thanks for your patch.

I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
each node and got the results below. I expect this test should fill one node, then fall back to the other.

1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: no swap, THPs are allocated in the fallback node.
2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the disk instead of being allocated in the fallback node.
3. no madvise, THP is on by default + defrag = {always, defer, defer+madvise}: pages got swapped to the disk instead of
being allocated in the fallback node.
4. no madvise, THP is on by default + defrag = madvise: no swap, base pages are allocated in the fallback node.

The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.

The reason, as Andrea mentioned in his email, is that the combination of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM from this experiment). __THIS_NODE uses ZONELIST_NOFALLBACK, which removes the fallback possibility
and __GFP_*_RECLAIM triggers page reclaim in the first page allocation node when fallback nodes are removed by
ZONELIST_NOFALLBACK.

IMHO, __THIS_NODE should not be used for user memory allocation at all, since it fights against most of memory policies.
But kernel memory allocation would need it as a kernel MPOL_BIND memory policy.

Comments?

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 19:24                               ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
  2018-08-29 22:54                                 ` Zi Yan
@ 2018-08-30  6:47                                 ` Michal Hocko
  2018-09-06 11:18                                   ` Vlastimil Babka
  2018-09-12 17:29                                 ` Mel Gorman
  2 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-30  6:47 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

I forgot mpol_cond_put...

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 22:54                                 ` Zi Yan
@ 2018-08-30  7:00                                   ` Michal Hocko
  2018-08-30 13:22                                     ` Zi Yan
  2018-09-06 10:59                                   ` Vlastimil Babka
  1 sibling, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-30  7:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Wed 29-08-18 18:54:23, Zi Yan wrote:
[...]
> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> 
> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> no swap, THPs are allocated in the fallback node.
> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> disk instead of being allocated in the fallback node.
> 3. no madvise, THP is on by default + defrag = {always, defer,
> defer+madvise}: pages got swapped to the disk instead of being
> allocated in the fallback node.
> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> pages are allocated in the fallback node.
> 
> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
> 
> The reason, as Andrea mentioned in his email, is that the combination
> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
> from this experiment).

But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
see kswapd do the reclaim to balance the node. If the node is full of
anonymous pages then there is no other way than swap out.

> __THIS_NODE uses ZONELIST_NOFALLBACK, which
> removes the fallback possibility and __GFP_*_RECLAIM triggers page
> reclaim in the first page allocation node when fallback nodes are
> removed by ZONELIST_NOFALLBACK.

Yes but the point is that the allocations which use __GFP_THISNODE are
optimistic so they shouldn't fallback to remote NUMA nodes.

> IMHO, __THIS_NODE should not be used for user memory allocation at
> all, since it fights against most of memory policies.  But kernel
> memory allocation would need it as a kernel MPOL_BIND memory policy.

__GFP_THISNODE is indeed an ugliness. I would really love to get rid of
it here. But the problem is that optimistic THP allocations should
prefer a local node because a remote node might easily offset the
advantage of the THP. I do not have a great idea how to achieve that
without __GFP_THISNODE though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30  7:00                                   ` Michal Hocko
@ 2018-08-30 13:22                                     ` Zi Yan
  2018-08-30 13:45                                       ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Zi Yan @ 2018-08-30 13:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 3285 bytes --]

On 30 Aug 2018, at 3:00, Michal Hocko wrote:

> On Wed 29-08-18 18:54:23, Zi Yan wrote:
> [...]
>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
>>
>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
>> no swap, THPs are allocated in the fallback node.
>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
>> disk instead of being allocated in the fallback node.
>> 3. no madvise, THP is on by default + defrag = {always, defer,
>> defer+madvise}: pages got swapped to the disk instead of being
>> allocated in the fallback node.
>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
>> pages are allocated in the fallback node.
>>
>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
>>
>> The reason, as Andrea mentioned in his email, is that the combination
>> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
>> from this experiment).
>
> But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
> We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
> see kswapd do the reclaim to balance the node. If the node is full of
> anonymous pages then there is no other way than swap out.

GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
+ defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
can be triggered.

The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT),
which should allow page allocation fallback to other nodes, but as shown in
result 3, swapping is triggered instead of page allocation fallback.

>
>> __THIS_NODE uses ZONELIST_NOFALLBACK, which
>> removes the fallback possibility and __GFP_*_RECLAIM triggers page
>> reclaim in the first page allocation node when fallback nodes are
>> removed by ZONELIST_NOFALLBACK.
>
> Yes but the point is that the allocations which use __GFP_THISNODE are
> optimistic so they shouldn't fallback to remote NUMA nodes.

This can be achieved by using MPOL_BIND memory policy which restricts
nodemask in struct alloc_context for user space memory allocations.

>
>> IMHO, __THIS_NODE should not be used for user memory allocation at
>> all, since it fights against most of memory policies.  But kernel
>> memory allocation would need it as a kernel MPOL_BIND memory policy.
>
> __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
> it here. But the problem is that optimistic THP allocations should
> prefer a local node because a remote node might easily offset the
> advantage of the THP. I do not have a great idea how to achieve that
> without __GFP_THISNODE though.

MPOL_PREFERRED memory policy can be used to achieve this optimistic THP allocation
for user space. Even with the default memory policy, local memory node will be used
first until it is full. It seems to me that __GFP_THISNODE is not necessary
if a proper memory policy is used.

Let me know if I miss anything. Thanks.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 13:22                                     ` Zi Yan
@ 2018-08-30 13:45                                       ` Michal Hocko
  2018-08-30 14:02                                         ` Zi Yan
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-30 13:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Thu 30-08-18 09:22:21, Zi Yan wrote:
> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
> 
> > On Wed 29-08-18 18:54:23, Zi Yan wrote:
> > [...]
> >> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> >> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> >>
> >> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> >> no swap, THPs are allocated in the fallback node.
> >> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> >> disk instead of being allocated in the fallback node.
> >> 3. no madvise, THP is on by default + defrag = {always, defer,
> >> defer+madvise}: pages got swapped to the disk instead of being
> >> allocated in the fallback node.
> >> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> >> pages are allocated in the fallback node.
> >>
> >> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
> >>
> >> The reason, as Andrea mentioned in his email, is that the combination
> >> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
> >> from this experiment).
> >
> > But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
> > We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
> > see kswapd do the reclaim to balance the node. If the node is full of
> > anonymous pages then there is no other way than swap out.
> 
> GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
> + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
> can be triggered.

Yes, but the setup tells that you are willing to pay price to get a THP.
defered=always uses that special __GFP_NORETRY (unless it is madvised
mapping) that should back off if the compaction failed recently. How
much that reduces the reclaim is not really clear to me right now to be
honest.

> The key issue here is that a??memhog -r3 130ga?? uses the default memory policy (MPOL_DEFAULT),
> which should allow page allocation fallback to other nodes, but as shown in
> result 3, swapping is triggered instead of page allocation fallback.

Well, I guess this really depends. Fallback to a different node might be
seen as a bad thing and worse than the reclaim on the local node.

> >> __THIS_NODE uses ZONELIST_NOFALLBACK, which
> >> removes the fallback possibility and __GFP_*_RECLAIM triggers page
> >> reclaim in the first page allocation node when fallback nodes are
> >> removed by ZONELIST_NOFALLBACK.
> >
> > Yes but the point is that the allocations which use __GFP_THISNODE are
> > optimistic so they shouldn't fallback to remote NUMA nodes.
> 
> This can be achieved by using MPOL_BIND memory policy which restricts
> nodemask in struct alloc_context for user space memory allocations.

Yes, but that requires and explicit NUMA handling. And we are trying to
handle those cases which do not really give a damn and just want to use
THP if it is available or try harder when they ask by using madvise.

> >> IMHO, __THIS_NODE should not be used for user memory allocation at
> >> all, since it fights against most of memory policies.  But kernel
> >> memory allocation would need it as a kernel MPOL_BIND memory policy.
> >
> > __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
> > it here. But the problem is that optimistic THP allocations should
> > prefer a local node because a remote node might easily offset the
> > advantage of the THP. I do not have a great idea how to achieve that
> > without __GFP_THISNODE though.
> 
> MPOL_PREFERRED memory policy can be used to achieve this optimistic
> THP allocation for user space. Even with the default memory policy,
> local memory node will be used first until it is full. It seems to
> me that __GFP_THISNODE is not necessary if a proper memory policy is
> used.
> 
> Let me know if I miss anything. Thanks.

You are missing that we are trying to define a sensible model for those
who do not really care about mempolicies. THP shouldn't cause more harm
than good for those.

I wish we could come up with a remotely sane and comprehensible model.
That means that you know how hard the allocator tries to get a THP for
you depending on the defrag configuration, your memory policy and your
madvise setting. The easiest one I can think of is to 
- always follow mempolicy when specified because you asked for it
  explicitly
- stay node local and low latency for the light THP defrag mode (defrag,
  madvise without hint and none) because THP is a nice to have
- if the defrag mode is always then you are willing to pay the latency
  price but off-node might be still a no-no.
- allow fallback for madvised mappings because you really want THP. If
  you care about specific numa placement then combine with the
  mempolicy.

As you can see I do not really mention anything about the direct reclaim
because that is just an implementation detail of the page allocator and
compaction interaction.

Maybe you can formulate a saner matrix with all the available modes that
we have.

Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE
is clearly wrong and we should address that first. Either Andrea's
option 2) patch or mine which does the similar thing except at the
proper layer (I believe). We can continue discussing other odd cases on
top I guess. Unless somebody has much brighter idea, of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 13:45                                       ` Michal Hocko
@ 2018-08-30 14:02                                         ` Zi Yan
  2018-08-30 16:19                                           ` Stefan Priebe - Profihost AG
  2018-08-30 16:40                                           ` Michal Hocko
  0 siblings, 2 replies; 77+ messages in thread
From: Zi Yan @ 2018-08-30 14:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 6014 bytes --]

On 30 Aug 2018, at 9:45, Michal Hocko wrote:

> On Thu 30-08-18 09:22:21, Zi Yan wrote:
>> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
>>
>>> On Wed 29-08-18 18:54:23, Zi Yan wrote:
>>> [...]
>>>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
>>>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
>>>>
>>>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
>>>> no swap, THPs are allocated in the fallback node.
>>>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
>>>> disk instead of being allocated in the fallback node.
>>>> 3. no madvise, THP is on by default + defrag = {always, defer,
>>>> defer+madvise}: pages got swapped to the disk instead of being
>>>> allocated in the fallback node.
>>>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
>>>> pages are allocated in the fallback node.
>>>>
>>>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
>>>>
>>>> The reason, as Andrea mentioned in his email, is that the combination
>>>> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
>>>> from this experiment).
>>>
>>> But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
>>> We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
>>> see kswapd do the reclaim to balance the node. If the node is full of
>>> anonymous pages then there is no other way than swap out.
>>
>> GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
>> + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
>> can be triggered.
>
> Yes, but the setup tells that you are willing to pay price to get a THP.
> defered=always uses that special __GFP_NORETRY (unless it is madvised
> mapping) that should back off if the compaction failed recently. How
> much that reduces the reclaim is not really clear to me right now to be
> honest.
>
>> The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT),
>> which should allow page allocation fallback to other nodes, but as shown in
>> result 3, swapping is triggered instead of page allocation fallback.
>
> Well, I guess this really depends. Fallback to a different node might be
> seen as a bad thing and worse than the reclaim on the local node.
>
>>>> __THIS_NODE uses ZONELIST_NOFALLBACK, which
>>>> removes the fallback possibility and __GFP_*_RECLAIM triggers page
>>>> reclaim in the first page allocation node when fallback nodes are
>>>> removed by ZONELIST_NOFALLBACK.
>>>
>>> Yes but the point is that the allocations which use __GFP_THISNODE are
>>> optimistic so they shouldn't fallback to remote NUMA nodes.
>>
>> This can be achieved by using MPOL_BIND memory policy which restricts
>> nodemask in struct alloc_context for user space memory allocations.
>
> Yes, but that requires and explicit NUMA handling. And we are trying to
> handle those cases which do not really give a damn and just want to use
> THP if it is available or try harder when they ask by using madvise.
>
>>>> IMHO, __THIS_NODE should not be used for user memory allocation at
>>>> all, since it fights against most of memory policies.  But kernel
>>>> memory allocation would need it as a kernel MPOL_BIND memory policy.
>>>
>>> __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
>>> it here. But the problem is that optimistic THP allocations should
>>> prefer a local node because a remote node might easily offset the
>>> advantage of the THP. I do not have a great idea how to achieve that
>>> without __GFP_THISNODE though.
>>
>> MPOL_PREFERRED memory policy can be used to achieve this optimistic
>> THP allocation for user space. Even with the default memory policy,
>> local memory node will be used first until it is full. It seems to
>> me that __GFP_THISNODE is not necessary if a proper memory policy is
>> used.
>>
>> Let me know if I miss anything. Thanks.
>
> You are missing that we are trying to define a sensible model for those
> who do not really care about mempolicies. THP shouldn't cause more harm
> than good for those.
>
> I wish we could come up with a remotely sane and comprehensible model.
> That means that you know how hard the allocator tries to get a THP for
> you depending on the defrag configuration, your memory policy and your
> madvise setting. The easiest one I can think of is to
> - always follow mempolicy when specified because you asked for it
>   explicitly
> - stay node local and low latency for the light THP defrag mode (defrag,
>   madvise without hint and none) because THP is a nice to have
> - if the defrag mode is always then you are willing to pay the latency
>   price but off-node might be still a no-no.
> - allow fallback for madvised mappings because you really want THP. If
>   you care about specific numa placement then combine with the
>   mempolicy.
>
> As you can see I do not really mention anything about the direct reclaim
> because that is just an implementation detail of the page allocator and
> compaction interaction.
>
> Maybe you can formulate a saner matrix with all the available modes that
> we have.
>
> Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE
> is clearly wrong and we should address that first. Either Andrea's
> option 2) patch or mine which does the similar thing except at the
> proper layer (I believe). We can continue discussing other odd cases on
> top I guess. Unless somebody has much brighter idea, of course.

Thanks for your explanation. It makes sense to me. I am fine with your patch.
You can add my Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>, since
my test result 1 shows that the problem mentioned in your changelog is solved.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 14:02                                         ` Zi Yan
@ 2018-08-30 16:19                                           ` Stefan Priebe - Profihost AG
  2018-08-30 16:40                                           ` Michal Hocko
  1 sibling, 0 replies; 77+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-08-30 16:19 UTC (permalink / raw)
  To: Zi Yan
  Cc: Michal Hocko, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka

[-- Attachment #1: Type: text/plain, Size: 230 bytes --]

Please add also:
Tested-by: Stefan Priebe <s.priebe@profihost.ag>

Stefan

Excuse my typo sent from my mobile phone.

> Am 30.08.2018 um 16:02 schrieb Zi Yan <zi.yan@cs.rutgers.edu>:
> 
> Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>

[-- Attachment #2: Type: text/html, Size: 750 bytes --]

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 14:02                                         ` Zi Yan
  2018-08-30 16:19                                           ` Stefan Priebe - Profihost AG
@ 2018-08-30 16:40                                           ` Michal Hocko
  2018-09-05  3:44                                             ` Andrea Arcangeli
  1 sibling, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-08-30 16:40 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Thu 30-08-18 10:02:23, Zi Yan wrote:
> On 30 Aug 2018, at 9:45, Michal Hocko wrote:
> 
> > On Thu 30-08-18 09:22:21, Zi Yan wrote:
> >> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
> >>
> >>> On Wed 29-08-18 18:54:23, Zi Yan wrote:
> >>> [...]
> >>>> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> >>>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> >>>>
> >>>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> >>>> no swap, THPs are allocated in the fallback node.
> >>>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> >>>> disk instead of being allocated in the fallback node.
> >>>> 3. no madvise, THP is on by default + defrag = {always, defer,
> >>>> defer+madvise}: pages got swapped to the disk instead of being
> >>>> allocated in the fallback node.
> >>>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> >>>> pages are allocated in the fallback node.
> >>>>
> >>>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
> >>>>
> >>>> The reason, as Andrea mentioned in his email, is that the combination
> >>>> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
> >>>> from this experiment).
> >>>
> >>> But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
> >>> We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
> >>> see kswapd do the reclaim to balance the node. If the node is full of
> >>> anonymous pages then there is no other way than swap out.
> >>
> >> GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
> >> + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
> >> can be triggered.
> >
> > Yes, but the setup tells that you are willing to pay price to get a THP.
> > defered=always uses that special __GFP_NORETRY (unless it is madvised
> > mapping) that should back off if the compaction failed recently. How
> > much that reduces the reclaim is not really clear to me right now to be
> > honest.
> >
> >> The key issue here is that a??memhog -r3 130ga?? uses the default memory policy (MPOL_DEFAULT),
> >> which should allow page allocation fallback to other nodes, but as shown in
> >> result 3, swapping is triggered instead of page allocation fallback.
> >
> > Well, I guess this really depends. Fallback to a different node might be
> > seen as a bad thing and worse than the reclaim on the local node.
> >
> >>>> __THIS_NODE uses ZONELIST_NOFALLBACK, which
> >>>> removes the fallback possibility and __GFP_*_RECLAIM triggers page
> >>>> reclaim in the first page allocation node when fallback nodes are
> >>>> removed by ZONELIST_NOFALLBACK.
> >>>
> >>> Yes but the point is that the allocations which use __GFP_THISNODE are
> >>> optimistic so they shouldn't fallback to remote NUMA nodes.
> >>
> >> This can be achieved by using MPOL_BIND memory policy which restricts
> >> nodemask in struct alloc_context for user space memory allocations.
> >
> > Yes, but that requires and explicit NUMA handling. And we are trying to
> > handle those cases which do not really give a damn and just want to use
> > THP if it is available or try harder when they ask by using madvise.
> >
> >>>> IMHO, __THIS_NODE should not be used for user memory allocation at
> >>>> all, since it fights against most of memory policies.  But kernel
> >>>> memory allocation would need it as a kernel MPOL_BIND memory policy.
> >>>
> >>> __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
> >>> it here. But the problem is that optimistic THP allocations should
> >>> prefer a local node because a remote node might easily offset the
> >>> advantage of the THP. I do not have a great idea how to achieve that
> >>> without __GFP_THISNODE though.
> >>
> >> MPOL_PREFERRED memory policy can be used to achieve this optimistic
> >> THP allocation for user space. Even with the default memory policy,
> >> local memory node will be used first until it is full. It seems to
> >> me that __GFP_THISNODE is not necessary if a proper memory policy is
> >> used.
> >>
> >> Let me know if I miss anything. Thanks.
> >
> > You are missing that we are trying to define a sensible model for those
> > who do not really care about mempolicies. THP shouldn't cause more harm
> > than good for those.
> >
> > I wish we could come up with a remotely sane and comprehensible model.
> > That means that you know how hard the allocator tries to get a THP for
> > you depending on the defrag configuration, your memory policy and your
> > madvise setting. The easiest one I can think of is to
> > - always follow mempolicy when specified because you asked for it
> >   explicitly
> > - stay node local and low latency for the light THP defrag mode (defrag,
> >   madvise without hint and none) because THP is a nice to have
> > - if the defrag mode is always then you are willing to pay the latency
> >   price but off-node might be still a no-no.
> > - allow fallback for madvised mappings because you really want THP. If
> >   you care about specific numa placement then combine with the
> >   mempolicy.
> >
> > As you can see I do not really mention anything about the direct reclaim
> > because that is just an implementation detail of the page allocator and
> > compaction interaction.
> >
> > Maybe you can formulate a saner matrix with all the available modes that
> > we have.
> >
> > Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE
> > is clearly wrong and we should address that first. Either Andrea's
> > option 2) patch or mine which does the similar thing except at the
> > proper layer (I believe). We can continue discussing other odd cases on
> > top I guess. Unless somebody has much brighter idea, of course.
> 
> Thanks for your explanation. It makes sense to me. I am fine with your patch.
> You can add my Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>, since
> my test result 1 shows that the problem mentioned in your changelog is solved.

Thanks for your and Stefan's testing. I will wait for some more
feedback. I will be offline next few days and if there are no major
objections I will repost with both tested-bys early next week.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 16:40                                           ` Michal Hocko
@ 2018-09-05  3:44                                             ` Andrea Arcangeli
  2018-09-05  7:08                                               ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2018-09-05  3:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zi Yan, Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka, Stefan Priebe - Profihost AG

On Thu, Aug 30, 2018 at 06:40:57PM +0200, Michal Hocko wrote:
> On Thu 30-08-18 10:02:23, Zi Yan wrote:
> > On 30 Aug 2018, at 9:45, Michal Hocko wrote:
> > 
> > > On Thu 30-08-18 09:22:21, Zi Yan wrote:
> > >> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
> > >>
> > >>> On Wed 29-08-18 18:54:23, Zi Yan wrote:
> > >>> [...]
> > >>>> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> > >>>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> > >>>>
> > >>>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> > >>>> no swap, THPs are allocated in the fallback node.

no swap

> > >>>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> > >>>> disk instead of being allocated in the fallback node.

swap

> > >>>> 3. no madvise, THP is on by default + defrag = {always, defer,
> > >>>> defer+madvise}: pages got swapped to the disk instead of being
> > >>>> allocated in the fallback node.

swap

> > >>>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> > >>>> pages are allocated in the fallback node.

no swap

> > >>>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.

I agree it's not great for 2 and 3.

I don't see how the above can be considered a 100% "pass" to the test,
at best it's a 50% pass.

Let me clarify the setup to be sure:

1) There was no hard bind at all

2) Let's also ignore NUMA balancing which is all but restrictive at
   the start and it's meant to converge over time if current
   conditions don't allow immediate convergence. For simplicity let's
   assume NUMA balancing off.

So what the test exercised is the plain normal allocation of RAM with
THP main knob enabled to "always" on a NUMA system.

No matter the madvise used or not used, 2 cases over 4 decided to
swapout instead of allocating totally free THP or PAGE_SIZEd pages.

As opposed there would have been absolutely zero swapouts in the exact
same test if the main THP knob would have been disabled with:

     echo never >/sys/kernel/mm/transparent_hugepage/enabled

There is no way that enabling THP (no matter what other defrag
settings were and no matter if MADV_HUGEPAGE was used or not) should
cause heavy swap storms during page faults allocating memory, when
disabling THP doesn't swap even a single 4k page. That can't possibly
be right.

This is because there is no way the overhead of swapping can be
compensated by the THP improvement.

And with swapping I really mean "reclaim", just testing with the
swapout testcase is simpler and doesn't require an iommu pinning all
memory. So setting may_swap and may_unmap to zero won't move the
needle because my test showed just massive CPU consumption in trying
so hard to generate THP from the local node, but nothing got swapped
out because of the iommu pins.

That kind of swapping may only pay off in the very long long term,
which is what khugepaged is for. khugepaged already takes care of the
long term, so we could later argue and think if khugepaged should
swapout or not in such condition, but I don't think there's much to
argue about the page fault.

> Thanks for your and Stefan's testing. I will wait for some more
> feedback. I will be offline next few days and if there are no major
> objections I will repost with both tested-bys early next week.

I'm not so positive about 2 of the above tests if I understood the
test correctly.

Those results are totally fine if you used the non default memory
policy, but with MPOL_DEFAULT and in turn no hard bind of the memory,
I'm afraid it'll be even be harder to reproduce when things will go
wrong again in those two cases.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-05  3:44                                             ` Andrea Arcangeli
@ 2018-09-05  7:08                                               ` Michal Hocko
  2018-09-06 11:10                                                 ` Vlastimil Babka
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-09-05  7:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Zi Yan, Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka, Stefan Priebe - Profihost AG

On Tue 04-09-18 23:44:03, Andrea Arcangeli wrote:
[...]
> That kind of swapping may only pay off in the very long long term,
> which is what khugepaged is for. khugepaged already takes care of the
> long term, so we could later argue and think if khugepaged should
> swapout or not in such condition, but I don't think there's much to
> argue about the page fault.

I agree that defrag==always doing a reclaim is not really good and
benefits are questionable. If you remember this was the primary reason
why the default has been changed.

> > Thanks for your and Stefan's testing. I will wait for some more
> > feedback. I will be offline next few days and if there are no major
> > objections I will repost with both tested-bys early next week.
> 
> I'm not so positive about 2 of the above tests if I understood the
> test correctly.
> 
> Those results are totally fine if you used the non default memory
> policy, but with MPOL_DEFAULT and in turn no hard bind of the memory,
> I'm afraid it'll be even be harder to reproduce when things will go
> wrong again in those two cases.

We can and should think about this much more but I would like to have
this regression closed. So can we address GFP_THISNODE part first and
build more complex solution on top?

Is there any objection to my patch which does the similar thing to your
patch v2 in a different location?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 22:54                                 ` Zi Yan
  2018-08-30  7:00                                   ` Michal Hocko
@ 2018-09-06 10:59                                   ` Vlastimil Babka
  2018-09-06 11:17                                     ` Zi Yan
  1 sibling, 1 reply; 77+ messages in thread
From: Vlastimil Babka @ 2018-09-06 10:59 UTC (permalink / raw)
  To: Zi Yan, Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Stefan Priebe - Profihost AG

On 08/30/2018 12:54 AM, Zi Yan wrote:
> 
> Thanks for your patch.
> 
> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> 
> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: no swap, THPs are allocated in the fallback node.
> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the disk instead of being allocated in the fallback node.

Hmm this is GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE.
No direct reclaim, so it would have to be kswapd causing the swapping? I
wouldn't expect it to be significant and over-reclaiming. What exactly
is your definition of "pages got swapped"?

> 3. no madvise, THP is on by default + defrag = {always, defer, defer+madvise}: pages got swapped to the disk instead of
> being allocated in the fallback node.

So this should be the most common case (no madvise, THP on). If it's
causing too much reclaim, it's not good IMHO.

depending on defrag:
defer (the default) = same as above, so it would have to be kswapd
always = GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM | __GFP_NORETRY |
__GFP_THISNODE
  - so direct reclaim also overreclaims despite __GFP_NORETRY?
defer+madvise = same as defer

Vlastimil

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-05  7:08                                               ` Michal Hocko
@ 2018-09-06 11:10                                                 ` Vlastimil Babka
  2018-09-06 11:16                                                   ` Vlastimil Babka
  0 siblings, 1 reply; 77+ messages in thread
From: Vlastimil Babka @ 2018-09-06 11:10 UTC (permalink / raw)
  To: Michal Hocko, Andrea Arcangeli
  Cc: Zi Yan, Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Stefan Priebe - Profihost AG

On 09/05/2018 09:08 AM, Michal Hocko wrote:
> On Tue 04-09-18 23:44:03, Andrea Arcangeli wrote:
> [...]
>> That kind of swapping may only pay off in the very long long term,
>> which is what khugepaged is for. khugepaged already takes care of the
>> long term, so we could later argue and think if khugepaged should
>> swapout or not in such condition, but I don't think there's much to
>> argue about the page fault.
> 
> I agree that defrag==always doing a reclaim is not really good and
> benefits are questionable. If you remember this was the primary reason
> why the default has been changed.
> 
>>> Thanks for your and Stefan's testing. I will wait for some more
>>> feedback. I will be offline next few days and if there are no major
>>> objections I will repost with both tested-bys early next week.
>>
>> I'm not so positive about 2 of the above tests if I understood the
>> test correctly.
>>
>> Those results are totally fine if you used the non default memory
>> policy, but with MPOL_DEFAULT and in turn no hard bind of the memory,
>> I'm afraid it'll be even be harder to reproduce when things will go
>> wrong again in those two cases.
> 
> We can and should think about this much more but I would like to have
> this regression closed. So can we address GFP_THISNODE part first and
> build more complex solution on top?
> 
> Is there any objection to my patch which does the similar thing to your
> patch v2 in a different location?

Similar but not the same. It fixes the madvise case, but I wonder about
the no-madvise defrag=defer case, where Zi Yan reports it still causes
swapping.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 11:10                                                 ` Vlastimil Babka
@ 2018-09-06 11:16                                                   ` Vlastimil Babka
  2018-09-06 11:25                                                     ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Vlastimil Babka @ 2018-09-06 11:16 UTC (permalink / raw)
  To: Michal Hocko, Andrea Arcangeli
  Cc: Zi Yan, Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Stefan Priebe - Profihost AG

On 09/06/2018 01:10 PM, Vlastimil Babka wrote:
>> We can and should think about this much more but I would like to have
>> this regression closed. So can we address GFP_THISNODE part first and
>> build more complex solution on top?
>>
>> Is there any objection to my patch which does the similar thing to your
>> patch v2 in a different location?
> 
> Similar but not the same. It fixes the madvise case, but I wonder about
> the no-madvise defrag=defer case, where Zi Yan reports it still causes
> swapping.

Ah, but that should be the same with Andrea's variant 2) patch. There
should only be difference with defrag=always, which is direct reclaim
with __GFP_NORETRY, Andrea's patch would drop __GFP_THISNODE and your
not. Maybe Zi Yan can do the same kind of tests with Andrea's patch [1]
to confirm?

[1] https://marc.info/?l=linux-mm&m=153476267026951

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 10:59                                   ` Vlastimil Babka
@ 2018-09-06 11:17                                     ` Zi Yan
  0 siblings, 0 replies; 77+ messages in thread
From: Zi Yan @ 2018-09-06 11:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 1172 bytes --]

On 6 Sep 2018, at 6:59, Vlastimil Babka wrote:

> On 08/30/2018 12:54 AM, Zi Yan wrote:
>>
>> Thanks for your patch.
>>
>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
>>
>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: no swap, THPs are allocated in the fallback node.
>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the disk instead of being allocated in the fallback node.
>
> Hmm this is GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE.
> No direct reclaim, so it would have to be kswapd causing the swapping? I
> wouldn't expect it to be significant and over-reclaiming. What exactly
> is your definition of "pages got swapped"?

About 4GB pages are swapped to the disk (my swap disk size is 4.7GB).
My machine has 128GB memory in each node and memhog uses 130GB memory.
When one node is filled up, the oldest pages are swapped into disk
until memhog finishes touching all 130GB memory.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30  6:47                                 ` Michal Hocko
@ 2018-09-06 11:18                                   ` Vlastimil Babka
  2018-09-06 11:27                                     ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Vlastimil Babka @ 2018-09-06 11:18 UTC (permalink / raw)
  To: Michal Hocko, Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Stefan Priebe - Profihost AG

On 08/30/2018 08:47 AM, Michal Hocko wrote:
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +	mpol_cond_put(pol);

The code is better without the hack in alloc_pages_vma() but I'm not
thrilled about getting vma policy here and then immediately again in
alloc_pages_vma(). But if it can't be helped...

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 11:16                                                   ` Vlastimil Babka
@ 2018-09-06 11:25                                                     ` Michal Hocko
  2018-09-06 12:35                                                       ` Zi Yan
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-09-06 11:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrea Arcangeli, Zi Yan, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Stefan Priebe - Profihost AG

On Thu 06-09-18 13:16:00, Vlastimil Babka wrote:
> On 09/06/2018 01:10 PM, Vlastimil Babka wrote:
> >> We can and should think about this much more but I would like to have
> >> this regression closed. So can we address GFP_THISNODE part first and
> >> build more complex solution on top?
> >>
> >> Is there any objection to my patch which does the similar thing to your
> >> patch v2 in a different location?
> > 
> > Similar but not the same. It fixes the madvise case, but I wonder about
> > the no-madvise defrag=defer case, where Zi Yan reports it still causes
> > swapping.
> 
> Ah, but that should be the same with Andrea's variant 2) patch. There
> should only be difference with defrag=always, which is direct reclaim
> with __GFP_NORETRY, Andrea's patch would drop __GFP_THISNODE and your
> not. Maybe Zi Yan can do the same kind of tests with Andrea's patch [1]
> to confirm?

Yes, that is the only difference and that is why I've said those patches
are mostly similar. I do not want to touch defrag=always case because
this one has always been stall prone and we have replaced it as a
default just because of that. We should discuss what should be done with
that case separately IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 11:18                                   ` Vlastimil Babka
@ 2018-09-06 11:27                                     ` Michal Hocko
  0 siblings, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2018-09-06 11:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Stefan Priebe - Profihost AG

On Thu 06-09-18 13:18:52, Vlastimil Babka wrote:
> On 08/30/2018 08:47 AM, Michal Hocko wrote:
> > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > +	gfp_t this_node = 0;
> > +	struct mempolicy *pol;
> > +
> > +#ifdef CONFIG_NUMA
> > +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> > +	pol = get_vma_policy(vma, addr);
> > +	if (pol->mode != MPOL_BIND)
> > +		this_node = __GFP_THISNODE;
> > +	mpol_cond_put(pol);
> 
> The code is better without the hack in alloc_pages_vma() but I'm not
> thrilled about getting vma policy here and then immediately again in
> alloc_pages_vma(). But if it can't be helped...

The whole function is an act of beauty isn't it. I wanted to get the
policy from the caller but that would be even more messy so I've tried
to keep it in the ugly corner and have it hidden there. You should ask
your friends to read alloc_hugepage_direct_gfpmask unless they have done
something terribly wrong.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 11:25                                                     ` Michal Hocko
@ 2018-09-06 12:35                                                       ` Zi Yan
  0 siblings, 0 replies; 77+ messages in thread
From: Zi Yan @ 2018-09-06 12:35 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 2281 bytes --]

On 6 Sep 2018, at 7:25, Michal Hocko wrote:

> On Thu 06-09-18 13:16:00, Vlastimil Babka wrote:
>> On 09/06/2018 01:10 PM, Vlastimil Babka wrote:
>>>> We can and should think about this much more but I would like to have
>>>> this regression closed. So can we address GFP_THISNODE part first and
>>>> build more complex solution on top?
>>>>
>>>> Is there any objection to my patch which does the similar thing to your
>>>> patch v2 in a different location?
>>>
>>> Similar but not the same. It fixes the madvise case, but I wonder about
>>> the no-madvise defrag=defer case, where Zi Yan reports it still causes
>>> swapping.
>>
>> Ah, but that should be the same with Andrea's variant 2) patch. There
>> should only be difference with defrag=always, which is direct reclaim
>> with __GFP_NORETRY, Andrea's patch would drop __GFP_THISNODE and your
>> not. Maybe Zi Yan can do the same kind of tests with Andrea's patch [1]
>> to confirm?
>
> Yes, that is the only difference and that is why I've said those patches
> are mostly similar. I do not want to touch defrag=always case because
> this one has always been stall prone and we have replaced it as a
> default just because of that. We should discuss what should be done with
> that case separately IMHO.

Vlastimil, my test using Andrea’s patch confirms your statement.
My test result of Andrea’s patch shows that it gives the same outcomes as
Michal’s patch except that when no madvise is used, THP is on by default
+ defrag = {always}, instead of swapping pages to disk, Adndrea’s patch
causes no swapping and THPs are allocated in the fallback node.

As I said before, the fundamental issue that causes swapping pages to disk
when allocating THPs in a filled node is __GFP_THISNODE removes all fallback
zone/node options, thus,  __GFP_KSWAPD_RECLAIM or __GFP_DIRECT_RECLAIM
can only swap pages out to satisfy the THP allocation request.

__GFP_THISNODE can be seen as a kernel-version MPOL_BIND policy, which
overwrites any user space memory policy and should be removed or limited
to kernel-only page allocations. But, as Michal said, we could discuss
this further but do not make this discussion on the critical path of merging
the patch.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 19:24                               ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
  2018-08-29 22:54                                 ` Zi Yan
  2018-08-30  6:47                                 ` Michal Hocko
@ 2018-09-12 17:29                                 ` Mel Gorman
  2018-09-17  6:11                                   ` Michal Hocko
  2 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2018-09-12 17:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka,
	Stefan Priebe - Profihost AG

On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
> From 4dc2f772756e6f91b9e64d1a3e2df4dca3475f5b Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 28 Aug 2018 09:59:19 +0200
> Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:

Note that this behaviour is unhelpful but it is not against the defined
semantic of the "madvise" defrag option.

> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes.

For the short term, I think you might be better off simply avoiding the
combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM and declaring
that the fix. That would be easier for -stable and the layering can be
dealt with as a cleanup.

I recognise that this fix means that users that expect zone_reclaim_mode==1
type behaviour may get burned but the users that benefit from that should
also be users that benefit from sizing their workload to a node. They should
be able to replicate that with mempolicies or at least use prepation scripts
to clear memory on a target node (e.g. membind a memhog to the desired size,
exit and then start the target workload).

I think this is a more appropriate solution than prematurely introducing a
GFP flag as it's not guaranteed that a user is willing to pay a compaction
penalty until it fails. That should be decided separately when the immediate
problem is resolved.

That said, I do think that sorting out where GFP flags are set for THP
should be done in the context of THP code and not alloc_pages_vma. The
current layering is a bit odd.

> The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement.
> 
> Be careful when the vma has an explicit numa binding though, because
> __GFP_THISNODE is not playing well with it. We want to follow the
> explicit numa policy rather than enforce a node which happens to be
> local to the cpu we are running on.
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 25 +++++++++++++++++--------
>  mm/mempolicy.c            | 28 +---------------------------
>  3 files changed, 20 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..94472bf9a31b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +#endif
>  

Where is the mpol_cond_put? Historically it might not have mattered
because THP could not be used with a shared possibility but it probably
matters now that tmpfs can be backed by THP.

The comment needs more expansion as well. Arguably it only makes sense in
the event we are explicitly bound to one node because if we are bound to
two nodes without interleaving then why not fall back? The answer to that
is outside the scope of the patch but the comment as-is will cause head
scratches in a years time.

>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>  
>  /* Caller must hold page table lock. */
> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -

The hugepage flag passed into this function is now redundant and that
means that callers of alloc_hugepage_vma need to move back to using
alloc_pages_vma() directly and remove the API entirely. This block of
code is about both GFP flag settings and node selection but at a glance I
cannot see the point of it because it's very similar to the base page code.
The whole point may be to get around the warning in policy_node and that
could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
as you do already in this patch. There should be no reason why THP has a
different policy than a base page within a single VMA.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-12 17:29                                 ` Mel Gorman
@ 2018-09-17  6:11                                   ` Michal Hocko
  2018-09-17  7:04                                     ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-09-17  6:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka,
	Stefan Priebe - Profihost AG

[sorry I've missed your reply]

On Wed 12-09-18 18:29:25, Mel Gorman wrote:
> On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
[...]
> I recognise that this fix means that users that expect zone_reclaim_mode==1
> type behaviour may get burned but the users that benefit from that should
> also be users that benefit from sizing their workload to a node. They should
> be able to replicate that with mempolicies or at least use prepation scripts
> to clear memory on a target node (e.g. membind a memhog to the desired size,
> exit and then start the target workload).

As I've said in other email. We probably want to add a new mempolicy
which has zone_reclaim_mode-like semantic.

[...]

> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> > index 5228c62af416..bac395f1d00a 100644
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
> >  struct mempolicy *get_task_policy(struct task_struct *p);
> >  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> >  		unsigned long addr);
> > +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> > +						unsigned long addr);
> >  bool vma_policy_mof(struct vm_area_struct *vma);
> >  
> >  extern void numa_default_policy(void);
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c3bc7e9c9a2a..94472bf9a31b 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> >   *	    available
> >   * never: never stall for any thp allocation
> >   */
> > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > +	gfp_t this_node = 0;
> > +	struct mempolicy *pol;
> > +
> > +#ifdef CONFIG_NUMA
> > +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> > +	pol = get_vma_policy(vma, addr);
> > +	if (pol->mode != MPOL_BIND)
> > +		this_node = __GFP_THISNODE;
> > +#endif
> >  
> 
> Where is the mpol_cond_put? Historically it might not have mattered
> because THP could not be used with a shared possibility but it probably
> matters now that tmpfs can be backed by THP.

http://lkml.kernel.org/r/20180830064732.GA2656@dhcp22.suse.cz

> The comment needs more expansion as well. Arguably it only makes sense in
> the event we are explicitly bound to one node because if we are bound to
> two nodes without interleaving then why not fall back? The answer to that
> is outside the scope of the patch but the comment as-is will cause head
> scratches in a years time.

Do you have any specific wording in mind? I have a bit hard time to come
up with something more precise and do not go into details too much.
 
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> > -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> > +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> > -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> > +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
> >  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> > -							     __GFP_KSWAPD_RECLAIM);
> > +							     __GFP_KSWAPD_RECLAIM | this_node);
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> >  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> > -							     0);
> > -	return GFP_TRANSHUGE_LIGHT;
> > +							     this_node);
> > +	return GFP_TRANSHUGE_LIGHT | this_node;
> >  }
> >  
> >  /* Caller must hold page table lock. */
> > @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >  			pte_free(vma->vm_mm, pgtable);
> >  		return ret;
> >  	}
> > -	gfp = alloc_hugepage_direct_gfpmask(vma);
> > +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
> >  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
> >  	if (unlikely(!page)) {
> >  		count_vm_event(THP_FAULT_FALLBACK);
> > @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
> >  alloc:
> >  	if (transparent_hugepage_enabled(vma) &&
> >  	    !transparent_hugepage_debug_cow()) {
> > -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> > +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
> >  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
> >  	} else
> >  		new_page = NULL;
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da858f794eb6..75bbfc3d6233 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> >   * freeing by another task.  It is the caller's responsibility to free the
> >   * extra reference for shared policies.
> >   */
> > -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> > +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> >  						unsigned long addr)
> >  {
> >  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> > @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  		goto out;
> >  	}
> >  
> > -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> > -		int hpage_node = node;
> > -
> > -		/*
> > -		 * For hugepage allocation and non-interleave policy which
> > -		 * allows the current node (or other explicitly preferred
> > -		 * node) we only try to allocate from the current/preferred
> > -		 * node and don't fall back to other nodes, as the cost of
> > -		 * remote accesses would likely offset THP benefits.
> > -		 *
> > -		 * If the policy is interleave, or does not allow the current
> > -		 * node in its nodemask, we allocate the standard way.
> > -		 */
> > -		if (pol->mode == MPOL_PREFERRED &&
> > -						!(pol->flags & MPOL_F_LOCAL))
> > -			hpage_node = pol->v.preferred_node;
> > -
> > -		nmask = policy_nodemask(gfp, pol);
> > -		if (!nmask || node_isset(hpage_node, *nmask)) {
> > -			mpol_cond_put(pol);
> > -			page = __alloc_pages_node(hpage_node,
> > -						gfp | __GFP_THISNODE, order);
> > -			goto out;
> > -		}
> > -	}
> > -
> 
> The hugepage flag passed into this function is now redundant and that
> means that callers of alloc_hugepage_vma need to move back to using
> alloc_pages_vma() directly and remove the API entirely. This block of
> code is about both GFP flag settings and node selection but at a glance I
> cannot see the point of it because it's very similar to the base page code.
> The whole point may be to get around the warning in policy_node and that
> could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
> as you do already in this patch. There should be no reason why THP has a
> different policy than a base page within a single VMA.

OK, I can follow up with a cleanup patch once we settle down with this
approach to fix the issue.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-17  6:11                                   ` Michal Hocko
@ 2018-09-17  7:04                                     ` Stefan Priebe - Profihost AG
  2018-09-17  9:32                                       ` Stefan Priebe - Profihost AG
  2018-09-17 11:27                                       ` Michal Hocko
  0 siblings, 2 replies; 77+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-09-17  7:04 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka

Hi,

i had multiple memory stalls this weekend again. All kvm processes where
spinning trying to get > 100% CPU and i was not able to even login to
ssh. After 5-10 minutes i was able to login.

There were about 150GB free mem on the host.

Relevant settings (no local storage involved):
        vm.dirty_background_ratio:
            3
        vm.dirty_ratio:
            10
        vm.min_free_kbytes:
            10567004

# cat /sys/kernel/mm/transparent_hugepage/defrag
always defer [defer+madvise] madvise never

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

After that i had the following traces on the host node:
https://pastebin.com/raw/0VhyQmAv

Thanks!

Greets,
Stefan


Am 17.09.2018 um 08:11 schrieb Michal Hocko:
> [sorry I've missed your reply]
> 
> On Wed 12-09-18 18:29:25, Mel Gorman wrote:
>> On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
> [...]
>> I recognise that this fix means that users that expect zone_reclaim_mode==1
>> type behaviour may get burned but the users that benefit from that should
>> also be users that benefit from sizing their workload to a node. They should
>> be able to replicate that with mempolicies or at least use prepation scripts
>> to clear memory on a target node (e.g. membind a memhog to the desired size,
>> exit and then start the target workload).
> 
> As I've said in other email. We probably want to add a new mempolicy
> which has zone_reclaim_mode-like semantic.
> 
> [...]
> 
>>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>>> index 5228c62af416..bac395f1d00a 100644
>>> --- a/include/linux/mempolicy.h
>>> +++ b/include/linux/mempolicy.h
>>> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>>>  struct mempolicy *get_task_policy(struct task_struct *p);
>>>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>  		unsigned long addr);
>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>> +						unsigned long addr);
>>>  bool vma_policy_mof(struct vm_area_struct *vma);
>>>  
>>>  extern void numa_default_policy(void);
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c3bc7e9c9a2a..94472bf9a31b 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>   *	    available
>>>   * never: never stall for any thp allocation
>>>   */
>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>>  {
>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>> +	gfp_t this_node = 0;
>>> +	struct mempolicy *pol;
>>> +
>>> +#ifdef CONFIG_NUMA
>>> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
>>> +	pol = get_vma_policy(vma, addr);
>>> +	if (pol->mode != MPOL_BIND)
>>> +		this_node = __GFP_THISNODE;
>>> +#endif
>>>  
>>
>> Where is the mpol_cond_put? Historically it might not have mattered
>> because THP could not be used with a shared possibility but it probably
>> matters now that tmpfs can be backed by THP.
> 
> http://lkml.kernel.org/r/20180830064732.GA2656@dhcp22.suse.cz
> 
>> The comment needs more expansion as well. Arguably it only makes sense in
>> the event we are explicitly bound to one node because if we are bound to
>> two nodes without interleaving then why not fall back? The answer to that
>> is outside the scope of the patch but the comment as-is will cause head
>> scratches in a years time.
> 
> Do you have any specific wording in mind? I have a bit hard time to come
> up with something more precise and do not go into details too much.
>  
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>>> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>>> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>>> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>> -							     __GFP_KSWAPD_RECLAIM);
>>> +							     __GFP_KSWAPD_RECLAIM | this_node);
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>> -							     0);
>>> -	return GFP_TRANSHUGE_LIGHT;
>>> +							     this_node);
>>> +	return GFP_TRANSHUGE_LIGHT | this_node;
>>>  }
>>>  
>>>  /* Caller must hold page table lock. */
>>> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>>  			pte_free(vma->vm_mm, pgtable);
>>>  		return ret;
>>>  	}
>>> -	gfp = alloc_hugepage_direct_gfpmask(vma);
>>> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>  	if (unlikely(!page)) {
>>>  		count_vm_event(THP_FAULT_FALLBACK);
>>> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>>>  alloc:
>>>  	if (transparent_hugepage_enabled(vma) &&
>>>  	    !transparent_hugepage_debug_cow()) {
>>> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
>>> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>  	} else
>>>  		new_page = NULL;
>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>> index da858f794eb6..75bbfc3d6233 100644
>>> --- a/mm/mempolicy.c
>>> +++ b/mm/mempolicy.c
>>> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>   * freeing by another task.  It is the caller's responsibility to free the
>>>   * extra reference for shared policies.
>>>   */
>>> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>  						unsigned long addr)
>>>  {
>>>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
>>> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>>  		goto out;
>>>  	}
>>>  
>>> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
>>> -		int hpage_node = node;
>>> -
>>> -		/*
>>> -		 * For hugepage allocation and non-interleave policy which
>>> -		 * allows the current node (or other explicitly preferred
>>> -		 * node) we only try to allocate from the current/preferred
>>> -		 * node and don't fall back to other nodes, as the cost of
>>> -		 * remote accesses would likely offset THP benefits.
>>> -		 *
>>> -		 * If the policy is interleave, or does not allow the current
>>> -		 * node in its nodemask, we allocate the standard way.
>>> -		 */
>>> -		if (pol->mode == MPOL_PREFERRED &&
>>> -						!(pol->flags & MPOL_F_LOCAL))
>>> -			hpage_node = pol->v.preferred_node;
>>> -
>>> -		nmask = policy_nodemask(gfp, pol);
>>> -		if (!nmask || node_isset(hpage_node, *nmask)) {
>>> -			mpol_cond_put(pol);
>>> -			page = __alloc_pages_node(hpage_node,
>>> -						gfp | __GFP_THISNODE, order);
>>> -			goto out;
>>> -		}
>>> -	}
>>> -
>>
>> The hugepage flag passed into this function is now redundant and that
>> means that callers of alloc_hugepage_vma need to move back to using
>> alloc_pages_vma() directly and remove the API entirely. This block of
>> code is about both GFP flag settings and node selection but at a glance I
>> cannot see the point of it because it's very similar to the base page code.
>> The whole point may be to get around the warning in policy_node and that
>> could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
>> as you do already in this patch. There should be no reason why THP has a
>> different policy than a base page within a single VMA.
> 
> OK, I can follow up with a cleanup patch once we settle down with this
> approach to fix the issue.
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-17  7:04                                     ` Stefan Priebe - Profihost AG
@ 2018-09-17  9:32                                       ` Stefan Priebe - Profihost AG
  2018-09-17 11:27                                       ` Michal Hocko
  1 sibling, 0 replies; 77+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-09-17  9:32 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka

May be amissing piece:
vm.overcommit_memory=0

Greets,
Stefan

Am 17.09.2018 um 09:04 schrieb Stefan Priebe - Profihost AG:
> Hi,
> 
> i had multiple memory stalls this weekend again. All kvm processes where
> spinning trying to get > 100% CPU and i was not able to even login to
> ssh. After 5-10 minutes i was able to login.
> 
> There were about 150GB free mem on the host.
> 
> Relevant settings (no local storage involved):
>         vm.dirty_background_ratio:
>             3
>         vm.dirty_ratio:
>             10
>         vm.min_free_kbytes:
>             10567004
> 
> # cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer [defer+madvise] madvise never
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> 
> After that i had the following traces on the host node:
> https://pastebin.com/raw/0VhyQmAv
> 
> Thanks!
> 
> Greets,
> Stefan
> 
> 
> Am 17.09.2018 um 08:11 schrieb Michal Hocko:
>> [sorry I've missed your reply]
>>
>> On Wed 12-09-18 18:29:25, Mel Gorman wrote:
>>> On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
>> [...]
>>> I recognise that this fix means that users that expect zone_reclaim_mode==1
>>> type behaviour may get burned but the users that benefit from that should
>>> also be users that benefit from sizing their workload to a node. They should
>>> be able to replicate that with mempolicies or at least use prepation scripts
>>> to clear memory on a target node (e.g. membind a memhog to the desired size,
>>> exit and then start the target workload).
>>
>> As I've said in other email. We probably want to add a new mempolicy
>> which has zone_reclaim_mode-like semantic.
>>
>> [...]
>>
>>>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>>>> index 5228c62af416..bac395f1d00a 100644
>>>> --- a/include/linux/mempolicy.h
>>>> +++ b/include/linux/mempolicy.h
>>>> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>>>>  struct mempolicy *get_task_policy(struct task_struct *p);
>>>>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>>  		unsigned long addr);
>>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>> +						unsigned long addr);
>>>>  bool vma_policy_mof(struct vm_area_struct *vma);
>>>>  
>>>>  extern void numa_default_policy(void);
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index c3bc7e9c9a2a..94472bf9a31b 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>>   *	    available
>>>>   * never: never stall for any thp allocation
>>>>   */
>>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>>>  {
>>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>>> +	gfp_t this_node = 0;
>>>> +	struct mempolicy *pol;
>>>> +
>>>> +#ifdef CONFIG_NUMA
>>>> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
>>>> +	pol = get_vma_policy(vma, addr);
>>>> +	if (pol->mode != MPOL_BIND)
>>>> +		this_node = __GFP_THISNODE;
>>>> +#endif
>>>>  
>>>
>>> Where is the mpol_cond_put? Historically it might not have mattered
>>> because THP could not be used with a shared possibility but it probably
>>> matters now that tmpfs can be backed by THP.
>>
>> http://lkml.kernel.org/r/20180830064732.GA2656@dhcp22.suse.cz
>>
>>> The comment needs more expansion as well. Arguably it only makes sense in
>>> the event we are explicitly bound to one node because if we are bound to
>>> two nodes without interleaving then why not fall back? The answer to that
>>> is outside the scope of the patch but the comment as-is will cause head
>>> scratches in a years time.
>>
>> Do you have any specific wording in mind? I have a bit hard time to come
>> up with something more precise and do not go into details too much.
>>  
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>>>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>>>> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>>>> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>>>> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>>> -							     __GFP_KSWAPD_RECLAIM);
>>>> +							     __GFP_KSWAPD_RECLAIM | this_node);
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>>> -							     0);
>>>> -	return GFP_TRANSHUGE_LIGHT;
>>>> +							     this_node);
>>>> +	return GFP_TRANSHUGE_LIGHT | this_node;
>>>>  }
>>>>  
>>>>  /* Caller must hold page table lock. */
>>>> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>>>  			pte_free(vma->vm_mm, pgtable);
>>>>  		return ret;
>>>>  	}
>>>> -	gfp = alloc_hugepage_direct_gfpmask(vma);
>>>> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>>  	if (unlikely(!page)) {
>>>>  		count_vm_event(THP_FAULT_FALLBACK);
>>>> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>>>>  alloc:
>>>>  	if (transparent_hugepage_enabled(vma) &&
>>>>  	    !transparent_hugepage_debug_cow()) {
>>>> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
>>>> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>>  	} else
>>>>  		new_page = NULL;
>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>>> index da858f794eb6..75bbfc3d6233 100644
>>>> --- a/mm/mempolicy.c
>>>> +++ b/mm/mempolicy.c
>>>> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>>   * freeing by another task.  It is the caller's responsibility to free the
>>>>   * extra reference for shared policies.
>>>>   */
>>>> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>>  						unsigned long addr)
>>>>  {
>>>>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
>>>> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>>>  		goto out;
>>>>  	}
>>>>  
>>>> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
>>>> -		int hpage_node = node;
>>>> -
>>>> -		/*
>>>> -		 * For hugepage allocation and non-interleave policy which
>>>> -		 * allows the current node (or other explicitly preferred
>>>> -		 * node) we only try to allocate from the current/preferred
>>>> -		 * node and don't fall back to other nodes, as the cost of
>>>> -		 * remote accesses would likely offset THP benefits.
>>>> -		 *
>>>> -		 * If the policy is interleave, or does not allow the current
>>>> -		 * node in its nodemask, we allocate the standard way.
>>>> -		 */
>>>> -		if (pol->mode == MPOL_PREFERRED &&
>>>> -						!(pol->flags & MPOL_F_LOCAL))
>>>> -			hpage_node = pol->v.preferred_node;
>>>> -
>>>> -		nmask = policy_nodemask(gfp, pol);
>>>> -		if (!nmask || node_isset(hpage_node, *nmask)) {
>>>> -			mpol_cond_put(pol);
>>>> -			page = __alloc_pages_node(hpage_node,
>>>> -						gfp | __GFP_THISNODE, order);
>>>> -			goto out;
>>>> -		}
>>>> -	}
>>>> -
>>>
>>> The hugepage flag passed into this function is now redundant and that
>>> means that callers of alloc_hugepage_vma need to move back to using
>>> alloc_pages_vma() directly and remove the API entirely. This block of
>>> code is about both GFP flag settings and node selection but at a glance I
>>> cannot see the point of it because it's very similar to the base page code.
>>> The whole point may be to get around the warning in policy_node and that
>>> could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
>>> as you do already in this patch. There should be no reason why THP has a
>>> different policy than a base page within a single VMA.
>>
>> OK, I can follow up with a cleanup patch once we settle down with this
>> approach to fix the issue.
>>
>> Thanks!
>>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-17  7:04                                     ` Stefan Priebe - Profihost AG
  2018-09-17  9:32                                       ` Stefan Priebe - Profihost AG
@ 2018-09-17 11:27                                       ` Michal Hocko
  1 sibling, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2018-09-17 11:27 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Mel Gorman, Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka

On Mon 17-09-18 09:04:02, Stefan Priebe - Profihost AG wrote:
> Hi,
> 
> i had multiple memory stalls this weekend again. All kvm processes where
> spinning trying to get > 100% CPU and i was not able to even login to
> ssh. After 5-10 minutes i was able to login.
> 
> There were about 150GB free mem on the host.
> 
> Relevant settings (no local storage involved):
>         vm.dirty_background_ratio:
>             3
>         vm.dirty_ratio:
>             10
>         vm.min_free_kbytes:
>             10567004
> 
> # cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer [defer+madvise] madvise never
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> 
> After that i had the following traces on the host node:
> https://pastebin.com/raw/0VhyQmAv

I would suggest reporting this in a new email thread. I would also
recommend to CC kvm guys (see MAINTAINERS file in the kernel source
tree) and trace qemu/kvm processes to see what they are doing at the
time when you see the stall.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-12 12:05         ` Michal Hocko
  (?)
@ 2018-09-12 20:40         ` David Rientjes
  -1 siblings, 0 replies; 77+ messages in thread
From: David Rientjes @ 2018-09-12 20:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Wed, 12 Sep 2018, Michal Hocko wrote:

> > Saying that we really want THP isn't an all-or-nothing decision.  We 
> > certainly want to try hard to fault hugepages locally especially at task 
> > startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
> > very well for that.  Remote hugepages would be a regression that we now 
> > have no way to avoid because the kernel doesn't provide for it, if we were 
> > to remove __GFP_THISNODE that this patch introduces.
> 
> Why cannot you use mempolicy to bind to local nodes if you really care
> about the locality?
> 

Because we do not want to oom kill, we want to fallback first to local 
native pages and then to remote native pages.  That's the order of least 
to greatest latency, we do not want to work hard to allocate a remote 
hugepage when a local native page is faster.  This seems pretty straight 
forward.

> From what you have said so far it sounds like you would like to have
> something like the zone/node reclaim mode fine grained for a specific
> mapping. If we really want to support something like that then it should
> be a generic policy rather than THP specific thing IMHO.
> 
> As I've said it is hard to come up with a solution that would satisfy
> everybody but considering that the existing reports are seeing this a
> regression and cosindering their NUMA requirements are not so strict as
> yours I would tend to think that stronger NUMA requirements should be
> expressed explicitly rather than implicit effect of a madvise flag. We
> do have APIs for that.

Every process on every platform we have would need to define this explicit 
mempolicy for users of libraries that remap text segments because changing 
the allocation behavior of thp out from under them would cause very 
noticeable performance regressions.  I don't know of any platform where 
remote hugepages is preferred over local native pages.  If they exist, it 
sounds resaonable to introduce a stronger variant of MADV_HUGEPAGE that 
defines exactly what you want rather than causing it to become a dumping 
ground and userspace regressions.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-12 14:21       ` Michal Hocko
@ 2018-09-12 15:25         ` Michal Hocko
  0 siblings, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2018-09-12 15:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Rientjes, Andrew Morton, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Wed 12-09-18 16:21:26, Michal Hocko wrote:
> On Wed 12-09-18 09:54:17, Andrea Arcangeli wrote:
[...]
> > I wasn't particularly happy about your patch because it still swaps
> > with certain defrag settings which is still allowing things that
> > shouldn't happen without some kind of privileged capability.
> 
> Well, I am not really sure about defrag=always. I would rather care
> about the default behavior to plug the regression first. And think about
> `always' mode on top. Or is this a no-go from your POV?

In other words the following on top
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 56c9aac4dc86..723e8d77c5ef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -644,7 +644,7 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, un
 #endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-12 13:54     ` Andrea Arcangeli
@ 2018-09-12 14:21       ` Michal Hocko
  2018-09-12 15:25         ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2018-09-12 14:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Rientjes, Andrew Morton, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Wed 12-09-18 09:54:17, Andrea Arcangeli wrote:
> Hello,
> 
> On Tue, Sep 11, 2018 at 01:56:13PM +0200, Michal Hocko wrote:
> > Well, it seems that expectations differ for users. It seems that kvm
> > users do not really agree with your interpretation.
> 
> Like David also mentioned here:
> 
> lkml.kernel.org/r/alpine.DEB.2.21.1808211021110.258924@chino.kir.corp.google.com
> 
> depends on the hardware what is a win, so there's no one size fits
> all.
> 
> For two sockets providing remote THP to KVM is likely a win, but
> changing the defaults depending on boot-time NUMA topology makes
> things less deterministic and it's also impossible to define an exact
> break even point.
> 
> > I do realize that this is a gray zone because nobody bothered to define
> > the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> > is exceptionaly short of information). So we are left with more or less
> > undefined behavior and define it properly now. As we can see this might
> > regress in some workloads but I strongly suspect that an explicit
> > binding sounds more logical approach than a thp specific mpol mode. If
> > anything this should be a more generic memory policy basically saying
> > that a zone/node reclaim mode should be enabled for the particular
> > allocation.
> 
> MADV_HUGEPAGE means the allocation is long lived, so the cost of
> compaction is worth it in direct reclaim. Not much else. That is not
> the problem.

It seems there is no general agreement here. My understanding is that
this means that the user really prefers THP for whatever reasons.

> The problem is that even if you ignore the breakage and regression to
> real life workloads, what is happening right now obviously would
> require root privilege but MADV_HUEGPAGE requires no root privilege.

I do not follow.

> Swapping heavy because MADV_HUGEPAGE when there are gigabytes free on
> other nodes and not even 4k would be swapped-out with THP turned off
> in sysfs, is simply not possibly what MADV_HUGEPAGE could have been
> about, and it's a kernel regression that never existed until that
> commit that added __GFP_THISNODE to the default THP heuristic in
> mempolicy.

agreed

> I think we should defer the problem of what is better between 4k NUMA
> local or remote THP by default for later, I provided two options
> myself because it didn't matter so much which option we picked in the
> short term, as long as the bug was fixed.
> 
> I wasn't particularly happy about your patch because it still swaps
> with certain defrag settings which is still allowing things that
> shouldn't happen without some kind of privileged capability.

Well, I am not really sure about defrag=always. I would rather care
about the default behavior to plug the regression first. And think about
`always' mode on top. Or is this a no-go from your POV?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-11 11:56   ` Michal Hocko
  2018-09-11 20:30     ` David Rientjes
@ 2018-09-12 13:54     ` Andrea Arcangeli
  2018-09-12 14:21       ` Michal Hocko
  1 sibling, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2018-09-12 13:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

Hello,

On Tue, Sep 11, 2018 at 01:56:13PM +0200, Michal Hocko wrote:
> Well, it seems that expectations differ for users. It seems that kvm
> users do not really agree with your interpretation.

Like David also mentioned here:

lkml.kernel.org/r/alpine.DEB.2.21.1808211021110.258924@chino.kir.corp.google.com

depends on the hardware what is a win, so there's no one size fits
all.

For two sockets providing remote THP to KVM is likely a win, but
changing the defaults depending on boot-time NUMA topology makes
things less deterministic and it's also impossible to define an exact
break even point.

> I do realize that this is a gray zone because nobody bothered to define
> the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> is exceptionaly short of information). So we are left with more or less
> undefined behavior and define it properly now. As we can see this might
> regress in some workloads but I strongly suspect that an explicit
> binding sounds more logical approach than a thp specific mpol mode. If
> anything this should be a more generic memory policy basically saying
> that a zone/node reclaim mode should be enabled for the particular
> allocation.

MADV_HUGEPAGE means the allocation is long lived, so the cost of
compaction is worth it in direct reclaim. Not much else. That is not
the problem.

The problem is that even if you ignore the breakage and regression to
real life workloads, what is happening right now obviously would
require root privilege but MADV_HUEGPAGE requires no root privilege.

Swapping heavy because MADV_HUGEPAGE when there are gigabytes free on
other nodes and not even 4k would be swapped-out with THP turned off
in sysfs, is simply not possibly what MADV_HUGEPAGE could have been
about, and it's a kernel regression that never existed until that
commit that added __GFP_THISNODE to the default THP heuristic in
mempolicy.

I think we should defer the problem of what is better between 4k NUMA
local or remote THP by default for later, I provided two options
myself because it didn't matter so much which option we picked in the
short term, as long as the bug was fixed.

I wasn't particularly happy about your patch because it still swaps
with certain defrag settings which is still allowing things that
shouldn't happen without some kind of privileged capability.

If you can update the patch to prevent swapping in all cases it's a go
as far as I'm concerned. The main difference is that you're dropping
the THP logic in the mempolicy code which will make it worse for some
case and I was trying to retain it for all cases where swapping
wouldn't happen anyway and such logic would have still provided the
behavior David prefers to those cases.

Adding the new feature to create a THP specific mempolicy can be done
later. In the meanwhile the current mempolicy code can always override
whatever THP default behavior that gets out of this, just it will
require the admin to setup a mempolicy to enforce the preferred
behavior to 4k and THP allocations alike.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-11 20:30     ` David Rientjes
@ 2018-09-12 12:05         ` Michal Hocko
  0 siblings, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2018-09-12 12:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Tue 11-09-18 13:30:20, David Rientjes wrote:
> On Tue, 11 Sep 2018, Michal Hocko wrote:
[...]
> > hugepage specific MPOL flags sounds like yet another step into even more
> > cluttered API and semantic, I am afraid. Why should this be any
> > different from regular page allocations? You are getting off-node memory
> > once your local node is full. You have to use an explicit binding to
> > disallow that. THP should be similar in that regards. Once you have said
> > that you _really_ want THP then you are closer to what we do for regular
> > pages IMHO.
> > 
> 
> Saying that we really want THP isn't an all-or-nothing decision.  We 
> certainly want to try hard to fault hugepages locally especially at task 
> startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
> very well for that.  Remote hugepages would be a regression that we now 
> have no way to avoid because the kernel doesn't provide for it, if we were 
> to remove __GFP_THISNODE that this patch introduces.

Why cannot you use mempolicy to bind to local nodes if you really care
about the locality?

> On Broadwell, for example, we find 7% slower access to remote hugepages 
> than local native pages.  On Naples, that becomes worse: 14% slower access 
> latency for intrasocket hugepages compared to local native pages and 39% 
> slower for intersocket.

So, again, how does this compare to regular 4k pages? You are going to
pay for the same remote access as well.

From what you have said so far it sounds like you would like to have
something like the zone/node reclaim mode fine grained for a specific
mapping. If we really want to support something like that then it should
be a generic policy rather than THP specific thing IMHO.

As I've said it is hard to come up with a solution that would satisfy
everybody but considering that the existing reports are seeing this a
regression and cosindering their NUMA requirements are not so strict as
yours I would tend to think that stronger NUMA requirements should be
expressed explicitly rather than implicit effect of a madvise flag. We
do have APIs for that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
@ 2018-09-12 12:05         ` Michal Hocko
  0 siblings, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2018-09-12 12:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Tue 11-09-18 13:30:20, David Rientjes wrote:
> On Tue, 11 Sep 2018, Michal Hocko wrote:
[...]
> > hugepage specific MPOL flags sounds like yet another step into even more
> > cluttered API and semantic, I am afraid. Why should this be any
> > different from regular page allocations? You are getting off-node memory
> > once your local node is full. You have to use an explicit binding to
> > disallow that. THP should be similar in that regards. Once you have said
> > that you _really_ want THP then you are closer to what we do for regular
> > pages IMHO.
> > 
> 
> Saying that we really want THP isn't an all-or-nothing decision.  We 
> certainly want to try hard to fault hugepages locally especially at task 
> startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
> very well for that.  Remote hugepages would be a regression that we now 
> have no way to avoid because the kernel doesn't provide for it, if we were 
> to remove __GFP_THISNODE that this patch introduces.

Why cannot you use mempolicy to bind to local nodes if you really care
about the locality?

> On Broadwell, for example, we find 7% slower access to remote hugepages 
> than local native pages.  On Naples, that becomes worse: 14% slower access 
> latency for intrasocket hugepages compared to local native pages and 39% 
> slower for intersocket.

So, again, how does this compare to regular 4k pages? You are going to
pay for the same remote access as well.

>From what you have said so far it sounds like you would like to have
something like the zone/node reclaim mode fine grained for a specific
mapping. If we really want to support something like that then it should
be a generic policy rather than THP specific thing IMHO.

As I've said it is hard to come up with a solution that would satisfy
everybody but considering that the existing reports are seeing this a
regression and cosindering their NUMA requirements are not so strict as
yours I would tend to think that stronger NUMA requirements should be
expressed explicitly rather than implicit effect of a madvise flag. We
do have APIs for that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-11 11:56   ` Michal Hocko
@ 2018-09-11 20:30     ` David Rientjes
  2018-09-12 12:05         ` Michal Hocko
  2018-09-12 13:54     ` Andrea Arcangeli
  1 sibling, 1 reply; 77+ messages in thread
From: David Rientjes @ 2018-09-11 20:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Tue, 11 Sep 2018, Michal Hocko wrote:

> > That's not entirely true, the remote access latency for remote thp on all 
> > of our platforms is greater than local small pages, this is especially 
> > true for remote thp that is allocated intersocket and must be accessed 
> > through the interconnect.
> > 
> > Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
> > allocation latency, but certainly not remote access latency.  There are 
> > users who remap their text segment onto transparent hugepages are fine 
> > with startup delay if they are access all of their text from local thp.  
> > Remote thp would be a significant performance degradation.
> 
> Well, it seems that expectations differ for users. It seems that kvm
> users do not really agree with your interpretation.
> 

If kvm is happy to allocate hugepages remotely, at least on a subset of 
platforms where it doesn't incur such a high remote access latency, then 
we probably shouldn't be adding lumping that together with the current 
semantics of MADV_HUGEPAGE.  Otherwise, we risk it becoming a dumping 
ground where current users may regress because they would be much more 
willing to fault local pages of the native page size and lose the ability 
to require that absent using mbind() -- and in that case they would be 
affected by the policy decision of native page sizes as well.

> > When Andrea brought this up, I suggested that the full solution would be a 
> > MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
> > benefit is that we could replace the thp "defrag" mode default by setting 
> > this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> > concerned about (1) getting thp when system-wide it is not default and (2) 
> > additional fault latency when direct compaction is not default.  They are 
> > not anticipating the degradation of remote access latency, so overloading 
> > the meaning of the mode is probably not a good idea.
> 
> hugepage specific MPOL flags sounds like yet another step into even more
> cluttered API and semantic, I am afraid. Why should this be any
> different from regular page allocations? You are getting off-node memory
> once your local node is full. You have to use an explicit binding to
> disallow that. THP should be similar in that regards. Once you have said
> that you _really_ want THP then you are closer to what we do for regular
> pages IMHO.
> 

Saying that we really want THP isn't an all-or-nothing decision.  We 
certainly want to try hard to fault hugepages locally especially at task 
startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
very well for that.  Remote hugepages would be a regression that we now 
have no way to avoid because the kernel doesn't provide for it, if we were 
to remove __GFP_THISNODE that this patch introduces.

On Broadwell, for example, we find 7% slower access to remote hugepages 
than local native pages.  On Naples, that becomes worse: 14% slower access 
latency for intrasocket hugepages compared to local native pages and 39% 
slower for intersocket.

> I do realize that this is a gray zone because nobody bothered to define
> the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> is exceptionaly short of information). So we are left with more or less
> undefined behavior and define it properly now. As we can see this might
> regress in some workloads but I strongly suspect that an explicit
> binding sounds more logical approach than a thp specific mpol mode. If
> anything this should be a more generic memory policy basically saying
> that a zone/node reclaim mode should be enabled for the particular
> allocation.

This would be quite a serious regression with no way to actually define 
that we want local hugepages but allow fallback to remote native pages if 
we cannot allocate local native pages.  So rather than causing userspace 
to regress and give them no alternative, I would suggest either hugepage 
specific mempolicies or another madvise mode to allow remotely allocated 
hugepages.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-10 20:08 ` David Rientjes
  2018-09-10 20:22   ` Stefan Priebe - Profihost AG
  2018-09-11  8:51   ` Vlastimil Babka
@ 2018-09-11 11:56   ` Michal Hocko
  2018-09-11 20:30     ` David Rientjes
  2018-09-12 13:54     ` Andrea Arcangeli
  2 siblings, 2 replies; 77+ messages in thread
From: Michal Hocko @ 2018-09-11 11:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Mon 10-09-18 13:08:34, David Rientjes wrote:
> On Fri, 7 Sep 2018, Michal Hocko wrote:
[...]
> > Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> > it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> > juggle gfp flags for different allocation modes. The rationale is that
> > __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> > to a different node might be more harmful than the benefit of a large page.
> > If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> > a higher priority than local NUMA placement.
> > 
> 
> That's not entirely true, the remote access latency for remote thp on all 
> of our platforms is greater than local small pages, this is especially 
> true for remote thp that is allocated intersocket and must be accessed 
> through the interconnect.
> 
> Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
> allocation latency, but certainly not remote access latency.  There are 
> users who remap their text segment onto transparent hugepages are fine 
> with startup delay if they are access all of their text from local thp.  
> Remote thp would be a significant performance degradation.

Well, it seems that expectations differ for users. It seems that kvm
users do not really agree with your interpretation.

> When Andrea brought this up, I suggested that the full solution would be a 
> MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
> benefit is that we could replace the thp "defrag" mode default by setting 
> this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> concerned about (1) getting thp when system-wide it is not default and (2) 
> additional fault latency when direct compaction is not default.  They are 
> not anticipating the degradation of remote access latency, so overloading 
> the meaning of the mode is probably not a good idea.

hugepage specific MPOL flags sounds like yet another step into even more
cluttered API and semantic, I am afraid. Why should this be any
different from regular page allocations? You are getting off-node memory
once your local node is full. You have to use an explicit binding to
disallow that. THP should be similar in that regards. Once you have said
that you _really_ want THP then you are closer to what we do for regular
pages IMHO.

I do realize that this is a gray zone because nobody bothered to define
the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
is exceptionaly short of information). So we are left with more or less
undefined behavior and define it properly now. As we can see this might
regress in some workloads but I strongly suspect that an explicit
binding sounds more logical approach than a thp specific mpol mode. If
anything this should be a more generic memory policy basically saying
that a zone/node reclaim mode should be enabled for the particular
allocation.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-08 18:52 ` Stefan Priebe - Profihost AG
  2018-09-10  7:39   ` Michal Hocko
@ 2018-09-11  9:03   ` Vlastimil Babka
  1 sibling, 0 replies; 77+ messages in thread
From: Vlastimil Babka @ 2018-09-11  9:03 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Michal Hocko, Andrew Morton
  Cc: Andrea Arcangeli, David Rientjes, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko

On 09/08/2018 08:52 PM, Stefan Priebe - Profihost AG wrote:
> Hello,
> 
> whlie using this path i got another stall - which i never saw under
> kernel 4.4. Here is the trace:
> [305111.932698] INFO: task ksmtuned:1399 blocked for more than 120 seconds.
> [305111.933612]       Tainted: G                   4.12.0+105-ph #1
> [305111.934456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [305111.935323] ksmtuned        D    0  1399      1 0x00080000
> [305111.936207] Call Trace:
> [305111.937118]  ? __schedule+0x3bc/0x830
> [305111.937991]  schedule+0x32/0x80
> [305111.938837]  schedule_preempt_disabled+0xa/0x10
> [305111.939687]  __mutex_lock.isra.4+0x287/0x4c0

Hmm so is this waiting on mutex_lock(&ksm_thread_mutex)? Who's holding it?

Looking at your original report [1] I see more tasks waiting on a semaphore:

[245654.463746] INFO: task pvestatd:3175 blocked for more than 120 seconds.
[245654.464730] Tainted: G W 4.12.0+98-ph <a
href="/view.php?id=1" title="[geschlossen] Integration Ramdisk"
class="resolved">0000001</a>
[245654.465666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[245654.466591] pvestatd D 0 3175 1 0x00080000
[245654.467495] Call Trace:
[245654.468413] ? __schedule+0x3bc/0x830
[245654.469283] schedule+0x32/0x80
[245654.470108] rwsem_down_read_failed+0x121/0x170
[245654.470918] ? call_rwsem_down_read_failed+0x14/0x30
[245654.471729] ? alloc_pages_current+0x91/0x140
[245654.472530] call_rwsem_down_read_failed+0x14/0x30
[245654.473316] down_read+0x13/0x30
[245654.474064] proc_pid_cmdline_read+0xae/0x3f0

- probably down_read(&mm->mmap_sem);

[245654.485537] INFO: task service:86643 blocked for more than 120 seconds.
[245654.486015] Tainted: G W 4.12.0+98-ph <a
href="/view.php?id=1" title="[geschlossen] Integration Ramdisk"
class="resolved">0000001</a>
[245654.486502] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[245654.486960] service D 0 86643 1 0x00080000
[245654.487423] Call Trace:
[245654.487865] ? __schedule+0x3bc/0x830
[245654.488286] schedule+0x32/0x80
[245654.488763] rwsem_down_read_failed+0x121/0x170
[245654.489200] ? __handle_mm_fault+0xd67/0x1060
[245654.489668] ? call_rwsem_down_read_failed+0x14/0x30
[245654.490088] call_rwsem_down_read_failed+0x14/0x30
[245654.490538] down_read+0x13/0x30
[245654.490964] __do_page_fault+0x32b/0x430
[245654.491389] ? get_vtime_delta+0x13/0xb0
[245654.491835] do_page_fault+0x2a/0x70
[245654.492253] ? page_fault+0x65/0x80
[245654.492703] page_fault+0x7b/0x80

- page fault, so another mmap_sem for read

[245654.496050] INFO: task safe_timer:86651 blocked for more than 120
seconds.
[245654.496466] Tainted: G W 4.12.0+98-ph <a
href="/view.php?id=1" title="[geschlossen] Integration Ramdisk"
class="resolved">0000001</a>
[245654.496916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[245654.497336] safe_timer D 0 86651 1 0x00080000
[245654.497790] Call Trace:
[245654.498211] ? __schedule+0x3bc/0x830
[245654.498646] schedule+0x32/0x80
[245654.499074] rwsem_down_read_failed+0x121/0x170
[245654.499501] ? call_rwsem_down_read_failed+0x14/0x30
[245654.499959] call_rwsem_down_read_failed+0x14/0x30
[245654.500381] down_read+0x13/0x30
[245654.500851] __do_page_fault+0x32b/0x430
[245654.501285] ? get_vtime_delta+0x13/0xb0
[245654.501722] do_page_fault+0x2a/0x70
[245654.502163] ? page_fault+0x65/0x80
[245654.502600] page_fault+0x7b/0x80

- ditto

But those are different tasks that the one that was stalling allocation
while holding mmap_sem for write.

So I'm not sure what's going on, but maybe the reclaim is also stalled
due to waiting on some locks, and is thus victim of something else?

[1] https://lists.opensuse.org/opensuse-kernel/2018-08/msg00012.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-10 20:08 ` David Rientjes
  2018-09-10 20:22   ` Stefan Priebe - Profihost AG
@ 2018-09-11  8:51   ` Vlastimil Babka
  2018-09-11 11:56   ` Michal Hocko
  2 siblings, 0 replies; 77+ messages in thread
From: Vlastimil Babka @ 2018-09-11  8:51 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko, Stefan Priebe

On 09/10/2018 10:08 PM, David Rientjes wrote:
> When Andrea brought this up, I suggested that the full solution would be a 
> MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 

Can you elaborate on the semantics of this? You mean that a given vma
could now have two mempolicies, where one would be for hugepages only?
That's likely much more easy to suggest than to implement, with all uapi
consequences...

> benefit is that we could replace the thp "defrag" mode default by setting 
> this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> concerned about (1) getting thp when system-wide it is not default and (2) 
> additional fault latency when direct compaction is not default.  They are 
> not anticipating the degradation of remote access latency, so overloading 
> the meaning of the mode is probably not a good idea.
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-10 20:08 ` David Rientjes
@ 2018-09-10 20:22   ` Stefan Priebe - Profihost AG
  2018-09-11  8:51   ` Vlastimil Babka
  2018-09-11 11:56   ` Michal Hocko
  2 siblings, 0 replies; 77+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-09-10 20:22 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko

Am 10.09.2018 um 22:08 schrieb David Rientjes:
> On Fri, 7 Sep 2018, Michal Hocko wrote:
> 
>> From: Michal Hocko <mhocko@suse.com>
>>
>> Andrea has noticed [1] that a THP allocation might be really disruptive
>> when allocated on NUMA system with the local node full or hard to
>> reclaim. Stefan has posted an allocation stall report on 4.12 based
>> SLES kernel which suggests the same issue:
>> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
>> [245513.363983] kvm cpuset=/ mems_allowed=0-1
>> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
>> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
>> [245513.365905] Call Trace:
>> [245513.366535]  dump_stack+0x5c/0x84
>> [245513.367148]  warn_alloc+0xe0/0x180
>> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
>> [245513.368406]  ? __slab_free+0xa9/0x2f0
>> [245513.369048]  ? __slab_free+0xa9/0x2f0
>> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
>> [245513.370300]  alloc_pages_vma+0x1e5/0x280
>> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
>> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
>> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
>> [245513.372812]  __handle_mm_fault+0x93d/0x1060
>> [245513.373439]  handle_mm_fault+0xc6/0x1b0
>> [245513.374042]  __do_page_fault+0x230/0x430
>> [245513.374679]  ? get_vtime_delta+0x13/0xb0
>> [245513.375411]  do_page_fault+0x2a/0x70
>> [245513.376145]  ? page_fault+0x65/0x80
>> [245513.376882]  page_fault+0x7b/0x80
> 
> Since we don't have __GFP_REPEAT, this suggests that 
> __alloc_pages_direct_compact() took >100s to complete.  The memory 
> capacity of the system isn't shown, but I assume it's around 768GB?  This 
> should be with COMPACT_PRIO_ASYNC, and MIGRATE_ASYNC compaction certainly 
> should abort much earlier.

Yes it's 768GB.

Greets,
Stefan

>> [245513.382056] Mem-Info:
>> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>>                  active_file:60183 inactive_file:245285 isolated_file:0
>>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>>                  free:32294031 free_pcp:6233 free_cma:0
>> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>>
>> The defrag mode is "madvise" and from the above report it is clear that
>> the THP has been allocated for MADV_HUGEPAGA vma.
>>
>> Andrea has identified that the main source of the problem is
>> __GFP_THISNODE usage:
>>
>> : The problem is that direct compaction combined with the NUMA
>> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
>> : hard the local node, instead of failing the allocation if there's no
>> : THP available in the local node.
>> :
>> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
>> : path even with MPOL_DEFAULT.
>> :
>> : The idea behind the __GFP_THISNODE addition, is that it is better to
>> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
>> : backed memory. That largely depends on the remote latency though, on
>> : threadrippers for example the overhead is relatively low in my
>> : experience.
>> :
>> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
>> : extremely slow qemu startup with vfio, if the VM is larger than the
>> : size of one host NUMA node. This is because it will try very hard to
>> : unsuccessfully swapout get_user_pages pinned pages as result of the
>> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
>> : allocations and instead of trying to allocate THP on other nodes (it
>> : would be even worse without vfio type1 GUP pins of course, except it'd
>> : be swapping heavily instead).
>>
>> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
>> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
>> juggle gfp flags for different allocation modes. The rationale is that
>> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
>> to a different node might be more harmful than the benefit of a large page.
>> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
>> a higher priority than local NUMA placement.
>>
> 
> That's not entirely true, the remote access latency for remote thp on all 
> of our platforms is greater than local small pages, this is especially 
> true for remote thp that is allocated intersocket and must be accessed 
> through the interconnect.
> 
> Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
> allocation latency, but certainly not remote access latency.  There are 
> users who remap their text segment onto transparent hugepages are fine 
> with startup delay if they are access all of their text from local thp.  
> Remote thp would be a significant performance degradation.
> 
> When Andrea brought this up, I suggested that the full solution would be a 
> MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
> benefit is that we could replace the thp "defrag" mode default by setting 
> this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> concerned about (1) getting thp when system-wide it is not default and (2) 
> additional fault latency when direct compaction is not default.  They are 
> not anticipating the degradation of remote access latency, so overloading 
> the meaning of the mode is probably not a good idea.
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-07 13:05 ` Michal Hocko
  (?)
  (?)
@ 2018-09-10 20:08 ` David Rientjes
  2018-09-10 20:22   ` Stefan Priebe - Profihost AG
                     ` (2 more replies)
  -1 siblings, 3 replies; 77+ messages in thread
From: David Rientjes @ 2018-09-10 20:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko, Stefan Priebe

On Fri, 7 Sep 2018, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80

Since we don't have __GFP_REPEAT, this suggests that 
__alloc_pages_direct_compact() took >100s to complete.  The memory 
capacity of the system isn't shown, but I assume it's around 768GB?  This 
should be with COMPACT_PRIO_ASYNC, and MIGRATE_ASYNC compaction certainly 
should abort much earlier.

> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes. The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement.
> 

That's not entirely true, the remote access latency for remote thp on all 
of our platforms is greater than local small pages, this is especially 
true for remote thp that is allocated intersocket and must be accessed 
through the interconnect.

Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
allocation latency, but certainly not remote access latency.  There are 
users who remap their text segment onto transparent hugepages are fine 
with startup delay if they are access all of their text from local thp.  
Remote thp would be a significant performance degradation.

When Andrea brought this up, I suggested that the full solution would be a 
MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
benefit is that we could replace the thp "defrag" mode default by setting 
this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
concerned about (1) getting thp when system-wide it is not default and (2) 
additional fault latency when direct compaction is not default.  They are 
not anticipating the degradation of remote access latency, so overloading 
the meaning of the mode is probably not a good idea.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-08 18:52 ` Stefan Priebe - Profihost AG
@ 2018-09-10  7:39   ` Michal Hocko
  2018-09-11  9:03   ` Vlastimil Babka
  1 sibling, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2018-09-10  7:39 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Andrew Morton, Andrea Arcangeli, David Rientjes, Zi Yan,
	Kirill A. Shutemov, linux-mm, LKML

[Cc Vlastimil. The full report is http://lkml.kernel.org/r/f7ed71c1-d599-5257-fd8f-041eb24d9f29@profihost.ag]
On Sat 08-09-18 20:52:35, Stefan Priebe - Profihost AG wrote:
> [305146.987742] khugepaged: page allocation stalls for 224236ms, order:9,
> mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)

This is certainly not a result of this patch AFAICS. khugepaged does add
__GFP_THISNODE regardless of what alloc_hugepage_khugepaged_gfpmask
thinks about that flag. Something to look into as well I guess.

Anyway, I guess we want to look closer at what compaction is doing here
because such a long stall is really not acceptable at all. Maybe this is
something 4.12 kernel related. This is hard to tell. Unfortunatelly,
upstream has lost the stall warning so you wouldn't know this is the
case with newer kernels.

Anyway, running with compaction tracepoints might tell us more.
Vlastimil will surely help to tell you which of them to enable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-07 13:05 ` Michal Hocko
  (?)
@ 2018-09-08 18:52 ` Stefan Priebe - Profihost AG
  2018-09-10  7:39   ` Michal Hocko
  2018-09-11  9:03   ` Vlastimil Babka
  -1 siblings, 2 replies; 77+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-09-08 18:52 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Andrea Arcangeli, David Rientjes, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko

Hello,

whlie using this path i got another stall - which i never saw under
kernel 4.4. Here is the trace:
[305111.932698] INFO: task ksmtuned:1399 blocked for more than 120 seconds.
[305111.933612]       Tainted: G                   4.12.0+105-ph #1
[305111.934456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[305111.935323] ksmtuned        D    0  1399      1 0x00080000
[305111.936207] Call Trace:
[305111.937118]  ? __schedule+0x3bc/0x830
[305111.937991]  schedule+0x32/0x80
[305111.938837]  schedule_preempt_disabled+0xa/0x10
[305111.939687]  __mutex_lock.isra.4+0x287/0x4c0
[305111.940550]  ? run_store+0x47/0x2b0
[305111.941416]  run_store+0x47/0x2b0
[305111.942284]  ? __kmalloc+0x157/0x1d0
[305111.943138]  kernfs_fop_write+0x102/0x180
[305111.943988]  __vfs_write+0x26/0x140
[305111.944827]  ? __alloc_fd+0x44/0x170
[305111.945669]  ? set_close_on_exec+0x30/0x60
[305111.946519]  vfs_write+0xb1/0x1e0
[305111.947359]  SyS_write+0x42/0x90
[305111.948193]  do_syscall_64+0x74/0x150
[305111.949014]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[305111.949854] RIP: 0033:0x7fe7cde93730
[305111.950678] RSP: 002b:00007fffab0d5e88 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[305111.951525] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007fe7cde93730
[305111.952358] RDX: 0000000000000002 RSI: 00000000011b1c08 RDI:
0000000000000001
[305111.953170] RBP: 00000000011b1c08 R08: 00007fe7ce153760 R09:
00007fe7ce797b40
[305111.953979] R10: 0000000000000073 R11: 0000000000000246 R12:
0000000000000002
[305111.954790] R13: 0000000000000001 R14: 00007fe7ce152600 R15:
0000000000000002
[305146.987742] khugepaged: page allocation stalls for 224236ms,
order:9,
mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
nodemask=(null)
[305146.989652] khugepaged cpuset=/ mems_allowed=0-1
[305146.990582] CPU: 1 PID: 405 Comm: khugepaged Tainted: G
     4.12.0+105-ph #1 SLE15 (unreleased)
[305146.991536] Hardware name: Supermicro
X9DRW-3LN4F+/X9DRW-3TF+/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.00 07/05/2013
[305146.992524] Call Trace:
[305146.993493]  dump_stack+0x5c/0x84
[305146.994499]  warn_alloc+0xe0/0x180
[305146.995469]  __alloc_pages_slowpath+0x820/0xc90
[305146.996456]  ? get_vtime_delta+0x13/0xb0
[305146.997424]  ? sched_clock+0x5/0x10
[305146.998394]  ? del_timer_sync+0x35/0x40
[305146.999370]  __alloc_pages_nodemask+0x1cc/0x210
[305147.000369]  khugepaged_alloc_page+0x39/0x70
[305147.001326]  khugepaged+0xc0c/0x20c0
[305147.002214]  ? remove_wait_queue+0x60/0x60
[305147.003226]  kthread+0xff/0x130
[305147.004219]  ? collapse_shmem+0xba0/0xba0
[305147.005131]  ? kthread_create_on_node+0x40/0x40
[305147.005971]  ret_from_fork+0x35/0x40
[305147.006835] Mem-Info:
[305147.007681] active_anon:51674768 inactive_anon:69112 isolated_anon:21
                 active_file:47818 inactive_file:51708 isolated_file:0
                 unevictable:15710 dirty:187 writeback:0 unstable:0
                 slab_reclaimable:62499 slab_unreclaimable:1284920
                 mapped:66765 shmem:47623 pagetables:185294 bounce:0
                 free:44265934 free_pcp:23646 free_cma:0
[305147.012664] Node 0 active_anon:116919912kB inactive_anon:238824kB
active_file:157296kB inactive_file:112820kB unevictable:58364kB
isolated(anon):80kB isolated(file):0kB mapped:221548kB dirty:548kB
writeback:0kB shmem:153196kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 14430208kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[305147.015437] Node 1 active_anon:89781496kB inactive_anon:37624kB
active_file:33976kB inactive_file:94012kB unevictable:4476kB
isolated(anon):4kB isolated(file):0kB mapped:45512kB dirty:200kB
writeback:0kB shmem:37296kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 9279488kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[305147.018550] Node 0 DMA free:15816kB min:12kB low:24kB high:36kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15988kB managed:15816kB
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[305147.022085] lowmem_reserve[]: 0 1922 193381 193381 193381
[305147.023208] Node 0 DMA32 free:769020kB min:1964kB low:3944kB
high:5924kB active_anon:1204144kB inactive_anon:0kB active_file:4kB
inactive_file:0kB unevictable:0kB writepending:0kB present:2046368kB
managed:1980800kB mlocked:0kB slab_reclaimable:32kB
slab_unreclaimable:5376kB kernel_stack:0kB pagetables:1560kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:0kB
[305147.026768] lowmem_reserve[]: 0 0 191458 191458 191458
[305147.028044] Node 0 Normal free:71769584kB min:194564kB low:390620kB
high:586676kB active_anon:115715084kB inactive_anon:238824kB
active_file:157292kB inactive_file:112820kB unevictable:58364kB
writepending:548kB present:199229440kB managed:196058772kB
mlocked:58364kB slab_reclaimable:146536kB slab_unreclaimable:2697676kB
kernel_stack:12488kB pagetables:669756kB bounce:0kB free_pcp:42284kB
local_pcp:284kB free_cma:0kB
[305147.033356] lowmem_reserve[]: 0 0 0 0 0
[305147.034754] Node 1 Normal free:104504180kB min:196664kB low:394836kB
high:593008kB active_anon:89783256kB inactive_anon:37624kB
active_file:33976kB inactive_file:94012kB unevictable:4476kB
writepending:200kB present:201326592kB managed:198175320kB
mlocked:4476kB slab_reclaimable:103428kB slab_unreclaimable:2436628kB
kernel_stack:14232kB pagetables:69860kB bounce:0kB free_pcp:51764kB
local_pcp:936kB free_cma:0kB
[305147.040667] lowmem_reserve[]: 0 0 0 0 0
[305147.042180] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 1*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) =
15816kB
[305147.045054] Node 0 DMA32: 6363*4kB (UM) 3686*8kB (UM) 2106*16kB (UM)
1102*32kB (UME) 608*64kB (UME) 304*128kB (UME) 146*256kB (UME) 59*512kB
(UME) 32*1024kB (UME) 4*2048kB (UME) 112*4096kB (M) = 769020kB
[305147.048041] Node 0 Normal: 29891*4kB (UME) 367874*8kB (UME)
978139*16kB (UME) 481963*32kB (UME) 173612*64kB (UME) 65480*128kB (UME)
28019*256kB (UME) 10993*512kB (UME) 5217*1024kB (UM) 0*2048kB 0*4096kB =
71771692kB
[305147.051291] Node 1 Normal: 396333*4kB (UME) 257656*8kB (UME)
276637*16kB (UME) 190234*32kB (ME) 101344*64kB (ME) 39168*128kB (UME)
19207*256kB (UME) 8599*512kB (UME) 3065*1024kB (UM) 2*2048kB (UM)
16206*4096kB (M) = 104501892kB
[305147.054836] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[305147.056555] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[305147.058160] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[305147.059817] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[305147.061429] 149901 total pagecache pages
[305147.063124] 2 pages in swap cache
[305147.064908] Swap cache stats: add 7, delete 5, find 0/0
[305147.066676] Free swap  = 3905020kB
[305147.068268] Total swap = 3905532kB
[305147.069955] 100654597 pages RAM
[305147.071569] 0 pages HighMem/MovableOnly
[305147.073176] 1596920 pages reserved
[305147.074946] 0 pages hwpoisoned
[326258.236694] INFO: task ksmtuned:1399 blocked for more than 120 seconds.
[326258.237723]       Tainted: G                   4.12.0+105-ph #1
[326258.238718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[326258.239679] ksmtuned        D    0  1399      1 0x00080000
[326258.240651] Call Trace:
[326258.241602]  ? __schedule+0x3bc/0x830
[326258.242557]  schedule+0x32/0x80
[326258.243462]  schedule_preempt_disabled+0xa/0x10
[326258.244336]  __mutex_lock.isra.4+0x287/0x4c0
[326258.245205]  ? run_store+0x47/0x2b0
[326258.246064]  run_store+0x47/0x2b0
[326258.246890]  ? __kmalloc+0x157/0x1d0
[326258.247716]  kernfs_fop_write+0x102/0x180
[326258.248514]  __vfs_write+0x26/0x140
[326258.249284]  ? __alloc_fd+0x44/0x170
[326258.250062]  ? set_close_on_exec+0x30/0x60
[326258.250812]  vfs_write+0xb1/0x1e0
[326258.251548]  SyS_write+0x42/0x90
[326258.252237]  do_syscall_64+0x74/0x150
[326258.252920]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[326258.253577] RIP: 0033:0x7fe7cde93730
[326258.254263] RSP: 002b:00007fffab0d5e88 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[326258.254916] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007fe7cde93730
[326258.255564] RDX: 0000000000000002 RSI: 00000000011b1c08 RDI:
0000000000000001
[326258.256187] RBP: 00000000011b1c08 R08: 00007fe7ce153760 R09:
00007fe7ce797b40
[326258.256801] R10: 0000000000000073 R11: 0000000000000246 R12:
0000000000000002
[326258.257406] R13: 0000000000000001 R14: 00007fe7ce152600 R15:
0000000000000002

Greets,
Stefan
Am 07.09.2018 um 15:05 schrieb Michal Hocko:
> From: Michal Hocko <mhocko@suse.com>
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes. The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement.
> 
> Be careful when the vma has an explicit numa binding though, because
> __GFP_THISNODE is not playing well with it. We want to follow the
> explicit numa policy rather than enforce a node which happens to be
> local to the cpu we are running on.
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Tested-by: Stefan Priebe <s.priebe@profihost.ag>
> Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> 
> Hi,
> this is a follow up for [1]. Anrea has proposed two approaches to solve
> the regression. This is an alternative implementation of the second
> approach [2]. The reason for an alternative approach is that I strongly
> believe that all the subtle THP gfp manipulation should be at a single
> place (alloc_hugepage_direct_gfpmask) rather than spread in multiple
> places with additional fixup. There is one notable difference to [2]
> and that is defrag=allways behavior where I am preserving the original
> behavior. The reason for that is that defrag=always has always had
> tendency to stall and reclaim and we have addressed that by defining a
> new default defrag mode. We can discuss this behavior later but I
> believe the default mode and a regression noticed by multiple users
> should be closed regardless. Hence this patch.
> 
> [2] http://lkml.kernel.org/r/20180820032640.9896-2-aarcange@redhat.com
> 
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 26 ++++++++++++++++++--------
>  mm/mempolicy.c            | 28 +---------------------------
>  3 files changed, 21 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..56c9aac4dc86 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,31 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +	mpol_cond_put(pol);
> +#endif
>  
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>  
>  /* Caller must hold page table lock. */
> @@ -715,7 +725,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1300,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
@ 2018-09-07 13:05 ` Michal Hocko
  0 siblings, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2018-09-07 13:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, David Rientjes, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko, Stefan Priebe

From: Michal Hocko <mhocko@suse.com>

Andrea has noticed [1] that a THP allocation might be really disruptive
when allocated on NUMA system with the local node full or hard to
reclaim. Stefan has posted an allocation stall report on 4.12 based
SLES kernel which suggests the same issue:
[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
juggle gfp flags for different allocation modes. The rationale is that
__GFP_THISNODE is helpful in relaxed defrag modes because falling back
to a different node might be more harmful than the benefit of a large page.
If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
a higher priority than local NUMA placement.

Be careful when the vma has an explicit numa binding though, because
__GFP_THISNODE is not playing well with it. We want to follow the
explicit numa policy rather than enforce a node which happens to be
local to the cpu we are running on.

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is a follow up for [1]. Anrea has proposed two approaches to solve
the regression. This is an alternative implementation of the second
approach [2]. The reason for an alternative approach is that I strongly
believe that all the subtle THP gfp manipulation should be at a single
place (alloc_hugepage_direct_gfpmask) rather than spread in multiple
places with additional fixup. There is one notable difference to [2]
and that is defrag=allways behavior where I am preserving the original
behavior. The reason for that is that defrag=always has always had
tendency to stall and reclaim and we have addressed that by defining a
new default defrag mode. We can discuss this behavior later but I
believe the default mode and a regression noticed by multiple users
should be closed regardless. Hence this patch.

[2] http://lkml.kernel.org/r/20180820032640.9896-2-aarcange@redhat.com

 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 26 ++++++++++++++++++--------
 mm/mempolicy.c            | 28 +---------------------------
 3 files changed, 21 insertions(+), 35 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..56c9aac4dc86 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,31 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+	struct mempolicy *pol;
+
+#ifdef CONFIG_NUMA
+	/* __GFP_THISNODE makes sense only if there is no explicit binding */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+	mpol_cond_put(pol);
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,7 +725,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1290,7 +1300,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..75bbfc3d6233 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
@ 2018-09-07 13:05 ` Michal Hocko
  0 siblings, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2018-09-07 13:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, David Rientjes, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko, Stefan Priebe

From: Michal Hocko <mhocko@suse.com>

Andrea has noticed [1] that a THP allocation might be really disruptive
when allocated on NUMA system with the local node full or hard to
reclaim. Stefan has posted an allocation stall report on 4.12 based
SLES kernel which suggests the same issue:
[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
juggle gfp flags for different allocation modes. The rationale is that
__GFP_THISNODE is helpful in relaxed defrag modes because falling back
to a different node might be more harmful than the benefit of a large page.
If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
a higher priority than local NUMA placement.

Be careful when the vma has an explicit numa binding though, because
__GFP_THISNODE is not playing well with it. We want to follow the
explicit numa policy rather than enforce a node which happens to be
local to the cpu we are running on.

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is a follow up for [1]. Anrea has proposed two approaches to solve
the regression. This is an alternative implementation of the second
approach [2]. The reason for an alternative approach is that I strongly
believe that all the subtle THP gfp manipulation should be at a single
place (alloc_hugepage_direct_gfpmask) rather than spread in multiple
places with additional fixup. There is one notable difference to [2]
and that is defrag=allways behavior where I am preserving the original
behavior. The reason for that is that defrag=always has always had
tendency to stall and reclaim and we have addressed that by defining a
new default defrag mode. We can discuss this behavior later but I
believe the default mode and a regression noticed by multiple users
should be closed regardless. Hence this patch.

[2] http://lkml.kernel.org/r/20180820032640.9896-2-aarcange@redhat.com

 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 26 ++++++++++++++++++--------
 mm/mempolicy.c            | 28 +---------------------------
 3 files changed, 21 insertions(+), 35 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..56c9aac4dc86 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,31 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+	struct mempolicy *pol;
+
+#ifdef CONFIG_NUMA
+	/* __GFP_THISNODE makes sense only if there is no explicit binding */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+	mpol_cond_put(pol);
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,7 +725,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1290,7 +1300,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..75bbfc3d6233 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2018-09-17 11:27 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-20  3:22 [PATCH 0/2] fix for "pathological THP behavior" Andrea Arcangeli
2018-08-20  3:22 ` [PATCH 1/2] mm: thp: consolidate policy_nodemask call Andrea Arcangeli
2018-08-20  3:22 ` [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
2018-08-20  3:26   ` [PATCH 0/1] fix for "pathological THP behavior" v2 Andrea Arcangeli
2018-08-20  3:26     ` [PATCH 1/1] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
2018-08-20 12:35   ` [PATCH 2/2] " Zi Yan
2018-08-20 15:32     ` Andrea Arcangeli
2018-08-21 11:50   ` Michal Hocko
2018-08-21 21:40     ` Andrea Arcangeli
2018-08-22  9:02       ` Michal Hocko
2018-08-22 11:07         ` Michal Hocko
2018-08-22 14:24           ` Andrea Arcangeli
2018-08-22 14:45             ` Michal Hocko
2018-08-22 15:24               ` Andrea Arcangeli
2018-08-23 10:50                 ` Michal Hocko
2018-08-22 15:52         ` Andrea Arcangeli
2018-08-23 10:52           ` Michal Hocko
2018-08-28  7:53             ` Michal Hocko
2018-08-28  8:18               ` Michal Hocko
2018-08-28  8:54                 ` Stefan Priebe - Profihost AG
2018-08-29 11:11                   ` Stefan Priebe - Profihost AG
     [not found]                 ` <D5F4A33C-0A37-495C-9468-D6866A862097@cs.rutgers.edu>
2018-08-29 14:28                   ` Michal Hocko
2018-08-29 14:35                     ` Michal Hocko
2018-08-29 15:22                       ` Zi Yan
2018-08-29 15:47                         ` Michal Hocko
2018-08-29 16:06                           ` Zi Yan
2018-08-29 16:25                             ` Michal Hocko
2018-08-29 19:24                               ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
2018-08-29 22:54                                 ` Zi Yan
2018-08-30  7:00                                   ` Michal Hocko
2018-08-30 13:22                                     ` Zi Yan
2018-08-30 13:45                                       ` Michal Hocko
2018-08-30 14:02                                         ` Zi Yan
2018-08-30 16:19                                           ` Stefan Priebe - Profihost AG
2018-08-30 16:40                                           ` Michal Hocko
2018-09-05  3:44                                             ` Andrea Arcangeli
2018-09-05  7:08                                               ` Michal Hocko
2018-09-06 11:10                                                 ` Vlastimil Babka
2018-09-06 11:16                                                   ` Vlastimil Babka
2018-09-06 11:25                                                     ` Michal Hocko
2018-09-06 12:35                                                       ` Zi Yan
2018-09-06 10:59                                   ` Vlastimil Babka
2018-09-06 11:17                                     ` Zi Yan
2018-08-30  6:47                                 ` Michal Hocko
2018-09-06 11:18                                   ` Vlastimil Babka
2018-09-06 11:27                                     ` Michal Hocko
2018-09-12 17:29                                 ` Mel Gorman
2018-09-17  6:11                                   ` Michal Hocko
2018-09-17  7:04                                     ` Stefan Priebe - Profihost AG
2018-09-17  9:32                                       ` Stefan Priebe - Profihost AG
2018-09-17 11:27                                       ` Michal Hocko
2018-08-20 11:58 ` [PATCH 0/2] fix for "pathological THP behavior" Kirill A. Shutemov
2018-08-20 15:19   ` Andrea Arcangeli
2018-08-21 15:30     ` Vlastimil Babka
2018-08-21 17:26       ` David Rientjes
2018-08-21 22:18         ` Andrea Arcangeli
2018-08-21 22:05       ` Andrea Arcangeli
2018-08-22  9:24       ` Michal Hocko
2018-08-22 15:56         ` Andrea Arcangeli
2018-08-20 19:06   ` Yang Shi
2018-08-20 23:24     ` Andrea Arcangeli
2018-09-07 13:05 [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
2018-09-07 13:05 ` Michal Hocko
2018-09-08 18:52 ` Stefan Priebe - Profihost AG
2018-09-10  7:39   ` Michal Hocko
2018-09-11  9:03   ` Vlastimil Babka
2018-09-10 20:08 ` David Rientjes
2018-09-10 20:22   ` Stefan Priebe - Profihost AG
2018-09-11  8:51   ` Vlastimil Babka
2018-09-11 11:56   ` Michal Hocko
2018-09-11 20:30     ` David Rientjes
2018-09-12 12:05       ` Michal Hocko
2018-09-12 12:05         ` Michal Hocko
2018-09-12 20:40         ` David Rientjes
2018-09-12 13:54     ` Andrea Arcangeli
2018-09-12 14:21       ` Michal Hocko
2018-09-12 15:25         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.