All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
@ 2018-09-07 13:05 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2018-09-07 13:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, David Rientjes, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko, Stefan Priebe

From: Michal Hocko <mhocko@suse.com>

Andrea has noticed [1] that a THP allocation might be really disruptive
when allocated on NUMA system with the local node full or hard to
reclaim. Stefan has posted an allocation stall report on 4.12 based
SLES kernel which suggests the same issue:
[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
juggle gfp flags for different allocation modes. The rationale is that
__GFP_THISNODE is helpful in relaxed defrag modes because falling back
to a different node might be more harmful than the benefit of a large page.
If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
a higher priority than local NUMA placement.

Be careful when the vma has an explicit numa binding though, because
__GFP_THISNODE is not playing well with it. We want to follow the
explicit numa policy rather than enforce a node which happens to be
local to the cpu we are running on.

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is a follow up for [1]. Anrea has proposed two approaches to solve
the regression. This is an alternative implementation of the second
approach [2]. The reason for an alternative approach is that I strongly
believe that all the subtle THP gfp manipulation should be at a single
place (alloc_hugepage_direct_gfpmask) rather than spread in multiple
places with additional fixup. There is one notable difference to [2]
and that is defrag=allways behavior where I am preserving the original
behavior. The reason for that is that defrag=always has always had
tendency to stall and reclaim and we have addressed that by defining a
new default defrag mode. We can discuss this behavior later but I
believe the default mode and a regression noticed by multiple users
should be closed regardless. Hence this patch.

[2] http://lkml.kernel.org/r/20180820032640.9896-2-aarcange@redhat.com

 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 26 ++++++++++++++++++--------
 mm/mempolicy.c            | 28 +---------------------------
 3 files changed, 21 insertions(+), 35 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..56c9aac4dc86 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,31 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+	struct mempolicy *pol;
+
+#ifdef CONFIG_NUMA
+	/* __GFP_THISNODE makes sense only if there is no explicit binding */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+	mpol_cond_put(pol);
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,7 +725,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1290,7 +1300,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..75bbfc3d6233 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
@ 2018-09-07 13:05 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2018-09-07 13:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, David Rientjes, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko, Stefan Priebe

From: Michal Hocko <mhocko@suse.com>

Andrea has noticed [1] that a THP allocation might be really disruptive
when allocated on NUMA system with the local node full or hard to
reclaim. Stefan has posted an allocation stall report on 4.12 based
SLES kernel which suggests the same issue:
[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
juggle gfp flags for different allocation modes. The rationale is that
__GFP_THISNODE is helpful in relaxed defrag modes because falling back
to a different node might be more harmful than the benefit of a large page.
If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
a higher priority than local NUMA placement.

Be careful when the vma has an explicit numa binding though, because
__GFP_THISNODE is not playing well with it. We want to follow the
explicit numa policy rather than enforce a node which happens to be
local to the cpu we are running on.

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is a follow up for [1]. Anrea has proposed two approaches to solve
the regression. This is an alternative implementation of the second
approach [2]. The reason for an alternative approach is that I strongly
believe that all the subtle THP gfp manipulation should be at a single
place (alloc_hugepage_direct_gfpmask) rather than spread in multiple
places with additional fixup. There is one notable difference to [2]
and that is defrag=allways behavior where I am preserving the original
behavior. The reason for that is that defrag=always has always had
tendency to stall and reclaim and we have addressed that by defining a
new default defrag mode. We can discuss this behavior later but I
believe the default mode and a regression noticed by multiple users
should be closed regardless. Hence this patch.

[2] http://lkml.kernel.org/r/20180820032640.9896-2-aarcange@redhat.com

 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 26 ++++++++++++++++++--------
 mm/mempolicy.c            | 28 +---------------------------
 3 files changed, 21 insertions(+), 35 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..56c9aac4dc86 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,31 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+	struct mempolicy *pol;
+
+#ifdef CONFIG_NUMA
+	/* __GFP_THISNODE makes sense only if there is no explicit binding */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+	mpol_cond_put(pol);
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,7 +725,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1290,7 +1300,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..75bbfc3d6233 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-07 13:05 ` Michal Hocko
  (?)
@ 2018-09-08 18:52 ` Stefan Priebe - Profihost AG
  2018-09-10  7:39   ` Michal Hocko
  2018-09-11  9:03   ` Vlastimil Babka
  -1 siblings, 2 replies; 40+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-09-08 18:52 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Andrea Arcangeli, David Rientjes, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko

Hello,

whlie using this path i got another stall - which i never saw under
kernel 4.4. Here is the trace:
[305111.932698] INFO: task ksmtuned:1399 blocked for more than 120 seconds.
[305111.933612]       Tainted: G                   4.12.0+105-ph #1
[305111.934456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[305111.935323] ksmtuned        D    0  1399      1 0x00080000
[305111.936207] Call Trace:
[305111.937118]  ? __schedule+0x3bc/0x830
[305111.937991]  schedule+0x32/0x80
[305111.938837]  schedule_preempt_disabled+0xa/0x10
[305111.939687]  __mutex_lock.isra.4+0x287/0x4c0
[305111.940550]  ? run_store+0x47/0x2b0
[305111.941416]  run_store+0x47/0x2b0
[305111.942284]  ? __kmalloc+0x157/0x1d0
[305111.943138]  kernfs_fop_write+0x102/0x180
[305111.943988]  __vfs_write+0x26/0x140
[305111.944827]  ? __alloc_fd+0x44/0x170
[305111.945669]  ? set_close_on_exec+0x30/0x60
[305111.946519]  vfs_write+0xb1/0x1e0
[305111.947359]  SyS_write+0x42/0x90
[305111.948193]  do_syscall_64+0x74/0x150
[305111.949014]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[305111.949854] RIP: 0033:0x7fe7cde93730
[305111.950678] RSP: 002b:00007fffab0d5e88 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[305111.951525] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007fe7cde93730
[305111.952358] RDX: 0000000000000002 RSI: 00000000011b1c08 RDI:
0000000000000001
[305111.953170] RBP: 00000000011b1c08 R08: 00007fe7ce153760 R09:
00007fe7ce797b40
[305111.953979] R10: 0000000000000073 R11: 0000000000000246 R12:
0000000000000002
[305111.954790] R13: 0000000000000001 R14: 00007fe7ce152600 R15:
0000000000000002
[305146.987742] khugepaged: page allocation stalls for 224236ms,
order:9,
mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
nodemask=(null)
[305146.989652] khugepaged cpuset=/ mems_allowed=0-1
[305146.990582] CPU: 1 PID: 405 Comm: khugepaged Tainted: G
     4.12.0+105-ph #1 SLE15 (unreleased)
[305146.991536] Hardware name: Supermicro
X9DRW-3LN4F+/X9DRW-3TF+/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.00 07/05/2013
[305146.992524] Call Trace:
[305146.993493]  dump_stack+0x5c/0x84
[305146.994499]  warn_alloc+0xe0/0x180
[305146.995469]  __alloc_pages_slowpath+0x820/0xc90
[305146.996456]  ? get_vtime_delta+0x13/0xb0
[305146.997424]  ? sched_clock+0x5/0x10
[305146.998394]  ? del_timer_sync+0x35/0x40
[305146.999370]  __alloc_pages_nodemask+0x1cc/0x210
[305147.000369]  khugepaged_alloc_page+0x39/0x70
[305147.001326]  khugepaged+0xc0c/0x20c0
[305147.002214]  ? remove_wait_queue+0x60/0x60
[305147.003226]  kthread+0xff/0x130
[305147.004219]  ? collapse_shmem+0xba0/0xba0
[305147.005131]  ? kthread_create_on_node+0x40/0x40
[305147.005971]  ret_from_fork+0x35/0x40
[305147.006835] Mem-Info:
[305147.007681] active_anon:51674768 inactive_anon:69112 isolated_anon:21
                 active_file:47818 inactive_file:51708 isolated_file:0
                 unevictable:15710 dirty:187 writeback:0 unstable:0
                 slab_reclaimable:62499 slab_unreclaimable:1284920
                 mapped:66765 shmem:47623 pagetables:185294 bounce:0
                 free:44265934 free_pcp:23646 free_cma:0
[305147.012664] Node 0 active_anon:116919912kB inactive_anon:238824kB
active_file:157296kB inactive_file:112820kB unevictable:58364kB
isolated(anon):80kB isolated(file):0kB mapped:221548kB dirty:548kB
writeback:0kB shmem:153196kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 14430208kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[305147.015437] Node 1 active_anon:89781496kB inactive_anon:37624kB
active_file:33976kB inactive_file:94012kB unevictable:4476kB
isolated(anon):4kB isolated(file):0kB mapped:45512kB dirty:200kB
writeback:0kB shmem:37296kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 9279488kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[305147.018550] Node 0 DMA free:15816kB min:12kB low:24kB high:36kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15988kB managed:15816kB
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[305147.022085] lowmem_reserve[]: 0 1922 193381 193381 193381
[305147.023208] Node 0 DMA32 free:769020kB min:1964kB low:3944kB
high:5924kB active_anon:1204144kB inactive_anon:0kB active_file:4kB
inactive_file:0kB unevictable:0kB writepending:0kB present:2046368kB
managed:1980800kB mlocked:0kB slab_reclaimable:32kB
slab_unreclaimable:5376kB kernel_stack:0kB pagetables:1560kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:0kB
[305147.026768] lowmem_reserve[]: 0 0 191458 191458 191458
[305147.028044] Node 0 Normal free:71769584kB min:194564kB low:390620kB
high:586676kB active_anon:115715084kB inactive_anon:238824kB
active_file:157292kB inactive_file:112820kB unevictable:58364kB
writepending:548kB present:199229440kB managed:196058772kB
mlocked:58364kB slab_reclaimable:146536kB slab_unreclaimable:2697676kB
kernel_stack:12488kB pagetables:669756kB bounce:0kB free_pcp:42284kB
local_pcp:284kB free_cma:0kB
[305147.033356] lowmem_reserve[]: 0 0 0 0 0
[305147.034754] Node 1 Normal free:104504180kB min:196664kB low:394836kB
high:593008kB active_anon:89783256kB inactive_anon:37624kB
active_file:33976kB inactive_file:94012kB unevictable:4476kB
writepending:200kB present:201326592kB managed:198175320kB
mlocked:4476kB slab_reclaimable:103428kB slab_unreclaimable:2436628kB
kernel_stack:14232kB pagetables:69860kB bounce:0kB free_pcp:51764kB
local_pcp:936kB free_cma:0kB
[305147.040667] lowmem_reserve[]: 0 0 0 0 0
[305147.042180] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 1*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) =
15816kB
[305147.045054] Node 0 DMA32: 6363*4kB (UM) 3686*8kB (UM) 2106*16kB (UM)
1102*32kB (UME) 608*64kB (UME) 304*128kB (UME) 146*256kB (UME) 59*512kB
(UME) 32*1024kB (UME) 4*2048kB (UME) 112*4096kB (M) = 769020kB
[305147.048041] Node 0 Normal: 29891*4kB (UME) 367874*8kB (UME)
978139*16kB (UME) 481963*32kB (UME) 173612*64kB (UME) 65480*128kB (UME)
28019*256kB (UME) 10993*512kB (UME) 5217*1024kB (UM) 0*2048kB 0*4096kB =
71771692kB
[305147.051291] Node 1 Normal: 396333*4kB (UME) 257656*8kB (UME)
276637*16kB (UME) 190234*32kB (ME) 101344*64kB (ME) 39168*128kB (UME)
19207*256kB (UME) 8599*512kB (UME) 3065*1024kB (UM) 2*2048kB (UM)
16206*4096kB (M) = 104501892kB
[305147.054836] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[305147.056555] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[305147.058160] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[305147.059817] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[305147.061429] 149901 total pagecache pages
[305147.063124] 2 pages in swap cache
[305147.064908] Swap cache stats: add 7, delete 5, find 0/0
[305147.066676] Free swap  = 3905020kB
[305147.068268] Total swap = 3905532kB
[305147.069955] 100654597 pages RAM
[305147.071569] 0 pages HighMem/MovableOnly
[305147.073176] 1596920 pages reserved
[305147.074946] 0 pages hwpoisoned
[326258.236694] INFO: task ksmtuned:1399 blocked for more than 120 seconds.
[326258.237723]       Tainted: G                   4.12.0+105-ph #1
[326258.238718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[326258.239679] ksmtuned        D    0  1399      1 0x00080000
[326258.240651] Call Trace:
[326258.241602]  ? __schedule+0x3bc/0x830
[326258.242557]  schedule+0x32/0x80
[326258.243462]  schedule_preempt_disabled+0xa/0x10
[326258.244336]  __mutex_lock.isra.4+0x287/0x4c0
[326258.245205]  ? run_store+0x47/0x2b0
[326258.246064]  run_store+0x47/0x2b0
[326258.246890]  ? __kmalloc+0x157/0x1d0
[326258.247716]  kernfs_fop_write+0x102/0x180
[326258.248514]  __vfs_write+0x26/0x140
[326258.249284]  ? __alloc_fd+0x44/0x170
[326258.250062]  ? set_close_on_exec+0x30/0x60
[326258.250812]  vfs_write+0xb1/0x1e0
[326258.251548]  SyS_write+0x42/0x90
[326258.252237]  do_syscall_64+0x74/0x150
[326258.252920]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[326258.253577] RIP: 0033:0x7fe7cde93730
[326258.254263] RSP: 002b:00007fffab0d5e88 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[326258.254916] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007fe7cde93730
[326258.255564] RDX: 0000000000000002 RSI: 00000000011b1c08 RDI:
0000000000000001
[326258.256187] RBP: 00000000011b1c08 R08: 00007fe7ce153760 R09:
00007fe7ce797b40
[326258.256801] R10: 0000000000000073 R11: 0000000000000246 R12:
0000000000000002
[326258.257406] R13: 0000000000000001 R14: 00007fe7ce152600 R15:
0000000000000002

Greets,
Stefan
Am 07.09.2018 um 15:05 schrieb Michal Hocko:
> From: Michal Hocko <mhocko@suse.com>
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes. The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement.
> 
> Be careful when the vma has an explicit numa binding though, because
> __GFP_THISNODE is not playing well with it. We want to follow the
> explicit numa policy rather than enforce a node which happens to be
> local to the cpu we are running on.
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Tested-by: Stefan Priebe <s.priebe@profihost.ag>
> Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> 
> Hi,
> this is a follow up for [1]. Anrea has proposed two approaches to solve
> the regression. This is an alternative implementation of the second
> approach [2]. The reason for an alternative approach is that I strongly
> believe that all the subtle THP gfp manipulation should be at a single
> place (alloc_hugepage_direct_gfpmask) rather than spread in multiple
> places with additional fixup. There is one notable difference to [2]
> and that is defrag=allways behavior where I am preserving the original
> behavior. The reason for that is that defrag=always has always had
> tendency to stall and reclaim and we have addressed that by defining a
> new default defrag mode. We can discuss this behavior later but I
> believe the default mode and a regression noticed by multiple users
> should be closed regardless. Hence this patch.
> 
> [2] http://lkml.kernel.org/r/20180820032640.9896-2-aarcange@redhat.com
> 
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 26 ++++++++++++++++++--------
>  mm/mempolicy.c            | 28 +---------------------------
>  3 files changed, 21 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..56c9aac4dc86 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,31 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +	mpol_cond_put(pol);
> +#endif
>  
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>  
>  /* Caller must hold page table lock. */
> @@ -715,7 +725,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1300,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-08 18:52 ` Stefan Priebe - Profihost AG
@ 2018-09-10  7:39   ` Michal Hocko
  2018-09-11  9:03   ` Vlastimil Babka
  1 sibling, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2018-09-10  7:39 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Andrew Morton, Andrea Arcangeli, David Rientjes, Zi Yan,
	Kirill A. Shutemov, linux-mm, LKML

[Cc Vlastimil. The full report is http://lkml.kernel.org/r/f7ed71c1-d599-5257-fd8f-041eb24d9f29@profihost.ag]
On Sat 08-09-18 20:52:35, Stefan Priebe - Profihost AG wrote:
> [305146.987742] khugepaged: page allocation stalls for 224236ms, order:9,
> mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)

This is certainly not a result of this patch AFAICS. khugepaged does add
__GFP_THISNODE regardless of what alloc_hugepage_khugepaged_gfpmask
thinks about that flag. Something to look into as well I guess.

Anyway, I guess we want to look closer at what compaction is doing here
because such a long stall is really not acceptable at all. Maybe this is
something 4.12 kernel related. This is hard to tell. Unfortunatelly,
upstream has lost the stall warning so you wouldn't know this is the
case with newer kernels.

Anyway, running with compaction tracepoints might tell us more.
Vlastimil will surely help to tell you which of them to enable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-07 13:05 ` Michal Hocko
  (?)
  (?)
@ 2018-09-10 20:08 ` David Rientjes
  2018-09-10 20:22   ` Stefan Priebe - Profihost AG
                     ` (2 more replies)
  -1 siblings, 3 replies; 40+ messages in thread
From: David Rientjes @ 2018-09-10 20:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko, Stefan Priebe

On Fri, 7 Sep 2018, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80

Since we don't have __GFP_REPEAT, this suggests that 
__alloc_pages_direct_compact() took >100s to complete.  The memory 
capacity of the system isn't shown, but I assume it's around 768GB?  This 
should be with COMPACT_PRIO_ASYNC, and MIGRATE_ASYNC compaction certainly 
should abort much earlier.

> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes. The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement.
> 

That's not entirely true, the remote access latency for remote thp on all 
of our platforms is greater than local small pages, this is especially 
true for remote thp that is allocated intersocket and must be accessed 
through the interconnect.

Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
allocation latency, but certainly not remote access latency.  There are 
users who remap their text segment onto transparent hugepages are fine 
with startup delay if they are access all of their text from local thp.  
Remote thp would be a significant performance degradation.

When Andrea brought this up, I suggested that the full solution would be a 
MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
benefit is that we could replace the thp "defrag" mode default by setting 
this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
concerned about (1) getting thp when system-wide it is not default and (2) 
additional fault latency when direct compaction is not default.  They are 
not anticipating the degradation of remote access latency, so overloading 
the meaning of the mode is probably not a good idea.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-10 20:08 ` David Rientjes
@ 2018-09-10 20:22   ` Stefan Priebe - Profihost AG
  2018-09-11  8:51   ` Vlastimil Babka
  2018-09-11 11:56   ` Michal Hocko
  2 siblings, 0 replies; 40+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-09-10 20:22 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko

Am 10.09.2018 um 22:08 schrieb David Rientjes:
> On Fri, 7 Sep 2018, Michal Hocko wrote:
> 
>> From: Michal Hocko <mhocko@suse.com>
>>
>> Andrea has noticed [1] that a THP allocation might be really disruptive
>> when allocated on NUMA system with the local node full or hard to
>> reclaim. Stefan has posted an allocation stall report on 4.12 based
>> SLES kernel which suggests the same issue:
>> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
>> [245513.363983] kvm cpuset=/ mems_allowed=0-1
>> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
>> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
>> [245513.365905] Call Trace:
>> [245513.366535]  dump_stack+0x5c/0x84
>> [245513.367148]  warn_alloc+0xe0/0x180
>> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
>> [245513.368406]  ? __slab_free+0xa9/0x2f0
>> [245513.369048]  ? __slab_free+0xa9/0x2f0
>> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
>> [245513.370300]  alloc_pages_vma+0x1e5/0x280
>> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
>> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
>> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
>> [245513.372812]  __handle_mm_fault+0x93d/0x1060
>> [245513.373439]  handle_mm_fault+0xc6/0x1b0
>> [245513.374042]  __do_page_fault+0x230/0x430
>> [245513.374679]  ? get_vtime_delta+0x13/0xb0
>> [245513.375411]  do_page_fault+0x2a/0x70
>> [245513.376145]  ? page_fault+0x65/0x80
>> [245513.376882]  page_fault+0x7b/0x80
> 
> Since we don't have __GFP_REPEAT, this suggests that 
> __alloc_pages_direct_compact() took >100s to complete.  The memory 
> capacity of the system isn't shown, but I assume it's around 768GB?  This 
> should be with COMPACT_PRIO_ASYNC, and MIGRATE_ASYNC compaction certainly 
> should abort much earlier.

Yes it's 768GB.

Greets,
Stefan

>> [245513.382056] Mem-Info:
>> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>>                  active_file:60183 inactive_file:245285 isolated_file:0
>>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>>                  free:32294031 free_pcp:6233 free_cma:0
>> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>>
>> The defrag mode is "madvise" and from the above report it is clear that
>> the THP has been allocated for MADV_HUGEPAGA vma.
>>
>> Andrea has identified that the main source of the problem is
>> __GFP_THISNODE usage:
>>
>> : The problem is that direct compaction combined with the NUMA
>> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
>> : hard the local node, instead of failing the allocation if there's no
>> : THP available in the local node.
>> :
>> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
>> : path even with MPOL_DEFAULT.
>> :
>> : The idea behind the __GFP_THISNODE addition, is that it is better to
>> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
>> : backed memory. That largely depends on the remote latency though, on
>> : threadrippers for example the overhead is relatively low in my
>> : experience.
>> :
>> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
>> : extremely slow qemu startup with vfio, if the VM is larger than the
>> : size of one host NUMA node. This is because it will try very hard to
>> : unsuccessfully swapout get_user_pages pinned pages as result of the
>> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
>> : allocations and instead of trying to allocate THP on other nodes (it
>> : would be even worse without vfio type1 GUP pins of course, except it'd
>> : be swapping heavily instead).
>>
>> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
>> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
>> juggle gfp flags for different allocation modes. The rationale is that
>> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
>> to a different node might be more harmful than the benefit of a large page.
>> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
>> a higher priority than local NUMA placement.
>>
> 
> That's not entirely true, the remote access latency for remote thp on all 
> of our platforms is greater than local small pages, this is especially 
> true for remote thp that is allocated intersocket and must be accessed 
> through the interconnect.
> 
> Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
> allocation latency, but certainly not remote access latency.  There are 
> users who remap their text segment onto transparent hugepages are fine 
> with startup delay if they are access all of their text from local thp.  
> Remote thp would be a significant performance degradation.
> 
> When Andrea brought this up, I suggested that the full solution would be a 
> MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
> benefit is that we could replace the thp "defrag" mode default by setting 
> this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> concerned about (1) getting thp when system-wide it is not default and (2) 
> additional fault latency when direct compaction is not default.  They are 
> not anticipating the degradation of remote access latency, so overloading 
> the meaning of the mode is probably not a good idea.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-10 20:08 ` David Rientjes
  2018-09-10 20:22   ` Stefan Priebe - Profihost AG
@ 2018-09-11  8:51   ` Vlastimil Babka
  2018-09-11 11:56   ` Michal Hocko
  2 siblings, 0 replies; 40+ messages in thread
From: Vlastimil Babka @ 2018-09-11  8:51 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko, Stefan Priebe

On 09/10/2018 10:08 PM, David Rientjes wrote:
> When Andrea brought this up, I suggested that the full solution would be a 
> MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 

Can you elaborate on the semantics of this? You mean that a given vma
could now have two mempolicies, where one would be for hugepages only?
That's likely much more easy to suggest than to implement, with all uapi
consequences...

> benefit is that we could replace the thp "defrag" mode default by setting 
> this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> concerned about (1) getting thp when system-wide it is not default and (2) 
> additional fault latency when direct compaction is not default.  They are 
> not anticipating the degradation of remote access latency, so overloading 
> the meaning of the mode is probably not a good idea.
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-08 18:52 ` Stefan Priebe - Profihost AG
  2018-09-10  7:39   ` Michal Hocko
@ 2018-09-11  9:03   ` Vlastimil Babka
  1 sibling, 0 replies; 40+ messages in thread
From: Vlastimil Babka @ 2018-09-11  9:03 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Michal Hocko, Andrew Morton
  Cc: Andrea Arcangeli, David Rientjes, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Michal Hocko

On 09/08/2018 08:52 PM, Stefan Priebe - Profihost AG wrote:
> Hello,
> 
> whlie using this path i got another stall - which i never saw under
> kernel 4.4. Here is the trace:
> [305111.932698] INFO: task ksmtuned:1399 blocked for more than 120 seconds.
> [305111.933612]       Tainted: G                   4.12.0+105-ph #1
> [305111.934456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [305111.935323] ksmtuned        D    0  1399      1 0x00080000
> [305111.936207] Call Trace:
> [305111.937118]  ? __schedule+0x3bc/0x830
> [305111.937991]  schedule+0x32/0x80
> [305111.938837]  schedule_preempt_disabled+0xa/0x10
> [305111.939687]  __mutex_lock.isra.4+0x287/0x4c0

Hmm so is this waiting on mutex_lock(&ksm_thread_mutex)? Who's holding it?

Looking at your original report [1] I see more tasks waiting on a semaphore:

[245654.463746] INFO: task pvestatd:3175 blocked for more than 120 seconds.
[245654.464730] Tainted: G W 4.12.0+98-ph <a
href="/view.php?id=1" title="[geschlossen] Integration Ramdisk"
class="resolved">0000001</a>
[245654.465666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[245654.466591] pvestatd D 0 3175 1 0x00080000
[245654.467495] Call Trace:
[245654.468413] ? __schedule+0x3bc/0x830
[245654.469283] schedule+0x32/0x80
[245654.470108] rwsem_down_read_failed+0x121/0x170
[245654.470918] ? call_rwsem_down_read_failed+0x14/0x30
[245654.471729] ? alloc_pages_current+0x91/0x140
[245654.472530] call_rwsem_down_read_failed+0x14/0x30
[245654.473316] down_read+0x13/0x30
[245654.474064] proc_pid_cmdline_read+0xae/0x3f0

- probably down_read(&mm->mmap_sem);

[245654.485537] INFO: task service:86643 blocked for more than 120 seconds.
[245654.486015] Tainted: G W 4.12.0+98-ph <a
href="/view.php?id=1" title="[geschlossen] Integration Ramdisk"
class="resolved">0000001</a>
[245654.486502] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[245654.486960] service D 0 86643 1 0x00080000
[245654.487423] Call Trace:
[245654.487865] ? __schedule+0x3bc/0x830
[245654.488286] schedule+0x32/0x80
[245654.488763] rwsem_down_read_failed+0x121/0x170
[245654.489200] ? __handle_mm_fault+0xd67/0x1060
[245654.489668] ? call_rwsem_down_read_failed+0x14/0x30
[245654.490088] call_rwsem_down_read_failed+0x14/0x30
[245654.490538] down_read+0x13/0x30
[245654.490964] __do_page_fault+0x32b/0x430
[245654.491389] ? get_vtime_delta+0x13/0xb0
[245654.491835] do_page_fault+0x2a/0x70
[245654.492253] ? page_fault+0x65/0x80
[245654.492703] page_fault+0x7b/0x80

- page fault, so another mmap_sem for read

[245654.496050] INFO: task safe_timer:86651 blocked for more than 120
seconds.
[245654.496466] Tainted: G W 4.12.0+98-ph <a
href="/view.php?id=1" title="[geschlossen] Integration Ramdisk"
class="resolved">0000001</a>
[245654.496916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[245654.497336] safe_timer D 0 86651 1 0x00080000
[245654.497790] Call Trace:
[245654.498211] ? __schedule+0x3bc/0x830
[245654.498646] schedule+0x32/0x80
[245654.499074] rwsem_down_read_failed+0x121/0x170
[245654.499501] ? call_rwsem_down_read_failed+0x14/0x30
[245654.499959] call_rwsem_down_read_failed+0x14/0x30
[245654.500381] down_read+0x13/0x30
[245654.500851] __do_page_fault+0x32b/0x430
[245654.501285] ? get_vtime_delta+0x13/0xb0
[245654.501722] do_page_fault+0x2a/0x70
[245654.502163] ? page_fault+0x65/0x80
[245654.502600] page_fault+0x7b/0x80

- ditto

But those are different tasks that the one that was stalling allocation
while holding mmap_sem for write.

So I'm not sure what's going on, but maybe the reclaim is also stalled
due to waiting on some locks, and is thus victim of something else?

[1] https://lists.opensuse.org/opensuse-kernel/2018-08/msg00012.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-10 20:08 ` David Rientjes
  2018-09-10 20:22   ` Stefan Priebe - Profihost AG
  2018-09-11  8:51   ` Vlastimil Babka
@ 2018-09-11 11:56   ` Michal Hocko
  2018-09-11 20:30     ` David Rientjes
  2018-09-12 13:54     ` Andrea Arcangeli
  2 siblings, 2 replies; 40+ messages in thread
From: Michal Hocko @ 2018-09-11 11:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Mon 10-09-18 13:08:34, David Rientjes wrote:
> On Fri, 7 Sep 2018, Michal Hocko wrote:
[...]
> > Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> > it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> > juggle gfp flags for different allocation modes. The rationale is that
> > __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> > to a different node might be more harmful than the benefit of a large page.
> > If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> > a higher priority than local NUMA placement.
> > 
> 
> That's not entirely true, the remote access latency for remote thp on all 
> of our platforms is greater than local small pages, this is especially 
> true for remote thp that is allocated intersocket and must be accessed 
> through the interconnect.
> 
> Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
> allocation latency, but certainly not remote access latency.  There are 
> users who remap their text segment onto transparent hugepages are fine 
> with startup delay if they are access all of their text from local thp.  
> Remote thp would be a significant performance degradation.

Well, it seems that expectations differ for users. It seems that kvm
users do not really agree with your interpretation.

> When Andrea brought this up, I suggested that the full solution would be a 
> MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
> benefit is that we could replace the thp "defrag" mode default by setting 
> this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> concerned about (1) getting thp when system-wide it is not default and (2) 
> additional fault latency when direct compaction is not default.  They are 
> not anticipating the degradation of remote access latency, so overloading 
> the meaning of the mode is probably not a good idea.

hugepage specific MPOL flags sounds like yet another step into even more
cluttered API and semantic, I am afraid. Why should this be any
different from regular page allocations? You are getting off-node memory
once your local node is full. You have to use an explicit binding to
disallow that. THP should be similar in that regards. Once you have said
that you _really_ want THP then you are closer to what we do for regular
pages IMHO.

I do realize that this is a gray zone because nobody bothered to define
the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
is exceptionaly short of information). So we are left with more or less
undefined behavior and define it properly now. As we can see this might
regress in some workloads but I strongly suspect that an explicit
binding sounds more logical approach than a thp specific mpol mode. If
anything this should be a more generic memory policy basically saying
that a zone/node reclaim mode should be enabled for the particular
allocation.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-11 11:56   ` Michal Hocko
@ 2018-09-11 20:30     ` David Rientjes
  2018-09-12 12:05         ` Michal Hocko
  2018-09-12 13:54     ` Andrea Arcangeli
  1 sibling, 1 reply; 40+ messages in thread
From: David Rientjes @ 2018-09-11 20:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Tue, 11 Sep 2018, Michal Hocko wrote:

> > That's not entirely true, the remote access latency for remote thp on all 
> > of our platforms is greater than local small pages, this is especially 
> > true for remote thp that is allocated intersocket and must be accessed 
> > through the interconnect.
> > 
> > Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
> > allocation latency, but certainly not remote access latency.  There are 
> > users who remap their text segment onto transparent hugepages are fine 
> > with startup delay if they are access all of their text from local thp.  
> > Remote thp would be a significant performance degradation.
> 
> Well, it seems that expectations differ for users. It seems that kvm
> users do not really agree with your interpretation.
> 

If kvm is happy to allocate hugepages remotely, at least on a subset of 
platforms where it doesn't incur such a high remote access latency, then 
we probably shouldn't be adding lumping that together with the current 
semantics of MADV_HUGEPAGE.  Otherwise, we risk it becoming a dumping 
ground where current users may regress because they would be much more 
willing to fault local pages of the native page size and lose the ability 
to require that absent using mbind() -- and in that case they would be 
affected by the policy decision of native page sizes as well.

> > When Andrea brought this up, I suggested that the full solution would be a 
> > MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
> > benefit is that we could replace the thp "defrag" mode default by setting 
> > this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> > concerned about (1) getting thp when system-wide it is not default and (2) 
> > additional fault latency when direct compaction is not default.  They are 
> > not anticipating the degradation of remote access latency, so overloading 
> > the meaning of the mode is probably not a good idea.
> 
> hugepage specific MPOL flags sounds like yet another step into even more
> cluttered API and semantic, I am afraid. Why should this be any
> different from regular page allocations? You are getting off-node memory
> once your local node is full. You have to use an explicit binding to
> disallow that. THP should be similar in that regards. Once you have said
> that you _really_ want THP then you are closer to what we do for regular
> pages IMHO.
> 

Saying that we really want THP isn't an all-or-nothing decision.  We 
certainly want to try hard to fault hugepages locally especially at task 
startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
very well for that.  Remote hugepages would be a regression that we now 
have no way to avoid because the kernel doesn't provide for it, if we were 
to remove __GFP_THISNODE that this patch introduces.

On Broadwell, for example, we find 7% slower access to remote hugepages 
than local native pages.  On Naples, that becomes worse: 14% slower access 
latency for intrasocket hugepages compared to local native pages and 39% 
slower for intersocket.

> I do realize that this is a gray zone because nobody bothered to define
> the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> is exceptionaly short of information). So we are left with more or less
> undefined behavior and define it properly now. As we can see this might
> regress in some workloads but I strongly suspect that an explicit
> binding sounds more logical approach than a thp specific mpol mode. If
> anything this should be a more generic memory policy basically saying
> that a zone/node reclaim mode should be enabled for the particular
> allocation.

This would be quite a serious regression with no way to actually define 
that we want local hugepages but allow fallback to remote native pages if 
we cannot allocate local native pages.  So rather than causing userspace 
to regress and give them no alternative, I would suggest either hugepage 
specific mempolicies or another madvise mode to allow remotely allocated 
hugepages.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-11 20:30     ` David Rientjes
@ 2018-09-12 12:05         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2018-09-12 12:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Tue 11-09-18 13:30:20, David Rientjes wrote:
> On Tue, 11 Sep 2018, Michal Hocko wrote:
[...]
> > hugepage specific MPOL flags sounds like yet another step into even more
> > cluttered API and semantic, I am afraid. Why should this be any
> > different from regular page allocations? You are getting off-node memory
> > once your local node is full. You have to use an explicit binding to
> > disallow that. THP should be similar in that regards. Once you have said
> > that you _really_ want THP then you are closer to what we do for regular
> > pages IMHO.
> > 
> 
> Saying that we really want THP isn't an all-or-nothing decision.  We 
> certainly want to try hard to fault hugepages locally especially at task 
> startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
> very well for that.  Remote hugepages would be a regression that we now 
> have no way to avoid because the kernel doesn't provide for it, if we were 
> to remove __GFP_THISNODE that this patch introduces.

Why cannot you use mempolicy to bind to local nodes if you really care
about the locality?

> On Broadwell, for example, we find 7% slower access to remote hugepages 
> than local native pages.  On Naples, that becomes worse: 14% slower access 
> latency for intrasocket hugepages compared to local native pages and 39% 
> slower for intersocket.

So, again, how does this compare to regular 4k pages? You are going to
pay for the same remote access as well.

From what you have said so far it sounds like you would like to have
something like the zone/node reclaim mode fine grained for a specific
mapping. If we really want to support something like that then it should
be a generic policy rather than THP specific thing IMHO.

As I've said it is hard to come up with a solution that would satisfy
everybody but considering that the existing reports are seeing this a
regression and cosindering their NUMA requirements are not so strict as
yours I would tend to think that stronger NUMA requirements should be
expressed explicitly rather than implicit effect of a madvise flag. We
do have APIs for that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
@ 2018-09-12 12:05         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2018-09-12 12:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Tue 11-09-18 13:30:20, David Rientjes wrote:
> On Tue, 11 Sep 2018, Michal Hocko wrote:
[...]
> > hugepage specific MPOL flags sounds like yet another step into even more
> > cluttered API and semantic, I am afraid. Why should this be any
> > different from regular page allocations? You are getting off-node memory
> > once your local node is full. You have to use an explicit binding to
> > disallow that. THP should be similar in that regards. Once you have said
> > that you _really_ want THP then you are closer to what we do for regular
> > pages IMHO.
> > 
> 
> Saying that we really want THP isn't an all-or-nothing decision.  We 
> certainly want to try hard to fault hugepages locally especially at task 
> startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
> very well for that.  Remote hugepages would be a regression that we now 
> have no way to avoid because the kernel doesn't provide for it, if we were 
> to remove __GFP_THISNODE that this patch introduces.

Why cannot you use mempolicy to bind to local nodes if you really care
about the locality?

> On Broadwell, for example, we find 7% slower access to remote hugepages 
> than local native pages.  On Naples, that becomes worse: 14% slower access 
> latency for intrasocket hugepages compared to local native pages and 39% 
> slower for intersocket.

So, again, how does this compare to regular 4k pages? You are going to
pay for the same remote access as well.

>From what you have said so far it sounds like you would like to have
something like the zone/node reclaim mode fine grained for a specific
mapping. If we really want to support something like that then it should
be a generic policy rather than THP specific thing IMHO.

As I've said it is hard to come up with a solution that would satisfy
everybody but considering that the existing reports are seeing this a
regression and cosindering their NUMA requirements are not so strict as
yours I would tend to think that stronger NUMA requirements should be
expressed explicitly rather than implicit effect of a madvise flag. We
do have APIs for that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-11 11:56   ` Michal Hocko
  2018-09-11 20:30     ` David Rientjes
@ 2018-09-12 13:54     ` Andrea Arcangeli
  2018-09-12 14:21       ` Michal Hocko
  1 sibling, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2018-09-12 13:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

Hello,

On Tue, Sep 11, 2018 at 01:56:13PM +0200, Michal Hocko wrote:
> Well, it seems that expectations differ for users. It seems that kvm
> users do not really agree with your interpretation.

Like David also mentioned here:

lkml.kernel.org/r/alpine.DEB.2.21.1808211021110.258924@chino.kir.corp.google.com

depends on the hardware what is a win, so there's no one size fits
all.

For two sockets providing remote THP to KVM is likely a win, but
changing the defaults depending on boot-time NUMA topology makes
things less deterministic and it's also impossible to define an exact
break even point.

> I do realize that this is a gray zone because nobody bothered to define
> the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> is exceptionaly short of information). So we are left with more or less
> undefined behavior and define it properly now. As we can see this might
> regress in some workloads but I strongly suspect that an explicit
> binding sounds more logical approach than a thp specific mpol mode. If
> anything this should be a more generic memory policy basically saying
> that a zone/node reclaim mode should be enabled for the particular
> allocation.

MADV_HUGEPAGE means the allocation is long lived, so the cost of
compaction is worth it in direct reclaim. Not much else. That is not
the problem.

The problem is that even if you ignore the breakage and regression to
real life workloads, what is happening right now obviously would
require root privilege but MADV_HUEGPAGE requires no root privilege.

Swapping heavy because MADV_HUGEPAGE when there are gigabytes free on
other nodes and not even 4k would be swapped-out with THP turned off
in sysfs, is simply not possibly what MADV_HUGEPAGE could have been
about, and it's a kernel regression that never existed until that
commit that added __GFP_THISNODE to the default THP heuristic in
mempolicy.

I think we should defer the problem of what is better between 4k NUMA
local or remote THP by default for later, I provided two options
myself because it didn't matter so much which option we picked in the
short term, as long as the bug was fixed.

I wasn't particularly happy about your patch because it still swaps
with certain defrag settings which is still allowing things that
shouldn't happen without some kind of privileged capability.

If you can update the patch to prevent swapping in all cases it's a go
as far as I'm concerned. The main difference is that you're dropping
the THP logic in the mempolicy code which will make it worse for some
case and I was trying to retain it for all cases where swapping
wouldn't happen anyway and such logic would have still provided the
behavior David prefers to those cases.

Adding the new feature to create a THP specific mempolicy can be done
later. In the meanwhile the current mempolicy code can always override
whatever THP default behavior that gets out of this, just it will
require the admin to setup a mempolicy to enforce the preferred
behavior to 4k and THP allocations alike.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-12 13:54     ` Andrea Arcangeli
@ 2018-09-12 14:21       ` Michal Hocko
  2018-09-12 15:25         ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2018-09-12 14:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Rientjes, Andrew Morton, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Wed 12-09-18 09:54:17, Andrea Arcangeli wrote:
> Hello,
> 
> On Tue, Sep 11, 2018 at 01:56:13PM +0200, Michal Hocko wrote:
> > Well, it seems that expectations differ for users. It seems that kvm
> > users do not really agree with your interpretation.
> 
> Like David also mentioned here:
> 
> lkml.kernel.org/r/alpine.DEB.2.21.1808211021110.258924@chino.kir.corp.google.com
> 
> depends on the hardware what is a win, so there's no one size fits
> all.
> 
> For two sockets providing remote THP to KVM is likely a win, but
> changing the defaults depending on boot-time NUMA topology makes
> things less deterministic and it's also impossible to define an exact
> break even point.
> 
> > I do realize that this is a gray zone because nobody bothered to define
> > the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> > is exceptionaly short of information). So we are left with more or less
> > undefined behavior and define it properly now. As we can see this might
> > regress in some workloads but I strongly suspect that an explicit
> > binding sounds more logical approach than a thp specific mpol mode. If
> > anything this should be a more generic memory policy basically saying
> > that a zone/node reclaim mode should be enabled for the particular
> > allocation.
> 
> MADV_HUGEPAGE means the allocation is long lived, so the cost of
> compaction is worth it in direct reclaim. Not much else. That is not
> the problem.

It seems there is no general agreement here. My understanding is that
this means that the user really prefers THP for whatever reasons.

> The problem is that even if you ignore the breakage and regression to
> real life workloads, what is happening right now obviously would
> require root privilege but MADV_HUEGPAGE requires no root privilege.

I do not follow.

> Swapping heavy because MADV_HUGEPAGE when there are gigabytes free on
> other nodes and not even 4k would be swapped-out with THP turned off
> in sysfs, is simply not possibly what MADV_HUGEPAGE could have been
> about, and it's a kernel regression that never existed until that
> commit that added __GFP_THISNODE to the default THP heuristic in
> mempolicy.

agreed

> I think we should defer the problem of what is better between 4k NUMA
> local or remote THP by default for later, I provided two options
> myself because it didn't matter so much which option we picked in the
> short term, as long as the bug was fixed.
> 
> I wasn't particularly happy about your patch because it still swaps
> with certain defrag settings which is still allowing things that
> shouldn't happen without some kind of privileged capability.

Well, I am not really sure about defrag=always. I would rather care
about the default behavior to plug the regression first. And think about
`always' mode on top. Or is this a no-go from your POV?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-12 14:21       ` Michal Hocko
@ 2018-09-12 15:25         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2018-09-12 15:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Rientjes, Andrew Morton, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Wed 12-09-18 16:21:26, Michal Hocko wrote:
> On Wed 12-09-18 09:54:17, Andrea Arcangeli wrote:
[...]
> > I wasn't particularly happy about your patch because it still swaps
> > with certain defrag settings which is still allowing things that
> > shouldn't happen without some kind of privileged capability.
> 
> Well, I am not really sure about defrag=always. I would rather care
> about the default behavior to plug the regression first. And think about
> `always' mode on top. Or is this a no-go from your POV?

In other words the following on top
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 56c9aac4dc86..723e8d77c5ef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -644,7 +644,7 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, un
 #endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-12 12:05         ` Michal Hocko
  (?)
@ 2018-09-12 20:40         ` David Rientjes
  -1 siblings, 0 replies; 40+ messages in thread
From: David Rientjes @ 2018-09-12 20:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Zi Yan, Kirill A. Shutemov,
	linux-mm, LKML, Stefan Priebe

On Wed, 12 Sep 2018, Michal Hocko wrote:

> > Saying that we really want THP isn't an all-or-nothing decision.  We 
> > certainly want to try hard to fault hugepages locally especially at task 
> > startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
> > very well for that.  Remote hugepages would be a regression that we now 
> > have no way to avoid because the kernel doesn't provide for it, if we were 
> > to remove __GFP_THISNODE that this patch introduces.
> 
> Why cannot you use mempolicy to bind to local nodes if you really care
> about the locality?
> 

Because we do not want to oom kill, we want to fallback first to local 
native pages and then to remote native pages.  That's the order of least 
to greatest latency, we do not want to work hard to allocate a remote 
hugepage when a local native page is faster.  This seems pretty straight 
forward.

> From what you have said so far it sounds like you would like to have
> something like the zone/node reclaim mode fine grained for a specific
> mapping. If we really want to support something like that then it should
> be a generic policy rather than THP specific thing IMHO.
> 
> As I've said it is hard to come up with a solution that would satisfy
> everybody but considering that the existing reports are seeing this a
> regression and cosindering their NUMA requirements are not so strict as
> yours I would tend to think that stronger NUMA requirements should be
> expressed explicitly rather than implicit effect of a madvise flag. We
> do have APIs for that.

Every process on every platform we have would need to define this explicit 
mempolicy for users of libraries that remap text segments because changing 
the allocation behavior of thp out from under them would cause very 
noticeable performance regressions.  I don't know of any platform where 
remote hugepages is preferred over local native pages.  If they exist, it 
sounds resaonable to introduce a stronger variant of MADV_HUGEPAGE that 
defines exactly what you want rather than causing it to become a dumping 
ground and userspace regressions.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-17  7:04                         ` Stefan Priebe - Profihost AG
  2018-09-17  9:32                           ` Stefan Priebe - Profihost AG
@ 2018-09-17 11:27                           ` Michal Hocko
  1 sibling, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2018-09-17 11:27 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Mel Gorman, Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka

On Mon 17-09-18 09:04:02, Stefan Priebe - Profihost AG wrote:
> Hi,
> 
> i had multiple memory stalls this weekend again. All kvm processes where
> spinning trying to get > 100% CPU and i was not able to even login to
> ssh. After 5-10 minutes i was able to login.
> 
> There were about 150GB free mem on the host.
> 
> Relevant settings (no local storage involved):
>         vm.dirty_background_ratio:
>             3
>         vm.dirty_ratio:
>             10
>         vm.min_free_kbytes:
>             10567004
> 
> # cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer [defer+madvise] madvise never
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> 
> After that i had the following traces on the host node:
> https://pastebin.com/raw/0VhyQmAv

I would suggest reporting this in a new email thread. I would also
recommend to CC kvm guys (see MAINTAINERS file in the kernel source
tree) and trace qemu/kvm processes to see what they are doing at the
time when you see the stall.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-17  7:04                         ` Stefan Priebe - Profihost AG
@ 2018-09-17  9:32                           ` Stefan Priebe - Profihost AG
  2018-09-17 11:27                           ` Michal Hocko
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-09-17  9:32 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka

May be amissing piece:
vm.overcommit_memory=0

Greets,
Stefan

Am 17.09.2018 um 09:04 schrieb Stefan Priebe - Profihost AG:
> Hi,
> 
> i had multiple memory stalls this weekend again. All kvm processes where
> spinning trying to get > 100% CPU and i was not able to even login to
> ssh. After 5-10 minutes i was able to login.
> 
> There were about 150GB free mem on the host.
> 
> Relevant settings (no local storage involved):
>         vm.dirty_background_ratio:
>             3
>         vm.dirty_ratio:
>             10
>         vm.min_free_kbytes:
>             10567004
> 
> # cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer [defer+madvise] madvise never
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> 
> After that i had the following traces on the host node:
> https://pastebin.com/raw/0VhyQmAv
> 
> Thanks!
> 
> Greets,
> Stefan
> 
> 
> Am 17.09.2018 um 08:11 schrieb Michal Hocko:
>> [sorry I've missed your reply]
>>
>> On Wed 12-09-18 18:29:25, Mel Gorman wrote:
>>> On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
>> [...]
>>> I recognise that this fix means that users that expect zone_reclaim_mode==1
>>> type behaviour may get burned but the users that benefit from that should
>>> also be users that benefit from sizing their workload to a node. They should
>>> be able to replicate that with mempolicies or at least use prepation scripts
>>> to clear memory on a target node (e.g. membind a memhog to the desired size,
>>> exit and then start the target workload).
>>
>> As I've said in other email. We probably want to add a new mempolicy
>> which has zone_reclaim_mode-like semantic.
>>
>> [...]
>>
>>>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>>>> index 5228c62af416..bac395f1d00a 100644
>>>> --- a/include/linux/mempolicy.h
>>>> +++ b/include/linux/mempolicy.h
>>>> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>>>>  struct mempolicy *get_task_policy(struct task_struct *p);
>>>>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>>  		unsigned long addr);
>>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>> +						unsigned long addr);
>>>>  bool vma_policy_mof(struct vm_area_struct *vma);
>>>>  
>>>>  extern void numa_default_policy(void);
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index c3bc7e9c9a2a..94472bf9a31b 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>>   *	    available
>>>>   * never: never stall for any thp allocation
>>>>   */
>>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>>>  {
>>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>>> +	gfp_t this_node = 0;
>>>> +	struct mempolicy *pol;
>>>> +
>>>> +#ifdef CONFIG_NUMA
>>>> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
>>>> +	pol = get_vma_policy(vma, addr);
>>>> +	if (pol->mode != MPOL_BIND)
>>>> +		this_node = __GFP_THISNODE;
>>>> +#endif
>>>>  
>>>
>>> Where is the mpol_cond_put? Historically it might not have mattered
>>> because THP could not be used with a shared possibility but it probably
>>> matters now that tmpfs can be backed by THP.
>>
>> http://lkml.kernel.org/r/20180830064732.GA2656@dhcp22.suse.cz
>>
>>> The comment needs more expansion as well. Arguably it only makes sense in
>>> the event we are explicitly bound to one node because if we are bound to
>>> two nodes without interleaving then why not fall back? The answer to that
>>> is outside the scope of the patch but the comment as-is will cause head
>>> scratches in a years time.
>>
>> Do you have any specific wording in mind? I have a bit hard time to come
>> up with something more precise and do not go into details too much.
>>  
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>>>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>>>> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>>>> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>>>> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>>> -							     __GFP_KSWAPD_RECLAIM);
>>>> +							     __GFP_KSWAPD_RECLAIM | this_node);
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>>> -							     0);
>>>> -	return GFP_TRANSHUGE_LIGHT;
>>>> +							     this_node);
>>>> +	return GFP_TRANSHUGE_LIGHT | this_node;
>>>>  }
>>>>  
>>>>  /* Caller must hold page table lock. */
>>>> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>>>  			pte_free(vma->vm_mm, pgtable);
>>>>  		return ret;
>>>>  	}
>>>> -	gfp = alloc_hugepage_direct_gfpmask(vma);
>>>> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>>  	if (unlikely(!page)) {
>>>>  		count_vm_event(THP_FAULT_FALLBACK);
>>>> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>>>>  alloc:
>>>>  	if (transparent_hugepage_enabled(vma) &&
>>>>  	    !transparent_hugepage_debug_cow()) {
>>>> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
>>>> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>>  	} else
>>>>  		new_page = NULL;
>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>>> index da858f794eb6..75bbfc3d6233 100644
>>>> --- a/mm/mempolicy.c
>>>> +++ b/mm/mempolicy.c
>>>> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>>   * freeing by another task.  It is the caller's responsibility to free the
>>>>   * extra reference for shared policies.
>>>>   */
>>>> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>>  						unsigned long addr)
>>>>  {
>>>>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
>>>> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>>>  		goto out;
>>>>  	}
>>>>  
>>>> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
>>>> -		int hpage_node = node;
>>>> -
>>>> -		/*
>>>> -		 * For hugepage allocation and non-interleave policy which
>>>> -		 * allows the current node (or other explicitly preferred
>>>> -		 * node) we only try to allocate from the current/preferred
>>>> -		 * node and don't fall back to other nodes, as the cost of
>>>> -		 * remote accesses would likely offset THP benefits.
>>>> -		 *
>>>> -		 * If the policy is interleave, or does not allow the current
>>>> -		 * node in its nodemask, we allocate the standard way.
>>>> -		 */
>>>> -		if (pol->mode == MPOL_PREFERRED &&
>>>> -						!(pol->flags & MPOL_F_LOCAL))
>>>> -			hpage_node = pol->v.preferred_node;
>>>> -
>>>> -		nmask = policy_nodemask(gfp, pol);
>>>> -		if (!nmask || node_isset(hpage_node, *nmask)) {
>>>> -			mpol_cond_put(pol);
>>>> -			page = __alloc_pages_node(hpage_node,
>>>> -						gfp | __GFP_THISNODE, order);
>>>> -			goto out;
>>>> -		}
>>>> -	}
>>>> -
>>>
>>> The hugepage flag passed into this function is now redundant and that
>>> means that callers of alloc_hugepage_vma need to move back to using
>>> alloc_pages_vma() directly and remove the API entirely. This block of
>>> code is about both GFP flag settings and node selection but at a glance I
>>> cannot see the point of it because it's very similar to the base page code.
>>> The whole point may be to get around the warning in policy_node and that
>>> could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
>>> as you do already in this patch. There should be no reason why THP has a
>>> different policy than a base page within a single VMA.
>>
>> OK, I can follow up with a cleanup patch once we settle down with this
>> approach to fix the issue.
>>
>> Thanks!
>>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-17  6:11                       ` Michal Hocko
@ 2018-09-17  7:04                         ` Stefan Priebe - Profihost AG
  2018-09-17  9:32                           ` Stefan Priebe - Profihost AG
  2018-09-17 11:27                           ` Michal Hocko
  0 siblings, 2 replies; 40+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-09-17  7:04 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka

Hi,

i had multiple memory stalls this weekend again. All kvm processes where
spinning trying to get > 100% CPU and i was not able to even login to
ssh. After 5-10 minutes i was able to login.

There were about 150GB free mem on the host.

Relevant settings (no local storage involved):
        vm.dirty_background_ratio:
            3
        vm.dirty_ratio:
            10
        vm.min_free_kbytes:
            10567004

# cat /sys/kernel/mm/transparent_hugepage/defrag
always defer [defer+madvise] madvise never

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

After that i had the following traces on the host node:
https://pastebin.com/raw/0VhyQmAv

Thanks!

Greets,
Stefan


Am 17.09.2018 um 08:11 schrieb Michal Hocko:
> [sorry I've missed your reply]
> 
> On Wed 12-09-18 18:29:25, Mel Gorman wrote:
>> On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
> [...]
>> I recognise that this fix means that users that expect zone_reclaim_mode==1
>> type behaviour may get burned but the users that benefit from that should
>> also be users that benefit from sizing their workload to a node. They should
>> be able to replicate that with mempolicies or at least use prepation scripts
>> to clear memory on a target node (e.g. membind a memhog to the desired size,
>> exit and then start the target workload).
> 
> As I've said in other email. We probably want to add a new mempolicy
> which has zone_reclaim_mode-like semantic.
> 
> [...]
> 
>>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>>> index 5228c62af416..bac395f1d00a 100644
>>> --- a/include/linux/mempolicy.h
>>> +++ b/include/linux/mempolicy.h
>>> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>>>  struct mempolicy *get_task_policy(struct task_struct *p);
>>>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>  		unsigned long addr);
>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>> +						unsigned long addr);
>>>  bool vma_policy_mof(struct vm_area_struct *vma);
>>>  
>>>  extern void numa_default_policy(void);
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c3bc7e9c9a2a..94472bf9a31b 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>   *	    available
>>>   * never: never stall for any thp allocation
>>>   */
>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>>  {
>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>> +	gfp_t this_node = 0;
>>> +	struct mempolicy *pol;
>>> +
>>> +#ifdef CONFIG_NUMA
>>> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
>>> +	pol = get_vma_policy(vma, addr);
>>> +	if (pol->mode != MPOL_BIND)
>>> +		this_node = __GFP_THISNODE;
>>> +#endif
>>>  
>>
>> Where is the mpol_cond_put? Historically it might not have mattered
>> because THP could not be used with a shared possibility but it probably
>> matters now that tmpfs can be backed by THP.
> 
> http://lkml.kernel.org/r/20180830064732.GA2656@dhcp22.suse.cz
> 
>> The comment needs more expansion as well. Arguably it only makes sense in
>> the event we are explicitly bound to one node because if we are bound to
>> two nodes without interleaving then why not fall back? The answer to that
>> is outside the scope of the patch but the comment as-is will cause head
>> scratches in a years time.
> 
> Do you have any specific wording in mind? I have a bit hard time to come
> up with something more precise and do not go into details too much.
>  
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>>> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>>> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>>> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>> -							     __GFP_KSWAPD_RECLAIM);
>>> +							     __GFP_KSWAPD_RECLAIM | this_node);
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>> -							     0);
>>> -	return GFP_TRANSHUGE_LIGHT;
>>> +							     this_node);
>>> +	return GFP_TRANSHUGE_LIGHT | this_node;
>>>  }
>>>  
>>>  /* Caller must hold page table lock. */
>>> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>>  			pte_free(vma->vm_mm, pgtable);
>>>  		return ret;
>>>  	}
>>> -	gfp = alloc_hugepage_direct_gfpmask(vma);
>>> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>  	if (unlikely(!page)) {
>>>  		count_vm_event(THP_FAULT_FALLBACK);
>>> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>>>  alloc:
>>>  	if (transparent_hugepage_enabled(vma) &&
>>>  	    !transparent_hugepage_debug_cow()) {
>>> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
>>> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>  	} else
>>>  		new_page = NULL;
>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>> index da858f794eb6..75bbfc3d6233 100644
>>> --- a/mm/mempolicy.c
>>> +++ b/mm/mempolicy.c
>>> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>   * freeing by another task.  It is the caller's responsibility to free the
>>>   * extra reference for shared policies.
>>>   */
>>> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>  						unsigned long addr)
>>>  {
>>>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
>>> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>>  		goto out;
>>>  	}
>>>  
>>> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
>>> -		int hpage_node = node;
>>> -
>>> -		/*
>>> -		 * For hugepage allocation and non-interleave policy which
>>> -		 * allows the current node (or other explicitly preferred
>>> -		 * node) we only try to allocate from the current/preferred
>>> -		 * node and don't fall back to other nodes, as the cost of
>>> -		 * remote accesses would likely offset THP benefits.
>>> -		 *
>>> -		 * If the policy is interleave, or does not allow the current
>>> -		 * node in its nodemask, we allocate the standard way.
>>> -		 */
>>> -		if (pol->mode == MPOL_PREFERRED &&
>>> -						!(pol->flags & MPOL_F_LOCAL))
>>> -			hpage_node = pol->v.preferred_node;
>>> -
>>> -		nmask = policy_nodemask(gfp, pol);
>>> -		if (!nmask || node_isset(hpage_node, *nmask)) {
>>> -			mpol_cond_put(pol);
>>> -			page = __alloc_pages_node(hpage_node,
>>> -						gfp | __GFP_THISNODE, order);
>>> -			goto out;
>>> -		}
>>> -	}
>>> -
>>
>> The hugepage flag passed into this function is now redundant and that
>> means that callers of alloc_hugepage_vma need to move back to using
>> alloc_pages_vma() directly and remove the API entirely. This block of
>> code is about both GFP flag settings and node selection but at a glance I
>> cannot see the point of it because it's very similar to the base page code.
>> The whole point may be to get around the warning in policy_node and that
>> could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
>> as you do already in this patch. There should be no reason why THP has a
>> different policy than a base page within a single VMA.
> 
> OK, I can follow up with a cleanup patch once we settle down with this
> approach to fix the issue.
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-12 17:29                     ` Mel Gorman
@ 2018-09-17  6:11                       ` Michal Hocko
  2018-09-17  7:04                         ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2018-09-17  6:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka,
	Stefan Priebe - Profihost AG

[sorry I've missed your reply]

On Wed 12-09-18 18:29:25, Mel Gorman wrote:
> On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
[...]
> I recognise that this fix means that users that expect zone_reclaim_mode==1
> type behaviour may get burned but the users that benefit from that should
> also be users that benefit from sizing their workload to a node. They should
> be able to replicate that with mempolicies or at least use prepation scripts
> to clear memory on a target node (e.g. membind a memhog to the desired size,
> exit and then start the target workload).

As I've said in other email. We probably want to add a new mempolicy
which has zone_reclaim_mode-like semantic.

[...]

> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> > index 5228c62af416..bac395f1d00a 100644
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
> >  struct mempolicy *get_task_policy(struct task_struct *p);
> >  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> >  		unsigned long addr);
> > +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> > +						unsigned long addr);
> >  bool vma_policy_mof(struct vm_area_struct *vma);
> >  
> >  extern void numa_default_policy(void);
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c3bc7e9c9a2a..94472bf9a31b 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> >   *	    available
> >   * never: never stall for any thp allocation
> >   */
> > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > +	gfp_t this_node = 0;
> > +	struct mempolicy *pol;
> > +
> > +#ifdef CONFIG_NUMA
> > +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> > +	pol = get_vma_policy(vma, addr);
> > +	if (pol->mode != MPOL_BIND)
> > +		this_node = __GFP_THISNODE;
> > +#endif
> >  
> 
> Where is the mpol_cond_put? Historically it might not have mattered
> because THP could not be used with a shared possibility but it probably
> matters now that tmpfs can be backed by THP.

http://lkml.kernel.org/r/20180830064732.GA2656@dhcp22.suse.cz

> The comment needs more expansion as well. Arguably it only makes sense in
> the event we are explicitly bound to one node because if we are bound to
> two nodes without interleaving then why not fall back? The answer to that
> is outside the scope of the patch but the comment as-is will cause head
> scratches in a years time.

Do you have any specific wording in mind? I have a bit hard time to come
up with something more precise and do not go into details too much.
 
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> > -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> > +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> > -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> > +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
> >  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> > -							     __GFP_KSWAPD_RECLAIM);
> > +							     __GFP_KSWAPD_RECLAIM | this_node);
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> >  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> > -							     0);
> > -	return GFP_TRANSHUGE_LIGHT;
> > +							     this_node);
> > +	return GFP_TRANSHUGE_LIGHT | this_node;
> >  }
> >  
> >  /* Caller must hold page table lock. */
> > @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >  			pte_free(vma->vm_mm, pgtable);
> >  		return ret;
> >  	}
> > -	gfp = alloc_hugepage_direct_gfpmask(vma);
> > +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
> >  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
> >  	if (unlikely(!page)) {
> >  		count_vm_event(THP_FAULT_FALLBACK);
> > @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
> >  alloc:
> >  	if (transparent_hugepage_enabled(vma) &&
> >  	    !transparent_hugepage_debug_cow()) {
> > -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> > +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
> >  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
> >  	} else
> >  		new_page = NULL;
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da858f794eb6..75bbfc3d6233 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> >   * freeing by another task.  It is the caller's responsibility to free the
> >   * extra reference for shared policies.
> >   */
> > -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> > +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> >  						unsigned long addr)
> >  {
> >  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> > @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  		goto out;
> >  	}
> >  
> > -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> > -		int hpage_node = node;
> > -
> > -		/*
> > -		 * For hugepage allocation and non-interleave policy which
> > -		 * allows the current node (or other explicitly preferred
> > -		 * node) we only try to allocate from the current/preferred
> > -		 * node and don't fall back to other nodes, as the cost of
> > -		 * remote accesses would likely offset THP benefits.
> > -		 *
> > -		 * If the policy is interleave, or does not allow the current
> > -		 * node in its nodemask, we allocate the standard way.
> > -		 */
> > -		if (pol->mode == MPOL_PREFERRED &&
> > -						!(pol->flags & MPOL_F_LOCAL))
> > -			hpage_node = pol->v.preferred_node;
> > -
> > -		nmask = policy_nodemask(gfp, pol);
> > -		if (!nmask || node_isset(hpage_node, *nmask)) {
> > -			mpol_cond_put(pol);
> > -			page = __alloc_pages_node(hpage_node,
> > -						gfp | __GFP_THISNODE, order);
> > -			goto out;
> > -		}
> > -	}
> > -
> 
> The hugepage flag passed into this function is now redundant and that
> means that callers of alloc_hugepage_vma need to move back to using
> alloc_pages_vma() directly and remove the API entirely. This block of
> code is about both GFP flag settings and node selection but at a glance I
> cannot see the point of it because it's very similar to the base page code.
> The whole point may be to get around the warning in policy_node and that
> could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
> as you do already in this patch. There should be no reason why THP has a
> different policy than a base page within a single VMA.

OK, I can follow up with a cleanup patch once we settle down with this
approach to fix the issue.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 19:24                   ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
  2018-08-29 22:54                     ` Zi Yan
  2018-08-30  6:47                     ` Michal Hocko
@ 2018-09-12 17:29                     ` Mel Gorman
  2018-09-17  6:11                       ` Michal Hocko
  2 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2018-09-12 17:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka,
	Stefan Priebe - Profihost AG

On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
> From 4dc2f772756e6f91b9e64d1a3e2df4dca3475f5b Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 28 Aug 2018 09:59:19 +0200
> Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:

Note that this behaviour is unhelpful but it is not against the defined
semantic of the "madvise" defrag option.

> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes.

For the short term, I think you might be better off simply avoiding the
combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM and declaring
that the fix. That would be easier for -stable and the layering can be
dealt with as a cleanup.

I recognise that this fix means that users that expect zone_reclaim_mode==1
type behaviour may get burned but the users that benefit from that should
also be users that benefit from sizing their workload to a node. They should
be able to replicate that with mempolicies or at least use prepation scripts
to clear memory on a target node (e.g. membind a memhog to the desired size,
exit and then start the target workload).

I think this is a more appropriate solution than prematurely introducing a
GFP flag as it's not guaranteed that a user is willing to pay a compaction
penalty until it fails. That should be decided separately when the immediate
problem is resolved.

That said, I do think that sorting out where GFP flags are set for THP
should be done in the context of THP code and not alloc_pages_vma. The
current layering is a bit odd.

> The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement.
> 
> Be careful when the vma has an explicit numa binding though, because
> __GFP_THISNODE is not playing well with it. We want to follow the
> explicit numa policy rather than enforce a node which happens to be
> local to the cpu we are running on.
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 25 +++++++++++++++++--------
>  mm/mempolicy.c            | 28 +---------------------------
>  3 files changed, 20 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..94472bf9a31b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +#endif
>  

Where is the mpol_cond_put? Historically it might not have mattered
because THP could not be used with a shared possibility but it probably
matters now that tmpfs can be backed by THP.

The comment needs more expansion as well. Arguably it only makes sense in
the event we are explicitly bound to one node because if we are bound to
two nodes without interleaving then why not fall back? The answer to that
is outside the scope of the patch but the comment as-is will cause head
scratches in a years time.

>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>  
>  /* Caller must hold page table lock. */
> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -

The hugepage flag passed into this function is now redundant and that
means that callers of alloc_hugepage_vma need to move back to using
alloc_pages_vma() directly and remove the API entirely. This block of
code is about both GFP flag settings and node selection but at a glance I
cannot see the point of it because it's very similar to the base page code.
The whole point may be to get around the warning in policy_node and that
could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
as you do already in this patch. There should be no reason why THP has a
different policy than a base page within a single VMA.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 11:25                                         ` Michal Hocko
@ 2018-09-06 12:35                                           ` Zi Yan
  0 siblings, 0 replies; 40+ messages in thread
From: Zi Yan @ 2018-09-06 12:35 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 2281 bytes --]

On 6 Sep 2018, at 7:25, Michal Hocko wrote:

> On Thu 06-09-18 13:16:00, Vlastimil Babka wrote:
>> On 09/06/2018 01:10 PM, Vlastimil Babka wrote:
>>>> We can and should think about this much more but I would like to have
>>>> this regression closed. So can we address GFP_THISNODE part first and
>>>> build more complex solution on top?
>>>>
>>>> Is there any objection to my patch which does the similar thing to your
>>>> patch v2 in a different location?
>>>
>>> Similar but not the same. It fixes the madvise case, but I wonder about
>>> the no-madvise defrag=defer case, where Zi Yan reports it still causes
>>> swapping.
>>
>> Ah, but that should be the same with Andrea's variant 2) patch. There
>> should only be difference with defrag=always, which is direct reclaim
>> with __GFP_NORETRY, Andrea's patch would drop __GFP_THISNODE and your
>> not. Maybe Zi Yan can do the same kind of tests with Andrea's patch [1]
>> to confirm?
>
> Yes, that is the only difference and that is why I've said those patches
> are mostly similar. I do not want to touch defrag=always case because
> this one has always been stall prone and we have replaced it as a
> default just because of that. We should discuss what should be done with
> that case separately IMHO.

Vlastimil, my test using Andrea’s patch confirms your statement.
My test result of Andrea’s patch shows that it gives the same outcomes as
Michal’s patch except that when no madvise is used, THP is on by default
+ defrag = {always}, instead of swapping pages to disk, Adndrea’s patch
causes no swapping and THPs are allocated in the fallback node.

As I said before, the fundamental issue that causes swapping pages to disk
when allocating THPs in a filled node is __GFP_THISNODE removes all fallback
zone/node options, thus,  __GFP_KSWAPD_RECLAIM or __GFP_DIRECT_RECLAIM
can only swap pages out to satisfy the THP allocation request.

__GFP_THISNODE can be seen as a kernel-version MPOL_BIND policy, which
overwrites any user space memory policy and should be removed or limited
to kernel-only page allocations. But, as Michal said, we could discuss
this further but do not make this discussion on the critical path of merging
the patch.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 11:18                       ` Vlastimil Babka
@ 2018-09-06 11:27                         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2018-09-06 11:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Zi Yan, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Stefan Priebe - Profihost AG

On Thu 06-09-18 13:18:52, Vlastimil Babka wrote:
> On 08/30/2018 08:47 AM, Michal Hocko wrote:
> > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > +	gfp_t this_node = 0;
> > +	struct mempolicy *pol;
> > +
> > +#ifdef CONFIG_NUMA
> > +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> > +	pol = get_vma_policy(vma, addr);
> > +	if (pol->mode != MPOL_BIND)
> > +		this_node = __GFP_THISNODE;
> > +	mpol_cond_put(pol);
> 
> The code is better without the hack in alloc_pages_vma() but I'm not
> thrilled about getting vma policy here and then immediately again in
> alloc_pages_vma(). But if it can't be helped...

The whole function is an act of beauty isn't it. I wanted to get the
policy from the caller but that would be even more messy so I've tried
to keep it in the ugly corner and have it hidden there. You should ask
your friends to read alloc_hugepage_direct_gfpmask unless they have done
something terribly wrong.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 11:16                                       ` Vlastimil Babka
@ 2018-09-06 11:25                                         ` Michal Hocko
  2018-09-06 12:35                                           ` Zi Yan
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2018-09-06 11:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrea Arcangeli, Zi Yan, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Stefan Priebe - Profihost AG

On Thu 06-09-18 13:16:00, Vlastimil Babka wrote:
> On 09/06/2018 01:10 PM, Vlastimil Babka wrote:
> >> We can and should think about this much more but I would like to have
> >> this regression closed. So can we address GFP_THISNODE part first and
> >> build more complex solution on top?
> >>
> >> Is there any objection to my patch which does the similar thing to your
> >> patch v2 in a different location?
> > 
> > Similar but not the same. It fixes the madvise case, but I wonder about
> > the no-madvise defrag=defer case, where Zi Yan reports it still causes
> > swapping.
> 
> Ah, but that should be the same with Andrea's variant 2) patch. There
> should only be difference with defrag=always, which is direct reclaim
> with __GFP_NORETRY, Andrea's patch would drop __GFP_THISNODE and your
> not. Maybe Zi Yan can do the same kind of tests with Andrea's patch [1]
> to confirm?

Yes, that is the only difference and that is why I've said those patches
are mostly similar. I do not want to touch defrag=always case because
this one has always been stall prone and we have replaced it as a
default just because of that. We should discuss what should be done with
that case separately IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30  6:47                     ` Michal Hocko
@ 2018-09-06 11:18                       ` Vlastimil Babka
  2018-09-06 11:27                         ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Vlastimil Babka @ 2018-09-06 11:18 UTC (permalink / raw)
  To: Michal Hocko, Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Stefan Priebe - Profihost AG

On 08/30/2018 08:47 AM, Michal Hocko wrote:
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +	mpol_cond_put(pol);

The code is better without the hack in alloc_pages_vma() but I'm not
thrilled about getting vma policy here and then immediately again in
alloc_pages_vma(). But if it can't be helped...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 10:59                       ` Vlastimil Babka
@ 2018-09-06 11:17                         ` Zi Yan
  0 siblings, 0 replies; 40+ messages in thread
From: Zi Yan @ 2018-09-06 11:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 1172 bytes --]

On 6 Sep 2018, at 6:59, Vlastimil Babka wrote:

> On 08/30/2018 12:54 AM, Zi Yan wrote:
>>
>> Thanks for your patch.
>>
>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
>>
>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: no swap, THPs are allocated in the fallback node.
>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the disk instead of being allocated in the fallback node.
>
> Hmm this is GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE.
> No direct reclaim, so it would have to be kswapd causing the swapping? I
> wouldn't expect it to be significant and over-reclaiming. What exactly
> is your definition of "pages got swapped"?

About 4GB pages are swapped to the disk (my swap disk size is 4.7GB).
My machine has 128GB memory in each node and memhog uses 130GB memory.
When one node is filled up, the oldest pages are swapped into disk
until memhog finishes touching all 130GB memory.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-06 11:10                                     ` Vlastimil Babka
@ 2018-09-06 11:16                                       ` Vlastimil Babka
  2018-09-06 11:25                                         ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Vlastimil Babka @ 2018-09-06 11:16 UTC (permalink / raw)
  To: Michal Hocko, Andrea Arcangeli
  Cc: Zi Yan, Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Stefan Priebe - Profihost AG

On 09/06/2018 01:10 PM, Vlastimil Babka wrote:
>> We can and should think about this much more but I would like to have
>> this regression closed. So can we address GFP_THISNODE part first and
>> build more complex solution on top?
>>
>> Is there any objection to my patch which does the similar thing to your
>> patch v2 in a different location?
> 
> Similar but not the same. It fixes the madvise case, but I wonder about
> the no-madvise defrag=defer case, where Zi Yan reports it still causes
> swapping.

Ah, but that should be the same with Andrea's variant 2) patch. There
should only be difference with defrag=always, which is direct reclaim
with __GFP_NORETRY, Andrea's patch would drop __GFP_THISNODE and your
not. Maybe Zi Yan can do the same kind of tests with Andrea's patch [1]
to confirm?

[1] https://marc.info/?l=linux-mm&m=153476267026951

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-05  7:08                                   ` Michal Hocko
@ 2018-09-06 11:10                                     ` Vlastimil Babka
  2018-09-06 11:16                                       ` Vlastimil Babka
  0 siblings, 1 reply; 40+ messages in thread
From: Vlastimil Babka @ 2018-09-06 11:10 UTC (permalink / raw)
  To: Michal Hocko, Andrea Arcangeli
  Cc: Zi Yan, Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Stefan Priebe - Profihost AG

On 09/05/2018 09:08 AM, Michal Hocko wrote:
> On Tue 04-09-18 23:44:03, Andrea Arcangeli wrote:
> [...]
>> That kind of swapping may only pay off in the very long long term,
>> which is what khugepaged is for. khugepaged already takes care of the
>> long term, so we could later argue and think if khugepaged should
>> swapout or not in such condition, but I don't think there's much to
>> argue about the page fault.
> 
> I agree that defrag==always doing a reclaim is not really good and
> benefits are questionable. If you remember this was the primary reason
> why the default has been changed.
> 
>>> Thanks for your and Stefan's testing. I will wait for some more
>>> feedback. I will be offline next few days and if there are no major
>>> objections I will repost with both tested-bys early next week.
>>
>> I'm not so positive about 2 of the above tests if I understood the
>> test correctly.
>>
>> Those results are totally fine if you used the non default memory
>> policy, but with MPOL_DEFAULT and in turn no hard bind of the memory,
>> I'm afraid it'll be even be harder to reproduce when things will go
>> wrong again in those two cases.
> 
> We can and should think about this much more but I would like to have
> this regression closed. So can we address GFP_THISNODE part first and
> build more complex solution on top?
> 
> Is there any objection to my patch which does the similar thing to your
> patch v2 in a different location?

Similar but not the same. It fixes the madvise case, but I wonder about
the no-madvise defrag=defer case, where Zi Yan reports it still causes
swapping.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 22:54                     ` Zi Yan
  2018-08-30  7:00                       ` Michal Hocko
@ 2018-09-06 10:59                       ` Vlastimil Babka
  2018-09-06 11:17                         ` Zi Yan
  1 sibling, 1 reply; 40+ messages in thread
From: Vlastimil Babka @ 2018-09-06 10:59 UTC (permalink / raw)
  To: Zi Yan, Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Stefan Priebe - Profihost AG

On 08/30/2018 12:54 AM, Zi Yan wrote:
> 
> Thanks for your patch.
> 
> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> 
> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: no swap, THPs are allocated in the fallback node.
> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the disk instead of being allocated in the fallback node.

Hmm this is GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE.
No direct reclaim, so it would have to be kswapd causing the swapping? I
wouldn't expect it to be significant and over-reclaiming. What exactly
is your definition of "pages got swapped"?

> 3. no madvise, THP is on by default + defrag = {always, defer, defer+madvise}: pages got swapped to the disk instead of
> being allocated in the fallback node.

So this should be the most common case (no madvise, THP on). If it's
causing too much reclaim, it's not good IMHO.

depending on defrag:
defer (the default) = same as above, so it would have to be kswapd
always = GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM | __GFP_NORETRY |
__GFP_THISNODE
  - so direct reclaim also overreclaims despite __GFP_NORETRY?
defer+madvise = same as defer

Vlastimil

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-09-05  3:44                                 ` Andrea Arcangeli
@ 2018-09-05  7:08                                   ` Michal Hocko
  2018-09-06 11:10                                     ` Vlastimil Babka
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2018-09-05  7:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Zi Yan, Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka, Stefan Priebe - Profihost AG

On Tue 04-09-18 23:44:03, Andrea Arcangeli wrote:
[...]
> That kind of swapping may only pay off in the very long long term,
> which is what khugepaged is for. khugepaged already takes care of the
> long term, so we could later argue and think if khugepaged should
> swapout or not in such condition, but I don't think there's much to
> argue about the page fault.

I agree that defrag==always doing a reclaim is not really good and
benefits are questionable. If you remember this was the primary reason
why the default has been changed.

> > Thanks for your and Stefan's testing. I will wait for some more
> > feedback. I will be offline next few days and if there are no major
> > objections I will repost with both tested-bys early next week.
> 
> I'm not so positive about 2 of the above tests if I understood the
> test correctly.
> 
> Those results are totally fine if you used the non default memory
> policy, but with MPOL_DEFAULT and in turn no hard bind of the memory,
> I'm afraid it'll be even be harder to reproduce when things will go
> wrong again in those two cases.

We can and should think about this much more but I would like to have
this regression closed. So can we address GFP_THISNODE part first and
build more complex solution on top?

Is there any objection to my patch which does the similar thing to your
patch v2 in a different location?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 16:40                               ` Michal Hocko
@ 2018-09-05  3:44                                 ` Andrea Arcangeli
  2018-09-05  7:08                                   ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2018-09-05  3:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zi Yan, Andrew Morton, linux-mm, Alex Williamson, David Rientjes,
	Vlastimil Babka, Stefan Priebe - Profihost AG

On Thu, Aug 30, 2018 at 06:40:57PM +0200, Michal Hocko wrote:
> On Thu 30-08-18 10:02:23, Zi Yan wrote:
> > On 30 Aug 2018, at 9:45, Michal Hocko wrote:
> > 
> > > On Thu 30-08-18 09:22:21, Zi Yan wrote:
> > >> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
> > >>
> > >>> On Wed 29-08-18 18:54:23, Zi Yan wrote:
> > >>> [...]
> > >>>> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> > >>>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> > >>>>
> > >>>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> > >>>> no swap, THPs are allocated in the fallback node.

no swap

> > >>>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> > >>>> disk instead of being allocated in the fallback node.

swap

> > >>>> 3. no madvise, THP is on by default + defrag = {always, defer,
> > >>>> defer+madvise}: pages got swapped to the disk instead of being
> > >>>> allocated in the fallback node.

swap

> > >>>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> > >>>> pages are allocated in the fallback node.

no swap

> > >>>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.

I agree it's not great for 2 and 3.

I don't see how the above can be considered a 100% "pass" to the test,
at best it's a 50% pass.

Let me clarify the setup to be sure:

1) There was no hard bind at all

2) Let's also ignore NUMA balancing which is all but restrictive at
   the start and it's meant to converge over time if current
   conditions don't allow immediate convergence. For simplicity let's
   assume NUMA balancing off.

So what the test exercised is the plain normal allocation of RAM with
THP main knob enabled to "always" on a NUMA system.

No matter the madvise used or not used, 2 cases over 4 decided to
swapout instead of allocating totally free THP or PAGE_SIZEd pages.

As opposed there would have been absolutely zero swapouts in the exact
same test if the main THP knob would have been disabled with:

     echo never >/sys/kernel/mm/transparent_hugepage/enabled

There is no way that enabling THP (no matter what other defrag
settings were and no matter if MADV_HUGEPAGE was used or not) should
cause heavy swap storms during page faults allocating memory, when
disabling THP doesn't swap even a single 4k page. That can't possibly
be right.

This is because there is no way the overhead of swapping can be
compensated by the THP improvement.

And with swapping I really mean "reclaim", just testing with the
swapout testcase is simpler and doesn't require an iommu pinning all
memory. So setting may_swap and may_unmap to zero won't move the
needle because my test showed just massive CPU consumption in trying
so hard to generate THP from the local node, but nothing got swapped
out because of the iommu pins.

That kind of swapping may only pay off in the very long long term,
which is what khugepaged is for. khugepaged already takes care of the
long term, so we could later argue and think if khugepaged should
swapout or not in such condition, but I don't think there's much to
argue about the page fault.

> Thanks for your and Stefan's testing. I will wait for some more
> feedback. I will be offline next few days and if there are no major
> objections I will repost with both tested-bys early next week.

I'm not so positive about 2 of the above tests if I understood the
test correctly.

Those results are totally fine if you used the non default memory
policy, but with MPOL_DEFAULT and in turn no hard bind of the memory,
I'm afraid it'll be even be harder to reproduce when things will go
wrong again in those two cases.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 14:02                             ` Zi Yan
  2018-08-30 16:19                               ` Stefan Priebe - Profihost AG
@ 2018-08-30 16:40                               ` Michal Hocko
  2018-09-05  3:44                                 ` Andrea Arcangeli
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2018-08-30 16:40 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Thu 30-08-18 10:02:23, Zi Yan wrote:
> On 30 Aug 2018, at 9:45, Michal Hocko wrote:
> 
> > On Thu 30-08-18 09:22:21, Zi Yan wrote:
> >> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
> >>
> >>> On Wed 29-08-18 18:54:23, Zi Yan wrote:
> >>> [...]
> >>>> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> >>>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> >>>>
> >>>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> >>>> no swap, THPs are allocated in the fallback node.
> >>>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> >>>> disk instead of being allocated in the fallback node.
> >>>> 3. no madvise, THP is on by default + defrag = {always, defer,
> >>>> defer+madvise}: pages got swapped to the disk instead of being
> >>>> allocated in the fallback node.
> >>>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> >>>> pages are allocated in the fallback node.
> >>>>
> >>>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
> >>>>
> >>>> The reason, as Andrea mentioned in his email, is that the combination
> >>>> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
> >>>> from this experiment).
> >>>
> >>> But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
> >>> We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
> >>> see kswapd do the reclaim to balance the node. If the node is full of
> >>> anonymous pages then there is no other way than swap out.
> >>
> >> GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
> >> + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
> >> can be triggered.
> >
> > Yes, but the setup tells that you are willing to pay price to get a THP.
> > defered=always uses that special __GFP_NORETRY (unless it is madvised
> > mapping) that should back off if the compaction failed recently. How
> > much that reduces the reclaim is not really clear to me right now to be
> > honest.
> >
> >> The key issue here is that a??memhog -r3 130ga?? uses the default memory policy (MPOL_DEFAULT),
> >> which should allow page allocation fallback to other nodes, but as shown in
> >> result 3, swapping is triggered instead of page allocation fallback.
> >
> > Well, I guess this really depends. Fallback to a different node might be
> > seen as a bad thing and worse than the reclaim on the local node.
> >
> >>>> __THIS_NODE uses ZONELIST_NOFALLBACK, which
> >>>> removes the fallback possibility and __GFP_*_RECLAIM triggers page
> >>>> reclaim in the first page allocation node when fallback nodes are
> >>>> removed by ZONELIST_NOFALLBACK.
> >>>
> >>> Yes but the point is that the allocations which use __GFP_THISNODE are
> >>> optimistic so they shouldn't fallback to remote NUMA nodes.
> >>
> >> This can be achieved by using MPOL_BIND memory policy which restricts
> >> nodemask in struct alloc_context for user space memory allocations.
> >
> > Yes, but that requires and explicit NUMA handling. And we are trying to
> > handle those cases which do not really give a damn and just want to use
> > THP if it is available or try harder when they ask by using madvise.
> >
> >>>> IMHO, __THIS_NODE should not be used for user memory allocation at
> >>>> all, since it fights against most of memory policies.  But kernel
> >>>> memory allocation would need it as a kernel MPOL_BIND memory policy.
> >>>
> >>> __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
> >>> it here. But the problem is that optimistic THP allocations should
> >>> prefer a local node because a remote node might easily offset the
> >>> advantage of the THP. I do not have a great idea how to achieve that
> >>> without __GFP_THISNODE though.
> >>
> >> MPOL_PREFERRED memory policy can be used to achieve this optimistic
> >> THP allocation for user space. Even with the default memory policy,
> >> local memory node will be used first until it is full. It seems to
> >> me that __GFP_THISNODE is not necessary if a proper memory policy is
> >> used.
> >>
> >> Let me know if I miss anything. Thanks.
> >
> > You are missing that we are trying to define a sensible model for those
> > who do not really care about mempolicies. THP shouldn't cause more harm
> > than good for those.
> >
> > I wish we could come up with a remotely sane and comprehensible model.
> > That means that you know how hard the allocator tries to get a THP for
> > you depending on the defrag configuration, your memory policy and your
> > madvise setting. The easiest one I can think of is to
> > - always follow mempolicy when specified because you asked for it
> >   explicitly
> > - stay node local and low latency for the light THP defrag mode (defrag,
> >   madvise without hint and none) because THP is a nice to have
> > - if the defrag mode is always then you are willing to pay the latency
> >   price but off-node might be still a no-no.
> > - allow fallback for madvised mappings because you really want THP. If
> >   you care about specific numa placement then combine with the
> >   mempolicy.
> >
> > As you can see I do not really mention anything about the direct reclaim
> > because that is just an implementation detail of the page allocator and
> > compaction interaction.
> >
> > Maybe you can formulate a saner matrix with all the available modes that
> > we have.
> >
> > Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE
> > is clearly wrong and we should address that first. Either Andrea's
> > option 2) patch or mine which does the similar thing except at the
> > proper layer (I believe). We can continue discussing other odd cases on
> > top I guess. Unless somebody has much brighter idea, of course.
> 
> Thanks for your explanation. It makes sense to me. I am fine with your patch.
> You can add my Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>, since
> my test result 1 shows that the problem mentioned in your changelog is solved.

Thanks for your and Stefan's testing. I will wait for some more
feedback. I will be offline next few days and if there are no major
objections I will repost with both tested-bys early next week.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 14:02                             ` Zi Yan
@ 2018-08-30 16:19                               ` Stefan Priebe - Profihost AG
  2018-08-30 16:40                               ` Michal Hocko
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Priebe - Profihost AG @ 2018-08-30 16:19 UTC (permalink / raw)
  To: Zi Yan
  Cc: Michal Hocko, Andrea Arcangeli, Andrew Morton, linux-mm,
	Alex Williamson, David Rientjes, Vlastimil Babka

[-- Attachment #1: Type: text/plain, Size: 230 bytes --]

Please add also:
Tested-by: Stefan Priebe <s.priebe@profihost.ag>

Stefan

Excuse my typo sent from my mobile phone.

> Am 30.08.2018 um 16:02 schrieb Zi Yan <zi.yan@cs.rutgers.edu>:
> 
> Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>

[-- Attachment #2: Type: text/html, Size: 750 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 13:45                           ` Michal Hocko
@ 2018-08-30 14:02                             ` Zi Yan
  2018-08-30 16:19                               ` Stefan Priebe - Profihost AG
  2018-08-30 16:40                               ` Michal Hocko
  0 siblings, 2 replies; 40+ messages in thread
From: Zi Yan @ 2018-08-30 14:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 6014 bytes --]

On 30 Aug 2018, at 9:45, Michal Hocko wrote:

> On Thu 30-08-18 09:22:21, Zi Yan wrote:
>> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
>>
>>> On Wed 29-08-18 18:54:23, Zi Yan wrote:
>>> [...]
>>>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
>>>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
>>>>
>>>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
>>>> no swap, THPs are allocated in the fallback node.
>>>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
>>>> disk instead of being allocated in the fallback node.
>>>> 3. no madvise, THP is on by default + defrag = {always, defer,
>>>> defer+madvise}: pages got swapped to the disk instead of being
>>>> allocated in the fallback node.
>>>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
>>>> pages are allocated in the fallback node.
>>>>
>>>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
>>>>
>>>> The reason, as Andrea mentioned in his email, is that the combination
>>>> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
>>>> from this experiment).
>>>
>>> But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
>>> We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
>>> see kswapd do the reclaim to balance the node. If the node is full of
>>> anonymous pages then there is no other way than swap out.
>>
>> GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
>> + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
>> can be triggered.
>
> Yes, but the setup tells that you are willing to pay price to get a THP.
> defered=always uses that special __GFP_NORETRY (unless it is madvised
> mapping) that should back off if the compaction failed recently. How
> much that reduces the reclaim is not really clear to me right now to be
> honest.
>
>> The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT),
>> which should allow page allocation fallback to other nodes, but as shown in
>> result 3, swapping is triggered instead of page allocation fallback.
>
> Well, I guess this really depends. Fallback to a different node might be
> seen as a bad thing and worse than the reclaim on the local node.
>
>>>> __THIS_NODE uses ZONELIST_NOFALLBACK, which
>>>> removes the fallback possibility and __GFP_*_RECLAIM triggers page
>>>> reclaim in the first page allocation node when fallback nodes are
>>>> removed by ZONELIST_NOFALLBACK.
>>>
>>> Yes but the point is that the allocations which use __GFP_THISNODE are
>>> optimistic so they shouldn't fallback to remote NUMA nodes.
>>
>> This can be achieved by using MPOL_BIND memory policy which restricts
>> nodemask in struct alloc_context for user space memory allocations.
>
> Yes, but that requires and explicit NUMA handling. And we are trying to
> handle those cases which do not really give a damn and just want to use
> THP if it is available or try harder when they ask by using madvise.
>
>>>> IMHO, __THIS_NODE should not be used for user memory allocation at
>>>> all, since it fights against most of memory policies.  But kernel
>>>> memory allocation would need it as a kernel MPOL_BIND memory policy.
>>>
>>> __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
>>> it here. But the problem is that optimistic THP allocations should
>>> prefer a local node because a remote node might easily offset the
>>> advantage of the THP. I do not have a great idea how to achieve that
>>> without __GFP_THISNODE though.
>>
>> MPOL_PREFERRED memory policy can be used to achieve this optimistic
>> THP allocation for user space. Even with the default memory policy,
>> local memory node will be used first until it is full. It seems to
>> me that __GFP_THISNODE is not necessary if a proper memory policy is
>> used.
>>
>> Let me know if I miss anything. Thanks.
>
> You are missing that we are trying to define a sensible model for those
> who do not really care about mempolicies. THP shouldn't cause more harm
> than good for those.
>
> I wish we could come up with a remotely sane and comprehensible model.
> That means that you know how hard the allocator tries to get a THP for
> you depending on the defrag configuration, your memory policy and your
> madvise setting. The easiest one I can think of is to
> - always follow mempolicy when specified because you asked for it
>   explicitly
> - stay node local and low latency for the light THP defrag mode (defrag,
>   madvise without hint and none) because THP is a nice to have
> - if the defrag mode is always then you are willing to pay the latency
>   price but off-node might be still a no-no.
> - allow fallback for madvised mappings because you really want THP. If
>   you care about specific numa placement then combine with the
>   mempolicy.
>
> As you can see I do not really mention anything about the direct reclaim
> because that is just an implementation detail of the page allocator and
> compaction interaction.
>
> Maybe you can formulate a saner matrix with all the available modes that
> we have.
>
> Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE
> is clearly wrong and we should address that first. Either Andrea's
> option 2) patch or mine which does the similar thing except at the
> proper layer (I believe). We can continue discussing other odd cases on
> top I guess. Unless somebody has much brighter idea, of course.

Thanks for your explanation. It makes sense to me. I am fine with your patch.
You can add my Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>, since
my test result 1 shows that the problem mentioned in your changelog is solved.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30 13:22                         ` Zi Yan
@ 2018-08-30 13:45                           ` Michal Hocko
  2018-08-30 14:02                             ` Zi Yan
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2018-08-30 13:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Thu 30-08-18 09:22:21, Zi Yan wrote:
> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
> 
> > On Wed 29-08-18 18:54:23, Zi Yan wrote:
> > [...]
> >> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> >> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> >>
> >> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> >> no swap, THPs are allocated in the fallback node.
> >> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> >> disk instead of being allocated in the fallback node.
> >> 3. no madvise, THP is on by default + defrag = {always, defer,
> >> defer+madvise}: pages got swapped to the disk instead of being
> >> allocated in the fallback node.
> >> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> >> pages are allocated in the fallback node.
> >>
> >> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
> >>
> >> The reason, as Andrea mentioned in his email, is that the combination
> >> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
> >> from this experiment).
> >
> > But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
> > We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
> > see kswapd do the reclaim to balance the node. If the node is full of
> > anonymous pages then there is no other way than swap out.
> 
> GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
> + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
> can be triggered.

Yes, but the setup tells that you are willing to pay price to get a THP.
defered=always uses that special __GFP_NORETRY (unless it is madvised
mapping) that should back off if the compaction failed recently. How
much that reduces the reclaim is not really clear to me right now to be
honest.

> The key issue here is that a??memhog -r3 130ga?? uses the default memory policy (MPOL_DEFAULT),
> which should allow page allocation fallback to other nodes, but as shown in
> result 3, swapping is triggered instead of page allocation fallback.

Well, I guess this really depends. Fallback to a different node might be
seen as a bad thing and worse than the reclaim on the local node.

> >> __THIS_NODE uses ZONELIST_NOFALLBACK, which
> >> removes the fallback possibility and __GFP_*_RECLAIM triggers page
> >> reclaim in the first page allocation node when fallback nodes are
> >> removed by ZONELIST_NOFALLBACK.
> >
> > Yes but the point is that the allocations which use __GFP_THISNODE are
> > optimistic so they shouldn't fallback to remote NUMA nodes.
> 
> This can be achieved by using MPOL_BIND memory policy which restricts
> nodemask in struct alloc_context for user space memory allocations.

Yes, but that requires and explicit NUMA handling. And we are trying to
handle those cases which do not really give a damn and just want to use
THP if it is available or try harder when they ask by using madvise.

> >> IMHO, __THIS_NODE should not be used for user memory allocation at
> >> all, since it fights against most of memory policies.  But kernel
> >> memory allocation would need it as a kernel MPOL_BIND memory policy.
> >
> > __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
> > it here. But the problem is that optimistic THP allocations should
> > prefer a local node because a remote node might easily offset the
> > advantage of the THP. I do not have a great idea how to achieve that
> > without __GFP_THISNODE though.
> 
> MPOL_PREFERRED memory policy can be used to achieve this optimistic
> THP allocation for user space. Even with the default memory policy,
> local memory node will be used first until it is full. It seems to
> me that __GFP_THISNODE is not necessary if a proper memory policy is
> used.
> 
> Let me know if I miss anything. Thanks.

You are missing that we are trying to define a sensible model for those
who do not really care about mempolicies. THP shouldn't cause more harm
than good for those.

I wish we could come up with a remotely sane and comprehensible model.
That means that you know how hard the allocator tries to get a THP for
you depending on the defrag configuration, your memory policy and your
madvise setting. The easiest one I can think of is to 
- always follow mempolicy when specified because you asked for it
  explicitly
- stay node local and low latency for the light THP defrag mode (defrag,
  madvise without hint and none) because THP is a nice to have
- if the defrag mode is always then you are willing to pay the latency
  price but off-node might be still a no-no.
- allow fallback for madvised mappings because you really want THP. If
  you care about specific numa placement then combine with the
  mempolicy.

As you can see I do not really mention anything about the direct reclaim
because that is just an implementation detail of the page allocator and
compaction interaction.

Maybe you can formulate a saner matrix with all the available modes that
we have.

Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE
is clearly wrong and we should address that first. Either Andrea's
option 2) patch or mine which does the similar thing except at the
proper layer (I believe). We can continue discussing other odd cases on
top I guess. Unless somebody has much brighter idea, of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-30  7:00                       ` Michal Hocko
@ 2018-08-30 13:22                         ` Zi Yan
  2018-08-30 13:45                           ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Zi Yan @ 2018-08-30 13:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 3285 bytes --]

On 30 Aug 2018, at 3:00, Michal Hocko wrote:

> On Wed 29-08-18 18:54:23, Zi Yan wrote:
> [...]
>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
>>
>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
>> no swap, THPs are allocated in the fallback node.
>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
>> disk instead of being allocated in the fallback node.
>> 3. no madvise, THP is on by default + defrag = {always, defer,
>> defer+madvise}: pages got swapped to the disk instead of being
>> allocated in the fallback node.
>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
>> pages are allocated in the fallback node.
>>
>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
>>
>> The reason, as Andrea mentioned in his email, is that the combination
>> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
>> from this experiment).
>
> But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
> We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
> see kswapd do the reclaim to balance the node. If the node is full of
> anonymous pages then there is no other way than swap out.

GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
+ defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
can be triggered.

The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT),
which should allow page allocation fallback to other nodes, but as shown in
result 3, swapping is triggered instead of page allocation fallback.

>
>> __THIS_NODE uses ZONELIST_NOFALLBACK, which
>> removes the fallback possibility and __GFP_*_RECLAIM triggers page
>> reclaim in the first page allocation node when fallback nodes are
>> removed by ZONELIST_NOFALLBACK.
>
> Yes but the point is that the allocations which use __GFP_THISNODE are
> optimistic so they shouldn't fallback to remote NUMA nodes.

This can be achieved by using MPOL_BIND memory policy which restricts
nodemask in struct alloc_context for user space memory allocations.

>
>> IMHO, __THIS_NODE should not be used for user memory allocation at
>> all, since it fights against most of memory policies.  But kernel
>> memory allocation would need it as a kernel MPOL_BIND memory policy.
>
> __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
> it here. But the problem is that optimistic THP allocations should
> prefer a local node because a remote node might easily offset the
> advantage of the THP. I do not have a great idea how to achieve that
> without __GFP_THISNODE though.

MPOL_PREFERRED memory policy can be used to achieve this optimistic THP allocation
for user space. Even with the default memory policy, local memory node will be used
first until it is full. It seems to me that __GFP_THISNODE is not necessary
if a proper memory policy is used.

Let me know if I miss anything. Thanks.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 22:54                     ` Zi Yan
@ 2018-08-30  7:00                       ` Michal Hocko
  2018-08-30 13:22                         ` Zi Yan
  2018-09-06 10:59                       ` Vlastimil Babka
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2018-08-30  7:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Wed 29-08-18 18:54:23, Zi Yan wrote:
[...]
> I tested it against Linusa??s tree with a??memhog -r3 130ga?? in a two-socket machine with 128GB memory on
> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> 
> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> no swap, THPs are allocated in the fallback node.
> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> disk instead of being allocated in the fallback node.
> 3. no madvise, THP is on by default + defrag = {always, defer,
> defer+madvise}: pages got swapped to the disk instead of being
> allocated in the fallback node.
> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> pages are allocated in the fallback node.
> 
> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
> 
> The reason, as Andrea mentioned in his email, is that the combination
> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
> from this experiment).

But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
see kswapd do the reclaim to balance the node. If the node is full of
anonymous pages then there is no other way than swap out.

> __THIS_NODE uses ZONELIST_NOFALLBACK, which
> removes the fallback possibility and __GFP_*_RECLAIM triggers page
> reclaim in the first page allocation node when fallback nodes are
> removed by ZONELIST_NOFALLBACK.

Yes but the point is that the allocations which use __GFP_THISNODE are
optimistic so they shouldn't fallback to remote NUMA nodes.

> IMHO, __THIS_NODE should not be used for user memory allocation at
> all, since it fights against most of memory policies.  But kernel
> memory allocation would need it as a kernel MPOL_BIND memory policy.

__GFP_THISNODE is indeed an ugliness. I would really love to get rid of
it here. But the problem is that optimistic THP allocations should
prefer a local node because a remote node might easily offset the
advantage of the THP. I do not have a great idea how to achieve that
without __GFP_THISNODE though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 19:24                   ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
  2018-08-29 22:54                     ` Zi Yan
@ 2018-08-30  6:47                     ` Michal Hocko
  2018-09-06 11:18                       ` Vlastimil Babka
  2018-09-12 17:29                     ` Mel Gorman
  2 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2018-08-30  6:47 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

I forgot mpol_cond_put...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 19:24                   ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
@ 2018-08-29 22:54                     ` Zi Yan
  2018-08-30  7:00                       ` Michal Hocko
  2018-09-06 10:59                       ` Vlastimil Babka
  2018-08-30  6:47                     ` Michal Hocko
  2018-09-12 17:29                     ` Mel Gorman
  2 siblings, 2 replies; 40+ messages in thread
From: Zi Yan @ 2018-08-29 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

[-- Attachment #1: Type: text/plain, Size: 7162 bytes --]

Hi Michal,

<snip>
>
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 25 +++++++++++++++++--------
>  mm/mempolicy.c            | 28 +---------------------------
>  3 files changed, 20 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..94472bf9a31b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +#endif
>
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>
>  /* Caller must hold page table lock. */
> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> -- 
> 2.18.0
>

Thanks for your patch.

I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
each node and got the results below. I expect this test should fill one node, then fall back to the other.

1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: no swap, THPs are allocated in the fallback node.
2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the disk instead of being allocated in the fallback node.
3. no madvise, THP is on by default + defrag = {always, defer, defer+madvise}: pages got swapped to the disk instead of
being allocated in the fallback node.
4. no madvise, THP is on by default + defrag = madvise: no swap, base pages are allocated in the fallback node.

The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.

The reason, as Andrea mentioned in his email, is that the combination of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM from this experiment). __THIS_NODE uses ZONELIST_NOFALLBACK, which removes the fallback possibility
and __GFP_*_RECLAIM triggers page reclaim in the first page allocation node when fallback nodes are removed by
ZONELIST_NOFALLBACK.

IMHO, __THIS_NODE should not be used for user memory allocation at all, since it fights against most of memory policies.
But kernel memory allocation would need it as a kernel MPOL_BIND memory policy.

Comments?

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
  2018-08-29 16:25                 ` Michal Hocko
@ 2018-08-29 19:24                   ` Michal Hocko
  2018-08-29 22:54                     ` Zi Yan
                                       ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Michal Hocko @ 2018-08-29 19:24 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Alex Williamson,
	David Rientjes, Vlastimil Babka, Stefan Priebe - Profihost AG

On Wed 29-08-18 18:25:28, Michal Hocko wrote:
> On Wed 29-08-18 12:06:48, Zi Yan wrote:
> > The warning goes away with this change. I am OK with this patch (plus the original one you sent out,
> > which could be merged with this one).
> 
> I will respin the patch, update the changelog and repost. Tomorrow I
> hope.

Here is the v2

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2018-09-17 11:27 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-07 13:05 [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
2018-09-07 13:05 ` Michal Hocko
2018-09-08 18:52 ` Stefan Priebe - Profihost AG
2018-09-10  7:39   ` Michal Hocko
2018-09-11  9:03   ` Vlastimil Babka
2018-09-10 20:08 ` David Rientjes
2018-09-10 20:22   ` Stefan Priebe - Profihost AG
2018-09-11  8:51   ` Vlastimil Babka
2018-09-11 11:56   ` Michal Hocko
2018-09-11 20:30     ` David Rientjes
2018-09-12 12:05       ` Michal Hocko
2018-09-12 12:05         ` Michal Hocko
2018-09-12 20:40         ` David Rientjes
2018-09-12 13:54     ` Andrea Arcangeli
2018-09-12 14:21       ` Michal Hocko
2018-09-12 15:25         ` Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2018-08-23 10:52 [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always Michal Hocko
2018-08-28  7:53 ` Michal Hocko
2018-08-28  8:18   ` Michal Hocko
     [not found]     ` <D5F4A33C-0A37-495C-9468-D6866A862097@cs.rutgers.edu>
2018-08-29 14:28       ` Michal Hocko
2018-08-29 14:35         ` Michal Hocko
2018-08-29 15:22           ` Zi Yan
2018-08-29 15:47             ` Michal Hocko
2018-08-29 16:06               ` Zi Yan
2018-08-29 16:25                 ` Michal Hocko
2018-08-29 19:24                   ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
2018-08-29 22:54                     ` Zi Yan
2018-08-30  7:00                       ` Michal Hocko
2018-08-30 13:22                         ` Zi Yan
2018-08-30 13:45                           ` Michal Hocko
2018-08-30 14:02                             ` Zi Yan
2018-08-30 16:19                               ` Stefan Priebe - Profihost AG
2018-08-30 16:40                               ` Michal Hocko
2018-09-05  3:44                                 ` Andrea Arcangeli
2018-09-05  7:08                                   ` Michal Hocko
2018-09-06 11:10                                     ` Vlastimil Babka
2018-09-06 11:16                                       ` Vlastimil Babka
2018-09-06 11:25                                         ` Michal Hocko
2018-09-06 12:35                                           ` Zi Yan
2018-09-06 10:59                       ` Vlastimil Babka
2018-09-06 11:17                         ` Zi Yan
2018-08-30  6:47                     ` Michal Hocko
2018-09-06 11:18                       ` Vlastimil Babka
2018-09-06 11:27                         ` Michal Hocko
2018-09-12 17:29                     ` Mel Gorman
2018-09-17  6:11                       ` Michal Hocko
2018-09-17  7:04                         ` Stefan Priebe - Profihost AG
2018-09-17  9:32                           ` Stefan Priebe - Profihost AG
2018-09-17 11:27                           ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.