linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch v2 for-4.20] mm, thp: restore node-local hugepage allocations
@ 2018-12-05 22:46 David Rientjes
  2018-12-06  4:59 ` [LKP] " kernel test robot
  2018-12-06 21:42 ` David Rientjes
  0 siblings, 2 replies; 11+ messages in thread
From: David Rientjes @ 2018-12-05 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, mgorman, Vlastimil Babka, mhocko, ying.huang,
	s.priebe, Linux List Kernel Mailing, alex.williamson, lkp,
	kirill, Andrew Morton, zi.yan

This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").

By not setting __GFP_THISNODE, applications can allocate remote hugepages
when the local node is fragmented or low on memory when either the thp
defrag setting is "always" or the vma has been madvised with
MADV_HUGEPAGE.

Remote access to hugepages often has much higher latency than local pages
of the native page size.  On Haswell, ac5b2c18911f was shown to have a
13.9% access regression after this commit for binaries that remap their
text segment to be backed by transparent hugepages.

The intent of ac5b2c18911f is to address an issue where a local node is
low on memory or fragmented such that a hugepage cannot be allocated.  In
every scenario where this was described as a fix, there is abundant and
unfragmented remote memory available to allocate from, even with a greater
access latency.

If remote memory is also low or fragmented, not setting __GFP_THISNODE was
also measured on Haswell to have a 40% regression in allocation latency.

Restore __GFP_THISNODE for thp allocations.

Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mempolicy.h |  2 --
 mm/huge_memory.c          | 42 +++++++++++++++------------------------
 mm/mempolicy.c            |  2 +-
 3 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
-struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
-						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -632,37 +632,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-	gfp_t this_node = 0;
-
-#ifdef CONFIG_NUMA
-	struct mempolicy *pol;
-	/*
-	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
-	 * specified, to express a general desire to stay on the current
-	 * node for optimistic allocation attempts. If the defrag mode
-	 * and/or madvise hint requires the direct reclaim then we prefer
-	 * to fallback to other node rather than node reclaim because that
-	 * can lead to excessive reclaim even though there is free memory
-	 * on other nodes. We expect that NUMA preferences are specified
-	 * by memory policies.
-	 */
-	pol = get_vma_policy(vma, addr);
-	if (pol->mode != MPOL_BIND)
-		this_node = __GFP_THISNODE;
-	mpol_cond_put(pol);
-#endif
+	const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 
+	/* Always do synchronous compaction */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | __GFP_THISNODE |
+		       (vma_madvised ? 0 : __GFP_NORETRY);
+
+	/* Kick kcompactd and fail quickly */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
+		return gfp_mask | __GFP_KSWAPD_RECLAIM;
+
+	/* Synchronous compaction if madvised, otherwise kick kcompactd */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM | this_node);
+		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+						  __GFP_KSWAPD_RECLAIM);
+
+	/* Only do synchronous compaction if madvised */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     this_node);
-	return GFP_TRANSHUGE_LIGHT | this_node;
+		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
+
+	return gfp_mask;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1662,7 +1662,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LKP] [patch v2 for-4.20] mm, thp: restore node-local hugepage allocations
  2018-12-05 22:46 [patch v2 for-4.20] mm, thp: restore node-local hugepage allocations David Rientjes
@ 2018-12-06  4:59 ` kernel test robot
  2018-12-06 21:42 ` David Rientjes
  1 sibling, 0 replies; 11+ messages in thread
From: kernel test robot @ 2018-12-06  4:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Andrea Arcangeli, s.priebe, lkp,
	Linux List Kernel Mailing, mhocko, alex.williamson, zi.yan,
	kirill, Andrew Morton, mgorman, Vlastimil Babka

On Wed, Dec 05, 2018 at 02:46:50PM -0800, David Rientjes wrote:
> This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
> consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
> 
> By not setting __GFP_THISNODE, applications can allocate remote hugepages
> when the local node is fragmented or low on memory when either the thp
> defrag setting is "always" or the vma has been madvised with
> MADV_HUGEPAGE.
> 
> Remote access to hugepages often has much higher latency than local pages
> of the native page size.  On Haswell, ac5b2c18911f was shown to have a
> 13.9% access regression after this commit for binaries that remap their
> text segment to be backed by transparent hugepages.
> 
> The intent of ac5b2c18911f is to address an issue where a local node is
> low on memory or fragmented such that a hugepage cannot be allocated.  In
> every scenario where this was described as a fix, there is abundant and
> unfragmented remote memory available to allocate from, even with a greater
> access latency.
> 
> If remote memory is also low or fragmented, not setting __GFP_THISNODE was
> also measured on Haswell to have a 40% regression in allocation latency.
> 
> Restore __GFP_THISNODE for thp allocations.
> 
> Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
> Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/mempolicy.h |  2 --
>  mm/huge_memory.c          | 42 +++++++++++++++------------------------
>  mm/mempolicy.c            |  2 +-
>  3 files changed, 17 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> -struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> -						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -632,37 +632,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>  static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> -	gfp_t this_node = 0;
> -
> -#ifdef CONFIG_NUMA
> -	struct mempolicy *pol;
> -	/*
> -	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> -	 * specified, to express a general desire to stay on the current
> -	 * node for optimistic allocation attempts. If the defrag mode
> -	 * and/or madvise hint requires the direct reclaim then we prefer
> -	 * to fallback to other node rather than node reclaim because that
> -	 * can lead to excessive reclaim even though there is free memory
> -	 * on other nodes. We expect that NUMA preferences are specified
> -	 * by memory policies.
> -	 */
> -	pol = get_vma_policy(vma, addr);
> -	if (pol->mode != MPOL_BIND)
> -		this_node = __GFP_THISNODE;
> -	mpol_cond_put(pol);
> -#endif
> +	const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>  
> +	/* Always do synchronous compaction */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | __GFP_THISNODE |
> +		       (vma_madvised ? 0 : __GFP_NORETRY);
> +
> +	/* Kick kcompactd and fail quickly */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
> +		return gfp_mask | __GFP_KSWAPD_RECLAIM;
> +
> +	/* Synchronous compaction if madvised, otherwise kick kcompactd */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM | this_node);
> +		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> +						  __GFP_KSWAPD_RECLAIM);
> +
> +	/* Only do synchronous compaction if madvised */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     this_node);
> -	return GFP_TRANSHUGE_LIGHT | this_node;
> +		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
> +
> +	return gfp_mask;
>  }
>  
>  /* Caller must hold page table lock. */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1662,7 +1662,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> _______________________________________________
> LKP mailing list
> LKP@lists.01.org
> https://lists.01.org/mailman/listinfo/lkp

Hi,

FYI, we noticed no regression of vm-scalability.throughput between the two commits.

tests: 1
testcase/path_params/tbox_group/run: vm-scalability/300-always-always-32-1-swap-w-seq-ucode=0x3d-performance/lkp-hsw-ep4

commit: 
  94e297c50b ("include/linux/notifier.h: SRCU: fix ctags")
  2f0799a0ff ("mm, thp: restore node-local hugepage allocations")

94e297c50b529f5d  2f0799a0ffc033bf3cc82d5032  
----------------  --------------------------  
         %stddev      change         %stddev
             \          |                \  
    425945 ±  9%        36%     578453 ± 18%  vm-scalability.time.minor_page_faults
       322                         313        vm-scalability.time.user_time
    905482 ±  5%        20%    1087102 ±  9%  perf-stat.page-faults
    901474 ±  5%        20%    1081707 ±  9%  perf-stat.minor-faults
    425945 ±  9%        36%     578453 ± 18%  time.minor_page_faults
       322                         313        time.user_time
      6792 ±  5%       -10%       6097        vmstat.system.cs
   1625177 ±  4%       -11%    1446045        vmstat.swap.so
   1625189 ±  4%       -11%    1446055        vmstat.io.bo
    171516 ±  5%       -24%     131111 ± 24%  vmstat.system.in
    157193 ± 15%        46%     230011 ± 20%  proc-vmstat.pgactivate
    559397 ±  8%        45%     809533 ± 12%  proc-vmstat.pgscan_direct
    207042 ±  8%        43%     295574 ± 10%  proc-vmstat.pgsteal_direct
   5852284 ± 12%        42%    8294341 ± 20%  proc-vmstat.slabs_scanned
    211428 ±  6%        39%     293175 ±  6%  proc-vmstat.pgrefill
    135763 ±  9%        39%     188155 ±  3%  proc-vmstat.nr_vmscan_write
    220023 ±  5%        38%     303404 ±  6%  proc-vmstat.nr_written
    219051 ±  5%        38%     301549 ±  6%  proc-vmstat.pgrotated
      1018 ± 10%        36%       1383 ±  4%  proc-vmstat.nr_zone_write_pending
      1017 ± 10%        36%       1379 ±  4%  proc-vmstat.nr_writeback
    919059 ±  4%        20%    1102507 ±  9%  proc-vmstat.pgfault
   1108142 ±  9%        16%    1285859 ±  5%  proc-vmstat.numa_local
   1122389 ±  9%        16%    1302063 ±  5%  proc-vmstat.numa_hit
    715724              -5%     682381 ±  4%  proc-vmstat.nr_file_pages
      5784 ±  6%        -5%       5482 ±  7%  proc-vmstat.nr_mapped
    266928             -16%     223157        proc-vmstat.nr_zone_unevictable
    266928             -16%     223157        proc-vmstat.nr_unevictable
         0            4e+05     428930 ±139%  latency_stats.avg.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_generic_unrolled._copy_to_user.do_syslog.kmsg_read
     11973 ± 70%      3e+05     298302 ± 83%  latency_stats.avg.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_enhanced_fast_string._copy_to_user.do_syslog.kmsg_read
     17650 ± 35%      2e+05     265410 ± 84%  latency_stats.avg.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.strnlen_user.copy_strings.__do_execve_file.__x64_sys_execve
      9801 ± 42%      5e+04      61121 ±112%  latency_stats.avg.io_schedule.__lock_page.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__get_user_pages.get_user_pages_remote.__access_remote_vm.proc_pid_cmdline_read.__vfs_read.vfs_read
         0            3e+04      25998 ±141%  latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
         0            6e+03       6202 ±141%  latency_stats.avg.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.read_swap_cache_async.swap_cluster_readahead.shmem_swapin.shmem_getpage_gfp.shmem_fault
         0            5e+03       5457 ±141%  latency_stats.avg.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.swapin_readahead.do_swap_page.__handle_mm_fault.handle_mm_fault.__get_user_pages
      4585 ±173%     -5e+03          0        latency_stats.avg.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.swap_cluster_readahead.shmem_swapin.shmem_getpage_gfp.shmem_fault.__do_fault
      6229 ±173%     -6e+03          0        latency_stats.avg.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.read_swap_cache_async.swap_cluster_readahead.shmem_swapin.shmem_getpage_gfp.shmem_write_begin
     37391 ±173%     -4e+04          0        latency_stats.avg.io_schedule.wait_on_page_bit.shmem_getpage_gfp.shmem_fault.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
     52525 ±100%     -5e+04          0        latency_stats.avg.msleep.shrink_inactive_list.shrink_node_memcg.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.__do_page_fault
    141825 ±172%     -1e+05        590 ±126%  latency_stats.avg.io_schedule.__lock_page.shmem_getpage_gfp.shmem_file_read_iter.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
    240110 ±124%     -2e+05       2543 ±141%  latency_stats.avg.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_generic_unrolled._copy_to_user.perf_read.__vfs_read
    272115 ±171%     -3e+05       6043 ± 71%  latency_stats.avg.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_generic_unrolled.core_sys_select.kern_select.__x64_sys_select
    538363 ± 77%     -4e+05     136003 ± 68%  latency_stats.avg.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_enhanced_fast_string.copyout._copy_to_iter.skb_copy_datagram_iter
     23857 ± 26%      1e+06    1035782 ± 71%  latency_stats.max.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.strnlen_user.copy_strings.__do_execve_file.__x64_sys_execve
     11973 ± 70%      8e+05     778555 ± 70%  latency_stats.max.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_enhanced_fast_string._copy_to_user.do_syslog.kmsg_read
         0            4e+05     428930 ±139%  latency_stats.max.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_generic_unrolled._copy_to_user.do_syslog.kmsg_read
         0            3e+04      25998 ±141%  latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
         0            8e+03       7916 ±141%  latency_stats.max.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.read_swap_cache_async.swap_cluster_readahead.shmem_swapin.shmem_getpage_gfp.shmem_fault
         0            7e+03       6751 ±141%  latency_stats.max.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.swapin_readahead.do_swap_page.__handle_mm_fault.handle_mm_fault.__get_user_pages
      4585 ±173%     -5e+03          0        latency_stats.max.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.swap_cluster_readahead.shmem_swapin.shmem_getpage_gfp.shmem_fault.__do_fault
      6229 ±173%     -6e+03          0        latency_stats.max.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.read_swap_cache_async.swap_cluster_readahead.shmem_swapin.shmem_getpage_gfp.shmem_write_begin
      7425 ±117%     -7e+03         68 ±  7%  latency_stats.max.smp_call_on_cpu.lockup_detector_reconfigure.proc_watchdog_common.proc_sys_call_handler.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
     52680 ±100%     -5e+04          0        latency_stats.max.msleep.shrink_inactive_list.shrink_node_memcg.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.__do_page_fault
    107626 ±173%     -1e+05          0        latency_stats.max.io_schedule.wait_on_page_bit.shmem_getpage_gfp.shmem_fault.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
    272938 ±170%     -3e+05      16767 ± 71%  latency_stats.max.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_generic_unrolled.core_sys_select.kern_select.__x64_sys_select
    284952 ±170%     -3e+05        592 ±125%  latency_stats.max.io_schedule.__lock_page.shmem_getpage_gfp.shmem_file_read_iter.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
    357768 ±136%     -4e+05       2543 ±141%  latency_stats.max.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_generic_unrolled._copy_to_user.perf_read.__vfs_read
   1155510 ± 48%      9e+06    9751594 ±110%  latency_stats.sum.io_schedule.__lock_page.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__get_user_pages.get_user_pages_remote.__access_remote_vm.proc_pid_cmdline_read.__vfs_read.vfs_read
   4573026 ± 77%      8e+06   12847407 ±122%  latency_stats.sum.io_schedule.wait_on_page_bit.shmem_getpage_gfp.shmem_write_begin.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
     48641 ± 48%      2e+06    2342957 ± 71%  latency_stats.sum.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.strnlen_user.copy_strings.__do_execve_file.__x64_sys_execve
     11973 ± 70%      8e+05     788601 ± 70%  latency_stats.sum.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_enhanced_fast_string._copy_to_user.do_syslog.kmsg_read
         0            4e+05     428930 ±139%  latency_stats.sum.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_generic_unrolled._copy_to_user.do_syslog.kmsg_read
         0            4e+04      38202 ±141%  latency_stats.sum.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.swapin_readahead.do_swap_page.__handle_mm_fault.handle_mm_fault.__get_user_pages
         0            3e+04      25998 ±141%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
         0            2e+04      18607 ±141%  latency_stats.sum.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.read_swap_cache_async.swap_cluster_readahead.shmem_swapin.shmem_getpage_gfp.shmem_fault
      4585 ±173%     -5e+03          0        latency_stats.sum.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.swap_cluster_readahead.shmem_swapin.shmem_getpage_gfp.shmem_fault.__do_fault
      8048 ± 99%     -5e+03       3141 ± 60%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup
      6229 ±173%     -6e+03          0        latency_stats.sum.io_schedule.blk_mq_get_tag.blk_mq_get_request.blk_mq_make_request.generic_make_request.submit_bio.swap_readpage.read_swap_cache_async.swap_cluster_readahead.shmem_swapin.shmem_getpage_gfp.shmem_write_begin
     10774 ± 82%     -7e+03       3648 ±  5%  latency_stats.sum.smp_call_on_cpu.lockup_detector_reconfigure.proc_watchdog_common.proc_sys_call_handler.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
    112174 ±173%     -1e+05          0        latency_stats.sum.io_schedule.wait_on_page_bit.shmem_getpage_gfp.shmem_fault.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
    157878 ±137%     -2e+05          0        latency_stats.sum.msleep.shrink_inactive_list.shrink_node_memcg.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.__do_page_fault
    275908 ±168%     -3e+05      18129 ± 71%  latency_stats.sum.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_generic_unrolled.core_sys_select.kern_select.__x64_sys_select
    285059 ±170%     -3e+05        633 ±113%  latency_stats.sum.io_schedule.__lock_page.shmem_getpage_gfp.shmem_file_read_iter.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
    427767 ±141%     -4e+05       2543 ±141%  latency_stats.sum.io_schedule.__lock_page_or_retry.do_swap_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.copy_user_generic_unrolled._copy_to_user.perf_read.__vfs_read

Best Regards,
Rong Chen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v2 for-4.20] mm, thp: restore node-local hugepage allocations
  2018-12-05 22:46 [patch v2 for-4.20] mm, thp: restore node-local hugepage allocations David Rientjes
  2018-12-06  4:59 ` [LKP] " kernel test robot
@ 2018-12-06 21:42 ` David Rientjes
  2018-12-06 22:00   ` [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" David Rientjes
  1 sibling, 1 reply; 11+ messages in thread
From: David Rientjes @ 2018-12-06 21:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, mgorman, Vlastimil Babka, mhocko, ying.huang,
	s.priebe, Linux List Kernel Mailing, alex.williamson, lkp,
	kirill, Andrew Morton, zi.yan

On Wed, 5 Dec 2018, David Rientjes wrote:

> This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
> consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
> 
> By not setting __GFP_THISNODE, applications can allocate remote hugepages
> when the local node is fragmented or low on memory when either the thp
> defrag setting is "always" or the vma has been madvised with
> MADV_HUGEPAGE.
> 
> Remote access to hugepages often has much higher latency than local pages
> of the native page size.  On Haswell, ac5b2c18911f was shown to have a
> 13.9% access regression after this commit for binaries that remap their
> text segment to be backed by transparent hugepages.
> 
> The intent of ac5b2c18911f is to address an issue where a local node is
> low on memory or fragmented such that a hugepage cannot be allocated.  In
> every scenario where this was described as a fix, there is abundant and
> unfragmented remote memory available to allocate from, even with a greater
> access latency.
> 
> If remote memory is also low or fragmented, not setting __GFP_THISNODE was
> also measured on Haswell to have a 40% regression in allocation latency.
> 
> Restore __GFP_THISNODE for thp allocations.
> 
> Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
> Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
> Signed-off-by: David Rientjes <rientjes@google.com>

We've identified a couple more regressions wrt 89c83fb539f9 ("mm, thp: 
consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") in 
automated testing so I'm going to be proposing a full revert of that 
commit for 4.20 as a follow-up to this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
  2018-12-06 21:42 ` David Rientjes
@ 2018-12-06 22:00   ` David Rientjes
  2018-12-07  8:05     ` Michal Hocko
  2018-12-07 14:41     ` Vlastimil Babka
  0 siblings, 2 replies; 11+ messages in thread
From: David Rientjes @ 2018-12-06 22:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, mgorman, Vlastimil Babka, mhocko, ying.huang,
	s.priebe, Linux List Kernel Mailing, alex.williamson, lkp,
	kirill, Andrew Morton, zi.yan

This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.

There are a couple of issues with 89c83fb539f9 independent of its partial 
revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
allocations"):

Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
the same for shared vma policies, triggering the WARN_ON_ONCE() in 
policy_node().

Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
different policy for hugepage allocations, which were allocated through 
alloc_hugepage_vma().  For hugepage allocations, if the allocating 
process's node is in the set of allowed nodes, allocate with 
__GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
__GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to 
allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in 
mm/mempolicy.c which is functionally different behavior and removes the 
requirement to only allocate hugepages locally.

The latter should have been reverted as part of 2f0799a0ffc0 as well.

Fully revert 89c83fb539f9 so that hugepage allocation policy is fully 
restored and there is no race between alloc_hugepage_direct_gfpmask() and 
alloc_pages_vma().

Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/gfp.h | 12 ++++++++----
 mm/huge_memory.c    | 27 +++++++++++++--------------
 mm/mempolicy.c      | 32 +++++++++++++++++++++++++++++---
 mm/shmem.c          |  2 +-
 4 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr,
-			int node);
+			int node, bool hugepage);
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
+	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
+	alloc_pages(gfp_mask, order)
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
 	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)			\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
 #define alloc_page_vma_node(gfp_mask, vma, addr, node)		\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, node)
+	alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,30 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-	const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 
 	/* Always do synchronous compaction */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | __GFP_THISNODE |
-		       (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 
 	/* Kick kcompactd and fail quickly */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return gfp_mask | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
 
 	/* Synchronous compaction if madvised, otherwise kick kcompactd */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
-		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-						  __GFP_KSWAPD_RECLAIM);
+		return GFP_TRANSHUGE_LIGHT |
+			(vma_madvised ? __GFP_DIRECT_RECLAIM :
+					__GFP_KSWAPD_RECLAIM);
 
 	/* Only do synchronous compaction if madvised */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
-		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
+		return GFP_TRANSHUGE_LIGHT |
+		       (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
 
-	return gfp_mask;
+	return GFP_TRANSHUGE_LIGHT;
 }
 
 /* Caller must hold page table lock. */
@@ -724,8 +724,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
-	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
+	gfp = alloc_hugepage_direct_gfpmask(vma);
+	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1295,9 +1295,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
-		new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
-				haddr, numa_node_id());
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1116,8 +1116,8 @@ static struct page *new_page(struct page *page, unsigned long start)
 	} else if (PageTransHuge(page)) {
 		struct page *thp;
 
-		thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
-				address, numa_node_id());
+		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+					 HPAGE_PMD_ORDER);
 		if (!thp)
 			return NULL;
 		prep_transhuge_page(thp);
@@ -2011,6 +2011,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  * 	@vma:  Pointer to VMA or NULL if not available.
  *	@addr: Virtual Address of the allocation. Must be inside the VMA.
  *	@node: Which node to prefer for allocation (modulo policy).
+ *	@hugepage: for hugepages try only the preferred node if possible
  *
  * 	This function allocates a page from the kernel page pool and applies
  *	a NUMA policy associated with the VMA or the current process.
@@ -2021,7 +2022,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  */
 struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
-		unsigned long addr, int node)
+		unsigned long addr, int node, bool hugepage)
 {
 	struct mempolicy *pol;
 	struct page *page;
@@ -2039,6 +2040,31 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
+	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
+		int hpage_node = node;
+
+		/*
+		 * For hugepage allocation and non-interleave policy which
+		 * allows the current node (or other explicitly preferred
+		 * node) we only try to allocate from the current/preferred
+		 * node and don't fall back to other nodes, as the cost of
+		 * remote accesses would likely offset THP benefits.
+		 *
+		 * If the policy is interleave, or does not allow the current
+		 * node in its nodemask, we allocate the standard way.
+		 */
+		if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
+			hpage_node = pol->v.preferred_node;
+
+		nmask = policy_nodemask(gfp, pol);
+		if (!nmask || node_isset(hpage_node, *nmask)) {
+			mpol_cond_put(pol);
+			page = __alloc_pages_node(hpage_node,
+						gfp | __GFP_THISNODE, order);
+			goto out;
+		}
+	}
+
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
diff --git a/mm/shmem.c b/mm/shmem.c
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1439,7 +1439,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
 
 	shmem_pseudo_vma_init(&pvma, info, hindex);
 	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
-			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
+			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
 	shmem_pseudo_vma_destroy(&pvma);
 	if (page)
 		prep_transhuge_page(page);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
  2018-12-06 22:00   ` [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" David Rientjes
@ 2018-12-07  8:05     ` Michal Hocko
  2018-12-07 23:05       ` David Rientjes
  2018-12-07 14:41     ` Vlastimil Babka
  1 sibling, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2018-12-07  8:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Andrea Arcangeli, mgorman, Vlastimil Babka,
	ying.huang, s.priebe, Linux List Kernel Mailing, alex.williamson,
	lkp, kirill, Andrew Morton, zi.yan

On Thu 06-12-18 14:00:20, David Rientjes wrote:
> This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> 
> There are a couple of issues with 89c83fb539f9 independent of its partial 
> revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> allocations"):
> 
> Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> policy_node().

Could you share a test case?

> Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> different policy for hugepage allocations, which were allocated through 
> alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> process's node is in the set of allowed nodes, allocate with 
> __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> __GFP_THISNODE instead).

Why is it wrong to fallback to an explicitly configured mbind mask?

> This was changed for shmem_alloc_hugepage() to 
> allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in 
> mm/mempolicy.c which is functionally different behavior and removes the 
> requirement to only allocate hugepages locally.

Also why should new_page behave any differently for THP from any other
pages? There is no fallback allocation for those and so the mbind could
easily fail. So what is the actual rationale?

> The latter should have been reverted as part of 2f0799a0ffc0 as well.
> 
> Fully revert 89c83fb539f9 so that hugepage allocation policy is fully 
> restored and there is no race between alloc_hugepage_direct_gfpmask() and 
> alloc_pages_vma().
> 
> Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
> Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/gfp.h | 12 ++++++++----
>  mm/huge_memory.c    | 27 +++++++++++++--------------
>  mm/mempolicy.c      | 32 +++++++++++++++++++++++++++++---
>  mm/shmem.c          |  2 +-
>  4 files changed, 51 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
>  }
>  extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
>  			struct vm_area_struct *vma, unsigned long addr,
> -			int node);
> +			int node, bool hugepage);
> +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
> +	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
>  #else
>  #define alloc_pages(gfp_mask, order) \
>  		alloc_pages_node(numa_node_id(), gfp_mask, order)
> -#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
> +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
> +	alloc_pages(gfp_mask, order)
> +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
>  	alloc_pages(gfp_mask, order)
>  #endif
>  #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
>  #define alloc_page_vma(gfp_mask, vma, addr)			\
> -	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
> +	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
>  #define alloc_page_vma_node(gfp_mask, vma, addr, node)		\
> -	alloc_pages_vma(gfp_mask, 0, vma, addr, node)
> +	alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
>  
>  extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
>  extern unsigned long get_zeroed_page(gfp_t gfp_mask);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,30 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> -	const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>  
>  	/* Always do synchronous compaction */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | __GFP_THISNODE |
> -		       (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>  
>  	/* Kick kcompactd and fail quickly */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return gfp_mask | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>  
>  	/* Synchronous compaction if madvised, otherwise kick kcompactd */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
> -		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -						  __GFP_KSWAPD_RECLAIM);
> +		return GFP_TRANSHUGE_LIGHT |
> +			(vma_madvised ? __GFP_DIRECT_RECLAIM :
> +					__GFP_KSWAPD_RECLAIM);
>  
>  	/* Only do synchronous compaction if madvised */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> -		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
> +		return GFP_TRANSHUGE_LIGHT |
> +		       (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
>  
> -	return gfp_mask;
> +	return GFP_TRANSHUGE_LIGHT;
>  }
>  
>  /* Caller must hold page table lock. */
> @@ -724,8 +724,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
> -	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
> +	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
>  		return VM_FAULT_FALLBACK;
> @@ -1295,9 +1295,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
> -		new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
> -				haddr, numa_node_id());
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1116,8 +1116,8 @@ static struct page *new_page(struct page *page, unsigned long start)
>  	} else if (PageTransHuge(page)) {
>  		struct page *thp;
>  
> -		thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
> -				address, numa_node_id());
> +		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
> +					 HPAGE_PMD_ORDER);
>  		if (!thp)
>  			return NULL;
>  		prep_transhuge_page(thp);
> @@ -2011,6 +2011,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
>   * 	@vma:  Pointer to VMA or NULL if not available.
>   *	@addr: Virtual Address of the allocation. Must be inside the VMA.
>   *	@node: Which node to prefer for allocation (modulo policy).
> + *	@hugepage: for hugepages try only the preferred node if possible
>   *
>   * 	This function allocates a page from the kernel page pool and applies
>   *	a NUMA policy associated with the VMA or the current process.
> @@ -2021,7 +2022,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
>   */
>  struct page *
>  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> -		unsigned long addr, int node)
> +		unsigned long addr, int node, bool hugepage)
>  {
>  	struct mempolicy *pol;
>  	struct page *page;
> @@ -2039,6 +2040,31 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> +	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> +		int hpage_node = node;
> +
> +		/*
> +		 * For hugepage allocation and non-interleave policy which
> +		 * allows the current node (or other explicitly preferred
> +		 * node) we only try to allocate from the current/preferred
> +		 * node and don't fall back to other nodes, as the cost of
> +		 * remote accesses would likely offset THP benefits.
> +		 *
> +		 * If the policy is interleave, or does not allow the current
> +		 * node in its nodemask, we allocate the standard way.
> +		 */
> +		if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
> +			hpage_node = pol->v.preferred_node;
> +
> +		nmask = policy_nodemask(gfp, pol);
> +		if (!nmask || node_isset(hpage_node, *nmask)) {
> +			mpol_cond_put(pol);
> +			page = __alloc_pages_node(hpage_node,
> +						gfp | __GFP_THISNODE, order);
> +			goto out;
> +		}
> +	}
> +
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> diff --git a/mm/shmem.c b/mm/shmem.c
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1439,7 +1439,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>  
>  	shmem_pseudo_vma_init(&pvma, info, hindex);
>  	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
> -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
> +			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
>  	shmem_pseudo_vma_destroy(&pvma);
>  	if (page)
>  		prep_transhuge_page(page);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
  2018-12-06 22:00   ` [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" David Rientjes
  2018-12-07  8:05     ` Michal Hocko
@ 2018-12-07 14:41     ` Vlastimil Babka
  2018-12-07 22:27       ` David Rientjes
  1 sibling, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2018-12-07 14:41 UTC (permalink / raw)
  To: David Rientjes, Linus Torvalds
  Cc: Andrea Arcangeli, mgorman, mhocko, ying.huang, s.priebe,
	Linux List Kernel Mailing, alex.williamson, lkp, kirill,
	Andrew Morton, zi.yan

On 12/6/18 11:00 PM, David Rientjes wrote:
> This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> 
> There are a couple of issues with 89c83fb539f9 independent of its partial 
> revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> allocations"):
> 
> Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> policy_node().

AFAICS 2f0799a0ffc0 removed the policy check in
alloc_hugepage_direct_gfpmask() comlpetely, so it's not racy and the
warning will always trigger for a MPOL_BIND policy right now?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
  2018-12-07 14:41     ` Vlastimil Babka
@ 2018-12-07 22:27       ` David Rientjes
  2018-12-07 22:33         ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: David Rientjes @ 2018-12-07 22:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Andrea Arcangeli, mgorman, mhocko, ying.huang,
	s.priebe, Linux List Kernel Mailing, alex.williamson, lkp,
	kirill, Andrew Morton, zi.yan

On Fri, 7 Dec 2018, Vlastimil Babka wrote:

> > This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> > 
> > There are a couple of issues with 89c83fb539f9 independent of its partial 
> > revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> > allocations"):
> > 
> > Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> > alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> > alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> > an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> > the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> > policy_node().
> 
> AFAICS 2f0799a0ffc0 removed the policy check in
> alloc_hugepage_direct_gfpmask() comlpetely, so it's not racy and the
> warning will always trigger for a MPOL_BIND policy right now?
> 

Yup, looks like you hit it on the head.  This revert should have been done 
alongside 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
allocations"), I didn't appreciate how invasive the consolidation patch 
was.

I noticed the race in 89c83fb539f9 ("mm, thp: consolidate THP gfp handling 
into alloc_hugepage_direct_gfpmask") that is fixed by the revert, but as 
you noted it didn't cleanup the second part which is the balancing act for 
gfp flags between alloc_hugepage_direct_gfpmask() and alloc_pages_vma().  
Syzbot found this to trigger the WARN_ON_ONCE() you mention.

So we certainly need this patch for 4.20 as a follow-up to 2f0799a0ffc0.  
It's likely better to regroup and discuss NUMA aspects of all thp 
allocations separately with a stable 4.20.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
  2018-12-07 22:27       ` David Rientjes
@ 2018-12-07 22:33         ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2018-12-07 22:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, Andrea Arcangeli, mgorman, Michal Hocko,
	ying.huang, s.priebe, Linux List Kernel Mailing, alex.williamson,
	lkp, kirill, Andrew Morton, zi.yan

[-- Attachment #1: Type: text/plain, Size: 574 bytes --]

On Fri, Dec 7, 2018 at 2:27 PM David Rientjes <rientjes@google.com> wrote:
>
> I noticed the race in 89c83fb539f9 ("mm, thp: consolidate THP gfp handling
> into alloc_hugepage_direct_gfpmask") that is fixed by the revert, but as
> you noted it didn't cleanup the second part which is the balancing act for
> gfp flags between alloc_hugepage_direct_gfpmask() and alloc_pages_vma().
> Syzbot found this to trigger the WARN_ON_ONCE() you mention.

Can you rewrite the commit message to explain this better rather than
the misleading "race" description?

                 Linus

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5120 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
  2018-12-07  8:05     ` Michal Hocko
@ 2018-12-07 23:05       ` David Rientjes
  2018-12-10  8:34         ` Michal Hocko
  2018-12-10 13:28         ` Kirill A. Shutemov
  0 siblings, 2 replies; 11+ messages in thread
From: David Rientjes @ 2018-12-07 23:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Andrea Arcangeli, mgorman, Vlastimil Babka,
	ying.huang, s.priebe, Linux List Kernel Mailing, alex.williamson,
	lkp, kirill, Andrew Morton, zi.yan

On Fri, 7 Dec 2018, Michal Hocko wrote:

> > This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> > 
> > There are a couple of issues with 89c83fb539f9 independent of its partial 
> > revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> > allocations"):
> > 
> > Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> > alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> > alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> > an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> > the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> > policy_node().
> 
> Could you share a test case?
> 

Sorry, as Vlastimil pointed out this race does not exist anymore since 
commit 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations") 
since it removed the restructuring of alloc_hugepage_direct_gfpmask().  It 
existed prior to this commit for shared vma policies that could modify the 
policy between alloc_hugepage_direct_gfpmask() and alloc_pages_vma() if 
the policy switches to MPOL_BIND in that window.

> > Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> > different policy for hugepage allocations, which were allocated through 
> > alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> > process's node is in the set of allowed nodes, allocate with 
> > __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> > __GFP_THISNODE instead).
> 
> Why is it wrong to fallback to an explicitly configured mbind mask?
> 

The new_page() case is similar to the shmem_alloc_hugepage() case.  Prior 
to 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
alloc_hugepage_direct_gfpmask"), shmem_alloc_hugepage() did 
alloc_pages_vma() with hugepage == true, which effected a different 
allocation policy: if the node current is running on is allowed by the 
policy, use __GFP_THISNODE (considering ac5b2c18911ff is reverted, which 
it is in Linus's tree).

After 89c83fb539f9, we lose that and can fallback to remote memory.  Since 
the discussion is on-going wrt the NUMA aspects of hugepage allocations, 
it's better to have a stable 4.20 tree while that is being worked out and 
likely deserves separate patches for both new_page() and 
shmem_alloc_hugepage().  For the latter specifically, I assume it would be 
nice to get an Acked-by by Kirill who implemented shmem_alloc_hugepage() 
with hugepage == true back in 4.8 that also had the __GFP_THISNODE 
behavior before the allocation policy is suddenly changed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
  2018-12-07 23:05       ` David Rientjes
@ 2018-12-10  8:34         ` Michal Hocko
  2018-12-10 13:28         ` Kirill A. Shutemov
  1 sibling, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2018-12-10  8:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Andrea Arcangeli, mgorman, Vlastimil Babka,
	ying.huang, s.priebe, Linux List Kernel Mailing, alex.williamson,
	lkp, kirill, Andrew Morton, zi.yan

On Fri 07-12-18 15:05:28, David Rientjes wrote:
> On Fri, 7 Dec 2018, Michal Hocko wrote:
[...]
> > > Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> > > different policy for hugepage allocations, which were allocated through 
> > > alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> > > process's node is in the set of allowed nodes, allocate with 
> > > __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> > > __GFP_THISNODE instead).
> > 
> > Why is it wrong to fallback to an explicitly configured mbind mask?
> > 
> 
> The new_page() case is similar to the shmem_alloc_hugepage() case.  Prior 
> to 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> alloc_hugepage_direct_gfpmask"), shmem_alloc_hugepage() did 
> alloc_pages_vma() with hugepage == true, which effected a different 
> allocation policy: if the node current is running on is allowed by the 
> policy, use __GFP_THISNODE (considering ac5b2c18911ff is reverted, which 
> it is in Linus's tree).
> 
> After 89c83fb539f9, we lose that and can fallback to remote memory.  Since 
> the discussion is on-going wrt the NUMA aspects of hugepage allocations, 
> it's better to have a stable 4.20 tree while that is being worked out and 
> likely deserves separate patches for both new_page() and 
> shmem_alloc_hugepage().  For the latter specifically, I assume it would be 
> nice to get an Acked-by by Kirill who implemented shmem_alloc_hugepage() 
> with hugepage == true back in 4.8 that also had the __GFP_THISNODE 
> behavior before the allocation policy is suddenly changed.

This should be a part of the changelog.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
  2018-12-07 23:05       ` David Rientjes
  2018-12-10  8:34         ` Michal Hocko
@ 2018-12-10 13:28         ` Kirill A. Shutemov
  1 sibling, 0 replies; 11+ messages in thread
From: Kirill A. Shutemov @ 2018-12-10 13:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Linus Torvalds, Andrea Arcangeli, mgorman,
	Vlastimil Babka, ying.huang, s.priebe, Linux List Kernel Mailing,
	alex.williamson, lkp, Andrew Morton, zi.yan

On Fri, Dec 07, 2018 at 03:05:28PM -0800, David Rientjes wrote:
> > > Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> > > different policy for hugepage allocations, which were allocated through 
> > > alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> > > process's node is in the set of allowed nodes, allocate with 
> > > __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> > > __GFP_THISNODE instead).
> > 
> > Why is it wrong to fallback to an explicitly configured mbind mask?
> > 
> 
> The new_page() case is similar to the shmem_alloc_hugepage() case.  Prior 
> to 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> alloc_hugepage_direct_gfpmask"), shmem_alloc_hugepage() did 
> alloc_pages_vma() with hugepage == true, which effected a different 
> allocation policy: if the node current is running on is allowed by the 
> policy, use __GFP_THISNODE (considering ac5b2c18911ff is reverted, which 
> it is in Linus's tree).
> 
> After 89c83fb539f9, we lose that and can fallback to remote memory.  Since 
> the discussion is on-going wrt the NUMA aspects of hugepage allocations, 
> it's better to have a stable 4.20 tree while that is being worked out and 
> likely deserves separate patches for both new_page() and 
> shmem_alloc_hugepage().  For the latter specifically, I assume it would be 
> nice to get an Acked-by by Kirill who implemented shmem_alloc_hugepage() 
> with hugepage == true back in 4.8 that also had the __GFP_THISNODE 
> behavior before the allocation policy is suddenly changed.

I do not have much experience with page_alloc/compaction/reclaim paths and
I don't feel that my opinion should have much weight here. Do not gate it
on me.

(I do follow the discussion, but I don't have anything meaningful to
contribute so far.)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-12-10 13:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-05 22:46 [patch v2 for-4.20] mm, thp: restore node-local hugepage allocations David Rientjes
2018-12-06  4:59 ` [LKP] " kernel test robot
2018-12-06 21:42 ` David Rientjes
2018-12-06 22:00   ` [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" David Rientjes
2018-12-07  8:05     ` Michal Hocko
2018-12-07 23:05       ` David Rientjes
2018-12-10  8:34         ` Michal Hocko
2018-12-10 13:28         ` Kirill A. Shutemov
2018-12-07 14:41     ` Vlastimil Babka
2018-12-07 22:27       ` David Rientjes
2018-12-07 22:33         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).