On 4 Oct 2018, at 16:17, David Rientjes wrote: > On Wed, 26 Sep 2018, Kirill A. Shutemov wrote: > >> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote: >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index c3bc7e9c9a2a..c0bcede31930 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, >>> * available >>> * never: never stall for any thp allocation >>> */ >>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) >>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) >>> { >>> const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); >>> + gfp_t this_node = 0; >>> + >>> +#ifdef CONFIG_NUMA >>> + struct mempolicy *pol; >>> + /* >>> + * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not >>> + * specified, to express a general desire to stay on the current >>> + * node for optimistic allocation attempts. If the defrag mode >>> + * and/or madvise hint requires the direct reclaim then we prefer >>> + * to fallback to other node rather than node reclaim because that >>> + * can lead to excessive reclaim even though there is free memory >>> + * on other nodes. We expect that NUMA preferences are specified >>> + * by memory policies. >>> + */ >>> + pol = get_vma_policy(vma, addr); >>> + if (pol->mode != MPOL_BIND) >>> + this_node = __GFP_THISNODE; >>> + mpol_cond_put(pol); >>> +#endif >> >> I'm not very good with NUMA policies. Could you explain in more details how >> the code above is equivalent to the code below? >> > > It breaks mbind() because new_page() is now using numa_node_id() to > allocate migration targets for instead of using the mempolicy. I'm not > sure that this patch was tested for mbind(). I do not see mbind() is broken. With both patches applied, I ran "numactl -N 0 memhog -r1 4096m membind 1" and saw all pages are allocated in Node 1 not Node 0, which is returned by numa_node_id(). From the source code, in alloc_pages_vma(), the nodemask is generated from the memory policy (i.e. mbind in the case above), which only has the nodes specified by mbind(). Then, __alloc_pages_nodemask() only uses the zones from the nodemask. The numa_node_id() return value will be ignored in the actual page allocation process if mbind policy is applied. Let me know if I miss anything. -- Best Regards Yan Zi