[01/10] mm: control memory placement by nodemask for two tier main memory
diff mbox series

Message ID 1553316275-21985-2-git-send-email-yang.shi@linux.alibaba.com
State Superseded
Headers show
Series
  • Another Approach to Use PMEM as NUMA Node
Related show

Commit Message

Yang Shi March 23, 2019, 4:44 a.m. UTC
When running applications on the machine with NVDIMM as NUMA node, the
memory allocation may end up on NVDIMM node.  This may result in silent
performance degradation and regression due to the difference of hardware
property.

DRAM first should be obeyed to prevent from surprising regression.  Any
non-DRAM nodes should be excluded from default allocation.  Use nodemask
to control the memory placement.  Introduce def_alloc_nodemask which has
DRAM nodes set only.  Any non-DRAM allocation should be specified by
NUMA policy explicitly.

In the future we may be able to extract the memory charasteristics from
HMAT or other source to build up the default allocation nodemask.
However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
for the time being.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 arch/x86/mm/numa.c     |  1 +
 drivers/acpi/numa.c    |  8 ++++++++
 include/linux/mmzone.h |  3 +++
 mm/page_alloc.c        | 18 ++++++++++++++++--
 4 files changed, 28 insertions(+), 2 deletions(-)

Comments

Dan Williams March 23, 2019, 5:21 p.m. UTC | #1
On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> When running applications on the machine with NVDIMM as NUMA node, the
> memory allocation may end up on NVDIMM node.  This may result in silent
> performance degradation and regression due to the difference of hardware
> property.
>
> DRAM first should be obeyed to prevent from surprising regression.  Any
> non-DRAM nodes should be excluded from default allocation.  Use nodemask
> to control the memory placement.  Introduce def_alloc_nodemask which has
> DRAM nodes set only.  Any non-DRAM allocation should be specified by
> NUMA policy explicitly.
>
> In the future we may be able to extract the memory charasteristics from
> HMAT or other source to build up the default allocation nodemask.
> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
> for the time being.
>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  arch/x86/mm/numa.c     |  1 +
>  drivers/acpi/numa.c    |  8 ++++++++
>  include/linux/mmzone.h |  3 +++
>  mm/page_alloc.c        | 18 ++++++++++++++++--
>  4 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index dfb6c4d..d9e0ca4 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
>         nodes_clear(numa_nodes_parsed);
>         nodes_clear(node_possible_map);
>         nodes_clear(node_online_map);
> +       nodes_clear(def_alloc_nodemask);
>         memset(&numa_meminfo, 0, sizeof(numa_meminfo));
>         WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
>                                   MAX_NUMNODES));
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index 867f6e3..79dfedf 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>                 goto out_err_bad_srat;
>         }
>
> +       /*
> +        * Non volatile memory is excluded from zonelist by default.
> +        * Only regular DRAM nodes are set in default allocation node
> +        * mask.
> +        */
> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
> +               node_set(node, def_alloc_nodemask);

Hmm, no, I don't think we should do this. Especially considering
current generation NVDIMMs are energy backed DRAM there is no
performance difference that should be assumed by the non-volatile
flag.

Why isn't default SLIT distance sufficient for ensuring a DRAM-first
default policy?
Yang Shi March 25, 2019, 7:28 p.m. UTC | #2
On 3/23/19 10:21 AM, Dan Williams wrote:
> On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>> When running applications on the machine with NVDIMM as NUMA node, the
>> memory allocation may end up on NVDIMM node.  This may result in silent
>> performance degradation and regression due to the difference of hardware
>> property.
>>
>> DRAM first should be obeyed to prevent from surprising regression.  Any
>> non-DRAM nodes should be excluded from default allocation.  Use nodemask
>> to control the memory placement.  Introduce def_alloc_nodemask which has
>> DRAM nodes set only.  Any non-DRAM allocation should be specified by
>> NUMA policy explicitly.
>>
>> In the future we may be able to extract the memory charasteristics from
>> HMAT or other source to build up the default allocation nodemask.
>> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
>> for the time being.
>>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>>   arch/x86/mm/numa.c     |  1 +
>>   drivers/acpi/numa.c    |  8 ++++++++
>>   include/linux/mmzone.h |  3 +++
>>   mm/page_alloc.c        | 18 ++++++++++++++++--
>>   4 files changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>> index dfb6c4d..d9e0ca4 100644
>> --- a/arch/x86/mm/numa.c
>> +++ b/arch/x86/mm/numa.c
>> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
>>          nodes_clear(numa_nodes_parsed);
>>          nodes_clear(node_possible_map);
>>          nodes_clear(node_online_map);
>> +       nodes_clear(def_alloc_nodemask);
>>          memset(&numa_meminfo, 0, sizeof(numa_meminfo));
>>          WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
>>                                    MAX_NUMNODES));
>> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
>> index 867f6e3..79dfedf 100644
>> --- a/drivers/acpi/numa.c
>> +++ b/drivers/acpi/numa.c
>> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>>                  goto out_err_bad_srat;
>>          }
>>
>> +       /*
>> +        * Non volatile memory is excluded from zonelist by default.
>> +        * Only regular DRAM nodes are set in default allocation node
>> +        * mask.
>> +        */
>> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
>> +               node_set(node, def_alloc_nodemask);
> Hmm, no, I don't think we should do this. Especially considering
> current generation NVDIMMs are energy backed DRAM there is no
> performance difference that should be assumed by the non-volatile
> flag.

Actually, here I would like to initialize a node mask for default 
allocation. Memory allocation should not end up on any nodes excluded by 
this node mask unless they are specified by mempolicy.

We may have a few different ways or criteria to initialize the node 
mask, for example, we can read from HMAT (when HMAT is ready in the 
future), and we definitely could have non-DRAM nodes set if they have no 
performance difference (I'm supposed you mean NVDIMM-F  or HBM).

As long as there are different tiers, distinguished by performance, for 
main memory, IMHO, there should be a defined default allocation node 
mask to control the memory placement no matter where we get the information.

But, for now we haven't had such information ready for such use yet, so 
the SRAT flag might be a choice.

>
> Why isn't default SLIT distance sufficient for ensuring a DRAM-first
> default policy?

"DRAM-first" may sound ambiguous, actually I mean "DRAM only by 
default". SLIT should just can tell us what node is local what node is 
remote, but can't tell us the performance difference.

Thanks,
Yang
Dan Williams March 25, 2019, 11:18 p.m. UTC | #3
On Mon, Mar 25, 2019 at 12:28 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>
>
> On 3/23/19 10:21 AM, Dan Williams wrote:
> > On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> >> When running applications on the machine with NVDIMM as NUMA node, the
> >> memory allocation may end up on NVDIMM node.  This may result in silent
> >> performance degradation and regression due to the difference of hardware
> >> property.
> >>
> >> DRAM first should be obeyed to prevent from surprising regression.  Any
> >> non-DRAM nodes should be excluded from default allocation.  Use nodemask
> >> to control the memory placement.  Introduce def_alloc_nodemask which has
> >> DRAM nodes set only.  Any non-DRAM allocation should be specified by
> >> NUMA policy explicitly.
> >>
> >> In the future we may be able to extract the memory charasteristics from
> >> HMAT or other source to build up the default allocation nodemask.
> >> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
> >> for the time being.
> >>
> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> >> ---
> >>   arch/x86/mm/numa.c     |  1 +
> >>   drivers/acpi/numa.c    |  8 ++++++++
> >>   include/linux/mmzone.h |  3 +++
> >>   mm/page_alloc.c        | 18 ++++++++++++++++--
> >>   4 files changed, 28 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> >> index dfb6c4d..d9e0ca4 100644
> >> --- a/arch/x86/mm/numa.c
> >> +++ b/arch/x86/mm/numa.c
> >> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
> >>          nodes_clear(numa_nodes_parsed);
> >>          nodes_clear(node_possible_map);
> >>          nodes_clear(node_online_map);
> >> +       nodes_clear(def_alloc_nodemask);
> >>          memset(&numa_meminfo, 0, sizeof(numa_meminfo));
> >>          WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
> >>                                    MAX_NUMNODES));
> >> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> >> index 867f6e3..79dfedf 100644
> >> --- a/drivers/acpi/numa.c
> >> +++ b/drivers/acpi/numa.c
> >> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
> >>                  goto out_err_bad_srat;
> >>          }
> >>
> >> +       /*
> >> +        * Non volatile memory is excluded from zonelist by default.
> >> +        * Only regular DRAM nodes are set in default allocation node
> >> +        * mask.
> >> +        */
> >> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
> >> +               node_set(node, def_alloc_nodemask);
> > Hmm, no, I don't think we should do this. Especially considering
> > current generation NVDIMMs are energy backed DRAM there is no
> > performance difference that should be assumed by the non-volatile
> > flag.
>
> Actually, here I would like to initialize a node mask for default
> allocation. Memory allocation should not end up on any nodes excluded by
> this node mask unless they are specified by mempolicy.
>
> We may have a few different ways or criteria to initialize the node
> mask, for example, we can read from HMAT (when HMAT is ready in the
> future), and we definitely could have non-DRAM nodes set if they have no
> performance difference (I'm supposed you mean NVDIMM-F  or HBM).
>
> As long as there are different tiers, distinguished by performance, for
> main memory, IMHO, there should be a defined default allocation node
> mask to control the memory placement no matter where we get the information.

I understand the intent, but I don't think the kernel should have such
a hardline policy by default. However, it would be worthwhile
mechanism and policy to consider for the dax-hotplug userspace
tooling. I.e. arrange for a given device-dax instance to be onlined,
but set the policy to require explicit opt-in by numa binding for it
to be an allocation / migration option.

I added Vishal to the cc who is looking into such policy tooling.

> But, for now we haven't had such information ready for such use yet, so
> the SRAT flag might be a choice.
>
> >
> > Why isn't default SLIT distance sufficient for ensuring a DRAM-first
> > default policy?
>
> "DRAM-first" may sound ambiguous, actually I mean "DRAM only by
> default". SLIT should just can tell us what node is local what node is
> remote, but can't tell us the performance difference.

I think it's a useful semantic, but let's leave the selection of that
policy to an explicit userspace decision.
Yang Shi March 25, 2019, 11:36 p.m. UTC | #4
On 3/25/19 4:18 PM, Dan Williams wrote:
> On Mon, Mar 25, 2019 at 12:28 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>
>>
>> On 3/23/19 10:21 AM, Dan Williams wrote:
>>> On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>>> When running applications on the machine with NVDIMM as NUMA node, the
>>>> memory allocation may end up on NVDIMM node.  This may result in silent
>>>> performance degradation and regression due to the difference of hardware
>>>> property.
>>>>
>>>> DRAM first should be obeyed to prevent from surprising regression.  Any
>>>> non-DRAM nodes should be excluded from default allocation.  Use nodemask
>>>> to control the memory placement.  Introduce def_alloc_nodemask which has
>>>> DRAM nodes set only.  Any non-DRAM allocation should be specified by
>>>> NUMA policy explicitly.
>>>>
>>>> In the future we may be able to extract the memory charasteristics from
>>>> HMAT or other source to build up the default allocation nodemask.
>>>> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
>>>> for the time being.
>>>>
>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>>> ---
>>>>    arch/x86/mm/numa.c     |  1 +
>>>>    drivers/acpi/numa.c    |  8 ++++++++
>>>>    include/linux/mmzone.h |  3 +++
>>>>    mm/page_alloc.c        | 18 ++++++++++++++++--
>>>>    4 files changed, 28 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>>>> index dfb6c4d..d9e0ca4 100644
>>>> --- a/arch/x86/mm/numa.c
>>>> +++ b/arch/x86/mm/numa.c
>>>> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
>>>>           nodes_clear(numa_nodes_parsed);
>>>>           nodes_clear(node_possible_map);
>>>>           nodes_clear(node_online_map);
>>>> +       nodes_clear(def_alloc_nodemask);
>>>>           memset(&numa_meminfo, 0, sizeof(numa_meminfo));
>>>>           WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
>>>>                                     MAX_NUMNODES));
>>>> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
>>>> index 867f6e3..79dfedf 100644
>>>> --- a/drivers/acpi/numa.c
>>>> +++ b/drivers/acpi/numa.c
>>>> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>>>>                   goto out_err_bad_srat;
>>>>           }
>>>>
>>>> +       /*
>>>> +        * Non volatile memory is excluded from zonelist by default.
>>>> +        * Only regular DRAM nodes are set in default allocation node
>>>> +        * mask.
>>>> +        */
>>>> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
>>>> +               node_set(node, def_alloc_nodemask);
>>> Hmm, no, I don't think we should do this. Especially considering
>>> current generation NVDIMMs are energy backed DRAM there is no
>>> performance difference that should be assumed by the non-volatile
>>> flag.
>> Actually, here I would like to initialize a node mask for default
>> allocation. Memory allocation should not end up on any nodes excluded by
>> this node mask unless they are specified by mempolicy.
>>
>> We may have a few different ways or criteria to initialize the node
>> mask, for example, we can read from HMAT (when HMAT is ready in the
>> future), and we definitely could have non-DRAM nodes set if they have no
>> performance difference (I'm supposed you mean NVDIMM-F  or HBM).
>>
>> As long as there are different tiers, distinguished by performance, for
>> main memory, IMHO, there should be a defined default allocation node
>> mask to control the memory placement no matter where we get the information.
> I understand the intent, but I don't think the kernel should have such
> a hardline policy by default. However, it would be worthwhile
> mechanism and policy to consider for the dax-hotplug userspace
> tooling. I.e. arrange for a given device-dax instance to be onlined,
> but set the policy to require explicit opt-in by numa binding for it
> to be an allocation / migration option.
>
> I added Vishal to the cc who is looking into such policy tooling.

We may assume the nodes returned by cpu_to_node() would be treated as 
the default allocation nodes from the kernel point of view.

So, the below code may do the job:

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d9e0ca4..a3e07da 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -764,6 +764,8 @@ void __init init_cpu_to_node(void)
                         init_memory_less_node(node);

                 numa_set_node(cpu, node);
+
+              node_set(node, def_alloc_nodemask);
         }
  }

Actually, the kernel should not care too much what kind of memory is 
used, any node could be used for memory allocation. But it may be better 
to restrict to some default nodes due to the performance disparity, for 
example, default to regular DRAM only. Here kernel assumes the nodes 
associated with CPUs would be DRAM nodes.

The node mask could be exported to user space to be override by 
userspace tool or sysfs or kernel commandline. But I still think kernel 
does need a default node mask.

>
>> But, for now we haven't had such information ready for such use yet, so
>> the SRAT flag might be a choice.
>>
>>> Why isn't default SLIT distance sufficient for ensuring a DRAM-first
>>> default policy?
>> "DRAM-first" may sound ambiguous, actually I mean "DRAM only by
>> default". SLIT should just can tell us what node is local what node is
>> remote, but can't tell us the performance difference.
> I think it's a useful semantic, but let's leave the selection of that
> policy to an explicit userspace decision.

Yes, mempolicy is a kind of userspace decision too.

Thanks,
Yang
Dan Williams March 25, 2019, 11:42 p.m. UTC | #5
On Mon, Mar 25, 2019 at 4:36 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
[..]
> >>> Hmm, no, I don't think we should do this. Especially considering
> >>> current generation NVDIMMs are energy backed DRAM there is no
> >>> performance difference that should be assumed by the non-volatile
> >>> flag.
> >> Actually, here I would like to initialize a node mask for default
> >> allocation. Memory allocation should not end up on any nodes excluded by
> >> this node mask unless they are specified by mempolicy.
> >>
> >> We may have a few different ways or criteria to initialize the node
> >> mask, for example, we can read from HMAT (when HMAT is ready in the
> >> future), and we definitely could have non-DRAM nodes set if they have no
> >> performance difference (I'm supposed you mean NVDIMM-F  or HBM).
> >>
> >> As long as there are different tiers, distinguished by performance, for
> >> main memory, IMHO, there should be a defined default allocation node
> >> mask to control the memory placement no matter where we get the information.
> > I understand the intent, but I don't think the kernel should have such
> > a hardline policy by default. However, it would be worthwhile
> > mechanism and policy to consider for the dax-hotplug userspace
> > tooling. I.e. arrange for a given device-dax instance to be onlined,
> > but set the policy to require explicit opt-in by numa binding for it
> > to be an allocation / migration option.
> >
> > I added Vishal to the cc who is looking into such policy tooling.
>
> We may assume the nodes returned by cpu_to_node() would be treated as
> the default allocation nodes from the kernel point of view.
>
> So, the below code may do the job:
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index d9e0ca4..a3e07da 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -764,6 +764,8 @@ void __init init_cpu_to_node(void)
>                          init_memory_less_node(node);
>
>                  numa_set_node(cpu, node);
> +
> +              node_set(node, def_alloc_nodemask);
>          }
>   }
>
> Actually, the kernel should not care too much what kind of memory is
> used, any node could be used for memory allocation. But it may be better
> to restrict to some default nodes due to the performance disparity, for
> example, default to regular DRAM only. Here kernel assumes the nodes
> associated with CPUs would be DRAM nodes.
>
> The node mask could be exported to user space to be override by
> userspace tool or sysfs or kernel commandline.

Yes, sounds good.

> But I still think kernel does need a default node mask.

Yes, just depends on what is less surprising for userspace to contend
with by default. I would expect an unaware userspace to be confused by
the fact that the system has free memory, but it's unusable. So,
usable by default sounds a safer option, and special cases to forbid
default usage of given nodes is an administrator / application opt-in
mechanism.

Patch
diff mbox series

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index dfb6c4d..d9e0ca4 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -626,6 +626,7 @@  static int __init numa_init(int (*init_func)(void))
 	nodes_clear(numa_nodes_parsed);
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
+	nodes_clear(def_alloc_nodemask);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
 				  MAX_NUMNODES));
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 867f6e3..79dfedf 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -296,6 +296,14 @@  void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
 		goto out_err_bad_srat;
 	}
 
+	/*
+	 * Non volatile memory is excluded from zonelist by default.
+	 * Only regular DRAM nodes are set in default allocation node
+	 * mask.
+	 */
+	if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
+		node_set(node, def_alloc_nodemask);
+
 	node_set(node, numa_nodes_parsed);
 
 	pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fba7741..063c3b4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -927,6 +927,9 @@  extern int numa_zonelist_order_handler(struct ctl_table *, int,
 extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat);
 extern struct zone *next_zone(struct zone *zone);
 
+/* Regular DRAM nodes */
+extern nodemask_t def_alloc_nodemask;
+
 /**
  * for_each_online_pgdat - helper macro to iterate over all online nodes
  * @pgdat - pointer to a pg_data_t variable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03fcf73..68ad8c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -134,6 +134,8 @@  struct pcpu_drain {
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
+nodemask_t def_alloc_nodemask __read_mostly;
+
 /*
  * A cached value of the page's pageblock's migratetype, used when the page is
  * put on a pcplist. Used to avoid the pageblock migratetype lookup when
@@ -4524,12 +4526,24 @@  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 {
 	ac->high_zoneidx = gfp_zone(gfp_mask);
 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
-	ac->nodemask = nodemask;
 	ac->migratetype = gfpflags_to_migratetype(gfp_mask);
 
+	if (!nodemask) {
+		/* Non-DRAM node is preferred node */
+		if (!node_isset(preferred_nid, def_alloc_nodemask))
+			/*
+			 * With MPOL_PREFERRED policy, once PMEM is allowed,
+			 * can falback to all memory nodes.
+			 */
+			ac->nodemask = &node_states[N_MEMORY];
+		else
+			ac->nodemask = &def_alloc_nodemask;
+	} else
+		ac->nodemask = nodemask;
+
 	if (cpusets_enabled()) {
 		*alloc_mask |= __GFP_HARDWALL;
-		if (!ac->nodemask)
+		if (nodes_equal(*ac->nodemask, def_alloc_nodemask))
 			ac->nodemask = &cpuset_current_mems_allowed;
 		else
 			*alloc_flags |= ALLOC_CPUSET;