linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: fix invalid node in alloc_migrate_target()
@ 2016-03-25  6:56 Xishi Qiu
  2016-03-25 19:22 ` Andrew Morton
  2016-03-29 12:25 ` Vlastimil Babka
  0 siblings, 2 replies; 14+ messages in thread
From: Xishi Qiu @ 2016-03-25  6:56 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Joonsoo Kim, David Rientjes,
	Naoya Horiguchi, Laura Abbott, zhuhui, wangxq10
  Cc: Xishi Qiu, Linux MM, LKML

It is incorrect to use next_node to find a target node, it will
return MAX_NUMNODES or invalid node. This will lead to crash in
buddy system allocation.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
---
 mm/page_isolation.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 92c4c36..31555b6 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
 	 * now as a simple work-around, we use the next node for destination.
 	 */
 	if (PageHuge(page)) {
-		nodemask_t src = nodemask_of_node(page_to_nid(page));
-		nodemask_t dst;
-		nodes_complement(dst, src);
+		int node = next_online_node(page_to_nid(page));
+		if (node == MAX_NUMNODES)
+			node = first_online_node;
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
-					    next_node(page_to_nid(page), dst));
+					    node);
 	}
 
 	if (PageHighMem(page))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-25  6:56 [PATCH] mm: fix invalid node in alloc_migrate_target() Xishi Qiu
@ 2016-03-25 19:22 ` Andrew Morton
  2016-03-26  5:31   ` Xishi Qiu
                     ` (2 more replies)
  2016-03-29 12:25 ` Vlastimil Babka
  1 sibling, 3 replies; 14+ messages in thread
From: Andrew Morton @ 2016-03-25 19:22 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Vlastimil Babka, Joonsoo Kim, David Rientjes, Naoya Horiguchi,
	Laura Abbott, zhuhui, wangxq10, Linux MM, LKML

On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <qiuxishi@huawei.com> wrote:

> It is incorrect to use next_node to find a target node, it will
> return MAX_NUMNODES or invalid node. This will lead to crash in
> buddy system allocation.
> 
> ...
>
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>  	 * now as a simple work-around, we use the next node for destination.
>  	 */
>  	if (PageHuge(page)) {
> -		nodemask_t src = nodemask_of_node(page_to_nid(page));
> -		nodemask_t dst;
> -		nodes_complement(dst, src);
> +		int node = next_online_node(page_to_nid(page));
> +		if (node == MAX_NUMNODES)
> +			node = first_online_node;
>  		return alloc_huge_page_node(page_hstate(compound_head(page)),
> -					    next_node(page_to_nid(page), dst));
> +					    node);
>  	}
>  
>  	if (PageHighMem(page))

Indeed.  Can you tell us more about this circumstances under which the
kernel will crash?  I need to decide which kernel version(s) need the
patch, but the changelog doesn't contain the info needed to make this
decision (it should).



next_node() isn't a very useful interface, really.  Just about every
caller does this:


	node = next_node(node, XXX);
	if (node == MAX_NUMNODES)
		node = first_node(XXX);

so how about we write a function which does that, and stop open-coding
the same thing everywhere?

And I think your fix could then use such a function:

	int node = that_new_function(page_to_nid(page), node_online_map);



Also, mm/mempolicy.c:offset_il_node() worries me:

	do {
		nid = next_node(nid, pol->v.nodes);
		c++;
	} while (c <= target);

Can't `nid' hit MAX_NUMNODES?


And can someone please explain mem_cgroup_select_victim_node() to me? 
How can we hit the "node = numa_node_id()" path?  Only if
memcg->scan_nodes is empty?  is that even valid?  The comment seems to
have not much to do with the code?

mpol_rebind_nodemask() is similar.



Something like this?


From: Andrew Morton <akpm@linux-foundation.org>
Subject: include/linux/nodemask.h: create next_node_in() helper

Lots of code does

	node = next_node(node, XXX);
	if (node == MAX_NUMNODES)
		node = first_node(XXX);

so create next_node_in() to do this and use it in various places.

Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Laura Abbott" <lauraa@codeaurora.org>
Cc: Hui Zhu <zhuhui@xiaomi.com>
Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/nodemask.h |   18 +++++++++++++++++-
 kernel/cpuset.c          |    8 +-------
 mm/hugetlb.c             |    4 +---
 mm/memcontrol.c          |    4 +---
 mm/mempolicy.c           |    8 ++------
 mm/page_isolation.c      |    9 +++------
 mm/slab.c                |   13 +++----------
 7 files changed, 28 insertions(+), 36 deletions(-)

diff -puN include/linux/nodemask.h~include-linux-nodemaskh-create-next_node_in-helper include/linux/nodemask.h
--- a/include/linux/nodemask.h~include-linux-nodemaskh-create-next_node_in-helper
+++ a/include/linux/nodemask.h
@@ -43,8 +43,10 @@
  *
  * int first_node(mask)			Number lowest set bit, or MAX_NUMNODES
  * int next_node(node, mask)		Next node past 'node', or MAX_NUMNODES
+ * int next_node_in(node, mask)		Next node past 'node', or wrap to first,
+ *					or MAX_NUMNODES
  * int first_unset_node(mask)		First node not set in mask, or 
- *					MAX_NUMNODES.
+ *					MAX_NUMNODES
  *
  * nodemask_t nodemask_of_node(node)	Return nodemask with bit 'node' set
  * NODE_MASK_ALL			Initializer - all bits set
@@ -259,6 +261,20 @@ static inline int __next_node(int n, con
 	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
 }
 
+/*
+ * Find the next present node in src, starting after node n, wrapping around to
+ * the first node in src if needed.  Returns MAX_NUMNODES if src is empty.
+ */
+#define next_node_in(n, src) __next_node_in((n), &(src))
+static inline int __next_node_in(int node, const nodemask_t *srcp)
+{
+	int ret = __next_node(node, srcp);
+
+	if (ret == MAX_NUMNODES)
+		ret = __first_node(srcp);
+	return ret;
+}
+
 static inline void init_nodemask_of_node(nodemask_t *mask, int node)
 {
 	nodes_clear(*mask);
diff -puN kernel/cpuset.c~include-linux-nodemaskh-create-next_node_in-helper kernel/cpuset.c
--- a/kernel/cpuset.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/kernel/cpuset.c
@@ -2591,13 +2591,7 @@ int __cpuset_node_allowed(int node, gfp_
 
 static int cpuset_spread_node(int *rotor)
 {
-	int node;
-
-	node = next_node(*rotor, current->mems_allowed);
-	if (node == MAX_NUMNODES)
-		node = first_node(current->mems_allowed);
-	*rotor = node;
-	return node;
+	return *rotor = next_node_in(*rotor, current->mems_allowed);
 }
 
 int cpuset_mem_spread_node(void)
diff -puN mm/hugetlb.c~include-linux-nodemaskh-create-next_node_in-helper mm/hugetlb.c
--- a/mm/hugetlb.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/hugetlb.c
@@ -937,9 +937,7 @@ err:
  */
 static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
 {
-	nid = next_node(nid, *nodes_allowed);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(*nodes_allowed);
+	nid = next_node_in(nid, *nodes_allowed);
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 
 	return nid;
diff -puN mm/memcontrol.c~include-linux-nodemaskh-create-next_node_in-helper mm/memcontrol.c
--- a/mm/memcontrol.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/memcontrol.c
@@ -1388,9 +1388,7 @@ int mem_cgroup_select_victim_node(struct
 	mem_cgroup_may_update_nodemask(memcg);
 	node = memcg->last_scanned_node;
 
-	node = next_node(node, memcg->scan_nodes);
-	if (node == MAX_NUMNODES)
-		node = first_node(memcg->scan_nodes);
+	node = next_node_in(node, memcg->scan_nodes);
 	/*
 	 * We call this when we hit limit, not when pages are added to LRU.
 	 * No LRU may hold pages because all pages are UNEVICTABLE or
diff -puN mm/mempolicy.c~include-linux-nodemaskh-create-next_node_in-helper mm/mempolicy.c
--- a/mm/mempolicy.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/mempolicy.c
@@ -347,9 +347,7 @@ static void mpol_rebind_nodemask(struct
 		BUG();
 
 	if (!node_isset(current->il_next, tmp)) {
-		current->il_next = next_node(current->il_next, tmp);
-		if (current->il_next >= MAX_NUMNODES)
-			current->il_next = first_node(tmp);
+		current->il_next = next_node_in(current->il_next, tmp);
 		if (current->il_next >= MAX_NUMNODES)
 			current->il_next = numa_node_id();
 	}
@@ -1709,9 +1707,7 @@ static unsigned interleave_nodes(struct
 	struct task_struct *me = current;
 
 	nid = me->il_next;
-	next = next_node(nid, policy->v.nodes);
-	if (next >= MAX_NUMNODES)
-		next = first_node(policy->v.nodes);
+	next = next_node_in(nid, policy->v.nodes);
 	if (next < MAX_NUMNODES)
 		me->il_next = next;
 	return nid;
diff -puN mm/page_isolation.c~include-linux-nodemaskh-create-next_node_in-helper mm/page_isolation.c
--- a/mm/page_isolation.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/page_isolation.c
@@ -288,13 +288,10 @@ struct page *alloc_migrate_target(struct
 	 * accordance with memory policy of the user process if possible. For
 	 * now as a simple work-around, we use the next node for destination.
 	 */
-	if (PageHuge(page)) {
-		int node = next_online_node(page_to_nid(page));
-		if (node == MAX_NUMNODES)
-			node = first_online_node;
+	if (PageHuge(page))
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
-					    node);
-	}
+					    next_node_in(page_to_nid(page),
+							 node_online_map));
 
 	if (PageHighMem(page))
 		gfp_mask |= __GFP_HIGHMEM;
diff -puN mm/slab.c~include-linux-nodemaskh-create-next_node_in-helper mm/slab.c
--- a/mm/slab.c~include-linux-nodemaskh-create-next_node_in-helper
+++ a/mm/slab.c
@@ -519,22 +519,15 @@ static DEFINE_PER_CPU(unsigned long, sla
 
 static void init_reap_node(int cpu)
 {
-	int node;
-
-	node = next_node(cpu_to_mem(cpu), node_online_map);
-	if (node == MAX_NUMNODES)
-		node = first_node(node_online_map);
-
-	per_cpu(slab_reap_node, cpu) = node;
+	per_cpu(slab_reap_node, cpu) = next_node_in(cpu_to_mem(cpu),
+						    node_online_map);
 }
 
 static void next_reap_node(void)
 {
 	int node = __this_cpu_read(slab_reap_node);
 
-	node = next_node(node, node_online_map);
-	if (unlikely(node >= MAX_NUMNODES))
-		node = first_node(node_online_map);
+	node = next_node_in(node, node_online_map);
 	__this_cpu_write(slab_reap_node, node);
 }
 
_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-25 19:22 ` Andrew Morton
@ 2016-03-26  5:31   ` Xishi Qiu
  2016-03-29  9:52     ` Vlastimil Babka
  2016-03-29 13:06   ` Vlastimil Babka
  2016-03-29 15:52   ` Michal Hocko
  2 siblings, 1 reply; 14+ messages in thread
From: Xishi Qiu @ 2016-03-26  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Joonsoo Kim, David Rientjes, Naoya Horiguchi,
	Laura Abbott, zhuhui, wangxq10, Linux MM, LKML

On 2016/3/26 3:22, Andrew Morton wrote:

> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <qiuxishi@huawei.com> wrote:
> 
>> It is incorrect to use next_node to find a target node, it will
>> return MAX_NUMNODES or invalid node. This will lead to crash in
>> buddy system allocation.
>>
>> ...
>>
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>>  	 * now as a simple work-around, we use the next node for destination.
>>  	 */
>>  	if (PageHuge(page)) {
>> -		nodemask_t src = nodemask_of_node(page_to_nid(page));
>> -		nodemask_t dst;
>> -		nodes_complement(dst, src);
>> +		int node = next_online_node(page_to_nid(page));
>> +		if (node == MAX_NUMNODES)
>> +			node = first_online_node;
>>  		return alloc_huge_page_node(page_hstate(compound_head(page)),
>> -					    next_node(page_to_nid(page), dst));
>> +					    node);
>>  	}
>>  
>>  	if (PageHighMem(page))
> 
> Indeed.  Can you tell us more about this circumstances under which the
> kernel will crash?  I need to decide which kernel version(s) need the
> patch, but the changelog doesn't contain the info needed to make this
> decision (it should).
> 

Hi Andrew,

I read the code v4.4, and find the following path maybe trigger the bug.

alloc_migrate_target()
	alloc_huge_page_node()  // the node may be offline or MAX_NUMNODES
		__alloc_buddy_huge_page_no_mpol()
			__alloc_buddy_huge_page()
				__hugetlb_alloc_buddy_huge_page()
					alloc_pages_node()
						__alloc_pages_node()
							VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
							VM_WARN_ON(!node_online(nid));

Thanks,
Xishi Qiu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-26  5:31   ` Xishi Qiu
@ 2016-03-29  9:52     ` Vlastimil Babka
  2016-03-29 10:06       ` Vlastimil Babka
  2016-03-29 10:37       ` Xishi Qiu
  0 siblings, 2 replies; 14+ messages in thread
From: Vlastimil Babka @ 2016-03-29  9:52 UTC (permalink / raw)
  To: Xishi Qiu, Andrew Morton
  Cc: Joonsoo Kim, David Rientjes, Naoya Horiguchi, Laura Abbott,
	zhuhui, wangxq10, Linux MM, LKML, Dave Hansen

On 03/26/2016 06:31 AM, Xishi Qiu wrote:
> On 2016/3/26 3:22, Andrew Morton wrote:
>
>> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <qiuxishi@huawei.com> wrote:
>>
>>> It is incorrect to use next_node to find a target node, it will
>>> return MAX_NUMNODES or invalid node. This will lead to crash in
>>> buddy system allocation.
>>>
>>> ...
>>>
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>>>   	 * now as a simple work-around, we use the next node for destination.
>>>   	 */
>>>   	if (PageHuge(page)) {
>>> -		nodemask_t src = nodemask_of_node(page_to_nid(page));
>>> -		nodemask_t dst;
>>> -		nodes_complement(dst, src);
>>> +		int node = next_online_node(page_to_nid(page));
>>> +		if (node == MAX_NUMNODES)
>>> +			node = first_online_node;
>>>   		return alloc_huge_page_node(page_hstate(compound_head(page)),
>>> -					    next_node(page_to_nid(page), dst));
>>> +					    node);
>>>   	}
>>>
>>>   	if (PageHighMem(page))
>>
>> Indeed.  Can you tell us more about this circumstances under which the
>> kernel will crash?  I need to decide which kernel version(s) need the
>> patch, but the changelog doesn't contain the info needed to make this
>> decision (it should).
>>
>
> Hi Andrew,
>
> I read the code v4.4, and find the following path maybe trigger the bug.
>
> alloc_migrate_target()
> 	alloc_huge_page_node()  // the node may be offline or MAX_NUMNODES
> 		__alloc_buddy_huge_page_no_mpol()
> 			__alloc_buddy_huge_page()
> 				__hugetlb_alloc_buddy_huge_page()

The code in this functions seems to come from 099730d67417d ("mm, 
hugetlb: use memory policy when available") by Dave Hansen (adding to 
CC), which was indeed merged in 4.4-rc1.

However, alloc_pages_node() is only called in the block guarded by:

if (!IS_ENABLED(CONFIG_NUMA) || !vma) {

The rather weird "!IS_ENABLED(CONFIG_NUMA)" part comes from immediate 
followup commit e0ec90ee7e6f ("mm, hugetlbfs: optimize when NUMA=n")

So I doubt the code path here can actually happen. But it's fragile and 
confusing nevertheless.

> 					alloc_pages_node()
> 						__alloc_pages_node()
> 							VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> 							VM_WARN_ON(!node_online(nid));
>
> Thanks,
> Xishi Qiu
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-29  9:52     ` Vlastimil Babka
@ 2016-03-29 10:06       ` Vlastimil Babka
  2016-03-29 10:37       ` Xishi Qiu
  1 sibling, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2016-03-29 10:06 UTC (permalink / raw)
  To: Xishi Qiu, Andrew Morton
  Cc: Joonsoo Kim, David Rientjes, Naoya Horiguchi, Laura Abbott,
	zhuhui, wangxq10, Linux MM, LKML, Dave Hansen

On 03/29/2016 11:52 AM, Vlastimil Babka wrote:
> On 03/26/2016 06:31 AM, Xishi Qiu wrote:
>> On 2016/3/26 3:22, Andrew Morton wrote:
>>
>>> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <qiuxishi@huawei.com> wrote:
>>>
>>>> It is incorrect to use next_node to find a target node, it will
>>>> return MAX_NUMNODES or invalid node. This will lead to crash in
>>>> buddy system allocation.
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_isolation.c
>>>> +++ b/mm/page_isolation.c
>>>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>>>>    	 * now as a simple work-around, we use the next node for destination.
>>>>    	 */
>>>>    	if (PageHuge(page)) {
>>>> -		nodemask_t src = nodemask_of_node(page_to_nid(page));
>>>> -		nodemask_t dst;
>>>> -		nodes_complement(dst, src);
>>>> +		int node = next_online_node(page_to_nid(page));
>>>> +		if (node == MAX_NUMNODES)
>>>> +			node = first_online_node;
>>>>    		return alloc_huge_page_node(page_hstate(compound_head(page)),
>>>> -					    next_node(page_to_nid(page), dst));
>>>> +					    node);
>>>>    	}
>>>>
>>>>    	if (PageHighMem(page))
>>>
>>> Indeed.  Can you tell us more about this circumstances under which the
>>> kernel will crash?  I need to decide which kernel version(s) need the
>>> patch, but the changelog doesn't contain the info needed to make this
>>> decision (it should).
>>>
>>
>> Hi Andrew,
>>
>> I read the code v4.4, and find the following path maybe trigger the bug.
>>
>> alloc_migrate_target()
>> 	alloc_huge_page_node()  // the node may be offline or MAX_NUMNODES
>> 		__alloc_buddy_huge_page_no_mpol()
>> 			__alloc_buddy_huge_page()
>> 				__hugetlb_alloc_buddy_huge_page()
>
> The code in this functions seems to come from 099730d67417d ("mm,
> hugetlb: use memory policy when available") by Dave Hansen (adding to
> CC), which was indeed merged in 4.4-rc1.
>
> However, alloc_pages_node() is only called in the block guarded by:
>
> if (!IS_ENABLED(CONFIG_NUMA) || !vma) {
>
> The rather weird "!IS_ENABLED(CONFIG_NUMA)" part comes from immediate
> followup commit e0ec90ee7e6f ("mm, hugetlbfs: optimize when NUMA=n")
>
> So I doubt the code path here can actually happen. But it's fragile and
> confusing nevertheless.

Ah, so there's actually a dangerous path:
alloc_huge_page_node()
     dequeue_huge_page_node()
         list_for_each_entry(page, &h->hugepage_freelists[nid], lru)

hugepage_freelists is MAX_NUMNODES sized, so when nid is MAX_NUMNODES, 
we access past it.

However, look closer at how nid is obtained in alloc_migrate_target():

nodemask_t src = nodemask_of_node(page_to_nid(page));
nodemask_t dst;
nodes_complement(dst, src);

nid = next_node(page_to_nid(page), dst)

for nid to be MAX_NUMNODES, the original page has to be on node 
MAX_NUMNODES-1, otherwise the complement part means we hit the very next 
bit which is set.

It's actually a rather obfuscated way of doing:

nid = page_to_nid(page) + 1;

In that case the problem is in commit c8721bbbdd36 ("mm: memory-hotplug: 
enable memory hotplug to handle hugepage") from 3.12 and will likely 
affect only people that tune down MAX_NUMNODES to match their machine.

>> 					alloc_pages_node()
>> 						__alloc_pages_node()
>> 							VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> 							VM_WARN_ON(!node_online(nid));
>>
>> Thanks,
>> Xishi Qiu
>>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-29  9:52     ` Vlastimil Babka
  2016-03-29 10:06       ` Vlastimil Babka
@ 2016-03-29 10:37       ` Xishi Qiu
  2016-03-29 12:21         ` Vlastimil Babka
  1 sibling, 1 reply; 14+ messages in thread
From: Xishi Qiu @ 2016-03-29 10:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Joonsoo Kim, David Rientjes, Naoya Horiguchi,
	Laura Abbott, zhuhui, wangxq10, Linux MM, LKML, Dave Hansen

On 2016/3/29 17:52, Vlastimil Babka wrote:

> On 03/26/2016 06:31 AM, Xishi Qiu wrote:
>> On 2016/3/26 3:22, Andrew Morton wrote:
>>
>>> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <qiuxishi@huawei.com> wrote:
>>>
>>>> It is incorrect to use next_node to find a target node, it will
>>>> return MAX_NUMNODES or invalid node. This will lead to crash in
>>>> buddy system allocation.
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_isolation.c
>>>> +++ b/mm/page_isolation.c
>>>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>>>>        * now as a simple work-around, we use the next node for destination.
>>>>        */
>>>>       if (PageHuge(page)) {
>>>> -        nodemask_t src = nodemask_of_node(page_to_nid(page));
>>>> -        nodemask_t dst;
>>>> -        nodes_complement(dst, src);
>>>> +        int node = next_online_node(page_to_nid(page));
>>>> +        if (node == MAX_NUMNODES)
>>>> +            node = first_online_node;
>>>>           return alloc_huge_page_node(page_hstate(compound_head(page)),
>>>> -                        next_node(page_to_nid(page), dst));
>>>> +                        node);
>>>>       }
>>>>
>>>>       if (PageHighMem(page))
>>>
>>> Indeed.  Can you tell us more about this circumstances under which the
>>> kernel will crash?  I need to decide which kernel version(s) need the
>>> patch, but the changelog doesn't contain the info needed to make this
>>> decision (it should).
>>>
>>
>> Hi Andrew,
>>
>> I read the code v4.4, and find the following path maybe trigger the bug.
>>
>> alloc_migrate_target()
>>     alloc_huge_page_node()  // the node may be offline or MAX_NUMNODES
>>         __alloc_buddy_huge_page_no_mpol()
>>             __alloc_buddy_huge_page()
>>                 __hugetlb_alloc_buddy_huge_page()
> 
> The code in this functions seems to come from 099730d67417d ("mm, hugetlb: use memory policy when available") by Dave Hansen (adding to CC), which was indeed merged in 4.4-rc1.
> 
> However, alloc_pages_node() is only called in the block guarded by:
> 
> if (!IS_ENABLED(CONFIG_NUMA) || !vma) {
> 
> The rather weird "!IS_ENABLED(CONFIG_NUMA)" part comes from immediate followup commit e0ec90ee7e6f ("mm, hugetlbfs: optimize when NUMA=n")
> 
> So I doubt the code path here can actually happen. But it's fragile and confusing nevertheless.
> 

Hi Vlastimil

__alloc_buddy_huge_page(h, NULL, addr, nid); // so the vma is NULL

Thanks,
Xishi Qiu

>>                     alloc_pages_node()
>>                         __alloc_pages_node()
>>                             VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>>                             VM_WARN_ON(!node_online(nid));
>>
>> Thanks,
>> Xishi Qiu
>>
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-29 10:37       ` Xishi Qiu
@ 2016-03-29 12:21         ` Vlastimil Babka
  0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2016-03-29 12:21 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Andrew Morton, Joonsoo Kim, David Rientjes, Naoya Horiguchi,
	Laura Abbott, zhuhui, wangxq10, Linux MM, LKML, Dave Hansen

On 03/29/2016 12:37 PM, Xishi Qiu wrote:
> On 2016/3/29 17:52, Vlastimil Babka wrote:
>> The code in this functions seems to come from 099730d67417d ("mm, hugetlb: use memory policy when available") by Dave Hansen (adding to CC), which was indeed merged in 4.4-rc1.
>>
>> However, alloc_pages_node() is only called in the block guarded by:
>>
>> if (!IS_ENABLED(CONFIG_NUMA) || !vma) {
>>
>> The rather weird "!IS_ENABLED(CONFIG_NUMA)" part comes from immediate followup commit e0ec90ee7e6f ("mm, hugetlbfs: optimize when NUMA=n")
>>
>> So I doubt the code path here can actually happen. But it's fragile and confusing nevertheless.
>>
>
> Hi Vlastimil
>
> __alloc_buddy_huge_page(h, NULL, addr, nid); // so the vma is NULL

Hm that's true, I got lost in the logic, thanks.
But the problem with dequeue_huge_page_node() is also IMHO true, and 
older, so we should fix 3.12+.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-25  6:56 [PATCH] mm: fix invalid node in alloc_migrate_target() Xishi Qiu
  2016-03-25 19:22 ` Andrew Morton
@ 2016-03-29 12:25 ` Vlastimil Babka
  2016-03-30  1:13   ` Naoya Horiguchi
  1 sibling, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2016-03-29 12:25 UTC (permalink / raw)
  To: Xishi Qiu, Andrew Morton, Joonsoo Kim, David Rientjes,
	Naoya Horiguchi, Laura Abbott, zhuhui, wangxq10
  Cc: Linux MM, LKML

On 03/25/2016 07:56 AM, Xishi Qiu wrote:
> It is incorrect to use next_node to find a target node, it will
> return MAX_NUMNODES or invalid node. This will lead to crash in
> buddy system allocation.

One possible place of crash is:
alloc_huge_page_node()
     dequeue_huge_page_node()
         [accesses h->hugepage_freelists[nid] with size MAX_NUMANODES]

> Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>

Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to 
handle hugepage")
Cc: stable
Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/page_isolation.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 92c4c36..31555b6 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>   	 * now as a simple work-around, we use the next node for destination.
>   	 */
>   	if (PageHuge(page)) {
> -		nodemask_t src = nodemask_of_node(page_to_nid(page));
> -		nodemask_t dst;
> -		nodes_complement(dst, src);
> +		int node = next_online_node(page_to_nid(page));
> +		if (node == MAX_NUMNODES)
> +			node = first_online_node;
>   		return alloc_huge_page_node(page_hstate(compound_head(page)),
> -					    next_node(page_to_nid(page), dst));
> +					    node);
>   	}
>
>   	if (PageHighMem(page))
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-25 19:22 ` Andrew Morton
  2016-03-26  5:31   ` Xishi Qiu
@ 2016-03-29 13:06   ` Vlastimil Babka
  2016-03-31 13:13     ` Vlastimil Babka
  2016-03-29 15:52   ` Michal Hocko
  2 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2016-03-29 13:06 UTC (permalink / raw)
  To: Andrew Morton, Xishi Qiu
  Cc: Joonsoo Kim, David Rientjes, Naoya Horiguchi, Laura Abbott,
	zhuhui, wangxq10, Linux MM, LKML

On 03/25/2016 08:22 PM, Andrew Morton wrote:
> On Fri, 25 Mar 2016 14:56:04 +0800 Xishi Qiu <qiuxishi@huawei.com> wrote:
>
>> It is incorrect to use next_node to find a target node, it will
>> return MAX_NUMNODES or invalid node. This will lead to crash in
>> buddy system allocation.
>>
>> ...
>>
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>>   	 * now as a simple work-around, we use the next node for destination.
>>   	 */
>>   	if (PageHuge(page)) {
>> -		nodemask_t src = nodemask_of_node(page_to_nid(page));
>> -		nodemask_t dst;
>> -		nodes_complement(dst, src);
>> +		int node = next_online_node(page_to_nid(page));
>> +		if (node == MAX_NUMNODES)
>> +			node = first_online_node;
>>   		return alloc_huge_page_node(page_hstate(compound_head(page)),
>> -					    next_node(page_to_nid(page), dst));
>> +					    node);
>>   	}
>>
>>   	if (PageHighMem(page))
>
> Indeed.  Can you tell us more about this circumstances under which the
> kernel will crash?  I need to decide which kernel version(s) need the
> patch, but the changelog doesn't contain the info needed to make this
> decision (it should).
>
>
>
> next_node() isn't a very useful interface, really.  Just about every
> caller does this:
>
>
> 	node = next_node(node, XXX);
> 	if (node == MAX_NUMNODES)
> 		node = first_node(XXX);
>
> so how about we write a function which does that, and stop open-coding
> the same thing everywhere?

Good idea.

> And I think your fix could then use such a function:
>
> 	int node = that_new_function(page_to_nid(page), node_online_map);
>
>
>
> Also, mm/mempolicy.c:offset_il_node() worries me:
>
> 	do {
> 		nid = next_node(nid, pol->v.nodes);
> 		c++;
> 	} while (c <= target);
>
> Can't `nid' hit MAX_NUMNODES?

AFAICS it can. interleave_nid() uses this and the nid is then used e.g. 
in node_zonelist() where it's used for NODE_DATA(nid). That's quite 
scary. It also predates git. Why don't we see crashes or KASAN finding this?

>
> And can someone please explain mem_cgroup_select_victim_node() to me?
> How can we hit the "node = numa_node_id()" path?  Only if
> memcg->scan_nodes is empty?  is that even valid?  The comment seems to
> have not much to do with the code?

I understand the comment that it's valid to be empty and the comment 
lists reasons why that can happen (with somewhat broken language). Note 
that I didn't verify these reasons:
- we call this when hitting memcg limit, not when adding pages to LRU, 
as adding to LRU means it would contain the given LRU's node
- adding to unevictable LRU means it's not added to scan_nodes (probably 
because scanning unevictable lru would be useless)
- for other reasons (which?) it might have pages not on LRU and it's so 
small there are no other pages that would be on LRU

> mpol_rebind_nodemask() is similar.
>
>
>
> Something like this?
>
>
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: include/linux/nodemask.h: create next_node_in() helper
>
> Lots of code does
>
> 	node = next_node(node, XXX);
> 	if (node == MAX_NUMNODES)
> 		node = first_node(XXX);
>
> so create next_node_in() to do this and use it in various places.
>
> Cc: Xishi Qiu <qiuxishi@huawei.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Patch doesn't address offset_il_node() which is good, because if it's 
indeed buggy, it's serious and needs a non-cleanup patch.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-25 19:22 ` Andrew Morton
  2016-03-26  5:31   ` Xishi Qiu
  2016-03-29 13:06   ` Vlastimil Babka
@ 2016-03-29 15:52   ` Michal Hocko
  2 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2016-03-29 15:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Xishi Qiu, Vlastimil Babka, Joonsoo Kim, David Rientjes,
	Naoya Horiguchi, Laura Abbott, zhuhui, wangxq10, Linux MM, LKML

On Fri 25-03-16 12:22:37, Andrew Morton wrote:
[...]
> And can someone please explain mem_cgroup_select_victim_node() to me? 
> How can we hit the "node = numa_node_id()" path?  Only if
> memcg->scan_nodes is empty?

Yes, this seems to be the primary motivation.
mem_cgroup_may_update_nodemask might have seen all the pages on
unevictable LRU last time it checked something.

> is that even valid?

I suspect it is really rare but it seems possible

> The comment seems to have not much to do with the code?

I guess the comment tries to say that the code path is triggered when we
charge the page which happens _before_ it is added to the LRU list and
so last_scanned_node might contain the stale data. Would something like
the following be more clear?
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 17a847c96618..cff095318950 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1390,10 +1390,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 
 	node = next_node_in(node, memcg->scan_nodes);
 	/*
-	 * We call this when we hit limit, not when pages are added to LRU.
-	 * No LRU may hold pages because all pages are UNEVICTABLE or
-	 * memcg is too small and all pages are not on LRU. In that case,
-	 * we use curret node.
+	 * mem_cgroup_may_update_nodemask might have seen no reclaimmable pages
+	 * last time it really checked all the LRUs due to rate limiting.
+	 * Fallback to the current node in that case for simplicity.
 	 */
 	if (unlikely(node == MAX_NUMNODES))
 		node = numa_node_id();
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-29 12:25 ` Vlastimil Babka
@ 2016-03-30  1:13   ` Naoya Horiguchi
  0 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2016-03-30  1:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Xishi Qiu, Andrew Morton, Joonsoo Kim, David Rientjes,
	Laura Abbott, zhuhui, wangxq10, Linux MM, LKML

On Tue, Mar 29, 2016 at 02:25:03PM +0200, Vlastimil Babka wrote:
> On 03/25/2016 07:56 AM, Xishi Qiu wrote:
> >It is incorrect to use next_node to find a target node, it will
> >return MAX_NUMNODES or invalid node. This will lead to crash in
> >buddy system allocation.
> 
> One possible place of crash is:
> alloc_huge_page_node()
>     dequeue_huge_page_node()
>         [accesses h->hugepage_freelists[nid] with size MAX_NUMANODES]
> 
> >Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
> 
> Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle
> hugepage")
> Cc: stable
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks everyone for finding/fixing the bug!

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

> >---
> >  mm/page_isolation.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> >diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> >index 92c4c36..31555b6 100644
> >--- a/mm/page_isolation.c
> >+++ b/mm/page_isolation.c
> >@@ -289,11 +289,11 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
> >  	 * now as a simple work-around, we use the next node for destination.
> >  	 */
> >  	if (PageHuge(page)) {
> >-		nodemask_t src = nodemask_of_node(page_to_nid(page));
> >-		nodemask_t dst;
> >-		nodes_complement(dst, src);
> >+		int node = next_online_node(page_to_nid(page));
> >+		if (node == MAX_NUMNODES)
> >+			node = first_online_node;
> >  		return alloc_huge_page_node(page_hstate(compound_head(page)),
> >-					    next_node(page_to_nid(page), dst));
> >+					    node);
> >  	}
> >
> >  	if (PageHighMem(page))
> >
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-29 13:06   ` Vlastimil Babka
@ 2016-03-31 13:13     ` Vlastimil Babka
  2016-03-31 21:01       ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2016-03-31 13:13 UTC (permalink / raw)
  To: Andrew Morton, Xishi Qiu
  Cc: Joonsoo Kim, David Rientjes, Naoya Horiguchi, Laura Abbott,
	zhuhui, wangxq10, Linux MM, LKML

On 03/29/2016 03:06 PM, Vlastimil Babka wrote:
> On 03/25/2016 08:22 PM, Andrew Morton wrote:
>> Also, mm/mempolicy.c:offset_il_node() worries me:
>>
>> 	do {
>> 		nid = next_node(nid, pol->v.nodes);
>> 		c++;
>> 	} while (c <= target);
>>
>> Can't `nid' hit MAX_NUMNODES?
>
> AFAICS it can. interleave_nid() uses this and the nid is then used e.g.
> in node_zonelist() where it's used for NODE_DATA(nid). That's quite
> scary. It also predates git. Why don't we see crashes or KASAN finding this?

Ah, I see. In offset_il_node(), nid is initialized to -1, and the number 
of do-while iterations calling next_node() is up to the number of bits 
set in the pol->v.nodes bitmap, so it can't reach past the last set bit 
and return MAX_NUMNODES.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-31 13:13     ` Vlastimil Babka
@ 2016-03-31 21:01       ` Andrew Morton
  2016-04-01  8:42         ` Vlastimil Babka
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2016-03-31 21:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Xishi Qiu, Joonsoo Kim, David Rientjes, Naoya Horiguchi,
	Laura Abbott, zhuhui, wangxq10, Linux MM, LKML

On Thu, 31 Mar 2016 15:13:41 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 03/29/2016 03:06 PM, Vlastimil Babka wrote:
> > On 03/25/2016 08:22 PM, Andrew Morton wrote:
> >> Also, mm/mempolicy.c:offset_il_node() worries me:
> >>
> >> 	do {
> >> 		nid = next_node(nid, pol->v.nodes);
> >> 		c++;
> >> 	} while (c <= target);
> >>
> >> Can't `nid' hit MAX_NUMNODES?
> >
> > AFAICS it can. interleave_nid() uses this and the nid is then used e.g.
> > in node_zonelist() where it's used for NODE_DATA(nid). That's quite
> > scary. It also predates git. Why don't we see crashes or KASAN finding this?
> 
> Ah, I see. In offset_il_node(), nid is initialized to -1, and the number 
> of do-while iterations calling next_node() is up to the number of bits 
> set in the pol->v.nodes bitmap, so it can't reach past the last set bit 
> and return MAX_NUMNODES.

Gack.  offset_il_node() should be dragged out, strangled, shot then burnt.

static unsigned offset_il_node(struct mempolicy *pol,
		struct vm_area_struct *vma, unsigned long off)
{
	unsigned nnodes = nodes_weight(pol->v.nodes);
	unsigned target;
	int c;
	int nid = NUMA_NO_NODE;

	if (!nnodes)
		return numa_node_id();
	target = (unsigned int)off % nnodes;
	c = 0;
	do {
		nid = next_node(nid, pol->v.nodes);
		c++;
	} while (c <= target);
	return nid;
}

For starters it is relying upon next_node(-1, ...) behaving like
first_node().  Fair enough I guess, but that isn't very clear.

static inline int __next_node(int n, const nodemask_t *srcp)
{
	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
}

will start from node 0 when it does the n+1.

Also it is relying upon NUMA_NO_NODE having a value of -1.  That's just
grubby - this code shouldn't "know" that NUMA_NO_NODE==-1.  It would have
been better to use plain old "-1" here.


Does this look clearer and correct?

/*
 * Do static interleaving for a VMA with known offset @n.  Returns the n'th
 * node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the
 * number of present nodes.
 */
static unsigned offset_il_node(struct mempolicy *pol,
			       struct vm_area_struct *vma, unsigned long n)
{
	unsigned nnodes = nodes_weight(pol->v.nodes);
	unsigned target;
	int i;
	int nid;

	if (!nnodes)
		return numa_node_id();
	target = (unsigned int)n % nnodes;
	nid = first_node(pol->v.nodes);
	for (i = 0; i < target; i++)
		nid = next_node(nid, pol->v.nodes);
	return nid;
}


From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/mempolicy.c:offset_il_node() document and clarify

This code was pretty obscure and was relying upon obscure side-effects of
next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to -1.

Clean that all up and document the function's intent.

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |   20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff -puN mm/mempolicy.c~mm-mempolicyc-offset_il_node-document-and-clarify mm/mempolicy.c
--- a/mm/mempolicy.c~mm-mempolicyc-offset_il_node-document-and-clarify
+++ a/mm/mempolicy.c
@@ -1763,23 +1763,25 @@ unsigned int mempolicy_slab_node(void)
 	}
 }
 
-/* Do static interleaving for a VMA with known offset. */
+/*
+ * Do static interleaving for a VMA with known offset @n.  Returns the n'th
+ * node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the
+ * number of present nodes.
+ */
 static unsigned offset_il_node(struct mempolicy *pol,
-		struct vm_area_struct *vma, unsigned long off)
+			       struct vm_area_struct *vma, unsigned long n)
 {
 	unsigned nnodes = nodes_weight(pol->v.nodes);
 	unsigned target;
-	int c;
-	int nid = NUMA_NO_NODE;
+	int i;
+	int nid;
 
 	if (!nnodes)
 		return numa_node_id();
-	target = (unsigned int)off % nnodes;
-	c = 0;
-	do {
+	target = (unsigned int)n % nnodes;
+	nid = first_node(pol->v.nodes);
+	for (i = 0; i < target; i++)
 		nid = next_node(nid, pol->v.nodes);
-		c++;
-	} while (c <= target);
 	return nid;
 }
 
_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fix invalid node in alloc_migrate_target()
  2016-03-31 21:01       ` Andrew Morton
@ 2016-04-01  8:42         ` Vlastimil Babka
  0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2016-04-01  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Xishi Qiu, Joonsoo Kim, David Rientjes, Naoya Horiguchi,
	Laura Abbott, zhuhui, wangxq10, Linux MM, LKML

On 03/31/2016 11:01 PM, Andrew Morton wrote:
> On Thu, 31 Mar 2016 15:13:41 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>
>> On 03/29/2016 03:06 PM, Vlastimil Babka wrote:
>> > On 03/25/2016 08:22 PM, Andrew Morton wrote:
>> >> Also, mm/mempolicy.c:offset_il_node() worries me:
>> >>
>> >> 	do {
>> >> 		nid = next_node(nid, pol->v.nodes);
>> >> 		c++;
>> >> 	} while (c <= target);
>> >>
>> >> Can't `nid' hit MAX_NUMNODES?
>> >
>> > AFAICS it can. interleave_nid() uses this and the nid is then used e.g.
>> > in node_zonelist() where it's used for NODE_DATA(nid). That's quite
>> > scary. It also predates git. Why don't we see crashes or KASAN finding this?
>>
>> Ah, I see. In offset_il_node(), nid is initialized to -1, and the number
>> of do-while iterations calling next_node() is up to the number of bits
>> set in the pol->v.nodes bitmap, so it can't reach past the last set bit
>> and return MAX_NUMNODES.
>
> Gack.  offset_il_node() should be dragged out, strangled, shot then burnt.

Ah, but you went with the much less amusing alternative of just fixing it.

> static unsigned offset_il_node(struct mempolicy *pol,
> 		struct vm_area_struct *vma, unsigned long off)
> {
> 	unsigned nnodes = nodes_weight(pol->v.nodes);
> 	unsigned target;
> 	int c;
> 	int nid = NUMA_NO_NODE;
>
> 	if (!nnodes)
> 		return numa_node_id();
> 	target = (unsigned int)off % nnodes;
> 	c = 0;
> 	do {
> 		nid = next_node(nid, pol->v.nodes);
> 		c++;
> 	} while (c <= target);
> 	return nid;
> }
>
> For starters it is relying upon next_node(-1, ...) behaving like
> first_node().  Fair enough I guess, but that isn't very clear.
>
> static inline int __next_node(int n, const nodemask_t *srcp)
> {
> 	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
> }
>
> will start from node 0 when it does the n+1.
>
> Also it is relying upon NUMA_NO_NODE having a value of -1.  That's just
> grubby - this code shouldn't "know" that NUMA_NO_NODE==-1.  It would have
> been better to use plain old "-1" here.

Yeah looks like a blind change of all "-1" to "NUMA_NO_NODE" happened at some point.

>
> Does this look clearer and correct?

Definitely.

> /*
>   * Do static interleaving for a VMA with known offset @n.  Returns the n'th
>   * node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the
>   * number of present nodes.
>   */
> static unsigned offset_il_node(struct mempolicy *pol,
> 			       struct vm_area_struct *vma, unsigned long n)
> {
> 	unsigned nnodes = nodes_weight(pol->v.nodes);
> 	unsigned target;
> 	int i;
> 	int nid;
>
> 	if (!nnodes)
> 		return numa_node_id();
> 	target = (unsigned int)n % nnodes;
> 	nid = first_node(pol->v.nodes);
> 	for (i = 0; i < target; i++)
> 		nid = next_node(nid, pol->v.nodes);
> 	return nid;
> }
>
>
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/mempolicy.c:offset_il_node() document and clarify
>
> This code was pretty obscure and was relying upon obscure side-effects of
> next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to -1.
>
> Clean that all up and document the function's intent.
>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Xishi Qiu <qiuxishi@huawei.com>
> Cc: Joonsoo Kim <js1304@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Laura Abbott <lauraa@codeaurora.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-04-01  8:43 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-25  6:56 [PATCH] mm: fix invalid node in alloc_migrate_target() Xishi Qiu
2016-03-25 19:22 ` Andrew Morton
2016-03-26  5:31   ` Xishi Qiu
2016-03-29  9:52     ` Vlastimil Babka
2016-03-29 10:06       ` Vlastimil Babka
2016-03-29 10:37       ` Xishi Qiu
2016-03-29 12:21         ` Vlastimil Babka
2016-03-29 13:06   ` Vlastimil Babka
2016-03-31 13:13     ` Vlastimil Babka
2016-03-31 21:01       ` Andrew Morton
2016-04-01  8:42         ` Vlastimil Babka
2016-03-29 15:52   ` Michal Hocko
2016-03-29 12:25 ` Vlastimil Babka
2016-03-30  1:13   ` Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).