linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] mm, thp: only collapse hugepages to nodes with affinity
@ 2014-07-15  1:09 David Rientjes
  2014-07-15  4:47 ` Dave Hansen
  2014-07-16  0:13 ` [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode David Rientjes
  0 siblings, 2 replies; 15+ messages in thread
From: David Rientjes @ 2014-07-15  1:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Bob Liu, linux-mm, linux-kernel

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target 
node") improved the previous khugepaged logic which allocated a 
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which may 
suffer from additional access latency.  With the current policy, it is 
possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely 
if the majority are allocated from that node.

Introduce a strict requirement that pages can only be collapsed to nodes 
at or below RECLAIM_DISTANCE to ensure the access latency of the pages 
scanned does not regress.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/huge_memory.c | 54 ++++++++++++------------------------------------------
 1 file changed, 12 insertions(+), 42 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2231,34 +2231,7 @@ static void khugepaged_alloc_sleep(void)
 			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
 }
 
-static int khugepaged_node_load[MAX_NUMNODES];
-
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
-{
-	static int last_khugepaged_target_node = NUMA_NO_NODE;
-	int nid, target_node = 0, max_value = 0;
-
-	/* find first node with max normal pages hit */
-	for (nid = 0; nid < MAX_NUMNODES; nid++)
-		if (khugepaged_node_load[nid] > max_value) {
-			max_value = khugepaged_node_load[nid];
-			target_node = nid;
-		}
-
-	/* do some balance if several nodes have the same hit record */
-	if (target_node <= last_khugepaged_target_node)
-		for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
-				nid++)
-			if (max_value == khugepaged_node_load[nid]) {
-				target_node = nid;
-				break;
-			}
-
-	last_khugepaged_target_node = target_node;
-	return target_node;
-}
-
 static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 {
 	if (IS_ERR(*hpage)) {
@@ -2309,11 +2282,6 @@ static struct page
 	return *hpage;
 }
 #else
-static int khugepaged_find_target_node(void)
-{
-	return 0;
-}
-
 static inline struct page *alloc_hugepage(int defrag)
 {
 	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
@@ -2522,7 +2490,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	if (!pmd)
 		goto out;
 
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -2538,14 +2505,18 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
-		/*
-		 * Record which node the original page is from and save this
-		 * information to khugepaged_node_load[].
-		 * Khupaged will allocate hugepage from the node has the max
-		 * hit record.
-		 */
-		node = page_to_nid(page);
-		khugepaged_node_load[node]++;
+		if (node == NUMA_NO_NODE) {
+			node = page_to_nid(page);
+		} else {
+			int distance = node_distance(page_to_nid(page), node);
+
+			/*
+			 * Do not migrate to memory that would not be reclaimed
+			 * from.
+			 */
+			if (distance > RECLAIM_DISTANCE)
+				goto out_unmap;
+		}
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
@@ -2561,7 +2532,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node();
 		/* collapse_huge_page will return with the mmap_sem released */
 		collapse_huge_page(mm, address, hpage, vma, node);
 	}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] mm, thp: only collapse hugepages to nodes with affinity
  2014-07-15  1:09 [patch] mm, thp: only collapse hugepages to nodes with affinity David Rientjes
@ 2014-07-15  4:47 ` Dave Hansen
  2014-07-15 23:17   ` David Rientjes
  2014-07-16  0:13 ` [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode David Rientjes
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2014-07-15  4:47 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Andrea Arcangeli, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Bob Liu, linux-mm, linux-kernel

On 07/14/2014 06:09 PM, David Rientjes wrote:
> +		if (node == NUMA_NO_NODE) {
> +			node = page_to_nid(page);
> +		} else {
> +			int distance = node_distance(page_to_nid(page), node);
> +
> +			/*
> +			 * Do not migrate to memory that would not be reclaimed
> +			 * from.
> +			 */
> +			if (distance > RECLAIM_DISTANCE)
> +				goto out_unmap;
> +		}

Isn't the reclaim behavior based on zone_reclaim_mode and not
RECLAIM_DISTANCE directly?  And isn't that reclaim behavior disabled by
default?

I think you should at least be consulting zone_reclaim_mode.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] mm, thp: only collapse hugepages to nodes with affinity
  2014-07-15  4:47 ` Dave Hansen
@ 2014-07-15 23:17   ` David Rientjes
  0 siblings, 0 replies; 15+ messages in thread
From: David Rientjes @ 2014-07-15 23:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Andrea Arcangeli, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Bob Liu, linux-mm, linux-kernel

On Mon, 14 Jul 2014, Dave Hansen wrote:

> > +		if (node == NUMA_NO_NODE) {
> > +			node = page_to_nid(page);
> > +		} else {
> > +			int distance = node_distance(page_to_nid(page), node);
> > +
> > +			/*
> > +			 * Do not migrate to memory that would not be reclaimed
> > +			 * from.
> > +			 */
> > +			if (distance > RECLAIM_DISTANCE)
> > +				goto out_unmap;
> > +		}
> 
> Isn't the reclaim behavior based on zone_reclaim_mode and not
> RECLAIM_DISTANCE directly?  And isn't that reclaim behavior disabled by
> default?
> 

Seems that RECLAIM_DISTANCE has taken on a life of its own independent of 
zone_reclaim_mode as a heuristic, such as its use in creating sched 
domains which would be unrelated.

> I think you should at least be consulting zone_reclaim_mode.
> 

Good point, and it matches what the comment is saying about whether we'd 
actually reclaim from the remote node to allocate thp on fault or not.  
I'll add it.

After this change, we'll also need to consider the behavior of thp at 
fault and whether remote HPAGE_PMD_SIZE memory when local memory is 
low/fragmented is better than local PAGE_SIZE memory.  In my page fault 
latency testing on true NUMA machines it's convincing that it's not.

This makes me believe that, somewhat similar to this patch, when we 
allocate thp memory at fault and zone_reclaim_mode is non-zero that we 
should set only nodes with numa_node_id() <= RECLAIM_DISTANCE and then 
otherwise fallback to the PAGE_SIZE fault path.

I've been hesitant to make that exact change, though, because it's a 
systemwide setting and I really hope to avoid a prctl() that controls 
zone reclaim for a particular process.  Perhaps the NUMA balancing work 
makes this more dependable.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-15  1:09 [patch] mm, thp: only collapse hugepages to nodes with affinity David Rientjes
  2014-07-15  4:47 ` Dave Hansen
@ 2014-07-16  0:13 ` David Rientjes
  2014-07-16  1:22   ` Bob Liu
  2014-07-16 15:38   ` Vlastimil Babka
  1 sibling, 2 replies; 15+ messages in thread
From: David Rientjes @ 2014-07-16  0:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Bob Liu, linux-mm, linux-kernel

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target 
node") improved the previous khugepaged logic which allocated a 
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which may 
suffer from additional access latency.  With the current policy, it is 
possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely 
if the majority are allocated from that node.

When zone_reclaim_mode is enabled, it means the VM should make every attempt
to allocate locally to prevent NUMA performance degradation.  In this case,
we do not want to collapse hugepages to remote nodes that would suffer from
increased access latency.  Thus, when zone_reclaim_mode is enabled, only
allow collapsing to nodes with RECLAIM_DISTANCE or less.

There is no functional change for systems that disable zone_reclaim_mode.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2: only change behavior for zone_reclaim_mode per Dave Hansen

 mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2234,6 +2234,26 @@ static void khugepaged_alloc_sleep(void)
 static int khugepaged_node_load[MAX_NUMNODES];
 
 #ifdef CONFIG_NUMA
+static bool khugepaged_scan_abort(int nid)
+{
+	int i;
+
+	/*
+	 * If zone_reclaim_mode is disabled, then no extra effort is made to
+	 * allocate memory locally.
+	 */
+	if (!zone_reclaim_mode)
+		return false;
+
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		if (!khugepaged_node_load[i])
+			continue;
+		if (node_distance(nid, i) > RECLAIM_DISTANCE)
+			return true;
+	}
+	return false;
+}
+
 static int khugepaged_find_target_node(void)
 {
 	static int last_khugepaged_target_node = NUMA_NO_NODE;
@@ -2309,6 +2329,11 @@ static struct page
 	return *hpage;
 }
 #else
+static bool khugepaged_scan_abort(int nid)
+{
+	return false;
+}
+
 static int khugepaged_find_target_node(void)
 {
 	return 0;
@@ -2515,6 +2540,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	unsigned long _address;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE;
+	int last_node = node;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = page_to_nid(page);
+		if (node != last_node) {
+			if (khugepaged_scan_abort(node))
+				goto out_unmap;
+			last_node = node;
+		}
 		khugepaged_node_load[node]++;
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-16  0:13 ` [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode David Rientjes
@ 2014-07-16  1:22   ` Bob Liu
  2014-07-16 15:47     ` Vlastimil Babka
  2014-07-16 15:38   ` Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Bob Liu @ 2014-07-16  1:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andrea Arcangeli, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, linux-mm, linux-kernel


On 07/16/2014 08:13 AM, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target 
> node") improved the previous khugepaged logic which allocated a 
> transparent hugepages from the node of the first page being collapsed.
> 
> However, it is still possible to collapse pages to remote memory which may 
> suffer from additional access latency.  With the current policy, it is 
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely 
> if the majority are allocated from that node.
> 
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation.  In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency.  Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
> 
> There is no functional change for systems that disable zone_reclaim_mode.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  v2: only change behavior for zone_reclaim_mode per Dave Hansen
> 
>  mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2234,6 +2234,26 @@ static void khugepaged_alloc_sleep(void)
>  static int khugepaged_node_load[MAX_NUMNODES];
>  
>  #ifdef CONFIG_NUMA
> +static bool khugepaged_scan_abort(int nid)
> +{
> +	int i;
> +
> +	/*
> +	 * If zone_reclaim_mode is disabled, then no extra effort is made to
> +	 * allocate memory locally.
> +	 */
> +	if (!zone_reclaim_mode)
> +		return false;
> +
> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		if (!khugepaged_node_load[i])
> +			continue;
> +		if (node_distance(nid, i) > RECLAIM_DISTANCE)
> +			return true;
> +	}
> +	return false;
> +}
> +
>  static int khugepaged_find_target_node(void)
>  {
>  	static int last_khugepaged_target_node = NUMA_NO_NODE;
> @@ -2309,6 +2329,11 @@ static struct page
>  	return *hpage;
>  }
>  #else
> +static bool khugepaged_scan_abort(int nid)
> +{
> +	return false;
> +}
> +
>  static int khugepaged_find_target_node(void)
>  {
>  	return 0;
> @@ -2515,6 +2540,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  	unsigned long _address;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE;
> +	int last_node = node;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> @@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		 * hit record.
>  		 */
>  		node = page_to_nid(page);
> +		if (node != last_node) {
> +			if (khugepaged_scan_abort(node))
> +				goto out_unmap;

Nitpick: How about not break the loop but only reset the related
khugepaged_node_load[] to zero. E.g. modify khugepaged_scan_abort() like
this:
if (node_distance(nid, i) > RECLAIM_DISTANCE)
   khugepaged_node_load[i] = 0;

By this way, we may have a chance to find a more suitable node.

> +			last_node = node;
> +		}
>  		khugepaged_node_load[node]++;
>  		VM_BUG_ON_PAGE(PageCompound(page), page);
>  		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
> 

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-16  0:13 ` [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode David Rientjes
  2014-07-16  1:22   ` Bob Liu
@ 2014-07-16 15:38   ` Vlastimil Babka
  2014-07-17  0:54     ` David Rientjes
  1 sibling, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2014-07-16 15:38 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Andrea Arcangeli, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Bob Liu, linux-mm, linux-kernel

On 07/16/2014 02:13 AM, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target 
> node") improved the previous khugepaged logic which allocated a 
> transparent hugepages from the node of the first page being collapsed.
> 
> However, it is still possible to collapse pages to remote memory which may 
> suffer from additional access latency.  With the current policy, it is 
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely 
> if the majority are allocated from that node.
> 
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation.  In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency.  Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
> 
> There is no functional change for systems that disable zone_reclaim_mode.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  v2: only change behavior for zone_reclaim_mode per Dave Hansen
> 
>  mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2234,6 +2234,26 @@ static void khugepaged_alloc_sleep(void)
>  static int khugepaged_node_load[MAX_NUMNODES];
>  
>  #ifdef CONFIG_NUMA
> +static bool khugepaged_scan_abort(int nid)
> +{
> +	int i;
> +
> +	/*
> +	 * If zone_reclaim_mode is disabled, then no extra effort is made to
> +	 * allocate memory locally.
> +	 */
> +	if (!zone_reclaim_mode)
> +		return false;
> +

I wonder if you could do this here?

if (khugepaged_node_load[nid])
	return false;

If the condition is true, it means you already checked the 'nid' node against
all other nodes present in the pmd in a previous khugepaged_scan_pmd iteration.
And if it passed then, it would also pass now. If meanwhile a new node was found
and recorded, it was also checked against everything present at that point,
including 'nid'. So it should be safe?

The worst case (perfect interleaving page per page, so that "node != last_node"
is true in each iteration) complexity then reduces from O(HPAGE_PMD_NR *
MAX_NUMNODES) to O(HPAGE_PMD_NR + MAX_NUMNODES) iterations.

> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		if (!khugepaged_node_load[i])
> +			continue;
> +		if (node_distance(nid, i) > RECLAIM_DISTANCE)
> +			return true;
> +	}
> +	return false;
> +}
> +
>  static int khugepaged_find_target_node(void)
>  {
>  	static int last_khugepaged_target_node = NUMA_NO_NODE;
> @@ -2309,6 +2329,11 @@ static struct page
>  	return *hpage;
>  }
>  #else
> +static bool khugepaged_scan_abort(int nid)
> +{
> +	return false;
> +}
> +
>  static int khugepaged_find_target_node(void)
>  {
>  	return 0;
> @@ -2515,6 +2540,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  	unsigned long _address;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE;
> +	int last_node = node;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> @@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		 * hit record.
>  		 */
>  		node = page_to_nid(page);
> +		if (node != last_node) {
> +			if (khugepaged_scan_abort(node))
> +				goto out_unmap;
> +			last_node = node;
> +		}
>  		khugepaged_node_load[node]++;
>  		VM_BUG_ON_PAGE(PageCompound(page), page);
>  		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-16  1:22   ` Bob Liu
@ 2014-07-16 15:47     ` Vlastimil Babka
  2014-07-16 19:37       ` Hugh Dickins
  2014-07-17  0:49       ` David Rientjes
  0 siblings, 2 replies; 15+ messages in thread
From: Vlastimil Babka @ 2014-07-16 15:47 UTC (permalink / raw)
  To: Bob Liu, David Rientjes
  Cc: Andrew Morton, Andrea Arcangeli, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, linux-mm, linux-kernel

On 07/16/2014 03:22 AM, Bob Liu wrote:
>> @@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>  		 * hit record.
>>  		 */
>>  		node = page_to_nid(page);
>> +		if (node != last_node) {
>> +			if (khugepaged_scan_abort(node))
>> +				goto out_unmap;
> 
> Nitpick: How about not break the loop but only reset the related
> khugepaged_node_load[] to zero. E.g. modify khugepaged_scan_abort() like
> this:
> if (node_distance(nid, i) > RECLAIM_DISTANCE)
>    khugepaged_node_load[i] = 0;
> 
> By this way, we may have a chance to find a more suitable node.

Hm theoretically there might be a suitable node, but this approach wouldn't
work. By resetting it to zero you forget that there ever was node 'i'. If there
is no more base page from node 'i', the load remains zero and the next call with
'nid' will think that 'nid' is OK.

So the correct way would be more complex but I wonder if it's worth the trouble...

>> +			last_node = node;
>> +		}
>>  		khugepaged_node_load[node]++;
>>  		VM_BUG_ON_PAGE(PageCompound(page), page);
>>  		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-16 15:47     ` Vlastimil Babka
@ 2014-07-16 19:37       ` Hugh Dickins
  2014-07-17  0:49       ` David Rientjes
  1 sibling, 0 replies; 15+ messages in thread
From: Hugh Dickins @ 2014-07-16 19:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, Bob Liu, Andrew Morton, Andrea Arcangeli,
	Mel Gorman, Rik van Riel, Kirill A. Shutemov, linux-mm,
	linux-kernel

I shall worry less if you change the Subject from "mm, tmp:" to "mm, thp:"

Hugh :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-16 15:47     ` Vlastimil Babka
  2014-07-16 19:37       ` Hugh Dickins
@ 2014-07-17  0:49       ` David Rientjes
  1 sibling, 0 replies; 15+ messages in thread
From: David Rientjes @ 2014-07-17  0:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Bob Liu, Andrew Morton, Andrea Arcangeli, Mel Gorman,
	Rik van Riel, Kirill A. Shutemov, linux-mm, linux-kernel

On Wed, 16 Jul 2014, Vlastimil Babka wrote:

> >> @@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >>  		 * hit record.
> >>  		 */
> >>  		node = page_to_nid(page);
> >> +		if (node != last_node) {
> >> +			if (khugepaged_scan_abort(node))
> >> +				goto out_unmap;
> > 
> > Nitpick: How about not break the loop but only reset the related
> > khugepaged_node_load[] to zero. E.g. modify khugepaged_scan_abort() like
> > this:
> > if (node_distance(nid, i) > RECLAIM_DISTANCE)
> >    khugepaged_node_load[i] = 0;
> > 
> > By this way, we may have a chance to find a more suitable node.
> 
> Hm theoretically there might be a suitable node, but this approach wouldn't
> work. By resetting it to zero you forget that there ever was node 'i'. If there
> is no more base page from node 'i', the load remains zero and the next call with
> 'nid' will think that 'nid' is OK.
> 

Right, the suggestion is wrong because we do not want to ever collapse to 
a node when the distance from the source page is > RECLAIM_DISTANCE, 
that's the entire point of the patch.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-16 15:38   ` Vlastimil Babka
@ 2014-07-17  0:54     ` David Rientjes
  2014-07-17  0:59       ` [patch v3] mm, thp: " David Rientjes
  0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2014-07-17  0:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Andrea Arcangeli, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Bob Liu, linux-mm, linux-kernel

On Wed, 16 Jul 2014, Vlastimil Babka wrote:

> I wonder if you could do this here?
> 
> if (khugepaged_node_load[nid])
> 	return false;
> 
> If the condition is true, it means you already checked the 'nid' node against
> all other nodes present in the pmd in a previous khugepaged_scan_pmd iteration.
> And if it passed then, it would also pass now. If meanwhile a new node was found
> and recorded, it was also checked against everything present at that point,
> including 'nid'. So it should be safe?
> 
> The worst case (perfect interleaving page per page, so that "node != last_node"
> is true in each iteration) complexity then reduces from O(HPAGE_PMD_NR *
> MAX_NUMNODES) to O(HPAGE_PMD_NR + MAX_NUMNODES) iterations.
> 

Excellent suggestion, thanks Vlastimil!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch v3] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-17  0:54     ` David Rientjes
@ 2014-07-17  0:59       ` David Rientjes
  2014-07-17 16:28         ` Dave Hansen
  2014-07-28  8:42         ` [patch v3] " Vlastimil Babka
  0 siblings, 2 replies; 15+ messages in thread
From: David Rientjes @ 2014-07-17  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Vlastimil Babka, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Bob Liu, linux-mm, linux-kernel

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target 
node") improved the previous khugepaged logic which allocated a 
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which may 
suffer from additional access latency.  With the current policy, it is 
possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely 
if the majority are allocated from that node.

When zone_reclaim_mode is enabled, it means the VM should make every attempt
to allocate locally to prevent NUMA performance degradation.  In this case,
we do not want to collapse hugepages to remote nodes that would suffer from
increased access latency.  Thus, when zone_reclaim_mode is enabled, only
allow collapsing to nodes with RECLAIM_DISTANCE or less.

There is no functional change for systems that disable zone_reclaim_mode.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2: only change behavior for zone_reclaim_mode per Dave Hansen
 v3: optimization based on previous node counts per Vlastimil Babka

 mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2234,6 +2234,30 @@ static void khugepaged_alloc_sleep(void)
 static int khugepaged_node_load[MAX_NUMNODES];
 
 #ifdef CONFIG_NUMA
+static bool khugepaged_scan_abort(int nid)
+{
+	int i;
+
+	/*
+	 * If zone_reclaim_mode is disabled, then no extra effort is made to
+	 * allocate memory locally.
+	 */
+	if (!zone_reclaim_mode)
+		return false;
+
+	/* If there is a count for this node already, it must be acceptable */
+	if (khugepaged_node_load[nid])
+		return false;
+
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		if (!khugepaged_node_load[i])
+			continue;
+		if (node_distance(nid, i) > RECLAIM_DISTANCE)
+			return true;
+	}
+	return false;
+}
+
 static int khugepaged_find_target_node(void)
 {
 	static int last_khugepaged_target_node = NUMA_NO_NODE;
@@ -2309,6 +2333,11 @@ static struct page
 	return *hpage;
 }
 #else
+static bool khugepaged_scan_abort(int nid)
+{
+	return false;
+}
+
 static int khugepaged_find_target_node(void)
 {
 	return 0;
@@ -2545,6 +2574,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = page_to_nid(page);
+		if (khugepaged_scan_abort(node))
+			goto out_unmap;
 		khugepaged_node_load[node]++;
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v3] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-17  0:59       ` [patch v3] mm, thp: " David Rientjes
@ 2014-07-17 16:28         ` Dave Hansen
  2014-07-17 21:48           ` [patch v4] " David Rientjes
  2014-07-28  8:42         ` [patch v3] " Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2014-07-17 16:28 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Andrea Arcangeli, Vlastimil Babka, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Bob Liu, linux-mm, linux-kernel

On 07/16/2014 05:59 PM, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target 
> node") improved the previous khugepaged logic which allocated a 
> transparent hugepages from the node of the first page being collapsed.
> 
> However, it is still possible to collapse pages to remote memory which may 
> suffer from additional access latency.  With the current policy, it is 
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely 
> if the majority are allocated from that node.
> 
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation.  In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency.  Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
> 
> There is no functional change for systems that disable zone_reclaim_mode.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  v2: only change behavior for zone_reclaim_mode per Dave Hansen
>  v3: optimization based on previous node counts per Vlastimil Babka
> 
>  mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2234,6 +2234,30 @@ static void khugepaged_alloc_sleep(void)
>  static int khugepaged_node_load[MAX_NUMNODES];
>  
>  #ifdef CONFIG_NUMA
> +static bool khugepaged_scan_abort(int nid)
> +{
> +	int i;
> +
> +	/*
> +	 * If zone_reclaim_mode is disabled, then no extra effort is made to
> +	 * allocate memory locally.
> +	 */
> +	if (!zone_reclaim_mode)
> +		return false;
> +
> +	/* If there is a count for this node already, it must be acceptable */
> +	if (khugepaged_node_load[nid])
> +		return false;
> +
> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		if (!khugepaged_node_load[i])
> +			continue;
> +		if (node_distance(nid, i) > RECLAIM_DISTANCE)
> +			return true;
> +	}
> +	return false;
> +}
> +
>  static int khugepaged_find_target_node(void)
>  {
>  	static int last_khugepaged_target_node = NUMA_NO_NODE;
> @@ -2309,6 +2333,11 @@ static struct page
>  	return *hpage;
>  }
>  #else
> +static bool khugepaged_scan_abort(int nid)
> +{
> +	return false;
> +}

Minor nit: I guess this makes it more explicit, but this #ifdef is
unnecessary in practice because we define zone_reclaim_mode this way:

#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
#else
#define zone_reclaim_mode 0
#endif

Looks fine to me otherwise, though.  Definitely addresses the concerns I
had about RECLAIM_DISTANCE being consulted directly.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch v4] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-17 16:28         ` Dave Hansen
@ 2014-07-17 21:48           ` David Rientjes
  2014-07-25 15:34             ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2014-07-17 21:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, Andrea Arcangeli, Vlastimil Babka, Mel Gorman,
	Rik van Riel, Kirill A. Shutemov, Bob Liu, linux-mm,
	linux-kernel

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target 
node") improved the previous khugepaged logic which allocated a 
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which may 
suffer from additional access latency.  With the current policy, it is 
possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely 
if the majority are allocated from that node.

When zone_reclaim_mode is enabled, it means the VM should make every attempt
to allocate locally to prevent NUMA performance degradation.  In this case,
we do not want to collapse hugepages to remote nodes that would suffer from
increased access latency.  Thus, when zone_reclaim_mode is enabled, only
allow collapsing to nodes with RECLAIM_DISTANCE or less.

There is no functional change for systems that disable zone_reclaim_mode.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2: only change behavior for zone_reclaim_mode per Dave Hansen
 v3: optimization based on previous node counts per Vlastimil Babka
 v4: unconditionally define khugepaged_scan_abort per Dave Hansen
     no increase in .text size for CONFIG_NUMA=n

 mm/huge_memory.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2233,6 +2233,30 @@ static void khugepaged_alloc_sleep(void)
 
 static int khugepaged_node_load[MAX_NUMNODES];
 
+static bool khugepaged_scan_abort(int nid)
+{
+	int i;
+
+	/*
+	 * If zone_reclaim_mode is disabled, then no extra effort is made to
+	 * allocate memory locally.
+	 */
+	if (!zone_reclaim_mode)
+		return false;
+
+	/* If there is a count for this node already, it must be acceptable */
+	if (khugepaged_node_load[nid])
+		return false;
+
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		if (!khugepaged_node_load[i])
+			continue;
+		if (node_distance(nid, i) > RECLAIM_DISTANCE)
+			return true;
+	}
+	return false;
+}
+
 #ifdef CONFIG_NUMA
 static int khugepaged_find_target_node(void)
 {
@@ -2545,6 +2569,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = page_to_nid(page);
+		if (khugepaged_scan_abort(node))
+			goto out_unmap;
 		khugepaged_node_load[node]++;
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v4] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-17 21:48           ` [patch v4] " David Rientjes
@ 2014-07-25 15:34             ` Mel Gorman
  0 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2014-07-25 15:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Dave Hansen, Andrea Arcangeli, Vlastimil Babka,
	Rik van Riel, Kirill A. Shutemov, Bob Liu, linux-mm,
	linux-kernel

On Thu, Jul 17, 2014 at 02:48:07PM -0700, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target 
> node") improved the previous khugepaged logic which allocated a 
> transparent hugepages from the node of the first page being collapsed.
> 
> However, it is still possible to collapse pages to remote memory which may 
> suffer from additional access latency.  With the current policy, it is 
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely 
> if the majority are allocated from that node.
> 
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation.  In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency.  Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
> 
> There is no functional change for systems that disable zone_reclaim_mode.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

The patch looks ok for what it is intended to do so

Acked-by: Mel Gorman <mgorman@suse.de>

However, I would consider it likely that pages allocated on different nodes
within a hugepage boundary indicates that multiple threads on different nodes
are accessing those pages. I would be skeptical that reduced TLB misses
offset remote access penalties. Should we simply refuse to collapse huge
pages when the 4K pages are allocated from different nodes? If automatic
NUMA balancing is enabled and the access are really coming from one node
then the 4K pages will eventually be migrated to a local node and then
khugepaged can collapse it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v3] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
  2014-07-17  0:59       ` [patch v3] mm, thp: " David Rientjes
  2014-07-17 16:28         ` Dave Hansen
@ 2014-07-28  8:42         ` Vlastimil Babka
  1 sibling, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2014-07-28  8:42 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Andrea Arcangeli, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Bob Liu, linux-mm, linux-kernel

On 07/17/2014 02:59 AM, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
> node") improved the previous khugepaged logic which allocated a
> transparent hugepages from the node of the first page being collapsed.
>
> However, it is still possible to collapse pages to remote memory which may
> suffer from additional access latency.  With the current policy, it is
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
> if the majority are allocated from that node.
>
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation.  In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency.  Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
>
> There is no functional change for systems that disable zone_reclaim_mode.
>
> Signed-off-by: David Rientjes <rientjes@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-07-28  8:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-15  1:09 [patch] mm, thp: only collapse hugepages to nodes with affinity David Rientjes
2014-07-15  4:47 ` Dave Hansen
2014-07-15 23:17   ` David Rientjes
2014-07-16  0:13 ` [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode David Rientjes
2014-07-16  1:22   ` Bob Liu
2014-07-16 15:47     ` Vlastimil Babka
2014-07-16 19:37       ` Hugh Dickins
2014-07-17  0:49       ` David Rientjes
2014-07-16 15:38   ` Vlastimil Babka
2014-07-17  0:54     ` David Rientjes
2014-07-17  0:59       ` [patch v3] mm, thp: " David Rientjes
2014-07-17 16:28         ` Dave Hansen
2014-07-17 21:48           ` [patch v4] " David Rientjes
2014-07-25 15:34             ` Mel Gorman
2014-07-28  8:42         ` [patch v3] " Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).