[PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage
@ 2017-01-24  7:49 Jia He
  2017-01-24  7:49 ` [PATCH RFC 1/3] mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path Jia He
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Jia He @ 2017-01-24  7:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Naoya Horiguchi, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel, Jia He

If there is a server with uneven numa memory layout:
available: 7 nodes (0-6)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 6603 MB
node 0 free: 91 MB
node 1 cpus:
node 1 size: 12527 MB
node 1 free: 157 MB
node 2 cpus:
node 2 size: 15087 MB
node 2 free: 189 MB
node 3 cpus:
node 3 size: 16111 MB
node 3 free: 205 MB
node 4 cpus: 8 9 10 11 12 13 14 15
node 4 size: 24815 MB
node 4 free: 310 MB
node 5 cpus:
node 5 size: 4095 MB
node 5 free: 61 MB
node 6 cpus:
node 6 size: 22750 MB
node 6 free: 283 MB
node distances:
node   0   1   2   3   4   5   6
  0:  10  20  40  40  40  40  40
  1:  20  10  40  40  40  40  40
  2:  40  40  10  20  40  40  40
  3:  40  40  20  10  40  40  40
  4:  40  40  40  40  10  20  40
  5:  40  40  40  40  20  10  40
  6:  40  40  40  40  40  40  10

In this case node 5 has less memory and we will alloc the hugepages
from these nodes one by one after we trigger 
echo 4000 > /proc/sys/vm/nr_hugepages

Then the kswapd5 will take 100% cpu for a long time. This is a livelock
issue in kswapd. This patch set fixes it.

The 3rd patch improves the kswapd's bad performance significantly.

Jia He (3):
  mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path
  mm, vmscan: limit kswapd loop if no progress is made
  mm, vmscan: correct prepare_kswapd_sleep return value

 mm/hugetlb.c |  9 +++++++++
 mm/vmscan.c  | 28 ++++++++++++++++++++++++----
 2 files changed, 33 insertions(+), 4 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH RFC 1/3] mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path
  2017-01-24  7:49 [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage Jia He
@ 2017-01-24  7:49 ` Jia He
  2017-01-24 16:52   ` Michal Hocko
  2017-01-24  7:49 ` [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made Jia He
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Jia He @ 2017-01-24  7:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Naoya Horiguchi, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel, Jia He

This patch split alloc_fresh_huge_page_node into 2 parts:
- fast path without __GFP_REPEAT flag
- slow path with __GFP_REPEAT flag

Thus, if there is a server with uneven numa memory layout:
available: 7 nodes (0-6)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 6603 MB
node 0 free: 91 MB
node 1 cpus:
node 1 size: 12527 MB
node 1 free: 157 MB
node 2 cpus:
node 2 size: 15087 MB
node 2 free: 189 MB
node 3 cpus:
node 3 size: 16111 MB
node 3 free: 205 MB
node 4 cpus: 8 9 10 11 12 13 14 15
node 4 size: 24815 MB
node 4 free: 310 MB
node 5 cpus:
node 5 size: 4095 MB
node 5 free: 61 MB
node 6 cpus:
node 6 size: 22750 MB
node 6 free: 283 MB
node distances:
node   0   1   2   3   4   5   6
  0:  10  20  40  40  40  40  40
  1:  20  10  40  40  40  40  40
  2:  40  40  10  20  40  40  40
  3:  40  40  20  10  40  40  40
  4:  40  40  40  40  10  20  40
  5:  40  40  40  40  20  10  40
  6:  40  40  40  40  40  40  10

In this case node 5 has less memory and we will alloc the hugepages
from these nodes one by one.
After this patch, we will not trigger too early direct memory/kswap
reclaim for node 5 if there are enough memory in other nodes.

Signed-off-by: Jia He <hejianet@gmail.com>
---
 mm/hugetlb.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c7025c1..f2415ce 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1364,10 +1364,19 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
+	/* fast path without __GFP_REPEAT */
 	page = __alloc_pages_node(nid,
 		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
+
+	/* slow path with __GFP_REPEAT*/
+	if (!page)
+		page = __alloc_pages_node(nid,
+			htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
+					__GFP_NOWARN,
+			huge_page_order(h));
+
 	if (page) {
 		prep_new_huge_page(h, page, nid);
 	}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made
  2017-01-24  7:49 [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage Jia He
  2017-01-24  7:49 ` [PATCH RFC 1/3] mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path Jia He
@ 2017-01-24  7:49 ` Jia He
  2017-01-24 16:54   ` Michal Hocko
  2017-01-24  7:49 ` [PATCH RFC 3/3] mm, vmscan: correct prepare_kswapd_sleep return value Jia He
  2017-01-24 16:46 ` [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage Michal Hocko
  3 siblings, 1 reply; 12+ messages in thread
From: Jia He @ 2017-01-24  7:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Naoya Horiguchi, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel, Jia He

Currently there is no hard limitation for kswapd retry times if no progress
is made. Then kswapd will take 100% for a long time.

In my test, I tried to allocate 4000 hugepages by:
echo 4000 > /proc/sys/vm/nr_hugepages

Then,kswapd will take 100% cpu for a long time.

The numa layout is:
available: 7 nodes (0-6)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 6611 MB
node 0 free: 1103 MB
node 1 cpus:
node 1 size: 12527 MB
node 1 free: 8477 MB
node 2 cpus:
node 2 size: 15087 MB
node 2 free: 11037 MB
node 3 cpus:
node 3 size: 16111 MB
node 3 free: 12060 MB
node 4 cpus: 8 9 10 11 12 13 14 15
node 4 size: 24815 MB
node 4 free: 20704 MB
node 5 cpus:
node 5 size: 4095 MB
node 5 free: 61 MB 
node 6 cpus:
node 6 size: 22750 MB
node 6 free: 18716 MB

The cause is kswapd will loop for long time even if there is no progress in
balance_pgdat.

Signed-off-by: Jia He <hejianet@gmail.com>
---
 mm/vmscan.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a7..7396a0a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -59,6 +59,7 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+#define MAX_KSWAPD_RECLAIM_RETRIES 16
 struct scan_control {
 	/* How many pages shrink_list() should reclaim */
 	unsigned long nr_to_reclaim;
@@ -3202,7 +3203,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
  * or lower is eligible for reclaim until at least one usable zone is
  * balanced.
  */
-static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
+static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx,
+						 int *did_some_progress)
 {
 	int i;
 	unsigned long nr_soft_reclaimed;
@@ -3322,6 +3324,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	 * entered the allocator slow path while kswapd was awake, order will
 	 * remain at the higher level.
 	 */
+	*did_some_progress = !!(sc.nr_scanned || sc.nr_reclaimed);
 	return sc.order;
 }
 
@@ -3417,6 +3420,8 @@ static int kswapd(void *p)
 	unsigned int alloc_order, reclaim_order, classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
+	int no_progress_loops = 0;
+	int did_some_progress = 0;
 
 	struct reclaim_state reclaim_state = {
 		.reclaimed_slab = 0,
@@ -3480,9 +3485,23 @@ static int kswapd(void *p)
 		 */
 		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
 						alloc_order);
-		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
-		if (reclaim_order < alloc_order)
+		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx,
+						&did_some_progress);
+
+		if (reclaim_order < alloc_order) {
+			no_progress_loops = 0;
 			goto kswapd_try_sleep;
+		}
+
+		if (did_some_progress)
+			no_progress_loops = 0;
+		else
+			no_progress_loops++;
+
+		if (no_progress_loops >= MAX_KSWAPD_RECLAIM_RETRIES) {
+			no_progress_loops = 0;
+			goto kswapd_try_sleep;
+		}
 
 		alloc_order = reclaim_order = pgdat->kswapd_order;
 		classzone_idx = pgdat->kswapd_classzone_idx;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 3/3] mm, vmscan: correct prepare_kswapd_sleep return value
  2017-01-24  7:49 [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage Jia He
  2017-01-24  7:49 ` [PATCH RFC 1/3] mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path Jia He
  2017-01-24  7:49 ` [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made Jia He
@ 2017-01-24  7:49 ` Jia He
  2017-01-24 22:01   ` Rik van Riel
  2017-01-24 16:46 ` [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage Michal Hocko
  3 siblings, 1 reply; 12+ messages in thread
From: Jia He @ 2017-01-24  7:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Naoya Horiguchi, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel, Jia He

When there is no reclaimable pages in the zone, even the zone is
not balanced, we let kswapd go sleeping. That is prepare_kswapd_sleep
will return true in this case.

Signed-off-by: Jia He <hejianet@gmail.com>
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7396a0a..54445e2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3140,7 +3140,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;
 
-		if (!zone_balanced(zone, order, classzone_idx))
+		if (!zone_balanced(zone, order, classzone_idx)
+			&& !zone_reclaimable_pages(zone))
 			return false;
 	}
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage
  2017-01-24  7:49 [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage Jia He
                   ` (2 preceding siblings ...)
  2017-01-24  7:49 ` [PATCH RFC 3/3] mm, vmscan: correct prepare_kswapd_sleep return value Jia He
@ 2017-01-24 16:46 ` Michal Hocko
  2017-01-25  2:13   ` hejianet
  3 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-01-24 16:46 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Naoya Horiguchi,
	Mike Kravetz, Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Tue 24-01-17 15:49:01, Jia He wrote:
> If there is a server with uneven numa memory layout:
> available: 7 nodes (0-6)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 6603 MB
> node 0 free: 91 MB
> node 1 cpus:
> node 1 size: 12527 MB
> node 1 free: 157 MB
> node 2 cpus:
> node 2 size: 15087 MB
> node 2 free: 189 MB
> node 3 cpus:
> node 3 size: 16111 MB
> node 3 free: 205 MB
> node 4 cpus: 8 9 10 11 12 13 14 15
> node 4 size: 24815 MB
> node 4 free: 310 MB
> node 5 cpus:
> node 5 size: 4095 MB
> node 5 free: 61 MB
> node 6 cpus:
> node 6 size: 22750 MB
> node 6 free: 283 MB
> node distances:
> node   0   1   2   3   4   5   6
>   0:  10  20  40  40  40  40  40
>   1:  20  10  40  40  40  40  40
>   2:  40  40  10  20  40  40  40
>   3:  40  40  20  10  40  40  40
>   4:  40  40  40  40  10  20  40
>   5:  40  40  40  40  20  10  40
>   6:  40  40  40  40  40  40  10
> 
> In this case node 5 has less memory and we will alloc the hugepages
> from these nodes one by one after we trigger 
> echo 4000 > /proc/sys/vm/nr_hugepages
> 
> Then the kswapd5 will take 100% cpu for a long time. This is a livelock
> issue in kswapd. This patch set fixes it.

It would be really helpful to describe what is the issue and whether it
is specific to the configuration above. Also a highlevel overview of the
fix and why it is the right approach would be appreciated.
 
> The 3rd patch improves the kswapd's bad performance significantly.

Numbers?

> Jia He (3):
>   mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path
>   mm, vmscan: limit kswapd loop if no progress is made
>   mm, vmscan: correct prepare_kswapd_sleep return value
> 
>  mm/hugetlb.c |  9 +++++++++
>  mm/vmscan.c  | 28 ++++++++++++++++++++++++----
>  2 files changed, 33 insertions(+), 4 deletions(-)
> 
> -- 
> 2.5.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 1/3] mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path
  2017-01-24  7:49 ` [PATCH RFC 1/3] mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path Jia He
@ 2017-01-24 16:52   ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-01-24 16:52 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Naoya Horiguchi,
	Mike Kravetz, Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Tue 24-01-17 15:49:02, Jia He wrote:
> This patch split alloc_fresh_huge_page_node into 2 parts:
> - fast path without __GFP_REPEAT flag
> - slow path with __GFP_REPEAT flag
> 
> Thus, if there is a server with uneven numa memory layout:
> available: 7 nodes (0-6)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 6603 MB
> node 0 free: 91 MB
> node 1 cpus:
> node 1 size: 12527 MB
> node 1 free: 157 MB
> node 2 cpus:
> node 2 size: 15087 MB
> node 2 free: 189 MB
> node 3 cpus:
> node 3 size: 16111 MB
> node 3 free: 205 MB
> node 4 cpus: 8 9 10 11 12 13 14 15
> node 4 size: 24815 MB
> node 4 free: 310 MB
> node 5 cpus:
> node 5 size: 4095 MB
> node 5 free: 61 MB
> node 6 cpus:
> node 6 size: 22750 MB
> node 6 free: 283 MB
> node distances:
> node   0   1   2   3   4   5   6
>   0:  10  20  40  40  40  40  40
>   1:  20  10  40  40  40  40  40
>   2:  40  40  10  20  40  40  40
>   3:  40  40  20  10  40  40  40
>   4:  40  40  40  40  10  20  40
>   5:  40  40  40  40  20  10  40
>   6:  40  40  40  40  40  40  10
> 
> In this case node 5 has less memory and we will alloc the hugepages
> from these nodes one by one.
> After this patch, we will not trigger too early direct memory/kswap
> reclaim for node 5 if there are enough memory in other nodes.

This description is doesn't explain what is the problem, why it matters
and how the fix actually works. Moreover it does opposite what is
claims. Which brings me to another question. How has this been tested? 

> Signed-off-by: Jia He <hejianet@gmail.com>
> ---
>  mm/hugetlb.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c7025c1..f2415ce 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1364,10 +1364,19 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
>  {
>  	struct page *page;
>  
> +	/* fast path without __GFP_REPEAT */
>  	page = __alloc_pages_node(nid,
>  		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>  						__GFP_REPEAT|__GFP_NOWARN,
>  		huge_page_order(h));

this does opposite what the comment says.

> +
> +	/* slow path with __GFP_REPEAT*/
> +	if (!page)
> +		page = __alloc_pages_node(nid,
> +			htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
> +					__GFP_NOWARN,
> +			huge_page_order(h));
> +
>  	if (page) {
>  		prep_new_huge_page(h, page, nid);
>  	}
> -- 
> 2.5.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made
  2017-01-24  7:49 ` [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made Jia He
@ 2017-01-24 16:54   ` Michal Hocko
  2017-01-25  3:03     ` hejianet
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-01-24 16:54 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Naoya Horiguchi,
	Mike Kravetz, Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Tue 24-01-17 15:49:03, Jia He wrote:
> Currently there is no hard limitation for kswapd retry times if no progress
> is made. 

Yes, because the main objective of the kswapd is to balance all memory
zones. So having a hard limit on retries doesn't make any sense.

> Then kswapd will take 100% for a long time.

Where it is spending time?

> In my test, I tried to allocate 4000 hugepages by:
> echo 4000 > /proc/sys/vm/nr_hugepages
> 
> Then,kswapd will take 100% cpu for a long time.
> 
> The numa layout is:
> available: 7 nodes (0-6)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 6611 MB
> node 0 free: 1103 MB
> node 1 cpus:
> node 1 size: 12527 MB
> node 1 free: 8477 MB
> node 2 cpus:
> node 2 size: 15087 MB
> node 2 free: 11037 MB
> node 3 cpus:
> node 3 size: 16111 MB
> node 3 free: 12060 MB
> node 4 cpus: 8 9 10 11 12 13 14 15
> node 4 size: 24815 MB
> node 4 free: 20704 MB
> node 5 cpus:
> node 5 size: 4095 MB
> node 5 free: 61 MB 
> node 6 cpus:
> node 6 size: 22750 MB
> node 6 free: 18716 MB
> 
> The cause is kswapd will loop for long time even if there is no progress in
> balance_pgdat.

How does this solve anything? If the kswapd just backs off then the more
work has to be done in the direct reclaim context.

> Signed-off-by: Jia He <hejianet@gmail.com>
> ---
>  mm/vmscan.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 532a2a7..7396a0a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -59,6 +59,7 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/vmscan.h>
>  
> +#define MAX_KSWAPD_RECLAIM_RETRIES 16
>  struct scan_control {
>  	/* How many pages shrink_list() should reclaim */
>  	unsigned long nr_to_reclaim;
> @@ -3202,7 +3203,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
>   * or lower is eligible for reclaim until at least one usable zone is
>   * balanced.
>   */
> -static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> +static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx,
> +						 int *did_some_progress)
>  {
>  	int i;
>  	unsigned long nr_soft_reclaimed;
> @@ -3322,6 +3324,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  	 * entered the allocator slow path while kswapd was awake, order will
>  	 * remain at the higher level.
>  	 */
> +	*did_some_progress = !!(sc.nr_scanned || sc.nr_reclaimed);
>  	return sc.order;
>  }
>  
> @@ -3417,6 +3420,8 @@ static int kswapd(void *p)
>  	unsigned int alloc_order, reclaim_order, classzone_idx;
>  	pg_data_t *pgdat = (pg_data_t*)p;
>  	struct task_struct *tsk = current;
> +	int no_progress_loops = 0;
> +	int did_some_progress = 0;
>  
>  	struct reclaim_state reclaim_state = {
>  		.reclaimed_slab = 0,
> @@ -3480,9 +3485,23 @@ static int kswapd(void *p)
>  		 */
>  		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
>  						alloc_order);
> -		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
> -		if (reclaim_order < alloc_order)
> +		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx,
> +						&did_some_progress);
> +
> +		if (reclaim_order < alloc_order) {
> +			no_progress_loops = 0;
>  			goto kswapd_try_sleep;
> +		}
> +
> +		if (did_some_progress)
> +			no_progress_loops = 0;
> +		else
> +			no_progress_loops++;
> +
> +		if (no_progress_loops >= MAX_KSWAPD_RECLAIM_RETRIES) {
> +			no_progress_loops = 0;
> +			goto kswapd_try_sleep;
> +		}
>  
>  		alloc_order = reclaim_order = pgdat->kswapd_order;
>  		classzone_idx = pgdat->kswapd_classzone_idx;
> -- 
> 2.5.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 3/3] mm, vmscan: correct prepare_kswapd_sleep return value
  2017-01-24  7:49 ` [PATCH RFC 3/3] mm, vmscan: correct prepare_kswapd_sleep return value Jia He
@ 2017-01-24 22:01   ` Rik van Riel
  2017-01-25  2:24     ` hejianet
  0 siblings, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2017-01-24 22:01 UTC (permalink / raw)
  To: Jia He, linux-mm, linux-kernel
  Cc: Andrew Morton, Naoya Horiguchi, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim

On Tue, 2017-01-24 at 15:49 +0800, Jia He wrote:
> When there is no reclaimable pages in the zone, even the zone is
> not balanced, we let kswapd go sleeping. That is prepare_kswapd_sleep
> will return true in this case.
> 
> Signed-off-by: Jia He <hejianet@gmail.com>
> ---
>  mm/vmscan.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7396a0a..54445e2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3140,7 +3140,8 @@ static bool prepare_kswapd_sleep(pg_data_t
> *pgdat, int order, int classzone_idx)
>  		if (!managed_zone(zone))
>  			continue;
>  
> -		if (!zone_balanced(zone, order, classzone_idx))
> +		if (!zone_balanced(zone, order, classzone_idx)
> +			&& !zone_reclaimable_pages(zone))
>  			return false;
>  	}

This patch does the opposite of what your changelog
says.  The above keeps kswapd running forever if
the zone is not balanced, and there are no reclaimable
pages.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage
  2017-01-24 16:46 ` [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage Michal Hocko
@ 2017-01-25  2:13   ` hejianet
  0 siblings, 0 replies; 12+ messages in thread
From: hejianet @ 2017-01-25  2:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Naoya Horiguchi,
	Mike Kravetz, Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

Hi Michal
Thanks for the comments, I will resend the patch as per your
comment after my 2 weeks vacation.

B.R.
Jia

On 25/01/2017 12:46 AM, Michal Hocko wrote:
> On Tue 24-01-17 15:49:01, Jia He wrote:
>> If there is a server with uneven numa memory layout:
>> available: 7 nodes (0-6)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 6603 MB
>> node 0 free: 91 MB
>> node 1 cpus:
>> node 1 size: 12527 MB
>> node 1 free: 157 MB
>> node 2 cpus:
>> node 2 size: 15087 MB
>> node 2 free: 189 MB
>> node 3 cpus:
>> node 3 size: 16111 MB
>> node 3 free: 205 MB
>> node 4 cpus: 8 9 10 11 12 13 14 15
>> node 4 size: 24815 MB
>> node 4 free: 310 MB
>> node 5 cpus:
>> node 5 size: 4095 MB
>> node 5 free: 61 MB
>> node 6 cpus:
>> node 6 size: 22750 MB
>> node 6 free: 283 MB
>> node distances:
>> node   0   1   2   3   4   5   6
>>   0:  10  20  40  40  40  40  40
>>   1:  20  10  40  40  40  40  40
>>   2:  40  40  10  20  40  40  40
>>   3:  40  40  20  10  40  40  40
>>   4:  40  40  40  40  10  20  40
>>   5:  40  40  40  40  20  10  40
>>   6:  40  40  40  40  40  40  10
>>
>> In this case node 5 has less memory and we will alloc the hugepages
>> from these nodes one by one after we trigger
>> echo 4000 > /proc/sys/vm/nr_hugepages
>>
>> Then the kswapd5 will take 100% cpu for a long time. This is a livelock
>> issue in kswapd. This patch set fixes it.
>
> It would be really helpful to describe what is the issue and whether it
> is specific to the configuration above. Also a highlevel overview of the
> fix and why it is the right approach would be appreciated.
>
>> The 3rd patch improves the kswapd's bad performance significantly.
>
> Numbers?
>
>> Jia He (3):
>>   mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path
>>   mm, vmscan: limit kswapd loop if no progress is made
>>   mm, vmscan: correct prepare_kswapd_sleep return value
>>
>>  mm/hugetlb.c |  9 +++++++++
>>  mm/vmscan.c  | 28 ++++++++++++++++++++++++----
>>  2 files changed, 33 insertions(+), 4 deletions(-)
>>
>> --
>> 2.5.5
>>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 3/3] mm, vmscan: correct prepare_kswapd_sleep return value
  2017-01-24 22:01   ` Rik van Riel
@ 2017-01-25  2:24     ` hejianet
  0 siblings, 0 replies; 12+ messages in thread
From: hejianet @ 2017-01-25  2:24 UTC (permalink / raw)
  To: Rik van Riel, linux-mm, linux-kernel
  Cc: Andrew Morton, Naoya Horiguchi, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim



On 25/01/2017 6:01 AM, Rik van Riel wrote:
> On Tue, 2017-01-24 at 15:49 +0800, Jia He wrote:
>> When there is no reclaimable pages in the zone, even the zone is
>> not balanced, we let kswapd go sleeping. That is prepare_kswapd_sleep
>> will return true in this case.
>>
>> Signed-off-by: Jia He <hejianet@gmail.com>
>> ---
>>  mm/vmscan.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 7396a0a..54445e2 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3140,7 +3140,8 @@ static bool prepare_kswapd_sleep(pg_data_t
>> *pgdat, int order, int classzone_idx)
>>  		if (!managed_zone(zone))
>>  			continue;
>>
>> -		if (!zone_balanced(zone, order, classzone_idx))
>> +		if (!zone_balanced(zone, order, classzone_idx)
>> +			&& !zone_reclaimable_pages(zone))
>>  			return false;
>>  	}
>
> This patch does the opposite of what your changelog
> says.  The above keeps kswapd running forever if
> the zone is not balanced, and there are no reclaimable
> pages.
sorry for the mistake, I will check what happened.
I tested in my local system.

B.R.
Jia
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made
  2017-01-24 16:54   ` Michal Hocko
@ 2017-01-25  3:03     ` hejianet
  2017-01-25  9:34       ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: hejianet @ 2017-01-25  3:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Naoya Horiguchi,
	Mike Kravetz, Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel



On 25/01/2017 12:54 AM, Michal Hocko wrote:
> On Tue 24-01-17 15:49:03, Jia He wrote:
>> Currently there is no hard limitation for kswapd retry times if no progress
>> is made.
>
> Yes, because the main objective of the kswapd is to balance all memory
> zones. So having a hard limit on retries doesn't make any sense.
>
But do you think even when there is no any process, kswapd still need
to run and take the cpu usage uselessly?

>> Then kswapd will take 100% for a long time.
>
> Where it is spending time?
I've watched kswapd takes 100% cpu for a whole night.

>
>> In my test, I tried to allocate 4000 hugepages by:
>> echo 4000 > /proc/sys/vm/nr_hugepages
>>
>> Then,kswapd will take 100% cpu for a long time.
>>
>> The numa layout is:
>> available: 7 nodes (0-6)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 6611 MB
>> node 0 free: 1103 MB
>> node 1 cpus:
>> node 1 size: 12527 MB
>> node 1 free: 8477 MB
>> node 2 cpus:
>> node 2 size: 15087 MB
>> node 2 free: 11037 MB
>> node 3 cpus:
>> node 3 size: 16111 MB
>> node 3 free: 12060 MB
>> node 4 cpus: 8 9 10 11 12 13 14 15
>> node 4 size: 24815 MB
>> node 4 free: 20704 MB
>> node 5 cpus:
>> node 5 size: 4095 MB
>> node 5 free: 61 MB
>> node 6 cpus:
>> node 6 size: 22750 MB
>> node 6 free: 18716 MB
>>
>> The cause is kswapd will loop for long time even if there is no progress in
>> balance_pgdat.
>
> How does this solve anything? If the kswapd just backs off then the more
> work has to be done in the direct reclaim context.
What if there is still no progress in direct context?

B.R.
Jia
>
>> Signed-off-by: Jia He <hejianet@gmail.com>
>> ---
>>  mm/vmscan.c | 25 ++++++++++++++++++++++---
>>  1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 532a2a7..7396a0a 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -59,6 +59,7 @@
>>  #define CREATE_TRACE_POINTS
>>  #include <trace/events/vmscan.h>
>>
>> +#define MAX_KSWAPD_RECLAIM_RETRIES 16
>>  struct scan_control {
>>  	/* How many pages shrink_list() should reclaim */
>>  	unsigned long nr_to_reclaim;
>> @@ -3202,7 +3203,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
>>   * or lower is eligible for reclaim until at least one usable zone is
>>   * balanced.
>>   */
>> -static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>> +static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx,
>> +						 int *did_some_progress)
>>  {
>>  	int i;
>>  	unsigned long nr_soft_reclaimed;
>> @@ -3322,6 +3324,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>>  	 * entered the allocator slow path while kswapd was awake, order will
>>  	 * remain at the higher level.
>>  	 */
>> +	*did_some_progress = !!(sc.nr_scanned || sc.nr_reclaimed);
>>  	return sc.order;
>>  }
>>
>> @@ -3417,6 +3420,8 @@ static int kswapd(void *p)
>>  	unsigned int alloc_order, reclaim_order, classzone_idx;
>>  	pg_data_t *pgdat = (pg_data_t*)p;
>>  	struct task_struct *tsk = current;
>> +	int no_progress_loops = 0;
>> +	int did_some_progress = 0;
>>
>>  	struct reclaim_state reclaim_state = {
>>  		.reclaimed_slab = 0,
>> @@ -3480,9 +3485,23 @@ static int kswapd(void *p)
>>  		 */
>>  		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
>>  						alloc_order);
>> -		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
>> -		if (reclaim_order < alloc_order)
>> +		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx,
>> +						&did_some_progress);
>> +
>> +		if (reclaim_order < alloc_order) {
>> +			no_progress_loops = 0;
>>  			goto kswapd_try_sleep;
>> +		}
>> +
>> +		if (did_some_progress)
>> +			no_progress_loops = 0;
>> +		else
>> +			no_progress_loops++;
>> +
>> +		if (no_progress_loops >= MAX_KSWAPD_RECLAIM_RETRIES) {
>> +			no_progress_loops = 0;
>> +			goto kswapd_try_sleep;
>> +		}
>>
>>  		alloc_order = reclaim_order = pgdat->kswapd_order;
>>  		classzone_idx = pgdat->kswapd_classzone_idx;
>> --
>> 2.5.5
>>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made
  2017-01-25  3:03     ` hejianet
@ 2017-01-25  9:34       ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-01-25  9:34 UTC (permalink / raw)
  To: hejianet
  Cc: linux-mm, linux-kernel, Andrew Morton, Naoya Horiguchi,
	Mike Kravetz, Aneesh Kumar K.V, Gerald Schaefer, zhong jiang,
	Kirill A. Shutemov, Vaishali Thakkar, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 25-01-17 11:03:53, hejianet wrote:
> 
> 
> On 25/01/2017 12:54 AM, Michal Hocko wrote:
> > On Tue 24-01-17 15:49:03, Jia He wrote:
> > > Currently there is no hard limitation for kswapd retry times if no progress
> > > is made.
> > 
> > Yes, because the main objective of the kswapd is to balance all memory
> > zones. So having a hard limit on retries doesn't make any sense.
> > 
> But do you think even when there is no any process, kswapd still need
> to run and take the cpu usage uselessly?

The question is whether we can get into such a state during reasonable
workloads. So far you haven't explained what you are seeing and on which
kernel version.
 
> > > Then kswapd will take 100% for a long time.
> > 
> > Where it is spending time?
> I've watched kswapd takes 100% cpu for a whole night.

I assume it didn't get to sleep because your request has consumed enough
memory for hugetlb pages to get below watermarks which would keep kswapd
active. Is that correct?

> > > In my test, I tried to allocate 4000 hugepages by:
> > > echo 4000 > /proc/sys/vm/nr_hugepages
> > > 
> > > Then,kswapd will take 100% cpu for a long time.
> > > 
> > > The numa layout is:
> > > available: 7 nodes (0-6)
> > > node 0 cpus: 0 1 2 3 4 5 6 7
> > > node 0 size: 6611 MB
> > > node 0 free: 1103 MB
> > > node 1 cpus:
> > > node 1 size: 12527 MB
> > > node 1 free: 8477 MB
> > > node 2 cpus:
> > > node 2 size: 15087 MB
> > > node 2 free: 11037 MB
> > > node 3 cpus:
> > > node 3 size: 16111 MB
> > > node 3 free: 12060 MB
> > > node 4 cpus: 8 9 10 11 12 13 14 15
> > > node 4 size: 24815 MB
> > > node 4 free: 20704 MB
> > > node 5 cpus:
> > > node 5 size: 4095 MB
> > > node 5 free: 61 MB
> > > node 6 cpus:
> > > node 6 size: 22750 MB
> > > node 6 free: 18716 MB
> > > 
> > > The cause is kswapd will loop for long time even if there is no progress in
> > > balance_pgdat.
> > 
> > How does this solve anything? If the kswapd just backs off then the more
> > work has to be done in the direct reclaim context.
> What if there is still no progress in direct context?

Then we trigger the OOM killer when applicable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-01-25  9:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-24  7:49 [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage Jia He
2017-01-24  7:49 ` [PATCH RFC 1/3] mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path Jia He
2017-01-24 16:52   ` Michal Hocko
2017-01-24  7:49 ` [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made Jia He
2017-01-24 16:54   ` Michal Hocko
2017-01-25  3:03     ` hejianet
2017-01-25  9:34       ` Michal Hocko
2017-01-24  7:49 ` [PATCH RFC 3/3] mm, vmscan: correct prepare_kswapd_sleep return value Jia He
2017-01-24 22:01   ` Rik van Riel
2017-01-25  2:24     ` hejianet
2017-01-24 16:46 ` [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage Michal Hocko
2017-01-25  2:13   ` hejianet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).