linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages
@ 2017-02-24  6:49 Jia He
  2017-02-24  8:49 ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Jia He @ 2017-02-24  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Johannes Weiner, Mel Gorman,
	Vlastimil Babka, Michal Hocko, Minchan Kim, Rik van Riel, Jia He

In a numa server, topology looks like
available: 3 nodes (0,2-3)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 15299 MB
node 2 free: 289 MB
node 3 cpus:
node 3 size: 15336 MB
node 3 free: 184 MB
node distances:
node   0   2   3
  0:  10  40  40
  2:  40  10  20
  3:  40  20  10
 
When I try to dynamically allocate the hugepages more than system total free 
memory:
e.g. echo 4000 >/proc/sys/vm/nr_hugepages
 
Then the kswapd will take 100% cpu for a long time(more than 3 hours, and will
not be about to end)
top result:
top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3 

The root cause is: kswapd3 is waken up and then try to do reclaim again and 
again but it makes no progress. At last the allocated hugepages are less than
4000.
HugePages_Total:    1864
HugePages_Free:     1864
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB
  
At that time, even there are no relaimable pages in that node3, kswapd3 will 
not go to sleep.
Node 3, zone      DMA
  per-node stats
      nr_inactive_anon 0
      nr_active_anon 0
      nr_inactive_file 0
      nr_active_file 0
      nr_unevictable 0
      nr_isolated_anon 0
      nr_isolated_file 0
      nr_pages_scanned 0
      workingset_refault 0
      workingset_activate 0
      workingset_nodereclaim 0
      nr_anon_pages 0
      nr_mapped    0
      nr_file_pages 0
      nr_dirty     0
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     0
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_anon_transparent_hugepages 0
      nr_unstable  0
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   0
      nr_written   0
  pages free     2951
        min      2821
        low      3526
        high     4231
   node_scanned  0
        spanned  245760
        present  245760
        managed  245388
      nr_free_pages 2951
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_slab_reclaimable 46
      nr_slab_unreclaimable 90
      nr_page_table_pages 0
      nr_kernel_stack 0
      nr_bounce    0
      nr_zspages   0
      numa_hit     2257
      numa_miss    0
      numa_foreign 0
      numa_interleave 982
      numa_local   0
      numa_other   2257
      nr_free_cma  0
        protection: (0, 0, 0, 0) 
It would be called a misconfiguration but it seems that it might be quite easy
to hit with NUMA machines which have large differences in the node sizes.

Further more, when it consumes most the memory in node3, every alloc slow path
might wake up kswapd3 and it will make things worse:
__alloc_pages_slowpath
    wake_all_kswapds
        wakeup_kswapd

This patch resolves the issue from 2 aspects:
1. In prepare_kswapd_sleep, only when zone is not balanced and there are
  reclaimable pages in this zone, kswapd will go to do relaim without sleeping
2. Don't wake up kswapd if there are no reclaimable pages in that node

After this patch:
top - 07:29:43 up 3 min,  1 user,  load average: 0.12, 0.13, 0.06
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 97.8 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total,   938112 used, 30433408 free,     5504 buffers
KiB Swap:  6284224 total,        0 used,  6284224 free.   632448 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
   78 root      20   0       0      0      0 S 0.000 0.000   0:00.00 kswapd3    

Changes:
V2: - fix incorrect condition for assignment of node_has_reclaimable_pages
    - make commit decription better

Signed-off-by: Jia He <hejianet@gmail.com>
---
 mm/vmscan.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a7..7c5a563 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;
 
-		if (!zone_balanced(zone, order, classzone_idx))
+		if (!zone_balanced(zone, order, classzone_idx)
+			&& zone_reclaimable_pages(zone))
 			return false;
 	}
 
@@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
 	int z;
+	int node_has_relaimable_pages = 0;
 
 	if (!managed_zone(zone))
 		return;
@@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 
 		if (zone_balanced(zone, order, classzone_idx))
 			return;
+
+		if (zone_reclaimable_pages(zone))
+			node_has_relaimable_pages = 1;
 	}
 
+	/* Dont wake kswapd if all zones has no reclaimable pages */
+	if (!node_has_relaimable_pages)
+		return;
+
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
-- 
1.8.5.6

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages
  2017-02-24  6:49 [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages Jia He
@ 2017-02-24  8:49 ` Michal Hocko
  2017-02-24 16:51   ` Johannes Weiner
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2017-02-24  8:49 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Fri 24-02-17 14:49:52, Jia He wrote:
> In a numa server, topology looks like
> available: 3 nodes (0,2-3)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 2 cpus: 0 1 2 3 4 5 6 7
> node 2 size: 15299 MB
> node 2 free: 289 MB
> node 3 cpus:
> node 3 size: 15336 MB
> node 3 free: 184 MB
> node distances:
> node   0   2   3
>   0:  10  40  40
>   2:  40  10  20
>   3:  40  20  10
>  
> When I try to dynamically allocate the hugepages more than system total free 
> memory:
> e.g. echo 4000 >/proc/sys/vm/nr_hugepages
>  
> Then the kswapd will take 100% cpu for a long time(more than 3 hours, and will
> not be about to end)
> top result:
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3 
> 
> The root cause is: kswapd3 is waken up and then try to do reclaim again and 
> again but it makes no progress. At last the allocated hugepages are less than
> 4000.
> HugePages_Total:    1864
> HugePages_Free:     1864
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:      16384 kB
>   
> At that time, even there are no relaimable pages in that node3, kswapd3 will 
> not go to sleep.
> Node 3, zone      DMA
>   per-node stats
>       nr_inactive_anon 0
>       nr_active_anon 0
>       nr_inactive_file 0
>       nr_active_file 0
>       nr_unevictable 0
>       nr_isolated_anon 0
>       nr_isolated_file 0
>       nr_pages_scanned 0
>       workingset_refault 0
>       workingset_activate 0
>       workingset_nodereclaim 0
>       nr_anon_pages 0
>       nr_mapped    0
>       nr_file_pages 0
>       nr_dirty     0
>       nr_writeback 0
>       nr_writeback_temp 0
>       nr_shmem     0
>       nr_shmem_hugepages 0
>       nr_shmem_pmdmapped 0
>       nr_anon_transparent_hugepages 0
>       nr_unstable  0
>       nr_vmscan_write 0
>       nr_vmscan_immediate_reclaim 0
>       nr_dirtied   0
>       nr_written   0
>   pages free     2951
>         min      2821
>         low      3526
>         high     4231
>    node_scanned  0
>         spanned  245760
>         present  245760
>         managed  245388
>       nr_free_pages 2951
>       nr_zone_inactive_anon 0
>       nr_zone_active_anon 0
>       nr_zone_inactive_file 0
>       nr_zone_active_file 0
>       nr_zone_unevictable 0
>       nr_zone_write_pending 0
>       nr_mlock     0
>       nr_slab_reclaimable 46
>       nr_slab_unreclaimable 90
>       nr_page_table_pages 0
>       nr_kernel_stack 0
>       nr_bounce    0
>       nr_zspages   0
>       numa_hit     2257
>       numa_miss    0
>       numa_foreign 0
>       numa_interleave 982
>       numa_local   0
>       numa_other   2257
>       nr_free_cma  0
>         protection: (0, 0, 0, 0) 
> It would be called a misconfiguration but it seems that it might be quite easy
> to hit with NUMA machines which have large differences in the node sizes.
> 
> Further more, when it consumes most the memory in node3, every alloc slow path
> might wake up kswapd3 and it will make things worse:
> __alloc_pages_slowpath
>     wake_all_kswapds
>         wakeup_kswapd
> 
> This patch resolves the issue from 2 aspects:
> 1. In prepare_kswapd_sleep, only when zone is not balanced and there are
>   reclaimable pages in this zone, kswapd will go to do relaim without sleeping
> 2. Don't wake up kswapd if there are no reclaimable pages in that node
> 
> After this patch:
> top - 07:29:43 up 3 min,  1 user,  load average: 0.12, 0.13, 0.06
> Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 97.8 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total,   938112 used, 30433408 free,     5504 buffers
> KiB Swap:  6284224 total,        0 used,  6284224 free.   632448 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
>    78 root      20   0       0      0      0 S 0.000 0.000   0:00.00 kswapd3    
> 
> Changes:
> V2: - fix incorrect condition for assignment of node_has_reclaimable_pages
>     - make commit decription better

I believe we should pursue the proposal from Johannes which is more
generic and copes with corner cases much better.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages
  2017-02-24  8:49 ` Michal Hocko
@ 2017-02-24 16:51   ` Johannes Weiner
  2017-02-27  6:04     ` hejianet
  2017-02-27  8:50     ` Michal Hocko
  0 siblings, 2 replies; 8+ messages in thread
From: Johannes Weiner @ 2017-02-24 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Fri, Feb 24, 2017 at 09:49:50AM +0100, Michal Hocko wrote:
> I believe we should pursue the proposal from Johannes which is more
> generic and copes with corner cases much better.

Jia, can you try this? I'll put the cleanups in follow-up patches.

---

>From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 24 Feb 2017 10:56:32 -0500
Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

Reported-by: Jia He <hejianet@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  2 ++
 include/linux/swap.h   |  1 +
 mm/page_alloc.c        |  6 ------
 mm/vmscan.c            | 20 ++++++++++++++++++++
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e02b3750fe0..d2c50ab6ae40 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -630,6 +630,8 @@ typedef struct pglist_data {
 	int kswapd_order;
 	enum zone_type kswapd_classzone_idx;
 
+	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 45e91dd6716d..5c06581a730b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -288,6 +288,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+#define MAX_RECLAIM_RETRIES 16
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 614cd0397ce3..83f0442f07fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 }
 
 /*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..8e9bdd172182 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
+	/*
+	 * Kswapd gives up on balancing particular nodes after too
+	 * many failures to reclaim anything from them. If reclaim
+	 * progress happens, reset the failure counter. A successful
+	 * direct reclaim run will knock a stuck kswapd loose again.
+	 */
+	if (reclaimable)
+		pgdat->kswapd_failures = 0;
+
 	return reclaimable;
 }
 
@@ -3134,6 +3143,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
+	/* Hopeless node, leave it to direct reclaim */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		return true;
+
 	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -3316,6 +3329,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
+	if (!sc.nr_reclaimed)
+		pgdat->kswapd_failures++;
+
 out:
 	/*
 	 * Return the order kswapd stopped reclaiming at as
@@ -3515,6 +3531,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
+	/* Hopeless node, leave it to direct reclaim */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		return;
+
 	/* Only wake kswapd if all zones are unbalanced */
 	for (z = 0; z <= classzone_idx; z++) {
 		zone = pgdat->node_zones + z;
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages
  2017-02-24 16:51   ` Johannes Weiner
@ 2017-02-27  6:04     ` hejianet
  2017-02-27  8:50     ` Michal Hocko
  1 sibling, 0 replies; 8+ messages in thread
From: hejianet @ 2017-02-27  6:04 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel


Hi
Tested-by: Jia He <hejianet@gmail.com>

cat /proc/meminfo
[...]
CmaFree:               0 kB
HugePages_Total:    1831
HugePages_Free:     1831
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB

top - 06:50:29 up  1:26,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 99.6 id,  0.2 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30577664 used,   793856 free,      256 buffers
KiB Swap:  6284224 total,      128 used,  6284096 free.   281280 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    79 root      20   0       0      0      0 S 0.000 0.000   0:00.00 kswapd3


On 25/02/2017 12:51 AM, Johannes Weiner wrote:
> On Fri, Feb 24, 2017 at 09:49:50AM +0100, Michal Hocko wrote:
>> I believe we should pursue the proposal from Johannes which is more
>> generic and copes with corner cases much better.
>
> Jia, can you try this? I'll put the cleanups in follow-up patches.
>
> ---
>
>>From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 24 Feb 2017 10:56:32 -0500
> Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes
>
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
>
> $ echo 4000 >/proc/sys/vm/nr_hugepages
>
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
>
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
>
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
>
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
>
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
>
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h |  2 ++
>  include/linux/swap.h   |  1 +
>  mm/page_alloc.c        |  6 ------
>  mm/vmscan.c            | 20 ++++++++++++++++++++
>  4 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>
> +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
>  	int kcompactd_max_order;
>  	enum zone_type kcompactd_classzone_idx;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 45e91dd6716d..5c06581a730b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -288,6 +288,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>
>  /* linux/mm/vmscan.c */
> +#define MAX_RECLAIM_RETRIES 16
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..83f0442f07fa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..8e9bdd172182 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>
> +	/*
> +	 * Kswapd gives up on balancing particular nodes after too
> +	 * many failures to reclaim anything from them. If reclaim
> +	 * progress happens, reset the failure counter. A successful
> +	 * direct reclaim run will knock a stuck kswapd loose again.
> +	 */
> +	if (reclaimable)
> +		pgdat->kswapd_failures = 0;
> +
>  	return reclaimable;
>  }
>
> @@ -3134,6 +3143,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
>  		wake_up_all(&pgdat->pfmemalloc_wait);
>
> +	/* Hopeless node, leave it to direct reclaim */
> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return true;
> +
>  	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>
> @@ -3316,6 +3329,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>
> +	if (!sc.nr_reclaimed)
> +		pgdat->kswapd_failures++;
> +
>  out:
>  	/*
>  	 * Return the order kswapd stopped reclaiming at as
> @@ -3515,6 +3531,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
>
> +	/* Hopeless node, leave it to direct reclaim */
> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return;
> +
>  	/* Only wake kswapd if all zones are unbalanced */
>  	for (z = 0; z <= classzone_idx; z++) {
>  		zone = pgdat->node_zones + z;
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages
  2017-02-24 16:51   ` Johannes Weiner
  2017-02-27  6:04     ` hejianet
@ 2017-02-27  8:50     ` Michal Hocko
  2017-02-27 17:06       ` Johannes Weiner
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2017-02-27  8:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Fri 24-02-17 11:51:05, Johannes Weiner wrote:
[...]
> >From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 24 Feb 2017 10:56:32 -0500
> Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes
> 
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.

Yes this looks, nice&simple. I would just be worried about [1] a bit.
Maybe that is worth a separate patch though.

[1] http://lkml.kernel.org/r/20170223111609.hlncnvokhq3quxwz@dhcp22.suse.cz
 
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

I would have just one more suggestion. Please move MAX_RECLAIM_RETRIES
to mm/internal.h. This is MM internal thing and there is no need to make
it visible.

> ---
>  include/linux/mmzone.h |  2 ++
>  include/linux/swap.h   |  1 +
>  mm/page_alloc.c        |  6 ------
>  mm/vmscan.c            | 20 ++++++++++++++++++++
>  4 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>  
> +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
>  	int kcompactd_max_order;
>  	enum zone_type kcompactd_classzone_idx;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 45e91dd6716d..5c06581a730b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -288,6 +288,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>  
>  /* linux/mm/vmscan.c */
> +#define MAX_RECLAIM_RETRIES 16
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..83f0442f07fa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>  
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..8e9bdd172182 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>  
> +	/*
> +	 * Kswapd gives up on balancing particular nodes after too
> +	 * many failures to reclaim anything from them. If reclaim
> +	 * progress happens, reset the failure counter. A successful
> +	 * direct reclaim run will knock a stuck kswapd loose again.
> +	 */
> +	if (reclaimable)
> +		pgdat->kswapd_failures = 0;
> +
>  	return reclaimable;
>  }
>  
> @@ -3134,6 +3143,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
>  		wake_up_all(&pgdat->pfmemalloc_wait);
>  
> +	/* Hopeless node, leave it to direct reclaim */
> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return true;
> +
>  	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>  
> @@ -3316,6 +3329,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>  
> +	if (!sc.nr_reclaimed)
> +		pgdat->kswapd_failures++;
> +
>  out:
>  	/*
>  	 * Return the order kswapd stopped reclaiming at as
> @@ -3515,6 +3531,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
>  
> +	/* Hopeless node, leave it to direct reclaim */
> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return;
> +
>  	/* Only wake kswapd if all zones are unbalanced */
>  	for (z = 0; z <= classzone_idx; z++) {
>  		zone = pgdat->node_zones + z;
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages
  2017-02-27  8:50     ` Michal Hocko
@ 2017-02-27 17:06       ` Johannes Weiner
  2017-02-27 17:29         ` Michal Hocko
  2017-02-28  1:53         ` hejianet
  0 siblings, 2 replies; 8+ messages in thread
From: Johannes Weiner @ 2017-02-27 17:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Mon, Feb 27, 2017 at 09:50:24AM +0100, Michal Hocko wrote:
> On Fri 24-02-17 11:51:05, Johannes Weiner wrote:
> [...]
> > >From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Fri, 24 Feb 2017 10:56:32 -0500
> > Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes
> > 
> > Jia He reports a problem with kswapd spinning at 100% CPU when
> > requesting more hugepages than memory available in the system:
> > 
> > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > 
> > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> > KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> > 
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> > 
> > At that time, there are no reclaimable pages left in the node, but as
> > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > 
> > Kswapd needs to back away from nodes that fail to balance. Up until
> > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > kswapd had such a mechanism. It considered zones whose theoretically
> > reclaimable pages it had reclaimed six times over as unreclaimable and
> > backed away from them. This guard was erroneously removed as the patch
> > changed the definition of a balanced node.
> > 
> > However, simply restoring this code wouldn't help in the case reported
> > here: there *are* no reclaimable pages that could be scanned until the
> > threshold is met. Kswapd would stay awake anyway.
> > 
> > Introduce a new and much simpler way of backing off. If kswapd runs
> > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > page, make it back off from the node. This is the same number of shots
> > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > that node until a direct reclaimer manages to reclaim some pages, thus
> > proving the node reclaimable again.
> 
> Yes this looks, nice&simple. I would just be worried about [1] a bit.
> Maybe that is worth a separate patch though.
> 
> [1] http://lkml.kernel.org/r/20170223111609.hlncnvokhq3quxwz@dhcp22.suse.cz

I think I'd prefer the simplicity of keeping this contained inside
vmscan.c, as an interaction between direct reclaimers and kswapd, as
well as leaving the wakeup tied to actually seeing reclaimable pages
rather than merely producing free pages (e.g. should we also add a
kick to a large munmap() for example?).

OOM kills come with such high latencies that I cannot imagine a
slightly quicker kswapd restart would matter in practice.

> > Reported-by: Jia He <hejianet@gmail.com>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> I would have just one more suggestion. Please move MAX_RECLAIM_RETRIES
> to mm/internal.h. This is MM internal thing and there is no need to make
> it visible.

Good point, I'll move it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages
  2017-02-27 17:06       ` Johannes Weiner
@ 2017-02-27 17:29         ` Michal Hocko
  2017-02-28  1:53         ` hejianet
  1 sibling, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2017-02-27 17:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Mon 27-02-17 12:06:34, Johannes Weiner wrote:
> On Mon, Feb 27, 2017 at 09:50:24AM +0100, Michal Hocko wrote:
> > On Fri 24-02-17 11:51:05, Johannes Weiner wrote:
> > [...]
> > > >From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Date: Fri, 24 Feb 2017 10:56:32 -0500
> > > Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes
> > > 
> > > Jia He reports a problem with kswapd spinning at 100% CPU when
> > > requesting more hugepages than memory available in the system:
> > > 
> > > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > > 
> > > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > > KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> > > KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> > > 
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > >    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> > > 
> > > At that time, there are no reclaimable pages left in the node, but as
> > > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > > 
> > > Kswapd needs to back away from nodes that fail to balance. Up until
> > > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > > kswapd had such a mechanism. It considered zones whose theoretically
> > > reclaimable pages it had reclaimed six times over as unreclaimable and
> > > backed away from them. This guard was erroneously removed as the patch
> > > changed the definition of a balanced node.
> > > 
> > > However, simply restoring this code wouldn't help in the case reported
> > > here: there *are* no reclaimable pages that could be scanned until the
> > > threshold is met. Kswapd would stay awake anyway.
> > > 
> > > Introduce a new and much simpler way of backing off. If kswapd runs
> > > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > > page, make it back off from the node. This is the same number of shots
> > > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > > that node until a direct reclaimer manages to reclaim some pages, thus
> > > proving the node reclaimable again.
> > 
> > Yes this looks, nice&simple. I would just be worried about [1] a bit.
> > Maybe that is worth a separate patch though.
> > 
> > [1] http://lkml.kernel.org/r/20170223111609.hlncnvokhq3quxwz@dhcp22.suse.cz
> 
> I think I'd prefer the simplicity of keeping this contained inside
> vmscan.c, as an interaction between direct reclaimers and kswapd, as
> well as leaving the wakeup tied to actually seeing reclaimable pages
> rather than merely producing free pages (e.g. should we also add a
> kick to a large munmap() for example?).

OK, that is a good point as well. I was about to argue that a mlock
runaway process killed by the OOM killer should restart kswapd otherwise
the following operation would be quite surprising. But you are right
that there are other sources of large amout of free pages. So you are
right, let's keep it simple for now and do something based on freed
pages.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages
  2017-02-27 17:06       ` Johannes Weiner
  2017-02-27 17:29         ` Michal Hocko
@ 2017-02-28  1:53         ` hejianet
  1 sibling, 0 replies; 8+ messages in thread
From: hejianet @ 2017-02-28  1:53 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

Hi Johannes

I have another concern:
kswapd -> balance_pgdat -> age_active_anon
This code path will do some background works to age anon list, will this
patch have some impact on it if the retry time is > 16 and kswapd is
not waken up?

B.R.
Jia

On 28/02/2017 1:06 AM, Johannes Weiner wrote:
> On Mon, Feb 27, 2017 at 09:50:24AM +0100, Michal Hocko wrote:
>> On Fri 24-02-17 11:51:05, Johannes Weiner wrote:
>> [...]
>>> >From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
>>> From: Johannes Weiner <hannes@cmpxchg.org>
>>> Date: Fri, 24 Feb 2017 10:56:32 -0500
>>> Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes
>>>
>>> Jia He reports a problem with kswapd spinning at 100% CPU when
>>> requesting more hugepages than memory available in the system:
>>>
>>> $ echo 4000 >/proc/sys/vm/nr_hugepages
>>>
>>> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
>>> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
>>> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
>>> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
>>> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
>>>
>>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
>>>
>>> At that time, there are no reclaimable pages left in the node, but as
>>> kswapd fails to restore the high watermarks it refuses to go to sleep.
>>>
>>> Kswapd needs to back away from nodes that fail to balance. Up until
>>> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
>>> kswapd had such a mechanism. It considered zones whose theoretically
>>> reclaimable pages it had reclaimed six times over as unreclaimable and
>>> backed away from them. This guard was erroneously removed as the patch
>>> changed the definition of a balanced node.
>>>
>>> However, simply restoring this code wouldn't help in the case reported
>>> here: there *are* no reclaimable pages that could be scanned until the
>>> threshold is met. Kswapd would stay awake anyway.
>>>
>>> Introduce a new and much simpler way of backing off. If kswapd runs
>>> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
>>> page, make it back off from the node. This is the same number of shots
>>> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
>>> that node until a direct reclaimer manages to reclaim some pages, thus
>>> proving the node reclaimable again.
>>
>> Yes this looks, nice&simple. I would just be worried about [1] a bit.
>> Maybe that is worth a separate patch though.
>>
>> [1] http://lkml.kernel.org/r/20170223111609.hlncnvokhq3quxwz@dhcp22.suse.cz
>
> I think I'd prefer the simplicity of keeping this contained inside
> vmscan.c, as an interaction between direct reclaimers and kswapd, as
> well as leaving the wakeup tied to actually seeing reclaimable pages
> rather than merely producing free pages (e.g. should we also add a
> kick to a large munmap() for example?).
>
> OOM kills come with such high latencies that I cannot imagine a
> slightly quicker kswapd restart would matter in practice.
>
>>> Reported-by: Jia He <hejianet@gmail.com>
>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>
> Thanks!
>
>> I would have just one more suggestion. Please move MAX_RECLAIM_RETRIES
>> to mm/internal.h. This is MM internal thing and there is no need to make
>> it visible.
>
> Good point, I'll move it.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-02-28  1:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-24  6:49 [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages Jia He
2017-02-24  8:49 ` Michal Hocko
2017-02-24 16:51   ` Johannes Weiner
2017-02-27  6:04     ` hejianet
2017-02-27  8:50     ` Michal Hocko
2017-02-27 17:06       ` Johannes Weiner
2017-02-27 17:29         ` Michal Hocko
2017-02-28  1:53         ` hejianet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).