All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-22  9:04 ` Jia He
  0 siblings, 0 replies; 25+ messages in thread
From: Jia He @ 2017-02-22  9:04 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Johannes Weiner, Mel Gorman,
	Vlastimil Babka, Michal Hocko, Minchan Kim, Rik van Riel, Jia He

When I try to dynamically allocate the hugepages more than system total
free memory:
e.g. echo 4000 >/proc/sys/vm/nr_hugepages

Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
will not be about to end)
top result:
top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3 

The root cause is kswapd3 is trying to do relaim again and again but it 
makes no progress
# numactl -H
available: 3 nodes (0,2-3)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 15299 MB
node 2 free: 289 MB
node 3 cpus:
node 3 size: 15336 MB
node 3 free: 184 MB        <--- kswapd works
node distances:
node   0   2   3 
  0:  10  40  40 
  2:  40  10  20 
  3:  40  20  10 
At that time, there are no relaimable pages in that node:
Node 3, zone      DMA
  per-node stats
      nr_inactive_anon 0
      nr_active_anon 0
      nr_inactive_file 0
      nr_active_file 0
      nr_unevictable 0
      nr_isolated_anon 0
      nr_isolated_file 0
      nr_pages_scanned 0
      workingset_refault 0
      workingset_activate 0
      workingset_nodereclaim 0
      nr_anon_pages 0
      nr_mapped    0
      nr_file_pages 0
      nr_dirty     0
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     0
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_anon_transparent_hugepages 0
      nr_unstable  0
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   0
      nr_written   0
  pages free     2951
        min      2821
        low      3526
        high     4231
   node_scanned  0
        spanned  245760
        present  245760
        managed  245388
      nr_free_pages 2951
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_slab_reclaimable 46
      nr_slab_unreclaimable 90
      nr_page_table_pages 0
      nr_kernel_stack 0
      nr_bounce    0
      nr_zspages   0
      numa_hit     2257
      numa_miss    0
      numa_foreign 0
      numa_interleave 982
      numa_local   0
      numa_other   2257
      nr_free_cma  0
        protection: (0, 0, 0, 0) 
  
This patch resolves the issue from 2 aspects:
1. In prepare_kswapd_sleep, only when zone is not balanced and there is
  reclaimable pages in this zone, kswapd will go to do relaim without sleeping
2. Don't wake up kswapd if there are no reclaimable pages in that node

After this patch:
top - 07:13:40 up 28 min,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 99.9 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30908096 used,   463424 free,      384 buffers
KiB Swap:  6284224 total,    77504 used,  6206720 free.   131328 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
   77 root      20   0       0      0      0 S 0.000 0.000   0:00.00 kswapd3

Signed-off-by: Jia He <hejianet@gmail.com>
---
 mm/vmscan.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a7..a05e3ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;
 
-		if (!zone_balanced(zone, order, classzone_idx))
+		if (!zone_balanced(zone, order, classzone_idx)
+			&& zone_reclaimable_pages(zone))
 			return false;
 	}
 
@@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
 	int z;
+	int node_has_relaimable_pages = 0;
 
 	if (!managed_zone(zone))
 		return;
@@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 
 		if (zone_balanced(zone, order, classzone_idx))
 			return;
+
+		if (!zone_reclaimable_pages(zone))
+			node_has_relaimable_pages = 1;
 	}
 
+	/* Dont wake kswapd if no reclaimable pages */
+	if (!node_has_relaimable_pages)
+		return;
+
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
-- 
1.8.5.6

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-22  9:04 ` Jia He
  0 siblings, 0 replies; 25+ messages in thread
From: Jia He @ 2017-02-22  9:04 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Johannes Weiner, Mel Gorman,
	Vlastimil Babka, Michal Hocko, Minchan Kim, Rik van Riel, Jia He

When I try to dynamically allocate the hugepages more than system total
free memory:
e.g. echo 4000 >/proc/sys/vm/nr_hugepages

Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
will not be about to end)
top result:
top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3 

The root cause is kswapd3 is trying to do relaim again and again but it 
makes no progress
# numactl -H
available: 3 nodes (0,2-3)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 15299 MB
node 2 free: 289 MB
node 3 cpus:
node 3 size: 15336 MB
node 3 free: 184 MB        <--- kswapd works
node distances:
node   0   2   3 
  0:  10  40  40 
  2:  40  10  20 
  3:  40  20  10 
At that time, there are no relaimable pages in that node:
Node 3, zone      DMA
  per-node stats
      nr_inactive_anon 0
      nr_active_anon 0
      nr_inactive_file 0
      nr_active_file 0
      nr_unevictable 0
      nr_isolated_anon 0
      nr_isolated_file 0
      nr_pages_scanned 0
      workingset_refault 0
      workingset_activate 0
      workingset_nodereclaim 0
      nr_anon_pages 0
      nr_mapped    0
      nr_file_pages 0
      nr_dirty     0
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     0
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_anon_transparent_hugepages 0
      nr_unstable  0
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   0
      nr_written   0
  pages free     2951
        min      2821
        low      3526
        high     4231
   node_scanned  0
        spanned  245760
        present  245760
        managed  245388
      nr_free_pages 2951
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_slab_reclaimable 46
      nr_slab_unreclaimable 90
      nr_page_table_pages 0
      nr_kernel_stack 0
      nr_bounce    0
      nr_zspages   0
      numa_hit     2257
      numa_miss    0
      numa_foreign 0
      numa_interleave 982
      numa_local   0
      numa_other   2257
      nr_free_cma  0
        protection: (0, 0, 0, 0) 
  
This patch resolves the issue from 2 aspects:
1. In prepare_kswapd_sleep, only when zone is not balanced and there is
  reclaimable pages in this zone, kswapd will go to do relaim without sleeping
2. Don't wake up kswapd if there are no reclaimable pages in that node

After this patch:
top - 07:13:40 up 28 min,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 99.9 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30908096 used,   463424 free,      384 buffers
KiB Swap:  6284224 total,    77504 used,  6206720 free.   131328 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
   77 root      20   0       0      0      0 S 0.000 0.000   0:00.00 kswapd3

Signed-off-by: Jia He <hejianet@gmail.com>
---
 mm/vmscan.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a7..a05e3ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;
 
-		if (!zone_balanced(zone, order, classzone_idx))
+		if (!zone_balanced(zone, order, classzone_idx)
+			&& zone_reclaimable_pages(zone))
 			return false;
 	}
 
@@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
 	int z;
+	int node_has_relaimable_pages = 0;
 
 	if (!managed_zone(zone))
 		return;
@@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 
 		if (zone_balanced(zone, order, classzone_idx))
 			return;
+
+		if (!zone_reclaimable_pages(zone))
+			node_has_relaimable_pages = 1;
 	}
 
+	/* Dont wake kswapd if no reclaimable pages */
+	if (!node_has_relaimable_pages)
+		return;
+
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
-- 
1.8.5.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22  9:04 ` Jia He
@ 2017-02-22 11:41   ` Michal Hocko
  -1 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-22 11:41 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 17:04:48, Jia He wrote:
> When I try to dynamically allocate the hugepages more than system total
> free memory:
> e.g. echo 4000 >/proc/sys/vm/nr_hugepages

I assume that the command has terminated with less huge pages allocated
than requested but

> Node 3, zone      DMA
[...]
>   pages free     2951
>         min      2821
>         low      3526
>         high     4231

it left the zone below high watermark with

>    node_scanned  0
>         spanned  245760
>         present  245760
>         managed  245388
>       nr_free_pages 2951
>       nr_zone_inactive_anon 0
>       nr_zone_active_anon 0
>       nr_zone_inactive_file 0
>       nr_zone_active_file 0

no pages reclaimable, so kswapd will not go to sleep. It would be quite
easy and comfortable to call it a misconfiguration but it seems that
it might be quite easy to hit with NUMA machines which have large
differences in the node sizes. I guess it makes sense to back off
the kswapd rather than burning CPU without any way to make forward
progress.

[...]

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 532a2a7..a05e3ab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  		if (!managed_zone(zone))
>  			continue;
>  
> -		if (!zone_balanced(zone, order, classzone_idx))
> +		if (!zone_balanced(zone, order, classzone_idx)
> +			&& zone_reclaimable_pages(zone))
>  			return false;

OK, this makes some sense, although zone_reclaimable_pages doesn't count
SLAB reclaimable pages. So we might go to sleep with a reclaimable slab
still around. This is not really easy to address because the reclaimable
slab doesn't really imply that those pages will be reclaimed...

>  	}
>  
> @@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  {
>  	pg_data_t *pgdat;
>  	int z;
> +	int node_has_relaimable_pages = 0;
>  
>  	if (!managed_zone(zone))
>  		return;
> @@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  
>  		if (zone_balanced(zone, order, classzone_idx))
>  			return;
> +
> +		if (!zone_reclaimable_pages(zone))
> +			node_has_relaimable_pages = 1;

What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?

>  	}
>  
> +	/* Dont wake kswapd if no reclaimable pages */
> +	if (!node_has_relaimable_pages)
> +		return;
> +
>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>  	wake_up_interruptible(&pgdat->kswapd_wait);
>  }
> -- 
> 1.8.5.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-22 11:41   ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-22 11:41 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 17:04:48, Jia He wrote:
> When I try to dynamically allocate the hugepages more than system total
> free memory:
> e.g. echo 4000 >/proc/sys/vm/nr_hugepages

I assume that the command has terminated with less huge pages allocated
than requested but

> Node 3, zone      DMA
[...]
>   pages free     2951
>         min      2821
>         low      3526
>         high     4231

it left the zone below high watermark with

>    node_scanned  0
>         spanned  245760
>         present  245760
>         managed  245388
>       nr_free_pages 2951
>       nr_zone_inactive_anon 0
>       nr_zone_active_anon 0
>       nr_zone_inactive_file 0
>       nr_zone_active_file 0

no pages reclaimable, so kswapd will not go to sleep. It would be quite
easy and comfortable to call it a misconfiguration but it seems that
it might be quite easy to hit with NUMA machines which have large
differences in the node sizes. I guess it makes sense to back off
the kswapd rather than burning CPU without any way to make forward
progress.

[...]

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 532a2a7..a05e3ab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  		if (!managed_zone(zone))
>  			continue;
>  
> -		if (!zone_balanced(zone, order, classzone_idx))
> +		if (!zone_balanced(zone, order, classzone_idx)
> +			&& zone_reclaimable_pages(zone))
>  			return false;

OK, this makes some sense, although zone_reclaimable_pages doesn't count
SLAB reclaimable pages. So we might go to sleep with a reclaimable slab
still around. This is not really easy to address because the reclaimable
slab doesn't really imply that those pages will be reclaimed...

>  	}
>  
> @@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  {
>  	pg_data_t *pgdat;
>  	int z;
> +	int node_has_relaimable_pages = 0;
>  
>  	if (!managed_zone(zone))
>  		return;
> @@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  
>  		if (zone_balanced(zone, order, classzone_idx))
>  			return;
> +
> +		if (!zone_reclaimable_pages(zone))
> +			node_has_relaimable_pages = 1;

What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?

>  	}
>  
> +	/* Dont wake kswapd if no reclaimable pages */
> +	if (!node_has_relaimable_pages)
> +		return;
> +
>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>  	wake_up_interruptible(&pgdat->kswapd_wait);
>  }
> -- 
> 1.8.5.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22 11:41   ` Michal Hocko
@ 2017-02-22 14:31     ` hejianet
  -1 siblings, 0 replies; 25+ messages in thread
From: hejianet @ 2017-02-22 14:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

Hi Michal

On 22/02/2017 7:41 PM, Michal Hocko wrote:
> On Wed 22-02-17 17:04:48, Jia He wrote:
>> When I try to dynamically allocate the hugepages more than system total
>> free memory:
>> e.g. echo 4000 >/proc/sys/vm/nr_hugepages
>
> I assume that the command has terminated with less huge pages allocated
> than requested but
>
Yes, at last the allocated hugepages are less than 4000
HugePages_Total:    1864
HugePages_Free:     1864
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB

In the bad case, although kswapd takes 100% cpu, the number of
HugePages_Total is not increase at all.

>> Node 3, zone      DMA
> [...]
>>   pages free     2951
>>         min      2821
>>         low      3526
>>         high     4231
>
> it left the zone below high watermark with
>
>>    node_scanned  0
>>         spanned  245760
>>         present  245760
>>         managed  245388
>>       nr_free_pages 2951
>>       nr_zone_inactive_anon 0
>>       nr_zone_active_anon 0
>>       nr_zone_inactive_file 0
>>       nr_zone_active_file 0
>
> no pages reclaimable, so kswapd will not go to sleep. It would be quite
> easy and comfortable to call it a misconfiguration but it seems that
> it might be quite easy to hit with NUMA machines which have large
> differences in the node sizes. I guess it makes sense to back off
> the kswapd rather than burning CPU without any way to make forward
> progress.
agree.
>
> [...]
>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 532a2a7..a05e3ab 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>>  		if (!managed_zone(zone))
>>  			continue;
>>
>> -		if (!zone_balanced(zone, order, classzone_idx))
>> +		if (!zone_balanced(zone, order, classzone_idx)
>> +			&& zone_reclaimable_pages(zone))
>>  			return false;
>
> OK, this makes some sense, although zone_reclaimable_pages doesn't count
> SLAB reclaimable pages. So we might go to sleep with a reclaimable slab
> still around. This is not really easy to address because the reclaimable
> slab doesn't really imply that those pages will be reclaimed...
Yes, even in the bad case, when kswapd takes all the cpu, the reclaimable
pages are not decreased
>
>>  	}
>>
>> @@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>>  {
>>  	pg_data_t *pgdat;
>>  	int z;
>> +	int node_has_relaimable_pages = 0;
>>
>>  	if (!managed_zone(zone))
>>  		return;
>> @@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>>
>>  		if (zone_balanced(zone, order, classzone_idx))
>>  			return;
>> +
>> +		if (!zone_reclaimable_pages(zone))
>> +			node_has_relaimable_pages = 1;
>
> What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?
I mean, if any one zone has reclaimable pages, then this zone's *node* has
reclaimable pages. Thus, the kswapN for this node should be waken up.
e.g. node 1 has 2 zones.
zone A has no reclaimable pages but zone B has.
Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
pages number. So I prefer the name node_has_relaimable_pages instead of
zone_has_relaimable_pages

Did I understand it correctly? Thanks

B.R.
Jia
>
>>  	}
>>
>> +	/* Dont wake kswapd if no reclaimable pages */
>> +	if (!node_has_relaimable_pages)
>> +		return;
>> +
>>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>>  	wake_up_interruptible(&pgdat->kswapd_wait);
>>  }
>> --
>> 1.8.5.6
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-22 14:31     ` hejianet
  0 siblings, 0 replies; 25+ messages in thread
From: hejianet @ 2017-02-22 14:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

Hi Michal

On 22/02/2017 7:41 PM, Michal Hocko wrote:
> On Wed 22-02-17 17:04:48, Jia He wrote:
>> When I try to dynamically allocate the hugepages more than system total
>> free memory:
>> e.g. echo 4000 >/proc/sys/vm/nr_hugepages
>
> I assume that the command has terminated with less huge pages allocated
> than requested but
>
Yes, at last the allocated hugepages are less than 4000
HugePages_Total:    1864
HugePages_Free:     1864
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB

In the bad case, although kswapd takes 100% cpu, the number of
HugePages_Total is not increase at all.

>> Node 3, zone      DMA
> [...]
>>   pages free     2951
>>         min      2821
>>         low      3526
>>         high     4231
>
> it left the zone below high watermark with
>
>>    node_scanned  0
>>         spanned  245760
>>         present  245760
>>         managed  245388
>>       nr_free_pages 2951
>>       nr_zone_inactive_anon 0
>>       nr_zone_active_anon 0
>>       nr_zone_inactive_file 0
>>       nr_zone_active_file 0
>
> no pages reclaimable, so kswapd will not go to sleep. It would be quite
> easy and comfortable to call it a misconfiguration but it seems that
> it might be quite easy to hit with NUMA machines which have large
> differences in the node sizes. I guess it makes sense to back off
> the kswapd rather than burning CPU without any way to make forward
> progress.
agree.
>
> [...]
>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 532a2a7..a05e3ab 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>>  		if (!managed_zone(zone))
>>  			continue;
>>
>> -		if (!zone_balanced(zone, order, classzone_idx))
>> +		if (!zone_balanced(zone, order, classzone_idx)
>> +			&& zone_reclaimable_pages(zone))
>>  			return false;
>
> OK, this makes some sense, although zone_reclaimable_pages doesn't count
> SLAB reclaimable pages. So we might go to sleep with a reclaimable slab
> still around. This is not really easy to address because the reclaimable
> slab doesn't really imply that those pages will be reclaimed...
Yes, even in the bad case, when kswapd takes all the cpu, the reclaimable
pages are not decreased
>
>>  	}
>>
>> @@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>>  {
>>  	pg_data_t *pgdat;
>>  	int z;
>> +	int node_has_relaimable_pages = 0;
>>
>>  	if (!managed_zone(zone))
>>  		return;
>> @@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>>
>>  		if (zone_balanced(zone, order, classzone_idx))
>>  			return;
>> +
>> +		if (!zone_reclaimable_pages(zone))
>> +			node_has_relaimable_pages = 1;
>
> What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?
I mean, if any one zone has reclaimable pages, then this zone's *node* has
reclaimable pages. Thus, the kswapN for this node should be waken up.
e.g. node 1 has 2 zones.
zone A has no reclaimable pages but zone B has.
Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
pages number. So I prefer the name node_has_relaimable_pages instead of
zone_has_relaimable_pages

Did I understand it correctly? Thanks

B.R.
Jia
>
>>  	}
>>
>> +	/* Dont wake kswapd if no reclaimable pages */
>> +	if (!node_has_relaimable_pages)
>> +		return;
>> +
>>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>>  	wake_up_interruptible(&pgdat->kswapd_wait);
>>  }
>> --
>> 1.8.5.6
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22 14:31     ` hejianet
@ 2017-02-22 15:48       ` Michal Hocko
  -1 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-22 15:48 UTC (permalink / raw)
  To: hejianet
  Cc: linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 22:31:50, hejianet wrote:
> Hi Michal
> 
> On 22/02/2017 7:41 PM, Michal Hocko wrote:
> > On Wed 22-02-17 17:04:48, Jia He wrote:
> > > When I try to dynamically allocate the hugepages more than system total
> > > free memory:
> > > e.g. echo 4000 >/proc/sys/vm/nr_hugepages
> > 
> > I assume that the command has terminated with less huge pages allocated
> > than requested but
> > 
> Yes, at last the allocated hugepages are less than 4000
> HugePages_Total:    1864
> HugePages_Free:     1864
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:      16384 kB
> 
> In the bad case, although kswapd takes 100% cpu, the number of
> HugePages_Total is not increase at all.
> 
> > > Node 3, zone      DMA
> > [...]
> > >   pages free     2951
> > >         min      2821
> > >         low      3526
> > >         high     4231
> > 
> > it left the zone below high watermark with
> > 
> > >    node_scanned  0
> > >         spanned  245760
> > >         present  245760
> > >         managed  245388
> > >       nr_free_pages 2951
> > >       nr_zone_inactive_anon 0
> > >       nr_zone_active_anon 0
> > >       nr_zone_inactive_file 0
> > >       nr_zone_active_file 0
> > 
> > no pages reclaimable, so kswapd will not go to sleep. It would be quite
> > easy and comfortable to call it a misconfiguration but it seems that
> > it might be quite easy to hit with NUMA machines which have large
> > differences in the node sizes. I guess it makes sense to back off
> > the kswapd rather than burning CPU without any way to make forward
> > progress.
>
> agree.

please make sure that this information is in the changelog

[...]
> > > @@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
> > >  {
> > >  	pg_data_t *pgdat;
> > >  	int z;
> > > +	int node_has_relaimable_pages = 0;
> > > 
> > >  	if (!managed_zone(zone))
> > >  		return;
> > > @@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
> > > 
> > >  		if (zone_balanced(zone, order, classzone_idx))
> > >  			return;
> > > +
> > > +		if (!zone_reclaimable_pages(zone))
> > > +			node_has_relaimable_pages = 1;
> > 
> > What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?
>
> I mean, if any one zone has reclaimable pages, then this zone's *node* has
> reclaimable pages. Thus, the kswapN for this node should be waken up.
> e.g. node 1 has 2 zones.
> zone A has no reclaimable pages but zone B has.
> Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
> I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
> pages number. So I prefer the name node_has_relaimable_pages instead of
> zone_has_relaimable_pages

I still do not understand. This code starts with
node_has_relaimable_pages = 0. If you see a zone with no reclaimable
pages then you make it node_has_relaimable_pages = 1 which means that 

> > > +	/* Dont wake kswapd if no reclaimable pages */
> > > +	if (!node_has_relaimable_pages)
> > > +		return;

this will not hold and we will wake up the kswapd. I believe what
you want instead, is to skip the wake up if _all_ zones have
!zone_reclaimable_pages() Or I am missing your point. This means that
you want
	if (zone_reclaimable_pages(zone)
		node_has_relaimable_pages = 1;	

> > > +
> > >  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
> > >  	wake_up_interruptible(&pgdat->kswapd_wait);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-22 15:48       ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-22 15:48 UTC (permalink / raw)
  To: hejianet
  Cc: linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 22:31:50, hejianet wrote:
> Hi Michal
> 
> On 22/02/2017 7:41 PM, Michal Hocko wrote:
> > On Wed 22-02-17 17:04:48, Jia He wrote:
> > > When I try to dynamically allocate the hugepages more than system total
> > > free memory:
> > > e.g. echo 4000 >/proc/sys/vm/nr_hugepages
> > 
> > I assume that the command has terminated with less huge pages allocated
> > than requested but
> > 
> Yes, at last the allocated hugepages are less than 4000
> HugePages_Total:    1864
> HugePages_Free:     1864
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:      16384 kB
> 
> In the bad case, although kswapd takes 100% cpu, the number of
> HugePages_Total is not increase at all.
> 
> > > Node 3, zone      DMA
> > [...]
> > >   pages free     2951
> > >         min      2821
> > >         low      3526
> > >         high     4231
> > 
> > it left the zone below high watermark with
> > 
> > >    node_scanned  0
> > >         spanned  245760
> > >         present  245760
> > >         managed  245388
> > >       nr_free_pages 2951
> > >       nr_zone_inactive_anon 0
> > >       nr_zone_active_anon 0
> > >       nr_zone_inactive_file 0
> > >       nr_zone_active_file 0
> > 
> > no pages reclaimable, so kswapd will not go to sleep. It would be quite
> > easy and comfortable to call it a misconfiguration but it seems that
> > it might be quite easy to hit with NUMA machines which have large
> > differences in the node sizes. I guess it makes sense to back off
> > the kswapd rather than burning CPU without any way to make forward
> > progress.
>
> agree.

please make sure that this information is in the changelog

[...]
> > > @@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
> > >  {
> > >  	pg_data_t *pgdat;
> > >  	int z;
> > > +	int node_has_relaimable_pages = 0;
> > > 
> > >  	if (!managed_zone(zone))
> > >  		return;
> > > @@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
> > > 
> > >  		if (zone_balanced(zone, order, classzone_idx))
> > >  			return;
> > > +
> > > +		if (!zone_reclaimable_pages(zone))
> > > +			node_has_relaimable_pages = 1;
> > 
> > What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?
>
> I mean, if any one zone has reclaimable pages, then this zone's *node* has
> reclaimable pages. Thus, the kswapN for this node should be waken up.
> e.g. node 1 has 2 zones.
> zone A has no reclaimable pages but zone B has.
> Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
> I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
> pages number. So I prefer the name node_has_relaimable_pages instead of
> zone_has_relaimable_pages

I still do not understand. This code starts with
node_has_relaimable_pages = 0. If you see a zone with no reclaimable
pages then you make it node_has_relaimable_pages = 1 which means that 

> > > +	/* Dont wake kswapd if no reclaimable pages */
> > > +	if (!node_has_relaimable_pages)
> > > +		return;

this will not hold and we will wake up the kswapd. I believe what
you want instead, is to skip the wake up if _all_ zones have
!zone_reclaimable_pages() Or I am missing your point. This means that
you want
	if (zone_reclaimable_pages(zone)
		node_has_relaimable_pages = 1;	

> > > +
> > >  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
> > >  	wake_up_interruptible(&pgdat->kswapd_wait);

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22  9:04 ` Jia He
@ 2017-02-22 20:16   ` Johannes Weiner
  -1 siblings, 0 replies; 25+ messages in thread
From: Johannes Weiner @ 2017-02-22 20:16 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Michal Hocko, Minchan Kim, Rik van Riel

On Wed, Feb 22, 2017 at 05:04:48PM +0800, Jia He wrote:
> When I try to dynamically allocate the hugepages more than system total
> free memory:

> Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
> will not be about to end)

> The root cause is kswapd3 is trying to do relaim again and again but it 
> makes no progress

> At that time, there are no relaimable pages in that node:

Yes, this is a problem with the current kswapd code.

A less artificial scenario that I observed recently was machines with
two NUMA nodes, after being up for 200+ days, getting into a state
where node0 is mostly consumed by anon and some kernel allocations,
leaving less than the high watermark free. The machines don't have
swap, so the anon isn't reclaimable. But also, anon LRU is never even
*scanned*, so the "all unreclaimable" logic doesn't kick in. Kswapd is
spinning at 100% CPU calculating scan counts and checking zone states.

One specific problem with your patch, Jia, is that there might be some
cache pages that are pinned one way or another. That was the case on
our machines, and so reclaimable pages wasn't 0. Even if we check the
reclaimable pages, we need a hard cutoff after X attempts. And then it
sounds pretty much like what the allocator/direct reclaim already does.

Can we use the *exact* same cutoff conditions for direct reclaim and
kswapd, though? I don't think so. For direct reclaim, the goal is the
watermark, to make an allocation happen in the caller. While kswapd
tries to restore the watermarks too, it might never meet them but
still do useful work on behalf of concurrently allocating threads. It
should only stop when it tries and fails to free any pages at all.

The recent removal of the scanned > reclaimable * 6 exit condition
from kswapd was a mistake, we need something like it. But it's also
broken the way we perform reclaim on swapless systems, so instead of
re-instating it, we should remove the remnants and do something new.

Can we simply count the number of balance_pgdat() runs that didn't
reclaim anything and have kswapd sleep after MAX_RECLAIM_RETRIES?

And a follow-up: once it gives up, when should kswapd return to work?
We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
that's a branch in a common allocator path, just to recover kswapd - a
latency tool, not a necessity for functional correctness - from a
situation that's exceedingly pretty rare. How about we leave it
disabled until a direct reclaimer manages to free something?

Something like this (untested):

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 96c78d840916..611c5fb52d7b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -149,7 +149,6 @@ enum node_stat_item {
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
-	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	PGSCAN_ANON,
 	PGSCAN_FILE,
 	PGSTEAL_ANON,
@@ -630,6 +629,8 @@ typedef struct pglist_data {
 	int kswapd_order;
 	enum zone_type kswapd_classzone_idx;
 
+	int kswapd_failed_runs;
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 145005e6dc85..9de49e2710af 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -285,8 +285,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+#define MAX_RECLAIM_RETRIES 16
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
-extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/internal.h b/mm/internal.h
index 537ac9951f5f..812e1aaf4142 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -78,7 +78,6 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
-extern bool pgdat_reclaimable(struct pglist_data *pgdat);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/migrate.c b/mm/migrate.c
index c83186c61257..d4f0a499b4e5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1731,9 +1731,6 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 {
 	int z;
 
-	if (!pgdat_reclaimable(pgdat))
-		return false;
-
 	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 		struct zone *zone = pgdat->node_zones + z;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c470b8fe28cf..c467a4fff16a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1087,15 +1087,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	unsigned long nr_scanned;
 	bool isolated_pageblocks;
 
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
-	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
-	if (nr_scanned)
-		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
-
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -1147,12 +1142,7 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype)
 {
-	unsigned long nr_scanned;
 	spin_lock(&zone->lock);
-	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
-	if (nr_scanned)
-		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
-
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
@@ -3388,12 +3378,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 }
 
 /*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
@@ -4301,7 +4285,6 @@ void show_free_areas(unsigned int filter)
 #endif
 			" writeback_tmp:%lukB"
 			" unstable:%lukB"
-			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -4324,8 +4307,8 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_SHMEM)),
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
-			node_page_state(pgdat, NR_PAGES_SCANNED),
-			!pgdat_reclaimable(pgdat) ? "yes" : "no");
+		        pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES ?
+				"yes" : "no");
 	}
 
 	for_each_populated_zone(zone) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1b112f291fbe..2b4ce7cd97d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -212,28 +212,6 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 	return nr;
 }
 
-unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
-{
-	unsigned long nr;
-
-	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
-
-	if (get_nr_swap_pages() > 0)
-		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
-		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
-		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
-
-	return nr;
-}
-
-bool pgdat_reclaimable(struct pglist_data *pgdat)
-{
-	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
-		pgdat_reclaimable_pages(pgdat) * 6;
-}
-
 unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
 	if (!mem_cgroup_disabled())
@@ -1438,10 +1416,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			continue;
 		}
 
-		/*
-		 * Account for scanned and skipped separetly to avoid the pgdat
-		 * being prematurely marked unreclaimable by pgdat_reclaimable.
-		 */
 		scan++;
 
 		switch (__isolate_lru_page(page, mode)) {
@@ -1728,7 +1702,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (global_reclaim(sc)) {
 		__mod_node_page_state(pgdat, PGSCAN_ANON + file, nr_scanned);
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 		if (current_is_kswapd())
 			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
 		else
@@ -1916,8 +1889,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 
-	if (global_reclaim(sc))
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 	__count_vm_events(PGREFILL, nr_scanned);
 
 	spin_unlock_irq(&pgdat->lru_lock);
@@ -2114,12 +2085,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd()) {
-		if (!pgdat_reclaimable(pgdat))
-			force_scan = true;
-		if (!mem_cgroup_online(memcg))
-			force_scan = true;
-	}
+	if (current_is_kswapd() && !mem_cgroup_online(memcg))
+		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
 
@@ -2593,6 +2560,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
+	/*
+	 * Kswapd might have declared a node hopeless after too many
+	 * successless balancing attempts. If we reclaim anything at
+	 * all here, knock it loose.
+	 */
+	if (reclaimable)
+		pgdat->kswapd_failed_runs = 0;
+
 	return reclaimable;
 }
 
@@ -2667,10 +2642,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
-			if (sc->priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;	/* Let kswapd poll it */
-
 			/*
 			 * If we already have plenty of memory free for
 			 * compaction in this zone, don't free any more.
@@ -2817,8 +2788,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
-		if (!managed_zone(zone) ||
-		    pgdat_reclaimable_pages(pgdat) == 0)
+		if (!managed_zone(zone))
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
@@ -3117,6 +3087,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
+	/* Hopeless node until direct reclaim knocks us free */
+	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
+		return true;
+
 	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -3260,7 +3234,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * If we're getting trouble reclaiming, start doing writepage
 		 * even in laptop mode.
 		 */
-		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
+		if (sc.priority < DEF_PRIORITY - 2)
 			sc.may_writepage = 1;
 
 		/* Call soft limit reclaim before calling shrink_node. */
@@ -3299,6 +3273,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
+	if (!sc.nr_reclaimed)
+		pgdat->kswapd_failed_runs++;
+
 out:
 	/*
 	 * Return the order kswapd stopped reclaiming at as
@@ -3498,6 +3475,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
+	/* Hopeless node until direct reclaim knocks us free */
+	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
+		return;
+
 	/* Only wake kswapd if all zones are unbalanced */
 	for (z = 0; z <= classzone_idx; z++) {
 		zone = pgdat->node_zones + z;
@@ -3768,9 +3749,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
-	if (!pgdat_reclaimable(pgdat))
-		return NODE_RECLAIM_FULL;
-
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ef3b683f1f56..2e022d537d25 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -954,7 +954,6 @@ const char * const vmstat_text[] = {
 	"nr_unevictable",
 	"nr_isolated_anon",
 	"nr_isolated_file",
-	"nr_pages_scanned",
 	"pgscan_anon",
 	"pgscan_file",
 	"pgsteal_anon",
@@ -1378,7 +1377,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n   node_scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu",
@@ -1386,7 +1384,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone->managed_pages);
@@ -1425,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  node_unreclaimable:  %u"
 		   "\n  start_pfn:           %lu"
 		   "\n  node_inactive_ratio: %u",
-		   !pgdat_reclaimable(zone->zone_pgdat),
+		   zone->zone_pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES,
 		   zone->zone_start_pfn,
 		   zone->zone_pgdat->inactive_ratio);
 	seq_putc(m, '\n');
@@ -1587,26 +1584,11 @@ int vmstat_refresh(struct ctl_table *table, int write,
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
-			switch (i) {
-			case NR_PAGES_SCANNED:
-				/*
-				 * This is often seen to go negative in
-				 * recent kernels, but not to go permanently
-				 * negative.  Whilst it would be nicer not to
-				 * have exceptions, rooting them out would be
-				 * another task, of rather low priority.
-				 */
-				break;
-			default:
-				pr_warn("%s: %s %ld\n",
-					__func__, vmstat_text[i], val);
-				err = -EINVAL;
-				break;
-			}
+			pr_warn("%s: %s %ld\n",
+				__func__, vmstat_text[i], val);
+			return -EINVAL;
 		}
 	}
-	if (err)
-		return err;
 	if (write)
 		*ppos += *lenp;
 	else

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-22 20:16   ` Johannes Weiner
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Weiner @ 2017-02-22 20:16 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Michal Hocko, Minchan Kim, Rik van Riel

On Wed, Feb 22, 2017 at 05:04:48PM +0800, Jia He wrote:
> When I try to dynamically allocate the hugepages more than system total
> free memory:

> Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
> will not be about to end)

> The root cause is kswapd3 is trying to do relaim again and again but it 
> makes no progress

> At that time, there are no relaimable pages in that node:

Yes, this is a problem with the current kswapd code.

A less artificial scenario that I observed recently was machines with
two NUMA nodes, after being up for 200+ days, getting into a state
where node0 is mostly consumed by anon and some kernel allocations,
leaving less than the high watermark free. The machines don't have
swap, so the anon isn't reclaimable. But also, anon LRU is never even
*scanned*, so the "all unreclaimable" logic doesn't kick in. Kswapd is
spinning at 100% CPU calculating scan counts and checking zone states.

One specific problem with your patch, Jia, is that there might be some
cache pages that are pinned one way or another. That was the case on
our machines, and so reclaimable pages wasn't 0. Even if we check the
reclaimable pages, we need a hard cutoff after X attempts. And then it
sounds pretty much like what the allocator/direct reclaim already does.

Can we use the *exact* same cutoff conditions for direct reclaim and
kswapd, though? I don't think so. For direct reclaim, the goal is the
watermark, to make an allocation happen in the caller. While kswapd
tries to restore the watermarks too, it might never meet them but
still do useful work on behalf of concurrently allocating threads. It
should only stop when it tries and fails to free any pages at all.

The recent removal of the scanned > reclaimable * 6 exit condition
from kswapd was a mistake, we need something like it. But it's also
broken the way we perform reclaim on swapless systems, so instead of
re-instating it, we should remove the remnants and do something new.

Can we simply count the number of balance_pgdat() runs that didn't
reclaim anything and have kswapd sleep after MAX_RECLAIM_RETRIES?

And a follow-up: once it gives up, when should kswapd return to work?
We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
that's a branch in a common allocator path, just to recover kswapd - a
latency tool, not a necessity for functional correctness - from a
situation that's exceedingly pretty rare. How about we leave it
disabled until a direct reclaimer manages to free something?

Something like this (untested):

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 96c78d840916..611c5fb52d7b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -149,7 +149,6 @@ enum node_stat_item {
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
-	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	PGSCAN_ANON,
 	PGSCAN_FILE,
 	PGSTEAL_ANON,
@@ -630,6 +629,8 @@ typedef struct pglist_data {
 	int kswapd_order;
 	enum zone_type kswapd_classzone_idx;
 
+	int kswapd_failed_runs;
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 145005e6dc85..9de49e2710af 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -285,8 +285,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+#define MAX_RECLAIM_RETRIES 16
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
-extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/internal.h b/mm/internal.h
index 537ac9951f5f..812e1aaf4142 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -78,7 +78,6 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
-extern bool pgdat_reclaimable(struct pglist_data *pgdat);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/migrate.c b/mm/migrate.c
index c83186c61257..d4f0a499b4e5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1731,9 +1731,6 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 {
 	int z;
 
-	if (!pgdat_reclaimable(pgdat))
-		return false;
-
 	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 		struct zone *zone = pgdat->node_zones + z;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c470b8fe28cf..c467a4fff16a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1087,15 +1087,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	unsigned long nr_scanned;
 	bool isolated_pageblocks;
 
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
-	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
-	if (nr_scanned)
-		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
-
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -1147,12 +1142,7 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype)
 {
-	unsigned long nr_scanned;
 	spin_lock(&zone->lock);
-	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
-	if (nr_scanned)
-		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
-
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
@@ -3388,12 +3378,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 }
 
 /*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
@@ -4301,7 +4285,6 @@ void show_free_areas(unsigned int filter)
 #endif
 			" writeback_tmp:%lukB"
 			" unstable:%lukB"
-			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -4324,8 +4307,8 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_SHMEM)),
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
-			node_page_state(pgdat, NR_PAGES_SCANNED),
-			!pgdat_reclaimable(pgdat) ? "yes" : "no");
+		        pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES ?
+				"yes" : "no");
 	}
 
 	for_each_populated_zone(zone) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1b112f291fbe..2b4ce7cd97d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -212,28 +212,6 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 	return nr;
 }
 
-unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
-{
-	unsigned long nr;
-
-	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
-
-	if (get_nr_swap_pages() > 0)
-		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
-		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
-		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
-
-	return nr;
-}
-
-bool pgdat_reclaimable(struct pglist_data *pgdat)
-{
-	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
-		pgdat_reclaimable_pages(pgdat) * 6;
-}
-
 unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
 	if (!mem_cgroup_disabled())
@@ -1438,10 +1416,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			continue;
 		}
 
-		/*
-		 * Account for scanned and skipped separetly to avoid the pgdat
-		 * being prematurely marked unreclaimable by pgdat_reclaimable.
-		 */
 		scan++;
 
 		switch (__isolate_lru_page(page, mode)) {
@@ -1728,7 +1702,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (global_reclaim(sc)) {
 		__mod_node_page_state(pgdat, PGSCAN_ANON + file, nr_scanned);
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 		if (current_is_kswapd())
 			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
 		else
@@ -1916,8 +1889,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 
-	if (global_reclaim(sc))
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 	__count_vm_events(PGREFILL, nr_scanned);
 
 	spin_unlock_irq(&pgdat->lru_lock);
@@ -2114,12 +2085,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd()) {
-		if (!pgdat_reclaimable(pgdat))
-			force_scan = true;
-		if (!mem_cgroup_online(memcg))
-			force_scan = true;
-	}
+	if (current_is_kswapd() && !mem_cgroup_online(memcg))
+		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
 
@@ -2593,6 +2560,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
+	/*
+	 * Kswapd might have declared a node hopeless after too many
+	 * successless balancing attempts. If we reclaim anything at
+	 * all here, knock it loose.
+	 */
+	if (reclaimable)
+		pgdat->kswapd_failed_runs = 0;
+
 	return reclaimable;
 }
 
@@ -2667,10 +2642,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
-			if (sc->priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;	/* Let kswapd poll it */
-
 			/*
 			 * If we already have plenty of memory free for
 			 * compaction in this zone, don't free any more.
@@ -2817,8 +2788,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
-		if (!managed_zone(zone) ||
-		    pgdat_reclaimable_pages(pgdat) == 0)
+		if (!managed_zone(zone))
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
@@ -3117,6 +3087,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
+	/* Hopeless node until direct reclaim knocks us free */
+	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
+		return true;
+
 	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -3260,7 +3234,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * If we're getting trouble reclaiming, start doing writepage
 		 * even in laptop mode.
 		 */
-		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
+		if (sc.priority < DEF_PRIORITY - 2)
 			sc.may_writepage = 1;
 
 		/* Call soft limit reclaim before calling shrink_node. */
@@ -3299,6 +3273,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
+	if (!sc.nr_reclaimed)
+		pgdat->kswapd_failed_runs++;
+
 out:
 	/*
 	 * Return the order kswapd stopped reclaiming at as
@@ -3498,6 +3475,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
+	/* Hopeless node until direct reclaim knocks us free */
+	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
+		return;
+
 	/* Only wake kswapd if all zones are unbalanced */
 	for (z = 0; z <= classzone_idx; z++) {
 		zone = pgdat->node_zones + z;
@@ -3768,9 +3749,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
-	if (!pgdat_reclaimable(pgdat))
-		return NODE_RECLAIM_FULL;
-
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ef3b683f1f56..2e022d537d25 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -954,7 +954,6 @@ const char * const vmstat_text[] = {
 	"nr_unevictable",
 	"nr_isolated_anon",
 	"nr_isolated_file",
-	"nr_pages_scanned",
 	"pgscan_anon",
 	"pgscan_file",
 	"pgsteal_anon",
@@ -1378,7 +1377,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n   node_scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu",
@@ -1386,7 +1384,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone->managed_pages);
@@ -1425,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  node_unreclaimable:  %u"
 		   "\n  start_pfn:           %lu"
 		   "\n  node_inactive_ratio: %u",
-		   !pgdat_reclaimable(zone->zone_pgdat),
+		   zone->zone_pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES,
 		   zone->zone_start_pfn,
 		   zone->zone_pgdat->inactive_ratio);
 	seq_putc(m, '\n');
@@ -1587,26 +1584,11 @@ int vmstat_refresh(struct ctl_table *table, int write,
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
-			switch (i) {
-			case NR_PAGES_SCANNED:
-				/*
-				 * This is often seen to go negative in
-				 * recent kernels, but not to go permanently
-				 * negative.  Whilst it would be nicer not to
-				 * have exceptions, rooting them out would be
-				 * another task, of rather low priority.
-				 */
-				break;
-			default:
-				pr_warn("%s: %s %ld\n",
-					__func__, vmstat_text[i], val);
-				err = -EINVAL;
-				break;
-			}
+			pr_warn("%s: %s %ld\n",
+				__func__, vmstat_text[i], val);
+			return -EINVAL;
 		}
 	}
-	if (err)
-		return err;
 	if (write)
 		*ppos += *lenp;
 	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22 20:16   ` Johannes Weiner
@ 2017-02-22 20:24     ` Johannes Weiner
  -1 siblings, 0 replies; 25+ messages in thread
From: Johannes Weiner @ 2017-02-22 20:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Jia He, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed, Feb 22, 2017 at 03:16:57PM -0500, Johannes Weiner wrote:
> [...] And then it sounds pretty much like what the allocator/direct
> reclaim already does.

On a side note: Michal, I'm not sure I fully understand why we need
the backoff code in should_reclaim_retry(). If no_progress_loops is
growing steadily, then we quickly reach 16 and bail anyway. Layering
on top a backoff function that *might* cut out an iteration or two
earlier in the cold path of an OOM situation seems unnecessary.
Conversely, if there *are* intermittent reclaims, no_progress_loops
gets reset straight to 0, which then also makes the backoff function
jump back to square one. So in the only situation where backing off
would make sense - making some progress, but not enough - it's not
actually backing off. It seems to me it should be enough to bail after
either 16 iterations or when free + reclaimable < watermark.

Hm?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c470b8fe28cf..b0e9495c0530 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3396,11 +3396,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 /*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
- * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round) and no_progress_loops (number of reclaim rounds without
- * any progress in a row) is considered as well as the reclaimable pages on the
- * applicable zone list (with a backoff mechanism which is a function of
- * no_progress_loops).
+ *
+ * We give up when we either have tried MAX_RECLAIM_RETRIES in a row
+ * without success, or when we couldn't even meet the watermark if we
+ * reclaimed all remaining pages on the LRU lists.
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
@@ -3441,13 +3440,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		unsigned long reclaimable;
 
 		available = reclaimable = zone_reclaimable_pages(zone);
-		available -= DIV_ROUND_UP((*no_progress_loops) * available,
-					  MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 		/*
-		 * Would the allocation succeed if we reclaimed the whole
-		 * available?
+		 * Would the allocation succeed if we reclaimed all
+		 * the reclaimable pages?
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac_classzone_idx(ac), alloc_flags, available)) {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-22 20:24     ` Johannes Weiner
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Weiner @ 2017-02-22 20:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Jia He, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed, Feb 22, 2017 at 03:16:57PM -0500, Johannes Weiner wrote:
> [...] And then it sounds pretty much like what the allocator/direct
> reclaim already does.

On a side note: Michal, I'm not sure I fully understand why we need
the backoff code in should_reclaim_retry(). If no_progress_loops is
growing steadily, then we quickly reach 16 and bail anyway. Layering
on top a backoff function that *might* cut out an iteration or two
earlier in the cold path of an OOM situation seems unnecessary.
Conversely, if there *are* intermittent reclaims, no_progress_loops
gets reset straight to 0, which then also makes the backoff function
jump back to square one. So in the only situation where backing off
would make sense - making some progress, but not enough - it's not
actually backing off. It seems to me it should be enough to bail after
either 16 iterations or when free + reclaimable < watermark.

Hm?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c470b8fe28cf..b0e9495c0530 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3396,11 +3396,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 /*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
- * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round) and no_progress_loops (number of reclaim rounds without
- * any progress in a row) is considered as well as the reclaimable pages on the
- * applicable zone list (with a backoff mechanism which is a function of
- * no_progress_loops).
+ *
+ * We give up when we either have tried MAX_RECLAIM_RETRIES in a row
+ * without success, or when we couldn't even meet the watermark if we
+ * reclaimed all remaining pages on the LRU lists.
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
@@ -3441,13 +3440,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		unsigned long reclaimable;
 
 		available = reclaimable = zone_reclaimable_pages(zone);
-		available -= DIV_ROUND_UP((*no_progress_loops) * available,
-					  MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 		/*
-		 * Would the allocation succeed if we reclaimed the whole
-		 * available?
+		 * Would the allocation succeed if we reclaimed all
+		 * the reclaimable pages?
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac_classzone_idx(ac), alloc_flags, available)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22 20:16   ` Johannes Weiner
  (?)
  (?)
@ 2017-02-23  2:24   ` hejianet
  -1 siblings, 0 replies; 25+ messages in thread
From: hejianet @ 2017-02-23  2:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Michal Hocko, Minchan Kim, Rik van Riel


Hi Johannes
On 23/02/2017 4:16 AM, Johannes Weiner wrote:
> On Wed, Feb 22, 2017 at 05:04:48PM +0800, Jia He wrote:
>> When I try to dynamically allocate the hugepages more than system total
>> free memory:
>
>> Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
>> will not be about to end)
>
>> The root cause is kswapd3 is trying to do relaim again and again but it
>> makes no progress
>
>> At that time, there are no relaimable pages in that node:
>
> Yes, this is a problem with the current kswapd code.
>
> A less artificial scenario that I observed recently was machines with
> two NUMA nodes, after being up for 200+ days, getting into a state
> where node0 is mostly consumed by anon and some kernel allocations,
> leaving less than the high watermark free. The machines don't have
> swap, so the anon isn't reclaimable. But also, anon LRU is never even
> *scanned*, so the "all unreclaimable" logic doesn't kick in. Kswapd is
> spinning at 100% CPU calculating scan counts and checking zone states.
>
> One specific problem with your patch, Jia, is that there might be some
> cache pages that are pinned one way or another. That was the case on
> our machines, and so reclaimable pages wasn't 0. Even if we check the
> reclaimable pages, we need a hard cutoff after X attempts. And then it
> sounds pretty much like what the allocator/direct reclaim already does.
>
> Can we use the *exact* same cutoff conditions for direct reclaim and
> kswapd, though? I don't think so. For direct reclaim, the goal is the
> watermark, to make an allocation happen in the caller. While kswapd
> tries to restore the watermarks too, it might never meet them but
> still do useful work on behalf of concurrently allocating threads. It
> should only stop when it tries and fails to free any pages at all.
>
Yes, this is what I thought before this patchi 1/4 ?but seems Michal
doesn't like this idea :)
Please see https://lkml.org/lkml/2017/1/24/543

Frankly speaking, it did a little impact but not so much on
kswapd high cpu usage(not tested your patch here but I've tested
my 1st patch)

Because in the case when the memory in nodeN is mostly consumed,
any direct reclaim path will try to wake up the kswapd:
__alloc_pages_slowpath
	wake_all_kswapds
		wakeup_kswapd
			retry for 16 times

Compared with the idea which limited the no progress loop of kswapd,
avoiding wakeup kswapd can have significant impact on the high cpu
usage.

B.R.
Jia He
> The recent removal of the scanned > reclaimable * 6 exit condition
> from kswapd was a mistake, we need something like it. But it's also
> broken the way we perform reclaim on swapless systems, so instead of
> re-instating it, we should remove the remnants and do something new.
>
> Can we simply count the number of balance_pgdat() runs that didn't
> reclaim anything and have kswapd sleep after MAX_RECLAIM_RETRIES?
>
> And a follow-up: once it gives up, when should kswapd return to work?
> We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
> that's a branch in a common allocator path, just to recover kswapd - a
> latency tool, not a necessity for functional correctness - from a
> situation that's exceedingly pretty rare. How about we leave it
> disabled until a direct reclaimer manages to free something?
>
> Something like this (untested):
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 96c78d840916..611c5fb52d7b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -149,7 +149,6 @@ enum node_stat_item {
>  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
>  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
>  	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
> -	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
>  	PGSCAN_ANON,
>  	PGSCAN_FILE,
>  	PGSTEAL_ANON,
> @@ -630,6 +629,8 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>
> +	int kswapd_failed_runs;
> +
>  #ifdef CONFIG_COMPACTION
>  	int kcompactd_max_order;
>  	enum zone_type kcompactd_classzone_idx;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 145005e6dc85..9de49e2710af 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -285,8 +285,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>
>  /* linux/mm/vmscan.c */
> +#define MAX_RECLAIM_RETRIES 16
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> -extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/internal.h b/mm/internal.h
> index 537ac9951f5f..812e1aaf4142 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -78,7 +78,6 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c83186c61257..d4f0a499b4e5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1731,9 +1731,6 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>  {
>  	int z;
>
> -	if (!pgdat_reclaimable(pgdat))
> -		return false;
> -
>  	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>  		struct zone *zone = pgdat->node_zones + z;
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c470b8fe28cf..c467a4fff16a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1087,15 +1087,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	unsigned long nr_scanned;
>  	bool isolated_pageblocks;
>
>  	spin_lock(&zone->lock);
>  	isolated_pageblocks = has_isolate_pageblock(zone);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	while (count) {
>  		struct page *page;
>  		struct list_head *list;
> @@ -1147,12 +1142,7 @@ static void free_one_page(struct zone *zone,
>  				unsigned int order,
>  				int migratetype)
>  {
> -	unsigned long nr_scanned;
>  	spin_lock(&zone->lock);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	if (unlikely(has_isolate_pageblock(zone) ||
>  		is_migrate_isolate(migratetype))) {
>  		migratetype = get_pfnblock_migratetype(page, pfn);
> @@ -3388,12 +3378,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4301,7 +4285,6 @@ void show_free_areas(unsigned int filter)
>  #endif
>  			" writeback_tmp:%lukB"
>  			" unstable:%lukB"
> -			" pages_scanned:%lu"
>  			" all_unreclaimable? %s"
>  			"\n",
>  			pgdat->node_id,
> @@ -4324,8 +4307,8 @@ void show_free_areas(unsigned int filter)
>  			K(node_page_state(pgdat, NR_SHMEM)),
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> -			node_page_state(pgdat, NR_PAGES_SCANNED),
> -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> +		        pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES ?
> +				"yes" : "no");
>  	}
>
>  	for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1b112f291fbe..2b4ce7cd97d8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -212,28 +212,6 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>  	return nr;
>  }
>
> -unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
> -{
> -	unsigned long nr;
> -
> -	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> -	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> -	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
> -
> -	if (get_nr_swap_pages() > 0)
> -		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> -		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
> -		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
> -
> -	return nr;
> -}
> -
> -bool pgdat_reclaimable(struct pglist_data *pgdat)
> -{
> -	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> -		pgdat_reclaimable_pages(pgdat) * 6;
> -}
> -
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
>  {
>  	if (!mem_cgroup_disabled())
> @@ -1438,10 +1416,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>
> -		/*
> -		 * Account for scanned and skipped separetly to avoid the pgdat
> -		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> -		 */
>  		scan++;
>
>  		switch (__isolate_lru_page(page, mode)) {
> @@ -1728,7 +1702,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>
>  	if (global_reclaim(sc)) {
>  		__mod_node_page_state(pgdat, PGSCAN_ANON + file, nr_scanned);
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  		if (current_is_kswapd())
>  			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>  		else
> @@ -1916,8 +1889,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>
> -	if (global_reclaim(sc))
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  	__count_vm_events(PGREFILL, nr_scanned);
>
>  	spin_unlock_irq(&pgdat->lru_lock);
> @@ -2114,12 +2085,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (current_is_kswapd()) {
> -		if (!pgdat_reclaimable(pgdat))
> -			force_scan = true;
> -		if (!mem_cgroup_online(memcg))
> -			force_scan = true;
> -	}
> +	if (current_is_kswapd() && !mem_cgroup_online(memcg))
> +		force_scan = true;
>  	if (!global_reclaim(sc))
>  		force_scan = true;
>
> @@ -2593,6 +2560,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>
> +	/*
> +	 * Kswapd might have declared a node hopeless after too many
> +	 * successless balancing attempts. If we reclaim anything at
> +	 * all here, knock it loose.
> +	 */
> +	if (reclaimable)
> +		pgdat->kswapd_failed_runs = 0;
> +
>  	return reclaimable;
>  }
>
> @@ -2667,10 +2642,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  						 GFP_KERNEL | __GFP_HARDWALL))
>  				continue;
>
> -			if (sc->priority != DEF_PRIORITY &&
> -			    !pgdat_reclaimable(zone->zone_pgdat))
> -				continue;	/* Let kswapd poll it */
> -
>  			/*
>  			 * If we already have plenty of memory free for
>  			 * compaction in this zone, don't free any more.
> @@ -2817,8 +2788,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>
>  	for (i = 0; i <= ZONE_NORMAL; i++) {
>  		zone = &pgdat->node_zones[i];
> -		if (!managed_zone(zone) ||
> -		    pgdat_reclaimable_pages(pgdat) == 0)
> +		if (!managed_zone(zone))
>  			continue;
>
>  		pfmemalloc_reserve += min_wmark_pages(zone);
> @@ -3117,6 +3087,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
>  		wake_up_all(&pgdat->pfmemalloc_wait);
>
> +	/* Hopeless node until direct reclaim knocks us free */
> +	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
> +		return true;
> +
>  	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>
> @@ -3260,7 +3234,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * If we're getting trouble reclaiming, start doing writepage
>  		 * even in laptop mode.
>  		 */
> -		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
> +		if (sc.priority < DEF_PRIORITY - 2)
>  			sc.may_writepage = 1;
>
>  		/* Call soft limit reclaim before calling shrink_node. */
> @@ -3299,6 +3273,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>
> +	if (!sc.nr_reclaimed)
> +		pgdat->kswapd_failed_runs++;
> +
>  out:
>  	/*
>  	 * Return the order kswapd stopped reclaiming at as
> @@ -3498,6 +3475,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
>
> +	/* Hopeless node until direct reclaim knocks us free */
> +	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
> +		return;
> +
>  	/* Only wake kswapd if all zones are unbalanced */
>  	for (z = 0; z <= classzone_idx; z++) {
>  		zone = pgdat->node_zones + z;
> @@ -3768,9 +3749,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
>  		return NODE_RECLAIM_FULL;
>
> -	if (!pgdat_reclaimable(pgdat))
> -		return NODE_RECLAIM_FULL;
> -
>  	/*
>  	 * Do not scan if the allocation should not be delayed.
>  	 */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ef3b683f1f56..2e022d537d25 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -954,7 +954,6 @@ const char * const vmstat_text[] = {
>  	"nr_unevictable",
>  	"nr_isolated_anon",
>  	"nr_isolated_file",
> -	"nr_pages_scanned",
>  	"pgscan_anon",
>  	"pgscan_file",
>  	"pgsteal_anon",
> @@ -1378,7 +1377,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n        min      %lu"
>  		   "\n        low      %lu"
>  		   "\n        high     %lu"
> -		   "\n   node_scanned  %lu"
>  		   "\n        spanned  %lu"
>  		   "\n        present  %lu"
>  		   "\n        managed  %lu",
> @@ -1386,7 +1384,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   min_wmark_pages(zone),
>  		   low_wmark_pages(zone),
>  		   high_wmark_pages(zone),
> -		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>  		   zone->spanned_pages,
>  		   zone->present_pages,
>  		   zone->managed_pages);
> @@ -1425,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  node_unreclaimable:  %u"
>  		   "\n  start_pfn:           %lu"
>  		   "\n  node_inactive_ratio: %u",
> -		   !pgdat_reclaimable(zone->zone_pgdat),
> +		   zone->zone_pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES,
>  		   zone->zone_start_pfn,
>  		   zone->zone_pgdat->inactive_ratio);
>  	seq_putc(m, '\n');
> @@ -1587,26 +1584,11 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
>  		if (val < 0) {
> -			switch (i) {
> -			case NR_PAGES_SCANNED:
> -				/*
> -				 * This is often seen to go negative in
> -				 * recent kernels, but not to go permanently
> -				 * negative.  Whilst it would be nicer not to
> -				 * have exceptions, rooting them out would be
> -				 * another task, of rather low priority.
> -				 */
> -				break;
> -			default:
> -				pr_warn("%s: %s %ld\n",
> -					__func__, vmstat_text[i], val);
> -				err = -EINVAL;
> -				break;
> -			}
> +			pr_warn("%s: %s %ld\n",
> +				__func__, vmstat_text[i], val);
> +			return -EINVAL;
>  		}
>  	}
> -	if (err)
> -		return err;
>  	if (write)
>  		*ppos += *lenp;
>  	else
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22 15:48       ` Michal Hocko
@ 2017-02-23  2:25         ` hejianet
  -1 siblings, 0 replies; 25+ messages in thread
From: hejianet @ 2017-02-23  2:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

Hi Michal

On 22/02/2017 11:48 PM, Michal Hocko wrote:
> On Wed 22-02-17 22:31:50, hejianet wrote:
>> Hi Michal
>>
>> On 22/02/2017 7:41 PM, Michal Hocko wrote:
>>> On Wed 22-02-17 17:04:48, Jia He wrote:
>>>> When I try to dynamically allocate the hugepages more than system total
>>>> free memory:
>>>> e.g. echo 4000 >/proc/sys/vm/nr_hugepages
>>>
>>> I assume that the command has terminated with less huge pages allocated
>>> than requested but
>>>
>> Yes, at last the allocated hugepages are less than 4000
>> HugePages_Total:    1864
>> HugePages_Free:     1864
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:      16384 kB
>>
>> In the bad case, although kswapd takes 100% cpu, the number of
>> HugePages_Total is not increase at all.
>>
>>>> Node 3, zone      DMA
>>> [...]
>>>>   pages free     2951
>>>>         min      2821
>>>>         low      3526
>>>>         high     4231
>>>
>>> it left the zone below high watermark with
>>>
>>>>    node_scanned  0
>>>>         spanned  245760
>>>>         present  245760
>>>>         managed  245388
>>>>       nr_free_pages 2951
>>>>       nr_zone_inactive_anon 0
>>>>       nr_zone_active_anon 0
>>>>       nr_zone_inactive_file 0
>>>>       nr_zone_active_file 0
>>>
>>> no pages reclaimable, so kswapd will not go to sleep. It would be quite
>>> easy and comfortable to call it a misconfiguration but it seems that
>>> it might be quite easy to hit with NUMA machines which have large
>>> differences in the node sizes. I guess it makes sense to back off
>>> the kswapd rather than burning CPU without any way to make forward
>>> progress.
>>
>> agree.
>
> please make sure that this information is in the changelog
>
> [...]
>>>> @@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>>>>  {
>>>>  	pg_data_t *pgdat;
>>>>  	int z;
>>>> +	int node_has_relaimable_pages = 0;
>>>>
>>>>  	if (!managed_zone(zone))
>>>>  		return;
>>>> @@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>>>>
>>>>  		if (zone_balanced(zone, order, classzone_idx))
>>>>  			return;
>>>> +
>>>> +		if (!zone_reclaimable_pages(zone))
>>>> +			node_has_relaimable_pages = 1;
>>>
>>> What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?
>>
>> I mean, if any one zone has reclaimable pages, then this zone's *node* has
>> reclaimable pages. Thus, the kswapN for this node should be waken up.
>> e.g. node 1 has 2 zones.
>> zone A has no reclaimable pages but zone B has.
>> Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
>> I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
>> pages number. So I prefer the name node_has_relaimable_pages instead of
>> zone_has_relaimable_pages
>
> I still do not understand. This code starts with
> node_has_relaimable_pages = 0. If you see a zone with no reclaimable
> pages then you make it node_has_relaimable_pages = 1 which means that
>
>>>> +	/* Dont wake kswapd if no reclaimable pages */
>>>> +	if (!node_has_relaimable_pages)
>>>> +		return;
>
> this will not hold and we will wake up the kswapd. I believe what
> you want instead, is to skip the wake up if _all_ zones have
> !zone_reclaimable_pages() Or I am missing your point. This means that
> you want
> 	if (zone_reclaimable_pages(zone)
> 		node_has_relaimable_pages = 1;	
You are right, will send v2 soon after testing it
B.R.
Jia
>
>>>> +
>>>>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>>>>  	wake_up_interruptible(&pgdat->kswapd_wait);
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-23  2:25         ` hejianet
  0 siblings, 0 replies; 25+ messages in thread
From: hejianet @ 2017-02-23  2:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

Hi Michal

On 22/02/2017 11:48 PM, Michal Hocko wrote:
> On Wed 22-02-17 22:31:50, hejianet wrote:
>> Hi Michal
>>
>> On 22/02/2017 7:41 PM, Michal Hocko wrote:
>>> On Wed 22-02-17 17:04:48, Jia He wrote:
>>>> When I try to dynamically allocate the hugepages more than system total
>>>> free memory:
>>>> e.g. echo 4000 >/proc/sys/vm/nr_hugepages
>>>
>>> I assume that the command has terminated with less huge pages allocated
>>> than requested but
>>>
>> Yes, at last the allocated hugepages are less than 4000
>> HugePages_Total:    1864
>> HugePages_Free:     1864
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:      16384 kB
>>
>> In the bad case, although kswapd takes 100% cpu, the number of
>> HugePages_Total is not increase at all.
>>
>>>> Node 3, zone      DMA
>>> [...]
>>>>   pages free     2951
>>>>         min      2821
>>>>         low      3526
>>>>         high     4231
>>>
>>> it left the zone below high watermark with
>>>
>>>>    node_scanned  0
>>>>         spanned  245760
>>>>         present  245760
>>>>         managed  245388
>>>>       nr_free_pages 2951
>>>>       nr_zone_inactive_anon 0
>>>>       nr_zone_active_anon 0
>>>>       nr_zone_inactive_file 0
>>>>       nr_zone_active_file 0
>>>
>>> no pages reclaimable, so kswapd will not go to sleep. It would be quite
>>> easy and comfortable to call it a misconfiguration but it seems that
>>> it might be quite easy to hit with NUMA machines which have large
>>> differences in the node sizes. I guess it makes sense to back off
>>> the kswapd rather than burning CPU without any way to make forward
>>> progress.
>>
>> agree.
>
> please make sure that this information is in the changelog
>
> [...]
>>>> @@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>>>>  {
>>>>  	pg_data_t *pgdat;
>>>>  	int z;
>>>> +	int node_has_relaimable_pages = 0;
>>>>
>>>>  	if (!managed_zone(zone))
>>>>  		return;
>>>> @@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>>>>
>>>>  		if (zone_balanced(zone, order, classzone_idx))
>>>>  			return;
>>>> +
>>>> +		if (!zone_reclaimable_pages(zone))
>>>> +			node_has_relaimable_pages = 1;
>>>
>>> What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?
>>
>> I mean, if any one zone has reclaimable pages, then this zone's *node* has
>> reclaimable pages. Thus, the kswapN for this node should be waken up.
>> e.g. node 1 has 2 zones.
>> zone A has no reclaimable pages but zone B has.
>> Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
>> I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
>> pages number. So I prefer the name node_has_relaimable_pages instead of
>> zone_has_relaimable_pages
>
> I still do not understand. This code starts with
> node_has_relaimable_pages = 0. If you see a zone with no reclaimable
> pages then you make it node_has_relaimable_pages = 1 which means that
>
>>>> +	/* Dont wake kswapd if no reclaimable pages */
>>>> +	if (!node_has_relaimable_pages)
>>>> +		return;
>
> this will not hold and we will wake up the kswapd. I believe what
> you want instead, is to skip the wake up if _all_ zones have
> !zone_reclaimable_pages() Or I am missing your point. This means that
> you want
> 	if (zone_reclaimable_pages(zone)
> 		node_has_relaimable_pages = 1;	
You are right, will send v2 soon after testing it
B.R.
Jia
>
>>>> +
>>>>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>>>>  	wake_up_interruptible(&pgdat->kswapd_wait);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22 20:16   ` Johannes Weiner
                     ` (2 preceding siblings ...)
  (?)
@ 2017-02-23  2:46   ` hejianet
  2017-02-23  3:15     ` Fwd: " hejianet
  2017-02-23  7:21       ` Michal Hocko
  -1 siblings, 2 replies; 25+ messages in thread
From: hejianet @ 2017-02-23  2:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Michal Hocko, Minchan Kim, Rik van Riel

sorry, resend it due to a delivery-failure:
"Wrong MIME labeling on 8-bit character texts"
I am sorry if anybody received it twice
------------
Hi Johannes
On 23/02/2017 4:16 AM, Johannes Weiner wrote:
> On Wed, Feb 22, 2017 at 05:04:48PM +0800, Jia He wrote:
>> When I try to dynamically allocate the hugepages more than system total
>> free memory:
>
>> Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
>> will not be about to end)
>
>> The root cause is kswapd3 is trying to do relaim again and again but it
>> makes no progress
>
>> At that time, there are no relaimable pages in that node:
>
> Yes, this is a problem with the current kswapd code.
>
> A less artificial scenario that I observed recently was machines with
> two NUMA nodes, after being up for 200+ days, getting into a state
> where node0 is mostly consumed by anon and some kernel allocations,
> leaving less than the high watermark free. The machines don't have
> swap, so the anon isn't reclaimable. But also, anon LRU is never even
> *scanned*, so the "all unreclaimable" logic doesn't kick in. Kswapd is
> spinning at 100% CPU calculating scan counts and checking zone states.
>
> One specific problem with your patch, Jia, is that there might be some
> cache pages that are pinned one way or another. That was the case on
> our machines, and so reclaimable pages wasn't 0. Even if we check the
> reclaimable pages, we need a hard cutoff after X attempts. And then it
> sounds pretty much like what the allocator/direct reclaim already does.
>
> Can we use the *exact* same cutoff conditions for direct reclaim and
> kswapd, though? I don't think so. For direct reclaim, the goal is the
> watermark, to make an allocation happen in the caller. While kswapd
> tries to restore the watermarks too, it might never meet them but
> still do useful work on behalf of concurrently allocating threads. It
> should only stop when it tries and fails to free any pages at all.
>
Yes, this is what I thought before this patchi 1/4 ?but seems Michal
doesn't like this idea :)
Please see https://lkml.org/lkml/2017/1/24/543

Frankly speaking, it did a little impact but not so much on
kswapd high cpu usage(not tested your patch here but I've tested
my 1st patch)

Because in the case when the memory in nodeN is mostly consumed,
any direct reclaim path will try to wake up the kswapd:
__alloc_pages_slowpath
     wake_all_kswapds
         wakeup_kswapd
             retry for 16 times

Compared with the idea which limited the no progress loop of kswapd,
avoiding wakeup kswapd can have significant impact on the high cpu
usage.

B.R.
Jia
> The recent removal of the scanned > reclaimable * 6 exit condition
> from kswapd was a mistake, we need something like it. But it's also
> broken the way we perform reclaim on swapless systems, so instead of
> re-instating it, we should remove the remnants and do something new.
>
> Can we simply count the number of balance_pgdat() runs that didn't
> reclaim anything and have kswapd sleep after MAX_RECLAIM_RETRIES?
>
> And a follow-up: once it gives up, when should kswapd return to work?
> We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
> that's a branch in a common allocator path, just to recover kswapd - a
> latency tool, not a necessity for functional correctness - from a
> situation that's exceedingly pretty rare. How about we leave it
> disabled until a direct reclaimer manages to free something?
>
> Something like this (untested):
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 96c78d840916..611c5fb52d7b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -149,7 +149,6 @@ enum node_stat_item {
>  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
>  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
>  	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
> -	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
>  	PGSCAN_ANON,
>  	PGSCAN_FILE,
>  	PGSTEAL_ANON,
> @@ -630,6 +629,8 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>
> +	int kswapd_failed_runs;
> +
>  #ifdef CONFIG_COMPACTION
>  	int kcompactd_max_order;
>  	enum zone_type kcompactd_classzone_idx;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 145005e6dc85..9de49e2710af 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -285,8 +285,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>
>  /* linux/mm/vmscan.c */
> +#define MAX_RECLAIM_RETRIES 16
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> -extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/internal.h b/mm/internal.h
> index 537ac9951f5f..812e1aaf4142 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -78,7 +78,6 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c83186c61257..d4f0a499b4e5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1731,9 +1731,6 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>  {
>  	int z;
>
> -	if (!pgdat_reclaimable(pgdat))
> -		return false;
> -
>  	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>  		struct zone *zone = pgdat->node_zones + z;
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c470b8fe28cf..c467a4fff16a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1087,15 +1087,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	unsigned long nr_scanned;
>  	bool isolated_pageblocks;
>
>  	spin_lock(&zone->lock);
>  	isolated_pageblocks = has_isolate_pageblock(zone);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	while (count) {
>  		struct page *page;
>  		struct list_head *list;
> @@ -1147,12 +1142,7 @@ static void free_one_page(struct zone *zone,
>  				unsigned int order,
>  				int migratetype)
>  {
> -	unsigned long nr_scanned;
>  	spin_lock(&zone->lock);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	if (unlikely(has_isolate_pageblock(zone) ||
>  		is_migrate_isolate(migratetype))) {
>  		migratetype = get_pfnblock_migratetype(page, pfn);
> @@ -3388,12 +3378,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4301,7 +4285,6 @@ void show_free_areas(unsigned int filter)
>  #endif
>  			" writeback_tmp:%lukB"
>  			" unstable:%lukB"
> -			" pages_scanned:%lu"
>  			" all_unreclaimable? %s"
>  			"\n",
>  			pgdat->node_id,
> @@ -4324,8 +4307,8 @@ void show_free_areas(unsigned int filter)
>  			K(node_page_state(pgdat, NR_SHMEM)),
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> -			node_page_state(pgdat, NR_PAGES_SCANNED),
> -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> +		        pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES ?
> +				"yes" : "no");
>  	}
>
>  	for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1b112f291fbe..2b4ce7cd97d8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -212,28 +212,6 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>  	return nr;
>  }
>
> -unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
> -{
> -	unsigned long nr;
> -
> -	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> -	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> -	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
> -
> -	if (get_nr_swap_pages() > 0)
> -		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> -		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
> -		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
> -
> -	return nr;
> -}
> -
> -bool pgdat_reclaimable(struct pglist_data *pgdat)
> -{
> -	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> -		pgdat_reclaimable_pages(pgdat) * 6;
> -}
> -
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
>  {
>  	if (!mem_cgroup_disabled())
> @@ -1438,10 +1416,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>
> -		/*
> -		 * Account for scanned and skipped separetly to avoid the pgdat
> -		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> -		 */
>  		scan++;
>
>  		switch (__isolate_lru_page(page, mode)) {
> @@ -1728,7 +1702,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>
>  	if (global_reclaim(sc)) {
>  		__mod_node_page_state(pgdat, PGSCAN_ANON + file, nr_scanned);
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  		if (current_is_kswapd())
>  			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>  		else
> @@ -1916,8 +1889,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>
> -	if (global_reclaim(sc))
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  	__count_vm_events(PGREFILL, nr_scanned);
>
>  	spin_unlock_irq(&pgdat->lru_lock);
> @@ -2114,12 +2085,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (current_is_kswapd()) {
> -		if (!pgdat_reclaimable(pgdat))
> -			force_scan = true;
> -		if (!mem_cgroup_online(memcg))
> -			force_scan = true;
> -	}
> +	if (current_is_kswapd() && !mem_cgroup_online(memcg))
> +		force_scan = true;
>  	if (!global_reclaim(sc))
>  		force_scan = true;
>
> @@ -2593,6 +2560,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>
> +	/*
> +	 * Kswapd might have declared a node hopeless after too many
> +	 * successless balancing attempts. If we reclaim anything at
> +	 * all here, knock it loose.
> +	 */
> +	if (reclaimable)
> +		pgdat->kswapd_failed_runs = 0;
> +
>  	return reclaimable;
>  }
>
> @@ -2667,10 +2642,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  						 GFP_KERNEL | __GFP_HARDWALL))
>  				continue;
>
> -			if (sc->priority != DEF_PRIORITY &&
> -			    !pgdat_reclaimable(zone->zone_pgdat))
> -				continue;	/* Let kswapd poll it */
> -
>  			/*
>  			 * If we already have plenty of memory free for
>  			 * compaction in this zone, don't free any more.
> @@ -2817,8 +2788,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>
>  	for (i = 0; i <= ZONE_NORMAL; i++) {
>  		zone = &pgdat->node_zones[i];
> -		if (!managed_zone(zone) ||
> -		    pgdat_reclaimable_pages(pgdat) == 0)
> +		if (!managed_zone(zone))
>  			continue;
>
>  		pfmemalloc_reserve += min_wmark_pages(zone);
> @@ -3117,6 +3087,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
>  		wake_up_all(&pgdat->pfmemalloc_wait);
>
> +	/* Hopeless node until direct reclaim knocks us free */
> +	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
> +		return true;
> +
>  	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>
> @@ -3260,7 +3234,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * If we're getting trouble reclaiming, start doing writepage
>  		 * even in laptop mode.
>  		 */
> -		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
> +		if (sc.priority < DEF_PRIORITY - 2)
>  			sc.may_writepage = 1;
>
>  		/* Call soft limit reclaim before calling shrink_node. */
> @@ -3299,6 +3273,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>
> +	if (!sc.nr_reclaimed)
> +		pgdat->kswapd_failed_runs++;
> +
>  out:
>  	/*
>  	 * Return the order kswapd stopped reclaiming at as
> @@ -3498,6 +3475,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
>
> +	/* Hopeless node until direct reclaim knocks us free */
> +	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
> +		return;
> +
>  	/* Only wake kswapd if all zones are unbalanced */
>  	for (z = 0; z <= classzone_idx; z++) {
>  		zone = pgdat->node_zones + z;
> @@ -3768,9 +3749,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
>  		return NODE_RECLAIM_FULL;
>
> -	if (!pgdat_reclaimable(pgdat))
> -		return NODE_RECLAIM_FULL;
> -
>  	/*
>  	 * Do not scan if the allocation should not be delayed.
>  	 */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ef3b683f1f56..2e022d537d25 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -954,7 +954,6 @@ const char * const vmstat_text[] = {
>  	"nr_unevictable",
>  	"nr_isolated_anon",
>  	"nr_isolated_file",
> -	"nr_pages_scanned",
>  	"pgscan_anon",
>  	"pgscan_file",
>  	"pgsteal_anon",
> @@ -1378,7 +1377,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n        min      %lu"
>  		   "\n        low      %lu"
>  		   "\n        high     %lu"
> -		   "\n   node_scanned  %lu"
>  		   "\n        spanned  %lu"
>  		   "\n        present  %lu"
>  		   "\n        managed  %lu",
> @@ -1386,7 +1384,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   min_wmark_pages(zone),
>  		   low_wmark_pages(zone),
>  		   high_wmark_pages(zone),
> -		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>  		   zone->spanned_pages,
>  		   zone->present_pages,
>  		   zone->managed_pages);
> @@ -1425,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  node_unreclaimable:  %u"
>  		   "\n  start_pfn:           %lu"
>  		   "\n  node_inactive_ratio: %u",
> -		   !pgdat_reclaimable(zone->zone_pgdat),
> +		   zone->zone_pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES,
>  		   zone->zone_start_pfn,
>  		   zone->zone_pgdat->inactive_ratio);
>  	seq_putc(m, '\n');
> @@ -1587,26 +1584,11 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
>  		if (val < 0) {
> -			switch (i) {
> -			case NR_PAGES_SCANNED:
> -				/*
> -				 * This is often seen to go negative in
> -				 * recent kernels, but not to go permanently
> -				 * negative.  Whilst it would be nicer not to
> -				 * have exceptions, rooting them out would be
> -				 * another task, of rather low priority.
> -				 */
> -				break;
> -			default:
> -				pr_warn("%s: %s %ld\n",
> -					__func__, vmstat_text[i], val);
> -				err = -EINVAL;
> -				break;
> -			}
> +			pr_warn("%s: %s %ld\n",
> +				__func__, vmstat_text[i], val);
> +			return -EINVAL;
>  		}
>  	}
> -	if (err)
> -		return err;
>  	if (write)
>  		*ppos += *lenp;
>  	else
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Fwd: Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-23  2:46   ` hejianet
@ 2017-02-23  3:15     ` hejianet
  2017-02-23  7:21       ` Michal Hocko
  1 sibling, 0 replies; 25+ messages in thread
From: hejianet @ 2017-02-23  3:15 UTC (permalink / raw)
  To: LKML


resend it to lkml only.

-------- Forwarded Message --------
Subject: Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
Date: Thu, 23 Feb 2017 10:46:01 +0800
From: hejianet <hejianet@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton 
<akpm@linux-foundation.org>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil 
Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>, Minchan Kim 
<minchan@kernel.org>, Rik van Riel <riel@redhat.com>

sorry, resend it due to a delivery-failure:
"Wrong MIME labeling on 8-bit character texts"
I am sorry if anybody received it twice
------------
Hi Johannes
On 23/02/2017 4:16 AM, Johannes Weiner wrote:
> On Wed, Feb 22, 2017 at 05:04:48PM +0800, Jia He wrote:
>> When I try to dynamically allocate the hugepages more than system total
>> free memory:
>
>> Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
>> will not be about to end)
>
>> The root cause is kswapd3 is trying to do relaim again and again but it
>> makes no progress
>
>> At that time, there are no relaimable pages in that node:
>
> Yes, this is a problem with the current kswapd code.
>
> A less artificial scenario that I observed recently was machines with
> two NUMA nodes, after being up for 200+ days, getting into a state
> where node0 is mostly consumed by anon and some kernel allocations,
> leaving less than the high watermark free. The machines don't have
> swap, so the anon isn't reclaimable. But also, anon LRU is never even
> *scanned*, so the "all unreclaimable" logic doesn't kick in. Kswapd is
> spinning at 100% CPU calculating scan counts and checking zone states.
>
> One specific problem with your patch, Jia, is that there might be some
> cache pages that are pinned one way or another. That was the case on
> our machines, and so reclaimable pages wasn't 0. Even if we check the
> reclaimable pages, we need a hard cutoff after X attempts. And then it
> sounds pretty much like what the allocator/direct reclaim already does.
>
> Can we use the *exact* same cutoff conditions for direct reclaim and
> kswapd, though? I don't think so. For direct reclaim, the goal is the
> watermark, to make an allocation happen in the caller. While kswapd
> tries to restore the watermarks too, it might never meet them but
> still do useful work on behalf of concurrently allocating threads. It
> should only stop when it tries and fails to free any pages at all.
>
Yes, this is what I thought before this patch, but seems Michal
doesn't like this idea :)
Please see https://lkml.org/lkml/2017/1/24/543

Frankly speaking, it did a little impact but not so much on
kswapd high cpu usage(not tested your patch here but I've tested
my 1st patch)

Because in the case when the memory in nodeN is mostly consumed,
any direct reclaim path will try to wake up the kswapd:
__alloc_pages_slowpath
     wake_all_kswapds
         wakeup_kswapd
             retry for 16 times

Compared with the idea which limited the no progress loop of kswapd,
avoiding wakeup kswapd can have significant impact on the high cpu
usage.

B.R.
Jia
> The recent removal of the scanned > reclaimable * 6 exit condition
> from kswapd was a mistake, we need something like it. But it's also
> broken the way we perform reclaim on swapless systems, so instead of
> re-instating it, we should remove the remnants and do something new.
>
> Can we simply count the number of balance_pgdat() runs that didn't
> reclaim anything and have kswapd sleep after MAX_RECLAIM_RETRIES?
>
> And a follow-up: once it gives up, when should kswapd return to work?
> We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
> that's a branch in a common allocator path, just to recover kswapd - a
> latency tool, not a necessity for functional correctness - from a
> situation that's exceedingly pretty rare. How about we leave it
> disabled until a direct reclaimer manages to free something?
>
> Something like this (untested):
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 96c78d840916..611c5fb52d7b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -149,7 +149,6 @@ enum node_stat_item {
>  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
>  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
>  	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
> -	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
>  	PGSCAN_ANON,
>  	PGSCAN_FILE,
>  	PGSTEAL_ANON,
> @@ -630,6 +629,8 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>
> +	int kswapd_failed_runs;
> +
>  #ifdef CONFIG_COMPACTION
>  	int kcompactd_max_order;
>  	enum zone_type kcompactd_classzone_idx;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 145005e6dc85..9de49e2710af 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -285,8 +285,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>
>  /* linux/mm/vmscan.c */
> +#define MAX_RECLAIM_RETRIES 16
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> -extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/internal.h b/mm/internal.h
> index 537ac9951f5f..812e1aaf4142 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -78,7 +78,6 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c83186c61257..d4f0a499b4e5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1731,9 +1731,6 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>  {
>  	int z;
>
> -	if (!pgdat_reclaimable(pgdat))
> -		return false;
> -
>  	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>  		struct zone *zone = pgdat->node_zones + z;
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c470b8fe28cf..c467a4fff16a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1087,15 +1087,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	unsigned long nr_scanned;
>  	bool isolated_pageblocks;
>
>  	spin_lock(&zone->lock);
>  	isolated_pageblocks = has_isolate_pageblock(zone);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	while (count) {
>  		struct page *page;
>  		struct list_head *list;
> @@ -1147,12 +1142,7 @@ static void free_one_page(struct zone *zone,
>  				unsigned int order,
>  				int migratetype)
>  {
> -	unsigned long nr_scanned;
>  	spin_lock(&zone->lock);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	if (unlikely(has_isolate_pageblock(zone) ||
>  		is_migrate_isolate(migratetype))) {
>  		migratetype = get_pfnblock_migratetype(page, pfn);
> @@ -3388,12 +3378,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4301,7 +4285,6 @@ void show_free_areas(unsigned int filter)
>  #endif
>  			" writeback_tmp:%lukB"
>  			" unstable:%lukB"
> -			" pages_scanned:%lu"
>  			" all_unreclaimable? %s"
>  			"\n",
>  			pgdat->node_id,
> @@ -4324,8 +4307,8 @@ void show_free_areas(unsigned int filter)
>  			K(node_page_state(pgdat, NR_SHMEM)),
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> -			node_page_state(pgdat, NR_PAGES_SCANNED),
> -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> +		        pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES ?
> +				"yes" : "no");
>  	}
>
>  	for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1b112f291fbe..2b4ce7cd97d8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -212,28 +212,6 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>  	return nr;
>  }
>
> -unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
> -{
> -	unsigned long nr;
> -
> -	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> -	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> -	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
> -
> -	if (get_nr_swap_pages() > 0)
> -		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> -		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
> -		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
> -
> -	return nr;
> -}
> -
> -bool pgdat_reclaimable(struct pglist_data *pgdat)
> -{
> -	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> -		pgdat_reclaimable_pages(pgdat) * 6;
> -}
> -
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
>  {
>  	if (!mem_cgroup_disabled())
> @@ -1438,10 +1416,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>
> -		/*
> -		 * Account for scanned and skipped separetly to avoid the pgdat
> -		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> -		 */
>  		scan++;
>
>  		switch (__isolate_lru_page(page, mode)) {
> @@ -1728,7 +1702,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>
>  	if (global_reclaim(sc)) {
>  		__mod_node_page_state(pgdat, PGSCAN_ANON + file, nr_scanned);
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  		if (current_is_kswapd())
>  			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>  		else
> @@ -1916,8 +1889,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>
> -	if (global_reclaim(sc))
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  	__count_vm_events(PGREFILL, nr_scanned);
>
>  	spin_unlock_irq(&pgdat->lru_lock);
> @@ -2114,12 +2085,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (current_is_kswapd()) {
> -		if (!pgdat_reclaimable(pgdat))
> -			force_scan = true;
> -		if (!mem_cgroup_online(memcg))
> -			force_scan = true;
> -	}
> +	if (current_is_kswapd() && !mem_cgroup_online(memcg))
> +		force_scan = true;
>  	if (!global_reclaim(sc))
>  		force_scan = true;
>
> @@ -2593,6 +2560,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>
> +	/*
> +	 * Kswapd might have declared a node hopeless after too many
> +	 * successless balancing attempts. If we reclaim anything at
> +	 * all here, knock it loose.
> +	 */
> +	if (reclaimable)
> +		pgdat->kswapd_failed_runs = 0;
> +
>  	return reclaimable;
>  }
>
> @@ -2667,10 +2642,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  						 GFP_KERNEL | __GFP_HARDWALL))
>  				continue;
>
> -			if (sc->priority != DEF_PRIORITY &&
> -			    !pgdat_reclaimable(zone->zone_pgdat))
> -				continue;	/* Let kswapd poll it */
> -
>  			/*
>  			 * If we already have plenty of memory free for
>  			 * compaction in this zone, don't free any more.
> @@ -2817,8 +2788,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>
>  	for (i = 0; i <= ZONE_NORMAL; i++) {
>  		zone = &pgdat->node_zones[i];
> -		if (!managed_zone(zone) ||
> -		    pgdat_reclaimable_pages(pgdat) == 0)
> +		if (!managed_zone(zone))
>  			continue;
>
>  		pfmemalloc_reserve += min_wmark_pages(zone);
> @@ -3117,6 +3087,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
>  		wake_up_all(&pgdat->pfmemalloc_wait);
>
> +	/* Hopeless node until direct reclaim knocks us free */
> +	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
> +		return true;
> +
>  	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>
> @@ -3260,7 +3234,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * If we're getting trouble reclaiming, start doing writepage
>  		 * even in laptop mode.
>  		 */
> -		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
> +		if (sc.priority < DEF_PRIORITY - 2)
>  			sc.may_writepage = 1;
>
>  		/* Call soft limit reclaim before calling shrink_node. */
> @@ -3299,6 +3273,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>
> +	if (!sc.nr_reclaimed)
> +		pgdat->kswapd_failed_runs++;
> +
>  out:
>  	/*
>  	 * Return the order kswapd stopped reclaiming at as
> @@ -3498,6 +3475,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
>
> +	/* Hopeless node until direct reclaim knocks us free */
> +	if (pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES)
> +		return;
> +
>  	/* Only wake kswapd if all zones are unbalanced */
>  	for (z = 0; z <= classzone_idx; z++) {
>  		zone = pgdat->node_zones + z;
> @@ -3768,9 +3749,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
>  		return NODE_RECLAIM_FULL;
>
> -	if (!pgdat_reclaimable(pgdat))
> -		return NODE_RECLAIM_FULL;
> -
>  	/*
>  	 * Do not scan if the allocation should not be delayed.
>  	 */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ef3b683f1f56..2e022d537d25 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -954,7 +954,6 @@ const char * const vmstat_text[] = {
>  	"nr_unevictable",
>  	"nr_isolated_anon",
>  	"nr_isolated_file",
> -	"nr_pages_scanned",
>  	"pgscan_anon",
>  	"pgscan_file",
>  	"pgsteal_anon",
> @@ -1378,7 +1377,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n        min      %lu"
>  		   "\n        low      %lu"
>  		   "\n        high     %lu"
> -		   "\n   node_scanned  %lu"
>  		   "\n        spanned  %lu"
>  		   "\n        present  %lu"
>  		   "\n        managed  %lu",
> @@ -1386,7 +1384,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   min_wmark_pages(zone),
>  		   low_wmark_pages(zone),
>  		   high_wmark_pages(zone),
> -		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>  		   zone->spanned_pages,
>  		   zone->present_pages,
>  		   zone->managed_pages);
> @@ -1425,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  node_unreclaimable:  %u"
>  		   "\n  start_pfn:           %lu"
>  		   "\n  node_inactive_ratio: %u",
> -		   !pgdat_reclaimable(zone->zone_pgdat),
> +		   zone->zone_pgdat->kswapd_failed_runs >= MAX_RECLAIM_RETRIES,
>  		   zone->zone_start_pfn,
>  		   zone->zone_pgdat->inactive_ratio);
>  	seq_putc(m, '\n');
> @@ -1587,26 +1584,11 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
>  		if (val < 0) {
> -			switch (i) {
> -			case NR_PAGES_SCANNED:
> -				/*
> -				 * This is often seen to go negative in
> -				 * recent kernels, but not to go permanently
> -				 * negative.  Whilst it would be nicer not to
> -				 * have exceptions, rooting them out would be
> -				 * another task, of rather low priority.
> -				 */
> -				break;
> -			default:
> -				pr_warn("%s: %s %ld\n",
> -					__func__, vmstat_text[i], val);
> -				err = -EINVAL;
> -				break;
> -			}
> +			pr_warn("%s: %s %ld\n",
> +				__func__, vmstat_text[i], val);
> +			return -EINVAL;
>  		}
>  	}
> -	if (err)
> -		return err;
>  	if (write)
>  		*ppos += *lenp;
>  	else
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-23  2:46   ` hejianet
@ 2017-02-23  7:21       ` Michal Hocko
  2017-02-23  7:21       ` Michal Hocko
  1 sibling, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-23  7:21 UTC (permalink / raw)
  To: hejianet
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Thu 23-02-17 10:46:01, hejianet wrote:
> sorry, resend it due to a delivery-failure:
> "Wrong MIME labeling on 8-bit character texts"
> I am sorry if anybody received it twice
> ------------
> Hi Johannes
> On 23/02/2017 4:16 AM, Johannes Weiner wrote:
> > On Wed, Feb 22, 2017 at 05:04:48PM +0800, Jia He wrote:
> > > When I try to dynamically allocate the hugepages more than system total
> > > free memory:
> > 
> > > Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
> > > will not be about to end)
> > 
> > > The root cause is kswapd3 is trying to do relaim again and again but it
> > > makes no progress
> > 
> > > At that time, there are no relaimable pages in that node:
> > 
> > Yes, this is a problem with the current kswapd code.
> > 
> > A less artificial scenario that I observed recently was machines with
> > two NUMA nodes, after being up for 200+ days, getting into a state
> > where node0 is mostly consumed by anon and some kernel allocations,
> > leaving less than the high watermark free. The machines don't have
> > swap, so the anon isn't reclaimable. But also, anon LRU is never even
> > *scanned*, so the "all unreclaimable" logic doesn't kick in. Kswapd is
> > spinning at 100% CPU calculating scan counts and checking zone states.
> > 
> > One specific problem with your patch, Jia, is that there might be some
> > cache pages that are pinned one way or another. That was the case on
> > our machines, and so reclaimable pages wasn't 0. Even if we check the
> > reclaimable pages, we need a hard cutoff after X attempts. And then it
> > sounds pretty much like what the allocator/direct reclaim already does.
> > 
> > Can we use the *exact* same cutoff conditions for direct reclaim and
> > kswapd, though? I don't think so. For direct reclaim, the goal is the
> > watermark, to make an allocation happen in the caller. While kswapd
> > tries to restore the watermarks too, it might never meet them but
> > still do useful work on behalf of concurrently allocating threads. It
> > should only stop when it tries and fails to free any pages at all.
> > 
> Yes, this is what I thought before this patch,but seems Michal
> doesn't like this idea :)
> Please see https://lkml.org/lkml/2017/1/24/543

Yeah, I didn't like the hard limit on kswapd retries as you proposed it.
It didn't make much sense to me because the current condition for kswapd
to back off is to have all zones balanced. Without further criterion
kswapd would just wake up and go around the same retry loops again with
no progress. I didn't realize that a direct reclaim progress might be
that criterion. Proposal from Johannes makes much more sense. I have to
think about it some more but this looks like a way forward.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-23  7:21       ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-23  7:21 UTC (permalink / raw)
  To: hejianet
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	Mel Gorman, Vlastimil Babka, Minchan Kim, Rik van Riel

On Thu 23-02-17 10:46:01, hejianet wrote:
> sorry, resend it due to a delivery-failure:
> "Wrong MIME labeling on 8-bit character texts"
> I am sorry if anybody received it twice
> ------------
> Hi Johannes
> On 23/02/2017 4:16 AM, Johannes Weiner wrote:
> > On Wed, Feb 22, 2017 at 05:04:48PM +0800, Jia He wrote:
> > > When I try to dynamically allocate the hugepages more than system total
> > > free memory:
> > 
> > > Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
> > > will not be about to end)
> > 
> > > The root cause is kswapd3 is trying to do relaim again and again but it
> > > makes no progress
> > 
> > > At that time, there are no relaimable pages in that node:
> > 
> > Yes, this is a problem with the current kswapd code.
> > 
> > A less artificial scenario that I observed recently was machines with
> > two NUMA nodes, after being up for 200+ days, getting into a state
> > where node0 is mostly consumed by anon and some kernel allocations,
> > leaving less than the high watermark free. The machines don't have
> > swap, so the anon isn't reclaimable. But also, anon LRU is never even
> > *scanned*, so the "all unreclaimable" logic doesn't kick in. Kswapd is
> > spinning at 100% CPU calculating scan counts and checking zone states.
> > 
> > One specific problem with your patch, Jia, is that there might be some
> > cache pages that are pinned one way or another. That was the case on
> > our machines, and so reclaimable pages wasn't 0. Even if we check the
> > reclaimable pages, we need a hard cutoff after X attempts. And then it
> > sounds pretty much like what the allocator/direct reclaim already does.
> > 
> > Can we use the *exact* same cutoff conditions for direct reclaim and
> > kswapd, though? I don't think so. For direct reclaim, the goal is the
> > watermark, to make an allocation happen in the caller. While kswapd
> > tries to restore the watermarks too, it might never meet them but
> > still do useful work on behalf of concurrently allocating threads. It
> > should only stop when it tries and fails to free any pages at all.
> > 
> Yes, this is what I thought before this patchi 1/4 ?but seems Michal
> doesn't like this idea :)
> Please see https://lkml.org/lkml/2017/1/24/543

Yeah, I didn't like the hard limit on kswapd retries as you proposed it.
It didn't make much sense to me because the current condition for kswapd
to back off is to have all zones balanced. Without further criterion
kswapd would just wake up and go around the same retry loops again with
no progress. I didn't realize that a direct reclaim progress might be
that criterion. Proposal from Johannes makes much more sense. I have to
think about it some more but this looks like a way forward.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22 20:24     ` Johannes Weiner
@ 2017-02-23  7:29       ` Michal Hocko
  -1 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-23  7:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Jia He, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 15:24:06, Johannes Weiner wrote:
> On Wed, Feb 22, 2017 at 03:16:57PM -0500, Johannes Weiner wrote:
> > [...] And then it sounds pretty much like what the allocator/direct
> > reclaim already does.
> 
> On a side note: Michal, I'm not sure I fully understand why we need
> the backoff code in should_reclaim_retry(). If no_progress_loops is
> growing steadily, then we quickly reach 16 and bail anyway. Layering
> on top a backoff function that *might* cut out an iteration or two
> earlier in the cold path of an OOM situation seems unnecessary.
> Conversely, if there *are* intermittent reclaims, no_progress_loops
> gets reset straight to 0, which then also makes the backoff function
> jump back to square one. So in the only situation where backing off
> would make sense - making some progress, but not enough - it's not
> actually backing off. It seems to me it should be enough to bail after
> either 16 iterations or when free + reclaimable < watermark.

Hmm, yes you are right! I wanted to use this backoff to reduce chances
to trash over last remaining reclaimable pages. But the code evolved in
a way that this no longer works that way, as you say. I just got stuck
with the code without rethinking its relevance during the development.

That being said, I think we will eventually want some backoff logic for
those cases where we still make a little progress but not enough (e.g.
count the number of reclaimed pages and give up when we reach a portion
of available reclaimable memory), but the patch below is a good start to
make the code simpler. Feel free to add my Acked-by when posting a full
patch.

Thanks!

> 
> Hm?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c470b8fe28cf..b0e9495c0530 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3396,11 +3396,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  /*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
> - * The reclaim feedback represented by did_some_progress (any progress during
> - * the last reclaim round) and no_progress_loops (number of reclaim rounds without
> - * any progress in a row) is considered as well as the reclaimable pages on the
> - * applicable zone list (with a backoff mechanism which is a function of
> - * no_progress_loops).
> + *
> + * We give up when we either have tried MAX_RECLAIM_RETRIES in a row
> + * without success, or when we couldn't even meet the watermark if we
> + * reclaimed all remaining pages on the LRU lists.
>   *
>   * Returns true if a retry is viable or false to enter the oom path.
>   */
> @@ -3441,13 +3440,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		unsigned long reclaimable;
>  
>  		available = reclaimable = zone_reclaimable_pages(zone);
> -		available -= DIV_ROUND_UP((*no_progress_loops) * available,
> -					  MAX_RECLAIM_RETRIES);
>  		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  
>  		/*
> -		 * Would the allocation succeed if we reclaimed the whole
> -		 * available?
> +		 * Would the allocation succeed if we reclaimed all
> +		 * the reclaimable pages?
>  		 */
>  		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
>  				ac_classzone_idx(ac), alloc_flags, available)) {

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-23  7:29       ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-23  7:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Jia He, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 15:24:06, Johannes Weiner wrote:
> On Wed, Feb 22, 2017 at 03:16:57PM -0500, Johannes Weiner wrote:
> > [...] And then it sounds pretty much like what the allocator/direct
> > reclaim already does.
> 
> On a side note: Michal, I'm not sure I fully understand why we need
> the backoff code in should_reclaim_retry(). If no_progress_loops is
> growing steadily, then we quickly reach 16 and bail anyway. Layering
> on top a backoff function that *might* cut out an iteration or two
> earlier in the cold path of an OOM situation seems unnecessary.
> Conversely, if there *are* intermittent reclaims, no_progress_loops
> gets reset straight to 0, which then also makes the backoff function
> jump back to square one. So in the only situation where backing off
> would make sense - making some progress, but not enough - it's not
> actually backing off. It seems to me it should be enough to bail after
> either 16 iterations or when free + reclaimable < watermark.

Hmm, yes you are right! I wanted to use this backoff to reduce chances
to trash over last remaining reclaimable pages. But the code evolved in
a way that this no longer works that way, as you say. I just got stuck
with the code without rethinking its relevance during the development.

That being said, I think we will eventually want some backoff logic for
those cases where we still make a little progress but not enough (e.g.
count the number of reclaimed pages and give up when we reach a portion
of available reclaimable memory), but the patch below is a good start to
make the code simpler. Feel free to add my Acked-by when posting a full
patch.

Thanks!

> 
> Hm?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c470b8fe28cf..b0e9495c0530 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3396,11 +3396,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  /*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
> - * The reclaim feedback represented by did_some_progress (any progress during
> - * the last reclaim round) and no_progress_loops (number of reclaim rounds without
> - * any progress in a row) is considered as well as the reclaimable pages on the
> - * applicable zone list (with a backoff mechanism which is a function of
> - * no_progress_loops).
> + *
> + * We give up when we either have tried MAX_RECLAIM_RETRIES in a row
> + * without success, or when we couldn't even meet the watermark if we
> + * reclaimed all remaining pages on the LRU lists.
>   *
>   * Returns true if a retry is viable or false to enter the oom path.
>   */
> @@ -3441,13 +3440,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		unsigned long reclaimable;
>  
>  		available = reclaimable = zone_reclaimable_pages(zone);
> -		available -= DIV_ROUND_UP((*no_progress_loops) * available,
> -					  MAX_RECLAIM_RETRIES);
>  		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  
>  		/*
> -		 * Would the allocation succeed if we reclaimed the whole
> -		 * available?
> +		 * Would the allocation succeed if we reclaimed all
> +		 * the reclaimable pages?
>  		 */
>  		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
>  				ac_classzone_idx(ac), alloc_flags, available)) {

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22 20:16   ` Johannes Weiner
@ 2017-02-23 10:19     ` Michal Hocko
  -1 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-23 10:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 15:16:57, Johannes Weiner wrote:
[...]
> Can we simply count the number of balance_pgdat() runs that didn't
> reclaim anything and have kswapd sleep after MAX_RECLAIM_RETRIES?
> 
> And a follow-up: once it gives up, when should kswapd return to work?
> We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
> that's a branch in a common allocator path, just to recover kswapd - a
> latency tool, not a necessity for functional correctness - from a
> situation that's exceedingly pretty rare. How about we leave it
> disabled until a direct reclaimer manages to free something?

Yes, this makes sense to me and it looks much better than the proposed
solution here. There some theoretical corner cases, like heavy metadata
and GFP_NOFS workload which wouldn't be able to reclaim from FS
shrinkers and kspwad would be really helpful at that time. But that
would need a general solution on its own.

I also welcome removing NR_PAGES_SCANNED, because this was just too
ephemeral to be actually useful when debugging the reclaim behavior.
I think we can accomplish much more by existing tracepoints. I would
just split that up in a separate follow up patch.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-23 10:19     ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-23 10:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 15:16:57, Johannes Weiner wrote:
[...]
> Can we simply count the number of balance_pgdat() runs that didn't
> reclaim anything and have kswapd sleep after MAX_RECLAIM_RETRIES?
> 
> And a follow-up: once it gives up, when should kswapd return to work?
> We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
> that's a branch in a common allocator path, just to recover kswapd - a
> latency tool, not a necessity for functional correctness - from a
> situation that's exceedingly pretty rare. How about we leave it
> disabled until a direct reclaimer manages to free something?

Yes, this makes sense to me and it looks much better than the proposed
solution here. There some theoretical corner cases, like heavy metadata
and GFP_NOFS workload which wouldn't be able to reclaim from FS
shrinkers and kspwad would be really helpful at that time. But that
would need a general solution on its own.

I also welcome removing NR_PAGES_SCANNED, because this was just too
ephemeral to be actually useful when debugging the reclaim behavior.
I think we can accomplish much more by existing tracepoints. I would
just split that up in a separate follow up patch.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
  2017-02-22 20:16   ` Johannes Weiner
@ 2017-02-23 11:16     ` Michal Hocko
  -1 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-23 11:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 15:16:57, Johannes Weiner wrote:
[...]
> And a follow-up: once it gives up, when should kswapd return to work?
> We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
> that's a branch in a common allocator path, just to recover kswapd - a
> latency tool, not a necessity for functional correctness - from a
> situation that's exceedingly pretty rare. How about we leave it
> disabled until a direct reclaimer manages to free something?

Hmm, I guess we also want to reset the counter after OOM invocation
which might free a lot of memory and we do not want to wait for the
direct reclaim to resurrect the kswapd. Something like the following on
top of yours
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ddf27c435225..6be11c18551f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3446,7 +3446,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
+static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac,
+		bool force)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -3454,8 +3455,11 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 					ac->high_zoneidx, ac->nodemask) {
-		if (last_pgdat != zone->zone_pgdat)
+		if (last_pgdat != zone->zone_pgdat) {
+			if (force)
+				zone->zone_pgdat->kswapd_failed_runs = 0;
 			wakeup_kswapd(zone, order, ac->high_zoneidx);
+		}
 		last_pgdat = zone->zone_pgdat;
 	}
 }
@@ -3640,6 +3644,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
+	bool kick_kswapd = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3685,7 +3690,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
-		wake_all_kswapds(order, ac);
+		wake_all_kswapds(order, ac, false);
 
 	/*
 	 * The adjusted alloc_flags might result in immediate success, so try
@@ -3738,7 +3743,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 retry:
 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
-		wake_all_kswapds(order, ac);
+		wake_all_kswapds(order, ac, kick_kswapd);
+	kick_kswapd = false;
 
 	if (gfp_pfmemalloc_allowed(gfp_mask))
 		alloc_flags = ALLOC_NO_WATERMARKS;
@@ -3833,6 +3839,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
+		kick_kswapd = true;
 		goto retry;
 	}
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
@ 2017-02-23 11:16     ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-02-23 11:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Vlastimil Babka, Minchan Kim, Rik van Riel

On Wed 22-02-17 15:16:57, Johannes Weiner wrote:
[...]
> And a follow-up: once it gives up, when should kswapd return to work?
> We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
> that's a branch in a common allocator path, just to recover kswapd - a
> latency tool, not a necessity for functional correctness - from a
> situation that's exceedingly pretty rare. How about we leave it
> disabled until a direct reclaimer manages to free something?

Hmm, I guess we also want to reset the counter after OOM invocation
which might free a lot of memory and we do not want to wait for the
direct reclaim to resurrect the kswapd. Something like the following on
top of yours
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ddf27c435225..6be11c18551f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3446,7 +3446,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
+static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac,
+		bool force)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -3454,8 +3455,11 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 					ac->high_zoneidx, ac->nodemask) {
-		if (last_pgdat != zone->zone_pgdat)
+		if (last_pgdat != zone->zone_pgdat) {
+			if (force)
+				zone->zone_pgdat->kswapd_failed_runs = 0;
 			wakeup_kswapd(zone, order, ac->high_zoneidx);
+		}
 		last_pgdat = zone->zone_pgdat;
 	}
 }
@@ -3640,6 +3644,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
+	bool kick_kswapd = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3685,7 +3690,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
-		wake_all_kswapds(order, ac);
+		wake_all_kswapds(order, ac, false);
 
 	/*
 	 * The adjusted alloc_flags might result in immediate success, so try
@@ -3738,7 +3743,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 retry:
 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
-		wake_all_kswapds(order, ac);
+		wake_all_kswapds(order, ac, kick_kswapd);
+	kick_kswapd = false;
 
 	if (gfp_pfmemalloc_allowed(gfp_mask))
 		alloc_flags = ALLOC_NO_WATERMARKS;
@@ -3833,6 +3839,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
+		kick_kswapd = true;
 		goto retry;
 	}
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-02-23 11:16 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-22  9:04 [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there Jia He
2017-02-22  9:04 ` Jia He
2017-02-22 11:41 ` Michal Hocko
2017-02-22 11:41   ` Michal Hocko
2017-02-22 14:31   ` hejianet
2017-02-22 14:31     ` hejianet
2017-02-22 15:48     ` Michal Hocko
2017-02-22 15:48       ` Michal Hocko
2017-02-23  2:25       ` hejianet
2017-02-23  2:25         ` hejianet
2017-02-22 20:16 ` Johannes Weiner
2017-02-22 20:16   ` Johannes Weiner
2017-02-22 20:24   ` Johannes Weiner
2017-02-22 20:24     ` Johannes Weiner
2017-02-23  7:29     ` Michal Hocko
2017-02-23  7:29       ` Michal Hocko
2017-02-23  2:24   ` hejianet
2017-02-23  2:46   ` hejianet
2017-02-23  3:15     ` Fwd: " hejianet
2017-02-23  7:21     ` Michal Hocko
2017-02-23  7:21       ` Michal Hocko
2017-02-23 10:19   ` Michal Hocko
2017-02-23 10:19     ` Michal Hocko
2017-02-23 11:16   ` Michal Hocko
2017-02-23 11:16     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.