[1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
diff mbox series

Message ID 20170228214007.5621-2-hannes@cmpxchg.org
State New, archived
Headers show
Series
  • mm: kswapd spinning on unreclaimable nodes - fixes and cleanups
Related show

Commit Message

Johannes Weiner Feb. 28, 2017, 9:39 p.m. UTC
Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)

Reported-by: Jia He <hejianet@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Jia He <hejianet@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  2 ++
 mm/internal.h          |  6 ++++++
 mm/page_alloc.c        |  9 ++-------
 mm/vmscan.c            | 27 ++++++++++++++++++++-------
 mm/vmstat.c            |  2 +-
 5 files changed, 31 insertions(+), 15 deletions(-)

Comments

Hillf Danton March 2, 2017, 3:23 a.m. UTC | #1
On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Jia He <hejianet@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Shakeel Butt March 2, 2017, 11:30 p.m. UTC | #2
On Tue, Feb 28, 2017 at 1:39 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
>
> $ echo 4000 >/proc/sys/vm/nr_hugepages
>
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
>
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
>
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
>
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
>
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
>

Should the condition of wait_event_killable in throttle_direct_reclaim
be changed to (pfmemalloc_watermark_ok(pgdat) ||
pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)?

> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
>
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Jia He <hejianet@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/internal.h          |  6 ++++++
>  mm/page_alloc.c        |  9 ++-------
>  mm/vmscan.c            | 27 ++++++++++++++++++++-------
>  mm/vmstat.c            |  2 +-
>  5 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
>         int kswapd_order;
>         enum zone_type kswapd_classzone_idx;
>
> +       int kswapd_failures;            /* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
>         int kcompactd_max_order;
>         enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/internal.h b/mm/internal.h
> index ccfc2a2969f4..aae93e3fd984 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
>  extern unsigned long highest_memmap_pfn;
>
>  /*
> + * Maximum number of reclaim retries without progress before the OOM
> + * killer is consider the only way forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
>   * in mm/vmscan.c:
>   */
>  extern int isolate_lru_page(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..f50e36e7b024 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>                         K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>                         K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>                         node_page_state(pgdat, NR_PAGES_SCANNED),
> -                       !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +                       pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> +                               "yes" : "no");
>         }
>
>         for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..407b27831ff7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>                                          sc->nr_scanned - nr_scanned, sc));
>
> +       /*
> +        * Kswapd gives up on balancing particular nodes after too
> +        * many failures to reclaim anything from them and goes to
> +        * sleep. On reclaim progress, reset the failure counter. A
> +        * successful direct reclaim run will revive a dormant kswapd.
> +        */
> +       if (reclaimable)
> +               pgdat->kswapd_failures = 0;
> +
>         return reclaimable;
>  }
>
> @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>                                                  GFP_KERNEL | __GFP_HARDWALL))
>                                 continue;
>
> -                       if (sc->priority != DEF_PRIORITY &&
> -                           !pgdat_reclaimable(zone->zone_pgdat))
> -                               continue;       /* Let kswapd poll it */
> -
>                         /*
>                          * If we already have plenty of memory free for
>                          * compaction in this zone, don't free any more.
> @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>         if (waitqueue_active(&pgdat->pfmemalloc_wait))
>                 wake_up_all(&pgdat->pfmemalloc_wait);
>
> +       /* Hopeless node, leave it to direct reclaim */
> +       if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +               return true;
> +
>         for (i = 0; i <= classzone_idx; i++) {
>                 struct zone *zone = pgdat->node_zones + i;
>
> @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                         sc.priority--;
>         } while (sc.priority >= 1);
>
> +       if (!sc.nr_reclaimed)
> +               pgdat->kswapd_failures++;
> +
>  out:
>         /*
>          * Return the order kswapd stopped reclaiming at as
> @@ -3515,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>         if (!waitqueue_active(&pgdat->kswapd_wait))
>                 return;
>
> +       /* Hopeless node, leave it to direct reclaim */
> +       if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +               return;
> +
>         /* Only wake kswapd if all zones are unbalanced */
>         for (z = 0; z <= classzone_idx; z++) {
>                 zone = pgdat->node_zones + z;
> @@ -3785,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>             sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
>                 return NODE_RECLAIM_FULL;
>
> -       if (!pgdat_reclaimable(pgdat))
> -               return NODE_RECLAIM_FULL;
> -
>         /*
>          * Do not scan if the allocation should not be delayed.
>          */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff39a2e..ff16cdc15df2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1422,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n  node_unreclaimable:  %u"
>                    "\n  start_pfn:           %lu"
>                    "\n  node_inactive_ratio: %u",
> -                  !pgdat_reclaimable(zone->zone_pgdat),
> +                  pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
>                    zone->zone_start_pfn,
>                    zone->zone_pgdat->inactive_ratio);
>         seq_putc(m, '\n');
> --
> 2.11.1
>
Minchan Kim March 3, 2017, 1:26 a.m. UTC | #3
Hi Johannes,

On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Jia He <hejianet@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/internal.h          |  6 ++++++
>  mm/page_alloc.c        |  9 ++-------
>  mm/vmscan.c            | 27 ++++++++++++++++++++-------
>  mm/vmstat.c            |  2 +-
>  5 files changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>  
> +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
>  	int kcompactd_max_order;
>  	enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/internal.h b/mm/internal.h
> index ccfc2a2969f4..aae93e3fd984 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
>  extern unsigned long highest_memmap_pfn;
>  
>  /*
> + * Maximum number of reclaim retries without progress before the OOM
> + * killer is consider the only way forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
>   * in mm/vmscan.c:
>   */
>  extern int isolate_lru_page(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..f50e36e7b024 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>  
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>  			node_page_state(pgdat, NR_PAGES_SCANNED),
> -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> +			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> +				"yes" : "no");
>  	}
>  
>  	for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..407b27831ff7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>  
> +	/*
> +	 * Kswapd gives up on balancing particular nodes after too
> +	 * many failures to reclaim anything from them and goes to
> +	 * sleep. On reclaim progress, reset the failure counter. A
> +	 * successful direct reclaim run will revive a dormant kswapd.
> +	 */
> +	if (reclaimable)
> +		pgdat->kswapd_failures = 0;
> +
>  	return reclaimable;
>  }
>  
> @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  						 GFP_KERNEL | __GFP_HARDWALL))
>  				continue;
>  
> -			if (sc->priority != DEF_PRIORITY &&
> -			    !pgdat_reclaimable(zone->zone_pgdat))
> -				continue;	/* Let kswapd poll it */
> -
>  			/*
>  			 * If we already have plenty of memory free for
>  			 * compaction in this zone, don't free any more.
> @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
>  		wake_up_all(&pgdat->pfmemalloc_wait);
>  
> +	/* Hopeless node, leave it to direct reclaim */

I hope to clear what we want by deferring the job to direct reclaim.
Direct reclaim is much limited reclaim worker by serveral things(e.g.,
avoid writeback for stack overflow, NOIO|NOFS context) so what do we
want for direct reclaimer to do even if kswapd can make forward
progress? OOM?

> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return true;
> +
>  	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>  
> @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>  
> +	if (!sc.nr_reclaimed)
> +		pgdat->kswapd_failures++;

sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
it pgdat->kswapd_failures is increased.

> +
>  out:
>  	/*
>  	 * Return the order kswapd stopped reclaiming at as
> @@ -3515,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
>  
> +	/* Hopeless node, leave it to direct reclaim */
> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return;
> +
>  	/* Only wake kswapd if all zones are unbalanced */
>  	for (z = 0; z <= classzone_idx; z++) {
>  		zone = pgdat->node_zones + z;
> @@ -3785,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
>  		return NODE_RECLAIM_FULL;
>  
> -	if (!pgdat_reclaimable(pgdat))
> -		return NODE_RECLAIM_FULL;
> -
>  	/*
>  	 * Do not scan if the allocation should not be delayed.
>  	 */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff39a2e..ff16cdc15df2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1422,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  node_unreclaimable:  %u"
>  		   "\n  start_pfn:           %lu"
>  		   "\n  node_inactive_ratio: %u",
> -		   !pgdat_reclaimable(zone->zone_pgdat),
> +		   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
>  		   zone->zone_start_pfn,
>  		   zone->zone_pgdat->inactive_ratio);
>  	seq_putc(m, '\n');
> -- 
> 2.11.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Michal Hocko March 3, 2017, 7:59 a.m. UTC | #4
On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> Hi Johannes,
> 
> On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > Jia He reports a problem with kswapd spinning at 100% CPU when
> > requesting more hugepages than memory available in the system:
> > 
> > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > 
> > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> > KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> > 
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> > 
> > At that time, there are no reclaimable pages left in the node, but as
> > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > 
> > Kswapd needs to back away from nodes that fail to balance. Up until
> > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > kswapd had such a mechanism. It considered zones whose theoretically
> > reclaimable pages it had reclaimed six times over as unreclaimable and
> > backed away from them. This guard was erroneously removed as the patch
> > changed the definition of a balanced node.
> > 
> > However, simply restoring this code wouldn't help in the case reported
> > here: there *are* no reclaimable pages that could be scanned until the
> > threshold is met. Kswapd would stay awake anyway.
> > 
> > Introduce a new and much simpler way of backing off. If kswapd runs
> > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > page, make it back off from the node. This is the same number of shots
> > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > that node until a direct reclaimer manages to reclaim some pages, thus
> > proving the node reclaimable again.
> > 
> > v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> > 
> > Reported-by: Jia He <hejianet@gmail.com>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Tested-by: Jia He <hejianet@gmail.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  include/linux/mmzone.h |  2 ++
> >  mm/internal.h          |  6 ++++++
> >  mm/page_alloc.c        |  9 ++-------
> >  mm/vmscan.c            | 27 ++++++++++++++++++++-------
> >  mm/vmstat.c            |  2 +-
> >  5 files changed, 31 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 8e02b3750fe0..d2c50ab6ae40 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -630,6 +630,8 @@ typedef struct pglist_data {
> >  	int kswapd_order;
> >  	enum zone_type kswapd_classzone_idx;
> >  
> > +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> > +
> >  #ifdef CONFIG_COMPACTION
> >  	int kcompactd_max_order;
> >  	enum zone_type kcompactd_classzone_idx;
> > diff --git a/mm/internal.h b/mm/internal.h
> > index ccfc2a2969f4..aae93e3fd984 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
> >  extern unsigned long highest_memmap_pfn;
> >  
> >  /*
> > + * Maximum number of reclaim retries without progress before the OOM
> > + * killer is consider the only way forward.
> > + */
> > +#define MAX_RECLAIM_RETRIES 16
> > +
> > +/*
> >   * in mm/vmscan.c:
> >   */
> >  extern int isolate_lru_page(struct page *page);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 614cd0397ce3..f50e36e7b024 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >  }
> >  
> >  /*
> > - * Maximum number of reclaim retries without any progress before OOM killer
> > - * is consider as the only way to move forward.
> > - */
> > -#define MAX_RECLAIM_RETRIES 16
> > -
> > -/*
> >   * Checks whether it makes sense to retry the reclaim to make a forward progress
> >   * for the given allocation request.
> >   * The reclaim feedback represented by did_some_progress (any progress during
> > @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> >  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> >  			node_page_state(pgdat, NR_PAGES_SCANNED),
> > -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> > +			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> > +				"yes" : "no");
> >  	}
> >  
> >  	for_each_populated_zone(zone) {
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 26c3b405ef34..407b27831ff7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> >  					 sc->nr_scanned - nr_scanned, sc));
> >  
> > +	/*
> > +	 * Kswapd gives up on balancing particular nodes after too
> > +	 * many failures to reclaim anything from them and goes to
> > +	 * sleep. On reclaim progress, reset the failure counter. A
> > +	 * successful direct reclaim run will revive a dormant kswapd.
> > +	 */
> > +	if (reclaimable)
> > +		pgdat->kswapd_failures = 0;
> > +
> >  	return reclaimable;
> >  }
> >  
> > @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> >  						 GFP_KERNEL | __GFP_HARDWALL))
> >  				continue;
> >  
> > -			if (sc->priority != DEF_PRIORITY &&
> > -			    !pgdat_reclaimable(zone->zone_pgdat))
> > -				continue;	/* Let kswapd poll it */
> > -
> >  			/*
> >  			 * If we already have plenty of memory free for
> >  			 * compaction in this zone, don't free any more.
> > @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
> >  		wake_up_all(&pgdat->pfmemalloc_wait);
> >  
> > +	/* Hopeless node, leave it to direct reclaim */
> 
> I hope to clear what we want by deferring the job to direct reclaim.
> Direct reclaim is much limited reclaim worker by serveral things(e.g.,
> avoid writeback for stack overflow, NOIO|NOFS context)

This is true but if kswapd cannot reclaim anything at all then we do not
have much choice left

> so what do we
> want for direct reclaimer to do even if kswapd can make forward
> progress? OOM?

yes resp. back off for costly high order requests and leave the node
unbalanced.

> > +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > +		return true;
> > +
> >  	for (i = 0; i <= classzone_idx; i++) {
> >  		struct zone *zone = pgdat->node_zones + i;
> >  
> > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> >  			sc.priority--;
> >  	} while (sc.priority >= 1);
> >  
> > +	if (!sc.nr_reclaimed)
> > +		pgdat->kswapd_failures++;
> 
> sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> it pgdat->kswapd_failures is increased.

But then we increase the counter in kswapd_shrink_node or do I miss your
point? Are you suggesting to use the aggregate nr_reclaimed over all
priorities because the last round might have made no progress?
Minchan Kim March 6, 2017, 1:37 a.m. UTC | #5
Hi Michal,

On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > Hi Johannes,
> > 
> > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > Jia He reports a problem with kswapd spinning at 100% CPU when
> > > requesting more hugepages than memory available in the system:
> > > 
> > > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > > 
> > > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > > KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> > > KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> > > 
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > >    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> > > 
> > > At that time, there are no reclaimable pages left in the node, but as
> > > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > > 
> > > Kswapd needs to back away from nodes that fail to balance. Up until
> > > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > > kswapd had such a mechanism. It considered zones whose theoretically
> > > reclaimable pages it had reclaimed six times over as unreclaimable and
> > > backed away from them. This guard was erroneously removed as the patch
> > > changed the definition of a balanced node.
> > > 
> > > However, simply restoring this code wouldn't help in the case reported
> > > here: there *are* no reclaimable pages that could be scanned until the
> > > threshold is met. Kswapd would stay awake anyway.
> > > 
> > > Introduce a new and much simpler way of backing off. If kswapd runs
> > > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > > page, make it back off from the node. This is the same number of shots
> > > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > > that node until a direct reclaimer manages to reclaim some pages, thus
> > > proving the node reclaimable again.
> > > 
> > > v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> > > 
> > > Reported-by: Jia He <hejianet@gmail.com>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Tested-by: Jia He <hejianet@gmail.com>
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  include/linux/mmzone.h |  2 ++
> > >  mm/internal.h          |  6 ++++++
> > >  mm/page_alloc.c        |  9 ++-------
> > >  mm/vmscan.c            | 27 ++++++++++++++++++++-------
> > >  mm/vmstat.c            |  2 +-
> > >  5 files changed, 31 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 8e02b3750fe0..d2c50ab6ae40 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -630,6 +630,8 @@ typedef struct pglist_data {
> > >  	int kswapd_order;
> > >  	enum zone_type kswapd_classzone_idx;
> > >  
> > > +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> > > +
> > >  #ifdef CONFIG_COMPACTION
> > >  	int kcompactd_max_order;
> > >  	enum zone_type kcompactd_classzone_idx;
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index ccfc2a2969f4..aae93e3fd984 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
> > >  extern unsigned long highest_memmap_pfn;
> > >  
> > >  /*
> > > + * Maximum number of reclaim retries without progress before the OOM
> > > + * killer is consider the only way forward.
> > > + */
> > > +#define MAX_RECLAIM_RETRIES 16
> > > +
> > > +/*
> > >   * in mm/vmscan.c:
> > >   */
> > >  extern int isolate_lru_page(struct page *page);
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 614cd0397ce3..f50e36e7b024 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > >  }
> > >  
> > >  /*
> > > - * Maximum number of reclaim retries without any progress before OOM killer
> > > - * is consider as the only way to move forward.
> > > - */
> > > -#define MAX_RECLAIM_RETRIES 16
> > > -
> > > -/*
> > >   * Checks whether it makes sense to retry the reclaim to make a forward progress
> > >   * for the given allocation request.
> > >   * The reclaim feedback represented by did_some_progress (any progress during
> > > @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> > >  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> > >  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> > >  			node_page_state(pgdat, NR_PAGES_SCANNED),
> > > -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> > > +			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> > > +				"yes" : "no");
> > >  	}
> > >  
> > >  	for_each_populated_zone(zone) {
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 26c3b405ef34..407b27831ff7 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > >  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> > >  					 sc->nr_scanned - nr_scanned, sc));
> > >  
> > > +	/*
> > > +	 * Kswapd gives up on balancing particular nodes after too
> > > +	 * many failures to reclaim anything from them and goes to
> > > +	 * sleep. On reclaim progress, reset the failure counter. A
> > > +	 * successful direct reclaim run will revive a dormant kswapd.
> > > +	 */
> > > +	if (reclaimable)
> > > +		pgdat->kswapd_failures = 0;
> > > +
> > >  	return reclaimable;
> > >  }
> > >  
> > > @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > >  						 GFP_KERNEL | __GFP_HARDWALL))
> > >  				continue;
> > >  
> > > -			if (sc->priority != DEF_PRIORITY &&
> > > -			    !pgdat_reclaimable(zone->zone_pgdat))
> > > -				continue;	/* Let kswapd poll it */
> > > -
> > >  			/*
> > >  			 * If we already have plenty of memory free for
> > >  			 * compaction in this zone, don't free any more.
> > > @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> > >  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
> > >  		wake_up_all(&pgdat->pfmemalloc_wait);
> > >  
> > > +	/* Hopeless node, leave it to direct reclaim */
> > 
> > I hope to clear what we want by deferring the job to direct reclaim.
> > Direct reclaim is much limited reclaim worker by serveral things(e.g.,
> > avoid writeback for stack overflow, NOIO|NOFS context)
> 
> This is true but if kswapd cannot reclaim anything at all then we do not
> have much choice left
> 
> > so what do we
> > want for direct reclaimer to do even if kswapd can make forward
> > progress? OOM?
> 
> yes resp. back off for costly high order requests and leave the node
> unbalanced.

Okay, I just wanted to clear it out because we have kept logic to
prevent CPU burn out of direct reclaim in case of being full with
unreclaiamble pages of zones. And this patch removes it in
shrink_zones. It might be optimized to cut off direct reclaim
if kswapd failure is higher than threshold so we reach OOM fast
without pointless retrial to reclaim in direct reclaim path but
I guess it would be rare case so no worth to optimize.

commit 36fb7f8
Author: Andrew Morton <akpm@digeo.com>
Date:   Thu Nov 21 19:32:34 2002 -0800

    [PATCH] handle zones which are full of unreclaimable pages

> 
> > > +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > > +		return true;
> > > +
> > >  	for (i = 0; i <= classzone_idx; i++) {
> > >  		struct zone *zone = pgdat->node_zones + i;
> > >  
> > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > >  			sc.priority--;
> > >  	} while (sc.priority >= 1);
> > >  
> > > +	if (!sc.nr_reclaimed)
> > > +		pgdat->kswapd_failures++;
> > 
> > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > it pgdat->kswapd_failures is increased.
> 
> But then we increase the counter in kswapd_shrink_node or do I miss your
> point? Are you suggesting to use the aggregate nr_reclaimed over all
> priorities because the last round might have made no progress?

Yes.

Let's assume there is severe memory pressure so there would be less
LRU pages than sum += highwatermark of eligible zones(As well,
user can configure watermark big in a specific zone). In that case,
kswapd will increase prioirity by kswapd_shrink_node's return check
although it reclaims a few of pages.

Also, process can consume pages kswapd reclaimed in parallel without
entering slow path because it uses *low* watermark. So there would be
no chance to reset kswapd_failure to zero until it goes slow path.

Also, although it goes direct reclaim's slow path, it cannot wakeup
kswapd until it can make forward progress which is condition to
reset kswapd_failure and consider direct reclaimer's context is
easily limited with NO_[FS|IO] so sometime, it would be hard to make
forward progress.

We can rule out that situation easily via aggregating nr_reclaimed
in balance_pgdat, simply. Why not?
Johannes Weiner March 6, 2017, 4:24 p.m. UTC | #6
On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > > >  			sc.priority--;
> > > >  	} while (sc.priority >= 1);
> > > >  
> > > > +	if (!sc.nr_reclaimed)
> > > > +		pgdat->kswapd_failures++;
> > > 
> > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > > it pgdat->kswapd_failures is increased.

That wasn't intentional; I didn't see the sc.nr_reclaimed reset.

---

>From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 6 Mar 2017 10:53:59 -0500
Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix

Check kswapd failure against the cumulative nr_reclaimed count, not
against the count from the lowest priority iteration.

Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ddcff8a11c1e..b834b2dd4e19 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	count_vm_event(PAGEOUTRUN);
 
 	do {
+		unsigned long nr_reclaimed = sc.nr_reclaimed;
 		bool raise_priority = true;
 
-		sc.nr_reclaimed = 0;
 		sc.reclaim_idx = classzone_idx;
 
 		/*
@@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
-		if (raise_priority || !sc.nr_reclaimed)
+		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1);
Hillf Danton March 7, 2017, 12:59 a.m. UTC | #7
On March 07, 2017 12:24 AM Johannes Weiner wrote: 
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > > > >  			sc.priority--;
> > > > >  	} while (sc.priority >= 1);
> > > > >
> > > > > +	if (!sc.nr_reclaimed)
> > > > > +		pgdat->kswapd_failures++;
> > > >
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  	count_vm_event(PAGEOUTRUN);
> 
>  	do {
> +		unsigned long nr_reclaimed = sc.nr_reclaimed;
>  		bool raise_priority = true;
> 
> -		sc.nr_reclaimed = 0;

This has another effect that we'll reclaim less pages than we're
currently doing if we are balancing for high order request. And 
it looks worth including that info also in log message.

>  		sc.reclaim_idx = classzone_idx;
> 
>  		/*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * Raise priority if scanning rate is too low or there was no
>  		 * progress in reclaiming pages
>  		 */
> -		if (raise_priority || !sc.nr_reclaimed)
> +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> +		if (raise_priority || !nr_reclaimed)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
> 
> --
> 2.11.1
Minchan Kim March 7, 2017, 7:28 a.m. UTC | #8
On Mon, Mar 06, 2017 at 11:24:10AM -0500, Johannes Weiner wrote:
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > > > >  			sc.priority--;
> > > > >  	} while (sc.priority >= 1);
> > > > >  
> > > > > +	if (!sc.nr_reclaimed)
> > > > > +		pgdat->kswapd_failures++;
> > > > 
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Michal Hocko March 7, 2017, 10:17 a.m. UTC | #9
On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
[...]
> >From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  	count_vm_event(PAGEOUTRUN);
>  
>  	do {
> +		unsigned long nr_reclaimed = sc.nr_reclaimed;
>  		bool raise_priority = true;
>  
> -		sc.nr_reclaimed = 0;
>  		sc.reclaim_idx = classzone_idx;
>  
>  		/*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * Raise priority if scanning rate is too low or there was no
>  		 * progress in reclaiming pages
>  		 */
> -		if (raise_priority || !sc.nr_reclaimed)
> +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> +		if (raise_priority || !nr_reclaimed)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>  

I would rather not play with the sc state here. From a quick look at
least 
	/*
	 * Fragmentation may mean that the system cannot be rebalanced for
	 * high-order allocations. If twice the allocation size has been
	 * reclaimed then recheck watermarks only at order-0 to prevent
	 * excessive reclaim. Assume that a process requested a high-order
	 * can direct reclaim/compact.
	 */
	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
		sc->order = 0;

does rely on the value. Wouldn't something like the following be safer?
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..b731f24fed12 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3183,6 +3183,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		.may_unmap = 1,
 		.may_swap = 1,
 	};
+	bool reclaimable = false;
 	count_vm_event(PAGEOUTRUN);
 
 	do {
@@ -3274,6 +3275,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		if (try_to_freeze() || kthread_should_stop())
 			break;
 
+		if (sc.nr_reclaimed)
+			reclaimable = true;
+
 		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
@@ -3282,7 +3286,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
-	if (!sc.nr_reclaimed)
+	if (!reclaimable)
 		pgdat->kswapd_failures++;
 
 out:
Johannes Weiner March 7, 2017, 4:56 p.m. UTC | #10
On Tue, Mar 07, 2017 at 11:17:02AM +0100, Michal Hocko wrote:
> On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
> > @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> >  		 * Raise priority if scanning rate is too low or there was no
> >  		 * progress in reclaiming pages
> >  		 */
> > -		if (raise_priority || !sc.nr_reclaimed)
> > +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > +		if (raise_priority || !nr_reclaimed)
> >  			sc.priority--;
> >  	} while (sc.priority >= 1);
> >  
> 
> I would rather not play with the sc state here. From a quick look at
> least 
> 	/*
> 	 * Fragmentation may mean that the system cannot be rebalanced for
> 	 * high-order allocations. If twice the allocation size has been
> 	 * reclaimed then recheck watermarks only at order-0 to prevent
> 	 * excessive reclaim. Assume that a process requested a high-order
> 	 * can direct reclaim/compact.
> 	 */
> 	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> 		sc->order = 0;
> 
> does rely on the value. Wouldn't something like the following be safer?

Well, what behavior is correct, though? This check looks like an
argument *against* resetting sc.nr_reclaimed.

If kswapd is woken up for a higher order, this check sets a reclaim
cutoff beyond which it should give up on the order and balance for 0.

That's on the scope of the kswapd invocation. Applying this threshold
to the outcome of just the preceeding priority seems like a mistake.

Mel? Vlastimil?
Mel Gorman March 9, 2017, 2:20 p.m. UTC | #11
On Tue, Mar 07, 2017 at 11:56:31AM -0500, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 11:17:02AM +0100, Michal Hocko wrote:
> > On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
> > > @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > >  		 * Raise priority if scanning rate is too low or there was no
> > >  		 * progress in reclaiming pages
> > >  		 */
> > > -		if (raise_priority || !sc.nr_reclaimed)
> > > +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > > +		if (raise_priority || !nr_reclaimed)
> > >  			sc.priority--;
> > >  	} while (sc.priority >= 1);
> > >  
> > 
> > I would rather not play with the sc state here. From a quick look at
> > least 
> > 	/*
> > 	 * Fragmentation may mean that the system cannot be rebalanced for
> > 	 * high-order allocations. If twice the allocation size has been
> > 	 * reclaimed then recheck watermarks only at order-0 to prevent
> > 	 * excessive reclaim. Assume that a process requested a high-order
> > 	 * can direct reclaim/compact.
> > 	 */
> > 	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> > 		sc->order = 0;
> > 
> > does rely on the value. Wouldn't something like the following be safer?
> 
> Well, what behavior is correct, though? This check looks like an
> argument *against* resetting sc.nr_reclaimed.
> 
> If kswapd is woken up for a higher order, this check sets a reclaim
> cutoff beyond which it should give up on the order and balance for 0.
> 
> That's on the scope of the kswapd invocation. Applying this threshold
> to the outcome of just the preceeding priority seems like a mistake.
> 
> Mel? Vlastimil?

I cannot say which is definitely the correct behaviour. The current
behaviour is conservative due to the historical concerns about kswapd
reclaiming the world. The hazard as I see it is that resetting it *may*
lead to more aggressive reclaim for high-order allocations. That may be a
welcome outcome to some that really want high-order pages and be unwelcome
to others that prefer pages to remain resident.

However, in this case it's a tight window and problems would be tricky to
detect. THP allocations won't trigger the behaviour and with vmalloc'd
stack, I'd expect that only SLUB-intensive workloads using high-order
pages would trigger any adverse behaviour. While I'm mildly concerned, I
would be a little surprised if it actually caused runaway reclaim.

Patch
diff mbox series

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e02b3750fe0..d2c50ab6ae40 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -630,6 +630,8 @@  typedef struct pglist_data {
 	int kswapd_order;
 	enum zone_type kswapd_classzone_idx;
 
+	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
diff --git a/mm/internal.h b/mm/internal.h
index ccfc2a2969f4..aae93e3fd984 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -81,6 +81,12 @@  static inline void set_page_refcounted(struct page *page)
 extern unsigned long highest_memmap_pfn;
 
 /*
+ * Maximum number of reclaim retries without progress before the OOM
+ * killer is consider the only way forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
  * in mm/vmscan.c:
  */
 extern int isolate_lru_page(struct page *page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 614cd0397ce3..f50e36e7b024 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3516,12 +3516,6 @@  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 }
 
 /*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
@@ -4527,7 +4521,8 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 			node_page_state(pgdat, NR_PAGES_SCANNED),
-			!pgdat_reclaimable(pgdat) ? "yes" : "no");
+			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
+				"yes" : "no");
 	}
 
 	for_each_populated_zone(zone) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..407b27831ff7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2626,6 +2626,15 @@  static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
+	/*
+	 * Kswapd gives up on balancing particular nodes after too
+	 * many failures to reclaim anything from them and goes to
+	 * sleep. On reclaim progress, reset the failure counter. A
+	 * successful direct reclaim run will revive a dormant kswapd.
+	 */
+	if (reclaimable)
+		pgdat->kswapd_failures = 0;
+
 	return reclaimable;
 }
 
@@ -2700,10 +2709,6 @@  static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
-			if (sc->priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;	/* Let kswapd poll it */
-
 			/*
 			 * If we already have plenty of memory free for
 			 * compaction in this zone, don't free any more.
@@ -3134,6 +3139,10 @@  static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
+	/* Hopeless node, leave it to direct reclaim */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		return true;
+
 	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -3316,6 +3325,9 @@  static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
+	if (!sc.nr_reclaimed)
+		pgdat->kswapd_failures++;
+
 out:
 	/*
 	 * Return the order kswapd stopped reclaiming at as
@@ -3515,6 +3527,10 @@  void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
+	/* Hopeless node, leave it to direct reclaim */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		return;
+
 	/* Only wake kswapd if all zones are unbalanced */
 	for (z = 0; z <= classzone_idx; z++) {
 		zone = pgdat->node_zones + z;
@@ -3785,9 +3801,6 @@  int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
-	if (!pgdat_reclaimable(pgdat))
-		return NODE_RECLAIM_FULL;
-
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 69f9aff39a2e..ff16cdc15df2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1422,7 +1422,7 @@  static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  node_unreclaimable:  %u"
 		   "\n  start_pfn:           %lu"
 		   "\n  node_inactive_ratio: %u",
-		   !pgdat_reclaimable(zone->zone_pgdat),
+		   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
 		   zone->zone_start_pfn,
 		   zone->zone_pgdat->inactive_ratio);
 	seq_putc(m, '\n');