linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
@ 2020-02-19 18:25 Sultan Alsawaf
  2020-02-19 19:13 ` Dave Hansen
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-19 18:25 UTC (permalink / raw)
  Cc: Sultan Alsawaf, Andrew Morton, linux-mm, linux-kernel

From: Sultan Alsawaf <sultan@kerneltoast.com>

Keeping kswapd running when all the failed allocations that invoked it
are satisfied incurs a high overhead due to unnecessary page eviction
and writeback, as well as spurious VM pressure events to various
registered shrinkers. When kswapd doesn't need to work to make an
allocation succeed anymore, stop it prematurely to save resources.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
---
 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 17 ++++++++++++++---
 mm/vmscan.c            |  3 ++-
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..49c922abfb90 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -20,6 +20,7 @@
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
 #include <linux/page-flags.h>
+#include <linux/refcount.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -735,6 +736,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
+	refcount_t kswapd_waiters;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..2d4caacfd2fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4401,6 +4401,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
 	int reserve_flags;
+	pg_data_t *pgdat = ac->preferred_zoneref->zone->zone_pgdat;
+	bool woke_kswapd = false;
 
 	/*
 	 * We also sanity check to catch abuse of atomic reserves being used by
@@ -4434,8 +4436,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
-	if (alloc_flags & ALLOC_KSWAPD)
+	if (alloc_flags & ALLOC_KSWAPD) {
+		if (!woke_kswapd) {
+			refcount_inc(&pgdat->kswapd_waiters);
+			woke_kswapd = true;
+		}
 		wake_all_kswapds(order, gfp_mask, ac);
+	}
 
 	/*
 	 * The adjusted alloc_flags might result in immediate success, so try
@@ -4640,9 +4647,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 	}
 fail:
-	warn_alloc(gfp_mask, ac->nodemask,
-			"page allocation failure: order:%u", order);
 got_pg:
+	if (woke_kswapd)
+		refcount_dec(&pgdat->kswapd_waiters);
+	if (!page)
+		warn_alloc(gfp_mask, ac->nodemask,
+				"page allocation failure: order:%u", order);
 	return page;
 }
 
@@ -6711,6 +6721,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	pgdat_page_ext_init(pgdat);
 	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
+	pgdat->kswapd_waiters = (refcount_t)REFCOUNT_INIT(0);
 }
 
 static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05eb9efec07..e795add372d1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3694,7 +3694,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		__fs_reclaim_release();
 		ret = try_to_freeze();
 		__fs_reclaim_acquire();
-		if (ret || kthread_should_stop())
+		if (ret || kthread_should_stop() ||
+		    !refcount_read(&pgdat->kswapd_waiters))
 			break;
 
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 18:25 [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages Sultan Alsawaf
@ 2020-02-19 19:13 ` Dave Hansen
  2020-02-19 19:40   ` Sultan Alsawaf
  2020-02-19 19:26 ` Andrew Morton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2020-02-19 19:13 UTC (permalink / raw)
  To: Sultan Alsawaf; +Cc: Andrew Morton, linux-mm, linux-kernel

On 2/19/20 10:25 AM, Sultan Alsawaf wrote:
> Keeping kswapd running when all the failed allocations that invoked it
> are satisfied incurs a high overhead due to unnecessary page eviction
> and writeback, as well as spurious VM pressure events to various
> registered shrinkers. When kswapd doesn't need to work to make an
> allocation succeed anymore, stop it prematurely to save resources.

But kswapd isn't just to provide memory to waiters.  It also serves to
get free memory back up to the high watermark.  This seems like it might
result in more frequent allocation stalls and kswapd wakeups, which
consumes extra resources.

I guess I'd wonder what positive effects you have observed as a result
of this patch and whether you've gone looking for any negative effects.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 18:25 [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages Sultan Alsawaf
  2020-02-19 19:13 ` Dave Hansen
@ 2020-02-19 19:26 ` Andrew Morton
  2020-02-19 22:45   ` Sultan Alsawaf
  2020-02-19 19:35 ` Michal Hocko
  2020-02-21  4:30 ` [PATCH v2] " Sultan Alsawaf
  3 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2020-02-19 19:26 UTC (permalink / raw)
  To: Sultan Alsawaf; +Cc: linux-mm, linux-kernel

On Wed, 19 Feb 2020 10:25:22 -0800 Sultan Alsawaf <sultan@kerneltoast.com> wrote:

> Keeping kswapd running when all the failed allocations that invoked it
> are satisfied incurs a high overhead due to unnecessary page eviction
> and writeback, as well as spurious VM pressure events to various
> registered shrinkers. When kswapd doesn't need to work to make an
> allocation succeed anymore, stop it prematurely to save resources.

Seems sensible.

Please fully describe the userspace-visible runtime effects of this
change?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 18:25 [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages Sultan Alsawaf
  2020-02-19 19:13 ` Dave Hansen
  2020-02-19 19:26 ` Andrew Morton
@ 2020-02-19 19:35 ` Michal Hocko
  2020-02-21  4:30 ` [PATCH v2] " Sultan Alsawaf
  3 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2020-02-19 19:35 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Andrew Morton, linux-mm, linux-kernel, Mel Gorman, Johannes Weiner

[Cc Mel and Johannes]

On Wed 19-02-20 10:25:22, Sultan Alsawaf wrote:
> From: Sultan Alsawaf <sultan@kerneltoast.com>
> 
> Keeping kswapd running when all the failed allocations that invoked it
> are satisfied incurs a high overhead due to unnecessary page eviction
> and writeback, as well as spurious VM pressure events to various
> registered shrinkers. When kswapd doesn't need to work to make an
> allocation succeed anymore, stop it prematurely to save resources.

I do not think this patch is correct. kswapd is supposed to balance a
node and get it up to the high watermark. The number of contexts which
woke it up is not really relevant. If for no other reasons then each
allocation request might be of a different size.

Could you be more specific about the problem you are trying to address
please?

> Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/page_alloc.c        | 17 ++++++++++++++---
>  mm/vmscan.c            |  3 ++-
>  3 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 462f6873905a..49c922abfb90 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -20,6 +20,7 @@
>  #include <linux/atomic.h>
>  #include <linux/mm_types.h>
>  #include <linux/page-flags.h>
> +#include <linux/refcount.h>
>  #include <asm/page.h>
>  
>  /* Free memory management - zoned buddy allocator.  */
> @@ -735,6 +736,7 @@ typedef struct pglist_data {
>  	unsigned long node_spanned_pages; /* total size of physical page
>  					     range, including holes */
>  	int node_id;
> +	refcount_t kswapd_waiters;
>  	wait_queue_head_t kswapd_wait;
>  	wait_queue_head_t pfmemalloc_wait;
>  	struct task_struct *kswapd;	/* Protected by
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..2d4caacfd2fc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4401,6 +4401,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int no_progress_loops;
>  	unsigned int cpuset_mems_cookie;
>  	int reserve_flags;
> +	pg_data_t *pgdat = ac->preferred_zoneref->zone->zone_pgdat;
> +	bool woke_kswapd = false;
>  
>  	/*
>  	 * We also sanity check to catch abuse of atomic reserves being used by
> @@ -4434,8 +4436,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (!ac->preferred_zoneref->zone)
>  		goto nopage;
>  
> -	if (alloc_flags & ALLOC_KSWAPD)
> +	if (alloc_flags & ALLOC_KSWAPD) {
> +		if (!woke_kswapd) {
> +			refcount_inc(&pgdat->kswapd_waiters);
> +			woke_kswapd = true;
> +		}
>  		wake_all_kswapds(order, gfp_mask, ac);
> +	}
>  
>  	/*
>  	 * The adjusted alloc_flags might result in immediate success, so try
> @@ -4640,9 +4647,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto retry;
>  	}
>  fail:
> -	warn_alloc(gfp_mask, ac->nodemask,
> -			"page allocation failure: order:%u", order);
>  got_pg:
> +	if (woke_kswapd)
> +		refcount_dec(&pgdat->kswapd_waiters);
> +	if (!page)
> +		warn_alloc(gfp_mask, ac->nodemask,
> +				"page allocation failure: order:%u", order);
>  	return page;
>  }
>  
> @@ -6711,6 +6721,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  	pgdat_page_ext_init(pgdat);
>  	spin_lock_init(&pgdat->lru_lock);
>  	lruvec_init(&pgdat->__lruvec);
> +	pgdat->kswapd_waiters = (refcount_t)REFCOUNT_INIT(0);
>  }
>  
>  static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c05eb9efec07..e795add372d1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3694,7 +3694,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		__fs_reclaim_release();
>  		ret = try_to_freeze();
>  		__fs_reclaim_acquire();
> -		if (ret || kthread_should_stop())
> +		if (ret || kthread_should_stop() ||
> +		    !refcount_read(&pgdat->kswapd_waiters))
>  			break;
>  
>  		/*
> -- 
> 2.25.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 19:13 ` Dave Hansen
@ 2020-02-19 19:40   ` Sultan Alsawaf
  2020-02-19 20:05     ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-19 19:40 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-mm, linux-kernel

On Wed, Feb 19, 2020 at 11:13:21AM -0800, Dave Hansen wrote:
> On 2/19/20 10:25 AM, Sultan Alsawaf wrote:
> > Keeping kswapd running when all the failed allocations that invoked it
> > are satisfied incurs a high overhead due to unnecessary page eviction
> > and writeback, as well as spurious VM pressure events to various
> > registered shrinkers. When kswapd doesn't need to work to make an
> > allocation succeed anymore, stop it prematurely to save resources.
> 
> But kswapd isn't just to provide memory to waiters.  It also serves to
> get free memory back up to the high watermark.  This seems like it might
> result in more frequent allocation stalls and kswapd wakeups, which
> consumes extra resources.
> 
> I guess I'd wonder what positive effects you have observed as a result
> of this patch and whether you've gone looking for any negative effects.

This patch essentially stops kswapd from going overboard when a failed
allocation fires up kswapd. Otherwise, when memory pressure is really high,
kswapd just chomps through CPU time freeing pages nonstop when it isn't needed.
On a constrained system I tested (mem=2G), this patch had the positive effect of
improving overall responsiveness at high memory pressure.

On systems with more memory I tested (>=4G), kswapd becomes more expensive to
run at its higher scan depths, so stopping kswapd prematurely when there aren't
any memory allocations waiting for it prevents it from reaching the *really*
expensive scan depths and burning through even more resources.

Combine a large amount of memory with a slow CPU and the current problematic
behavior of kswapd at high memory pressure shows. My personal test scenario for
this was an arm64 CPU with a variable amount of memory (up to 4G RAM + 2G swap).

Sultan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 19:40   ` Sultan Alsawaf
@ 2020-02-19 20:05     ` Michal Hocko
  2020-02-19 20:42       ` Sultan Alsawaf
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-02-19 20:05 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Dave Hansen, Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner

[Ups, for some reason I have missed Dave's response previously]

On Wed 19-02-20 11:40:06, Sultan Alsawaf wrote:
> On Wed, Feb 19, 2020 at 11:13:21AM -0800, Dave Hansen wrote:
> > On 2/19/20 10:25 AM, Sultan Alsawaf wrote:
> > > Keeping kswapd running when all the failed allocations that invoked it
> > > are satisfied incurs a high overhead due to unnecessary page eviction
> > > and writeback, as well as spurious VM pressure events to various
> > > registered shrinkers. When kswapd doesn't need to work to make an
> > > allocation succeed anymore, stop it prematurely to save resources.
> > 
> > But kswapd isn't just to provide memory to waiters.  It also serves to
> > get free memory back up to the high watermark.  This seems like it might
> > result in more frequent allocation stalls and kswapd wakeups, which
> > consumes extra resources.

Agreed as expressed in my other reply

> > I guess I'd wonder what positive effects you have observed as a result
> > of this patch and whether you've gone looking for any negative effects.
> 
> This patch essentially stops kswapd from going overboard when a failed
> allocation fires up kswapd. Otherwise, when memory pressure is really high,
> kswapd just chomps through CPU time freeing pages nonstop when it isn't needed.

Could you be more specific please? kspwad should stop as soon as the
high watermark is reached. If that is not the case then there is a bug
which should be fixed.

Sure it is quite possible that kswapd is busy for extended amount of
time if the memory pressure is continuous.

> On a constrained system I tested (mem=2G), this patch had the positive effect of
> improving overall responsiveness at high memory pressure.

Again, do you have more details about the workload and what was the
cause of responsiveness issues? Because I would expect that the
situation would be quite opposite because it is usually the direct
reclaim that is a source of stalls visible from userspace. Or is this
about a single CPU situation where kswapd saturates the single CPU and
all other tasks are just not getting enough CPU cycles?
 
> On systems with more memory I tested (>=4G), kswapd becomes more expensive to
> run at its higher scan depths, so stopping kswapd prematurely when there aren't
> any memory allocations waiting for it prevents it from reaching the *really*
> expensive scan depths and burning through even more resources.
> 
> Combine a large amount of memory with a slow CPU and the current problematic
> behavior of kswapd at high memory pressure shows. My personal test scenario for
> this was an arm64 CPU with a variable amount of memory (up to 4G RAM + 2G swap).

But still, somebody has to put the system into balanced state so who is
going to do all the work?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 20:05     ` Michal Hocko
@ 2020-02-19 20:42       ` Sultan Alsawaf
  2020-02-19 21:45         ` Mel Gorman
  2020-02-20  8:29         ` Michal Hocko
  0 siblings, 2 replies; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-19 20:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner

On Wed, Feb 19, 2020 at 09:05:27PM +0100, Michal Hocko wrote:
> Could you be more specific please? kspwad should stop as soon as the
> high watermark is reached. If that is not the case then there is a bug
> which should be fixed.

No, there is no bug causing kswapd to continue beyond the high watermark.

> Sure it is quite possible that kswapd is busy for extended amount of
> time if the memory pressure is continuous.
> 
> > On a constrained system I tested (mem=2G), this patch had the positive effect of
> > improving overall responsiveness at high memory pressure.
> 
> Again, do you have more details about the workload and what was the
> cause of responsiveness issues? Because I would expect that the
> situation would be quite opposite because it is usually the direct
> reclaim that is a source of stalls visible from userspace. Or is this
> about a single CPU situation where kswapd saturates the single CPU and
> all other tasks are just not getting enough CPU cycles?

The workload was having lots of applications open at once. At a certain point
when memory ran low, my system became sluggish and kswapd CPU usage skyrocketed.
I added printks into kswapd with this patch, and my premature exit in kswapd
kicked in quite often.

> > On systems with more memory I tested (>=4G), kswapd becomes more expensive to
> > run at its higher scan depths, so stopping kswapd prematurely when there aren't
> > any memory allocations waiting for it prevents it from reaching the *really*
> > expensive scan depths and burning through even more resources.
> > 
> > Combine a large amount of memory with a slow CPU and the current problematic
> > behavior of kswapd at high memory pressure shows. My personal test scenario for
> > this was an arm64 CPU with a variable amount of memory (up to 4G RAM + 2G swap).
> 
> But still, somebody has to put the system into balanced state so who is
> going to do all the work?

All the work will be done by kswapd of course, but only if it's needed.

The real problem is that a single memory allocation failure, and free memory
being some amount below the high watermark, are not good heuristics to predict
*future* memory allocation needs. They are good for determining how to steer
kswapd to help satisfy a failed allocation in the present, but anything more is
pure speculation (which turns out to be wrong speculation, since this behavior
causes problems).

If there are outstanding failed allocations that won't go away, then it's
perfectly reasonable to keep kswapd running until it frees pages up to the high
watermark. But beyond that is unnecessary, since there's no way to know if or
when kswapd will need to fire up again. This makes sense considering how kswapd
is currently invoked: it's fired up due to a failed allocation of some sort, not
because the amount of free memory dropped below the high watermark.

Sultan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 20:42       ` Sultan Alsawaf
@ 2020-02-19 21:45         ` Mel Gorman
  2020-02-19 22:42           ` Sultan Alsawaf
  2020-02-20  8:29         ` Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2020-02-19 21:45 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Michal Hocko, Dave Hansen, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner

On Wed, Feb 19, 2020 at 12:42:20PM -0800, Sultan Alsawaf wrote:
> > Again, do you have more details about the workload and what was the
> > cause of responsiveness issues? Because I would expect that the
> > situation would be quite opposite because it is usually the direct
> > reclaim that is a source of stalls visible from userspace. Or is this
> > about a single CPU situation where kswapd saturates the single CPU and
> > all other tasks are just not getting enough CPU cycles?
> 
> The workload was having lots of applications open at once. At a certain point
> when memory ran low, my system became sluggish and kswapd CPU usage skyrocketed.
> I added printks into kswapd with this patch, and my premature exit in kswapd
> kicked in quite often.
> 

This could be watermark boosting run wild again. Can you test with
sysctl vm.watermark_boost_factor=0 or the following patch? (preferably
both to compare and contrast).

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 572fb17c6273..71dd47172cef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3462,6 +3462,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 	return false;
 }
 
+static void acct_boosted_reclaim(pg_data_t *pgdat, int classzone_idx,
+				unsigned long *zone_boosts)
+{
+	struct zone *zone;
+	unsigned long flags;
+	int i;
+
+	for (i = 0; i <= classzone_idx; i++) {
+		if (!zone_boosts[i])
+			continue;
+
+		/* Increments are under the zone lock */
+		zone = pgdat->node_zones + i;
+		spin_lock_irqsave(&zone->lock, flags);
+		zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+}
+
 /* Clear pgdat state for congested, dirty or under writeback. */
 static void clear_pgdat_congested(pg_data_t *pgdat)
 {
@@ -3654,9 +3673,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!nr_boost_reclaim && balanced)
 			goto out;
 
-		/* Limit the priority of boosting to avoid reclaim writeback */
-		if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
-			raise_priority = false;
+		/*
+		 * Abort boosting if reclaiming at higher priority is not
+		 * working to avoid excessive reclaim due to lower zones
+		 * being boosted.
+		 */
+		if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) {
+			acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
+			boosted = false;
+			nr_boost_reclaim = 0;
+			goto restart;
+		}
 
 		/*
 		 * Do not writeback or swap pages for boosted reclaim. The
@@ -3738,18 +3765,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 out:
 	/* If reclaim was boosted, account for the reclaim done in this pass */
 	if (boosted) {
-		unsigned long flags;
-
-		for (i = 0; i <= classzone_idx; i++) {
-			if (!zone_boosts[i])
-				continue;
-
-			/* Increments are under the zone lock */
-			zone = pgdat->node_zones + i;
-			spin_lock_irqsave(&zone->lock, flags);
-			zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
-			spin_unlock_irqrestore(&zone->lock, flags);
-		}
+		acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
 
 		/*
 		 * As there is now likely space, wakeup kcompact to defragment

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 21:45         ` Mel Gorman
@ 2020-02-19 22:42           ` Sultan Alsawaf
  2020-02-20 10:19             ` Mel Gorman
  0 siblings, 1 reply; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-19 22:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Dave Hansen, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner

On Wed, Feb 19, 2020 at 09:45:13PM +0000, Mel Gorman wrote:
> This could be watermark boosting run wild again. Can you test with
> sysctl vm.watermark_boost_factor=0 or the following patch? (preferably
> both to compare and contrast).

I can test that, but something to note is that I've been doing equal testing
with this on 4.4, which exhibits the same behavior, and that kernel doesn't have
watermark boosting in it, as far as I can tell.

I don't think what we're addressing here is a "bug", but rather something
fundamental about how we've been thinking about kswapd lifetime. The argument
here is that it's not coherent to be letting kswapd run as it does, and instead
gating it on outstanding allocation requests provides much more reasonable
behavior, given real workloads and use patterns.

Does that make sense and seem reasonable?

Sultan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 19:26 ` Andrew Morton
@ 2020-02-19 22:45   ` Sultan Alsawaf
  0 siblings, 0 replies; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-19 22:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

On Wed, Feb 19, 2020 at 11:26:53AM -0800, Andrew Morton wrote:
> On Wed, 19 Feb 2020 10:25:22 -0800 Sultan Alsawaf <sultan@kerneltoast.com> wrote:
> 
> > Keeping kswapd running when all the failed allocations that invoked it
> > are satisfied incurs a high overhead due to unnecessary page eviction
> > and writeback, as well as spurious VM pressure events to various
> > registered shrinkers. When kswapd doesn't need to work to make an
> > allocation succeed anymore, stop it prematurely to save resources.
> 
> Seems sensible.
> 
> Please fully describe the userspace-visible runtime effects of this
> change?
> 

FWIW, it looks like the refcount API doesn't allow refcounts to be zero, so I'd
have to update this patch to use just plain atomic_t instead. But since there
doesn't seem to be much interest in this, I'll only send an updated patch if
it's desired.

Sultan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 20:42       ` Sultan Alsawaf
  2020-02-19 21:45         ` Mel Gorman
@ 2020-02-20  8:29         ` Michal Hocko
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2020-02-20  8:29 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Dave Hansen, Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner

On Wed 19-02-20 12:42:20, Sultan Alsawaf wrote:
> On Wed, Feb 19, 2020 at 09:05:27PM +0100, Michal Hocko wrote:
[...]
> > Again, do you have more details about the workload and what was the
> > cause of responsiveness issues? Because I would expect that the
> > situation would be quite opposite because it is usually the direct
> > reclaim that is a source of stalls visible from userspace. Or is this
> > about a single CPU situation where kswapd saturates the single CPU and
> > all other tasks are just not getting enough CPU cycles?
> 
> The workload was having lots of applications open at once. At a certain point
> when memory ran low, my system became sluggish and kswapd CPU usage skyrocketed.

Could you provide more details please? Is kswapd making a forward
progress? Have you checked why other precesses are slugish? They do not
get CPU time or they are blocked on something?

> I added printks into kswapd with this patch, and my premature exit in kswapd
> kicked in quite often.
> 
> > > On systems with more memory I tested (>=4G), kswapd becomes more expensive to
> > > run at its higher scan depths, so stopping kswapd prematurely when there aren't
> > > any memory allocations waiting for it prevents it from reaching the *really*
> > > expensive scan depths and burning through even more resources.
> > > 
> > > Combine a large amount of memory with a slow CPU and the current problematic
> > > behavior of kswapd at high memory pressure shows. My personal test scenario for
> > > this was an arm64 CPU with a variable amount of memory (up to 4G RAM + 2G swap).
> > 
> > But still, somebody has to put the system into balanced state so who is
> > going to do all the work?
> 
> All the work will be done by kswapd of course, but only if it's needed.
> 
> The real problem is that a single memory allocation failure, and free memory
> being some amount below the high watermark, are not good heuristics to predict
> *future* memory allocation needs. They are good for determining how to steer
> kswapd to help satisfy a failed allocation in the present, but anything more is
> pure speculation (which turns out to be wrong speculation, since this behavior
> causes problems).

Well, you might be right that there might be better heuristics than the
existing watermark based one. After all nobody can predict the future.
The existing heuristic aims at providing min_free_kbytes of free memory
as much as possible and that tends to work reasonably well for a large
set of workloads.

> If there are outstanding failed allocations that won't go away, then it's
> perfectly reasonable to keep kswapd running until it frees pages up to the high
> watermark. But beyond that is unnecessary, since there's no way to know if or
> when kswapd will need to fire up again. This makes sense considering how kswapd
> is currently invoked: it's fired up due to a failed allocation of some sort, not
> because the amount of free memory dropped below the high watermark.

Very broadly speaking (sorry if I am stating obvious here), the kswapd
is woken up when the allocator hits low watermark or the reguested high
order pages are depleted. Then allocator enters its slow path. That
means that the background reclaim then aims at reclaiming the high-low
watermark gap or invokes compaction to keep the balance. It takes to
consume that gap to wake the kswapd again for order-0 (most common)
requests. So this is usually not about a single allocation to trigger
the background reclaim and counting failures on low watermark attempts
is unlikely to work with the current code as you suggested.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 22:42           ` Sultan Alsawaf
@ 2020-02-20 10:19             ` Mel Gorman
  2020-02-21  4:22               ` Sultan Alsawaf
  0 siblings, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2020-02-20 10:19 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Michal Hocko, Dave Hansen, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner

On Wed, Feb 19, 2020 at 02:42:31PM -0800, Sultan Alsawaf wrote:
> On Wed, Feb 19, 2020 at 09:45:13PM +0000, Mel Gorman wrote:
> > This could be watermark boosting run wild again. Can you test with
> > sysctl vm.watermark_boost_factor=0 or the following patch? (preferably
> > both to compare and contrast).
> 
> I can test that, but something to note is that I've been doing equal testing
> with this on 4.4, which exhibits the same behavior, and that kernel doesn't have
> watermark boosting in it, as far as I can tell.
> 
> I don't think what we're addressing here is a "bug", but rather something
> fundamental about how we've been thinking about kswapd lifetime. The argument
> here is that it's not coherent to be letting kswapd run as it does, and instead
> gating it on outstanding allocation requests provides much more reasonable
> behavior, given real workloads and use patterns.
> 
> Does that make sense and seem reasonable?
> 

I'm not entirely convinced. The reason the high watermark exists is to have
kswapd work long enough to make progress without a process having to direct
reclaim. The most straight-forward example would be a streaming reader of
a large file. It'll keep pushing the zone towards the low watermark and
kswapd has to keep ahead of the reader. If we cut kswapd off too quickly,
the min watermark is hit and stalls occur. While kswapd could stop at the
min watermark, it leaves a very short window for kswapd to make enough
progress before the min watermark is hit.

At minimum, any change in this area would need to include the /proc/vmstats
on allocstat and pg*direct* to ensure that direct reclaim stalls are
not worse.

I'm not a fan of the patch in question because kswapd can be woken between
the low and min watermark without stalling but we really do expect kswapd
to make progress and continue to make progress to avoid future stalls. The
changelog had no information on the before/after impact of the patch and
this is an area where intuition can disagree with real behaviour.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-20 10:19             ` Mel Gorman
@ 2020-02-21  4:22               ` Sultan Alsawaf
  2020-02-21  8:07                 ` Michal Hocko
  2020-02-21 18:04                 ` Shakeel Butt
  0 siblings, 2 replies; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-21  4:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Dave Hansen, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner

On Thu, Feb 20, 2020 at 10:19:45AM +0000, Mel Gorman wrote:
> I'm not entirely convinced. The reason the high watermark exists is to have
> kswapd work long enough to make progress without a process having to direct
> reclaim. The most straight-forward example would be a streaming reader of
> a large file. It'll keep pushing the zone towards the low watermark and
> kswapd has to keep ahead of the reader. If we cut kswapd off too quickly,
> the min watermark is hit and stalls occur. While kswapd could stop at the
> min watermark, it leaves a very short window for kswapd to make enough
> progress before the min watermark is hit.
> 
> At minimum, any change in this area would need to include the /proc/vmstats
> on allocstat and pg*direct* to ensure that direct reclaim stalls are
> not worse.
> 
> I'm not a fan of the patch in question because kswapd can be woken between
> the low and min watermark without stalling but we really do expect kswapd
> to make progress and continue to make progress to avoid future stalls. The
> changelog had no information on the before/after impact of the patch and
> this is an area where intuition can disagree with real behaviour.
> 
> -- 
> Mel Gorman
> SUSE Labs

Okay, then let's test real behavior.

I fired up my i5-8265U laptop with vanilla linux-next and passed mem=2G on the
command line. After boot up, I opened up chromium and played a video on YouTube.
Immediately after the video started, my laptop completely froze for a few
seconds; then, a few seconds later, my cursor began to respond again, but moving
it around was very laggy. The audio from the video playing was choppy during
this time. About 15-20 seconds after I had started the YouTube video, my system
finally stopped lagging.

Then I tried again with my patch applied (albeit a correct version that doesn't
use the refcount API). Upon starting the same YouTube video in chromium, my
laptop didn't freeze or stutter at all. The cursor was responsive and there was
no stuttering, or choppy audio.

I tested this multiple times with reproducible results each time.

I will attach a functional v2 of the patch that I used.

Sultan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-19 18:25 [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages Sultan Alsawaf
                   ` (2 preceding siblings ...)
  2020-02-19 19:35 ` Michal Hocko
@ 2020-02-21  4:30 ` Sultan Alsawaf
       [not found]   ` <20200221182201.GB4462@iweiny-DESK2.sc.intel.com>
  3 siblings, 1 reply; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-21  4:30 UTC (permalink / raw)
  Cc: Sultan Alsawaf, Andrew Morton, linux-mm, linux-kernel

From: Sultan Alsawaf <sultan@kerneltoast.com>

Keeping kswapd running when all the failed allocations that invoked it
are satisfied incurs a high overhead due to unnecessary page eviction
and writeback, as well as spurious VM pressure events to various
registered shrinkers. When kswapd doesn't need to work to make an
allocation succeed anymore, stop it prematurely to save resources.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 17 ++++++++++++++---
 mm/vmscan.c            |  3 ++-
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..23861cdaab7f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -735,6 +735,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
+	atomic_t kswapd_waiters;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..923b994c38c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4401,6 +4401,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
 	int reserve_flags;
+	pg_data_t *pgdat = ac->preferred_zoneref->zone->zone_pgdat;
+	bool woke_kswapd = false;
 
 	/*
 	 * We also sanity check to catch abuse of atomic reserves being used by
@@ -4434,8 +4436,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
-	if (alloc_flags & ALLOC_KSWAPD)
+	if (alloc_flags & ALLOC_KSWAPD) {
+		if (!woke_kswapd) {
+			atomic_inc(&pgdat->kswapd_waiters);
+			woke_kswapd = true;
+		}
 		wake_all_kswapds(order, gfp_mask, ac);
+	}
 
 	/*
 	 * The adjusted alloc_flags might result in immediate success, so try
@@ -4640,9 +4647,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 	}
 fail:
-	warn_alloc(gfp_mask, ac->nodemask,
-			"page allocation failure: order:%u", order);
 got_pg:
+	if (woke_kswapd)
+		atomic_dec(&pgdat->kswapd_waiters);
+	if (!page)
+		warn_alloc(gfp_mask, ac->nodemask,
+				"page allocation failure: order:%u", order);
 	return page;
 }
 
@@ -6711,6 +6721,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	pgdat_page_ext_init(pgdat);
 	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
+	pgdat->kswapd_waiters = (atomic_t)ATOMIC_INIT(0);
 }
 
 static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05eb9efec07..59d9f3dd14f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3694,7 +3694,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		__fs_reclaim_release();
 		ret = try_to_freeze();
 		__fs_reclaim_acquire();
-		if (ret || kthread_should_stop())
+		if (ret || kthread_should_stop() ||
+		    !atomic_read(&pgdat->kswapd_waiters))
 			break;
 
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-21  4:22               ` Sultan Alsawaf
@ 2020-02-21  8:07                 ` Michal Hocko
       [not found]                   ` <20200221210824.GA3605@sultan-book.localdomain>
  2020-02-21 18:04                 ` Shakeel Butt
  1 sibling, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-02-21  8:07 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Mel Gorman, Dave Hansen, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner

On Thu 20-02-20 20:22:32, Sultan Alsawaf wrote:
> On Thu, Feb 20, 2020 at 10:19:45AM +0000, Mel Gorman wrote:
> > I'm not entirely convinced. The reason the high watermark exists is to have
> > kswapd work long enough to make progress without a process having to direct
> > reclaim. The most straight-forward example would be a streaming reader of
> > a large file. It'll keep pushing the zone towards the low watermark and
> > kswapd has to keep ahead of the reader. If we cut kswapd off too quickly,
> > the min watermark is hit and stalls occur. While kswapd could stop at the
> > min watermark, it leaves a very short window for kswapd to make enough
> > progress before the min watermark is hit.
> > 
> > At minimum, any change in this area would need to include the /proc/vmstats
> > on allocstat and pg*direct* to ensure that direct reclaim stalls are
> > not worse.
> > 
> > I'm not a fan of the patch in question because kswapd can be woken between
> > the low and min watermark without stalling but we really do expect kswapd
> > to make progress and continue to make progress to avoid future stalls. The
> > changelog had no information on the before/after impact of the patch and
> > this is an area where intuition can disagree with real behaviour.
> > 
> > -- 
> > Mel Gorman
> > SUSE Labs
> 
> Okay, then let's test real behavior.
> 
> I fired up my i5-8265U laptop with vanilla linux-next and passed mem=2G on the
> command line. After boot up, I opened up chromium and played a video on YouTube.
> Immediately after the video started, my laptop completely froze for a few
> seconds; then, a few seconds later, my cursor began to respond again, but moving
> it around was very laggy. The audio from the video playing was choppy during
> this time. About 15-20 seconds after I had started the YouTube video, my system
> finally stopped lagging.

Could you provide regular (e.g. each second) snapshots of /proc/vmstat,
ideally started before and finished after the observed behavior?
Something like
while true
do
	cp /proc/vmstat vmstat.$(date +%s)
done

If you can perf record and see where the kernel spends time during that
time period then it would be really helpful as well.

> Then I tried again with my patch applied (albeit a correct version that doesn't
> use the refcount API). Upon starting the same YouTube video in chromium, my
> laptop didn't freeze or stutter at all. The cursor was responsive and there was
> no stuttering, or choppy audio.
> 
> I tested this multiple times with reproducible results each time.

Your patch might be simply papering over a real problem.

> I will attach a functional v2 of the patch that I used.
> 
> Sultan

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-21  4:22               ` Sultan Alsawaf
  2020-02-21  8:07                 ` Michal Hocko
@ 2020-02-21 18:04                 ` Shakeel Butt
  2020-02-21 20:06                   ` Sultan Alsawaf
  1 sibling, 1 reply; 27+ messages in thread
From: Shakeel Butt @ 2020-02-21 18:04 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Mel Gorman, Michal Hocko, Dave Hansen, Andrew Morton, Linux MM,
	LKML, Johannes Weiner

On Thu, Feb 20, 2020 at 8:22 PM Sultan Alsawaf <sultan@kerneltoast.com> wrote:
>
> On Thu, Feb 20, 2020 at 10:19:45AM +0000, Mel Gorman wrote:
> > I'm not entirely convinced. The reason the high watermark exists is to have
> > kswapd work long enough to make progress without a process having to direct
> > reclaim. The most straight-forward example would be a streaming reader of
> > a large file. It'll keep pushing the zone towards the low watermark and
> > kswapd has to keep ahead of the reader. If we cut kswapd off too quickly,
> > the min watermark is hit and stalls occur. While kswapd could stop at the
> > min watermark, it leaves a very short window for kswapd to make enough
> > progress before the min watermark is hit.
> >
> > At minimum, any change in this area would need to include the /proc/vmstats
> > on allocstat and pg*direct* to ensure that direct reclaim stalls are
> > not worse.
> >
> > I'm not a fan of the patch in question because kswapd can be woken between
> > the low and min watermark without stalling but we really do expect kswapd
> > to make progress and continue to make progress to avoid future stalls. The
> > changelog had no information on the before/after impact of the patch and
> > this is an area where intuition can disagree with real behaviour.
> >
> > --
> > Mel Gorman
> > SUSE Labs
>
> Okay, then let's test real behavior.
>
> I fired up my i5-8265U laptop with vanilla linux-next and passed mem=2G on the
> command line. After boot up, I opened up chromium and played a video on YouTube.
> Immediately after the video started, my laptop completely froze for a few
> seconds; then, a few seconds later, my cursor began to respond again, but moving
> it around was very laggy. The audio from the video playing was choppy during
> this time. About 15-20 seconds after I had started the YouTube video, my system
> finally stopped lagging.
>
> Then I tried again with my patch applied (albeit a correct version that doesn't
> use the refcount API). Upon starting the same YouTube video in chromium, my
> laptop didn't freeze or stutter at all. The cursor was responsive and there was
> no stuttering, or choppy audio.
>
> I tested this multiple times with reproducible results each time.
>
> I will attach a functional v2 of the patch that I used.
>

Can you also attach the /proc/zoneinfo? Maybe the watermarks are too high.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] mm: Stop kswapd early when nothing's waiting for it to free pages
       [not found]   ` <20200221182201.GB4462@iweiny-DESK2.sc.intel.com>
@ 2020-02-21 20:00     ` Sultan Alsawaf
  0 siblings, 0 replies; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-21 20:00 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Andrew Morton, linux-mm, linux-kernel

On Fri, Feb 21, 2020 at 10:22:02AM -0800, Ira Weiny wrote:
> On Thu, Feb 20, 2020 at 08:30:52PM -0800, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf <sultan@kerneltoast.com>
>  
> [snip]
> 
> > @@ -4640,9 +4647,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto retry;
> >  	}
> >  fail:
> > -	warn_alloc(gfp_mask, ac->nodemask,
> > -			"page allocation failure: order:%u", order);
> >  got_pg:
> 
> I have no insight into if this is masking a deeper problem or if this fixes
> something but doesn't the above result in 'fail' and 'got_pg' being the same
> label?
> 
> Ira
> 
> > +	if (woke_kswapd)
> > +		atomic_dec(&pgdat->kswapd_waiters);
> > +	if (!page)
> > +		warn_alloc(gfp_mask, ac->nodemask,
> > +				"page allocation failure: order:%u", order);
> >  	return page;
> >  }
>   
> [snip]

Yes,. This was to reduce the patch delta for the initial submission so it's
clearer what's going on; it can be altered of course to coalesce the labels into
a single one. I typically produce my patches to upstream components to be as
uninvasive as possible to aid in backporting and forward porting in case it's
rejected and I want to keep it for myself.

Sultan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-21 18:04                 ` Shakeel Butt
@ 2020-02-21 20:06                   ` Sultan Alsawaf
  0 siblings, 0 replies; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-21 20:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Mel Gorman, Michal Hocko, Dave Hansen, Andrew Morton, Linux MM,
	LKML, Johannes Weiner

On Fri, Feb 21, 2020 at 10:04:38AM -0800, Shakeel Butt wrote:
> Can you also attach the /proc/zoneinfo? Maybe the watermarks are too high.

Here it is (when booted with mem=2G).

Sultan

Node 0, zone      DMA
  per-node stats
      nr_inactive_anon 236
      nr_active_anon 47914
      nr_inactive_file 44716
      nr_active_file 22397
      nr_unevictable 11665
      nr_slab_reclaimable 11217
      nr_slab_unreclaimable 16465
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 0
      workingset_refault 0
      workingset_activate 0
      workingset_restore 0
      workingset_nodereclaim 0
      nr_anon_pages 47716
      nr_mapped    34753
      nr_file_pages 79237
      nr_dirty     66
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     12120
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_file_hugepages 0
      nr_file_pmdmapped 0
      nr_anon_transparent_hugepages 0
      nr_unstable  0
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   3143
      nr_written   3073
      nr_kernel_misc_reclaimable 0
  pages free     3975
        min      173
        low      216
        high     259
        spanned  4095
        present  3998
        managed  3975
        protection: (0, 992, 992, 992, 992)
      nr_free_pages 3975
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_page_table_pages 0
      nr_kernel_stack 0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     0
      numa_miss    0
      numa_foreign 0
      numa_interleave 0
      numa_local   0
      numa_other   0
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 2
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 3
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 4
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 5
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 6
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 7
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
  node_unreclaimable:  0
  start_pfn:           1
Node 0, zone    DMA32
  pages free     87925
        min      11090
        low      13862
        high     16634
        spanned  294144
        present  270018
        managed  257874
        protection: (0, 0, 0, 0, 0)
      nr_free_pages 87925
      nr_zone_inactive_anon 236
      nr_zone_active_anon 47914
      nr_zone_inactive_file 44716
      nr_zone_active_file 22397
      nr_zone_unevictable 11665
      nr_zone_write_pending 66
      nr_mlock     16
      nr_page_table_pages 1284
      nr_kernel_stack 5024
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     544686
      numa_miss    0
      numa_foreign 0
      numa_interleave 8149
      numa_local   544686
      numa_other   0
  pagesets
    cpu: 0
              count: 358
              high:  378
              batch: 63
  vm stats threshold: 32
    cpu: 1
              count: 335
              high:  378
              batch: 63
  vm stats threshold: 32
    cpu: 2
              count: 308
              high:  378
              batch: 63
  vm stats threshold: 32
    cpu: 3
              count: 337
              high:  378
              batch: 63
  vm stats threshold: 32
    cpu: 4
              count: 326
              high:  378
              batch: 63
  vm stats threshold: 32
    cpu: 5
              count: 315
              high:  378
              batch: 63
  vm stats threshold: 32
    cpu: 6
              count: 370
              high:  378
              batch: 63
  vm stats threshold: 32
    cpu: 7
              count: 292
              high:  378
              batch: 63
  vm stats threshold: 32
  node_unreclaimable:  0
  start_pfn:           4096
Node 0, zone   Normal
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0, 0)
Node 0, zone  Movable
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0, 0)
Node 0, zone   Device
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0, 0)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
       [not found]                   ` <20200221210824.GA3605@sultan-book.localdomain>
@ 2020-02-21 21:24                     ` Dave Hansen
  2020-02-25  9:09                     ` Michal Hocko
  1 sibling, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2020-02-21 21:24 UTC (permalink / raw)
  To: Sultan Alsawaf, Michal Hocko
  Cc: Mel Gorman, Andrew Morton, linux-mm, linux-kernel, Johannes Weiner

On 2/21/20 1:08 PM, Sultan Alsawaf wrote:
> Both of these logs are attached in a tarball.

Could you post your vmlinux somewhere?  It's hard to do much with
perf.data without it.  The other option is to do a 'perf report' on your
system and send the resulting text output.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
       [not found]                   ` <20200221210824.GA3605@sultan-book.localdomain>
  2020-02-21 21:24                     ` Dave Hansen
@ 2020-02-25  9:09                     ` Michal Hocko
  2020-02-25 17:12                       ` Sultan Alsawaf
                                         ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Michal Hocko @ 2020-02-25  9:09 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Mel Gorman, Dave Hansen, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner

On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote:
[...]
> Both of these logs are attached in a tarball.

Thanks! First of all
$ grep pswp vmstat.1582318979
pswpin 0
pswpout 0

suggests that you do not have any swap storage, right? I will get back
to this later. Now, let's have a look at snapshots. We have regular 1s
snapshots intially but then we have
vmstat.1582318734
vmstat.1582318736
vmstat.1582318758
vmstat.1582318763
vmstat.1582318768
[...]
vmstat.1582318965
vmstat.1582318975
vmstat.1582318976

That is 242s time period when even a simple bash script was struggling
to write a snapshot of a /proc/vmstat which by itself shouldn't really
depend on the system activity much. Let's have a look at a random chosen
two consecutive snapshots from this time period:

		vmstat.1582318736	vmstat.1582318758
			base		diff
allocstall_dma  	0       	0
allocstall_dma32        0       	0
allocstall_movable      5773    	0
allocstall_normal       906     	0

to my surprise there was no invocation of the direct reclaim in this
time period. I would expect this to be the case considering the long
stall. But the source of the stall might be different than the DR.

compact_stall   	13      	1

Direct compaction has been invoked but this shouldn't cause a major
stall for all processes.

nr_active_anon  	133932  	236
nr_inactive_anon        9350    	-1179
nr_active_file  	318     	190
nr_inactive_file        673     	56
nr_unevictable  	51984   	0

The amount of anonymous memory is not really high (~560MB) but file LRU
is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to
~760M of memory which is 74% of the memory. Please note that your mem=2G
setup gives you only 1G of memory in fact (based on the zone_info you
have posted). That is not something unusual but the amount of the page
cache is worrying because I would expect a heavy trashing because most
of the executables are going to require major faults. Anonymous memory
is not swapped out obviously so there is no other option than to refault
constantly.

pgscan_kswapd   	64788716        14157035
pgsteal_kswapd  	29378868        4393216
pswpin  		0       	0
pswpout 		0       	0
workingset_activate     3840226 	169674
workingset_refault      29396942        4393013
workingset_restore      2883042 	106358

And here we can see it clearly happening. Note how working set refaults
matches the amount of memory reclaimed by kswapd.

I would be really curious whether adding swap space would help some.

Now to your patch and why it helps here. It seems quite obvious that the
only effectively reclaimable memory (page cache) is not going to satisfy
the high watermark target
Node 0, zone    DMA32
  pages free     87925
        min      11090
        low      13862
        high     16634

kswapd has some feedback mechanism to back off when the zone is hopless
from the reclaim point of view AFAIR but it seems it has failed in this
particular situation. It should have relied on the direct reclaim and
eventually trigger the OOM killer. Your patch has worked around this by
bailing out from the kswapd reclaim too early so a part of the page
cache required for the code to move on would stay resident and move
further.

The proper fix should, however, check the amount of reclaimable pages
and back off if they cannot meet the target IMO. We cannot rely on the
general reclaimability here because that could really be thrashing.

Thoughts?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-25  9:09                     ` Michal Hocko
@ 2020-02-25 17:12                       ` Sultan Alsawaf
  2020-02-26  9:05                         ` Michal Hocko
  2020-02-25 22:30                       ` Shakeel Butt
       [not found]                       ` <20200226105137.9088-1-hdanton@sina.com>
  2 siblings, 1 reply; 27+ messages in thread
From: Sultan Alsawaf @ 2020-02-25 17:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Dave Hansen, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner

On Tue, Feb 25, 2020 at 10:09:45AM +0100, Michal Hocko wrote:
> On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote:
> [...]
> > Both of these logs are attached in a tarball.
> 
> Thanks! First of all
> $ grep pswp vmstat.1582318979
> pswpin 0
> pswpout 0
> 
> suggests that you do not have any swap storage, right?

Correct. I'm not using any swap (and it should not be necessary to make Linux mm
work of course). If I were to divide my RAM in half and use one half as swap,
do you think the results would be different? IMO they shouldn't be.

> The amount of anonymous memory is not really high (~560MB) but file LRU
> is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to
> ~760M of memory which is 74% of the memory. Please note that your mem=2G
> setup gives you only 1G of memory in fact (based on the zone_info you
> have posted). That is not something unusual but the amount of the page
> cache is worrying because I would expect a heavy trashing because most
> of the executables are going to require major faults. Anonymous memory
> is not swapped out obviously so there is no other option than to refault
> constantly.

I noticed that only 1G was available as well. Perhaps direct reclaim wasn't
attempted due to the zone_reclaimable_pages() check, though I don't think direct
reclaim would've been particularly helpful in this case (see below).

> kswapd has some feedback mechanism to back off when the zone is hopless
> from the reclaim point of view AFAIR but it seems it has failed in this
> particular situation. It should have relied on the direct reclaim and
> eventually trigger the OOM killer. Your patch has worked around this by
> bailing out from the kswapd reclaim too early so a part of the page
> cache required for the code to move on would stay resident and move
> further.
> 
> The proper fix should, however, check the amount of reclaimable pages
> and back off if they cannot meet the target IMO. We cannot rely on the
> general reclaimability here because that could really be thrashing.

Yes, my guess was that thrashing out pages used by the running programs was the
cause for my freezes, but I didn't think of making kswapd back off a different
way.

Right now I don't see any such back-off mechanism in kswapd. Also, if we add
this into kswapd, we would need to plug it into the direct reclaim path as well,
no? I don't think direct reclaim would help with the situation I've run into;
although it wouldn't be as bad as letting kswapd evict pages to the high
watermark, it would still cause page thrashing that would just be capped to the
amount of pages a direct reclaimer is looking to steal.

Considering that my patch remedies this issue for me without invoking the OOM
killer, a proper solution should produce the same or better results. I don't
think the OOM killer should have been triggered in this case.

Sultan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-25  9:09                     ` Michal Hocko
  2020-02-25 17:12                       ` Sultan Alsawaf
@ 2020-02-25 22:30                       ` Shakeel Butt
  2020-02-26  9:08                         ` Michal Hocko
       [not found]                       ` <20200226105137.9088-1-hdanton@sina.com>
  2 siblings, 1 reply; 27+ messages in thread
From: Shakeel Butt @ 2020-02-25 22:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sultan Alsawaf, Mel Gorman, Dave Hansen, Andrew Morton, Linux MM,
	LKML, Johannes Weiner

On Tue, Feb 25, 2020 at 1:10 AM Michal Hocko <mhocko@kernel.org> wrote:
>
[snip]
>
> The proper fix should, however, check the amount of reclaimable pages
> and back off if they cannot meet the target IMO. We cannot rely on the
> general reclaimability here because that could really be thrashing.
>

"check the amount of reclaimable pages" vs "cannot rely on the general
reclaimability"? Can you clarify?

BTW we are seeing a similar situation in our production environment.
We have swappiness=0, no swap from kswapd (because we don't swapout on
pressure, only on cold age) and too few file pages, the kswapd goes
crazy on shrink_slab and spends 100% cpu on it.

Shakeel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-25 17:12                       ` Sultan Alsawaf
@ 2020-02-26  9:05                         ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2020-02-26  9:05 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Mel Gorman, Dave Hansen, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner

On Tue 25-02-20 09:12:42, Sultan Alsawaf wrote:
> On Tue, Feb 25, 2020 at 10:09:45AM +0100, Michal Hocko wrote:
> > On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote:
> > [...]
> > > Both of these logs are attached in a tarball.
> > 
> > Thanks! First of all
> > $ grep pswp vmstat.1582318979
> > pswpin 0
> > pswpout 0
> > 
> > suggests that you do not have any swap storage, right?
> 
> Correct. I'm not using any swap (and it should not be necessary to make Linux mm
> work of course).

In your particular situation you simply overcommit the memory way too
much AFAICS. Look at the vmstat data
vmstat.1582318758
nr_free_pages 16292
nr_page_table_pages 2822
nr_inactive_anon 8171
nr_active_anon 134168
nr_inactive_file 729
nr_active_file 508
nr_unevictable 51984
nr_slab_reclaimable 10919
nr_slab_unreclaimable 19766

This is roughly memory that we have clear accounting for (well except
for hugetlb pages but you do not seem to be using those). This is
82% of the memory. The rest is used by different kernel subsystems.
Of that only 47MB is reclaimable without invoking the OOM killer (if we
include nr_slab_reclaimable which might be quite hard to reclaim
effectively). I can imagine that could work only if the page cache
footprint was negligible but that doesn't seem to be the case here.
If you add swap storage then the math is completely different so yes the
swap is likely going to make a difference here.

> If I were to divide my RAM in half and use one half as swap,
> do you think the results would be different? IMO they shouldn't be.

That really depends on what is the swap backed memory footprint.

> > The amount of anonymous memory is not really high (~560MB) but file LRU
> > is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to
> > ~760M of memory which is 74% of the memory. Please note that your mem=2G
> > setup gives you only 1G of memory in fact (based on the zone_info you
> > have posted). That is not something unusual but the amount of the page
> > cache is worrying because I would expect a heavy trashing because most
> > of the executables are going to require major faults. Anonymous memory
> > is not swapped out obviously so there is no other option than to refault
> > constantly.
> 
> I noticed that only 1G was available as well.

This is because of your physical memory layout. mem=2G doesn't restrict
the amount of the memory to 2G but rather the top physical address to
2G. If you have holes in the memory layout you are likely to get much
less memory.

> Perhaps direct reclaim wasn't
> attempted due to the zone_reclaimable_pages() check, though I don't think direct
> reclaim would've been particularly helpful in this case (see below).

there are reclaimable pages so zone_reclaimable_pages shouldn't really
stop DR for the zone Normal.

> > kswapd has some feedback mechanism to back off when the zone is hopless
> > from the reclaim point of view AFAIR but it seems it has failed in this
> > particular situation. It should have relied on the direct reclaim and
> > eventually trigger the OOM killer. Your patch has worked around this by
> > bailing out from the kswapd reclaim too early so a part of the page
> > cache required for the code to move on would stay resident and move
> > further.
> > 
> > The proper fix should, however, check the amount of reclaimable pages
> > and back off if they cannot meet the target IMO. We cannot rely on the
> > general reclaimability here because that could really be thrashing.
> 
> Yes, my guess was that thrashing out pages used by the running programs was the
> cause for my freezes, but I didn't think of making kswapd back off a different
> way.
> 
> Right now I don't see any such back-off mechanism in kswapd.

There is pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES check in prepare_kswapd_sleep
but this is not helping really, quite contrary, because kswapd is able
to reclaim page cache. The problem is that we even try to reclaim that
page cache I believe. If we have a signal that all the reclaimed memory
is essentially refaulted then we should backoff.

A simpler and probably much more subtle solution would be to back off
kswapd when zone_reclaimable_pages() < high_wmark_pages. This would push
out the work to the direct reclaim which itself is changing the overall
timing as the reclaim would be more bound to the memory demand.

> Also, if we add
> this into kswapd, we would need to plug it into the direct reclaim path as well,
> no? I don't think direct reclaim would help with the situation I've run into;
> although it wouldn't be as bad as letting kswapd evict pages to the high
> watermark, it would still cause page thrashing that would just be capped to the
> amount of pages a direct reclaimer is looking to steal.

And more importantly it would be bound to the allocating context so the
feedback mechanism would be more bound to the workload.

> Considering that my patch remedies this issue for me without invoking the OOM
> killer, a proper solution should produce the same or better results. I don't
> think the OOM killer should have been triggered in this case.

Your system is likely struggling already, it is just less visible
because kswapd reclaim is essentially faster than paging in so a single
access might trigger several refaults.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-25 22:30                       ` Shakeel Butt
@ 2020-02-26  9:08                         ` Michal Hocko
  2020-02-26 17:00                           ` Shakeel Butt
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-02-26  9:08 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Sultan Alsawaf, Mel Gorman, Dave Hansen, Andrew Morton, Linux MM,
	LKML, Johannes Weiner

On Tue 25-02-20 14:30:03, Shakeel Butt wrote:
> On Tue, Feb 25, 2020 at 1:10 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> [snip]
> >
> > The proper fix should, however, check the amount of reclaimable pages
> > and back off if they cannot meet the target IMO. We cannot rely on the
> > general reclaimability here because that could really be thrashing.
> >
> 
> "check the amount of reclaimable pages" vs "cannot rely on the general
> reclaimability"? Can you clarify?

kswapd targets the high watermark and if your reclaimable memory (aka
zone_reclaimable_pages) is lower than the high wmark then it cannot
simply satisfy that target, right? Keeping reclaim in that situations
seems counter productive to me because you keep evicting pages that
might be reused without any feedback mechanism on the actual usage.
Please see my other reply.

> BTW we are seeing a similar situation in our production environment.
> We have swappiness=0, no swap from kswapd (because we don't swapout on
> pressure, only on cold age) and too few file pages, the kswapd goes
> crazy on shrink_slab and spends 100% cpu on it.

I am not sure this is the same problem. It seems that the slab shrinkers
are not really a bottle neck here. I would recommend you to identify
which shrinkers are eating the cpu time in your case.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-26  9:08                         ` Michal Hocko
@ 2020-02-26 17:00                           ` Shakeel Butt
  2020-02-26 17:41                             ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Shakeel Butt @ 2020-02-26 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sultan Alsawaf, Mel Gorman, Dave Hansen, Andrew Morton, Linux MM,
	LKML, Johannes Weiner

On Wed, Feb 26, 2020 at 1:08 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 25-02-20 14:30:03, Shakeel Butt wrote:
> > On Tue, Feb 25, 2020 at 1:10 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > [snip]
> > >
> > > The proper fix should, however, check the amount of reclaimable pages
> > > and back off if they cannot meet the target IMO. We cannot rely on the
> > > general reclaimability here because that could really be thrashing.
> > >
> >
> > "check the amount of reclaimable pages" vs "cannot rely on the general
> > reclaimability"? Can you clarify?
>
> kswapd targets the high watermark and if your reclaimable memory (aka
> zone_reclaimable_pages) is lower than the high wmark then it cannot
> simply satisfy that target, right? Keeping reclaim in that situations
> seems counter productive to me because you keep evicting pages that
> might be reused without any feedback mechanism on the actual usage.
> Please see my other reply.
>

I understand and agree with the argument that if reclaimable pages are
less than high wmark then no need to reclaim. Regarding not depending
on general reclaimability, I thought you meant that even if
reclaimable pages are over high wmark, we might not want to continue
the reclaim to not cause thrashing. Is that right?

> > BTW we are seeing a similar situation in our production environment.
> > We have swappiness=0, no swap from kswapd (because we don't swapout on
> > pressure, only on cold age) and too few file pages, the kswapd goes
> > crazy on shrink_slab and spends 100% cpu on it.
>
> I am not sure this is the same problem. It seems that the slab shrinkers
> are not really a bottle neck here. I would recommend you to identify
> which shrinkers are eating the cpu time in your case.
>

The perf profile shows that the kswapd is spending almost all its time
in list_lru_count_one and memcg tree traversal. So, it's not just one
shrinker.

Yes, it's not exactly the same problem but I would say it is similar.
For Sultan's issue, even if there are many reclaimable pages, we don't
want to thrash. In this issue, thrashing is not happening but kswapd
is going nuts on slab shrinkers.

Shakeel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
       [not found]                       ` <20200226105137.9088-1-hdanton@sina.com>
@ 2020-02-26 17:04                         ` Shakeel Butt
  0 siblings, 0 replies; 27+ messages in thread
From: Shakeel Butt @ 2020-02-26 17:04 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Sultan Alsawaf, Mel Gorman, Dave Hansen, Andrew Morton, Linux MM, LKML

On Wed, Feb 26, 2020 at 4:15 AM Hillf Danton <hdanton@sina.com> wrote:
>
>
> On Tue, 25 Feb 2020 14:30:03 -0800 Shakeel Butt wrote:
> >
> > BTW we are seeing a similar situation in our production environment.
> > We have swappiness=0, no swap from kswapd (because we don't swapout on
> > pressure, only on cold age) and too few file pages, the kswapd goes
> > crazy on shrink_slab and spends 100% cpu on it.
>
> Dunno if swappiness is able to put peace on your kswapd.
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2631,8 +2631,14 @@ static inline bool should_continue_recla
>          */
>         pages_for_compaction = compact_gap(sc->order);
>         inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> -       if (get_nr_swap_pages() > 0)
> -               inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> +       do {
> +               struct lruvec *lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +               struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +               int swappiness = mem_cgroup_swappiness(memcg);
> +
> +               if (swappiness && get_nr_swap_pages() > 0)

Thanks for finding this. I think we also need to check sc->may_swap as
well. Can you please send a signed-off patch? It may or maynot help
kswapd but I think this is needed.

> +                       inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> +       } while (0);
>
>         return inactive_lru_pages > pages_for_compaction;
>  }
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages
  2020-02-26 17:00                           ` Shakeel Butt
@ 2020-02-26 17:41                             ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2020-02-26 17:41 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Sultan Alsawaf, Mel Gorman, Dave Hansen, Andrew Morton, Linux MM,
	LKML, Johannes Weiner

On Wed 26-02-20 09:00:57, Shakeel Butt wrote:
> On Wed, Feb 26, 2020 at 1:08 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 25-02-20 14:30:03, Shakeel Butt wrote:
> > > On Tue, Feb 25, 2020 at 1:10 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > [snip]
> > > >
> > > > The proper fix should, however, check the amount of reclaimable pages
> > > > and back off if they cannot meet the target IMO. We cannot rely on the
> > > > general reclaimability here because that could really be thrashing.
> > > >
> > >
> > > "check the amount of reclaimable pages" vs "cannot rely on the general
> > > reclaimability"? Can you clarify?
> >
> > kswapd targets the high watermark and if your reclaimable memory (aka
> > zone_reclaimable_pages) is lower than the high wmark then it cannot
> > simply satisfy that target, right? Keeping reclaim in that situations
> > seems counter productive to me because you keep evicting pages that
> > might be reused without any feedback mechanism on the actual usage.
> > Please see my other reply.
> >
> 
> I understand and agree with the argument that if reclaimable pages are
> less than high wmark then no need to reclaim. Regarding not depending
> on general reclaimability, I thought you meant that even if
> reclaimable pages are over high wmark, we might not want to continue
> the reclaim to not cause thrashing. Is that right?

That is a completely different story. I would stick with the pathological
problem reported here.  General threshing problem is much more complex
and harder to provide a solution for without introducing a lot of policy
into the reclaim.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-02-26 17:41 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-19 18:25 [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages Sultan Alsawaf
2020-02-19 19:13 ` Dave Hansen
2020-02-19 19:40   ` Sultan Alsawaf
2020-02-19 20:05     ` Michal Hocko
2020-02-19 20:42       ` Sultan Alsawaf
2020-02-19 21:45         ` Mel Gorman
2020-02-19 22:42           ` Sultan Alsawaf
2020-02-20 10:19             ` Mel Gorman
2020-02-21  4:22               ` Sultan Alsawaf
2020-02-21  8:07                 ` Michal Hocko
     [not found]                   ` <20200221210824.GA3605@sultan-book.localdomain>
2020-02-21 21:24                     ` Dave Hansen
2020-02-25  9:09                     ` Michal Hocko
2020-02-25 17:12                       ` Sultan Alsawaf
2020-02-26  9:05                         ` Michal Hocko
2020-02-25 22:30                       ` Shakeel Butt
2020-02-26  9:08                         ` Michal Hocko
2020-02-26 17:00                           ` Shakeel Butt
2020-02-26 17:41                             ` Michal Hocko
     [not found]                       ` <20200226105137.9088-1-hdanton@sina.com>
2020-02-26 17:04                         ` Shakeel Butt
2020-02-21 18:04                 ` Shakeel Butt
2020-02-21 20:06                   ` Sultan Alsawaf
2020-02-20  8:29         ` Michal Hocko
2020-02-19 19:26 ` Andrew Morton
2020-02-19 22:45   ` Sultan Alsawaf
2020-02-19 19:35 ` Michal Hocko
2020-02-21  4:30 ` [PATCH v2] " Sultan Alsawaf
     [not found]   ` <20200221182201.GB4462@iweiny-DESK2.sc.intel.com>
2020-02-21 20:00     ` Sultan Alsawaf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).