All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] vmscan: set correct defer count for shrinker
@ 2016-10-12 16:09 ` Shaohua Li
  0 siblings, 0 replies; 5+ messages in thread
From: Shaohua Li @ 2016-10-12 16:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Kernel-team, Johannes Weiner, Vladimir Davydov, Andrew Morton, v4.0+

Our system uses significantly more slab memory with memcg enabled with
latest kernel. With 3.10 kernel, slab uses 2G memory, while with 4.6
kernel, 6G memory is used. Looks the shrinker has problem. Let's see we
have two memcg for one shrinker. In do_shrink_slab:

1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size is 1024,
then no memory is freed. nr_deferred = 700
2. Check cg2. nr_deferred = 700. Assume freeable = 20, then total_scan = 10
or 40. Let's assume it's 10. No memory is freed. nr_deferred = 10.

The deferred share of cg1 is lost in this case. kswapd will free no
memory even run above steps again and again.

The fix makes sure one memcg's deferred share isn't lost.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org (v4.0+)
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/vmscan.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fe8b71..c3822ae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -291,6 +291,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
+	long scanned = 0, next_deferred;
 
 	freeable = shrinker->count_objects(shrinker, shrinkctl);
 	if (freeable == 0)
@@ -312,7 +313,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
 		       shrinker->scan_objects, total_scan);
 		total_scan = freeable;
-	}
+		next_deferred = nr;
+	} else
+		next_deferred = total_scan;
 
 	/*
 	 * We need to avoid excessive windup on filesystem shrinkers
@@ -369,17 +372,22 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 
 		count_vm_events(SLABS_SCANNED, nr_to_scan);
 		total_scan -= nr_to_scan;
+		scanned += nr_to_scan;
 
 		cond_resched();
 	}
 
+	if (next_deferred >= scanned)
+		next_deferred -= scanned;
+	else
+		next_deferred = 0;
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates. If we exhausted the
 	 * scan, there is no need to do an update.
 	 */
-	if (total_scan > 0)
-		new_nr = atomic_long_add_return(total_scan,
+	if (next_deferred > 0)
+		new_nr = atomic_long_add_return(next_deferred,
 						&shrinker->nr_deferred[nid]);
 	else
 		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH] vmscan: set correct defer count for shrinker
@ 2016-10-12 16:09 ` Shaohua Li
  0 siblings, 0 replies; 5+ messages in thread
From: Shaohua Li @ 2016-10-12 16:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Kernel-team, Johannes Weiner, Vladimir Davydov, Andrew Morton, v4.0+

Our system uses significantly more slab memory with memcg enabled with
latest kernel. With 3.10 kernel, slab uses 2G memory, while with 4.6
kernel, 6G memory is used. Looks the shrinker has problem. Let's see we
have two memcg for one shrinker. In do_shrink_slab:

1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size is 1024,
then no memory is freed. nr_deferred = 700
2. Check cg2. nr_deferred = 700. Assume freeable = 20, then total_scan = 10
or 40. Let's assume it's 10. No memory is freed. nr_deferred = 10.

The deferred share of cg1 is lost in this case. kswapd will free no
memory even run above steps again and again.

The fix makes sure one memcg's deferred share isn't lost.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org (v4.0+)
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/vmscan.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fe8b71..c3822ae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -291,6 +291,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
+	long scanned = 0, next_deferred;
 
 	freeable = shrinker->count_objects(shrinker, shrinkctl);
 	if (freeable == 0)
@@ -312,7 +313,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
 		       shrinker->scan_objects, total_scan);
 		total_scan = freeable;
-	}
+		next_deferred = nr;
+	} else
+		next_deferred = total_scan;
 
 	/*
 	 * We need to avoid excessive windup on filesystem shrinkers
@@ -369,17 +372,22 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 
 		count_vm_events(SLABS_SCANNED, nr_to_scan);
 		total_scan -= nr_to_scan;
+		scanned += nr_to_scan;
 
 		cond_resched();
 	}
 
+	if (next_deferred >= scanned)
+		next_deferred -= scanned;
+	else
+		next_deferred = 0;
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates. If we exhausted the
 	 * scan, there is no need to do an update.
 	 */
-	if (total_scan > 0)
-		new_nr = atomic_long_add_return(total_scan,
+	if (next_deferred > 0)
+		new_nr = atomic_long_add_return(next_deferred,
 						&shrinker->nr_deferred[nid]);
 	else
 		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] vmscan: set correct defer count for shrinker
  2016-10-12 16:09 ` Shaohua Li
  (?)
@ 2016-10-13  6:53 ` Michal Hocko
  2016-10-15 20:48   ` Vladimir Davydov
  -1 siblings, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2016-10-13  6:53 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, Kernel-team, Johannes Weiner, Andrew Morton, Vladimir Davydov

[Fixup Vladimir's email and drop the stable mailing list]

On Wed 12-10-16 09:09:49, Shaohua Li wrote:
> Our system uses significantly more slab memory with memcg enabled with
> latest kernel. With 3.10 kernel, slab uses 2G memory, while with 4.6
> kernel, 6G memory is used. Looks the shrinker has problem. Let's see we
> have two memcg for one shrinker. In do_shrink_slab:
> 
> 1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size is 1024,
> then no memory is freed. nr_deferred = 700
> 2. Check cg2. nr_deferred = 700. Assume freeable = 20, then total_scan = 10
> or 40. Let's assume it's 10. No memory is freed. nr_deferred = 10.
> 
> The deferred share of cg1 is lost in this case. kswapd will free no
> memory even run above steps again and again.
> 
> The fix makes sure one memcg's deferred share isn't lost.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: stable@vger.kernel.org (v4.0+)
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  mm/vmscan.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0fe8b71..c3822ae 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -291,6 +291,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	int nid = shrinkctl->nid;
>  	long batch_size = shrinker->batch ? shrinker->batch
>  					  : SHRINK_BATCH;
> +	long scanned = 0, next_deferred;
>  
>  	freeable = shrinker->count_objects(shrinker, shrinkctl);
>  	if (freeable == 0)
> @@ -312,7 +313,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
>  		       shrinker->scan_objects, total_scan);
>  		total_scan = freeable;
> -	}
> +		next_deferred = nr;
> +	} else
> +		next_deferred = total_scan;
>  
>  	/*
>  	 * We need to avoid excessive windup on filesystem shrinkers
> @@ -369,17 +372,22 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  
>  		count_vm_events(SLABS_SCANNED, nr_to_scan);
>  		total_scan -= nr_to_scan;
> +		scanned += nr_to_scan;
>  
>  		cond_resched();
>  	}
>  
> +	if (next_deferred >= scanned)
> +		next_deferred -= scanned;
> +	else
> +		next_deferred = 0;
>  	/*
>  	 * move the unused scan count back into the shrinker in a
>  	 * manner that handles concurrent updates. If we exhausted the
>  	 * scan, there is no need to do an update.
>  	 */
> -	if (total_scan > 0)
> -		new_nr = atomic_long_add_return(total_scan,
> +	if (next_deferred > 0)
> +		new_nr = atomic_long_add_return(next_deferred,
>  						&shrinker->nr_deferred[nid]);
>  	else
>  		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] vmscan: set correct defer count for shrinker
  2016-10-13  6:53 ` Michal Hocko
@ 2016-10-15 20:48   ` Vladimir Davydov
  2016-10-17 17:16     ` Shaohua Li
  0 siblings, 1 reply; 5+ messages in thread
From: Vladimir Davydov @ 2016-10-15 20:48 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Michal Hocko, linux-mm, Kernel-team, Johannes Weiner, Andrew Morton

On Thu, Oct 13, 2016 at 08:53:28AM +0200, Michal Hocko wrote:
> On Wed 12-10-16 09:09:49, Shaohua Li wrote:
> > Our system uses significantly more slab memory with memcg enabled with
> > latest kernel. With 3.10 kernel, slab uses 2G memory, while with 4.6
> > kernel, 6G memory is used. Looks the shrinker has problem. Let's see we
> > have two memcg for one shrinker. In do_shrink_slab:
> > 
> > 1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size is 1024,
> > then no memory is freed. nr_deferred = 700
> > 2. Check cg2. nr_deferred = 700. Assume freeable = 20, then total_scan = 10
> > or 40. Let's assume it's 10. No memory is freed. nr_deferred = 10.
> > 
> > The deferred share of cg1 is lost in this case. kswapd will free no
> > memory even run above steps again and again.

I agree this is possible. IMO the ideal way to fix this problem would be
making deferred counters per memory cgroup. That would also resolve
possible fairness issues when objects deferred by one cgroup are
reclaimed from another. However, it's unclear to me how to implement it
w/o bringing in a lot of awkward code. So I guess your patch is
reasonable for now. Apart from a couple nitpicks (below), it looks good
to me:

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

> > @@ -312,7 +313,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  		pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
> >  		       shrinker->scan_objects, total_scan);
> >  		total_scan = freeable;
> > -	}
> > +		next_deferred = nr;
> > +	} else
> > +		next_deferred = total_scan;

nitpick: Why do we want to handle this what-the-heck-is-going-on case in
a special way? Why not just always assign total_scan to next_deferred
here? I don't see how it could make things worse when total_scan gets
screwed up.

> >  
> >  	/*
> >  	 * We need to avoid excessive windup on filesystem shrinkers
> > @@ -369,17 +372,22 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  
> >  		count_vm_events(SLABS_SCANNED, nr_to_scan);
> >  		total_scan -= nr_to_scan;
> > +		scanned += nr_to_scan;

nitpick: We could get along w/o 'scanned' here:

                next_deferred -= nr_to_scan;

> >  
> >  		cond_resched();
> >  	}
> >  
> > +	if (next_deferred >= scanned)
> > +		next_deferred -= scanned;
> > +	else
> > +		next_deferred = 0;

... and this chunk wouldn't be needed then.

> >  	/*
> >  	 * move the unused scan count back into the shrinker in a
> >  	 * manner that handles concurrent updates. If we exhausted the
> >  	 * scan, there is no need to do an update.
> >  	 */
> > -	if (total_scan > 0)
> > -		new_nr = atomic_long_add_return(total_scan,
> > +	if (next_deferred > 0)
> > +		new_nr = atomic_long_add_return(next_deferred,
> >  						&shrinker->nr_deferred[nid]);
> >  	else
> >  		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] vmscan: set correct defer count for shrinker
  2016-10-15 20:48   ` Vladimir Davydov
@ 2016-10-17 17:16     ` Shaohua Li
  0 siblings, 0 replies; 5+ messages in thread
From: Shaohua Li @ 2016-10-17 17:16 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, linux-mm, Kernel-team, Johannes Weiner, Andrew Morton

On Sat, Oct 15, 2016 at 11:48:13PM +0300, Vladimir Davydov wrote:
> On Thu, Oct 13, 2016 at 08:53:28AM +0200, Michal Hocko wrote:
> > On Wed 12-10-16 09:09:49, Shaohua Li wrote:
> > > Our system uses significantly more slab memory with memcg enabled with
> > > latest kernel. With 3.10 kernel, slab uses 2G memory, while with 4.6
> > > kernel, 6G memory is used. Looks the shrinker has problem. Let's see we
> > > have two memcg for one shrinker. In do_shrink_slab:
> > > 
> > > 1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size is 1024,
> > > then no memory is freed. nr_deferred = 700
> > > 2. Check cg2. nr_deferred = 700. Assume freeable = 20, then total_scan = 10
> > > or 40. Let's assume it's 10. No memory is freed. nr_deferred = 10.
> > > 
> > > The deferred share of cg1 is lost in this case. kswapd will free no
> > > memory even run above steps again and again.
> 
> I agree this is possible. IMO the ideal way to fix this problem would be
> making deferred counters per memory cgroup. That would also resolve
> possible fairness issues when objects deferred by one cgroup are
> reclaimed from another. However, it's unclear to me how to implement it
> w/o bringing in a lot of awkward code. So I guess your patch is
> reasonable for now. Apart from a couple nitpicks (below), it looks good
> to me:
> 
> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> 
> > > @@ -312,7 +313,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > >  		pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
> > >  		       shrinker->scan_objects, total_scan);
> > >  		total_scan = freeable;
> > > -	}
> > > +		next_deferred = nr;
> > > +	} else
> > > +		next_deferred = total_scan;
> 
> nitpick: Why do we want to handle this what-the-heck-is-going-on case in
> a special way? Why not just always assign total_scan to next_deferred
> here? I don't see how it could make things worse when total_scan gets
> screwed up.

I have no idea when this special case will hapen. I'd like to make it
conservative. Somebody knowing this code probably could help clean up.
 
> > >  
> > >  	/*
> > >  	 * We need to avoid excessive windup on filesystem shrinkers
> > > @@ -369,17 +372,22 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > >  
> > >  		count_vm_events(SLABS_SCANNED, nr_to_scan);
> > >  		total_scan -= nr_to_scan;
> > > +		scanned += nr_to_scan;
> 
> nitpick: We could get along w/o 'scanned' here:
> 
>                 next_deferred -= nr_to_scan;

In that special case the next_deferred could be smaller than total_scan. That
said, if we don't have the special case, your suggestion is good. I'm totally
open here.

> > >  
> > >  		cond_resched();
> > >  	}
> > >  
> > > +	if (next_deferred >= scanned)
> > > +		next_deferred -= scanned;
> > > +	else
> > > +		next_deferred = 0;
> 
> ... and this chunk wouldn't be needed then.

Ditto.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-10-17 17:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-12 16:09 [PATCH] vmscan: set correct defer count for shrinker Shaohua Li
2016-10-12 16:09 ` Shaohua Li
2016-10-13  6:53 ` Michal Hocko
2016-10-15 20:48   ` Vladimir Davydov
2016-10-17 17:16     ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.