linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm, slab: Extend slab/shrink to shrink all memcg caches
@ 2019-07-17 20:24 Waiman Long
  2019-07-17 20:24 ` [PATCH v2 1/2] " Waiman Long
  2019-07-17 20:24 ` [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read Waiman Long
  0 siblings, 2 replies; 13+ messages in thread
From: Waiman Long @ 2019-07-17 20:24 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, Waiman Long

 v2:
  - Just extend the shrink sysfs file to shrink all memcg caches without
    adding new semantics.
  - Add a patch to report the time of the shrink operation.

This patchset enables the slab/shrink sysfs file to shrink all the
memcg caches that are associated with the given root cache. The time of
the shrink operation can now be read from the shrink file.

Waiman Long (2):
  mm, slab: Extend slab/shrink to shrink all memcg caches
  mm, slab: Show last shrink time in us when slab/shrink is read

 Documentation/ABI/testing/sysfs-kernel-slab | 14 +++++---
 include/linux/slub_def.h                    |  1 +
 mm/slab.h                                   |  1 +
 mm/slab_common.c                            | 37 +++++++++++++++++++++
 mm/slub.c                                   | 14 +++++---
 5 files changed, 59 insertions(+), 8 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
  2019-07-17 20:24 [PATCH v2 0/2] mm, slab: Extend slab/shrink to shrink all memcg caches Waiman Long
@ 2019-07-17 20:24 ` Waiman Long
  2019-07-18 11:38   ` Christopher Lameter
  2019-07-19  6:20   ` Michal Hocko
  2019-07-17 20:24 ` [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read Waiman Long
  1 sibling, 2 replies; 13+ messages in thread
From: Waiman Long @ 2019-07-17 20:24 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, Waiman Long

Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
file to shrink the slab by flushing out all the per-cpu slabs and free
slabs in partial lists. This can be useful to squeeze out a bit more memory
under extreme condition as well as making the active object counts in
/proc/slabinfo more accurate.

This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
option is usually not enabled and "slub_memcg_sysfs=1" not set. Even
if memcg sysfs is turned on, it is too cumbersome and impractical to
manage all those per-memcg sysfs files in a real production system.

So there is no practical way to shrink memcg caches.  Fix this by
enabling a proper write to the shrink sysfs file of the root cache
to scan all the available memcg caches and shrink them as well. For a
non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is
on), only that cache will be shrunk when written.

On a 2-socket 64-core 256-thread arm64 system with 64k page after
a parallel kernel build, the the amount of memory occupied by slabs
before shrinking slabs were:

 # grep task_struct /proc/slabinfo
 task_struct        53137  53192   4288   61    4 : tunables    0    0
 0 : slabdata    872    872      0
 # grep "^S[lRU]" /proc/meminfo
 Slab:            3936832 kB
 SReclaimable:     399104 kB
 SUnreclaim:      3537728 kB

After shrinking slabs:

 # grep "^S[lRU]" /proc/meminfo
 Slab:            1356288 kB
 SReclaimable:     263296 kB
 SUnreclaim:      1092992 kB
 # grep task_struct /proc/slabinfo
 task_struct         2764   6832   4288   61    4 : tunables    0    0
 0 : slabdata    112    112      0

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 Documentation/ABI/testing/sysfs-kernel-slab | 12 ++++---
 mm/slab.h                                   |  1 +
 mm/slab_common.c                            | 37 +++++++++++++++++++++
 mm/slub.c                                   |  2 +-
 4 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..94ffd47fc8d7 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -429,10 +429,14 @@ KernelVersion:	2.6.22
 Contact:	Pekka Enberg <penberg@cs.helsinki.fi>,
 		Christoph Lameter <cl@linux-foundation.org>
 Description:
-		The shrink file is written when memory should be reclaimed from
-		a cache.  Empty partial slabs are freed and the partial list is
-		sorted so the slabs with the fewest available objects are used
-		first.
+		The shrink file is used to enable some unused slab cache
+		memory to be reclaimed from a cache.  Empty per-cpu
+		or partial slabs are freed and the partial list is
+		sorted so the slabs with the fewest available objects
+		are used first.  It only accepts a value of "1" on
+		write for shrinking the cache. Other input values are
+		considered invalid.  If it is a root cache, all the
+		child memcg caches will also be shrunk, if available.
 
 What:		/sys/kernel/slab/cache/slab_size
 Date:		May 2007
diff --git a/mm/slab.h b/mm/slab.h
index 9057b8056b07..5bf615cb3f99 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *);
 void __kmemcg_cache_deactivate(struct kmem_cache *s);
 void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
+void kmem_cache_shrink_all(struct kmem_cache *s);
 
 struct seq_file;
 struct file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 807490fe217a..6491c3a41805 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -981,6 +981,43 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+/**
+ * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
+ * @s: The cache pointer
+ */
+void kmem_cache_shrink_all(struct kmem_cache *s)
+{
+	struct kmem_cache *c;
+
+	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
+		kmem_cache_shrink(s);
+		return;
+	}
+
+	get_online_cpus();
+	get_online_mems();
+	kasan_cache_shrink(s);
+	__kmem_cache_shrink(s);
+
+	/*
+	 * We have to take the slab_mutex to protect from the memcg list
+	 * modification.
+	 */
+	mutex_lock(&slab_mutex);
+	for_each_memcg_cache(c, s) {
+		/*
+		 * Don't need to shrink deactivated memcg caches.
+		 */
+		if (s->flags & SLAB_DEACTIVATED)
+			continue;
+		kasan_cache_shrink(c);
+		__kmem_cache_shrink(c);
+	}
+	mutex_unlock(&slab_mutex);
+	put_online_mems();
+	put_online_cpus();
+}
+
 bool slab_is_available(void)
 {
 	return slab_state >= UP;
diff --git a/mm/slub.c b/mm/slub.c
index e6c030e47364..9736eb10dcb8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5294,7 +5294,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
 	if (buf[0] == '1')
-		kmem_cache_shrink(s);
+		kmem_cache_shrink_all(s);
 	else
 		return -EINVAL;
 	return length;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read
  2019-07-17 20:24 [PATCH v2 0/2] mm, slab: Extend slab/shrink to shrink all memcg caches Waiman Long
  2019-07-17 20:24 ` [PATCH v2 1/2] " Waiman Long
@ 2019-07-17 20:24 ` Waiman Long
  2019-07-18 11:39   ` Christopher Lameter
  2019-07-19  6:14   ` Michal Hocko
  1 sibling, 2 replies; 13+ messages in thread
From: Waiman Long @ 2019-07-17 20:24 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, Waiman Long

The show method of /sys/kernel/slab/<slab>/shrink sysfs file currently
returns nothing. This is now modified to show the time of the last
cache shrink operation in us.

CONFIG_SLUB_DEBUG depends on CONFIG_SYSFS. So the new shrink_us field
is always available to the shrink methods.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/ABI/testing/sysfs-kernel-slab |  2 ++
 include/linux/slub_def.h                    |  1 +
 mm/slub.c                                   | 12 +++++++++---
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
index 94ffd47fc8d7..9869a3f57dc3 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -437,6 +437,8 @@ Description:
 		write for shrinking the cache. Other input values are
 		considered invalid.  If it is a root cache, all the
 		child memcg caches will also be shrunk, if available.
+		When read, the time in us of the last cache shrink
+		operation is shown.
 
 What:		/sys/kernel/slab/cache/slab_size
 Date:		May 2007
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..055474197e83 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -113,6 +113,7 @@ struct kmem_cache {
 	/* For propagation, maximum size of a stored attr */
 	unsigned int max_attr_size;
 #ifdef CONFIG_SYSFS
+	unsigned int shrink_us;	/* Cache shrink time in us */
 	struct kset *memcg_kset;
 #endif
 #endif
diff --git a/mm/slub.c b/mm/slub.c
index 9736eb10dcb8..77d67a55ce43 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -34,6 +34,7 @@
 #include <linux/prefetch.h>
 #include <linux/memcontrol.h>
 #include <linux/random.h>
+#include <linux/sched/clock.h>
 
 #include <trace/events/kmem.h>
 
@@ -5287,16 +5288,21 @@ SLAB_ATTR(failslab);
 
 static ssize_t shrink_show(struct kmem_cache *s, char *buf)
 {
-	return 0;
+	return sprintf(buf, "%u\n", s->shrink_us);
 }
 
 static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
-	if (buf[0] == '1')
+	if (buf[0] == '1') {
+		u64 start = sched_clock();
+
 		kmem_cache_shrink_all(s);
-	else
+		s->shrink_us = (unsigned int)div_u64(sched_clock() - start,
+						     NSEC_PER_USEC);
+	} else {
 		return -EINVAL;
+	}
 	return length;
 }
 SLAB_ATTR(shrink);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
  2019-07-17 20:24 ` [PATCH v2 1/2] " Waiman Long
@ 2019-07-18 11:38   ` Christopher Lameter
  2019-07-18 17:05     ` Roman Gushchin
  2019-07-19  6:20   ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Christopher Lameter @ 2019-07-18 11:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	linux-mm, linux-kernel, Michal Hocko, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On Wed, 17 Jul 2019, Waiman Long wrote:

> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> file to shrink the slab by flushing out all the per-cpu slabs and free
> slabs in partial lists. This can be useful to squeeze out a bit more memory
> under extreme condition as well as making the active object counts in
> /proc/slabinfo more accurate.

Acked-by: Christoph Lameter <cl@linux.com>

>  # grep task_struct /proc/slabinfo
>  task_struct        53137  53192   4288   61    4 : tunables    0    0
>  0 : slabdata    872    872      0
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            3936832 kB
>  SReclaimable:     399104 kB
>  SUnreclaim:      3537728 kB
>
> After shrinking slabs:
>
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            1356288 kB
>  SReclaimable:     263296 kB
>  SUnreclaim:      1092992 kB

Well another indicator that it may not be a good decision to replicate the
whole set of slabs for each memcg. Migrate the memcg ownership into the
objects may allow the use of the same slab cache. In particular together
with the slab migration patches this may be a viable way to reduce memory
consumption.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read
  2019-07-17 20:24 ` [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read Waiman Long
@ 2019-07-18 11:39   ` Christopher Lameter
  2019-07-18 14:36     ` Waiman Long
  2019-07-19  6:14   ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Christopher Lameter @ 2019-07-18 11:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	linux-mm, linux-kernel, Michal Hocko, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On Wed, 17 Jul 2019, Waiman Long wrote:

> The show method of /sys/kernel/slab/<slab>/shrink sysfs file currently
> returns nothing. This is now modified to show the time of the last
> cache shrink operation in us.

What is this useful for? Any use cases?

> CONFIG_SLUB_DEBUG depends on CONFIG_SYSFS. So the new shrink_us field
> is always available to the shrink methods.

Aside from minimal systems without CONFIG_SYSFS... Does this build without
CONFIG_SYSFS?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read
  2019-07-18 11:39   ` Christopher Lameter
@ 2019-07-18 14:36     ` Waiman Long
  2019-07-18 18:04       ` Waiman Long
  0 siblings, 1 reply; 13+ messages in thread
From: Waiman Long @ 2019-07-18 14:36 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	linux-mm, linux-kernel, Michal Hocko, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On 7/18/19 7:39 AM, Christopher Lameter wrote:
> On Wed, 17 Jul 2019, Waiman Long wrote:
>
>> The show method of /sys/kernel/slab/<slab>/shrink sysfs file currently
>> returns nothing. This is now modified to show the time of the last
>> cache shrink operation in us.
> What is this useful for? Any use cases?

I got query about how much time will the slab_mutex be held when
shrinking the cache. I don't have a solid answer as it depends on how
many memcg caches are there. This patch is a partial answer to that as
it give a rough upper bound of the lock hold time.


>> CONFIG_SLUB_DEBUG depends on CONFIG_SYSFS. So the new shrink_us field
>> is always available to the shrink methods.
> Aside from minimal systems without CONFIG_SYSFS... Does this build without
> CONFIG_SYSFS?

The sysfs code in mm/slub.c is guarded by CONFIG_SLUB_DEBUG which, in
turn, depends on CONFIG_SYSFS. So if CONFIG_SYSFS is off, the shrink
sysfs methods will be off as well. I haven't tried doing a minimal
build. I will certainly try that, but I don't expect any problem here.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
  2019-07-18 11:38   ` Christopher Lameter
@ 2019-07-18 17:05     ` Roman Gushchin
  0 siblings, 0 replies; 13+ messages in thread
From: Roman Gushchin @ 2019-07-18 17:05 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Waiman Long, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On Thu, Jul 18, 2019 at 11:38:11AM +0000, Christopher Lameter wrote:
> On Wed, 17 Jul 2019, Waiman Long wrote:
> 
> > Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> > file to shrink the slab by flushing out all the per-cpu slabs and free
> > slabs in partial lists. This can be useful to squeeze out a bit more memory
> > under extreme condition as well as making the active object counts in
> > /proc/slabinfo more accurate.
> 
> Acked-by: Christoph Lameter <cl@linux.com>
> 
> >  # grep task_struct /proc/slabinfo
> >  task_struct        53137  53192   4288   61    4 : tunables    0    0
> >  0 : slabdata    872    872      0
> >  # grep "^S[lRU]" /proc/meminfo
> >  Slab:            3936832 kB
> >  SReclaimable:     399104 kB
> >  SUnreclaim:      3537728 kB
> >
> > After shrinking slabs:
> >
> >  # grep "^S[lRU]" /proc/meminfo
> >  Slab:            1356288 kB
> >  SReclaimable:     263296 kB
> >  SUnreclaim:      1092992 kB
> 
> Well another indicator that it may not be a good decision to replicate the
> whole set of slabs for each memcg. Migrate the memcg ownership into the
> objects may allow the use of the same slab cache. In particular together
> with the slab migration patches this may be a viable way to reduce memory
> consumption.
> 

Btw I'm working on an alternative solution. It's way too early to present
anything, but preliminary results are looking promising: slab memory usage
is decreased by 10-40% depending on the workload.

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read
  2019-07-18 14:36     ` Waiman Long
@ 2019-07-18 18:04       ` Waiman Long
  0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2019-07-18 18:04 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	linux-mm, linux-kernel, Michal Hocko, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On 7/18/19 10:36 AM, Waiman Long wrote:
>>> CONFIG_SLUB_DEBUG depends on CONFIG_SYSFS. So the new shrink_us field
>>> is always available to the shrink methods.
>> Aside from minimal systems without CONFIG_SYSFS... Does this build without
>> CONFIG_SYSFS?
> The sysfs code in mm/slub.c is guarded by CONFIG_SLUB_DEBUG which, in
> turn, depends on CONFIG_SYSFS. So if CONFIG_SYSFS is off, the shrink
> sysfs methods will be off as well. I haven't tried doing a minimal
> build. I will certainly try that, but I don't expect any problem here.

I have tried a tiny config with slub. There was no compilation problem.

-Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read
  2019-07-17 20:24 ` [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read Waiman Long
  2019-07-18 11:39   ` Christopher Lameter
@ 2019-07-19  6:14   ` Michal Hocko
  2019-07-19 14:07     ` Waiman Long
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-07-19  6:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On Wed 17-07-19 16:24:13, Waiman Long wrote:
> The show method of /sys/kernel/slab/<slab>/shrink sysfs file currently
> returns nothing. This is now modified to show the time of the last
> cache shrink operation in us.

Isn't this something that tracing can be used for without any kernel
modifications?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
  2019-07-17 20:24 ` [PATCH v2 1/2] " Waiman Long
  2019-07-18 11:38   ` Christopher Lameter
@ 2019-07-19  6:20   ` Michal Hocko
  2019-07-19 14:09     ` Waiman Long
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-07-19  6:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On Wed 17-07-19 16:24:12, Waiman Long wrote:
> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> file to shrink the slab by flushing out all the per-cpu slabs and free
> slabs in partial lists. This can be useful to squeeze out a bit more memory
> under extreme condition as well as making the active object counts in
> /proc/slabinfo more accurate.
> 
> This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
> option is usually not enabled and "slub_memcg_sysfs=1" not set. Even
> if memcg sysfs is turned on, it is too cumbersome and impractical to
> manage all those per-memcg sysfs files in a real production system.
> 
> So there is no practical way to shrink memcg caches.  Fix this by
> enabling a proper write to the shrink sysfs file of the root cache
> to scan all the available memcg caches and shrink them as well. For a
> non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is
> on), only that cache will be shrunk when written.

I would mention that memcg unawareness was an overlook more than
anything else. The interface is intended to shrink all pcp data of the
cache. The fact that we are using per-memcg internal caches is an
implementation detail.

> On a 2-socket 64-core 256-thread arm64 system with 64k page after
> a parallel kernel build, the the amount of memory occupied by slabs
> before shrinking slabs were:
> 
>  # grep task_struct /proc/slabinfo
>  task_struct        53137  53192   4288   61    4 : tunables    0    0
>  0 : slabdata    872    872      0
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            3936832 kB
>  SReclaimable:     399104 kB
>  SUnreclaim:      3537728 kB
> 
> After shrinking slabs:
> 
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            1356288 kB
>  SReclaimable:     263296 kB
>  SUnreclaim:      1092992 kB
>  # grep task_struct /proc/slabinfo
>  task_struct         2764   6832   4288   61    4 : tunables    0    0
>  0 : slabdata    112    112      0

Now that you are touching the documentation I would just add a note that
shrinking might be expensive and block other slab operations so it
should be used with some care.

> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>

The patch looks good to me. I do not feel qualified to give my ack but
it is definitely a change in the good direction.

Let's just be careful recommending people to use this as a workaround to
over caching and resulting tilted stats. That needs to be addressed
separately.

Thanks!

> ---
>  Documentation/ABI/testing/sysfs-kernel-slab | 12 ++++---
>  mm/slab.h                                   |  1 +
>  mm/slab_common.c                            | 37 +++++++++++++++++++++
>  mm/slub.c                                   |  2 +-
>  4 files changed, 47 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
> index 29601d93a1c2..94ffd47fc8d7 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-slab
> +++ b/Documentation/ABI/testing/sysfs-kernel-slab
> @@ -429,10 +429,14 @@ KernelVersion:	2.6.22
>  Contact:	Pekka Enberg <penberg@cs.helsinki.fi>,
>  		Christoph Lameter <cl@linux-foundation.org>
>  Description:
> -		The shrink file is written when memory should be reclaimed from
> -		a cache.  Empty partial slabs are freed and the partial list is
> -		sorted so the slabs with the fewest available objects are used
> -		first.
> +		The shrink file is used to enable some unused slab cache
> +		memory to be reclaimed from a cache.  Empty per-cpu
> +		or partial slabs are freed and the partial list is
> +		sorted so the slabs with the fewest available objects
> +		are used first.  It only accepts a value of "1" on
> +		write for shrinking the cache. Other input values are
> +		considered invalid.  If it is a root cache, all the
> +		child memcg caches will also be shrunk, if available.
>  
>  What:		/sys/kernel/slab/cache/slab_size
>  Date:		May 2007
> diff --git a/mm/slab.h b/mm/slab.h
> index 9057b8056b07..5bf615cb3f99 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *);
>  void __kmemcg_cache_deactivate(struct kmem_cache *s);
>  void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
>  void slab_kmem_cache_release(struct kmem_cache *);
> +void kmem_cache_shrink_all(struct kmem_cache *s);
>  
>  struct seq_file;
>  struct file;
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 807490fe217a..6491c3a41805 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -981,6 +981,43 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
>  }
>  EXPORT_SYMBOL(kmem_cache_shrink);
>  
> +/**
> + * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
> + * @s: The cache pointer
> + */
> +void kmem_cache_shrink_all(struct kmem_cache *s)
> +{
> +	struct kmem_cache *c;
> +
> +	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
> +		kmem_cache_shrink(s);
> +		return;
> +	}
> +
> +	get_online_cpus();
> +	get_online_mems();
> +	kasan_cache_shrink(s);
> +	__kmem_cache_shrink(s);
> +
> +	/*
> +	 * We have to take the slab_mutex to protect from the memcg list
> +	 * modification.
> +	 */
> +	mutex_lock(&slab_mutex);
> +	for_each_memcg_cache(c, s) {
> +		/*
> +		 * Don't need to shrink deactivated memcg caches.
> +		 */
> +		if (s->flags & SLAB_DEACTIVATED)
> +			continue;
> +		kasan_cache_shrink(c);
> +		__kmem_cache_shrink(c);
> +	}
> +	mutex_unlock(&slab_mutex);
> +	put_online_mems();
> +	put_online_cpus();
> +}
> +
>  bool slab_is_available(void)
>  {
>  	return slab_state >= UP;
> diff --git a/mm/slub.c b/mm/slub.c
> index e6c030e47364..9736eb10dcb8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5294,7 +5294,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
>  			const char *buf, size_t length)
>  {
>  	if (buf[0] == '1')
> -		kmem_cache_shrink(s);
> +		kmem_cache_shrink_all(s);
>  	else
>  		return -EINVAL;
>  	return length;
> -- 
> 2.18.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read
  2019-07-19  6:14   ` Michal Hocko
@ 2019-07-19 14:07     ` Waiman Long
  2019-07-19 14:29       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Waiman Long @ 2019-07-19 14:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On 7/19/19 2:14 AM, Michal Hocko wrote:
> On Wed 17-07-19 16:24:13, Waiman Long wrote:
>> The show method of /sys/kernel/slab/<slab>/shrink sysfs file currently
>> returns nothing. This is now modified to show the time of the last
>> cache shrink operation in us.
> Isn't this something that tracing can be used for without any kernel
> modifications?

That is true, but it will be a bit more cumbersome to get the data.
Anyway, this is just a nice to have patch for me. I am perfectly fine
with dropping it if this does not prove to be that useful.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
  2019-07-19  6:20   ` Michal Hocko
@ 2019-07-19 14:09     ` Waiman Long
  0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2019-07-19 14:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On 7/19/19 2:20 AM, Michal Hocko wrote:
> On Wed 17-07-19 16:24:12, Waiman Long wrote:
>> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
>> file to shrink the slab by flushing out all the per-cpu slabs and free
>> slabs in partial lists. This can be useful to squeeze out a bit more memory
>> under extreme condition as well as making the active object counts in
>> /proc/slabinfo more accurate.
>>
>> This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
>> option is usually not enabled and "slub_memcg_sysfs=1" not set. Even
>> if memcg sysfs is turned on, it is too cumbersome and impractical to
>> manage all those per-memcg sysfs files in a real production system.
>>
>> So there is no practical way to shrink memcg caches.  Fix this by
>> enabling a proper write to the shrink sysfs file of the root cache
>> to scan all the available memcg caches and shrink them as well. For a
>> non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is
>> on), only that cache will be shrunk when written.
> I would mention that memcg unawareness was an overlook more than
> anything else. The interface is intended to shrink all pcp data of the
> cache. The fact that we are using per-memcg internal caches is an
> implementation detail.
>
>> On a 2-socket 64-core 256-thread arm64 system with 64k page after
>> a parallel kernel build, the the amount of memory occupied by slabs
>> before shrinking slabs were:
>>
>>  # grep task_struct /proc/slabinfo
>>  task_struct        53137  53192   4288   61    4 : tunables    0    0
>>  0 : slabdata    872    872      0
>>  # grep "^S[lRU]" /proc/meminfo
>>  Slab:            3936832 kB
>>  SReclaimable:     399104 kB
>>  SUnreclaim:      3537728 kB
>>
>> After shrinking slabs:
>>
>>  # grep "^S[lRU]" /proc/meminfo
>>  Slab:            1356288 kB
>>  SReclaimable:     263296 kB
>>  SUnreclaim:      1092992 kB
>>  # grep task_struct /proc/slabinfo
>>  task_struct         2764   6832   4288   61    4 : tunables    0    0
>>  0 : slabdata    112    112      0
> Now that you are touching the documentation I would just add a note that
> shrinking might be expensive and block other slab operations so it
> should be used with some care.
>
Good point. I will update the patch to include such a note in the
documentation.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read
  2019-07-19 14:07     ` Waiman Long
@ 2019-07-19 14:29       ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2019-07-19 14:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov

On Fri 19-07-19 10:07:20, Waiman Long wrote:
> On 7/19/19 2:14 AM, Michal Hocko wrote:
> > On Wed 17-07-19 16:24:13, Waiman Long wrote:
> >> The show method of /sys/kernel/slab/<slab>/shrink sysfs file currently
> >> returns nothing. This is now modified to show the time of the last
> >> cache shrink operation in us.
> > Isn't this something that tracing can be used for without any kernel
> > modifications?
> 
> That is true, but it will be a bit more cumbersome to get the data.

I have no say for this code but if there is a way to capture timing data
I prefer to rely on the tracing infrastructure. If the current tooling
makes it cumbersome to get then this is a good reason to ask for a less
cumbersome way. On the other hand, if you somehow hardwire it to a user
visible interface then you just establish ABI which might stand in way
for potential/future development.

So take it as my 2c
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-07-19 14:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-17 20:24 [PATCH v2 0/2] mm, slab: Extend slab/shrink to shrink all memcg caches Waiman Long
2019-07-17 20:24 ` [PATCH v2 1/2] " Waiman Long
2019-07-18 11:38   ` Christopher Lameter
2019-07-18 17:05     ` Roman Gushchin
2019-07-19  6:20   ` Michal Hocko
2019-07-19 14:09     ` Waiman Long
2019-07-17 20:24 ` [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read Waiman Long
2019-07-18 11:39   ` Christopher Lameter
2019-07-18 14:36     ` Waiman Long
2019-07-18 18:04       ` Waiman Long
2019-07-19  6:14   ` Michal Hocko
2019-07-19 14:07     ` Waiman Long
2019-07-19 14:29       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).