[PATCH 0/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
@ 2019-06-24 17:42 Waiman Long
  2019-06-24 17:42 ` [PATCH 1/2] mm, memcontrol: Add memcg_iterate_all() Waiman Long
  2019-06-24 17:42 ` [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs Waiman Long
  0 siblings, 2 replies; 19+ messages in thread
From: Waiman Long @ 2019-06-24 17:42 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Roman Gushchin, Shakeel Butt, Andrea Arcangeli, Waiman Long

The purpose of this patchset is to allow system administrators to have
the ability to shrink all the kmem slabs in order to free up memory
and get a more accurate picture of how many slab objects are actually
being used.

Patch 1 adds a new memcg_iterate_all() that is used by the patch 2 to
iterate on all the memory cgroups.

Waiman Long (2):
  mm, memcontrol: Add memcg_iterate_all()
  mm, slab: Extend vm/drop_caches to shrink kmem slabs

 Documentation/sysctl/vm.txt | 11 ++++++++--
 fs/drop_caches.c            |  4 ++++
 include/linux/memcontrol.h  |  3 +++
 include/linux/slab.h        |  1 +
 kernel/sysctl.c             |  4 ++--
 mm/memcontrol.c             | 13 +++++++++++
 mm/slab_common.c            | 44 +++++++++++++++++++++++++++++++++++++
 7 files changed, 76 insertions(+), 4 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] mm, memcontrol: Add memcg_iterate_all()
  2019-06-24 17:42 [PATCH 0/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs Waiman Long
@ 2019-06-24 17:42 ` Waiman Long
  2019-06-27 15:07   ` Michal Hocko
  2019-06-24 17:42 ` [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs Waiman Long
  1 sibling, 1 reply; 19+ messages in thread
From: Waiman Long @ 2019-06-24 17:42 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Roman Gushchin, Shakeel Butt, Andrea Arcangeli, Waiman Long

Add a memcg_iterate_all() function for iterating all the available
memory cgroups and call the given callback function for each of the
memory cgruops.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c            | 13 +++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1dcb763bb610..0e31418e5a47 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1268,6 +1268,9 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
+extern void memcg_iterate_all(void (*callback)(struct mem_cgroup *memcg,
+					       void *arg), void *arg);
+
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge(struct page *page, int order);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba9138a4a1de..c1c4706f7696 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -443,6 +443,19 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { }
 #endif /* CONFIG_MEMCG_KMEM */
 
+/*
+ * Iterate all the memory cgroups and call the given callback function
+ * for each of the memory cgroups.
+ */
+void memcg_iterate_all(void (*callback)(struct mem_cgroup *memcg, void *arg),
+		       void *arg)
+{
+	struct mem_cgroup *memcg;
+
+	for_each_mem_cgroup(memcg)
+		callback(memcg, arg);
+}
+
 /**
  * mem_cgroup_css_from_page - css of the memcg associated with a page
  * @page: page of interest
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-24 17:42 [PATCH 0/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs Waiman Long
  2019-06-24 17:42 ` [PATCH 1/2] mm, memcontrol: Add memcg_iterate_all() Waiman Long
@ 2019-06-24 17:42 ` Waiman Long
  2019-06-26 20:19   ` Roman Gushchin
  2019-06-27 15:15   ` Michal Hocko
  1 sibling, 2 replies; 19+ messages in thread
From: Waiman Long @ 2019-06-24 17:42 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Roman Gushchin, Shakeel Butt, Andrea Arcangeli, Waiman Long

With the slub memory allocator, the numbers of active slab objects
reported in /proc/slabinfo are not real because they include objects
that are held by the per-cpu slab structures whether they are actually
used or not.  The problem gets worse the more CPUs a system have. For
instance, looking at the reported number of active task_struct objects,
one will wonder where all the missing tasks gone.

I know it is hard and costly to get a real count of active objects. So
I am not advocating for that. Instead, this patch extends the
/proc/sys/vm/drop_caches sysctl parameter by using a new bit (bit 3)
to shrink all the kmem slabs which will flush out all the slabs in the
per-cpu structures and give a more accurate view of how much memory are
really used up by the active slab objects. This is a costly operation,
of course, but it gives a way to have a clearer picture of the actual
number of slab objects used, if the need arises.

The upper range of the drop_caches sysctl parameter is increased to 15
to allow all possible combinations of the lowest 4 bits.

On a 2-socket 64-core 256-thread ARM64 system with 64k page size after
a parallel kernel build, the amount of memory occupied by slabs before
and after echoing to drop_caches were:

 # grep task_struct /proc/slabinfo
 task_struct        48376  48434   4288   61    4 : tunables    0    0
 0 : slabdata    794    794      0
 # grep "^S[lRU]" /proc/meminfo
 Slab:            3419072 kB
 SReclaimable:     354688 kB
 SUnreclaim:      3064384 kB
 # echo 3 > /proc/sys/vm/drop_caches
 # grep "^S[lRU]" /proc/meminfo
 Slab:            3351680 kB
 SReclaimable:     316096 kB
 SUnreclaim:      3035584 kB
 # echo 8 > /proc/sys/vm/drop_caches
 # grep "^S[lRU]" /proc/meminfo
 Slab:            1008192 kB
 SReclaimable:     126912 kB
 SUnreclaim:       881280 kB
 # grep task_struct /proc/slabinfo
 task_struct         2601   6588   4288   61    4 : tunables    0    0
 0 : slabdata    108    108      0

Shrinking the slabs saves more than 2GB of memory in this case. This
new feature certainly fulfills the promise of dropping caches.

Unlike counting objects in the per-node caches done by /proc/slabinfo
which is rather light weight, iterating all the per-cpu caches and
shrinking them is much more heavy weight.

For this particular instance, the time taken to shrinks all the root
caches was about 30.2ms. There were 73 memory cgroup and the longest
time taken for shrinking the largest one was about 16.4ms. The total
shrinking time was about 101ms.

Because of the potential long time to shrinks all the caches, the
slab_mutex was taken multiple times - once for all the root caches
and once for each memory cgroup. This is to reduce the slab_mutex hold
time to minimize impact to other running applications that may need to
acquire the mutex.

The slab shrinking feature is only available when CONFIG_MEMCG_KMEM is
defined as the code need to access slab_root_caches to iterate all the
root caches.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/sysctl/vm.txt | 11 ++++++++--
 fs/drop_caches.c            |  4 ++++
 include/linux/slab.h        |  1 +
 kernel/sysctl.c             |  4 ++--
 mm/slab_common.c            | 44 +++++++++++++++++++++++++++++++++++++
 5 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 749322060f10..b643ac8968d2 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -207,8 +207,8 @@ Setting this to zero disables periodic writeback altogether.
 drop_caches
 
 Writing to this will cause the kernel to drop clean caches, as well as
-reclaimable slab objects like dentries and inodes.  Once dropped, their
-memory becomes free.
+reclaimable slab objects like dentries and inodes.  It can also be used
+to shrink the slabs.  Once dropped, their memory becomes free.
 
 To free pagecache:
 	echo 1 > /proc/sys/vm/drop_caches
@@ -216,6 +216,8 @@ To free reclaimable slab objects (includes dentries and inodes):
 	echo 2 > /proc/sys/vm/drop_caches
 To free slab objects and pagecache:
 	echo 3 > /proc/sys/vm/drop_caches
+To shrink the slabs:
+	echo 8 > /proc/sys/vm/drop_caches
 
 This is a non-destructive operation and will not free any dirty objects.
 To increase the number of objects freed by this operation, the user may run
@@ -223,6 +225,11 @@ To increase the number of objects freed by this operation, the user may run
 number of dirty objects on the system and create more candidates to be
 dropped.
 
+Shrinking the slabs can reduce the memory footprint used by the slabs.
+It also makes the number of active objects reported in /proc/slabinfo
+more representative of the actual number of objects used for the slub
+memory allocator.
+
 This file is not a means to control the growth of the various kernel caches
 (inodes, dentries, pagecache, etc...)  These objects are automatically
 reclaimed by the kernel when memory is needed elsewhere on the system.
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d31b6c72b476..633b99e25dab 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -9,6 +9,7 @@
 #include <linux/writeback.h>
 #include <linux/sysctl.h>
 #include <linux/gfp.h>
+#include <linux/slab.h>
 #include "internal.h"
 
 /* A global variable is a bit ugly, but it keeps the code simple */
@@ -65,6 +66,9 @@ int drop_caches_sysctl_handler(struct ctl_table *table, int write,
 			drop_slab();
 			count_vm_event(DROP_SLAB);
 		}
+		if (sysctl_drop_caches & 8) {
+			kmem_cache_shrink_all();
+		}
 		if (!stfu) {
 			pr_info("%s (%d): drop_caches: %d\n",
 				current->comm, task_pid_nr(current),
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..f7c1626b2aa6 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 			void (*ctor)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+void kmem_cache_shrink_all(void);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1beca96fb625..feeb867dabd7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -129,7 +129,7 @@ static int __maybe_unused neg_one = -1;
 static int zero;
 static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
-static int __maybe_unused four = 4;
+static int __maybe_unused fifteen = 15;
 static unsigned long zero_ul;
 static unsigned long one_ul = 1;
 static unsigned long long_max = LONG_MAX;
@@ -1455,7 +1455,7 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= drop_caches_sysctl_handler,
 		.extra1		= &one,
-		.extra2		= &four,
+		.extra2		= &fifteen,
 	},
 #ifdef CONFIG_COMPACTION
 	{
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..b3c5b64f9bfb 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -956,6 +956,50 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+#ifdef CONFIG_MEMCG_KMEM
+static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
+				    void __maybe_unused *arg)
+{
+	struct kmem_cache *s;
+
+	if (memcg == root_mem_cgroup)
+		return;
+	mutex_lock(&slab_mutex);
+	list_for_each_entry(s, &memcg->kmem_caches,
+			    memcg_params.kmem_caches_node) {
+		kmem_cache_shrink(s);
+	}
+	mutex_unlock(&slab_mutex);
+	cond_resched();
+}
+
+/*
+ * Shrink all the kmem caches.
+ *
+ * If there are a large number of memory cgroups outstanding, it may take
+ * a while to shrink all of them. So we may need to release the lock, call
+ * cond_resched() and reacquire the lock from time to time.
+ */
+void kmem_cache_shrink_all(void)
+{
+	struct kmem_cache *s;
+
+	/* Shrink all the root caches */
+	mutex_lock(&slab_mutex);
+	list_for_each_entry(s, &slab_root_caches, root_caches_node)
+		kmem_cache_shrink(s);
+	mutex_unlock(&slab_mutex);
+	cond_resched();
+
+	/*
+	 * Flush each of the memcg individually
+	 */
+	memcg_iterate_all(kmem_cache_shrink_memcg, NULL);
+}
+#else
+void kmem_cache_shrink_all(void) { }
+#endif
+
 bool slab_is_available(void)
 {
 	return slab_state >= UP;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-24 17:42 ` [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs Waiman Long
@ 2019-06-26 20:19   ` Roman Gushchin
  2019-06-27 20:57     ` Waiman Long
  2019-06-27 15:15   ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2019-06-26 20:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Shakeel Butt, Andrea Arcangeli

On Mon, Jun 24, 2019 at 01:42:19PM -0400, Waiman Long wrote:
> With the slub memory allocator, the numbers of active slab objects
> reported in /proc/slabinfo are not real because they include objects
> that are held by the per-cpu slab structures whether they are actually
> used or not.  The problem gets worse the more CPUs a system have. For
> instance, looking at the reported number of active task_struct objects,
> one will wonder where all the missing tasks gone.
> 
> I know it is hard and costly to get a real count of active objects. So
> I am not advocating for that. Instead, this patch extends the
> /proc/sys/vm/drop_caches sysctl parameter by using a new bit (bit 3)
> to shrink all the kmem slabs which will flush out all the slabs in the
> per-cpu structures and give a more accurate view of how much memory are
> really used up by the active slab objects. This is a costly operation,
> of course, but it gives a way to have a clearer picture of the actual
> number of slab objects used, if the need arises.
> 
> The upper range of the drop_caches sysctl parameter is increased to 15
> to allow all possible combinations of the lowest 4 bits.
> 
> On a 2-socket 64-core 256-thread ARM64 system with 64k page size after
> a parallel kernel build, the amount of memory occupied by slabs before
> and after echoing to drop_caches were:
> 
>  # grep task_struct /proc/slabinfo
>  task_struct        48376  48434   4288   61    4 : tunables    0    0
>  0 : slabdata    794    794      0
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            3419072 kB
>  SReclaimable:     354688 kB
>  SUnreclaim:      3064384 kB
>  # echo 3 > /proc/sys/vm/drop_caches
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            3351680 kB
>  SReclaimable:     316096 kB
>  SUnreclaim:      3035584 kB
>  # echo 8 > /proc/sys/vm/drop_caches
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            1008192 kB
>  SReclaimable:     126912 kB
>  SUnreclaim:       881280 kB
>  # grep task_struct /proc/slabinfo
>  task_struct         2601   6588   4288   61    4 : tunables    0    0
>  0 : slabdata    108    108      0
> 
> Shrinking the slabs saves more than 2GB of memory in this case. This
> new feature certainly fulfills the promise of dropping caches.
> 
> Unlike counting objects in the per-node caches done by /proc/slabinfo
> which is rather light weight, iterating all the per-cpu caches and
> shrinking them is much more heavy weight.
> 
> For this particular instance, the time taken to shrinks all the root
> caches was about 30.2ms. There were 73 memory cgroup and the longest
> time taken for shrinking the largest one was about 16.4ms. The total
> shrinking time was about 101ms.
> 
> Because of the potential long time to shrinks all the caches, the
> slab_mutex was taken multiple times - once for all the root caches
> and once for each memory cgroup. This is to reduce the slab_mutex hold
> time to minimize impact to other running applications that may need to
> acquire the mutex.
> 
> The slab shrinking feature is only available when CONFIG_MEMCG_KMEM is
> defined as the code need to access slab_root_caches to iterate all the
> root caches.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  Documentation/sysctl/vm.txt | 11 ++++++++--
>  fs/drop_caches.c            |  4 ++++
>  include/linux/slab.h        |  1 +
>  kernel/sysctl.c             |  4 ++--
>  mm/slab_common.c            | 44 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 60 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 749322060f10..b643ac8968d2 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -207,8 +207,8 @@ Setting this to zero disables periodic writeback altogether.
>  drop_caches
>  
>  Writing to this will cause the kernel to drop clean caches, as well as
> -reclaimable slab objects like dentries and inodes.  Once dropped, their
> -memory becomes free.
> +reclaimable slab objects like dentries and inodes.  It can also be used
> +to shrink the slabs.  Once dropped, their memory becomes free.
>  
>  To free pagecache:
>  	echo 1 > /proc/sys/vm/drop_caches
> @@ -216,6 +216,8 @@ To free reclaimable slab objects (includes dentries and inodes):
>  	echo 2 > /proc/sys/vm/drop_caches
>  To free slab objects and pagecache:
>  	echo 3 > /proc/sys/vm/drop_caches
> +To shrink the slabs:
> +	echo 8 > /proc/sys/vm/drop_caches
>  
>  This is a non-destructive operation and will not free any dirty objects.
>  To increase the number of objects freed by this operation, the user may run
> @@ -223,6 +225,11 @@ To increase the number of objects freed by this operation, the user may run
>  number of dirty objects on the system and create more candidates to be
>  dropped.
>  
> +Shrinking the slabs can reduce the memory footprint used by the slabs.
> +It also makes the number of active objects reported in /proc/slabinfo
> +more representative of the actual number of objects used for the slub
> +memory allocator.
> +
>  This file is not a means to control the growth of the various kernel caches
>  (inodes, dentries, pagecache, etc...)  These objects are automatically
>  reclaimed by the kernel when memory is needed elsewhere on the system.
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index d31b6c72b476..633b99e25dab 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -9,6 +9,7 @@
>  #include <linux/writeback.h>
>  #include <linux/sysctl.h>
>  #include <linux/gfp.h>
> +#include <linux/slab.h>
>  #include "internal.h"
>  
>  /* A global variable is a bit ugly, but it keeps the code simple */
> @@ -65,6 +66,9 @@ int drop_caches_sysctl_handler(struct ctl_table *table, int write,
>  			drop_slab();
>  			count_vm_event(DROP_SLAB);
>  		}
> +		if (sysctl_drop_caches & 8) {
> +			kmem_cache_shrink_all();
> +		}
>  		if (!stfu) {
>  			pr_info("%s (%d): drop_caches: %d\n",
>  				current->comm, task_pid_nr(current),
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 9449b19c5f10..f7c1626b2aa6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
>  			void (*ctor)(void *));
>  void kmem_cache_destroy(struct kmem_cache *);
>  int kmem_cache_shrink(struct kmem_cache *);
> +void kmem_cache_shrink_all(void);
>  
>  void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
>  void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 1beca96fb625..feeb867dabd7 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -129,7 +129,7 @@ static int __maybe_unused neg_one = -1;
>  static int zero;
>  static int __maybe_unused one = 1;
>  static int __maybe_unused two = 2;
> -static int __maybe_unused four = 4;
> +static int __maybe_unused fifteen = 15;
>  static unsigned long zero_ul;
>  static unsigned long one_ul = 1;
>  static unsigned long long_max = LONG_MAX;
> @@ -1455,7 +1455,7 @@ static struct ctl_table vm_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= drop_caches_sysctl_handler,
>  		.extra1		= &one,
> -		.extra2		= &four,
> +		.extra2		= &fifteen,
>  	},
>  #ifdef CONFIG_COMPACTION
>  	{
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 58251ba63e4a..b3c5b64f9bfb 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -956,6 +956,50 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
>  }
>  EXPORT_SYMBOL(kmem_cache_shrink);

Hi Waiman!

>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
> +				    void __maybe_unused *arg)
> +{
> +	struct kmem_cache *s;
> +
> +	if (memcg == root_mem_cgroup)
> +		return;
> +	mutex_lock(&slab_mutex);
> +	list_for_each_entry(s, &memcg->kmem_caches,
> +			    memcg_params.kmem_caches_node) {
> +		kmem_cache_shrink(s);
> +	}
> +	mutex_unlock(&slab_mutex);
> +	cond_resched();
> +}

A couple of questions:
1) how about skipping already offlined kmem_caches? They are already shrunk,
   so you probably won't get much out of them. Or isn't it true?
2) what's your long-term vision here? do you think that we need to shrink
   kmem_caches periodically, depending on memory pressure? how a user
   will use this new sysctl?

What's the problem you're trying to solve in general?

Thanks!

Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] mm, memcontrol: Add memcg_iterate_all()
  2019-06-24 17:42 ` [PATCH 1/2] mm, memcontrol: Add memcg_iterate_all() Waiman Long
@ 2019-06-27 15:07   ` Michal Hocko
  2019-06-27 21:03     ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2019-06-27 15:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-doc, linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli

On Mon 24-06-19 13:42:18, Waiman Long wrote:
> Add a memcg_iterate_all() function for iterating all the available
> memory cgroups and call the given callback function for each of the
> memory cgruops.

Why is a trivial wrapper any better than open coded usage of the
iterator?

> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/memcontrol.h |  3 +++
>  mm/memcontrol.c            | 13 +++++++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1dcb763bb610..0e31418e5a47 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1268,6 +1268,9 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
>  struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
>  void memcg_kmem_put_cache(struct kmem_cache *cachep);
>  
> +extern void memcg_iterate_all(void (*callback)(struct mem_cgroup *memcg,
> +					       void *arg), void *arg);
> +
>  #ifdef CONFIG_MEMCG_KMEM
>  int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
>  void __memcg_kmem_uncharge(struct page *page, int order);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ba9138a4a1de..c1c4706f7696 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -443,6 +443,19 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>  static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { }
>  #endif /* CONFIG_MEMCG_KMEM */
>  
> +/*
> + * Iterate all the memory cgroups and call the given callback function
> + * for each of the memory cgroups.
> + */
> +void memcg_iterate_all(void (*callback)(struct mem_cgroup *memcg, void *arg),
> +		       void *arg)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	for_each_mem_cgroup(memcg)
> +		callback(memcg, arg);
> +}
> +
>  /**
>   * mem_cgroup_css_from_page - css of the memcg associated with a page
>   * @page: page of interest
> -- 
> 2.18.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-24 17:42 ` [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs Waiman Long
  2019-06-26 20:19   ` Roman Gushchin
@ 2019-06-27 15:15   ` Michal Hocko
  2019-06-27 21:16     ` Waiman Long
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2019-06-27 15:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-doc, linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli

On Mon 24-06-19 13:42:19, Waiman Long wrote:
> With the slub memory allocator, the numbers of active slab objects
> reported in /proc/slabinfo are not real because they include objects
> that are held by the per-cpu slab structures whether they are actually
> used or not.  The problem gets worse the more CPUs a system have. For
> instance, looking at the reported number of active task_struct objects,
> one will wonder where all the missing tasks gone.
> 
> I know it is hard and costly to get a real count of active objects.

What exactly is expensive? Why cannot slabinfo reduce the number of
active objects by per-cpu cached objects?

> So
> I am not advocating for that. Instead, this patch extends the
> /proc/sys/vm/drop_caches sysctl parameter by using a new bit (bit 3)
> to shrink all the kmem slabs which will flush out all the slabs in the
> per-cpu structures and give a more accurate view of how much memory are
> really used up by the active slab objects. This is a costly operation,
> of course, but it gives a way to have a clearer picture of the actual
> number of slab objects used, if the need arises.

drop_caches is a terrible interface. It destroys all the caching and
people are just too easy in using it to solve any kind of problem they
think they might have and cause others they might not see immediately.
I am strongly discouraging anybody - except for some tests which really
do want to see reproducible results without cache effects - from using
this interface and therefore I am not really happy to paper over
something that might be a real problem with yet another mode. If SLUB
indeed caches too aggressively on large machines then this should be
fixed.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-26 20:19   ` Roman Gushchin
@ 2019-06-27 20:57     ` Waiman Long
  2019-06-27 21:24       ` Roman Gushchin
  2019-06-27 21:25       ` Luis Chamberlain
  0 siblings, 2 replies; 19+ messages in thread
From: Waiman Long @ 2019-06-27 20:57 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Shakeel Butt, Andrea Arcangeli

On 6/26/19 4:19 PM, Roman Gushchin wrote:
>>  
>> +#ifdef CONFIG_MEMCG_KMEM
>> +static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
>> +				    void __maybe_unused *arg)
>> +{
>> +	struct kmem_cache *s;
>> +
>> +	if (memcg == root_mem_cgroup)
>> +		return;
>> +	mutex_lock(&slab_mutex);
>> +	list_for_each_entry(s, &memcg->kmem_caches,
>> +			    memcg_params.kmem_caches_node) {
>> +		kmem_cache_shrink(s);
>> +	}
>> +	mutex_unlock(&slab_mutex);
>> +	cond_resched();
>> +}
> A couple of questions:
> 1) how about skipping already offlined kmem_caches? They are already shrunk,
>    so you probably won't get much out of them. Or isn't it true?

I have been thinking about that. This patch is based on the linux tree
and so don't have an easy to find out if the kmem caches have been
shrinked. Rebasing this on top of linux-next, I can use the
SLAB_DEACTIVATED flag as a marker for skipping the shrink.

With all the latest patches, I am still seeing 121 out of a total of 726
memcg kmem caches (1/6) that are deactivated caches after system bootup
one of the test systems. My system is still using cgroup v1 and so the
number may be different in a v2 setup. The next step is probably to
figure out why those deactivated caches are still there.

> 2) what's your long-term vision here? do you think that we need to shrink
>    kmem_caches periodically, depending on memory pressure? how a user
>    will use this new sysctl?
Shrinking the kmem caches under extreme memory pressure can be one way
to free up extra pages, but the effect will probably be temporary.
> What's the problem you're trying to solve in general?

At least for the slub allocator, shrinking the caches allow the number
of active objects reported in slabinfo to be more accurate. In addition,
this allow to know the real slab memory consumption. I have been working
on a BZ about continuous memory leaks with a container based workloads.
The ability to shrink caches allow us to get a more accurate memory
consumption picture. Another alternative is to turn on slub_debug which
will then disables all the per-cpu slabs.

Anyway, I think this can be useful to others that is why I posted the patch.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] mm, memcontrol: Add memcg_iterate_all()
  2019-06-27 15:07   ` Michal Hocko
@ 2019-06-27 21:03     ` Waiman Long
  2019-06-28  7:10       ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2019-06-27 21:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-doc, linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli

On 6/27/19 11:07 AM, Michal Hocko wrote:
> On Mon 24-06-19 13:42:18, Waiman Long wrote:
>> Add a memcg_iterate_all() function for iterating all the available
>> memory cgroups and call the given callback function for each of the
>> memory cgruops.
> Why is a trivial wrapper any better than open coded usage of the
> iterator?

Because the iterator is only defined within memcontrol.c. So an
alternative may be to put the iterator into a header file that can be
used by others. Will take a look at that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-27 15:15   ` Michal Hocko
@ 2019-06-27 21:16     ` Waiman Long
  2019-06-28  7:31       ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2019-06-27 21:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-doc, linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli

On 6/27/19 11:15 AM, Michal Hocko wrote:
> On Mon 24-06-19 13:42:19, Waiman Long wrote:
>> With the slub memory allocator, the numbers of active slab objects
>> reported in /proc/slabinfo are not real because they include objects
>> that are held by the per-cpu slab structures whether they are actually
>> used or not.  The problem gets worse the more CPUs a system have. For
>> instance, looking at the reported number of active task_struct objects,
>> one will wonder where all the missing tasks gone.
>>
>> I know it is hard and costly to get a real count of active objects.
> What exactly is expensive? Why cannot slabinfo reduce the number of
> active objects by per-cpu cached objects?
>
The number of cachelines that needs to be accessed in order to get an
accurate count will be much higher if we need to iterate through all the
per-cpu structures. In addition, accessing the per-cpu partial list will
be racy.


>> So
>> I am not advocating for that. Instead, this patch extends the
>> /proc/sys/vm/drop_caches sysctl parameter by using a new bit (bit 3)
>> to shrink all the kmem slabs which will flush out all the slabs in the
>> per-cpu structures and give a more accurate view of how much memory are
>> really used up by the active slab objects. This is a costly operation,
>> of course, but it gives a way to have a clearer picture of the actual
>> number of slab objects used, if the need arises.
> drop_caches is a terrible interface. It destroys all the caching and
> people are just too easy in using it to solve any kind of problem they
> think they might have and cause others they might not see immediately.
> I am strongly discouraging anybody - except for some tests which really
> do want to see reproducible results without cache effects - from using
> this interface and therefore I am not really happy to paper over
> something that might be a real problem with yet another mode. If SLUB
> indeed caches too aggressively on large machines then this should be
> fixed.
>
OK, as explained in another thread, the main reason for doing this patch
is to be able to do more accurate measurement of changes in kmem cache
memory consumption. Yes, I do agree that drop_caches is not a general
purpose interface that should be used lightly.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-27 20:57     ` Waiman Long
@ 2019-06-27 21:24       ` Roman Gushchin
  2019-06-27 21:31         ` Waiman Long
  2019-06-28 15:32         ` Christopher Lameter
  2019-06-27 21:25       ` Luis Chamberlain
  1 sibling, 2 replies; 19+ messages in thread
From: Roman Gushchin @ 2019-06-27 21:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Shakeel Butt, Andrea Arcangeli

On Thu, Jun 27, 2019 at 04:57:50PM -0400, Waiman Long wrote:
> On 6/26/19 4:19 PM, Roman Gushchin wrote:
> >>  
> >> +#ifdef CONFIG_MEMCG_KMEM
> >> +static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
> >> +				    void __maybe_unused *arg)
> >> +{
> >> +	struct kmem_cache *s;
> >> +
> >> +	if (memcg == root_mem_cgroup)
> >> +		return;
> >> +	mutex_lock(&slab_mutex);
> >> +	list_for_each_entry(s, &memcg->kmem_caches,
> >> +			    memcg_params.kmem_caches_node) {
> >> +		kmem_cache_shrink(s);
> >> +	}
> >> +	mutex_unlock(&slab_mutex);
> >> +	cond_resched();
> >> +}
> > A couple of questions:
> > 1) how about skipping already offlined kmem_caches? They are already shrunk,
> >    so you probably won't get much out of them. Or isn't it true?
> 
> I have been thinking about that. This patch is based on the linux tree
> and so don't have an easy to find out if the kmem caches have been
> shrinked. Rebasing this on top of linux-next, I can use the
> SLAB_DEACTIVATED flag as a marker for skipping the shrink.
> 
> With all the latest patches, I am still seeing 121 out of a total of 726
> memcg kmem caches (1/6) that are deactivated caches after system bootup
> one of the test systems. My system is still using cgroup v1 and so the
> number may be different in a v2 setup. The next step is probably to
> figure out why those deactivated caches are still there.

It's not a secret: these kmem_caches are holding objects, which are in use.
It's a drawback of the current slab accounting implementation: every
object holds a whole page and the corresponding kmem_cache. It's optimized
for a large number of objects, which are created and destroyed within
the life of the cgroup (e.g. task_structs), and it works worse for long-living
objects like vfs cache.

Long-term I think we need a different implementation for long-living objects,
so that objects belonging to different memory cgroups can share the same page
and kmem_caches.

It's a fairly big change though.

> 
> > 2) what's your long-term vision here? do you think that we need to shrink
> >    kmem_caches periodically, depending on memory pressure? how a user
> >    will use this new sysctl?
> Shrinking the kmem caches under extreme memory pressure can be one way
> to free up extra pages, but the effect will probably be temporary.
> > What's the problem you're trying to solve in general?
> 
> At least for the slub allocator, shrinking the caches allow the number
> of active objects reported in slabinfo to be more accurate. In addition,
> this allow to know the real slab memory consumption. I have been working
> on a BZ about continuous memory leaks with a container based workloads.
> The ability to shrink caches allow us to get a more accurate memory
> consumption picture. Another alternative is to turn on slub_debug which
> will then disables all the per-cpu slabs.

I see... I agree with Michal here, that extending drop_caches sysctl isn't
the best idea. Isn't it possible to achieve the same effect using slub sysfs?

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-27 20:57     ` Waiman Long
  2019-06-27 21:24       ` Roman Gushchin
@ 2019-06-27 21:25       ` Luis Chamberlain
  1 sibling, 0 replies; 19+ messages in thread
From: Luis Chamberlain @ 2019-06-27 21:25 UTC (permalink / raw)
  To: Waiman Long, Masami Hiramatsu, Masoud Asgharifard Sharbiani
  Cc: Roman Gushchin, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Alexander Viro, Jonathan Corbet,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Shakeel Butt, Andrea Arcangeli

On Thu, Jun 27, 2019 at 04:57:50PM -0400, Waiman Long wrote:
> On 6/26/19 4:19 PM, Roman Gushchin wrote:
> >>  
> >> +#ifdef CONFIG_MEMCG_KMEM
> >> +static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
> >> +				    void __maybe_unused *arg)
> >> +{
> >> +	struct kmem_cache *s;
> >> +
> >> +	if (memcg == root_mem_cgroup)
> >> +		return;
> >> +	mutex_lock(&slab_mutex);
> >> +	list_for_each_entry(s, &memcg->kmem_caches,
> >> +			    memcg_params.kmem_caches_node) {
> >> +		kmem_cache_shrink(s);
> >> +	}
> >> +	mutex_unlock(&slab_mutex);
> >> +	cond_resched();
> >> +}
> > A couple of questions:
> > 1) how about skipping already offlined kmem_caches? They are already shrunk,
> >    so you probably won't get much out of them. Or isn't it true?
> 
> I have been thinking about that. This patch is based on the linux tree
> and so don't have an easy to find out if the kmem caches have been
> shrinked. Rebasing this on top of linux-next, I can use the
> SLAB_DEACTIVATED flag as a marker for skipping the shrink.
> 
> With all the latest patches, I am still seeing 121 out of a total of 726
> memcg kmem caches (1/6) that are deactivated caches after system bootup
> one of the test systems. My system is still using cgroup v1 and so the
> number may be different in a v2 setup. The next step is probably to
> figure out why those deactivated caches are still there.
> 
> > 2) what's your long-term vision here? do you think that we need to shrink
> >    kmem_caches periodically, depending on memory pressure? how a user
> >    will use this new sysctl?
> Shrinking the kmem caches under extreme memory pressure can be one way
> to free up extra pages, but the effect will probably be temporary.
> > What's the problem you're trying to solve in general?
> 
> At least for the slub allocator, shrinking the caches allow the number
> of active objects reported in slabinfo to be more accurate. In addition,
> this allow to know the real slab memory consumption. I have been working
> on a BZ about continuous memory leaks with a container based workloads.

So.. this is still a work around?

> The ability to shrink caches allow us to get a more accurate memory
> consumption picture. Another alternative is to turn on slub_debug which
> will then disables all the per-cpu slabs.

So this is a debugging mechanism?

> Anyway, I think this can be useful to others that is why I posted the patch.

Since this is debug stuff, please add this to /proc/sys/debug/ instead.
That would reflect the intention, and would avoid the concern that folks
in production would use these things.

Since we only have 2 users of /proc/sys/debug/ I am now wondering if
would be best to add a new sysctl debug taint flag. This way bug
reports with these stupid knobs can got to /dev/null inbox for bug
reports.

Masami, /proc/sys/debug/kprobes-optimization is debug. Would you be OK
to add the taint for it too?

Masoud, /proc/sys/debug/exception-trace seems to actually be enabled
by default, and its goal seems to be to enable disabling it. So I
don't think it would make sense to taint there.

So.. maybe we need something /proc/sys/taints/ or
/proc/sys/debug/taints/ so its *very* clear this is by no way ever
expected to be used in production.

May even be good to long term add a symlink for vm/drop_caches there
as well?

  Luis

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-27 21:24       ` Roman Gushchin
@ 2019-06-27 21:31         ` Waiman Long
  2019-06-28 15:32         ` Christopher Lameter
  1 sibling, 0 replies; 19+ messages in thread
From: Waiman Long @ 2019-06-27 21:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Shakeel Butt, Andrea Arcangeli

On 6/27/19 5:24 PM, Roman Gushchin wrote:
>>> 2) what's your long-term vision here? do you think that we need to shrink
>>>    kmem_caches periodically, depending on memory pressure? how a user
>>>    will use this new sysctl?
>> Shrinking the kmem caches under extreme memory pressure can be one way
>> to free up extra pages, but the effect will probably be temporary.
>>> What's the problem you're trying to solve in general?
>> At least for the slub allocator, shrinking the caches allow the number
>> of active objects reported in slabinfo to be more accurate. In addition,
>> this allow to know the real slab memory consumption. I have been working
>> on a BZ about continuous memory leaks with a container based workloads.
>> The ability to shrink caches allow us to get a more accurate memory
>> consumption picture. Another alternative is to turn on slub_debug which
>> will then disables all the per-cpu slabs.
> I see... I agree with Michal here, that extending drop_caches sysctl isn't
> the best idea. Isn't it possible to achieve the same effect using slub sysfs?

Yes, using the slub sysfs interface can be a possible alternative.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] mm, memcontrol: Add memcg_iterate_all()
  2019-06-27 21:03     ` Waiman Long
@ 2019-06-28  7:10       ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2019-06-28  7:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-doc, linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli

On Thu 27-06-19 17:03:06, Waiman Long wrote:
> On 6/27/19 11:07 AM, Michal Hocko wrote:
> > On Mon 24-06-19 13:42:18, Waiman Long wrote:
> >> Add a memcg_iterate_all() function for iterating all the available
> >> memory cgroups and call the given callback function for each of the
> >> memory cgruops.
> > Why is a trivial wrapper any better than open coded usage of the
> > iterator?
> 
> Because the iterator is only defined within memcontrol.c. So an
> alternative may be to put the iterator into a header file that can be
> used by others. Will take a look at that.

That would be preferred.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-27 21:16     ` Waiman Long
@ 2019-06-28  7:31       ` Michal Hocko
  2019-07-02 18:41         ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2019-06-28  7:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-doc, linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli

On Thu 27-06-19 17:16:04, Waiman Long wrote:
> On 6/27/19 11:15 AM, Michal Hocko wrote:
> > On Mon 24-06-19 13:42:19, Waiman Long wrote:
> >> With the slub memory allocator, the numbers of active slab objects
> >> reported in /proc/slabinfo are not real because they include objects
> >> that are held by the per-cpu slab structures whether they are actually
> >> used or not.  The problem gets worse the more CPUs a system have. For
> >> instance, looking at the reported number of active task_struct objects,
> >> one will wonder where all the missing tasks gone.
> >>
> >> I know it is hard and costly to get a real count of active objects.
> > What exactly is expensive? Why cannot slabinfo reduce the number of
> > active objects by per-cpu cached objects?
> >
> The number of cachelines that needs to be accessed in order to get an
> accurate count will be much higher if we need to iterate through all the
> per-cpu structures. In addition, accessing the per-cpu partial list will
> be racy.

Why is all that a problem for a root only interface that should be used
quite rarely (it is not something that you should be reading hundreds
time per second, right)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-27 21:24       ` Roman Gushchin
  2019-06-27 21:31         ` Waiman Long
@ 2019-06-28 15:32         ` Christopher Lameter
  2019-06-28 16:33           ` Roman Gushchin
  2019-06-28 17:16           ` Yang Shi
  1 sibling, 2 replies; 19+ messages in thread
From: Christopher Lameter @ 2019-06-28 15:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Waiman Long, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Shakeel Butt, Andrea Arcangeli

On Thu, 27 Jun 2019, Roman Gushchin wrote:

> so that objects belonging to different memory cgroups can share the same page
> and kmem_caches.
>
> It's a fairly big change though.

Could this be done at another level? Put a cgoup pointer into the
corresponding structures and then go back to just a single kmen_cache for
the system as a whole? You can still account them per cgroup and there
will be no cleanup problem anymore. You could scan through a slab cache
to remove the objects of a certain cgroup and then the fragmentation
problem that cgroups create here will be handled by the slab allocators in
the traditional way. The duplication of the kmem_cache was not designed
into the allocators but bolted on later.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-28 15:32         ` Christopher Lameter
@ 2019-06-28 16:33           ` Roman Gushchin
  2019-06-28 17:16           ` Yang Shi
  1 sibling, 0 replies; 19+ messages in thread
From: Roman Gushchin @ 2019-06-28 16:33 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Waiman Long, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Shakeel Butt, Andrea Arcangeli

On Fri, Jun 28, 2019 at 03:32:28PM +0000, Christopher Lameter wrote:
> On Thu, 27 Jun 2019, Roman Gushchin wrote:
> 
> > so that objects belonging to different memory cgroups can share the same page
> > and kmem_caches.
> >
> > It's a fairly big change though.
> 
> Could this be done at another level? Put a cgoup pointer into the
> corresponding structures and then go back to just a single kmen_cache for
> the system as a whole?
> You can still account them per cgroup and there
> will be no cleanup problem anymore. You could scan through a slab cache
> to remove the objects of a certain cgroup and then the fragmentation
> problem that cgroups create here will be handled by the slab allocators in
> the traditional way. The duplication of the kmem_cache was not designed
> into the allocators but bolted on later.
> 

Yeah, this is exactly what I'm talking about. Idk how big the performance
penalty will be for small and short-living objects, it should be measured.
But for long-living objects it will be much better for sure...

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-28 15:32         ` Christopher Lameter
  2019-06-28 16:33           ` Roman Gushchin
@ 2019-06-28 17:16           ` Yang Shi
  2019-06-28 17:30             ` Roman Gushchin
  1 sibling, 1 reply; 19+ messages in thread
From: Yang Shi @ 2019-06-28 17:16 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Roman Gushchin, Waiman Long, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, linux-mm, linux-doc, linux-fsdevel, cgroups,
	linux-kernel, Shakeel Butt, Andrea Arcangeli

On Fri, Jun 28, 2019 at 8:32 AM Christopher Lameter <cl@linux.com> wrote:
>
> On Thu, 27 Jun 2019, Roman Gushchin wrote:
>
> > so that objects belonging to different memory cgroups can share the same page
> > and kmem_caches.
> >
> > It's a fairly big change though.
>
> Could this be done at another level? Put a cgoup pointer into the
> corresponding structures and then go back to just a single kmen_cache for
> the system as a whole? You can still account them per cgroup and there
> will be no cleanup problem anymore. You could scan through a slab cache
> to remove the objects of a certain cgroup and then the fragmentation
> problem that cgroups create here will be handled by the slab allocators in
> the traditional way. The duplication of the kmem_cache was not designed
> into the allocators but bolted on later.

I'm afraid this may bring in another problem for memcg page reclaim.
When shrinking the slabs, the shrinker may end up scanning a very long
list to find out the slabs for a specific memcg. Particularly for the
count operation, it may have to scan the list from the beginning all
the way down to the end. It may take unbounded time.

When I worked on THP deferred split shrinker problem, I used to do
like this, but it turns out it may take milliseconds to count the
objects on the list, but it may just need reclaim a few of them.

>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-28 17:16           ` Yang Shi
@ 2019-06-28 17:30             ` Roman Gushchin
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Gushchin @ 2019-06-28 17:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: Christopher Lameter, Waiman Long, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, linux-mm, linux-doc, linux-fsdevel, cgroups,
	linux-kernel, Shakeel Butt, Andrea Arcangeli

On Fri, Jun 28, 2019 at 10:16:13AM -0700, Yang Shi wrote:
> On Fri, Jun 28, 2019 at 8:32 AM Christopher Lameter <cl@linux.com> wrote:
> >
> > On Thu, 27 Jun 2019, Roman Gushchin wrote:
> >
> > > so that objects belonging to different memory cgroups can share the same page
> > > and kmem_caches.
> > >
> > > It's a fairly big change though.
> >
> > Could this be done at another level? Put a cgoup pointer into the
> > corresponding structures and then go back to just a single kmen_cache for
> > the system as a whole? You can still account them per cgroup and there
> > will be no cleanup problem anymore. You could scan through a slab cache
> > to remove the objects of a certain cgroup and then the fragmentation
> > problem that cgroups create here will be handled by the slab allocators in
> > the traditional way. The duplication of the kmem_cache was not designed
> > into the allocators but bolted on later.
> 
> I'm afraid this may bring in another problem for memcg page reclaim.
> When shrinking the slabs, the shrinker may end up scanning a very long
> list to find out the slabs for a specific memcg. Particularly for the
> count operation, it may have to scan the list from the beginning all
> the way down to the end. It may take unbounded time.
> 
> When I worked on THP deferred split shrinker problem, I used to do
> like this, but it turns out it may take milliseconds to count the
> objects on the list, but it may just need reclaim a few of them.

I don't think the shrinker mechanism should be altered. Shrinker lists
already contain individual objects, and I don't see any reasons, why
these objects can't reside on a shared set of pages.

What we're discussing is that it's way too costly (under some conditions)
to have many sets of kmem_caches, if each of them is containing only
few objects.

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
  2019-06-28  7:31       ` Michal Hocko
@ 2019-07-02 18:41         ` Waiman Long
  0 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2019-07-02 18:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-doc, linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli

On 6/28/19 3:31 AM, Michal Hocko wrote:
> On Thu 27-06-19 17:16:04, Waiman Long wrote:
>> On 6/27/19 11:15 AM, Michal Hocko wrote:
>>> On Mon 24-06-19 13:42:19, Waiman Long wrote:
>>>> With the slub memory allocator, the numbers of active slab objects
>>>> reported in /proc/slabinfo are not real because they include objects
>>>> that are held by the per-cpu slab structures whether they are actually
>>>> used or not.  The problem gets worse the more CPUs a system have. For
>>>> instance, looking at the reported number of active task_struct objects,
>>>> one will wonder where all the missing tasks gone.
>>>>
>>>> I know it is hard and costly to get a real count of active objects.
>>> What exactly is expensive? Why cannot slabinfo reduce the number of
>>> active objects by per-cpu cached objects?
>>>
>> The number of cachelines that needs to be accessed in order to get an
>> accurate count will be much higher if we need to iterate through all the
>> per-cpu structures. In addition, accessing the per-cpu partial list will
>> be racy.
> Why is all that a problem for a root only interface that should be used
> quite rarely (it is not something that you should be reading hundreds
> time per second, right)?

That can be true. Anyway, I have posted a new patch to use the existing
<slab>/shrink sysfs file to perform memcg cache shrinking as well. So I
am not going to pursue this patch.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-07-02 18:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-24 17:42 [PATCH 0/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs Waiman Long
2019-06-24 17:42 ` [PATCH 1/2] mm, memcontrol: Add memcg_iterate_all() Waiman Long
2019-06-27 15:07   ` Michal Hocko
2019-06-27 21:03     ` Waiman Long
2019-06-28  7:10       ` Michal Hocko
2019-06-24 17:42 ` [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs Waiman Long
2019-06-26 20:19   ` Roman Gushchin
2019-06-27 20:57     ` Waiman Long
2019-06-27 21:24       ` Roman Gushchin
2019-06-27 21:31         ` Waiman Long
2019-06-28 15:32         ` Christopher Lameter
2019-06-28 16:33           ` Roman Gushchin
2019-06-28 17:16           ` Yang Shi
2019-06-28 17:30             ` Roman Gushchin
2019-06-27 21:25       ` Luis Chamberlain
2019-06-27 15:15   ` Michal Hocko
2019-06-27 21:16     ` Waiman Long
2019-06-28  7:31       ` Michal Hocko
2019-07-02 18:41         ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).