LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/2] mm: memcg/slab: Fix objcg pointer array handling problem
@ 2021-05-04 13:23 Waiman Long
  2021-05-04 13:23 ` [PATCH v2 1/2] mm: memcg/slab: Properly set up gfp flags for objcg pointer array Waiman Long
  2021-05-04 13:23 ` [PATCH v2 2/2] mm: memcg/slab: Create a new set of kmalloc-cg-<n> caches Waiman Long
  0 siblings, 2 replies; 9+ messages in thread
From: Waiman Long @ 2021-05-04 13:23 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, Shakeel Butt
  Cc: linux-kernel, cgroups, linux-mm, Waiman Long

 v2:
  - Take suggestion from Vlastimil to use a new set of kmalloc-cg-* to
    handle the objcg pointer array allocation and freeing problems.

Since the merging of the new slab memory controller in v5.9,
the page structure stores a pointer to objcg pointer array for
slab pages. When the slab has no used objects, it can be freed in
free_slab() which will call kfree() to free the objcg pointer array in
memcg_alloc_page_obj_cgroups(). If it happens that the objcg pointer
array is the last used object in its slab, that slab may then be freed
which may caused kfree() to be called again.

With the right workload, the slab cache may be set up in a way that
allows the recursive kfree() calling loop to nest deep enough to
cause a kernel stack overflow and panic the system. In fact, we have
a reproducer that can cause kernel stack overflow on a s390 system
involving kmalloc-rcl-256 and kmalloc-rcl-128 slabs with the following
kfree() loop recursively called 74 times:

  [  285.520739]  [<000000000ec432fc>] kfree+0x4bc/0x560
  [  285.520740]  [<000000000ec43466>] __free_slab+0xc6/0x228
  [  285.520741]  [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0
  [  285.520742]  [<000000000ec432fc>] kfree+0x4bc/0x560
					:
While investigating this issue, I also found an issue on the allocation
side. If the objcg pointer array happen to come from the same slab or
a circular dependency linkage is formed with multiple slabs, those
affected slabs can never be freed again.

This patch series addresses these two issues by introducing a new
set of kmalloc-cg-<n> caches split from kmalloc-<n> caches. The new
set will only contain non-reclaimable and non-dma objects that are
accounted in memory cgroups whereas the old set are now for unaccounted
objects only. By making this split, all the objcg pointer arrays will
come from the kmalloc-<n> caches, but those caches will never hold any
objcg pointer array. As a result, deeply nested kfree() call and the
unfreeable slab problems are now gone.

Waiman Long (2):
  mm: memcg/slab: Properly set up gfp flags for objcg pointer array
  mm: memcg/slab: Create a new set of kmalloc-cg-<n> caches

 include/linux/slab.h | 15 +++++++++++++++
 mm/memcontrol.c      |  8 ++++++++
 mm/slab.h            |  1 -
 mm/slab_common.c     | 23 +++++++++++++++--------
 4 files changed, 38 insertions(+), 9 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/2] mm: memcg/slab: Properly set up gfp flags for objcg pointer array
  2021-05-04 13:23 [PATCH v2 0/2] mm: memcg/slab: Fix objcg pointer array handling problem Waiman Long
@ 2021-05-04 13:23 ` Waiman Long
  2021-05-04 19:37   ` Shakeel Butt
  2021-05-04 13:23 ` [PATCH v2 2/2] mm: memcg/slab: Create a new set of kmalloc-cg-<n> caches Waiman Long
  1 sibling, 1 reply; 9+ messages in thread
From: Waiman Long @ 2021-05-04 13:23 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, Shakeel Butt
  Cc: linux-kernel, cgroups, linux-mm, Waiman Long

Since the merging of the new slab memory controller in v5.9, the page
structure may store a pointer to obj_cgroup pointer array for slab pages.
Currently, only the __GFP_ACCOUNT bit is masked off. However, the array
is not readily reclaimable and doesn't need to come from the DMA buffer.
So those GFP bits should be masked off as well.

Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
that it is consistently applied no matter where it is called.

Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 8 ++++++++
 mm/slab.h       | 1 -
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c100265dc393..5e3b4f23b830 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2863,6 +2863,13 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+/*
+ * The allocated objcg pointers array is not accounted directly.
+ * Moreover, it should not come from DMA buffer and is not readily
+ * reclaimable. So those GFP bits should be masked off.
+ */
+#define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
+
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 				 gfp_t gfp, bool new_page)
 {
@@ -2870,6 +2877,7 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 	unsigned long memcg_data;
 	void *vec;
 
+	gfp &= ~OBJCGS_CLEAR_MASK;
 	vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
 			   page_to_nid(page));
 	if (!vec)
diff --git a/mm/slab.h b/mm/slab.h
index 18c1927cd196..b3294712a686 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -309,7 +309,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 	if (!memcg_kmem_enabled() || !objcg)
 		return;
 
-	flags &= ~__GFP_ACCOUNT;
 	for (i = 0; i < size; i++) {
 		if (likely(p[i])) {
 			page = virt_to_head_page(p[i]);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] mm: memcg/slab: Create a new set of kmalloc-cg-<n> caches
  2021-05-04 13:23 [PATCH v2 0/2] mm: memcg/slab: Fix objcg pointer array handling problem Waiman Long
  2021-05-04 13:23 ` [PATCH v2 1/2] mm: memcg/slab: Properly set up gfp flags for objcg pointer array Waiman Long
@ 2021-05-04 13:23 ` Waiman Long
  2021-05-04 16:01   ` Vlastimil Babka
  1 sibling, 1 reply; 9+ messages in thread
From: Waiman Long @ 2021-05-04 13:23 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, Shakeel Butt
  Cc: linux-kernel, cgroups, linux-mm, Waiman Long

There are currently two problems in the way the objcg pointer array
(memcg_data) in the page structure is being allocated and freed.

On its allocation, it is possible that the allocated objcg pointer
array comes from the same slab that requires memory accounting. If this
happens, the slab will never become empty again as there is at least
one object left (the obj_cgroup array) in the slab.

When it is freed, the objcg pointer array object may be the last one
in its slab and hence causes kfree() to be called again. With the
right workload, the slab cache may be set up in a way that allows the
recursive kfree() calling loop to nest deep enough to cause a kernel
stack overflow and panic the system.

One way to solve this problem is to split the kmalloc-<n> caches
(KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
(KMALLOC_NORMAL) caches for non-accounted objects only and a new set of
kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
the other caches can allow a mix of accounted and non-accounted objects.

With this change, all the objcg pointer array objects will come from
KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
both the recursive kfree() problem and non-freeable slab problem
are gone.

The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
will include the newly added caches without change.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/slab.h | 15 +++++++++++++++
 mm/slab_common.c     | 23 +++++++++++++++--------
 2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0c97d788762c..fca03c22ea7c 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -305,9 +305,16 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
 /*
  * Whenever changing this, take care of that kmalloc_type() and
  * create_kmalloc_caches() still work as intended.
+ *
+ * KMALLOC_NORMAL is for non-accounted objects only whereas KMALLOC_CGROUP
+ * is for accounted objects only. All the other kmem caches can have both
+ * accounted and non-accounted objects.
  */
 enum kmalloc_cache_type {
 	KMALLOC_NORMAL = 0,
+#ifdef CONFIG_MEMCG_KMEM
+	KMALLOC_CGROUP,
+#endif
 	KMALLOC_RECLAIM,
 #ifdef CONFIG_ZONE_DMA
 	KMALLOC_DMA,
@@ -321,6 +328,14 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
 
 static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
 {
+#ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * KMALLOC_CGROUP for non-reclaimable and non-DMA object with
+	 * accounting enabled.
+	 */
+	if ((flags & (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)) == __GFP_ACCOUNT)
+		return KMALLOC_CGROUP;
+#endif
 #ifdef CONFIG_ZONE_DMA
 	/*
 	 * The most common case is KMALLOC_NORMAL, so test for it
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f8833d3e5d47..d750e3ba7af5 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -727,21 +727,25 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
 }
 
 #ifdef CONFIG_ZONE_DMA
-#define INIT_KMALLOC_INFO(__size, __short_size)			\
-{								\
-	.name[KMALLOC_NORMAL]  = "kmalloc-" #__short_size,	\
-	.name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size,	\
-	.name[KMALLOC_DMA]     = "dma-kmalloc-" #__short_size,	\
-	.size = __size,						\
-}
+#define KMALLOC_DMA_NAME(sz)	.name[KMALLOC_DMA] = "dma-kmalloc-" #sz,
+#else
+#define KMALLOC_DMA_NAME(sz)
+#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+#define KMALLOC_CGROUP_NAME(sz)	.name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz,
 #else
+#define KMALLOC_CGROUP_NAME(sz)
+#endif
+
 #define INIT_KMALLOC_INFO(__size, __short_size)			\
 {								\
 	.name[KMALLOC_NORMAL]  = "kmalloc-" #__short_size,	\
 	.name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size,	\
+	KMALLOC_CGROUP_NAME(__short_size)			\
+	KMALLOC_DMA_NAME(__short_size)				\
 	.size = __size,						\
 }
-#endif
 
 /*
  * kmalloc_info[] is to make slub_debug=,kmalloc-xx option work at boot time.
@@ -847,6 +851,9 @@ void __init create_kmalloc_caches(slab_flags_t flags)
 	int i;
 	enum kmalloc_cache_type type;
 
+	/*
+	 * Including KMALLOC_CGROUP if CONFIG_MEMCG_KMEM defined
+	 */
 	for (type = KMALLOC_NORMAL; type <= KMALLOC_RECLAIM; type++) {
 		for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
 			if (!kmalloc_caches[type][i])
-- 
2.18.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg/slab: Create a new set of kmalloc-cg-<n> caches
  2021-05-04 13:23 ` [PATCH v2 2/2] mm: memcg/slab: Create a new set of kmalloc-cg-<n> caches Waiman Long
@ 2021-05-04 16:01   ` Vlastimil Babka
  2021-05-05  1:55     ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2021-05-04 16:01 UTC (permalink / raw)
  To: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Roman Gushchin, Shakeel Butt
  Cc: linux-kernel, cgroups, linux-mm

On 5/4/21 3:23 PM, Waiman Long wrote:
> There are currently two problems in the way the objcg pointer array
> (memcg_data) in the page structure is being allocated and freed.
> 
> On its allocation, it is possible that the allocated objcg pointer
> array comes from the same slab that requires memory accounting. If this
> happens, the slab will never become empty again as there is at least
> one object left (the obj_cgroup array) in the slab.
> 
> When it is freed, the objcg pointer array object may be the last one
> in its slab and hence causes kfree() to be called again. With the
> right workload, the slab cache may be set up in a way that allows the
> recursive kfree() calling loop to nest deep enough to cause a kernel
> stack overflow and panic the system.
> 
> One way to solve this problem is to split the kmalloc-<n> caches
> (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
> (KMALLOC_NORMAL) caches for non-accounted objects only and a new set of
> kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
> the other caches can allow a mix of accounted and non-accounted objects.
> 
> With this change, all the objcg pointer array objects will come from
> KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
> both the recursive kfree() problem and non-freeable slab problem
> are gone.
> 
> The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
> KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
> will include the newly added caches without change.

Great, thanks I hope there would be also benefits to objcg arrays not
created for all the normal caches anymore (possibly poorly used due to
mix of accounted and non-accounted objects in the same cache) and perhaps
it's possible for you to quantify the reduction of those?

> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Waiman Long <longman@redhat.com>

...

> @@ -321,6 +328,14 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
>  
>  static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
>  {
> +#ifdef CONFIG_MEMCG_KMEM
> +	/*
> +	 * KMALLOC_CGROUP for non-reclaimable and non-DMA object with
> +	 * accounting enabled.
> +	 */
> +	if ((flags & (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)) == __GFP_ACCOUNT)
> +		return KMALLOC_CGROUP;
> +#endif

This function was designed so that KMALLOC_NORMAL would be the first tested and
returned possibility, as it's expected to be the most common. What about the
following on top?

----8<----
diff --git a/include/linux/slab.h b/include/linux/slab.h
index fca03c22ea7c..418c5df0305b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -328,30 +328,40 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
 
 static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
 {
-#ifdef CONFIG_MEMCG_KMEM
 	/*
-	 * KMALLOC_CGROUP for non-reclaimable and non-DMA object with
-	 * accounting enabled.
+	 * The most common case is KMALLOC_NORMAL, so test for it
+	 * with a single branch for all flags that might affect it
 	 */
-	if ((flags & (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)) == __GFP_ACCOUNT)
-		return KMALLOC_CGROUP;
+	if (likely((flags & (__GFP_RECLAIMABLE
+#ifdef CONFIG_MEMCG_KMEM
+			     | __GFP_ACCOUNT
 #endif
 #ifdef CONFIG_ZONE_DMA
-	/*
-	 * The most common case is KMALLOC_NORMAL, so test for it
-	 * with a single branch for both flags.
-	 */
-	if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0))
+			     | __GFP_DMA
+#endif
+			    )) == 0))
 		return KMALLOC_NORMAL;
 
+#ifdef CONFIG_MEMCG_KMEM
 	/*
-	 * At least one of the flags has to be set. If both are, __GFP_DMA
-	 * is more important.
+	 * KMALLOC_CGROUP for non-reclaimable and non-DMA object with
+	 * accounting enabled.
 	 */
-	return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM;
-#else
-	return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL;
+	if ((flags & (__GFP_ACCOUNT | __GFP_RECLAIMABLE
+#ifdef CONFIG_ZONE_DMA
+		      | __GFP_DMA
+#endif
+		     )) == __GFP_ACCOUNT)
+		return KMALLOC_CGROUP;
 #endif
+
+#ifdef CONFIG_ZONE_DMA
+	if (flags & __GFP_DMA)
+		return KMALLOC_DMA;
+#endif
+
+	/* if we got here, it has to be __GFP_RECLAIMABLE */
+	return KMALLOC_RECLAIM;
 }
 
 /*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg/slab: Properly set up gfp flags for objcg pointer array
  2021-05-04 13:23 ` [PATCH v2 1/2] mm: memcg/slab: Properly set up gfp flags for objcg pointer array Waiman Long
@ 2021-05-04 19:37   ` Shakeel Butt
  2021-05-04 20:02     ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Shakeel Butt @ 2021-05-04 19:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, LKML, Cgroups, Linux MM

On Tue, May 4, 2021 at 6:24 AM Waiman Long <longman@redhat.com> wrote:
>
> Since the merging of the new slab memory controller in v5.9, the page
> structure may store a pointer to obj_cgroup pointer array for slab pages.
> Currently, only the __GFP_ACCOUNT bit is masked off. However, the array
> is not readily reclaimable and doesn't need to come from the DMA buffer.
> So those GFP bits should be masked off as well.
>
> Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
> that it is consistently applied no matter where it is called.
>
> Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 8 ++++++++
>  mm/slab.h       | 1 -
>  2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c100265dc393..5e3b4f23b830 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2863,6 +2863,13 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
>  }
>
>  #ifdef CONFIG_MEMCG_KMEM
> +/*
> + * The allocated objcg pointers array is not accounted directly.
> + * Moreover, it should not come from DMA buffer and is not readily
> + * reclaimable. So those GFP bits should be masked off.
> + */
> +#define OBJCGS_CLEAR_MASK      (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)

What about __GFP_DMA32? Does it matter? It seems like DMA32 requests
go to normal caches.

> +
>  int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
>                                  gfp_t gfp, bool new_page)
>  {
> @@ -2870,6 +2877,7 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
>         unsigned long memcg_data;
>         void *vec;
>
> +       gfp &= ~OBJCGS_CLEAR_MASK;
>         vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
>                            page_to_nid(page));
>         if (!vec)
> diff --git a/mm/slab.h b/mm/slab.h
> index 18c1927cd196..b3294712a686 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -309,7 +309,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
>         if (!memcg_kmem_enabled() || !objcg)
>                 return;
>
> -       flags &= ~__GFP_ACCOUNT;
>         for (i = 0; i < size; i++) {
>                 if (likely(p[i])) {
>                         page = virt_to_head_page(p[i]);
> --
> 2.18.1
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg/slab: Properly set up gfp flags for objcg pointer array
  2021-05-04 19:37   ` Shakeel Butt
@ 2021-05-04 20:02     ` Waiman Long
  2021-05-04 20:06       ` Shakeel Butt
  0 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2021-05-04 20:02 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, LKML, Cgroups, Linux MM

On 5/4/21 3:37 PM, Shakeel Butt wrote:
> On Tue, May 4, 2021 at 6:24 AM Waiman Long <longman@redhat.com> wrote:
>> Since the merging of the new slab memory controller in v5.9, the page
>> structure may store a pointer to obj_cgroup pointer array for slab pages.
>> Currently, only the __GFP_ACCOUNT bit is masked off. However, the array
>> is not readily reclaimable and doesn't need to come from the DMA buffer.
>> So those GFP bits should be masked off as well.
>>
>> Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
>> that it is consistently applied no matter where it is called.
>>
>> Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   mm/memcontrol.c | 8 ++++++++
>>   mm/slab.h       | 1 -
>>   2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c100265dc393..5e3b4f23b830 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2863,6 +2863,13 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
>>   }
>>
>>   #ifdef CONFIG_MEMCG_KMEM
>> +/*
>> + * The allocated objcg pointers array is not accounted directly.
>> + * Moreover, it should not come from DMA buffer and is not readily
>> + * reclaimable. So those GFP bits should be masked off.
>> + */
>> +#define OBJCGS_CLEAR_MASK      (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
> What about __GFP_DMA32? Does it matter? It seems like DMA32 requests
> go to normal caches.

I included __GFP_DMA32 in my first draft patch. However, __GFP_DMA32 is 
not considered in determining the right kmalloc_type() (patch 2), so I 
took it out to make it consistent. I can certainly add it back.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg/slab: Properly set up gfp flags for objcg pointer array
  2021-05-04 20:02     ` Waiman Long
@ 2021-05-04 20:06       ` Shakeel Butt
  2021-05-05 11:32         ` Vlastimil Babka
  0 siblings, 1 reply; 9+ messages in thread
From: Shakeel Butt @ 2021-05-04 20:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, LKML, Cgroups, Linux MM

On Tue, May 4, 2021 at 1:02 PM Waiman Long <llong@redhat.com> wrote:
>
> On 5/4/21 3:37 PM, Shakeel Butt wrote:
> > On Tue, May 4, 2021 at 6:24 AM Waiman Long <longman@redhat.com> wrote:
> >> Since the merging of the new slab memory controller in v5.9, the page
> >> structure may store a pointer to obj_cgroup pointer array for slab pages.
> >> Currently, only the __GFP_ACCOUNT bit is masked off. However, the array
> >> is not readily reclaimable and doesn't need to come from the DMA buffer.
> >> So those GFP bits should be masked off as well.
> >>
> >> Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
> >> that it is consistently applied no matter where it is called.
> >>
> >> Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> >> ---
> >>   mm/memcontrol.c | 8 ++++++++
> >>   mm/slab.h       | 1 -
> >>   2 files changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index c100265dc393..5e3b4f23b830 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -2863,6 +2863,13 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
> >>   }
> >>
> >>   #ifdef CONFIG_MEMCG_KMEM
> >> +/*
> >> + * The allocated objcg pointers array is not accounted directly.
> >> + * Moreover, it should not come from DMA buffer and is not readily
> >> + * reclaimable. So those GFP bits should be masked off.
> >> + */
> >> +#define OBJCGS_CLEAR_MASK      (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
> > What about __GFP_DMA32? Does it matter? It seems like DMA32 requests
> > go to normal caches.
>
> I included __GFP_DMA32 in my first draft patch. However, __GFP_DMA32 is
> not considered in determining the right kmalloc_type() (patch 2), so I
> took it out to make it consistent. I can certainly add it back.
>

No this is fine and DMA32 question is unrelated to this patch series.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg/slab: Create a new set of kmalloc-cg-<n> caches
  2021-05-04 16:01   ` Vlastimil Babka
@ 2021-05-05  1:55     ` Waiman Long
  0 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2021-05-05  1:55 UTC (permalink / raw)
  To: Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Roman Gushchin, Shakeel Butt
  Cc: linux-kernel, cgroups, linux-mm

On 5/4/21 12:01 PM, Vlastimil Babka wrote:
> On 5/4/21 3:23 PM, Waiman Long wrote:
>> There are currently two problems in the way the objcg pointer array
>> (memcg_data) in the page structure is being allocated and freed.
>>
>> On its allocation, it is possible that the allocated objcg pointer
>> array comes from the same slab that requires memory accounting. If this
>> happens, the slab will never become empty again as there is at least
>> one object left (the obj_cgroup array) in the slab.
>>
>> When it is freed, the objcg pointer array object may be the last one
>> in its slab and hence causes kfree() to be called again. With the
>> right workload, the slab cache may be set up in a way that allows the
>> recursive kfree() calling loop to nest deep enough to cause a kernel
>> stack overflow and panic the system.
>>
>> One way to solve this problem is to split the kmalloc-<n> caches
>> (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
>> (KMALLOC_NORMAL) caches for non-accounted objects only and a new set of
>> kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
>> the other caches can allow a mix of accounted and non-accounted objects.
>>
>> With this change, all the objcg pointer array objects will come from
>> KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
>> both the recursive kfree() problem and non-freeable slab problem
>> are gone.
>>
>> The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
>> KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
>> will include the newly added caches without change.
> Great, thanks I hope there would be also benefits to objcg arrays not
> created for all the normal caches anymore (possibly poorly used due to
> mix of accounted and non-accounted objects in the same cache) and perhaps
> it's possible for you to quantify the reduction of those?
Right, I will update the commit log to mention that as well. Thanks!
>> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> ...
>
>> @@ -321,6 +328,14 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
>>   
>>   static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
>>   {
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	/*
>> +	 * KMALLOC_CGROUP for non-reclaimable and non-DMA object with
>> +	 * accounting enabled.
>> +	 */
>> +	if ((flags & (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)) == __GFP_ACCOUNT)
>> +		return KMALLOC_CGROUP;
>> +#endif
> This function was designed so that KMALLOC_NORMAL would be the first tested and
> returned possibility, as it's expected to be the most common. What about the
> following on top?
>
> ----8<----
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index fca03c22ea7c..418c5df0305b 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -328,30 +328,40 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
>   
>   static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
>   {
> -#ifdef CONFIG_MEMCG_KMEM
>   	/*
> -	 * KMALLOC_CGROUP for non-reclaimable and non-DMA object with
> -	 * accounting enabled.
> +	 * The most common case is KMALLOC_NORMAL, so test for it
> +	 * with a single branch for all flags that might affect it
>   	 */
> -	if ((flags & (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)) == __GFP_ACCOUNT)
> -		return KMALLOC_CGROUP;
> +	if (likely((flags & (__GFP_RECLAIMABLE
> +#ifdef CONFIG_MEMCG_KMEM
> +			     | __GFP_ACCOUNT
>   #endif
>   #ifdef CONFIG_ZONE_DMA
> -	/*
> -	 * The most common case is KMALLOC_NORMAL, so test for it
> -	 * with a single branch for both flags.
> -	 */
> -	if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0))
> +			     | __GFP_DMA
> +#endif
> +			    )) == 0))
>   		return KMALLOC_NORMAL;
>   
> +#ifdef CONFIG_MEMCG_KMEM
>   	/*
> -	 * At least one of the flags has to be set. If both are, __GFP_DMA
> -	 * is more important.
> +	 * KMALLOC_CGROUP for non-reclaimable and non-DMA object with
> +	 * accounting enabled.
>   	 */
> -	return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM;
> -#else
> -	return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL;
> +	if ((flags & (__GFP_ACCOUNT | __GFP_RECLAIMABLE
> +#ifdef CONFIG_ZONE_DMApropose this to the customer as proposing this will create a lot of confusion
> +		      | __GFP_DMA
> +#endif
> +		     )) == __GFP_ACCOUNT)
> +		return KMALLOC_CGROUP;
>   #endif
> +
> +#ifdef CONFIG_ZONE_DMA
> +	if (flags & __GFP_DMA)
> +		return KMALLOC_DMA;
> +#endif
> +
> +	/* if we got here, it has to be __GFP_RECLAIMABLE */
> +	return KMALLOC_RECLAIM;
>   }
>   
>   /*
>
OK, I will make KMALLOC_NORMAL the first in the test. However the 
proposed change is a bit hard to read, so I will probably change it a bit.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg/slab: Properly set up gfp flags for objcg pointer array
  2021-05-04 20:06       ` Shakeel Butt
@ 2021-05-05 11:32         ` Vlastimil Babka
  0 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2021-05-05 11:32 UTC (permalink / raw)
  To: Shakeel Butt, Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, LKML, Cgroups, Linux MM

On 5/4/21 10:06 PM, Shakeel Butt wrote:
> On Tue, May 4, 2021 at 1:02 PM Waiman Long <llong@redhat.com> wrote:
>>
>> On 5/4/21 3:37 PM, Shakeel Butt wrote:
>> > On Tue, May 4, 2021 at 6:24 AM Waiman Long <longman@redhat.com> wrote:
>> >> Since the merging of the new slab memory controller in v5.9, the page
>> >> structure may store a pointer to obj_cgroup pointer array for slab pages.
>> >> Currently, only the __GFP_ACCOUNT bit is masked off. However, the array
>> >> is not readily reclaimable and doesn't need to come from the DMA buffer.
>> >> So those GFP bits should be masked off as well.
>> >>
>> >> Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
>> >> that it is consistently applied no matter where it is called.
>> >>
>> >> Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
>> >> Signed-off-by: Waiman Long <longman@redhat.com>
>> >> ---
>> >>   mm/memcontrol.c | 8 ++++++++
>> >>   mm/slab.h       | 1 -
>> >>   2 files changed, 8 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> >> index c100265dc393..5e3b4f23b830 100644
>> >> --- a/mm/memcontrol.c
>> >> +++ b/mm/memcontrol.c
>> >> @@ -2863,6 +2863,13 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
>> >>   }
>> >>
>> >>   #ifdef CONFIG_MEMCG_KMEM
>> >> +/*
>> >> + * The allocated objcg pointers array is not accounted directly.
>> >> + * Moreover, it should not come from DMA buffer and is not readily
>> >> + * reclaimable. So those GFP bits should be masked off.
>> >> + */
>> >> +#define OBJCGS_CLEAR_MASK      (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
>> > What about __GFP_DMA32? Does it matter? It seems like DMA32 requests
>> > go to normal caches.
>>
>> I included __GFP_DMA32 in my first draft patch. However, __GFP_DMA32 is
>> not considered in determining the right kmalloc_type() (patch 2), so I
>> took it out to make it consistent. I can certainly add it back.
>>
> 
> No this is fine and DMA32 question is unrelated to this patch series.

We never supported them in kmalloc(), only explicit caches with SLAB_CACHE_DMA32
flag.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-04 13:23 [PATCH v2 0/2] mm: memcg/slab: Fix objcg pointer array handling problem Waiman Long
2021-05-04 13:23 ` [PATCH v2 1/2] mm: memcg/slab: Properly set up gfp flags for objcg pointer array Waiman Long
2021-05-04 19:37   ` Shakeel Butt
2021-05-04 20:02     ` Waiman Long
2021-05-04 20:06       ` Shakeel Butt
2021-05-05 11:32         ` Vlastimil Babka
2021-05-04 13:23 ` [PATCH v2 2/2] mm: memcg/slab: Create a new set of kmalloc-cg-<n> caches Waiman Long
2021-05-04 16:01   ` Vlastimil Babka
2021-05-05  1:55     ` Waiman Long

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git
	git clone --mirror https://lore.kernel.org/lkml/10 lkml/git/10.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git