linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] mm/slub: some debug enhancements for kmalloc
@ 2022-09-07  7:10 Feng Tang
  2022-09-07  7:10 ` [PATCH v5 1/4] mm/slub: enable debugging memory wasting of kmalloc Feng Tang
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Feng Tang @ 2022-09-07  7:10 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet
  Cc: Dave Hansen, linux-mm, linux-kernel, kasan-dev, Feng Tang

kmalloc's API family is critical for mm, and one of its nature is that
it will round up the request size to a fixed one (mostly power of 2).
When user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory space
waste.

The wastage is not a big issue for requests that get allocated/freed 
quickly, but may cause problems with objects that have longer life time,
and there were some OOM cases in some extrem cases.

This patchset tries to :
* Add a debug method to track each kmalloced object's wastage info,
  and show the call stack of original allocation (depends on
  SLAB_STORE_USER flag)
* Extend the redzone sanity check to the extra kmalloced buffer than
  requested, to better detect un-legitimate access to it. (depends
  on SLAB_STORE_USER & SLAB_RED_ZONE)

The redzone part has been tested with code below:

	for (shift = 3; shift <= 12; shift++) {
		size = 1 << shift;
		buf = kmalloc(size + 4, GFP_KERNEL);
		/* We have 96, 196 kmalloc size, which is not power of 2 */
		if (size == 64 || size == 128)
			oob_size = 16;
		else
			oob_size = size - 4;
		memset(buf + size + 4, 0xee, oob_size);
		kfree(buf);
	}

Please help to review, thanks!

- Feng

---
Changelogs:
 
  since v4:
    * fix a race issue in v3, by moving kmalloc debug init into
      alloc_debug_processing (Hyeonggon Yoo)
    * add 'partial_conext' for better parameter passing in get_partial()
      call chain (Vlastimil Babka)
    * update 'slub.rst' for 'alloc_traces' part (Hyeonggon Yoo)
    * update code comments for 'orig_size'

  since v3:
    * rebase against latest post 6.0-rc1 slab tree's 'for-next' branch  
    * fix a bug reported by 0Day, that kmalloc-redzoned data and kasan's
      free meta data overlaps in the same kmalloc object data area 

  since v2:
    * rebase against slab tree's 'for-next' branch
    * fix pointer handling (Kefeng Wang)
    * move kzalloc zeroing handling change to a separate patch (Vlastimil Babka) 
    * make 'orig_size' only depend on KMALLOC & STORE_USER flag
      bits (Vlastimil Babka)

  since v1:
    * limit the 'orig_size' to kmalloc objects only, and save
      it after track in metadata (Vlastimil Babka)
    * fix a offset calculation problem in print_trailer

  since RFC:
    * fix problems in kmem_cache_alloc_bulk() and records sorting,
      improve the print format (Hyeonggon Yoo)
    * fix a compiling issue found by 0Day bot
    * update the commit log based info from iova developers

Feng Tang (4):
  mm/slub: enable debugging memory wasting of kmalloc
  mm/slub: only zero the requested size of buffer for kzalloc
  mm: kasan: Add free_meta size info in struct kasan_cache
  mm/slub: extend redzone check to extra allocated kmalloc space than
    requested

 Documentation/mm/slub.rst |  33 +++---
 include/linux/kasan.h     |   2 +
 include/linux/slab.h      |   2 +
 mm/kasan/common.c         |   2 +
 mm/slab.c                 |   6 +-
 mm/slab.h                 |  13 ++-
 mm/slab_common.c          |   4 +
 mm/slub.c                 | 219 ++++++++++++++++++++++++++++++--------
 8 files changed, 220 insertions(+), 61 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5 1/4] mm/slub: enable debugging memory wasting of kmalloc
  2022-09-07  7:10 [PATCH v5 0/4] mm/slub: some debug enhancements for kmalloc Feng Tang
@ 2022-09-07  7:10 ` Feng Tang
  2022-09-07 14:17   ` Hyeonggon Yoo
  2022-09-07  7:10 ` [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc Feng Tang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Feng Tang @ 2022-09-07  7:10 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet
  Cc: Dave Hansen, linux-mm, linux-kernel, kasan-dev, Feng Tang,
	Robin Murphy, John Garry, Kefeng Wang

kmalloc's API family is critical for mm, with one nature that it will
round up the request size to a fixed one (mostly power of 2). Say
when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory
space waste.

The wastage is not a big issue for requests that get allocated/freed
quickly, but may cause problems with objects that have longer life
time.

We've met a kernel boot OOM panic (v5.10), and from the dumped slab
info:

    [   26.062145] kmalloc-2k            814056KB     814056KB

From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).

And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:

    [    4.332648] iommu: Adding device 0000:20:02.0 to group 16
    [    4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
    ...
    [    4.857565] kmalloc-2048           59164KB      59164KB

The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are just picked to demo the problem, also
a patch changing its size to 1024 bytes has been merged)

So add an way to track each kmalloc's memory waste info, and
leverage the existing SLUB debug framework (specifically
SLUB_STORE_USER) to show its call stack of original allocation,
so that user can evaluate the waste situation, identify some hot
spots and optimize accordingly, for a better utilization of memory.

The waste info is integrated into existing interface:
'/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
'kmalloc-4k' after boot is:

 126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
     __kmem_cache_alloc_node+0x11f/0x4e0
     __kmalloc_node+0x4e/0x140
     ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
     ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
     ixgbe_probe+0x165f/0x1d20 [ixgbe]
     local_pci_probe+0x78/0xc0
     work_for_cpu_fn+0x26/0x40
     ...

which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total), from ixgbe_alloc_q_vector().

And when system starts some real workload like multiple docker
instances, there could are more severe waste.

[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/

[Thanks Hyeonggon for pointing out several bugs about sorting/format]
[Thanks Vlastimil for suggesting way to reduce memory usage of
 orig_size and keep it only for kmalloc objects]

Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 Documentation/mm/slub.rst |  33 +++++---
 include/linux/slab.h      |   2 +
 mm/slub.c                 | 156 ++++++++++++++++++++++++++++----------
 3 files changed, 141 insertions(+), 50 deletions(-)

diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst
index 43063ade737a..4e1578186b4f 100644
--- a/Documentation/mm/slub.rst
+++ b/Documentation/mm/slub.rst
@@ -400,21 +400,30 @@ information:
     allocated objects. The output is sorted by frequency of each trace.
 
     Information in the output:
-    Number of objects, allocating function, minimal/average/maximal jiffies since alloc,
-    pid range of the allocating processes, cpu mask of allocating cpus, and stack trace.
+    Number of objects, allocating function, possible memory wastage of
+    kmalloc objects(total/per-object), minimal/average/maximal jiffies
+    since alloc, pid range of the allocating processes, cpu mask of
+    allocating cpus, numa node mask of origins of memory, and stack trace.
 
     Example:::
 
-    1085 populate_error_injection_list+0x97/0x110 age=166678/166680/166682 pid=1 cpus=1::
-	__slab_alloc+0x6d/0x90
-	kmem_cache_alloc_trace+0x2eb/0x300
-	populate_error_injection_list+0x97/0x110
-	init_error_injection+0x1b/0x71
-	do_one_initcall+0x5f/0x2d0
-	kernel_init_freeable+0x26f/0x2d7
-	kernel_init+0xe/0x118
-	ret_from_fork+0x22/0x30
-
+    338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1
+        __kmem_cache_alloc_node+0x11f/0x4e0
+        kmalloc_trace+0x26/0xa0
+        pci_alloc_dev+0x2c/0xa0
+        pci_scan_single_device+0xd2/0x150
+        pci_scan_slot+0xf7/0x2d0
+        pci_scan_child_bus_extend+0x4e/0x360
+        acpi_pci_root_create+0x32e/0x3b0
+        pci_acpi_scan_root+0x2b9/0x2d0
+        acpi_pci_root_add.cold.11+0x110/0xb0a
+        acpi_bus_attach+0x262/0x3f0
+        device_for_each_child+0xb7/0x110
+        acpi_dev_for_each_child+0x77/0xa0
+        acpi_bus_attach+0x108/0x3f0
+        device_for_each_child+0xb7/0x110
+        acpi_dev_for_each_child+0x77/0xa0
+        acpi_bus_attach+0x108/0x3f0
 
 2. free_traces::
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9b592e611cb1..6dc495f76644 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -29,6 +29,8 @@
 #define SLAB_RED_ZONE		((slab_flags_t __force)0x00000400U)
 /* DEBUG: Poison objects */
 #define SLAB_POISON		((slab_flags_t __force)0x00000800U)
+/* Indicate a kmalloc slab */
+#define SLAB_KMALLOC		((slab_flags_t __force)0x00001000U)
 /* Align objs on cache lines */
 #define SLAB_HWCACHE_ALIGN	((slab_flags_t __force)0x00002000U)
 /* Use GFP_DMA memory */
diff --git a/mm/slub.c b/mm/slub.c
index fe4fe0e72daf..effd994438e6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -194,11 +194,24 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled);
 #endif
 #endif		/* CONFIG_SLUB_DEBUG */
 
+/* Structure holding parameters for get_partial() call chain */
+struct partial_context {
+	struct slab **slab;
+	gfp_t flags;
+	int orig_size;
+};
+
 static inline bool kmem_cache_debug(struct kmem_cache *s)
 {
 	return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS);
 }
 
+static inline bool slub_debug_orig_size(struct kmem_cache *s)
+{
+	return (kmem_cache_debug_flags(s, SLAB_STORE_USER) &&
+			(s->flags & SLAB_KMALLOC));
+}
+
 void *fixup_red_left(struct kmem_cache *s, void *p)
 {
 	if (kmem_cache_debug_flags(s, SLAB_RED_ZONE))
@@ -785,6 +798,39 @@ static void print_slab_info(const struct slab *slab)
 	       folio_flags(folio, 0));
 }
 
+/*
+ * kmalloc caches has fixed sizes (mostly power of 2), and kmalloc() API
+ * family will round up the real request size to these fixed ones, so
+ * there could be an extra area than what is requested. Save the original
+ * request size in the meta data area, for better debug and sanity check.
+ */
+static inline void set_orig_size(struct kmem_cache *s,
+				void *object, unsigned int orig_size)
+{
+	void *p = kasan_reset_tag(object);
+
+	if (!slub_debug_orig_size(s))
+		return;
+
+	p += get_info_end(s);
+	p += sizeof(struct track) * 2;
+
+	*(unsigned int *)p = orig_size;
+}
+
+static unsigned int get_orig_size(struct kmem_cache *s, void *object)
+{
+	void *p = kasan_reset_tag(object);
+
+	if (!slub_debug_orig_size(s))
+		return s->object_size;
+
+	p += get_info_end(s);
+	p += sizeof(struct track) * 2;
+
+	return *(unsigned int *)p;
+}
+
 static void slab_bug(struct kmem_cache *s, char *fmt, ...)
 {
 	struct va_format vaf;
@@ -844,6 +890,9 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
 	if (s->flags & SLAB_STORE_USER)
 		off += 2 * sizeof(struct track);
 
+	if (slub_debug_orig_size(s))
+		off += sizeof(unsigned int);
+
 	off += kasan_metadata_size(s);
 
 	if (off != size_from_object(s))
@@ -977,7 +1026,8 @@ static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
  *
  * 	A. Free pointer (if we cannot overwrite object on free)
  * 	B. Tracking data for SLAB_STORE_USER
- *	C. Padding to reach required alignment boundary or at minimum
+ *	C. Original request size for kmalloc object (SLAB_STORE_USER enabled)
+ *	D. Padding to reach required alignment boundary or at minimum
  * 		one word if debugging is on to be able to detect writes
  * 		before the word boundary.
  *
@@ -995,10 +1045,14 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
 {
 	unsigned long off = get_info_end(s);	/* The end of info */
 
-	if (s->flags & SLAB_STORE_USER)
+	if (s->flags & SLAB_STORE_USER) {
 		/* We also have user information there */
 		off += 2 * sizeof(struct track);
 
+		if (s->flags & SLAB_KMALLOC)
+			off += sizeof(unsigned int);
+	}
+
 	off += kasan_metadata_size(s);
 
 	if (size_from_object(s) == off)
@@ -1293,7 +1347,7 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
 }
 
 static noinline int alloc_debug_processing(struct kmem_cache *s,
-					struct slab *slab, void *object)
+			struct slab *slab, void *object, int orig_size)
 {
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
 		if (!alloc_consistency_checks(s, slab, object))
@@ -1302,6 +1356,7 @@ static noinline int alloc_debug_processing(struct kmem_cache *s,
 
 	/* Success. Perform special debug activities for allocs */
 	trace(s, slab, object, 1);
+	set_orig_size(s, object, orig_size);
 	init_object(s, object, SLUB_RED_ACTIVE);
 	return 1;
 
@@ -1570,7 +1625,10 @@ static inline
 void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
 
 static inline int alloc_debug_processing(struct kmem_cache *s,
-	struct slab *slab, void *object) { return 0; }
+	struct slab *slab, void *object, int orig_size) { return 0; }
+
+static inline void set_orig_size(struct kmem_cache *s,
+	void *object, unsigned int orig_size) {}
 
 static inline void free_debug_processing(
 	struct kmem_cache *s, struct slab *slab,
@@ -1999,7 +2057,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
  * it to full list if it was the last free object.
  */
 static void *alloc_single_from_partial(struct kmem_cache *s,
-		struct kmem_cache_node *n, struct slab *slab)
+		struct kmem_cache_node *n, struct slab *slab, int orig_size)
 {
 	void *object;
 
@@ -2009,7 +2067,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
 	slab->freelist = get_freepointer(s, object);
 	slab->inuse++;
 
-	if (!alloc_debug_processing(s, slab, object)) {
+	if (!alloc_debug_processing(s, slab, object, orig_size)) {
 		remove_partial(n, slab);
 		return NULL;
 	}
@@ -2028,7 +2086,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
  * and put the slab to the partial (or full) list.
  */
 static void *alloc_single_from_new_slab(struct kmem_cache *s,
-					struct slab *slab)
+					struct slab *slab, int orig_size)
 {
 	int nid = slab_nid(slab);
 	struct kmem_cache_node *n = get_node(s, nid);
@@ -2040,7 +2098,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
 	slab->freelist = get_freepointer(s, object);
 	slab->inuse = 1;
 
-	if (!alloc_debug_processing(s, slab, object))
+	if (!alloc_debug_processing(s, slab, object, orig_size))
 		/*
 		 * It's not really expected that this would fail on a
 		 * freshly allocated slab, but a concurrent memory
@@ -2118,7 +2176,7 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
  * Try to allocate a partial slab from a specific node.
  */
 static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
-			      struct slab **ret_slab, gfp_t gfpflags)
+			      struct partial_context *pc)
 {
 	struct slab *slab, *slab2;
 	void *object = NULL;
@@ -2138,11 +2196,12 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
 		void *t;
 
-		if (!pfmemalloc_match(slab, gfpflags))
+		if (!pfmemalloc_match(slab, pc->flags))
 			continue;
 
 		if (kmem_cache_debug(s)) {
-			object = alloc_single_from_partial(s, n, slab);
+			object = alloc_single_from_partial(s, n, slab,
+							pc->orig_size);
 			if (object)
 				break;
 			continue;
@@ -2153,7 +2212,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 			break;
 
 		if (!object) {
-			*ret_slab = slab;
+			*pc->slab = slab;
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
 		} else {
@@ -2177,14 +2236,13 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 /*
  * Get a slab from somewhere. Search in increasing NUMA distances.
  */
-static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
-			     struct slab **ret_slab)
+static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type highest_zoneidx = gfp_zone(flags);
+	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
 	void *object;
 	unsigned int cpuset_mems_cookie;
 
@@ -2212,15 +2270,15 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 
 	do {
 		cpuset_mems_cookie = read_mems_allowed_begin();
-		zonelist = node_zonelist(mempolicy_slab_node(), flags);
+		zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
 		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
 			struct kmem_cache_node *n;
 
 			n = get_node(s, zone_to_nid(zone));
 
-			if (n && cpuset_zone_allowed(zone, flags) &&
+			if (n && cpuset_zone_allowed(zone, pc->flags) &&
 					n->nr_partial > s->min_partial) {
-				object = get_partial_node(s, n, ret_slab, flags);
+				object = get_partial_node(s, n, pc);
 				if (object) {
 					/*
 					 * Don't check read_mems_allowed_retry()
@@ -2241,8 +2299,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 /*
  * Get a partial slab, lock it and return it.
  */
-static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
-			 struct slab **ret_slab)
+static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
 {
 	void *object;
 	int searchnode = node;
@@ -2250,11 +2307,11 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	if (node == NUMA_NO_NODE)
 		searchnode = numa_mem_id();
 
-	object = get_partial_node(s, get_node(s, searchnode), ret_slab, flags);
+	object = get_partial_node(s, get_node(s, searchnode), pc);
 	if (object || node != NUMA_NO_NODE)
 		return object;
 
-	return get_any_partial(s, flags, ret_slab);
+	return get_any_partial(s, pc);
 }
 
 #ifdef CONFIG_PREEMPTION
@@ -2974,11 +3031,12 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
  * already disabled (which is the case for bulk allocation).
  */
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
 {
 	void *freelist;
 	struct slab *slab;
 	unsigned long flags;
+	struct partial_context pc;
 
 	stat(s, ALLOC_SLOWPATH);
 
@@ -3092,7 +3150,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 new_objects:
 
-	freelist = get_partial(s, gfpflags, node, &slab);
+	pc.flags = gfpflags;
+	pc.slab = &slab;
+	pc.orig_size = orig_size;
+	freelist = get_partial(s, node, &pc);
 	if (freelist)
 		goto check_new_slab;
 
@@ -3108,7 +3169,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	stat(s, ALLOC_SLAB);
 
 	if (kmem_cache_debug(s)) {
-		freelist = alloc_single_from_new_slab(s, slab);
+		freelist = alloc_single_from_new_slab(s, slab, orig_size);
 
 		if (unlikely(!freelist))
 			goto new_objects;
@@ -3140,6 +3201,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		 */
 		if (s->flags & SLAB_STORE_USER)
 			set_track(s, freelist, TRACK_ALLOC, addr);
+
 		return freelist;
 	}
 
@@ -3182,7 +3244,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
  * pointer.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
 {
 	void *p;
 
@@ -3195,7 +3257,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	c = slub_get_cpu_ptr(s->cpu_slab);
 #endif
 
-	p = ___slab_alloc(s, gfpflags, node, addr, c);
+	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
 #ifdef CONFIG_PREEMPT_COUNT
 	slub_put_cpu_ptr(s->cpu_slab);
 #endif
@@ -3280,7 +3342,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
 
 	if (!USE_LOCKLESS_FAST_PATH() ||
 	    unlikely(!object || !slab || !node_match(slab, node))) {
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
 	} else {
 		void *next_object = get_freepointer_safe(s, object);
 
@@ -3747,7 +3809,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			 * of re-populating per CPU c->freelist
 			 */
 			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
-					    _RET_IP_, c);
+					    _RET_IP_, c, s->object_size);
 			if (unlikely(!p[i]))
 				goto error;
 
@@ -4150,12 +4212,17 @@ static int calculate_sizes(struct kmem_cache *s)
 	}
 
 #ifdef CONFIG_SLUB_DEBUG
-	if (flags & SLAB_STORE_USER)
+	if (flags & SLAB_STORE_USER) {
 		/*
 		 * Need to store information about allocs and frees after
 		 * the object.
 		 */
 		size += 2 * sizeof(struct track);
+
+		/* Save the original kmalloc request size */
+		if (flags & SLAB_KMALLOC)
+			size += sizeof(unsigned int);
+	}
 #endif
 
 	kasan_cache_create(s, &size, &s->flags);
@@ -4770,7 +4837,7 @@ void __init kmem_cache_init(void)
 
 	/* Now we can use the kmem_cache to allocate kmalloc slabs */
 	setup_kmalloc_cache_index_table();
-	create_kmalloc_caches(0);
+	create_kmalloc_caches(SLAB_KMALLOC);
 
 	/* Setup random freelists for each cache */
 	init_freelist_randomization();
@@ -4937,6 +5004,7 @@ struct location {
 	depot_stack_handle_t handle;
 	unsigned long count;
 	unsigned long addr;
+	unsigned long waste;
 	long long sum_time;
 	long min_time;
 	long max_time;
@@ -4983,13 +5051,15 @@ static int alloc_loc_track(struct loc_track *t, unsigned long max, gfp_t flags)
 }
 
 static int add_location(struct loc_track *t, struct kmem_cache *s,
-				const struct track *track)
+				const struct track *track,
+				unsigned int orig_size)
 {
 	long start, end, pos;
 	struct location *l;
-	unsigned long caddr, chandle;
+	unsigned long caddr, chandle, cwaste;
 	unsigned long age = jiffies - track->when;
 	depot_stack_handle_t handle = 0;
+	unsigned int waste = s->object_size - orig_size;
 
 #ifdef CONFIG_STACKDEPOT
 	handle = READ_ONCE(track->handle);
@@ -5007,11 +5077,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 		if (pos == end)
 			break;
 
-		caddr = t->loc[pos].addr;
-		chandle = t->loc[pos].handle;
-		if ((track->addr == caddr) && (handle == chandle)) {
+		l = &t->loc[pos];
+		caddr = l->addr;
+		chandle = l->handle;
+		cwaste = l->waste;
+		if ((track->addr == caddr) && (handle == chandle) &&
+			(waste == cwaste)) {
 
-			l = &t->loc[pos];
 			l->count++;
 			if (track->when) {
 				l->sum_time += age;
@@ -5036,6 +5108,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 			end = pos;
 		else if (track->addr == caddr && handle < chandle)
 			end = pos;
+		else if (track->addr == caddr && handle == chandle &&
+				waste < cwaste)
+			end = pos;
 		else
 			start = pos;
 	}
@@ -5059,6 +5134,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 	l->min_pid = track->pid;
 	l->max_pid = track->pid;
 	l->handle = handle;
+	l->waste = waste;
 	cpumask_clear(to_cpumask(l->cpus));
 	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
 	nodes_clear(l->nodes);
@@ -5077,7 +5153,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
 
 	for_each_object(p, s, addr, slab->objects)
 		if (!test_bit(__obj_to_index(s, addr, p), obj_map))
-			add_location(t, s, get_track(s, p, alloc));
+			add_location(t, s, get_track(s, p, alloc), get_orig_size(s, p));
 }
 #endif  /* CONFIG_DEBUG_FS   */
 #endif	/* CONFIG_SLUB_DEBUG */
@@ -5942,6 +6018,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
 		else
 			seq_puts(seq, "<not-available>");
 
+		if (l->waste)
+			seq_printf(seq, " waste=%lu/%lu",
+				l->count * l->waste, l->waste);
+
 		if (l->sum_time != l->min_time) {
 			seq_printf(seq, " age=%ld/%llu/%ld",
 				l->min_time, div_u64(l->sum_time, l->count),
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc
  2022-09-07  7:10 [PATCH v5 0/4] mm/slub: some debug enhancements for kmalloc Feng Tang
  2022-09-07  7:10 ` [PATCH v5 1/4] mm/slub: enable debugging memory wasting of kmalloc Feng Tang
@ 2022-09-07  7:10 ` Feng Tang
  2022-09-07 14:57   ` Hyeonggon Yoo
  2022-09-10 23:11   ` Andrey Konovalov
  2022-09-07  7:10 ` [PATCH v5 3/4] mm: kasan: Add free_meta size info in struct kasan_cache Feng Tang
  2022-09-07  7:10 ` [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested Feng Tang
  3 siblings, 2 replies; 19+ messages in thread
From: Feng Tang @ 2022-09-07  7:10 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet
  Cc: Dave Hansen, linux-mm, linux-kernel, kasan-dev, Feng Tang

kzalloc/kmalloc will round up the request size to a fixed size
(mostly power of 2), so the allocated memory could be more than
requested. Currently kzalloc family APIs will zero all the
allocated memory.

To detect out-of-bound usage of the extra allocated memory, only
zero the requested part, so that sanity check could be added to
the extra space later.

For kzalloc users who will call ksize() later and utilize this
extra space, please be aware that the space is not zeroed any
more.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/slab.c | 6 +++---
 mm/slab.h | 9 +++++++--
 mm/slub.c | 6 +++---
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index a5486ff8362a..73ecaa7066e1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3253,7 +3253,7 @@ slab_alloc_node(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags,
 	init = slab_want_init_on_alloc(flags, cachep);
 
 out:
-	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init, 0);
 	return objp;
 }
 
@@ -3506,13 +3506,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	 * Done outside of the IRQ disabled section.
 	 */
 	slab_post_alloc_hook(s, objcg, flags, size, p,
-				slab_want_init_on_alloc(flags, s));
+				slab_want_init_on_alloc(flags, s), 0);
 	/* FIXME: Trace call missing. Christoph would like a bulk variant */
 	return size;
 error:
 	local_irq_enable();
 	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
-	slab_post_alloc_hook(s, objcg, flags, i, p, false);
+	slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
 	kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
diff --git a/mm/slab.h b/mm/slab.h
index d0ef9dd44b71..20f9e2a9814f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -730,12 +730,17 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 static inline void slab_post_alloc_hook(struct kmem_cache *s,
 					struct obj_cgroup *objcg, gfp_t flags,
-					size_t size, void **p, bool init)
+					size_t size, void **p, bool init,
+					unsigned int orig_size)
 {
 	size_t i;
 
 	flags &= gfp_allowed_mask;
 
+	/* If original request size(kmalloc) is not set, use object_size */
+	if (!orig_size)
+		orig_size = s->object_size;
+
 	/*
 	 * As memory initialization might be integrated into KASAN,
 	 * kasan_slab_alloc and initialization memset must be
@@ -746,7 +751,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
 	for (i = 0; i < size; i++) {
 		p[i] = kasan_slab_alloc(s, p[i], flags, init);
 		if (p[i] && init && !kasan_has_integrated_init())
-			memset(p[i], 0, s->object_size);
+			memset(p[i], 0, orig_size);
 		kmemleak_alloc_recursive(p[i], s->object_size, 1,
 					 s->flags, flags);
 	}
diff --git a/mm/slub.c b/mm/slub.c
index effd994438e6..f523601d3fcf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3376,7 +3376,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
 	init = slab_want_init_on_alloc(gfpflags, s);
 
 out:
-	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
+	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init, orig_size);
 
 	return object;
 }
@@ -3833,11 +3833,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	 * Done outside of the IRQ disabled fastpath loop.
 	 */
 	slab_post_alloc_hook(s, objcg, flags, size, p,
-				slab_want_init_on_alloc(flags, s));
+				slab_want_init_on_alloc(flags, s), 0);
 	return i;
 error:
 	slub_put_cpu_ptr(s->cpu_slab);
-	slab_post_alloc_hook(s, objcg, flags, i, p, false);
+	slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
 	kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 3/4] mm: kasan: Add free_meta size info in struct kasan_cache
  2022-09-07  7:10 [PATCH v5 0/4] mm/slub: some debug enhancements for kmalloc Feng Tang
  2022-09-07  7:10 ` [PATCH v5 1/4] mm/slub: enable debugging memory wasting of kmalloc Feng Tang
  2022-09-07  7:10 ` [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc Feng Tang
@ 2022-09-07  7:10 ` Feng Tang
  2022-09-10 23:14   ` Andrey Konovalov
  2022-09-07  7:10 ` [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested Feng Tang
  3 siblings, 1 reply; 19+ messages in thread
From: Feng Tang @ 2022-09-07  7:10 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet
  Cc: Dave Hansen, linux-mm, linux-kernel, kasan-dev, Feng Tang,
	kernel test robot

When kasan is enabled for slab/slub, it may save kasan' free_meta
data in the former part of slab object data area in slab object
free path, which works fine.

There is ongoing effort to extend slub's debug function which will
redzone the latter part of kmalloc object area, and when both of
the debug are enabled, there is possible conflict, especially when
the kmalloc object has small size, as caught by 0Day bot [1]

For better information for slab/slub, add free_meta's data size
into 'struct kasan_cache', so that its users can take right action
to avoid data conflict.

[1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
---
 include/linux/kasan.h | 2 ++
 mm/kasan/common.c     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index b092277bf48d..293bdaa0ba09 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -100,6 +100,8 @@ static inline bool kasan_has_integrated_init(void)
 struct kasan_cache {
 	int alloc_meta_offset;
 	int free_meta_offset;
+	/* size of free_meta data saved in object's data area */
+	int free_meta_size_in_object;
 	bool is_kmalloc;
 };
 
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 69f583855c8b..762ae7a7793e 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -201,6 +201,8 @@ void __kasan_cache_create(struct kmem_cache *cache, unsigned int *size,
 			cache->kasan_info.free_meta_offset = KASAN_NO_FREE_META;
 			*size = ok_size;
 		}
+	} else {
+		cache->kasan_info.free_meta_size_in_object = sizeof(struct kasan_free_meta);
 	}
 
 	/* Calculate size with optimal redzone. */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested
  2022-09-07  7:10 [PATCH v5 0/4] mm/slub: some debug enhancements for kmalloc Feng Tang
                   ` (2 preceding siblings ...)
  2022-09-07  7:10 ` [PATCH v5 3/4] mm: kasan: Add free_meta size info in struct kasan_cache Feng Tang
@ 2022-09-07  7:10 ` Feng Tang
  2022-09-09  6:26   ` Hyeonggon Yoo
  2022-09-10 23:12   ` Andrey Konovalov
  3 siblings, 2 replies; 19+ messages in thread
From: Feng Tang @ 2022-09-07  7:10 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet
  Cc: Dave Hansen, linux-mm, linux-kernel, kasan-dev, Feng Tang

kmalloc will round up the request size to a fixed size (mostly power
of 2), so there could be a extra space than what is requested, whose
size is the actual buffer size minus original request size.

To better detect out of bound access or abuse of this space, add
redzone sanity check for it.

And in current kernel, some kmalloc user already knows the existence
of the space and utilizes it after calling 'ksize()' to know the real
size of the allocated buffer. So we skip the sanity check for objects
which have been called with ksize(), as treating them as legitimate
users.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/slab.h        |  4 ++++
 mm/slab_common.c |  4 ++++
 mm/slub.c        | 57 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 20f9e2a9814f..0bc91b30b031 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -885,4 +885,8 @@ void __check_heap_object(const void *ptr, unsigned long n,
 }
 #endif
 
+#ifdef CONFIG_SLUB_DEBUG
+void skip_orig_size_check(struct kmem_cache *s, const void *object);
+#endif
+
 #endif /* MM_SLAB_H */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8e13e3aac53f..5106667d6adb 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1001,6 +1001,10 @@ size_t __ksize(const void *object)
 		return folio_size(folio);
 	}
 
+#ifdef CONFIG_SLUB_DEBUG
+	skip_orig_size_check(folio_slab(folio)->slab_cache, object);
+#endif
+
 	return slab_ksize(folio_slab(folio)->slab_cache);
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index f523601d3fcf..2f0302136604 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -812,12 +812,27 @@ static inline void set_orig_size(struct kmem_cache *s,
 	if (!slub_debug_orig_size(s))
 		return;
 
+#ifdef CONFIG_KASAN_GENERIC
+	/*
+	 * KASAN could save its free meta data in the start part of object
+	 * area, so skip the redzone check if kasan's meta data size is
+	 * bigger enough to possibly overlap with kmalloc redzone
+	 */
+	if (s->kasan_info.free_meta_size_in_object * 2 >= s->object_size)
+		orig_size = s->object_size;
+#endif
+
 	p += get_info_end(s);
 	p += sizeof(struct track) * 2;
 
 	*(unsigned int *)p = orig_size;
 }
 
+void skip_orig_size_check(struct kmem_cache *s, const void *object)
+{
+	set_orig_size(s, (void *)object, s->object_size);
+}
+
 static unsigned int get_orig_size(struct kmem_cache *s, void *object)
 {
 	void *p = kasan_reset_tag(object);
@@ -949,13 +964,34 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
 static void init_object(struct kmem_cache *s, void *object, u8 val)
 {
 	u8 *p = kasan_reset_tag(object);
+	unsigned int orig_size = s->object_size;
 
-	if (s->flags & SLAB_RED_ZONE)
+	if (s->flags & SLAB_RED_ZONE) {
 		memset(p - s->red_left_pad, val, s->red_left_pad);
 
+		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
+			unsigned int zone_start;
+
+			orig_size = get_orig_size(s, object);
+			zone_start = orig_size;
+
+			if (!freeptr_outside_object(s))
+				zone_start = max_t(unsigned int, orig_size,
+						s->offset + sizeof(void *));
+
+			/*
+			 * Redzone the extra allocated space by kmalloc
+			 * than requested.
+			 */
+			if (zone_start < s->object_size)
+				memset(p + zone_start, val,
+					s->object_size - zone_start);
+		}
+	}
+
 	if (s->flags & __OBJECT_POISON) {
-		memset(p, POISON_FREE, s->object_size - 1);
-		p[s->object_size - 1] = POISON_END;
+		memset(p, POISON_FREE, orig_size - 1);
+		p[orig_size - 1] = POISON_END;
 	}
 
 	if (s->flags & SLAB_RED_ZONE)
@@ -1103,6 +1139,7 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
 {
 	u8 *p = object;
 	u8 *endobject = object + s->object_size;
+	unsigned int orig_size;
 
 	if (s->flags & SLAB_RED_ZONE) {
 		if (!check_bytes_and_report(s, slab, object, "Left Redzone",
@@ -1112,6 +1149,20 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
 		if (!check_bytes_and_report(s, slab, object, "Right Redzone",
 			endobject, val, s->inuse - s->object_size))
 			return 0;
+
+		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
+			orig_size = get_orig_size(s, object);
+
+			if (!freeptr_outside_object(s))
+				orig_size = max_t(unsigned int, orig_size,
+						s->offset + sizeof(void *));
+			if (s->object_size > orig_size  &&
+				!check_bytes_and_report(s, slab, object,
+					"kmalloc Redzone", p + orig_size,
+					val, s->object_size - orig_size)) {
+				return 0;
+			}
+		}
 	} else {
 		if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
 			check_bytes_and_report(s, slab, p, "Alignment padding",
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] mm/slub: enable debugging memory wasting of kmalloc
  2022-09-07  7:10 ` [PATCH v5 1/4] mm/slub: enable debugging memory wasting of kmalloc Feng Tang
@ 2022-09-07 14:17   ` Hyeonggon Yoo
  2022-09-08  2:25     ` Feng Tang
  0 siblings, 1 reply; 19+ messages in thread
From: Hyeonggon Yoo @ 2022-09-07 14:17 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Dmitry Vyukov,
	Jonathan Corbet, Dave Hansen, linux-mm, linux-kernel, kasan-dev,
	Robin Murphy, John Garry, Kefeng Wang

On Wed, Sep 07, 2022 at 03:10:20PM +0800, Feng Tang wrote:
> kmalloc's API family is critical for mm, with one nature that it will
> round up the request size to a fixed one (mostly power of 2). Say
> when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
> could be allocated, so in worst case, there is around 50% memory
> space waste.
> 
> The wastage is not a big issue for requests that get allocated/freed
> quickly, but may cause problems with objects that have longer life
> time.
> 
> We've met a kernel boot OOM panic (v5.10), and from the dumped slab
> info:
> 
>     [   26.062145] kmalloc-2k            814056KB     814056KB
> 
> >From debug we found there are huge number of 'struct iova_magazine',
> whose size is 1032 bytes (1024 + 8), so each allocation will waste
> 1016 bytes. Though the issue was solved by giving the right (bigger)
> size of RAM, it is still nice to optimize the size (either use a
> kmalloc friendly size or create a dedicated slab for it).
> 
> And from lkml archive, there was another crash kernel OOM case [1]
> back in 2019, which seems to be related with the similar slab waste
> situation, as the log is similar:
> 
>     [    4.332648] iommu: Adding device 0000:20:02.0 to group 16
>     [    4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
>     ...
>     [    4.857565] kmalloc-2048           59164KB      59164KB
> 
> The crash kernel only has 256M memory, and 59M is pretty big here.
> (Note: the related code has been changed and optimised in recent
> kernel [2], these logs are just picked to demo the problem, also
> a patch changing its size to 1024 bytes has been merged)
> 
> So add an way to track each kmalloc's memory waste info, and
> leverage the existing SLUB debug framework (specifically
> SLUB_STORE_USER) to show its call stack of original allocation,
> so that user can evaluate the waste situation, identify some hot
> spots and optimize accordingly, for a better utilization of memory.
> 
> The waste info is integrated into existing interface:
> '/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
> 'kmalloc-4k' after boot is:
> 
>  126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
>      __kmem_cache_alloc_node+0x11f/0x4e0
>      __kmalloc_node+0x4e/0x140
>      ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
>      ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
>      ixgbe_probe+0x165f/0x1d20 [ixgbe]
>      local_pci_probe+0x78/0xc0
>      work_for_cpu_fn+0x26/0x40
>      ...
> 
> which means in 'kmalloc-4k' slab, there are 126 requests of
> 2240 bytes which got a 4KB space (wasting 1856 bytes each
> and 233856 bytes in total), from ixgbe_alloc_q_vector().
> 
> And when system starts some real workload like multiple docker
> instances, there could are more severe waste.
> 
> [1]. https://lkml.org/lkml/2019/8/12/266
> [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/
> 
> [Thanks Hyeonggon for pointing out several bugs about sorting/format]
> [Thanks Vlastimil for suggesting way to reduce memory usage of
>  orig_size and keep it only for kmalloc objects]
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>  Documentation/mm/slub.rst |  33 +++++---
>  include/linux/slab.h      |   2 +
>  mm/slub.c                 | 156 ++++++++++++++++++++++++++++----------
>  3 files changed, 141 insertions(+), 50 deletions(-)
> 

Looks good to me.
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

> diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst

[...]

> +/* Structure holding parameters for get_partial() call chain */
> +struct partial_context {
> +	struct slab **slab;
> +	gfp_t flags;
> +	int orig_size;

Nit: unsigned int orig_size

Thanks!

> +};
> +



>  static inline bool kmem_cache_debug(struct kmem_cache *s)
>  {
>  	return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS);
>  }
>  
> +static inline bool slub_debug_orig_size(struct kmem_cache *s)
> +{
> +	return (kmem_cache_debug_flags(s, SLAB_STORE_USER) &&
> +			(s->flags & SLAB_KMALLOC));
> +}
> +
>  void *fixup_red_left(struct kmem_cache *s, void *p)
>  {
>  	if (kmem_cache_debug_flags(s, SLAB_RED_ZONE))
> @@ -785,6 +798,39 @@ static void print_slab_info(const struct slab *slab)
>  	       folio_flags(folio, 0));
>  }
>  
> +/*
> + * kmalloc caches has fixed sizes (mostly power of 2), and kmalloc() API
> + * family will round up the real request size to these fixed ones, so
> + * there could be an extra area than what is requested. Save the original
> + * request size in the meta data area, for better debug and sanity check.
> + */
> +static inline void set_orig_size(struct kmem_cache *s,
> +				void *object, unsigned int orig_size)
> +{
> +	void *p = kasan_reset_tag(object);
> +
> +	if (!slub_debug_orig_size(s))
> +		return;
> +
> +	p += get_info_end(s);
> +	p += sizeof(struct track) * 2;
> +
> +	*(unsigned int *)p = orig_size;
> +}
> +
> +static unsigned int get_orig_size(struct kmem_cache *s, void *object)
> +{
> +	void *p = kasan_reset_tag(object);
> +
> +	if (!slub_debug_orig_size(s))
> +		return s->object_size;
> +
> +	p += get_info_end(s);
> +	p += sizeof(struct track) * 2;
> +
> +	return *(unsigned int *)p;
> +}
> +
>  static void slab_bug(struct kmem_cache *s, char *fmt, ...)
>  {
>  	struct va_format vaf;
> @@ -844,6 +890,9 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
>  	if (s->flags & SLAB_STORE_USER)
>  		off += 2 * sizeof(struct track);
>  
> +	if (slub_debug_orig_size(s))
> +		off += sizeof(unsigned int);
> +
>  	off += kasan_metadata_size(s);
>  
>  	if (off != size_from_object(s))
> @@ -977,7 +1026,8 @@ static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
>   *
>   * 	A. Free pointer (if we cannot overwrite object on free)
>   * 	B. Tracking data for SLAB_STORE_USER
> - *	C. Padding to reach required alignment boundary or at minimum
> + *	C. Original request size for kmalloc object (SLAB_STORE_USER enabled)
> + *	D. Padding to reach required alignment boundary or at minimum
>   * 		one word if debugging is on to be able to detect writes
>   * 		before the word boundary.
>   *
> @@ -995,10 +1045,14 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
>  {
>  	unsigned long off = get_info_end(s);	/* The end of info */
>  
> -	if (s->flags & SLAB_STORE_USER)
> +	if (s->flags & SLAB_STORE_USER) {
>  		/* We also have user information there */
>  		off += 2 * sizeof(struct track);
>  
> +		if (s->flags & SLAB_KMALLOC)
> +			off += sizeof(unsigned int);
> +	}
> +
>  	off += kasan_metadata_size(s);
>  
>  	if (size_from_object(s) == off)
> @@ -1293,7 +1347,7 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
>  }
>  
>  static noinline int alloc_debug_processing(struct kmem_cache *s,
> -					struct slab *slab, void *object)
> +			struct slab *slab, void *object, int orig_size)
>  {
>  	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
>  		if (!alloc_consistency_checks(s, slab, object))
> @@ -1302,6 +1356,7 @@ static noinline int alloc_debug_processing(struct kmem_cache *s,
>  
>  	/* Success. Perform special debug activities for allocs */
>  	trace(s, slab, object, 1);
> +	set_orig_size(s, object, orig_size);
>  	init_object(s, object, SLUB_RED_ACTIVE);
>  	return 1;
>  
> @@ -1570,7 +1625,10 @@ static inline
>  void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
>  
>  static inline int alloc_debug_processing(struct kmem_cache *s,
> -	struct slab *slab, void *object) { return 0; }
> +	struct slab *slab, void *object, int orig_size) { return 0; }
> +
> +static inline void set_orig_size(struct kmem_cache *s,
> +	void *object, unsigned int orig_size) {}
>  
>  static inline void free_debug_processing(
>  	struct kmem_cache *s, struct slab *slab,
> @@ -1999,7 +2057,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
>   * it to full list if it was the last free object.
>   */
>  static void *alloc_single_from_partial(struct kmem_cache *s,
> -		struct kmem_cache_node *n, struct slab *slab)
> +		struct kmem_cache_node *n, struct slab *slab, int orig_size)
>  {
>  	void *object;
>  
> @@ -2009,7 +2067,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
>  	slab->freelist = get_freepointer(s, object);
>  	slab->inuse++;
>  
> -	if (!alloc_debug_processing(s, slab, object)) {
> +	if (!alloc_debug_processing(s, slab, object, orig_size)) {
>  		remove_partial(n, slab);
>  		return NULL;
>  	}
> @@ -2028,7 +2086,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
>   * and put the slab to the partial (or full) list.
>   */
>  static void *alloc_single_from_new_slab(struct kmem_cache *s,
> -					struct slab *slab)
> +					struct slab *slab, int orig_size)
>  {
>  	int nid = slab_nid(slab);
>  	struct kmem_cache_node *n = get_node(s, nid);
> @@ -2040,7 +2098,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
>  	slab->freelist = get_freepointer(s, object);
>  	slab->inuse = 1;
>  
> -	if (!alloc_debug_processing(s, slab, object))
> +	if (!alloc_debug_processing(s, slab, object, orig_size))
>  		/*
>  		 * It's not really expected that this would fail on a
>  		 * freshly allocated slab, but a concurrent memory
> @@ -2118,7 +2176,7 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>   * Try to allocate a partial slab from a specific node.
>   */
>  static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> -			      struct slab **ret_slab, gfp_t gfpflags)
> +			      struct partial_context *pc)
>  {
>  	struct slab *slab, *slab2;
>  	void *object = NULL;
> @@ -2138,11 +2196,12 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
>  	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
>  		void *t;
>  
> -		if (!pfmemalloc_match(slab, gfpflags))
> +		if (!pfmemalloc_match(slab, pc->flags))
>  			continue;
>  
>  		if (kmem_cache_debug(s)) {
> -			object = alloc_single_from_partial(s, n, slab);
> +			object = alloc_single_from_partial(s, n, slab,
> +							pc->orig_size);
>  			if (object)
>  				break;
>  			continue;
> @@ -2153,7 +2212,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
>  			break;
>  
>  		if (!object) {
> -			*ret_slab = slab;
> +			*pc->slab = slab;
>  			stat(s, ALLOC_FROM_PARTIAL);
>  			object = t;
>  		} else {
> @@ -2177,14 +2236,13 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
>  /*
>   * Get a slab from somewhere. Search in increasing NUMA distances.
>   */
> -static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
> -			     struct slab **ret_slab)
> +static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
>  {
>  #ifdef CONFIG_NUMA
>  	struct zonelist *zonelist;
>  	struct zoneref *z;
>  	struct zone *zone;
> -	enum zone_type highest_zoneidx = gfp_zone(flags);
> +	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
>  	void *object;
>  	unsigned int cpuset_mems_cookie;
>  
> @@ -2212,15 +2270,15 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
>  
>  	do {
>  		cpuset_mems_cookie = read_mems_allowed_begin();
> -		zonelist = node_zonelist(mempolicy_slab_node(), flags);
> +		zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
>  		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
>  			struct kmem_cache_node *n;
>  
>  			n = get_node(s, zone_to_nid(zone));
>  
> -			if (n && cpuset_zone_allowed(zone, flags) &&
> +			if (n && cpuset_zone_allowed(zone, pc->flags) &&
>  					n->nr_partial > s->min_partial) {
> -				object = get_partial_node(s, n, ret_slab, flags);
> +				object = get_partial_node(s, n, pc);
>  				if (object) {
>  					/*
>  					 * Don't check read_mems_allowed_retry()
> @@ -2241,8 +2299,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
>  /*
>   * Get a partial slab, lock it and return it.
>   */
> -static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> -			 struct slab **ret_slab)
> +static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
>  {
>  	void *object;
>  	int searchnode = node;
> @@ -2250,11 +2307,11 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>  	if (node == NUMA_NO_NODE)
>  		searchnode = numa_mem_id();
>  
> -	object = get_partial_node(s, get_node(s, searchnode), ret_slab, flags);
> +	object = get_partial_node(s, get_node(s, searchnode), pc);
>  	if (object || node != NUMA_NO_NODE)
>  		return object;
>  
> -	return get_any_partial(s, flags, ret_slab);
> +	return get_any_partial(s, pc);
>  }
>  
>  #ifdef CONFIG_PREEMPTION
> @@ -2974,11 +3031,12 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
>   * already disabled (which is the case for bulk allocation).
>   */
>  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> -			  unsigned long addr, struct kmem_cache_cpu *c)
> +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
>  {
>  	void *freelist;
>  	struct slab *slab;
>  	unsigned long flags;
> +	struct partial_context pc;
>  
>  	stat(s, ALLOC_SLOWPATH);
>  
> @@ -3092,7 +3150,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  new_objects:
>  
> -	freelist = get_partial(s, gfpflags, node, &slab);
> +	pc.flags = gfpflags;
> +	pc.slab = &slab;
> +	pc.orig_size = orig_size;
> +	freelist = get_partial(s, node, &pc);
>  	if (freelist)
>  		goto check_new_slab;
>  
> @@ -3108,7 +3169,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	stat(s, ALLOC_SLAB);
>  
>  	if (kmem_cache_debug(s)) {
> -		freelist = alloc_single_from_new_slab(s, slab);
> +		freelist = alloc_single_from_new_slab(s, slab, orig_size);
>  
>  		if (unlikely(!freelist))
>  			goto new_objects;
> @@ -3140,6 +3201,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		 */
>  		if (s->flags & SLAB_STORE_USER)
>  			set_track(s, freelist, TRACK_ALLOC, addr);
> +
>  		return freelist;
>  	}
>  
> @@ -3182,7 +3244,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>   * pointer.
>   */
>  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> -			  unsigned long addr, struct kmem_cache_cpu *c)
> +			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
>  {
>  	void *p;
>  
> @@ -3195,7 +3257,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	c = slub_get_cpu_ptr(s->cpu_slab);
>  #endif
>  
> -	p = ___slab_alloc(s, gfpflags, node, addr, c);
> +	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
>  #ifdef CONFIG_PREEMPT_COUNT
>  	slub_put_cpu_ptr(s->cpu_slab);
>  #endif
> @@ -3280,7 +3342,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
>  
>  	if (!USE_LOCKLESS_FAST_PATH() ||
>  	    unlikely(!object || !slab || !node_match(slab, node))) {
> -		object = __slab_alloc(s, gfpflags, node, addr, c);
> +		object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
>  	} else {
>  		void *next_object = get_freepointer_safe(s, object);
>  
> @@ -3747,7 +3809,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  			 * of re-populating per CPU c->freelist
>  			 */
>  			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
> -					    _RET_IP_, c);
> +					    _RET_IP_, c, s->object_size);
>  			if (unlikely(!p[i]))
>  				goto error;
>  
> @@ -4150,12 +4212,17 @@ static int calculate_sizes(struct kmem_cache *s)
>  	}
>  
>  #ifdef CONFIG_SLUB_DEBUG
> -	if (flags & SLAB_STORE_USER)
> +	if (flags & SLAB_STORE_USER) {
>  		/*
>  		 * Need to store information about allocs and frees after
>  		 * the object.
>  		 */
>  		size += 2 * sizeof(struct track);
> +
> +		/* Save the original kmalloc request size */
> +		if (flags & SLAB_KMALLOC)
> +			size += sizeof(unsigned int);
> +	}
>  #endif
>  
>  	kasan_cache_create(s, &size, &s->flags);
> @@ -4770,7 +4837,7 @@ void __init kmem_cache_init(void)
>  
>  	/* Now we can use the kmem_cache to allocate kmalloc slabs */
>  	setup_kmalloc_cache_index_table();
> -	create_kmalloc_caches(0);
> +	create_kmalloc_caches(SLAB_KMALLOC);
>  
>  	/* Setup random freelists for each cache */
>  	init_freelist_randomization();
> @@ -4937,6 +5004,7 @@ struct location {
>  	depot_stack_handle_t handle;
>  	unsigned long count;
>  	unsigned long addr;
> +	unsigned long waste;
>  	long long sum_time;
>  	long min_time;
>  	long max_time;
> @@ -4983,13 +5051,15 @@ static int alloc_loc_track(struct loc_track *t, unsigned long max, gfp_t flags)
>  }
>  
>  static int add_location(struct loc_track *t, struct kmem_cache *s,
> -				const struct track *track)
> +				const struct track *track,
> +				unsigned int orig_size)
>  {
>  	long start, end, pos;
>  	struct location *l;
> -	unsigned long caddr, chandle;
> +	unsigned long caddr, chandle, cwaste;
>  	unsigned long age = jiffies - track->when;
>  	depot_stack_handle_t handle = 0;
> +	unsigned int waste = s->object_size - orig_size;
>  
>  #ifdef CONFIG_STACKDEPOT
>  	handle = READ_ONCE(track->handle);
> @@ -5007,11 +5077,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  		if (pos == end)
>  			break;
>  
> -		caddr = t->loc[pos].addr;
> -		chandle = t->loc[pos].handle;
> -		if ((track->addr == caddr) && (handle == chandle)) {
> +		l = &t->loc[pos];
> +		caddr = l->addr;
> +		chandle = l->handle;
> +		cwaste = l->waste;
> +		if ((track->addr == caddr) && (handle == chandle) &&
> +			(waste == cwaste)) {
>  
> -			l = &t->loc[pos];
>  			l->count++;
>  			if (track->when) {
>  				l->sum_time += age;
> @@ -5036,6 +5108,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  			end = pos;
>  		else if (track->addr == caddr && handle < chandle)
>  			end = pos;
> +		else if (track->addr == caddr && handle == chandle &&
> +				waste < cwaste)
> +			end = pos;
>  		else
>  			start = pos;
>  	}
> @@ -5059,6 +5134,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
>  	l->min_pid = track->pid;
>  	l->max_pid = track->pid;
>  	l->handle = handle;
> +	l->waste = waste;
>  	cpumask_clear(to_cpumask(l->cpus));
>  	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
>  	nodes_clear(l->nodes);
> @@ -5077,7 +5153,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
>  
>  	for_each_object(p, s, addr, slab->objects)
>  		if (!test_bit(__obj_to_index(s, addr, p), obj_map))
> -			add_location(t, s, get_track(s, p, alloc));
> +			add_location(t, s, get_track(s, p, alloc), get_orig_size(s, p));
>  }
>  #endif  /* CONFIG_DEBUG_FS   */
>  #endif	/* CONFIG_SLUB_DEBUG */
> @@ -5942,6 +6018,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
>  		else
>  			seq_puts(seq, "<not-available>");
>  
> +		if (l->waste)
> +			seq_printf(seq, " waste=%lu/%lu",
> +				l->count * l->waste, l->waste);
> +
>  		if (l->sum_time != l->min_time) {
>  			seq_printf(seq, " age=%ld/%llu/%ld",
>  				l->min_time, div_u64(l->sum_time, l->count),
> -- 
> 2.34.1
>

-- 
Thanks,
Hyeonggon


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc
  2022-09-07  7:10 ` [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc Feng Tang
@ 2022-09-07 14:57   ` Hyeonggon Yoo
  2022-09-08  7:38     ` Feng Tang
  2022-09-10 23:11   ` Andrey Konovalov
  1 sibling, 1 reply; 19+ messages in thread
From: Hyeonggon Yoo @ 2022-09-07 14:57 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Dmitry Vyukov,
	Jonathan Corbet, Dave Hansen, linux-mm, linux-kernel, kasan-dev

On Wed, Sep 07, 2022 at 03:10:21PM +0800, Feng Tang wrote:
> kzalloc/kmalloc will round up the request size to a fixed size
> (mostly power of 2), so the allocated memory could be more than
> requested. Currently kzalloc family APIs will zero all the
> allocated memory.
> 
> To detect out-of-bound usage of the extra allocated memory, only
> zero the requested part, so that sanity check could be added to
> the extra space later.
> 
> For kzalloc users who will call ksize() later and utilize this
> extra space, please be aware that the space is not zeroed any
> more.

Can this break existing users?
or should we initialize extra bytes to zero when someone called ksize()?

If it is not going to break something - I think we can add a comment of this.
something like "... kzalloc() will initialize to zero only for @size bytes ..."

> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/slab.c | 6 +++---
>  mm/slab.h | 9 +++++++--
>  mm/slub.c | 6 +++---
>  3 files changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index a5486ff8362a..73ecaa7066e1 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3253,7 +3253,7 @@ slab_alloc_node(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags,
>  	init = slab_want_init_on_alloc(flags, cachep);
>  
>  out:
> -	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init);
> +	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init, 0);
>  	return objp;
>  }
>  
> @@ -3506,13 +3506,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  	 * Done outside of the IRQ disabled section.
>  	 */
>  	slab_post_alloc_hook(s, objcg, flags, size, p,
> -				slab_want_init_on_alloc(flags, s));
> +				slab_want_init_on_alloc(flags, s), 0);
>  	/* FIXME: Trace call missing. Christoph would like a bulk variant */
>  	return size;
>  error:
>  	local_irq_enable();
>  	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
> -	slab_post_alloc_hook(s, objcg, flags, i, p, false);
> +	slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
>  	kmem_cache_free_bulk(s, i, p);
>  	return 0;
>  }
> diff --git a/mm/slab.h b/mm/slab.h
> index d0ef9dd44b71..20f9e2a9814f 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -730,12 +730,17 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
>  
>  static inline void slab_post_alloc_hook(struct kmem_cache *s,
>  					struct obj_cgroup *objcg, gfp_t flags,
> -					size_t size, void **p, bool init)
> +					size_t size, void **p, bool init,
> +					unsigned int orig_size)
>  {
>  	size_t i;
>  
>  	flags &= gfp_allowed_mask;
>  
> +	/* If original request size(kmalloc) is not set, use object_size */
> +	if (!orig_size)
> +		orig_size = s->object_size;

I think it is more readable to pass s->object_size than zero

> +
>  	/*
>  	 * As memory initialization might be integrated into KASAN,
>  	 * kasan_slab_alloc and initialization memset must be
> @@ -746,7 +751,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
>  	for (i = 0; i < size; i++) {
>  		p[i] = kasan_slab_alloc(s, p[i], flags, init);
>  		if (p[i] && init && !kasan_has_integrated_init())
> -			memset(p[i], 0, s->object_size);
> +			memset(p[i], 0, orig_size);
>  		kmemleak_alloc_recursive(p[i], s->object_size, 1,
>  					 s->flags, flags);
>  	}
> diff --git a/mm/slub.c b/mm/slub.c
> index effd994438e6..f523601d3fcf 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3376,7 +3376,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
>  	init = slab_want_init_on_alloc(gfpflags, s);
>  
>  out:
> -	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
> +	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init, orig_size);
>  
>  	return object;
>  }
> @@ -3833,11 +3833,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  	 * Done outside of the IRQ disabled fastpath loop.
>  	 */
>  	slab_post_alloc_hook(s, objcg, flags, size, p,
> -				slab_want_init_on_alloc(flags, s));
> +				slab_want_init_on_alloc(flags, s), 0);
>  	return i;
>  error:
>  	slub_put_cpu_ptr(s->cpu_slab);
> -	slab_post_alloc_hook(s, objcg, flags, i, p, false);
> +	slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
>  	kmem_cache_free_bulk(s, i, p);

>  	return 0;
>  }
> -- 
> 2.34.1
>

-- 
Thanks,
Hyeonggon


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] mm/slub: enable debugging memory wasting of kmalloc
  2022-09-07 14:17   ` Hyeonggon Yoo
@ 2022-09-08  2:25     ` Feng Tang
  0 siblings, 0 replies; 19+ messages in thread
From: Feng Tang @ 2022-09-08  2:25 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Dmitry Vyukov,
	Jonathan Corbet, Hansen, Dave, linux-mm, linux-kernel, kasan-dev,
	Robin Murphy, John Garry, Kefeng Wang

On Wed, Sep 07, 2022 at 10:17:22PM +0800, Hyeonggon Yoo wrote:
> On Wed, Sep 07, 2022 at 03:10:20PM +0800, Feng Tang wrote:
> > kmalloc's API family is critical for mm, with one nature that it will
> > round up the request size to a fixed one (mostly power of 2). Say
> > when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
> > could be allocated, so in worst case, there is around 50% memory
> > space waste.
> > 
> > The wastage is not a big issue for requests that get allocated/freed
> > quickly, but may cause problems with objects that have longer life
> > time.
> > 
> > We've met a kernel boot OOM panic (v5.10), and from the dumped slab
> > info:
> > 
> >     [   26.062145] kmalloc-2k            814056KB     814056KB
> > 
> > >From debug we found there are huge number of 'struct iova_magazine',
> > whose size is 1032 bytes (1024 + 8), so each allocation will waste
> > 1016 bytes. Though the issue was solved by giving the right (bigger)
> > size of RAM, it is still nice to optimize the size (either use a
> > kmalloc friendly size or create a dedicated slab for it).
> > 
> > And from lkml archive, there was another crash kernel OOM case [1]
> > back in 2019, which seems to be related with the similar slab waste
> > situation, as the log is similar:
> > 
> >     [    4.332648] iommu: Adding device 0000:20:02.0 to group 16
> >     [    4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
> >     ...
> >     [    4.857565] kmalloc-2048           59164KB      59164KB
> > 
> > The crash kernel only has 256M memory, and 59M is pretty big here.
> > (Note: the related code has been changed and optimised in recent
> > kernel [2], these logs are just picked to demo the problem, also
> > a patch changing its size to 1024 bytes has been merged)
> > 
> > So add an way to track each kmalloc's memory waste info, and
> > leverage the existing SLUB debug framework (specifically
> > SLUB_STORE_USER) to show its call stack of original allocation,
> > so that user can evaluate the waste situation, identify some hot
> > spots and optimize accordingly, for a better utilization of memory.
> > 
> > The waste info is integrated into existing interface:
> > '/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
> > 'kmalloc-4k' after boot is:
> > 
> >  126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
> >      __kmem_cache_alloc_node+0x11f/0x4e0
> >      __kmalloc_node+0x4e/0x140
> >      ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
> >      ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
> >      ixgbe_probe+0x165f/0x1d20 [ixgbe]
> >      local_pci_probe+0x78/0xc0
> >      work_for_cpu_fn+0x26/0x40
> >      ...
> > 
> > which means in 'kmalloc-4k' slab, there are 126 requests of
> > 2240 bytes which got a 4KB space (wasting 1856 bytes each
> > and 233856 bytes in total), from ixgbe_alloc_q_vector().
> > 
> > And when system starts some real workload like multiple docker
> > instances, there could are more severe waste.
> > 
> > [1]. https://lkml.org/lkml/2019/8/12/266
> > [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/
> > 
> > [Thanks Hyeonggon for pointing out several bugs about sorting/format]
> > [Thanks Vlastimil for suggesting way to reduce memory usage of
> >  orig_size and keep it only for kmalloc objects]
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: John Garry <john.garry@huawei.com>
> > Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> > ---
> >  Documentation/mm/slub.rst |  33 +++++---
> >  include/linux/slab.h      |   2 +
> >  mm/slub.c                 | 156 ++++++++++++++++++++++++++++----------
> >  3 files changed, 141 insertions(+), 50 deletions(-)
> > 
> 
> Looks good to me.
> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
 
Thank you!

> > diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst
> 
> [...]
> 
> > +/* Structure holding parameters for get_partial() call chain */
> > +struct partial_context {
> > +	struct slab **slab;
> > +	gfp_t flags;
> > +	int orig_size;
> 
> Nit: unsigned int orig_size
 
Yes, will change. 'unsigned int' is more consistent with the orig_size saved
in meta data and others members size/object_size/inuse/offset of kmem_cache.

Thanks,
Feng

> Thanks!
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc
  2022-09-07 14:57   ` Hyeonggon Yoo
@ 2022-09-08  7:38     ` Feng Tang
  0 siblings, 0 replies; 19+ messages in thread
From: Feng Tang @ 2022-09-08  7:38 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Dmitry Vyukov,
	Jonathan Corbet, Hansen, Dave, linux-mm, linux-kernel, kasan-dev

On Wed, Sep 07, 2022 at 10:57:34PM +0800, Hyeonggon Yoo wrote:
> On Wed, Sep 07, 2022 at 03:10:21PM +0800, Feng Tang wrote:
> > kzalloc/kmalloc will round up the request size to a fixed size
> > (mostly power of 2), so the allocated memory could be more than
> > requested. Currently kzalloc family APIs will zero all the
> > allocated memory.
> > 
> > To detect out-of-bound usage of the extra allocated memory, only
> > zero the requested part, so that sanity check could be added to
> > the extra space later.
> > 
> > For kzalloc users who will call ksize() later and utilize this
> > extra space, please be aware that the space is not zeroed any
> > more.
> 
> Can this break existing users?
> or should we initialize extra bytes to zero when someone called ksize()?

Good point!

As kmalloc caches' size are not strictly power of 2, the logical
usage for users is to call ksize() first to know the actual size.

I did a grep of both "xxzalloc" + "ksize" with cmd 

#git-grep " ksize(" | cut -f 1 -d':' | xargs grep zalloc | cut -f 1 -d':' | sort  -u

and got:

	arch/x86/kernel/cpu/microcode/amd.c
	drivers/base/devres.c
	drivers/net/ethernet/intel/igb/igb_main.c
	drivers/net/wireless/intel/iwlwifi/mvm/scan.c
	fs/btrfs/send.c
	include/linux/slab.h
	lib/test_kasan.c
	mm/mempool.c
	mm/nommu.c
	mm/slab_common.c
	security/tomoyo/memory.c

I roughly went through these files, and haven't found obvious breakage
regarding with data zeroing (I could miss something)

Also these patches has been in a tree monitored by 0Day, and some basic
sanity tests should have been run with 0Day's help, no problem with
this patch so far (one KASAN related problem was found though, see
patch 3/4).

And in worst case there is problem, we can fix it quickly.


> If it is not going to break something - I think we can add a comment of this.
> something like "... kzalloc() will initialize to zero only for @size bytes ..."
 
Agree, this is necessary. 

> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  mm/slab.c | 6 +++---
> >  mm/slab.h | 9 +++++++--
> >  mm/slub.c | 6 +++---
> >  3 files changed, 13 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/slab.c b/mm/slab.c
> > index a5486ff8362a..73ecaa7066e1 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -3253,7 +3253,7 @@ slab_alloc_node(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags,
> >  	init = slab_want_init_on_alloc(flags, cachep);
> >  
> >  out:
> > -	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init);
> > +	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init, 0);
> >  	return objp;
> >  }
> >  
> > @@ -3506,13 +3506,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> >  	 * Done outside of the IRQ disabled section.
> >  	 */
> >  	slab_post_alloc_hook(s, objcg, flags, size, p,
> > -				slab_want_init_on_alloc(flags, s));
> > +				slab_want_init_on_alloc(flags, s), 0);
> >  	/* FIXME: Trace call missing. Christoph would like a bulk variant */
> >  	return size;
> >  error:
> >  	local_irq_enable();
> >  	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
> > -	slab_post_alloc_hook(s, objcg, flags, i, p, false);
> > +	slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
> >  	kmem_cache_free_bulk(s, i, p);
> >  	return 0;
> >  }
> > diff --git a/mm/slab.h b/mm/slab.h
> > index d0ef9dd44b71..20f9e2a9814f 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -730,12 +730,17 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
> >  
> >  static inline void slab_post_alloc_hook(struct kmem_cache *s,
> >  					struct obj_cgroup *objcg, gfp_t flags,
> > -					size_t size, void **p, bool init)
> > +					size_t size, void **p, bool init,
> > +					unsigned int orig_size)
> >  {
> >  	size_t i;
> >  
> >  	flags &= gfp_allowed_mask;
> >  
> > +	/* If original request size(kmalloc) is not set, use object_size */
> > +	if (!orig_size)
> > +		orig_size = s->object_size;
> 
> I think it is more readable to pass s->object_size than zero

OK, will change. 

Thanks,
Feng




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested
  2022-09-07  7:10 ` [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested Feng Tang
@ 2022-09-09  6:26   ` Hyeonggon Yoo
  2022-09-09  7:33     ` Feng Tang
  2022-09-10 23:12   ` Andrey Konovalov
  1 sibling, 1 reply; 19+ messages in thread
From: Hyeonggon Yoo @ 2022-09-09  6:26 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Dmitry Vyukov,
	Jonathan Corbet, Dave Hansen, linux-mm, linux-kernel, kasan-dev

On Wed, Sep 07, 2022 at 03:10:23PM +0800, Feng Tang wrote:
> kmalloc will round up the request size to a fixed size (mostly power
> of 2), so there could be a extra space than what is requested, whose
> size is the actual buffer size minus original request size.
> 
> To better detect out of bound access or abuse of this space, add
> redzone sanity check for it.
> 
> And in current kernel, some kmalloc user already knows the existence
> of the space and utilizes it after calling 'ksize()' to know the real
> size of the allocated buffer. So we skip the sanity check for objects
> which have been called with ksize(), as treating them as legitimate
> users.
> 
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/slab.h        |  4 ++++
>  mm/slab_common.c |  4 ++++
>  mm/slub.c        | 57 +++++++++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 62 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 20f9e2a9814f..0bc91b30b031 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -885,4 +885,8 @@ void __check_heap_object(const void *ptr, unsigned long n,
>  }
>  #endif
>  
> +#ifdef CONFIG_SLUB_DEBUG
> +void skip_orig_size_check(struct kmem_cache *s, const void *object);
> +#endif
> +
>  #endif /* MM_SLAB_H */
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 8e13e3aac53f..5106667d6adb 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1001,6 +1001,10 @@ size_t __ksize(const void *object)
>  		return folio_size(folio);
>  	}
>  
> +#ifdef CONFIG_SLUB_DEBUG
> +	skip_orig_size_check(folio_slab(folio)->slab_cache, object);
> +#endif
> +
>  	return slab_ksize(folio_slab(folio)->slab_cache);
>  }
>  
> diff --git a/mm/slub.c b/mm/slub.c
> index f523601d3fcf..2f0302136604 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -812,12 +812,27 @@ static inline void set_orig_size(struct kmem_cache *s,
>  	if (!slub_debug_orig_size(s))
>  		return;
>  
> +#ifdef CONFIG_KASAN_GENERIC
> +	/*
> +	 * KASAN could save its free meta data in the start part of object
> +	 * area, so skip the redzone check if kasan's meta data size is
> +	 * bigger enough to possibly overlap with kmalloc redzone
> +	 */
> +	if (s->kasan_info.free_meta_size_in_object * 2 >= s->object_size)
> +		orig_size = s->object_size;
> +#endif
> +
>  	p += get_info_end(s);
>  	p += sizeof(struct track) * 2;
>  
>  	*(unsigned int *)p = orig_size;
>  }
>  
> +void skip_orig_size_check(struct kmem_cache *s, const void *object)
> +{
> +	set_orig_size(s, (void *)object, s->object_size);
> +}
> +
>  static unsigned int get_orig_size(struct kmem_cache *s, void *object)
>  {
>  	void *p = kasan_reset_tag(object);
> @@ -949,13 +964,34 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
>  static void init_object(struct kmem_cache *s, void *object, u8 val)
>  {
>  	u8 *p = kasan_reset_tag(object);
> +	unsigned int orig_size = s->object_size;
>  
> -	if (s->flags & SLAB_RED_ZONE)
> +	if (s->flags & SLAB_RED_ZONE) {
>  		memset(p - s->red_left_pad, val, s->red_left_pad);
>  
> +		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
> +			unsigned int zone_start;
> +
> +			orig_size = get_orig_size(s, object);
> +			zone_start = orig_size;
> +
> +			if (!freeptr_outside_object(s))
> +				zone_start = max_t(unsigned int, orig_size,
> +						s->offset + sizeof(void *));
> +
> +			/*
> +			 * Redzone the extra allocated space by kmalloc
> +			 * than requested.
> +			 */
> +			if (zone_start < s->object_size)
> +				memset(p + zone_start, val,
> +					s->object_size - zone_start);
> +		}
> +	}
> +
>  	if (s->flags & __OBJECT_POISON) {
> -		memset(p, POISON_FREE, s->object_size - 1);
> -		p[s->object_size - 1] = POISON_END;
> +		memset(p, POISON_FREE, orig_size - 1);
> +		p[orig_size - 1] = POISON_END;
>  	}
>  
>  	if (s->flags & SLAB_RED_ZONE)
> @@ -1103,6 +1139,7 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
>  {
>  	u8 *p = object;
>  	u8 *endobject = object + s->object_size;
> +	unsigned int orig_size;
>  
>  	if (s->flags & SLAB_RED_ZONE) {
>  		if (!check_bytes_and_report(s, slab, object, "Left Redzone",
> @@ -1112,6 +1149,20 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
>  		if (!check_bytes_and_report(s, slab, object, "Right Redzone",
>  			endobject, val, s->inuse - s->object_size))
>  			return 0;
> +
> +		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
> +			orig_size = get_orig_size(s, object);
> +
> +			if (!freeptr_outside_object(s))
> +				orig_size = max_t(unsigned int, orig_size,
> +						s->offset + sizeof(void *));
> +			if (s->object_size > orig_size  &&
> +				!check_bytes_and_report(s, slab, object,
> +					"kmalloc Redzone", p + orig_size,
> +					val, s->object_size - orig_size)) {
> +				return 0;
> +			}
> +		}
>  	} else {
>  		if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
>  			check_bytes_and_report(s, slab, p, "Alignment padding",
> -- 
> 2.34.1
> 

Looks good, but what about putting
free pointer outside object when slub_debug_orig_size(s)?

diff --git a/mm/slub.c b/mm/slub.c
index 9d1a985c9ede..7e57d9f718d1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -970,22 +970,15 @@ static void init_object(struct kmem_cache *s, void *object, u8 val)
 		memset(p - s->red_left_pad, val, s->red_left_pad);
 
 		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
-			unsigned int zone_start;
-
 			orig_size = get_orig_size(s, object);
-			zone_start = orig_size;
-
-			if (!freeptr_outside_object(s))
-				zone_start = max_t(unsigned int, orig_size,
-						s->offset + sizeof(void *));
 
 			/*
 			 * Redzone the extra allocated space by kmalloc
 			 * than requested.
 			 */
-			if (zone_start < s->object_size)
-				memset(p + zone_start, val,
-					s->object_size - zone_start);
+			if (orig_size < s->object_size)
+				memset(p + orig_size, val,
+				       s->object_size - orig_size);
 		}
 	}
 
@@ -1153,9 +1146,6 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
 		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
 			orig_size = get_orig_size(s, object);
 
-			if (!freeptr_outside_object(s))
-				orig_size = max_t(unsigned int, orig_size,
-						s->offset + sizeof(void *));
 			if (s->object_size > orig_size  &&
 				!check_bytes_and_report(s, slab, object,
 					"kmalloc Redzone", p + orig_size,
@@ -4234,7 +4224,8 @@ static int calculate_sizes(struct kmem_cache *s)
 	 */
 	s->inuse = size;
 
-	if ((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
+	if (slub_debug_orig_size(s) ||
+	    (flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
 	    ((flags & SLAB_RED_ZONE) && s->object_size < sizeof(void *)) ||
 	    s->ctor) {
 		/*

-- 
Thanks,
Hyeonggon


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested
  2022-09-09  6:26   ` Hyeonggon Yoo
@ 2022-09-09  7:33     ` Feng Tang
  0 siblings, 0 replies; 19+ messages in thread
From: Feng Tang @ 2022-09-09  7:33 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Dmitry Vyukov,
	Jonathan Corbet, Hansen, Dave, linux-mm, linux-kernel, kasan-dev

On Fri, Sep 09, 2022 at 02:26:34PM +0800, Hyeonggon Yoo wrote:
> On Wed, Sep 07, 2022 at 03:10:23PM +0800, Feng Tang wrote:
> > kmalloc will round up the request size to a fixed size (mostly power
> > of 2), so there could be a extra space than what is requested, whose
> > size is the actual buffer size minus original request size.
> > 
> > To better detect out of bound access or abuse of this space, add
> > redzone sanity check for it.
> > 
> > And in current kernel, some kmalloc user already knows the existence
> > of the space and utilizes it after calling 'ksize()' to know the real
> > size of the allocated buffer. So we skip the sanity check for objects
> > which have been called with ksize(), as treating them as legitimate
> > users.
> > 
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
[...]

> > -	if (s->flags & SLAB_RED_ZONE)
> > +	if (s->flags & SLAB_RED_ZONE) {
> >  		memset(p - s->red_left_pad, val, s->red_left_pad);
> >  
> > +		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
> > +			unsigned int zone_start;
> > +
> > +			orig_size = get_orig_size(s, object);
> > +			zone_start = orig_size;
> > +
> > +			if (!freeptr_outside_object(s))
> > +				zone_start = max_t(unsigned int, orig_size,
> > +						s->offset + sizeof(void *));
> > +
> > +			/*
> > +			 * Redzone the extra allocated space by kmalloc
> > +			 * than requested.
> > +			 */
> > +			if (zone_start < s->object_size)
> > +				memset(p + zone_start, val,
> > +					s->object_size - zone_start);
> > +		}
> > +	}
> > +
> >  	if (s->flags & __OBJECT_POISON) {
> > -		memset(p, POISON_FREE, s->object_size - 1);
> > -		p[s->object_size - 1] = POISON_END;
> > +		memset(p, POISON_FREE, orig_size - 1);
> > +		p[orig_size - 1] = POISON_END;
> >  	}
> >  
> >  	if (s->flags & SLAB_RED_ZONE)
> > @@ -1103,6 +1139,7 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
> >  {
> >  	u8 *p = object;
> >  	u8 *endobject = object + s->object_size;
> > +	unsigned int orig_size;
> >  
> >  	if (s->flags & SLAB_RED_ZONE) {
> >  		if (!check_bytes_and_report(s, slab, object, "Left Redzone",
> > @@ -1112,6 +1149,20 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
> >  		if (!check_bytes_and_report(s, slab, object, "Right Redzone",
> >  			endobject, val, s->inuse - s->object_size))
> >  			return 0;
> > +
> > +		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
> > +			orig_size = get_orig_size(s, object);
> > +
> > +			if (!freeptr_outside_object(s))
> > +				orig_size = max_t(unsigned int, orig_size,
> > +						s->offset + sizeof(void *));
> > +			if (s->object_size > orig_size  &&
> > +				!check_bytes_and_report(s, slab, object,
> > +					"kmalloc Redzone", p + orig_size,
> > +					val, s->object_size - orig_size)) {
> > +				return 0;
> > +			}
> > +		}
> >  	} else {
> >  		if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
> >  			check_bytes_and_report(s, slab, p, "Alignment padding",
> > -- 
> > 2.34.1
> > 
> 
> Looks good, but what about putting
> free pointer outside object when slub_debug_orig_size(s)?
 
Sounds good to me. This makes all kmalloc slabs covered by redzone
check. I just gave the code a shot and it just works with my test
case! Thanks!

- Feng


> diff --git a/mm/slub.c b/mm/slub.c
> index 9d1a985c9ede..7e57d9f718d1 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -970,22 +970,15 @@ static void init_object(struct kmem_cache *s, void *object, u8 val)
>  		memset(p - s->red_left_pad, val, s->red_left_pad);
>  
>  		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
> -			unsigned int zone_start;
> -
>  			orig_size = get_orig_size(s, object);
> -			zone_start = orig_size;
> -
> -			if (!freeptr_outside_object(s))
> -				zone_start = max_t(unsigned int, orig_size,
> -						s->offset + sizeof(void *));
>  
>  			/*
>  			 * Redzone the extra allocated space by kmalloc
>  			 * than requested.
>  			 */
> -			if (zone_start < s->object_size)
> -				memset(p + zone_start, val,
> -					s->object_size - zone_start);
> +			if (orig_size < s->object_size)
> +				memset(p + orig_size, val,
> +				       s->object_size - orig_size);
>  		}
>  	}
>  
> @@ -1153,9 +1146,6 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
>  		if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
>  			orig_size = get_orig_size(s, object);
>  
> -			if (!freeptr_outside_object(s))
> -				orig_size = max_t(unsigned int, orig_size,
> -						s->offset + sizeof(void *));
>  			if (s->object_size > orig_size  &&
>  				!check_bytes_and_report(s, slab, object,
>  					"kmalloc Redzone", p + orig_size,
> @@ -4234,7 +4224,8 @@ static int calculate_sizes(struct kmem_cache *s)
>  	 */
>  	s->inuse = size;
>  
> -	if ((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
> +	if (slub_debug_orig_size(s) ||
> +	    (flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
>  	    ((flags & SLAB_RED_ZONE) && s->object_size < sizeof(void *)) ||
>  	    s->ctor) {
>  		/*
> 
> -- 
> Thanks,
> Hyeonggon
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc
  2022-09-07  7:10 ` [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc Feng Tang
  2022-09-07 14:57   ` Hyeonggon Yoo
@ 2022-09-10 23:11   ` Andrey Konovalov
  2022-09-11  5:04     ` Feng Tang
  1 sibling, 1 reply; 19+ messages in thread
From: Andrey Konovalov @ 2022-09-10 23:11 UTC (permalink / raw)
  To: Feng Tang, Alexander Potapenko
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet, Dave Hansen,
	Linux Memory Management List, LKML, kasan-dev

On Wed, Sep 7, 2022 at 9:10 AM Feng Tang <feng.tang@intel.com> wrote:
>
> kzalloc/kmalloc will round up the request size to a fixed size
> (mostly power of 2), so the allocated memory could be more than
> requested. Currently kzalloc family APIs will zero all the
> allocated memory.
>
> To detect out-of-bound usage of the extra allocated memory, only
> zero the requested part, so that sanity check could be added to
> the extra space later.
>
> For kzalloc users who will call ksize() later and utilize this
> extra space, please be aware that the space is not zeroed any
> more.
>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/slab.c | 6 +++---
>  mm/slab.h | 9 +++++++--
>  mm/slub.c | 6 +++---
>  3 files changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index a5486ff8362a..73ecaa7066e1 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3253,7 +3253,7 @@ slab_alloc_node(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags,
>         init = slab_want_init_on_alloc(flags, cachep);
>
>  out:
> -       slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init);
> +       slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init, 0);
>         return objp;
>  }
>
> @@ -3506,13 +3506,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>          * Done outside of the IRQ disabled section.
>          */
>         slab_post_alloc_hook(s, objcg, flags, size, p,
> -                               slab_want_init_on_alloc(flags, s));
> +                               slab_want_init_on_alloc(flags, s), 0);
>         /* FIXME: Trace call missing. Christoph would like a bulk variant */
>         return size;
>  error:
>         local_irq_enable();
>         cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
> -       slab_post_alloc_hook(s, objcg, flags, i, p, false);
> +       slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
>         kmem_cache_free_bulk(s, i, p);
>         return 0;
>  }
> diff --git a/mm/slab.h b/mm/slab.h
> index d0ef9dd44b71..20f9e2a9814f 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -730,12 +730,17 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
>
>  static inline void slab_post_alloc_hook(struct kmem_cache *s,
>                                         struct obj_cgroup *objcg, gfp_t flags,
> -                                       size_t size, void **p, bool init)
> +                                       size_t size, void **p, bool init,
> +                                       unsigned int orig_size)
>  {
>         size_t i;
>
>         flags &= gfp_allowed_mask;
>
> +       /* If original request size(kmalloc) is not set, use object_size */
> +       if (!orig_size)
> +               orig_size = s->object_size;
> +
>         /*
>          * As memory initialization might be integrated into KASAN,
>          * kasan_slab_alloc and initialization memset must be
> @@ -746,7 +751,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
>         for (i = 0; i < size; i++) {
>                 p[i] = kasan_slab_alloc(s, p[i], flags, init);
>                 if (p[i] && init && !kasan_has_integrated_init())
> -                       memset(p[i], 0, s->object_size);
> +                       memset(p[i], 0, orig_size);

Arguably, with slab_want_init_on_alloc(), all allocated memory should
be zeroed to prevent possibility of info-leaks, even unused paddings.
Perhaps, Alexander can give his opinion here.

Thanks!


>                 kmemleak_alloc_recursive(p[i], s->object_size, 1,
>                                          s->flags, flags);
>         }
> diff --git a/mm/slub.c b/mm/slub.c
> index effd994438e6..f523601d3fcf 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3376,7 +3376,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
>         init = slab_want_init_on_alloc(gfpflags, s);
>
>  out:
> -       slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
> +       slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init, orig_size);
>
>         return object;
>  }
> @@ -3833,11 +3833,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>          * Done outside of the IRQ disabled fastpath loop.
>          */
>         slab_post_alloc_hook(s, objcg, flags, size, p,
> -                               slab_want_init_on_alloc(flags, s));
> +                               slab_want_init_on_alloc(flags, s), 0);
>         return i;
>  error:
>         slub_put_cpu_ptr(s->cpu_slab);
> -       slab_post_alloc_hook(s, objcg, flags, i, p, false);
> +       slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
>         kmem_cache_free_bulk(s, i, p);
>         return 0;
>  }
> --
> 2.34.1
>
> --
> You received this message because you are subscribed to the Google Groups "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kasan-dev/20220907071023.3838692-3-feng.tang%40intel.com.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested
  2022-09-07  7:10 ` [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested Feng Tang
  2022-09-09  6:26   ` Hyeonggon Yoo
@ 2022-09-10 23:12   ` Andrey Konovalov
  2022-09-11  4:10     ` Feng Tang
  1 sibling, 1 reply; 19+ messages in thread
From: Andrey Konovalov @ 2022-09-10 23:12 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet, Dave Hansen,
	Linux Memory Management List, LKML, kasan-dev

On Wed, Sep 7, 2022 at 9:11 AM Feng Tang <feng.tang@intel.com> wrote:
>
> kmalloc will round up the request size to a fixed size (mostly power
> of 2), so there could be a extra space than what is requested, whose
> size is the actual buffer size minus original request size.
>
> To better detect out of bound access or abuse of this space, add
> redzone sanity check for it.
>
> And in current kernel, some kmalloc user already knows the existence
> of the space and utilizes it after calling 'ksize()' to know the real
> size of the allocated buffer. So we skip the sanity check for objects
> which have been called with ksize(), as treating them as legitimate
> users.
>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/slab.h        |  4 ++++
>  mm/slab_common.c |  4 ++++
>  mm/slub.c        | 57 +++++++++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 62 insertions(+), 3 deletions(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 20f9e2a9814f..0bc91b30b031 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -885,4 +885,8 @@ void __check_heap_object(const void *ptr, unsigned long n,
>  }
>  #endif
>
> +#ifdef CONFIG_SLUB_DEBUG
> +void skip_orig_size_check(struct kmem_cache *s, const void *object);
> +#endif
> +
>  #endif /* MM_SLAB_H */
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 8e13e3aac53f..5106667d6adb 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1001,6 +1001,10 @@ size_t __ksize(const void *object)
>                 return folio_size(folio);
>         }
>
> +#ifdef CONFIG_SLUB_DEBUG
> +       skip_orig_size_check(folio_slab(folio)->slab_cache, object);
> +#endif
> +
>         return slab_ksize(folio_slab(folio)->slab_cache);
>  }
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f523601d3fcf..2f0302136604 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -812,12 +812,27 @@ static inline void set_orig_size(struct kmem_cache *s,
>         if (!slub_debug_orig_size(s))
>                 return;
>
> +#ifdef CONFIG_KASAN_GENERIC
> +       /*
> +        * KASAN could save its free meta data in the start part of object
> +        * area, so skip the redzone check if kasan's meta data size is
> +        * bigger enough to possibly overlap with kmalloc redzone
> +        */
> +       if (s->kasan_info.free_meta_size_in_object * 2 >= s->object_size)

Why is free_meta_size_in_object multiplied by 2? Looks cryptic,
probably needs a comment.

Thanks!

> +               orig_size = s->object_size;
> +#endif
> +
>         p += get_info_end(s);
>         p += sizeof(struct track) * 2;
>
>         *(unsigned int *)p = orig_size;
>  }
>
> +void skip_orig_size_check(struct kmem_cache *s, const void *object)
> +{
> +       set_orig_size(s, (void *)object, s->object_size);
> +}
> +
>  static unsigned int get_orig_size(struct kmem_cache *s, void *object)
>  {
>         void *p = kasan_reset_tag(object);
> @@ -949,13 +964,34 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
>  static void init_object(struct kmem_cache *s, void *object, u8 val)
>  {
>         u8 *p = kasan_reset_tag(object);
> +       unsigned int orig_size = s->object_size;
>
> -       if (s->flags & SLAB_RED_ZONE)
> +       if (s->flags & SLAB_RED_ZONE) {
>                 memset(p - s->red_left_pad, val, s->red_left_pad);
>
> +               if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
> +                       unsigned int zone_start;
> +
> +                       orig_size = get_orig_size(s, object);
> +                       zone_start = orig_size;
> +
> +                       if (!freeptr_outside_object(s))
> +                               zone_start = max_t(unsigned int, orig_size,
> +                                               s->offset + sizeof(void *));
> +
> +                       /*
> +                        * Redzone the extra allocated space by kmalloc
> +                        * than requested.
> +                        */
> +                       if (zone_start < s->object_size)
> +                               memset(p + zone_start, val,
> +                                       s->object_size - zone_start);
> +               }
> +       }
> +
>         if (s->flags & __OBJECT_POISON) {
> -               memset(p, POISON_FREE, s->object_size - 1);
> -               p[s->object_size - 1] = POISON_END;
> +               memset(p, POISON_FREE, orig_size - 1);
> +               p[orig_size - 1] = POISON_END;
>         }
>
>         if (s->flags & SLAB_RED_ZONE)
> @@ -1103,6 +1139,7 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
>  {
>         u8 *p = object;
>         u8 *endobject = object + s->object_size;
> +       unsigned int orig_size;
>
>         if (s->flags & SLAB_RED_ZONE) {
>                 if (!check_bytes_and_report(s, slab, object, "Left Redzone",
> @@ -1112,6 +1149,20 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
>                 if (!check_bytes_and_report(s, slab, object, "Right Redzone",
>                         endobject, val, s->inuse - s->object_size))
>                         return 0;
> +
> +               if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
> +                       orig_size = get_orig_size(s, object);
> +
> +                       if (!freeptr_outside_object(s))
> +                               orig_size = max_t(unsigned int, orig_size,
> +                                               s->offset + sizeof(void *));
> +                       if (s->object_size > orig_size  &&
> +                               !check_bytes_and_report(s, slab, object,
> +                                       "kmalloc Redzone", p + orig_size,
> +                                       val, s->object_size - orig_size)) {
> +                               return 0;
> +                       }
> +               }
>         } else {
>                 if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
>                         check_bytes_and_report(s, slab, p, "Alignment padding",
> --
> 2.34.1
>
> --
> You received this message because you are subscribed to the Google Groups "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kasan-dev/20220907071023.3838692-5-feng.tang%40intel.com.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm: kasan: Add free_meta size info in struct kasan_cache
  2022-09-07  7:10 ` [PATCH v5 3/4] mm: kasan: Add free_meta size info in struct kasan_cache Feng Tang
@ 2022-09-10 23:14   ` Andrey Konovalov
  2022-09-11  3:56     ` Feng Tang
  0 siblings, 1 reply; 19+ messages in thread
From: Andrey Konovalov @ 2022-09-10 23:14 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet, Dave Hansen,
	Linux Memory Management List, LKML, kasan-dev, kernel test robot

On Wed, Sep 7, 2022 at 9:11 AM Feng Tang <feng.tang@intel.com> wrote:
>
> When kasan is enabled for slab/slub, it may save kasan' free_meta
> data in the former part of slab object data area in slab object
> free path, which works fine.
>
> There is ongoing effort to extend slub's debug function which will
> redzone the latter part of kmalloc object area, and when both of
> the debug are enabled, there is possible conflict, especially when
> the kmalloc object has small size, as caught by 0Day bot [1]
>
> For better information for slab/slub, add free_meta's data size
> into 'struct kasan_cache', so that its users can take right action
> to avoid data conflict.
>
> [1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Acked-by: Dmitry Vyukov <dvyukov@google.com>
> ---
>  include/linux/kasan.h | 2 ++
>  mm/kasan/common.c     | 2 ++
>  2 files changed, 4 insertions(+)
>
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index b092277bf48d..293bdaa0ba09 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -100,6 +100,8 @@ static inline bool kasan_has_integrated_init(void)
>  struct kasan_cache {
>         int alloc_meta_offset;
>         int free_meta_offset;
> +       /* size of free_meta data saved in object's data area */
> +       int free_meta_size_in_object;

I thinks calling this field free_meta_size is clear enough. Thanks!

>         bool is_kmalloc;
>  };
>
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 69f583855c8b..762ae7a7793e 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -201,6 +201,8 @@ void __kasan_cache_create(struct kmem_cache *cache, unsigned int *size,
>                         cache->kasan_info.free_meta_offset = KASAN_NO_FREE_META;
>                         *size = ok_size;
>                 }
> +       } else {
> +               cache->kasan_info.free_meta_size_in_object = sizeof(struct kasan_free_meta);
>         }
>
>         /* Calculate size with optimal redzone. */
> --
> 2.34.1
>
> --
> You received this message because you are subscribed to the Google Groups "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kasan-dev/20220907071023.3838692-4-feng.tang%40intel.com.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm: kasan: Add free_meta size info in struct kasan_cache
  2022-09-10 23:14   ` Andrey Konovalov
@ 2022-09-11  3:56     ` Feng Tang
  2022-09-11 11:51       ` Andrey Konovalov
  0 siblings, 1 reply; 19+ messages in thread
From: Feng Tang @ 2022-09-11  3:56 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet, Hansen, Dave,
	Linux Memory Management List, LKML, kasan-dev, Sang, Oliver

Hi Andrey,

Thanks for reviewing this series!

On Sun, Sep 11, 2022 at 07:14:55AM +0800, Andrey Konovalov wrote:
> On Wed, Sep 7, 2022 at 9:11 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > When kasan is enabled for slab/slub, it may save kasan' free_meta
> > data in the former part of slab object data area in slab object
> > free path, which works fine.
> >
> > There is ongoing effort to extend slub's debug function which will
> > redzone the latter part of kmalloc object area, and when both of
> > the debug are enabled, there is possible conflict, especially when
> > the kmalloc object has small size, as caught by 0Day bot [1]
> >
> > For better information for slab/slub, add free_meta's data size
> > into 'struct kasan_cache', so that its users can take right action
> > to avoid data conflict.
> >
> > [1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > Acked-by: Dmitry Vyukov <dvyukov@google.com>
> > ---
> >  include/linux/kasan.h | 2 ++
> >  mm/kasan/common.c     | 2 ++
> >  2 files changed, 4 insertions(+)
> >
> > diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> > index b092277bf48d..293bdaa0ba09 100644
> > --- a/include/linux/kasan.h
> > +++ b/include/linux/kasan.h
> > @@ -100,6 +100,8 @@ static inline bool kasan_has_integrated_init(void)
> >  struct kasan_cache {
> >         int alloc_meta_offset;
> >         int free_meta_offset;
> > +       /* size of free_meta data saved in object's data area */
> > +       int free_meta_size_in_object;
> 
> I thinks calling this field free_meta_size is clear enough. Thanks!

Yes, the name does look long. The "in_object" was added to make it
also a flag for whether the free meta is saved inside object's data
area. 

For 'free_meta_size', the code logic in slub should be:
  
  if (info->free_meta_offset == 0 &&
	info->free_meta_size >= ...)

Thanks,
Feng


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested
  2022-09-10 23:12   ` Andrey Konovalov
@ 2022-09-11  4:10     ` Feng Tang
  0 siblings, 0 replies; 19+ messages in thread
From: Feng Tang @ 2022-09-11  4:10 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet, Hansen, Dave,
	Linux Memory Management List, LKML, kasan-dev

On Sun, Sep 11, 2022 at 07:12:05AM +0800, Andrey Konovalov wrote:
> On Wed, Sep 7, 2022 at 9:11 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > kmalloc will round up the request size to a fixed size (mostly power
> > of 2), so there could be a extra space than what is requested, whose
> > size is the actual buffer size minus original request size.
> >
> > To better detect out of bound access or abuse of this space, add
> > redzone sanity check for it.
> >
> > And in current kernel, some kmalloc user already knows the existence
> > of the space and utilizes it after calling 'ksize()' to know the real
> > size of the allocated buffer. So we skip the sanity check for objects
> > which have been called with ksize(), as treating them as legitimate
> > users.
> >
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  mm/slab.h        |  4 ++++
> >  mm/slab_common.c |  4 ++++
> >  mm/slub.c        | 57 +++++++++++++++++++++++++++++++++++++++++++++---
> >  3 files changed, 62 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 20f9e2a9814f..0bc91b30b031 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -885,4 +885,8 @@ void __check_heap_object(const void *ptr, unsigned long n,
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_SLUB_DEBUG
> > +void skip_orig_size_check(struct kmem_cache *s, const void *object);
> > +#endif
> > +
> >  #endif /* MM_SLAB_H */
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 8e13e3aac53f..5106667d6adb 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -1001,6 +1001,10 @@ size_t __ksize(const void *object)
> >                 return folio_size(folio);
> >         }
> >
> > +#ifdef CONFIG_SLUB_DEBUG
> > +       skip_orig_size_check(folio_slab(folio)->slab_cache, object);
> > +#endif
> > +
> >         return slab_ksize(folio_slab(folio)->slab_cache);
> >  }
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f523601d3fcf..2f0302136604 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -812,12 +812,27 @@ static inline void set_orig_size(struct kmem_cache *s,
> >         if (!slub_debug_orig_size(s))
> >                 return;
> >
> > +#ifdef CONFIG_KASAN_GENERIC
> > +       /*
> > +        * KASAN could save its free meta data in the start part of object
> > +        * area, so skip the redzone check if kasan's meta data size is
> > +        * bigger enough to possibly overlap with kmalloc redzone
> > +        */
> > +       if (s->kasan_info.free_meta_size_in_object * 2 >= s->object_size)
> 
> Why is free_meta_size_in_object multiplied by 2? Looks cryptic,
> probably needs a comment.
 
OK, will change, I didn't make it clear. 

The basic idea is kasan's free-meta could be saved in object's data
area at offset 0, and it could overlap the kmalloc's in-object
redzone, which can only be in the second half part of the data
area. And as long as kasan's free meta sits in the first half,
then it's fine.

Maybe I can change the check to

  if (s->kasan_info.free_meta_size_in_object > orig_size)
	...

Thanks,
Feng

> Thanks!
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc
  2022-09-10 23:11   ` Andrey Konovalov
@ 2022-09-11  5:04     ` Feng Tang
  0 siblings, 0 replies; 19+ messages in thread
From: Feng Tang @ 2022-09-11  5:04 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Alexander Potapenko, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, Hyeonggon Yoo, Dmitry Vyukov, Jonathan Corbet,
	Hansen, Dave, Linux Memory Management List, LKML, kasan-dev

On Sun, Sep 11, 2022 at 07:11:18AM +0800, Andrey Konovalov wrote:
> On Wed, Sep 7, 2022 at 9:10 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > kzalloc/kmalloc will round up the request size to a fixed size
> > (mostly power of 2), so the allocated memory could be more than
> > requested. Currently kzalloc family APIs will zero all the
> > allocated memory.
> >
> > To detect out-of-bound usage of the extra allocated memory, only
> > zero the requested part, so that sanity check could be added to
> > the extra space later.
> >
> > For kzalloc users who will call ksize() later and utilize this
> > extra space, please be aware that the space is not zeroed any
> > more.
> >
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  mm/slab.c | 6 +++---
> >  mm/slab.h | 9 +++++++--
> >  mm/slub.c | 6 +++---
> >  3 files changed, 13 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/slab.c b/mm/slab.c
> > index a5486ff8362a..73ecaa7066e1 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -3253,7 +3253,7 @@ slab_alloc_node(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags,
> >         init = slab_want_init_on_alloc(flags, cachep);
> >
> >  out:
> > -       slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init);
> > +       slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init, 0);
> >         return objp;
> >  }
> >
> > @@ -3506,13 +3506,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> >          * Done outside of the IRQ disabled section.
> >          */
> >         slab_post_alloc_hook(s, objcg, flags, size, p,
> > -                               slab_want_init_on_alloc(flags, s));
> > +                               slab_want_init_on_alloc(flags, s), 0);
> >         /* FIXME: Trace call missing. Christoph would like a bulk variant */
> >         return size;
> >  error:
> >         local_irq_enable();
> >         cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
> > -       slab_post_alloc_hook(s, objcg, flags, i, p, false);
> > +       slab_post_alloc_hook(s, objcg, flags, i, p, false, 0);
> >         kmem_cache_free_bulk(s, i, p);
> >         return 0;
> >  }
> > diff --git a/mm/slab.h b/mm/slab.h
> > index d0ef9dd44b71..20f9e2a9814f 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -730,12 +730,17 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
> >
> >  static inline void slab_post_alloc_hook(struct kmem_cache *s,
> >                                         struct obj_cgroup *objcg, gfp_t flags,
> > -                                       size_t size, void **p, bool init)
> > +                                       size_t size, void **p, bool init,
> > +                                       unsigned int orig_size)
> >  {
> >         size_t i;
> >
> >         flags &= gfp_allowed_mask;
> >
> > +       /* If original request size(kmalloc) is not set, use object_size */
> > +       if (!orig_size)
> > +               orig_size = s->object_size;
> > +
> >         /*
> >          * As memory initialization might be integrated into KASAN,
> >          * kasan_slab_alloc and initialization memset must be
> > @@ -746,7 +751,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
> >         for (i = 0; i < size; i++) {
> >                 p[i] = kasan_slab_alloc(s, p[i], flags, init);
> >                 if (p[i] && init && !kasan_has_integrated_init())
> > -                       memset(p[i], 0, s->object_size);
> > +                       memset(p[i], 0, orig_size);
> 
> Arguably, with slab_want_init_on_alloc(), all allocated memory should
> be zeroed to prevent possibility of info-leaks, even unused paddings.
> Perhaps, Alexander can give his opinion here.

Initially, I thought about only zero the requested part(orig_size)
when slub_debug is enabled for that slab. But from the profiling,
zeroing 4096+1 bytes and zeroing 8192 bytes, has obvious difference
in execution time (about 10 us vs 18 us).

Semantics wise, requesting 'A' bytes being zeroed and expecting
'A+B' zeroed bytes is not very valid, IMHO

Also this 2/4 patch is also a preparation for 4/4 of redzone
extension, without it, the redzone initialization will be
overridden by the zeroing.

Thanks,
Feng

> Thanks!
> 
 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm: kasan: Add free_meta size info in struct kasan_cache
  2022-09-11  3:56     ` Feng Tang
@ 2022-09-11 11:51       ` Andrey Konovalov
  2022-09-11 12:29         ` Feng Tang
  0 siblings, 1 reply; 19+ messages in thread
From: Andrey Konovalov @ 2022-09-11 11:51 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet, Hansen, Dave,
	Linux Memory Management List, LKML, kasan-dev, Sang, Oliver

On Sun, Sep 11, 2022 at 5:57 AM Feng Tang <feng.tang@intel.com> wrote:
>
> Hi Andrey,
>
> Thanks for reviewing this series!
>
> On Sun, Sep 11, 2022 at 07:14:55AM +0800, Andrey Konovalov wrote:
> > On Wed, Sep 7, 2022 at 9:11 AM Feng Tang <feng.tang@intel.com> wrote:
> > >
> > > When kasan is enabled for slab/slub, it may save kasan' free_meta
> > > data in the former part of slab object data area in slab object
> > > free path, which works fine.
> > >
> > > There is ongoing effort to extend slub's debug function which will
> > > redzone the latter part of kmalloc object area, and when both of
> > > the debug are enabled, there is possible conflict, especially when
> > > the kmalloc object has small size, as caught by 0Day bot [1]
> > >
> > > For better information for slab/slub, add free_meta's data size
> > > into 'struct kasan_cache', so that its users can take right action
> > > to avoid data conflict.
> > >
> > > [1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
> > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > > Acked-by: Dmitry Vyukov <dvyukov@google.com>
> > > ---
> > >  include/linux/kasan.h | 2 ++
> > >  mm/kasan/common.c     | 2 ++
> > >  2 files changed, 4 insertions(+)
> > >
> > > diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> > > index b092277bf48d..293bdaa0ba09 100644
> > > --- a/include/linux/kasan.h
> > > +++ b/include/linux/kasan.h
> > > @@ -100,6 +100,8 @@ static inline bool kasan_has_integrated_init(void)
> > >  struct kasan_cache {
> > >         int alloc_meta_offset;
> > >         int free_meta_offset;
> > > +       /* size of free_meta data saved in object's data area */
> > > +       int free_meta_size_in_object;
> >
> > I thinks calling this field free_meta_size is clear enough. Thanks!
>
> Yes, the name does look long. The "in_object" was added to make it
> also a flag for whether the free meta is saved inside object's data
> area.
>
> For 'free_meta_size', the code logic in slub should be:
>
>   if (info->free_meta_offset == 0 &&
>         info->free_meta_size >= ...)

I'd say you can keep the current logic and just rename the field to
make it shorter. But up to you, I'm fine with either approach. Thanks!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm: kasan: Add free_meta size info in struct kasan_cache
  2022-09-11 11:51       ` Andrey Konovalov
@ 2022-09-11 12:29         ` Feng Tang
  0 siblings, 0 replies; 19+ messages in thread
From: Feng Tang @ 2022-09-11 12:29 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Dmitry Vyukov, Jonathan Corbet, Hansen, Dave,
	Linux Memory Management List, LKML, kasan-dev, Sang, Oliver

On Sun, Sep 11, 2022 at 07:51:54PM +0800, Andrey Konovalov wrote:
> On Sun, Sep 11, 2022 at 5:57 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > Hi Andrey,
> >
> > Thanks for reviewing this series!
> >
> > On Sun, Sep 11, 2022 at 07:14:55AM +0800, Andrey Konovalov wrote:
> > > On Wed, Sep 7, 2022 at 9:11 AM Feng Tang <feng.tang@intel.com> wrote:
> > > >
> > > > When kasan is enabled for slab/slub, it may save kasan' free_meta
> > > > data in the former part of slab object data area in slab object
> > > > free path, which works fine.
> > > >
> > > > There is ongoing effort to extend slub's debug function which will
> > > > redzone the latter part of kmalloc object area, and when both of
> > > > the debug are enabled, there is possible conflict, especially when
> > > > the kmalloc object has small size, as caught by 0Day bot [1]
> > > >
> > > > For better information for slab/slub, add free_meta's data size
> > > > into 'struct kasan_cache', so that its users can take right action
> > > > to avoid data conflict.
> > > >
> > > > [1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
> > > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > > > Acked-by: Dmitry Vyukov <dvyukov@google.com>
> > > > ---
> > > >  include/linux/kasan.h | 2 ++
> > > >  mm/kasan/common.c     | 2 ++
> > > >  2 files changed, 4 insertions(+)
> > > >
> > > > diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> > > > index b092277bf48d..293bdaa0ba09 100644
> > > > --- a/include/linux/kasan.h
> > > > +++ b/include/linux/kasan.h
> > > > @@ -100,6 +100,8 @@ static inline bool kasan_has_integrated_init(void)
> > > >  struct kasan_cache {
> > > >         int alloc_meta_offset;
> > > >         int free_meta_offset;
> > > > +       /* size of free_meta data saved in object's data area */
> > > > +       int free_meta_size_in_object;
> > >
> > > I thinks calling this field free_meta_size is clear enough. Thanks!
> >
> > Yes, the name does look long. The "in_object" was added to make it
> > also a flag for whether the free meta is saved inside object's data
> > area.
> >
> > For 'free_meta_size', the code logic in slub should be:
> >
> >   if (info->free_meta_offset == 0 &&
> >         info->free_meta_size >= ...)
> 
> I'd say you can keep the current logic and just rename the field to
> make it shorter. But up to you, I'm fine with either approach. Thanks!

OK, I don't have strong opinion either. As the comment for that
member clearly stats it's for inside-data size info, we could use
the shorter name.

Thanks,
Feng





^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-09-11 12:30 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-07  7:10 [PATCH v5 0/4] mm/slub: some debug enhancements for kmalloc Feng Tang
2022-09-07  7:10 ` [PATCH v5 1/4] mm/slub: enable debugging memory wasting of kmalloc Feng Tang
2022-09-07 14:17   ` Hyeonggon Yoo
2022-09-08  2:25     ` Feng Tang
2022-09-07  7:10 ` [PATCH v5 2/4] mm/slub: only zero the requested size of buffer for kzalloc Feng Tang
2022-09-07 14:57   ` Hyeonggon Yoo
2022-09-08  7:38     ` Feng Tang
2022-09-10 23:11   ` Andrey Konovalov
2022-09-11  5:04     ` Feng Tang
2022-09-07  7:10 ` [PATCH v5 3/4] mm: kasan: Add free_meta size info in struct kasan_cache Feng Tang
2022-09-10 23:14   ` Andrey Konovalov
2022-09-11  3:56     ` Feng Tang
2022-09-11 11:51       ` Andrey Konovalov
2022-09-11 12:29         ` Feng Tang
2022-09-07  7:10 ` [PATCH v5 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested Feng Tang
2022-09-09  6:26   ` Hyeonggon Yoo
2022-09-09  7:33     ` Feng Tang
2022-09-10 23:12   ` Andrey Konovalov
2022-09-11  4:10     ` Feng Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).