* [PATCH 0/5] slab cleanups
@ 2022-02-21 10:53 Hyeonggon Yoo
2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
` (4 more replies)
0 siblings, 5 replies; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
Hyeonggon Yoo
Hi. These are cleanup patches of slab. Please consider them for slab-next :)
Correctness of locking for patch 5 is checked using lockdep.
Hyeonggon Yoo (5):
mm/sl[au]b: Unify __ksize()
mm/sl[auo]b: Do not export __ksize()
mm/slab: Do not call kmalloc_large() for unsupported size
mm/slub: Limit min_partial only in cache creation
mm/slub: Refactor deactivate_slab()
include/linux/slab.h | 23 +++++++---
mm/slab.c | 23 ----------
mm/slab_common.c | 28 ++++++++++++
mm/slob.c | 1 -
mm/slub.c | 101 +++++++++++++++++--------------------------
5 files changed, 84 insertions(+), 92 deletions(-)
--
2.33.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/5] mm/sl[au]b: Unify __ksize()
2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
2022-02-23 18:39 ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 2/5] mm/sl[auo]b: Do not export __ksize() Hyeonggon Yoo
` (3 subsequent siblings)
4 siblings, 1 reply; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
Hyeonggon Yoo
Only SLOB need to implement __ksize() separately because SLOB records
size in object header for kmalloc objects. Unify SLAB/SLUB's __ksize().
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
mm/slab.c | 23 -----------------------
mm/slab_common.c | 29 +++++++++++++++++++++++++++++
mm/slub.c | 16 ----------------
3 files changed, 29 insertions(+), 39 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index ddf5737c63d9..eb73d2499480 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -4199,27 +4199,4 @@ void __check_heap_object(const void *ptr, unsigned long n,
}
#endif /* CONFIG_HARDENED_USERCOPY */
-/**
- * __ksize -- Uninstrumented ksize.
- * @objp: pointer to the object
- *
- * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
- * safety checks as ksize() with KASAN instrumentation enabled.
- *
- * Return: size of the actual memory used by @objp in bytes
- */
-size_t __ksize(const void *objp)
-{
- struct kmem_cache *c;
- size_t size;
- BUG_ON(!objp);
- if (unlikely(objp == ZERO_SIZE_PTR))
- return 0;
-
- c = virt_to_cache(objp);
- size = c ? c->object_size : 0;
-
- return size;
-}
-EXPORT_SYMBOL(__ksize);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 23f2ab0713b7..488997db0d97 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1245,6 +1245,35 @@ void kfree_sensitive(const void *p)
}
EXPORT_SYMBOL(kfree_sensitive);
+#ifndef CONFIG_SLOB
+/**
+ * __ksize -- Uninstrumented ksize.
+ * @objp: pointer to the object
+ *
+ * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
+ * safety checks as ksize() with KASAN instrumentation enabled.
+ *
+ * Return: size of the actual memory used by @objp in bytes
+ */
+size_t __ksize(const void *object)
+{
+ struct folio *folio;
+
+ if (unlikely(object == ZERO_SIZE_PTR))
+ return 0;
+
+ folio = virt_to_folio(object);
+
+#ifdef CONFIG_SLUB
+ if (unlikely(!folio_test_slab(folio)))
+ return folio_size(folio);
+#endif
+
+ return slab_ksize(folio_slab(folio)->slab_cache);
+}
+EXPORT_SYMBOL(__ksize);
+#endif
+
/**
* ksize - get the actual amount of memory allocated for a given object
* @objp: Pointer to the object
diff --git a/mm/slub.c b/mm/slub.c
index 261474092e43..3a4458976ab7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4526,22 +4526,6 @@ void __check_heap_object(const void *ptr, unsigned long n,
}
#endif /* CONFIG_HARDENED_USERCOPY */
-size_t __ksize(const void *object)
-{
- struct folio *folio;
-
- if (unlikely(object == ZERO_SIZE_PTR))
- return 0;
-
- folio = virt_to_folio(object);
-
- if (unlikely(!folio_test_slab(folio)))
- return folio_size(folio);
-
- return slab_ksize(folio_slab(folio)->slab_cache);
-}
-EXPORT_SYMBOL(__ksize);
-
void kfree(const void *x)
{
struct folio *folio;
--
2.33.1
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 2/5] mm/sl[auo]b: Do not export __ksize()
2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
2022-02-21 15:46 ` Matthew Wilcox
2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
` (2 subsequent siblings)
4 siblings, 1 reply; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
Hyeonggon Yoo
Do not export __ksize(). Only kasan calls __ksize() directly.
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
mm/slab_common.c | 1 -
mm/slob.c | 1 -
2 files changed, 2 deletions(-)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 488997db0d97..f2c021b57579 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1271,7 +1271,6 @@ size_t __ksize(const void *object)
return slab_ksize(folio_slab(folio)->slab_cache);
}
-EXPORT_SYMBOL(__ksize);
#endif
/**
diff --git a/mm/slob.c b/mm/slob.c
index 60c5842215f1..d8af6c54f133 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -588,7 +588,6 @@ size_t __ksize(const void *block)
m = (unsigned int *)(block - align);
return SLOB_UNITS(*m) * SLOB_UNIT;
}
-EXPORT_SYMBOL(__ksize);
int __kmem_cache_create(struct kmem_cache *c, slab_flags_t flags)
{
--
2.33.1
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
2022-02-21 10:53 ` [PATCH 2/5] mm/sl[auo]b: Do not export __ksize() Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
2022-02-21 15:53 ` Matthew Wilcox
2022-02-24 12:48 ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 4/5] mm/slub: Limit min_partial only in cache creation Hyeonggon Yoo
2022-02-21 10:53 ` [PATCH 5/5] mm/slub: Refactor deactivate_slab() Hyeonggon Yoo
4 siblings, 2 replies; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
Hyeonggon Yoo
SLAB's kfree() does not support freeing an object that is allocated from
kmalloc_large(). Fix this as SLAB do not pass requests larger than
KMALLOC_MAX_CACHE_SIZE directly to page allocator.
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
include/linux/slab.h | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 37bde99b74af..aeda3e863f2b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -564,15 +564,19 @@ static __always_inline __alloc_size(1) void *kmalloc_large(size_t size, gfp_t fl
* Try really hard to succeed the allocation but fail
* eventually.
*/
+#ifndef CONFIG_SLOB
static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size)) {
-#ifndef CONFIG_SLOB
unsigned int index;
-#endif
- if (size > KMALLOC_MAX_CACHE_SIZE)
- return kmalloc_large(size, flags);
-#ifndef CONFIG_SLOB
+
+ if (size > KMALLOC_MAX_CACHE_SIZE) {
+ if (IS_ENABLED(CONFIG_SLUB))
+ return kmalloc_large(size, flags);
+ else
+ return NULL;
+ }
+
index = kmalloc_index(size);
if (!index)
@@ -581,10 +585,17 @@ static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
return kmem_cache_alloc_trace(
kmalloc_caches[kmalloc_type(flags)][index],
flags, size);
-#endif
}
return __kmalloc(size, flags);
}
+#else
+static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
+{
+ if (__builtin_constant_p(size) && size > KMALLOC_MAX_CACHE_SIZE)
+ return kmalloc_large(size, flags);
+ return __kmalloc(size, flags);
+}
+#endif
static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t flags, int node)
{
--
2.33.1
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 4/5] mm/slub: Limit min_partial only in cache creation
2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
` (2 preceding siblings ...)
2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
2022-02-22 23:48 ` David Rientjes
2022-02-21 10:53 ` [PATCH 5/5] mm/slub: Refactor deactivate_slab() Hyeonggon Yoo
4 siblings, 1 reply; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
Hyeonggon Yoo
SLUB sets number of minimum partial slabs for node (min_partial) using
set_min_partial(). SLUB holds at least min_partial slabs even if they're empty
to avoid excessive use of page allocator.
set_min_partial() limits value of min_partial between MIN_PARTIAL and
MAX_PARTIAL. As set_min_partial() can be called by min_partial_store()
too, Only limit value of min_partial in kmem_cache_open() so that it
can be changed to value that a user wants.
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
mm/slub.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 3a4458976ab7..a4964deccb61 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4002,10 +4002,6 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
static void set_min_partial(struct kmem_cache *s, unsigned long min)
{
- if (min < MIN_PARTIAL)
- min = MIN_PARTIAL;
- else if (min > MAX_PARTIAL)
- min = MAX_PARTIAL;
s->min_partial = min;
}
@@ -4184,6 +4180,8 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
{
+ int min_partial;
+
s->flags = kmem_cache_flags(s->size, flags, s->name);
#ifdef CONFIG_SLAB_FREELIST_HARDENED
s->random = get_random_long();
@@ -4215,7 +4213,10 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
* The larger the object size is, the more slabs we want on the partial
* list to avoid pounding the page allocator excessively.
*/
- set_min_partial(s, ilog2(s->size) / 2);
+ min_partial = min(MAX_PARTIAL, ilog2(s->size) / 2);
+ min_partial = max(MIN_PARTIAL, min_partial);
+
+ set_min_partial(s, min_partial);
set_cpu_partial(s);
--
2.33.1
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 5/5] mm/slub: Refactor deactivate_slab()
2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
` (3 preceding siblings ...)
2022-02-21 10:53 ` [PATCH 4/5] mm/slub: Limit min_partial only in cache creation Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
2022-02-24 18:16 ` Vlastimil Babka
4 siblings, 1 reply; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
Hyeonggon Yoo
Simply deactivate_slab() by removing variable 'lock' and replacing
'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
n->list_lock when cmpxchg_double() fails, and then retry.
One slight functional change is releasing and taking n->list_lock again
when cmpxchg_double() fails. This is not harmful because SLUB avoids
deactivating slabs as much as possible.
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
mm/slub.c | 74 +++++++++++++++++++++++++------------------------------
1 file changed, 33 insertions(+), 41 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index a4964deccb61..2d0663befb9e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2350,8 +2350,8 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
{
enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
struct kmem_cache_node *n = get_node(s, slab_nid(slab));
- int lock = 0, free_delta = 0;
- enum slab_modes l = M_NONE, m = M_NONE;
+ int free_delta = 0;
+ enum slab_modes mode = M_NONE;
void *nextfree, *freelist_iter, *freelist_tail;
int tail = DEACTIVATE_TO_HEAD;
unsigned long flags = 0;
@@ -2420,57 +2420,49 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
new.frozen = 0;
if (!new.inuse && n->nr_partial >= s->min_partial)
- m = M_FREE;
+ mode = M_FREE;
else if (new.freelist) {
- m = M_PARTIAL;
- if (!lock) {
- lock = 1;
- /*
- * Taking the spinlock removes the possibility that
- * acquire_slab() will see a slab that is frozen
- */
- spin_lock_irqsave(&n->list_lock, flags);
- }
- } else {
- m = M_FULL;
- if (kmem_cache_debug_flags(s, SLAB_STORE_USER) && !lock) {
- lock = 1;
- /*
- * This also ensures that the scanning of full
- * slabs from diagnostic functions will not see
- * any frozen slabs.
- */
- spin_lock_irqsave(&n->list_lock, flags);
- }
+ mode = M_PARTIAL;
+ /*
+ * Taking the spinlock removes the possibility that
+ * acquire_slab() will see a slab that is frozen
+ */
+ spin_lock_irqsave(&n->list_lock, flags);
+ add_partial(n, slab, tail);
+ } else if (kmem_cache_debug_flags(s, SLAB_STORE_USER)) {
+ mode = M_FULL;
+ /*
+ * This also ensures that the scanning of full
+ * slabs from diagnostic functions will not see
+ * any frozen slabs.
+ */
+ spin_lock_irqsave(&n->list_lock, flags);
+ add_full(s, n, slab);
}
- if (l != m) {
- if (l == M_PARTIAL)
- remove_partial(n, slab);
- else if (l == M_FULL)
- remove_full(s, n, slab);
- if (m == M_PARTIAL)
- add_partial(n, slab, tail);
- else if (m == M_FULL)
- add_full(s, n, slab);
- }
-
- l = m;
if (!cmpxchg_double_slab(s, slab,
old.freelist, old.counters,
new.freelist, new.counters,
- "unfreezing slab"))
+ "unfreezing slab")) {
+ if (mode == M_PARTIAL) {
+ remove_partial(n, slab);
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ } else if (mode == M_FULL) {
+ remove_full(s, n, slab);
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ }
goto redo;
+ }
- if (lock)
- spin_unlock_irqrestore(&n->list_lock, flags);
- if (m == M_PARTIAL)
+ if (mode == M_PARTIAL) {
+ spin_unlock_irqrestore(&n->list_lock, flags);
stat(s, tail);
- else if (m == M_FULL)
+ } else if (mode == M_FULL) {
+ spin_unlock_irqrestore(&n->list_lock, flags);
stat(s, DEACTIVATE_FULL);
- else if (m == M_FREE) {
+ } else if (mode == M_FREE) {
stat(s, DEACTIVATE_EMPTY);
discard_slab(s, slab);
stat(s, FREE_SLAB);
--
2.33.1
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] mm/sl[auo]b: Do not export __ksize()
2022-02-21 10:53 ` [PATCH 2/5] mm/sl[auo]b: Do not export __ksize() Hyeonggon Yoo
@ 2022-02-21 15:46 ` Matthew Wilcox
2022-02-23 3:26 ` Hyeonggon Yoo
2022-02-23 18:40 ` Vlastimil Babka
0 siblings, 2 replies; 28+ messages in thread
From: Matthew Wilcox @ 2022-02-21 15:46 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
Pekka Enberg
On Mon, Feb 21, 2022 at 10:53:33AM +0000, Hyeonggon Yoo wrote:
> Do not export __ksize(). Only kasan calls __ksize() directly.
We should probably delete (most of) the kernel-doc comment on __ksize,
leaving only the note that it's uninstrumented and only called by
kasan. Also, we should probably move the declaration of __ksize from
include/linux/slab.h to mm/slab.h since we don't want to grow new callers.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
@ 2022-02-21 15:53 ` Matthew Wilcox
2022-02-22 8:10 ` Hyeonggon Yoo
2022-02-24 12:48 ` Vlastimil Babka
1 sibling, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2022-02-21 15:53 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
Pekka Enberg
On Mon, Feb 21, 2022 at 10:53:34AM +0000, Hyeonggon Yoo wrote:
> SLAB's kfree() does not support freeing an object that is allocated from
> kmalloc_large(). Fix this as SLAB do not pass requests larger than
> KMALLOC_MAX_CACHE_SIZE directly to page allocator.
I was wondering if we wanted to go in the other direction and get rid of
kmalloc cache sizes larger than, say, 64kB from the SLAB allocator.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
2022-02-21 15:53 ` Matthew Wilcox
@ 2022-02-22 8:10 ` Hyeonggon Yoo
2022-02-22 19:59 ` Matthew Wilcox
0 siblings, 1 reply; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-22 8:10 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
Pekka Enberg
On Mon, Feb 21, 2022 at 03:53:39PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 21, 2022 at 10:53:34AM +0000, Hyeonggon Yoo wrote:
> > SLAB's kfree() does not support freeing an object that is allocated from
> > kmalloc_large(). Fix this as SLAB do not pass requests larger than
> > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
>
> I was wondering if we wanted to go in the other direction and get rid of
> kmalloc cache sizes larger than, say, 64kB from the SLAB allocator.
Good point.
Hmm.. I don't think SLAB is benefiting from queueing that large objects,
and maximum size is still limited to what buddy allocator supports.
I'll try reducing kmalloc caches up to order-1 page like SLUB.
That would be easier to maintain.
Thanks,
Hyeonggon
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
2022-02-22 8:10 ` Hyeonggon Yoo
@ 2022-02-22 19:59 ` Matthew Wilcox
2022-02-23 3:24 ` Hyeonggon Yoo
0 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2022-02-22 19:59 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
Pekka Enberg
On Tue, Feb 22, 2022 at 08:10:32AM +0000, Hyeonggon Yoo wrote:
> On Mon, Feb 21, 2022 at 03:53:39PM +0000, Matthew Wilcox wrote:
> > On Mon, Feb 21, 2022 at 10:53:34AM +0000, Hyeonggon Yoo wrote:
> > > SLAB's kfree() does not support freeing an object that is allocated from
> > > kmalloc_large(). Fix this as SLAB do not pass requests larger than
> > > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
> >
> > I was wondering if we wanted to go in the other direction and get rid of
> > kmalloc cache sizes larger than, say, 64kB from the SLAB allocator.
>
> Good point.
>
> Hmm.. I don't think SLAB is benefiting from queueing that large objects,
> and maximum size is still limited to what buddy allocator supports.
>
> I'll try reducing kmalloc caches up to order-1 page like SLUB.
> That would be easier to maintain.
If you have time to investigate these kinds of things, I think SLUB would
benefit from caching order-2 and order-3 slabs as well. Maybe not so much
now that Mel included order-2 and order-3 caching in the page allocator.
But it'd be interesting to have numbers.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/5] mm/slub: Limit min_partial only in cache creation
2022-02-21 10:53 ` [PATCH 4/5] mm/slub: Limit min_partial only in cache creation Hyeonggon Yoo
@ 2022-02-22 23:48 ` David Rientjes
2022-02-23 3:37 ` Hyeonggon Yoo
0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2022-02-22 23:48 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
linux-kernel, Joonsoo Kim, Christoph Lameter, Pekka Enberg
On Mon, 21 Feb 2022, Hyeonggon Yoo wrote:
> SLUB sets number of minimum partial slabs for node (min_partial) using
> set_min_partial(). SLUB holds at least min_partial slabs even if they're empty
> to avoid excessive use of page allocator.
>
> set_min_partial() limits value of min_partial between MIN_PARTIAL and
> MAX_PARTIAL. As set_min_partial() can be called by min_partial_store()
> too, Only limit value of min_partial in kmem_cache_open() so that it
> can be changed to value that a user wants.
>
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
I think this makes sense and there is no reason to limit the bounds that
may be set at runtime with undocumented behavior.
However, since set_min_partial() is only setting the value in the
kmem_cache, could we remove the helper function entirely and fold it into
its two callers?
> ---
> mm/slub.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 3a4458976ab7..a4964deccb61 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4002,10 +4002,6 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>
> static void set_min_partial(struct kmem_cache *s, unsigned long min)
> {
> - if (min < MIN_PARTIAL)
> - min = MIN_PARTIAL;
> - else if (min > MAX_PARTIAL)
> - min = MAX_PARTIAL;
> s->min_partial = min;
> }
>
> @@ -4184,6 +4180,8 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
>
> static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> {
> + int min_partial;
> +
> s->flags = kmem_cache_flags(s->size, flags, s->name);
> #ifdef CONFIG_SLAB_FREELIST_HARDENED
> s->random = get_random_long();
> @@ -4215,7 +4213,10 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> * The larger the object size is, the more slabs we want on the partial
> * list to avoid pounding the page allocator excessively.
> */
> - set_min_partial(s, ilog2(s->size) / 2);
> + min_partial = min(MAX_PARTIAL, ilog2(s->size) / 2);
> + min_partial = max(MIN_PARTIAL, min_partial);
> +
> + set_min_partial(s, min_partial);
>
> set_cpu_partial(s);
>
> --
> 2.33.1
>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
2022-02-22 19:59 ` Matthew Wilcox
@ 2022-02-23 3:24 ` Hyeonggon Yoo
0 siblings, 0 replies; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-23 3:24 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
Pekka Enberg
On Tue, Feb 22, 2022 at 07:59:08PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 22, 2022 at 08:10:32AM +0000, Hyeonggon Yoo wrote:
> > On Mon, Feb 21, 2022 at 03:53:39PM +0000, Matthew Wilcox wrote:
> > > On Mon, Feb 21, 2022 at 10:53:34AM +0000, Hyeonggon Yoo wrote:
> > > > SLAB's kfree() does not support freeing an object that is allocated from
> > > > kmalloc_large(). Fix this as SLAB do not pass requests larger than
> > > > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
> > >
> > > I was wondering if we wanted to go in the other direction and get rid of
> > > kmalloc cache sizes larger than, say, 64kB from the SLAB allocator.
> >
> > Good point.
> >
> > Hmm.. I don't think SLAB is benefiting from queueing that large objects,
> > and maximum size is still limited to what buddy allocator supports.
> >
> > I'll try reducing kmalloc caches up to order-1 page like SLUB.
> > That would be easier to maintain.
>
> If you have time to investigate these kinds of things, I think SLUB would
> benefit from caching order-2 and order-3 slabs as well. Maybe not so much
> now that Mel included order-2 and order-3 caching in the page allocator.
> But it'd be interesting to have numbers.
That's interesting topic. But I think this is slightly different topic.
AFAIK It's rare that a workload would benefit more from using slab for
large size objects (8K, 16K, ... etc) than using page allocator.
And yeah, caching high order slabs may affect the numbers even if page
allocator caches high order pages. SLUB already caches them and SLUB can
cache more slabs by tuning number of cpu partial slabs (s->cpu_partial_slabs)
and number of node partial slabs. (s->min_partial)
I need to investigate what actually Mel did and learn how it affects
SLUB. So it will take some time. Thanks!
--
Hyeonggon
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] mm/sl[auo]b: Do not export __ksize()
2022-02-21 15:46 ` Matthew Wilcox
@ 2022-02-23 3:26 ` Hyeonggon Yoo
2022-02-23 18:40 ` Vlastimil Babka
1 sibling, 0 replies; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-23 3:26 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
Pekka Enberg
On Mon, Feb 21, 2022 at 03:46:01PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 21, 2022 at 10:53:33AM +0000, Hyeonggon Yoo wrote:
> > Do not export __ksize(). Only kasan calls __ksize() directly.
>
> We should probably delete (most of) the kernel-doc comment on __ksize,
> leaving only the note that it's uninstrumented and only called by
> kasan. Also, we should probably move the declaration of __ksize from
> include/linux/slab.h to mm/slab.h since we don't want to grow new callers.
Seems reasonable. I'll update it in next version!
Thanks.
--
Hyeonggon
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/5] mm/slub: Limit min_partial only in cache creation
2022-02-22 23:48 ` David Rientjes
@ 2022-02-23 3:37 ` Hyeonggon Yoo
2022-02-24 12:52 ` Vlastimil Babka
0 siblings, 1 reply; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-23 3:37 UTC (permalink / raw)
To: David Rientjes
Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
linux-kernel, Joonsoo Kim, Christoph Lameter, Pekka Enberg
On Tue, Feb 22, 2022 at 03:48:16PM -0800, David Rientjes wrote:
> On Mon, 21 Feb 2022, Hyeonggon Yoo wrote:
>
> > SLUB sets number of minimum partial slabs for node (min_partial) using
> > set_min_partial(). SLUB holds at least min_partial slabs even if they're empty
> > to avoid excessive use of page allocator.
> >
> > set_min_partial() limits value of min_partial between MIN_PARTIAL and
> > MAX_PARTIAL. As set_min_partial() can be called by min_partial_store()
> > too, Only limit value of min_partial in kmem_cache_open() so that it
> > can be changed to value that a user wants.
> >
> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>
> I think this makes sense and there is no reason to limit the bounds that
> may be set at runtime with undocumented behavior.
Thank you for comment.
>
> However, since set_min_partial() is only setting the value in the
> kmem_cache, could we remove the helper function entirely and fold it into
> its two callers?
Right. We don't need to separate this as function. I'll update this
in next version. Thanks!
>
> > ---
> > mm/slub.c | 11 ++++++-----
> > 1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 3a4458976ab7..a4964deccb61 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4002,10 +4002,6 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
> >
> > static void set_min_partial(struct kmem_cache *s, unsigned long min)
> > {
> > - if (min < MIN_PARTIAL)
> > - min = MIN_PARTIAL;
> > - else if (min > MAX_PARTIAL)
> > - min = MAX_PARTIAL;
> > s->min_partial = min;
> > }
> >
> > @@ -4184,6 +4180,8 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
> >
> > static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> > {
> > + int min_partial;
> > +
> > s->flags = kmem_cache_flags(s->size, flags, s->name);
> > #ifdef CONFIG_SLAB_FREELIST_HARDENED
> > s->random = get_random_long();
> > @@ -4215,7 +4213,10 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> > * The larger the object size is, the more slabs we want on the partial
> > * list to avoid pounding the page allocator excessively.
> > */
> > - set_min_partial(s, ilog2(s->size) / 2);
> > + min_partial = min(MAX_PARTIAL, ilog2(s->size) / 2);
> > + min_partial = max(MIN_PARTIAL, min_partial);
> > +
> > + set_min_partial(s, min_partial);
> >
> > set_cpu_partial(s);
> >
> > --
> > 2.33.1
> >
> >
--
Hyeonggon
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/5] mm/sl[au]b: Unify __ksize()
2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
@ 2022-02-23 18:39 ` Vlastimil Babka
2022-02-23 19:06 ` Marco Elver
0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2022-02-23 18:39 UTC (permalink / raw)
To: Hyeonggon Yoo, linux-mm
Cc: Roman Gushchin, Andrew Morton, linux-kernel, Joonsoo Kim,
David Rientjes, Christoph Lameter, Pekka Enberg, Kees Cook,
kasan-dev, Marco Elver
On 2/21/22 11:53, Hyeonggon Yoo wrote:
> Only SLOB need to implement __ksize() separately because SLOB records
> size in object header for kmalloc objects. Unify SLAB/SLUB's __ksize().
>
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
> mm/slab.c | 23 -----------------------
> mm/slab_common.c | 29 +++++++++++++++++++++++++++++
> mm/slub.c | 16 ----------------
> 3 files changed, 29 insertions(+), 39 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index ddf5737c63d9..eb73d2499480 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -4199,27 +4199,4 @@ void __check_heap_object(const void *ptr, unsigned long n,
> }
> #endif /* CONFIG_HARDENED_USERCOPY */
>
> -/**
> - * __ksize -- Uninstrumented ksize.
> - * @objp: pointer to the object
> - *
> - * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
> - * safety checks as ksize() with KASAN instrumentation enabled.
> - *
> - * Return: size of the actual memory used by @objp in bytes
> - */
> -size_t __ksize(const void *objp)
> -{
> - struct kmem_cache *c;
> - size_t size;
>
> - BUG_ON(!objp);
> - if (unlikely(objp == ZERO_SIZE_PTR))
> - return 0;
> -
> - c = virt_to_cache(objp);
> - size = c ? c->object_size : 0;
This comes from commit a64b53780ec3 ("mm/slab: sanity-check page type when
looking up cache") by Kees and virt_to_cache() is an implicit check for
folio slab flag ...
> -
> - return size;
> -}
> -EXPORT_SYMBOL(__ksize);
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 23f2ab0713b7..488997db0d97 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1245,6 +1245,35 @@ void kfree_sensitive(const void *p)
> }
> EXPORT_SYMBOL(kfree_sensitive);
>
> +#ifndef CONFIG_SLOB
> +/**
> + * __ksize -- Uninstrumented ksize.
> + * @objp: pointer to the object
> + *
> + * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
> + * safety checks as ksize() with KASAN instrumentation enabled.
> + *
> + * Return: size of the actual memory used by @objp in bytes
> + */
> +size_t __ksize(const void *object)
> +{
> + struct folio *folio;
> +
> + if (unlikely(object == ZERO_SIZE_PTR))
> + return 0;
> +
> + folio = virt_to_folio(object);
> +
> +#ifdef CONFIG_SLUB
> + if (unlikely(!folio_test_slab(folio)))
> + return folio_size(folio);
> +#endif
> +
> + return slab_ksize(folio_slab(folio)->slab_cache);
... and here in the common version you now for SLAB trust that the folio
will be a slab folio, thus undoing the intention of that commit. Maybe
that's not good and we should keep the folio_test_slab() for both cases?
Although maybe it's also strange that prior this patch, SLAB would return 0
if the test fails, and SLUB would return folio_size(). Probably because with
SLUB this can be a large kmalloc here and with SLAB not. So we could keep
doing that in the unified version, or KASAN devs (CC'd) could advise
something better?
> +}
> +EXPORT_SYMBOL(__ksize);
> +#endif
> +
> /**
> * ksize - get the actual amount of memory allocated for a given object
> * @objp: Pointer to the object
> diff --git a/mm/slub.c b/mm/slub.c
> index 261474092e43..3a4458976ab7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4526,22 +4526,6 @@ void __check_heap_object(const void *ptr, unsigned long n,
> }
> #endif /* CONFIG_HARDENED_USERCOPY */
>
> -size_t __ksize(const void *object)
> -{
> - struct folio *folio;
> -
> - if (unlikely(object == ZERO_SIZE_PTR))
> - return 0;
> -
> - folio = virt_to_folio(object);
> -
> - if (unlikely(!folio_test_slab(folio)))
> - return folio_size(folio);
> -
> - return slab_ksize(folio_slab(folio)->slab_cache);
> -}
> -EXPORT_SYMBOL(__ksize);
> -
> void kfree(const void *x)
> {
> struct folio *folio;
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] mm/sl[auo]b: Do not export __ksize()
2022-02-21 15:46 ` Matthew Wilcox
2022-02-23 3:26 ` Hyeonggon Yoo
@ 2022-02-23 18:40 ` Vlastimil Babka
1 sibling, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2022-02-23 18:40 UTC (permalink / raw)
To: Matthew Wilcox, Hyeonggon Yoo
Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg
On 2/21/22 16:46, Matthew Wilcox wrote:
> On Mon, Feb 21, 2022 at 10:53:33AM +0000, Hyeonggon Yoo wrote:
>> Do not export __ksize(). Only kasan calls __ksize() directly.
>
> We should probably delete (most of) the kernel-doc comment on __ksize,
> leaving only the note that it's uninstrumented and only called by
> kasan. Also, we should probably move the declaration of __ksize from
> include/linux/slab.h to mm/slab.h since we don't want to grow new callers.
Agreed, thanks.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/5] mm/sl[au]b: Unify __ksize()
2022-02-23 18:39 ` Vlastimil Babka
@ 2022-02-23 19:06 ` Marco Elver
2022-02-24 12:26 ` Vlastimil Babka
0 siblings, 1 reply; 28+ messages in thread
From: Marco Elver @ 2022-02-23 19:06 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Hyeonggon Yoo, linux-mm, Roman Gushchin, Andrew Morton,
linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
Pekka Enberg, Kees Cook, kasan-dev, Andrey Konovalov
On Wed, 23 Feb 2022 at 19:39, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 2/21/22 11:53, Hyeonggon Yoo wrote:
> > Only SLOB need to implement __ksize() separately because SLOB records
> > size in object header for kmalloc objects. Unify SLAB/SLUB's __ksize().
> >
> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > ---
> > mm/slab.c | 23 -----------------------
> > mm/slab_common.c | 29 +++++++++++++++++++++++++++++
> > mm/slub.c | 16 ----------------
> > 3 files changed, 29 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/slab.c b/mm/slab.c
> > index ddf5737c63d9..eb73d2499480 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -4199,27 +4199,4 @@ void __check_heap_object(const void *ptr, unsigned long n,
> > }
> > #endif /* CONFIG_HARDENED_USERCOPY */
> >
> > -/**
> > - * __ksize -- Uninstrumented ksize.
> > - * @objp: pointer to the object
> > - *
> > - * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
> > - * safety checks as ksize() with KASAN instrumentation enabled.
> > - *
> > - * Return: size of the actual memory used by @objp in bytes
> > - */
> > -size_t __ksize(const void *objp)
> > -{
> > - struct kmem_cache *c;
> > - size_t size;
> >
> > - BUG_ON(!objp);
> > - if (unlikely(objp == ZERO_SIZE_PTR))
> > - return 0;
> > -
> > - c = virt_to_cache(objp);
> > - size = c ? c->object_size : 0;
>
> This comes from commit a64b53780ec3 ("mm/slab: sanity-check page type when
> looking up cache") by Kees and virt_to_cache() is an implicit check for
> folio slab flag ...
>
> > -
> > - return size;
> > -}
> > -EXPORT_SYMBOL(__ksize);
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 23f2ab0713b7..488997db0d97 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -1245,6 +1245,35 @@ void kfree_sensitive(const void *p)
> > }
> > EXPORT_SYMBOL(kfree_sensitive);
> >
> > +#ifndef CONFIG_SLOB
> > +/**
> > + * __ksize -- Uninstrumented ksize.
> > + * @objp: pointer to the object
> > + *
> > + * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
> > + * safety checks as ksize() with KASAN instrumentation enabled.
> > + *
> > + * Return: size of the actual memory used by @objp in bytes
> > + */
> > +size_t __ksize(const void *object)
> > +{
> > + struct folio *folio;
> > +
> > + if (unlikely(object == ZERO_SIZE_PTR))
> > + return 0;
> > +
> > + folio = virt_to_folio(object);
> > +
> > +#ifdef CONFIG_SLUB
> > + if (unlikely(!folio_test_slab(folio)))
> > + return folio_size(folio);
> > +#endif
> > +
> > + return slab_ksize(folio_slab(folio)->slab_cache);
>
> ... and here in the common version you now for SLAB trust that the folio
> will be a slab folio, thus undoing the intention of that commit. Maybe
> that's not good and we should keep the folio_test_slab() for both cases?
> Although maybe it's also strange that prior this patch, SLAB would return 0
> if the test fails, and SLUB would return folio_size(). Probably because with
> SLUB this can be a large kmalloc here and with SLAB not. So we could keep
> doing that in the unified version, or KASAN devs (CC'd) could advise
> something better?
Is this a definitive failure case? My opinion here is that returning 0
from ksize() in case of failure will a) provide a way to check for
error, and b) if the size is used unconditionally to compute an
address may be the more graceful failure mode (see comment added in
0d4ca4c9bab39 for what happens if we see invalid memory per KASAN
being accessed).
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/5] mm/sl[au]b: Unify __ksize()
2022-02-23 19:06 ` Marco Elver
@ 2022-02-24 12:26 ` Vlastimil Babka
0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2022-02-24 12:26 UTC (permalink / raw)
To: Marco Elver
Cc: Hyeonggon Yoo, linux-mm, Roman Gushchin, Andrew Morton,
linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
Pekka Enberg, Kees Cook, kasan-dev, Andrey Konovalov
On 2/23/22 20:06, Marco Elver wrote:
> On Wed, 23 Feb 2022 at 19:39, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 2/21/22 11:53, Hyeonggon Yoo wrote:
>> > Only SLOB need to implement __ksize() separately because SLOB records
>> > size in object header for kmalloc objects. Unify SLAB/SLUB's __ksize().
>> >
>> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>> > ---
>> > mm/slab.c | 23 -----------------------
>> > mm/slab_common.c | 29 +++++++++++++++++++++++++++++
>> > mm/slub.c | 16 ----------------
>> > 3 files changed, 29 insertions(+), 39 deletions(-)
>> >
>> > diff --git a/mm/slab.c b/mm/slab.c
>> > index ddf5737c63d9..eb73d2499480 100644
>> > --- a/mm/slab.c
>> > +++ b/mm/slab.c
>> > @@ -4199,27 +4199,4 @@ void __check_heap_object(const void *ptr, unsigned long n,
>> > }
>> > #endif /* CONFIG_HARDENED_USERCOPY */
>> >
>> > -/**
>> > - * __ksize -- Uninstrumented ksize.
>> > - * @objp: pointer to the object
>> > - *
>> > - * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
>> > - * safety checks as ksize() with KASAN instrumentation enabled.
>> > - *
>> > - * Return: size of the actual memory used by @objp in bytes
>> > - */
>> > -size_t __ksize(const void *objp)
>> > -{
>> > - struct kmem_cache *c;
>> > - size_t size;
>> >
>> > - BUG_ON(!objp);
>> > - if (unlikely(objp == ZERO_SIZE_PTR))
>> > - return 0;
>> > -
>> > - c = virt_to_cache(objp);
>> > - size = c ? c->object_size : 0;
>>
>> This comes from commit a64b53780ec3 ("mm/slab: sanity-check page type when
>> looking up cache") by Kees and virt_to_cache() is an implicit check for
>> folio slab flag ...
>>
>> > -
>> > - return size;
>> > -}
>> > -EXPORT_SYMBOL(__ksize);
>> > diff --git a/mm/slab_common.c b/mm/slab_common.c
>> > index 23f2ab0713b7..488997db0d97 100644
>> > --- a/mm/slab_common.c
>> > +++ b/mm/slab_common.c
>> > @@ -1245,6 +1245,35 @@ void kfree_sensitive(const void *p)
>> > }
>> > EXPORT_SYMBOL(kfree_sensitive);
>> >
>> > +#ifndef CONFIG_SLOB
>> > +/**
>> > + * __ksize -- Uninstrumented ksize.
>> > + * @objp: pointer to the object
>> > + *
>> > + * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
>> > + * safety checks as ksize() with KASAN instrumentation enabled.
>> > + *
>> > + * Return: size of the actual memory used by @objp in bytes
>> > + */
>> > +size_t __ksize(const void *object)
>> > +{
>> > + struct folio *folio;
>> > +
>> > + if (unlikely(object == ZERO_SIZE_PTR))
>> > + return 0;
>> > +
>> > + folio = virt_to_folio(object);
>> > +
>> > +#ifdef CONFIG_SLUB
>> > + if (unlikely(!folio_test_slab(folio)))
>> > + return folio_size(folio);
>> > +#endif
>> > +
>> > + return slab_ksize(folio_slab(folio)->slab_cache);
>>
>> ... and here in the common version you now for SLAB trust that the folio
>> will be a slab folio, thus undoing the intention of that commit. Maybe
>> that's not good and we should keep the folio_test_slab() for both cases?
>> Although maybe it's also strange that prior this patch, SLAB would return 0
>> if the test fails, and SLUB would return folio_size(). Probably because with
>> SLUB this can be a large kmalloc here and with SLAB not. So we could keep
>> doing that in the unified version, or KASAN devs (CC'd) could advise
>> something better?
>
> Is this a definitive failure case?
Yeah, if we called it on a supposed object pointer that turns out to be not
slab, it usually means some UAF, so a failure.
> My opinion here is that returning 0
> from ksize() in case of failure will a) provide a way to check for
> error, and b) if the size is used unconditionally to compute an
> address may be the more graceful failure mode (see comment added in
> 0d4ca4c9bab39 for what happens if we see invalid memory per KASAN
> being accessed).
Sounds good, thanks. Then the patch should be fixed up to keep checking for
slab flag and returning 0 otherwise for CONFIG_SLAB.
For SLUB we might fail to detect the failure case by assuming it was a large
kmalloc. Maybe we could improve and only assume that when folio_size() is
large enough that the corresponding allocation would actually be done as a
large kmalloc, and the object pointer is to the beginning of the folio?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
2022-02-21 15:53 ` Matthew Wilcox
@ 2022-02-24 12:48 ` Vlastimil Babka
2022-02-24 13:31 ` Hyeonggon Yoo
1 sibling, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2022-02-24 12:48 UTC (permalink / raw)
To: Hyeonggon Yoo, linux-mm
Cc: Roman Gushchin, Andrew Morton, linux-kernel, Joonsoo Kim,
David Rientjes, Christoph Lameter, Pekka Enberg, Matthew Wilcox
On 2/21/22 11:53, Hyeonggon Yoo wrote:
> SLAB's kfree() does not support freeing an object that is allocated from
> kmalloc_large(). Fix this as SLAB do not pass requests larger than
> KMALLOC_MAX_CACHE_SIZE directly to page allocator.
AFAICS this issue is limited to build-time constant sizes. Might be better
to make this a build error rather than build-time NULL?
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/5] mm/slub: Limit min_partial only in cache creation
2022-02-23 3:37 ` Hyeonggon Yoo
@ 2022-02-24 12:52 ` Vlastimil Babka
0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2022-02-24 12:52 UTC (permalink / raw)
To: Hyeonggon Yoo, David Rientjes
Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
Joonsoo Kim, Christoph Lameter, Pekka Enberg
On 2/23/22 04:37, Hyeonggon Yoo wrote:
> On Tue, Feb 22, 2022 at 03:48:16PM -0800, David Rientjes wrote:
>> On Mon, 21 Feb 2022, Hyeonggon Yoo wrote:
>>
>> > SLUB sets number of minimum partial slabs for node (min_partial) using
>> > set_min_partial(). SLUB holds at least min_partial slabs even if they're empty
>> > to avoid excessive use of page allocator.
>> >
>> > set_min_partial() limits value of min_partial between MIN_PARTIAL and
>> > MAX_PARTIAL. As set_min_partial() can be called by min_partial_store()
>> > too, Only limit value of min_partial in kmem_cache_open() so that it
>> > can be changed to value that a user wants.
>> >
>> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>>
>> I think this makes sense and there is no reason to limit the bounds that
>> may be set at runtime with undocumented behavior.
Right.
> Thank you for comment.
>
>>
>> However, since set_min_partial() is only setting the value in the
>> kmem_cache, could we remove the helper function entirely and fold it into
>> its two callers?
>
> Right. We don't need to separate this as function. I'll update this
> in next version. Thanks!
Agreed, thanks!
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
2022-02-24 12:48 ` Vlastimil Babka
@ 2022-02-24 13:31 ` Hyeonggon Yoo
2022-02-24 15:08 ` Vlastimil Babka
0 siblings, 1 reply; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-24 13:31 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
Matthew Wilcox
On Thu, Feb 24, 2022 at 01:48:47PM +0100, Vlastimil Babka wrote:
> On 2/21/22 11:53, Hyeonggon Yoo wrote:
> > SLAB's kfree() does not support freeing an object that is allocated from
> > kmalloc_large(). Fix this as SLAB do not pass requests larger than
> > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
>
> AFAICS this issue is limited to build-time constant sizes. Might be better
> to make this a build error rather than build-time NULL?
Right. And sounds better. But what about another direction as Matthew said:
passing large requests to page allocator like SLUB?
I think it's better for maintenance. Any obstacles for this direction?
Thank you!
> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
--
Hyeonggon
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
2022-02-24 13:31 ` Hyeonggon Yoo
@ 2022-02-24 15:08 ` Vlastimil Babka
0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2022-02-24 15:08 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
Matthew Wilcox
On 2/24/22 14:31, Hyeonggon Yoo wrote:
> On Thu, Feb 24, 2022 at 01:48:47PM +0100, Vlastimil Babka wrote:
>> On 2/21/22 11:53, Hyeonggon Yoo wrote:
>> > SLAB's kfree() does not support freeing an object that is allocated from
>> > kmalloc_large(). Fix this as SLAB do not pass requests larger than
>> > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
>>
>> AFAICS this issue is limited to build-time constant sizes. Might be better
>> to make this a build error rather than build-time NULL?
>
> Right. And sounds better. But what about another direction as Matthew said:
> passing large requests to page allocator like SLUB?
Sounds like a good idea, that would reduce the number of kmalloc caches with
SLAB, and I expect also simplify the common code further.
> I think it's better for maintenance. Any obstacles for this direction?
>
> Thank you!
>
>> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
2022-02-21 10:53 ` [PATCH 5/5] mm/slub: Refactor deactivate_slab() Hyeonggon Yoo
@ 2022-02-24 18:16 ` Vlastimil Babka
2022-02-25 9:34 ` Hyeonggon Yoo
0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2022-02-24 18:16 UTC (permalink / raw)
To: Hyeonggon Yoo, linux-mm
Cc: Roman Gushchin, Andrew Morton, linux-kernel, Joonsoo Kim,
David Rientjes, Christoph Lameter, Pekka Enberg
On 2/21/22 11:53, Hyeonggon Yoo wrote:
> Simply deactivate_slab() by removing variable 'lock' and replacing
> 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
> n->list_lock when cmpxchg_double() fails, and then retry.
>
> One slight functional change is releasing and taking n->list_lock again
> when cmpxchg_double() fails. This is not harmful because SLUB avoids
> deactivating slabs as much as possible.
>
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Hm I wonder if we could simplify even a bit more. Do we have to actually
place the slab on a partial (full) list before the cmpxchg, only to remove
it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
un-frozen, but not on the list, which would be unexpected. However if anyone
sees such slab, they have to take the list_lock first to start working with
the slab... so this should be safe, because we hold the list_lock here, and
will place the slab on the list before we release it. But it thus shouldn't
matter if the placement happens before or after a successful cmpxchg, no? So
we can only do it once after a successful cmpxchg and need no undo's?
Specifically AFAIK the only possible race should be with a __slab_free()
which might observe !was_frozen after we succeed an unfreezing cmpxchg and
go through the
"} else { /* Needs to be taken off a list */"
branch but then it takes the list_lock as the first thing, so will be able
to proceed only after the slab is actually on the list.
Do I miss anything or would you agree?
> ---
> mm/slub.c | 74 +++++++++++++++++++++++++------------------------------
> 1 file changed, 33 insertions(+), 41 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index a4964deccb61..2d0663befb9e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2350,8 +2350,8 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> {
> enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
> struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> - int lock = 0, free_delta = 0;
> - enum slab_modes l = M_NONE, m = M_NONE;
> + int free_delta = 0;
> + enum slab_modes mode = M_NONE;
> void *nextfree, *freelist_iter, *freelist_tail;
> int tail = DEACTIVATE_TO_HEAD;
> unsigned long flags = 0;
> @@ -2420,57 +2420,49 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> new.frozen = 0;
>
> if (!new.inuse && n->nr_partial >= s->min_partial)
> - m = M_FREE;
> + mode = M_FREE;
> else if (new.freelist) {
> - m = M_PARTIAL;
> - if (!lock) {
> - lock = 1;
> - /*
> - * Taking the spinlock removes the possibility that
> - * acquire_slab() will see a slab that is frozen
> - */
> - spin_lock_irqsave(&n->list_lock, flags);
> - }
> - } else {
> - m = M_FULL;
> - if (kmem_cache_debug_flags(s, SLAB_STORE_USER) && !lock) {
> - lock = 1;
> - /*
> - * This also ensures that the scanning of full
> - * slabs from diagnostic functions will not see
> - * any frozen slabs.
> - */
> - spin_lock_irqsave(&n->list_lock, flags);
> - }
> + mode = M_PARTIAL;
> + /*
> + * Taking the spinlock removes the possibility that
> + * acquire_slab() will see a slab that is frozen
> + */
> + spin_lock_irqsave(&n->list_lock, flags);
> + add_partial(n, slab, tail);
> + } else if (kmem_cache_debug_flags(s, SLAB_STORE_USER)) {
> + mode = M_FULL;
> + /*
> + * This also ensures that the scanning of full
> + * slabs from diagnostic functions will not see
> + * any frozen slabs.
> + */
> + spin_lock_irqsave(&n->list_lock, flags);
> + add_full(s, n, slab);
> }
>
> - if (l != m) {
> - if (l == M_PARTIAL)
> - remove_partial(n, slab);
> - else if (l == M_FULL)
> - remove_full(s, n, slab);
>
> - if (m == M_PARTIAL)
> - add_partial(n, slab, tail);
> - else if (m == M_FULL)
> - add_full(s, n, slab);
> - }
> -
> - l = m;
> if (!cmpxchg_double_slab(s, slab,
> old.freelist, old.counters,
> new.freelist, new.counters,
> - "unfreezing slab"))
> + "unfreezing slab")) {
> + if (mode == M_PARTIAL) {
> + remove_partial(n, slab);
> + spin_unlock_irqrestore(&n->list_lock, flags);
> + } else if (mode == M_FULL) {
> + remove_full(s, n, slab);
> + spin_unlock_irqrestore(&n->list_lock, flags);
> + }
> goto redo;
> + }
>
> - if (lock)
> - spin_unlock_irqrestore(&n->list_lock, flags);
>
> - if (m == M_PARTIAL)
> + if (mode == M_PARTIAL) {
> + spin_unlock_irqrestore(&n->list_lock, flags);
> stat(s, tail);
> - else if (m == M_FULL)
> + } else if (mode == M_FULL) {
> + spin_unlock_irqrestore(&n->list_lock, flags);
> stat(s, DEACTIVATE_FULL);
> - else if (m == M_FREE) {
> + } else if (mode == M_FREE) {
> stat(s, DEACTIVATE_EMPTY);
> discard_slab(s, slab);
> stat(s, FREE_SLAB);
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
2022-02-24 18:16 ` Vlastimil Babka
@ 2022-02-25 9:34 ` Hyeonggon Yoo
2022-02-25 9:50 ` Hyeonggon Yoo
0 siblings, 1 reply; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-25 9:34 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg
On Thu, Feb 24, 2022 at 07:16:11PM +0100, Vlastimil Babka wrote:
> On 2/21/22 11:53, Hyeonggon Yoo wrote:
> > Simply deactivate_slab() by removing variable 'lock' and replacing
> > 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
> > n->list_lock when cmpxchg_double() fails, and then retry.
> >
> > One slight functional change is releasing and taking n->list_lock again
> > when cmpxchg_double() fails. This is not harmful because SLUB avoids
> > deactivating slabs as much as possible.
> >
> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>
> Hm I wonder if we could simplify even a bit more. Do we have to actually
> place the slab on a partial (full) list before the cmpxchg, only to remove
> it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
> un-frozen, but not on the list, which would be unexpected. However if anyone
> sees such slab, they have to take the list_lock first to start working with
> the slab... so this should be safe, because we hold the list_lock here, and
> will place the slab on the list before we release it. But it thus shouldn't
> matter if the placement happens before or after a successful cmpxchg, no? So
> we can only do it once after a successful cmpxchg and need no undo's?
>
My thought was similar. But after testing I noticed that &n->list_lock prevents
race between __slab_free() and deactivate_slab().
> Specifically AFAIK the only possible race should be with a __slab_free()
> which might observe !was_frozen after we succeed an unfreezing cmpxchg and
> go through the
> "} else { /* Needs to be taken off a list */"
> branch but then it takes the list_lock as the first thing, so will be able
> to proceed only after the slab is actually on the list.
>
> Do I miss anything or would you agree?
>
It's so tricky.
I tried to simplify more as you said. Seeing frozen slab on list was not
problem. But the problem was that something might interfere between
cmpxchg_double() and taking spinlock.
This is what I faced:
CPU A CPU B
deactivate_slab() { __slab_free() {
/* slab is full */
slab.frozen = 0;
cmpxchg_double();
/* Hmm...
slab->frozen == 0 &&
slab->freelist != NULL?
Oh This must be on the list.. */
spin_lock_irqsave();
cmpxchg_double();
/* Corruption: slab
* was not yet inserted to
* list but try removing */
remove_full();
spin_unlock_irqrestore();
}
spin_lock_irqsave();
add_full();
spin_unlock_irqrestore();
}
I think it's quite confusing because it's protecting code, not data.
Would you have an idea to solve it, or should we add a comment for this?
> > ---
> > mm/slub.c | 74 +++++++++++++++++++++++++------------------------------
> > 1 file changed, 33 insertions(+), 41 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a4964deccb61..2d0663befb9e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2350,8 +2350,8 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> > {
> > enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
> > struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> > - int lock = 0, free_delta = 0;
> > - enum slab_modes l = M_NONE, m = M_NONE;
> > + int free_delta = 0;
> > + enum slab_modes mode = M_NONE;
> > void *nextfree, *freelist_iter, *freelist_tail;
> > int tail = DEACTIVATE_TO_HEAD;
> > unsigned long flags = 0;
> > @@ -2420,57 +2420,49 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> > new.frozen = 0;
> >
> > if (!new.inuse && n->nr_partial >= s->min_partial)
> > - m = M_FREE;
> > + mode = M_FREE;
> > else if (new.freelist) {
> > - m = M_PARTIAL;
> > - if (!lock) {
> > - lock = 1;
> > - /*
> > - * Taking the spinlock removes the possibility that
> > - * acquire_slab() will see a slab that is frozen
> > - */
> > - spin_lock_irqsave(&n->list_lock, flags);
> > - }
> > - } else {
> > - m = M_FULL;
> > - if (kmem_cache_debug_flags(s, SLAB_STORE_USER) && !lock) {
> > - lock = 1;
> > - /*
> > - * This also ensures that the scanning of full
> > - * slabs from diagnostic functions will not see
> > - * any frozen slabs.
> > - */
> > - spin_lock_irqsave(&n->list_lock, flags);
> > - }
> > + mode = M_PARTIAL;
> > + /*
> > + * Taking the spinlock removes the possibility that
> > + * acquire_slab() will see a slab that is frozen
> > + */
> > + spin_lock_irqsave(&n->list_lock, flags);
> > + add_partial(n, slab, tail);
> > + } else if (kmem_cache_debug_flags(s, SLAB_STORE_USER)) {
> > + mode = M_FULL;
> > + /*
> > + * This also ensures that the scanning of full
> > + * slabs from diagnostic functions will not see
> > + * any frozen slabs.
> > + */
> > + spin_lock_irqsave(&n->list_lock, flags);
> > + add_full(s, n, slab);
> > }
> >
> > - if (l != m) {
> > - if (l == M_PARTIAL)
> > - remove_partial(n, slab);
> > - else if (l == M_FULL)
> > - remove_full(s, n, slab);
> >
> > - if (m == M_PARTIAL)
> > - add_partial(n, slab, tail);
> > - else if (m == M_FULL)
> > - add_full(s, n, slab);
> > - }
> > -
> > - l = m;
> > if (!cmpxchg_double_slab(s, slab,
> > old.freelist, old.counters,
> > new.freelist, new.counters,
> > - "unfreezing slab"))
> > + "unfreezing slab")) {
> > + if (mode == M_PARTIAL) {
> > + remove_partial(n, slab);
> > + spin_unlock_irqrestore(&n->list_lock, flags);
> > + } else if (mode == M_FULL) {
> > + remove_full(s, n, slab);
> > + spin_unlock_irqrestore(&n->list_lock, flags);
> > + }
> > goto redo;
> > + }
> >
> > - if (lock)
> > - spin_unlock_irqrestore(&n->list_lock, flags);
> >
> > - if (m == M_PARTIAL)
> > + if (mode == M_PARTIAL) {
> > + spin_unlock_irqrestore(&n->list_lock, flags);
> > stat(s, tail);
> > - else if (m == M_FULL)
> > + } else if (mode == M_FULL) {
> > + spin_unlock_irqrestore(&n->list_lock, flags);
> > stat(s, DEACTIVATE_FULL);
> > - else if (m == M_FREE) {
> > + } else if (mode == M_FREE) {
> > stat(s, DEACTIVATE_EMPTY);
> > discard_slab(s, slab);
> > stat(s, FREE_SLAB);
>
--
Thank you, You are awesome!
Hyeonggon :-)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
2022-02-25 9:34 ` Hyeonggon Yoo
@ 2022-02-25 9:50 ` Hyeonggon Yoo
2022-02-25 10:07 ` Vlastimil Babka
0 siblings, 1 reply; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-25 9:50 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg
On Fri, Feb 25, 2022 at 09:34:09AM +0000, Hyeonggon Yoo wrote:
> On Thu, Feb 24, 2022 at 07:16:11PM +0100, Vlastimil Babka wrote:
> > On 2/21/22 11:53, Hyeonggon Yoo wrote:
> > > Simply deactivate_slab() by removing variable 'lock' and replacing
> > > 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
> > > n->list_lock when cmpxchg_double() fails, and then retry.
> > >
> > > One slight functional change is releasing and taking n->list_lock again
> > > when cmpxchg_double() fails. This is not harmful because SLUB avoids
> > > deactivating slabs as much as possible.
> > >
> > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >
> > Hm I wonder if we could simplify even a bit more. Do we have to actually
> > place the slab on a partial (full) list before the cmpxchg, only to remove
> > it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
> > un-frozen, but not on the list, which would be unexpected. However if anyone
> > sees such slab, they have to take the list_lock first to start working with
> > the slab... so this should be safe, because we hold the list_lock here, and
> > will place the slab on the list before we release it. But it thus shouldn't
> > matter if the placement happens before or after a successful cmpxchg, no? So
> > we can only do it once after a successful cmpxchg and need no undo's?
> >
>
> My thought was similar. But after testing I noticed that &n->list_lock prevents
> race between __slab_free() and deactivate_slab().
>
> > Specifically AFAIK the only possible race should be with a __slab_free()
> > which might observe !was_frozen after we succeed an unfreezing cmpxchg and
> > go through the
> > "} else { /* Needs to be taken off a list */"
> > branch but then it takes the list_lock as the first thing, so will be able
> > to proceed only after the slab is actually on the list.
> >
> > Do I miss anything or would you agree?
> >
>
> It's so tricky.
>
> I tried to simplify more as you said. Seeing frozen slab on list was not
> problem. But the problem was that something might interfere between
> cmpxchg_double() and taking spinlock.
>
> This is what I faced:
>
> CPU A CPU B
> deactivate_slab() { __slab_free() {
> /* slab is full */
> slab.frozen = 0;
> cmpxchg_double();
> /* Hmm...
> slab->frozen == 0 &&
> slab->freelist != NULL?
> Oh This must be on the list.. */
Oh this is wrong.
slab->freelist must be
NULL because it's full
slab.
It's more complex
than I thought...
> spin_lock_irqsave();
> cmpxchg_double();
> /* Corruption: slab
> * was not yet inserted to
> * list but try removing */
> remove_full();
> spin_unlock_irqrestore();
> }
> spin_lock_irqsave();
> add_full();
> spin_unlock_irqrestore();
> }
So it was...
CPU A CPU B
deactivate_slab() { __slab_free() {
/* slab is full */
slab.frozen = 0;
cmpxchg_double();
/*
Hmm...
!was_frozen &&
prior == NULL?
Let's freeze this!
*/
put_cpu_partial();
}
spin_lock_irqsave();
add_full();
/* It's now frozen by CPU B and at the same time on full list */
spin_unlock_irqrestore();
And &n->list_lock prevents such a race.
>
> I think it's quite confusing because it's protecting code, not data.
>
> Would you have an idea to solve it, or should we add a comment for this?
>
> > > ---
> > > mm/slub.c | 74 +++++++++++++++++++++++++------------------------------
> > > 1 file changed, 33 insertions(+), 41 deletions(-)
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index a4964deccb61..2d0663befb9e 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2350,8 +2350,8 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> > > {
> > > enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
> > > struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> > > - int lock = 0, free_delta = 0;
> > > - enum slab_modes l = M_NONE, m = M_NONE;
> > > + int free_delta = 0;
> > > + enum slab_modes mode = M_NONE;
> > > void *nextfree, *freelist_iter, *freelist_tail;
> > > int tail = DEACTIVATE_TO_HEAD;
> > > unsigned long flags = 0;
> > > @@ -2420,57 +2420,49 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> > > new.frozen = 0;
> > >
> > > if (!new.inuse && n->nr_partial >= s->min_partial)
> > > - m = M_FREE;
> > > + mode = M_FREE;
> > > else if (new.freelist) {
> > > - m = M_PARTIAL;
> > > - if (!lock) {
> > > - lock = 1;
> > > - /*
> > > - * Taking the spinlock removes the possibility that
> > > - * acquire_slab() will see a slab that is frozen
> > > - */
> > > - spin_lock_irqsave(&n->list_lock, flags);
> > > - }
> > > - } else {
> > > - m = M_FULL;
> > > - if (kmem_cache_debug_flags(s, SLAB_STORE_USER) && !lock) {
> > > - lock = 1;
> > > - /*
> > > - * This also ensures that the scanning of full
> > > - * slabs from diagnostic functions will not see
> > > - * any frozen slabs.
> > > - */
> > > - spin_lock_irqsave(&n->list_lock, flags);
> > > - }
> > > + mode = M_PARTIAL;
> > > + /*
> > > + * Taking the spinlock removes the possibility that
> > > + * acquire_slab() will see a slab that is frozen
> > > + */
> > > + spin_lock_irqsave(&n->list_lock, flags);
> > > + add_partial(n, slab, tail);
> > > + } else if (kmem_cache_debug_flags(s, SLAB_STORE_USER)) {
> > > + mode = M_FULL;
> > > + /*
> > > + * This also ensures that the scanning of full
> > > + * slabs from diagnostic functions will not see
> > > + * any frozen slabs.
> > > + */
> > > + spin_lock_irqsave(&n->list_lock, flags);
> > > + add_full(s, n, slab);
> > > }
> > >
> > > - if (l != m) {
> > > - if (l == M_PARTIAL)
> > > - remove_partial(n, slab);
> > > - else if (l == M_FULL)
> > > - remove_full(s, n, slab);
> > >
> > > - if (m == M_PARTIAL)
> > > - add_partial(n, slab, tail);
> > > - else if (m == M_FULL)
> > > - add_full(s, n, slab);
> > > - }
> > > -
> > > - l = m;
> > > if (!cmpxchg_double_slab(s, slab,
> > > old.freelist, old.counters,
> > > new.freelist, new.counters,
> > > - "unfreezing slab"))
> > > + "unfreezing slab")) {
> > > + if (mode == M_PARTIAL) {
> > > + remove_partial(n, slab);
> > > + spin_unlock_irqrestore(&n->list_lock, flags);
> > > + } else if (mode == M_FULL) {
> > > + remove_full(s, n, slab);
> > > + spin_unlock_irqrestore(&n->list_lock, flags);
> > > + }
> > > goto redo;
> > > + }
> > >
> > > - if (lock)
> > > - spin_unlock_irqrestore(&n->list_lock, flags);
> > >
> > > - if (m == M_PARTIAL)
> > > + if (mode == M_PARTIAL) {
> > > + spin_unlock_irqrestore(&n->list_lock, flags);
> > > stat(s, tail);
> > > - else if (m == M_FULL)
> > > + } else if (mode == M_FULL) {
> > > + spin_unlock_irqrestore(&n->list_lock, flags);
> > > stat(s, DEACTIVATE_FULL);
> > > - else if (m == M_FREE) {
> > > + } else if (mode == M_FREE) {
> > > stat(s, DEACTIVATE_EMPTY);
> > > discard_slab(s, slab);
> > > stat(s, FREE_SLAB);
> >
>
> --
> Thank you, You are awesome!
> Hyeonggon :-)
--
Thank you, You are awesome!
Hyeonggon :-)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
2022-02-25 9:50 ` Hyeonggon Yoo
@ 2022-02-25 10:07 ` Vlastimil Babka
2022-02-25 10:26 ` Hyeonggon Yoo
0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2022-02-25 10:07 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg
On 2/25/22 10:50, Hyeonggon Yoo wrote:
> On Fri, Feb 25, 2022 at 09:34:09AM +0000, Hyeonggon Yoo wrote:
>> On Thu, Feb 24, 2022 at 07:16:11PM +0100, Vlastimil Babka wrote:
>> > On 2/21/22 11:53, Hyeonggon Yoo wrote:
>> > > Simply deactivate_slab() by removing variable 'lock' and replacing
>> > > 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
>> > > n->list_lock when cmpxchg_double() fails, and then retry.
>> > >
>> > > One slight functional change is releasing and taking n->list_lock again
>> > > when cmpxchg_double() fails. This is not harmful because SLUB avoids
>> > > deactivating slabs as much as possible.
>> > >
>> > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>> >
>> > Hm I wonder if we could simplify even a bit more. Do we have to actually
>> > place the slab on a partial (full) list before the cmpxchg, only to remove
>> > it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
>> > un-frozen, but not on the list, which would be unexpected. However if anyone
>> > sees such slab, they have to take the list_lock first to start working with
>> > the slab... so this should be safe, because we hold the list_lock here, and
>> > will place the slab on the list before we release it. But it thus shouldn't
>> > matter if the placement happens before or after a successful cmpxchg, no? So
>> > we can only do it once after a successful cmpxchg and need no undo's?
>> >
>>
>> My thought was similar. But after testing I noticed that &n->list_lock prevents
>> race between __slab_free() and deactivate_slab().
>>
>> > Specifically AFAIK the only possible race should be with a __slab_free()
>> > which might observe !was_frozen after we succeed an unfreezing cmpxchg and
>> > go through the
>> > "} else { /* Needs to be taken off a list */"
>> > branch but then it takes the list_lock as the first thing, so will be able
>> > to proceed only after the slab is actually on the list.
>> >
>> > Do I miss anything or would you agree?
>> >
>>
>> It's so tricky.
>>
>> I tried to simplify more as you said. Seeing frozen slab on list was not
>> problem. But the problem was that something might interfere between
>> cmpxchg_double() and taking spinlock.
>>
>> This is what I faced:
>>
>> CPU A CPU B
>> deactivate_slab() { __slab_free() {
>> /* slab is full */
>> slab.frozen = 0;
>> cmpxchg_double();
>> /* Hmm...
>> slab->frozen == 0 &&
>> slab->freelist != NULL?
>> Oh This must be on the list.. */
> Oh this is wrong.
> slab->freelist must be
> NULL because it's full
> slab.
>
> It's more complex
> than I thought...
>
>
>> spin_lock_irqsave();
>> cmpxchg_double();
>> /* Corruption: slab
>> * was not yet inserted to
>> * list but try removing */
>> remove_full();
>> spin_unlock_irqrestore();
>> }
>> spin_lock_irqsave();
>> add_full();
>> spin_unlock_irqrestore();
>> }
>
> So it was...
>
> CPU A CPU B
> deactivate_slab() { __slab_free() {
> /* slab is full */
> slab.frozen = 0;
> cmpxchg_double();
> /*
> Hmm...
> !was_frozen &&
> prior == NULL?
> Let's freeze this!
> */
> put_cpu_partial();
> }
> spin_lock_irqsave();
Yeah in my proposal I didn't intend to only take spin_lock_irqsave() here.
My idea for CPU A would be something like:
spin_lock_irqsave();
slab.frozen = 0;
if (cmpxchg_double()); {
/* success */
add_partial(); // (or add_full())
spin_unlock_irqrestore();
} else {
/* fail */
spin_unlock_irqrestore();
goto redo;
}
So we would still have the list_lock protection around cmpxchg as in the
current code. We just wouldn't do e.g. add_partial() before cmpxchg, only to
remove_partial() when cmpxchg failed.
> add_full();
> /* It's now frozen by CPU B and at the same time on full list */
> spin_unlock_irqrestore();
>
> And &n->list_lock prevents such a race.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
2022-02-25 10:07 ` Vlastimil Babka
@ 2022-02-25 10:26 ` Hyeonggon Yoo
0 siblings, 0 replies; 28+ messages in thread
From: Hyeonggon Yoo @ 2022-02-25 10:26 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg
On Fri, Feb 25, 2022 at 11:07:45AM +0100, Vlastimil Babka wrote:
> On 2/25/22 10:50, Hyeonggon Yoo wrote:
> > On Fri, Feb 25, 2022 at 09:34:09AM +0000, Hyeonggon Yoo wrote:
> >> On Thu, Feb 24, 2022 at 07:16:11PM +0100, Vlastimil Babka wrote:
> >> > On 2/21/22 11:53, Hyeonggon Yoo wrote:
> >> > > Simply deactivate_slab() by removing variable 'lock' and replacing
> >> > > 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
> >> > > n->list_lock when cmpxchg_double() fails, and then retry.
> >> > >
> >> > > One slight functional change is releasing and taking n->list_lock again
> >> > > when cmpxchg_double() fails. This is not harmful because SLUB avoids
> >> > > deactivating slabs as much as possible.
> >> > >
> >> > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >> >
> >> > Hm I wonder if we could simplify even a bit more. Do we have to actually
> >> > place the slab on a partial (full) list before the cmpxchg, only to remove
> >> > it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
> >> > un-frozen, but not on the list, which would be unexpected. However if anyone
> >> > sees such slab, they have to take the list_lock first to start working with
> >> > the slab... so this should be safe, because we hold the list_lock here, and
> >> > will place the slab on the list before we release it. But it thus shouldn't
> >> > matter if the placement happens before or after a successful cmpxchg, no? So
> >> > we can only do it once after a successful cmpxchg and need no undo's?
> >> >
> >>
> >> My thought was similar. But after testing I noticed that &n->list_lock prevents
> >> race between __slab_free() and deactivate_slab().
> >>
> >> > Specifically AFAIK the only possible race should be with a __slab_free()
> >> > which might observe !was_frozen after we succeed an unfreezing cmpxchg and
> >> > go through the
> >> > "} else { /* Needs to be taken off a list */"
> >> > branch but then it takes the list_lock as the first thing, so will be able
> >> > to proceed only after the slab is actually on the list.
> >> >
> >> > Do I miss anything or would you agree?
> >> >
> >>
> >> It's so tricky.
> >>
> >> I tried to simplify more as you said. Seeing frozen slab on list was not
> >> problem. But the problem was that something might interfere between
> >> cmpxchg_double() and taking spinlock.
> >>
> >> This is what I faced:
> >>
> >> CPU A CPU B
> >> deactivate_slab() { __slab_free() {
> >> /* slab is full */
> >> slab.frozen = 0;
> >> cmpxchg_double();
> >> /* Hmm...
> >> slab->frozen == 0 &&
> >> slab->freelist != NULL?
> >> Oh This must be on the list.. */
> > Oh this is wrong.
> > slab->freelist must be
> > NULL because it's full
> > slab.
> >
> > It's more complex
> > than I thought...
> >
> >
> >> spin_lock_irqsave();
> >> cmpxchg_double();
> >> /* Corruption: slab
> >> * was not yet inserted to
> >> * list but try removing */
> >> remove_full();
> >> spin_unlock_irqrestore();
> >> }
> >> spin_lock_irqsave();
> >> add_full();
> >> spin_unlock_irqrestore();
> >> }
> >
> > So it was...
> >
> > CPU A CPU B
> > deactivate_slab() { __slab_free() {
> > /* slab is full */
> > slab.frozen = 0;
> > cmpxchg_double();
> > /*
> > Hmm...
> > !was_frozen &&
> > prior == NULL?
> > Let's freeze this!
> > */
> > put_cpu_partial();
> > }
> > spin_lock_irqsave();
>
> Yeah in my proposal I didn't intend to only take spin_lock_irqsave() here.
> My idea for CPU A would be something like:
>
Oh, misunderstood your proposal.
I spent hours figuring what's wrong haha
> spin_lock_irqsave();
> slab.frozen = 0;
> if (cmpxchg_double()); {
> /* success */
> add_partial(); // (or add_full())
> spin_unlock_irqrestore();
> } else {
> /* fail */
> spin_unlock_irqrestore();
> goto redo;
> }
>
> So we would still have the list_lock protection around cmpxchg as in the
> current code. We just wouldn't do e.g. add_partial() before cmpxchg, only to
> remove_partial() when cmpxchg failed.
Now I got what you mean...
I think that would work. Will try that.
Thank you for nice proposal!
>
> > add_full();
> > /* It's now frozen by CPU B and at the same time on full list */
> > spin_unlock_irqrestore();
> >
> > And &n->list_lock prevents such a race.
>
--
Thank you, You are awesome!
Hyeonggon :-)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
@ 2022-02-22 7:53 kernel test robot
0 siblings, 0 replies; 28+ messages in thread
From: kernel test robot @ 2022-02-22 7:53 UTC (permalink / raw)
To: kbuild
[-- Attachment #1: Type: text/plain, Size: 14980 bytes --]
CC: kbuild-all(a)lists.01.org
In-Reply-To: <20220221105336.522086-4-42.hyeyoo@gmail.com>
References: <20220221105336.522086-4-42.hyeyoo@gmail.com>
TO: Hyeonggon Yoo <42.hyeyoo@gmail.com>
TO: linux-mm(a)kvack.org
CC: Roman Gushchin <guro@fb.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux Memory Management List <linux-mm@kvack.org>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: linux-kernel(a)vger.kernel.org
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: David Rientjes <rientjes@google.com>
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Pekka Enberg <penberg@kernel.org>
CC: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Hi Hyeonggon,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on hnaz-mm/master]
[also build test WARNING on linus/master v5.17-rc5 next-20220217]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Hyeonggon-Yoo/slab-cleanups/20220221-185522
base: https://github.com/hnaz/linux-mm master
:::::: branch date: 21 hours ago
:::::: commit date: 21 hours ago
config: x86_64-randconfig-m001-20220221 (https://download.01.org/0day-ci/archive/20220222/202202221538.EHaGbteQ-lkp(a)intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
smatch warnings:
fs/ubifs/debug.c:2043 check_leaf() warn: is 'node' large enough for 'struct ubifs_ch'? s32min
fs/ubifs/debug.c:2072 check_leaf() warn: is 'node' large enough for 'struct ubifs_data_node'? s32min
fs/ubifs/debug.c:2094 check_leaf() warn: is 'node' large enough for 'struct ubifs_dent_node'? s32min
vim +2043 fs/ubifs/debug.c
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1980
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1981 /**
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1982 * check_leaf - check leaf node.
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1983 * @c: UBIFS file-system description object
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1984 * @zbr: zbranch of the leaf node to check
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1985 * @priv: FS checking information
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1986 *
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1987 * This is a helper function for 'dbg_check_filesystem()' which is called for
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1988 * every single leaf node while walking the indexing tree. It checks that the
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1989 * leaf node referred from the indexing tree exists, has correct CRC, and does
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1990 * some other basic validation. This function is also responsible for building
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1991 * an RB-tree of inodes - it adds all inodes into the RB-tree. It also
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1992 * calculates reference count, size, etc for each inode in order to later
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1993 * compare them to the information stored inside the inodes and detect possible
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1994 * inconsistencies. Returns zero in case of success and a negative error code
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1995 * in case of failure.
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1996 */
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1997 static int check_leaf(struct ubifs_info *c, struct ubifs_zbranch *zbr,
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1998 void *priv)
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 1999 {
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2000 ino_t inum;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2001 void *node;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2002 struct ubifs_ch *ch;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2003 int err, type = key_type(c, &zbr->key);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2004 struct fsck_inode *fscki;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2005
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2006 if (zbr->len < UBIFS_CH_SZ) {
235c362bd0f6af Sheng Yong 2015-03-20 2007 ubifs_err(c, "bad leaf length %d (LEB %d:%d)",
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2008 zbr->len, zbr->lnum, zbr->offs);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2009 return -EINVAL;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2010 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2011
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2012 node = kmalloc(zbr->len, GFP_NOFS);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2013 if (!node)
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2014 return -ENOMEM;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2015
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2016 err = ubifs_tnc_read_node(c, zbr, node);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2017 if (err) {
235c362bd0f6af Sheng Yong 2015-03-20 2018 ubifs_err(c, "cannot read leaf node at LEB %d:%d, error %d",
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2019 zbr->lnum, zbr->offs, err);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2020 goto out_free;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2021 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2022
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2023 /* If this is an inode node, add it to RB-tree of inodes */
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2024 if (type == UBIFS_INO_KEY) {
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2025 fscki = add_inode(c, priv, node);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2026 if (IS_ERR(fscki)) {
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2027 err = PTR_ERR(fscki);
235c362bd0f6af Sheng Yong 2015-03-20 2028 ubifs_err(c, "error %d while adding inode node", err);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2029 goto out_dump;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2030 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2031 goto out;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2032 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2033
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2034 if (type != UBIFS_DENT_KEY && type != UBIFS_XENT_KEY &&
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2035 type != UBIFS_DATA_KEY) {
235c362bd0f6af Sheng Yong 2015-03-20 2036 ubifs_err(c, "unexpected node type %d at LEB %d:%d",
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2037 type, zbr->lnum, zbr->offs);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2038 err = -EINVAL;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2039 goto out_free;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2040 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2041
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2042 ch = node;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 @2043 if (le64_to_cpu(ch->sqnum) > c->max_sqnum) {
235c362bd0f6af Sheng Yong 2015-03-20 2044 ubifs_err(c, "too high sequence number, max. is %llu",
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2045 c->max_sqnum);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2046 err = -EINVAL;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2047 goto out_dump;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2048 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2049
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2050 if (type == UBIFS_DATA_KEY) {
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2051 long long blk_offs;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2052 struct ubifs_data_node *dn = node;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2053
6eb61d587f4515 Richard Weinberger 2018-07-12 2054 ubifs_assert(c, zbr->len >= UBIFS_DATA_NODE_SZ);
fb4325a3d9f983 Artem Bityutskiy 2014-11-25 2055
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2056 /*
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2057 * Search the inode node this data node belongs to and insert
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2058 * it to the RB-tree of inodes.
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2059 */
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2060 inum = key_inum_flash(c, &dn->key);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2061 fscki = read_add_inode(c, priv, inum);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2062 if (IS_ERR(fscki)) {
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2063 err = PTR_ERR(fscki);
235c362bd0f6af Sheng Yong 2015-03-20 2064 ubifs_err(c, "error %d while processing data node and trying to find inode node %lu",
e84461ad9c4f0f Artem Bityutskiy 2008-10-29 2065 err, (unsigned long)inum);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2066 goto out_dump;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2067 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2068
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2069 /* Make sure the data node is within inode size */
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2070 blk_offs = key_block_flash(c, &dn->key);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2071 blk_offs <<= UBIFS_BLOCK_SHIFT;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 @2072 blk_offs += le32_to_cpu(dn->size);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2073 if (blk_offs > fscki->size) {
235c362bd0f6af Sheng Yong 2015-03-20 2074 ubifs_err(c, "data node at LEB %d:%d is not within inode size %lld",
79fda5179a5227 Artem Bityutskiy 2012-08-27 2075 zbr->lnum, zbr->offs, fscki->size);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2076 err = -EINVAL;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2077 goto out_dump;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2078 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2079 } else {
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2080 int nlen;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2081 struct ubifs_dent_node *dent = node;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2082 struct fsck_inode *fscki1;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2083
6eb61d587f4515 Richard Weinberger 2018-07-12 2084 ubifs_assert(c, zbr->len >= UBIFS_DENT_NODE_SZ);
fb4325a3d9f983 Artem Bityutskiy 2014-11-25 2085
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2086 err = ubifs_validate_entry(c, dent);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2087 if (err)
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2088 goto out_dump;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2089
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2090 /*
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2091 * Search the inode node this entry refers to and the parent
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2092 * inode node and insert them to the RB-tree of inodes.
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2093 */
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 @2094 inum = le64_to_cpu(dent->inum);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2095 fscki = read_add_inode(c, priv, inum);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2096 if (IS_ERR(fscki)) {
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2097 err = PTR_ERR(fscki);
235c362bd0f6af Sheng Yong 2015-03-20 2098 ubifs_err(c, "error %d while processing entry node and trying to find inode node %lu",
e84461ad9c4f0f Artem Bityutskiy 2008-10-29 2099 err, (unsigned long)inum);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2100 goto out_dump;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2101 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2102
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2103 /* Count how many direntries or xentries refers this inode */
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2104 fscki->references += 1;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2105
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2106 inum = key_inum_flash(c, &dent->key);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2107 fscki1 = read_add_inode(c, priv, inum);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2108 if (IS_ERR(fscki1)) {
b38882f5c066dc Roel Kluin 2009-12-07 2109 err = PTR_ERR(fscki1);
235c362bd0f6af Sheng Yong 2015-03-20 2110 ubifs_err(c, "error %d while processing entry node and trying to find parent inode node %lu",
e84461ad9c4f0f Artem Bityutskiy 2008-10-29 2111 err, (unsigned long)inum);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2112 goto out_dump;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2113 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2114
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2115 nlen = le16_to_cpu(dent->nlen);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2116 if (type == UBIFS_XENT_KEY) {
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2117 fscki1->calc_xcnt += 1;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2118 fscki1->calc_xsz += CALC_DENT_SIZE(nlen);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2119 fscki1->calc_xsz += CALC_XATTR_BYTES(fscki->size);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2120 fscki1->calc_xnms += nlen;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2121 } else {
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2122 fscki1->calc_sz += CALC_DENT_SIZE(nlen);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2123 if (dent->type == UBIFS_ITYPE_DIR)
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2124 fscki1->calc_cnt += 1;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2125 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2126 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2127
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2128 out:
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2129 kfree(node);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2130 return 0;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2131
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2132 out_dump:
235c362bd0f6af Sheng Yong 2015-03-20 2133 ubifs_msg(c, "dump of node at LEB %d:%d", zbr->lnum, zbr->offs);
a33e30a0e023e9 Zhihao Cheng 2020-06-16 2134 ubifs_dump_node(c, node, zbr->len);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2135 out_free:
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2136 kfree(node);
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2137 return err;
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2138 }
1e51764a3c2ac0 Artem Bityutskiy 2008-07-14 2139
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2022-02-25 10:27 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
2022-02-23 18:39 ` Vlastimil Babka
2022-02-23 19:06 ` Marco Elver
2022-02-24 12:26 ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 2/5] mm/sl[auo]b: Do not export __ksize() Hyeonggon Yoo
2022-02-21 15:46 ` Matthew Wilcox
2022-02-23 3:26 ` Hyeonggon Yoo
2022-02-23 18:40 ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
2022-02-21 15:53 ` Matthew Wilcox
2022-02-22 8:10 ` Hyeonggon Yoo
2022-02-22 19:59 ` Matthew Wilcox
2022-02-23 3:24 ` Hyeonggon Yoo
2022-02-24 12:48 ` Vlastimil Babka
2022-02-24 13:31 ` Hyeonggon Yoo
2022-02-24 15:08 ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 4/5] mm/slub: Limit min_partial only in cache creation Hyeonggon Yoo
2022-02-22 23:48 ` David Rientjes
2022-02-23 3:37 ` Hyeonggon Yoo
2022-02-24 12:52 ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 5/5] mm/slub: Refactor deactivate_slab() Hyeonggon Yoo
2022-02-24 18:16 ` Vlastimil Babka
2022-02-25 9:34 ` Hyeonggon Yoo
2022-02-25 9:50 ` Hyeonggon Yoo
2022-02-25 10:07 ` Vlastimil Babka
2022-02-25 10:26 ` Hyeonggon Yoo
2022-02-22 7:53 [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size kernel test robot
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.