Re: [PATCH 1/3] mm/slub: fix the race between validate_slab and slab_free

From: Rongwei Wang <rongwei.wang@linux.alibaba.com>
To: Vlastimil Babka <vbabka@suse.cz>, Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: David Rientjes <rientjes@google.com>,
	songmuchun@bytedance.com, Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	akpm@linux-foundation.org, roman.gushchin@linux.dev,
	iamjoonsoo.kim@lge.com, penberg@kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/3] mm/slub: fix the race between validate_slab and slab_free
Date: Fri, 15 Jul 2022 16:05:29 +0800	[thread overview]
Message-ID: <aceab1c8-0c10-fa5f-da39-6820294494c4@linux.alibaba.com> (raw)
In-Reply-To: <faf416b9-f46c-8534-7fb7-557c046a564d@suse.cz>

On 6/17/22 5:40 PM, Vlastimil Babka wrote:
> On 6/8/22 14:23, Christoph Lameter wrote:
>> On Wed, 8 Jun 2022, Rongwei Wang wrote:
>>
>>> If available, I think document the issue and warn this incorrect behavior is
>>> OK. But it still prints a large amount of confusing messages, and disturbs us?
>>
>> Correct it would be great if you could fix this in a way that does not
>> impact performance.
>>
>>>> are current operations on the slab being validated.
>>> And I am trying to fix it in following way. In a short, these changes only
>>> works under the slub debug mode, and not affects the normal mode (I'm not
>>> sure). It looks not elegant enough. And if all approve of this way, I can
>>> submit the next version.
>>
>>
>>>
>>> Anyway, thanks for your time:).
>>> -wrw
>>>
>>> @@ -3304,7 +3300,7 @@ static void __slab_free(struct kmem_cache *s,
>> struct
>>> slab *slab,
>>>
>>>   {
>>>          void *prior;
>>> -       int was_frozen;
>>> +       int was_frozen, to_take_off = 0;
>>>          struct slab new;
>>
>> to_take_off has the role of !n ? Why is that needed?
>>
>>> -       do {
>>> -               if (unlikely(n)) {
>>> +               spin_lock_irqsave(&n->list_lock, flags);
>>> +               ret = free_debug_processing(s, slab, head, tail, cnt, addr);
>>
>> Ok so the idea is to take the lock only if kmem_cache_debug. That looks
>> ok. But it still adds a number of new branches etc to the free loop.
> 
Hi, Vlastimil, sorry for missing your message long time.
> It also further complicates the already tricky code. I wonder if we should
> make more benefit from the fact that for kmem_cache_debug() caches we don't
> leave any slabs on percpu or percpu partial lists, and also in
> free_debug_processing() we aready take both list_lock and slab_lock. If we
> just did the freeing immediately there under those locks, we would be
> protected against other freeing cpus by that list_lock and don't need the
> double cmpxchg tricks.
enen, I'm not sure get your "don't need the double cmpxchg tricks" means 
completely. What you want to say is that replace cmpxchg_double_slab() 
here with following code when kmem_cache_debug(s)?

__slab_lock(slab);
if (slab->freelist == freelist_old && slab->counters == counters_old){
     slab->freelist = freelist_new;
     slab->counters = counters_new;
     __slab_unlock(slab);
     local_irq_restore(flags);
     return true;
}
__slab_unlock(slab);

If I make mistakes for your words, please let me know.
Thanks!
> 
> What about against allocating cpus? More tricky as those will currently end
> up privatizing the freelist via get_partial(), only to deactivate it again,
> so our list_lock+slab_lock in freeing path would not protect in the
> meanwhile. But the allocation is currently very inefficient for debug
> caches, as in get_partial() it will take the list_lock to take the slab from
> partial list and then in most cases again in deactivate_slab() to return it.
It seems that I need speed some time to eat these words. Anyway, thanks.
> 
> If instead the allocation path for kmem_cache_debug() cache would take a
> single object from the partial list (not whole freelist) under list_lock, it
> would be ultimately more efficient, and protect against freeing using
> list_lock. Sounds like an idea worth trying to me?

Hyeonggon had a similar advice that split freeing and allocating slab 
from debugging, likes below:

__slab_alloc() {
     if (kmem_cache_debug(s))
         slab_alloc_debug()
     else
         ___slab_alloc()
}

I guess that above code aims to solve your mentioned problem (idea)?

slab_free() {
     if (kmem_cache_debug(s))
         slab_free_debug()
     else
         __do_slab_free()
}

Currently, I only modify the code of freeing slab to fix the confusing 
messages of "slabinfo -v". If you agree, I can try to realize above 
mentioned slab_alloc_debug() code. Maybe it's also a challenge to me.

Thanks for your time.

> And of course we would stop creating the 'validate' sysfs files for
> non-debug caches.