From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F5FEC433EF for ; Fri, 15 Jul 2022 10:52:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230372AbiGOKwF (ORCPT ); Fri, 15 Jul 2022 06:52:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48104 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229459AbiGOKwC (ORCPT ); Fri, 15 Jul 2022 06:52:02 -0400 Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com [115.124.30.54]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3D91D7FE44 for ; Fri, 15 Jul 2022 03:52:00 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=rongwei.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0VJP9jpm_1657882314; Received: from 30.240.97.187(mailfrom:rongwei.wang@linux.alibaba.com fp:SMTPD_---0VJP9jpm_1657882314) by smtp.aliyun-inc.com; Fri, 15 Jul 2022 18:51:55 +0800 Message-ID: <23a79337-a5eb-a959-a764-1296a3e8e7c1@linux.alibaba.com> Date: Fri, 15 Jul 2022 18:51:54 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:103.0) Gecko/20100101 Thunderbird/103.0 Subject: Re: [PATCH 1/3] mm/slub: fix the race between validate_slab and slab_free Content-Language: en-US To: Vlastimil Babka , Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: David Rientjes , songmuchun@bytedance.com, akpm@linux-foundation.org, roman.gushchin@linux.dev, iamjoonsoo.kim@lge.com, penberg@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20220529081535.69275-1-rongwei.wang@linux.alibaba.com> <9794df4f-3ffe-4e99-0810-a1346b139ce8@linux.alibaba.com> <29723aaa-5e28-51d3-7f87-9edf0f7b9c33@linux.alibaba.com> From: Rongwei Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 7/15/22 6:33 PM, Vlastimil Babka wrote: > On 7/15/22 10:05, Rongwei Wang wrote: >> >> >> On 6/17/22 5:40 PM, Vlastimil Babka wrote: >>> On 6/8/22 14:23, Christoph Lameter wrote: >>>> On Wed, 8 Jun 2022, Rongwei Wang wrote: >>>> >>>>> If available, I think document the issue and warn this incorrect >>>>> behavior is >>>>> OK. But it still prints a large amount of confusing messages, and >>>>> disturbs us? >>>> >>>> Correct it would be great if you could fix this in a way that does not >>>> impact performance. >>>> >>>>>> are current operations on the slab being validated. >>>>> And I am trying to fix it in following way. In a short, these changes only >>>>> works under the slub debug mode, and not affects the normal mode (I'm not >>>>> sure). It looks not elegant enough. And if all approve of this way, I can >>>>> submit the next version. >>>> >>>> >>>>> >>>>> Anyway, thanks for your time:). >>>>> -wrw >>>>> >>>>> @@ -3304,7 +3300,7 @@ static void __slab_free(struct kmem_cache *s, >>>> struct >>>>> slab *slab, >>>>> >>>>>   { >>>>>          void *prior; >>>>> -       int was_frozen; >>>>> +       int was_frozen, to_take_off = 0; >>>>>          struct slab new; >>>> >>>> to_take_off has the role of !n ? Why is that needed? >>>> >>>>> -       do { >>>>> -               if (unlikely(n)) { >>>>> +               spin_lock_irqsave(&n->list_lock, flags); >>>>> +               ret = free_debug_processing(s, slab, head, tail, cnt, >>>>> addr); >>>> >>>> Ok so the idea is to take the lock only if kmem_cache_debug. That looks >>>> ok. But it still adds a number of new branches etc to the free loop. >>> >> Hi, Vlastimil, sorry for missing your message long time. > > Hi, no problem. > >>> It also further complicates the already tricky code. I wonder if we should >>> make more benefit from the fact that for kmem_cache_debug() caches we don't >>> leave any slabs on percpu or percpu partial lists, and also in >>> free_debug_processing() we aready take both list_lock and slab_lock. If we >>> just did the freeing immediately there under those locks, we would be >>> protected against other freeing cpus by that list_lock and don't need the >>> double cmpxchg tricks. >> enen, I'm not sure get your "don't need the double cmpxchg tricks" means >> completely. What you want to say is that replace cmpxchg_double_slab() here >> with following code when kmem_cache_debug(s)? >> >> __slab_lock(slab); >> if (slab->freelist == freelist_old && slab->counters == counters_old){ >>     slab->freelist = freelist_new; >>     slab->counters = counters_new; >>     __slab_unlock(slab); >>     local_irq_restore(flags); >>     return true; >> } >> __slab_unlock(slab); > > Pretty much, but it's more complicated. Yes, actually, I think reuse cmpxchg_double_slab() here is more concise in code. I'm already finish this part of code, but hesitating whether to replace cmpxchg_double_slab(). > >> If I make mistakes for your words, please let me know. >> Thanks! >>> >>> What about against allocating cpus? More tricky as those will currently end >>> up privatizing the freelist via get_partial(), only to deactivate it again, >>> so our list_lock+slab_lock in freeing path would not protect in the >>> meanwhile. But the allocation is currently very inefficient for debug >>> caches, as in get_partial() it will take the list_lock to take the slab from >>> partial list and then in most cases again in deactivate_slab() to return it. >> It seems that I need speed some time to eat these words. Anyway, thanks. >>> >>> If instead the allocation path for kmem_cache_debug() cache would take a >>> single object from the partial list (not whole freelist) under list_lock, it >>> would be ultimately more efficient, and protect against freeing using >>> list_lock. Sounds like an idea worth trying to me? >> >> Hyeonggon had a similar advice that split freeing and allocating slab from >> debugging, likes below: >> >> >> __slab_alloc() { >>     if (kmem_cache_debug(s)) >>         slab_alloc_debug() >>     else >>         ___slab_alloc() >> } >> >> I guess that above code aims to solve your mentioned problem (idea)? >> >> slab_free() { >>     if (kmem_cache_debug(s)) >>         slab_free_debug() >>     else >>         __do_slab_free() >> } >> >> Currently, I only modify the code of freeing slab to fix the confusing >> messages of "slabinfo -v". If you agree, I can try to realize above >> mentioned slab_alloc_debug() code. Maybe it's also a challenge to me. > > I already started working on this approach and hope to post a RFC soon. OK, that's great. > >> Thanks for your time. >> >>> And of course we would stop creating the 'validate' sysfs files for >>> non-debug caches.