From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90A19C00140 for ; Sun, 31 Jul 2022 08:44:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E99C46B0072; Sun, 31 Jul 2022 04:44:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E49D06B0073; Sun, 31 Jul 2022 04:44:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D10DB6B0074; Sun, 31 Jul 2022 04:44:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C30936B0072 for ; Sun, 31 Jul 2022 04:44:22 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 805CA140598 for ; Sun, 31 Jul 2022 08:44:22 +0000 (UTC) X-FDA: 79746758364.31.98D5190 Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) by imf23.hostedemail.com (Postfix) with ESMTP id 1C2461400DE for ; Sun, 31 Jul 2022 08:44:21 +0000 (UTC) Received: by mail-pj1-f41.google.com with SMTP id f11-20020a17090a4a8b00b001f2f7e32d03so9946127pjh.0 for ; Sun, 31 Jul 2022 01:44:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=nhiuk4lZ2WQCSw+EJkPOvMMOktBrD/4MTnGS3NB+Tg4=; b=L8oq1pUoMtPnfG01yRzuNCAnRZ55aJrLfQDGY/Ic9IDN7g9+04+NudhMVAhcHhC0v6 eFw4efJsydHNdtOil5RCR1wH8MZyLSQNmc17oWfrcaBG5yEh6BaZNfZVCnSNw405JtFk Q/3iSNWRWfc0xNiTOAZxYSdmdWkKpa3fi8+NF3A22nJ07SwVK12whJ95UyErbDS62qb2 bSjVbtT0M20CH0f7nstrb41+Ufw/jBmbpBDGy+8h1ekaz6933bDLiyAZ0fo0UxPIv1a6 26K3hGARoFfAi5nwFuuPF5BuGhNAzImqp+m0aWcAaqYm8MkxZ+0tobXcQIWGvrf2mtXN fvKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=nhiuk4lZ2WQCSw+EJkPOvMMOktBrD/4MTnGS3NB+Tg4=; b=lDpe9zg+b4FkKtHWKRG0rhRGZFtnSlwgU6WRZ35d7tp463Nz0Yu9Jn7LooDpufb7ML e1v4UuUU5iG+B3TNuTVVTrLT1SmpdjD++oWPdUZ5itpB8zZNzisNwQmgDpZR3nf54/nU zqSBuzIIc3Ot8W6yv82D7txPDnn1dBgG/CwDKrsYouesN6El4rnl7suUeZGfeajdcuFZ LFaEa0FiOgEwXeZqh8QanNLxosiScbkECtHXxNtYWIJ7kUAjPcC0VmZADoYjM+XAQzcd te9o2u36aL2lThZr4BOSRj3f23ZVtssxpP8gmCgb2GvDvSLfseiqOBuWKaRSUSR4Cipv iVVw== X-Gm-Message-State: ACgBeo1Z3e1dI03uadpYz6eRTjxoQBbQQT/KYacqKsOiLyP/vWr4SKaP aJTosEDmpaE+rWq+W8JwJ6M= X-Google-Smtp-Source: AA6agR5KOoe64dLsrZveBE5yOB0E412HuEbdmBmUXEGiPKdqdd2wHIZfH44MfvDSsEhslop9+ttZDA== X-Received: by 2002:a17:902:d48a:b0:16d:30f4:ca5a with SMTP id c10-20020a170902d48a00b0016d30f4ca5amr11493566plg.50.1659257060894; Sun, 31 Jul 2022 01:44:20 -0700 (PDT) Received: from hyeyoo ([114.29.91.56]) by smtp.gmail.com with ESMTPSA id d9-20020a170902e14900b0016be9fa6807sm6832770pla.284.2022.07.31.01.44.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 31 Jul 2022 01:44:20 -0700 (PDT) Date: Sun, 31 Jul 2022 17:44:14 +0900 From: Hyeonggon Yoo <42.hyeyoo@gmail.com> To: Vlastimil Babka Cc: Rongwei Wang , Christoph Lameter , Joonsoo Kim , David Rientjes , Pekka Enberg , linux-mm@kvack.org, Feng Tang Subject: Re: [RFC PATCH] mm, slub: restrict sysfs validation to debug caches and make it safe Message-ID: References: <20220727182909.11231-1-vbabka@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220727182909.11231-1-vbabka@suse.cz> ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659257062; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nhiuk4lZ2WQCSw+EJkPOvMMOktBrD/4MTnGS3NB+Tg4=; b=dzyNlCinBFGp9IValwZyr850kx2ubQ4kROWXU/wYbVLHG7CwecuZ+OsCTO8e8eaxsk00uF u6FOHXimQaopOq/F2t4Mgrd1i2V2lUMbSFpUloB1oVCJR3vlRH1rmKq9RRKwMHp5TGqG61 6psFqsUjv0P1xVEr/PdeVTXTOSeafPY= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=L8oq1pUo; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of 42.hyeyoo@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=42.hyeyoo@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659257062; a=rsa-sha256; cv=none; b=e7ak4/HgM6c54tv5xd4+1Jm1+OPrfbKaOo4Uu+cYU5cXo1ghOH4E7hkM8YuxwU1hvnVav5 d+fyhkusZHI4LYzSVFUKEJp2hINCDZilWaab8LBQygwfDgiJMqN4eD6FxLcHPyTcWgqnYR OJKt8vdXkvQAsY+0CLil3JekogbzZb0= X-Rspamd-Server: rspam02 X-Rspam-User: Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=L8oq1pUo; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of 42.hyeyoo@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=42.hyeyoo@gmail.com X-Stat-Signature: uyubemxersszfct6xa635hf3tg6d8t8o X-Rspamd-Queue-Id: 1C2461400DE X-HE-Tag: 1659257061-838394 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jul 27, 2022 at 08:29:09PM +0200, Vlastimil Babka wrote: > Rongwei Wang reports [1] that cache validation triggered by writing to > /sys/kernel/slab//validate is racy against normal cache > operations (e.g. freeing) in a way that can cause false positive > inconsistency reports for caches with debugging enabled. The problem is > that debugging actions that mark object free or active and actual > freelist operations are not atomic, and the validation can see an > inconsistent state. > > For caches that do or don't have debugging enabled, additional races > regarding n->nr_slabs are possible that result in false reports of wrong > slab counts. > > This patch attempts to solve these issues while not adding overhead to > normal (especially fastpath) operations for caches that do not have > debugging enabled, just to make possible userspace-triggered validation > safe. Instead, disable the validation for caches that don't have > debugging enabled and make the sysfs handler return -EINVAL. > > For caches that do have debugging enabled, we can instead extend the > existing approach of not using percpu freelists to force all operations > to the slow paths where debugging is checked for and processed. > > The processing on free in free_debug_processing() already happens under > n->list_lock and slab_lock() so we can extend it to actually do the > freeing as well and thus make it atomic against concurrent validation. > > The processing on alloc in alloc_debug_processing() currently doesn't > take any locks, but we have to first allocate the object from a slab on > the partial list (as percpu slabs are always non-existent) and thus take > the n->list_lock anyway. Add a function alloc_single_from_partial() that > additionally takes slab_lock() for the debug processing and then grabs > just the allocated object instead of the whole freelist. This again > makes it atomic against validation and it is also ultimately more > efficient than the current grabbing of freelist immediately followed by > slab deactivation. > > To prevent races on n->nr_slabs, make sure that for caches with > debugging enabled, inc_slabs_node() or dec_slabs_node() is called under > n->list_lock. When allocating a new slab for a debug cache, handle the > allocation by a new function alloc_single_from_new_slab() instead of the > current forced deactivation path. > > Neither of these changes affect the fast paths. > > The function free_debug_processing() is moved so that it is placed > later than the definitions of add_partial(), remove_partial() and > discard_slab(), to avoid a need for forward declarations. > > [1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/ > > Reported-by: Rongwei Wang > Signed-off-by: Vlastimil Babka > --- > Hi, this extends the pre-RFC from [1] to cover also racy n->nr_slab updates > and hopefully thus addresses everything that Rongwei's series did, and > testing will show that? > Thanks, Vlastimil > I don't care whose patch to ACK. Maybe Rongwei will post his own patch? Anyway, this patch overall looks good. Also all issues (as far as I know) related to validate attribute as gone after this patch. Silly question: Do we want to apply on stable trees? I doubt someone would use validate attribute when not debugging. > [1] https://lore.kernel.org/all/69462916-2d1c-dd50-2e64-b31c2b61690e@suse.cz/ > > mm/slub.c | 322 +++++++++++++++++++++++++++++++++++++++--------------- > 1 file changed, 231 insertions(+), 91 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index b1281b8654bd..01e5228809d7 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1324,17 +1324,14 @@ static inline int alloc_consistency_checks(struct kmem_cache *s, > } [...] > +/* > + * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a > + * slab from the n->partial list. Removes only a single object from the slab > + * under slab_lock(), does the alloc_debug_processing() checks and leaves the > + * slab on the list, or moves it to full list if it was the last object. > + */ > +static void *alloc_single_from_partial(struct kmem_cache *s, > + struct kmem_cache_node *n, struct slab *slab) > +{ > + void *object; > + unsigned long flags; > + > + lockdep_assert_held(&n->list_lock); > + > + slab_lock(slab, &flags); > + > + object = slab->freelist; > + slab->freelist = get_freepointer(s, object); > + slab->inuse++; > + > + if (!alloc_debug_processing(s, slab, object)) { > + remove_partial(n, slab); > + slab_unlock(slab, &flags); > + return NULL; > + } > + > + if (slab->inuse == slab->objects) { > + remove_partial(n, slab); > + add_full(s, n, slab); > + } > + > + slab_unlock(slab, &flags); AFAIK add_full/remove_full/add_partial/remove_partial can be called outside slab_lock but inside list_lock. > + return object; > +} > + > +/* > + * Called only for kmem_cache_debug() caches to allocate from a freshly > + * allocated slab. Allocates a single object instead of whole freelist > + * and puts the slab to the partial (or full) list. > + */ > +static void *alloc_single_from_new_slab(struct kmem_cache *s, > + struct slab *slab) > +{ > + int nid = slab_nid(slab); > + struct kmem_cache_node *n = get_node(s, nid); > + unsigned long flags, flags2; > + void *object; > + > + spin_lock_irqsave(&n->list_lock, flags); > + slab_lock(slab, &flags2); > + > + object = slab->freelist; > + slab->freelist = get_freepointer(s, object); > + /* Undo what allocate_slab() did */ > + slab->frozen = 0; > + slab->inuse = 1; Maybe do it in allocate_slab()? > + if (!alloc_debug_processing(s, slab, object)) { > + /* > + * It's not really expected that this would fail on a > + * freshly allocated slab, but a concurrent memory > + * corruption in theory could cause that. > + */ > + slab_unlock(slab, &flags2); > + spin_unlock_irqrestore(&n->list_lock, flags); > + return NULL; > + } > + > + if (slab->inuse == slab->objects) > + add_full(s, n, slab); > + else > + add_partial(n, slab, DEACTIVATE_TO_HEAD); > + > + slab_unlock(slab, &flags2); > + inc_slabs_node(s, nid, slab->objects); > + spin_unlock_irqrestore(&n->list_lock, flags); > + > + return object; > +} [...] > #endif /* CONFIG_SLUB_DEBUG */ > > #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) > @@ -3036,6 +3165,20 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > return NULL; > } > > + stat(s, ALLOC_SLAB); > + > + if (kmem_cache_debug(s)) { > + freelist = alloc_single_from_new_slab(s, slab); > + > + if (unlikely(!freelist)) > + goto new_objects; > + > + if (s->flags & SLAB_STORE_USER) > + set_track(s, freelist, TRACK_ALLOC, addr); > + > + return freelist; > + } > + > /* > * No other reference to the slab yet so we can > * muck around with it freely without cmpxchg > @@ -3043,29 +3186,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > freelist = slab->freelist; > slab->freelist = NULL; > > - stat(s, ALLOC_SLAB); > + inc_slabs_node(s, slab_nid(slab), slab->objects); > > check_new_slab: > > if (kmem_cache_debug(s)) { > - if (!alloc_debug_processing(s, slab, freelist, addr)) { > - /* Slab failed checks. Next slab needed */ > - goto new_slab; > - } else { > - /* > - * For debug case, we don't load freelist so that all > - * allocations go through alloc_debug_processing() > - */ > - goto return_single; > - } > + /* > + * For debug caches here we had to go through > + * alloc_single_from_partial() so just store the tracking info > + * and return the object > + */ > + if (s->flags & SLAB_STORE_USER) > + set_track(s, freelist, TRACK_ALLOC, addr); > + return freelist; > } > > - if (unlikely(!pfmemalloc_match(slab, gfpflags))) > + if (unlikely(!pfmemalloc_match(slab, gfpflags))) { > /* > * For !pfmemalloc_match() case we don't load freelist so that > * we don't make further mismatched allocations easier. > */ > - goto return_single; > + deactivate_slab(s, slab, get_freepointer(s, freelist)); > + return freelist; > + } > > retry_load_slab: > > @@ -3089,11 +3232,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > c->slab = slab; > > goto load_freelist; > - > -return_single: > - > - deactivate_slab(s, slab, get_freepointer(s, freelist)); > - return freelist; > } > > /* > @@ -3341,9 +3479,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab, > if (kfence_free(head)) > return; > > - if (kmem_cache_debug(s) && > - !free_debug_processing(s, slab, head, tail, cnt, addr)) > + if (kmem_cache_debug(s)) { > + free_debug_processing(s, slab, head, tail, cnt, addr); > return; > + } Oh, now debugging caches does not share free path with non-debugging caches. Now free_debug_processing's return type can be void? > > do { > if (unlikely(n)) { > @@ -3958,6 +4097,7 @@ static void early_kmem_cache_node_alloc(int node) > slab = new_slab(kmem_cache_node, GFP_NOWAIT, node); > > BUG_ON(!slab); > + inc_slabs_node(kmem_cache_node, slab_nid(slab), slab->objects); > if (slab_nid(slab) != node) { > pr_err("SLUB: Unable to allocate memory from node %d\n", node); > pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n"); > @@ -5625,7 +5765,7 @@ static ssize_t validate_store(struct kmem_cache *s, > { > int ret = -EINVAL; > > - if (buf[0] == '1') { > + if (buf[0] == '1' && kmem_cache_debug(s)) { > ret = validate_slab_cache(s); > if (ret >= 0) > ret = length; Yeah definitely this is what it should be, instead of serializing inc_slabs_node()/dec_slabs_node() for non-debugging caches. > -- > 2.37.1 > -- Thanks, Hyeonggon