From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40F06C4338F for ; Thu, 29 Jul 2021 13:21:44 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A64A761008 for ; Thu, 29 Jul 2021 13:21:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org A64A761008 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 047668D0002; Thu, 29 Jul 2021 09:21:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0262D8D0008; Thu, 29 Jul 2021 09:21:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D92E68D0002; Thu, 29 Jul 2021 09:21:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0200.hostedemail.com [216.40.44.200]) by kanga.kvack.org (Postfix) with ESMTP id B5E7A8D0001 for ; Thu, 29 Jul 2021 09:21:42 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 703481842BE2C for ; Thu, 29 Jul 2021 13:21:42 +0000 (UTC) X-FDA: 78415687644.27.22B6808 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf19.hostedemail.com (Postfix) with ESMTP id C4F7BB0000AD for ; Thu, 29 Jul 2021 13:21:41 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 2A3D820039; Thu, 29 Jul 2021 13:21:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1627564900; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=ms7KCoZb7FAaYor6VYbLeaHOpBsaQVDedXyfTJhJW6U=; b=jqyPiB6PJL/X+wvSoc/0yVbQbK3UiEGDApTko2OwoJsZV2sgxhx32i6zgftSl69+Gsm5dH 2Z/ygeuDvDtGKVgAS0IgnAZdk6eTqjHsSHwQ//TTakON2PtlEwosYa3cThFevk81s1GNJV cSRtjsThy7r+fXL+/RW3y5o7//Jjis8= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1627564900; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=ms7KCoZb7FAaYor6VYbLeaHOpBsaQVDedXyfTJhJW6U=; b=DGjaDI/hDjflUxWsqcNWtSpSKYCK89cFXGDoS6E4PbXXerVlXuHYFC4NRaKG6hTVVnb2RI 5IoADYHLwX2+93Dg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id EB98413AE9; Thu, 29 Jul 2021 13:21:39 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id KAW4OGOrAmF9AwAAMHmgww (envelope-from ); Thu, 29 Jul 2021 13:21:39 +0000 From: Vlastimil Babka To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Christoph Lameter , David Rientjes , Pekka Enberg , Joonsoo Kim Cc: Mike Galbraith , Sebastian Andrzej Siewior , Thomas Gleixner , Mel Gorman , Jesper Dangaard Brouer , Jann Horn , Vlastimil Babka Subject: [PATCH v3 00/35] SLUB: reduce irq disabled scope and make it RT compatible Date: Thu, 29 Jul 2021 15:20:57 +0200 Message-Id: <20210729132132.19691-1-vbabka@suse.cz> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=jqyPiB6P; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="DGjaDI/h"; spf=pass (imf19.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=vbabka@suse.cz; dmarc=none X-Rspamd-Server: rspam02 X-Stat-Signature: kziyhbn3x8u8nzmcyu36wxfekpum3ny1 X-Rspamd-Queue-Id: C4F7BB0000AD X-HE-Tag: 1627564901-903853 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Changes since v2 [5]: * Rebase to 5.14-rc3 * A number of fixes to the RT parts, big thanks to Mike Galbraith for tes= ting and debugging! * The largest fix is to protect kmem_cache_cpu->partial by local_lock i= nstead of cmpxchg tricks, which are insufficient on RT. To avoid divergence between RT and !RT, just do it everywhere. Affected mainly patch 25 a= nd a new patch 33. This also addresses a theoretical race raised earlier b= y Jann Horn. * Smaller fixes reported by Sebastian Andrzej Siewior and Cyrill Gorcunov Changes since RFC v1 [1]: * Addressed feedback from Christoph and Mel, added their acks. * Finished RT conversion, adopting 2 patches from the RT tree. * The local_lock conversion has to sacrifice lockless fathpaths on PREEMP= T_RT * Added some more cleanup patches to the front. This series was initially inspired by Mel's pcplist local_lock rewrite, a= nd also interest to better understand SLUB's locking and the new primitives = and RT variants and implications. It should make SLUB more preemption-friendly, especially for RT, hopefully without noticeable regressions, as the fast = paths are not affected. Series is based on 5.14-rc3 and also available as a git branch: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=3D= slub-local-lock-v3r1 The series should now be sufficiently tested in both RT and !RT configs, = mainly thanks to Mike. The RFC/v1 version also got basic performance screening b= y Mel that didn't show major regressions. Mike's testing with hackbench of = v2 on !RT reported negligible differences [6]: virgin(ish) tip 5.13.0.g60ab3ed-tip 7,320.67 msec task-clock # 7.792 CPUs utilize= d ( +- 0.31% ) 221,215 context-switches # 0.030 M/sec = ( +- 3.97% ) 16,234 cpu-migrations # 0.002 M/sec = ( +- 4.07% ) 13,233 page-faults # 0.002 M/sec = ( +- 0.91% ) 27,592,205,252 cycles # 3.769 GHz = ( +- 0.32% ) 8,309,495,040 instructions # 0.30 insn per cyc= le ( +- 0.37% ) 1,555,210,607 branches # 212.441 M/sec = ( +- 0.42% ) 5,484,209 branch-misses # 0.35% of all branc= hes ( +- 2.13% ) 0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% ) 0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat= ) 0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% ) 5.13.0.g60ab3ed-tip +slub-local-lock-v2r3 7,343.57 msec task-clock # 7.776 CPUs utilize= d ( +- 0.44% ) 223,044 context-switches # 0.030 M/sec = ( +- 3.02% ) 16,057 cpu-migrations # 0.002 M/sec = ( +- 4.03% ) 13,164 page-faults # 0.002 M/sec = ( +- 0.97% ) 27,684,906,017 cycles # 3.770 GHz = ( +- 0.45% ) 8,323,273,871 instructions # 0.30 insn per cyc= le ( +- 0.28% ) 1,556,106,680 branches # 211.901 M/sec = ( +- 0.31% ) 5,463,468 branch-misses # 0.35% of all branc= hes ( +- 1.33% ) 0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% ) 0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat= ) 0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat= ) RT configs showed some throughput regressions, but that's expected tradeo= ff for the preemption improvements through the RT mutex. It didn't prevent the v= 2 to be incorporated to the 5.13 RT tree [7], leading to testing exposure and bugfixes. Before the series, SLUB is lockless in both allocation and free fast path= s, but elsewhere, it's disabling irqs for considerable periods of time - especia= lly in allocation slowpath and the bulk allocation, where IRQs are re-enabled on= ly when a new page from the page allocator is needed, and the context allows blocking. The irq disabled sections can then include deactivate_slab() wh= ich walks a full freelist and frees the slab back to page allocator or unfreeze_partials() going through a list of percpu partial slabs. The RT = tree currently has some patches mitigating these, but we can do much better in mainline too. Patches 1-6 are straightforward improvements or cleanups that could exist outside of this series too, but are prerequsities. Patches 7-10 are also preparatory code changes without functional changes= , but not so useful without the rest of the series. Patch 11 simplifies the fast paths on systems with preemption, based on (hopefully correct) observation that the current loops to verify tid are unnecessary. Patches 12-21 focus on reducing irq disabled scope in the allocation slow= path. Patch 12 moves disabling of irqs into ___slab_alloc() from its callers, w= hich are the allocation slowpath, and bulk allocation. Instead these callers o= nly disable preemption to stabilize the cpu. The following patches then gradu= ally reduce the scope of disabled irqs in ___slab_alloc() and the functions ca= lled from there. As of patch 15, the re-enabling of irqs based on gfp flags be= fore calling the page allocator is removed from allocate_slab(). As of patch 1= 8, it's possible to reach the page allocator (in case of existing slabs depl= eted) without disabling and re-enabling irqs a single time. Pathces 22-27 reduce the scope of disabled irqs in functions related to unfreezing percpu partial slab. Patch 28 is preparatory. Patch 29 is adopted from the RT tree and convert= s the flushing of percpu slabs on all cpus from using IPI to workqueue, so that= the processing isn't happening with irqs disabled in the IPI handler. The flu= shing is not performance critical so it should be acceptable. Patch 30 also comes from RT tree and makes object_map_lock RT compatible. Patches 31-32 make slab_lock irq-safe on RT where we cannot rely on havin= g irq disabled from the list_lock spin lock usage. Patch 33 changes kmem_cache_cpu->partial handling in put_cpu_partial() fr= om cmpxchg loop to a short irq disabled section, which is used by all other = code modifying the field. This addresses a theoretical race scenario pointed o= ut by Jann, and makes the critical section safe wrt with RT local_lock semantic= s after the conversion in patch 35. Patch 34 changes preempt disable to migrate disable, so that the nested list_lock spinlock is safe to take on RT. Because migrate_disable() is a function call even on !RT, a small set of private wrappers is introduced to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations= . As of this patch, SLUB should be compatible with RT's lock semantics, to = the best of my knowledge. Finally, patch 35 changes irq disabled sections that protect kmem_cache_c= pu fields in the slow paths, with a local lock. However on PREEMPT_RT it mea= ns the lockless fast paths can now preempt slow paths which don't expect that, s= o the local lock has to be taken also in the fast paths and they are no longer lockless. It's up to RT folks to decide if this is a good tradeoff. The patch also updates the locking documentation in the file's comment. The main results of this series: * irq disabling is only done for minimum amount of time needed to protect= the kmem_cache_cpu data and as part of spin lock, local lock and bit lock operations to make them irq-safe * SLUB should be fully PREEMPT_RT compatible This should have obvious implications for better preemptibility, especial= ly on RT. Some details are different than how the previous SLUB RT tree patches wer= e implemented: mm: sl[au]b: Change list_lock to raw_spinlock_t [2] - the SLAB part can= be dropped as a different patch restricts RT to SLUB anyway. And after thi= s series the list_lock in SLUB is never used with irqs disabled before taking th= e lock so it doesn't have to be converted to raw_spinlock_t. mm: slub: Move discard_slab() invocations out of IRQ-off sections [3] s= hould be unnecessary as this series does move these invocations outside irq disa= bled sections in a different way. The remaining patches to upstream from the RT tree are small ones related= to KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) m= akes sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT co= uld perhaps be re-evaluated as the series addresses some latency issues with = it. [1] [RFC 00/26] SLUB: use local_lock for kmem_cache_cpu protection and re= duce disabling irqs https://lore.kernel.org/lkml/20210524233946.20352-1-vbabka@suse.cz/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git= /tree/patches/0001-mm-sl-au-b-Change-list_lock-to-raw_spinlock_t.patch?h=3D= linux-5.12.y-rt-patches [3] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git= /tree/patches/0004-mm-slub-Move-discard_slab-invocations-out-of-IRQ-off.p= atch?h=3Dlinux-5.12.y-rt-patches [4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git= /tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.p= atch?h=3Dlinux-5.12.y-rt-patches [5] https://lore.kernel.org/lkml/20210609113903.1421-1-vbabka@suse.cz/ [6] https://lore.kernel.org/lkml/891dc24e38106f8542f4c72831d52dc1a1863ae8= .camel@gmx.de [7] https://lore.kernel.org/linux-rt-users/87tul5p2fa.ffs@nanos.tec.linut= ronix.de/ Sebastian Andrzej Siewior (2): mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context mm: slub: Make object_map_lock a raw_spinlock_t Vlastimil Babka (33): mm, slub: don't call flush_all() from slab_debug_trace_open() mm, slub: allocate private object map for debugfs listings mm, slub: allocate private object map for validate_slab_cache() mm, slub: don't disable irq for debug_check_no_locks_freed() mm, slub: remove redundant unfreeze_partials() from put_cpu_partial() mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab() mm, slub: extract get_partial() from new_slab_objects() mm, slub: dissolve new_slab_objects() into ___slab_alloc() mm, slub: return slab page from get_partial() and set c->page afterwards mm, slub: restructure new page checks in ___slab_alloc() mm, slub: simplify kmem_cache_cpu and tid setup mm, slub: move disabling/enabling irqs to ___slab_alloc() mm, slub: do initial checks in ___slab_alloc() with irqs enabled mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc() mm, slub: restore irqs around calling new_slab() mm, slub: validate slab from partial list or page allocator before making it cpu slab mm, slub: check new pages with restored irqs mm, slub: stop disabling irqs around get_partial() mm, slub: move reset of c->page and freelist out of deactivate_slab() mm, slub: make locking in deactivate_slab() irq-safe mm, slub: call deactivate_slab() without disabling irqs mm, slub: move irq control into unfreeze_partials() mm, slub: discard slabs in unfreeze_partials() without irqs disabled mm, slub: detach whole partial list at once in unfreeze_partials() mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing mm, slub: only disable irq with spin_lock in __unfreeze_partials() mm, slub: don't disable irqs in slub_cpu_dead() mm, slab: make flush_slab() possible to call with irqs enabled mm, slub: optionally save/restore irqs in slab_[un]lock()/ mm, slub: make slab_lock() disable irqs with PREEMPT_RT mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg mm, slub: use migrate_disable() on PREEMPT_RT mm, slub: convert kmem_cpu_slab protection to local_lock include/linux/slub_def.h | 2 + mm/slub.c | 808 +++++++++++++++++++++++++-------------- 2 files changed, 530 insertions(+), 280 deletions(-) --=20 2.32.0