From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1054C05027 for ; Fri, 17 Feb 2023 05:24:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229635AbjBQFYr (ORCPT ); Fri, 17 Feb 2023 00:24:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39176 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229436AbjBQFYo (ORCPT ); Fri, 17 Feb 2023 00:24:44 -0500 Received: from mail-qt1-x82b.google.com (mail-qt1-x82b.google.com [IPv6:2607:f8b0:4864:20::82b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CDFB85B751 for ; Thu, 16 Feb 2023 21:24:42 -0800 (PST) Received: by mail-qt1-x82b.google.com with SMTP id t16so1901700qto.3 for ; Thu, 16 Feb 2023 21:24:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:feedback-id:from:to:cc:subject:date :message-id:reply-to; bh=Kp0sDaGiaTYadbEh+nJMezShCNhNGSevRvqTglEj8Og=; b=JzzfErXlB6Np+R5CR7+ExzXKeZDcp99sx5p2oYYo8XZxolL22VOg4JnyshVPL9lNYN 73j0j6LxrvZb/xiSmHzJSp7StefXRGNRF9bmWHIhCccPl8F5DvQobXREdsJ3VuHg8LK4 IYl66TL+Rw4GBw9ia0rYbue2LDEtOhEbHenBUMJ3VDVrwkB14IgtZOiGVS1CeWjyKhKJ 5GyI5yGX8ZBjNEKmbh03PeSdUWZutEoKcEXFpKfHuLmoMl8mdqtRAxz/cN/iQAYc3TK/ GUjog8mO7jXb1i60xjqPQSzilWJ4yTeMex6+pfmY6XMB1uBuO+ejprG0ypSV9iq6NT4Q nkaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:feedback-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Kp0sDaGiaTYadbEh+nJMezShCNhNGSevRvqTglEj8Og=; b=RzM6++K645UE5AVbEYwlTUMUIdVHUltZaFVzAUrPr4RB7gN/uVkAwwjtgVG45yLWOI XGoFbU0HjbVPRRD3zWaCIBFqeJss5Wrrk3793jD9AMi33kLto/dfe43C+MIOToInFR8M Wn1iS69aR8rDEHg0LMhq5P+k5kE1b+9gfKKM015MIYDQAZRVQbv/Ut+n34PAJee0lKgs OMRmCrNkvojI3VqelFhtWrpv+HFoSeuxdREtRJGMgTNT1wUeheGaahnZSZwyVlJJE7LE a0GNGP84mcqEiiCgzXaAuiQI7CqNQ7yZOZ19gE19q8H/sXZGasErKqeg4mnDb91EzhEf rEhg== X-Gm-Message-State: AO0yUKXZbCKweCMJQHlb8UU67IeM/PmHGWKa0x+LZG6mEdNsqYJJxkwo t31DpVKxDG1VjF4n0ONwvZc= X-Google-Smtp-Source: AK7set+kqE00dfO98Gqd6ThOnTYcqSPez11sxIfCbNCjogP91pNJr/MmDGybXuAZNRQz85nIb5rkaA== X-Received: by 2002:a05:622a:1653:b0:3b8:6c80:c113 with SMTP id y19-20020a05622a165300b003b86c80c113mr187426qtj.66.1676611481825; Thu, 16 Feb 2023 21:24:41 -0800 (PST) Received: from auth1-smtp.messagingengine.com (auth1-smtp.messagingengine.com. [66.111.4.227]) by smtp.gmail.com with ESMTPSA id k14-20020ac8604e000000b003b63dfad2b4sm2641677qtm.0.2023.02.16.21.24.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 16 Feb 2023 21:24:41 -0800 (PST) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailauth.nyi.internal (Postfix) with ESMTP id 99CEB27C0054; Fri, 17 Feb 2023 00:24:39 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute5.internal (MEProxy); Fri, 17 Feb 2023 00:24:40 -0500 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrudeikedgkedvucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepfffhvfevuffkfhggtggujgesthdtredttddtvdenucfhrhhomhepuehoqhhu nhcuhfgvnhhguceosghoqhhunhdrfhgvnhhgsehgmhgrihhlrdgtohhmqeenucggtffrrg htthgvrhhnpefhtedvgfdtueekvdekieetieetjeeihedvteehuddujedvkedtkeefgedv vdehtdenucffohhmrghinhepkhgvrhhnvghlrdhorhhgnecuvehluhhsthgvrhfuihiivg eptdenucfrrghrrghmpehmrghilhhfrhhomhepsghoqhhunhdomhgvshhmthhprghuthhh phgvrhhsohhnrghlihhthidqieelvdeghedtieegqddujeejkeehheehvddqsghoqhhunh drfhgvnhhgpeepghhmrghilhdrtghomhesfhhigihmvgdrnhgrmhgv X-ME-Proxy: Feedback-ID: iad51458e:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 17 Feb 2023 00:24:38 -0500 (EST) Date: Thu, 16 Feb 2023 21:24:36 -0800 From: Boqun Feng To: Waiman Long Cc: Peter Zijlstra , Ingo Molnar , Will Deacon , linux-kernel@vger.kernel.org, Hillf Danton Subject: Re: [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff Message-ID: References: <20230216210933.1169097-1-longman@redhat.com> <20230216210933.1169097-3-longman@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230216210933.1169097-3-longman@redhat.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 16, 2023 at 04:09:32PM -0500, Waiman Long wrote: > The lock handoff provided in rwsem isn't a true handoff like that in > the mutex. Instead, it is more like a quiescent state where optimistic > spinning and lock stealing are disabled to make it easier for the first > waiter to acquire the lock. > > For readers, setting the HANDOFF bit will disable writers from stealing > the lock. The actual handoff is done at rwsem_wake() time after taking > the wait_lock. There isn't much we need to improve here other than > setting the RWSEM_NONSPINNABLE bit in owner. > > For writers, setting the HANDOFF bit does not guarantee that it can > acquire the rwsem successfully in a subsequent rwsem_try_write_lock() > after setting the bit there. A reader can come in and add a > RWSEM_READER_BIAS temporarily which can spoil the takeover of the rwsem > in rwsem_try_write_lock() leading to additional delay. > > For mutex, lock handoff is done at unlock time as the owner value and > the handoff bit is in the same lock word and can be updated atomically. > > That is the not case for rwsem which has a count value for locking and > a different owner value for storing lock owner. In addition, the handoff > processing differs depending on whether the first waiter is a writer or a > reader. We can only make that waiter type determination after acquiring > the wait lock. Together with the fact that the RWSEM_FLAG_HANDOFF bit > is stable while holding the wait_lock, the most convenient place to > do the early handoff is at rwsem_mark_wake() where wait_lock has to be > acquired anyway. > > There isn't much additional cost in doing this check and early handoff > in rwsem_mark_wake() while increasing the chance that a lock handoff > will be successful when the HANDOFF setting writer wakes up without > even the need to take the wait_lock at all. Note that if early handoff > fails to happen in rwsem_mark_wake(), a late handoff can still happen > when the awoken writer calls rwsem_try_write_lock(). > > Kernel test robot noticed a 19.3% improvement of > will-it-scale.per_thread_ops in an earlier version of this commit [1]. > > [1] https://lore.kernel.org/lkml/202302122155.87699b56-oliver.sang@intel.com/ > > Signed-off-by: Waiman Long > --- > kernel/locking/lock_events_list.h | 1 + > kernel/locking/rwsem.c | 71 +++++++++++++++++++++++++++---- > 2 files changed, 64 insertions(+), 8 deletions(-) > > diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h > index 97fb6f3f840a..fd80f5828f24 100644 > --- a/kernel/locking/lock_events_list.h > +++ b/kernel/locking/lock_events_list.h > @@ -67,3 +67,4 @@ LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */ > LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */ > LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */ > LOCK_EVENT(rwsem_wlock_handoff) /* # of write lock handoffs */ > +LOCK_EVENT(rwsem_wlock_ehandoff) /* # of write lock early handoffs */ > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index e589f69793df..fc3961ceabe8 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -412,8 +412,9 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, > enum rwsem_wake_type wake_type, > struct wake_q_head *wake_q) > { > + long count = atomic_long_read(&sem->count); > struct rwsem_waiter *waiter, *tmp; > - long count, woken = 0, adjustment = 0; > + long woken = 0, adjustment = 0; > struct list_head wlist; > > lockdep_assert_held(&sem->wait_lock); > @@ -432,19 +433,39 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, > * Mark writer at the front of the queue for wakeup. > * Until the task is actually later awoken later by > * the caller, other writers are able to steal it. > + * > + * *Unless* HANDOFF is set, in which case only the > + * first waiter is allowed to take it. > + * > * Readers, on the other hand, will block as they > * will notice the queued writer. > */ > wake_q_add(wake_q, waiter->task); > lockevent_inc(rwsem_wake_writer); > + > + if ((count & RWSEM_LOCK_MASK) || !(count & RWSEM_FLAG_HANDOFF)) > + return; > + > + /* > + * If the rwsem is free and handoff flag is set with wait_lock > + * held, no other CPUs can take an active lock. We can do an > + * early handoff. > + */ > + adjustment = RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF; > + atomic_long_set(&sem->owner, (long)waiter->task); > + waiter->task = NULL; > + atomic_long_add(adjustment, &sem->count); > + rwsem_del_waiter(sem, waiter); > + lockevent_inc(rwsem_wlock_ehandoff); > } > return; > > wake_readers: > /* > - * No reader wakeup if there are too many of them already. > + * No reader wakeup if there are too many of them already or > + * something wrong happens. > */ > - if (unlikely(atomic_long_read(&sem->count) < 0)) > + if (unlikely(count < 0)) > return; > > /* > @@ -468,7 +489,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, > adjustment -= RWSEM_FLAG_HANDOFF; > lockevent_inc(rwsem_rlock_handoff); > } > + /* > + * With HANDOFF set for reader, we must > + * terminate all spinning. > + */ > waiter->handoff_set = true; > + rwsem_set_nonspinnable(sem); > } > > atomic_long_add(-adjustment, &sem->count); > @@ -610,6 +636,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem, > > lockdep_assert_held(&sem->wait_lock); > > + if (!waiter->task) { > + /* Write lock handed off */ > + smp_acquire__after_ctrl_dep(); I don't think you need smp_acquire__after_ctrl_dep() here, since: * The other side is just a normal write "waiter->task = NULL" * The "&sem->wait_lock" already provides the necessary ACQUIRE , but I come to this series so late, I may miss something subtle. > + return true; > + } > + > count = atomic_long_read(&sem->count); > do { > bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF); > @@ -755,6 +787,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem) > > owner = rwsem_owner_flags(sem, &flags); > state = rwsem_owner_state(owner, flags); > + > + if (owner == current) > + return OWNER_NONSPINNABLE; /* Handoff granted */ > + > if (state != OWNER_WRITER) > return state; > > @@ -1164,32 +1200,51 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state) > * the lock, attempt to spin on owner to accelerate lock > * transfer. If the previous owner is a on-cpu writer and it > * has just released the lock, OWNER_NULL will be returned. > - * In this case, we attempt to acquire the lock again > - * without sleeping. > + * In this case, the waker may be in the process of early > + * lock handoff. Use the wait_lock to synchronize with that > + * before checking for handoff. > */ > if (waiter.handoff_set) { > enum owner_state owner_state; > > owner_state = rwsem_spin_on_owner(sem); > - if (owner_state == OWNER_NULL) > - goto trylock_again; > + if ((owner_state == OWNER_NULL) && > + READ_ONCE(waiter.task)) { In theory, if there is a read outside some synchronization (say locks), not only READ_ONCE(), but also WRITE_ONCE() is needed (even for write inside locks), otherwise, KCSAN will yell (if KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n). This is because plain write may get teared. Regards, Boqun > + raw_spin_lock_irq(&sem->wait_lock); > + raw_spin_unlock_irq(&sem->wait_lock); > + } > + if (!READ_ONCE(waiter.task)) > + goto handed_off; > } > > schedule_preempt_disabled(); > lockevent_inc(rwsem_sleep_writer); > + if (!READ_ONCE(waiter.task)) > + goto handed_off; > + > set_current_state(state); > -trylock_again: > raw_spin_lock_irq(&sem->wait_lock); > } > __set_current_state(TASK_RUNNING); > raw_spin_unlock_irq(&sem->wait_lock); > +out: > lockevent_inc(rwsem_wlock); > trace_contention_end(sem, 0); > return sem; > > +handed_off: > + /* Write lock handed off */ > + set_current_state(TASK_RUNNING); /* smp_mb() */ > + goto out; > + > out_nolock: > __set_current_state(TASK_RUNNING); > raw_spin_lock_irq(&sem->wait_lock); > + if (!waiter.task) { > + smp_acquire__after_ctrl_dep(); > + raw_spin_unlock_irq(&sem->wait_lock); > + goto out; > + } > rwsem_del_wake_waiter(sem, &waiter, &wake_q); > lockevent_inc(rwsem_wlock_fail); > trace_contention_end(sem, -EINTR); > -- > 2.31.1 >