From: "André Almeida" <andrealmeid@igalia.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
"Paul E . McKenney" <paulmck@kernel.org>,
Boqun Feng <boqun.feng@gmail.com>,
"H . Peter Anvin" <hpa@zytor.com>, Paul Turner <pjt@google.com>,
"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
David Laight <David.Laight@ACULAB.COM>,
Christian Brauner <brauner@kernel.org>,
Florian Weimer <fw@deneb.enyo.de>,
"carlos@redhat.com" <carlos@redhat.com>,
Peter Oskolkov <posk@posk.io>,
Alexander Mikhalitsyn <alexander@mihalicyn.com>,
'Peter Zijlstra' <peterz@infradead.org>,
Chris Kennelly <ckennelly@google.com>,
Ingo Molnar <mingo@redhat.com>,
Darren Hart <dvhart@infradead.org>,
Davidlohr Bueso <dave@stgolabs.net>,
"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>,
Steven Rostedt <rostedt@goodmis.org>,
Jonathan Corbet <corbet@lwn.net>,
Noah Goldstein <goldstein.w.n@gmail.com>,
Daniel Colascione <dancol@google.com>,
"longman@redhat.com" <longman@redhat.com>,
Florian Weimer <fweimer@redhat.com>
Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq
Date: Thu, 28 Sep 2023 17:05:59 +0200 [thread overview]
Message-ID: <ab59863f-25f0-4635-8408-4aaec39ec6c2@igalia.com> (raw)
In-Reply-To: <34ddb730-8893-19a8-00fe-84c4e281eef1@efficios.com>
On 9/28/23 15:20, Mathieu Desnoyers wrote:
> On 9/28/23 07:22, David Laight wrote:
>> From: Peter Zijlstra
>>> Sent: 28 September 2023 11:39
>>>
>>> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
>>>> Expose the "on-cpu" state for each thread through struct rseq to allow
>>>> adaptative mutexes to decide more accurately between busy-waiting and
>>>> calling sys_futex() to release the CPU, based on the on-cpu state
>>>> of the
>>>> mutex owner.
>>
>> Are you trying to avoid spinning when the owning process is sleeping?
>
> Yes, this is my main intent.
>
>> Or trying to avoid the system call when it will find that the futex
>> is no longer held?
>>
>> The latter is really horribly detremental.
>
> That's a good questions. What should we do in those three situations
> when trying to grab the lock:
>
> 1) Lock has no owner
>
> We probably want to simply grab the lock with an atomic instruction.
> But then if other threads are queued on sys_futex and did not manage
> to grab the lock yet, this would be detrimental to fairness.
>
> 2) Lock owner is running:
>
> The lock owner is certainly running on another cpu (I'm using the term
> "cpu" here as logical cpu).
>
> I guess we could either decide to bypass sys_futex entirely and try to
> grab the lock with an atomic, or we go through sys_futex nevertheless
> to allow futex to guarantee some fairness across threads.
About the fairness part:
Even if you enqueue everyone, the futex syscall doesn't provide any
guarantee about the order of the wake. The current implementation tries
to be fair, but I don't think it works for every case. I wouldn't be
much concern about being fair here, given that it's an inherent problem
of the futex anyway.
From the man pages:
"No guarantee is provided about which waiters are awoken"
>
> 3) Lock owner is sleeping:
>
> The lock owner may be either tied to the same cpu as the requester, or
> a different cpu. Here calling FUTEX_WAIT and friends is pretty much
> required.
>
> Can you elaborate on why skipping sys_futex in scenario (2) would be
> so bad ? I wonder if we could get away with skipping futex entirely in
> this scenario and still guarantee fairness by implementing MCS locking
> or ticket locks in userspace. Basically, if userspace queues itself on
> the lock through either MCS locking or ticket locks, it could
> guarantee fairness on its own.
>
> Of course things are more complicated with PI-futex, is that what you
> have in mind ?
>
>>
>>>>
>>>> It is only provided as an optimization hint, because there is no
>>>> guarantee that the page containing this field is in the page cache,
>>>> and
>>>> therefore the scheduler may very well fail to clear the on-cpu
>>>> state on
>>>> preemption. This is expected to be rare though, and is resolved as
>>>> soon
>>>> as the task returns to user-space.
>>>>
>>>> The goal is to improve use-cases where the duration of the critical
>>>> sections for a given lock follows a multi-modal distribution,
>>>> preventing
>>>> statistical guesses from doing a good job at choosing between
>>>> busy-wait
>>>> and futex wait behavior.
>>>
>>> As always, are syscalls really *that* expensive? Why can't we busy wait
>>> in the kernel instead?
>>>
>>> I mean, sure, meltdown sucked, but most people should now be running
>>> chips that are not affected by that particular horror show, no?
>>
>> IIRC 'page table separation' which is what makes system calls expensive
>> is only a compile-time option. So is likely to be enabled on any
>> 'distro'
>> kernel.
>> But a lot of other mitigations (eg RSB stuffing) are also pretty
>> detrimental.
>>
>> OTOH if you have a 'hot' userspace mutex you are going to lose whatever.
>> All that needs to happen is for a ethernet interrupt to decide to
>> discard
>> completed transmits and refill the rx ring, and then for the softint
>> code
>> to free a load of stuff deferred by rcu while you've grabbed the mutex
>> and no matter how short the user-space code path the mutex won't be
>> released for absolutely ages.
>>
>> I had to change a load of code to use arrays and atomic increments
>> to avoid delays acquiring mutex.
>
> That's good input, thanks! I mostly defer to André Almeida on the
> use-case motivation. I mostly provided this POC patch to show that it
> _can_ be done with sys_rseq(2).
>
> Thanks!
>
> Mathieu
>
>>
>> David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
>> MK1 1PT, UK
>> Registration No: 1397386 (Wales)
>>
>
next prev parent reply other threads:[~2023-09-28 15:06 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-29 19:14 [RFC PATCH v2 0/4] Extend rseq with sched_state_ptr field Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
2023-05-29 19:35 ` Florian Weimer
2023-05-29 19:48 ` Mathieu Desnoyers
2023-05-30 8:20 ` Florian Weimer
2023-05-30 14:25 ` Mathieu Desnoyers
2023-05-30 15:13 ` Mathieu Desnoyers
2023-09-26 20:52 ` Dmitry Vyukov
2023-09-26 23:49 ` Dmitry Vyukov
2023-09-26 23:54 ` Dmitry Vyukov
2023-09-27 4:51 ` Florian Weimer
2023-09-27 15:58 ` Dmitry Vyukov
2023-09-28 8:52 ` Florian Weimer
2023-09-28 14:44 ` Dmitry Vyukov
2023-09-28 14:47 ` Dmitry Vyukov
2023-09-28 10:39 ` Peter Zijlstra
2023-09-28 11:22 ` David Laight
2023-09-28 13:20 ` Mathieu Desnoyers
2023-09-28 14:26 ` Peter Zijlstra
2023-09-28 14:33 ` David Laight
2023-09-28 15:05 ` André Almeida [this message]
2023-09-28 14:43 ` Steven Rostedt
2023-09-28 15:51 ` David Laight
2023-10-02 16:51 ` Steven Rostedt
2023-10-02 17:22 ` David Laight
2023-10-02 17:56 ` Steven Rostedt
2023-09-28 20:21 ` Thomas Gleixner
2023-09-28 20:43 ` Mathieu Desnoyers
2023-09-28 20:54 ` Thomas Gleixner
2023-09-28 22:11 ` Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 2/4] selftests/rseq: Add sched_state rseq field and getter Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 3/4] selftests/rseq: Implement sched state test program Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 4/4] selftests/rseq: Implement rseq_mutex " Mathieu Desnoyers
2023-09-28 19:55 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ab59863f-25f0-4635-8408-4aaec39ec6c2@igalia.com \
--to=andrealmeid@igalia.com \
--cc=David.Laight@ACULAB.COM \
--cc=alexander@mihalicyn.com \
--cc=boqun.feng@gmail.com \
--cc=brauner@kernel.org \
--cc=carlos@redhat.com \
--cc=ckennelly@google.com \
--cc=corbet@lwn.net \
--cc=dancol@google.com \
--cc=dave@stgolabs.net \
--cc=dvhart@infradead.org \
--cc=fw@deneb.enyo.de \
--cc=fweimer@redhat.com \
--cc=goldstein.w.n@gmail.com \
--cc=hpa@zytor.com \
--cc=libc-alpha@sourceware.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=posk@posk.io \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).