linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "André Almeida" <andrealmeid@igalia.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	"H . Peter Anvin" <hpa@zytor.com>, Paul Turner <pjt@google.com>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	David Laight <David.Laight@ACULAB.COM>,
	Christian Brauner <brauner@kernel.org>,
	Florian Weimer <fw@deneb.enyo.de>,
	"carlos@redhat.com" <carlos@redhat.com>,
	Peter Oskolkov <posk@posk.io>,
	Alexander Mikhalitsyn <alexander@mihalicyn.com>,
	'Peter Zijlstra' <peterz@infradead.org>,
	Chris Kennelly <ckennelly@google.com>,
	Ingo Molnar <mingo@redhat.com>,
	Darren Hart <dvhart@infradead.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Noah Goldstein <goldstein.w.n@gmail.com>,
	Daniel Colascione <dancol@google.com>,
	"longman@redhat.com" <longman@redhat.com>,
	Florian Weimer <fweimer@redhat.com>
Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq
Date: Thu, 28 Sep 2023 17:05:59 +0200	[thread overview]
Message-ID: <ab59863f-25f0-4635-8408-4aaec39ec6c2@igalia.com> (raw)
In-Reply-To: <34ddb730-8893-19a8-00fe-84c4e281eef1@efficios.com>


On 9/28/23 15:20, Mathieu Desnoyers wrote:
> On 9/28/23 07:22, David Laight wrote:
>> From: Peter Zijlstra
>>> Sent: 28 September 2023 11:39
>>>
>>> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
>>>> Expose the "on-cpu" state for each thread through struct rseq to allow
>>>> adaptative mutexes to decide more accurately between busy-waiting and
>>>> calling sys_futex() to release the CPU, based on the on-cpu state 
>>>> of the
>>>> mutex owner.
>>
>> Are you trying to avoid spinning when the owning process is sleeping?
>
> Yes, this is my main intent.
>
>> Or trying to avoid the system call when it will find that the futex
>> is no longer held?
>>
>> The latter is really horribly detremental.
>
> That's a good questions. What should we do in those three situations 
> when trying to grab the lock:
>
> 1) Lock has no owner
>
> We probably want to simply grab the lock with an atomic instruction. 
> But then if other threads are queued on sys_futex and did not manage 
> to grab the lock yet, this would be detrimental to fairness.
>
> 2) Lock owner is running:
>
> The lock owner is certainly running on another cpu (I'm using the term 
> "cpu" here as logical cpu).
>
> I guess we could either decide to bypass sys_futex entirely and try to 
> grab the lock with an atomic, or we go through sys_futex nevertheless 
> to allow futex to guarantee some fairness across threads.

About the fairness part:

Even if you enqueue everyone, the futex syscall doesn't provide any 
guarantee about the order of the wake. The current implementation tries 
to be fair, but I don't think it works for every case. I wouldn't be 
much concern about being fair here, given that it's an inherent problem 
of the futex anyway.

 From the man pages:

"No guarantee is provided about which waiters are awoken"

>
> 3) Lock owner is sleeping:
>
> The lock owner may be either tied to the same cpu as the requester, or 
> a different cpu. Here calling FUTEX_WAIT and friends is pretty much 
> required.
>
> Can you elaborate on why skipping sys_futex in scenario (2) would be 
> so bad ? I wonder if we could get away with skipping futex entirely in 
> this scenario and still guarantee fairness by implementing MCS locking 
> or ticket locks in userspace. Basically, if userspace queues itself on 
> the lock through either MCS locking or ticket locks, it could 
> guarantee fairness on its own.
>
> Of course things are more complicated with PI-futex, is that what you 
> have in mind ?
>
>>
>>>>
>>>> It is only provided as an optimization hint, because there is no
>>>> guarantee that the page containing this field is in the page cache, 
>>>> and
>>>> therefore the scheduler may very well fail to clear the on-cpu 
>>>> state on
>>>> preemption. This is expected to be rare though, and is resolved as 
>>>> soon
>>>> as the task returns to user-space.
>>>>
>>>> The goal is to improve use-cases where the duration of the critical
>>>> sections for a given lock follows a multi-modal distribution, 
>>>> preventing
>>>> statistical guesses from doing a good job at choosing between 
>>>> busy-wait
>>>> and futex wait behavior.
>>>
>>> As always, are syscalls really *that* expensive? Why can't we busy wait
>>> in the kernel instead?
>>>
>>> I mean, sure, meltdown sucked, but most people should now be running
>>> chips that are not affected by that particular horror show, no?
>>
>> IIRC 'page table separation' which is what makes system calls expensive
>> is only a compile-time option. So is likely to be enabled on any 
>> 'distro'
>> kernel.
>> But a lot of other mitigations (eg RSB stuffing) are also pretty 
>> detrimental.
>>
>> OTOH if you have a 'hot' userspace mutex you are going to lose whatever.
>> All that needs to happen is for a ethernet interrupt to decide to 
>> discard
>> completed transmits and refill the rx ring, and then for the softint 
>> code
>> to free a load of stuff deferred by rcu while you've grabbed the mutex
>> and no matter how short the user-space code path the mutex won't be
>> released for absolutely ages.
>>
>> I had to change a load of code to use arrays and atomic increments
>> to avoid delays acquiring mutex.
>
> That's good input, thanks! I mostly defer to André Almeida on the 
> use-case motivation. I mostly provided this POC patch to show that it 
> _can_ be done with sys_rseq(2).
>
> Thanks!
>
> Mathieu
>
>>
>>     David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, 
>> MK1 1PT, UK
>> Registration No: 1397386 (Wales)
>>
>

  parent reply	other threads:[~2023-09-28 15:06 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-29 19:14 [RFC PATCH v2 0/4] Extend rseq with sched_state_ptr field Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
2023-05-29 19:35   ` Florian Weimer
2023-05-29 19:48     ` Mathieu Desnoyers
2023-05-30  8:20       ` Florian Weimer
2023-05-30 14:25         ` Mathieu Desnoyers
2023-05-30 15:13           ` Mathieu Desnoyers
2023-09-26 20:52       ` Dmitry Vyukov
2023-09-26 23:49         ` Dmitry Vyukov
2023-09-26 23:54           ` Dmitry Vyukov
2023-09-27  4:51           ` Florian Weimer
2023-09-27 15:58             ` Dmitry Vyukov
2023-09-28  8:52               ` Florian Weimer
2023-09-28 14:44                 ` Dmitry Vyukov
2023-09-28 14:47           ` Dmitry Vyukov
2023-09-28 10:39   ` Peter Zijlstra
2023-09-28 11:22     ` David Laight
2023-09-28 13:20       ` Mathieu Desnoyers
2023-09-28 14:26         ` Peter Zijlstra
2023-09-28 14:33         ` David Laight
2023-09-28 15:05         ` André Almeida [this message]
2023-09-28 14:43     ` Steven Rostedt
2023-09-28 15:51       ` David Laight
2023-10-02 16:51         ` Steven Rostedt
2023-10-02 17:22           ` David Laight
2023-10-02 17:56             ` Steven Rostedt
2023-09-28 20:21   ` Thomas Gleixner
2023-09-28 20:43     ` Mathieu Desnoyers
2023-09-28 20:54   ` Thomas Gleixner
2023-09-28 22:11     ` Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 2/4] selftests/rseq: Add sched_state rseq field and getter Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 3/4] selftests/rseq: Implement sched state test program Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 4/4] selftests/rseq: Implement rseq_mutex " Mathieu Desnoyers
2023-09-28 19:55   ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ab59863f-25f0-4635-8408-4aaec39ec6c2@igalia.com \
    --to=andrealmeid@igalia.com \
    --cc=David.Laight@ACULAB.COM \
    --cc=alexander@mihalicyn.com \
    --cc=boqun.feng@gmail.com \
    --cc=brauner@kernel.org \
    --cc=carlos@redhat.com \
    --cc=ckennelly@google.com \
    --cc=corbet@lwn.net \
    --cc=dancol@google.com \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fw@deneb.enyo.de \
    --cc=fweimer@redhat.com \
    --cc=goldstein.w.n@gmail.com \
    --cc=hpa@zytor.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=posk@posk.io \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).