linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Steven Rostedt <rostedt@goodmis.org>
To: David Laight <David.Laight@ACULAB.COM>
Cc: "Peter Zijlstra" <peterz@infradead.org>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"H . Peter Anvin" <hpa@zytor.com>, "Paul Turner" <pjt@google.com>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Florian Weimer" <fw@deneb.enyo.de>,
	"carlos@redhat.com" <carlos@redhat.com>,
	"Peter Oskolkov" <posk@posk.io>,
	"Alexander Mikhalitsyn" <alexander@mihalicyn.com>,
	"Chris Kennelly" <ckennelly@google.com>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Darren Hart" <dvhart@infradead.org>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"André Almeida" <andrealmeid@igalia.com>,
	"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Noah Goldstein" <goldstein.w.n@gmail.com>,
	"Daniel Colascione" <dancol@google.com>,
	"longman@redhat.com" <longman@redhat.com>,
	"Florian Weimer" <fweimer@redhat.com>
Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq
Date: Mon, 2 Oct 2023 12:51:09 -0400	[thread overview]
Message-ID: <20231002125109.55c35030@gandalf.local.home> (raw)
In-Reply-To: <40b76cbd00d640e49f727abbd0c39693@AcuMS.aculab.com>

On Thu, 28 Sep 2023 15:51:47 +0000
David Laight <David.Laight@ACULAB.COM> wrote:


> > This is when I thought that having an adaptive spinner that could get
> > hints from the kernel via memory mapping would be extremely useful.  
> 
> Did you consider writing a timestamp into the mutex when it was
> acquired - or even as the 'acquired' value?
> A 'moderately synched TSC' should do.
> Then the waiter should be able to tell how long the mutex
> has been held for - and then not spin if it had been held ages.

And what heuristic would you use. My experience with picking "time to spin"
may work for one workload but cause major regressions in another workload.
I came to the conclusion to "hate" heuristics and NACK them whenever
someone suggested adding them to the rt_mutex in the kernel (back before
adaptive mutexes were introduced).

> 
> > The obvious problem with their implementation is that if the owner is
> > sleeping, there's no point in spinning. Worse, the owner may even be
> > waiting for the spinner to get off the CPU before it can run again. But
> > according to Robert, the gain in the general performance greatly
> > outweighed the few times this happened in practice.  
> 
> Unless you can use atomics (ok for bits and linked lists) you
> always have the problem that userspace can't disable interrupts.
> So, unlike the kernel, you can't implement a proper spinlock.

Why do you need to disable interrupts? If you know the owner is running on
the CPU, you know it's not trying to run on the CPU that is acquiring the
lock. Heck, there's normal spin locks outside of PREEMPT_RT that do not
disable interrupts. The only time you need to disable interrupts is if the
interrupt itself takes the spin lock, and that's just to prevent deadlocks.

> 
> I've NFI how CONFIG_RT manages to get anything done with all
> the spinlocks replaced by sleep locks.
> Clearly there are a spinlocks that are held for far too long.
> But you really do want to spin most of the time.

It spins as long as the owner of the lock is running on the CPU. This is
what we are looking to get from this patch series for user space.

Back in 2007, we had an issue with scaling on SMP machines. The RT kernel
with the sleeping spin locks would start to exponentially slow down with
the more CPUs you had. Once we hit more than 16 CPUs,  the time to boot a
kernel took 10s of minutes to boot RT when the normal CONFIG_PREEMPT kernel
would only take a couple of minutes. The more CPUs you added, the worse it
became.

Then SUSE submitted a patch to have the rt_mutex spin only if the owner of
the mutex was still running on another CPU. This actually mimics a real
spin lock (because that's exactly what they do, they spin while the owner
is running on a CPU). The difference between a true spin lock and an
rt_mutex was that the spinner would stop spinning if the owner was
preempted (a true spin lock owner could not be preempted).

After applying the adaptive spinning, we were able to scale PREEMPT_RT to
any number of CPUs that the normal kernel could do with just a linear
performance hit.

This is why I'm very much interested in getting the same ability into user
space spin locks.

-- Steve

  reply	other threads:[~2023-10-02 16:50 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-29 19:14 [RFC PATCH v2 0/4] Extend rseq with sched_state_ptr field Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
2023-05-29 19:35   ` Florian Weimer
2023-05-29 19:48     ` Mathieu Desnoyers
2023-05-30  8:20       ` Florian Weimer
2023-05-30 14:25         ` Mathieu Desnoyers
2023-05-30 15:13           ` Mathieu Desnoyers
2023-09-26 20:52       ` Dmitry Vyukov
2023-09-26 23:49         ` Dmitry Vyukov
2023-09-26 23:54           ` Dmitry Vyukov
2023-09-27  4:51           ` Florian Weimer
2023-09-27 15:58             ` Dmitry Vyukov
2023-09-28  8:52               ` Florian Weimer
2023-09-28 14:44                 ` Dmitry Vyukov
2023-09-28 14:47           ` Dmitry Vyukov
2023-09-28 10:39   ` Peter Zijlstra
2023-09-28 11:22     ` David Laight
2023-09-28 13:20       ` Mathieu Desnoyers
2023-09-28 14:26         ` Peter Zijlstra
2023-09-28 14:33         ` David Laight
2023-09-28 15:05         ` André Almeida
2023-09-28 14:43     ` Steven Rostedt
2023-09-28 15:51       ` David Laight
2023-10-02 16:51         ` Steven Rostedt [this message]
2023-10-02 17:22           ` David Laight
2023-10-02 17:56             ` Steven Rostedt
2023-09-28 20:21   ` Thomas Gleixner
2023-09-28 20:43     ` Mathieu Desnoyers
2023-09-28 20:54   ` Thomas Gleixner
2023-09-28 22:11     ` Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 2/4] selftests/rseq: Add sched_state rseq field and getter Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 3/4] selftests/rseq: Implement sched state test program Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 4/4] selftests/rseq: Implement rseq_mutex " Mathieu Desnoyers
2023-09-28 19:55   ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231002125109.55c35030@gandalf.local.home \
    --to=rostedt@goodmis.org \
    --cc=David.Laight@ACULAB.COM \
    --cc=alexander@mihalicyn.com \
    --cc=andrealmeid@igalia.com \
    --cc=boqun.feng@gmail.com \
    --cc=brauner@kernel.org \
    --cc=carlos@redhat.com \
    --cc=ckennelly@google.com \
    --cc=corbet@lwn.net \
    --cc=dancol@google.com \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fw@deneb.enyo.de \
    --cc=fweimer@redhat.com \
    --cc=goldstein.w.n@gmail.com \
    --cc=hpa@zytor.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=posk@posk.io \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).