linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	willy@infradead.org, mgorman@suse.de, rostedt@goodmis.org,
	jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
	jgross@suse.com, andrew.cooper3@citrix.com
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Date: Tue, 19 Sep 2023 14:30:59 +0200	[thread overview]
Message-ID: <87led2wdj0.ffs@tglx> (raw)
In-Reply-To: <CAHk-=whnwC01m_1f-gaM1xbvvwzwTiKitrWniA-ChZv+bM03dg@mail.gmail.com>

Linus!

On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
> On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@linutronix.de> wrote:
>>    2) When the scheduler wants to set NEED_RESCHED due it sets
>>       NEED_RESCHED_LAZY instead which is only evaluated in the return to
>>       user space preemption points.
>
> Is this just to try to emulate the existing PREEMPT_NONE behavior?

To some extent yes.

> If the new world order is that the time slice is always honored, then
> the "this might be a latency issue" goes away. Good.

That's the point.

> And we'd also get better coverage for the *debug* aim of
> "might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on
> PREEMPT_COUNT always existing.
>
> But because the latency argument is gone, the "might_resched()" should
> then just be removed entirely from "might_sleep()", so that
> might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing.

True. And this gives the scheduler the flexibility to enforce preemption
under certain conditions, e.g. when a task with RT scheduling class or a
task with a sporadic event handler is woken up. That's what VOLUNTARY
tries to achieve with all the might_sleep()/might_resched() magic.

> That argues for your suggestion too, since we had a performance issue
> due to "might_sleep()" _not_ being just a debug thing, and pointlessly
> causing a reschedule in a place where reschedules were _allowed_, but
> certainly much less than optimal.
>
> Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
> 'might_sleep()' in get_mmap_lock_carefully()").

Awesome.

> However, that does bring up an issue: even with full preemption, there
> are certainly places where we are *allowed* to schedule (when the
> preempt count is zero), but there are also some places that are
> *better* than other places to schedule (for example, when we don't
> hold any other locks).
>
> So, I do think that if we just decide to go "let's just always be
> preemptible", we might still have points in the kernel where
> preemption might be *better* than in others points.
>
> But none of might_resched(), might_sleep() _or_ cond_resched() are
> necessarily that kind of "this is a good point" thing. They come from
> a different background.

They are subject to subsystem/driver specific preferences and therefore
biased towards a certain usage scenario, which is not necessarily to the
benefit of everyone else.

> So what I think what you are saying is that we'd have the following situation:
>
>  - scheduling at "return to user space" is presumably always a good thing.
>
> A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
> whatever) would cover that, and would give us basically the existing
> CONFIG_PREEMPT_NONE behavior.
>
> So a config variable (either compile-time with PREEMPT_NONE or a
> dynamic one with DYNAMIC_PREEMPT set to none) would make any external
> wakeup only set that bit.
>
> And then a "fully preemptible low-latency desktop" would set the
> preempt-count bit too.

Correct.

>  - but the "timeslice over" case would always set the
> preempt-count-bit, regardless of any config, and would guarantee that
> we have reasonable latencies.

Yes. That's the reasoning.

> This all makes cond_resched() (and might_resched()) pointless, and
> they can just go away.

:)

So the decision matrix would be:

                Ret2user        Ret2kernel      PreemptCnt=0

NEED_RESCHED       Y                Y               Y 
LAZY_RESCHED       Y                N               N

That is completely independent of the preemption model and the
differentiation of the preemption models happens solely at the scheduler
level:

PREEMPT_NONE sets only LAZY_RESCHED unless it needs to enforce the time
slice where it sets NEED_RESCHED.

PREEMPT_VOLUNTARY extends the NONE model so that the wakeup of RT class
tasks or sporadic event tasks sets NEED_RESCHED too.

PREEMPT_FULL always sets NEED_RESCHED like today.

We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
only end up with two variants or even subsume PREEMPT_FULL into that
model because that's what is closer to the RT LAZY preempt behaviour,
which has two goals:

      1) Make low latency guarantees for RT workloads

      2) Preserve the throughput for non-RT workloads

But in any case this decision happens solely in the core scheduler code
and nothing outside of it needs to be changed.

So we not only get rid of the cond/might_resched() muck, we also get rid
of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
The only place which still needs that runtime tweaking is the scheduler
itself.

Though it just occured to me that there are dragons lurking:

arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
arch/um/Kconfig:        select ARCH_NO_PREEMPT

So we have four architectures which refuse to enable preemption points,
i.e. the only model they allow is NONE and they rely on cond_resched()
for breaking large computations.

But they support PREEMPT_COUNT, so we might get away with a reduced
preemption point coverage:

                Ret2user        Ret2kernel      PreemptCnt=0

NEED_RESCHED       Y                N               Y 
LAZY_RESCHED       Y                N               N

i.e. the only difference is that Ret2kernel is not a preemption
point. That's where the scheduler tick enforcement of the time slice
happens.

It still might work out good enough and if not then it should not be
real rocket science to add that Ret2kernel preemption point to cure it.

> Then the question becomes whether we'd want to introduce a *new*
> concept, which is a "if you are going to schedule, do it now rather
> than later, because I'm taking a lock, and while it's a preemptible
> lock, I'd rather not sleep while holding this resource".
>
> I suspect we want to avoid that for now, on the assumption that it's
> hopefully not a problem in practice (the recently addressed problem
> with might_sleep() was that it actively *moved* the scheduling point
> to a bad place, not that scheduling could happen there, so instead of
> optimizing scheduling, it actively pessimized it). But I thought I'd
> mention it.

I think we want to avoid that completely and if this becomes an issue,
we rather be smart about it at the core level.

It's trivial enough to have a per task counter which tells whether a
preemtible lock is held (or about to be acquired) or not. Then the
scheduler can take that hint into account and decide to grant a
timeslice extension once in the expectation that the task leaves the
lock held section soonish and either returns to user space or schedules
out. It still can enforce it later on.

We really want to let the scheduler decide and rather give it proper
hints at the conceptual level instead of letting developers make random
decisions which might work well for a particular use case and completely
suck for the rest. I think we wasted enough time already on those.

> Anyway, I'm definitely not opposed. We'd get rid of a config option
> that is presumably not very widely used, and we'd simplify a lot of
> issues, and get rid of all these badly defined "cond_preempt()"
> things.

Hmm. Didn't I promise a year ago that I won't do further large scale
cleanups and simplifications beyond printk.

Maybe I get away this time with just suggesting it. :)

Thanks,

        tglx

  parent reply	other threads:[~2023-09-19 12:31 UTC|newest]

Thread overview: 152+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-30 18:49 [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
2023-08-30 18:49 ` [PATCH v2 1/9] mm/clear_huge_page: allow arch override for clear_huge_page() Ankur Arora
2023-08-30 18:49 ` [PATCH v2 2/9] mm/huge_page: separate clear_huge_page() and copy_huge_page() Ankur Arora
2023-08-30 18:49 ` [PATCH v2 3/9] mm/huge_page: cleanup clear_/copy_subpage() Ankur Arora
2023-09-08 13:09   ` Matthew Wilcox
2023-09-11 17:22     ` Ankur Arora
2023-08-30 18:49 ` [PATCH v2 4/9] x86/clear_page: extend clear_page*() for multi-page clearing Ankur Arora
2023-09-08 13:11   ` Matthew Wilcox
2023-08-30 18:49 ` [PATCH v2 5/9] x86/clear_page: add clear_pages() Ankur Arora
2023-08-30 18:49 ` [PATCH v2 6/9] x86/clear_huge_page: multi-page clearing Ankur Arora
2023-08-31 18:26   ` kernel test robot
2023-09-08 12:38   ` Peter Zijlstra
2023-09-13  6:43   ` Raghavendra K T
2023-08-30 18:49 ` [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
2023-09-08  7:02   ` Peter Zijlstra
2023-09-08 17:15     ` Linus Torvalds
2023-09-08 22:50       ` Peter Zijlstra
2023-09-09  5:15         ` Linus Torvalds
2023-09-09  6:39           ` Ankur Arora
2023-09-09  9:11             ` Peter Zijlstra
2023-09-09 20:04               ` Ankur Arora
2023-09-09  5:30       ` Ankur Arora
2023-09-09  9:12         ` Peter Zijlstra
2023-09-09 20:15     ` Ankur Arora
2023-09-09 21:16       ` Linus Torvalds
2023-09-10  3:48         ` Ankur Arora
2023-09-10  4:35           ` Linus Torvalds
2023-09-10 10:01             ` Ankur Arora
2023-09-10 18:32               ` Linus Torvalds
2023-09-11 15:04                 ` Peter Zijlstra
2023-09-11 16:29                   ` andrew.cooper3
2023-09-11 17:04                   ` Ankur Arora
2023-09-12  8:26                     ` Peter Zijlstra
2023-09-12 12:24                       ` Phil Auld
2023-09-12 12:33                       ` Matthew Wilcox
2023-09-18 23:42                       ` Thomas Gleixner
2023-09-19  1:57                         ` Linus Torvalds
2023-09-19  8:03                           ` Ingo Molnar
2023-09-19  8:43                             ` Ingo Molnar
2023-09-19 13:43                               ` Thomas Gleixner
2023-09-19 13:25                             ` Thomas Gleixner
2023-09-19 12:30                           ` Thomas Gleixner [this message]
2023-09-19 13:00                             ` Arches that don't support PREEMPT Matthew Wilcox
2023-09-19 13:34                               ` Geert Uytterhoeven
2023-09-19 13:37                               ` John Paul Adrian Glaubitz
2023-09-19 13:42                                 ` Peter Zijlstra
2023-09-19 13:48                                   ` John Paul Adrian Glaubitz
2023-09-19 14:16                                     ` Peter Zijlstra
2023-09-19 14:24                                       ` John Paul Adrian Glaubitz
2023-09-19 14:32                                         ` Matthew Wilcox
2023-09-19 15:31                                           ` Steven Rostedt
2023-09-20 14:38                                       ` Anton Ivanov
2023-09-21 12:20                                       ` Arnd Bergmann
2023-09-19 14:17                                     ` Thomas Gleixner
2023-09-19 14:50                                       ` H. Peter Anvin
2023-09-19 14:57                                         ` Matt Turner
2023-09-19 17:09                                         ` Ulrich Teichert
2023-09-19 17:25                                     ` Linus Torvalds
2023-09-19 17:58                                       ` John Paul Adrian Glaubitz
2023-09-19 18:31                                       ` Thomas Gleixner
2023-09-19 18:38                                         ` Steven Rostedt
2023-09-19 18:52                                           ` Linus Torvalds
2023-09-19 19:53                                             ` Thomas Gleixner
2023-09-20  7:32                                           ` Ingo Molnar
2023-09-20  7:29                                         ` Ingo Molnar
2023-09-20  8:26                                       ` Thomas Gleixner
2023-09-20 10:37                                       ` David Laight
2023-09-19 14:21                                   ` Anton Ivanov
2023-09-19 15:17                                     ` Thomas Gleixner
2023-09-19 15:21                                       ` Anton Ivanov
2023-09-19 16:22                                         ` Richard Weinberger
2023-09-19 16:41                                           ` Anton Ivanov
2023-09-19 17:33                                             ` Thomas Gleixner
2023-10-06 14:51                               ` Geert Uytterhoeven
2023-09-20 14:22                             ` [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
2023-09-20 20:51                               ` Thomas Gleixner
2023-09-21  0:14                                 ` Thomas Gleixner
2023-09-21  0:58                                 ` Ankur Arora
2023-09-21  2:12                                   ` Thomas Gleixner
2023-09-20 23:58                             ` Thomas Gleixner
2023-09-21  0:57                               ` Ankur Arora
2023-09-21  2:02                                 ` Thomas Gleixner
2023-09-21  4:16                                   ` Ankur Arora
2023-09-21 13:59                                     ` Steven Rostedt
2023-09-21 16:00                               ` Linus Torvalds
2023-09-21 22:55                                 ` Thomas Gleixner
2023-09-23  1:11                                   ` Thomas Gleixner
2023-10-02 14:15                                     ` Steven Rostedt
2023-10-02 16:13                                       ` Thomas Gleixner
2023-10-18  1:03                                     ` Paul E. McKenney
2023-10-18 12:09                                       ` Ankur Arora
2023-10-18 17:51                                         ` Paul E. McKenney
2023-10-18 22:53                                           ` Thomas Gleixner
2023-10-18 23:25                                             ` Paul E. McKenney
2023-10-18 13:16                                       ` Thomas Gleixner
2023-10-18 14:31                                         ` Steven Rostedt
2023-10-18 17:55                                           ` Paul E. McKenney
2023-10-18 18:00                                             ` Steven Rostedt
2023-10-18 18:13                                               ` Paul E. McKenney
2023-10-19 12:37                                                 ` Daniel Bristot de Oliveira
2023-10-19 17:08                                                   ` Paul E. McKenney
2023-10-18 17:19                                         ` Paul E. McKenney
2023-10-18 17:41                                           ` Steven Rostedt
2023-10-18 17:59                                             ` Paul E. McKenney
2023-10-18 20:15                                           ` Ankur Arora
2023-10-18 20:42                                             ` Paul E. McKenney
2023-10-19  0:21                                           ` Thomas Gleixner
2023-10-19 19:13                                             ` Paul E. McKenney
2023-10-20 21:59                                               ` Paul E. McKenney
2023-10-20 22:56                                               ` Ankur Arora
2023-10-20 23:36                                                 ` Paul E. McKenney
2023-10-21  1:05                                                   ` Ankur Arora
2023-10-21  2:08                                                     ` Paul E. McKenney
2023-10-24 12:15                                               ` Thomas Gleixner
2023-10-24 18:59                                                 ` Paul E. McKenney
2023-09-23 22:50                             ` Thomas Gleixner
2023-09-24  0:10                               ` Thomas Gleixner
2023-09-24  7:19                               ` Matthew Wilcox
2023-09-24  7:55                                 ` Thomas Gleixner
2023-09-24 10:29                                   ` Matthew Wilcox
2023-09-25  0:13                               ` Ankur Arora
2023-10-06 13:01                             ` Geert Uytterhoeven
2023-09-19  7:21                         ` Ingo Molnar
2023-09-19 19:05                         ` Ankur Arora
2023-10-24 14:34                         ` Steven Rostedt
2023-10-25  1:49                           ` Steven Rostedt
2023-10-26  7:50                           ` Sergey Senozhatsky
2023-10-26 12:48                             ` Steven Rostedt
2023-09-11 16:48             ` Steven Rostedt
2023-09-11 20:50               ` Linus Torvalds
2023-09-11 21:16                 ` Linus Torvalds
2023-09-12  7:20                   ` Peter Zijlstra
2023-09-12  7:38                     ` Ingo Molnar
2023-09-11 22:20                 ` Steven Rostedt
2023-09-11 23:10                   ` Ankur Arora
2023-09-11 23:16                     ` Steven Rostedt
2023-09-12 16:30                   ` Linus Torvalds
2023-09-12  3:27                 ` Matthew Wilcox
2023-09-12 16:20                   ` Linus Torvalds
2023-09-19  3:21   ` Andy Lutomirski
2023-09-19  9:20     ` Thomas Gleixner
2023-09-19  9:49       ` Ingo Molnar
2023-08-30 18:49 ` [PATCH v2 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
2023-09-08 12:42   ` Peter Zijlstra
2023-09-11 17:24     ` Ankur Arora
2023-08-30 18:49 ` [PATCH v2 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
2023-09-08 12:45   ` Peter Zijlstra
2023-09-03  8:14 ` [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing Mateusz Guzik
2023-09-05 22:14   ` Ankur Arora
2023-09-08  2:18   ` Raghavendra K T
2023-09-05  1:06 ` Raghavendra K T
2023-09-05 19:36   ` Ankur Arora

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87led2wdj0.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=akpm@linux-foundation.org \
    --cc=andrew.cooper3@citrix.com \
    --cc=ankur.a.arora@oracle.com \
    --cc=bharata@amd.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=jon.grimm@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).