linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH RT v2 0/2] Add PINNED_HARD mode to hrtimers
@ 2021-06-16  7:17 Juri Lelli
  2021-06-16  7:17 ` [RFC PATCH RT v2 1/2] time/hrtimer: Add PINNED_HARD mode for realtime hrtimers Juri Lelli
  2021-06-16  7:17 ` [RFC PATCH RT v2 2/2] time/hrtimer: Embed hrtimer mode into hrtimer_sleeper Juri Lelli
  0 siblings, 2 replies; 6+ messages in thread
From: Juri Lelli @ 2021-06-16  7:17 UTC (permalink / raw)
  To: bigeasy, tglx; +Cc: linux-rt-users, peterz, linux-kernel, bristot, Juri Lelli

Hi,

I rebased an RFC series I already proposed a while ago [1] and I'd like
people to consider it again for inclusion.

When running cyclictest on isolated CPUs with timer_migration enabled,
the following, where CPU0 is one of the housekeeping CPUs and CPU2 is
isolated, is (of course) happening:

     <idle>-0     [000] ... hrtimer_cancel:       hrtimer=0xffffb4a74be7fe70
     <idle>-0     [000] ... hrtimer_expire_entry: hrtimer=0xffffb4a74be7fe70 now=144805770984 function=hrtimer_wakeup/0x0
     <idle>-0     [000] ... sched_wakeup:         cyclictest:1171 [4] success=1 CPU:002
     <idle>-0     [000] ... hrtimer_expire_exit:  hrtimer=0xffffb4a74be7fe70
     <idle>-0     [002] ... sched_switch:         swapper/2:0 [120] R ==> cyclictest:1171 [4]
 cyclictest-1171  [002] ... hrtimer_init:         hrtimer=0xffffb4a74be7fe70 clockid=CLOCK_MONOTONIC mode=0x8
 cyclictest-1171  [002] ... hrtimer_start:        hrtimer=0xffffb4a74be7fe70 function=hrtimer_wakeup/0x0 ...
 cyclictest-1171  [002] ... sched_switch:         cyclictest:1171 [4] S ==> swapper/2:0 [120]

While cyclitest is arming the hrtimer while running on isolated CPU2
(by means of clock_nanosleep), the hrtimer is then firing on CPU0. This
is due to the fact that switch_hrtimer_base(), called at hrtimer enqueue
time, will prefer to enqueue the timer on an housekeeping !idle CPU, if
the timer is not pinned, as per timer_migration feature.

The problem with this is that we are measuring wake up latencies across
isolated and !isolated domains, which is against the purpose of
configuring the latter, while having timer_migration enabled is required
for certain workloads that are not using timers and don't want to be
ever interrupted.

Since PREEMPT_RT already forces HARD mode for hrtimers armed by tasks
running with RT policies, it seems to make sense to also force PINNED
mode under the same conditions.

This set implements the behavior, achieving something like the
following:

     <idle>-0     [002] ... hrtimer_cancel:       hrtimer=0xffffafbacc19fe78
     <idle>-0     [002] ... hrtimer_expire_entry: hrtimer=0xffffafbacc19fe78 now=104335855898 function=hrtimer_wakeup/0x0
     <idle>-0     [002] ... sched_wakeup:         cyclictest:1165 [4] success=1 CPU:002
     <idle>-0     [002] ... hrtimer_expire_exit:  hrtimer=0xffffafbacc19fe78
     <idle>-0     [002] ... sched_switch:         swapper/2:0 [120] R ==> cyclictest:1165 [4]
 cyclictest-1165  [002] ... hrtimer_init:         hrtimer=0xffffafbacc19fe78 clockid=CLOCK_MONOTONIC mode=0xa
 cyclictest-1165  [002] ... hrtimer_start:        hrtimer=0xffffafbacc19fe78 function=hrtimer_wakeup/0x0 ...
 cyclictest-1165  [002] ... sched_switch:         cyclictest:1165 [4] S ==> swapper/2:0 [120]

Sebastian didn't look against the proposed changes, but I didn't follow
up back then because it looked like we could meet workloads requirements
at that time w/o this set. Now things have changed, looks like the mix
of the two types of workloads - interrupt driven and always running - is
very relevant and we need to accommodate both types on the same system
setup.

Does this still make sense or do you suggest alternative approaches?

Thanks!

- Juri

1 - https://lore.kernel.org/lkml/20190214133716.10187-1-juri.lelli@redhat.com/

Juri Lelli (2):
  time/hrtimer: Add PINNED_HARD mode for realtime hrtimers
  time/hrtimer: Embed hrtimer mode into hrtimer_sleeper

 include/linux/hrtimer.h |  3 +++
 kernel/time/hrtimer.c   | 13 +++++++------
 2 files changed, 10 insertions(+), 6 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH RT v2 1/2] time/hrtimer: Add PINNED_HARD mode for realtime hrtimers
  2021-06-16  7:17 [RFC PATCH RT v2 0/2] Add PINNED_HARD mode to hrtimers Juri Lelli
@ 2021-06-16  7:17 ` Juri Lelli
  2021-06-18 23:35   ` Thomas Gleixner
  2021-06-16  7:17 ` [RFC PATCH RT v2 2/2] time/hrtimer: Embed hrtimer mode into hrtimer_sleeper Juri Lelli
  1 sibling, 1 reply; 6+ messages in thread
From: Juri Lelli @ 2021-06-16  7:17 UTC (permalink / raw)
  To: bigeasy, tglx; +Cc: linux-rt-users, peterz, linux-kernel, bristot, Juri Lelli

While running cyclictest on isolated CPUs with timer_migration enabled,
I noticed the following behavior, where CPU0 is one of the housekeeping
CPUs and CPU2 is isolated:

     <idle>-0     [000] ... hrtimer_cancel:       hrtimer=0xffffb4a74be7fe70
     <idle>-0     [000] ... hrtimer_expire_entry: hrtimer=0xffffb4a74be7fe70 now=144805770984 function=hrtimer_wakeup/0x0
     <idle>-0     [000] ... sched_wakeup:         cyclictest:1171 [4] success=1 CPU:002
     <idle>-0     [000] ... hrtimer_expire_exit:  hrtimer=0xffffb4a74be7fe70
     <idle>-0     [002] ... sched_switch:         swapper/2:0 [120] R ==> cyclictest:1171 [4]
 cyclictest-1171  [002] ... hrtimer_init:         hrtimer=0xffffb4a74be7fe70 clockid=CLOCK_MONOTONIC mode=0x8
 cyclictest-1171  [002] ... hrtimer_start:        hrtimer=0xffffb4a74be7fe70 function=hrtimer_wakeup/0x0 ...
 cyclictest-1171  [002] ... sched_switch:         cyclictest:1171 [4] S ==> swapper/2:0 [120]

While cyclitest was arming the hrtimer while running on isolated CPU2
(by means of clock_nanosleep), the hrtimer was then firing on CPU0. This
is due to the fact that switch_hrtimer_base(), called at hrtimer enqueue
time, will prefer to enqueue the timer on an housekeeping !idle CPU, if
the timer is not pinned and timer_migration is enabled.

The problem with this is that we are measuring wake up latencies across
isolated and !isolated domains, which is against the purpose of
configuring the latter.

Since PREEMPT_RT already forces HARD mode for hrtimers armed by tasks
running with RT policies, it makes sense to also force PINNED mode under
the same conditions.

This patch implements this behavior, achieving something like the
following:

     <idle>-0     [002] ... hrtimer_cancel:       hrtimer=0xffffafbacc19fe78
     <idle>-0     [002] ... hrtimer_expire_entry: hrtimer=0xffffafbacc19fe78 now=104335855898 function=hrtimer_wakeup/0x0
     <idle>-0     [002] ... sched_wakeup:         cyclictest:1165 [4] success=1 CPU:002
     <idle>-0     [002] ... hrtimer_expire_exit:  hrtimer=0xffffafbacc19fe78
     <idle>-0     [002] ... sched_switch:         swapper/2:0 [120] R ==> cyclictest:1165 [4]
 cyclictest-1165  [002] ... hrtimer_init:         hrtimer=0xffffafbacc19fe78 clockid=CLOCK_MONOTONIC mode=0xa
 cyclictest-1165  [002] ... hrtimer_start:        hrtimer=0xffffafbacc19fe78 function=hrtimer_wakeup/0x0 ...
 cyclictest-1165  [002] ... sched_switch:         cyclictest:1165 [4] S ==> swapper/2:0 [120]

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 include/linux/hrtimer.h | 2 ++
 kernel/time/hrtimer.c   | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index e425a26a5ed8..bca9402a1030 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -55,6 +55,8 @@ enum hrtimer_mode {
 	HRTIMER_MODE_ABS_HARD	= HRTIMER_MODE_ABS | HRTIMER_MODE_HARD,
 	HRTIMER_MODE_REL_HARD	= HRTIMER_MODE_REL | HRTIMER_MODE_HARD,
 
+	HRTIMER_MODE_PINNED_HARD = HRTIMER_MODE_PINNED | HRTIMER_MODE_HARD,
+
 	HRTIMER_MODE_ABS_PINNED_HARD = HRTIMER_MODE_ABS_PINNED | HRTIMER_MODE_HARD,
 	HRTIMER_MODE_REL_PINNED_HARD = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_HARD,
 };
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3fa18a01f5b2..f64954d5c8f8 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1842,7 +1842,7 @@ static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
 	 */
 	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
 		if (task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT))
-			mode |= HRTIMER_MODE_HARD;
+			mode |= HRTIMER_MODE_PINNED_HARD;
 	}
 
 	__hrtimer_init(&sl->timer, clock_id, mode);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH RT v2 2/2] time/hrtimer: Embed hrtimer mode into hrtimer_sleeper
  2021-06-16  7:17 [RFC PATCH RT v2 0/2] Add PINNED_HARD mode to hrtimers Juri Lelli
  2021-06-16  7:17 ` [RFC PATCH RT v2 1/2] time/hrtimer: Add PINNED_HARD mode for realtime hrtimers Juri Lelli
@ 2021-06-16  7:17 ` Juri Lelli
  1 sibling, 0 replies; 6+ messages in thread
From: Juri Lelli @ 2021-06-16  7:17 UTC (permalink / raw)
  To: bigeasy, tglx; +Cc: linux-rt-users, peterz, linux-kernel, bristot, Juri Lelli

Changes to hrtimer mode (potentially made by __hrtimer_init_sleeper() on
PREEMPT_RT are not visible to do_nanosleep(), and thus not accounted for
by hrtimer_start_expires() call path.

Embed hrtimer mode into hrtimer_sleeper struct, so that the same mode is
used by code following hrtimer_sleeper initialization.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 include/linux/hrtimer.h |  1 +
 kernel/time/hrtimer.c   | 11 ++++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index bca9402a1030..7573823d91d9 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -138,6 +138,7 @@ struct hrtimer {
 struct hrtimer_sleeper {
 	struct hrtimer timer;
 	struct task_struct *task;
+	enum hrtimer_mode mode;
 };
 
 #ifdef CONFIG_64BIT
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index f64954d5c8f8..badcfb6e19aa 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1848,6 +1848,7 @@ static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
 	__hrtimer_init(&sl->timer, clock_id, mode);
 	sl->timer.function = hrtimer_wakeup;
 	sl->task = current;
+	sl->mode = mode;
 }
 
 /**
@@ -1884,19 +1885,19 @@ int nanosleep_copyout(struct restart_block *restart, struct timespec64 *ts)
 	return -ERESTART_RESTARTBLOCK;
 }
 
-static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
+static int __sched do_nanosleep(struct hrtimer_sleeper *t)
 {
 	struct restart_block *restart;
 
 	do {
 		set_current_state(TASK_INTERRUPTIBLE);
-		hrtimer_sleeper_start_expires(t, mode);
+		hrtimer_sleeper_start_expires(t, t->mode);
 
 		if (likely(t->task))
 			freezable_schedule();
 
 		hrtimer_cancel(&t->timer);
-		mode = HRTIMER_MODE_ABS;
+		t->mode = HRTIMER_MODE_ABS;
 
 	} while (t->task && !signal_pending(current));
 
@@ -1927,7 +1928,7 @@ static long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
 	hrtimer_init_sleeper_on_stack(&t, restart->nanosleep.clockid,
 				      HRTIMER_MODE_ABS);
 	hrtimer_set_expires_tv64(&t.timer, restart->nanosleep.expires);
-	ret = do_nanosleep(&t, HRTIMER_MODE_ABS);
+	ret = do_nanosleep(&t);
 	destroy_hrtimer_on_stack(&t.timer);
 	return ret;
 }
@@ -1946,7 +1947,7 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
 
 	hrtimer_init_sleeper_on_stack(&t, clockid, mode);
 	hrtimer_set_expires_range_ns(&t.timer, rqtp, slack);
-	ret = do_nanosleep(&t, mode);
+	ret = do_nanosleep(&t);
 	if (ret != -ERESTART_RESTARTBLOCK)
 		goto out;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH RT v2 1/2] time/hrtimer: Add PINNED_HARD mode for realtime hrtimers
  2021-06-16  7:17 ` [RFC PATCH RT v2 1/2] time/hrtimer: Add PINNED_HARD mode for realtime hrtimers Juri Lelli
@ 2021-06-18 23:35   ` Thomas Gleixner
  2021-06-19  7:56     ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2021-06-18 23:35 UTC (permalink / raw)
  To: Juri Lelli, bigeasy
  Cc: linux-rt-users, peterz, linux-kernel, bristot, Juri Lelli,
	Anna-Maria Behnsen

Juri,

On Wed, Jun 16 2021 at 09:17, Juri Lelli wrote:
> While running cyclictest on isolated CPUs with timer_migration enabled,
> I noticed the following behavior, where CPU0 is one of the housekeeping
> CPUs and CPU2 is isolated:
>
>      <idle>-0     [000] ... hrtimer_cancel:       hrtimer=0xffffb4a74be7fe70
>      <idle>-0     [000] ... hrtimer_expire_entry: hrtimer=0xffffb4a74be7fe70 now=144805770984 function=hrtimer_wakeup/0x0
>      <idle>-0     [000] ... sched_wakeup:         cyclictest:1171 [4] success=1 CPU:002
>      <idle>-0     [000] ... hrtimer_expire_exit:  hrtimer=0xffffb4a74be7fe70
>      <idle>-0     [002] ... sched_switch:         swapper/2:0 [120] R ==> cyclictest:1171 [4]
>  cyclictest-1171  [002] ... hrtimer_init:         hrtimer=0xffffb4a74be7fe70 clockid=CLOCK_MONOTONIC mode=0x8
>  cyclictest-1171  [002] ... hrtimer_start:        hrtimer=0xffffb4a74be7fe70 function=hrtimer_wakeup/0x0 ...
>  cyclictest-1171  [002] ... sched_switch:         cyclictest:1171 [4] S ==> swapper/2:0 [120]
>
> While cyclitest was arming the hrtimer while running on isolated CPU2
> (by means of clock_nanosleep), the hrtimer was then firing on CPU0. This
> is due to the fact that switch_hrtimer_base(), called at hrtimer enqueue
> time, will prefer to enqueue the timer on an housekeeping !idle CPU, if
> the timer is not pinned and timer_migration is enabled.
>
> The problem with this is that we are measuring wake up latencies across
> isolated and !isolated domains, which is against the purpose of
> configuring the latter.
>
> Since PREEMPT_RT already forces HARD mode for hrtimers armed by tasks
> running with RT policies, it makes sense to also force PINNED mode under
> the same conditions.
>
> This patch implements this behavior, achieving something like the

 git grep 'This patch' Documentation/process

Also look at the recommended usage of 'We, I' while at it.

> @@ -55,6 +55,8 @@ enum hrtimer_mode {
>  	HRTIMER_MODE_ABS_HARD	= HRTIMER_MODE_ABS | HRTIMER_MODE_HARD,
>  	HRTIMER_MODE_REL_HARD	= HRTIMER_MODE_REL | HRTIMER_MODE_HARD,
>  
> +	HRTIMER_MODE_PINNED_HARD = HRTIMER_MODE_PINNED | HRTIMER_MODE_HARD,
> +
>  	HRTIMER_MODE_ABS_PINNED_HARD = HRTIMER_MODE_ABS_PINNED | HRTIMER_MODE_HARD,
>  	HRTIMER_MODE_REL_PINNED_HARD = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_HARD,
>  };
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 3fa18a01f5b2..f64954d5c8f8 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1842,7 +1842,7 @@ static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
>  	 */
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>  		if (task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT))
> -			mode |= HRTIMER_MODE_HARD;
> +			mode |= HRTIMER_MODE_PINNED_HARD;
>  	}
>  
>  	__hrtimer_init(&sl->timer, clock_id, mode);

It makes sense to some extent, but in fact you are curing the symptom.

The root cause is that all of this is semantically ill defined.

The underlying problem is get_nohz_timer_target() which is a completely
broken heuristics trying to predict which CPU is the proper target for
handling the timer some unspecified time in the future.

In hindsight I regret that I even helped to merge that, but hindsight.
Is get_nohz_timer_target() anything near correct for some unspecified
reason? You surely know that it's not.

In fact your patch makes it even more semantically undefined simply
because it is solving the single RT thread per CPU use case which is
exposed by cyclictest. Is that universaly true for all RT tasks and use
cases?

The wild west of anything which scratches 'my itch' based on 'my use
case numbers' in Linux ended many years ago and while RT was always a
valuable playground for unthinkable ideas we definitely tried hard not
to accept use case specific hacks wihtout a proper justification that it
makes sense in general.

So why are you even trying to sell this to me?

get_nohz_timer_target() is broken by definition and while it made some
sense years ago despite it's heuristic nature, this is something which
really needs to be cleaned up because it causes more trouble than it
solves. Tagging every other timer as pinned just to work around that
underlying nonsense is just wrong.

We have been working on getting rid of this at least for the timer list
timers (which are admittedly the easier part of the problem) on and off
for years. I can't find the public links right now, but I'll ask
Anna-Maria to fill the void. Might take a while as she's AFK for a
while.

Thanks,

        tglx









^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH RT v2 1/2] time/hrtimer: Add PINNED_HARD mode for realtime hrtimers
  2021-06-18 23:35   ` Thomas Gleixner
@ 2021-06-19  7:56     ` Thomas Gleixner
  2021-06-21  5:35       ` Juri Lelli
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2021-06-19  7:56 UTC (permalink / raw)
  To: Juri Lelli, bigeasy
  Cc: linux-rt-users, peterz, linux-kernel, bristot, Juri Lelli,
	Anna-Maria Behnsen

On Sat, Jun 19 2021 at 01:35, Thomas Gleixner wrote:
> The wild west of anything which scratches 'my itch' based on 'my use
> case numbers' in Linux ended many years ago and while RT was always a
> valuable playground for unthinkable ideas we definitely tried hard not
> to accept use case specific hacks wihtout a proper justification that it
> makes sense in general.
>
> So why are you even trying to sell this to me?

I wouldn't have been that grumpy if you'd at least checked whether the
task is pinned. Still I would have told you that you "fix" it at the
wrong place.

Why on earth is that nohz heuristic trainwreck not even checking that?
It's not a RT problem and it's not a problem restricted to RT tasks
either. If a task is pinned then arming the timer on a random other CPU
is blatant nonsense independent of the scheduling class.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH RT v2 1/2] time/hrtimer: Add PINNED_HARD mode for realtime hrtimers
  2021-06-19  7:56     ` Thomas Gleixner
@ 2021-06-21  5:35       ` Juri Lelli
  0 siblings, 0 replies; 6+ messages in thread
From: Juri Lelli @ 2021-06-21  5:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bigeasy, linux-rt-users, peterz, linux-kernel, bristot,
	Anna-Maria Behnsen

Hi,

On 19/06/21 09:56, Thomas Gleixner wrote:
> On Sat, Jun 19 2021 at 01:35, Thomas Gleixner wrote:
> > The wild west of anything which scratches 'my itch' based on 'my use
> > case numbers' in Linux ended many years ago and while RT was always a
> > valuable playground for unthinkable ideas we definitely tried hard not
> > to accept use case specific hacks wihtout a proper justification that it
> > makes sense in general.
> >
> > So why are you even trying to sell this to me?
> 
> I wouldn't have been that grumpy if you'd at least checked whether the
> task is pinned. Still I would have told you that you "fix" it at the
> wrong place.

Ah, indeed. Pulled the trigger too early it seems. I'll ponder more.

> Why on earth is that nohz heuristic trainwreck not even checking that?
> It's not a RT problem and it's not a problem restricted to RT tasks
> either. If a task is pinned then arming the timer on a random other CPU
> is blatant nonsense independent of the scheduling class.

Agree. Lemme look more into it.

Thanks for the comments!

Best,
Juri


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-21  5:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-16  7:17 [RFC PATCH RT v2 0/2] Add PINNED_HARD mode to hrtimers Juri Lelli
2021-06-16  7:17 ` [RFC PATCH RT v2 1/2] time/hrtimer: Add PINNED_HARD mode for realtime hrtimers Juri Lelli
2021-06-18 23:35   ` Thomas Gleixner
2021-06-19  7:56     ` Thomas Gleixner
2021-06-21  5:35       ` Juri Lelli
2021-06-16  7:17 ` [RFC PATCH RT v2 2/2] time/hrtimer: Embed hrtimer mode into hrtimer_sleeper Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).