All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] timers/nohz fixes
@ 2021-10-26 14:10 Frederic Weisbecker
  2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker
  2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker
  0 siblings, 2 replies; 7+ messages in thread
From: Frederic Weisbecker @ 2021-10-26 14:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Frederic Weisbecker, Paul E . McKenney, Peter Zijlstra,
	Hasegawa Hitomi, Mel Gorman

A bunch of fixes that don't need to be on the urgent queue since
they are not regression fixes. But they are still fixes...

Frederic Weisbecker (2):
  timers/nohz: Last resort update jiffies on nohz_full IRQ entry
  sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full

 include/linux/sched/cputime.h |  5 +++--
 kernel/sched/cputime.c        | 12 +++++++++---
 kernel/softirq.c              |  3 ++-
 kernel/time/tick-sched.c      |  7 +++++++
 4 files changed, 21 insertions(+), 6 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry
  2021-10-26 14:10 [PATCH 0/2] timers/nohz fixes Frederic Weisbecker
@ 2021-10-26 14:10 ` Frederic Weisbecker
  2021-12-02 14:12   ` [tip: timers/urgent] " tip-bot2 for Frederic Weisbecker
  2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker
  1 sibling, 1 reply; 7+ messages in thread
From: Frederic Weisbecker @ 2021-10-26 14:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Frederic Weisbecker, Paul E . McKenney, Peter Zijlstra,
	Hasegawa Hitomi

When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
is guaranteed to stay online and to never stop its tick.

Meanwhile on some rare case, the dedicated timekeeper may be running
with interrupts disabled for a while, such as in stop_machine.

If jiffies stop being updated, a nohz_full CPU may end up endlessly
programming the next tick in the past, taking the last jiffies update
monotonic timestamp as a stale base, resulting in an tick storm.

Here is a scenario where it matters:

0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.

1) A stop machine callback is queued to execute somewhere.

2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
   MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
   can still take IRQs.

3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.

4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
   last_jiffies_update as a base. But last_jiffies_update hasn't been
   updated for 2 jiffies since the timekeeper has interrupts disabled.

5) clockevents_program_event(), which relies on ktime_get(), observes
   that the expiration is in the past and therefore programs the min
   delta event on the clock.

6) The tick fires immediately, goto 3)

7) Tick storm, the nohz_full CPU is drown and takes ages to reach
   MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.

Solve this with unconditionally updating jiffies if the value is stale
on nohz_full IRQ entry. IRQs and other disturbances are expected to be
rare enough on nohz_full for the unconditional call to ktime_get() to
actually matter.

Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/softirq.c         | 3 ++-
 kernel/time/tick-sched.c | 7 +++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 322b65d45676..41f470929e99 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -595,7 +595,8 @@ void irq_enter_rcu(void)
 {
 	__irq_enter_raw();
 
-	if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))
+	if (tick_nohz_full_cpu(smp_processor_id()) ||
+	    (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
 		tick_irq_enter();
 
 	account_hardirq_enter(current);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6bffe5af8cb1..17a283ce2b20 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void)
 	now = ktime_get();
 	if (ts->idle_active)
 		tick_nohz_stop_idle(ts, now);
+	/*
+	 * If all CPUs are idle. We may need to update a stale jiffies value.
+	 * Note nohz_full is a special case: a timekeeper is guaranteed to stay
+	 * alive but it might be busy looping with interrupts disabled in some
+	 * rare case (typically stop machine). So we must make sure we have a
+	 * last resort.
+	 */
 	if (ts->tick_stopped)
 		tick_nohz_update_jiffies(now);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
  2021-10-26 14:10 [PATCH 0/2] timers/nohz fixes Frederic Weisbecker
  2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker
@ 2021-10-26 14:10 ` Frederic Weisbecker
  2021-10-26 17:40   ` Masayoshi Mizuma
                     ` (2 more replies)
  1 sibling, 3 replies; 7+ messages in thread
From: Frederic Weisbecker @ 2021-10-26 14:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman

getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime
than the actual time.

task_cputime_adjusted() snapshots utime and stime and then adjust their
sum to match the scheduler maintained cputime.sum_exec_runtime.
Unfortunately in nohz_full, sum_exec_runtime is only updated once per
second in the worst case, causing a discrepancy against utime and stime
that can be updated anytime by the reader using vtime.

To fix this situation, perform an update of cputime.sum_exec_runtime
when the cputime snapshot reports the task as actually running while
the tick is disabled. The related overhead is then contained within the
relevant situations.

Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
---
 include/linux/sched/cputime.h |  5 +++--
 kernel/sched/cputime.c        | 12 +++++++++---
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 6c9f19a33865..ce3c58286062 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -18,15 +18,16 @@
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern void task_cputime(struct task_struct *t,
+extern bool task_cputime(struct task_struct *t,
 			 u64 *utime, u64 *stime);
 extern u64 task_gtime(struct task_struct *t);
 #else
-static inline void task_cputime(struct task_struct *t,
+static inline bool task_cputime(struct task_struct *t,
 				u64 *utime, u64 *stime)
 {
 	*utime = t->utime;
 	*stime = t->stime;
+	return false;
 }
 
 static inline u64 task_gtime(struct task_struct *t)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 872e481d5098..9392aea1804e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 		.sum_exec_runtime = p->se.sum_exec_runtime,
 	};
 
-	task_cputime(p, &cputime.utime, &cputime.stime);
+	if (task_cputime(p, &cputime.utime, &cputime.stime))
+		cputime.sum_exec_runtime = task_sched_runtime(p);
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 EXPORT_SYMBOL_GPL(task_cputime_adjusted);
@@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t)
  * add up the pending nohz execution time since the last
  * cputime snapshot.
  */
-void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
+bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 {
 	struct vtime *vtime = &t->vtime;
 	unsigned int seq;
 	u64 delta;
+	int ret;
 
 	if (!vtime_accounting_enabled()) {
 		*utime = t->utime;
 		*stime = t->stime;
-		return;
+		return false;
 	}
 
 	do {
+		ret = false;
 		seq = read_seqcount_begin(&vtime->seqcount);
 
 		*utime = t->utime;
@@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		if (vtime->state < VTIME_SYS)
 			continue;
 
+		ret = true;
 		delta = vtime_delta(vtime);
 
 		/*
@@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		else
 			*utime += vtime->utime + delta;
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
+
+	return ret;
 }
 
 static int vtime_state_fetch(struct vtime *vtime, int cpu)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
  2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker
@ 2021-10-26 17:40   ` Masayoshi Mizuma
  2021-11-10 19:30   ` Phil Auld
  2021-12-02 14:12   ` [tip: sched/urgent] " tip-bot2 for Frederic Weisbecker
  2 siblings, 0 replies; 7+ messages in thread
From: Masayoshi Mizuma @ 2021-10-26 17:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman

On Tue, Oct 26, 2021 at 04:10:55PM +0200, Frederic Weisbecker wrote:
> getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime
> than the actual time.
> 
> task_cputime_adjusted() snapshots utime and stime and then adjust their
> sum to match the scheduler maintained cputime.sum_exec_runtime.
> Unfortunately in nohz_full, sum_exec_runtime is only updated once per
> second in the worst case, causing a discrepancy against utime and stime
> that can be updated anytime by the reader using vtime.
> 
> To fix this situation, perform an update of cputime.sum_exec_runtime
> when the cputime snapshot reports the task as actually running while
> the tick is disabled. The related overhead is then contained within the
> relevant situations.
> 
> Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>

Thank you for this patch. getrusage(RUSAGE_THREAD) with nohz_full works well!
Please feel free to add:

Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

Thanks,
Masa

> ---
>  include/linux/sched/cputime.h |  5 +++--
>  kernel/sched/cputime.c        | 12 +++++++++---
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
> index 6c9f19a33865..ce3c58286062 100644
> --- a/include/linux/sched/cputime.h
> +++ b/include/linux/sched/cputime.h
> @@ -18,15 +18,16 @@
>  #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>  
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -extern void task_cputime(struct task_struct *t,
> +extern bool task_cputime(struct task_struct *t,
>  			 u64 *utime, u64 *stime);
>  extern u64 task_gtime(struct task_struct *t);
>  #else
> -static inline void task_cputime(struct task_struct *t,
> +static inline bool task_cputime(struct task_struct *t,
>  				u64 *utime, u64 *stime)
>  {
>  	*utime = t->utime;
>  	*stime = t->stime;
> +	return false;
>  }
>  
>  static inline u64 task_gtime(struct task_struct *t)
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 872e481d5098..9392aea1804e 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
>  		.sum_exec_runtime = p->se.sum_exec_runtime,
>  	};
>  
> -	task_cputime(p, &cputime.utime, &cputime.stime);
> +	if (task_cputime(p, &cputime.utime, &cputime.stime))
> +		cputime.sum_exec_runtime = task_sched_runtime(p);
>  	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
>  }
>  EXPORT_SYMBOL_GPL(task_cputime_adjusted);
> @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t)
>   * add up the pending nohz execution time since the last
>   * cputime snapshot.
>   */
> -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
> +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  {
>  	struct vtime *vtime = &t->vtime;
>  	unsigned int seq;
>  	u64 delta;
> +	int ret;
>  
>  	if (!vtime_accounting_enabled()) {
>  		*utime = t->utime;
>  		*stime = t->stime;
> -		return;
> +		return false;
>  	}
>  
>  	do {
> +		ret = false;
>  		seq = read_seqcount_begin(&vtime->seqcount);
>  
>  		*utime = t->utime;
> @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  		if (vtime->state < VTIME_SYS)
>  			continue;
>  
> +		ret = true;
>  		delta = vtime_delta(vtime);
>  
>  		/*
> @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  		else
>  			*utime += vtime->utime + delta;
>  	} while (read_seqcount_retry(&vtime->seqcount, seq));
> +
> +	return ret;
>  }
>  
>  static int vtime_state_fetch(struct vtime *vtime, int cpu)
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
  2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker
  2021-10-26 17:40   ` Masayoshi Mizuma
@ 2021-11-10 19:30   ` Phil Auld
  2021-12-02 14:12   ` [tip: sched/urgent] " tip-bot2 for Frederic Weisbecker
  2 siblings, 0 replies; 7+ messages in thread
From: Phil Auld @ 2021-11-10 19:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman

Hi,

On Tue, Oct 26, 2021 at 04:10:55PM +0200 Frederic Weisbecker wrote:
> getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime
> than the actual time.
> 
> task_cputime_adjusted() snapshots utime and stime and then adjust their
> sum to match the scheduler maintained cputime.sum_exec_runtime.
> Unfortunately in nohz_full, sum_exec_runtime is only updated once per
> second in the worst case, causing a discrepancy against utime and stime
> that can be updated anytime by the reader using vtime.
> 
> To fix this situation, perform an update of cputime.sum_exec_runtime
> when the cputime snapshot reports the task as actually running while
> the tick is disabled. The related overhead is then contained within the
> relevant situations.
> 
> Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
> ---
>  include/linux/sched/cputime.h |  5 +++--
>  kernel/sched/cputime.c        | 12 +++++++++---
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
> index 6c9f19a33865..ce3c58286062 100644
> --- a/include/linux/sched/cputime.h
> +++ b/include/linux/sched/cputime.h
> @@ -18,15 +18,16 @@
>  #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>  
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -extern void task_cputime(struct task_struct *t,
> +extern bool task_cputime(struct task_struct *t,
>  			 u64 *utime, u64 *stime);
>  extern u64 task_gtime(struct task_struct *t);
>  #else
> -static inline void task_cputime(struct task_struct *t,
> +static inline bool task_cputime(struct task_struct *t,
>  				u64 *utime, u64 *stime)
>  {
>  	*utime = t->utime;
>  	*stime = t->stime;
> +	return false;
>  }
>  
>  static inline u64 task_gtime(struct task_struct *t)
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 872e481d5098..9392aea1804e 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
>  		.sum_exec_runtime = p->se.sum_exec_runtime,
>  	};
>  
> -	task_cputime(p, &cputime.utime, &cputime.stime);
> +	if (task_cputime(p, &cputime.utime, &cputime.stime))
> +		cputime.sum_exec_runtime = task_sched_runtime(p);
>  	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
>  }
>  EXPORT_SYMBOL_GPL(task_cputime_adjusted);
> @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t)
>   * add up the pending nohz execution time since the last
>   * cputime snapshot.
>   */
> -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
> +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  {
>  	struct vtime *vtime = &t->vtime;
>  	unsigned int seq;
>  	u64 delta;
> +	int ret;
>  
>  	if (!vtime_accounting_enabled()) {
>  		*utime = t->utime;
>  		*stime = t->stime;
> -		return;
> +		return false;
>  	}
>  
>  	do {
> +		ret = false;
>  		seq = read_seqcount_begin(&vtime->seqcount);
>  
>  		*utime = t->utime;
> @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  		if (vtime->state < VTIME_SYS)
>  			continue;
>  
> +		ret = true;
>  		delta = vtime_delta(vtime);
>  
>  		/*
> @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  		else
>  			*utime += vtime->utime + delta;
>  	} while (read_seqcount_retry(&vtime->seqcount, seq));
> +
> +	return ret;
>  }
>  
>  static int vtime_state_fetch(struct vtime *vtime, int cpu)
> -- 
> 2.25.1
> 


Could someone please pick this (or, rather, these) up?

Acked-by: Phil Auld <pauld@redhat.com>

Thanks!


Phil

-- 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [tip: sched/urgent] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
  2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker
  2021-10-26 17:40   ` Masayoshi Mizuma
  2021-11-10 19:30   ` Phil Auld
@ 2021-12-02 14:12   ` tip-bot2 for Frederic Weisbecker
  2 siblings, 0 replies; 7+ messages in thread
From: tip-bot2 for Frederic Weisbecker @ 2021-12-02 14:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Hasegawa Hitomi, Frederic Weisbecker, Thomas Gleixner,
	Masayoshi Mizuma, Phil Auld, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     e7f2be115f0746b969c0df14c0d182f65f005ca5
Gitweb:        https://git.kernel.org/tip/e7f2be115f0746b969c0df14c0d182f65f005ca5
Author:        Frederic Weisbecker <frederic@kernel.org>
AuthorDate:    Tue, 26 Oct 2021 16:10:55 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 02 Dec 2021 15:08:22 +01:00

sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full

getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime
than the actual time.

task_cputime_adjusted() snapshots utime and stime and then adjust their
sum to match the scheduler maintained cputime.sum_exec_runtime.
Unfortunately in nohz_full, sum_exec_runtime is only updated once per
second in the worst case, causing a discrepancy against utime and stime
that can be updated anytime by the reader using vtime.

To fix this situation, perform an update of cputime.sum_exec_runtime
when the cputime snapshot reports the task as actually running while
the tick is disabled. The related overhead is then contained within the
relevant situations.

Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20211026141055.57358-3-frederic@kernel.org

---
 include/linux/sched/cputime.h |  5 +++--
 kernel/sched/cputime.c        | 12 +++++++++---
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 6c9f19a..ce3c582 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -18,15 +18,16 @@
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern void task_cputime(struct task_struct *t,
+extern bool task_cputime(struct task_struct *t,
 			 u64 *utime, u64 *stime);
 extern u64 task_gtime(struct task_struct *t);
 #else
-static inline void task_cputime(struct task_struct *t,
+static inline bool task_cputime(struct task_struct *t,
 				u64 *utime, u64 *stime)
 {
 	*utime = t->utime;
 	*stime = t->stime;
+	return false;
 }
 
 static inline u64 task_gtime(struct task_struct *t)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 872e481..9392aea 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 		.sum_exec_runtime = p->se.sum_exec_runtime,
 	};
 
-	task_cputime(p, &cputime.utime, &cputime.stime);
+	if (task_cputime(p, &cputime.utime, &cputime.stime))
+		cputime.sum_exec_runtime = task_sched_runtime(p);
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 EXPORT_SYMBOL_GPL(task_cputime_adjusted);
@@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t)
  * add up the pending nohz execution time since the last
  * cputime snapshot.
  */
-void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
+bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 {
 	struct vtime *vtime = &t->vtime;
 	unsigned int seq;
 	u64 delta;
+	int ret;
 
 	if (!vtime_accounting_enabled()) {
 		*utime = t->utime;
 		*stime = t->stime;
-		return;
+		return false;
 	}
 
 	do {
+		ret = false;
 		seq = read_seqcount_begin(&vtime->seqcount);
 
 		*utime = t->utime;
@@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		if (vtime->state < VTIME_SYS)
 			continue;
 
+		ret = true;
 		delta = vtime_delta(vtime);
 
 		/*
@@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		else
 			*utime += vtime->utime + delta;
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
+
+	return ret;
 }
 
 static int vtime_state_fetch(struct vtime *vtime, int cpu)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [tip: timers/urgent] timers/nohz: Last resort update jiffies on nohz_full IRQ entry
  2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker
@ 2021-12-02 14:12   ` tip-bot2 for Frederic Weisbecker
  0 siblings, 0 replies; 7+ messages in thread
From: tip-bot2 for Frederic Weisbecker @ 2021-12-02 14:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Paul E. McKenney, Frederic Weisbecker, Thomas Gleixner, x86,
	linux-kernel

The following commit has been merged into the timers/urgent branch of tip:

Commit-ID:     53e87e3cdc155f20c3417b689df8d2ac88d79576
Gitweb:        https://git.kernel.org/tip/53e87e3cdc155f20c3417b689df8d2ac88d79576
Author:        Frederic Weisbecker <frederic@kernel.org>
AuthorDate:    Tue, 26 Oct 2021 16:10:54 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 02 Dec 2021 15:07:22 +01:00

timers/nohz: Last resort update jiffies on nohz_full IRQ entry

When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
is guaranteed to stay online and to never stop its tick.

Meanwhile on some rare case, the dedicated timekeeper may be running
with interrupts disabled for a while, such as in stop_machine.

If jiffies stop being updated, a nohz_full CPU may end up endlessly
programming the next tick in the past, taking the last jiffies update
monotonic timestamp as a stale base, resulting in an tick storm.

Here is a scenario where it matters:

0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.

1) A stop machine callback is queued to execute somewhere.

2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
   MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
   can still take IRQs.

3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.

4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
   last_jiffies_update as a base. But last_jiffies_update hasn't been
   updated for 2 jiffies since the timekeeper has interrupts disabled.

5) clockevents_program_event(), which relies on ktime_get(), observes
   that the expiration is in the past and therefore programs the min
   delta event on the clock.

6) The tick fires immediately, goto 3)

7) Tick storm, the nohz_full CPU is drown and takes ages to reach
   MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.

Solve this with unconditionally updating jiffies if the value is stale
on nohz_full IRQ entry. IRQs and other disturbances are expected to be
rare enough on nohz_full for the unconditional call to ktime_get() to
actually matter.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org

---
 kernel/softirq.c         | 3 ++-
 kernel/time/tick-sched.c | 7 +++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 322b65d..41f4709 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -595,7 +595,8 @@ void irq_enter_rcu(void)
 {
 	__irq_enter_raw();
 
-	if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))
+	if (tick_nohz_full_cpu(smp_processor_id()) ||
+	    (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
 		tick_irq_enter();
 
 	account_hardirq_enter(current);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6bffe5a..17a283c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void)
 	now = ktime_get();
 	if (ts->idle_active)
 		tick_nohz_stop_idle(ts, now);
+	/*
+	 * If all CPUs are idle. We may need to update a stale jiffies value.
+	 * Note nohz_full is a special case: a timekeeper is guaranteed to stay
+	 * alive but it might be busy looping with interrupts disabled in some
+	 * rare case (typically stop machine). So we must make sure we have a
+	 * last resort.
+	 */
 	if (ts->tick_stopped)
 		tick_nohz_update_jiffies(now);
 }

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-12-02 14:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-26 14:10 [PATCH 0/2] timers/nohz fixes Frederic Weisbecker
2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker
2021-12-02 14:12   ` [tip: timers/urgent] " tip-bot2 for Frederic Weisbecker
2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker
2021-10-26 17:40   ` Masayoshi Mizuma
2021-11-10 19:30   ` Phil Auld
2021-12-02 14:12   ` [tip: sched/urgent] " tip-bot2 for Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.