* [PATCH 0/2] timers/nohz fixes @ 2021-10-26 14:10 Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 0 siblings, 2 replies; 7+ messages in thread From: Frederic Weisbecker @ 2021-10-26 14:10 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Frederic Weisbecker, Paul E . McKenney, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman A bunch of fixes that don't need to be on the urgent queue since they are not regression fixes. But they are still fixes... Frederic Weisbecker (2): timers/nohz: Last resort update jiffies on nohz_full IRQ entry sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full include/linux/sched/cputime.h | 5 +++-- kernel/sched/cputime.c | 12 +++++++++--- kernel/softirq.c | 3 ++- kernel/time/tick-sched.c | 7 +++++++ 4 files changed, 21 insertions(+), 6 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry 2021-10-26 14:10 [PATCH 0/2] timers/nohz fixes Frederic Weisbecker @ 2021-10-26 14:10 ` Frederic Weisbecker 2021-12-02 14:12 ` [tip: timers/urgent] " tip-bot2 for Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 1 sibling, 1 reply; 7+ messages in thread From: Frederic Weisbecker @ 2021-10-26 14:10 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Frederic Weisbecker, Paul E . McKenney, Peter Zijlstra, Hasegawa Hitomi When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU is guaranteed to stay online and to never stop its tick. Meanwhile on some rare case, the dedicated timekeeper may be running with interrupts disabled for a while, such as in stop_machine. If jiffies stop being updated, a nohz_full CPU may end up endlessly programming the next tick in the past, taking the last jiffies update monotonic timestamp as a stale base, resulting in an tick storm. Here is a scenario where it matters: 0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU. 1) A stop machine callback is queued to execute somewhere. 2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1 can still take IRQs. 3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward. 4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking last_jiffies_update as a base. But last_jiffies_update hasn't been updated for 2 jiffies since the timekeeper has interrupts disabled. 5) clockevents_program_event(), which relies on ktime_get(), observes that the expiration is in the past and therefore programs the min delta event on the clock. 6) The tick fires immediately, goto 3) 7) Tick storm, the nohz_full CPU is drown and takes ages to reach MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation. Solve this with unconditionally updating jiffies if the value is stale on nohz_full IRQ entry. IRQs and other disturbances are expected to be rare enough on nohz_full for the unconditional call to ktime_get() to actually matter. Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> --- kernel/softirq.c | 3 ++- kernel/time/tick-sched.c | 7 +++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 322b65d45676..41f470929e99 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -595,7 +595,8 @@ void irq_enter_rcu(void) { __irq_enter_raw(); - if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)) + if (tick_nohz_full_cpu(smp_processor_id()) || + (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))) tick_irq_enter(); account_hardirq_enter(current); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 6bffe5af8cb1..17a283ce2b20 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void) now = ktime_get(); if (ts->idle_active) tick_nohz_stop_idle(ts, now); + /* + * If all CPUs are idle. We may need to update a stale jiffies value. + * Note nohz_full is a special case: a timekeeper is guaranteed to stay + * alive but it might be busy looping with interrupts disabled in some + * rare case (typically stop machine). So we must make sure we have a + * last resort. + */ if (ts->tick_stopped) tick_nohz_update_jiffies(now); } -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [tip: timers/urgent] timers/nohz: Last resort update jiffies on nohz_full IRQ entry 2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker @ 2021-12-02 14:12 ` tip-bot2 for Frederic Weisbecker 0 siblings, 0 replies; 7+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2021-12-02 14:12 UTC (permalink / raw) To: linux-tip-commits Cc: Paul E. McKenney, Frederic Weisbecker, Thomas Gleixner, x86, linux-kernel The following commit has been merged into the timers/urgent branch of tip: Commit-ID: 53e87e3cdc155f20c3417b689df8d2ac88d79576 Gitweb: https://git.kernel.org/tip/53e87e3cdc155f20c3417b689df8d2ac88d79576 Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Tue, 26 Oct 2021 16:10:54 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 02 Dec 2021 15:07:22 +01:00 timers/nohz: Last resort update jiffies on nohz_full IRQ entry When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU is guaranteed to stay online and to never stop its tick. Meanwhile on some rare case, the dedicated timekeeper may be running with interrupts disabled for a while, such as in stop_machine. If jiffies stop being updated, a nohz_full CPU may end up endlessly programming the next tick in the past, taking the last jiffies update monotonic timestamp as a stale base, resulting in an tick storm. Here is a scenario where it matters: 0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU. 1) A stop machine callback is queued to execute somewhere. 2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1 can still take IRQs. 3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward. 4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking last_jiffies_update as a base. But last_jiffies_update hasn't been updated for 2 jiffies since the timekeeper has interrupts disabled. 5) clockevents_program_event(), which relies on ktime_get(), observes that the expiration is in the past and therefore programs the min delta event on the clock. 6) The tick fires immediately, goto 3) 7) Tick storm, the nohz_full CPU is drown and takes ages to reach MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation. Solve this with unconditionally updating jiffies if the value is stale on nohz_full IRQ entry. IRQs and other disturbances are expected to be rare enough on nohz_full for the unconditional call to ktime_get() to actually matter. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org --- kernel/softirq.c | 3 ++- kernel/time/tick-sched.c | 7 +++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 322b65d..41f4709 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -595,7 +595,8 @@ void irq_enter_rcu(void) { __irq_enter_raw(); - if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)) + if (tick_nohz_full_cpu(smp_processor_id()) || + (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))) tick_irq_enter(); account_hardirq_enter(current); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 6bffe5a..17a283c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void) now = ktime_get(); if (ts->idle_active) tick_nohz_stop_idle(ts, now); + /* + * If all CPUs are idle. We may need to update a stale jiffies value. + * Note nohz_full is a special case: a timekeeper is guaranteed to stay + * alive but it might be busy looping with interrupts disabled in some + * rare case (typically stop machine). So we must make sure we have a + * last resort. + */ if (ts->tick_stopped) tick_nohz_update_jiffies(now); } ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full 2021-10-26 14:10 [PATCH 0/2] timers/nohz fixes Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker @ 2021-10-26 14:10 ` Frederic Weisbecker 2021-10-26 17:40 ` Masayoshi Mizuma ` (2 more replies) 1 sibling, 3 replies; 7+ messages in thread From: Frederic Weisbecker @ 2021-10-26 14:10 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime than the actual time. task_cputime_adjusted() snapshots utime and stime and then adjust their sum to match the scheduler maintained cputime.sum_exec_runtime. Unfortunately in nohz_full, sum_exec_runtime is only updated once per second in the worst case, causing a discrepancy against utime and stime that can be updated anytime by the reader using vtime. To fix this situation, perform an update of cputime.sum_exec_runtime when the cputime snapshot reports the task as actually running while the tick is disabled. The related overhead is then contained within the relevant situations. Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> --- include/linux/sched/cputime.h | 5 +++-- kernel/sched/cputime.c | 12 +++++++++--- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 6c9f19a33865..ce3c58286062 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -18,15 +18,16 @@ #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -extern void task_cputime(struct task_struct *t, +extern bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime); extern u64 task_gtime(struct task_struct *t); #else -static inline void task_cputime(struct task_struct *t, +static inline bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) { *utime = t->utime; *stime = t->stime; + return false; } static inline u64 task_gtime(struct task_struct *t) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 872e481d5098..9392aea1804e 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) .sum_exec_runtime = p->se.sum_exec_runtime, }; - task_cputime(p, &cputime.utime, &cputime.stime); + if (task_cputime(p, &cputime.utime, &cputime.stime)) + cputime.sum_exec_runtime = task_sched_runtime(p); cputime_adjust(&cputime, &p->prev_cputime, ut, st); } EXPORT_SYMBOL_GPL(task_cputime_adjusted); @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t) * add up the pending nohz execution time since the last * cputime snapshot. */ -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) { struct vtime *vtime = &t->vtime; unsigned int seq; u64 delta; + int ret; if (!vtime_accounting_enabled()) { *utime = t->utime; *stime = t->stime; - return; + return false; } do { + ret = false; seq = read_seqcount_begin(&vtime->seqcount); *utime = t->utime; @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) if (vtime->state < VTIME_SYS) continue; + ret = true; delta = vtime_delta(vtime); /* @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) else *utime += vtime->utime + delta; } while (read_seqcount_retry(&vtime->seqcount, seq)); + + return ret; } static int vtime_state_fetch(struct vtime *vtime, int cpu) -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker @ 2021-10-26 17:40 ` Masayoshi Mizuma 2021-11-10 19:30 ` Phil Auld 2021-12-02 14:12 ` [tip: sched/urgent] " tip-bot2 for Frederic Weisbecker 2 siblings, 0 replies; 7+ messages in thread From: Masayoshi Mizuma @ 2021-10-26 17:40 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, LKML, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman On Tue, Oct 26, 2021 at 04:10:55PM +0200, Frederic Weisbecker wrote: > getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime > than the actual time. > > task_cputime_adjusted() snapshots utime and stime and then adjust their > sum to match the scheduler maintained cputime.sum_exec_runtime. > Unfortunately in nohz_full, sum_exec_runtime is only updated once per > second in the worst case, causing a discrepancy against utime and stime > that can be updated anytime by the reader using vtime. > > To fix this situation, perform an update of cputime.sum_exec_runtime > when the cputime snapshot reports the task as actually running while > the tick is disabled. The related overhead is then contained within the > relevant situations. > > Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Thank you for this patch. getrusage(RUSAGE_THREAD) with nohz_full works well! Please feel free to add: Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Thanks, Masa > --- > include/linux/sched/cputime.h | 5 +++-- > kernel/sched/cputime.c | 12 +++++++++--- > 2 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h > index 6c9f19a33865..ce3c58286062 100644 > --- a/include/linux/sched/cputime.h > +++ b/include/linux/sched/cputime.h > @@ -18,15 +18,16 @@ > #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ > > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > -extern void task_cputime(struct task_struct *t, > +extern bool task_cputime(struct task_struct *t, > u64 *utime, u64 *stime); > extern u64 task_gtime(struct task_struct *t); > #else > -static inline void task_cputime(struct task_struct *t, > +static inline bool task_cputime(struct task_struct *t, > u64 *utime, u64 *stime) > { > *utime = t->utime; > *stime = t->stime; > + return false; > } > > static inline u64 task_gtime(struct task_struct *t) > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 872e481d5098..9392aea1804e 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) > .sum_exec_runtime = p->se.sum_exec_runtime, > }; > > - task_cputime(p, &cputime.utime, &cputime.stime); > + if (task_cputime(p, &cputime.utime, &cputime.stime)) > + cputime.sum_exec_runtime = task_sched_runtime(p); > cputime_adjust(&cputime, &p->prev_cputime, ut, st); > } > EXPORT_SYMBOL_GPL(task_cputime_adjusted); > @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t) > * add up the pending nohz execution time since the last > * cputime snapshot. > */ > -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > { > struct vtime *vtime = &t->vtime; > unsigned int seq; > u64 delta; > + int ret; > > if (!vtime_accounting_enabled()) { > *utime = t->utime; > *stime = t->stime; > - return; > + return false; > } > > do { > + ret = false; > seq = read_seqcount_begin(&vtime->seqcount); > > *utime = t->utime; > @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > if (vtime->state < VTIME_SYS) > continue; > > + ret = true; > delta = vtime_delta(vtime); > > /* > @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > else > *utime += vtime->utime + delta; > } while (read_seqcount_retry(&vtime->seqcount, seq)); > + > + return ret; > } > > static int vtime_state_fetch(struct vtime *vtime, int cpu) > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 2021-10-26 17:40 ` Masayoshi Mizuma @ 2021-11-10 19:30 ` Phil Auld 2021-12-02 14:12 ` [tip: sched/urgent] " tip-bot2 for Frederic Weisbecker 2 siblings, 0 replies; 7+ messages in thread From: Phil Auld @ 2021-11-10 19:30 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, LKML, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman Hi, On Tue, Oct 26, 2021 at 04:10:55PM +0200 Frederic Weisbecker wrote: > getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime > than the actual time. > > task_cputime_adjusted() snapshots utime and stime and then adjust their > sum to match the scheduler maintained cputime.sum_exec_runtime. > Unfortunately in nohz_full, sum_exec_runtime is only updated once per > second in the worst case, causing a discrepancy against utime and stime > that can be updated anytime by the reader using vtime. > > To fix this situation, perform an update of cputime.sum_exec_runtime > when the cputime snapshot reports the task as actually running while > the tick is disabled. The related overhead is then contained within the > relevant situations. > > Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> > --- > include/linux/sched/cputime.h | 5 +++-- > kernel/sched/cputime.c | 12 +++++++++--- > 2 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h > index 6c9f19a33865..ce3c58286062 100644 > --- a/include/linux/sched/cputime.h > +++ b/include/linux/sched/cputime.h > @@ -18,15 +18,16 @@ > #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ > > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > -extern void task_cputime(struct task_struct *t, > +extern bool task_cputime(struct task_struct *t, > u64 *utime, u64 *stime); > extern u64 task_gtime(struct task_struct *t); > #else > -static inline void task_cputime(struct task_struct *t, > +static inline bool task_cputime(struct task_struct *t, > u64 *utime, u64 *stime) > { > *utime = t->utime; > *stime = t->stime; > + return false; > } > > static inline u64 task_gtime(struct task_struct *t) > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 872e481d5098..9392aea1804e 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) > .sum_exec_runtime = p->se.sum_exec_runtime, > }; > > - task_cputime(p, &cputime.utime, &cputime.stime); > + if (task_cputime(p, &cputime.utime, &cputime.stime)) > + cputime.sum_exec_runtime = task_sched_runtime(p); > cputime_adjust(&cputime, &p->prev_cputime, ut, st); > } > EXPORT_SYMBOL_GPL(task_cputime_adjusted); > @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t) > * add up the pending nohz execution time since the last > * cputime snapshot. > */ > -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > { > struct vtime *vtime = &t->vtime; > unsigned int seq; > u64 delta; > + int ret; > > if (!vtime_accounting_enabled()) { > *utime = t->utime; > *stime = t->stime; > - return; > + return false; > } > > do { > + ret = false; > seq = read_seqcount_begin(&vtime->seqcount); > > *utime = t->utime; > @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > if (vtime->state < VTIME_SYS) > continue; > > + ret = true; > delta = vtime_delta(vtime); > > /* > @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > else > *utime += vtime->utime + delta; > } while (read_seqcount_retry(&vtime->seqcount, seq)); > + > + return ret; > } > > static int vtime_state_fetch(struct vtime *vtime, int cpu) > -- > 2.25.1 > Could someone please pick this (or, rather, these) up? Acked-by: Phil Auld <pauld@redhat.com> Thanks! Phil -- ^ permalink raw reply [flat|nested] 7+ messages in thread
* [tip: sched/urgent] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 2021-10-26 17:40 ` Masayoshi Mizuma 2021-11-10 19:30 ` Phil Auld @ 2021-12-02 14:12 ` tip-bot2 for Frederic Weisbecker 2 siblings, 0 replies; 7+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2021-12-02 14:12 UTC (permalink / raw) To: linux-tip-commits Cc: Hasegawa Hitomi, Frederic Weisbecker, Thomas Gleixner, Masayoshi Mizuma, Phil Auld, x86, linux-kernel The following commit has been merged into the sched/urgent branch of tip: Commit-ID: e7f2be115f0746b969c0df14c0d182f65f005ca5 Gitweb: https://git.kernel.org/tip/e7f2be115f0746b969c0df14c0d182f65f005ca5 Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Tue, 26 Oct 2021 16:10:55 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 02 Dec 2021 15:08:22 +01:00 sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime than the actual time. task_cputime_adjusted() snapshots utime and stime and then adjust their sum to match the scheduler maintained cputime.sum_exec_runtime. Unfortunately in nohz_full, sum_exec_runtime is only updated once per second in the worst case, causing a discrepancy against utime and stime that can be updated anytime by the reader using vtime. To fix this situation, perform an update of cputime.sum_exec_runtime when the cputime snapshot reports the task as actually running while the tick is disabled. The related overhead is then contained within the relevant situations. Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20211026141055.57358-3-frederic@kernel.org --- include/linux/sched/cputime.h | 5 +++-- kernel/sched/cputime.c | 12 +++++++++--- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 6c9f19a..ce3c582 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -18,15 +18,16 @@ #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -extern void task_cputime(struct task_struct *t, +extern bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime); extern u64 task_gtime(struct task_struct *t); #else -static inline void task_cputime(struct task_struct *t, +static inline bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) { *utime = t->utime; *stime = t->stime; + return false; } static inline u64 task_gtime(struct task_struct *t) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 872e481..9392aea 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) .sum_exec_runtime = p->se.sum_exec_runtime, }; - task_cputime(p, &cputime.utime, &cputime.stime); + if (task_cputime(p, &cputime.utime, &cputime.stime)) + cputime.sum_exec_runtime = task_sched_runtime(p); cputime_adjust(&cputime, &p->prev_cputime, ut, st); } EXPORT_SYMBOL_GPL(task_cputime_adjusted); @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t) * add up the pending nohz execution time since the last * cputime snapshot. */ -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) { struct vtime *vtime = &t->vtime; unsigned int seq; u64 delta; + int ret; if (!vtime_accounting_enabled()) { *utime = t->utime; *stime = t->stime; - return; + return false; } do { + ret = false; seq = read_seqcount_begin(&vtime->seqcount); *utime = t->utime; @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) if (vtime->state < VTIME_SYS) continue; + ret = true; delta = vtime_delta(vtime); /* @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) else *utime += vtime->utime + delta; } while (read_seqcount_retry(&vtime->seqcount, seq)); + + return ret; } static int vtime_state_fetch(struct vtime *vtime, int cpu) ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-12-02 14:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-26 14:10 [PATCH 0/2] timers/nohz fixes Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker 2021-12-02 14:12 ` [tip: timers/urgent] " tip-bot2 for Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 2021-10-26 17:40 ` Masayoshi Mizuma 2021-11-10 19:30 ` Phil Auld 2021-12-02 14:12 ` [tip: sched/urgent] " tip-bot2 for Frederic Weisbecker
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.