linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full)
@ 2018-11-14  2:45 Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 01/25] sched/vtime: Fix guest/system mis-accounting on task switch Frederic Weisbecker
                   ` (24 more replies)
  0 siblings, 25 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Kcpustat (the stats you see for each CPU on /proc/stat) is partly
maintained by the tick, updated by TICK_NSEC every jiffy, the same way
we account the cputime for tasks.

Now in the case of nohz_full, kcpustat doesn't get accounted anymore while
the tick is stopped. Vtime maintains the task cputime but not kcpustat.

This issue was hidden as long as we had the 1Hz remaining tick, then
Yauheni Kaliuta made me remember that problem.

I scratched my head a lot on this, due to all the possible races.
The solution here is to fetch the task running on a CPU with RCU, read
its vtime delta (like we do for cputime) and add it to the relevant
kcpustat field. There have been several subtleties on the way (concurrent
task nice changes, earliest RCU delayed put_task_struct(), ordering with
vtime) and I couldn't resist a few cleanups so the patchset isn't too
small, sorry about that...

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	nohz/kcpustat

HEAD: c7c45c06334346f62dbbf7bb12e2a8ab954532e5

Thanks,
	Frederic
---

Frederic Weisbecker (25):
      sched/vtime: Fix guest/system mis-accounting on task switch
      sched/vtime: Protect idle accounting under vtime seqcount
      vtime: Rename vtime_account_system() to vtime_account_kernel()
      vtime: Spare a seqcount lock/unlock cycle on context switch
      sched/vtime: Record CPU under seqcount for kcpustat needs
      sched/cputime: Add vtime idle task state
      sched/cputime: Add vtime guest task state
      vtime: Exit vtime before exit_notify()
      kcpustat: Track running task following vtime sequences
      context_tracking: Remove context_tracking_active()
      context_tracking: s/context_tracking_is_enabled/context_tracking_enabled()
      context_tracking: Rename context_tracking_is_cpu_enabled() to context_tracking_enabled_this_cpu()
      context_tracking: Introduce context_tracking_enabled_cpu()
      sched/vtime: Rename vtime_accounting_cpu_enabled() to vtime_accounting_enabled_this_cpu()
      sched/vtime: Introduce vtime_accounting_enabled_cpu()
      sched/cputime: Allow to pass cputime index on user/guest accounting
      sched/cputime: Standardize the kcpustat index based accounting functions
      vtime: Track nice-ness on top of context switch
      sched/vite: Handle nice updates under vtime
      sched/kcpustat: Introduce vtime-aware kcpustat accessor
      procfs: Use vtime aware kcpustat accessor
      cpufreq: Use vtime aware kcpustat accessor
      leds: Use vtime aware kcpustat accessors
      rackmeter: Use vtime aware kcpustat accessors
      sched/vtime: Clarify vtime_task_switch() argument layout


 arch/ia64/include/asm/cputime.h         |   3 +-
 arch/ia64/kernel/time.c                 |  15 +-
 arch/powerpc/include/asm/cputime.h      |   8 +-
 arch/powerpc/kernel/time.c              |  12 +-
 arch/s390/kernel/vtime.c                |  19 +-
 arch/x86/entry/calling.h                |   2 +-
 drivers/cpufreq/cpufreq.c               |  18 +-
 drivers/cpufreq/cpufreq_governor.c      |  27 ++-
 drivers/leds/trigger/ledtrig-activity.c |   9 +-
 drivers/macintosh/rack-meter.c          |  14 +-
 fs/proc/stat.c                          |  21 +-
 include/linux/context_tracking.h        |  30 +--
 include/linux/context_tracking_state.h  |  19 +-
 include/linux/kernel_stat.h             |  28 ++-
 include/linux/sched.h                   |  12 +-
 include/linux/tick.h                    |   2 +-
 include/linux/vtime.h                   |  72 ++++---
 kernel/context_tracking.c               |   6 +-
 kernel/exit.c                           |   1 +
 kernel/sched/core.c                     |   6 +-
 kernel/sched/cputime.c                  | 372 +++++++++++++++++++++++++-------
 kernel/sched/sched.h                    |  39 ++++
 kernel/time/tick-sched.c                |   2 +-
 23 files changed, 548 insertions(+), 189 deletions(-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/25] sched/vtime: Fix guest/system mis-accounting on task switch
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 02/25] sched/vtime: Protect idle accounting under vtime seqcount Frederic Weisbecker
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

vtime_account_system() assumes that the target task to account cputime
to is always the current task. This is most often true indeed except on
task switch where we call:

	vtime_common_task_switch(prev)
		vtime_account_system(prev)

Here prev is the scheduling-out task where we account the cputime to. It
doesn't match current that is already the scheduling-in task at this
stage of the context switch.

So we end up checking the wrong task flags to determine if we are
accounting guest or system time to the previous task.

As a result the wrong task is used to check if the target is running in
guest mode. We may then spuriously account or leak either system or
guest time on task switch.

Fix this assumption and also turn vtime_guest_enter/exit() to use the
task passed in parameter as well to avoid future similar issues.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f93..54eb945 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -739,7 +739,7 @@ void vtime_account_system(struct task_struct *tsk)
 
 	write_seqcount_begin(&vtime->seqcount);
 	/* We might have scheduled out from guest path */
-	if (current->flags & PF_VCPU)
+	if (tsk->flags & PF_VCPU)
 		vtime_account_guest(tsk, vtime);
 	else
 		__vtime_account_system(tsk, vtime);
@@ -782,7 +782,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 */
 	write_seqcount_begin(&vtime->seqcount);
 	__vtime_account_system(tsk, vtime);
-	current->flags |= PF_VCPU;
+	tsk->flags |= PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_enter);
@@ -793,7 +793,7 @@ void vtime_guest_exit(struct task_struct *tsk)
 
 	write_seqcount_begin(&vtime->seqcount);
 	vtime_account_guest(tsk, vtime);
-	current->flags &= ~PF_VCPU;
+	tsk->flags &= ~PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_exit);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 02/25] sched/vtime: Protect idle accounting under vtime seqcount
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 01/25] sched/vtime: Fix guest/system mis-accounting on task switch Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-20 13:19   ` Peter Zijlstra
  2018-11-14  2:45 ` [PATCH 03/25] vtime: Rename vtime_account_system() to vtime_account_kernel() Frederic Weisbecker
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Locking the seqcount on idle vtime accounting wasn't thought to be
necessary because the readers of idle cputime don't use vtime (yet).

Now updating vtime expect the related seqcount to be locked so do it
for locking coherency purposes.

Also idle cputime updates use vtime, but idle cputime readers use the
traditional ad-hoc nohz time delta. We may want to consider moving
readers to use vtime to consolidate the overall accounting scheme. The
seqcount will be a functional requirement for it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 54eb945..6e74ec2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -800,7 +800,11 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
 	account_idle_time(get_vtime_delta(&tsk->vtime));
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void arch_vtime_task_switch(struct task_struct *prev)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 03/25] vtime: Rename vtime_account_system() to vtime_account_kernel()
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 01/25] sched/vtime: Fix guest/system mis-accounting on task switch Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 02/25] sched/vtime: Protect idle accounting under vtime seqcount Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 04/25] vtime: Spare a seqcount lock/unlock cycle on context switch Frederic Weisbecker
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

vtime_account_system() decides if we need to account the time to the
system (__vtime_account_system()) or to the guest (vtime_account_guest()).

So this function is a misnommer as we are on a higher level than
"system". All we know when we call that function is that we are
accounting kernel cputime. Whether it belongs to guest or system time
is a lower level detail.

Rename this function to vtime_account_kernel(). This will clarify things
and avoid too many underscored vtime_account_system() versions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/ia64/kernel/time.c          |  4 ++--
 arch/powerpc/kernel/time.c       |  6 +++---
 arch/s390/kernel/vtime.c         |  4 ++--
 include/linux/context_tracking.h |  4 ++--
 include/linux/vtime.h            |  6 +++---
 kernel/sched/cputime.c           | 18 +++++++++---------
 6 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 9025699..a3b09ea 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -132,7 +132,7 @@ static __u64 vtime_delta(struct task_struct *tsk)
 	return delta_stime;
 }
 
-void vtime_account_system(struct task_struct *tsk)
+void vtime_account_kernel(struct task_struct *tsk)
 {
 	struct thread_info *ti = task_thread_info(tsk);
 	__u64 stime = vtime_delta(tsk);
@@ -146,7 +146,7 @@ void vtime_account_system(struct task_struct *tsk)
 	else
 		ti->stime += stime;
 }
-EXPORT_SYMBOL_GPL(vtime_account_system);
+EXPORT_SYMBOL_GPL(vtime_account_kernel);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 3646aff..cc591a4 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -342,7 +342,7 @@ static unsigned long vtime_delta(struct task_struct *tsk,
 	return stime;
 }
 
-void vtime_account_system(struct task_struct *tsk)
+void vtime_account_kernel(struct task_struct *tsk)
 {
 	unsigned long stime, stime_scaled, steal_time;
 	struct cpu_accounting_data *acct = get_accounting(tsk);
@@ -370,7 +370,7 @@ void vtime_account_system(struct task_struct *tsk)
 #endif
 	}
 }
-EXPORT_SYMBOL_GPL(vtime_account_system);
+EXPORT_SYMBOL_GPL(vtime_account_kernel);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
@@ -399,7 +399,7 @@ static void vtime_flush_scaled(struct task_struct *tsk,
 /*
  * Account the whole cputime accumulated in the paca
  * Must be called with interrupts disabled.
- * Assumes that vtime_account_system/idle() has been called
+ * Assumes that vtime_account_kernel/idle() has been called
  * recently (i.e. since the last entry from usermode) so that
  * get_paca()->user_time_scaled is up to date.
  */
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index f24395a..af48db3 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -242,9 +242,9 @@ void vtime_account_irq_enter(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 
-void vtime_account_system(struct task_struct *tsk)
+void vtime_account_kernel(struct task_struct *tsk)
 __attribute__((alias("vtime_account_irq_enter")));
-EXPORT_SYMBOL_GPL(vtime_account_system);
+EXPORT_SYMBOL_GPL(vtime_account_kernel);
 
 /*
  * Sorted add to a list. List is linear searched until first bigger
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index d05609a..558a209 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -141,7 +141,7 @@ static inline void guest_enter_irqoff(void)
 	 * to assume that it's the stime pending cputime
 	 * to flush.
 	 */
-	vtime_account_system(current);
+	vtime_account_kernel(current);
 	current->flags |= PF_VCPU;
 	rcu_virt_note_context_switch(smp_processor_id());
 }
@@ -149,7 +149,7 @@ static inline void guest_enter_irqoff(void)
 static inline void guest_exit_irqoff(void)
 {
 	/* Flush the guest cputime we spent on the guest */
-	vtime_account_system(current);
+	vtime_account_kernel(current);
 	current->flags &= ~PF_VCPU;
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index a26ed10..2fd247f 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -57,13 +57,13 @@ static inline void vtime_task_switch(struct task_struct *prev)
 }
 #endif /* __ARCH_HAS_VTIME_TASK_SWITCH */
 
-extern void vtime_account_system(struct task_struct *tsk);
+extern void vtime_account_kernel(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 static inline void vtime_task_switch(struct task_struct *prev) { }
-static inline void vtime_account_system(struct task_struct *tsk) { }
+static inline void vtime_account_kernel(struct task_struct *tsk) { }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -86,7 +86,7 @@ extern void vtime_account_irq_enter(struct task_struct *tsk);
 static inline void vtime_account_irq_exit(struct task_struct *tsk)
 {
 	/* On hard|softirq exit we always account to hard|softirq cputime */
-	vtime_account_system(tsk);
+	vtime_account_kernel(tsk);
 }
 extern void vtime_flush(struct task_struct *tsk);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 6e74ec2..63f10ac 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -411,7 +411,7 @@ void vtime_common_task_switch(struct task_struct *prev)
 	if (is_idle_task(prev))
 		vtime_account_idle(prev);
 	else
-		vtime_account_system(prev);
+		vtime_account_kernel(prev);
 
 	vtime_flush(prev);
 	arch_vtime_task_switch(prev);
@@ -424,7 +424,7 @@ void vtime_common_task_switch(struct task_struct *prev)
 /*
  * Archs that account the whole time spent in the idle task
  * (outside irq) as idle time can rely on this and just implement
- * vtime_account_system() and vtime_account_idle(). Archs that
+ * vtime_account_kernel() and vtime_account_idle(). Archs that
  * have other meaning of the idle time (s390 only includes the
  * time spent by the CPU when it's in low power mode) must override
  * vtime_account().
@@ -435,7 +435,7 @@ void vtime_account_irq_enter(struct task_struct *tsk)
 	if (!in_interrupt() && is_idle_task(tsk))
 		vtime_account_idle(tsk);
 	else
-		vtime_account_system(tsk);
+		vtime_account_kernel(tsk);
 }
 EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
@@ -710,8 +710,8 @@ static u64 get_vtime_delta(struct vtime *vtime)
 	return delta - other;
 }
 
-static void __vtime_account_system(struct task_struct *tsk,
-				   struct vtime *vtime)
+static void vtime_account_system(struct task_struct *tsk,
+				 struct vtime *vtime)
 {
 	vtime->stime += get_vtime_delta(vtime);
 	if (vtime->stime >= TICK_NSEC) {
@@ -730,7 +730,7 @@ static void vtime_account_guest(struct task_struct *tsk,
 	}
 }
 
-void vtime_account_system(struct task_struct *tsk)
+void vtime_account_kernel(struct task_struct *tsk)
 {
 	struct vtime *vtime = &tsk->vtime;
 
@@ -742,7 +742,7 @@ void vtime_account_system(struct task_struct *tsk)
 	if (tsk->flags & PF_VCPU)
 		vtime_account_guest(tsk, vtime);
 	else
-		__vtime_account_system(tsk, vtime);
+		vtime_account_system(tsk, vtime);
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -751,7 +751,7 @@ void vtime_user_enter(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk, vtime);
+	vtime_account_system(tsk, vtime);
 	vtime->state = VTIME_USER;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -781,7 +781,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk, vtime);
+	vtime_account_system(tsk, vtime);
 	tsk->flags |= PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 04/25] vtime: Spare a seqcount lock/unlock cycle on context switch
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 03/25] vtime: Rename vtime_account_system() to vtime_account_kernel() Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-20 13:25   ` Peter Zijlstra
  2018-11-14  2:45 ` [PATCH 05/25] sched/vtime: Record CPU under seqcount for kcpustat needs Frederic Weisbecker
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

On context switch we are locking the vtime seqcount of the scheduling-out
task twice:

 * On vtime_task_switch_common(), when we flush the pending vtime through
   vtime_account_system() / vtime_account_idle()

 * On arch_vtime_task_switch() to reset the vtime state.

This is pointless as these actions can be performed without the need
to unlock/lock in the middle. The reason these steps are separated is to
consolidate a very small amount of common code between
CONFIG_VIRT_CPU_ACCOUNTING_GEN and CONFIG_VIRT_CPU_ACCOUNTING_NATIVE.

Performance in this fast path is definetly a priority over artificial
code factorization so split the task switch code between GEN and
NATIVE and mutualize the parts than can run under a single seqcount
locked block.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/vtime.h  | 32 ++++++++++++++++----------------
 kernel/sched/cputime.c | 38 +++++++++++++++++++++-----------------
 2 files changed, 37 insertions(+), 33 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 2fd247f..d9160ab 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,8 +14,12 @@ struct task_struct;
  * vtime_accounting_cpu_enabled() definitions/declarations
  */
 #if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE)
+
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+extern void vtime_task_switch(struct task_struct *prev);
+
 #elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
+
 /*
  * Checks if vtime is enabled on some CPU. Cputime readers want to be careful
  * in that case and compute the tickless cputime.
@@ -36,33 +40,29 @@ static inline bool vtime_accounting_cpu_enabled(void)
 
 	return false;
 }
+
+extern void vtime_task_switch_generic(struct task_struct *prev);
+
+static inline void vtime_task_switch(struct task_struct *prev)
+{
+	if (vtime_accounting_cpu_enabled())
+		vtime_task_switch_generic(prev);
+}
+
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
+
 static inline bool vtime_accounting_cpu_enabled(void) { return false; }
-#endif
+static inline void vtime_task_switch(struct task_struct *prev) { }
 
+#endif
 
 /*
  * Common vtime APIs
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
-
-#ifdef __ARCH_HAS_VTIME_TASK_SWITCH
-extern void vtime_task_switch(struct task_struct *prev);
-#else
-extern void vtime_common_task_switch(struct task_struct *prev);
-static inline void vtime_task_switch(struct task_struct *prev)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_task_switch(prev);
-}
-#endif /* __ARCH_HAS_VTIME_TASK_SWITCH */
-
 extern void vtime_account_kernel(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
-
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
-
-static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_kernel(struct task_struct *tsk) { }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 63f10ac..facc665 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -404,9 +404,10 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_
 /*
  * Use precise platform statistics if available:
  */
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+
 # ifndef __ARCH_HAS_VTIME_TASK_SWITCH
-void vtime_common_task_switch(struct task_struct *prev)
+void vtime_task_switch(struct task_struct *prev)
 {
 	if (is_idle_task(prev))
 		vtime_account_idle(prev);
@@ -417,10 +418,7 @@ void vtime_common_task_switch(struct task_struct *prev)
 	arch_vtime_task_switch(prev);
 }
 # endif
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
-
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /*
  * Archs that account the whole time spent in the idle task
  * (outside irq) as idle time can rely on this and just implement
@@ -730,19 +728,25 @@ static void vtime_account_guest(struct task_struct *tsk,
 	}
 }
 
-void vtime_account_kernel(struct task_struct *tsk)
+static void __vtime_account_kernel(struct task_struct *tsk,
+				   struct vtime *vtime)
 {
-	struct vtime *vtime = &tsk->vtime;
-
-	if (!vtime_delta(vtime))
-		return;
-
-	write_seqcount_begin(&vtime->seqcount);
 	/* We might have scheduled out from guest path */
 	if (tsk->flags & PF_VCPU)
 		vtime_account_guest(tsk, vtime);
 	else
 		vtime_account_system(tsk, vtime);
+}
+
+void vtime_account_kernel(struct task_struct *tsk)
+{
+	struct vtime *vtime = &tsk->vtime;
+
+	if (!vtime_delta(vtime))
+		return;
+
+	write_seqcount_begin(&vtime->seqcount);
+	__vtime_account_kernel(tsk, vtime);
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -800,18 +804,18 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	struct vtime *vtime = &tsk->vtime;
-
-	write_seqcount_begin(&vtime->seqcount);
 	account_idle_time(get_vtime_delta(&tsk->vtime));
-	write_seqcount_end(&vtime->seqcount);
 }
 
-void arch_vtime_task_switch(struct task_struct *prev)
+void vtime_task_switch_generic(struct task_struct *prev)
 {
 	struct vtime *vtime = &prev->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
+	if (is_idle_task(prev))
+		vtime_account_idle(prev);
+	else
+		__vtime_account_kernel(prev, vtime);
 	vtime->state = VTIME_INACTIVE;
 	write_seqcount_end(&vtime->seqcount);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 05/25] sched/vtime: Record CPU under seqcount for kcpustat needs
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 04/25] vtime: Spare a seqcount lock/unlock cycle on context switch Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 06/25] sched/cputime: Add vtime idle task state Frederic Weisbecker
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

In order to compute the kcpustat delta on a nohz CPU, we'll need to
fetch the task running on that target. Checking that its vtime
state snapshot actually refers to the relevant target involves recording
that CPU under the seqcount locked on task switch.

This is a step toward making kcpustat moving forward on full nohz CPUs.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h  | 1 +
 kernel/sched/cputime.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a51c13c..9f56880 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -275,6 +275,7 @@ struct vtime {
 	seqcount_t		seqcount;
 	unsigned long long	starttime;
 	enum vtime_state	state;
+	unsigned int		cpu;
 	u64			utime;
 	u64			stime;
 	u64			gtime;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index facc665..6ac27687 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -817,6 +817,7 @@ void vtime_task_switch_generic(struct task_struct *prev)
 	else
 		__vtime_account_kernel(prev, vtime);
 	vtime->state = VTIME_INACTIVE;
+	vtime->cpu = -1;
 	write_seqcount_end(&vtime->seqcount);
 
 	vtime = &current->vtime;
@@ -824,6 +825,7 @@ void vtime_task_switch_generic(struct task_struct *prev)
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
 	vtime->starttime = sched_clock();
+	vtime->cpu = smp_processor_id();
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -836,6 +838,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
 	vtime->starttime = sched_clock();
+	vtime->cpu = cpu;
 	write_seqcount_end(&vtime->seqcount);
 	local_irq_restore(flags);
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 06/25] sched/cputime: Add vtime idle task state
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 05/25] sched/vtime: Record CPU under seqcount for kcpustat needs Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 07/25] sched/cputime: Add vtime guest " Frederic Weisbecker
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Record idle as a VTIME state instead of guessing it from VTIME_SYS and
is_idle_task(). This is going to simplify the cputime read side
especially as its state machine is going to further expand in order to
fully support kcpustat on nohz_full.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h  |  6 ++++--
 kernel/sched/cputime.c | 13 ++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f56880..6d13938 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -265,10 +265,12 @@ struct task_cputime {
 enum vtime_state {
 	/* Task is sleeping or running in a CPU with VTIME inactive: */
 	VTIME_INACTIVE = 0,
-	/* Task runs in userspace in a CPU with VTIME active: */
-	VTIME_USER,
+	/* Task is idle */
+	VTIME_IDLE,
 	/* Task runs in kernelspace in a CPU with VTIME active: */
 	VTIME_SYS,
+	/* Task runs in userspace in a CPU with VTIME active: */
+	VTIME_USER,
 };
 
 struct vtime {
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 6ac27687..a9f42cc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -812,7 +812,7 @@ void vtime_task_switch_generic(struct task_struct *prev)
 	struct vtime *vtime = &prev->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (is_idle_task(prev))
+	if (vtime->state == VTIME_IDLE)
 		vtime_account_idle(prev);
 	else
 		__vtime_account_kernel(prev, vtime);
@@ -823,7 +823,10 @@ void vtime_task_switch_generic(struct task_struct *prev)
 	vtime = &current->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	vtime->state = VTIME_SYS;
+	if (is_idle_task(current))
+		vtime->state = VTIME_IDLE;
+	else
+		vtime->state = VTIME_SYS;
 	vtime->starttime = sched_clock();
 	vtime->cpu = smp_processor_id();
 	write_seqcount_end(&vtime->seqcount);
@@ -836,7 +839,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 
 	local_irq_save(flags);
 	write_seqcount_begin(&vtime->seqcount);
-	vtime->state = VTIME_SYS;
+	vtime->state = VTIME_IDLE;
 	vtime->starttime = sched_clock();
 	vtime->cpu = cpu;
 	write_seqcount_end(&vtime->seqcount);
@@ -887,8 +890,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		*utime = t->utime;
 		*stime = t->stime;
 
-		/* Task is sleeping, nothing to add */
-		if (vtime->state == VTIME_INACTIVE || is_idle_task(t))
+		/* Task is sleeping or idle, nothing to add */
+		if (vtime->state < VTIME_SYS)
 			continue;
 
 		delta = vtime_delta(vtime);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 07/25] sched/cputime: Add vtime guest task state
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 06/25] sched/cputime: Add vtime idle task state Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 08/25] vtime: Exit vtime before exit_notify() Frederic Weisbecker
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Record guest as a VTIME state instead of guessing it from VTIME_SYS and
PF_VCPU. This is going to simplify the cputime read side especially as
its state machine is going to further expand in order to fully support
kcpustat on nohz_full.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h  |  2 ++
 kernel/sched/cputime.c | 18 +++++++++++-------
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d13938..d458d65 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -271,6 +271,8 @@ enum vtime_state {
 	VTIME_SYS,
 	/* Task runs in userspace in a CPU with VTIME active: */
 	VTIME_USER,
+	/* Task runs as guests in a CPU with VTIME active: */
+	VTIME_GUEST,
 };
 
 struct vtime {
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a9f42cc..f64afd7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -732,7 +732,7 @@ static void __vtime_account_kernel(struct task_struct *tsk,
 				   struct vtime *vtime)
 {
 	/* We might have scheduled out from guest path */
-	if (tsk->flags & PF_VCPU)
+	if (vtime->state == VTIME_GUEST)
 		vtime_account_guest(tsk, vtime);
 	else
 		vtime_account_system(tsk, vtime);
@@ -787,6 +787,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 	write_seqcount_begin(&vtime->seqcount);
 	vtime_account_system(tsk, vtime);
 	tsk->flags |= PF_VCPU;
+	vtime->state = VTIME_GUEST;
 	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_enter);
@@ -798,6 +799,7 @@ void vtime_guest_exit(struct task_struct *tsk)
 	write_seqcount_begin(&vtime->seqcount);
 	vtime_account_guest(tsk, vtime);
 	tsk->flags &= ~PF_VCPU;
+	vtime->state = VTIME_SYS;
 	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_exit);
@@ -825,6 +827,8 @@ void vtime_task_switch_generic(struct task_struct *prev)
 	write_seqcount_begin(&vtime->seqcount);
 	if (is_idle_task(current))
 		vtime->state = VTIME_IDLE;
+	else if (current->flags & PF_VCPU)
+		vtime->state = VTIME_GUEST;
 	else
 		vtime->state = VTIME_SYS;
 	vtime->starttime = sched_clock();
@@ -859,7 +863,7 @@ u64 task_gtime(struct task_struct *t)
 		seq = read_seqcount_begin(&vtime->seqcount);
 
 		gtime = t->gtime;
-		if (vtime->state == VTIME_SYS && t->flags & PF_VCPU)
+		if (vtime->state == VTIME_GUEST)
 			gtime += vtime->gtime + vtime_delta(vtime);
 
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
@@ -897,13 +901,13 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		delta = vtime_delta(vtime);
 
 		/*
-		 * Task runs either in user or kernel space, add pending nohz time to
-		 * the right place.
+		 * Task runs either in user (including guest) or kernel space,
+		 * add pending nohz time to the right place.
 		 */
-		if (vtime->state == VTIME_USER || t->flags & PF_VCPU)
-			*utime += vtime->utime + delta;
-		else if (vtime->state == VTIME_SYS)
+		if (vtime->state == VTIME_SYS)
 			*stime += vtime->stime + delta;
+		else
+			*utime += vtime->utime + delta;
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 08/25] vtime: Exit vtime before exit_notify()
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 07/25] sched/cputime: Add vtime guest " Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-20 13:54   ` Peter Zijlstra
  2018-11-14  2:45 ` [PATCH 09/25] kcpustat: Track running task following vtime sequences Frederic Weisbecker
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

In order to correctly implement kcpustat under nohz_full, we need to
track the task running on a given CPU and read its vtime state safely,
reliably and locklessly.

This leaves us with tracking and fetching that task under RCU. This will
be done in a further patch. Until then we need to prepare vtime for
handling that properly and close the accounting before we meet the earliest
opportunity for the RCU delayed put_task_struct() to be queued. That
point happens to be in exit_notify() in case of auto-reaping.

Therefore we need to finish the accounting right before exit_notify().
After that we shouldn't track the exiting task any further.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h  |  2 ++
 include/linux/vtime.h  |  2 ++
 kernel/exit.c          |  1 +
 kernel/sched/cputime.c | 56 ++++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d458d65..27e0544 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -265,6 +265,8 @@ struct task_cputime {
 enum vtime_state {
 	/* Task is sleeping or running in a CPU with VTIME inactive: */
 	VTIME_INACTIVE = 0,
+	/* Task has passed exit_notify() */
+	VTIME_DEAD,
 	/* Task is idle */
 	VTIME_IDLE,
 	/* Task runs in kernelspace in a CPU with VTIME active: */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d9160ab..8350a0b 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -73,12 +73,14 @@ extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
 extern void vtime_guest_exit(struct task_struct *tsk);
 extern void vtime_init_idle(struct task_struct *tsk, int cpu);
+extern void vtime_exit_task(struct task_struct *tsk);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN  */
 static inline void vtime_user_enter(struct task_struct *tsk) { }
 static inline void vtime_user_exit(struct task_struct *tsk) { }
 static inline void vtime_guest_enter(struct task_struct *tsk) { }
 static inline void vtime_guest_exit(struct task_struct *tsk) { }
 static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
+static inline void vtime_exit_task(struct task_struct *tsk) { }
 #endif
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
diff --git a/kernel/exit.c b/kernel/exit.c
index 0e21e6d..cae3fe9 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -883,6 +883,7 @@ void __noreturn do_exit(long code)
 	 */
 	flush_ptrace_hw_breakpoint(tsk);
 
+	vtime_exit_task(tsk);
 	exit_tasks_rcu_start();
 	exit_notify(tsk, group_dead);
 	proc_exit_connector(tsk);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f64afd7..a0c3a82 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -702,7 +702,7 @@ static u64 get_vtime_delta(struct vtime *vtime)
 	 * errors from causing elapsed vtime to go negative.
 	 */
 	other = account_other_time(delta);
-	WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
+	WARN_ON_ONCE(vtime->state < VTIME_IDLE);
 	vtime->starttime += delta;
 
 	return delta - other;
@@ -813,17 +813,31 @@ void vtime_task_switch_generic(struct task_struct *prev)
 {
 	struct vtime *vtime = &prev->vtime;
 
-	write_seqcount_begin(&vtime->seqcount);
-	if (vtime->state == VTIME_IDLE)
-		vtime_account_idle(prev);
-	else
-		__vtime_account_kernel(prev, vtime);
-	vtime->state = VTIME_INACTIVE;
-	vtime->cpu = -1;
-	write_seqcount_end(&vtime->seqcount);
+	/*
+	 * Flush the prev task vtime, unless it has passed
+	 * vtime_exit_task(), in which case there is nothing
+	 * left to account.
+	 */
+	if (vtime->state != VTIME_DEAD) {
+		write_seqcount_begin(&vtime->seqcount);
+		if (vtime->state == VTIME_IDLE)
+			vtime_account_idle(prev);
+		else
+			__vtime_account_kernel(prev, vtime);
+		vtime->state = VTIME_INACTIVE;
+		vtime->cpu = -1;
+		write_seqcount_end(&vtime->seqcount);
+	}
 
 	vtime = &current->vtime;
 
+	/*
+	 * Ignore the next task if it has been preempted after
+	 * vtime_exit_task().
+	 */
+	if (vtime->state == VTIME_DEAD)
+		return;
+
 	write_seqcount_begin(&vtime->seqcount);
 	if (is_idle_task(current))
 		vtime->state = VTIME_IDLE;
@@ -850,6 +864,30 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	local_irq_restore(flags);
 }
 
+/*
+ * This is the final settlement point after which we don't account
+ * anymore vtime for this task.
+ */
+void vtime_exit_task(struct task_struct *t)
+{
+	struct vtime *vtime = &t->vtime;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	write_seqcount_begin(&vtime->seqcount);
+	/*
+	 * A task that has never run on a nohz_full CPU hasn't
+	 * been tracked by vtime. Thus it's in VTIME_INACTIVE
+	 * state. Nothing to account for it.
+	 */
+	if (vtime->state != VTIME_INACTIVE)
+		vtime_account_system(t, vtime);
+	vtime->state = VTIME_DEAD;
+	vtime->cpu = -1;
+	write_seqcount_end(&vtime->seqcount);
+	local_irq_restore(flags);
+}
+
 u64 task_gtime(struct task_struct *t)
 {
 	struct vtime *vtime = &t->vtime;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 09/25] kcpustat: Track running task following vtime sequences
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 08/25] vtime: Exit vtime before exit_notify() Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-20 13:58   ` Peter Zijlstra
  2018-11-14  2:45 ` [PATCH 10/25] context_tracking: Remove context_tracking_active() Frederic Weisbecker
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

In order to make kcpustat vtime aware (ie: work on nohz_full without
freezing), we need to track the task running on the CPU in order to
fetch its vtime delta and add it to the relevant kcpustat field.

The most efficient way to track this task is to use RCU. The task is
assigned on context switch right after we flush the vtime of the previous
task and the next task has been set on vtime.

Things are then prepared to be ordered that way:

             WRITER (ctx switch)                READER
             ------------------            -----------------------
        vtime_seqcount_write_lock(prev)     rcu_read_lock()
        //flush prev vtime                  curr = rcu_dereference(kcpustat->curr)
        vtime_seqcount_write_unlock(prev)   vtime_seqcount_read_start(curr)
                                            //fetch curr vtime
        vtime_seqcount_lock(next)           vtime_seqcount_read_end(curr)
        //Init vtime                        rcu_read_unlock()
        vtime_seqcount_unlock(next)

        rcu_assign_pointer(kcpustat->curr, next)

With this ordering layout, we are sure that we get a sequence with a
coherent couple (task cputime, kcpustat).

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/kernel_stat.h |  1 +
 kernel/sched/cputime.c      | 11 ++++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 7ee2bb4..86fdbce 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -32,6 +32,7 @@ enum cpu_usage_stat {
 };
 
 struct kernel_cpustat {
+	struct task_struct __rcu *curr;
 	u64 cpustat[NR_STATS];
 };
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a0c3a82..2eb313a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -812,6 +812,7 @@ void vtime_account_idle(struct task_struct *tsk)
 void vtime_task_switch_generic(struct task_struct *prev)
 {
 	struct vtime *vtime = &prev->vtime;
+	struct kernel_cpustat *kcpustat = kcpustat_this_cpu;
 
 	/*
 	 * Flush the prev task vtime, unless it has passed
@@ -835,8 +836,10 @@ void vtime_task_switch_generic(struct task_struct *prev)
 	 * Ignore the next task if it has been preempted after
 	 * vtime_exit_task().
 	 */
-	if (vtime->state == VTIME_DEAD)
+	if (vtime->state == VTIME_DEAD) {
+		rcu_assign_pointer(kcpustat->curr, NULL);
 		return;
+	}
 
 	write_seqcount_begin(&vtime->seqcount);
 	if (is_idle_task(current))
@@ -848,10 +851,13 @@ void vtime_task_switch_generic(struct task_struct *prev)
 	vtime->starttime = sched_clock();
 	vtime->cpu = smp_processor_id();
 	write_seqcount_end(&vtime->seqcount);
+
+	rcu_assign_pointer(kcpustat->curr, current);
 }
 
 void vtime_init_idle(struct task_struct *t, int cpu)
 {
+	struct kernel_cpustat *kcpustat = &kcpustat_cpu(cpu);
 	struct vtime *vtime = &t->vtime;
 	unsigned long flags;
 
@@ -862,6 +868,8 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	vtime->cpu = cpu;
 	write_seqcount_end(&vtime->seqcount);
 	local_irq_restore(flags);
+
+	rcu_assign_pointer(kcpustat->curr, t);
 }
 
 /*
@@ -885,6 +893,7 @@ void vtime_exit_task(struct task_struct *t)
 	vtime->state = VTIME_DEAD;
 	vtime->cpu = -1;
 	write_seqcount_end(&vtime->seqcount);
+	rcu_assign_pointer(kcpustat_this_cpu->curr, NULL);
 	local_irq_restore(flags);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 10/25] context_tracking: Remove context_tracking_active()
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 09/25] kcpustat: Track running task following vtime sequences Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 11/25] context_tracking: s/context_tracking_is_enabled/context_tracking_enabled() Frederic Weisbecker
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

This function is a leftover from old removal or rename. We can drop it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/context_tracking_state.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index f128dc3..f4633c2 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -42,7 +42,6 @@ static inline bool context_tracking_in_user(void)
 }
 #else
 static inline bool context_tracking_in_user(void) { return false; }
-static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
 #endif /* CONFIG_CONTEXT_TRACKING */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 11/25] context_tracking: s/context_tracking_is_enabled/context_tracking_enabled()
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 10/25] context_tracking: Remove context_tracking_active() Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 12/25] context_tracking: Rename context_tracking_is_cpu_enabled() to context_tracking_enabled_this_cpu() Frederic Weisbecker
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Remove the superfluous "is" in the middle of the name. We want to
standardize the naming so that it can be expanded through suffixes:

	context_tracking_enabled()
	context_tracking_enabled_cpu()
	context_tracking_enabled_this_cpu()

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h               |  2 +-
 include/linux/context_tracking.h       | 20 ++++++++++----------
 include/linux/context_tracking_state.h |  8 ++++----
 include/linux/tick.h                   |  2 +-
 include/linux/vtime.h                  |  2 +-
 kernel/context_tracking.c              |  6 +++---
 6 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 25e5a6b..9fe7ed3 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -352,7 +352,7 @@ For 32-bit we have the following conventions - kernel is built with
 .macro CALL_enter_from_user_mode
 #ifdef CONFIG_CONTEXT_TRACKING
 #ifdef HAVE_JUMP_LABEL
-	STATIC_BRANCH_JMP l_yes=.Lafter_call_\@, key=context_tracking_enabled, branch=1
+	STATIC_BRANCH_JMP l_yes=.Lafter_call_\@, key=context_tracking_key, branch=1
 #endif
 	call enter_from_user_mode
 .Lafter_call_\@:
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 558a209..f1601ba 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -22,26 +22,26 @@ extern void context_tracking_user_exit(void);
 
 static inline void user_enter(void)
 {
-	if (context_tracking_is_enabled())
+	if (context_tracking_enabled())
 		context_tracking_enter(CONTEXT_USER);
 
 }
 static inline void user_exit(void)
 {
-	if (context_tracking_is_enabled())
+	if (context_tracking_enabled())
 		context_tracking_exit(CONTEXT_USER);
 }
 
 /* Called with interrupts disabled.  */
 static inline void user_enter_irqoff(void)
 {
-	if (context_tracking_is_enabled())
+	if (context_tracking_enabled())
 		__context_tracking_enter(CONTEXT_USER);
 
 }
 static inline void user_exit_irqoff(void)
 {
-	if (context_tracking_is_enabled())
+	if (context_tracking_enabled())
 		__context_tracking_exit(CONTEXT_USER);
 }
 
@@ -49,7 +49,7 @@ static inline enum ctx_state exception_enter(void)
 {
 	enum ctx_state prev_ctx;
 
-	if (!context_tracking_is_enabled())
+	if (!context_tracking_enabled())
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
@@ -61,7 +61,7 @@ static inline enum ctx_state exception_enter(void)
 
 static inline void exception_exit(enum ctx_state prev_ctx)
 {
-	if (context_tracking_is_enabled()) {
+	if (context_tracking_enabled()) {
 		if (prev_ctx != CONTEXT_KERNEL)
 			context_tracking_enter(prev_ctx);
 	}
@@ -77,7 +77,7 @@ static inline void exception_exit(enum ctx_state prev_ctx)
  */
 static inline enum ctx_state ct_state(void)
 {
-	return context_tracking_is_enabled() ?
+	return context_tracking_enabled() ?
 		this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
 }
 #else
@@ -90,7 +90,7 @@ static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
 #endif /* !CONFIG_CONTEXT_TRACKING */
 
-#define CT_WARN_ON(cond) WARN_ON(context_tracking_is_enabled() && (cond))
+#define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
 
 #ifdef CONFIG_CONTEXT_TRACKING_FORCE
 extern void context_tracking_init(void);
@@ -108,7 +108,7 @@ static inline void guest_enter_irqoff(void)
 	else
 		current->flags |= PF_VCPU;
 
-	if (context_tracking_is_enabled())
+	if (context_tracking_enabled())
 		__context_tracking_enter(CONTEXT_GUEST);
 
 	/* KVM does not hold any references to rcu protected data when it
@@ -124,7 +124,7 @@ static inline void guest_enter_irqoff(void)
 
 static inline void guest_exit_irqoff(void)
 {
-	if (context_tracking_is_enabled())
+	if (context_tracking_enabled())
 		__context_tracking_exit(CONTEXT_GUEST);
 
 	if (vtime_accounting_cpu_enabled())
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index f4633c2..91250bd 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -23,12 +23,12 @@ struct context_tracking {
 };
 
 #ifdef CONFIG_CONTEXT_TRACKING
-extern struct static_key_false context_tracking_enabled;
+extern struct static_key_false context_tracking_key;
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
 
-static inline bool context_tracking_is_enabled(void)
+static inline bool context_tracking_enabled(void)
 {
-	return static_branch_unlikely(&context_tracking_enabled);
+	return static_branch_unlikely(&context_tracking_key);
 }
 
 static inline bool context_tracking_cpu_is_enabled(void)
@@ -42,7 +42,7 @@ static inline bool context_tracking_in_user(void)
 }
 #else
 static inline bool context_tracking_in_user(void) { return false; }
-static inline bool context_tracking_is_enabled(void) { return false; }
+static inline bool context_tracking_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
 #endif /* CONFIG_CONTEXT_TRACKING */
 
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 55388ab..67fb467 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -163,7 +163,7 @@ extern cpumask_var_t tick_nohz_full_mask;
 
 static inline bool tick_nohz_full_enabled(void)
 {
-	if (!context_tracking_is_enabled())
+	if (!context_tracking_enabled())
 		return false;
 
 	return tick_nohz_full_running;
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 8350a0b..01f09ca 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -28,7 +28,7 @@ extern void vtime_task_switch(struct task_struct *prev);
  */
 static inline bool vtime_accounting_enabled(void)
 {
-	return context_tracking_is_enabled();
+	return context_tracking_enabled();
 }
 
 static inline bool vtime_accounting_cpu_enabled(void)
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 9ad37b9..1533862 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -24,8 +24,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
 
-DEFINE_STATIC_KEY_FALSE(context_tracking_enabled);
-EXPORT_SYMBOL_GPL(context_tracking_enabled);
+DEFINE_STATIC_KEY_FALSE(context_tracking_key);
+EXPORT_SYMBOL_GPL(context_tracking_key);
 
 DEFINE_PER_CPU(struct context_tracking, context_tracking);
 EXPORT_SYMBOL_GPL(context_tracking);
@@ -191,7 +191,7 @@ void __init context_tracking_cpu_set(int cpu)
 
 	if (!per_cpu(context_tracking.active, cpu)) {
 		per_cpu(context_tracking.active, cpu) = true;
-		static_branch_inc(&context_tracking_enabled);
+		static_branch_inc(&context_tracking_key);
 	}
 
 	if (initialized)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 12/25] context_tracking: Rename context_tracking_is_cpu_enabled() to context_tracking_enabled_this_cpu()
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 11/25] context_tracking: s/context_tracking_is_enabled/context_tracking_enabled() Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 13/25] context_tracking: Introduce context_tracking_enabled_cpu() Frederic Weisbecker
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Standardize the naming on top of the context_tracking_enabled_*() base.
Also make it clear we are checking the context tracking state of the
*current* CPU with this function. We'll need to add an API to check that
state on remote CPUs as well, so we must disambiguate the naming.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/context_tracking.h       | 2 +-
 include/linux/context_tracking_state.h | 4 ++--
 include/linux/vtime.h                  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index f1601ba..c9065ad 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -118,7 +118,7 @@ static inline void guest_enter_irqoff(void)
 	 * one time slice). Lets treat guest mode as quiescent state, just like
 	 * we do with user-mode execution.
 	 */
-	if (!context_tracking_cpu_is_enabled())
+	if (!context_tracking_enabled_this_cpu())
 		rcu_virt_note_context_switch(smp_processor_id());
 }
 
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 91250bd..08f125f 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -31,7 +31,7 @@ static inline bool context_tracking_enabled(void)
 	return static_branch_unlikely(&context_tracking_key);
 }
 
-static inline bool context_tracking_cpu_is_enabled(void)
+static inline bool context_tracking_enabled_this_cpu(void)
 {
 	return __this_cpu_read(context_tracking.active);
 }
@@ -43,7 +43,7 @@ static inline bool context_tracking_in_user(void)
 #else
 static inline bool context_tracking_in_user(void) { return false; }
 static inline bool context_tracking_enabled(void) { return false; }
-static inline bool context_tracking_cpu_is_enabled(void) { return false; }
+static inline bool context_tracking_enabled_this_cpu(void) { return false; }
 #endif /* CONFIG_CONTEXT_TRACKING */
 
 #endif
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 01f09ca..df03b37 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -34,7 +34,7 @@ static inline bool vtime_accounting_enabled(void)
 static inline bool vtime_accounting_cpu_enabled(void)
 {
 	if (vtime_accounting_enabled()) {
-		if (context_tracking_cpu_is_enabled())
+		if (context_tracking_enabled_this_cpu())
 			return true;
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 13/25] context_tracking: Introduce context_tracking_enabled_cpu()
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 12/25] context_tracking: Rename context_tracking_is_cpu_enabled() to context_tracking_enabled_this_cpu() Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-20 14:02   ` Peter Zijlstra
  2018-11-14  2:45 ` [PATCH 14/25] sched/vtime: Rename vtime_accounting_cpu_enabled() to vtime_accounting_enabled_this_cpu() Frederic Weisbecker
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

This allows us to check if a remote CPU runs context tracking
(ie: is nohz_full). We'll need that to reliably support "nice"
accounting on kcpustat.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/context_tracking_state.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 08f125f..5877177 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -31,6 +31,11 @@ static inline bool context_tracking_enabled(void)
 	return static_branch_unlikely(&context_tracking_key);
 }
 
+static inline bool context_tracking_enabled_cpu(int cpu)
+{
+	return per_cpu(context_tracking.active, cpu);
+}
+
 static inline bool context_tracking_enabled_this_cpu(void)
 {
 	return __this_cpu_read(context_tracking.active);
@@ -43,6 +48,7 @@ static inline bool context_tracking_in_user(void)
 #else
 static inline bool context_tracking_in_user(void) { return false; }
 static inline bool context_tracking_enabled(void) { return false; }
+static inline bool context_tracking_enabled_cpu(int cpu) { return false; }
 static inline bool context_tracking_enabled_this_cpu(void) { return false; }
 #endif /* CONFIG_CONTEXT_TRACKING */
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 14/25] sched/vtime: Rename vtime_accounting_cpu_enabled() to vtime_accounting_enabled_this_cpu()
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (12 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 13/25] context_tracking: Introduce context_tracking_enabled_cpu() Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-14  2:45 ` [PATCH 15/25] sched/vtime: Introduce vtime_accounting_enabled_cpu() Frederic Weisbecker
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Standardize the naming on top of the vtime_accounting_enabled_*() base.
Also make it clear we are checking the vtime state of the
*current* CPU with this function. We'll need to add an API to check that
state on remote CPUs as well, so we must disambiguate the naming.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/context_tracking.h |  4 ++--
 include/linux/vtime.h            | 10 +++++-----
 kernel/sched/cputime.c           |  2 +-
 kernel/time/tick-sched.c         |  2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index c9065ad..64ec828 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -103,7 +103,7 @@ static inline void context_tracking_init(void) { }
 /* must be called with irqs disabled */
 static inline void guest_enter_irqoff(void)
 {
-	if (vtime_accounting_cpu_enabled())
+	if (vtime_accounting_enabled_this_cpu())
 		vtime_guest_enter(current);
 	else
 		current->flags |= PF_VCPU;
@@ -127,7 +127,7 @@ static inline void guest_exit_irqoff(void)
 	if (context_tracking_enabled())
 		__context_tracking_exit(CONTEXT_GUEST);
 
-	if (vtime_accounting_cpu_enabled())
+	if (vtime_accounting_enabled_this_cpu())
 		vtime_guest_exit(current);
 	else
 		current->flags &= ~PF_VCPU;
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index df03b37..82fed7a 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -11,11 +11,11 @@
 struct task_struct;
 
 /*
- * vtime_accounting_cpu_enabled() definitions/declarations
+ * vtime_accounting_enabled_this_cpu() definitions/declarations
  */
 #if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE)
 
-static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+static inline bool vtime_accounting_enabled_this_cpu(void) { return true; }
 extern void vtime_task_switch(struct task_struct *prev);
 
 #elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
@@ -31,7 +31,7 @@ static inline bool vtime_accounting_enabled(void)
 	return context_tracking_enabled();
 }
 
-static inline bool vtime_accounting_cpu_enabled(void)
+static inline bool vtime_accounting_enabled_this_cpu(void)
 {
 	if (vtime_accounting_enabled()) {
 		if (context_tracking_enabled_this_cpu())
@@ -45,13 +45,13 @@ extern void vtime_task_switch_generic(struct task_struct *prev);
 
 static inline void vtime_task_switch(struct task_struct *prev)
 {
-	if (vtime_accounting_cpu_enabled())
+	if (vtime_accounting_enabled_this_cpu())
 		vtime_task_switch_generic(prev);
 }
 
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
-static inline bool vtime_accounting_cpu_enabled(void) { return false; }
+static inline bool vtime_accounting_enabled_this_cpu(void) { return false; }
 static inline void vtime_task_switch(struct task_struct *prev) { }
 
 #endif
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2eb313a..61aa7ba 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -474,7 +474,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
 	u64 cputime, steal;
 	struct rq *rq = this_rq();
 
-	if (vtime_accounting_cpu_enabled())
+	if (vtime_accounting_enabled_this_cpu())
 		return;
 
 	if (sched_clock_irqtime) {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69e673b..482da8e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1094,7 +1094,7 @@ static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	unsigned long ticks;
 
-	if (vtime_accounting_cpu_enabled())
+	if (vtime_accounting_enabled_this_cpu())
 		return;
 	/*
 	 * We stopped the tick in idle. Update process times would miss the
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 15/25] sched/vtime: Introduce vtime_accounting_enabled_cpu()
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (13 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 14/25] sched/vtime: Rename vtime_accounting_cpu_enabled() to vtime_accounting_enabled_this_cpu() Frederic Weisbecker
@ 2018-11-14  2:45 ` Frederic Weisbecker
  2018-11-20 14:04   ` Peter Zijlstra
  2018-11-14  2:46 ` [PATCH 16/25] sched/cputime: Allow to pass cputime index on user/guest accounting Frederic Weisbecker
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

This allows us to check if a remote CPU runs vtime accounting
(ie: is nohz_full). We'll need that to reliably support "nice"
accounting on kcpustat.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/vtime.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 82fed7a..a53f6ea 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -31,6 +31,16 @@ static inline bool vtime_accounting_enabled(void)
 	return context_tracking_enabled();
 }
 
+static inline bool vtime_accounting_enabled_cpu(int cpu)
+{
+	if (vtime_accounting_enabled()) {
+		if (context_tracking_enabled_cpu(cpu))
+			return true;
+	}
+
+	return false;
+}
+
 static inline bool vtime_accounting_enabled_this_cpu(void)
 {
 	if (vtime_accounting_enabled()) {
@@ -51,6 +61,7 @@ static inline void vtime_task_switch(struct task_struct *prev)
 
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
+static inline bool vtime_accounting_enabled_cpu(int cpu) {return false; }
 static inline bool vtime_accounting_enabled_this_cpu(void) { return false; }
 static inline void vtime_task_switch(struct task_struct *prev) { }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 16/25] sched/cputime: Allow to pass cputime index on user/guest accounting
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (14 preceding siblings ...)
  2018-11-14  2:45 ` [PATCH 15/25] sched/vtime: Introduce vtime_accounting_enabled_cpu() Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  2018-11-14  2:46 ` [PATCH 17/25] sched/cputime: Standardize the kcpustat index based accounting functions Frederic Weisbecker
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

When we account user or guest cputime, we decide to add the delta either
to the nice fields or the normal fields of kcpustat and this depends on
the nice value for the task passed in parameter.

Since we are going to track the nice-ness from vtime instead, we'll need
to be able to use a different source than the task passed in parameter
on accounting time. So allow the callers of account_user/guest_time() to
pass custom kcpustat destination index fields.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 50 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 61aa7ba..63c4f0b 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -108,6 +108,20 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	cgroup_account_cputime_field(p, index, tmp);
 }
 
+static void account_user_time_index(struct task_struct *p,
+				    u64 cputime, enum cpu_usage_stat index)
+{
+	/* Add user time to process. */
+	p->utime += cputime;
+	account_group_user_time(p, cputime);
+
+	/* Add user time to cpustat. */
+	task_group_account_field(p, index, cputime);
+
+	/* Account for user time used */
+	acct_account_cputime(p);
+}
+
 /*
  * Account user CPU time to a process.
  * @p: the process that the CPU time gets accounted to
@@ -115,27 +129,14 @@ static inline void task_group_account_field(struct task_struct *p, int index,
  */
 void account_user_time(struct task_struct *p, u64 cputime)
 {
-	int index;
-
-	/* Add user time to process. */
-	p->utime += cputime;
-	account_group_user_time(p, cputime);
+	enum cpu_usage_stat index;
 
 	index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
-
-	/* Add user time to cpustat. */
-	task_group_account_field(p, index, cputime);
-
-	/* Account for user time used */
-	acct_account_cputime(p);
+	account_user_time_index(p, cputime, index);
 }
 
-/*
- * Account guest CPU time to a process.
- * @p: the process that the CPU time gets accounted to
- * @cputime: the CPU time spent in virtual machine since the last update
- */
-void account_guest_time(struct task_struct *p, u64 cputime)
+static void account_guest_time_index(struct task_struct *p,
+				     u64 cputime, enum cpu_usage_stat index)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
@@ -145,7 +146,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)
 	p->gtime += cputime;
 
 	/* Add guest time to cpustat. */
-	if (task_nice(p) > 0) {
+	if (index == CPUTIME_GUEST_NICE) {
 		cpustat[CPUTIME_NICE] += cputime;
 		cpustat[CPUTIME_GUEST_NICE] += cputime;
 	} else {
@@ -155,6 +156,19 @@ void account_guest_time(struct task_struct *p, u64 cputime)
 }
 
 /*
+ * Account guest CPU time to a process.
+ * @p: the process that the CPU time gets accounted to
+ * @cputime: the CPU time spent in virtual machine since the last update
+ */
+void account_guest_time(struct task_struct *p, u64 cputime)
+{
+	enum cpu_usage_stat index;
+
+	index = (task_nice(p) > 0) ? CPUTIME_GUEST_NICE : CPUTIME_GUEST;
+	account_guest_time_index(p, cputime, index);
+}
+
+/*
  * Account system CPU time to a process and desired cpustat field
  * @p: the process that the CPU time gets accounted to
  * @cputime: the CPU time spent in kernel space since the last update
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 17/25] sched/cputime: Standardize the kcpustat index based accounting functions
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (15 preceding siblings ...)
  2018-11-14  2:46 ` [PATCH 16/25] sched/cputime: Allow to pass cputime index on user/guest accounting Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  2018-11-14  2:46 ` [PATCH 18/25] vtime: Track nice-ness on top of context switch Frederic Weisbecker
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Sanitize a bit the functions that do cputime accounting with custom
kcpustat indexes:

* Rename account_system_index_time() to account_system_time_index() to
  comply with account_guest/user_time_index()

* Use proper enum cpu_usage_stat type in account_system_time()

* Reorder task_group_account_field() parameters to comply with those of
  account_*_time_index().

* Rename task_group_account_field()'s tmp parameter to cputime

* Precise the type of index in task_group_account_field(): enum cpu_usage_stat

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/ia64/kernel/time.c     |  6 +++---
 arch/powerpc/kernel/time.c  |  6 +++---
 arch/s390/kernel/vtime.c    |  2 +-
 include/linux/kernel_stat.h |  2 +-
 kernel/sched/cputime.c      | 22 +++++++++++-----------
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index a3b09ea..46a9798 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -79,17 +79,17 @@ void vtime_flush(struct task_struct *tsk)
 
 	if (ti->stime) {
 		delta = cycle_to_nsec(ti->stime);
-		account_system_index_time(tsk, delta, CPUTIME_SYSTEM);
+		account_system_time_index(tsk, delta, CPUTIME_SYSTEM);
 	}
 
 	if (ti->hardirq_time) {
 		delta = cycle_to_nsec(ti->hardirq_time);
-		account_system_index_time(tsk, delta, CPUTIME_IRQ);
+		account_system_time_index(tsk, delta, CPUTIME_IRQ);
 	}
 
 	if (ti->softirq_time) {
 		delta = cycle_to_nsec(ti->softirq_time);
-		account_system_index_time(tsk, delta, CPUTIME_SOFTIRQ);
+		account_system_time_index(tsk, delta, CPUTIME_SOFTIRQ);
 	}
 
 	ti->utime = 0;
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index cc591a4..6e6e0c8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -422,14 +422,14 @@ void vtime_flush(struct task_struct *tsk)
 		account_idle_time(cputime_to_nsecs(acct->idle_time));
 
 	if (acct->stime)
-		account_system_index_time(tsk, cputime_to_nsecs(acct->stime),
+		account_system_time_index(tsk, cputime_to_nsecs(acct->stime),
 					  CPUTIME_SYSTEM);
 
 	if (acct->hardirq_time)
-		account_system_index_time(tsk, cputime_to_nsecs(acct->hardirq_time),
+		account_system_time_index(tsk, cputime_to_nsecs(acct->hardirq_time),
 					  CPUTIME_IRQ);
 	if (acct->softirq_time)
-		account_system_index_time(tsk, cputime_to_nsecs(acct->softirq_time),
+		account_system_time_index(tsk, cputime_to_nsecs(acct->softirq_time),
 					  CPUTIME_SOFTIRQ);
 
 	vtime_flush_scaled(tsk, acct);
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index af48db3..b6b888d 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -115,7 +115,7 @@ static void account_system_index_scaled(struct task_struct *p, u64 cputime,
 					enum cpu_usage_stat index)
 {
 	p->stimescaled += cputime_to_nsecs(scale_vtime(cputime));
-	account_system_index_time(p, cputime_to_nsecs(cputime), index);
+	account_system_time_index(p, cputime_to_nsecs(cputime), index);
 }
 
 /*
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 86fdbce..049d973 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -82,7 +82,7 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 extern void account_user_time(struct task_struct *, u64);
 extern void account_guest_time(struct task_struct *, u64);
 extern void account_system_time(struct task_struct *, int, u64);
-extern void account_system_index_time(struct task_struct *, u64,
+extern void account_system_time_index(struct task_struct *, u64,
 				      enum cpu_usage_stat);
 extern void account_steal_time(u64);
 extern void account_idle_time(u64);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 63c4f0b..8f5dee2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -94,8 +94,8 @@ static u64 irqtime_tick_accounted(u64 dummy)
 
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
-static inline void task_group_account_field(struct task_struct *p, int index,
-					    u64 tmp)
+static inline void task_group_account_field(struct task_struct *p, u64 cputime,
+					    enum cpu_usage_stat index)
 {
 	/*
 	 * Since all updates are sure to touch the root cgroup, we
@@ -103,9 +103,9 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	 * is the only cgroup, then nothing else should be necessary.
 	 *
 	 */
-	__this_cpu_add(kernel_cpustat.cpustat[index], tmp);
+	__this_cpu_add(kernel_cpustat.cpustat[index], cputime);
 
-	cgroup_account_cputime_field(p, index, tmp);
+	cgroup_account_cputime_field(p, index, cputime);
 }
 
 static void account_user_time_index(struct task_struct *p,
@@ -116,7 +116,7 @@ static void account_user_time_index(struct task_struct *p,
 	account_group_user_time(p, cputime);
 
 	/* Add user time to cpustat. */
-	task_group_account_field(p, index, cputime);
+	task_group_account_field(p, cputime, index);
 
 	/* Account for user time used */
 	acct_account_cputime(p);
@@ -174,7 +174,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)
  * @cputime: the CPU time spent in kernel space since the last update
  * @index: pointer to cpustat field that has to be updated
  */
-void account_system_index_time(struct task_struct *p,
+void account_system_time_index(struct task_struct *p,
 			       u64 cputime, enum cpu_usage_stat index)
 {
 	/* Add system time to process. */
@@ -182,7 +182,7 @@ void account_system_index_time(struct task_struct *p,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	task_group_account_field(p, index, cputime);
+	task_group_account_field(p, cputime, index);
 
 	/* Account for system time used */
 	acct_account_cputime(p);
@@ -196,7 +196,7 @@ void account_system_index_time(struct task_struct *p,
  */
 void account_system_time(struct task_struct *p, int hardirq_offset, u64 cputime)
 {
-	int index;
+	enum cpu_usage_stat index;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime);
@@ -210,7 +210,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset, u64 cputime)
 	else
 		index = CPUTIME_SYSTEM;
 
-	account_system_index_time(p, cputime, index);
+	account_system_time_index(p, cputime, index);
 }
 
 /*
@@ -391,7 +391,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		 * So, we have to handle it separately here.
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
-		account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
+		account_system_time_index(p, cputime, CPUTIME_SOFTIRQ);
 	} else if (user_tick) {
 		account_user_time(p, cputime);
 	} else if (p == rq->idle) {
@@ -399,7 +399,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	} else if (p->flags & PF_VCPU) { /* System time or guest time */
 		account_guest_time(p, cputime);
 	} else {
-		account_system_index_time(p, cputime, CPUTIME_SYSTEM);
+		account_system_time_index(p, cputime, CPUTIME_SYSTEM);
 	}
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 18/25] vtime: Track nice-ness on top of context switch
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (16 preceding siblings ...)
  2018-11-14  2:46 ` [PATCH 17/25] sched/cputime: Standardize the kcpustat index based accounting functions Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  2018-11-20 14:09   ` Peter Zijlstra
  2018-11-14  2:46 ` [PATCH 19/25] sched/vite: Handle nice updates under vtime Frederic Weisbecker
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

We need to read the nice value of the task running on any CPU, possibly
remotely, in order to correctly support kcpustat on nohz_full.
Unfortunately we can't just read task_nice(tsk) when tsk runs on another
CPU because its nice value may be concurrently changed. There could be a
risk that a recently modified nice value is thought to apply for a longer
while than is supposed to.

For example if a task runs at T0 with nice = -10, then its nice value
is changed at T0 + 1 second with nice = 10, a reader at T0 + 1 second
could think that the task had this "nice == 10" value since the beginning
(T0) and spuriously account 1 second nice time on kcpustat instead of 1
second user time.

So we need to track the nice value changes under vtime seqcount. Start
with context switches and account the vtime nice-ness on top of it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h  |  1 +
 kernel/sched/cputime.c | 44 +++++++++++++++++++++++++++++++++++---------
 2 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 27e0544..356326f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -280,6 +280,7 @@ enum vtime_state {
 struct vtime {
 	seqcount_t		seqcount;
 	unsigned long long	starttime;
+	int			nice;
 	enum vtime_state	state;
 	unsigned int		cpu;
 	u64			utime;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8f5dee2..07c2e7f 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -735,13 +735,42 @@ static void vtime_account_system(struct task_struct *tsk,
 static void vtime_account_guest(struct task_struct *tsk,
 				struct vtime *vtime)
 {
+	enum cpu_usage_stat index;
+
 	vtime->gtime += get_vtime_delta(vtime);
-	if (vtime->gtime >= TICK_NSEC) {
-		account_guest_time(tsk, vtime->gtime);
-		vtime->gtime = 0;
-	}
+
+	if (vtime->gtime < TICK_NSEC)
+		return;
+
+	if (vtime->nice)
+		index = CPUTIME_GUEST_NICE;
+	else
+		index = CPUTIME_GUEST;
+
+	account_guest_time_index(tsk, vtime->gtime, index);
+	vtime->gtime = 0;
 }
 
+static void vtime_account_user(struct task_struct *tsk,
+			       struct vtime *vtime)
+{
+	enum cpu_usage_stat index;
+
+	vtime->utime += get_vtime_delta(vtime);
+
+	if (vtime->utime < TICK_NSEC)
+		return;
+
+	if (vtime->nice)
+		index = CPUTIME_NICE;
+	else
+		index = CPUTIME_USER;
+
+	account_user_time_index(tsk, vtime->utime, index);
+	vtime->utime = 0;
+}
+
+
 static void __vtime_account_kernel(struct task_struct *tsk,
 				   struct vtime *vtime)
 {
@@ -779,11 +808,7 @@ void vtime_user_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	vtime->utime += get_vtime_delta(vtime);
-	if (vtime->utime >= TICK_NSEC) {
-		account_user_time(tsk, vtime->utime);
-		vtime->utime = 0;
-	}
+	vtime_account_user(tsk, vtime);
 	vtime->state = VTIME_SYS;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -864,6 +889,7 @@ void vtime_task_switch_generic(struct task_struct *prev)
 		vtime->state = VTIME_SYS;
 	vtime->starttime = sched_clock();
 	vtime->cpu = smp_processor_id();
+	vtime->nice = (task_nice(current) > 0) ? 1 : 0;
 	write_seqcount_end(&vtime->seqcount);
 
 	rcu_assign_pointer(kcpustat->curr, current);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 19/25] sched/vite: Handle nice updates under vtime
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (17 preceding siblings ...)
  2018-11-14  2:46 ` [PATCH 18/25] vtime: Track nice-ness on top of context switch Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  2018-11-20 14:17   ` Peter Zijlstra
  2018-11-14  2:46 ` [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor Frederic Weisbecker
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

On the vtime level, nice updates are currently handled on context
switches. When a task's nice value gets updated while it is sleeping,
the context switch takes into account the new nice value in order to
later record the vtime delta to the appropriate kcpustat index.

We have yet to handle live updates: when set_user_nice() is called
while the target is running. We'll handle that on two sides:

* If the caller of set_user_nice() is the current task, we update the
  vtime state in place.

* If the target runs on a different CPU, we interrupt it with an IPI to
  update the vtime state in place.

The vtime update in question consists in flushing the pending vtime
delta to the task/kcpustat and resume the accounting on top of the new
nice value.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/vtime.h  |  2 ++
 kernel/sched/core.c    |  4 ++++
 kernel/sched/cputime.c | 41 ++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h   | 39 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 79 insertions(+), 7 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index a53f6ea..b4566d5 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -85,6 +85,8 @@ extern void vtime_guest_enter(struct task_struct *tsk);
 extern void vtime_guest_exit(struct task_struct *tsk);
 extern void vtime_init_idle(struct task_struct *tsk, int cpu);
 extern void vtime_exit_task(struct task_struct *tsk);
+extern void vtime_set_nice_local(struct task_struct *tsk);
+extern void vtime_set_nice_remote(int cpu);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN  */
 static inline void vtime_user_enter(struct task_struct *tsk) { }
 static inline void vtime_user_exit(struct task_struct *tsk) { }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f12225f..e8f0437 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3868,6 +3868,7 @@ void set_user_nice(struct task_struct *p, long nice)
 	int old_prio, delta;
 	struct rq_flags rf;
 	struct rq *rq;
+	long old_nice;
 
 	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
 		return;
@@ -3878,6 +3879,8 @@ void set_user_nice(struct task_struct *p, long nice)
 	rq = task_rq_lock(p, &rf);
 	update_rq_clock(rq);
 
+	old_nice = task_nice(p);
+
 	/*
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
@@ -3913,6 +3916,7 @@ void set_user_nice(struct task_struct *p, long nice)
 	if (running)
 		set_curr_task(rq, p);
 out_unlock:
+	vtime_set_nice(rq, p, old_nice);
 	task_rq_unlock(rq, p, &rf);
 }
 EXPORT_SYMBOL(set_user_nice);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 07c2e7f..2b35132 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -733,13 +733,13 @@ static void vtime_account_system(struct task_struct *tsk,
 }
 
 static void vtime_account_guest(struct task_struct *tsk,
-				struct vtime *vtime)
+				struct vtime *vtime, bool force)
 {
 	enum cpu_usage_stat index;
 
 	vtime->gtime += get_vtime_delta(vtime);
 
-	if (vtime->gtime < TICK_NSEC)
+	if (vtime->gtime < TICK_NSEC && !force)
 		return;
 
 	if (vtime->nice)
@@ -752,13 +752,13 @@ static void vtime_account_guest(struct task_struct *tsk,
 }
 
 static void vtime_account_user(struct task_struct *tsk,
-			       struct vtime *vtime)
+			       struct vtime *vtime, bool force)
 {
 	enum cpu_usage_stat index;
 
 	vtime->utime += get_vtime_delta(vtime);
 
-	if (vtime->utime < TICK_NSEC)
+	if (vtime->utime < TICK_NSEC && !force)
 		return;
 
 	if (vtime->nice)
@@ -776,7 +776,7 @@ static void __vtime_account_kernel(struct task_struct *tsk,
 {
 	/* We might have scheduled out from guest path */
 	if (vtime->state == VTIME_GUEST)
-		vtime_account_guest(tsk, vtime);
+		vtime_account_guest(tsk, vtime, false);
 	else
 		vtime_account_system(tsk, vtime);
 }
@@ -808,7 +808,7 @@ void vtime_user_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	vtime_account_user(tsk, vtime);
+	vtime_account_user(tsk, vtime, false);
 	vtime->state = VTIME_SYS;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -836,7 +836,7 @@ void vtime_guest_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	vtime_account_guest(tsk, vtime);
+	vtime_account_guest(tsk, vtime, false);
 	tsk->flags &= ~PF_VCPU;
 	vtime->state = VTIME_SYS;
 	write_seqcount_end(&vtime->seqcount);
@@ -937,6 +937,33 @@ void vtime_exit_task(struct task_struct *t)
 	local_irq_restore(flags);
 }
 
+void vtime_set_nice_local(struct task_struct *t)
+{
+	struct vtime *vtime = &t->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime->state == VTIME_USER)
+		vtime_account_user(t, vtime, true);
+	else if (vtime->state == VTIME_GUEST)
+		vtime_account_guest(t, vtime, true);
+	vtime->nice = (task_nice(t) > 0) ? 1 : 0;
+	write_seqcount_end(&vtime->seqcount);
+}
+
+static void vtime_set_nice_func(struct irq_work *work)
+{
+	vtime_set_nice_local(current);
+}
+
+static DEFINE_PER_CPU(struct irq_work, vtime_set_nice_work) = {
+	.func = vtime_set_nice_func,
+};
+
+void vtime_set_nice_remote(int cpu)
+{
+	irq_work_queue_on(&per_cpu(vtime_set_nice_work, cpu), cpu);
+}
+
 u64 task_gtime(struct task_struct *t)
 {
 	struct vtime *vtime = &t->vtime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 618577f..c7846ca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1790,6 +1790,45 @@ static inline int sched_tick_offload_init(void) { return 0; }
 static inline void sched_update_tick_dependency(struct rq *rq) { }
 #endif
 
+static inline void vtime_set_nice(struct rq *rq,
+				  struct task_struct *p, long old_nice)
+{
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+	long nice;
+	int cpu;
+
+	if (!vtime_accounting_enabled())
+		return;
+
+	cpu = cpu_of(rq);
+
+	if (!vtime_accounting_enabled_cpu(cpu))
+		return;
+
+	/*
+	 * Task not running, nice update will be seen by vtime on its
+	 * next context switch.
+	 */
+	if (!task_current(rq, p))
+		return;
+
+	nice = task_nice(p);
+
+	/* Task stays nice, still accounted as nice in kcpustat */
+	if (old_nice > 0 && nice > 0)
+		return;
+
+	/* Task stays rude, still accounted as non-nice in kcpustat */
+	if (old_nice <= 0 && nice <= 0)
+		return;
+
+	if (p == current)
+		vtime_set_nice_local(p);
+	else
+		vtime_set_nice_remote(cpu);
+#endif
+}
+
 static inline void add_nr_running(struct rq *rq, unsigned count)
 {
 	unsigned prev_nr = rq->nr_running;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (18 preceding siblings ...)
  2018-11-14  2:46 ` [PATCH 19/25] sched/vite: Handle nice updates under vtime Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  2018-11-20 14:23   ` Peter Zijlstra
  2018-11-14  2:46 ` [PATCH 21/25] procfs: Use vtime aware " Frederic Weisbecker
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Kcpustat is not correctly supported on nohz_full CPUs. The tick doesn't
fire and the cputime therefore doesn't move forward. The issue has shown
up after the vanishing of the remaining 1Hz which has made the issue
visible.

We are solving that with tracking the task running on a CPU through RCU
and reading its vtime delta that we add to the raw kcpustat values.

We make sure that we fetch a coherent raw-kcpustat/vtime-delta couple
sequence while checking that the CPU referred by the target vtime is the
correct one, under the locked vtime seqcount.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/kernel_stat.h | 25 +++++++++++++
 kernel/sched/cputime.c      | 90 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 115 insertions(+)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 049d973..2d4d301 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -79,6 +79,31 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 	return kstat_cpu(cpu).irqs_sum;
 }
 
+
+static inline void kcpustat_cputime_raw(struct kernel_cpustat *kcpustat,
+					u64 *user, u64 *nice, u64 *system,
+					u64 *guest, u64 *guest_nice)
+{
+	*user = kcpustat->cpustat[CPUTIME_USER];
+	*nice = kcpustat->cpustat[CPUTIME_NICE];
+	*system = kcpustat->cpustat[CPUTIME_SYSTEM];
+	*guest = kcpustat->cpustat[CPUTIME_GUEST];
+	*guest_nice = kcpustat->cpustat[CPUTIME_GUEST_NICE];
+}
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+extern void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
+			     u64 *user, u64 *nice, u64 *system,
+			     u64 *guest, u64 *guest_nice);
+#else
+static inline void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
+				    u64 *user, u64 *nice, u64 *system,
+				    u64 *guest, u64 *guest_nice)
+{
+	kcpustat_cputime_raw(kcpustat, user, nice, system, guest, guest_nice);
+}
+#endif
+
 extern void account_user_time(struct task_struct *, u64);
 extern void account_guest_time(struct task_struct *, u64);
 extern void account_system_time(struct task_struct *, int, u64);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2b35132..3afde9f 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -1024,4 +1024,94 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 			*utime += vtime->utime + delta;
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
+
+static int kcpustat_vtime(struct kernel_cpustat *kcpustat, struct vtime *vtime,
+			  int cpu, u64 *user, u64 *nice,
+			  u64 *system, u64 *guest, u64 *guest_nice)
+{
+	unsigned int seq;
+	u64 delta;
+	int err;
+
+	do {
+		seq = read_seqcount_begin(&vtime->seqcount);
+
+		/*
+		 * We raced against context switch, fetch the
+		 * kcpustat task again.
+		 */
+		if (vtime->cpu != cpu && vtime->cpu != -1) {
+			err = -EAGAIN;
+			continue;
+		}
+
+		err = 0;
+
+		kcpustat_cputime_raw(kcpustat, user, nice,
+				     system, guest, guest_nice);
+
+		/* Task is sleeping, dead or idle, nothing to add */
+		if (vtime->state < VTIME_SYS)
+			continue;
+
+		delta = vtime_delta(vtime);
+
+		/*
+		 * Task runs either in user (including guest) or kernel space,
+		 * add pending nohz time to the right place.
+		 */
+		if (vtime->state == VTIME_SYS) {
+			*system += vtime->stime + delta;
+		} else if (vtime->state == VTIME_USER) {
+			if (vtime->nice)
+				*nice += vtime->utime + delta;
+			else
+				*user += vtime->utime + delta;
+		} else {
+			WARN_ON_ONCE(vtime->state != VTIME_GUEST);
+			if (vtime->nice) {
+				*guest_nice += vtime->gtime + delta;
+				*nice += vtime->gtime + delta;
+			} else {
+				*guest += vtime->gtime + delta;
+				*user += vtime->gtime + delta;
+			}
+		}
+	} while (read_seqcount_retry(&vtime->seqcount, seq));
+
+	return err;
+}
+
+void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
+		      u64 *user, u64 *nice, u64 *system,
+		      u64 *guest, u64 *guest_nice)
+{
+	struct task_struct *curr;
+	struct vtime *vtime;
+	int err;
+
+	if (!vtime_accounting_enabled()) {
+		kcpustat_cputime_raw(kcpustat, user, nice,
+				     system, guest, guest_nice);
+		return;
+	}
+
+	rcu_read_lock();
+
+	do {
+		curr = rcu_dereference(kcpustat->curr);
+		if (!curr) {
+			kcpustat_cputime_raw(kcpustat, user, nice,
+					     system, guest, guest_nice);
+			break;
+		}
+
+		vtime = &curr->vtime;
+		err = kcpustat_vtime(kcpustat, vtime, cpu, user,
+				     nice, system, guest, guest_nice);
+	} while (err == -EAGAIN);
+
+	rcu_read_unlock();
+}
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 21/25] procfs: Use vtime aware kcpustat accessor
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (19 preceding siblings ...)
  2018-11-14  2:46 ` [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  2018-11-20 14:24   ` Peter Zijlstra
  2018-11-14  2:46 ` [PATCH 22/25] cpufreq: " Frederic Weisbecker
                   ` (3 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Now that we have a vtime safe kcpustat accessor, use it to fix frozen
kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 fs/proc/stat.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 535eda7..2eed662 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -95,16 +95,20 @@ static int show_stat(struct seq_file *p, void *v)
 	getboottime64(&boottime);
 
 	for_each_possible_cpu(i) {
-		user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
-		nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
-		system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
+		u64 cpu_user, cpu_nice, cpu_sys, cpu_guest, cpu_guest_nice;
+
+		kcpustat_cputime(&kcpustat_cpu(i), i, &cpu_user, &cpu_nice,
+				 &cpu_sys, &cpu_guest, &cpu_guest_nice);
+		user += cpu_user;
+		nice += cpu_nice;
+		system += cpu_sys;
 		idle += get_idle_time(i);
 		iowait += get_iowait_time(i);
 		irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
 		softirq += kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
 		steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
-		guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
-		guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+		guest += cpu_guest;
+		guest_nice += guest_nice;
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -130,17 +134,14 @@ static int show_stat(struct seq_file *p, void *v)
 	seq_putc(p, '\n');
 
 	for_each_online_cpu(i) {
+		kcpustat_cputime(&kcpustat_cpu(i), i, &user, &nice,
+				 &system, &guest, &guest_nice);
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kcpustat_cpu(i).cpustat[CPUTIME_USER];
-		nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE];
-		system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
 		idle = get_idle_time(i);
 		iowait = get_iowait_time(i);
 		irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
 		softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
 		steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
-		guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
-		guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
 		seq_printf(p, "cpu%d", i);
 		seq_put_decimal_ull(p, " ", nsec_to_clock_t(user));
 		seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 22/25] cpufreq: Use vtime aware kcpustat accessor
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (20 preceding siblings ...)
  2018-11-14  2:46 ` [PATCH 21/25] procfs: Use vtime aware " Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  2018-11-14  2:46 ` [PATCH 23/25] leds: Use vtime aware kcpustat accessors Frederic Weisbecker
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Now that we have a vtime safe kcpustat accessor, use it to fix frozen
kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 drivers/cpufreq/cpufreq.c          | 18 +++++++++++-------
 drivers/cpufreq/cpufreq_governor.c | 27 ++++++++++++++++++++-------
 2 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 7aa3dca..e9cff3f 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -119,18 +119,22 @@ EXPORT_SYMBOL_GPL(get_governor_parent_kobj);
 
 static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
 {
-	u64 idle_time;
+	struct kernel_cpustat *kcpustat = &kcpustat_cpu(cpu);
+	u64 user, nice, sys, guest, guest_nice;
 	u64 cur_wall_time;
+	u64 idle_time;
 	u64 busy_time;
 
 	cur_wall_time = jiffies64_to_nsecs(get_jiffies_64());
 
-	busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+	kcpustat_cputime(kcpustat, cpu, &user, &nice, &sys,
+			 &guest, &guest_nice);
+	busy_time = user;
+	busy_time += sys;
+	busy_time += kcpustat->cpustat[CPUTIME_IRQ];
+	busy_time += kcpustat->cpustat[CPUTIME_SOFTIRQ];
+	busy_time += kcpustat->cpustat[CPUTIME_STEAL];
+	busy_time += nice;
 
 	idle_time = cur_wall_time - busy_time;
 	if (wall)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 6d53f7d..5ebe726 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -107,8 +107,13 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
 
 			j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_update_time,
 								  dbs_data->io_is_busy);
-			if (dbs_data->ignore_nice_load)
-				j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+			if (dbs_data->ignore_nice_load) {
+				u64 user, nice, sys, guest, guest_nice;
+
+				kcpustat_cputime(&kcpustat_cpu(j), j, &user, &nice, &sys,
+						 &guest, &guest_nice);
+				j_cdbs->prev_cpu_nice = nice;
+			}
 		}
 	}
 }
@@ -152,10 +157,13 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
 		j_cdbs->prev_cpu_idle = cur_idle_time;
 
 		if (ignore_nice) {
-			u64 cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+			u64 user, nice, sys, guest, guest_nice;
 
-			idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
-			j_cdbs->prev_cpu_nice = cur_nice;
+			kcpustat_cputime(&kcpustat_cpu(j), j, &user, &nice, &sys,
+					 &guest, &guest_nice);
+
+			idle_time += div_u64(nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
+			j_cdbs->prev_cpu_nice = nice;
 		}
 
 		if (unlikely(!time_elapsed)) {
@@ -530,8 +538,13 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
 		 */
 		j_cdbs->prev_load = 0;
 
-		if (ignore_nice)
-			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+		if (ignore_nice) {
+			u64 user, nice, sys, guest, guest_nice;
+
+			kcpustat_cputime(&kcpustat_cpu(j), j, &user, &nice, &sys,
+					 &guest, &guest_nice);
+			j_cdbs->prev_cpu_nice = nice;
+		}
 	}
 
 	gov->start(policy);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 23/25] leds: Use vtime aware kcpustat accessors
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (21 preceding siblings ...)
  2018-11-14  2:46 ` [PATCH 22/25] cpufreq: " Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  2018-11-14  2:46 ` [PATCH 24/25] rackmeter: " Frederic Weisbecker
  2018-11-14  2:46 ` [PATCH 25/25] sched/vtime: Clarify vtime_task_switch() argument layout Frederic Weisbecker
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Now that we have a vtime safe kcpustat accessor, use it to fix frozen
kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 drivers/leds/trigger/ledtrig-activity.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/leds/trigger/ledtrig-activity.c b/drivers/leds/trigger/ledtrig-activity.c
index bcbf41c..b4a1f66 100644
--- a/drivers/leds/trigger/ledtrig-activity.c
+++ b/drivers/leds/trigger/ledtrig-activity.c
@@ -60,9 +60,12 @@ static void led_activity_function(struct timer_list *t)
 	curr_used = 0;
 
 	for_each_possible_cpu(i) {
-		curr_used += kcpustat_cpu(i).cpustat[CPUTIME_USER]
-			  +  kcpustat_cpu(i).cpustat[CPUTIME_NICE]
-			  +  kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]
+		u64 user, nice, sys, guest, guest_nice;
+
+		kcpustat_cputime(&kcpustat_cpu(i), i, &user, &nice, &sys,
+				 &guest, &guest_nice);
+
+		curr_used += user + nice + sys
 			  +  kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]
 			  +  kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
 		cpus++;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 24/25] rackmeter: Use vtime aware kcpustat accessors
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (22 preceding siblings ...)
  2018-11-14  2:46 ` [PATCH 23/25] leds: Use vtime aware kcpustat accessors Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  2018-11-14  2:46 ` [PATCH 25/25] sched/vtime: Clarify vtime_task_switch() argument layout Frederic Weisbecker
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

Now that we have a vtime safe kcpustat accessor, use it to fix frozen
kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 drivers/macintosh/rack-meter.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 1f29d24..1936e0ec 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -83,13 +83,19 @@ static int rackmeter_ignore_nice;
  */
 static inline u64 get_cpu_idle_time(unsigned int cpu)
 {
+	struct kernel_cpustat *kcpustat = &kcpustat_cpu(cpu);
 	u64 retval;
 
-	retval = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE] +
-		 kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
+	retval = kcpustat->cpustat[CPUTIME_IDLE] +
+		 kcpustat->cpustat[CPUTIME_IOWAIT];
 
-	if (rackmeter_ignore_nice)
-		retval += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+	if (rackmeter_ignore_nice) {
+		u64 user, nice, sys, guest, guest_nice;
+
+		kcpustat_cputime(kcpustat, cpu, &user, &nice, &sys,
+				 &guest, &guest_nice);
+		retval += nice;
+	}
 
 	return retval;
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 25/25] sched/vtime: Clarify vtime_task_switch() argument layout
  2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
                   ` (23 preceding siblings ...)
  2018-11-14  2:46 ` [PATCH 24/25] rackmeter: " Frederic Weisbecker
@ 2018-11-14  2:46 ` Frederic Weisbecker
  24 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-14  2:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Wanpeng Li, Thomas Gleixner,
	Yauheni Kaliuta, Ingo Molnar, Rik van Riel

This function deals with the previous and next tasks during a context
switch. But only the previous is passed as an argument, the next task
being deduced from current. Make the code clearer by passing both
previous and next as arguments.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/ia64/include/asm/cputime.h    |  3 ++-
 arch/ia64/kernel/time.c            |  5 +++--
 arch/powerpc/include/asm/cputime.h |  8 +++++---
 arch/s390/kernel/vtime.c           | 13 +++++++------
 include/linux/vtime.h              | 17 +++++++++++------
 kernel/sched/core.c                |  2 +-
 kernel/sched/cputime.c             | 18 ++++++++++--------
 7 files changed, 39 insertions(+), 27 deletions(-)

diff --git a/arch/ia64/include/asm/cputime.h b/arch/ia64/include/asm/cputime.h
index 3d665c0..0bc90a1 100644
--- a/arch/ia64/include/asm/cputime.h
+++ b/arch/ia64/include/asm/cputime.h
@@ -19,7 +19,8 @@
 #define __IA64_CPUTIME_H
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-extern void arch_vtime_task_switch(struct task_struct *tsk);
+extern void arch_vtime_task_switch(struct task_struct *prev,
+				   struct task_struct *next);
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #endif /* __IA64_CPUTIME_H */
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 46a9798..908bd4f 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -105,10 +105,11 @@ void vtime_flush(struct task_struct *tsk)
  * accumulated times to the current process, and to prepare accounting on
  * the next process.
  */
-void arch_vtime_task_switch(struct task_struct *prev)
+void arch_vtime_task_switch(struct task_struct *prev,
+			    struct task_struct *next)
 {
 	struct thread_info *pi = task_thread_info(prev);
-	struct thread_info *ni = task_thread_info(current);
+	struct thread_info *ni = task_thread_info(next);
 
 	ni->ac_stamp = pi->ac_stamp;
 	ni->ac_stime = ni->ac_utime = 0;
diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index ae73dc8..9d68040 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -47,7 +47,8 @@ static inline unsigned long cputime_to_usecs(const cputime_t ct)
  */
 #ifdef CONFIG_PPC64
 #define get_accounting(tsk)	(&get_paca()->accounting)
-static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
+static inline void arch_vtime_task_switch(struct task_struct *prev,
+					  struct task_struct *next) { }
 #else
 #define get_accounting(tsk)	(&task_thread_info(tsk)->accounting)
 /*
@@ -55,9 +56,10 @@ static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
  * accumulated times to the current process, and to prepare accounting on
  * the next process.
  */
-static inline void arch_vtime_task_switch(struct task_struct *prev)
+static inline void arch_vtime_task_switch(struct task_struct *prev,
+					  struct task_struct *next)
 {
-	struct cpu_accounting_data *acct = get_accounting(current);
+	struct cpu_accounting_data *acct = get_accounting(next);
 	struct cpu_accounting_data *acct0 = get_accounting(prev);
 
 	acct->starttime = acct0->starttime;
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index b6b888d..fcfeb63 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -191,7 +191,8 @@ static int do_account_vtime(struct task_struct *tsk)
 	return virt_timer_forward(user + guest + system + hardirq + softirq);
 }
 
-void vtime_task_switch(struct task_struct *prev)
+void vtime_task_switch(struct task_struct *prev,
+		       struct task_struct *next)
 {
 	do_account_vtime(prev);
 	prev->thread.user_timer = S390_lowcore.user_timer;
@@ -199,11 +200,11 @@ void vtime_task_switch(struct task_struct *prev)
 	prev->thread.system_timer = S390_lowcore.system_timer;
 	prev->thread.hardirq_timer = S390_lowcore.hardirq_timer;
 	prev->thread.softirq_timer = S390_lowcore.softirq_timer;
-	S390_lowcore.user_timer = current->thread.user_timer;
-	S390_lowcore.guest_timer = current->thread.guest_timer;
-	S390_lowcore.system_timer = current->thread.system_timer;
-	S390_lowcore.hardirq_timer = current->thread.hardirq_timer;
-	S390_lowcore.softirq_timer = current->thread.softirq_timer;
+	S390_lowcore.user_timer = next->thread.user_timer;
+	S390_lowcore.guest_timer = next->thread.guest_timer;
+	S390_lowcore.system_timer = next->thread.system_timer;
+	S390_lowcore.hardirq_timer = next->thread.hardirq_timer;
+	S390_lowcore.softirq_timer = next->thread.softirq_timer;
 }
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index b4566d5..188eace 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -16,7 +16,8 @@ struct task_struct;
 #if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE)
 
 static inline bool vtime_accounting_enabled_this_cpu(void) { return true; }
-extern void vtime_task_switch(struct task_struct *prev);
+extern void vtime_task_switch(struct task_struct *prev,
+			      struct task_struct *next);
 
 #elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
 
@@ -51,19 +52,22 @@ static inline bool vtime_accounting_enabled_this_cpu(void)
 	return false;
 }
 
-extern void vtime_task_switch_generic(struct task_struct *prev);
+extern void vtime_task_switch_generic(struct task_struct *prev,
+				      struct task_struct *next);
 
-static inline void vtime_task_switch(struct task_struct *prev)
+static inline void vtime_task_switch(struct task_struct *prev,
+				     struct task_struct *next)
 {
 	if (vtime_accounting_enabled_this_cpu())
-		vtime_task_switch_generic(prev);
+		vtime_task_switch_generic(prev, next);
 }
 
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 static inline bool vtime_accounting_enabled_cpu(int cpu) {return false; }
 static inline bool vtime_accounting_enabled_this_cpu(void) { return false; }
-static inline void vtime_task_switch(struct task_struct *prev) { }
+static inline void vtime_task_switch(struct task_struct *prev,
+				     struct task_struct *next) { }
 
 #endif
 
@@ -78,7 +82,8 @@ static inline void vtime_account_kernel(struct task_struct *tsk) { }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern void arch_vtime_task_switch(struct task_struct *tsk);
+extern void arch_vtime_task_switch(struct task_struct *prev,
+				   struct task_struct *next);
 extern void vtime_user_enter(struct task_struct *tsk);
 extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8f0437..6e315b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2675,7 +2675,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * transition, resulting in a double drop.
 	 */
 	prev_state = prev->state;
-	vtime_task_switch(prev);
+	vtime_task_switch(prev, current);
 	perf_event_task_sched_in(prev, current);
 	finish_task(prev);
 	finish_lock_switch(rq);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3afde9f..c6a953d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -421,7 +421,8 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 
 # ifndef __ARCH_HAS_VTIME_TASK_SWITCH
-void vtime_task_switch(struct task_struct *prev)
+void vtime_task_switch(struct task_struct *prev,
+		       struct task_struct *next)
 {
 	if (is_idle_task(prev))
 		vtime_account_idle(prev);
@@ -429,7 +430,7 @@ void vtime_task_switch(struct task_struct *prev)
 		vtime_account_kernel(prev);
 
 	vtime_flush(prev);
-	arch_vtime_task_switch(prev);
+	arch_vtime_task_switch(prev, next);
 }
 # endif
 
@@ -848,7 +849,8 @@ void vtime_account_idle(struct task_struct *tsk)
 	account_idle_time(get_vtime_delta(&tsk->vtime));
 }
 
-void vtime_task_switch_generic(struct task_struct *prev)
+void vtime_task_switch_generic(struct task_struct *prev,
+			       struct task_struct *next)
 {
 	struct vtime *vtime = &prev->vtime;
 	struct kernel_cpustat *kcpustat = kcpustat_this_cpu;
@@ -869,7 +871,7 @@ void vtime_task_switch_generic(struct task_struct *prev)
 		write_seqcount_end(&vtime->seqcount);
 	}
 
-	vtime = &current->vtime;
+	vtime = &next->vtime;
 
 	/*
 	 * Ignore the next task if it has been preempted after
@@ -881,18 +883,18 @@ void vtime_task_switch_generic(struct task_struct *prev)
 	}
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (is_idle_task(current))
+	if (is_idle_task(next))
 		vtime->state = VTIME_IDLE;
-	else if (current->flags & PF_VCPU)
+	else if (next->flags & PF_VCPU)
 		vtime->state = VTIME_GUEST;
 	else
 		vtime->state = VTIME_SYS;
 	vtime->starttime = sched_clock();
 	vtime->cpu = smp_processor_id();
-	vtime->nice = (task_nice(current) > 0) ? 1 : 0;
+	vtime->nice = (task_nice(next) > 0) ? 1 : 0;
 	write_seqcount_end(&vtime->seqcount);
 
-	rcu_assign_pointer(kcpustat->curr, current);
+	rcu_assign_pointer(kcpustat->curr, next);
 }
 
 void vtime_init_idle(struct task_struct *t, int cpu)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/25] sched/vtime: Protect idle accounting under vtime seqcount
  2018-11-14  2:45 ` [PATCH 02/25] sched/vtime: Protect idle accounting under vtime seqcount Frederic Weisbecker
@ 2018-11-20 13:19   ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 13:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:45:46AM +0100, Frederic Weisbecker wrote:
> Locking the seqcount on idle vtime accounting wasn't thought to be
> necessary because the readers of idle cputime don't use vtime (yet).
> 
> Now updating vtime expect the related seqcount to be locked so do it
> for locking coherency purposes.
> 
> Also idle cputime updates use vtime, but idle cputime readers use the
> traditional ad-hoc nohz time delta. We may want to consider moving
> readers to use vtime to consolidate the overall accounting scheme. The
> seqcount will be a functional requirement for it.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/cputime.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 54eb945..6e74ec2 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -800,7 +800,11 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
>  
>  void vtime_account_idle(struct task_struct *tsk)
>  {
> +	struct vtime *vtime = &tsk->vtime;
> +
> +	write_seqcount_begin(&vtime->seqcount);
>  	account_idle_time(get_vtime_delta(&tsk->vtime));
> +	write_seqcount_end(&vtime->seqcount);
>  }

So this makes switching away from idle more expensive ? Also,
vtime_account_system() has this fast-path check in there before taking
that lock, should we not do the same? Or should it be removed from
vtime_account_system() ?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/25] vtime: Spare a seqcount lock/unlock cycle on context switch
  2018-11-14  2:45 ` [PATCH 04/25] vtime: Spare a seqcount lock/unlock cycle on context switch Frederic Weisbecker
@ 2018-11-20 13:25   ` Peter Zijlstra
  2019-09-25 14:42     ` Frederic Weisbecker
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 13:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:45:48AM +0100, Frederic Weisbecker wrote:

So I definitely like avoiding that superfluous atomic op, however:

> @@ -730,19 +728,25 @@ static void vtime_account_guest(struct task_struct *tsk,
>  	}
>  }
>  
> +static void __vtime_account_kernel(struct task_struct *tsk,
> +				   struct vtime *vtime)

Your last patch removed a __function, and now you're adding one back :/

>  {
>  	/* We might have scheduled out from guest path */
>  	if (tsk->flags & PF_VCPU)
>  		vtime_account_guest(tsk, vtime);
>  	else
>  		vtime_account_system(tsk, vtime);
> +}
> +
> +void vtime_account_kernel(struct task_struct *tsk)
> +{
> +	struct vtime *vtime = &tsk->vtime;
> +
> +	if (!vtime_delta(vtime))
> +		return;
> +

See here the fast path (is it worth it?)

> +	write_seqcount_begin(&vtime->seqcount);
> +	__vtime_account_kernel(tsk, vtime);
>  	write_seqcount_end(&vtime->seqcount);
>  }

> +void vtime_task_switch_generic(struct task_struct *prev)
>  {
>  	struct vtime *vtime = &prev->vtime;

And observe a distinct lack of that same fast path..

>  
>  	write_seqcount_begin(&vtime->seqcount);
> +	if (is_idle_task(prev))
> +		vtime_account_idle(prev);
> +	else
> +		__vtime_account_kernel(prev, vtime);
>  	vtime->state = VTIME_INACTIVE;
>  	write_seqcount_end(&vtime->seqcount);
>  
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/25] vtime: Exit vtime before exit_notify()
  2018-11-14  2:45 ` [PATCH 08/25] vtime: Exit vtime before exit_notify() Frederic Weisbecker
@ 2018-11-20 13:54   ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 13:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:45:52AM +0100, Frederic Weisbecker wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d458d65..27e0544 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -265,6 +265,8 @@ struct task_cputime {
>  enum vtime_state {
>  	/* Task is sleeping or running in a CPU with VTIME inactive: */
>  	VTIME_INACTIVE = 0,
> +	/* Task has passed exit_notify() */
> +	VTIME_DEAD,

How does it make sense for VTIME_DEAD > VTIME_INACTIVE ?

>  	/* Task is idle */
>  	VTIME_IDLE,
>  	/* Task runs in kernelspace in a CPU with VTIME active: */


> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f64afd7..a0c3a82 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -813,17 +813,31 @@ void vtime_task_switch_generic(struct task_struct *prev)
>  {
>  	struct vtime *vtime = &prev->vtime;
>  
> -	write_seqcount_begin(&vtime->seqcount);
> -	if (vtime->state == VTIME_IDLE)
> -		vtime_account_idle(prev);
> -	else
> -		__vtime_account_kernel(prev, vtime);
> -	vtime->state = VTIME_INACTIVE;
> -	vtime->cpu = -1;
> -	write_seqcount_end(&vtime->seqcount);
> +	/*
> +	 * Flush the prev task vtime, unless it has passed
> +	 * vtime_exit_task(), in which case there is nothing
> +	 * left to account.
> +	 */
> +	if (vtime->state != VTIME_DEAD) {
> +		write_seqcount_begin(&vtime->seqcount);
> +		if (vtime->state == VTIME_IDLE)
> +			vtime_account_idle(prev);
> +		else
> +			__vtime_account_kernel(prev, vtime);
> +		vtime->state = VTIME_INACTIVE;
> +		vtime->cpu = -1;
> +		write_seqcount_end(&vtime->seqcount);
> +	}
>  
>  	vtime = &current->vtime;
>  
> +	/*
> +	 * Ignore the next task if it has been preempted after
> +	 * vtime_exit_task().
> +	 */
> +	if (vtime->state == VTIME_DEAD)
> +		return;
> +
>  	write_seqcount_begin(&vtime->seqcount);
>  	if (is_idle_task(current))
>  		vtime->state = VTIME_IDLE;

Bit inconsistent; having the one as a indent and the other as an early
return.

> @@ -850,6 +864,30 @@ void vtime_init_idle(struct task_struct *t, int cpu)
>  	local_irq_restore(flags);
>  }
>  
> +/*
> + * This is the final settlement point after which we don't account
> + * anymore vtime for this task.
> + */
> +void vtime_exit_task(struct task_struct *t)
> +{
> +	struct vtime *vtime = &t->vtime;
> +	unsigned long flags;

Note that the code in vtime_task_switch_generic() (above) relies on @t
== current (which is true, but not explicit).

> +	local_irq_save(flags);
> +	write_seqcount_begin(&vtime->seqcount);
> +	/*
> +	 * A task that has never run on a nohz_full CPU hasn't
> +	 * been tracked by vtime. Thus it's in VTIME_INACTIVE
> +	 * state. Nothing to account for it.
> +	 */
> +	if (vtime->state != VTIME_INACTIVE)
> +		vtime_account_system(t, vtime);
> +	vtime->state = VTIME_DEAD;
> +	vtime->cpu = -1;
> +	write_seqcount_end(&vtime->seqcount);
> +	local_irq_restore(flags);
> +}
> +
>  u64 task_gtime(struct task_struct *t)
>  {
>  	struct vtime *vtime = &t->vtime;
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/25] kcpustat: Track running task following vtime sequences
  2018-11-14  2:45 ` [PATCH 09/25] kcpustat: Track running task following vtime sequences Frederic Weisbecker
@ 2018-11-20 13:58   ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 13:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:45:53AM +0100, Frederic Weisbecker wrote:
> In order to make kcpustat vtime aware (ie: work on nohz_full without
> freezing), we need to track the task running on the CPU in order to
> fetch its vtime delta and add it to the relevant kcpustat field.
> 
> The most efficient way to track this task is to use RCU. The task is
> assigned on context switch right after we flush the vtime of the previous
> task and the next task has been set on vtime.
> 
> Things are then prepared to be ordered that way:
> 
>              WRITER (ctx switch)                READER
>              ------------------            -----------------------
>         vtime_seqcount_write_lock(prev)     rcu_read_lock()
>         //flush prev vtime                  curr = rcu_dereference(kcpustat->curr)
>         vtime_seqcount_write_unlock(prev)   vtime_seqcount_read_start(curr)
>                                             //fetch curr vtime
>         vtime_seqcount_lock(next)           vtime_seqcount_read_end(curr)
>         //Init vtime                        rcu_read_unlock()
>         vtime_seqcount_unlock(next)
> 
>         rcu_assign_pointer(kcpustat->curr, next)
> 
> With this ordering layout, we are sure that we get a sequence with a
> coherent couple (task cputime, kcpustat).

I'm confused; earlier you added a ->cpu member; but I don't see that
used.

Also, I'm confuddled on the purpose of rcu_assign_pointer(), what does
the store_release therein ensure?

Also, I'm pretty sure the below is terminally broken; task_struct is not
rcu-freed, and therefore the above scenario is just broken. Nothing
stops the task from going away right after rcu_dereference().

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 13/25] context_tracking: Introduce context_tracking_enabled_cpu()
  2018-11-14  2:45 ` [PATCH 13/25] context_tracking: Introduce context_tracking_enabled_cpu() Frederic Weisbecker
@ 2018-11-20 14:02   ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 14:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:45:57AM +0100, Frederic Weisbecker wrote:
> This allows us to check if a remote CPU runs context tracking
> (ie: is nohz_full). We'll need that to reliably support "nice"
> accounting on kcpustat.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/context_tracking_state.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
> index 08f125f..5877177 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -31,6 +31,11 @@ static inline bool context_tracking_enabled(void)
>  	return static_branch_unlikely(&context_tracking_key);
>  }
>  
> +static inline bool context_tracking_enabled_cpu(int cpu)
> +{
> +	return per_cpu(context_tracking.active, cpu);
> +}
> +
>  static inline bool context_tracking_enabled_this_cpu(void)
>  {
>  	return __this_cpu_read(context_tracking.active);

Should not the last two be something like:

static inline bool context_tracking_enabled_cpu(int cpu)
{
	return context_tracking_enabled() && per_cpu(context_tracking.active, cpu);
}

static inline bool context_tracking_enabled_this_cpu(void)
{
	return context_tracking_enabled() && __this_cpu_read(context_tracking.active);
}

Because, afaict, not all callsites first check the static key.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 15/25] sched/vtime: Introduce vtime_accounting_enabled_cpu()
  2018-11-14  2:45 ` [PATCH 15/25] sched/vtime: Introduce vtime_accounting_enabled_cpu() Frederic Weisbecker
@ 2018-11-20 14:04   ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 14:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:45:59AM +0100, Frederic Weisbecker wrote:
> This allows us to check if a remote CPU runs vtime accounting
> (ie: is nohz_full). We'll need that to reliably support "nice"
> accounting on kcpustat.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/vtime.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/include/linux/vtime.h b/include/linux/vtime.h
> index 82fed7a..a53f6ea 100644
> --- a/include/linux/vtime.h
> +++ b/include/linux/vtime.h
> @@ -31,6 +31,16 @@ static inline bool vtime_accounting_enabled(void)
>  	return context_tracking_enabled();
>  }
>  
> +static inline bool vtime_accounting_enabled_cpu(int cpu)
> +{
> +	if (vtime_accounting_enabled()) {
> +		if (context_tracking_enabled_cpu(cpu))
> +			return true;
> +	}
> +
> +	return false;
> +}

static inline bool vtime_account_enabled_cpu(int cpu)
{
	return vtime_account_enabled() && context_tracking_enabled_cpu(cpu);
}

Hmm?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 18/25] vtime: Track nice-ness on top of context switch
  2018-11-14  2:46 ` [PATCH 18/25] vtime: Track nice-ness on top of context switch Frederic Weisbecker
@ 2018-11-20 14:09   ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 14:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:46:02AM +0100, Frederic Weisbecker wrote:
> We need to read the nice value of the task running on any CPU, possibly
> remotely, in order to correctly support kcpustat on nohz_full.
> Unfortunately we can't just read task_nice(tsk) when tsk runs on another
> CPU because its nice value may be concurrently changed. There could be a
> risk that a recently modified nice value is thought to apply for a longer
> while than is supposed to.
> 
> For example if a task runs at T0 with nice = -10, then its nice value
> is changed at T0 + 1 second with nice = 10, a reader at T0 + 1 second
> could think that the task had this "nice == 10" value since the beginning
> (T0) and spuriously account 1 second nice time on kcpustat instead of 1
> second user time.
> 
> So we need to track the nice value changes under vtime seqcount. Start
> with context switches and account the vtime nice-ness on top of it.

Huh, what!? That doesn't make any sense..

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 19/25] sched/vite: Handle nice updates under vtime
  2018-11-14  2:46 ` [PATCH 19/25] sched/vite: Handle nice updates under vtime Frederic Weisbecker
@ 2018-11-20 14:17   ` Peter Zijlstra
  2018-11-26 15:53     ` Frederic Weisbecker
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 14:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:46:03AM +0100, Frederic Weisbecker wrote:
> On the vtime level, nice updates are currently handled on context
> switches. When a task's nice value gets updated while it is sleeping,
> the context switch takes into account the new nice value in order to
> later record the vtime delta to the appropriate kcpustat index.

Urgh, so this patch should be folded into the previous one. On their own
neither really makes sense.

> We have yet to handle live updates: when set_user_nice() is called
> while the target is running. We'll handle that on two sides:
> 
> * If the caller of set_user_nice() is the current task, we update the
>   vtime state in place.
> 
> * If the target runs on a different CPU, we interrupt it with an IPI to
>   update the vtime state in place.

*groan*... So what are the rules for vtime updates? Who can do that
when?

So when we change nice, we'll have the respective rq locked and task
effectively unqueued. It cannot schedule at such a point. Can
'concurrent' vtime updates still happen?

> The vtime update in question consists in flushing the pending vtime
> delta to the task/kcpustat and resume the accounting on top of the new
> nice value.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f12225f..e8f0437 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3868,6 +3868,7 @@ void set_user_nice(struct task_struct *p, long nice)
>  	int old_prio, delta;
>  	struct rq_flags rf;
>  	struct rq *rq;
> +	long old_nice;
>  
>  	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
>  		return;
> @@ -3878,6 +3879,8 @@ void set_user_nice(struct task_struct *p, long nice)
>  	rq = task_rq_lock(p, &rf);
>  	update_rq_clock(rq);
>  
> +	old_nice = task_nice(p);
> +
>  	/*
>  	 * The RT priorities are set via sched_setscheduler(), but we still
>  	 * allow the 'normal' nice value to be set - but as expected
> @@ -3913,6 +3916,7 @@ void set_user_nice(struct task_struct *p, long nice)
>  	if (running)
>  		set_curr_task(rq, p);
>  out_unlock:
> +	vtime_set_nice(rq, p, old_nice);
>  	task_rq_unlock(rq, p, &rf);
>  }

That's not sufficient; I think you want to hook set_load_weight() or
something. Things like sys_sched_setattr() can also change the nice
value.

>  EXPORT_SYMBOL(set_user_nice);
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 07c2e7f..2b35132 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c

> @@ -937,6 +937,33 @@ void vtime_exit_task(struct task_struct *t)
>  	local_irq_restore(flags);
>  }
>  
> +void vtime_set_nice_local(struct task_struct *t)
> +{
> +	struct vtime *vtime = &t->vtime;
> +
> +	write_seqcount_begin(&vtime->seqcount);
> +	if (vtime->state == VTIME_USER)
> +		vtime_account_user(t, vtime, true);
> +	else if (vtime->state == VTIME_GUEST)
> +		vtime_account_guest(t, vtime, true);
> +	vtime->nice = (task_nice(t) > 0) ? 1 : 0;
> +	write_seqcount_end(&vtime->seqcount);
> +}
> +
> +static void vtime_set_nice_func(struct irq_work *work)
> +{
> +	vtime_set_nice_local(current);
> +}
> +
> +static DEFINE_PER_CPU(struct irq_work, vtime_set_nice_work) = {
> +	.func = vtime_set_nice_func,
> +};
> +
> +void vtime_set_nice_remote(int cpu)
> +{
> +	irq_work_queue_on(&per_cpu(vtime_set_nice_work, cpu), cpu);

What happens if you already had one pending? Do we loose updates?

> +}
> +
>  u64 task_gtime(struct task_struct *t)
>  {
>  	struct vtime *vtime = &t->vtime;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 618577f..c7846ca 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1790,6 +1790,45 @@ static inline int sched_tick_offload_init(void) { return 0; }
>  static inline void sched_update_tick_dependency(struct rq *rq) { }
>  #endif
>  
> +static inline void vtime_set_nice(struct rq *rq,
> +				  struct task_struct *p, long old_nice)
> +{
> +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> +	long nice;
> +	int cpu;
> +
> +	if (!vtime_accounting_enabled())
> +		return;
> +
> +	cpu = cpu_of(rq);
> +
> +	if (!vtime_accounting_enabled_cpu(cpu))
> +		return;
> +
> +	/*
> +	 * Task not running, nice update will be seen by vtime on its
> +	 * next context switch.
> +	 */
> +	if (!task_current(rq, p))
> +		return;
> +
> +	nice = task_nice(p);
> +
> +	/* Task stays nice, still accounted as nice in kcpustat */
> +	if (old_nice > 0 && nice > 0)
> +		return;
> +
> +	/* Task stays rude, still accounted as non-nice in kcpustat */
> +	if (old_nice <= 0 && nice <= 0)
> +		return;
> +
> +	if (p == current)
> +		vtime_set_nice_local(p);
> +	else
> +		vtime_set_nice_remote(cpu);
> +#endif
> +}

That's _far_ too large for an inline I'm thinking. Also, changing nice
really isn't a fast path or anything.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor
  2018-11-14  2:46 ` [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor Frederic Weisbecker
@ 2018-11-20 14:23   ` Peter Zijlstra
  2018-11-20 22:40     ` Frederic Weisbecker
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 14:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:46:04AM +0100, Frederic Weisbecker wrote:

> +void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
> +		      u64 *user, u64 *nice, u64 *system,
> +		      u64 *guest, u64 *guest_nice)
> +{
> +	struct task_struct *curr;
> +	struct vtime *vtime;
> +	int err;
> +
> +	if (!vtime_accounting_enabled()) {
> +		kcpustat_cputime_raw(kcpustat, user, nice,
> +				     system, guest, guest_nice);
> +		return;
> +	}
> +
> +	rcu_read_lock();
> +
> +	do {
> +		curr = rcu_dereference(kcpustat->curr);

Like I explained earlier; I don't think the above is correct.
task_struct is itself not RCU protected.

> +		if (!curr) {
> +			kcpustat_cputime_raw(kcpustat, user, nice,
> +					     system, guest, guest_nice);
> +			break;
> +		}
> +
> +		vtime = &curr->vtime;
> +		err = kcpustat_vtime(kcpustat, vtime, cpu, user,
> +				     nice, system, guest, guest_nice);
> +	} while (err == -EAGAIN);
> +
> +	rcu_read_unlock();
> +}
> +
>  #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 21/25] procfs: Use vtime aware kcpustat accessor
  2018-11-14  2:46 ` [PATCH 21/25] procfs: Use vtime aware " Frederic Weisbecker
@ 2018-11-20 14:24   ` Peter Zijlstra
  2018-11-20 22:31     ` Frederic Weisbecker
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-20 14:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Wed, Nov 14, 2018 at 03:46:05AM +0100, Frederic Weisbecker wrote:
>  		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */

Just a note to let you know the current minimum GCC version is 4.6 :-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 21/25] procfs: Use vtime aware kcpustat accessor
  2018-11-20 14:24   ` Peter Zijlstra
@ 2018-11-20 22:31     ` Frederic Weisbecker
  0 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-20 22:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Tue, Nov 20, 2018 at 03:24:22PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 14, 2018 at 03:46:05AM +0100, Frederic Weisbecker wrote:
> >  		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
> 
> Just a note to let you know the current minimum GCC version is 4.6 :-)

Yeah, comments tend to become desperately conservative overtime :-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor
  2018-11-20 14:23   ` Peter Zijlstra
@ 2018-11-20 22:40     ` Frederic Weisbecker
  2018-11-21  8:18       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-20 22:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Tue, Nov 20, 2018 at 03:23:06PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 14, 2018 at 03:46:04AM +0100, Frederic Weisbecker wrote:
> 
> > +void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
> > +		      u64 *user, u64 *nice, u64 *system,
> > +		      u64 *guest, u64 *guest_nice)
> > +{
> > +	struct task_struct *curr;
> > +	struct vtime *vtime;
> > +	int err;
> > +
> > +	if (!vtime_accounting_enabled()) {
> > +		kcpustat_cputime_raw(kcpustat, user, nice,
> > +				     system, guest, guest_nice);
> > +		return;
> > +	}
> > +
> > +	rcu_read_lock();
> > +
> > +	do {
> > +		curr = rcu_dereference(kcpustat->curr);
> 
> Like I explained earlier; I don't think the above is correct.
> task_struct is itself not RCU protected.

But there is at least one put_task_struct() that is enqueued as an RCU callback
on release_task(). That patchset (try to) make sure that kcpustat->curr can't
be assigned beyond that point.

Or did I misunderstand something?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor
  2018-11-20 22:40     ` Frederic Weisbecker
@ 2018-11-21  8:18       ` Peter Zijlstra
  2018-11-21  8:35         ` Peter Zijlstra
  2018-11-21 16:33         ` Frederic Weisbecker
  0 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-21  8:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel, Oleg Nesterov

On Tue, Nov 20, 2018 at 11:40:22PM +0100, Frederic Weisbecker wrote:
> On Tue, Nov 20, 2018 at 03:23:06PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 14, 2018 at 03:46:04AM +0100, Frederic Weisbecker wrote:
> > 
> > > +void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
> > > +		      u64 *user, u64 *nice, u64 *system,
> > > +		      u64 *guest, u64 *guest_nice)
> > > +{
> > > +	struct task_struct *curr;
> > > +	struct vtime *vtime;
> > > +	int err;
> > > +
> > > +	if (!vtime_accounting_enabled()) {
> > > +		kcpustat_cputime_raw(kcpustat, user, nice,
> > > +				     system, guest, guest_nice);
> > > +		return;
> > > +	}
> > > +
> > > +	rcu_read_lock();
> > > +
> > > +	do {
> > > +		curr = rcu_dereference(kcpustat->curr);
> > 
> > Like I explained earlier; I don't think the above is correct.
> > task_struct is itself not RCU protected.
> 
> But there is at least one put_task_struct() that is enqueued as an RCU callback
> on release_task(). That patchset (try to) make sure that kcpustat->curr can't
> be assigned beyond that point.
> 
> Or did I misunderstand something?

Yeah; release_task() is not the normal exit path. Oleg can probably
remember how all that works, because I always get lost there :-/

In any case, have a look at task_rcu_dereference(), but that still does
not explain the rcu_assign_pointer() stuff you use to set
kcpustat->curr.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor
  2018-11-21  8:18       ` Peter Zijlstra
@ 2018-11-21  8:35         ` Peter Zijlstra
  2018-11-21 16:33         ` Frederic Weisbecker
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-21  8:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel, Oleg Nesterov

On Wed, Nov 21, 2018 at 09:18:19AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 20, 2018 at 11:40:22PM +0100, Frederic Weisbecker wrote:
> > On Tue, Nov 20, 2018 at 03:23:06PM +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 14, 2018 at 03:46:04AM +0100, Frederic Weisbecker wrote:
> > > 
> > > > +void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
> > > > +		      u64 *user, u64 *nice, u64 *system,
> > > > +		      u64 *guest, u64 *guest_nice)
> > > > +{
> > > > +	struct task_struct *curr;
> > > > +	struct vtime *vtime;
> > > > +	int err;
> > > > +
> > > > +	if (!vtime_accounting_enabled()) {
> > > > +		kcpustat_cputime_raw(kcpustat, user, nice,
> > > > +				     system, guest, guest_nice);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	rcu_read_lock();
> > > > +
> > > > +	do {
> > > > +		curr = rcu_dereference(kcpustat->curr);
> > > 
> > > Like I explained earlier; I don't think the above is correct.
> > > task_struct is itself not RCU protected.
> > 
> > But there is at least one put_task_struct() that is enqueued as an RCU callback
> > on release_task(). That patchset (try to) make sure that kcpustat->curr can't
> > be assigned beyond that point.
> > 
> > Or did I misunderstand something?
> 
> Yeah; release_task() is not the normal exit path. Oleg can probably
> remember how all that works, because I always get lost there :-/
> 
> In any case, have a look at task_rcu_dereference(), but that still does
> not explain the rcu_assign_pointer() stuff you use to set
> kcpustat->curr.

Also, why do you need kcpustat->curr at all, the above function has
@cpu, so you can equally use cpu_curr(cpu), no?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor
  2018-11-21  8:18       ` Peter Zijlstra
  2018-11-21  8:35         ` Peter Zijlstra
@ 2018-11-21 16:33         ` Frederic Weisbecker
  1 sibling, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-21 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel, Oleg Nesterov

On Wed, Nov 21, 2018 at 09:18:19AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 20, 2018 at 11:40:22PM +0100, Frederic Weisbecker wrote:
> > On Tue, Nov 20, 2018 at 03:23:06PM +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 14, 2018 at 03:46:04AM +0100, Frederic Weisbecker wrote:
> > > 
> > > > +void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
> > > > +		      u64 *user, u64 *nice, u64 *system,
> > > > +		      u64 *guest, u64 *guest_nice)
> > > > +{
> > > > +	struct task_struct *curr;
> > > > +	struct vtime *vtime;
> > > > +	int err;
> > > > +
> > > > +	if (!vtime_accounting_enabled()) {
> > > > +		kcpustat_cputime_raw(kcpustat, user, nice,
> > > > +				     system, guest, guest_nice);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	rcu_read_lock();
> > > > +
> > > > +	do {
> > > > +		curr = rcu_dereference(kcpustat->curr);
> > > 
> > > Like I explained earlier; I don't think the above is correct.
> > > task_struct is itself not RCU protected.
> > 
> > But there is at least one put_task_struct() that is enqueued as an RCU callback
> > on release_task(). That patchset (try to) make sure that kcpustat->curr can't
> > be assigned beyond that point.
> > 
> > Or did I misunderstand something?
> 
> Yeah; release_task() is not the normal exit path. Oleg can probably
> remember how all that works, because I always get lost there :-/
> 
> In any case, have a look at task_rcu_dereference(), but that still does
> not explain the rcu_assign_pointer() stuff you use to set
> kcpustat->curr.

So there are two reasons here.

One is the ordering. Here is the write side:


       schedule() {
           rq->curr = next;
	   context_switch() {
               vtime_task_switch() {
	           vtime_seqcount_write_lock(prev);
                   //flush pending vtime
                   vtime_seqcount_write_unlock(prev);

       	           vtime_seqcount_write_lock(next);
                   //Init vtime
                   vtime_seqcount_write_unlock(next);

                   rcu_assign_pointer(this_kcpustat->curr, next);
               }
           }
       }

So now from the read side, if I fetch the task from rq->curr, there is a risk
that I see the next task but I don't see the vtime flushed by prev task yet.
If the prev task has run 1 second without being interrupted, I'm going to miss
that whole time.

Assigning the next task after the prev task has flushed its vtime prevent from
that kind of race due to the barriers implied by seqcount and even rcu_assign_pointer.

The second reason is that I looked at task_rcu_dereference() and its guts are really
not appealing. The ordering guarantee is hard to parse there.

At least with kcpustat->curr I can make sure that the pointer is NULL before the
earliest opportunity for call_rcu(delayed_put_task_struct) to be queued (which is
exit_notify() in case of autoreap).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 19/25] sched/vite: Handle nice updates under vtime
  2018-11-20 14:17   ` Peter Zijlstra
@ 2018-11-26 15:53     ` Frederic Weisbecker
  2018-11-26 16:11       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-26 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Tue, Nov 20, 2018 at 03:17:54PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 14, 2018 at 03:46:03AM +0100, Frederic Weisbecker wrote:
> > On the vtime level, nice updates are currently handled on context
> > switches. When a task's nice value gets updated while it is sleeping,
> > the context switch takes into account the new nice value in order to
> > later record the vtime delta to the appropriate kcpustat index.
> 
> Urgh, so this patch should be folded into the previous one. On their own
> neither really makes sense.

Indeed, I overcut the patchset, some pieces need to be folded.

> 
> > We have yet to handle live updates: when set_user_nice() is called
> > while the target is running. We'll handle that on two sides:
> > 
> > * If the caller of set_user_nice() is the current task, we update the
> >   vtime state in place.
> > 
> > * If the target runs on a different CPU, we interrupt it with an IPI to
> >   update the vtime state in place.
> 
> *groan*... So what are the rules for vtime updates? Who can do that
> when?
> 
> So when we change nice, we'll have the respective rq locked and task
> effectively unqueued. It cannot schedule at such a point. Can
> 'concurrent' vtime updates still happen?

Yes but that's fine. The target vtime doesn't need to see the update immediately.
All we want is that it enters into nice mode at some near future, hence the use of
a non waiting IPI.

> 
> > The vtime update in question consists in flushing the pending vtime
> > delta to the task/kcpustat and resume the accounting on top of the new
> > nice value.
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f12225f..e8f0437 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3868,6 +3868,7 @@ void set_user_nice(struct task_struct *p, long nice)
> >  	int old_prio, delta;
> >  	struct rq_flags rf;
> >  	struct rq *rq;
> > +	long old_nice;
> >  
> >  	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
> >  		return;
> > @@ -3878,6 +3879,8 @@ void set_user_nice(struct task_struct *p, long nice)
> >  	rq = task_rq_lock(p, &rf);
> >  	update_rq_clock(rq);
> >  
> > +	old_nice = task_nice(p);
> > +
> >  	/*
> >  	 * The RT priorities are set via sched_setscheduler(), but we still
> >  	 * allow the 'normal' nice value to be set - but as expected
> > @@ -3913,6 +3916,7 @@ void set_user_nice(struct task_struct *p, long nice)
> >  	if (running)
> >  		set_curr_task(rq, p);
> >  out_unlock:
> > +	vtime_set_nice(rq, p, old_nice);
> >  	task_rq_unlock(rq, p, &rf);
> >  }
> 
> That's not sufficient; I think you want to hook set_load_weight() or
> something. Things like sys_sched_setattr() can also change the nice
> value.

Ah good point, I need to check that.

> 
> >  EXPORT_SYMBOL(set_user_nice);
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 07c2e7f..2b35132 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> 
> > @@ -937,6 +937,33 @@ void vtime_exit_task(struct task_struct *t)
> >  	local_irq_restore(flags);
> >  }
> >  
> > +void vtime_set_nice_local(struct task_struct *t)
> > +{
> > +	struct vtime *vtime = &t->vtime;
> > +
> > +	write_seqcount_begin(&vtime->seqcount);
> > +	if (vtime->state == VTIME_USER)
> > +		vtime_account_user(t, vtime, true);
> > +	else if (vtime->state == VTIME_GUEST)
> > +		vtime_account_guest(t, vtime, true);
> > +	vtime->nice = (task_nice(t) > 0) ? 1 : 0;
> > +	write_seqcount_end(&vtime->seqcount);
> > +}
> > +
> > +static void vtime_set_nice_func(struct irq_work *work)
> > +{
> > +	vtime_set_nice_local(current);
> > +}
> > +
> > +static DEFINE_PER_CPU(struct irq_work, vtime_set_nice_work) = {
> > +	.func = vtime_set_nice_func,
> > +};
> > +
> > +void vtime_set_nice_remote(int cpu)
> > +{
> > +	irq_work_queue_on(&per_cpu(vtime_set_nice_work, cpu), cpu);
> 
> What happens if you already had one pending? Do we loose updates?

No, if irq_work is already pending, it doesn't requeue iff the work hasn't
been executed yet and it's guaranteed it will see the freshest update.
(you should trust more the code you wrote ;-)

> 
> > +}
> > +
> >  u64 task_gtime(struct task_struct *t)
> >  {
> >  	struct vtime *vtime = &t->vtime;
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 618577f..c7846ca 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1790,6 +1790,45 @@ static inline int sched_tick_offload_init(void) { return 0; }
> >  static inline void sched_update_tick_dependency(struct rq *rq) { }
> >  #endif
> >  
> > +static inline void vtime_set_nice(struct rq *rq,
> > +				  struct task_struct *p, long old_nice)
> > +{
> > +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> > +	long nice;
> > +	int cpu;
> > +
> > +	if (!vtime_accounting_enabled())
> > +		return;
> > +
> > +	cpu = cpu_of(rq);
> > +
> > +	if (!vtime_accounting_enabled_cpu(cpu))
> > +		return;
> > +
> > +	/*
> > +	 * Task not running, nice update will be seen by vtime on its
> > +	 * next context switch.
> > +	 */
> > +	if (!task_current(rq, p))
> > +		return;
> > +
> > +	nice = task_nice(p);
> > +
> > +	/* Task stays nice, still accounted as nice in kcpustat */
> > +	if (old_nice > 0 && nice > 0)
> > +		return;
> > +
> > +	/* Task stays rude, still accounted as non-nice in kcpustat */
> > +	if (old_nice <= 0 && nice <= 0)
> > +		return;
> > +
> > +	if (p == current)
> > +		vtime_set_nice_local(p);
> > +	else
> > +		vtime_set_nice_remote(cpu);
> > +#endif
> > +}
> 
> That's _far_ too large for an inline I'm thinking. Also, changing nice
> really isn't a fast path or anything.

Agreed, I'll move that to sched/cputime.c

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 19/25] sched/vite: Handle nice updates under vtime
  2018-11-26 15:53     ` Frederic Weisbecker
@ 2018-11-26 16:11       ` Peter Zijlstra
  2018-11-26 18:41         ` Frederic Weisbecker
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-11-26 16:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Mon, Nov 26, 2018 at 04:53:54PM +0100, Frederic Weisbecker wrote:
> > > +	irq_work_queue_on(&per_cpu(vtime_set_nice_work, cpu), cpu);
> > 
> > What happens if you already had one pending? Do we loose updates?
> 
> No, if irq_work is already pending, it doesn't requeue iff the work hasn't
> been executed yet and it's guaranteed it will see the freshest update.
> (you should trust more the code you wrote ;-)

Yeah, I do remember hoq irq_work works. What I was asking was about how
this specific handler deals with 'missing' updates.

Suppose we start with state A, set it to B and raise the IPI, then set
it to C before the interrupt happens.

That means the irq_work handler will see C and never observe B.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 19/25] sched/vite: Handle nice updates under vtime
  2018-11-26 16:11       ` Peter Zijlstra
@ 2018-11-26 18:41         ` Frederic Weisbecker
  0 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2018-11-26 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

On Mon, Nov 26, 2018 at 05:11:03PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 26, 2018 at 04:53:54PM +0100, Frederic Weisbecker wrote:
> > > > +	irq_work_queue_on(&per_cpu(vtime_set_nice_work, cpu), cpu);
> > > 
> > > What happens if you already had one pending? Do we loose updates?
> > 
> > No, if irq_work is already pending, it doesn't requeue iff the work hasn't
> > been executed yet and it's guaranteed it will see the freshest update.
> > (you should trust more the code you wrote ;-)
> 
> Yeah, I do remember hoq irq_work works. What I was asking was about how
> this specific handler deals with 'missing' updates.
> 
> Suppose we start with state A, set it to B and raise the IPI, then set
> it to C before the interrupt happens.
> 
> That means the irq_work handler will see C and never observe B.

Ah right, but that's ok, we can miss a few intermediate very short states. In this
case the most recent update is the relevant one.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/25] vtime: Spare a seqcount lock/unlock cycle on context switch
  2018-11-20 13:25   ` Peter Zijlstra
@ 2019-09-25 14:42     ` Frederic Weisbecker
  0 siblings, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2019-09-25 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Wanpeng Li, Thomas Gleixner, Yauheni Kaliuta, Ingo Molnar,
	Rik van Riel

Sorry to answer 10 month later. You certainly have lost track of this. I have.
I'm re-issuing this patchset but more piecewise to make the review easier.

In case you still care, I'm answering your comments below. But you can skip
that and wait for the new version that I'm about to post.


On Tue, Nov 20, 2018 at 02:25:12PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 14, 2018 at 03:45:48AM +0100, Frederic Weisbecker wrote:
> 
> So I definitely like avoiding that superfluous atomic op, however:
> 
> > @@ -730,19 +728,25 @@ static void vtime_account_guest(struct task_struct *tsk,
> >  	}
> >  }
> >  
> > +static void __vtime_account_kernel(struct task_struct *tsk,
> > +				   struct vtime *vtime)
> 
> Your last patch removed a __function, and now you're adding one back :/

Yes, in fact I removed a __function to avoid having two in the end.

I can't think of a better name. vtime_account_kernel_locked() maybe,
but it's not event locked, it's seqcount write.

> 
> >  {
> >  	/* We might have scheduled out from guest path */
> >  	if (tsk->flags & PF_VCPU)
> >  		vtime_account_guest(tsk, vtime);
> >  	else
> >  		vtime_account_system(tsk, vtime);
> > +}
> > +
> > +void vtime_account_kernel(struct task_struct *tsk)
> > +{
> > +	struct vtime *vtime = &tsk->vtime;
> > +
> > +	if (!vtime_delta(vtime))
> > +		return;
> > +
> 
> See here the fast path (is it worth it?)

Might be worth testing if that fast path is often hit indeed. With
any sensible clock we should at least have a few nanosecs to account.

> 
> > +	write_seqcount_begin(&vtime->seqcount);
> > +	__vtime_account_kernel(tsk, vtime);
> >  	write_seqcount_end(&vtime->seqcount);
> >  }
> 
> > +void vtime_task_switch_generic(struct task_struct *prev)
> >  {
> >  	struct vtime *vtime = &prev->vtime;
> 
> And observe a distinct lack of that same fast path..

Right but in any case we have to lock (seqcount) here
since at least vtime->state has to be set to VTIME_INACTIVE.

> 
> >  
> >  	write_seqcount_begin(&vtime->seqcount);
> > +	if (is_idle_task(prev))
> > +		vtime_account_idle(prev);
> > +	else
> > +		__vtime_account_kernel(prev, vtime);
> >  	vtime->state = VTIME_INACTIVE;
> >  	write_seqcount_end(&vtime->seqcount);
> >  
> > -- 
> > 2.7.4
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2019-09-25 14:42 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-14  2:45 [PATCH 00/25] sched/nohz: Make kcpustat vtime aware (Fix kcpustat on nohz_full) Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 01/25] sched/vtime: Fix guest/system mis-accounting on task switch Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 02/25] sched/vtime: Protect idle accounting under vtime seqcount Frederic Weisbecker
2018-11-20 13:19   ` Peter Zijlstra
2018-11-14  2:45 ` [PATCH 03/25] vtime: Rename vtime_account_system() to vtime_account_kernel() Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 04/25] vtime: Spare a seqcount lock/unlock cycle on context switch Frederic Weisbecker
2018-11-20 13:25   ` Peter Zijlstra
2019-09-25 14:42     ` Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 05/25] sched/vtime: Record CPU under seqcount for kcpustat needs Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 06/25] sched/cputime: Add vtime idle task state Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 07/25] sched/cputime: Add vtime guest " Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 08/25] vtime: Exit vtime before exit_notify() Frederic Weisbecker
2018-11-20 13:54   ` Peter Zijlstra
2018-11-14  2:45 ` [PATCH 09/25] kcpustat: Track running task following vtime sequences Frederic Weisbecker
2018-11-20 13:58   ` Peter Zijlstra
2018-11-14  2:45 ` [PATCH 10/25] context_tracking: Remove context_tracking_active() Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 11/25] context_tracking: s/context_tracking_is_enabled/context_tracking_enabled() Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 12/25] context_tracking: Rename context_tracking_is_cpu_enabled() to context_tracking_enabled_this_cpu() Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 13/25] context_tracking: Introduce context_tracking_enabled_cpu() Frederic Weisbecker
2018-11-20 14:02   ` Peter Zijlstra
2018-11-14  2:45 ` [PATCH 14/25] sched/vtime: Rename vtime_accounting_cpu_enabled() to vtime_accounting_enabled_this_cpu() Frederic Weisbecker
2018-11-14  2:45 ` [PATCH 15/25] sched/vtime: Introduce vtime_accounting_enabled_cpu() Frederic Weisbecker
2018-11-20 14:04   ` Peter Zijlstra
2018-11-14  2:46 ` [PATCH 16/25] sched/cputime: Allow to pass cputime index on user/guest accounting Frederic Weisbecker
2018-11-14  2:46 ` [PATCH 17/25] sched/cputime: Standardize the kcpustat index based accounting functions Frederic Weisbecker
2018-11-14  2:46 ` [PATCH 18/25] vtime: Track nice-ness on top of context switch Frederic Weisbecker
2018-11-20 14:09   ` Peter Zijlstra
2018-11-14  2:46 ` [PATCH 19/25] sched/vite: Handle nice updates under vtime Frederic Weisbecker
2018-11-20 14:17   ` Peter Zijlstra
2018-11-26 15:53     ` Frederic Weisbecker
2018-11-26 16:11       ` Peter Zijlstra
2018-11-26 18:41         ` Frederic Weisbecker
2018-11-14  2:46 ` [PATCH 20/25] sched/kcpustat: Introduce vtime-aware kcpustat accessor Frederic Weisbecker
2018-11-20 14:23   ` Peter Zijlstra
2018-11-20 22:40     ` Frederic Weisbecker
2018-11-21  8:18       ` Peter Zijlstra
2018-11-21  8:35         ` Peter Zijlstra
2018-11-21 16:33         ` Frederic Weisbecker
2018-11-14  2:46 ` [PATCH 21/25] procfs: Use vtime aware " Frederic Weisbecker
2018-11-20 14:24   ` Peter Zijlstra
2018-11-20 22:31     ` Frederic Weisbecker
2018-11-14  2:46 ` [PATCH 22/25] cpufreq: " Frederic Weisbecker
2018-11-14  2:46 ` [PATCH 23/25] leds: Use vtime aware kcpustat accessors Frederic Weisbecker
2018-11-14  2:46 ` [PATCH 24/25] rackmeter: " Frederic Weisbecker
2018-11-14  2:46 ` [PATCH 25/25] sched/vtime: Clarify vtime_task_switch() argument layout Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).