[PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware
@ 2019-11-06  3:07 Frederic Weisbecker
  2019-11-06  3:07 ` [PATCH 1/9] sched/cputime: Allow to pass cputime index on user/guest accounting Frederic Weisbecker
                   ` (9 more replies)
  0 siblings, 10 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

So we fixed CPUTIME_SYSTEM on nohz_full, now is the turn for
CPUTIME_USER, CPUTIME_NICE, CPUTIME_GUEST and CPUTIME_GUEST_NICE.

The tricky piece was to handle "nice-ness" correctly. I think I addressed
most reviews (one year ago now) from peterz
(https://lore.kernel.org/lkml/20181120141754.GW2131@hirez.programming.kicks-ass.net/)
Among which:

* Merge patches that didn't make sense together
* Clarify the changelog (I hope) and add comments
* Also hook nice updates on __setscheduler_params()
* Uninline a big function

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	nohz/kcpustat-v3

HEAD: cfd377b0f261f1107ba5f3d2cf68a8ac5f359943

Thanks,
	Frederic
---

Frederic Weisbecker (9):
      sched/cputime: Allow to pass cputime index on user/guest accounting
      sched/cputime: Standardize the kcpustat index based accounting functions
      sched/vtime: Handle nice updates under vtime
      sched/cputime: Support other fields on kcpustat_field()
      sched/vtime: Bring all-in-one kcpustat accessor for vtime fields
      procfs: Use all-in-one vtime aware kcpustat accessor
      cpufreq: Use vtime aware kcpustat accessors for user time
      leds: Use all-in-one vtime aware kcpustat accessor
      rackmeter: Use vtime aware kcpustat accessor


 arch/ia64/kernel/time.c                 |   6 +-
 arch/powerpc/kernel/time.c              |   6 +-
 arch/s390/kernel/vtime.c                |   2 +-
 drivers/cpufreq/cpufreq.c               |  13 +-
 drivers/cpufreq/cpufreq_governor.c      |   6 +-
 drivers/leds/trigger/ledtrig-activity.c |   9 +-
 drivers/macintosh/rack-meter.c          |   7 +-
 fs/proc/stat.c                          |  20 +-
 include/linux/kernel_stat.h             |  25 ++-
 include/linux/sched.h                   |   1 +
 include/linux/vtime.h                   |   2 +
 kernel/sched/core.c                     |  16 +-
 kernel/sched/cputime.c                  | 378 ++++++++++++++++++++++++++------
 kernel/sched/sched.h                    |  18 ++
 14 files changed, 412 insertions(+), 97 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/9] sched/cputime: Allow to pass cputime index on user/guest accounting
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
@ 2019-11-06  3:07 ` Frederic Weisbecker
  2019-11-06  3:08 ` [PATCH 2/9] sched/cputime: Standardize the kcpustat index based accounting functions Frederic Weisbecker
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

When we account user or guest cputime, we decide to add the delta either
to the nice fields or the normal fields of kcpustat and this depends on
the nice value for the task passed in parameter.

Since we are going to track the nice-ness from vtime instead, we'll need
to be able to pass custom kcpustat destination index fields to the
accounting functions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 50 +++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index e0cd20693ef5..738ed7db615e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -109,6 +109,20 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	cgroup_account_cputime_field(p, index, tmp);
 }
 
+static void account_user_time_index(struct task_struct *p,
+				    u64 cputime, enum cpu_usage_stat index)
+{
+	/* Add user time to process. */
+	p->utime += cputime;
+	account_group_user_time(p, cputime);
+
+	/* Add user time to cpustat. */
+	task_group_account_field(p, index, cputime);
+
+	/* Account for user time used */
+	acct_account_cputime(p);
+}
+
 /*
  * Account user CPU time to a process.
  * @p: the process that the CPU time gets accounted to
@@ -116,27 +130,14 @@ static inline void task_group_account_field(struct task_struct *p, int index,
  */
 void account_user_time(struct task_struct *p, u64 cputime)
 {
-	int index;
-
-	/* Add user time to process. */
-	p->utime += cputime;
-	account_group_user_time(p, cputime);
+	enum cpu_usage_stat index;
 
 	index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
-
-	/* Add user time to cpustat. */
-	task_group_account_field(p, index, cputime);
-
-	/* Account for user time used */
-	acct_account_cputime(p);
+	account_user_time_index(p, cputime, index);
 }
 
-/*
- * Account guest CPU time to a process.
- * @p: the process that the CPU time gets accounted to
- * @cputime: the CPU time spent in virtual machine since the last update
- */
-void account_guest_time(struct task_struct *p, u64 cputime)
+static void account_guest_time_index(struct task_struct *p,
+				     u64 cputime, enum cpu_usage_stat index)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
@@ -146,7 +147,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)
 	p->gtime += cputime;
 
 	/* Add guest time to cpustat. */
-	if (task_nice(p) > 0) {
+	if (index == CPUTIME_GUEST_NICE) {
 		cpustat[CPUTIME_NICE] += cputime;
 		cpustat[CPUTIME_GUEST_NICE] += cputime;
 	} else {
@@ -155,6 +156,19 @@ void account_guest_time(struct task_struct *p, u64 cputime)
 	}
 }
 
+/*
+ * Account guest CPU time to a process.
+ * @p: the process that the CPU time gets accounted to
+ * @cputime: the CPU time spent in virtual machine since the last update
+ */
+void account_guest_time(struct task_struct *p, u64 cputime)
+{
+	enum cpu_usage_stat index;
+
+	index = (task_nice(p) > 0) ? CPUTIME_GUEST_NICE : CPUTIME_GUEST;
+	account_guest_time_index(p, cputime, index);
+}
+
 /*
  * Account system CPU time to a process and desired cpustat field
  * @p: the process that the CPU time gets accounted to
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/9] sched/cputime: Standardize the kcpustat index based accounting functions
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
  2019-11-06  3:07 ` [PATCH 1/9] sched/cputime: Allow to pass cputime index on user/guest accounting Frederic Weisbecker
@ 2019-11-06  3:08 ` Frederic Weisbecker
  2019-11-06  3:08 ` [PATCH 3/9] sched/vtime: Handle nice updates under vtime Frederic Weisbecker
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:08 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

Sanitize a bit the functions that do cputime accounting with custom
kcpustat indexes:

* Rename account_system_index_time() to account_system_time_index() to
  comply with account_guest/user_time_index()

* Use proper enum cpu_usage_stat type in account_system_time()

* Reorder task_group_account_field() parameters to comply with those of
  account_*_time_index().

* Rename task_group_account_field()'s tmp parameter to cputime

* Precise the type of index in task_group_account_field(): enum cpu_usage_stat

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/ia64/kernel/time.c     |  6 +++---
 arch/powerpc/kernel/time.c  |  6 +++---
 arch/s390/kernel/vtime.c    |  2 +-
 include/linux/kernel_stat.h |  2 +-
 kernel/sched/cputime.c      | 22 +++++++++++-----------
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 91b4024c9351..836e17c8b004 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -79,17 +79,17 @@ void vtime_flush(struct task_struct *tsk)
 
 	if (ti->stime) {
 		delta = cycle_to_nsec(ti->stime);
-		account_system_index_time(tsk, delta, CPUTIME_SYSTEM);
+		account_system_time_index(tsk, delta, CPUTIME_SYSTEM);
 	}
 
 	if (ti->hardirq_time) {
 		delta = cycle_to_nsec(ti->hardirq_time);
-		account_system_index_time(tsk, delta, CPUTIME_IRQ);
+		account_system_time_index(tsk, delta, CPUTIME_IRQ);
 	}
 
 	if (ti->softirq_time) {
 		delta = cycle_to_nsec(ti->softirq_time);
-		account_system_index_time(tsk, delta, CPUTIME_SOFTIRQ);
+		account_system_time_index(tsk, delta, CPUTIME_SOFTIRQ);
 	}
 
 	ti->utime = 0;
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 84827da01d45..f03469fe398b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -418,14 +418,14 @@ void vtime_flush(struct task_struct *tsk)
 		account_idle_time(cputime_to_nsecs(acct->idle_time));
 
 	if (acct->stime)
-		account_system_index_time(tsk, cputime_to_nsecs(acct->stime),
+		account_system_time_index(tsk, cputime_to_nsecs(acct->stime),
 					  CPUTIME_SYSTEM);
 
 	if (acct->hardirq_time)
-		account_system_index_time(tsk, cputime_to_nsecs(acct->hardirq_time),
+		account_system_time_index(tsk, cputime_to_nsecs(acct->hardirq_time),
 					  CPUTIME_IRQ);
 	if (acct->softirq_time)
-		account_system_index_time(tsk, cputime_to_nsecs(acct->softirq_time),
+		account_system_time_index(tsk, cputime_to_nsecs(acct->softirq_time),
 					  CPUTIME_SOFTIRQ);
 
 	vtime_flush_scaled(tsk, acct);
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 8df10d3c8f6c..3428f28d3df1 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -115,7 +115,7 @@ static void account_system_index_scaled(struct task_struct *p, u64 cputime,
 					enum cpu_usage_stat index)
 {
 	p->stimescaled += cputime_to_nsecs(scale_vtime(cputime));
-	account_system_index_time(p, cputime_to_nsecs(cputime), index);
+	account_system_time_index(p, cputime_to_nsecs(cputime), index);
 }
 
 /*
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 79781196eb25..1b9b97f6946e 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -92,7 +92,7 @@ static inline u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 extern void account_user_time(struct task_struct *, u64);
 extern void account_guest_time(struct task_struct *, u64);
 extern void account_system_time(struct task_struct *, int, u64);
-extern void account_system_index_time(struct task_struct *, u64,
+extern void account_system_time_index(struct task_struct *, u64,
 				      enum cpu_usage_stat);
 extern void account_steal_time(u64);
 extern void account_idle_time(u64);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 738ed7db615e..687b8684e480 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -95,8 +95,8 @@ static u64 irqtime_tick_accounted(u64 dummy)
 
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
-static inline void task_group_account_field(struct task_struct *p, int index,
-					    u64 tmp)
+static inline void task_group_account_field(struct task_struct *p, u64 cputime,
+					    enum cpu_usage_stat index)
 {
 	/*
 	 * Since all updates are sure to touch the root cgroup, we
@@ -104,9 +104,9 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	 * is the only cgroup, then nothing else should be necessary.
 	 *
 	 */
-	__this_cpu_add(kernel_cpustat.cpustat[index], tmp);
+	__this_cpu_add(kernel_cpustat.cpustat[index], cputime);
 
-	cgroup_account_cputime_field(p, index, tmp);
+	cgroup_account_cputime_field(p, index, cputime);
 }
 
 static void account_user_time_index(struct task_struct *p,
@@ -117,7 +117,7 @@ static void account_user_time_index(struct task_struct *p,
 	account_group_user_time(p, cputime);
 
 	/* Add user time to cpustat. */
-	task_group_account_field(p, index, cputime);
+	task_group_account_field(p, cputime, index);
 
 	/* Account for user time used */
 	acct_account_cputime(p);
@@ -175,7 +175,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)
  * @cputime: the CPU time spent in kernel space since the last update
  * @index: pointer to cpustat field that has to be updated
  */
-void account_system_index_time(struct task_struct *p,
+void account_system_time_index(struct task_struct *p,
 			       u64 cputime, enum cpu_usage_stat index)
 {
 	/* Add system time to process. */
@@ -183,7 +183,7 @@ void account_system_index_time(struct task_struct *p,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	task_group_account_field(p, index, cputime);
+	task_group_account_field(p, cputime, index);
 
 	/* Account for system time used */
 	acct_account_cputime(p);
@@ -197,7 +197,7 @@ void account_system_index_time(struct task_struct *p,
  */
 void account_system_time(struct task_struct *p, int hardirq_offset, u64 cputime)
 {
-	int index;
+	enum cpu_usage_stat index;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime);
@@ -211,7 +211,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset, u64 cputime)
 	else
 		index = CPUTIME_SYSTEM;
 
-	account_system_index_time(p, cputime, index);
+	account_system_time_index(p, cputime, index);
 }
 
 /*
@@ -392,7 +392,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		 * So, we have to handle it separately here.
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
-		account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
+		account_system_time_index(p, cputime, CPUTIME_SOFTIRQ);
 	} else if (user_tick) {
 		account_user_time(p, cputime);
 	} else if (p == rq->idle) {
@@ -400,7 +400,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	} else if (p->flags & PF_VCPU) { /* System time or guest time */
 		account_guest_time(p, cputime);
 	} else {
-		account_system_index_time(p, cputime, CPUTIME_SYSTEM);
+		account_system_time_index(p, cputime, CPUTIME_SYSTEM);
 	}
 }
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/9] sched/vtime: Handle nice updates under vtime
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
  2019-11-06  3:07 ` [PATCH 1/9] sched/cputime: Allow to pass cputime index on user/guest accounting Frederic Weisbecker
  2019-11-06  3:08 ` [PATCH 2/9] sched/cputime: Standardize the kcpustat index based accounting functions Frederic Weisbecker
@ 2019-11-06  3:08 ` Frederic Weisbecker
  2019-11-15 10:14   ` Peter Zijlstra
  2019-11-15 10:16   ` Peter Zijlstra
  2019-11-06  3:08 ` [PATCH 4/9] sched/cputime: Support other fields on kcpustat_field() Frederic Weisbecker
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:08 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

The cputime niceness is determined while checking the target's nice value
at cputime accounting time. Under vtime this happens on context switches,
user exit and guest exit. But nice value updates while the target is
running are not taken into account.

So if a task runs tickless for 10 seconds in userspace but it has been
reniced after 5 seconds from -1 (not nice) to 1 (nice), on user exit
vtime will account the whole 10 seconds as CPUTIME_NICE because it only
sees the latest nice value at accounting time which is 1 here. Yet it's
wrong, 5 seconds should be accounted as CPUTIME_USER and 5 seconds as
CPUTIME_NICE.

In order to solve this, we now cover nice updates withing three cases:

* If the nice updater is the current task, although we are in kernel
  mode there can be pending user or guest time in the cache to flush
  under the prior nice state. Account these if any. Also toggle the
  vtime nice flag for further user/guest cputime accounting.

* If the target runs on a different CPU, we interrupt it with an IPI to
  update the vtime state in place. If the task is running in user or
  guest, the pending cputime is accounted under the prior nice state.
  Then the vtime nice flag is toggled for further user/guest cputime
  accounting.

* When a task's nice value gets updated while it is sleeping, the next
  context switch takes into account the new nice value in order to
  later record the vtime delta to the appropriate kcpustat index.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h  |   1 +
 include/linux/vtime.h  |   2 +
 kernel/sched/core.c    |  16 +++--
 kernel/sched/cputime.c | 136 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h   |  18 ++++++
 5 files changed, 159 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 988c4da00c31..44552b74bd44 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -262,6 +262,7 @@ enum vtime_state {
 struct vtime {
 	seqcount_t		seqcount;
 	unsigned long long	starttime;
+	int			nice;
 	enum vtime_state	state;
 	unsigned int		cpu;
 	u64			utime;
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 2cdeca062db3..eb2fa3c00f3e 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -9,6 +9,7 @@
 
 
 struct task_struct;
+struct rq;
 
 /*
  * vtime_accounting_enabled_this_cpu() definitions/declarations
@@ -74,6 +75,7 @@ extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
 extern void vtime_guest_exit(struct task_struct *tsk);
 extern void vtime_init_idle(struct task_struct *tsk, int cpu);
+extern void __vtime_set_nice(struct rq *rq, struct task_struct *tsk, int old_prio);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN  */
 static inline void vtime_user_enter(struct task_struct *tsk) { }
 static inline void vtime_user_exit(struct task_struct *tsk) { }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7880f4f64d0e..fa56a1bdd5d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4488,6 +4488,7 @@ void set_user_nice(struct task_struct *p, long nice)
 	int old_prio, delta;
 	struct rq_flags rf;
 	struct rq *rq;
+	int old_static;
 
 	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
 		return;
@@ -4498,6 +4499,8 @@ void set_user_nice(struct task_struct *p, long nice)
 	rq = task_rq_lock(p, &rf);
 	update_rq_clock(rq);
 
+	old_static = p->static_prio;
+
 	/*
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
@@ -4532,7 +4535,9 @@ void set_user_nice(struct task_struct *p, long nice)
 	}
 	if (running)
 		set_next_task(rq, p);
+
 out_unlock:
+	vtime_set_nice(rq, p, old_static);
 	task_rq_unlock(rq, p, &rf);
 }
 EXPORT_SYMBOL(set_user_nice);
@@ -4668,8 +4673,8 @@ static struct task_struct *find_process_by_pid(pid_t pid)
  */
 #define SETPARAM_POLICY	-1
 
-static void __setscheduler_params(struct task_struct *p,
-		const struct sched_attr *attr)
+static void __setscheduler_params(struct rq *rq, struct task_struct *p,
+				  const struct sched_attr *attr)
 {
 	int policy = attr->sched_policy;
 
@@ -4680,8 +4685,11 @@ static void __setscheduler_params(struct task_struct *p,
 
 	if (dl_policy(policy))
 		__setparam_dl(p, attr);
-	else if (fair_policy(policy))
+	else if (fair_policy(policy)) {
+		int old_prio = p->static_prio;
 		p->static_prio = NICE_TO_PRIO(attr->sched_nice);
+		vtime_set_nice(rq, p, old_prio);
+	}
 
 	/*
 	 * __sched_setscheduler() ensures attr->sched_priority == 0 when
@@ -4704,7 +4712,7 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
 	if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
 		return;
 
-	__setscheduler_params(p, attr);
+	__setscheduler_params(rq, p, attr);
 
 	/*
 	 * Keep a potential priority boosting if called from
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 687b8684e480..e09194415c35 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -733,16 +733,60 @@ static void vtime_account_system(struct task_struct *tsk,
 	}
 }
 
+static void vtime_account_guest_delta(struct task_struct *tsk,
+				      struct vtime *vtime,
+				      u64 delta, u64 thresh)
+{
+	enum cpu_usage_stat index;
+
+	vtime->gtime += delta;
+
+	if (vtime->gtime < thresh)
+		return;
+
+	if (vtime->nice)
+		index = CPUTIME_GUEST_NICE;
+	else
+		index = CPUTIME_GUEST;
+
+	account_guest_time_index(tsk, vtime->gtime, index);
+	vtime->gtime = 0;
+}
+
 static void vtime_account_guest(struct task_struct *tsk,
 				struct vtime *vtime)
 {
-	vtime->gtime += get_vtime_delta(vtime);
-	if (vtime->gtime >= TICK_NSEC) {
-		account_guest_time(tsk, vtime->gtime);
-		vtime->gtime = 0;
-	}
+	vtime_account_guest_delta(tsk, vtime, get_vtime_delta(vtime), TICK_NSEC);
 }
 
+static void vtime_account_user_delta(struct task_struct *tsk,
+				     struct vtime *vtime,
+				     u64 delta, u64 thresh)
+{
+	enum cpu_usage_stat index;
+
+	vtime->utime += delta;
+
+	if (vtime->utime < thresh)
+		return;
+
+	if (vtime->nice)
+		index = CPUTIME_NICE;
+	else
+		index = CPUTIME_USER;
+
+	account_user_time_index(tsk, vtime->utime, index);
+	vtime->utime = 0;
+
+}
+
+static void vtime_account_user(struct task_struct *tsk,
+			       struct vtime *vtime)
+{
+	vtime_account_user_delta(tsk, vtime, get_vtime_delta(vtime), TICK_NSEC);
+}
+
+
 static void __vtime_account_kernel(struct task_struct *tsk,
 				   struct vtime *vtime)
 {
@@ -780,11 +824,7 @@ void vtime_user_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	vtime->utime += get_vtime_delta(vtime);
-	if (vtime->utime >= TICK_NSEC) {
-		account_user_time(tsk, vtime->utime);
-		vtime->utime = 0;
-	}
+	vtime_account_user(tsk, vtime);
 	vtime->state = VTIME_SYS;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -848,6 +888,7 @@ void vtime_task_switch_generic(struct task_struct *prev)
 		vtime->state = VTIME_SYS;
 	vtime->starttime = sched_clock();
 	vtime->cpu = smp_processor_id();
+	vtime->nice = (task_nice(current) > 0) ? 1 : 0;
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -865,6 +906,81 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	local_irq_restore(flags);
 }
 
+static void vtime_set_nice_local(void)
+{
+	struct task_struct *t = current;
+	struct vtime *vtime = &t->vtime;
+	u64 utime = 0, gtime = 0;
+
+	write_seqcount_begin(&vtime->seqcount);
+
+	if (vtime->state == VTIME_USER)
+		utime += get_vtime_delta(vtime);
+	else if (vtime->state == VTIME_GUEST)
+		gtime += get_vtime_delta(vtime);
+
+	vtime_account_user_delta(t, vtime, utime, 1);
+	vtime_account_guest_delta(t, vtime, gtime, 1);
+	vtime->nice = (task_nice(t) > 0) ? 1 : 0;
+	write_seqcount_end(&vtime->seqcount);
+}
+
+static void vtime_set_nice_remote(struct irq_work *work)
+{
+	struct vtime *vtime = &current->vtime;
+	int nice = task_nice(current);
+	/*
+	 * Make sure nice state actually toggled. We might have raced in
+	 * 2 ways:
+	 *
+	 * - Reniced Task has scheduled out before the IPI reached out,
+	 *   vtime->nice will be evaluated next time it schedules in.
+	 *
+	 * - Another nice update happened before the IPI reached out
+	 *   that cancelled the need for an update. We don't care if a
+	 *   few nanoseconds of cputime are misaccounted.
+	 */
+	if (nice > 0 && vtime->nice)
+		return;
+	if (nice <= 0 && !vtime->nice)
+		return;
+
+	vtime_set_nice_local();
+}
+
+static DEFINE_PER_CPU(struct irq_work, vtime_set_nice_work) = {
+	.func = vtime_set_nice_remote,
+};
+
+void __vtime_set_nice(struct rq *rq, struct task_struct *p, int old_prio)
+{
+	int nice, old_nice = PRIO_TO_NICE(old_prio);
+	int cpu = cpu_of(rq);
+
+	lockdep_assert_held(&rq->lock);
+	/*
+	 * Task not running, nice update will be seen by vtime on its
+	 * next context switch.
+	 */
+	if (!task_current(rq, p))
+		return;
+
+	nice = task_nice(p);
+
+	/* Task stays nice, still accounted as nice in kcpustat */
+	if (old_nice > 0 && nice > 0)
+		return;
+
+	/* Task stays rude, still accounted as non-nice in kcpustat */
+	if (old_nice <= 0 && nice <= 0)
+		return;
+
+	if (p == current)
+		vtime_set_nice_local();
+	else
+		irq_work_queue_on(&per_cpu(vtime_set_nice_work, cpu), cpu);
+}
+
 u64 task_gtime(struct task_struct *t)
 {
 	struct vtime *vtime = &t->vtime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0db2c1b3361e..b870042e65ae 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1895,6 +1895,24 @@ static inline int sched_tick_offload_init(void) { return 0; }
 static inline void sched_update_tick_dependency(struct rq *rq) { }
 #endif
 
+static inline void vtime_set_nice(struct rq *rq,
+				  struct task_struct *p, int old_prio)
+{
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+	int cpu;
+
+	if (!vtime_accounting_enabled())
+		return;
+
+	cpu = cpu_of(rq);
+
+	if (!vtime_accounting_enabled_cpu(cpu))
+		return;
+
+	__vtime_set_nice(rq, p, old_prio);
+#endif
+}
+
 static inline void add_nr_running(struct rq *rq, unsigned count)
 {
 	unsigned prev_nr = rq->nr_running;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/9] sched/cputime: Support other fields on kcpustat_field()
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2019-11-06  3:08 ` [PATCH 3/9] sched/vtime: Handle nice updates under vtime Frederic Weisbecker
@ 2019-11-06  3:08 ` Frederic Weisbecker
  2019-11-06  3:08 ` [PATCH 5/9] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields Frederic Weisbecker
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:08 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

Now that we support nice updates under vtime accounting, we can provide
tickless values for kcpustat readers of user and guest time.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index e09194415c35..bf4b61f71194 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -1042,6 +1042,15 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 
+static u64 kcpustat_user_vtime(struct vtime *vtime)
+{
+	if (vtime->state == VTIME_USER)
+		return vtime->utime + vtime_delta(vtime);
+	else if (vtime->state == VTIME_GUEST)
+		return vtime->gtime + vtime_delta(vtime);
+	return 0;
+}
+
 static int kcpustat_field_vtime(u64 *cpustat,
 				struct vtime *vtime,
 				enum cpu_usage_stat usage,
@@ -1076,9 +1085,30 @@ static int kcpustat_field_vtime(u64 *cpustat,
 
 		*val = cpustat[usage];
 
-		if (vtime->state == VTIME_SYS)
-			*val += vtime->stime + vtime_delta(vtime);
-
+		switch (usage) {
+		case CPUTIME_SYSTEM:
+			if (vtime->state == VTIME_SYS)
+				*val += vtime->stime + vtime_delta(vtime);
+			break;
+		case CPUTIME_USER:
+			if (!vtime->nice)
+				*val += kcpustat_user_vtime(vtime);
+			break;
+		case CPUTIME_NICE:
+			if (vtime->nice)
+				*val += kcpustat_user_vtime(vtime);
+			break;
+		case CPUTIME_GUEST:
+			if (vtime->state == VTIME_GUEST && !vtime->nice)
+				*val += vtime->gtime + vtime_delta(vtime);
+			break;
+		case CPUTIME_GUEST_NICE:
+			if (vtime->state == VTIME_GUEST && vtime->nice)
+				*val += vtime->gtime + vtime_delta(vtime);
+			break;
+		default:
+			break;
+		}
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
 	return 0;
@@ -1095,10 +1125,6 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 	if (!vtime_accounting_enabled_cpu(cpu))
 		return cpustat[usage];
 
-	/* Only support sys vtime for now */
-	if (usage != CPUTIME_SYSTEM)
-		return cpustat[usage];
-
 	rq = cpu_rq(cpu);
 
 	for (;;) {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/9] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2019-11-06  3:08 ` [PATCH 4/9] sched/cputime: Support other fields on kcpustat_field() Frederic Weisbecker
@ 2019-11-06  3:08 ` Frederic Weisbecker
  2019-11-06  3:08 ` [PATCH 6/9] procfs: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:08 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

Many callsites want to fetch the values of system, user, user_nice, guest
or guest_nice kcpustat fields altogether or at least a pair of these.

In that case calling kcpustat_field() for each requested field brings
unecessary overhead when we could fetch all of them in a row.

So provide kcpustat_cputime() that fetches all vtime sensitive fields
under the same RCU and seqcount block.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/kernel_stat.h |  23 ++++++
 kernel/sched/cputime.c      | 138 ++++++++++++++++++++++++++++++------
 2 files changed, 141 insertions(+), 20 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 1b9b97f6946e..c76daad2d8e2 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -78,15 +78,38 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 	return kstat_cpu(cpu).irqs_sum;
 }
 
+
+static inline void kcpustat_cputime_raw(u64 *cpustat, u64 *user, u64 *nice,
+					u64 *system, u64 *guest, u64 *guest_nice)
+{
+	*user = cpustat[CPUTIME_USER];
+	*nice = cpustat[CPUTIME_NICE];
+	*system = cpustat[CPUTIME_SYSTEM];
+	*guest = cpustat[CPUTIME_GUEST];
+	*guest_nice = cpustat[CPUTIME_GUEST_NICE];
+}
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 			  enum cpu_usage_stat usage, int cpu);
+extern void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
+			     u64 *user, u64 *nice, u64 *system,
+			     u64 *guest, u64 *guest_nice);
 #else
 static inline u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 				 enum cpu_usage_stat usage, int cpu)
 {
 	return kcpustat->cpustat[usage];
 }
+
+static inline void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
+				    u64 *user, u64 *nice, u64 *system,
+				    u64 *guest, u64 *guest_nice)
+{
+	kcpustat_cputime_raw(kcpustat->cpustat, user, nice,
+			     system, guest, guest_nice);
+}
+
 #endif
 
 extern void account_user_time(struct task_struct *, u64);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index bf4b61f71194..0006dfccbeb7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -1042,6 +1042,30 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 
+static int vtime_state_check(struct vtime *vtime, int cpu)
+{
+	/*
+	 * We raced against context switch, fetch the
+	 * kcpustat task again.
+	 */
+	if (vtime->cpu != cpu && vtime->cpu != -1)
+		return -EAGAIN;
+
+	/*
+	 * Two possible things here:
+	 * 1) We are seeing the scheduling out task (prev) or any past one.
+	 * 2) We are seeing the scheduling in task (next) but it hasn't
+	 *    passed though vtime_task_switch() yet so the pending
+	 *    cputime of the prev task may not be flushed yet.
+	 *
+	 * Case 1) is ok but 2) is not. So wait for a safe VTIME state.
+	 */
+	if (vtime->state == VTIME_INACTIVE)
+		return -EAGAIN;
+
+	return 0;
+}
+
 static u64 kcpustat_user_vtime(struct vtime *vtime)
 {
 	if (vtime->state == VTIME_USER)
@@ -1062,26 +1086,9 @@ static int kcpustat_field_vtime(u64 *cpustat,
 	do {
 		seq = read_seqcount_begin(&vtime->seqcount);
 
-		/*
-		 * We raced against context switch, fetch the
-		 * kcpustat task again.
-		 */
-		if (vtime->cpu != cpu && vtime->cpu != -1)
-			return -EAGAIN;
-
-		/*
-		 * Two possible things here:
-		 * 1) We are seeing the scheduling out task (prev) or any past one.
-		 * 2) We are seeing the scheduling in task (next) but it hasn't
-		 *    passed though vtime_task_switch() yet so the pending
-		 *    cputime of the prev task may not be flushed yet.
-		 *
-		 * Case 1) is ok but 2) is not. So wait for a safe VTIME state.
-		 */
-		if (vtime->state == VTIME_INACTIVE)
-			return -EAGAIN;
-
-		err = 0;
+		err = vtime_state_check(vtime, cpu);
+		if (err < 0)
+			return err;
 
 		*val = cpustat[usage];
 
@@ -1149,4 +1156,95 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 	}
 }
 EXPORT_SYMBOL_GPL(kcpustat_field);
+
+static int kcpustat_cputime_vtime(u64 *cpustat, struct vtime *vtime,
+				  int cpu, u64 *user, u64 *nice,
+				  u64 *system, u64 *guest, u64 *guest_nice)
+{
+	unsigned int seq;
+	u64 delta;
+	int err;
+
+	do {
+		seq = read_seqcount_begin(&vtime->seqcount);
+
+		err = vtime_state_check(vtime, cpu);
+		if (err < 0)
+			return err;
+
+		kcpustat_cputime_raw(cpustat, user, nice,
+				     system, guest, guest_nice);
+
+		/* Task is sleeping, dead or idle, nothing to add */
+		if (vtime->state < VTIME_SYS)
+			continue;
+
+		delta = vtime_delta(vtime);
+
+		/*
+		 * Task runs either in user (including guest) or kernel space,
+		 * add pending nohz time to the right place.
+		 */
+		if (vtime->state == VTIME_SYS) {
+			*system += vtime->stime + delta;
+		} else if (vtime->state == VTIME_USER) {
+			if (vtime->nice)
+				*nice += vtime->utime + delta;
+			else
+				*user += vtime->utime + delta;
+		} else {
+			WARN_ON_ONCE(vtime->state != VTIME_GUEST);
+			if (vtime->nice) {
+				*guest_nice += vtime->gtime + delta;
+				*nice += vtime->gtime + delta;
+			} else {
+				*guest += vtime->gtime + delta;
+				*user += vtime->gtime + delta;
+			}
+		}
+	} while (read_seqcount_retry(&vtime->seqcount, seq));
+
+	return err;
+}
+
+void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
+		      u64 *user, u64 *nice, u64 *system,
+		      u64 *guest, u64 *guest_nice)
+{
+	u64 *cpustat = kcpustat->cpustat;
+	struct rq *rq;
+	int err;
+
+	if (!vtime_accounting_enabled_cpu(cpu)) {
+		kcpustat_cputime_raw(cpustat, user, nice,
+				     system, guest, guest_nice);
+		return;
+	}
+
+	rq = cpu_rq(cpu);
+
+	for (;;) {
+		struct task_struct *curr;
+
+		rcu_read_lock();
+		curr = rcu_dereference(rq->curr);
+		if (WARN_ON_ONCE(!curr)) {
+			rcu_read_unlock();
+			kcpustat_cputime_raw(cpustat, user, nice,
+					     system, guest, guest_nice);
+			return;
+		}
+
+		err = kcpustat_cputime_vtime(cpustat, &curr->vtime, cpu, user,
+					     nice, system, guest, guest_nice);
+		rcu_read_unlock();
+
+		if (!err)
+			return;
+
+		cpu_relax();
+	}
+}
+EXPORT_SYMBOL_GPL(kcpustat_cputime);
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/9] procfs: Use all-in-one vtime aware kcpustat accessor
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2019-11-06  3:08 ` [PATCH 5/9] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields Frederic Weisbecker
@ 2019-11-06  3:08 ` Frederic Weisbecker
  2019-11-06  3:08 ` [PATCH 7/9] cpufreq: Use vtime aware kcpustat accessors for user time Frederic Weisbecker
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:08 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

Now that we can read also user and guest time safely under vtime, use
the relevant accessor to fix frozen kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 fs/proc/stat.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 5c6bd0ae3802..b2ee5418dece 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -120,18 +120,21 @@ static int show_stat(struct seq_file *p, void *v)
 	getboottime64(&boottime);
 
 	for_each_possible_cpu(i) {
+		u64 cpu_user, cpu_nice, cpu_sys, cpu_guest, cpu_guest_nice;
 		struct kernel_cpustat *kcs = &kcpustat_cpu(i);
 
-		user += kcs->cpustat[CPUTIME_USER];
-		nice += kcs->cpustat[CPUTIME_NICE];
-		system += kcpustat_field(kcs, CPUTIME_SYSTEM, i);
+		kcpustat_cputime(kcs, i, &cpu_user, &cpu_nice,
+				 &cpu_sys, &cpu_guest, &cpu_guest_nice);
+		user += cpu_user;
+		nice += cpu_nice;
+		system += cpu_sys;
 		idle += get_idle_time(kcs, i);
 		iowait += get_iowait_time(kcs, i);
 		irq += kcs->cpustat[CPUTIME_IRQ];
 		softirq += kcs->cpustat[CPUTIME_SOFTIRQ];
 		steal += kcs->cpustat[CPUTIME_STEAL];
-		guest += kcs->cpustat[CPUTIME_GUEST];
-		guest_nice += kcs->cpustat[CPUTIME_GUEST_NICE];
+		guest += cpu_guest;
+		guest_nice += guest_nice;
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -159,17 +162,14 @@ static int show_stat(struct seq_file *p, void *v)
 	for_each_online_cpu(i) {
 		struct kernel_cpustat *kcs = &kcpustat_cpu(i);
 
+		kcpustat_cputime(kcs, i, &user, &nice,
+				 &system, &guest, &guest_nice);
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kcs->cpustat[CPUTIME_USER];
-		nice = kcs->cpustat[CPUTIME_NICE];
-		system = kcpustat_field(kcs, CPUTIME_SYSTEM, i);
 		idle = get_idle_time(kcs, i);
 		iowait = get_iowait_time(kcs, i);
 		irq = kcs->cpustat[CPUTIME_IRQ];
 		softirq = kcs->cpustat[CPUTIME_SOFTIRQ];
 		steal = kcs->cpustat[CPUTIME_STEAL];
-		guest = kcs->cpustat[CPUTIME_GUEST];
-		guest_nice = kcs->cpustat[CPUTIME_GUEST_NICE];
 		seq_printf(p, "cpu%d", i);
 		seq_put_decimal_ull(p, " ", nsec_to_clock_t(user));
 		seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/9] cpufreq: Use vtime aware kcpustat accessors for user time
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2019-11-06  3:08 ` [PATCH 6/9] procfs: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
@ 2019-11-06  3:08 ` Frederic Weisbecker
  2019-11-06  3:08 ` [PATCH 8/9] leds: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:08 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

We can now safely read user and guest kcpustat fields on nohz_full CPUs.
Use the appropriate accessors.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
---
 drivers/cpufreq/cpufreq.c          | 13 +++++++++----
 drivers/cpufreq/cpufreq_governor.c |  6 +++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index a37ebfd7e0e8..cdc44a854ae1 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -113,18 +113,23 @@ EXPORT_SYMBOL_GPL(get_governor_parent_kobj);
 
 static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
 {
-	u64 idle_time;
+	struct kernel_cpustat *kcpustat = &kcpustat_cpu(cpu);
+	u64 user, nice, sys, guest, guest_nice;
 	u64 cur_wall_time;
+	u64 idle_time;
 	u64 busy_time;
 
 	cur_wall_time = jiffies64_to_nsecs(get_jiffies_64());
 
-	busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
-	busy_time += kcpustat_field(&kcpustat_cpu(cpu), CPUTIME_SYSTEM, cpu);
+	kcpustat_cputime(kcpustat, cpu, &user, &nice, &sys,
+			 &guest, &guest_nice);
+
+	busy_time = user;
+	busy_time += sys;
 	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
 	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
 	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+	busy_time += nice;
 
 	idle_time = cur_wall_time - busy_time;
 	if (wall)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 4bb054d0cb43..f99ae45efaea 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -105,7 +105,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
 			j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_update_time,
 								  dbs_data->io_is_busy);
 			if (dbs_data->ignore_nice_load)
-				j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+				j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 		}
 	}
 }
@@ -149,7 +149,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
 		j_cdbs->prev_cpu_idle = cur_idle_time;
 
 		if (ignore_nice) {
-			u64 cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+			u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 
 			idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
 			j_cdbs->prev_cpu_nice = cur_nice;
@@ -530,7 +530,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
 		j_cdbs->prev_load = 0;
 
 		if (ignore_nice)
-			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+			j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 	}
 
 	gov->start(policy);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 8/9] leds: Use all-in-one vtime aware kcpustat accessor
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2019-11-06  3:08 ` [PATCH 7/9] cpufreq: Use vtime aware kcpustat accessors for user time Frederic Weisbecker
@ 2019-11-06  3:08 ` Frederic Weisbecker
  2019-11-06  3:08 ` [PATCH 9/9] rackmeter: Use " Frederic Weisbecker
  2019-11-14 15:25 ` [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
  9 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:08 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

We can now safely read user kcpustat fields on nohz_full CPUs.
Use the appropriate accessor.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 drivers/leds/trigger/ledtrig-activity.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/leds/trigger/ledtrig-activity.c b/drivers/leds/trigger/ledtrig-activity.c
index ddfc5edd07c8..20250de1138d 100644
--- a/drivers/leds/trigger/ledtrig-activity.c
+++ b/drivers/leds/trigger/ledtrig-activity.c
@@ -57,9 +57,12 @@ static void led_activity_function(struct timer_list *t)
 	curr_used = 0;
 
 	for_each_possible_cpu(i) {
-		curr_used += kcpustat_cpu(i).cpustat[CPUTIME_USER]
-			  +  kcpustat_cpu(i).cpustat[CPUTIME_NICE]
-			  +  kcpustat_field(&kcpustat_cpu(i), CPUTIME_SYSTEM, i)
+		u64 user, nice, sys, guest, guest_nice;
+
+		kcpustat_cputime(&kcpustat_cpu(i), i, &user, &nice, &sys,
+				 &guest, &guest_nice);
+
+		curr_used += user + nice + sys
 			  +  kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]
 			  +  kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
 		cpus++;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 9/9] rackmeter: Use vtime aware kcpustat accessor
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2019-11-06  3:08 ` [PATCH 8/9] leds: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
@ 2019-11-06  3:08 ` Frederic Weisbecker
  2019-11-14 15:25 ` [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
  9 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-06  3:08 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Pavel Machek, Benjamin Herrenschmidt, Thomas Gleixner,
	Yauheni Kaliuta, Rafael J . Wysocki, Viresh Kumar, Rik van Riel

Now that we have a vtime safe kcpustat accessor, use it to fetch
CPUTIME_NICE and fix frozen kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 drivers/macintosh/rack-meter.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 4134e580f786..60311e8d6240 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -81,13 +81,14 @@ static int rackmeter_ignore_nice;
  */
 static inline u64 get_cpu_idle_time(unsigned int cpu)
 {
+	struct kernel_cpustat *kcpustat = &kcpustat_cpu(cpu);
 	u64 retval;
 
-	retval = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE] +
-		 kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
+	retval = kcpustat->cpustat[CPUTIME_IDLE] +
+		 kcpustat->cpustat[CPUTIME_IOWAIT];
 
 	if (rackmeter_ignore_nice)
-		retval += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+		retval += kcpustat_field(kcpustat, CPUTIME_NICE, cpu);
 
 	return retval;
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware
  2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2019-11-06  3:08 ` [PATCH 9/9] rackmeter: Use " Frederic Weisbecker
@ 2019-11-14 15:25 ` Frederic Weisbecker
  9 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-14 15:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Jacek Anaszewski, Wanpeng Li, Pavel Machek,
	Benjamin Herrenschmidt, Thomas Gleixner, Yauheni Kaliuta,
	Rafael J . Wysocki, Viresh Kumar, Rik van Riel

On Wed, Nov 06, 2019 at 04:07:58AM +0100, Frederic Weisbecker wrote:
> So we fixed CPUTIME_SYSTEM on nohz_full, now is the turn for
> CPUTIME_USER, CPUTIME_NICE, CPUTIME_GUEST and CPUTIME_GUEST_NICE.
> 
> The tricky piece was to handle "nice-ness" correctly. I think I addressed
> most reviews (one year ago now) from peterz
> (https://lore.kernel.org/lkml/20181120141754.GW2131@hirez.programming.kicks-ass.net/)
> Among which:
> 
> * Merge patches that didn't make sense together
> * Clarify the changelog (I hope) and add comments
> * Also hook nice updates on __setscheduler_params()
> * Uninline a big function

Gentle ping :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/9] sched/vtime: Handle nice updates under vtime
  2019-11-06  3:08 ` [PATCH 3/9] sched/vtime: Handle nice updates under vtime Frederic Weisbecker
@ 2019-11-15 10:14   ` Peter Zijlstra
  2019-11-15 10:16   ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2019-11-15 10:14 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Jacek Anaszewski, Wanpeng Li, Pavel Machek,
	Benjamin Herrenschmidt, Thomas Gleixner, Yauheni Kaliuta,
	Rafael J . Wysocki, Viresh Kumar, Rik van Riel

On Wed, Nov 06, 2019 at 04:08:01AM +0100, Frederic Weisbecker wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7880f4f64d0e..fa56a1bdd5d8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4488,6 +4488,7 @@ void set_user_nice(struct task_struct *p, long nice)
>  	int old_prio, delta;
>  	struct rq_flags rf;
>  	struct rq *rq;
> +	int old_static;
>  
>  	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
>  		return;
> @@ -4498,6 +4499,8 @@ void set_user_nice(struct task_struct *p, long nice)
>  	rq = task_rq_lock(p, &rf);
>  	update_rq_clock(rq);
>  
> +	old_static = p->static_prio;
> +
>  	/*
>  	 * The RT priorities are set via sched_setscheduler(), but we still
>  	 * allow the 'normal' nice value to be set - but as expected
> @@ -4532,7 +4535,9 @@ void set_user_nice(struct task_struct *p, long nice)
>  	}
>  	if (running)
>  		set_next_task(rq, p);
> +
>  out_unlock:
> +	vtime_set_nice(rq, p, old_static);
>  	task_rq_unlock(rq, p, &rf);
>  }
>  EXPORT_SYMBOL(set_user_nice);
> @@ -4668,8 +4673,8 @@ static struct task_struct *find_process_by_pid(pid_t pid)
>   */
>  #define SETPARAM_POLICY	-1
>  
> -static void __setscheduler_params(struct task_struct *p,
> -		const struct sched_attr *attr)
> +static void __setscheduler_params(struct rq *rq, struct task_struct *p,
> +				  const struct sched_attr *attr)
>  {
>  	int policy = attr->sched_policy;
>  
> @@ -4680,8 +4685,11 @@ static void __setscheduler_params(struct task_struct *p,
>  
>  	if (dl_policy(policy))
>  		__setparam_dl(p, attr);
> -	else if (fair_policy(policy))
> +	else if (fair_policy(policy)) {
> +		int old_prio = p->static_prio;
>  		p->static_prio = NICE_TO_PRIO(attr->sched_nice);
> +		vtime_set_nice(rq, p, old_prio);
> +	}
>  
>  	/*
>  	 * __sched_setscheduler() ensures attr->sched_priority == 0 when
> @@ -4704,7 +4712,7 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
>  	if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
>  		return;
>  
> -	__setscheduler_params(p, attr);
> +	__setscheduler_params(rq, p, attr);
>  
>  	/*
>  	 * Keep a potential priority boosting if called from

Would it make sense to add a helper like:

static void
__sched_set_nice(struct rq *rq, struct task_struct *t, long nice)
{
	int old_prio;

	old_prio = t->static_prio;
	t->static_prio = NICE_TO_PRIO(nice);
	vtime_set_nice(rq, t, old_prio);
}


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/9] sched/vtime: Handle nice updates under vtime
  2019-11-06  3:08 ` [PATCH 3/9] sched/vtime: Handle nice updates under vtime Frederic Weisbecker
  2019-11-15 10:14   ` Peter Zijlstra
@ 2019-11-15 10:16   ` Peter Zijlstra
  2019-11-15 10:18     ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2019-11-15 10:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Jacek Anaszewski, Wanpeng Li, Pavel Machek,
	Benjamin Herrenschmidt, Thomas Gleixner, Yauheni Kaliuta,
	Rafael J . Wysocki, Viresh Kumar, Rik van Riel

On Wed, Nov 06, 2019 at 04:08:01AM +0100, Frederic Weisbecker wrote:
> The cputime niceness is determined while checking the target's nice value
> at cputime accounting time. Under vtime this happens on context switches,
> user exit and guest exit. But nice value updates while the target is
> running are not taken into account.
> 
> So if a task runs tickless for 10 seconds in userspace but it has been
> reniced after 5 seconds from -1 (not nice) to 1 (nice), on user exit
> vtime will account the whole 10 seconds as CPUTIME_NICE because it only
> sees the latest nice value at accounting time which is 1 here. Yet it's
> wrong, 5 seconds should be accounted as CPUTIME_USER and 5 seconds as
> CPUTIME_NICE.
> 
> In order to solve this, we now cover nice updates withing three cases:
> 
> * If the nice updater is the current task, although we are in kernel
>   mode there can be pending user or guest time in the cache to flush
>   under the prior nice state. Account these if any. Also toggle the
>   vtime nice flag for further user/guest cputime accounting.
> 
> * If the target runs on a different CPU, we interrupt it with an IPI to
>   update the vtime state in place. If the task is running in user or
>   guest, the pending cputime is accounted under the prior nice state.
>   Then the vtime nice flag is toggled for further user/guest cputime
>   accounting.

But but but, I thought the idea was to _never_ send interrupts to
NOHZ_FULL cpus ?!?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/9] sched/vtime: Handle nice updates under vtime
  2019-11-15 10:16   ` Peter Zijlstra
@ 2019-11-15 10:18     ` Peter Zijlstra
  2019-11-15 15:27       ` Frederic Weisbecker
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2019-11-15 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Jacek Anaszewski, Wanpeng Li, Pavel Machek,
	Benjamin Herrenschmidt, Thomas Gleixner, Yauheni Kaliuta,
	Rafael J . Wysocki, Viresh Kumar, Rik van Riel

On Fri, Nov 15, 2019 at 11:16:48AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 06, 2019 at 04:08:01AM +0100, Frederic Weisbecker wrote:
> > The cputime niceness is determined while checking the target's nice value
> > at cputime accounting time. Under vtime this happens on context switches,
> > user exit and guest exit. But nice value updates while the target is
> > running are not taken into account.
> > 
> > So if a task runs tickless for 10 seconds in userspace but it has been
> > reniced after 5 seconds from -1 (not nice) to 1 (nice), on user exit
> > vtime will account the whole 10 seconds as CPUTIME_NICE because it only
> > sees the latest nice value at accounting time which is 1 here. Yet it's
> > wrong, 5 seconds should be accounted as CPUTIME_USER and 5 seconds as
> > CPUTIME_NICE.
> > 
> > In order to solve this, we now cover nice updates withing three cases:
> > 
> > * If the nice updater is the current task, although we are in kernel
> >   mode there can be pending user or guest time in the cache to flush
> >   under the prior nice state. Account these if any. Also toggle the
> >   vtime nice flag for further user/guest cputime accounting.
> > 
> > * If the target runs on a different CPU, we interrupt it with an IPI to
> >   update the vtime state in place. If the task is running in user or
> >   guest, the pending cputime is accounted under the prior nice state.
> >   Then the vtime nice flag is toggled for further user/guest cputime
> >   accounting.
> 
> But but but, I thought the idea was to _never_ send interrupts to
> NOHZ_FULL cpus ?!?

That is, isn't the cure worse than the problem? I mean, who bloody cares
about silly accounting crud more than not getting interrupts on their
NOHZ_FULL cpus.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/9] sched/vtime: Handle nice updates under vtime
  2019-11-15 10:18     ` Peter Zijlstra
@ 2019-11-15 15:27       ` Frederic Weisbecker
  2019-11-18 13:04         ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2019-11-15 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Jacek Anaszewski, Wanpeng Li, Pavel Machek,
	Benjamin Herrenschmidt, Thomas Gleixner, Yauheni Kaliuta,
	Rafael J . Wysocki, Viresh Kumar, Rik van Riel

On Fri, Nov 15, 2019 at 11:18:31AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 15, 2019 at 11:16:48AM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 06, 2019 at 04:08:01AM +0100, Frederic Weisbecker wrote:
> > > The cputime niceness is determined while checking the target's nice value
> > > at cputime accounting time. Under vtime this happens on context switches,
> > > user exit and guest exit. But nice value updates while the target is
> > > running are not taken into account.
> > > 
> > > So if a task runs tickless for 10 seconds in userspace but it has been
> > > reniced after 5 seconds from -1 (not nice) to 1 (nice), on user exit
> > > vtime will account the whole 10 seconds as CPUTIME_NICE because it only
> > > sees the latest nice value at accounting time which is 1 here. Yet it's
> > > wrong, 5 seconds should be accounted as CPUTIME_USER and 5 seconds as
> > > CPUTIME_NICE.
> > > 
> > > In order to solve this, we now cover nice updates withing three cases:
> > > 
> > > * If the nice updater is the current task, although we are in kernel
> > >   mode there can be pending user or guest time in the cache to flush
> > >   under the prior nice state. Account these if any. Also toggle the
> > >   vtime nice flag for further user/guest cputime accounting.
> > > 
> > > * If the target runs on a different CPU, we interrupt it with an IPI to
> > >   update the vtime state in place. If the task is running in user or
> > >   guest, the pending cputime is accounted under the prior nice state.
> > >   Then the vtime nice flag is toggled for further user/guest cputime
> > >   accounting.
> > 
> > But but but, I thought the idea was to _never_ send interrupts to
> > NOHZ_FULL cpus ?!?
> 
> That is, isn't the cure worse than the problem? I mean, who bloody cares
> about silly accounting crud more than not getting interrupts on their
> NOHZ_FULL cpus.

Yeah indeed. I tend to sacrifice everything for correctness but perhaps we can live with
small issues like nice accounting not being accounted to the right place if that
can avoid disturbing nohz_full CPUs. Also who cares about renicing a task that is
supposed to run alone.

So here is what I can do: I'll make a simplified version of that set which accounts
on top of the task_nice() value found on accounting time (context switch, user exit,
guest exit) and if some user ever complains, I can still bring back that IPI solution.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/9] sched/vtime: Handle nice updates under vtime
  2019-11-15 15:27       ` Frederic Weisbecker
@ 2019-11-18 13:04         ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2019-11-18 13:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Jacek Anaszewski, Wanpeng Li, Pavel Machek,
	Benjamin Herrenschmidt, Thomas Gleixner, Yauheni Kaliuta,
	Rafael J . Wysocki, Viresh Kumar, Rik van Riel

On Fri, Nov 15, 2019 at 04:27:04PM +0100, Frederic Weisbecker wrote:
> On Fri, Nov 15, 2019 at 11:18:31AM +0100, Peter Zijlstra wrote:
> > On Fri, Nov 15, 2019 at 11:16:48AM +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 06, 2019 at 04:08:01AM +0100, Frederic Weisbecker wrote:
> > > > The cputime niceness is determined while checking the target's nice value
> > > > at cputime accounting time. Under vtime this happens on context switches,
> > > > user exit and guest exit. But nice value updates while the target is
> > > > running are not taken into account.
> > > > 
> > > > So if a task runs tickless for 10 seconds in userspace but it has been
> > > > reniced after 5 seconds from -1 (not nice) to 1 (nice), on user exit
> > > > vtime will account the whole 10 seconds as CPUTIME_NICE because it only
> > > > sees the latest nice value at accounting time which is 1 here. Yet it's
> > > > wrong, 5 seconds should be accounted as CPUTIME_USER and 5 seconds as
> > > > CPUTIME_NICE.
> > > > 
> > > > In order to solve this, we now cover nice updates withing three cases:
> > > > 
> > > > * If the nice updater is the current task, although we are in kernel
> > > >   mode there can be pending user or guest time in the cache to flush
> > > >   under the prior nice state. Account these if any. Also toggle the
> > > >   vtime nice flag for further user/guest cputime accounting.
> > > > 
> > > > * If the target runs on a different CPU, we interrupt it with an IPI to
> > > >   update the vtime state in place. If the task is running in user or
> > > >   guest, the pending cputime is accounted under the prior nice state.
> > > >   Then the vtime nice flag is toggled for further user/guest cputime
> > > >   accounting.
> > > 
> > > But but but, I thought the idea was to _never_ send interrupts to
> > > NOHZ_FULL cpus ?!?
> > 
> > That is, isn't the cure worse than the problem? I mean, who bloody cares
> > about silly accounting crud more than not getting interrupts on their
> > NOHZ_FULL cpus.
> 
> Yeah indeed. I tend to sacrifice everything for correctness but perhaps we can live with
> small issues like nice accounting not being accounted to the right place if that
> can avoid disturbing nohz_full CPUs. Also who cares about renicing a task that is
> supposed to run alone.
> 
> So here is what I can do: I'll make a simplified version of that set which accounts
> on top of the task_nice() value found on accounting time (context switch, user exit,
> guest exit) and if some user ever complains, I can still bring back that IPI solution.

Makes sense, thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-11-18 13:05 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-06  3:07 [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker
2019-11-06  3:07 ` [PATCH 1/9] sched/cputime: Allow to pass cputime index on user/guest accounting Frederic Weisbecker
2019-11-06  3:08 ` [PATCH 2/9] sched/cputime: Standardize the kcpustat index based accounting functions Frederic Weisbecker
2019-11-06  3:08 ` [PATCH 3/9] sched/vtime: Handle nice updates under vtime Frederic Weisbecker
2019-11-15 10:14   ` Peter Zijlstra
2019-11-15 10:16   ` Peter Zijlstra
2019-11-15 10:18     ` Peter Zijlstra
2019-11-15 15:27       ` Frederic Weisbecker
2019-11-18 13:04         ` Peter Zijlstra
2019-11-06  3:08 ` [PATCH 4/9] sched/cputime: Support other fields on kcpustat_field() Frederic Weisbecker
2019-11-06  3:08 ` [PATCH 5/9] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields Frederic Weisbecker
2019-11-06  3:08 ` [PATCH 6/9] procfs: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
2019-11-06  3:08 ` [PATCH 7/9] cpufreq: Use vtime aware kcpustat accessors for user time Frederic Weisbecker
2019-11-06  3:08 ` [PATCH 8/9] leds: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
2019-11-06  3:08 ` [PATCH 9/9] rackmeter: Use " Frederic Weisbecker
2019-11-14 15:25 ` [PATCH 0/9] sched/nohz: Make the rest of kcpustat vtime aware Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).