linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v2
@ 2019-11-19 23:22 Frederic Weisbecker
  2019-11-19 23:22 ` [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field() Frederic Weisbecker
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2019-11-19 23:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

(See https://lore.kernel.org/lkml/20191106030807.31091-1-frederic@kernel.org/
for the record)

After review from Peter, I eventually gave up with the idea of fixing
the nice fields of kcpustat. Therefore if a nice update happens on a
task while it runs on a nohz_full CPU, the whole cputime will be
accounted under the nice value observed at accounting time, hence the
possibility of a nice VS unnice kcpustat inacurrate distribution.

But that's a lesser evil compared to interrupting a nohz_full CPU, which
would be required to fix that. Also users of nohz_full shouldn't care
about nice at all since a single task is expected to run on the CPU.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	nohz/kcpustat-v4

HEAD: dd59f186dc21e164097f659b4d60815e232f9555

Thanks,
	Frederic
---

Frederic Weisbecker (6):
      sched/cputime: Support other fields on kcpustat_field()
      sched/vtime: Bring all-in-one kcpustat accessor for vtime fields
      procfs: Use all-in-one vtime aware kcpustat accessor
      cpufreq: Use vtime aware kcpustat accessors for user time
      leds: Use all-in-one vtime aware kcpustat accessor
      rackmeter: Use vtime aware kcpustat accessor


 drivers/cpufreq/cpufreq.c               |  13 ++-
 drivers/cpufreq/cpufreq_governor.c      |   6 +-
 drivers/leds/trigger/ledtrig-activity.c |   9 +-
 drivers/macintosh/rack-meter.c          |   7 +-
 fs/proc/stat.c                          |  20 ++--
 include/linux/kernel_stat.h             |  23 ++++
 kernel/sched/cputime.c                  | 192 ++++++++++++++++++++++++++------
 7 files changed, 216 insertions(+), 54 deletions(-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field()
  2019-11-19 23:22 [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v2 Frederic Weisbecker
@ 2019-11-19 23:22 ` Frederic Weisbecker
  2019-11-20 11:51   ` Ingo Molnar
  2019-11-19 23:22 ` [PATCH 2/6] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields Frederic Weisbecker
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Frederic Weisbecker @ 2019-11-19 23:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

Provide support for user, nice, guest and guest_nice fields through
kcpustat_field().

Whether we account the delta to a nice or not nice field is decided on
top of the nice value snapshot taken at the time we call kcpustat_field().
If the nice value of the task has been changed since the last vtime
update, we may have inacurrate distribution of the nice VS unnice
cputime.

However this is considered as a minor issue compared to the proper fix
that would involve interrupting the target on nice updates, which is
undesired on nohz_full CPUs.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 53 +++++++++++++++++++++++++++++++++---------
 1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index e0cd20693ef5..b2cf544e2109 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -912,11 +912,21 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 
+static u64 kcpustat_user_vtime(struct vtime *vtime)
+{
+	if (vtime->state == VTIME_USER)
+		return vtime->utime + vtime_delta(vtime);
+	else if (vtime->state == VTIME_GUEST)
+		return vtime->gtime + vtime_delta(vtime);
+	return 0;
+}
+
 static int kcpustat_field_vtime(u64 *cpustat,
-				struct vtime *vtime,
+				struct task_struct *tsk,
 				enum cpu_usage_stat usage,
 				int cpu, u64 *val)
 {
+	struct vtime *vtime = &tsk->vtime;
 	unsigned int seq;
 	int err;
 
@@ -946,9 +956,36 @@ static int kcpustat_field_vtime(u64 *cpustat,
 
 		*val = cpustat[usage];
 
-		if (vtime->state == VTIME_SYS)
-			*val += vtime->stime + vtime_delta(vtime);
-
+		/*
+		 * Nice VS unnice cputime accounting may be inaccurate if
+		 * the nice value has changed since the last vtime update.
+		 * But proper fix would involve interrupting target on nice
+		 * updates which is a no go on nohz_full.
+		 */
+		switch (usage) {
+		case CPUTIME_SYSTEM:
+			if (vtime->state == VTIME_SYS)
+				*val += vtime->stime + vtime_delta(vtime);
+			break;
+		case CPUTIME_USER:
+			if (task_nice(tsk) <= 0)
+				*val += kcpustat_user_vtime(vtime);
+			break;
+		case CPUTIME_NICE:
+			if (task_nice(tsk) > 0)
+				*val += kcpustat_user_vtime(vtime);
+			break;
+		case CPUTIME_GUEST:
+			if (vtime->state == VTIME_GUEST && task_nice(tsk) <= 0)
+				*val += vtime->gtime + vtime_delta(vtime);
+			break;
+		case CPUTIME_GUEST_NICE:
+			if (vtime->state == VTIME_GUEST && task_nice(tsk) > 0)
+				*val += vtime->gtime + vtime_delta(vtime);
+			break;
+		default:
+			break;
+		}
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
 	return 0;
@@ -965,15 +1002,10 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 	if (!vtime_accounting_enabled_cpu(cpu))
 		return cpustat[usage];
 
-	/* Only support sys vtime for now */
-	if (usage != CPUTIME_SYSTEM)
-		return cpustat[usage];
-
 	rq = cpu_rq(cpu);
 
 	for (;;) {
 		struct task_struct *curr;
-		struct vtime *vtime;
 
 		rcu_read_lock();
 		curr = rcu_dereference(rq->curr);
@@ -982,8 +1014,7 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 			return cpustat[usage];
 		}
 
-		vtime = &curr->vtime;
-		err = kcpustat_field_vtime(cpustat, vtime, usage, cpu, &val);
+		err = kcpustat_field_vtime(cpustat, curr, usage, cpu, &val);
 		rcu_read_unlock();
 
 		if (!err)
-- 
2.23.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/6] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields
  2019-11-19 23:22 [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v2 Frederic Weisbecker
  2019-11-19 23:22 ` [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field() Frederic Weisbecker
@ 2019-11-19 23:22 ` Frederic Weisbecker
  2019-11-20 12:04   ` Ingo Molnar
  2019-11-19 23:22 ` [PATCH 3/6] procfs: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Frederic Weisbecker @ 2019-11-19 23:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

Many callsites want to fetch the values of system, user, user_nice, guest
or guest_nice kcpustat fields altogether or at least a pair of these.

In that case calling kcpustat_field() for each requested field brings
unecessary overhead when we could fetch all of them in a row.

So provide kcpustat_cputime() that fetches all vtime sensitive fields
under the same RCU and seqcount block.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/kernel_stat.h |  23 ++++++
 kernel/sched/cputime.c      | 139 ++++++++++++++++++++++++++++++------
 2 files changed, 142 insertions(+), 20 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 79781196eb25..6bd70e464c61 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -78,15 +78,38 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 	return kstat_cpu(cpu).irqs_sum;
 }
 
+
+static inline void kcpustat_cputime_raw(u64 *cpustat, u64 *user, u64 *nice,
+					u64 *system, u64 *guest, u64 *guest_nice)
+{
+	*user = cpustat[CPUTIME_USER];
+	*nice = cpustat[CPUTIME_NICE];
+	*system = cpustat[CPUTIME_SYSTEM];
+	*guest = cpustat[CPUTIME_GUEST];
+	*guest_nice = cpustat[CPUTIME_GUEST_NICE];
+}
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 			  enum cpu_usage_stat usage, int cpu);
+extern void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
+			     u64 *user, u64 *nice, u64 *system,
+			     u64 *guest, u64 *guest_nice);
 #else
 static inline u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 				 enum cpu_usage_stat usage, int cpu)
 {
 	return kcpustat->cpustat[usage];
 }
+
+static inline void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
+				    u64 *user, u64 *nice, u64 *system,
+				    u64 *guest, u64 *guest_nice)
+{
+	kcpustat_cputime_raw(kcpustat->cpustat, user, nice,
+			     system, guest, guest_nice);
+}
+
 #endif
 
 extern void account_user_time(struct task_struct *, u64);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b2cf544e2109..f576bbb1f4ee 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -912,6 +912,30 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 
+static int vtime_state_check(struct vtime *vtime, int cpu)
+{
+	/*
+	 * We raced against context switch, fetch the
+	 * kcpustat task again.
+	 */
+	if (vtime->cpu != cpu && vtime->cpu != -1)
+		return -EAGAIN;
+
+	/*
+	 * Two possible things here:
+	 * 1) We are seeing the scheduling out task (prev) or any past one.
+	 * 2) We are seeing the scheduling in task (next) but it hasn't
+	 *    passed though vtime_task_switch() yet so the pending
+	 *    cputime of the prev task may not be flushed yet.
+	 *
+	 * Case 1) is ok but 2) is not. So wait for a safe VTIME state.
+	 */
+	if (vtime->state == VTIME_INACTIVE)
+		return -EAGAIN;
+
+	return 0;
+}
+
 static u64 kcpustat_user_vtime(struct vtime *vtime)
 {
 	if (vtime->state == VTIME_USER)
@@ -933,26 +957,9 @@ static int kcpustat_field_vtime(u64 *cpustat,
 	do {
 		seq = read_seqcount_begin(&vtime->seqcount);
 
-		/*
-		 * We raced against context switch, fetch the
-		 * kcpustat task again.
-		 */
-		if (vtime->cpu != cpu && vtime->cpu != -1)
-			return -EAGAIN;
-
-		/*
-		 * Two possible things here:
-		 * 1) We are seeing the scheduling out task (prev) or any past one.
-		 * 2) We are seeing the scheduling in task (next) but it hasn't
-		 *    passed though vtime_task_switch() yet so the pending
-		 *    cputime of the prev task may not be flushed yet.
-		 *
-		 * Case 1) is ok but 2) is not. So wait for a safe VTIME state.
-		 */
-		if (vtime->state == VTIME_INACTIVE)
-			return -EAGAIN;
-
-		err = 0;
+		err = vtime_state_check(vtime, cpu);
+		if (err < 0)
+			return err;
 
 		*val = cpustat[usage];
 
@@ -1024,4 +1031,96 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 	}
 }
 EXPORT_SYMBOL_GPL(kcpustat_field);
+
+static int kcpustat_cputime_vtime(u64 *cpustat, struct task_struct *tsk,
+				  int cpu, u64 *user, u64 *nice,
+				  u64 *system, u64 *guest, u64 *guest_nice)
+{
+	struct vtime *vtime = &tsk->vtime;
+	unsigned int seq;
+	u64 delta;
+	int err;
+
+	do {
+		seq = read_seqcount_begin(&vtime->seqcount);
+
+		err = vtime_state_check(vtime, cpu);
+		if (err < 0)
+			return err;
+
+		kcpustat_cputime_raw(cpustat, user, nice,
+				     system, guest, guest_nice);
+
+		/* Task is sleeping, dead or idle, nothing to add */
+		if (vtime->state < VTIME_SYS)
+			continue;
+
+		delta = vtime_delta(vtime);
+
+		/*
+		 * Task runs either in user (including guest) or kernel space,
+		 * add pending nohz time to the right place.
+		 */
+		if (vtime->state == VTIME_SYS) {
+			*system += vtime->stime + delta;
+		} else if (vtime->state == VTIME_USER) {
+			if (task_nice(tsk) > 0)
+				*nice += vtime->utime + delta;
+			else
+				*user += vtime->utime + delta;
+		} else {
+			WARN_ON_ONCE(vtime->state != VTIME_GUEST);
+			if (task_nice(tsk) > 0) {
+				*guest_nice += vtime->gtime + delta;
+				*nice += vtime->gtime + delta;
+			} else {
+				*guest += vtime->gtime + delta;
+				*user += vtime->gtime + delta;
+			}
+		}
+	} while (read_seqcount_retry(&vtime->seqcount, seq));
+
+	return err;
+}
+
+void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
+		      u64 *user, u64 *nice, u64 *system,
+		      u64 *guest, u64 *guest_nice)
+{
+	u64 *cpustat = kcpustat->cpustat;
+	struct rq *rq;
+	int err;
+
+	if (!vtime_accounting_enabled_cpu(cpu)) {
+		kcpustat_cputime_raw(cpustat, user, nice,
+				     system, guest, guest_nice);
+		return;
+	}
+
+	rq = cpu_rq(cpu);
+
+	for (;;) {
+		struct task_struct *curr;
+
+		rcu_read_lock();
+		curr = rcu_dereference(rq->curr);
+		if (WARN_ON_ONCE(!curr)) {
+			rcu_read_unlock();
+			kcpustat_cputime_raw(cpustat, user, nice,
+					     system, guest, guest_nice);
+			return;
+		}
+
+		err = kcpustat_cputime_vtime(cpustat, curr, cpu, user,
+					     nice, system, guest, guest_nice);
+		rcu_read_unlock();
+
+		if (!err)
+			return;
+
+		cpu_relax();
+	}
+}
+EXPORT_SYMBOL_GPL(kcpustat_cputime);
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
2.23.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/6] procfs: Use all-in-one vtime aware kcpustat accessor
  2019-11-19 23:22 [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v2 Frederic Weisbecker
  2019-11-19 23:22 ` [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field() Frederic Weisbecker
  2019-11-19 23:22 ` [PATCH 2/6] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields Frederic Weisbecker
@ 2019-11-19 23:22 ` Frederic Weisbecker
  2019-11-19 23:22 ` [PATCH 4/6] cpufreq: Use vtime aware kcpustat accessors for user time Frederic Weisbecker
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2019-11-19 23:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

Now that we can read also user and guest time safely under vtime, use
the relevant accessor to fix frozen kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 fs/proc/stat.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 5c6bd0ae3802..b2ee5418dece 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -120,18 +120,21 @@ static int show_stat(struct seq_file *p, void *v)
 	getboottime64(&boottime);
 
 	for_each_possible_cpu(i) {
+		u64 cpu_user, cpu_nice, cpu_sys, cpu_guest, cpu_guest_nice;
 		struct kernel_cpustat *kcs = &kcpustat_cpu(i);
 
-		user += kcs->cpustat[CPUTIME_USER];
-		nice += kcs->cpustat[CPUTIME_NICE];
-		system += kcpustat_field(kcs, CPUTIME_SYSTEM, i);
+		kcpustat_cputime(kcs, i, &cpu_user, &cpu_nice,
+				 &cpu_sys, &cpu_guest, &cpu_guest_nice);
+		user += cpu_user;
+		nice += cpu_nice;
+		system += cpu_sys;
 		idle += get_idle_time(kcs, i);
 		iowait += get_iowait_time(kcs, i);
 		irq += kcs->cpustat[CPUTIME_IRQ];
 		softirq += kcs->cpustat[CPUTIME_SOFTIRQ];
 		steal += kcs->cpustat[CPUTIME_STEAL];
-		guest += kcs->cpustat[CPUTIME_GUEST];
-		guest_nice += kcs->cpustat[CPUTIME_GUEST_NICE];
+		guest += cpu_guest;
+		guest_nice += guest_nice;
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -159,17 +162,14 @@ static int show_stat(struct seq_file *p, void *v)
 	for_each_online_cpu(i) {
 		struct kernel_cpustat *kcs = &kcpustat_cpu(i);
 
+		kcpustat_cputime(kcs, i, &user, &nice,
+				 &system, &guest, &guest_nice);
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kcs->cpustat[CPUTIME_USER];
-		nice = kcs->cpustat[CPUTIME_NICE];
-		system = kcpustat_field(kcs, CPUTIME_SYSTEM, i);
 		idle = get_idle_time(kcs, i);
 		iowait = get_iowait_time(kcs, i);
 		irq = kcs->cpustat[CPUTIME_IRQ];
 		softirq = kcs->cpustat[CPUTIME_SOFTIRQ];
 		steal = kcs->cpustat[CPUTIME_STEAL];
-		guest = kcs->cpustat[CPUTIME_GUEST];
-		guest_nice = kcs->cpustat[CPUTIME_GUEST_NICE];
 		seq_printf(p, "cpu%d", i);
 		seq_put_decimal_ull(p, " ", nsec_to_clock_t(user));
 		seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
-- 
2.23.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 4/6] cpufreq: Use vtime aware kcpustat accessors for user time
  2019-11-19 23:22 [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v2 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2019-11-19 23:22 ` [PATCH 3/6] procfs: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
@ 2019-11-19 23:22 ` Frederic Weisbecker
  2019-11-19 23:22 ` [PATCH 5/6] leds: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
  2019-11-19 23:22 ` [PATCH 6/6] rackmeter: Use " Frederic Weisbecker
  5 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2019-11-19 23:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

We can now safely read user and guest kcpustat fields on nohz_full CPUs.
Use the appropriate accessors.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
---
 drivers/cpufreq/cpufreq.c          | 13 +++++++++----
 drivers/cpufreq/cpufreq_governor.c |  6 +++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 527fd068dc12..34f51e700ee6 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -113,18 +113,23 @@ EXPORT_SYMBOL_GPL(get_governor_parent_kobj);
 
 static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
 {
-	u64 idle_time;
+	struct kernel_cpustat *kcpustat = &kcpustat_cpu(cpu);
+	u64 user, nice, sys, guest, guest_nice;
 	u64 cur_wall_time;
+	u64 idle_time;
 	u64 busy_time;
 
 	cur_wall_time = jiffies64_to_nsecs(get_jiffies_64());
 
-	busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
-	busy_time += kcpustat_field(&kcpustat_cpu(cpu), CPUTIME_SYSTEM, cpu);
+	kcpustat_cputime(kcpustat, cpu, &user, &nice, &sys,
+			 &guest, &guest_nice);
+
+	busy_time = user;
+	busy_time += sys;
 	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
 	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
 	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+	busy_time += nice;
 
 	idle_time = cur_wall_time - busy_time;
 	if (wall)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 4bb054d0cb43..f99ae45efaea 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -105,7 +105,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
 			j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_update_time,
 								  dbs_data->io_is_busy);
 			if (dbs_data->ignore_nice_load)
-				j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+				j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 		}
 	}
 }
@@ -149,7 +149,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
 		j_cdbs->prev_cpu_idle = cur_idle_time;
 
 		if (ignore_nice) {
-			u64 cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+			u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 
 			idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
 			j_cdbs->prev_cpu_nice = cur_nice;
@@ -530,7 +530,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
 		j_cdbs->prev_load = 0;
 
 		if (ignore_nice)
-			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+			j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 	}
 
 	gov->start(policy);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 5/6] leds: Use all-in-one vtime aware kcpustat accessor
  2019-11-19 23:22 [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v2 Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2019-11-19 23:22 ` [PATCH 4/6] cpufreq: Use vtime aware kcpustat accessors for user time Frederic Weisbecker
@ 2019-11-19 23:22 ` Frederic Weisbecker
  2019-11-19 23:22 ` [PATCH 6/6] rackmeter: Use " Frederic Weisbecker
  5 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2019-11-19 23:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

We can now safely read user kcpustat fields on nohz_full CPUs.
Use the appropriate accessor.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 drivers/leds/trigger/ledtrig-activity.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/leds/trigger/ledtrig-activity.c b/drivers/leds/trigger/ledtrig-activity.c
index ddfc5edd07c8..20250de1138d 100644
--- a/drivers/leds/trigger/ledtrig-activity.c
+++ b/drivers/leds/trigger/ledtrig-activity.c
@@ -57,9 +57,12 @@ static void led_activity_function(struct timer_list *t)
 	curr_used = 0;
 
 	for_each_possible_cpu(i) {
-		curr_used += kcpustat_cpu(i).cpustat[CPUTIME_USER]
-			  +  kcpustat_cpu(i).cpustat[CPUTIME_NICE]
-			  +  kcpustat_field(&kcpustat_cpu(i), CPUTIME_SYSTEM, i)
+		u64 user, nice, sys, guest, guest_nice;
+
+		kcpustat_cputime(&kcpustat_cpu(i), i, &user, &nice, &sys,
+				 &guest, &guest_nice);
+
+		curr_used += user + nice + sys
 			  +  kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]
 			  +  kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
 		cpus++;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 6/6] rackmeter: Use vtime aware kcpustat accessor
  2019-11-19 23:22 [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v2 Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2019-11-19 23:22 ` [PATCH 5/6] leds: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
@ 2019-11-19 23:22 ` Frederic Weisbecker
  5 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2019-11-19 23:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

Now that we have a vtime safe kcpustat accessor, use it to fetch
CPUTIME_NICE and fix frozen kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 drivers/macintosh/rack-meter.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 4134e580f786..60311e8d6240 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -81,13 +81,14 @@ static int rackmeter_ignore_nice;
  */
 static inline u64 get_cpu_idle_time(unsigned int cpu)
 {
+	struct kernel_cpustat *kcpustat = &kcpustat_cpu(cpu);
 	u64 retval;
 
-	retval = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE] +
-		 kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
+	retval = kcpustat->cpustat[CPUTIME_IDLE] +
+		 kcpustat->cpustat[CPUTIME_IOWAIT];
 
 	if (rackmeter_ignore_nice)
-		retval += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+		retval += kcpustat_field(kcpustat, CPUTIME_NICE, cpu);
 
 	return retval;
 }
-- 
2.23.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field()
  2019-11-19 23:22 ` [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field() Frederic Weisbecker
@ 2019-11-20 11:51   ` Ingo Molnar
  2019-11-20 21:04     ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2019-11-20 11:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek


* Frederic Weisbecker <frederic@kernel.org> wrote:

> Provide support for user, nice, guest and guest_nice fields through
> kcpustat_field().
> 
> Whether we account the delta to a nice or not nice field is decided on
> top of the nice value snapshot taken at the time we call kcpustat_field().
> If the nice value of the task has been changed since the last vtime
> update, we may have inacurrate distribution of the nice VS unnice
> cputime.
> 
> However this is considered as a minor issue compared to the proper fix
> that would involve interrupting the target on nice updates, which is
> undesired on nohz_full CPUs.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/cputime.c | 53 +++++++++++++++++++++++++++++++++---------
>  1 file changed, 42 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index e0cd20693ef5..b2cf544e2109 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -912,11 +912,21 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  	} while (read_seqcount_retry(&vtime->seqcount, seq));
>  }
>  
> +static u64 kcpustat_user_vtime(struct vtime *vtime)
> +{
> +	if (vtime->state == VTIME_USER)
> +		return vtime->utime + vtime_delta(vtime);
> +	else if (vtime->state == VTIME_GUEST)
> +		return vtime->gtime + vtime_delta(vtime);
> +	return 0;
> +}
> +
>  static int kcpustat_field_vtime(u64 *cpustat,
> -				struct vtime *vtime,
> +				struct task_struct *tsk,
>  				enum cpu_usage_stat usage,
>  				int cpu, u64 *val)
>  {
> +	struct vtime *vtime = &tsk->vtime;
>  	unsigned int seq;
>  	int err;
>  
> @@ -946,9 +956,36 @@ static int kcpustat_field_vtime(u64 *cpustat,
>  
>  		*val = cpustat[usage];
>  
> -		if (vtime->state == VTIME_SYS)
> -			*val += vtime->stime + vtime_delta(vtime);
> -
> +		/*
> +		 * Nice VS unnice cputime accounting may be inaccurate if
> +		 * the nice value has changed since the last vtime update.
> +		 * But proper fix would involve interrupting target on nice
> +		 * updates which is a no go on nohz_full.

Well, we actually already interrupt the target in both sys_nice() and 
sys_setpriority() etc. syscall variants: we call set_user_nice() which 
calls resched_curr() and the task is sent an IPI and runs through a 
reschedule.

But ... I do agree that this kind of granularity of nice/non-nice 
accounting doesn't really matter in practice: the changing of nice values 
is a relatively low frequency operation on most systems.

But nevertheless the comment should probably be updated to reflect this.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields
  2019-11-19 23:22 ` [PATCH 2/6] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields Frederic Weisbecker
@ 2019-11-20 12:04   ` Ingo Molnar
  2019-11-20 15:00     ` Frederic Weisbecker
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2019-11-20 12:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek


* Frederic Weisbecker <frederic@kernel.org> wrote:

> Many callsites want to fetch the values of system, user, user_nice, guest
> or guest_nice kcpustat fields altogether or at least a pair of these.
> 
> In that case calling kcpustat_field() for each requested field brings
> unecessary overhead when we could fetch all of them in a row.
> 
> So provide kcpustat_cputime() that fetches all vtime sensitive fields
> under the same RCU and seqcount block.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/kernel_stat.h |  23 ++++++
>  kernel/sched/cputime.c      | 139 ++++++++++++++++++++++++++++++------
>  2 files changed, 142 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> index 79781196eb25..6bd70e464c61 100644
> --- a/include/linux/kernel_stat.h
> +++ b/include/linux/kernel_stat.h
> @@ -78,15 +78,38 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
>  	return kstat_cpu(cpu).irqs_sum;
>  }
>  
> +
> +static inline void kcpustat_cputime_raw(u64 *cpustat, u64 *user, u64 *nice,
> +					u64 *system, u64 *guest, u64 *guest_nice)
> +{
> +	*user = cpustat[CPUTIME_USER];
> +	*nice = cpustat[CPUTIME_NICE];
> +	*system = cpustat[CPUTIME_SYSTEM];
> +	*guest = cpustat[CPUTIME_GUEST];
> +	*guest_nice = cpustat[CPUTIME_GUEST_NICE];

Could the 'cpustat' pointer be constified?

Also, please:

> +	*user	    = cpustat[CPUTIME_USER];
> +	*nice	    = cpustat[CPUTIME_NICE];
> +	*system	    = cpustat[CPUTIME_SYSTEM];
> +	*guest	    = cpustat[CPUTIME_GUEST];
> +	*guest_nice = cpustat[CPUTIME_GUEST_NICE];

More pleasing to look at and easier to verify as well.

> +static int vtime_state_check(struct vtime *vtime, int cpu)
> +{
> +	/*
> +	 * We raced against context switch, fetch the
> +	 * kcpustat task again.
> +	 */

s/against context switch
 /against a context switch

> +void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
> +		      u64 *user, u64 *nice, u64 *system,
> +		      u64 *guest, u64 *guest_nice)
> +{
> +	u64 *cpustat = kcpustat->cpustat;
> +	struct rq *rq;
> +	int err;
> +
> +	if (!vtime_accounting_enabled_cpu(cpu)) {
> +		kcpustat_cputime_raw(cpustat, user, nice,
> +				     system, guest, guest_nice);
> +		return;
> +	}
> +
> +	rq = cpu_rq(cpu);
> +
> +	for (;;) {
> +		struct task_struct *curr;
> +
> +		rcu_read_lock();
> +		curr = rcu_dereference(rq->curr);
> +		if (WARN_ON_ONCE(!curr)) {
> +			rcu_read_unlock();
> +			kcpustat_cputime_raw(cpustat, user, nice,
> +					     system, guest, guest_nice);
> +			return;
> +		}
> +
> +		err = kcpustat_cputime_vtime(cpustat, curr, cpu, user,
> +					     nice, system, guest, guest_nice);
> +		rcu_read_unlock();
> +
> +		if (!err)
> +			return;
> +
> +		cpu_relax();
> +	}
> +}
> +EXPORT_SYMBOL_GPL(kcpustat_cputime);

I'm wondering whether it's worth introducing a helper structure for this 
train of parameters: user, nice, system, guest, guest_nice?

We also have similar constructs in other places:

+               u64 cpu_user, cpu_nice, cpu_sys, cpu_guest, cpu_guest_nice;

But more broadly, what do we gain by passing along a quartet of pointers, 
while we could also just use a 'struct kernel_cpustat' and store the 
values there naturally?

Yes, it's larger, because it also has 5 other fields - but we lose much 
of the space savings due to always passing along the 4 pointers already.

So I really think the parameter passing should be organized better here. 
This probably affects similar cpustat functions as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields
  2019-11-20 12:04   ` Ingo Molnar
@ 2019-11-20 15:00     ` Frederic Weisbecker
  0 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2019-11-20 15:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, LKML, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

On Wed, Nov 20, 2019 at 01:04:49PM +0100, Ingo Molnar wrote:
> > +static int vtime_state_check(struct vtime *vtime, int cpu)
> > +{
> > +	/*
> > +	 * We raced against context switch, fetch the
> > +	 * kcpustat task again.
> > +	 */
> 
> s/against context switch
>  /against a context switch

Ok.

> 
> > +void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu,
> > +		      u64 *user, u64 *nice, u64 *system,
> > +		      u64 *guest, u64 *guest_nice)
> > +{
> > +	u64 *cpustat = kcpustat->cpustat;
> > +	struct rq *rq;
> > +	int err;
> > +
> > +	if (!vtime_accounting_enabled_cpu(cpu)) {
> > +		kcpustat_cputime_raw(cpustat, user, nice,
> > +				     system, guest, guest_nice);
> > +		return;
> > +	}
> > +
> > +	rq = cpu_rq(cpu);
> > +
> > +	for (;;) {
> > +		struct task_struct *curr;
> > +
> > +		rcu_read_lock();
> > +		curr = rcu_dereference(rq->curr);
> > +		if (WARN_ON_ONCE(!curr)) {
> > +			rcu_read_unlock();
> > +			kcpustat_cputime_raw(cpustat, user, nice,
> > +					     system, guest, guest_nice);
> > +			return;
> > +		}
> > +
> > +		err = kcpustat_cputime_vtime(cpustat, curr, cpu, user,
> > +					     nice, system, guest, guest_nice);
> > +		rcu_read_unlock();
> > +
> > +		if (!err)
> > +			return;
> > +
> > +		cpu_relax();
> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(kcpustat_cputime);
> 
> I'm wondering whether it's worth introducing a helper structure for this 
> train of parameters: user, nice, system, guest, guest_nice?
> 
> We also have similar constructs in other places:
> 
> +               u64 cpu_user, cpu_nice, cpu_sys, cpu_guest, cpu_guest_nice;
> 
> But more broadly, what do we gain by passing along a quartet of pointers, 
> while we could also just use a 'struct kernel_cpustat' and store the 
> values there naturally?
> 
> Yes, it's larger, because it also has 5 other fields - but we lose much 
> of the space savings due to always passing along the 4 pointers already.
> 
> So I really think the parameter passing should be organized better here.

Yeah I've been thinking about that too but I was worried about the stack use.
It's probably not a big worry eventually. I'll do that for the next series.

> This probably affects similar cpustat functions as well.

Only this one fortunately :-)

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field()
  2019-11-20 11:51   ` Ingo Molnar
@ 2019-11-20 21:04     ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2019-11-20 21:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, LKML, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

On Wed, Nov 20, 2019 at 12:51:42PM +0100, Ingo Molnar wrote:
> * Frederic Weisbecker <frederic@kernel.org> wrote:

> > +		/*
> > +		 * Nice VS unnice cputime accounting may be inaccurate if
> > +		 * the nice value has changed since the last vtime update.
> > +		 * But proper fix would involve interrupting target on nice
> > +		 * updates which is a no go on nohz_full.
> 
> Well, we actually already interrupt the target in both sys_nice() and 
> sys_setpriority() etc. syscall variants: we call set_user_nice() which 
> calls resched_curr() and the task is sent an IPI and runs through a 
> reschedule.

I think we can easily avoid doing that IPI when we find it is the only
task on that runqueue. Which is exactly the case for NOHZ_FULL.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field()
  2019-11-21  2:44 [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v3 Frederic Weisbecker
@ 2019-11-21  2:44 ` Frederic Weisbecker
  0 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2019-11-21  2:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jacek Anaszewski, Wanpeng Li,
	Rafael J . Wysocki, Benjamin Herrenschmidt, Rik van Riel,
	Thomas Gleixner, Yauheni Kaliuta, Viresh Kumar, Pavel Machek

Provide support for user, nice, guest and guest_nice fields through
kcpustat_field().

Whether we account the delta to a nice or not nice field is decided on
top of the nice value snapshot taken at the time we call kcpustat_field().
If the nice value of the task has been changed since the last vtime
update, we may have inacurrate distribution of the nice VS unnice
cputime.

However this is considered as a minor issue compared to the proper fix
that would involve interrupting the target on nice updates, which is
undesired on nohz_full CPUs.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 54 +++++++++++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index e0cd20693ef5..27b5406222fc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -912,11 +912,21 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 
+static u64 kcpustat_user_vtime(struct vtime *vtime)
+{
+	if (vtime->state == VTIME_USER)
+		return vtime->utime + vtime_delta(vtime);
+	else if (vtime->state == VTIME_GUEST)
+		return vtime->gtime + vtime_delta(vtime);
+	return 0;
+}
+
 static int kcpustat_field_vtime(u64 *cpustat,
-				struct vtime *vtime,
+				struct task_struct *tsk,
 				enum cpu_usage_stat usage,
 				int cpu, u64 *val)
 {
+	struct vtime *vtime = &tsk->vtime;
 	unsigned int seq;
 	int err;
 
@@ -946,9 +956,37 @@ static int kcpustat_field_vtime(u64 *cpustat,
 
 		*val = cpustat[usage];
 
-		if (vtime->state == VTIME_SYS)
-			*val += vtime->stime + vtime_delta(vtime);
-
+		/*
+		 * Nice VS unnice cputime accounting may be inaccurate if
+		 * the nice value has changed since the last vtime update.
+		 * But proper fix would involve interrupting target on nice
+		 * updates which is a no go on nohz_full (although the scheduler
+		 * may still interrupt the target if rescheduling is needed...)
+		 */
+		switch (usage) {
+		case CPUTIME_SYSTEM:
+			if (vtime->state == VTIME_SYS)
+				*val += vtime->stime + vtime_delta(vtime);
+			break;
+		case CPUTIME_USER:
+			if (task_nice(tsk) <= 0)
+				*val += kcpustat_user_vtime(vtime);
+			break;
+		case CPUTIME_NICE:
+			if (task_nice(tsk) > 0)
+				*val += kcpustat_user_vtime(vtime);
+			break;
+		case CPUTIME_GUEST:
+			if (vtime->state == VTIME_GUEST && task_nice(tsk) <= 0)
+				*val += vtime->gtime + vtime_delta(vtime);
+			break;
+		case CPUTIME_GUEST_NICE:
+			if (vtime->state == VTIME_GUEST && task_nice(tsk) > 0)
+				*val += vtime->gtime + vtime_delta(vtime);
+			break;
+		default:
+			break;
+		}
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
 	return 0;
@@ -965,15 +1003,10 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 	if (!vtime_accounting_enabled_cpu(cpu))
 		return cpustat[usage];
 
-	/* Only support sys vtime for now */
-	if (usage != CPUTIME_SYSTEM)
-		return cpustat[usage];
-
 	rq = cpu_rq(cpu);
 
 	for (;;) {
 		struct task_struct *curr;
-		struct vtime *vtime;
 
 		rcu_read_lock();
 		curr = rcu_dereference(rq->curr);
@@ -982,8 +1015,7 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
 			return cpustat[usage];
 		}
 
-		vtime = &curr->vtime;
-		err = kcpustat_field_vtime(cpustat, vtime, usage, cpu, &val);
+		err = kcpustat_field_vtime(cpustat, curr, usage, cpu, &val);
 		rcu_read_unlock();
 
 		if (!err)
-- 
2.23.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-11-21  2:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-19 23:22 [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v2 Frederic Weisbecker
2019-11-19 23:22 ` [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field() Frederic Weisbecker
2019-11-20 11:51   ` Ingo Molnar
2019-11-20 21:04     ` Peter Zijlstra
2019-11-19 23:22 ` [PATCH 2/6] sched/vtime: Bring all-in-one kcpustat accessor for vtime fields Frederic Weisbecker
2019-11-20 12:04   ` Ingo Molnar
2019-11-20 15:00     ` Frederic Weisbecker
2019-11-19 23:22 ` [PATCH 3/6] procfs: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
2019-11-19 23:22 ` [PATCH 4/6] cpufreq: Use vtime aware kcpustat accessors for user time Frederic Weisbecker
2019-11-19 23:22 ` [PATCH 5/6] leds: Use all-in-one vtime aware kcpustat accessor Frederic Weisbecker
2019-11-19 23:22 ` [PATCH 6/6] rackmeter: Use " Frederic Weisbecker
2019-11-21  2:44 [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v3 Frederic Weisbecker
2019-11-21  2:44 ` [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field() Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).