[PATCH v2 0/5] sched,time: fix irq time accounting with nohz

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle
@ 2016-06-23  2:25 riel
  2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

Currently irq time accounting only works in these cases:
1) purely ticke based accounting
2) nohz_full accounting, but only on housekeeping & nohz_full CPUs
3) architectures with native vtime accounting

On nohz_idle CPUs, which are probably the majority nowadays,
irq time accounting is currently broken. This leads to systems
reporting a dramatically lower amount of irq & softirq time than
is actually spent handling them, with all the time spent while the
system is in the idle task being accounted as idle.

This patch set seems to bring the amount of irq time reported by
top (and /proc/stat) roughly in line with that measured when I do
a "perf record -g -a" run to see what is using all that time.

The amount of irq time used, especially softirq, is shockingly high,
to the point of me thinking this patch set may be wrong, but the
numbers seem to match what perf is giving me...

These patches apply on top of Wanpeng Li's steal time patches.

CONFIG_IRQ_TIME_ACCOUNTING is now a config option that is available
as a separate choice from tick based / nohz_idle / nohz_full mode,
a suggested by Frederic Weisbecker.

Next up: look at the things that are using CPU time on an otherwise
idle system, and see if I can make those a little faster :)

v2: address Peterz's concerns, some more cleanups

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
@ 2016-06-23  2:25 ` riel
  2016-06-27 12:25   ` Frederic Weisbecker
  2016-06-23  2:25 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of 'ticks' jiffies was
actually spent handling irqs, and could conceivably mis-count
all of the ticks during a period as irq time, when there was
some irq and some softirq time.

This can actually happen when irqtime_account_process_tick
is called from account_idle_ticks, which can pass a larger
number of ticks down all at once.

Fix this by changing irqtime_account_hi_update and
irqtime_account_si_update to round elapsed irq and softirq
time to jiffies, and return the number of jiffies spent in
each mode, similar to how steal time is handled.

Additionally, have irqtime_account_process_tick take into
account how much time was spent in each of steal, irq,
and softirq time.

The latter could help improve the accuracy of timekeeping
when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 81 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 60 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d76fdb..15b813c014be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,40 +79,54 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long irq_jiffies = 0;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 irq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
+	if (irq > cputime_one_jiffy) {
+		irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
+		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
+	}
 	local_irq_restore(flags);
-	return ret;
+	return irq_jiffies;
 }
 
-static int irqtime_account_si_update(void)
+static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long si_jiffies = 0;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 softirq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
+	if (softirq > cputime_one_jiffy) {
+		si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
+		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
+	}
 	local_irq_restore(flags);
-	return ret;
+	return si_jiffies;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #define sched_clock_irqtime	(0)
 
+static unsigned long irqtime_account_hi_update(unsigned long dummy)
+{
+	return 0;
+}
+
+static unsigned long irqtime_account_si_update(unsigned long dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -283,6 +297,26 @@ static __always_inline unsigned long steal_account_process_tick(unsigned long ma
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ * Due to rounding errors, the calculated amount can sometimes exceed
+ * max_jiffies; be careful not to account more than max_jiffies.
+ */
+static inline int account_other_ticks(unsigned long max_jiffies)
+{
+	unsigned long accounted;
+
+	accounted = steal_account_process_tick(max_jiffies);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_hi_update(max_jiffies - accounted);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_si_update(max_jiffies - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -344,19 +378,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 {
 	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_ticks(ticks);
+	if (other >= ticks)
 		return;
+	ticks -= other;
 
 	cputime *= ticks;
 	scaled *= ticks;
 
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-23  2:25 ` riel
  2016-06-27 23:21   ` Frederic Weisbecker
  2016-06-23  2:25 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs that are nohz_full, people typically do not assign IRQs.
On the housekeeping CPU (when a system is booted up with nohz_full),
sampling should work ok to determine irq and softirq time use, but
that only covers the housekeeping CPU itself, not the other
non-nohz_full CPUs.

On CPUs that are nohz_idle (the typical way a distro kernel is
booted), irq time is not accounted at all while the CPU is idle,
due to the lack of timer ticks.

Remove the VTIME_GEN vtime irq time code. The next patch will
allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vtime.h  | 32 ++++++++++++++------------------
 kernel/sched/cputime.c | 16 +++-------------
 2 files changed, 17 insertions(+), 31 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index fa2196990f84..d1977d84ebdf 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,6 +14,18 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+
+#ifdef __ARCH_HAS_VTIME_ACCOUNT
+extern void vtime_account_irq_enter(struct task_struct *tsk);
+#else
+extern void vtime_common_account_irq_enter(struct task_struct *tsk);
+static inline void vtime_account_irq_enter(struct task_struct *tsk)
+{
+	if (vtime_accounting_cpu_enabled())
+		vtime_common_account_irq_enter(tsk);
+}
+#endif /* __ARCH_HAS_VTIME_ACCOUNT */
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -64,17 +76,6 @@ extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
 
-#ifdef __ARCH_HAS_VTIME_ACCOUNT
-extern void vtime_account_irq_enter(struct task_struct *tsk);
-#else
-extern void vtime_common_account_irq_enter(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_account_irq_enter(tsk);
-}
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
-
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 static inline void vtime_task_switch(struct task_struct *prev) { }
@@ -85,13 +86,8 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_gen_account_irq_exit(struct task_struct *tsk);
-
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_gen_account_irq_exit(tsk);
-}
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
 
 extern void vtime_user_enter(struct task_struct *tsk);
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 15b813c014be..0280528800f2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -720,14 +720,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	unsigned long delta_jiffies, other_jiffies;
 
 	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	other_jiffies = account_other_ticks(delta_jiffies);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return jiffies_to_cputime(delta_jiffies - other_jiffies);
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
@@ -747,16 +747,6 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_gen_account_irq_exit(struct task_struct *tsk)
-{
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		__vtime_account_system(tsk);
-	if (context_tracking_in_user())
-		tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
-}
-
 void vtime_account_user(struct task_struct *tsk)
 {
 	cputime_t delta_cpu;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/5] cputime: allow irq time accounting to be selected as an option
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
  2016-06-23  2:25 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
@ 2016-06-23  2:25 ` riel
  2016-06-26  3:03   ` kbuild test robot
  2016-06-23  2:25 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
  2016-06-23  2:25 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  4 siblings, 1 reply; 31+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Allow CONFIG_IRQ_TIME_ACCOUNTING to be selected as an option, on top
of CONFIG_VIRT_CPU_ACCOUNTING_GEN (and potentially others?).

This allows for the irq time accounting code to be used with nohz_idle
CPUs, which is how several distributions ship their kernels. Using the
same code for several timer modes also allows us to drop duplicate code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 init/Kconfig | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 0dfd09d54c65..4c7ee4f136cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -375,9 +375,11 @@ config VIRT_CPU_ACCOUNTING_GEN
 
 	  If unsure, say N.
 
+endchoice
+
 config IRQ_TIME_ACCOUNTING
 	bool "Fine granularity task level IRQ time accounting"
-	depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL
+	depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
 	help
 	  Select this option to enable fine granularity task irq time
 	  accounting. This is done by reading a timestamp on each
@@ -386,8 +388,6 @@ config IRQ_TIME_ACCOUNTING
 
 	  If in doubt, say N here.
 
-endchoice
-
 config BSD_PROCESS_ACCT
 	bool "BSD Process Accounting"
 	depends on MULTIUSER
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
                   ` (2 preceding siblings ...)
  2016-06-23  2:25 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
@ 2016-06-23  2:25 ` riel
  2016-06-23  2:25 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  4 siblings, 0 replies; 31+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Add an irq type parameter and documentation to irqtime_account_irq,
this can be used to distinguish between transitioning from process
context to hardirq time, and from process context to softirq time.

This is necessary to be able to remove the local_irq_disable from
irqtime_account_irq.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/hardirq.h | 20 ++++++++++----------
 include/linux/vtime.h   | 12 ++++++------
 kernel/sched/cputime.c  |  9 ++++++++-
 kernel/softirq.c        |  6 +++---
 4 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6bc6f0..1ebb31f56285 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,11 +32,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define __irq_enter()					\
-	do {						\
-		account_irq_enter_time(current);	\
-		preempt_count_add(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
+#define __irq_enter()							\
+	do {								\
+		account_irq_enter_time(current, HARDIRQ_OFFSET);	\
+		preempt_count_add(HARDIRQ_OFFSET);			\
+		trace_hardirq_enter();					\
 	} while (0)
 
 /*
@@ -47,11 +47,11 @@ extern void irq_enter(void);
 /*
  * Exit irq context without processing softirqs:
  */
-#define __irq_exit()					\
-	do {						\
-		trace_hardirq_exit();			\
-		account_irq_exit_time(current);		\
-		preempt_count_sub(HARDIRQ_OFFSET);	\
+#define __irq_exit()							\
+	do {								\
+		trace_hardirq_exit();					\
+		account_irq_exit_time(current, HARDIRQ_OFFSET);		\
+		preempt_count_sub(HARDIRQ_OFFSET);			\
 	} while (0)
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d1977d84ebdf..31e486d741b0 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -112,21 +112,21 @@ static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, int irqtype);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, int irqtype) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_enter(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_exit(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0280528800f2..fc4122afc022 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,8 +46,15 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
 /*
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * There are six possible transitions:
+ * process -> softirq, softirq -> process
+ * process -> hardirq, hardirq -> process
+ * softirq -> hardirq, hardirq -> softirq
+ *
+ * When exiting hardirq or softirq time, account the elapsed time.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
 	unsigned long flags;
 	s64 delta;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a311c9622c86 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -245,7 +245,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	account_irq_enter_time(current);
+	account_irq_enter_time(current, SOFTIRQ_OFFSET);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
@@ -295,7 +295,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	}
 
 	lockdep_softirq_end(in_hardirq);
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, SOFTIRQ_OFFSET);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
@@ -385,7 +385,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, HARDIRQ_OFFSET);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
                   ` (3 preceding siblings ...)
  2016-06-23  2:25 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
@ 2016-06-23  2:25 ` riel
  4 siblings, 0 replies; 31+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Drop local_irq_save/restore from irqtime_account_irq.
Instead, have softirq and hardirq track their time spent
independently, with the softirq code subtracting hardirq
time that happened during the duration of the softirq run.

The softirq code can be interrupted by hardirq code at
any point in time, but it can check whether it got a
consistent snapshot of the timekeeping variables it wants,
and loop around in the unlikely case that it did not.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 72 +++++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h   | 38 +++++++++++++++++++++-----
 2 files changed, 90 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index fc4122afc022..e4c7b2d17141 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -26,7 +26,9 @@
 DEFINE_PER_CPU(u64, cpu_hardirq_time);
 DEFINE_PER_CPU(u64, cpu_softirq_time);
 
-static DEFINE_PER_CPU(u64, irq_start_time);
+static DEFINE_PER_CPU(u64, hardirq_start_time);
+static DEFINE_PER_CPU(u64, softirq_start_time);
+static DEFINE_PER_CPU(u64, prev_hardirq_time);
 static int sched_clock_irqtime;
 
 void enable_sched_clock_irqtime(void)
@@ -41,6 +43,7 @@ void disable_sched_clock_irqtime(void)
 
 #ifndef CONFIG_64BIT
 DEFINE_PER_CPU(seqcount_t, irq_time_seq);
+DEFINE_PER_CPU(seqcount_t, softirq_time_seq);
 #endif /* CONFIG_64BIT */
 
 /*
@@ -53,36 +56,79 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
  * softirq -> hardirq, hardirq -> softirq
  *
  * When exiting hardirq or softirq time, account the elapsed time.
+ *
+ * When exiting softirq time, subtract the amount of hardirq time that
+ * interrupted this softirq run, to avoid double accounting of that time.
  */
 void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
-	unsigned long flags;
+	u64 prev_softirq_start;
+	bool leaving_softirq;
+	u64 prev_hardirq;
+	u64 hardirq_time;
 	s64 delta;
 	int cpu;
 
 	if (!sched_clock_irqtime)
 		return;
 
-	local_irq_save(flags);
-
 	cpu = smp_processor_id();
-	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
-	__this_cpu_add(irq_start_time, delta);
 
-	irq_time_write_begin();
+	/*
+	 * Hardirq time accounting is pretty straightforward. If not in
+	 * hardirq context yet (entering hardirq), set the start time.
+	 * If already in hardirq context (leaving), account the elapsed time.
+	 */
+	if (irqtype == HARDIRQ_OFFSET) {
+		bool leaving_hardirq = hardirq_count();
+		delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
+		__this_cpu_add(hardirq_start_time, delta);
+		if (leaving_hardirq) {
+			hardirq_time_write_begin();
+			__this_cpu_add(cpu_hardirq_time, delta);
+			hardirq_time_write_end();
+		}
+		return;
+	}
+
+	/*
+	 * Softirq context may get interrupted by hardirq context, on the
+	 * same CPU. At softirq entry time the amount of time this CPU spent
+	 * in hardirq context is stored. At softirq exit time, the time spent
+	 * in hardirq context during the softirq is subtracted.
+	 */
+	prev_softirq_start = __this_cpu_read(softirq_start_time);
+	prev_hardirq = __this_cpu_read(prev_hardirq_time);
+	leaving_softirq = in_serving_softirq();
+
+	do {
+		u64 now = sched_clock_cpu(cpu);
+
+		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time, cpu));
+		__this_cpu_write(softirq_start_time, now);
+		__this_cpu_write(prev_hardirq_time, hardirq_time);
+
+		if (leaving_softirq) {
+			/*
+			 * Subtract hardirq time that happened during this
+			 * softirq.
+			 */
+			s64 hi_delta = hardirq_time - prev_hardirq;
+			delta = now - prev_softirq_start - hi_delta;
+		}
+		/* Loop around if interrupted by a hardirq. */
+	} while (hardirq_time != READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));
+
 	/*
 	 * We do not account for softirq time from ksoftirqd here.
 	 * We want to continue accounting softirq time to ksoftirqd thread
 	 * in that case, so as not to confuse scheduler with a special task
 	 * that do not consume any time, but still wants to run.
 	 */
-	if (hardirq_count())
-		__this_cpu_add(cpu_hardirq_time, delta);
-	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+	softirq_time_write_begin();
+	if (leaving_softirq && curr != this_cpu_ksoftirqd())
 		__this_cpu_add(cpu_softirq_time, delta);
-
-	irq_time_write_end();
-	local_irq_restore(flags);
+	softirq_time_write_end();
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec2e8d23527e..cad4df9835f7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1752,38 +1752,62 @@ DECLARE_PER_CPU(u64, cpu_softirq_time);
 
 #ifndef CONFIG_64BIT
 DECLARE_PER_CPU(seqcount_t, irq_time_seq);
+DECLARE_PER_CPU(seqcount_t, softirq_time_seq);
 
-static inline void irq_time_write_begin(void)
+static inline void hardirq_time_write_begin(void)
 {
 	__this_cpu_inc(irq_time_seq.sequence);
 	smp_wmb();
 }
 
-static inline void irq_time_write_end(void)
+static inline void hardirq_time_write_end(void)
 {
 	smp_wmb();
 	__this_cpu_inc(irq_time_seq.sequence);
 }
 
+static inline void softirq_time_write_begin(void)
+{
+	__this_cpu_inc(softirq_time_seq.sequence);
+	smp_wmb();
+}
+
+static inline void softirq_time_write_end(void)
+{
+	smp_wmb();
+	__this_cpu_inc(softirq_time_seq.sequence);
+}
+
 static inline u64 irq_time_read(int cpu)
 {
 	u64 irq_time;
-	unsigned seq;
+	unsigned hi_seq;
+	unsigned si_seq;
 
 	do {
-		seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		hi_seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		si_seq = read_seqcount_begin(&per_cpu(softirq_time_seq, cpu));
 		irq_time = per_cpu(cpu_softirq_time, cpu) +
 			   per_cpu(cpu_hardirq_time, cpu);
-	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));
+	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), hi_seq) ||
+		 read_seqcount_retry(&per_cpu(softirq_time_seq, cpu), si_seq));
 
 	return irq_time;
 }
 #else /* CONFIG_64BIT */
-static inline void irq_time_write_begin(void)
+static inline void hardirq_time_write_begin(void)
+{
+}
+
+static inline void hardirq_time_write_end(void)
+{
+}
+
+static inline void softirq_time_write_begin(void)
 {
 }
 
-static inline void irq_time_write_end(void)
+static inline void softirq_time_write_end(void)
 {
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] cputime: allow irq time accounting to be selected as an option
  2016-06-23  2:25 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
@ 2016-06-26  3:03   ` kbuild test robot
  0 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2016-06-26  3:03 UTC (permalink / raw)
  To: riel
  Cc: kbuild-all, linux-kernel, peterz, mingo, pbonzini, fweisbec,
	wanpeng.li, efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 8987 bytes --]

Hi,

[auto build test WARNING on tip/sched/core]
[also build test WARNING on next-20160624]
[cannot apply to v4.7-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/riel-redhat-com/sched-time-fix-irq-time-accounting-with-nohz_idle/20160623-102923
config: x86_64-randconfig-a0-06260951 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/sched.h:17:0,
                    from kernel/sched/cputime.c:2:
   kernel/sched/cputime.c: In function 'irqtime_account_hi_update':
   include/linux/kernel.h:742:17: warning: comparison of distinct pointer types lacks a cast
     (void) (&_min1 == &_min2);  \
                    ^
>> kernel/sched/cputime.c:92:17: note: in expansion of macro 'min'
      irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
                    ^~~
   kernel/sched/cputime.c: In function 'irqtime_account_si_update':
   include/linux/kernel.h:742:17: warning: comparison of distinct pointer types lacks a cast
     (void) (&_min1 == &_min2);  \
                    ^
   kernel/sched/cputime.c:109:16: note: in expansion of macro 'min'
      si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
                   ^~~

vim +/min +92 kernel/sched/cputime.c

73fbec604 Frederic Weisbecker 2012-06-16   1  #include <linux/export.h>
73fbec604 Frederic Weisbecker 2012-06-16  @2  #include <linux/sched.h>
73fbec604 Frederic Weisbecker 2012-06-16   3  #include <linux/tsacct_kern.h>
73fbec604 Frederic Weisbecker 2012-06-16   4  #include <linux/kernel_stat.h>
73fbec604 Frederic Weisbecker 2012-06-16   5  #include <linux/static_key.h>
abf917cd9 Frederic Weisbecker 2012-07-25   6  #include <linux/context_tracking.h>
73fbec604 Frederic Weisbecker 2012-06-16   7  #include "sched.h"
1fe7c4ef8 Stefano Stabellini  2015-11-10   8  #ifdef CONFIG_PARAVIRT
1fe7c4ef8 Stefano Stabellini  2015-11-10   9  #include <asm/paravirt.h>
1fe7c4ef8 Stefano Stabellini  2015-11-10  10  #endif
73fbec604 Frederic Weisbecker 2012-06-16  11  
73fbec604 Frederic Weisbecker 2012-06-16  12  
73fbec604 Frederic Weisbecker 2012-06-16  13  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
73fbec604 Frederic Weisbecker 2012-06-16  14  
73fbec604 Frederic Weisbecker 2012-06-16  15  /*
73fbec604 Frederic Weisbecker 2012-06-16  16   * There are no locks covering percpu hardirq/softirq time.
bf9fae9f5 Frederic Weisbecker 2012-09-08  17   * They are only modified in vtime_account, on corresponding CPU
73fbec604 Frederic Weisbecker 2012-06-16  18   * with interrupts disabled. So, writes are safe.
73fbec604 Frederic Weisbecker 2012-06-16  19   * They are read and saved off onto struct rq in update_rq_clock().
73fbec604 Frederic Weisbecker 2012-06-16  20   * This may result in other CPU reading this CPU's irq time and can
bf9fae9f5 Frederic Weisbecker 2012-09-08  21   * race with irq/vtime_account on this CPU. We would either get old
73fbec604 Frederic Weisbecker 2012-06-16  22   * or new value with a side effect of accounting a slice of irq time to wrong
73fbec604 Frederic Weisbecker 2012-06-16  23   * task when irq is in progress while we read rq->clock. That is a worthy
73fbec604 Frederic Weisbecker 2012-06-16  24   * compromise in place of having locks on each irq in account_system_time.
73fbec604 Frederic Weisbecker 2012-06-16  25   */
73fbec604 Frederic Weisbecker 2012-06-16  26  DEFINE_PER_CPU(u64, cpu_hardirq_time);
73fbec604 Frederic Weisbecker 2012-06-16  27  DEFINE_PER_CPU(u64, cpu_softirq_time);
73fbec604 Frederic Weisbecker 2012-06-16  28  
73fbec604 Frederic Weisbecker 2012-06-16  29  static DEFINE_PER_CPU(u64, irq_start_time);
73fbec604 Frederic Weisbecker 2012-06-16  30  static int sched_clock_irqtime;
73fbec604 Frederic Weisbecker 2012-06-16  31  
73fbec604 Frederic Weisbecker 2012-06-16  32  void enable_sched_clock_irqtime(void)
73fbec604 Frederic Weisbecker 2012-06-16  33  {
73fbec604 Frederic Weisbecker 2012-06-16  34  	sched_clock_irqtime = 1;
73fbec604 Frederic Weisbecker 2012-06-16  35  }
73fbec604 Frederic Weisbecker 2012-06-16  36  
73fbec604 Frederic Weisbecker 2012-06-16  37  void disable_sched_clock_irqtime(void)
73fbec604 Frederic Weisbecker 2012-06-16  38  {
73fbec604 Frederic Weisbecker 2012-06-16  39  	sched_clock_irqtime = 0;
73fbec604 Frederic Weisbecker 2012-06-16  40  }
73fbec604 Frederic Weisbecker 2012-06-16  41  
73fbec604 Frederic Weisbecker 2012-06-16  42  #ifndef CONFIG_64BIT
73fbec604 Frederic Weisbecker 2012-06-16  43  DEFINE_PER_CPU(seqcount_t, irq_time_seq);
73fbec604 Frederic Weisbecker 2012-06-16  44  #endif /* CONFIG_64BIT */
73fbec604 Frederic Weisbecker 2012-06-16  45  
73fbec604 Frederic Weisbecker 2012-06-16  46  /*
73fbec604 Frederic Weisbecker 2012-06-16  47   * Called before incrementing preempt_count on {soft,}irq_enter
73fbec604 Frederic Weisbecker 2012-06-16  48   * and before decrementing preempt_count on {soft,}irq_exit.
73fbec604 Frederic Weisbecker 2012-06-16  49   */
3e1df4f50 Frederic Weisbecker 2012-10-06  50  void irqtime_account_irq(struct task_struct *curr)
73fbec604 Frederic Weisbecker 2012-06-16  51  {
73fbec604 Frederic Weisbecker 2012-06-16  52  	unsigned long flags;
73fbec604 Frederic Weisbecker 2012-06-16  53  	s64 delta;
73fbec604 Frederic Weisbecker 2012-06-16  54  	int cpu;
73fbec604 Frederic Weisbecker 2012-06-16  55  
73fbec604 Frederic Weisbecker 2012-06-16  56  	if (!sched_clock_irqtime)
73fbec604 Frederic Weisbecker 2012-06-16  57  		return;
73fbec604 Frederic Weisbecker 2012-06-16  58  
73fbec604 Frederic Weisbecker 2012-06-16  59  	local_irq_save(flags);
73fbec604 Frederic Weisbecker 2012-06-16  60  
73fbec604 Frederic Weisbecker 2012-06-16  61  	cpu = smp_processor_id();
73fbec604 Frederic Weisbecker 2012-06-16  62  	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
73fbec604 Frederic Weisbecker 2012-06-16  63  	__this_cpu_add(irq_start_time, delta);
73fbec604 Frederic Weisbecker 2012-06-16  64  
73fbec604 Frederic Weisbecker 2012-06-16  65  	irq_time_write_begin();
73fbec604 Frederic Weisbecker 2012-06-16  66  	/*
73fbec604 Frederic Weisbecker 2012-06-16  67  	 * We do not account for softirq time from ksoftirqd here.
73fbec604 Frederic Weisbecker 2012-06-16  68  	 * We want to continue accounting softirq time to ksoftirqd thread
73fbec604 Frederic Weisbecker 2012-06-16  69  	 * in that case, so as not to confuse scheduler with a special task
73fbec604 Frederic Weisbecker 2012-06-16  70  	 * that do not consume any time, but still wants to run.
73fbec604 Frederic Weisbecker 2012-06-16  71  	 */
73fbec604 Frederic Weisbecker 2012-06-16  72  	if (hardirq_count())
73fbec604 Frederic Weisbecker 2012-06-16  73  		__this_cpu_add(cpu_hardirq_time, delta);
73fbec604 Frederic Weisbecker 2012-06-16  74  	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
73fbec604 Frederic Weisbecker 2012-06-16  75  		__this_cpu_add(cpu_softirq_time, delta);
73fbec604 Frederic Weisbecker 2012-06-16  76  
73fbec604 Frederic Weisbecker 2012-06-16  77  	irq_time_write_end();
73fbec604 Frederic Weisbecker 2012-06-16  78  	local_irq_restore(flags);
73fbec604 Frederic Weisbecker 2012-06-16  79  }
3e1df4f50 Frederic Weisbecker 2012-10-06  80  EXPORT_SYMBOL_GPL(irqtime_account_irq);
73fbec604 Frederic Weisbecker 2012-06-16  81  
ad7578c31 Rik van Riel        2016-06-22  82  static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
73fbec604 Frederic Weisbecker 2012-06-16  83  {
73fbec604 Frederic Weisbecker 2012-06-16  84  	u64 *cpustat = kcpustat_this_cpu->cpustat;
ad7578c31 Rik van Riel        2016-06-22  85  	unsigned long irq_jiffies = 0;
73fbec604 Frederic Weisbecker 2012-06-16  86  	unsigned long flags;
ad7578c31 Rik van Riel        2016-06-22  87  	u64 irq;
73fbec604 Frederic Weisbecker 2012-06-16  88  
73fbec604 Frederic Weisbecker 2012-06-16  89  	local_irq_save(flags);
ad7578c31 Rik van Riel        2016-06-22  90  	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
ad7578c31 Rik van Riel        2016-06-22  91  	if (irq > cputime_one_jiffy) {
ad7578c31 Rik van Riel        2016-06-22 @92  		irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
ad7578c31 Rik van Riel        2016-06-22  93  		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
ad7578c31 Rik van Riel        2016-06-22  94  	}
73fbec604 Frederic Weisbecker 2012-06-16  95  	local_irq_restore(flags);

:::::: The code at line 92 was first introduced by commit
:::::: ad7578c312973ca898de22b799774bd08ffe819f sched,time: count actually elapsed irq & softirq time

:::::: TO: Rik van Riel <riel@redhat.com>
:::::: CC: 0day robot <fengguang.wu@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 20781 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-27 12:25   ` Frederic Weisbecker
  2016-06-27 12:50     ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2016-06-27 12:25 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Wed, Jun 22, 2016 at 10:25:47PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Currently, if there was any irq or softirq time during 'ticks'
> jiffies, the entire period will be accounted as irq or softirq
> time.
> 
> This is inaccurate if only a subset of 'ticks' jiffies was
> actually spent handling irqs, and could conceivably mis-count
> all of the ticks during a period as irq time, when there was
> some irq and some softirq time.

Good catch!

Many comments following.

> 
> This can actually happen when irqtime_account_process_tick
> is called from account_idle_ticks, which can pass a larger
> number of ticks down all at once.
> 
> Fix this by changing irqtime_account_hi_update and
> irqtime_account_si_update to round elapsed irq and softirq
> time to jiffies, and return the number of jiffies spent in
> each mode, similar to how steal time is handled.
> 
> Additionally, have irqtime_account_process_tick take into
> account how much time was spent in each of steal, irq,
> and softirq time.
> 
> The latter could help improve the accuracy of timekeeping

Maybe you meant cputime? Timekeeping is rather about jiffies and GTOD.

> when returning from idle on a NO_HZ_IDLE CPU.
> 
> Properly accounting how much time was spent in hardirq and
> softirq time will also allow the NO_HZ_FULL code to re-use
> these same functions for hardirq and softirq accounting.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/cputime.c | 81 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 60 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 3d60e5d76fdb..15b813c014be 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -79,40 +79,54 @@ void irqtime_account_irq(struct task_struct *curr)
>  }
>  EXPORT_SYMBOL_GPL(irqtime_account_irq);
>  
> -static int irqtime_account_hi_update(void)
> +static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long irq_jiffies = 0;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	u64 irq;

Should be cputime_t ? And perhaps renamed to irq_cputime to distinguish
clearly against irq_jiffies?

>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_hardirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> -		ret = 1;
> +	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];

cpu_hardirq_time is made of nsecs whereas cpustat is of cputime_t (in fact
even cputime64_t). So you need to convert cpu_hardirq_time before doing the
substract.

> +	if (irq > cputime_one_jiffy) {
> +		irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
> +		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
> +	}
>  	local_irq_restore(flags);
> -	return ret;
> +	return irq_jiffies;

I think we shouldn't use jiffies at all here and only rely on cputime_t.
max_jiffies should be of cputime_t and this function should return the cputime_t.

The resulting code should be more simple and in fact more precise (more below on that).

>  }
>  
> -static int irqtime_account_si_update(void)
> +static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long si_jiffies = 0;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	u64 softirq;
>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_softirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
> -		ret = 1;
> +	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
> +	if (softirq > cputime_one_jiffy) {
> +		si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
> +		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
> +	}
>  	local_irq_restore(flags);
> -	return ret;
> +	return si_jiffies;

So same comments apply here.

[...]
>   * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
>   * tasks (sum on group iteration) belonging to @tsk's group.
>   */
> @@ -344,19 +378,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>  {
>  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
>  	u64 cputime = (__force u64) cputime_one_jiffy;
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long other;
>  
> -	if (steal_account_process_tick(ULONG_MAX))
> +	/*
> +	 * When returning from idle, many ticks can get accounted at
> +	 * once, including some ticks of steal, irq, and softirq time.
> +	 * Subtract those ticks from the amount of time accounted to
> +	 * idle, or potentially user or system time. Due to rounding,
> +	 * other time can exceed ticks occasionally.
> +	 */
> +	other = account_other_ticks(ticks);
> +	if (other >= ticks)
>  		return;
> +	ticks -= other;
>  
>  	cputime *= ticks;
>  	scaled *= ticks;

So instead of dealing with ticks here, I think you should rather use the above
cputime as both the limit and the remaining time to account after steal/irqs.

This should avoid some middle conversions and improve precision when cputime_t == nsecs granularity.

If we account 2 ticks to idle (lets say HZ=100) and irq time to account is 15 ms. 2 ticks = 20 ms
so we have 5 ms left to account to idle. With the jiffies granularity in this patch, we would account
one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account back later) and one tick to idle
time whereas if you deal with cputime_t, you are going to account the correct amount of idle time.

Beyond the scope of this patchset, we should probably kill cputime_t and account everything to nsecs.
I have a patchset that did that 2 years ago, I should probably re-iterate it.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-27 12:25   ` Frederic Weisbecker
@ 2016-06-27 12:50     ` Rik van Riel
  2016-06-28 20:56       ` Frederic Weisbecker
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2016-06-27 12:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 5303 bytes --]

On Mon, 2016-06-27 at 14:25 +0200, Frederic Weisbecker wrote:
> On Wed, Jun 22, 2016 at 10:25:47PM -0400, riel@redhat.com wrote:
> > 
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Currently, if there was any irq or softirq time during 'ticks'
> > jiffies, the entire period will be accounted as irq or softirq
> > time.
> > 
> > This is inaccurate if only a subset of 'ticks' jiffies was
> > actually spent handling irqs, and could conceivably mis-count
> > all of the ticks during a period as irq time, when there was
> > some irq and some softirq time.
> Good catch!
> 
> Many comments following.
> 
> > 
> > 
> > This can actually happen when irqtime_account_process_tick
> > is called from account_idle_ticks, which can pass a larger
> > number of ticks down all at once.
> > 
> > Fix this by changing irqtime_account_hi_update and
> > irqtime_account_si_update to round elapsed irq and softirq
> > time to jiffies, and return the number of jiffies spent in
> > each mode, similar to how steal time is handled.
> > 
> > Additionally, have irqtime_account_process_tick take into
> > account how much time was spent in each of steal, irq,
> > and softirq time.
> > 
> > The latter could help improve the accuracy of timekeeping
> Maybe you meant cputime? Timekeeping is rather about jiffies and
> GTOD.
> 
> > 
> > when returning from idle on a NO_HZ_IDLE CPU.
> > 
> > Properly accounting how much time was spent in hardirq and
> > softirq time will also allow the NO_HZ_FULL code to re-use
> > these same functions for hardirq and softirq accounting.
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > 
> >  	local_irq_save(flags);
> > -	latest_ns = this_cpu_read(cpu_hardirq_time);
> > -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> > -		ret = 1;
> > +	irq = this_cpu_read(cpu_hardirq_time) -
> > cpustat[CPUTIME_IRQ];
> cpu_hardirq_time is made of nsecs whereas cpustat is of cputime_t (in
> fact
> even cputime64_t). So you need to convert cpu_hardirq_time before
> doing the
> substract.

Doh. Good catch!

> > -static int irqtime_account_si_update(void)
> > +static unsigned long irqtime_account_si_update(unsigned long
> > max_jiffies)
> >  {
> >  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > +	unsigned long si_jiffies = 0;
> >  	unsigned long flags;
> > -	u64 latest_ns;
> > -	int ret = 0;
> > +	u64 softirq;
> >  
> >  	local_irq_save(flags);
> > -	latest_ns = this_cpu_read(cpu_softirq_time);
> > -	if (nsecs_to_cputime64(latest_ns) >
> > cpustat[CPUTIME_SOFTIRQ])
> > -		ret = 1;
> > +	softirq = this_cpu_read(cpu_softirq_time) -
> > cpustat[CPUTIME_SOFTIRQ];
> > +	if (softirq > cputime_one_jiffy) {
> > +		si_jiffies = min(max_jiffies,
> > cputime_to_jiffies(softirq));
> > +		cpustat[CPUTIME_SOFTIRQ] +=
> > jiffies_to_cputime(si_jiffies);
> > +	}
> >  	local_irq_restore(flags);
> > -	return ret;
> > +	return si_jiffies;
> So same comments apply here.
> 
> [...]
> > 
> >   * Accumulate raw cputime values of dead tasks (sig->[us]time) and
> > live
> >   * tasks (sum on group iteration) belonging to @tsk's group.
> >   */
> > @@ -344,19 +378,24 @@ static void
> > irqtime_account_process_tick(struct task_struct *p, int user_tick,
> >  {
> >  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
> >  	u64 cputime = (__force u64) cputime_one_jiffy;
> > -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > +	unsigned long other;
> >  
> > -	if (steal_account_process_tick(ULONG_MAX))
> > +	/*
> > +	 * When returning from idle, many ticks can get accounted
> > at
> > +	 * once, including some ticks of steal, irq, and softirq
> > time.
> > +	 * Subtract those ticks from the amount of time accounted
> > to
> > +	 * idle, or potentially user or system time. Due to
> > rounding,
> > +	 * other time can exceed ticks occasionally.
> > +	 */
> > +	other = account_other_ticks(ticks);
> > +	if (other >= ticks)
> >  		return;
> > +	ticks -= other;
> >  
> >  	cputime *= ticks;
> >  	scaled *= ticks;
> So instead of dealing with ticks here, I think you should rather use
> the above
> cputime as both the limit and the remaining time to account after
> steal/irqs.
> 
> This should avoid some middle conversions and improve precision when
> cputime_t == nsecs granularity.
> 
> If we account 2 ticks to idle (lets say HZ=100) and irq time to
> account is 15 ms. 2 ticks = 20 ms
> so we have 5 ms left to account to idle. With the jiffies granularity
> in this patch, we would account
> one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account
> back later) and one tick to idle
> time whereas if you deal with cputime_t, you are going to account the
> correct amount of idle time.
> 

Ahhh, so you want irqtime_account_process_tick to work with
and account fractional ticks when calling account_system_time,
account_user_time, account_idle_time, etc?

I guess that should work fine since we already pass cputime
values in, anyway.

I suppose we can do the same for get_vtime_delta, too.

They can both work with the actual remaining time (in cputime_t),
after the other time has been subtracted.

I can rework the series to do that.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-23  2:25 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
@ 2016-06-27 23:21   ` Frederic Weisbecker
  2016-06-27 23:31     ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2016-06-27 23:21 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Wed, Jun 22, 2016 at 10:25:48PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
> appear to currently work right.
> 
> On CPUs that are nohz_full, people typically do not assign IRQs.

Right, but they can still fire. At least one tick per second, plus the
pinned timers, etc...

>
> On the housekeeping CPU (when a system is booted up with nohz_full),
> sampling should work ok to determine irq and softirq time use, but
> that only covers the housekeeping CPU itself, not the other
> non-nohz_full CPUs.

Hmm, every non-nohz_full CPUs, including the CPU 0, account the irqtime
the same way: through the tick (and therefore can't account much of it).

So I'm a bit confused by the above statements..

> 
> On CPUs that are nohz_idle (the typical way a distro kernel is
> booted), irq time is not accounted at all while the CPU is idle,
> due to the lack of timer ticks.

But as soon as a timer tick fires in idle or afterward, the pending
irqtime is accounted.

That said I don't see how it explains why we do the below:

> 
> Remove the VTIME_GEN vtime irq time code. The next patch will
> allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.

I don't get the reason why we are doing this. Now arguably the irqtime
accounting is probably not working as well as before since we switched to
jiffy clock. But I still see some hard irqs accounted when account_irq_exit()
is lucky enough to observe that jiffies changed since the beginning of
the interrupt.

So it's not entirely broken. I agree that we need to switch it to the
generic irqtime accounting code but breaking the code now to reactivate it
in a subsequent patch is prone to future bisection issues.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-27 23:21   ` Frederic Weisbecker
@ 2016-06-27 23:31     ` Rik van Riel
  2016-06-27 23:51       ` Frederic Weisbecker
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2016-06-27 23:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 2371 bytes --]

On Tue, 2016-06-28 at 01:21 +0200, Frederic Weisbecker wrote:
> On Wed, Jun 22, 2016 at 10:25:48PM -0400, riel@redhat.com wrote:
> > 
> > From: Rik van Riel <riel@redhat.com>
> > 
> > The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
> > appear to currently work right.
> > 
> > On CPUs that are nohz_full, people typically do not assign IRQs.
> Right, but they can still fire. At least one tick per second, plus
> the
> pinned timers, etc...
> 
> > 
> > 
> > On the housekeeping CPU (when a system is booted up with
> > nohz_full),
> > sampling should work ok to determine irq and softirq time use, but
> > that only covers the housekeeping CPU itself, not the other
> > non-nohz_full CPUs.
> Hmm, every non-nohz_full CPUs, including the CPU 0, account the
> irqtime
> the same way: through the tick (and therefore can't account much of
> it).
> 
But it will be subtracted from the user time, rather
than the idle time during which the irqs happened.

Furthermore, we might well have 100 jiffies worth of
irq & softirq time on a CPU, and get just 1 jiffy
of userspace time, on systems acting like routers.

> > 
> > 
> > On CPUs that are nohz_idle (the typical way a distro kernel is
> > booted), irq time is not accounted at all while the CPU is idle,
> > due to the lack of timer ticks.
> But as soon as a timer tick fires in idle or afterward, the pending
> irqtime is accounted.
> 
> That said I don't see how it explains why we do the below:
> 
> > 
> > 
> > Remove the VTIME_GEN vtime irq time code. The next patch will
> > allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.
> I don't get the reason why we are doing this. Now arguably the
> irqtime
> accounting is probably not working as well as before since we
> switched to
> jiffy clock. But I still see some hard irqs accounted when
> account_irq_exit()
> is lucky enough to observe that jiffies changed since the beginning
> of
> the interrupt.
> 
> So it's not entirely broken. I agree that we need to switch it to the
> generic irqtime accounting code but breaking the code now to
> reactivate it
> in a subsequent patch is prone to future bisection issues.

Want me to merge patches 2 & 3 into one, so we immediately
start using the generic code and do not run into bisect
issues?

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-27 23:31     ` Rik van Riel
@ 2016-06-27 23:51       ` Frederic Weisbecker
  0 siblings, 0 replies; 31+ messages in thread
From: Frederic Weisbecker @ 2016-06-27 23:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Mon, Jun 27, 2016 at 07:31:29PM -0400, Rik van Riel wrote:
> On Tue, 2016-06-28 at 01:21 +0200, Frederic Weisbecker wrote:
> > On Wed, Jun 22, 2016 at 10:25:48PM -0400, riel@redhat.com wrote:
> > > 
> > > From: Rik van Riel <riel@redhat.com>
> > > 
> > > The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
> > > appear to currently work right.
> > > 
> > > On CPUs that are nohz_full, people typically do not assign IRQs.
> > Right, but they can still fire. At least one tick per second, plus
> > the
> > pinned timers, etc...
> > 
> > > 
> > > 
> > > On the housekeeping CPU (when a system is booted up with
> > > nohz_full),
> > > sampling should work ok to determine irq and softirq time use, but
> > > that only covers the housekeeping CPU itself, not the other
> > > non-nohz_full CPUs.
> > Hmm, every non-nohz_full CPUs, including the CPU 0, account the
> > irqtime
> > the same way: through the tick (and therefore can't account much of
> > it).
> > 
> But it will be subtracted from the user time, rather
> than the idle time during which the irqs happened.
> 
> Furthermore, we might well have 100 jiffies worth of
> irq & softirq time on a CPU, and get just 1 jiffy
> of userspace time, on systems acting like routers.

Indeed.

> > > Remove the VTIME_GEN vtime irq time code. The next patch will
> > > allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.
> > I don't get the reason why we are doing this. Now arguably the
> > irqtime
> > accounting is probably not working as well as before since we
> > switched to
> > jiffy clock. But I still see some hard irqs accounted when
> > account_irq_exit()
> > is lucky enough to observe that jiffies changed since the beginning
> > of
> > the interrupt.
> > 
> > So it's not entirely broken. I agree that we need to switch it to the
> > generic irqtime accounting code but breaking the code now to
> > reactivate it
> > in a subsequent patch is prone to future bisection issues.
> 
> Want me to merge patches 2 & 3 into one, so we immediately
> start using the generic code and do not run into bisect
> issues?

Yeah that would be better.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-27 12:50     ` Rik van Riel
@ 2016-06-28 20:56       ` Frederic Weisbecker
  0 siblings, 0 replies; 31+ messages in thread
From: Frederic Weisbecker @ 2016-06-28 20:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Mon, Jun 27, 2016 at 08:50:06AM -0400, Rik van Riel wrote:
> On Mon, 2016-06-27 at 14:25 +0200, Frederic Weisbecker wrote:
> > > 
> > >   * Accumulate raw cputime values of dead tasks (sig->[us]time) and
> > > live
> > >   * tasks (sum on group iteration) belonging to @tsk's group.
> > >   */
> > > @@ -344,19 +378,24 @@ static void
> > > irqtime_account_process_tick(struct task_struct *p, int user_tick,
> > >  {
> > >  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
> > >  	u64 cputime = (__force u64) cputime_one_jiffy;
> > > -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > > +	unsigned long other;
> > >  
> > > -	if (steal_account_process_tick(ULONG_MAX))
> > > +	/*
> > > +	 * When returning from idle, many ticks can get accounted
> > > at
> > > +	 * once, including some ticks of steal, irq, and softirq
> > > time.
> > > +	 * Subtract those ticks from the amount of time accounted
> > > to
> > > +	 * idle, or potentially user or system time. Due to
> > > rounding,
> > > +	 * other time can exceed ticks occasionally.
> > > +	 */
> > > +	other = account_other_ticks(ticks);
> > > +	if (other >= ticks)
> > >  		return;
> > > +	ticks -= other;
> > >  
> > >  	cputime *= ticks;
> > >  	scaled *= ticks;
> > So instead of dealing with ticks here, I think you should rather use
> > the above
> > cputime as both the limit and the remaining time to account after
> > steal/irqs.
> > 
> > This should avoid some middle conversions and improve precision when
> > cputime_t == nsecs granularity.
> > 
> > If we account 2 ticks to idle (lets say HZ=100) and irq time to
> > account is 15 ms. 2 ticks = 20 ms
> > so we have 5 ms left to account to idle. With the jiffies granularity
> > in this patch, we would account
> > one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account
> > back later) and one tick to idle
> > time whereas if you deal with cputime_t, you are going to account the
> > correct amount of idle time.
> > 
> 
> Ahhh, so you want irqtime_account_process_tick to work with
> and account fractional ticks when calling account_system_time,
> account_user_time, account_idle_time, etc?

Not exactly. This function is passed ticks but works with the cputime
converted value of those ticks. And it would be more precise to work on
top of cputime values without converting things to jiffies in the middle
as we may lose precision on the way.

> 
> I guess that should work fine since we already pass cputime
> values in, anyway.
> 
> I suppose we can do the same for get_vtime_delta, too.

But get_vtime_delta() already returns cputime, right?

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:07       ` Wanpeng Li
@ 2016-08-10  7:51         ` Wanpeng Li
  0 siblings, 0 replies; 31+ messages in thread
From: Wanpeng Li @ 2016-08-10  7:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:07 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> I will try nohz idle kvm guest and other more tests, a patch will be
> sent out once successful.

After apply the same logic to account_idle_ticks() for nohz idle kvm
guest(noirqtime, idle=poll, one pCPU and four vCPUs), the average idle
drop from 56.8% to 54.75%, I think it makes sense to make a formal
patch.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-10  5:07           ` Rik van Riel
@ 2016-08-10  6:33             ` Wanpeng Li
  0 siblings, 0 replies; 31+ messages in thread
From: Wanpeng Li @ 2016-08-10  6:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 13:07 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Wed, 2016-08-10 at 07:39 +0800, Wanpeng Li wrote:
>> 2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
>> > 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > > On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>> > > > Hi Rik,
>> > > > 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.
>> > > > com>:
>> > > > > From: Rik van Riel <riel@redhat.com>
>> > > > >
>> > > > > Currently, if there was any irq or softirq time during
>> > > > > 'ticks'
>> > > > > jiffies, the entire period will be accounted as irq or
>> > > > > softirq
>> > > > > time.
>> > > > >
>> > > > > This is inaccurate if only a subset of the time was actually
>> > > > > spent
>> > > > > handling irqs, and could conceivably mis-count all of the
>> > > > > ticks
>> > > > > during
>> > > > > a period as irq time, when there was some irq and some
>> > > > > softirq
>> > > > > time.
>> > > > >
>> > > > > This can actually happen when irqtime_account_process_tick is
>> > > > > called
>> > > > > from account_idle_ticks, which can pass a larger number of
>> > > > > ticks
>> > > > > down
>> > > > > all at once.
>> > > > >
>> > > > > Fix this by changing irqtime_account_hi_update,
>> > > > > irqtime_account_si_update,
>> > > > > and steal_account_process_ticks to work with cputime_t time
>> > > > > units,
>> > > > > and
>> > > > > return the amount of time spent in each mode.
>> > > >
>> > > > Do we need to minus st cputime from idle cputime in
>> > > > account_idle_ticks() when noirqtime is true? I try to add this
>> > > > logic
>> > > > w/ noirqtime and idle=poll boot parameter for a full dynticks
>> > > > guest,
>> > > > however, there is no difference, where I miss?
>> > >
>> > > Yes, you are right. The code in account_idle_ticks()
>> > > could use the same treatment.
>> > >
>> > > I am not sure why it would not work, though...
>> >
>> > Actually I observed a regression caused by this patch. I use a i5
>>
>> The regression is caused by your commit "sched,time: Count actually
>> elapsed irq & softirq time".
>
> Wanpeng and I discussed this issue, and discovered
> that this bug is triggered by my patch, specifically
> this bit:
>
> -       if (steal_account_process_tick(ULONG_MAX))
> +       other = account_other_time(cputime);
> +       if (other >= cputime)
>                 return;
>
> Replacing "cputime" with "ULONG_MAX" as the argument
> to account_other_time makes the bug disappear.
>
> However, this is not the cause of the bug.
>
> The cause of the bug appears to be that the values
> used to figure out how much steal time has passed
> are never initialized.
>
>                 steal = paravirt_steal_clock(smp_processor_id());
>                 steal -= this_rq()->prev_steal_time;
>
> The first of the two may be initialized by the host
> (I did not verify that), but the second one does not
> have any explicit initializers anywhere in the kernel
> tree.
>
> This can lead to an arbitrarily large difference between
> paravirt_steal_clock(smp_processor_id()) and
> this_rq()->prev_steal_time, which results in nothing but
> steal time getting accounted for a potentially a very
> long amount of time.
>
> Previously we carried this patch to initialize the
> various rq->prev_* values at CPU hotplug time:
>
> https://patchwork.codeaurora.org/patch/27699/
>
> Which got reverted by Paolo here:
>
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sche
> d/core&id=3d89e5478bf550a50c99e93adf659369798263b0
>
> Which leads me to this question:
>
> Paulo, how would you like us to fix this bug?
>
> It seems like the host and guest steal time CAN get out
> of sync, sometimes by a ridiculous amount, and we need
> some way to get the excessive difference out of the way,
> without it getting accounted as steal time (not immediately,
> and not over the next 17 hours, or months).

I can be the volunteer to fix it if you guys have an idea. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:39         ` Wanpeng Li
@ 2016-08-10  5:07           ` Rik van Riel
  2016-08-10  6:33             ` Wanpeng Li
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2016-08-10  5:07 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 3911 bytes --]

On Wed, 2016-08-10 at 07:39 +0800, Wanpeng Li wrote:
> 2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> > 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > > On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
> > > > Hi Rik,
> > > > 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.
> > > > com>:
> > > > > From: Rik van Riel <riel@redhat.com>
> > > > > 
> > > > > Currently, if there was any irq or softirq time during
> > > > > 'ticks'
> > > > > jiffies, the entire period will be accounted as irq or
> > > > > softirq
> > > > > time.
> > > > > 
> > > > > This is inaccurate if only a subset of the time was actually
> > > > > spent
> > > > > handling irqs, and could conceivably mis-count all of the
> > > > > ticks
> > > > > during
> > > > > a period as irq time, when there was some irq and some
> > > > > softirq
> > > > > time.
> > > > > 
> > > > > This can actually happen when irqtime_account_process_tick is
> > > > > called
> > > > > from account_idle_ticks, which can pass a larger number of
> > > > > ticks
> > > > > down
> > > > > all at once.
> > > > > 
> > > > > Fix this by changing irqtime_account_hi_update,
> > > > > irqtime_account_si_update,
> > > > > and steal_account_process_ticks to work with cputime_t time
> > > > > units,
> > > > > and
> > > > > return the amount of time spent in each mode.
> > > > 
> > > > Do we need to minus st cputime from idle cputime in
> > > > account_idle_ticks() when noirqtime is true? I try to add this
> > > > logic
> > > > w/ noirqtime and idle=poll boot parameter for a full dynticks
> > > > guest,
> > > > however, there is no difference, where I miss?
> > > 
> > > Yes, you are right. The code in account_idle_ticks()
> > > could use the same treatment.
> > > 
> > > I am not sure why it would not work, though...
> > 
> > Actually I observed a regression caused by this patch. I use a i5
> 
> The regression is caused by your commit "sched,time: Count actually
> elapsed irq & softirq time".

Wanpeng and I discussed this issue, and discovered
that this bug is triggered by my patch, specifically
this bit:

-       if (steal_account_process_tick(ULONG_MAX))
+       other = account_other_time(cputime);
+       if (other >= cputime)
                return;

Replacing "cputime" with "ULONG_MAX" as the argument
to account_other_time makes the bug disappear.

However, this is not the cause of the bug.

The cause of the bug appears to be that the values
used to figure out how much steal time has passed
are never initialized.

                steal = paravirt_steal_clock(smp_processor_id());
                steal -= this_rq()->prev_steal_time;

The first of the two may be initialized by the host
(I did not verify that), but the second one does not
have any explicit initializers anywhere in the kernel
tree.

This can lead to an arbitrarily large difference between
paravirt_steal_clock(smp_processor_id()) and
this_rq()->prev_steal_time, which results in nothing but
steal time getting accounted for a potentially a very
long amount of time.

Previously we carried this patch to initialize the
various rq->prev_* values at CPU hotplug time:

https://patchwork.codeaurora.org/patch/27699/

Which got reverted by Paolo here:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sche
d/core&id=3d89e5478bf550a50c99e93adf659369798263b0

Which leads me to this question:

Paulo, how would you like us to fix this bug?

It seems like the host and guest steal time CAN get out
of sync, sometimes by a ridiculous amount, and we need
some way to get the excessive difference out of the way,
without it getting accounted as steal time (not immediately,
and not over the next 17 hours, or months).

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:25       ` Wanpeng Li
  2016-08-09 23:31         ` Wanpeng Li
  2016-08-09 23:35         ` Wanpeng Li
@ 2016-08-09 23:39         ` Wanpeng Li
  2016-08-10  5:07           ` Rik van Riel
  2 siblings, 1 reply; 31+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> Actually I observed a regression caused by this patch. I use a i5

The regression is caused by your commit "sched,time: Count actually
elapsed irq & softirq time".

> laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> cpu hog processes(for loop) running in the guest, I hot-unplug the
> pCPUs on host one by one until there is only one left, then observe
> the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
> for other cpus(nohz full). However, w/o this patch, 75% for all the
> four cpus.
>
> I try to figure out this recently, any tip is a great appreciated. :)
>
> Regards,
> Wapeng Li

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:25       ` Wanpeng Li
  2016-08-09 23:31         ` Wanpeng Li
@ 2016-08-09 23:35         ` Wanpeng Li
  2016-08-09 23:39         ` Wanpeng Li
  2 siblings, 0 replies; 31+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> Actually I observed a regression caused by this patch. I use a i5

The regression is caused by your commit "sched,time: Count actually
elapsed irq & softirq time".

> laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> cpu hog processes(for loop) running in the guest, I hot-unplug the
> pCPUs on host one by one until there is only one left, then observe
> the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
> for other cpus(nohz full). However, w/o this patch, 75% for all the
> four cpus.
>
> I try to figure out this recently, any tip is a great appreciated. :)
>
> Regards,
> Wapeng Li

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:25       ` Wanpeng Li
@ 2016-08-09 23:31         ` Wanpeng Li
  2016-08-09 23:35         ` Wanpeng Li
  2016-08-09 23:39         ` Wanpeng Li
  2 siblings, 0 replies; 31+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> Actually I observed a regression caused by this patch. I use a i5

The regression is caused by your commit "sched,time: Count actually
elapsed irq & softirq time".

> laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> cpu hog processes(for loop) running in the guest, I hot-unplug the
> pCPUs on host one by one until there is only one left, then observe
> the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
> for other cpus(nohz full). However, w/o this patch, 75% for all the
> four cpus.
>
> I try to figure out this recently, any tip is a great appreciated. :)
>
> Regards,
> Wapeng Li

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 14:06     ` Rik van Riel
  2016-08-09 23:07       ` Wanpeng Li
@ 2016-08-09 23:25       ` Wanpeng Li
  2016-08-09 23:31         ` Wanpeng Li
                           ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>> Hi Rik,
>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>> > From: Rik van Riel <riel@redhat.com>
>> >
>> > Currently, if there was any irq or softirq time during 'ticks'
>> > jiffies, the entire period will be accounted as irq or softirq
>> > time.
>> >
>> > This is inaccurate if only a subset of the time was actually spent
>> > handling irqs, and could conceivably mis-count all of the ticks
>> > during
>> > a period as irq time, when there was some irq and some softirq
>> > time.
>> >
>> > This can actually happen when irqtime_account_process_tick is
>> > called
>> > from account_idle_ticks, which can pass a larger number of ticks
>> > down
>> > all at once.
>> >
>> > Fix this by changing irqtime_account_hi_update,
>> > irqtime_account_si_update,
>> > and steal_account_process_ticks to work with cputime_t time units,
>> > and
>> > return the amount of time spent in each mode.
>>
>> Do we need to minus st cputime from idle cputime in
>> account_idle_ticks() when noirqtime is true? I try to add this logic
>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>> however, there is no difference, where I miss?
>
> Yes, you are right. The code in account_idle_ticks()
> could use the same treatment.
>
> I am not sure why it would not work, though...

Actually I observed a regression caused by this patch. I use a i5
laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
cpu hog processes(for loop) running in the guest, I hot-unplug the
pCPUs on host one by one until there is only one left, then observe
the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
for other cpus(nohz full). However, w/o this patch, 75% for all the
four cpus.

I try to figure out this recently, any tip is a great appreciated. :)

Regards,
Wapeng Li

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 14:06     ` Rik van Riel
@ 2016-08-09 23:07       ` Wanpeng Li
  2016-08-10  7:51         ` Wanpeng Li
  2016-08-09 23:25       ` Wanpeng Li
  1 sibling, 1 reply; 31+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>> Hi Rik,
>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>> > From: Rik van Riel <riel@redhat.com>
>> >
>> > Currently, if there was any irq or softirq time during 'ticks'
>> > jiffies, the entire period will be accounted as irq or softirq
>> > time.
>> >
>> > This is inaccurate if only a subset of the time was actually spent
>> > handling irqs, and could conceivably mis-count all of the ticks
>> > during
>> > a period as irq time, when there was some irq and some softirq
>> > time.
>> >
>> > This can actually happen when irqtime_account_process_tick is
>> > called
>> > from account_idle_ticks, which can pass a larger number of ticks
>> > down
>> > all at once.
>> >
>> > Fix this by changing irqtime_account_hi_update,
>> > irqtime_account_si_update,
>> > and steal_account_process_ticks to work with cputime_t time units,
>> > and
>> > return the amount of time spent in each mode.
>>
>> Do we need to minus st cputime from idle cputime in
>> account_idle_ticks() when noirqtime is true? I try to add this logic
>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>> however, there is no difference, where I miss?
>
> Yes, you are right. The code in account_idle_ticks()
> could use the same treatment.
>
> I am not sure why it would not work, though...

I will try nohz idle kvm guest and other more tests, a patch will be
sent out once successful.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09  3:59   ` Wanpeng Li
@ 2016-08-09 14:06     ` Rik van Riel
  2016-08-09 23:07       ` Wanpeng Li
  2016-08-09 23:25       ` Wanpeng Li
  0 siblings, 2 replies; 31+ messages in thread
From: Rik van Riel @ 2016-08-09 14:06 UTC (permalink / raw)
  To: Wanpeng Li, Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Paolo Bonzini, Peter Zijlstra, Wanpeng Li,
	Thomas Gleixner, Radim Krcmar, Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 1395 bytes --]

On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
> Hi Rik,
> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Currently, if there was any irq or softirq time during 'ticks'
> > jiffies, the entire period will be accounted as irq or softirq
> > time.
> > 
> > This is inaccurate if only a subset of the time was actually spent
> > handling irqs, and could conceivably mis-count all of the ticks
> > during
> > a period as irq time, when there was some irq and some softirq
> > time.
> > 
> > This can actually happen when irqtime_account_process_tick is
> > called
> > from account_idle_ticks, which can pass a larger number of ticks
> > down
> > all at once.
> > 
> > Fix this by changing irqtime_account_hi_update,
> > irqtime_account_si_update,
> > and steal_account_process_ticks to work with cputime_t time units,
> > and
> > return the amount of time spent in each mode.
> 
> Do we need to minus st cputime from idle cputime in
> account_idle_ticks() when noirqtime is true? I try to add this logic
> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
> however, there is no difference, where I miss?

Yes, you are right. The code in account_idle_ticks()
could use the same treatment.

I am not sure why it would not work, though...

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
@ 2016-08-09  3:59   ` Wanpeng Li
  2016-08-09 14:06     ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Wanpeng Li @ 2016-08-09  3:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Rik van Riel, Paolo Bonzini, Peter Zijlstra,
	Wanpeng Li, Thomas Gleixner, Radim Krcmar, Mike Galbraith

Hi Rik,
2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> From: Rik van Riel <riel@redhat.com>
>
> Currently, if there was any irq or softirq time during 'ticks'
> jiffies, the entire period will be accounted as irq or softirq
> time.
>
> This is inaccurate if only a subset of the time was actually spent
> handling irqs, and could conceivably mis-count all of the ticks during
> a period as irq time, when there was some irq and some softirq time.
>
> This can actually happen when irqtime_account_process_tick is called
> from account_idle_ticks, which can pass a larger number of ticks down
> all at once.
>
> Fix this by changing irqtime_account_hi_update, irqtime_account_si_update,
> and steal_account_process_ticks to work with cputime_t time units, and
> return the amount of time spent in each mode.

Do we need to minus st cputime from idle cputime in
account_idle_ticks() when noirqtime is true? I try to add this logic
w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
however, there is no difference, where I miss?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
@ 2016-07-13 14:50 ` Frederic Weisbecker
  2016-08-09  3:59   ` Wanpeng Li
  0 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2016-07-13 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Rik van Riel, Paolo Bonzini, Peter Zijlstra, Wanpeng Li,
	Thomas Gleixner, Radim Krcmar, Frederic Weisbecker,
	Mike Galbraith

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of the time was actually spent
handling irqs, and could conceivably mis-count all of the ticks during
a period as irq time, when there was some irq and some softirq time.

This can actually happen when irqtime_account_process_tick is called
from account_idle_ticks, which can pass a larger number of ticks down
all at once.

Fix this by changing irqtime_account_hi_update, irqtime_account_si_update,
and steal_account_process_ticks to work with cputime_t time units, and
return the amount of time spent in each mode.

Rename steal_account_process_ticks to steal_account_process_time, to
reflect that time is now accounted in cputime_t, instead of ticks.

Additionally, have irqtime_account_process_tick take into account how
much time was spent in each of steal, irq, and softirq time.

The latter could help improve the accuracy of cputime
accounting when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Radim Krcmar <rkrcmar@redhat.com>
[fweisbec: Make nsecs_to_cputime64() actually return cputime64_t]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/asm-generic/cputime_nsecs.h |   2 +
 kernel/sched/cputime.c              | 124 ++++++++++++++++++++++--------------
 2 files changed, 79 insertions(+), 47 deletions(-)

diff --git a/include/asm-generic/cputime_nsecs.h b/include/asm-generic/cputime_nsecs.h
index 0f1c6f3..a84e28e 100644
--- a/include/asm-generic/cputime_nsecs.h
+++ b/include/asm-generic/cputime_nsecs.h
@@ -50,6 +50,8 @@ typedef u64 __nocast cputime64_t;
 	(__force u64)(__ct)
 #define nsecs_to_cputime(__nsecs)	\
 	(__force cputime_t)(__nsecs)
+#define nsecs_to_cputime64(__nsecs)	\
+	(__force cputime64_t)(__nsecs)
 
 
 /*
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d..db82ae1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,40 +79,50 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static cputime_t irqtime_account_hi_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t irq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_hardirq_time)) -
+		      cpustat[CPUTIME_IRQ];
+	irq_cputime = min(irq_cputime, maxtime);
+	cpustat[CPUTIME_IRQ] += irq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return irq_cputime;
 }
 
-static int irqtime_account_si_update(void)
+static cputime_t irqtime_account_si_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t softirq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_softirq_time)) -
+			  cpustat[CPUTIME_SOFTIRQ];
+	softirq_cputime = min(softirq_cputime, maxtime);
+	cpustat[CPUTIME_SOFTIRQ] += softirq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return softirq_cputime;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #define sched_clock_irqtime	(0)
 
+static cputime_t irqtime_account_hi_update(cputime_t dummy)
+{
+	return 0;
+}
+
+static cputime_t irqtime_account_si_update(cputime_t dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -257,32 +267,45 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline unsigned long steal_account_process_tick(unsigned long max_jiffies)
+static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
+		cputime_t steal_cputime;
 		u64 steal;
-		unsigned long steal_jiffies;
 
 		steal = paravirt_steal_clock(smp_processor_id());
 		steal -= this_rq()->prev_steal_time;
 
-		/*
-		 * steal is in nsecs but our caller is expecting steal
-		 * time in jiffies. Lets cast the result to jiffies
-		 * granularity and account the rest on the next rounds.
-		 */
-		steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies);
-		this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies);
+		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+		account_steal_time(steal_cputime);
+		this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
 
-		account_steal_time(jiffies_to_cputime(steal_jiffies));
-		return steal_jiffies;
+		return steal_cputime;
 	}
 #endif
 	return 0;
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ */
+static inline cputime_t account_other_time(cputime_t max)
+{
+	cputime_t accounted;
+
+	accounted = steal_account_process_time(max);
+
+	if (accounted < max)
+		accounted += irqtime_account_hi_update(max - accounted);
+
+	if (accounted < max)
+		accounted += irqtime_account_si_update(max - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -342,21 +365,23 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 					 struct rq *rq, int ticks)
 {
-	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
-	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 cputime = (__force u64) cputime_one_jiffy * ticks;
+	cputime_t scaled, other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_time(cputime);
+	if (other >= cputime)
 		return;
+	cputime -= other;
+	scaled = cputime_to_scaled(cputime);
 
-	cputime *= ticks;
-	scaled *= ticks;
-
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
@@ -466,7 +491,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
  */
 void account_process_tick(struct task_struct *p, int user_tick)
 {
-	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
+	cputime_t cputime, scaled, steal;
 	struct rq *rq = this_rq();
 
 	if (vtime_accounting_cpu_enabled())
@@ -477,16 +502,21 @@ void account_process_tick(struct task_struct *p, int user_tick)
 		return;
 	}
 
-	if (steal_account_process_tick(ULONG_MAX))
+	cputime = cputime_one_jiffy;
+	steal = steal_account_process_time(cputime);
+
+	if (steal >= cputime)
 		return;
 
+	cputime -= steal;
+	scaled = cputime_to_scaled(cputime);
+
 	if (user_tick)
-		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
+		account_user_time(p, cputime, scaled);
 	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
-		account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
-				    one_jiffy_scaled);
+		account_system_time(p, HARDIRQ_OFFSET, cputime, scaled);
 	else
-		account_idle_time(cputime_one_jiffy);
+		account_idle_time(cputime);
 }
 
 /*
@@ -681,14 +711,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	cputime_t delta, steal;
 
-	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	delta = jiffies_to_cputime(now - tsk->vtime_snap);
+	steal = steal_account_process_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return delta - steal;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-22 10:40     ` Paolo Bonzini
@ 2016-06-22 10:52       ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-22 10:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: riel, linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar

On Wed, Jun 22, 2016 at 12:40:31PM +0200, Paolo Bonzini wrote:
> 
> 
> On 21/06/2016 23:21, Peter Zijlstra wrote:
> > cputime_to_jiffies is a division, could we not avoid that by doing
> > something like:
> > 
> > 	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
> > 	while (irq_jiffies > cputime_one_jiffy) {
> > 		irq_jiffies -= cputime_one_jiffy;
> > 		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
> > 	}
> > 
> > assuming that the loop is 'rare' etc.. If not, only do the division on
> > that same > cputime_one_jiffy condition.
> 
> It's a division by a constant, it ought to become a multiplication.  For
> 64-bit it will, and context tracking is only enabled for 64-bit.

Right, not enabled on i386, however plenty 32bit archs (including ARM)
do have it enabled.

> BTW, for 32-bit there's a monster of a macro to turn do_div with
> constant divisor into multiplications in include/asm-generic/div64.h.
> However, x86-32 doesn't use it.

Right, but some ARM chips don't exactly have a fast multiplier either,
so avoiding even the 64bit mult (which ends up being 3 or so actual
mult instructions) is worthwhile I'd say.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-21 21:21   ` Peter Zijlstra
  2016-06-21 22:20     ` Rik van Riel
@ 2016-06-22 10:40     ` Paolo Bonzini
  2016-06-22 10:52       ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Paolo Bonzini @ 2016-06-22 10:40 UTC (permalink / raw)
  To: Peter Zijlstra, riel
  Cc: linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar



On 21/06/2016 23:21, Peter Zijlstra wrote:
> cputime_to_jiffies is a division, could we not avoid that by doing
> something like:
> 
> 	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
> 	while (irq_jiffies > cputime_one_jiffy) {
> 		irq_jiffies -= cputime_one_jiffy;
> 		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
> 	}
> 
> assuming that the loop is 'rare' etc.. If not, only do the division on
> that same > cputime_one_jiffy condition.

It's a division by a constant, it ought to become a multiplication.  For
64-bit it will, and context tracking is only enabled for 64-bit.

BTW, for 32-bit there's a monster of a macro to turn do_div with
constant divisor into multiplications in include/asm-generic/div64.h.
However, x86-32 doesn't use it.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-21 21:21   ` Peter Zijlstra
@ 2016-06-21 22:20     ` Rik van Riel
  2016-06-22 10:40     ` Paolo Bonzini
  1 sibling, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2016-06-21 22:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1244 bytes --]

On Tue, 2016-06-21 at 23:21 +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 12:06:03PM -0400, riel@redhat.com wrote:
> > 
> > +static unsigned long irqtime_account_hi_update(unsigned long
> > max_jiffies)
> >  {
> >  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > +	unsigned long irq_jiffies;
> >  	unsigned long flags;
> > +	u64 irq;
> >  
> >  	local_irq_save(flags);
> > +	irq = this_cpu_read(cpu_hardirq_time) -
> > cpustat[CPUTIME_IRQ];
> > +	irq_jiffies = min(cputime_to_jiffies(irq), max_jiffies);
> cputime_to_jiffies is a division, could we not avoid that by doing
> something like:
> 
> 	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
> 	while (irq_jiffies > cputime_one_jiffy) {
> 		irq_jiffies -= cputime_one_jiffy;
> 		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
> 	}
> 
> assuming that the loop is 'rare' etc.. If not, only do the division
> on
> that same > cputime_one_jiffy condition.

I suspect the loop is not rare on systems with nohz_idle,
where it may be quite a while before a timer tick happens
on an idle cpu.

I can certainly make sure the division is only done when
irq > 2*cputime_one_jiffy. I will do that in the next
version.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
  2016-06-16 16:22   ` kbuild test robot
@ 2016-06-21 21:21   ` Peter Zijlstra
  2016-06-21 22:20     ` Rik van Riel
  2016-06-22 10:40     ` Paolo Bonzini
  1 sibling, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-21 21:21 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

On Thu, Jun 16, 2016 at 12:06:03PM -0400, riel@redhat.com wrote:
> +static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long irq_jiffies;
>  	unsigned long flags;
> +	u64 irq;
>  
>  	local_irq_save(flags);
> +	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
> +	irq_jiffies = min(cputime_to_jiffies(irq), max_jiffies);

cputime_to_jiffies is a division, could we not avoid that by doing
something like:

	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
	while (irq_jiffies > cputime_one_jiffy) {
		irq_jiffies -= cputime_one_jiffy;
		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
	}

assuming that the loop is 'rare' etc.. If not, only do the division on
that same > cputime_one_jiffy condition.

> +	if (irq_jiffies)
> +		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
>  	local_irq_restore(flags);
> +	return irq_jiffies;
>  }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-16 16:22   ` kbuild test robot
  2016-06-21 21:21   ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2016-06-16 16:22 UTC (permalink / raw)
  To: riel
  Cc: kbuild-all, linux-kernel, mingo, pbonzini, peterz, fweisbec,
	wanpeng.li, efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 2010 bytes --]

Hi,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20160616]
[cannot apply to v4.7-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/riel-redhat-com/sched-time-fix-irq-time-accounting-with-nohz_idle/20160617-001150
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

Note: the linux-review/riel-redhat-com/sched-time-fix-irq-time-accounting-with-nohz_idle/20160617-001150 HEAD cc827b6bc02ef052d79b5dbd2c355d9483a8ada7 builds fine.
      It only hurts bisectibility.

All errors (new ones prefixed by >>):

   kernel/sched/cputime.c: In function 'account_other_ticks':
>> kernel/sched/cputime.c:299:16: error: implicit declaration of function 'irqtime_account_hi_update' [-Werror=implicit-function-declaration]
      accounted += irqtime_account_hi_update(max_jiffies - accounted);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/sched/cputime.c:302:16: error: implicit declaration of function 'irqtime_account_si_update' [-Werror=implicit-function-declaration]
      accounted += irqtime_account_si_update(max_jiffies - accounted);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/irqtime_account_hi_update +299 kernel/sched/cputime.c

   293	{
   294		unsigned long accounted;
   295	
   296		accounted = steal_account_process_tick(max_jiffies);
   297	
   298		if (accounted < max_jiffies)
 > 299			accounted += irqtime_account_hi_update(max_jiffies - accounted);
   300	
   301		if (accounted < max_jiffies)
 > 302			accounted += irqtime_account_si_update(max_jiffies - accounted);
   303	
   304		return accounted;
   305	}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 6335 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
@ 2016-06-16 16:06 ` riel
  2016-06-16 16:22   ` kbuild test robot
  2016-06-21 21:21   ` Peter Zijlstra
  0 siblings, 2 replies; 31+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of 'ticks' jiffies was
actually spent handling irqs, and could conceivably mis-count
all of the ticks during a period as irq time, when there was
some irq and some softirq time.

This can actually happen when irqtime_account_process_tick
is called from account_idle_ticks, which can pass a larger
number of ticks down all at once.

Fix this by changing irqtime_account_hi_update and
irqtime_account_si_update to round elapsed irq and softirq
time to jiffies, and return the number of jiffies spent in
each mode, similar to how steal time is handled.

Additionally, have irqtime_account_process_tick take into
account how much time was spent in each of steal, irq,
and softirq time.

The latter could help improve the accuracy of timekeeping
when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 69 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 48 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d76fdb..9bd2d4f42037 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,34 +79,36 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long irq_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 irq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
+	irq_jiffies = min(cputime_to_jiffies(irq), max_jiffies);
+	if (irq_jiffies)
+		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return irq_jiffies;
 }
 
-static int irqtime_account_si_update(void)
+static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long si_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 softirq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
+	si_jiffies = min(cputime_to_jiffies(softirq), max_jiffies);
+	if (si_jiffies)
+		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return si_jiffies;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
@@ -283,6 +285,26 @@ static __always_inline unsigned long steal_account_process_tick(unsigned long ma
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ * Due to rounding errors, the calculated amount can sometimes exceed
+ * max_jiffies; be careful not to account more than max_jiffies.
+ */
+static inline int account_other_ticks(unsigned long max_jiffies)
+{
+	unsigned long accounted;
+
+	accounted = steal_account_process_tick(max_jiffies);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_hi_update(max_jiffies - accounted);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_si_update(max_jiffies - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -344,19 +366,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 {
 	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_ticks(ticks);
+	if (other >= ticks)
 		return;
+	ticks -= other;
 
 	cputime *= ticks;
 	scaled *= ticks;
 
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-08  2:29 [PATCH RFC 0/5] sched,time: make irq time accounting work for nohz_idle riel
@ 2016-06-08  2:30 ` riel
  0 siblings, 0 replies; 31+ messages in thread
From: riel @ 2016-06-08  2:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: kernellwp, mingo, peterz, tglx, fweisbec

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of 'ticks' jiffies was
actually spent handling irqs, and could conceivably mis-count
all of the ticks during a period as irq time, when there was
some irq and some softirq time.

Fix this by changing irqtime_account_hi_update and
irqtime_account_si_update to round elapsed irq and softirq
time to jiffies, and return the number of jiffies spent in
each mode, similar to how steal time is handled.

Additionally, have irqtime_account_process_tick take into
account how much time was spent in each of steal, irq,
and softirq time.

The latter could help improve the accuracy of timekeeping
when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 54 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f51c98c740b5..4bd6d1b774ab 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,34 +79,36 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static unsigned long irqtime_account_hi_update(void)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long irq_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 irq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
+	irq_jiffies = cputime_to_jiffies(irq);
+	if (irq_jiffies)
+		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return irq_jiffies;
 }
 
-static int irqtime_account_si_update(void)
+static unsigned long irqtime_account_si_update(void)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long si_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 softirq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	delta = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
+	si_jiffies = cputime_to_jiffies(delta);
+	if (si_jiffies)
+		cpustat[CPUSTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return si_jiffies;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
@@ -345,18 +347,30 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 cputime = (__force u64) cputime_one_jiffy;
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long other;
 
-	if (steal_account_process_tick())
+	/*
+	 * Subtract steal, irq & softirq time (if any) from ticks.
+	 * These are counted in nanoseconds and not aligned with jiffies.
+	 * Multiple of these could "break a jiffy" simultaneously due to
+	 * rounding; be careful to not account too much of this time at once.
+	 */
+	other = steal_account_process_tick();
+	if (other >= ticks)
+		return;
+
+	other += irqtime_account_hi_update();
+	if (other >= ticks)
+		return;
+
+	other += irqtime_account_si_update();
+	if (other >= ticks)
 		return;
 
 	cputime *= ticks;
 	scaled *= ticks;
 
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-08-10 21:26 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
2016-06-27 12:25   ` Frederic Weisbecker
2016-06-27 12:50     ` Rik van Riel
2016-06-28 20:56       ` Frederic Weisbecker
2016-06-23  2:25 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
2016-06-27 23:21   ` Frederic Weisbecker
2016-06-27 23:31     ` Rik van Riel
2016-06-27 23:51       ` Frederic Weisbecker
2016-06-23  2:25 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
2016-06-26  3:03   ` kbuild test robot
2016-06-23  2:25 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
2016-06-23  2:25 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  -- strict thread matches above, loose matches on Subject: below --
2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
2016-08-09  3:59   ` Wanpeng Li
2016-08-09 14:06     ` Rik van Riel
2016-08-09 23:07       ` Wanpeng Li
2016-08-10  7:51         ` Wanpeng Li
2016-08-09 23:25       ` Wanpeng Li
2016-08-09 23:31         ` Wanpeng Li
2016-08-09 23:35         ` Wanpeng Li
2016-08-09 23:39         ` Wanpeng Li
2016-08-10  5:07           ` Rik van Riel
2016-08-10  6:33             ` Wanpeng Li
2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
2016-06-16 16:22   ` kbuild test robot
2016-06-21 21:21   ` Peter Zijlstra
2016-06-21 22:20     ` Rik van Riel
2016-06-22 10:40     ` Paolo Bonzini
2016-06-22 10:52       ` Peter Zijlstra
2016-06-08  2:29 [PATCH RFC 0/5] sched,time: make irq time accounting work for nohz_idle riel
2016-06-08  2:30 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).