All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle
@ 2016-06-16 16:06 riel
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

Currently irq time accounting only works in these cases:
1) purely ticke based accounting
2) nohz_full accounting, but only on housekeeping & nohz_full CPUs
3) architectures with native vtime accounting

On nohz_idle CPUs, which are probably the majority nowadays,
irq time accounting is currently broken. This leads to systems
reporting a dramatically lower amount of irq & softirq time than
is actually spent handling them, with all the time spent while the
system is in the idle task being accounted as idle.

This patch set seems to bring the amount of irq time reported by
top (and /proc/stat) roughly in line with that measured when I do
a "perf record -g -a" run to see what is using all that time.

The amount of irq time used, especially softirq, is shockingly high,
to the point of me thinking this patch set may be wrong, but the
numbers seem to match what perf is giving me...

These patches apply on top of Wanpeng Li's steal time patches.

CONFIG_IRQ_TIME_ACCOUNTING is now a config option that is available
as a separate choice from tick based / nohz_idle / nohz_full mode,
a suggested by Frederic Weisbecker.

Next up: look at the things that are using CPU time on an otherwise
idle system, and see if I can make those a little faster :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
@ 2016-06-16 16:06 ` riel
  2016-06-16 16:22   ` kbuild test robot
  2016-06-21 21:21   ` Peter Zijlstra
  2016-06-16 16:06 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 34+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of 'ticks' jiffies was
actually spent handling irqs, and could conceivably mis-count
all of the ticks during a period as irq time, when there was
some irq and some softirq time.

This can actually happen when irqtime_account_process_tick
is called from account_idle_ticks, which can pass a larger
number of ticks down all at once.

Fix this by changing irqtime_account_hi_update and
irqtime_account_si_update to round elapsed irq and softirq
time to jiffies, and return the number of jiffies spent in
each mode, similar to how steal time is handled.

Additionally, have irqtime_account_process_tick take into
account how much time was spent in each of steal, irq,
and softirq time.

The latter could help improve the accuracy of timekeeping
when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 69 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 48 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d76fdb..9bd2d4f42037 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,34 +79,36 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long irq_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 irq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
+	irq_jiffies = min(cputime_to_jiffies(irq), max_jiffies);
+	if (irq_jiffies)
+		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return irq_jiffies;
 }
 
-static int irqtime_account_si_update(void)
+static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long si_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 softirq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
+	si_jiffies = min(cputime_to_jiffies(softirq), max_jiffies);
+	if (si_jiffies)
+		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return si_jiffies;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
@@ -283,6 +285,26 @@ static __always_inline unsigned long steal_account_process_tick(unsigned long ma
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ * Due to rounding errors, the calculated amount can sometimes exceed
+ * max_jiffies; be careful not to account more than max_jiffies.
+ */
+static inline int account_other_ticks(unsigned long max_jiffies)
+{
+	unsigned long accounted;
+
+	accounted = steal_account_process_tick(max_jiffies);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_hi_update(max_jiffies - accounted);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_si_update(max_jiffies - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -344,19 +366,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 {
 	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_ticks(ticks);
+	if (other >= ticks)
 		return;
+	ticks -= other;
 
 	cputime *= ticks;
 	scaled *= ticks;
 
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-16 16:06 ` riel
  2016-06-16 16:06 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs that are nohz_full, people typically do not assign IRQs.
On the housekeeping CPU (when a system is booted up with nohz_full),
sampling should work ok to determine irq and softirq time use, but
that only covers the housekeeping CPU itself, not the other
non-nohz_full CPUs.

On CPUs that are nohz_idle (the typical way a distro kernel is
booted), irq time is not accounted at all while the CPU is idle,
due to the lack of timer ticks.

Remove the VTIME_GEN vtime irq time code. The next patch will
allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vtime.h  | 32 ++++++++++++++------------------
 kernel/sched/cputime.c | 26 +++++++++++++-------------
 2 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index fa2196990f84..d1977d84ebdf 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,6 +14,18 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+
+#ifdef __ARCH_HAS_VTIME_ACCOUNT
+extern void vtime_account_irq_enter(struct task_struct *tsk);
+#else
+extern void vtime_common_account_irq_enter(struct task_struct *tsk);
+static inline void vtime_account_irq_enter(struct task_struct *tsk)
+{
+	if (vtime_accounting_cpu_enabled())
+		vtime_common_account_irq_enter(tsk);
+}
+#endif /* __ARCH_HAS_VTIME_ACCOUNT */
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -64,17 +76,6 @@ extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
 
-#ifdef __ARCH_HAS_VTIME_ACCOUNT
-extern void vtime_account_irq_enter(struct task_struct *tsk);
-#else
-extern void vtime_common_account_irq_enter(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_account_irq_enter(tsk);
-}
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
-
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 static inline void vtime_task_switch(struct task_struct *prev) { }
@@ -85,13 +86,8 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_gen_account_irq_exit(struct task_struct *tsk);
-
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_gen_account_irq_exit(tsk);
-}
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
 
 extern void vtime_user_enter(struct task_struct *tsk);
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9bd2d4f42037..261c9002348a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -115,6 +115,16 @@ static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
 
 #define sched_clock_irqtime	(0)
 
+static unsigned long irqtime_account_hi_update(unsigned long dummy)
+{
+	return 0;
+}
+
+static unsigned long irqtime_account_si_update(unsigned long dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -708,14 +718,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	unsigned long delta_jiffies, other_jiffies;
 
 	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	other_jiffies = account_other_ticks(delta_jiffies);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return jiffies_to_cputime(delta_jiffies - other_jiffies);
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
@@ -735,16 +745,6 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_gen_account_irq_exit(struct task_struct *tsk)
-{
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		__vtime_account_system(tsk);
-	if (context_tracking_in_user())
-		tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
-}
-
 void vtime_account_user(struct task_struct *tsk)
 {
 	cputime_t delta_cpu;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] cputime: allow irq time accounting to be selected as an option
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
  2016-06-16 16:06 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
@ 2016-06-16 16:06 ` riel
  2016-06-16 16:06 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
  2016-06-16 16:06 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  4 siblings, 0 replies; 34+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Allow CONFIG_IRQ_TIME_ACCOUNTING to be selected as an option, on top
of CONFIG_VIRT_CPU_ACCOUNTING_GEN (and potentially others?).

This allows for the irq time accounting code to be used with nohz_idle
CPUs, which is how several distributions ship their kernels. Using the
same code for several timer modes also allows us to drop duplicate code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 init/Kconfig | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 0dfd09d54c65..4c7ee4f136cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -375,9 +375,11 @@ config VIRT_CPU_ACCOUNTING_GEN
 
 	  If unsure, say N.
 
+endchoice
+
 config IRQ_TIME_ACCOUNTING
 	bool "Fine granularity task level IRQ time accounting"
-	depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL
+	depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
 	help
 	  Select this option to enable fine granularity task irq time
 	  accounting. This is done by reading a timestamp on each
@@ -386,8 +388,6 @@ config IRQ_TIME_ACCOUNTING
 
 	  If in doubt, say N here.
 
-endchoice
-
 config BSD_PROCESS_ACCT
 	bool "BSD Process Accounting"
 	depends on MULTIUSER
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
                   ` (2 preceding siblings ...)
  2016-06-16 16:06 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
@ 2016-06-16 16:06 ` riel
  2016-06-16 16:06 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  4 siblings, 0 replies; 34+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Add an irq type parameter and documentation to irqtime_account_irq,
this can be used to distinguish between transitioning from process
context to hardirq time, and from process context to softirq time.

This is necessary to be able to remove the local_irq_disable from
irqtime_account_irq.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/hardirq.h | 20 ++++++++++----------
 include/linux/vtime.h   | 12 ++++++------
 kernel/sched/cputime.c  |  9 ++++++++-
 kernel/softirq.c        |  6 +++---
 4 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6bc6f0..1ebb31f56285 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,11 +32,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define __irq_enter()					\
-	do {						\
-		account_irq_enter_time(current);	\
-		preempt_count_add(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
+#define __irq_enter()							\
+	do {								\
+		account_irq_enter_time(current, HARDIRQ_OFFSET);	\
+		preempt_count_add(HARDIRQ_OFFSET);			\
+		trace_hardirq_enter();					\
 	} while (0)
 
 /*
@@ -47,11 +47,11 @@ extern void irq_enter(void);
 /*
  * Exit irq context without processing softirqs:
  */
-#define __irq_exit()					\
-	do {						\
-		trace_hardirq_exit();			\
-		account_irq_exit_time(current);		\
-		preempt_count_sub(HARDIRQ_OFFSET);	\
+#define __irq_exit()							\
+	do {								\
+		trace_hardirq_exit();					\
+		account_irq_exit_time(current, HARDIRQ_OFFSET);		\
+		preempt_count_sub(HARDIRQ_OFFSET);			\
 	} while (0)
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d1977d84ebdf..31e486d741b0 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -112,21 +112,21 @@ static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, int irqtype);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, int irqtype) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_enter(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_exit(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 261c9002348a..07f847267e4e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,8 +46,15 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
 /*
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * There are six possible transitions:
+ * process -> softirq, softirq -> process
+ * process -> hardirq, hardirq -> process
+ * softirq -> hardirq, hardirq -> softirq
+ *
+ * When exiting hardirq or softirq time, account the elapsed time.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
 	unsigned long flags;
 	s64 delta;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a311c9622c86 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -245,7 +245,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	account_irq_enter_time(current);
+	account_irq_enter_time(current, SOFTIRQ_OFFSET);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
@@ -295,7 +295,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	}
 
 	lockdep_softirq_end(in_hardirq);
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, SOFTIRQ_OFFSET);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
@@ -385,7 +385,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, HARDIRQ_OFFSET);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
                   ` (3 preceding siblings ...)
  2016-06-16 16:06 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
@ 2016-06-16 16:06 ` riel
  2016-06-21 21:49   ` Peter Zijlstra
  4 siblings, 1 reply; 34+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Drop local_irq_save/restore from irqtime_account_irq.
Instead, have softirq and hardirq track their time spent
independently, with the softirq code subtracting hardirq
time that happened during the duration of the softirq run.

The softirq code can be interrupted by hardirq code at
any point in time, but it can check whether it got a
consistent snapshot of the timekeeping variables it wants,
and loop around in the unlikely case that it did not.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 63 ++++++++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h   | 28 ++++++++++++++--------
 2 files changed, 70 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 07f847267e4e..ba18d51a091c 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -26,7 +26,9 @@
 DEFINE_PER_CPU(u64, cpu_hardirq_time);
 DEFINE_PER_CPU(u64, cpu_softirq_time);
 
-static DEFINE_PER_CPU(u64, irq_start_time);
+static DEFINE_PER_CPU(u64, hardirq_start_time);
+static DEFINE_PER_CPU(u64, softirq_start_time);
+static DEFINE_PER_CPU(u64, prev_hardirq_time);
 static int sched_clock_irqtime;
 
 void enable_sched_clock_irqtime(void)
@@ -41,6 +43,7 @@ void disable_sched_clock_irqtime(void)
 
 #ifndef CONFIG_64BIT
 DEFINE_PER_CPU(seqcount_t, irq_time_seq);
+DEFINE_PER_CPU(seqcount_t, softirq_time_seq);
 #endif /* CONFIG_64BIT */
 
 /*
@@ -53,36 +56,72 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
  * softirq -> hardirq, hardirq -> softirq
  *
  * When exiting hardirq or softirq time, account the elapsed time.
+ *
+ * When exiting softirq time, subtract the amount of hardirq time that
+ * interrupted this softirq run, to avoid double accounting of that time.
  */
 void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
-	unsigned long flags;
-	s64 delta;
+	u64 prev_softirq_start;
+	u64 prev_hardirq;
+	u64 hardirq_time;
+	s64 delta = 0;
 	int cpu;
 
 	if (!sched_clock_irqtime)
 		return;
 
-	local_irq_save(flags);
-
 	cpu = smp_processor_id();
-	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
-	__this_cpu_add(irq_start_time, delta);
+	/*
+	 * Softirq context may get interrupted by hardirq context,
+	 * on the same CPU. At softirq entry time the amount of time
+	 * spent in hardirq context is stored. At softirq exit time,
+	 * the time spent in hardirq context during the softirq is
+	 * subtracted.
+	 */
+	prev_hardirq = __this_cpu_read(prev_hardirq_time);
+	prev_softirq_start = __this_cpu_read(softirq_start_time);
+
+	if (irqtype == HARDIRQ_OFFSET) {
+		delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
+		__this_cpu_add(hardirq_start_time, delta);
+	} else do {
+		u64 now = sched_clock_cpu(cpu);
+		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time, cpu));
+
+		delta = now - prev_softirq_start;
+		if (in_serving_softirq()) {
+			/*
+			 * Leaving softirq context. Avoid double counting by
+			 * subtracting hardirq time from this interval.
+			 */
+			s64 hi_delta = hardirq_time - prev_hardirq;
+			delta -= hi_delta;
+		} else {
+			/* Entering softirq context. Note start times. */
+			__this_cpu_write(softirq_start_time, now);
+			__this_cpu_write(prev_hardirq_time, hardirq_time);
+		}
+		/*
+		 * If a hardirq happened during this calculation, it may not
+		 * have gotten a consistent snapshot. Try again.
+		 */
+	} while (hardirq_time != READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));
 
-	irq_time_write_begin();
+	irq_time_write_begin(irqtype);
 	/*
 	 * We do not account for softirq time from ksoftirqd here.
 	 * We want to continue accounting softirq time to ksoftirqd thread
 	 * in that case, so as not to confuse scheduler with a special task
 	 * that do not consume any time, but still wants to run.
 	 */
-	if (hardirq_count())
+	if (irqtype == HARDIRQ_OFFSET && hardirq_count())
 		__this_cpu_add(cpu_hardirq_time, delta);
-	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+	else if (irqtype == SOFTIRQ_OFFSET && in_serving_softirq() &&
+				curr != this_cpu_ksoftirqd())
 		__this_cpu_add(cpu_softirq_time, delta);
 
-	irq_time_write_end();
-	local_irq_restore(flags);
+	irq_time_write_end(irqtype);
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec2e8d23527e..65d013447e0c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1752,38 +1752,48 @@ DECLARE_PER_CPU(u64, cpu_softirq_time);
 
 #ifndef CONFIG_64BIT
 DECLARE_PER_CPU(seqcount_t, irq_time_seq);
+DECLARE_PER_CPU(seqcount_t, softirq_time_seq);
 
-static inline void irq_time_write_begin(void)
+static inline void irq_time_write_begin(int irqtype)
 {
-	__this_cpu_inc(irq_time_seq.sequence);
+	if (irqtype == HARDIRQ_OFFSET)
+		__this_cpu_inc(irq_time_seq.sequence);
+	else
+		__this_cpu_inc(softirq_time_seq.sequence);
 	smp_wmb();
 }
 
-static inline void irq_time_write_end(void)
+static inline void irq_time_write_end(int irqtype)
 {
 	smp_wmb();
-	__this_cpu_inc(irq_time_seq.sequence);
+	if (irqtype == HARDIRQ_OFFSET)
+		__this_cpu_inc(irq_time_seq.sequence);
+	else
+		__this_cpu_inc(softirq_time_seq.sequence);
 }
 
 static inline u64 irq_time_read(int cpu)
 {
 	u64 irq_time;
-	unsigned seq;
+	unsigned hi_seq;
+	unsigned si_seq;
 
 	do {
-		seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		hi_seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		si_seq = read_seqcount_begin(&per_cpu(softirq_time_seq, cpu));
 		irq_time = per_cpu(cpu_softirq_time, cpu) +
 			   per_cpu(cpu_hardirq_time, cpu);
-	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));
+	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), hi_seq) ||
+		 read_seqcount_retry(&per_cpu(softirq_time_seq, cpu), si_seq));
 
 	return irq_time;
 }
 #else /* CONFIG_64BIT */
-static inline void irq_time_write_begin(void)
+static inline void irq_time_write_begin(int irqtype)
 {
 }
 
-static inline void irq_time_write_end(void)
+static inline void irq_time_write_end(int irqtype)
 {
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-16 16:22   ` kbuild test robot
  2016-06-21 21:21   ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: kbuild test robot @ 2016-06-16 16:22 UTC (permalink / raw)
  To: riel
  Cc: kbuild-all, linux-kernel, mingo, pbonzini, peterz, fweisbec,
	wanpeng.li, efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 2010 bytes --]

Hi,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20160616]
[cannot apply to v4.7-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/riel-redhat-com/sched-time-fix-irq-time-accounting-with-nohz_idle/20160617-001150
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

Note: the linux-review/riel-redhat-com/sched-time-fix-irq-time-accounting-with-nohz_idle/20160617-001150 HEAD cc827b6bc02ef052d79b5dbd2c355d9483a8ada7 builds fine.
      It only hurts bisectibility.

All errors (new ones prefixed by >>):

   kernel/sched/cputime.c: In function 'account_other_ticks':
>> kernel/sched/cputime.c:299:16: error: implicit declaration of function 'irqtime_account_hi_update' [-Werror=implicit-function-declaration]
      accounted += irqtime_account_hi_update(max_jiffies - accounted);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/sched/cputime.c:302:16: error: implicit declaration of function 'irqtime_account_si_update' [-Werror=implicit-function-declaration]
      accounted += irqtime_account_si_update(max_jiffies - accounted);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/irqtime_account_hi_update +299 kernel/sched/cputime.c

   293	{
   294		unsigned long accounted;
   295	
   296		accounted = steal_account_process_tick(max_jiffies);
   297	
   298		if (accounted < max_jiffies)
 > 299			accounted += irqtime_account_hi_update(max_jiffies - accounted);
   300	
   301		if (accounted < max_jiffies)
 > 302			accounted += irqtime_account_si_update(max_jiffies - accounted);
   303	
   304		return accounted;
   305	}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 6335 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
  2016-06-16 16:22   ` kbuild test robot
@ 2016-06-21 21:21   ` Peter Zijlstra
  2016-06-21 22:20     ` Rik van Riel
  2016-06-22 10:40     ` Paolo Bonzini
  1 sibling, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2016-06-21 21:21 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

On Thu, Jun 16, 2016 at 12:06:03PM -0400, riel@redhat.com wrote:
> +static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long irq_jiffies;
>  	unsigned long flags;
> +	u64 irq;
>  
>  	local_irq_save(flags);
> +	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
> +	irq_jiffies = min(cputime_to_jiffies(irq), max_jiffies);

cputime_to_jiffies is a division, could we not avoid that by doing
something like:

	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
	while (irq_jiffies > cputime_one_jiffy) {
		irq_jiffies -= cputime_one_jiffy;
		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
	}

assuming that the loop is 'rare' etc.. If not, only do the division on
that same > cputime_one_jiffy condition.

> +	if (irq_jiffies)
> +		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
>  	local_irq_restore(flags);
> +	return irq_jiffies;
>  }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-16 16:06 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
@ 2016-06-21 21:49   ` Peter Zijlstra
  2016-06-21 22:23     ` Rik van Riel
  2016-06-22 21:55     ` Rik van Riel
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2016-06-21 21:49 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

On Thu, Jun 16, 2016 at 12:06:07PM -0400, riel@redhat.com wrote:
> @@ -53,36 +56,72 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
>   * softirq -> hardirq, hardirq -> softirq
>   *
>   * When exiting hardirq or softirq time, account the elapsed time.
> + *
> + * When exiting softirq time, subtract the amount of hardirq time that
> + * interrupted this softirq run, to avoid double accounting of that time.
>   */
>  void irqtime_account_irq(struct task_struct *curr, int irqtype)
>  {
> +	u64 prev_softirq_start;
> +	u64 prev_hardirq;
> +	u64 hardirq_time;
> +	s64 delta = 0;

We appear to always assign to delta, so this initialization seems
superfluous.

>  	int cpu;
>  
>  	if (!sched_clock_irqtime)
>  		return;
>  
>  	cpu = smp_processor_id();

Per this smp_processor_id() usage, preemption is disabled.

> +	/*
> +	 * Softirq context may get interrupted by hardirq context,
> +	 * on the same CPU. At softirq entry time the amount of time
> +	 * spent in hardirq context is stored. At softirq exit time,
> +	 * the time spent in hardirq context during the softirq is
> +	 * subtracted.
> +	 */
> +	prev_hardirq = __this_cpu_read(prev_hardirq_time);
> +	prev_softirq_start = __this_cpu_read(softirq_start_time);
> +
> +	if (irqtype == HARDIRQ_OFFSET) {
> +		delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
> +		__this_cpu_add(hardirq_start_time, delta);
> +	} else do {
> +		u64 now = sched_clock_cpu(cpu);
> +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time, cpu));

Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong with
__this_cpu_read() ?

> +
> +		delta = now - prev_softirq_start;
> +		if (in_serving_softirq()) {
> +			/*
> +			 * Leaving softirq context. Avoid double counting by
> +			 * subtracting hardirq time from this interval.
> +			 */
> +			s64 hi_delta = hardirq_time - prev_hardirq;
> +			delta -= hi_delta;
> +		} else {
> +			/* Entering softirq context. Note start times. */
> +			__this_cpu_write(softirq_start_time, now);
> +			__this_cpu_write(prev_hardirq_time, hardirq_time);
> +		}
> +		/*
> +		 * If a hardirq happened during this calculation, it may not
> +		 * have gotten a consistent snapshot. Try again.
> +		 */
> +	} while (hardirq_time != READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));

That whole thing is somewhat hard to read; but its far too late for me
to suggest anything more readable :/

> +	irq_time_write_begin(irqtype);
>  	/*
>  	 * We do not account for softirq time from ksoftirqd here.
>  	 * We want to continue accounting softirq time to ksoftirqd thread
>  	 * in that case, so as not to confuse scheduler with a special task
>  	 * that do not consume any time, but still wants to run.
>  	 */
> +	if (irqtype == HARDIRQ_OFFSET && hardirq_count())
>  		__this_cpu_add(cpu_hardirq_time, delta);
> +	else if (irqtype == SOFTIRQ_OFFSET && in_serving_softirq() &&
> +				curr != this_cpu_ksoftirqd())
>  		__this_cpu_add(cpu_softirq_time, delta);
>  
> +	irq_time_write_end(irqtype);

Maybe split the whole thing on irqtype at the very start, instead of the
endless repeated branches?

>  }
>  EXPORT_SYMBOL_GPL(irqtime_account_irq);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-21 21:21   ` Peter Zijlstra
@ 2016-06-21 22:20     ` Rik van Riel
  2016-06-22 10:40     ` Paolo Bonzini
  1 sibling, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2016-06-21 22:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1244 bytes --]

On Tue, 2016-06-21 at 23:21 +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 12:06:03PM -0400, riel@redhat.com wrote:
> > 
> > +static unsigned long irqtime_account_hi_update(unsigned long
> > max_jiffies)
> >  {
> >  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > +	unsigned long irq_jiffies;
> >  	unsigned long flags;
> > +	u64 irq;
> >  
> >  	local_irq_save(flags);
> > +	irq = this_cpu_read(cpu_hardirq_time) -
> > cpustat[CPUTIME_IRQ];
> > +	irq_jiffies = min(cputime_to_jiffies(irq), max_jiffies);
> cputime_to_jiffies is a division, could we not avoid that by doing
> something like:
> 
> 	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
> 	while (irq_jiffies > cputime_one_jiffy) {
> 		irq_jiffies -= cputime_one_jiffy;
> 		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
> 	}
> 
> assuming that the loop is 'rare' etc.. If not, only do the division
> on
> that same > cputime_one_jiffy condition.

I suspect the loop is not rare on systems with nohz_idle,
where it may be quite a while before a timer tick happens
on an idle cpu.

I can certainly make sure the division is only done when
irq > 2*cputime_one_jiffy. I will do that in the next
version.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-21 21:49   ` Peter Zijlstra
@ 2016-06-21 22:23     ` Rik van Riel
  2016-06-21 22:28       ` Peter Zijlstra
  2016-06-22 21:55     ` Rik van Riel
  1 sibling, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2016-06-21 22:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 4092 bytes --]

On Tue, 2016-06-21 at 23:49 +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 12:06:07PM -0400, riel@redhat.com wrote:
> > 
> > @@ -53,36 +56,72 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
> >   * softirq -> hardirq, hardirq -> softirq
> >   *
> >   * When exiting hardirq or softirq time, account the elapsed time.
> > + *
> > + * When exiting softirq time, subtract the amount of hardirq time
> > that
> > + * interrupted this softirq run, to avoid double accounting of
> > that time.
> >   */
> >  void irqtime_account_irq(struct task_struct *curr, int irqtype)
> >  {
> > +	u64 prev_softirq_start;
> > +	u64 prev_hardirq;
> > +	u64 hardirq_time;
> > +	s64 delta = 0;
> We appear to always assign to delta, so this initialization seems
> superfluous.
> 
> > 
> >  	int cpu;
> >  
> >  	if (!sched_clock_irqtime)
> >  		return;
> >  
> >  	cpu = smp_processor_id();
> Per this smp_processor_id() usage, preemption is disabled.

This code is called from the timer code. Surely preemption
is already disabled?

Should I change this into raw_smp_processor_id()?

> > 
> > +	/*
> > +	 * Softirq context may get interrupted by hardirq context,
> > +	 * on the same CPU. At softirq entry time the amount of
> > time
> > +	 * spent in hardirq context is stored. At softirq exit
> > time,
> > +	 * the time spent in hardirq context during the softirq is
> > +	 * subtracted.
> > +	 */
> > +	prev_hardirq = __this_cpu_read(prev_hardirq_time);
> > +	prev_softirq_start = __this_cpu_read(softirq_start_time);
> > +
> > +	if (irqtype == HARDIRQ_OFFSET) {
> > +		delta = sched_clock_cpu(cpu) -
> > __this_cpu_read(hardirq_start_time);
> > +		__this_cpu_add(hardirq_start_time, delta);
> > +	} else do {
> > +		u64 now = sched_clock_cpu(cpu);
> > +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time,
> > cpu));
> Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong
> with
> __this_cpu_read() ?

Is __this_cpu_read() as fast as per_cpu(,cpu) on all
architectures?

> > 
> > +
> > +		delta = now - prev_softirq_start;
> > +		if (in_serving_softirq()) {
> > +			/*
> > +			 * Leaving softirq context. Avoid double
> > counting by
> > +			 * subtracting hardirq time from this
> > interval.
> > +			 */
> > +			s64 hi_delta = hardirq_time -
> > prev_hardirq;
> > +			delta -= hi_delta;
> > +		} else {
> > +			/* Entering softirq context. Note start
> > times. */
> > +			__this_cpu_write(softirq_start_time, now);
> > +			__this_cpu_write(prev_hardirq_time,
> > hardirq_time);
> > +		}
> > +		/*
> > +		 * If a hardirq happened during this calculation,
> > it may not
> > +		 * have gotten a consistent snapshot. Try again.
> > +		 */
> > +	} while (hardirq_time !=
> > READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));
> That whole thing is somewhat hard to read; but its far too late for
> me
> to suggest anything more readable :/

I only had 2 1/2 hours of sleep last night, so I will not
try to rewrite it now, but I will see if there is anything
I can do to make it more readable tomorrow.

If you have any ideas before then, please let me know :)

> > 
> > +	irq_time_write_begin(irqtype);
> >  	/*
> >  	 * We do not account for softirq time from ksoftirqd here.
> >  	 * We want to continue accounting softirq time to
> > ksoftirqd thread
> >  	 * in that case, so as not to confuse scheduler with a
> > special task
> >  	 * that do not consume any time, but still wants to run.
> >  	 */
> > +	if (irqtype == HARDIRQ_OFFSET && hardirq_count())
> >  		__this_cpu_add(cpu_hardirq_time, delta);
> > +	else if (irqtype == SOFTIRQ_OFFSET && in_serving_softirq()
> > &&
> > +				curr != this_cpu_ksoftirqd())
> >  		__this_cpu_add(cpu_softirq_time, delta);
> >  
> > +	irq_time_write_end(irqtype);
> Maybe split the whole thing on irqtype at the very start, instead of
> the
> endless repeated branches?

Let me try if I can make things more readable that way.

Thanks for the review!

Rik
-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-21 22:23     ` Rik van Riel
@ 2016-06-21 22:28       ` Peter Zijlstra
  2016-06-21 22:32         ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2016-06-21 22:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

On Tue, Jun 21, 2016 at 06:23:34PM -0400, Rik van Riel wrote:
> > >  	cpu = smp_processor_id();
> > Per this smp_processor_id() usage, preemption is disabled.
> 
> This code is called from the timer code. Surely preemption
> is already disabled?

That's what I said.

> > > 
> > > +	/*
> > > +	 * Softirq context may get interrupted by hardirq context,
> > > +	 * on the same CPU. At softirq entry time the amount of
> > > time
> > > +	 * spent in hardirq context is stored. At softirq exit
> > > time,
> > > +	 * the time spent in hardirq context during the softirq is
> > > +	 * subtracted.
> > > +	 */
> > > +	prev_hardirq = __this_cpu_read(prev_hardirq_time);
> > > +	prev_softirq_start = __this_cpu_read(softirq_start_time);
> > > +
> > > +	if (irqtype == HARDIRQ_OFFSET) {
> > > +		delta = sched_clock_cpu(cpu) -
> > > __this_cpu_read(hardirq_start_time);
> > > +		__this_cpu_add(hardirq_start_time, delta);
> > > +	} else do {
> > > +		u64 now = sched_clock_cpu(cpu);
> > > +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time,
> > > cpu));
> > Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong
> > with
> > __this_cpu_read() ?
> 
> Is __this_cpu_read() as fast as per_cpu(,cpu) on all
> architectures?

Can't be slower. Don't get the argument though; you've used __this_cpu
stuff all over the place, and here you use a per_cpu() for no reason.

> > > 
> > > +
> > > +		delta = now - prev_softirq_start;
> > > +		if (in_serving_softirq()) {
> > > +			/*
> > > +			 * Leaving softirq context. Avoid double
> > > counting by
> > > +			 * subtracting hardirq time from this
> > > interval.
> > > +			 */
> > > +			s64 hi_delta = hardirq_time -
> > > prev_hardirq;
> > > +			delta -= hi_delta;
> > > +		} else {
> > > +			/* Entering softirq context. Note start
> > > times. */
> > > +			__this_cpu_write(softirq_start_time, now);
> > > +			__this_cpu_write(prev_hardirq_time,
> > > hardirq_time);
> > > +		}
> > > +		/*
> > > +		 * If a hardirq happened during this calculation,
> > > it may not
> > > +		 * have gotten a consistent snapshot. Try again.
> > > +		 */
> > > +	} while (hardirq_time !=
> > > READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));
> > That whole thing is somewhat hard to read; but its far too late for
> > me
> > to suggest anything more readable :/
> 
> I only had 2 1/2 hours of sleep last night, so I will not
> try to rewrite it now, but I will see if there is anything
> I can do to make it more readable tomorrow.
> 
> If you have any ideas before then, please let me know :)

Heh, step away from the computer ... ;-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-21 22:28       ` Peter Zijlstra
@ 2016-06-21 22:32         ` Rik van Riel
  0 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2016-06-21 22:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

On Wed, 2016-06-22 at 00:28 +0200, Peter Zijlstra wrote:
> On Tue, Jun 21, 2016 at 06:23:34PM -0400, Rik van Riel wrote:
> > 
> > > > +	/*
> > > > +	 * Softirq context may get interrupted by hardirq
> > > > context,
> > > > +	 * on the same CPU. At softirq entry time the amount
> > > > of
> > > > time
> > > > +	 * spent in hardirq context is stored. At softirq exit
> > > > time,
> > > > +	 * the time spent in hardirq context during the
> > > > softirq is
> > > > +	 * subtracted.
> > > > +	 */
> > > > +	prev_hardirq = __this_cpu_read(prev_hardirq_time);
> > > > +	prev_softirq_start =
> > > > __this_cpu_read(softirq_start_time);
> > > > +
> > > > +	if (irqtype == HARDIRQ_OFFSET) {
> > > > +		delta = sched_clock_cpu(cpu) -
> > > > __this_cpu_read(hardirq_start_time);
> > > > +		__this_cpu_add(hardirq_start_time, delta);
> > > > +	} else do {
> > > > +		u64 now = sched_clock_cpu(cpu);
> > > > +		hardirq_time =
> > > > READ_ONCE(per_cpu(cpu_hardirq_time,
> > > > cpu));
> > > Which makes this per_cpu(,cpu) usage somewhat curious. What's
> > > wrong
> > > with
> > > __this_cpu_read() ?
> > Is __this_cpu_read() as fast as per_cpu(,cpu) on all
> > architectures?
> Can't be slower. Don't get the argument though; you've used
> __this_cpu
> stuff all over the place, and here you use a per_cpu() for no reason.
> 
Good point. I will use __this_cpu_read here.

> > > That whole thing is somewhat hard to read; but its far too late
> > > for
> > > me
> > > to suggest anything more readable :/
> > I only had 2 1/2 hours of sleep last night, so I will not
> > try to rewrite it now, but I will see if there is anything
> > I can do to make it more readable tomorrow.
> > 
> > If you have any ideas before then, please let me know :)
> Heh, step away from the computer ... ;-)

No worries, I have booze with me. Everything will be
just fine! ;)

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-21 21:21   ` Peter Zijlstra
  2016-06-21 22:20     ` Rik van Riel
@ 2016-06-22 10:40     ` Paolo Bonzini
  2016-06-22 10:52       ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2016-06-22 10:40 UTC (permalink / raw)
  To: Peter Zijlstra, riel
  Cc: linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar



On 21/06/2016 23:21, Peter Zijlstra wrote:
> cputime_to_jiffies is a division, could we not avoid that by doing
> something like:
> 
> 	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
> 	while (irq_jiffies > cputime_one_jiffy) {
> 		irq_jiffies -= cputime_one_jiffy;
> 		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
> 	}
> 
> assuming that the loop is 'rare' etc.. If not, only do the division on
> that same > cputime_one_jiffy condition.

It's a division by a constant, it ought to become a multiplication.  For
64-bit it will, and context tracking is only enabled for 64-bit.

BTW, for 32-bit there's a monster of a macro to turn do_div with
constant divisor into multiplications in include/asm-generic/div64.h.
However, x86-32 doesn't use it.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-22 10:40     ` Paolo Bonzini
@ 2016-06-22 10:52       ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2016-06-22 10:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: riel, linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar

On Wed, Jun 22, 2016 at 12:40:31PM +0200, Paolo Bonzini wrote:
> 
> 
> On 21/06/2016 23:21, Peter Zijlstra wrote:
> > cputime_to_jiffies is a division, could we not avoid that by doing
> > something like:
> > 
> > 	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
> > 	while (irq_jiffies > cputime_one_jiffy) {
> > 		irq_jiffies -= cputime_one_jiffy;
> > 		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
> > 	}
> > 
> > assuming that the loop is 'rare' etc.. If not, only do the division on
> > that same > cputime_one_jiffy condition.
> 
> It's a division by a constant, it ought to become a multiplication.  For
> 64-bit it will, and context tracking is only enabled for 64-bit.

Right, not enabled on i386, however plenty 32bit archs (including ARM)
do have it enabled.

> BTW, for 32-bit there's a monster of a macro to turn do_div with
> constant divisor into multiplications in include/asm-generic/div64.h.
> However, x86-32 doesn't use it.

Right, but some ARM chips don't exactly have a fast multiplier either,
so avoiding even the 64bit mult (which ends up being 3 or so actual
mult instructions) is worthwhile I'd say.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-21 21:49   ` Peter Zijlstra
  2016-06-21 22:23     ` Rik van Riel
@ 2016-06-22 21:55     ` Rik van Riel
  2016-06-23 13:52       ` Paolo Bonzini
  1 sibling, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2016-06-22 21:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1918 bytes --]

On Tue, 2016-06-21 at 23:49 +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 12:06:07PM -0400, riel@redhat.com wrote:
> > 
> > @@ -53,36 +56,72 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
> >   * softirq -> hardirq, hardirq -> softirq
> >   *
> >   * When exiting hardirq or softirq time, account the elapsed time.
> > + *
> > + * When exiting softirq time, subtract the amount of hardirq time
> > that
> > + * interrupted this softirq run, to avoid double accounting of
> > that time.
> >   */
> >  void irqtime_account_irq(struct task_struct *curr, int irqtype)
> >  {
> > +	u64 prev_softirq_start;
> > +	u64 prev_hardirq;
> > +	u64 hardirq_time;
> > +	s64 delta = 0;
> We appear to always assign to delta, so this initialization seems
> superfluous.

It gets rid of a compiler warning, since gcc is not
smart enough to know that the result of in_softirq()
will be the same throughout the function.

Using a bool leaving_softirq = in_softirq() also
gets rid of the warning, and makes the function a
little more readable, so I am doing that.

> > +	if (irqtype == HARDIRQ_OFFSET) {
> > +		delta = sched_clock_cpu(cpu) -
> > __this_cpu_read(hardirq_start_time);
> > +		__this_cpu_add(hardirq_start_time, delta);
> > +	} else do {
> > +		u64 now = sched_clock_cpu(cpu);
> > +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time,
> > cpu));
> Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong
> with
> __this_cpu_read() ?

I played around with it a bit, and it seems that
__this_cpu_read does not want to nest inside
READ_ONCE.  Nobody else seems to be doing that,
either.

Back to READ_ONCE(per_cpu(,cpu)) it is...

> Maybe split the whole thing on irqtype at the very start, instead of
> the
> endless repeated branches?

I untangled the whole thing in the next version,
which I will post after testing.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-22 21:55     ` Rik van Riel
@ 2016-06-23 13:52       ` Paolo Bonzini
  2016-06-23 15:24         ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2016-06-23 13:52 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra
  Cc: linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar



On 22/06/2016 23:55, Rik van Riel wrote:
> > > +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time,
> > > cpu));
> > Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong
> > with __this_cpu_read() ?
>
> I played around with it a bit, and it seems that
> __this_cpu_read does not want to nest inside
> READ_ONCE.  Nobody else seems to be doing that,
> either.

According to arch/x86/include/asm/percpu.h, this_cpu_read always has 
READ_ONCE semantics, but I cannot find that in include/asm-generic
/percpu.h.  It probably just works because of all the layers of goo, but
something like this (101% untested) would make me feel safer:

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index 4d9f233c4ba8..d057568f1926 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -67,7 +67,7 @@ extern void setup_per_cpu_areas(void);
 
 #define raw_cpu_generic_to_op(pcp, val, op)				\
 do {									\
-	*raw_cpu_ptr(&(pcp)) op val;					\
+	ACCESS_ONCE(*raw_cpu_ptr(&(pcp))) op val;			\
 } while (0)
 
 #define raw_cpu_generic_add_return(pcp, val)				\
@@ -109,7 +109,7 @@ do {
 ({									\
 	typeof(pcp) __ret;						\
 	preempt_disable();						\
-	__ret = *this_cpu_ptr(&(pcp));					\
+	__ret = READ_ONCE(this_cpu_ptr(&(pcp)));			\
 	preempt_enable();						\
 	__ret;								\
 })
@@ -118,7 +118,7 @@ do {
 do {									\
 	unsigned long __flags;						\
 	raw_local_irq_save(__flags);					\
-	*raw_cpu_ptr(&(pcp)) op val;					\
+	ACCESS_ONCE(*raw_cpu_ptr(&(pcp))) op val;			\
 	raw_local_irq_restore(__flags);					\
 } while (0)
 
@@ -168,16 +168,16 @@ do {
 })
 
 #ifndef raw_cpu_read_1
-#define raw_cpu_read_1(pcp)		(*raw_cpu_ptr(&(pcp)))
+#define raw_cpu_read_1(pcp)		READ_ONCE(raw_cpu_ptr(&(pcp)))
 #endif
 #ifndef raw_cpu_read_2
-#define raw_cpu_read_2(pcp)		(*raw_cpu_ptr(&(pcp)))
+#define raw_cpu_read_2(pcp)		READ_ONCE(raw_cpu_ptr(&(pcp)))
 #endif
 #ifndef raw_cpu_read_4
-#define raw_cpu_read_4(pcp)		(*raw_cpu_ptr(&(pcp)))
+#define raw_cpu_read_4(pcp)		READ_ONCE(raw_cpu_ptr(&(pcp)))
 #endif
 #ifndef raw_cpu_read_8
-#define raw_cpu_read_8(pcp)		(*raw_cpu_ptr(&(pcp)))
+#define raw_cpu_read_8(pcp)		READ_ONCE(raw_cpu_ptr(&(pcp)))
 #endif
 
 #ifndef raw_cpu_write_1


> Back to READ_ONCE(per_cpu(,cpu)) it is...

What about READ_ONCE(this_cpu_ptr())?

Paolo

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-23 13:52       ` Paolo Bonzini
@ 2016-06-23 15:24         ` Rik van Riel
  0 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2016-06-23 15:24 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Zijlstra
  Cc: linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 2947 bytes --]

On Thu, 2016-06-23 at 15:52 +0200, Paolo Bonzini wrote:
> 
> On 22/06/2016 23:55, Rik van Riel wrote:
> > 
> > > 
> > > > 
> > > > +		hardirq_time =
> > > > READ_ONCE(per_cpu(cpu_hardirq_time,
> > > > cpu));
> > > Which makes this per_cpu(,cpu) usage somewhat curious. What's
> > > wrong
> > > with __this_cpu_read() ?
> > I played around with it a bit, and it seems that
> > __this_cpu_read does not want to nest inside
> > READ_ONCE.  Nobody else seems to be doing that,
> > either.
> According to arch/x86/include/asm/percpu.h, this_cpu_read always has 
> READ_ONCE semantics, but I cannot find that in include/asm-generic
> /percpu.h.  It probably just works because of all the layers of goo,
> but
> something like this (101% untested) would make me feel safer:
> 
Are READ_ONCE semantics desired for every
per-cpu read?

> > Back to READ_ONCE(per_cpu(,cpu)) it is...
> What about READ_ONCE(this_cpu_ptr())?

I tried that yesterday, because it looked like
it would work. Unfortunately, it does not.

  CC      kernel/sched/cputime.o
kernel/sched/cputime.c: In function ‘irqtime_account_irq’:
kernel/sched/cputime.c:107:71: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:298: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c:107:428: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:655: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c:107:749: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:976: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c:107:1087: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:1314: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c:107:1408: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:1635: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c: In function ‘irqtime_account_hi_update’:
kernel/sched/cputime.c:145:175: warning: comparison of distinct pointer
types lacks a cast
kernel/sched/cputime.c: In function ‘irqtime_account_si_update’:
kernel/sched/cputime.c:162:182: warning: comparison of distinct pointer
types lacks a cast
scripts/Makefile.build:291: recipe for target 'kernel/sched/cputime.o'
failed
make[1]: *** [kernel/sched/cputime.o] Error 1
Makefile:1594: recipe for target 'kernel/sched/' failed
make: *** [kernel/sched/] Error 2

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:07       ` Wanpeng Li
@ 2016-08-10  7:51         ` Wanpeng Li
  0 siblings, 0 replies; 34+ messages in thread
From: Wanpeng Li @ 2016-08-10  7:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:07 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> I will try nohz idle kvm guest and other more tests, a patch will be
> sent out once successful.

After apply the same logic to account_idle_ticks() for nohz idle kvm
guest(noirqtime, idle=poll, one pCPU and four vCPUs), the average idle
drop from 56.8% to 54.75%, I think it makes sense to make a formal
patch.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-10  5:07           ` Rik van Riel
@ 2016-08-10  6:33             ` Wanpeng Li
  0 siblings, 0 replies; 34+ messages in thread
From: Wanpeng Li @ 2016-08-10  6:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 13:07 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Wed, 2016-08-10 at 07:39 +0800, Wanpeng Li wrote:
>> 2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
>> > 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > > On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>> > > > Hi Rik,
>> > > > 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.
>> > > > com>:
>> > > > > From: Rik van Riel <riel@redhat.com>
>> > > > >
>> > > > > Currently, if there was any irq or softirq time during
>> > > > > 'ticks'
>> > > > > jiffies, the entire period will be accounted as irq or
>> > > > > softirq
>> > > > > time.
>> > > > >
>> > > > > This is inaccurate if only a subset of the time was actually
>> > > > > spent
>> > > > > handling irqs, and could conceivably mis-count all of the
>> > > > > ticks
>> > > > > during
>> > > > > a period as irq time, when there was some irq and some
>> > > > > softirq
>> > > > > time.
>> > > > >
>> > > > > This can actually happen when irqtime_account_process_tick is
>> > > > > called
>> > > > > from account_idle_ticks, which can pass a larger number of
>> > > > > ticks
>> > > > > down
>> > > > > all at once.
>> > > > >
>> > > > > Fix this by changing irqtime_account_hi_update,
>> > > > > irqtime_account_si_update,
>> > > > > and steal_account_process_ticks to work with cputime_t time
>> > > > > units,
>> > > > > and
>> > > > > return the amount of time spent in each mode.
>> > > >
>> > > > Do we need to minus st cputime from idle cputime in
>> > > > account_idle_ticks() when noirqtime is true? I try to add this
>> > > > logic
>> > > > w/ noirqtime and idle=poll boot parameter for a full dynticks
>> > > > guest,
>> > > > however, there is no difference, where I miss?
>> > >
>> > > Yes, you are right. The code in account_idle_ticks()
>> > > could use the same treatment.
>> > >
>> > > I am not sure why it would not work, though...
>> >
>> > Actually I observed a regression caused by this patch. I use a i5
>>
>> The regression is caused by your commit "sched,time: Count actually
>> elapsed irq & softirq time".
>
> Wanpeng and I discussed this issue, and discovered
> that this bug is triggered by my patch, specifically
> this bit:
>
> -       if (steal_account_process_tick(ULONG_MAX))
> +       other = account_other_time(cputime);
> +       if (other >= cputime)
>                 return;
>
> Replacing "cputime" with "ULONG_MAX" as the argument
> to account_other_time makes the bug disappear.
>
> However, this is not the cause of the bug.
>
> The cause of the bug appears to be that the values
> used to figure out how much steal time has passed
> are never initialized.
>
>                 steal = paravirt_steal_clock(smp_processor_id());
>                 steal -= this_rq()->prev_steal_time;
>
> The first of the two may be initialized by the host
> (I did not verify that), but the second one does not
> have any explicit initializers anywhere in the kernel
> tree.
>
> This can lead to an arbitrarily large difference between
> paravirt_steal_clock(smp_processor_id()) and
> this_rq()->prev_steal_time, which results in nothing but
> steal time getting accounted for a potentially a very
> long amount of time.
>
> Previously we carried this patch to initialize the
> various rq->prev_* values at CPU hotplug time:
>
> https://patchwork.codeaurora.org/patch/27699/
>
> Which got reverted by Paolo here:
>
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sche
> d/core&id=3d89e5478bf550a50c99e93adf659369798263b0
>
> Which leads me to this question:
>
> Paulo, how would you like us to fix this bug?
>
> It seems like the host and guest steal time CAN get out
> of sync, sometimes by a ridiculous amount, and we need
> some way to get the excessive difference out of the way,
> without it getting accounted as steal time (not immediately,
> and not over the next 17 hours, or months).

I can be the volunteer to fix it if you guys have an idea. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:39         ` Wanpeng Li
@ 2016-08-10  5:07           ` Rik van Riel
  2016-08-10  6:33             ` Wanpeng Li
  0 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2016-08-10  5:07 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 3911 bytes --]

On Wed, 2016-08-10 at 07:39 +0800, Wanpeng Li wrote:
> 2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> > 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > > On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
> > > > Hi Rik,
> > > > 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.
> > > > com>:
> > > > > From: Rik van Riel <riel@redhat.com>
> > > > > 
> > > > > Currently, if there was any irq or softirq time during
> > > > > 'ticks'
> > > > > jiffies, the entire period will be accounted as irq or
> > > > > softirq
> > > > > time.
> > > > > 
> > > > > This is inaccurate if only a subset of the time was actually
> > > > > spent
> > > > > handling irqs, and could conceivably mis-count all of the
> > > > > ticks
> > > > > during
> > > > > a period as irq time, when there was some irq and some
> > > > > softirq
> > > > > time.
> > > > > 
> > > > > This can actually happen when irqtime_account_process_tick is
> > > > > called
> > > > > from account_idle_ticks, which can pass a larger number of
> > > > > ticks
> > > > > down
> > > > > all at once.
> > > > > 
> > > > > Fix this by changing irqtime_account_hi_update,
> > > > > irqtime_account_si_update,
> > > > > and steal_account_process_ticks to work with cputime_t time
> > > > > units,
> > > > > and
> > > > > return the amount of time spent in each mode.
> > > > 
> > > > Do we need to minus st cputime from idle cputime in
> > > > account_idle_ticks() when noirqtime is true? I try to add this
> > > > logic
> > > > w/ noirqtime and idle=poll boot parameter for a full dynticks
> > > > guest,
> > > > however, there is no difference, where I miss?
> > > 
> > > Yes, you are right. The code in account_idle_ticks()
> > > could use the same treatment.
> > > 
> > > I am not sure why it would not work, though...
> > 
> > Actually I observed a regression caused by this patch. I use a i5
> 
> The regression is caused by your commit "sched,time: Count actually
> elapsed irq & softirq time".

Wanpeng and I discussed this issue, and discovered
that this bug is triggered by my patch, specifically
this bit:

-       if (steal_account_process_tick(ULONG_MAX))
+       other = account_other_time(cputime);
+       if (other >= cputime)
                return;

Replacing "cputime" with "ULONG_MAX" as the argument
to account_other_time makes the bug disappear.

However, this is not the cause of the bug.

The cause of the bug appears to be that the values
used to figure out how much steal time has passed
are never initialized.

                steal = paravirt_steal_clock(smp_processor_id());
                steal -= this_rq()->prev_steal_time;

The first of the two may be initialized by the host
(I did not verify that), but the second one does not
have any explicit initializers anywhere in the kernel
tree.

This can lead to an arbitrarily large difference between
paravirt_steal_clock(smp_processor_id()) and
this_rq()->prev_steal_time, which results in nothing but
steal time getting accounted for a potentially a very
long amount of time.

Previously we carried this patch to initialize the
various rq->prev_* values at CPU hotplug time:

https://patchwork.codeaurora.org/patch/27699/

Which got reverted by Paolo here:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sche
d/core&id=3d89e5478bf550a50c99e93adf659369798263b0

Which leads me to this question:

Paulo, how would you like us to fix this bug?

It seems like the host and guest steal time CAN get out
of sync, sometimes by a ridiculous amount, and we need
some way to get the excessive difference out of the way,
without it getting accounted as steal time (not immediately,
and not over the next 17 hours, or months).

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:25       ` Wanpeng Li
  2016-08-09 23:31         ` Wanpeng Li
  2016-08-09 23:35         ` Wanpeng Li
@ 2016-08-09 23:39         ` Wanpeng Li
  2016-08-10  5:07           ` Rik van Riel
  2 siblings, 1 reply; 34+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> Actually I observed a regression caused by this patch. I use a i5

The regression is caused by your commit "sched,time: Count actually
elapsed irq & softirq time".

> laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> cpu hog processes(for loop) running in the guest, I hot-unplug the
> pCPUs on host one by one until there is only one left, then observe
> the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
> for other cpus(nohz full). However, w/o this patch, 75% for all the
> four cpus.
>
> I try to figure out this recently, any tip is a great appreciated. :)
>
> Regards,
> Wapeng Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:25       ` Wanpeng Li
  2016-08-09 23:31         ` Wanpeng Li
@ 2016-08-09 23:35         ` Wanpeng Li
  2016-08-09 23:39         ` Wanpeng Li
  2 siblings, 0 replies; 34+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> Actually I observed a regression caused by this patch. I use a i5

The regression is caused by your commit "sched,time: Count actually
elapsed irq & softirq time".

> laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> cpu hog processes(for loop) running in the guest, I hot-unplug the
> pCPUs on host one by one until there is only one left, then observe
> the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
> for other cpus(nohz full). However, w/o this patch, 75% for all the
> four cpus.
>
> I try to figure out this recently, any tip is a great appreciated. :)
>
> Regards,
> Wapeng Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:25       ` Wanpeng Li
@ 2016-08-09 23:31         ` Wanpeng Li
  2016-08-09 23:35         ` Wanpeng Li
  2016-08-09 23:39         ` Wanpeng Li
  2 siblings, 0 replies; 34+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> Actually I observed a regression caused by this patch. I use a i5

The regression is caused by your commit "sched,time: Count actually
elapsed irq & softirq time".

> laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> cpu hog processes(for loop) running in the guest, I hot-unplug the
> pCPUs on host one by one until there is only one left, then observe
> the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
> for other cpus(nohz full). However, w/o this patch, 75% for all the
> four cpus.
>
> I try to figure out this recently, any tip is a great appreciated. :)
>
> Regards,
> Wapeng Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 14:06     ` Rik van Riel
  2016-08-09 23:07       ` Wanpeng Li
@ 2016-08-09 23:25       ` Wanpeng Li
  2016-08-09 23:31         ` Wanpeng Li
                           ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>> Hi Rik,
>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>> > From: Rik van Riel <riel@redhat.com>
>> >
>> > Currently, if there was any irq or softirq time during 'ticks'
>> > jiffies, the entire period will be accounted as irq or softirq
>> > time.
>> >
>> > This is inaccurate if only a subset of the time was actually spent
>> > handling irqs, and could conceivably mis-count all of the ticks
>> > during
>> > a period as irq time, when there was some irq and some softirq
>> > time.
>> >
>> > This can actually happen when irqtime_account_process_tick is
>> > called
>> > from account_idle_ticks, which can pass a larger number of ticks
>> > down
>> > all at once.
>> >
>> > Fix this by changing irqtime_account_hi_update,
>> > irqtime_account_si_update,
>> > and steal_account_process_ticks to work with cputime_t time units,
>> > and
>> > return the amount of time spent in each mode.
>>
>> Do we need to minus st cputime from idle cputime in
>> account_idle_ticks() when noirqtime is true? I try to add this logic
>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>> however, there is no difference, where I miss?
>
> Yes, you are right. The code in account_idle_ticks()
> could use the same treatment.
>
> I am not sure why it would not work, though...

Actually I observed a regression caused by this patch. I use a i5
laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
cpu hog processes(for loop) running in the guest, I hot-unplug the
pCPUs on host one by one until there is only one left, then observe
the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
for other cpus(nohz full). However, w/o this patch, 75% for all the
four cpus.

I try to figure out this recently, any tip is a great appreciated. :)

Regards,
Wapeng Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 14:06     ` Rik van Riel
@ 2016-08-09 23:07       ` Wanpeng Li
  2016-08-10  7:51         ` Wanpeng Li
  2016-08-09 23:25       ` Wanpeng Li
  1 sibling, 1 reply; 34+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>> Hi Rik,
>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>> > From: Rik van Riel <riel@redhat.com>
>> >
>> > Currently, if there was any irq or softirq time during 'ticks'
>> > jiffies, the entire period will be accounted as irq or softirq
>> > time.
>> >
>> > This is inaccurate if only a subset of the time was actually spent
>> > handling irqs, and could conceivably mis-count all of the ticks
>> > during
>> > a period as irq time, when there was some irq and some softirq
>> > time.
>> >
>> > This can actually happen when irqtime_account_process_tick is
>> > called
>> > from account_idle_ticks, which can pass a larger number of ticks
>> > down
>> > all at once.
>> >
>> > Fix this by changing irqtime_account_hi_update,
>> > irqtime_account_si_update,
>> > and steal_account_process_ticks to work with cputime_t time units,
>> > and
>> > return the amount of time spent in each mode.
>>
>> Do we need to minus st cputime from idle cputime in
>> account_idle_ticks() when noirqtime is true? I try to add this logic
>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>> however, there is no difference, where I miss?
>
> Yes, you are right. The code in account_idle_ticks()
> could use the same treatment.
>
> I am not sure why it would not work, though...

I will try nohz idle kvm guest and other more tests, a patch will be
sent out once successful.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09  3:59   ` Wanpeng Li
@ 2016-08-09 14:06     ` Rik van Riel
  2016-08-09 23:07       ` Wanpeng Li
  2016-08-09 23:25       ` Wanpeng Li
  0 siblings, 2 replies; 34+ messages in thread
From: Rik van Riel @ 2016-08-09 14:06 UTC (permalink / raw)
  To: Wanpeng Li, Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Paolo Bonzini, Peter Zijlstra, Wanpeng Li,
	Thomas Gleixner, Radim Krcmar, Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 1395 bytes --]

On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
> Hi Rik,
> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Currently, if there was any irq or softirq time during 'ticks'
> > jiffies, the entire period will be accounted as irq or softirq
> > time.
> > 
> > This is inaccurate if only a subset of the time was actually spent
> > handling irqs, and could conceivably mis-count all of the ticks
> > during
> > a period as irq time, when there was some irq and some softirq
> > time.
> > 
> > This can actually happen when irqtime_account_process_tick is
> > called
> > from account_idle_ticks, which can pass a larger number of ticks
> > down
> > all at once.
> > 
> > Fix this by changing irqtime_account_hi_update,
> > irqtime_account_si_update,
> > and steal_account_process_ticks to work with cputime_t time units,
> > and
> > return the amount of time spent in each mode.
> 
> Do we need to minus st cputime from idle cputime in
> account_idle_ticks() when noirqtime is true? I try to add this logic
> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
> however, there is no difference, where I miss?

Yes, you are right. The code in account_idle_ticks()
could use the same treatment.

I am not sure why it would not work, though...

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
@ 2016-08-09  3:59   ` Wanpeng Li
  2016-08-09 14:06     ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: Wanpeng Li @ 2016-08-09  3:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Rik van Riel, Paolo Bonzini, Peter Zijlstra,
	Wanpeng Li, Thomas Gleixner, Radim Krcmar, Mike Galbraith

Hi Rik,
2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> From: Rik van Riel <riel@redhat.com>
>
> Currently, if there was any irq or softirq time during 'ticks'
> jiffies, the entire period will be accounted as irq or softirq
> time.
>
> This is inaccurate if only a subset of the time was actually spent
> handling irqs, and could conceivably mis-count all of the ticks during
> a period as irq time, when there was some irq and some softirq time.
>
> This can actually happen when irqtime_account_process_tick is called
> from account_idle_ticks, which can pass a larger number of ticks down
> all at once.
>
> Fix this by changing irqtime_account_hi_update, irqtime_account_si_update,
> and steal_account_process_ticks to work with cputime_t time units, and
> return the amount of time spent in each mode.

Do we need to minus st cputime from idle cputime in
account_idle_ticks() when noirqtime is true? I try to add this logic
w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
however, there is no difference, where I miss?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
@ 2016-07-13 14:50 ` Frederic Weisbecker
  2016-08-09  3:59   ` Wanpeng Li
  0 siblings, 1 reply; 34+ messages in thread
From: Frederic Weisbecker @ 2016-07-13 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Rik van Riel, Paolo Bonzini, Peter Zijlstra, Wanpeng Li,
	Thomas Gleixner, Radim Krcmar, Frederic Weisbecker,
	Mike Galbraith

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of the time was actually spent
handling irqs, and could conceivably mis-count all of the ticks during
a period as irq time, when there was some irq and some softirq time.

This can actually happen when irqtime_account_process_tick is called
from account_idle_ticks, which can pass a larger number of ticks down
all at once.

Fix this by changing irqtime_account_hi_update, irqtime_account_si_update,
and steal_account_process_ticks to work with cputime_t time units, and
return the amount of time spent in each mode.

Rename steal_account_process_ticks to steal_account_process_time, to
reflect that time is now accounted in cputime_t, instead of ticks.

Additionally, have irqtime_account_process_tick take into account how
much time was spent in each of steal, irq, and softirq time.

The latter could help improve the accuracy of cputime
accounting when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Radim Krcmar <rkrcmar@redhat.com>
[fweisbec: Make nsecs_to_cputime64() actually return cputime64_t]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/asm-generic/cputime_nsecs.h |   2 +
 kernel/sched/cputime.c              | 124 ++++++++++++++++++++++--------------
 2 files changed, 79 insertions(+), 47 deletions(-)

diff --git a/include/asm-generic/cputime_nsecs.h b/include/asm-generic/cputime_nsecs.h
index 0f1c6f3..a84e28e 100644
--- a/include/asm-generic/cputime_nsecs.h
+++ b/include/asm-generic/cputime_nsecs.h
@@ -50,6 +50,8 @@ typedef u64 __nocast cputime64_t;
 	(__force u64)(__ct)
 #define nsecs_to_cputime(__nsecs)	\
 	(__force cputime_t)(__nsecs)
+#define nsecs_to_cputime64(__nsecs)	\
+	(__force cputime64_t)(__nsecs)
 
 
 /*
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d..db82ae1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,40 +79,50 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static cputime_t irqtime_account_hi_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t irq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_hardirq_time)) -
+		      cpustat[CPUTIME_IRQ];
+	irq_cputime = min(irq_cputime, maxtime);
+	cpustat[CPUTIME_IRQ] += irq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return irq_cputime;
 }
 
-static int irqtime_account_si_update(void)
+static cputime_t irqtime_account_si_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t softirq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_softirq_time)) -
+			  cpustat[CPUTIME_SOFTIRQ];
+	softirq_cputime = min(softirq_cputime, maxtime);
+	cpustat[CPUTIME_SOFTIRQ] += softirq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return softirq_cputime;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #define sched_clock_irqtime	(0)
 
+static cputime_t irqtime_account_hi_update(cputime_t dummy)
+{
+	return 0;
+}
+
+static cputime_t irqtime_account_si_update(cputime_t dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -257,32 +267,45 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline unsigned long steal_account_process_tick(unsigned long max_jiffies)
+static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
+		cputime_t steal_cputime;
 		u64 steal;
-		unsigned long steal_jiffies;
 
 		steal = paravirt_steal_clock(smp_processor_id());
 		steal -= this_rq()->prev_steal_time;
 
-		/*
-		 * steal is in nsecs but our caller is expecting steal
-		 * time in jiffies. Lets cast the result to jiffies
-		 * granularity and account the rest on the next rounds.
-		 */
-		steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies);
-		this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies);
+		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+		account_steal_time(steal_cputime);
+		this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
 
-		account_steal_time(jiffies_to_cputime(steal_jiffies));
-		return steal_jiffies;
+		return steal_cputime;
 	}
 #endif
 	return 0;
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ */
+static inline cputime_t account_other_time(cputime_t max)
+{
+	cputime_t accounted;
+
+	accounted = steal_account_process_time(max);
+
+	if (accounted < max)
+		accounted += irqtime_account_hi_update(max - accounted);
+
+	if (accounted < max)
+		accounted += irqtime_account_si_update(max - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -342,21 +365,23 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 					 struct rq *rq, int ticks)
 {
-	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
-	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 cputime = (__force u64) cputime_one_jiffy * ticks;
+	cputime_t scaled, other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_time(cputime);
+	if (other >= cputime)
 		return;
+	cputime -= other;
+	scaled = cputime_to_scaled(cputime);
 
-	cputime *= ticks;
-	scaled *= ticks;
-
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
@@ -466,7 +491,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
  */
 void account_process_tick(struct task_struct *p, int user_tick)
 {
-	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
+	cputime_t cputime, scaled, steal;
 	struct rq *rq = this_rq();
 
 	if (vtime_accounting_cpu_enabled())
@@ -477,16 +502,21 @@ void account_process_tick(struct task_struct *p, int user_tick)
 		return;
 	}
 
-	if (steal_account_process_tick(ULONG_MAX))
+	cputime = cputime_one_jiffy;
+	steal = steal_account_process_time(cputime);
+
+	if (steal >= cputime)
 		return;
 
+	cputime -= steal;
+	scaled = cputime_to_scaled(cputime);
+
 	if (user_tick)
-		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
+		account_user_time(p, cputime, scaled);
 	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
-		account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
-				    one_jiffy_scaled);
+		account_system_time(p, HARDIRQ_OFFSET, cputime, scaled);
 	else
-		account_idle_time(cputime_one_jiffy);
+		account_idle_time(cputime);
 }
 
 /*
@@ -681,14 +711,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	cputime_t delta, steal;
 
-	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	delta = jiffies_to_cputime(now - tsk->vtime_snap);
+	steal = steal_account_process_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return delta - steal;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-27 12:50     ` Rik van Riel
@ 2016-06-28 20:56       ` Frederic Weisbecker
  0 siblings, 0 replies; 34+ messages in thread
From: Frederic Weisbecker @ 2016-06-28 20:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Mon, Jun 27, 2016 at 08:50:06AM -0400, Rik van Riel wrote:
> On Mon, 2016-06-27 at 14:25 +0200, Frederic Weisbecker wrote:
> > > 
> > >   * Accumulate raw cputime values of dead tasks (sig->[us]time) and
> > > live
> > >   * tasks (sum on group iteration) belonging to @tsk's group.
> > >   */
> > > @@ -344,19 +378,24 @@ static void
> > > irqtime_account_process_tick(struct task_struct *p, int user_tick,
> > >  {
> > >  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
> > >  	u64 cputime = (__force u64) cputime_one_jiffy;
> > > -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > > +	unsigned long other;
> > >  
> > > -	if (steal_account_process_tick(ULONG_MAX))
> > > +	/*
> > > +	 * When returning from idle, many ticks can get accounted
> > > at
> > > +	 * once, including some ticks of steal, irq, and softirq
> > > time.
> > > +	 * Subtract those ticks from the amount of time accounted
> > > to
> > > +	 * idle, or potentially user or system time. Due to
> > > rounding,
> > > +	 * other time can exceed ticks occasionally.
> > > +	 */
> > > +	other = account_other_ticks(ticks);
> > > +	if (other >= ticks)
> > >  		return;
> > > +	ticks -= other;
> > >  
> > >  	cputime *= ticks;
> > >  	scaled *= ticks;
> > So instead of dealing with ticks here, I think you should rather use
> > the above
> > cputime as both the limit and the remaining time to account after
> > steal/irqs.
> > 
> > This should avoid some middle conversions and improve precision when
> > cputime_t == nsecs granularity.
> > 
> > If we account 2 ticks to idle (lets say HZ=100) and irq time to
> > account is 15 ms. 2 ticks = 20 ms
> > so we have 5 ms left to account to idle. With the jiffies granularity
> > in this patch, we would account
> > one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account
> > back later) and one tick to idle
> > time whereas if you deal with cputime_t, you are going to account the
> > correct amount of idle time.
> > 
> 
> Ahhh, so you want irqtime_account_process_tick to work with
> and account fractional ticks when calling account_system_time,
> account_user_time, account_idle_time, etc?

Not exactly. This function is passed ticks but works with the cputime
converted value of those ticks. And it would be more precise to work on
top of cputime values without converting things to jiffies in the middle
as we may lose precision on the way.

> 
> I guess that should work fine since we already pass cputime
> values in, anyway.
> 
> I suppose we can do the same for get_vtime_delta, too.

But get_vtime_delta() already returns cputime, right?

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-27 12:25   ` Frederic Weisbecker
@ 2016-06-27 12:50     ` Rik van Riel
  2016-06-28 20:56       ` Frederic Weisbecker
  0 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2016-06-27 12:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 5303 bytes --]

On Mon, 2016-06-27 at 14:25 +0200, Frederic Weisbecker wrote:
> On Wed, Jun 22, 2016 at 10:25:47PM -0400, riel@redhat.com wrote:
> > 
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Currently, if there was any irq or softirq time during 'ticks'
> > jiffies, the entire period will be accounted as irq or softirq
> > time.
> > 
> > This is inaccurate if only a subset of 'ticks' jiffies was
> > actually spent handling irqs, and could conceivably mis-count
> > all of the ticks during a period as irq time, when there was
> > some irq and some softirq time.
> Good catch!
> 
> Many comments following.
> 
> > 
> > 
> > This can actually happen when irqtime_account_process_tick
> > is called from account_idle_ticks, which can pass a larger
> > number of ticks down all at once.
> > 
> > Fix this by changing irqtime_account_hi_update and
> > irqtime_account_si_update to round elapsed irq and softirq
> > time to jiffies, and return the number of jiffies spent in
> > each mode, similar to how steal time is handled.
> > 
> > Additionally, have irqtime_account_process_tick take into
> > account how much time was spent in each of steal, irq,
> > and softirq time.
> > 
> > The latter could help improve the accuracy of timekeeping
> Maybe you meant cputime? Timekeeping is rather about jiffies and
> GTOD.
> 
> > 
> > when returning from idle on a NO_HZ_IDLE CPU.
> > 
> > Properly accounting how much time was spent in hardirq and
> > softirq time will also allow the NO_HZ_FULL code to re-use
> > these same functions for hardirq and softirq accounting.
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > 
> >  	local_irq_save(flags);
> > -	latest_ns = this_cpu_read(cpu_hardirq_time);
> > -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> > -		ret = 1;
> > +	irq = this_cpu_read(cpu_hardirq_time) -
> > cpustat[CPUTIME_IRQ];
> cpu_hardirq_time is made of nsecs whereas cpustat is of cputime_t (in
> fact
> even cputime64_t). So you need to convert cpu_hardirq_time before
> doing the
> substract.

Doh. Good catch!

> > -static int irqtime_account_si_update(void)
> > +static unsigned long irqtime_account_si_update(unsigned long
> > max_jiffies)
> >  {
> >  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > +	unsigned long si_jiffies = 0;
> >  	unsigned long flags;
> > -	u64 latest_ns;
> > -	int ret = 0;
> > +	u64 softirq;
> >  
> >  	local_irq_save(flags);
> > -	latest_ns = this_cpu_read(cpu_softirq_time);
> > -	if (nsecs_to_cputime64(latest_ns) >
> > cpustat[CPUTIME_SOFTIRQ])
> > -		ret = 1;
> > +	softirq = this_cpu_read(cpu_softirq_time) -
> > cpustat[CPUTIME_SOFTIRQ];
> > +	if (softirq > cputime_one_jiffy) {
> > +		si_jiffies = min(max_jiffies,
> > cputime_to_jiffies(softirq));
> > +		cpustat[CPUTIME_SOFTIRQ] +=
> > jiffies_to_cputime(si_jiffies);
> > +	}
> >  	local_irq_restore(flags);
> > -	return ret;
> > +	return si_jiffies;
> So same comments apply here.
> 
> [...]
> > 
> >   * Accumulate raw cputime values of dead tasks (sig->[us]time) and
> > live
> >   * tasks (sum on group iteration) belonging to @tsk's group.
> >   */
> > @@ -344,19 +378,24 @@ static void
> > irqtime_account_process_tick(struct task_struct *p, int user_tick,
> >  {
> >  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
> >  	u64 cputime = (__force u64) cputime_one_jiffy;
> > -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > +	unsigned long other;
> >  
> > -	if (steal_account_process_tick(ULONG_MAX))
> > +	/*
> > +	 * When returning from idle, many ticks can get accounted
> > at
> > +	 * once, including some ticks of steal, irq, and softirq
> > time.
> > +	 * Subtract those ticks from the amount of time accounted
> > to
> > +	 * idle, or potentially user or system time. Due to
> > rounding,
> > +	 * other time can exceed ticks occasionally.
> > +	 */
> > +	other = account_other_ticks(ticks);
> > +	if (other >= ticks)
> >  		return;
> > +	ticks -= other;
> >  
> >  	cputime *= ticks;
> >  	scaled *= ticks;
> So instead of dealing with ticks here, I think you should rather use
> the above
> cputime as both the limit and the remaining time to account after
> steal/irqs.
> 
> This should avoid some middle conversions and improve precision when
> cputime_t == nsecs granularity.
> 
> If we account 2 ticks to idle (lets say HZ=100) and irq time to
> account is 15 ms. 2 ticks = 20 ms
> so we have 5 ms left to account to idle. With the jiffies granularity
> in this patch, we would account
> one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account
> back later) and one tick to idle
> time whereas if you deal with cputime_t, you are going to account the
> correct amount of idle time.
> 

Ahhh, so you want irqtime_account_process_tick to work with
and account fractional ticks when calling account_system_time,
account_user_time, account_idle_time, etc?

I guess that should work fine since we already pass cputime
values in, anyway.

I suppose we can do the same for get_vtime_delta, too.

They can both work with the actual remaining time (in cputime_t),
after the other time has been subtracted.

I can rework the series to do that.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-27 12:25   ` Frederic Weisbecker
  2016-06-27 12:50     ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: Frederic Weisbecker @ 2016-06-27 12:25 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Wed, Jun 22, 2016 at 10:25:47PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Currently, if there was any irq or softirq time during 'ticks'
> jiffies, the entire period will be accounted as irq or softirq
> time.
> 
> This is inaccurate if only a subset of 'ticks' jiffies was
> actually spent handling irqs, and could conceivably mis-count
> all of the ticks during a period as irq time, when there was
> some irq and some softirq time.

Good catch!

Many comments following.

> 
> This can actually happen when irqtime_account_process_tick
> is called from account_idle_ticks, which can pass a larger
> number of ticks down all at once.
> 
> Fix this by changing irqtime_account_hi_update and
> irqtime_account_si_update to round elapsed irq and softirq
> time to jiffies, and return the number of jiffies spent in
> each mode, similar to how steal time is handled.
> 
> Additionally, have irqtime_account_process_tick take into
> account how much time was spent in each of steal, irq,
> and softirq time.
> 
> The latter could help improve the accuracy of timekeeping

Maybe you meant cputime? Timekeeping is rather about jiffies and GTOD.

> when returning from idle on a NO_HZ_IDLE CPU.
> 
> Properly accounting how much time was spent in hardirq and
> softirq time will also allow the NO_HZ_FULL code to re-use
> these same functions for hardirq and softirq accounting.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/cputime.c | 81 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 60 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 3d60e5d76fdb..15b813c014be 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -79,40 +79,54 @@ void irqtime_account_irq(struct task_struct *curr)
>  }
>  EXPORT_SYMBOL_GPL(irqtime_account_irq);
>  
> -static int irqtime_account_hi_update(void)
> +static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long irq_jiffies = 0;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	u64 irq;

Should be cputime_t ? And perhaps renamed to irq_cputime to distinguish
clearly against irq_jiffies?

>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_hardirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> -		ret = 1;
> +	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];

cpu_hardirq_time is made of nsecs whereas cpustat is of cputime_t (in fact
even cputime64_t). So you need to convert cpu_hardirq_time before doing the
substract.

> +	if (irq > cputime_one_jiffy) {
> +		irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
> +		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
> +	}
>  	local_irq_restore(flags);
> -	return ret;
> +	return irq_jiffies;

I think we shouldn't use jiffies at all here and only rely on cputime_t.
max_jiffies should be of cputime_t and this function should return the cputime_t.

The resulting code should be more simple and in fact more precise (more below on that).

>  }
>  
> -static int irqtime_account_si_update(void)
> +static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long si_jiffies = 0;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	u64 softirq;
>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_softirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
> -		ret = 1;
> +	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
> +	if (softirq > cputime_one_jiffy) {
> +		si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
> +		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
> +	}
>  	local_irq_restore(flags);
> -	return ret;
> +	return si_jiffies;

So same comments apply here.

[...]
>   * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
>   * tasks (sum on group iteration) belonging to @tsk's group.
>   */
> @@ -344,19 +378,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>  {
>  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
>  	u64 cputime = (__force u64) cputime_one_jiffy;
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long other;
>  
> -	if (steal_account_process_tick(ULONG_MAX))
> +	/*
> +	 * When returning from idle, many ticks can get accounted at
> +	 * once, including some ticks of steal, irq, and softirq time.
> +	 * Subtract those ticks from the amount of time accounted to
> +	 * idle, or potentially user or system time. Due to rounding,
> +	 * other time can exceed ticks occasionally.
> +	 */
> +	other = account_other_ticks(ticks);
> +	if (other >= ticks)
>  		return;
> +	ticks -= other;
>  
>  	cputime *= ticks;
>  	scaled *= ticks;

So instead of dealing with ticks here, I think you should rather use the above
cputime as both the limit and the remaining time to account after steal/irqs.

This should avoid some middle conversions and improve precision when cputime_t == nsecs granularity.

If we account 2 ticks to idle (lets say HZ=100) and irq time to account is 15 ms. 2 ticks = 20 ms
so we have 5 ms left to account to idle. With the jiffies granularity in this patch, we would account
one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account back later) and one tick to idle
time whereas if you deal with cputime_t, you are going to account the correct amount of idle time.

Beyond the scope of this patchset, we should probably kill cputime_t and account everything to nsecs.
I have a patchset that did that 2 years ago, I should probably re-iterate it.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
@ 2016-06-23  2:25 ` riel
  2016-06-27 12:25   ` Frederic Weisbecker
  0 siblings, 1 reply; 34+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of 'ticks' jiffies was
actually spent handling irqs, and could conceivably mis-count
all of the ticks during a period as irq time, when there was
some irq and some softirq time.

This can actually happen when irqtime_account_process_tick
is called from account_idle_ticks, which can pass a larger
number of ticks down all at once.

Fix this by changing irqtime_account_hi_update and
irqtime_account_si_update to round elapsed irq and softirq
time to jiffies, and return the number of jiffies spent in
each mode, similar to how steal time is handled.

Additionally, have irqtime_account_process_tick take into
account how much time was spent in each of steal, irq,
and softirq time.

The latter could help improve the accuracy of timekeeping
when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 81 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 60 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d76fdb..15b813c014be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,40 +79,54 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long irq_jiffies = 0;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 irq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
+	if (irq > cputime_one_jiffy) {
+		irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
+		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
+	}
 	local_irq_restore(flags);
-	return ret;
+	return irq_jiffies;
 }
 
-static int irqtime_account_si_update(void)
+static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long si_jiffies = 0;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 softirq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
+	if (softirq > cputime_one_jiffy) {
+		si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
+		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
+	}
 	local_irq_restore(flags);
-	return ret;
+	return si_jiffies;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #define sched_clock_irqtime	(0)
 
+static unsigned long irqtime_account_hi_update(unsigned long dummy)
+{
+	return 0;
+}
+
+static unsigned long irqtime_account_si_update(unsigned long dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -283,6 +297,26 @@ static __always_inline unsigned long steal_account_process_tick(unsigned long ma
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ * Due to rounding errors, the calculated amount can sometimes exceed
+ * max_jiffies; be careful not to account more than max_jiffies.
+ */
+static inline int account_other_ticks(unsigned long max_jiffies)
+{
+	unsigned long accounted;
+
+	accounted = steal_account_process_tick(max_jiffies);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_hi_update(max_jiffies - accounted);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_si_update(max_jiffies - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -344,19 +378,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 {
 	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_ticks(ticks);
+	if (other >= ticks)
 		return;
+	ticks -= other;
 
 	cputime *= ticks;
 	scaled *= ticks;
 
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-08  2:29 [PATCH RFC 0/5] sched,time: make irq time accounting work for nohz_idle riel
@ 2016-06-08  2:30 ` riel
  0 siblings, 0 replies; 34+ messages in thread
From: riel @ 2016-06-08  2:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: kernellwp, mingo, peterz, tglx, fweisbec

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of 'ticks' jiffies was
actually spent handling irqs, and could conceivably mis-count
all of the ticks during a period as irq time, when there was
some irq and some softirq time.

Fix this by changing irqtime_account_hi_update and
irqtime_account_si_update to round elapsed irq and softirq
time to jiffies, and return the number of jiffies spent in
each mode, similar to how steal time is handled.

Additionally, have irqtime_account_process_tick take into
account how much time was spent in each of steal, irq,
and softirq time.

The latter could help improve the accuracy of timekeeping
when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 54 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f51c98c740b5..4bd6d1b774ab 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,34 +79,36 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static unsigned long irqtime_account_hi_update(void)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long irq_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 irq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
+	irq_jiffies = cputime_to_jiffies(irq);
+	if (irq_jiffies)
+		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return irq_jiffies;
 }
 
-static int irqtime_account_si_update(void)
+static unsigned long irqtime_account_si_update(void)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long si_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 softirq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	delta = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
+	si_jiffies = cputime_to_jiffies(delta);
+	if (si_jiffies)
+		cpustat[CPUSTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return si_jiffies;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
@@ -345,18 +347,30 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 cputime = (__force u64) cputime_one_jiffy;
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long other;
 
-	if (steal_account_process_tick())
+	/*
+	 * Subtract steal, irq & softirq time (if any) from ticks.
+	 * These are counted in nanoseconds and not aligned with jiffies.
+	 * Multiple of these could "break a jiffy" simultaneously due to
+	 * rounding; be careful to not account too much of this time at once.
+	 */
+	other = steal_account_process_tick();
+	if (other >= ticks)
+		return;
+
+	other += irqtime_account_hi_update();
+	if (other >= ticks)
+		return;
+
+	other += irqtime_account_si_update();
+	if (other >= ticks)
 		return;
 
 	cputime *= ticks;
 	scaled *= ticks;
 
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-08-10 21:26 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
2016-06-16 16:22   ` kbuild test robot
2016-06-21 21:21   ` Peter Zijlstra
2016-06-21 22:20     ` Rik van Riel
2016-06-22 10:40     ` Paolo Bonzini
2016-06-22 10:52       ` Peter Zijlstra
2016-06-16 16:06 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
2016-06-16 16:06 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
2016-06-16 16:06 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
2016-06-16 16:06 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
2016-06-21 21:49   ` Peter Zijlstra
2016-06-21 22:23     ` Rik van Riel
2016-06-21 22:28       ` Peter Zijlstra
2016-06-21 22:32         ` Rik van Riel
2016-06-22 21:55     ` Rik van Riel
2016-06-23 13:52       ` Paolo Bonzini
2016-06-23 15:24         ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
2016-08-09  3:59   ` Wanpeng Li
2016-08-09 14:06     ` Rik van Riel
2016-08-09 23:07       ` Wanpeng Li
2016-08-10  7:51         ` Wanpeng Li
2016-08-09 23:25       ` Wanpeng Li
2016-08-09 23:31         ` Wanpeng Li
2016-08-09 23:35         ` Wanpeng Li
2016-08-09 23:39         ` Wanpeng Li
2016-08-10  5:07           ` Rik van Riel
2016-08-10  6:33             ` Wanpeng Li
2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
2016-06-27 12:25   ` Frederic Weisbecker
2016-06-27 12:50     ` Rik van Riel
2016-06-28 20:56       ` Frederic Weisbecker
2016-06-08  2:29 [PATCH RFC 0/5] sched,time: make irq time accounting work for nohz_idle riel
2016-06-08  2:30 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.