linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle
@ 2016-06-16 16:06 riel
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

Currently irq time accounting only works in these cases:
1) purely ticke based accounting
2) nohz_full accounting, but only on housekeeping & nohz_full CPUs
3) architectures with native vtime accounting

On nohz_idle CPUs, which are probably the majority nowadays,
irq time accounting is currently broken. This leads to systems
reporting a dramatically lower amount of irq & softirq time than
is actually spent handling them, with all the time spent while the
system is in the idle task being accounted as idle.

This patch set seems to bring the amount of irq time reported by
top (and /proc/stat) roughly in line with that measured when I do
a "perf record -g -a" run to see what is using all that time.

The amount of irq time used, especially softirq, is shockingly high,
to the point of me thinking this patch set may be wrong, but the
numbers seem to match what perf is giving me...

These patches apply on top of Wanpeng Li's steal time patches.

CONFIG_IRQ_TIME_ACCOUNTING is now a config option that is available
as a separate choice from tick based / nohz_idle / nohz_full mode,
a suggested by Frederic Weisbecker.

Next up: look at the things that are using CPU time on an otherwise
idle system, and see if I can make those a little faster :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
@ 2016-06-16 16:06 ` riel
  2016-06-16 16:22   ` kbuild test robot
  2016-06-21 21:21   ` Peter Zijlstra
  2016-06-16 16:06 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of 'ticks' jiffies was
actually spent handling irqs, and could conceivably mis-count
all of the ticks during a period as irq time, when there was
some irq and some softirq time.

This can actually happen when irqtime_account_process_tick
is called from account_idle_ticks, which can pass a larger
number of ticks down all at once.

Fix this by changing irqtime_account_hi_update and
irqtime_account_si_update to round elapsed irq and softirq
time to jiffies, and return the number of jiffies spent in
each mode, similar to how steal time is handled.

Additionally, have irqtime_account_process_tick take into
account how much time was spent in each of steal, irq,
and softirq time.

The latter could help improve the accuracy of timekeeping
when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 69 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 48 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d76fdb..9bd2d4f42037 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,34 +79,36 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long irq_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 irq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
+	irq_jiffies = min(cputime_to_jiffies(irq), max_jiffies);
+	if (irq_jiffies)
+		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return irq_jiffies;
 }
 
-static int irqtime_account_si_update(void)
+static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long si_jiffies;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 softirq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
+	si_jiffies = min(cputime_to_jiffies(softirq), max_jiffies);
+	if (si_jiffies)
+		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
 	local_irq_restore(flags);
-	return ret;
+	return si_jiffies;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
@@ -283,6 +285,26 @@ static __always_inline unsigned long steal_account_process_tick(unsigned long ma
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ * Due to rounding errors, the calculated amount can sometimes exceed
+ * max_jiffies; be careful not to account more than max_jiffies.
+ */
+static inline int account_other_ticks(unsigned long max_jiffies)
+{
+	unsigned long accounted;
+
+	accounted = steal_account_process_tick(max_jiffies);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_hi_update(max_jiffies - accounted);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_si_update(max_jiffies - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -344,19 +366,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 {
 	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_ticks(ticks);
+	if (other >= ticks)
 		return;
+	ticks -= other;
 
 	cputime *= ticks;
 	scaled *= ticks;
 
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-16 16:06 ` riel
  2016-06-16 16:06 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs that are nohz_full, people typically do not assign IRQs.
On the housekeeping CPU (when a system is booted up with nohz_full),
sampling should work ok to determine irq and softirq time use, but
that only covers the housekeeping CPU itself, not the other
non-nohz_full CPUs.

On CPUs that are nohz_idle (the typical way a distro kernel is
booted), irq time is not accounted at all while the CPU is idle,
due to the lack of timer ticks.

Remove the VTIME_GEN vtime irq time code. The next patch will
allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vtime.h  | 32 ++++++++++++++------------------
 kernel/sched/cputime.c | 26 +++++++++++++-------------
 2 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index fa2196990f84..d1977d84ebdf 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,6 +14,18 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+
+#ifdef __ARCH_HAS_VTIME_ACCOUNT
+extern void vtime_account_irq_enter(struct task_struct *tsk);
+#else
+extern void vtime_common_account_irq_enter(struct task_struct *tsk);
+static inline void vtime_account_irq_enter(struct task_struct *tsk)
+{
+	if (vtime_accounting_cpu_enabled())
+		vtime_common_account_irq_enter(tsk);
+}
+#endif /* __ARCH_HAS_VTIME_ACCOUNT */
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -64,17 +76,6 @@ extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
 
-#ifdef __ARCH_HAS_VTIME_ACCOUNT
-extern void vtime_account_irq_enter(struct task_struct *tsk);
-#else
-extern void vtime_common_account_irq_enter(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_account_irq_enter(tsk);
-}
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
-
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 static inline void vtime_task_switch(struct task_struct *prev) { }
@@ -85,13 +86,8 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_gen_account_irq_exit(struct task_struct *tsk);
-
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_gen_account_irq_exit(tsk);
-}
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
 
 extern void vtime_user_enter(struct task_struct *tsk);
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9bd2d4f42037..261c9002348a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -115,6 +115,16 @@ static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
 
 #define sched_clock_irqtime	(0)
 
+static unsigned long irqtime_account_hi_update(unsigned long dummy)
+{
+	return 0;
+}
+
+static unsigned long irqtime_account_si_update(unsigned long dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -708,14 +718,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	unsigned long delta_jiffies, other_jiffies;
 
 	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	other_jiffies = account_other_ticks(delta_jiffies);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return jiffies_to_cputime(delta_jiffies - other_jiffies);
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
@@ -735,16 +745,6 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_gen_account_irq_exit(struct task_struct *tsk)
-{
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		__vtime_account_system(tsk);
-	if (context_tracking_in_user())
-		tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
-}
-
 void vtime_account_user(struct task_struct *tsk)
 {
 	cputime_t delta_cpu;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] cputime: allow irq time accounting to be selected as an option
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
  2016-06-16 16:06 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
@ 2016-06-16 16:06 ` riel
  2016-06-16 16:06 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
  2016-06-16 16:06 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  4 siblings, 0 replies; 20+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Allow CONFIG_IRQ_TIME_ACCOUNTING to be selected as an option, on top
of CONFIG_VIRT_CPU_ACCOUNTING_GEN (and potentially others?).

This allows for the irq time accounting code to be used with nohz_idle
CPUs, which is how several distributions ship their kernels. Using the
same code for several timer modes also allows us to drop duplicate code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 init/Kconfig | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 0dfd09d54c65..4c7ee4f136cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -375,9 +375,11 @@ config VIRT_CPU_ACCOUNTING_GEN
 
 	  If unsure, say N.
 
+endchoice
+
 config IRQ_TIME_ACCOUNTING
 	bool "Fine granularity task level IRQ time accounting"
-	depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL
+	depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
 	help
 	  Select this option to enable fine granularity task irq time
 	  accounting. This is done by reading a timestamp on each
@@ -386,8 +388,6 @@ config IRQ_TIME_ACCOUNTING
 
 	  If in doubt, say N here.
 
-endchoice
-
 config BSD_PROCESS_ACCT
 	bool "BSD Process Accounting"
 	depends on MULTIUSER
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
                   ` (2 preceding siblings ...)
  2016-06-16 16:06 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
@ 2016-06-16 16:06 ` riel
  2016-06-16 16:06 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  4 siblings, 0 replies; 20+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Add an irq type parameter and documentation to irqtime_account_irq,
this can be used to distinguish between transitioning from process
context to hardirq time, and from process context to softirq time.

This is necessary to be able to remove the local_irq_disable from
irqtime_account_irq.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/hardirq.h | 20 ++++++++++----------
 include/linux/vtime.h   | 12 ++++++------
 kernel/sched/cputime.c  |  9 ++++++++-
 kernel/softirq.c        |  6 +++---
 4 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6bc6f0..1ebb31f56285 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,11 +32,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define __irq_enter()					\
-	do {						\
-		account_irq_enter_time(current);	\
-		preempt_count_add(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
+#define __irq_enter()							\
+	do {								\
+		account_irq_enter_time(current, HARDIRQ_OFFSET);	\
+		preempt_count_add(HARDIRQ_OFFSET);			\
+		trace_hardirq_enter();					\
 	} while (0)
 
 /*
@@ -47,11 +47,11 @@ extern void irq_enter(void);
 /*
  * Exit irq context without processing softirqs:
  */
-#define __irq_exit()					\
-	do {						\
-		trace_hardirq_exit();			\
-		account_irq_exit_time(current);		\
-		preempt_count_sub(HARDIRQ_OFFSET);	\
+#define __irq_exit()							\
+	do {								\
+		trace_hardirq_exit();					\
+		account_irq_exit_time(current, HARDIRQ_OFFSET);		\
+		preempt_count_sub(HARDIRQ_OFFSET);			\
 	} while (0)
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d1977d84ebdf..31e486d741b0 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -112,21 +112,21 @@ static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, int irqtype);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, int irqtype) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_enter(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_exit(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 261c9002348a..07f847267e4e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,8 +46,15 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
 /*
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * There are six possible transitions:
+ * process -> softirq, softirq -> process
+ * process -> hardirq, hardirq -> process
+ * softirq -> hardirq, hardirq -> softirq
+ *
+ * When exiting hardirq or softirq time, account the elapsed time.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
 	unsigned long flags;
 	s64 delta;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a311c9622c86 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -245,7 +245,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	account_irq_enter_time(current);
+	account_irq_enter_time(current, SOFTIRQ_OFFSET);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
@@ -295,7 +295,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	}
 
 	lockdep_softirq_end(in_hardirq);
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, SOFTIRQ_OFFSET);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
@@ -385,7 +385,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, HARDIRQ_OFFSET);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
                   ` (3 preceding siblings ...)
  2016-06-16 16:06 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
@ 2016-06-16 16:06 ` riel
  2016-06-21 21:49   ` Peter Zijlstra
  4 siblings, 1 reply; 20+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Drop local_irq_save/restore from irqtime_account_irq.
Instead, have softirq and hardirq track their time spent
independently, with the softirq code subtracting hardirq
time that happened during the duration of the softirq run.

The softirq code can be interrupted by hardirq code at
any point in time, but it can check whether it got a
consistent snapshot of the timekeeping variables it wants,
and loop around in the unlikely case that it did not.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 63 ++++++++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h   | 28 ++++++++++++++--------
 2 files changed, 70 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 07f847267e4e..ba18d51a091c 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -26,7 +26,9 @@
 DEFINE_PER_CPU(u64, cpu_hardirq_time);
 DEFINE_PER_CPU(u64, cpu_softirq_time);
 
-static DEFINE_PER_CPU(u64, irq_start_time);
+static DEFINE_PER_CPU(u64, hardirq_start_time);
+static DEFINE_PER_CPU(u64, softirq_start_time);
+static DEFINE_PER_CPU(u64, prev_hardirq_time);
 static int sched_clock_irqtime;
 
 void enable_sched_clock_irqtime(void)
@@ -41,6 +43,7 @@ void disable_sched_clock_irqtime(void)
 
 #ifndef CONFIG_64BIT
 DEFINE_PER_CPU(seqcount_t, irq_time_seq);
+DEFINE_PER_CPU(seqcount_t, softirq_time_seq);
 #endif /* CONFIG_64BIT */
 
 /*
@@ -53,36 +56,72 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
  * softirq -> hardirq, hardirq -> softirq
  *
  * When exiting hardirq or softirq time, account the elapsed time.
+ *
+ * When exiting softirq time, subtract the amount of hardirq time that
+ * interrupted this softirq run, to avoid double accounting of that time.
  */
 void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
-	unsigned long flags;
-	s64 delta;
+	u64 prev_softirq_start;
+	u64 prev_hardirq;
+	u64 hardirq_time;
+	s64 delta = 0;
 	int cpu;
 
 	if (!sched_clock_irqtime)
 		return;
 
-	local_irq_save(flags);
-
 	cpu = smp_processor_id();
-	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
-	__this_cpu_add(irq_start_time, delta);
+	/*
+	 * Softirq context may get interrupted by hardirq context,
+	 * on the same CPU. At softirq entry time the amount of time
+	 * spent in hardirq context is stored. At softirq exit time,
+	 * the time spent in hardirq context during the softirq is
+	 * subtracted.
+	 */
+	prev_hardirq = __this_cpu_read(prev_hardirq_time);
+	prev_softirq_start = __this_cpu_read(softirq_start_time);
+
+	if (irqtype == HARDIRQ_OFFSET) {
+		delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
+		__this_cpu_add(hardirq_start_time, delta);
+	} else do {
+		u64 now = sched_clock_cpu(cpu);
+		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time, cpu));
+
+		delta = now - prev_softirq_start;
+		if (in_serving_softirq()) {
+			/*
+			 * Leaving softirq context. Avoid double counting by
+			 * subtracting hardirq time from this interval.
+			 */
+			s64 hi_delta = hardirq_time - prev_hardirq;
+			delta -= hi_delta;
+		} else {
+			/* Entering softirq context. Note start times. */
+			__this_cpu_write(softirq_start_time, now);
+			__this_cpu_write(prev_hardirq_time, hardirq_time);
+		}
+		/*
+		 * If a hardirq happened during this calculation, it may not
+		 * have gotten a consistent snapshot. Try again.
+		 */
+	} while (hardirq_time != READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));
 
-	irq_time_write_begin();
+	irq_time_write_begin(irqtype);
 	/*
 	 * We do not account for softirq time from ksoftirqd here.
 	 * We want to continue accounting softirq time to ksoftirqd thread
 	 * in that case, so as not to confuse scheduler with a special task
 	 * that do not consume any time, but still wants to run.
 	 */
-	if (hardirq_count())
+	if (irqtype == HARDIRQ_OFFSET && hardirq_count())
 		__this_cpu_add(cpu_hardirq_time, delta);
-	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+	else if (irqtype == SOFTIRQ_OFFSET && in_serving_softirq() &&
+				curr != this_cpu_ksoftirqd())
 		__this_cpu_add(cpu_softirq_time, delta);
 
-	irq_time_write_end();
-	local_irq_restore(flags);
+	irq_time_write_end(irqtype);
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec2e8d23527e..65d013447e0c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1752,38 +1752,48 @@ DECLARE_PER_CPU(u64, cpu_softirq_time);
 
 #ifndef CONFIG_64BIT
 DECLARE_PER_CPU(seqcount_t, irq_time_seq);
+DECLARE_PER_CPU(seqcount_t, softirq_time_seq);
 
-static inline void irq_time_write_begin(void)
+static inline void irq_time_write_begin(int irqtype)
 {
-	__this_cpu_inc(irq_time_seq.sequence);
+	if (irqtype == HARDIRQ_OFFSET)
+		__this_cpu_inc(irq_time_seq.sequence);
+	else
+		__this_cpu_inc(softirq_time_seq.sequence);
 	smp_wmb();
 }
 
-static inline void irq_time_write_end(void)
+static inline void irq_time_write_end(int irqtype)
 {
 	smp_wmb();
-	__this_cpu_inc(irq_time_seq.sequence);
+	if (irqtype == HARDIRQ_OFFSET)
+		__this_cpu_inc(irq_time_seq.sequence);
+	else
+		__this_cpu_inc(softirq_time_seq.sequence);
 }
 
 static inline u64 irq_time_read(int cpu)
 {
 	u64 irq_time;
-	unsigned seq;
+	unsigned hi_seq;
+	unsigned si_seq;
 
 	do {
-		seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		hi_seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		si_seq = read_seqcount_begin(&per_cpu(softirq_time_seq, cpu));
 		irq_time = per_cpu(cpu_softirq_time, cpu) +
 			   per_cpu(cpu_hardirq_time, cpu);
-	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));
+	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), hi_seq) ||
+		 read_seqcount_retry(&per_cpu(softirq_time_seq, cpu), si_seq));
 
 	return irq_time;
 }
 #else /* CONFIG_64BIT */
-static inline void irq_time_write_begin(void)
+static inline void irq_time_write_begin(int irqtype)
 {
 }
 
-static inline void irq_time_write_end(void)
+static inline void irq_time_write_end(int irqtype)
 {
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-16 16:22   ` kbuild test robot
  2016-06-21 21:21   ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2016-06-16 16:22 UTC (permalink / raw)
  To: riel
  Cc: kbuild-all, linux-kernel, mingo, pbonzini, peterz, fweisbec,
	wanpeng.li, efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 2010 bytes --]

Hi,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20160616]
[cannot apply to v4.7-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/riel-redhat-com/sched-time-fix-irq-time-accounting-with-nohz_idle/20160617-001150
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

Note: the linux-review/riel-redhat-com/sched-time-fix-irq-time-accounting-with-nohz_idle/20160617-001150 HEAD cc827b6bc02ef052d79b5dbd2c355d9483a8ada7 builds fine.
      It only hurts bisectibility.

All errors (new ones prefixed by >>):

   kernel/sched/cputime.c: In function 'account_other_ticks':
>> kernel/sched/cputime.c:299:16: error: implicit declaration of function 'irqtime_account_hi_update' [-Werror=implicit-function-declaration]
      accounted += irqtime_account_hi_update(max_jiffies - accounted);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/sched/cputime.c:302:16: error: implicit declaration of function 'irqtime_account_si_update' [-Werror=implicit-function-declaration]
      accounted += irqtime_account_si_update(max_jiffies - accounted);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/irqtime_account_hi_update +299 kernel/sched/cputime.c

   293	{
   294		unsigned long accounted;
   295	
   296		accounted = steal_account_process_tick(max_jiffies);
   297	
   298		if (accounted < max_jiffies)
 > 299			accounted += irqtime_account_hi_update(max_jiffies - accounted);
   300	
   301		if (accounted < max_jiffies)
 > 302			accounted += irqtime_account_si_update(max_jiffies - accounted);
   303	
   304		return accounted;
   305	}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 6335 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
  2016-06-16 16:22   ` kbuild test robot
@ 2016-06-21 21:21   ` Peter Zijlstra
  2016-06-21 22:20     ` Rik van Riel
  2016-06-22 10:40     ` Paolo Bonzini
  1 sibling, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2016-06-21 21:21 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

On Thu, Jun 16, 2016 at 12:06:03PM -0400, riel@redhat.com wrote:
> +static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long irq_jiffies;
>  	unsigned long flags;
> +	u64 irq;
>  
>  	local_irq_save(flags);
> +	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
> +	irq_jiffies = min(cputime_to_jiffies(irq), max_jiffies);

cputime_to_jiffies is a division, could we not avoid that by doing
something like:

	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
	while (irq_jiffies > cputime_one_jiffy) {
		irq_jiffies -= cputime_one_jiffy;
		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
	}

assuming that the loop is 'rare' etc.. If not, only do the division on
that same > cputime_one_jiffy condition.

> +	if (irq_jiffies)
> +		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
>  	local_irq_restore(flags);
> +	return irq_jiffies;
>  }

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-16 16:06 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
@ 2016-06-21 21:49   ` Peter Zijlstra
  2016-06-21 22:23     ` Rik van Riel
  2016-06-22 21:55     ` Rik van Riel
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2016-06-21 21:49 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

On Thu, Jun 16, 2016 at 12:06:07PM -0400, riel@redhat.com wrote:
> @@ -53,36 +56,72 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
>   * softirq -> hardirq, hardirq -> softirq
>   *
>   * When exiting hardirq or softirq time, account the elapsed time.
> + *
> + * When exiting softirq time, subtract the amount of hardirq time that
> + * interrupted this softirq run, to avoid double accounting of that time.
>   */
>  void irqtime_account_irq(struct task_struct *curr, int irqtype)
>  {
> +	u64 prev_softirq_start;
> +	u64 prev_hardirq;
> +	u64 hardirq_time;
> +	s64 delta = 0;

We appear to always assign to delta, so this initialization seems
superfluous.

>  	int cpu;
>  
>  	if (!sched_clock_irqtime)
>  		return;
>  
>  	cpu = smp_processor_id();

Per this smp_processor_id() usage, preemption is disabled.

> +	/*
> +	 * Softirq context may get interrupted by hardirq context,
> +	 * on the same CPU. At softirq entry time the amount of time
> +	 * spent in hardirq context is stored. At softirq exit time,
> +	 * the time spent in hardirq context during the softirq is
> +	 * subtracted.
> +	 */
> +	prev_hardirq = __this_cpu_read(prev_hardirq_time);
> +	prev_softirq_start = __this_cpu_read(softirq_start_time);
> +
> +	if (irqtype == HARDIRQ_OFFSET) {
> +		delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
> +		__this_cpu_add(hardirq_start_time, delta);
> +	} else do {
> +		u64 now = sched_clock_cpu(cpu);
> +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time, cpu));

Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong with
__this_cpu_read() ?

> +
> +		delta = now - prev_softirq_start;
> +		if (in_serving_softirq()) {
> +			/*
> +			 * Leaving softirq context. Avoid double counting by
> +			 * subtracting hardirq time from this interval.
> +			 */
> +			s64 hi_delta = hardirq_time - prev_hardirq;
> +			delta -= hi_delta;
> +		} else {
> +			/* Entering softirq context. Note start times. */
> +			__this_cpu_write(softirq_start_time, now);
> +			__this_cpu_write(prev_hardirq_time, hardirq_time);
> +		}
> +		/*
> +		 * If a hardirq happened during this calculation, it may not
> +		 * have gotten a consistent snapshot. Try again.
> +		 */
> +	} while (hardirq_time != READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));

That whole thing is somewhat hard to read; but its far too late for me
to suggest anything more readable :/

> +	irq_time_write_begin(irqtype);
>  	/*
>  	 * We do not account for softirq time from ksoftirqd here.
>  	 * We want to continue accounting softirq time to ksoftirqd thread
>  	 * in that case, so as not to confuse scheduler with a special task
>  	 * that do not consume any time, but still wants to run.
>  	 */
> +	if (irqtype == HARDIRQ_OFFSET && hardirq_count())
>  		__this_cpu_add(cpu_hardirq_time, delta);
> +	else if (irqtype == SOFTIRQ_OFFSET && in_serving_softirq() &&
> +				curr != this_cpu_ksoftirqd())
>  		__this_cpu_add(cpu_softirq_time, delta);
>  
> +	irq_time_write_end(irqtype);

Maybe split the whole thing on irqtype at the very start, instead of the
endless repeated branches?

>  }
>  EXPORT_SYMBOL_GPL(irqtime_account_irq);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-21 21:21   ` Peter Zijlstra
@ 2016-06-21 22:20     ` Rik van Riel
  2016-06-22 10:40     ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2016-06-21 22:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1244 bytes --]

On Tue, 2016-06-21 at 23:21 +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 12:06:03PM -0400, riel@redhat.com wrote:
> > 
> > +static unsigned long irqtime_account_hi_update(unsigned long
> > max_jiffies)
> >  {
> >  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > +	unsigned long irq_jiffies;
> >  	unsigned long flags;
> > +	u64 irq;
> >  
> >  	local_irq_save(flags);
> > +	irq = this_cpu_read(cpu_hardirq_time) -
> > cpustat[CPUTIME_IRQ];
> > +	irq_jiffies = min(cputime_to_jiffies(irq), max_jiffies);
> cputime_to_jiffies is a division, could we not avoid that by doing
> something like:
> 
> 	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
> 	while (irq_jiffies > cputime_one_jiffy) {
> 		irq_jiffies -= cputime_one_jiffy;
> 		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
> 	}
> 
> assuming that the loop is 'rare' etc.. If not, only do the division
> on
> that same > cputime_one_jiffy condition.

I suspect the loop is not rare on systems with nohz_idle,
where it may be quite a while before a timer tick happens
on an idle cpu.

I can certainly make sure the division is only done when
irq > 2*cputime_one_jiffy. I will do that in the next
version.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-21 21:49   ` Peter Zijlstra
@ 2016-06-21 22:23     ` Rik van Riel
  2016-06-21 22:28       ` Peter Zijlstra
  2016-06-22 21:55     ` Rik van Riel
  1 sibling, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2016-06-21 22:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 4092 bytes --]

On Tue, 2016-06-21 at 23:49 +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 12:06:07PM -0400, riel@redhat.com wrote:
> > 
> > @@ -53,36 +56,72 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
> >   * softirq -> hardirq, hardirq -> softirq
> >   *
> >   * When exiting hardirq or softirq time, account the elapsed time.
> > + *
> > + * When exiting softirq time, subtract the amount of hardirq time
> > that
> > + * interrupted this softirq run, to avoid double accounting of
> > that time.
> >   */
> >  void irqtime_account_irq(struct task_struct *curr, int irqtype)
> >  {
> > +	u64 prev_softirq_start;
> > +	u64 prev_hardirq;
> > +	u64 hardirq_time;
> > +	s64 delta = 0;
> We appear to always assign to delta, so this initialization seems
> superfluous.
> 
> > 
> >  	int cpu;
> >  
> >  	if (!sched_clock_irqtime)
> >  		return;
> >  
> >  	cpu = smp_processor_id();
> Per this smp_processor_id() usage, preemption is disabled.

This code is called from the timer code. Surely preemption
is already disabled?

Should I change this into raw_smp_processor_id()?

> > 
> > +	/*
> > +	 * Softirq context may get interrupted by hardirq context,
> > +	 * on the same CPU. At softirq entry time the amount of
> > time
> > +	 * spent in hardirq context is stored. At softirq exit
> > time,
> > +	 * the time spent in hardirq context during the softirq is
> > +	 * subtracted.
> > +	 */
> > +	prev_hardirq = __this_cpu_read(prev_hardirq_time);
> > +	prev_softirq_start = __this_cpu_read(softirq_start_time);
> > +
> > +	if (irqtype == HARDIRQ_OFFSET) {
> > +		delta = sched_clock_cpu(cpu) -
> > __this_cpu_read(hardirq_start_time);
> > +		__this_cpu_add(hardirq_start_time, delta);
> > +	} else do {
> > +		u64 now = sched_clock_cpu(cpu);
> > +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time,
> > cpu));
> Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong
> with
> __this_cpu_read() ?

Is __this_cpu_read() as fast as per_cpu(,cpu) on all
architectures?

> > 
> > +
> > +		delta = now - prev_softirq_start;
> > +		if (in_serving_softirq()) {
> > +			/*
> > +			 * Leaving softirq context. Avoid double
> > counting by
> > +			 * subtracting hardirq time from this
> > interval.
> > +			 */
> > +			s64 hi_delta = hardirq_time -
> > prev_hardirq;
> > +			delta -= hi_delta;
> > +		} else {
> > +			/* Entering softirq context. Note start
> > times. */
> > +			__this_cpu_write(softirq_start_time, now);
> > +			__this_cpu_write(prev_hardirq_time,
> > hardirq_time);
> > +		}
> > +		/*
> > +		 * If a hardirq happened during this calculation,
> > it may not
> > +		 * have gotten a consistent snapshot. Try again.
> > +		 */
> > +	} while (hardirq_time !=
> > READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));
> That whole thing is somewhat hard to read; but its far too late for
> me
> to suggest anything more readable :/

I only had 2 1/2 hours of sleep last night, so I will not
try to rewrite it now, but I will see if there is anything
I can do to make it more readable tomorrow.

If you have any ideas before then, please let me know :)

> > 
> > +	irq_time_write_begin(irqtype);
> >  	/*
> >  	 * We do not account for softirq time from ksoftirqd here.
> >  	 * We want to continue accounting softirq time to
> > ksoftirqd thread
> >  	 * in that case, so as not to confuse scheduler with a
> > special task
> >  	 * that do not consume any time, but still wants to run.
> >  	 */
> > +	if (irqtype == HARDIRQ_OFFSET && hardirq_count())
> >  		__this_cpu_add(cpu_hardirq_time, delta);
> > +	else if (irqtype == SOFTIRQ_OFFSET && in_serving_softirq()
> > &&
> > +				curr != this_cpu_ksoftirqd())
> >  		__this_cpu_add(cpu_softirq_time, delta);
> >  
> > +	irq_time_write_end(irqtype);
> Maybe split the whole thing on irqtype at the very start, instead of
> the
> endless repeated branches?

Let me try if I can make things more readable that way.

Thanks for the review!

Rik
-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-21 22:23     ` Rik van Riel
@ 2016-06-21 22:28       ` Peter Zijlstra
  2016-06-21 22:32         ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2016-06-21 22:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

On Tue, Jun 21, 2016 at 06:23:34PM -0400, Rik van Riel wrote:
> > >  	cpu = smp_processor_id();
> > Per this smp_processor_id() usage, preemption is disabled.
> 
> This code is called from the timer code. Surely preemption
> is already disabled?

That's what I said.

> > > 
> > > +	/*
> > > +	 * Softirq context may get interrupted by hardirq context,
> > > +	 * on the same CPU. At softirq entry time the amount of
> > > time
> > > +	 * spent in hardirq context is stored. At softirq exit
> > > time,
> > > +	 * the time spent in hardirq context during the softirq is
> > > +	 * subtracted.
> > > +	 */
> > > +	prev_hardirq = __this_cpu_read(prev_hardirq_time);
> > > +	prev_softirq_start = __this_cpu_read(softirq_start_time);
> > > +
> > > +	if (irqtype == HARDIRQ_OFFSET) {
> > > +		delta = sched_clock_cpu(cpu) -
> > > __this_cpu_read(hardirq_start_time);
> > > +		__this_cpu_add(hardirq_start_time, delta);
> > > +	} else do {
> > > +		u64 now = sched_clock_cpu(cpu);
> > > +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time,
> > > cpu));
> > Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong
> > with
> > __this_cpu_read() ?
> 
> Is __this_cpu_read() as fast as per_cpu(,cpu) on all
> architectures?

Can't be slower. Don't get the argument though; you've used __this_cpu
stuff all over the place, and here you use a per_cpu() for no reason.

> > > 
> > > +
> > > +		delta = now - prev_softirq_start;
> > > +		if (in_serving_softirq()) {
> > > +			/*
> > > +			 * Leaving softirq context. Avoid double
> > > counting by
> > > +			 * subtracting hardirq time from this
> > > interval.
> > > +			 */
> > > +			s64 hi_delta = hardirq_time -
> > > prev_hardirq;
> > > +			delta -= hi_delta;
> > > +		} else {
> > > +			/* Entering softirq context. Note start
> > > times. */
> > > +			__this_cpu_write(softirq_start_time, now);
> > > +			__this_cpu_write(prev_hardirq_time,
> > > hardirq_time);
> > > +		}
> > > +		/*
> > > +		 * If a hardirq happened during this calculation,
> > > it may not
> > > +		 * have gotten a consistent snapshot. Try again.
> > > +		 */
> > > +	} while (hardirq_time !=
> > > READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));
> > That whole thing is somewhat hard to read; but its far too late for
> > me
> > to suggest anything more readable :/
> 
> I only had 2 1/2 hours of sleep last night, so I will not
> try to rewrite it now, but I will see if there is anything
> I can do to make it more readable tomorrow.
> 
> If you have any ideas before then, please let me know :)

Heh, step away from the computer ... ;-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-21 22:28       ` Peter Zijlstra
@ 2016-06-21 22:32         ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2016-06-21 22:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

On Wed, 2016-06-22 at 00:28 +0200, Peter Zijlstra wrote:
> On Tue, Jun 21, 2016 at 06:23:34PM -0400, Rik van Riel wrote:
> > 
> > > > +	/*
> > > > +	 * Softirq context may get interrupted by hardirq
> > > > context,
> > > > +	 * on the same CPU. At softirq entry time the amount
> > > > of
> > > > time
> > > > +	 * spent in hardirq context is stored. At softirq exit
> > > > time,
> > > > +	 * the time spent in hardirq context during the
> > > > softirq is
> > > > +	 * subtracted.
> > > > +	 */
> > > > +	prev_hardirq = __this_cpu_read(prev_hardirq_time);
> > > > +	prev_softirq_start =
> > > > __this_cpu_read(softirq_start_time);
> > > > +
> > > > +	if (irqtype == HARDIRQ_OFFSET) {
> > > > +		delta = sched_clock_cpu(cpu) -
> > > > __this_cpu_read(hardirq_start_time);
> > > > +		__this_cpu_add(hardirq_start_time, delta);
> > > > +	} else do {
> > > > +		u64 now = sched_clock_cpu(cpu);
> > > > +		hardirq_time =
> > > > READ_ONCE(per_cpu(cpu_hardirq_time,
> > > > cpu));
> > > Which makes this per_cpu(,cpu) usage somewhat curious. What's
> > > wrong
> > > with
> > > __this_cpu_read() ?
> > Is __this_cpu_read() as fast as per_cpu(,cpu) on all
> > architectures?
> Can't be slower. Don't get the argument though; you've used
> __this_cpu
> stuff all over the place, and here you use a per_cpu() for no reason.
> 
Good point. I will use __this_cpu_read here.

> > > That whole thing is somewhat hard to read; but its far too late
> > > for
> > > me
> > > to suggest anything more readable :/
> > I only had 2 1/2 hours of sleep last night, so I will not
> > try to rewrite it now, but I will see if there is anything
> > I can do to make it more readable tomorrow.
> > 
> > If you have any ideas before then, please let me know :)
> Heh, step away from the computer ... ;-)

No worries, I have booze with me. Everything will be
just fine! ;)

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-21 21:21   ` Peter Zijlstra
  2016-06-21 22:20     ` Rik van Riel
@ 2016-06-22 10:40     ` Paolo Bonzini
  2016-06-22 10:52       ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2016-06-22 10:40 UTC (permalink / raw)
  To: Peter Zijlstra, riel
  Cc: linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar



On 21/06/2016 23:21, Peter Zijlstra wrote:
> cputime_to_jiffies is a division, could we not avoid that by doing
> something like:
> 
> 	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
> 	while (irq_jiffies > cputime_one_jiffy) {
> 		irq_jiffies -= cputime_one_jiffy;
> 		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
> 	}
> 
> assuming that the loop is 'rare' etc.. If not, only do the division on
> that same > cputime_one_jiffy condition.

It's a division by a constant, it ought to become a multiplication.  For
64-bit it will, and context tracking is only enabled for 64-bit.

BTW, for 32-bit there's a monster of a macro to turn do_div with
constant divisor into multiplications in include/asm-generic/div64.h.
However, x86-32 doesn't use it.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-22 10:40     ` Paolo Bonzini
@ 2016-06-22 10:52       ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2016-06-22 10:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: riel, linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar

On Wed, Jun 22, 2016 at 12:40:31PM +0200, Paolo Bonzini wrote:
> 
> 
> On 21/06/2016 23:21, Peter Zijlstra wrote:
> > cputime_to_jiffies is a division, could we not avoid that by doing
> > something like:
> > 
> > 	irq_jiffies = min(irq, jiffies_to_cputime(max_jiffies));
> > 	while (irq_jiffies > cputime_one_jiffy) {
> > 		irq_jiffies -= cputime_one_jiffy;
> > 		cpustat[CPUTIME_IRQ] += cputime_one_jiffy;
> > 	}
> > 
> > assuming that the loop is 'rare' etc.. If not, only do the division on
> > that same > cputime_one_jiffy condition.
> 
> It's a division by a constant, it ought to become a multiplication.  For
> 64-bit it will, and context tracking is only enabled for 64-bit.

Right, not enabled on i386, however plenty 32bit archs (including ARM)
do have it enabled.

> BTW, for 32-bit there's a monster of a macro to turn do_div with
> constant divisor into multiplications in include/asm-generic/div64.h.
> However, x86-32 doesn't use it.

Right, but some ARM chips don't exactly have a fast multiplier either,
so avoiding even the 64bit mult (which ends up being 3 or so actual
mult instructions) is worthwhile I'd say.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-21 21:49   ` Peter Zijlstra
  2016-06-21 22:23     ` Rik van Riel
@ 2016-06-22 21:55     ` Rik van Riel
  2016-06-23 13:52       ` Paolo Bonzini
  1 sibling, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2016-06-22 21:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, pbonzini, fweisbec, wanpeng.li, efault,
	tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1918 bytes --]

On Tue, 2016-06-21 at 23:49 +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 12:06:07PM -0400, riel@redhat.com wrote:
> > 
> > @@ -53,36 +56,72 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
> >   * softirq -> hardirq, hardirq -> softirq
> >   *
> >   * When exiting hardirq or softirq time, account the elapsed time.
> > + *
> > + * When exiting softirq time, subtract the amount of hardirq time
> > that
> > + * interrupted this softirq run, to avoid double accounting of
> > that time.
> >   */
> >  void irqtime_account_irq(struct task_struct *curr, int irqtype)
> >  {
> > +	u64 prev_softirq_start;
> > +	u64 prev_hardirq;
> > +	u64 hardirq_time;
> > +	s64 delta = 0;
> We appear to always assign to delta, so this initialization seems
> superfluous.

It gets rid of a compiler warning, since gcc is not
smart enough to know that the result of in_softirq()
will be the same throughout the function.

Using a bool leaving_softirq = in_softirq() also
gets rid of the warning, and makes the function a
little more readable, so I am doing that.

> > +	if (irqtype == HARDIRQ_OFFSET) {
> > +		delta = sched_clock_cpu(cpu) -
> > __this_cpu_read(hardirq_start_time);
> > +		__this_cpu_add(hardirq_start_time, delta);
> > +	} else do {
> > +		u64 now = sched_clock_cpu(cpu);
> > +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time,
> > cpu));
> Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong
> with
> __this_cpu_read() ?

I played around with it a bit, and it seems that
__this_cpu_read does not want to nest inside
READ_ONCE.  Nobody else seems to be doing that,
either.

Back to READ_ONCE(per_cpu(,cpu)) it is...

> Maybe split the whole thing on irqtype at the very start, instead of
> the
> endless repeated branches?

I untangled the whole thing in the next version,
which I will post after testing.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-22 21:55     ` Rik van Riel
@ 2016-06-23 13:52       ` Paolo Bonzini
  2016-06-23 15:24         ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2016-06-23 13:52 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra
  Cc: linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar



On 22/06/2016 23:55, Rik van Riel wrote:
> > > +		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time,
> > > cpu));
> > Which makes this per_cpu(,cpu) usage somewhat curious. What's wrong
> > with __this_cpu_read() ?
>
> I played around with it a bit, and it seems that
> __this_cpu_read does not want to nest inside
> READ_ONCE.  Nobody else seems to be doing that,
> either.

According to arch/x86/include/asm/percpu.h, this_cpu_read always has 
READ_ONCE semantics, but I cannot find that in include/asm-generic
/percpu.h.  It probably just works because of all the layers of goo, but
something like this (101% untested) would make me feel safer:

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index 4d9f233c4ba8..d057568f1926 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -67,7 +67,7 @@ extern void setup_per_cpu_areas(void);
 
 #define raw_cpu_generic_to_op(pcp, val, op)				\
 do {									\
-	*raw_cpu_ptr(&(pcp)) op val;					\
+	ACCESS_ONCE(*raw_cpu_ptr(&(pcp))) op val;			\
 } while (0)
 
 #define raw_cpu_generic_add_return(pcp, val)				\
@@ -109,7 +109,7 @@ do {
 ({									\
 	typeof(pcp) __ret;						\
 	preempt_disable();						\
-	__ret = *this_cpu_ptr(&(pcp));					\
+	__ret = READ_ONCE(this_cpu_ptr(&(pcp)));			\
 	preempt_enable();						\
 	__ret;								\
 })
@@ -118,7 +118,7 @@ do {
 do {									\
 	unsigned long __flags;						\
 	raw_local_irq_save(__flags);					\
-	*raw_cpu_ptr(&(pcp)) op val;					\
+	ACCESS_ONCE(*raw_cpu_ptr(&(pcp))) op val;			\
 	raw_local_irq_restore(__flags);					\
 } while (0)
 
@@ -168,16 +168,16 @@ do {
 })
 
 #ifndef raw_cpu_read_1
-#define raw_cpu_read_1(pcp)		(*raw_cpu_ptr(&(pcp)))
+#define raw_cpu_read_1(pcp)		READ_ONCE(raw_cpu_ptr(&(pcp)))
 #endif
 #ifndef raw_cpu_read_2
-#define raw_cpu_read_2(pcp)		(*raw_cpu_ptr(&(pcp)))
+#define raw_cpu_read_2(pcp)		READ_ONCE(raw_cpu_ptr(&(pcp)))
 #endif
 #ifndef raw_cpu_read_4
-#define raw_cpu_read_4(pcp)		(*raw_cpu_ptr(&(pcp)))
+#define raw_cpu_read_4(pcp)		READ_ONCE(raw_cpu_ptr(&(pcp)))
 #endif
 #ifndef raw_cpu_read_8
-#define raw_cpu_read_8(pcp)		(*raw_cpu_ptr(&(pcp)))
+#define raw_cpu_read_8(pcp)		READ_ONCE(raw_cpu_ptr(&(pcp)))
 #endif
 
 #ifndef raw_cpu_write_1


> Back to READ_ONCE(per_cpu(,cpu)) it is...

What about READ_ONCE(this_cpu_ptr())?

Paolo

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-23 13:52       ` Paolo Bonzini
@ 2016-06-23 15:24         ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2016-06-23 15:24 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Zijlstra
  Cc: linux-kernel, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 2947 bytes --]

On Thu, 2016-06-23 at 15:52 +0200, Paolo Bonzini wrote:
> 
> On 22/06/2016 23:55, Rik van Riel wrote:
> > 
> > > 
> > > > 
> > > > +		hardirq_time =
> > > > READ_ONCE(per_cpu(cpu_hardirq_time,
> > > > cpu));
> > > Which makes this per_cpu(,cpu) usage somewhat curious. What's
> > > wrong
> > > with __this_cpu_read() ?
> > I played around with it a bit, and it seems that
> > __this_cpu_read does not want to nest inside
> > READ_ONCE.  Nobody else seems to be doing that,
> > either.
> According to arch/x86/include/asm/percpu.h, this_cpu_read always has 
> READ_ONCE semantics, but I cannot find that in include/asm-generic
> /percpu.h.  It probably just works because of all the layers of goo,
> but
> something like this (101% untested) would make me feel safer:
> 
Are READ_ONCE semantics desired for every
per-cpu read?

> > Back to READ_ONCE(per_cpu(,cpu)) it is...
> What about READ_ONCE(this_cpu_ptr())?

I tried that yesterday, because it looked like
it would work. Unfortunately, it does not.

  CC      kernel/sched/cputime.o
kernel/sched/cputime.c: In function ‘irqtime_account_irq’:
kernel/sched/cputime.c:107:71: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:298: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c:107:428: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:655: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c:107:749: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:976: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c:107:1087: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:1314: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c:107:1408: warning: initialization makes pointer
from integer without a cast [-Wint-conversion]
kernel/sched/cputime.c:107:1635: error: invalid type argument of unary
‘*’ (have ‘u64 {aka long long unsigned int}’)
kernel/sched/cputime.c: In function ‘irqtime_account_hi_update’:
kernel/sched/cputime.c:145:175: warning: comparison of distinct pointer
types lacks a cast
kernel/sched/cputime.c: In function ‘irqtime_account_si_update’:
kernel/sched/cputime.c:162:182: warning: comparison of distinct pointer
types lacks a cast
scripts/Makefile.build:291: recipe for target 'kernel/sched/cputime.o'
failed
make[1]: *** [kernel/sched/cputime.o] Error 1
Makefile:1594: recipe for target 'kernel/sched/' failed
make: *** [kernel/sched/] Error 2

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
@ 2016-06-23  2:25 ` riel
  0 siblings, 0 replies; 20+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Add an irq type parameter and documentation to irqtime_account_irq,
this can be used to distinguish between transitioning from process
context to hardirq time, and from process context to softirq time.

This is necessary to be able to remove the local_irq_disable from
irqtime_account_irq.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/hardirq.h | 20 ++++++++++----------
 include/linux/vtime.h   | 12 ++++++------
 kernel/sched/cputime.c  |  9 ++++++++-
 kernel/softirq.c        |  6 +++---
 4 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6bc6f0..1ebb31f56285 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,11 +32,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define __irq_enter()					\
-	do {						\
-		account_irq_enter_time(current);	\
-		preempt_count_add(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
+#define __irq_enter()							\
+	do {								\
+		account_irq_enter_time(current, HARDIRQ_OFFSET);	\
+		preempt_count_add(HARDIRQ_OFFSET);			\
+		trace_hardirq_enter();					\
 	} while (0)
 
 /*
@@ -47,11 +47,11 @@ extern void irq_enter(void);
 /*
  * Exit irq context without processing softirqs:
  */
-#define __irq_exit()					\
-	do {						\
-		trace_hardirq_exit();			\
-		account_irq_exit_time(current);		\
-		preempt_count_sub(HARDIRQ_OFFSET);	\
+#define __irq_exit()							\
+	do {								\
+		trace_hardirq_exit();					\
+		account_irq_exit_time(current, HARDIRQ_OFFSET);		\
+		preempt_count_sub(HARDIRQ_OFFSET);			\
 	} while (0)
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d1977d84ebdf..31e486d741b0 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -112,21 +112,21 @@ static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, int irqtype);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, int irqtype) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_enter(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_exit(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0280528800f2..fc4122afc022 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,8 +46,15 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
 /*
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * There are six possible transitions:
+ * process -> softirq, softirq -> process
+ * process -> hardirq, hardirq -> process
+ * softirq -> hardirq, hardirq -> softirq
+ *
+ * When exiting hardirq or softirq time, account the elapsed time.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
 	unsigned long flags;
 	s64 delta;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a311c9622c86 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -245,7 +245,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	account_irq_enter_time(current);
+	account_irq_enter_time(current, SOFTIRQ_OFFSET);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
@@ -295,7 +295,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	}
 
 	lockdep_softirq_end(in_hardirq);
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, SOFTIRQ_OFFSET);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
@@ -385,7 +385,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, HARDIRQ_OFFSET);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq
  2016-06-08  2:29 [PATCH RFC 0/5] sched,time: make irq time accounting work for nohz_idle riel
@ 2016-06-08  2:30 ` riel
  0 siblings, 0 replies; 20+ messages in thread
From: riel @ 2016-06-08  2:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: kernellwp, mingo, peterz, tglx, fweisbec

From: Rik van Riel <riel@redhat.com>

Add an irq type parameter and documentation to irqtime_account_irq,
this can be used to distinguish between transitioning from process
context to hardirq time, and from process context to softirq time.

This is necessary to be able to remove the local_irq_disable from
irqtime_account_irq.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/hardirq.h | 20 ++++++++++----------
 include/linux/vtime.h   | 12 ++++++------
 kernel/sched/cputime.c  |  9 ++++++++-
 kernel/softirq.c        |  6 +++---
 4 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6bc6f0..1ebb31f56285 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,11 +32,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define __irq_enter()					\
-	do {						\
-		account_irq_enter_time(current);	\
-		preempt_count_add(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
+#define __irq_enter()							\
+	do {								\
+		account_irq_enter_time(current, HARDIRQ_OFFSET);	\
+		preempt_count_add(HARDIRQ_OFFSET);			\
+		trace_hardirq_enter();					\
 	} while (0)
 
 /*
@@ -47,11 +47,11 @@ extern void irq_enter(void);
 /*
  * Exit irq context without processing softirqs:
  */
-#define __irq_exit()					\
-	do {						\
-		trace_hardirq_exit();			\
-		account_irq_exit_time(current);		\
-		preempt_count_sub(HARDIRQ_OFFSET);	\
+#define __irq_exit()							\
+	do {								\
+		trace_hardirq_exit();					\
+		account_irq_exit_time(current, HARDIRQ_OFFSET);		\
+		preempt_count_sub(HARDIRQ_OFFSET);			\
 	} while (0)
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 3b384bf5ce1a..58f036f3ebea 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -112,21 +112,21 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk)
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, int irqtype);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, int irqtype) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_enter(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_exit(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2f862dfdb520..e009077aeab6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,8 +46,15 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
 /*
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * There are six possible transitions:
+ * process -> softirq, softirq -> process
+ * process -> hardirq, hardirq -> process
+ * softirq -> hardirq, hardirq -> softirq
+ *
+ * When exiting hardirq or softirq time, account the elapsed time.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
 	unsigned long flags;
 	s64 delta;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a311c9622c86 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -245,7 +245,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	account_irq_enter_time(current);
+	account_irq_enter_time(current, SOFTIRQ_OFFSET);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
@@ -295,7 +295,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	}
 
 	lockdep_softirq_end(in_hardirq);
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, SOFTIRQ_OFFSET);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
@@ -385,7 +385,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, HARDIRQ_OFFSET);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-06-23 15:24 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
2016-06-16 16:06 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
2016-06-16 16:22   ` kbuild test robot
2016-06-21 21:21   ` Peter Zijlstra
2016-06-21 22:20     ` Rik van Riel
2016-06-22 10:40     ` Paolo Bonzini
2016-06-22 10:52       ` Peter Zijlstra
2016-06-16 16:06 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
2016-06-16 16:06 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
2016-06-16 16:06 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
2016-06-16 16:06 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
2016-06-21 21:49   ` Peter Zijlstra
2016-06-21 22:23     ` Rik van Riel
2016-06-21 22:28       ` Peter Zijlstra
2016-06-21 22:32         ` Rik van Riel
2016-06-22 21:55     ` Rik van Riel
2016-06-23 13:52       ` Paolo Bonzini
2016-06-23 15:24         ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
2016-06-23  2:25 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
2016-06-08  2:29 [PATCH RFC 0/5] sched,time: make irq time accounting work for nohz_idle riel
2016-06-08  2:30 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).