[PATCH v2 0/5] sched,time: fix irq time accounting with nohz

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle
@ 2016-06-23  2:25 riel
  2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

Currently irq time accounting only works in these cases:
1) purely ticke based accounting
2) nohz_full accounting, but only on housekeeping & nohz_full CPUs
3) architectures with native vtime accounting

On nohz_idle CPUs, which are probably the majority nowadays,
irq time accounting is currently broken. This leads to systems
reporting a dramatically lower amount of irq & softirq time than
is actually spent handling them, with all the time spent while the
system is in the idle task being accounted as idle.

This patch set seems to bring the amount of irq time reported by
top (and /proc/stat) roughly in line with that measured when I do
a "perf record -g -a" run to see what is using all that time.

The amount of irq time used, especially softirq, is shockingly high,
to the point of me thinking this patch set may be wrong, but the
numbers seem to match what perf is giving me...

These patches apply on top of Wanpeng Li's steal time patches.

CONFIG_IRQ_TIME_ACCOUNTING is now a config option that is available
as a separate choice from tick based / nohz_idle / nohz_full mode,
a suggested by Frederic Weisbecker.

Next up: look at the things that are using CPU time on an otherwise
idle system, and see if I can make those a little faster :)

v2: address Peterz's concerns, some more cleanups

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
@ 2016-06-23  2:25 ` riel
  2016-06-27 12:25   ` Frederic Weisbecker
  2016-06-23  2:25 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of 'ticks' jiffies was
actually spent handling irqs, and could conceivably mis-count
all of the ticks during a period as irq time, when there was
some irq and some softirq time.

This can actually happen when irqtime_account_process_tick
is called from account_idle_ticks, which can pass a larger
number of ticks down all at once.

Fix this by changing irqtime_account_hi_update and
irqtime_account_si_update to round elapsed irq and softirq
time to jiffies, and return the number of jiffies spent in
each mode, similar to how steal time is handled.

Additionally, have irqtime_account_process_tick take into
account how much time was spent in each of steal, irq,
and softirq time.

The latter could help improve the accuracy of timekeeping
when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 81 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 60 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d76fdb..15b813c014be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,40 +79,54 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long irq_jiffies = 0;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 irq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
+	if (irq > cputime_one_jiffy) {
+		irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
+		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
+	}
 	local_irq_restore(flags);
-	return ret;
+	return irq_jiffies;
 }
 
-static int irqtime_account_si_update(void)
+static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long si_jiffies = 0;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	u64 softirq;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
+	if (softirq > cputime_one_jiffy) {
+		si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
+		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
+	}
 	local_irq_restore(flags);
-	return ret;
+	return si_jiffies;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #define sched_clock_irqtime	(0)
 
+static unsigned long irqtime_account_hi_update(unsigned long dummy)
+{
+	return 0;
+}
+
+static unsigned long irqtime_account_si_update(unsigned long dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -283,6 +297,26 @@ static __always_inline unsigned long steal_account_process_tick(unsigned long ma
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ * Due to rounding errors, the calculated amount can sometimes exceed
+ * max_jiffies; be careful not to account more than max_jiffies.
+ */
+static inline int account_other_ticks(unsigned long max_jiffies)
+{
+	unsigned long accounted;
+
+	accounted = steal_account_process_tick(max_jiffies);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_hi_update(max_jiffies - accounted);
+
+	if (accounted < max_jiffies)
+		accounted += irqtime_account_si_update(max_jiffies - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -344,19 +378,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 {
 	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	unsigned long other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_ticks(ticks);
+	if (other >= ticks)
 		return;
+	ticks -= other;
 
 	cputime *= ticks;
 	scaled *= ticks;
 
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-23  2:25 ` riel
  2016-06-27 23:21   ` Frederic Weisbecker
  2016-06-23  2:25 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs that are nohz_full, people typically do not assign IRQs.
On the housekeeping CPU (when a system is booted up with nohz_full),
sampling should work ok to determine irq and softirq time use, but
that only covers the housekeeping CPU itself, not the other
non-nohz_full CPUs.

On CPUs that are nohz_idle (the typical way a distro kernel is
booted), irq time is not accounted at all while the CPU is idle,
due to the lack of timer ticks.

Remove the VTIME_GEN vtime irq time code. The next patch will
allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vtime.h  | 32 ++++++++++++++------------------
 kernel/sched/cputime.c | 16 +++-------------
 2 files changed, 17 insertions(+), 31 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index fa2196990f84..d1977d84ebdf 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,6 +14,18 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+
+#ifdef __ARCH_HAS_VTIME_ACCOUNT
+extern void vtime_account_irq_enter(struct task_struct *tsk);
+#else
+extern void vtime_common_account_irq_enter(struct task_struct *tsk);
+static inline void vtime_account_irq_enter(struct task_struct *tsk)
+{
+	if (vtime_accounting_cpu_enabled())
+		vtime_common_account_irq_enter(tsk);
+}
+#endif /* __ARCH_HAS_VTIME_ACCOUNT */
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -64,17 +76,6 @@ extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
 
-#ifdef __ARCH_HAS_VTIME_ACCOUNT
-extern void vtime_account_irq_enter(struct task_struct *tsk);
-#else
-extern void vtime_common_account_irq_enter(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_account_irq_enter(tsk);
-}
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
-
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 static inline void vtime_task_switch(struct task_struct *prev) { }
@@ -85,13 +86,8 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_gen_account_irq_exit(struct task_struct *tsk);
-
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_gen_account_irq_exit(tsk);
-}
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
 
 extern void vtime_user_enter(struct task_struct *tsk);
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 15b813c014be..0280528800f2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -720,14 +720,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	unsigned long delta_jiffies, other_jiffies;
 
 	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	other_jiffies = account_other_ticks(delta_jiffies);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return jiffies_to_cputime(delta_jiffies - other_jiffies);
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
@@ -747,16 +747,6 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_gen_account_irq_exit(struct task_struct *tsk)
-{
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		__vtime_account_system(tsk);
-	if (context_tracking_in_user())
-		tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
-}
-
 void vtime_account_user(struct task_struct *tsk)
 {
 	cputime_t delta_cpu;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] cputime: allow irq time accounting to be selected as an option
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
  2016-06-23  2:25 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
@ 2016-06-23  2:25 ` riel
  2016-06-26  3:03   ` kbuild test robot
  2016-06-23  2:25 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
  2016-06-23  2:25 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  4 siblings, 1 reply; 15+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Allow CONFIG_IRQ_TIME_ACCOUNTING to be selected as an option, on top
of CONFIG_VIRT_CPU_ACCOUNTING_GEN (and potentially others?).

This allows for the irq time accounting code to be used with nohz_idle
CPUs, which is how several distributions ship their kernels. Using the
same code for several timer modes also allows us to drop duplicate code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 init/Kconfig | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 0dfd09d54c65..4c7ee4f136cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -375,9 +375,11 @@ config VIRT_CPU_ACCOUNTING_GEN
 
 	  If unsure, say N.
 
+endchoice
+
 config IRQ_TIME_ACCOUNTING
 	bool "Fine granularity task level IRQ time accounting"
-	depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL
+	depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
 	help
 	  Select this option to enable fine granularity task irq time
 	  accounting. This is done by reading a timestamp on each
@@ -386,8 +388,6 @@ config IRQ_TIME_ACCOUNTING
 
 	  If in doubt, say N here.
 
-endchoice
-
 config BSD_PROCESS_ACCT
 	bool "BSD Process Accounting"
 	depends on MULTIUSER
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
                   ` (2 preceding siblings ...)
  2016-06-23  2:25 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
@ 2016-06-23  2:25 ` riel
  2016-06-23  2:25 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  4 siblings, 0 replies; 15+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Add an irq type parameter and documentation to irqtime_account_irq,
this can be used to distinguish between transitioning from process
context to hardirq time, and from process context to softirq time.

This is necessary to be able to remove the local_irq_disable from
irqtime_account_irq.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/hardirq.h | 20 ++++++++++----------
 include/linux/vtime.h   | 12 ++++++------
 kernel/sched/cputime.c  |  9 ++++++++-
 kernel/softirq.c        |  6 +++---
 4 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6bc6f0..1ebb31f56285 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,11 +32,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define __irq_enter()					\
-	do {						\
-		account_irq_enter_time(current);	\
-		preempt_count_add(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
+#define __irq_enter()							\
+	do {								\
+		account_irq_enter_time(current, HARDIRQ_OFFSET);	\
+		preempt_count_add(HARDIRQ_OFFSET);			\
+		trace_hardirq_enter();					\
 	} while (0)
 
 /*
@@ -47,11 +47,11 @@ extern void irq_enter(void);
 /*
  * Exit irq context without processing softirqs:
  */
-#define __irq_exit()					\
-	do {						\
-		trace_hardirq_exit();			\
-		account_irq_exit_time(current);		\
-		preempt_count_sub(HARDIRQ_OFFSET);	\
+#define __irq_exit()							\
+	do {								\
+		trace_hardirq_exit();					\
+		account_irq_exit_time(current, HARDIRQ_OFFSET);		\
+		preempt_count_sub(HARDIRQ_OFFSET);			\
 	} while (0)
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d1977d84ebdf..31e486d741b0 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -112,21 +112,21 @@ static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, int irqtype);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, int irqtype) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_enter(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_exit(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0280528800f2..fc4122afc022 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,8 +46,15 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
 /*
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * There are six possible transitions:
+ * process -> softirq, softirq -> process
+ * process -> hardirq, hardirq -> process
+ * softirq -> hardirq, hardirq -> softirq
+ *
+ * When exiting hardirq or softirq time, account the elapsed time.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
 	unsigned long flags;
 	s64 delta;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a311c9622c86 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -245,7 +245,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	account_irq_enter_time(current);
+	account_irq_enter_time(current, SOFTIRQ_OFFSET);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
@@ -295,7 +295,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	}
 
 	lockdep_softirq_end(in_hardirq);
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, SOFTIRQ_OFFSET);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
@@ -385,7 +385,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, HARDIRQ_OFFSET);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
                   ` (3 preceding siblings ...)
  2016-06-23  2:25 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
@ 2016-06-23  2:25 ` riel
  4 siblings, 0 replies; 15+ messages in thread
From: riel @ 2016-06-23  2:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Drop local_irq_save/restore from irqtime_account_irq.
Instead, have softirq and hardirq track their time spent
independently, with the softirq code subtracting hardirq
time that happened during the duration of the softirq run.

The softirq code can be interrupted by hardirq code at
any point in time, but it can check whether it got a
consistent snapshot of the timekeeping variables it wants,
and loop around in the unlikely case that it did not.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 72 +++++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h   | 38 +++++++++++++++++++++-----
 2 files changed, 90 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index fc4122afc022..e4c7b2d17141 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -26,7 +26,9 @@
 DEFINE_PER_CPU(u64, cpu_hardirq_time);
 DEFINE_PER_CPU(u64, cpu_softirq_time);
 
-static DEFINE_PER_CPU(u64, irq_start_time);
+static DEFINE_PER_CPU(u64, hardirq_start_time);
+static DEFINE_PER_CPU(u64, softirq_start_time);
+static DEFINE_PER_CPU(u64, prev_hardirq_time);
 static int sched_clock_irqtime;
 
 void enable_sched_clock_irqtime(void)
@@ -41,6 +43,7 @@ void disable_sched_clock_irqtime(void)
 
 #ifndef CONFIG_64BIT
 DEFINE_PER_CPU(seqcount_t, irq_time_seq);
+DEFINE_PER_CPU(seqcount_t, softirq_time_seq);
 #endif /* CONFIG_64BIT */
 
 /*
@@ -53,36 +56,79 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
  * softirq -> hardirq, hardirq -> softirq
  *
  * When exiting hardirq or softirq time, account the elapsed time.
+ *
+ * When exiting softirq time, subtract the amount of hardirq time that
+ * interrupted this softirq run, to avoid double accounting of that time.
  */
 void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
-	unsigned long flags;
+	u64 prev_softirq_start;
+	bool leaving_softirq;
+	u64 prev_hardirq;
+	u64 hardirq_time;
 	s64 delta;
 	int cpu;
 
 	if (!sched_clock_irqtime)
 		return;
 
-	local_irq_save(flags);
-
 	cpu = smp_processor_id();
-	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
-	__this_cpu_add(irq_start_time, delta);
 
-	irq_time_write_begin();
+	/*
+	 * Hardirq time accounting is pretty straightforward. If not in
+	 * hardirq context yet (entering hardirq), set the start time.
+	 * If already in hardirq context (leaving), account the elapsed time.
+	 */
+	if (irqtype == HARDIRQ_OFFSET) {
+		bool leaving_hardirq = hardirq_count();
+		delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
+		__this_cpu_add(hardirq_start_time, delta);
+		if (leaving_hardirq) {
+			hardirq_time_write_begin();
+			__this_cpu_add(cpu_hardirq_time, delta);
+			hardirq_time_write_end();
+		}
+		return;
+	}
+
+	/*
+	 * Softirq context may get interrupted by hardirq context, on the
+	 * same CPU. At softirq entry time the amount of time this CPU spent
+	 * in hardirq context is stored. At softirq exit time, the time spent
+	 * in hardirq context during the softirq is subtracted.
+	 */
+	prev_softirq_start = __this_cpu_read(softirq_start_time);
+	prev_hardirq = __this_cpu_read(prev_hardirq_time);
+	leaving_softirq = in_serving_softirq();
+
+	do {
+		u64 now = sched_clock_cpu(cpu);
+
+		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time, cpu));
+		__this_cpu_write(softirq_start_time, now);
+		__this_cpu_write(prev_hardirq_time, hardirq_time);
+
+		if (leaving_softirq) {
+			/*
+			 * Subtract hardirq time that happened during this
+			 * softirq.
+			 */
+			s64 hi_delta = hardirq_time - prev_hardirq;
+			delta = now - prev_softirq_start - hi_delta;
+		}
+		/* Loop around if interrupted by a hardirq. */
+	} while (hardirq_time != READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));
+
 	/*
 	 * We do not account for softirq time from ksoftirqd here.
 	 * We want to continue accounting softirq time to ksoftirqd thread
 	 * in that case, so as not to confuse scheduler with a special task
 	 * that do not consume any time, but still wants to run.
 	 */
-	if (hardirq_count())
-		__this_cpu_add(cpu_hardirq_time, delta);
-	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+	softirq_time_write_begin();
+	if (leaving_softirq && curr != this_cpu_ksoftirqd())
 		__this_cpu_add(cpu_softirq_time, delta);
-
-	irq_time_write_end();
-	local_irq_restore(flags);
+	softirq_time_write_end();
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec2e8d23527e..cad4df9835f7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1752,38 +1752,62 @@ DECLARE_PER_CPU(u64, cpu_softirq_time);
 
 #ifndef CONFIG_64BIT
 DECLARE_PER_CPU(seqcount_t, irq_time_seq);
+DECLARE_PER_CPU(seqcount_t, softirq_time_seq);
 
-static inline void irq_time_write_begin(void)
+static inline void hardirq_time_write_begin(void)
 {
 	__this_cpu_inc(irq_time_seq.sequence);
 	smp_wmb();
 }
 
-static inline void irq_time_write_end(void)
+static inline void hardirq_time_write_end(void)
 {
 	smp_wmb();
 	__this_cpu_inc(irq_time_seq.sequence);
 }
 
+static inline void softirq_time_write_begin(void)
+{
+	__this_cpu_inc(softirq_time_seq.sequence);
+	smp_wmb();
+}
+
+static inline void softirq_time_write_end(void)
+{
+	smp_wmb();
+	__this_cpu_inc(softirq_time_seq.sequence);
+}
+
 static inline u64 irq_time_read(int cpu)
 {
 	u64 irq_time;
-	unsigned seq;
+	unsigned hi_seq;
+	unsigned si_seq;
 
 	do {
-		seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		hi_seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		si_seq = read_seqcount_begin(&per_cpu(softirq_time_seq, cpu));
 		irq_time = per_cpu(cpu_softirq_time, cpu) +
 			   per_cpu(cpu_hardirq_time, cpu);
-	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));
+	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), hi_seq) ||
+		 read_seqcount_retry(&per_cpu(softirq_time_seq, cpu), si_seq));
 
 	return irq_time;
 }
 #else /* CONFIG_64BIT */
-static inline void irq_time_write_begin(void)
+static inline void hardirq_time_write_begin(void)
+{
+}
+
+static inline void hardirq_time_write_end(void)
+{
+}
+
+static inline void softirq_time_write_begin(void)
 {
 }
 
-static inline void irq_time_write_end(void)
+static inline void softirq_time_write_end(void)
 {
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] cputime: allow irq time accounting to be selected as an option
  2016-06-23  2:25 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
@ 2016-06-26  3:03   ` kbuild test robot
  0 siblings, 0 replies; 15+ messages in thread
From: kbuild test robot @ 2016-06-26  3:03 UTC (permalink / raw)
  To: riel
  Cc: kbuild-all, linux-kernel, peterz, mingo, pbonzini, fweisbec,
	wanpeng.li, efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 8987 bytes --]

Hi,

[auto build test WARNING on tip/sched/core]
[also build test WARNING on next-20160624]
[cannot apply to v4.7-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/riel-redhat-com/sched-time-fix-irq-time-accounting-with-nohz_idle/20160623-102923
config: x86_64-randconfig-a0-06260951 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/sched.h:17:0,
                    from kernel/sched/cputime.c:2:
   kernel/sched/cputime.c: In function 'irqtime_account_hi_update':
   include/linux/kernel.h:742:17: warning: comparison of distinct pointer types lacks a cast
     (void) (&_min1 == &_min2);  \
                    ^
>> kernel/sched/cputime.c:92:17: note: in expansion of macro 'min'
      irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
                    ^~~
   kernel/sched/cputime.c: In function 'irqtime_account_si_update':
   include/linux/kernel.h:742:17: warning: comparison of distinct pointer types lacks a cast
     (void) (&_min1 == &_min2);  \
                    ^
   kernel/sched/cputime.c:109:16: note: in expansion of macro 'min'
      si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
                   ^~~

vim +/min +92 kernel/sched/cputime.c

73fbec604 Frederic Weisbecker 2012-06-16   1  #include <linux/export.h>
73fbec604 Frederic Weisbecker 2012-06-16  @2  #include <linux/sched.h>
73fbec604 Frederic Weisbecker 2012-06-16   3  #include <linux/tsacct_kern.h>
73fbec604 Frederic Weisbecker 2012-06-16   4  #include <linux/kernel_stat.h>
73fbec604 Frederic Weisbecker 2012-06-16   5  #include <linux/static_key.h>
abf917cd9 Frederic Weisbecker 2012-07-25   6  #include <linux/context_tracking.h>
73fbec604 Frederic Weisbecker 2012-06-16   7  #include "sched.h"
1fe7c4ef8 Stefano Stabellini  2015-11-10   8  #ifdef CONFIG_PARAVIRT
1fe7c4ef8 Stefano Stabellini  2015-11-10   9  #include <asm/paravirt.h>
1fe7c4ef8 Stefano Stabellini  2015-11-10  10  #endif
73fbec604 Frederic Weisbecker 2012-06-16  11  
73fbec604 Frederic Weisbecker 2012-06-16  12  
73fbec604 Frederic Weisbecker 2012-06-16  13  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
73fbec604 Frederic Weisbecker 2012-06-16  14  
73fbec604 Frederic Weisbecker 2012-06-16  15  /*
73fbec604 Frederic Weisbecker 2012-06-16  16   * There are no locks covering percpu hardirq/softirq time.
bf9fae9f5 Frederic Weisbecker 2012-09-08  17   * They are only modified in vtime_account, on corresponding CPU
73fbec604 Frederic Weisbecker 2012-06-16  18   * with interrupts disabled. So, writes are safe.
73fbec604 Frederic Weisbecker 2012-06-16  19   * They are read and saved off onto struct rq in update_rq_clock().
73fbec604 Frederic Weisbecker 2012-06-16  20   * This may result in other CPU reading this CPU's irq time and can
bf9fae9f5 Frederic Weisbecker 2012-09-08  21   * race with irq/vtime_account on this CPU. We would either get old
73fbec604 Frederic Weisbecker 2012-06-16  22   * or new value with a side effect of accounting a slice of irq time to wrong
73fbec604 Frederic Weisbecker 2012-06-16  23   * task when irq is in progress while we read rq->clock. That is a worthy
73fbec604 Frederic Weisbecker 2012-06-16  24   * compromise in place of having locks on each irq in account_system_time.
73fbec604 Frederic Weisbecker 2012-06-16  25   */
73fbec604 Frederic Weisbecker 2012-06-16  26  DEFINE_PER_CPU(u64, cpu_hardirq_time);
73fbec604 Frederic Weisbecker 2012-06-16  27  DEFINE_PER_CPU(u64, cpu_softirq_time);
73fbec604 Frederic Weisbecker 2012-06-16  28  
73fbec604 Frederic Weisbecker 2012-06-16  29  static DEFINE_PER_CPU(u64, irq_start_time);
73fbec604 Frederic Weisbecker 2012-06-16  30  static int sched_clock_irqtime;
73fbec604 Frederic Weisbecker 2012-06-16  31  
73fbec604 Frederic Weisbecker 2012-06-16  32  void enable_sched_clock_irqtime(void)
73fbec604 Frederic Weisbecker 2012-06-16  33  {
73fbec604 Frederic Weisbecker 2012-06-16  34  	sched_clock_irqtime = 1;
73fbec604 Frederic Weisbecker 2012-06-16  35  }
73fbec604 Frederic Weisbecker 2012-06-16  36  
73fbec604 Frederic Weisbecker 2012-06-16  37  void disable_sched_clock_irqtime(void)
73fbec604 Frederic Weisbecker 2012-06-16  38  {
73fbec604 Frederic Weisbecker 2012-06-16  39  	sched_clock_irqtime = 0;
73fbec604 Frederic Weisbecker 2012-06-16  40  }
73fbec604 Frederic Weisbecker 2012-06-16  41  
73fbec604 Frederic Weisbecker 2012-06-16  42  #ifndef CONFIG_64BIT
73fbec604 Frederic Weisbecker 2012-06-16  43  DEFINE_PER_CPU(seqcount_t, irq_time_seq);
73fbec604 Frederic Weisbecker 2012-06-16  44  #endif /* CONFIG_64BIT */
73fbec604 Frederic Weisbecker 2012-06-16  45  
73fbec604 Frederic Weisbecker 2012-06-16  46  /*
73fbec604 Frederic Weisbecker 2012-06-16  47   * Called before incrementing preempt_count on {soft,}irq_enter
73fbec604 Frederic Weisbecker 2012-06-16  48   * and before decrementing preempt_count on {soft,}irq_exit.
73fbec604 Frederic Weisbecker 2012-06-16  49   */
3e1df4f50 Frederic Weisbecker 2012-10-06  50  void irqtime_account_irq(struct task_struct *curr)
73fbec604 Frederic Weisbecker 2012-06-16  51  {
73fbec604 Frederic Weisbecker 2012-06-16  52  	unsigned long flags;
73fbec604 Frederic Weisbecker 2012-06-16  53  	s64 delta;
73fbec604 Frederic Weisbecker 2012-06-16  54  	int cpu;
73fbec604 Frederic Weisbecker 2012-06-16  55  
73fbec604 Frederic Weisbecker 2012-06-16  56  	if (!sched_clock_irqtime)
73fbec604 Frederic Weisbecker 2012-06-16  57  		return;
73fbec604 Frederic Weisbecker 2012-06-16  58  
73fbec604 Frederic Weisbecker 2012-06-16  59  	local_irq_save(flags);
73fbec604 Frederic Weisbecker 2012-06-16  60  
73fbec604 Frederic Weisbecker 2012-06-16  61  	cpu = smp_processor_id();
73fbec604 Frederic Weisbecker 2012-06-16  62  	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
73fbec604 Frederic Weisbecker 2012-06-16  63  	__this_cpu_add(irq_start_time, delta);
73fbec604 Frederic Weisbecker 2012-06-16  64  
73fbec604 Frederic Weisbecker 2012-06-16  65  	irq_time_write_begin();
73fbec604 Frederic Weisbecker 2012-06-16  66  	/*
73fbec604 Frederic Weisbecker 2012-06-16  67  	 * We do not account for softirq time from ksoftirqd here.
73fbec604 Frederic Weisbecker 2012-06-16  68  	 * We want to continue accounting softirq time to ksoftirqd thread
73fbec604 Frederic Weisbecker 2012-06-16  69  	 * in that case, so as not to confuse scheduler with a special task
73fbec604 Frederic Weisbecker 2012-06-16  70  	 * that do not consume any time, but still wants to run.
73fbec604 Frederic Weisbecker 2012-06-16  71  	 */
73fbec604 Frederic Weisbecker 2012-06-16  72  	if (hardirq_count())
73fbec604 Frederic Weisbecker 2012-06-16  73  		__this_cpu_add(cpu_hardirq_time, delta);
73fbec604 Frederic Weisbecker 2012-06-16  74  	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
73fbec604 Frederic Weisbecker 2012-06-16  75  		__this_cpu_add(cpu_softirq_time, delta);
73fbec604 Frederic Weisbecker 2012-06-16  76  
73fbec604 Frederic Weisbecker 2012-06-16  77  	irq_time_write_end();
73fbec604 Frederic Weisbecker 2012-06-16  78  	local_irq_restore(flags);
73fbec604 Frederic Weisbecker 2012-06-16  79  }
3e1df4f50 Frederic Weisbecker 2012-10-06  80  EXPORT_SYMBOL_GPL(irqtime_account_irq);
73fbec604 Frederic Weisbecker 2012-06-16  81  
ad7578c31 Rik van Riel        2016-06-22  82  static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
73fbec604 Frederic Weisbecker 2012-06-16  83  {
73fbec604 Frederic Weisbecker 2012-06-16  84  	u64 *cpustat = kcpustat_this_cpu->cpustat;
ad7578c31 Rik van Riel        2016-06-22  85  	unsigned long irq_jiffies = 0;
73fbec604 Frederic Weisbecker 2012-06-16  86  	unsigned long flags;
ad7578c31 Rik van Riel        2016-06-22  87  	u64 irq;
73fbec604 Frederic Weisbecker 2012-06-16  88  
73fbec604 Frederic Weisbecker 2012-06-16  89  	local_irq_save(flags);
ad7578c31 Rik van Riel        2016-06-22  90  	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
ad7578c31 Rik van Riel        2016-06-22  91  	if (irq > cputime_one_jiffy) {
ad7578c31 Rik van Riel        2016-06-22 @92  		irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
ad7578c31 Rik van Riel        2016-06-22  93  		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
ad7578c31 Rik van Riel        2016-06-22  94  	}
73fbec604 Frederic Weisbecker 2012-06-16  95  	local_irq_restore(flags);

:::::: The code at line 92 was first introduced by commit
:::::: ad7578c312973ca898de22b799774bd08ffe819f sched,time: count actually elapsed irq & softirq time

:::::: TO: Rik van Riel <riel@redhat.com>
:::::: CC: 0day robot <fengguang.wu@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 20781 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-27 12:25   ` Frederic Weisbecker
  2016-06-27 12:50     ` Rik van Riel
  0 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2016-06-27 12:25 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Wed, Jun 22, 2016 at 10:25:47PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Currently, if there was any irq or softirq time during 'ticks'
> jiffies, the entire period will be accounted as irq or softirq
> time.
> 
> This is inaccurate if only a subset of 'ticks' jiffies was
> actually spent handling irqs, and could conceivably mis-count
> all of the ticks during a period as irq time, when there was
> some irq and some softirq time.

Good catch!

Many comments following.

> 
> This can actually happen when irqtime_account_process_tick
> is called from account_idle_ticks, which can pass a larger
> number of ticks down all at once.
> 
> Fix this by changing irqtime_account_hi_update and
> irqtime_account_si_update to round elapsed irq and softirq
> time to jiffies, and return the number of jiffies spent in
> each mode, similar to how steal time is handled.
> 
> Additionally, have irqtime_account_process_tick take into
> account how much time was spent in each of steal, irq,
> and softirq time.
> 
> The latter could help improve the accuracy of timekeeping

Maybe you meant cputime? Timekeeping is rather about jiffies and GTOD.

> when returning from idle on a NO_HZ_IDLE CPU.
> 
> Properly accounting how much time was spent in hardirq and
> softirq time will also allow the NO_HZ_FULL code to re-use
> these same functions for hardirq and softirq accounting.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/cputime.c | 81 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 60 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 3d60e5d76fdb..15b813c014be 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -79,40 +79,54 @@ void irqtime_account_irq(struct task_struct *curr)
>  }
>  EXPORT_SYMBOL_GPL(irqtime_account_irq);
>  
> -static int irqtime_account_hi_update(void)
> +static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long irq_jiffies = 0;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	u64 irq;

Should be cputime_t ? And perhaps renamed to irq_cputime to distinguish
clearly against irq_jiffies?

>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_hardirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> -		ret = 1;
> +	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];

cpu_hardirq_time is made of nsecs whereas cpustat is of cputime_t (in fact
even cputime64_t). So you need to convert cpu_hardirq_time before doing the
substract.

> +	if (irq > cputime_one_jiffy) {
> +		irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
> +		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
> +	}
>  	local_irq_restore(flags);
> -	return ret;
> +	return irq_jiffies;

I think we shouldn't use jiffies at all here and only rely on cputime_t.
max_jiffies should be of cputime_t and this function should return the cputime_t.

The resulting code should be more simple and in fact more precise (more below on that).

>  }
>  
> -static int irqtime_account_si_update(void)
> +static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long si_jiffies = 0;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	u64 softirq;
>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_softirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
> -		ret = 1;
> +	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
> +	if (softirq > cputime_one_jiffy) {
> +		si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
> +		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
> +	}
>  	local_irq_restore(flags);
> -	return ret;
> +	return si_jiffies;

So same comments apply here.

[...]
>   * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
>   * tasks (sum on group iteration) belonging to @tsk's group.
>   */
> @@ -344,19 +378,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>  {
>  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
>  	u64 cputime = (__force u64) cputime_one_jiffy;
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long other;
>  
> -	if (steal_account_process_tick(ULONG_MAX))
> +	/*
> +	 * When returning from idle, many ticks can get accounted at
> +	 * once, including some ticks of steal, irq, and softirq time.
> +	 * Subtract those ticks from the amount of time accounted to
> +	 * idle, or potentially user or system time. Due to rounding,
> +	 * other time can exceed ticks occasionally.
> +	 */
> +	other = account_other_ticks(ticks);
> +	if (other >= ticks)
>  		return;
> +	ticks -= other;
>  
>  	cputime *= ticks;
>  	scaled *= ticks;

So instead of dealing with ticks here, I think you should rather use the above
cputime as both the limit and the remaining time to account after steal/irqs.

This should avoid some middle conversions and improve precision when cputime_t == nsecs granularity.

If we account 2 ticks to idle (lets say HZ=100) and irq time to account is 15 ms. 2 ticks = 20 ms
so we have 5 ms left to account to idle. With the jiffies granularity in this patch, we would account
one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account back later) and one tick to idle
time whereas if you deal with cputime_t, you are going to account the correct amount of idle time.

Beyond the scope of this patchset, we should probably kill cputime_t and account everything to nsecs.
I have a patchset that did that 2 years ago, I should probably re-iterate it.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-27 12:25   ` Frederic Weisbecker
@ 2016-06-27 12:50     ` Rik van Riel
  2016-06-28 20:56       ` Frederic Weisbecker
  0 siblings, 1 reply; 15+ messages in thread
From: Rik van Riel @ 2016-06-27 12:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 5303 bytes --]

On Mon, 2016-06-27 at 14:25 +0200, Frederic Weisbecker wrote:
> On Wed, Jun 22, 2016 at 10:25:47PM -0400, riel@redhat.com wrote:
> > 
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Currently, if there was any irq or softirq time during 'ticks'
> > jiffies, the entire period will be accounted as irq or softirq
> > time.
> > 
> > This is inaccurate if only a subset of 'ticks' jiffies was
> > actually spent handling irqs, and could conceivably mis-count
> > all of the ticks during a period as irq time, when there was
> > some irq and some softirq time.
> Good catch!
> 
> Many comments following.
> 
> > 
> > 
> > This can actually happen when irqtime_account_process_tick
> > is called from account_idle_ticks, which can pass a larger
> > number of ticks down all at once.
> > 
> > Fix this by changing irqtime_account_hi_update and
> > irqtime_account_si_update to round elapsed irq and softirq
> > time to jiffies, and return the number of jiffies spent in
> > each mode, similar to how steal time is handled.
> > 
> > Additionally, have irqtime_account_process_tick take into
> > account how much time was spent in each of steal, irq,
> > and softirq time.
> > 
> > The latter could help improve the accuracy of timekeeping
> Maybe you meant cputime? Timekeeping is rather about jiffies and
> GTOD.
> 
> > 
> > when returning from idle on a NO_HZ_IDLE CPU.
> > 
> > Properly accounting how much time was spent in hardirq and
> > softirq time will also allow the NO_HZ_FULL code to re-use
> > these same functions for hardirq and softirq accounting.
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > 
> >  	local_irq_save(flags);
> > -	latest_ns = this_cpu_read(cpu_hardirq_time);
> > -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> > -		ret = 1;
> > +	irq = this_cpu_read(cpu_hardirq_time) -
> > cpustat[CPUTIME_IRQ];
> cpu_hardirq_time is made of nsecs whereas cpustat is of cputime_t (in
> fact
> even cputime64_t). So you need to convert cpu_hardirq_time before
> doing the
> substract.

Doh. Good catch!

> > -static int irqtime_account_si_update(void)
> > +static unsigned long irqtime_account_si_update(unsigned long
> > max_jiffies)
> >  {
> >  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > +	unsigned long si_jiffies = 0;
> >  	unsigned long flags;
> > -	u64 latest_ns;
> > -	int ret = 0;
> > +	u64 softirq;
> >  
> >  	local_irq_save(flags);
> > -	latest_ns = this_cpu_read(cpu_softirq_time);
> > -	if (nsecs_to_cputime64(latest_ns) >
> > cpustat[CPUTIME_SOFTIRQ])
> > -		ret = 1;
> > +	softirq = this_cpu_read(cpu_softirq_time) -
> > cpustat[CPUTIME_SOFTIRQ];
> > +	if (softirq > cputime_one_jiffy) {
> > +		si_jiffies = min(max_jiffies,
> > cputime_to_jiffies(softirq));
> > +		cpustat[CPUTIME_SOFTIRQ] +=
> > jiffies_to_cputime(si_jiffies);
> > +	}
> >  	local_irq_restore(flags);
> > -	return ret;
> > +	return si_jiffies;
> So same comments apply here.
> 
> [...]
> > 
> >   * Accumulate raw cputime values of dead tasks (sig->[us]time) and
> > live
> >   * tasks (sum on group iteration) belonging to @tsk's group.
> >   */
> > @@ -344,19 +378,24 @@ static void
> > irqtime_account_process_tick(struct task_struct *p, int user_tick,
> >  {
> >  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
> >  	u64 cputime = (__force u64) cputime_one_jiffy;
> > -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > +	unsigned long other;
> >  
> > -	if (steal_account_process_tick(ULONG_MAX))
> > +	/*
> > +	 * When returning from idle, many ticks can get accounted
> > at
> > +	 * once, including some ticks of steal, irq, and softirq
> > time.
> > +	 * Subtract those ticks from the amount of time accounted
> > to
> > +	 * idle, or potentially user or system time. Due to
> > rounding,
> > +	 * other time can exceed ticks occasionally.
> > +	 */
> > +	other = account_other_ticks(ticks);
> > +	if (other >= ticks)
> >  		return;
> > +	ticks -= other;
> >  
> >  	cputime *= ticks;
> >  	scaled *= ticks;
> So instead of dealing with ticks here, I think you should rather use
> the above
> cputime as both the limit and the remaining time to account after
> steal/irqs.
> 
> This should avoid some middle conversions and improve precision when
> cputime_t == nsecs granularity.
> 
> If we account 2 ticks to idle (lets say HZ=100) and irq time to
> account is 15 ms. 2 ticks = 20 ms
> so we have 5 ms left to account to idle. With the jiffies granularity
> in this patch, we would account
> one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account
> back later) and one tick to idle
> time whereas if you deal with cputime_t, you are going to account the
> correct amount of idle time.
> 

Ahhh, so you want irqtime_account_process_tick to work with
and account fractional ticks when calling account_system_time,
account_user_time, account_idle_time, etc?

I guess that should work fine since we already pass cputime
values in, anyway.

I suppose we can do the same for get_vtime_delta, too.

They can both work with the actual remaining time (in cputime_t),
after the other time has been subtracted.

I can rework the series to do that.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-23  2:25 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
@ 2016-06-27 23:21   ` Frederic Weisbecker
  2016-06-27 23:31     ` Rik van Riel
  0 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2016-06-27 23:21 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Wed, Jun 22, 2016 at 10:25:48PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
> appear to currently work right.
> 
> On CPUs that are nohz_full, people typically do not assign IRQs.

Right, but they can still fire. At least one tick per second, plus the
pinned timers, etc...

>
> On the housekeeping CPU (when a system is booted up with nohz_full),
> sampling should work ok to determine irq and softirq time use, but
> that only covers the housekeeping CPU itself, not the other
> non-nohz_full CPUs.

Hmm, every non-nohz_full CPUs, including the CPU 0, account the irqtime
the same way: through the tick (and therefore can't account much of it).

So I'm a bit confused by the above statements..

> 
> On CPUs that are nohz_idle (the typical way a distro kernel is
> booted), irq time is not accounted at all while the CPU is idle,
> due to the lack of timer ticks.

But as soon as a timer tick fires in idle or afterward, the pending
irqtime is accounted.

That said I don't see how it explains why we do the below:

> 
> Remove the VTIME_GEN vtime irq time code. The next patch will
> allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.

I don't get the reason why we are doing this. Now arguably the irqtime
accounting is probably not working as well as before since we switched to
jiffy clock. But I still see some hard irqs accounted when account_irq_exit()
is lucky enough to observe that jiffies changed since the beginning of
the interrupt.

So it's not entirely broken. I agree that we need to switch it to the
generic irqtime accounting code but breaking the code now to reactivate it
in a subsequent patch is prone to future bisection issues.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-27 23:21   ` Frederic Weisbecker
@ 2016-06-27 23:31     ` Rik van Riel
  2016-06-27 23:51       ` Frederic Weisbecker
  0 siblings, 1 reply; 15+ messages in thread
From: Rik van Riel @ 2016-06-27 23:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 2371 bytes --]

On Tue, 2016-06-28 at 01:21 +0200, Frederic Weisbecker wrote:
> On Wed, Jun 22, 2016 at 10:25:48PM -0400, riel@redhat.com wrote:
> > 
> > From: Rik van Riel <riel@redhat.com>
> > 
> > The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
> > appear to currently work right.
> > 
> > On CPUs that are nohz_full, people typically do not assign IRQs.
> Right, but they can still fire. At least one tick per second, plus
> the
> pinned timers, etc...
> 
> > 
> > 
> > On the housekeeping CPU (when a system is booted up with
> > nohz_full),
> > sampling should work ok to determine irq and softirq time use, but
> > that only covers the housekeeping CPU itself, not the other
> > non-nohz_full CPUs.
> Hmm, every non-nohz_full CPUs, including the CPU 0, account the
> irqtime
> the same way: through the tick (and therefore can't account much of
> it).
> 
But it will be subtracted from the user time, rather
than the idle time during which the irqs happened.

Furthermore, we might well have 100 jiffies worth of
irq & softirq time on a CPU, and get just 1 jiffy
of userspace time, on systems acting like routers.

> > 
> > 
> > On CPUs that are nohz_idle (the typical way a distro kernel is
> > booted), irq time is not accounted at all while the CPU is idle,
> > due to the lack of timer ticks.
> But as soon as a timer tick fires in idle or afterward, the pending
> irqtime is accounted.
> 
> That said I don't see how it explains why we do the below:
> 
> > 
> > 
> > Remove the VTIME_GEN vtime irq time code. The next patch will
> > allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.
> I don't get the reason why we are doing this. Now arguably the
> irqtime
> accounting is probably not working as well as before since we
> switched to
> jiffy clock. But I still see some hard irqs accounted when
> account_irq_exit()
> is lucky enough to observe that jiffies changed since the beginning
> of
> the interrupt.
> 
> So it's not entirely broken. I agree that we need to switch it to the
> generic irqtime accounting code but breaking the code now to
> reactivate it
> in a subsequent patch is prone to future bisection issues.

Want me to merge patches 2 & 3 into one, so we immediately
start using the generic code and do not run into bisect
issues?

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code
  2016-06-27 23:31     ` Rik van Riel
@ 2016-06-27 23:51       ` Frederic Weisbecker
  0 siblings, 0 replies; 15+ messages in thread
From: Frederic Weisbecker @ 2016-06-27 23:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Mon, Jun 27, 2016 at 07:31:29PM -0400, Rik van Riel wrote:
> On Tue, 2016-06-28 at 01:21 +0200, Frederic Weisbecker wrote:
> > On Wed, Jun 22, 2016 at 10:25:48PM -0400, riel@redhat.com wrote:
> > > 
> > > From: Rik van Riel <riel@redhat.com>
> > > 
> > > The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
> > > appear to currently work right.
> > > 
> > > On CPUs that are nohz_full, people typically do not assign IRQs.
> > Right, but they can still fire. At least one tick per second, plus
> > the
> > pinned timers, etc...
> > 
> > > 
> > > 
> > > On the housekeeping CPU (when a system is booted up with
> > > nohz_full),
> > > sampling should work ok to determine irq and softirq time use, but
> > > that only covers the housekeeping CPU itself, not the other
> > > non-nohz_full CPUs.
> > Hmm, every non-nohz_full CPUs, including the CPU 0, account the
> > irqtime
> > the same way: through the tick (and therefore can't account much of
> > it).
> > 
> But it will be subtracted from the user time, rather
> than the idle time during which the irqs happened.
> 
> Furthermore, we might well have 100 jiffies worth of
> irq & softirq time on a CPU, and get just 1 jiffy
> of userspace time, on systems acting like routers.

Indeed.

> > > Remove the VTIME_GEN vtime irq time code. The next patch will
> > > allow NO_HZ_FULL kernels to use the IRQ_TIME_ACCOUNTING code.
> > I don't get the reason why we are doing this. Now arguably the
> > irqtime
> > accounting is probably not working as well as before since we
> > switched to
> > jiffy clock. But I still see some hard irqs accounted when
> > account_irq_exit()
> > is lucky enough to observe that jiffies changed since the beginning
> > of
> > the interrupt.
> > 
> > So it's not entirely broken. I agree that we need to switch it to the
> > generic irqtime accounting code but breaking the code now to
> > reactivate it
> > in a subsequent patch is prone to future bisection issues.
> 
> Want me to merge patches 2 & 3 into one, so we immediately
> start using the generic code and do not run into bisect
> issues?

Yeah that would be better.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
  2016-06-27 12:50     ` Rik van Riel
@ 2016-06-28 20:56       ` Frederic Weisbecker
  0 siblings, 0 replies; 15+ messages in thread
From: Frederic Weisbecker @ 2016-06-28 20:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Mon, Jun 27, 2016 at 08:50:06AM -0400, Rik van Riel wrote:
> On Mon, 2016-06-27 at 14:25 +0200, Frederic Weisbecker wrote:
> > > 
> > >   * Accumulate raw cputime values of dead tasks (sig->[us]time) and
> > > live
> > >   * tasks (sum on group iteration) belonging to @tsk's group.
> > >   */
> > > @@ -344,19 +378,24 @@ static void
> > > irqtime_account_process_tick(struct task_struct *p, int user_tick,
> > >  {
> > >  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
> > >  	u64 cputime = (__force u64) cputime_one_jiffy;
> > > -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > > +	unsigned long other;
> > >  
> > > -	if (steal_account_process_tick(ULONG_MAX))
> > > +	/*
> > > +	 * When returning from idle, many ticks can get accounted
> > > at
> > > +	 * once, including some ticks of steal, irq, and softirq
> > > time.
> > > +	 * Subtract those ticks from the amount of time accounted
> > > to
> > > +	 * idle, or potentially user or system time. Due to
> > > rounding,
> > > +	 * other time can exceed ticks occasionally.
> > > +	 */
> > > +	other = account_other_ticks(ticks);
> > > +	if (other >= ticks)
> > >  		return;
> > > +	ticks -= other;
> > >  
> > >  	cputime *= ticks;
> > >  	scaled *= ticks;
> > So instead of dealing with ticks here, I think you should rather use
> > the above
> > cputime as both the limit and the remaining time to account after
> > steal/irqs.
> > 
> > This should avoid some middle conversions and improve precision when
> > cputime_t == nsecs granularity.
> > 
> > If we account 2 ticks to idle (lets say HZ=100) and irq time to
> > account is 15 ms. 2 ticks = 20 ms
> > so we have 5 ms left to account to idle. With the jiffies granularity
> > in this patch, we would account
> > one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account
> > back later) and one tick to idle
> > time whereas if you deal with cputime_t, you are going to account the
> > correct amount of idle time.
> > 
> 
> Ahhh, so you want irqtime_account_process_tick to work with
> and account fractional ticks when calling account_system_time,
> account_user_time, account_idle_time, etc?

Not exactly. This function is passed ticks but works with the cputime
converted value of those ticks. And it would be more precise to work on
top of cputime values without converting things to jiffies in the middle
as we may lose precision on the way.

> 
> I guess that should work fine since we already pass cputime
> values in, anyway.
> 
> I suppose we can do the same for get_vtime_delta, too.

But get_vtime_delta() already returns cputime, right?

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq
  2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
@ 2016-06-16 16:06 ` riel
  0 siblings, 0 replies; 15+ messages in thread
From: riel @ 2016-06-16 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, pbonzini, peterz, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Add an irq type parameter and documentation to irqtime_account_irq,
this can be used to distinguish between transitioning from process
context to hardirq time, and from process context to softirq time.

This is necessary to be able to remove the local_irq_disable from
irqtime_account_irq.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/hardirq.h | 20 ++++++++++----------
 include/linux/vtime.h   | 12 ++++++------
 kernel/sched/cputime.c  |  9 ++++++++-
 kernel/softirq.c        |  6 +++---
 4 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6bc6f0..1ebb31f56285 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,11 +32,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define __irq_enter()					\
-	do {						\
-		account_irq_enter_time(current);	\
-		preempt_count_add(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
+#define __irq_enter()							\
+	do {								\
+		account_irq_enter_time(current, HARDIRQ_OFFSET);	\
+		preempt_count_add(HARDIRQ_OFFSET);			\
+		trace_hardirq_enter();					\
 	} while (0)
 
 /*
@@ -47,11 +47,11 @@ extern void irq_enter(void);
 /*
  * Exit irq context without processing softirqs:
  */
-#define __irq_exit()					\
-	do {						\
-		trace_hardirq_exit();			\
-		account_irq_exit_time(current);		\
-		preempt_count_sub(HARDIRQ_OFFSET);	\
+#define __irq_exit()							\
+	do {								\
+		trace_hardirq_exit();					\
+		account_irq_exit_time(current, HARDIRQ_OFFSET);		\
+		preempt_count_sub(HARDIRQ_OFFSET);			\
 	} while (0)
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d1977d84ebdf..31e486d741b0 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -112,21 +112,21 @@ static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, int irqtype);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, int irqtype) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_enter(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_exit(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 261c9002348a..07f847267e4e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,8 +46,15 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
 /*
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * There are six possible transitions:
+ * process -> softirq, softirq -> process
+ * process -> hardirq, hardirq -> process
+ * softirq -> hardirq, hardirq -> softirq
+ *
+ * When exiting hardirq or softirq time, account the elapsed time.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
 	unsigned long flags;
 	s64 delta;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a311c9622c86 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -245,7 +245,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	account_irq_enter_time(current);
+	account_irq_enter_time(current, SOFTIRQ_OFFSET);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
@@ -295,7 +295,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	}
 
 	lockdep_softirq_end(in_hardirq);
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, SOFTIRQ_OFFSET);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
@@ -385,7 +385,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, HARDIRQ_OFFSET);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq
  2016-06-08  2:29 [PATCH RFC 0/5] sched,time: make irq time accounting work for nohz_idle riel
@ 2016-06-08  2:30 ` riel
  0 siblings, 0 replies; 15+ messages in thread
From: riel @ 2016-06-08  2:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: kernellwp, mingo, peterz, tglx, fweisbec

From: Rik van Riel <riel@redhat.com>

Add an irq type parameter and documentation to irqtime_account_irq,
this can be used to distinguish between transitioning from process
context to hardirq time, and from process context to softirq time.

This is necessary to be able to remove the local_irq_disable from
irqtime_account_irq.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/hardirq.h | 20 ++++++++++----------
 include/linux/vtime.h   | 12 ++++++------
 kernel/sched/cputime.c  |  9 ++++++++-
 kernel/softirq.c        |  6 +++---
 4 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6bc6f0..1ebb31f56285 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,11 +32,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define __irq_enter()					\
-	do {						\
-		account_irq_enter_time(current);	\
-		preempt_count_add(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
+#define __irq_enter()							\
+	do {								\
+		account_irq_enter_time(current, HARDIRQ_OFFSET);	\
+		preempt_count_add(HARDIRQ_OFFSET);			\
+		trace_hardirq_enter();					\
 	} while (0)
 
 /*
@@ -47,11 +47,11 @@ extern void irq_enter(void);
 /*
  * Exit irq context without processing softirqs:
  */
-#define __irq_exit()					\
-	do {						\
-		trace_hardirq_exit();			\
-		account_irq_exit_time(current);		\
-		preempt_count_sub(HARDIRQ_OFFSET);	\
+#define __irq_exit()							\
+	do {								\
+		trace_hardirq_exit();					\
+		account_irq_exit_time(current, HARDIRQ_OFFSET);		\
+		preempt_count_sub(HARDIRQ_OFFSET);			\
 	} while (0)
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 3b384bf5ce1a..58f036f3ebea 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -112,21 +112,21 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk)
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, int irqtype);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, int irqtype) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_enter(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_exit(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2f862dfdb520..e009077aeab6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,8 +46,15 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
 /*
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * There are six possible transitions:
+ * process -> softirq, softirq -> process
+ * process -> hardirq, hardirq -> process
+ * softirq -> hardirq, hardirq -> softirq
+ *
+ * When exiting hardirq or softirq time, account the elapsed time.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
 	unsigned long flags;
 	s64 delta;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a311c9622c86 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -245,7 +245,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	account_irq_enter_time(current);
+	account_irq_enter_time(current, SOFTIRQ_OFFSET);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
@@ -295,7 +295,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	}
 
 	lockdep_softirq_end(in_hardirq);
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, SOFTIRQ_OFFSET);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
@@ -385,7 +385,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, HARDIRQ_OFFSET);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-06-28 20:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-23  2:25 [PATCH v2 0/5] sched,time: fix irq time accounting with nohz_idle riel
2016-06-23  2:25 ` [PATCH 1/5] sched,time: count actually elapsed irq & softirq time riel
2016-06-27 12:25   ` Frederic Weisbecker
2016-06-27 12:50     ` Rik van Riel
2016-06-28 20:56       ` Frederic Weisbecker
2016-06-23  2:25 ` [PATCH 2/5] nohz,cputime: remove VTIME_GEN vtime irq time code riel
2016-06-27 23:21   ` Frederic Weisbecker
2016-06-27 23:31     ` Rik van Riel
2016-06-27 23:51       ` Frederic Weisbecker
2016-06-23  2:25 ` [PATCH 3/5] cputime: allow irq time accounting to be selected as an option riel
2016-06-26  3:03   ` kbuild test robot
2016-06-23  2:25 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
2016-06-23  2:25 ` [PATCH 5/5] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  -- strict thread matches above, loose matches on Subject: below --
2016-06-16 16:06 [PATCH 0/5] sched,time: fix irq time accounting with nohz_idle riel
2016-06-16 16:06 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel
2016-06-08  2:29 [PATCH RFC 0/5] sched,time: make irq time accounting work for nohz_idle riel
2016-06-08  2:30 ` [PATCH 4/5] irqtime: add irq type parameter to irqtime_account_irq riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).