All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle
@ 2016-06-30 19:35 riel
  2016-06-30 19:35 ` [PATCH 1/4] sched,time: count actually elapsed irq & softirq time riel
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: riel @ 2016-06-30 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

Currently irq time accounting only works in these cases:
1) purely ticke based accounting
2) nohz_full accounting, but only on housekeeping & nohz_full CPUs
3) architectures with native vtime accounting

On nohz_idle CPUs, which are probably the majority nowadays,
irq time accounting is currently broken. This leads to systems
reporting a dramatically lower amount of irq & softirq time than
is actually spent handling them, with all the time spent while the
system is in the idle task being accounted as idle.

This patch set seems to bring the amount of irq time reported by
top (and /proc/stat) roughly in line with that measured when I do
a "perf record -g -a" run to see what is using all that time.

The amount of irq time used, especially softirq, is shockingly high,
to the point of me thinking this patch set may be wrong, but the
numbers seem to match what perf is giving me...

These patches apply on top of Wanpeng Li's steal time patches.

CONFIG_IRQ_TIME_ACCOUNTING is now a config option that is available
as a separate choice from tick based / nohz_idle / nohz_full mode,
a suggested by Frederic Weisbecker.

Next up: look at the things that are using CPU time on an otherwise
idle system, and see if I can make those a little faster :)

v2: address Peterz's concerns, some more cleanups
v3: rewrite the code along Frederic's suggestions, now cputime_t
    is used everywhere

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] sched,time: count actually elapsed irq & softirq time
  2016-06-30 19:35 [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle riel
@ 2016-06-30 19:35 ` riel
  2016-07-05 12:40   ` Frederic Weisbecker
  2016-06-30 19:35 ` [PATCH 2/4] nohz,cputime: replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code riel
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: riel @ 2016-06-30 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of the time was actually spent
handling irqs, and could conceivably mis-count all of the ticks during
a period as irq time, when there was some irq and some softirq time.

This can actually happen when irqtime_account_process_tick is called
from account_idle_ticks, which can pass a larger number of ticks down
all at once.

Fix this by changing irqtime_account_hi_update, irqtime_account_si_update,
and steal_account_process_ticks to work with cputime_t time units, and
return the amount of time spent in each mode.

Rename steal_account_process_ticks to steal_account_process_time, to
reflect that time is now accounted in cputime_t, instead of ticks.

Additionally, have irqtime_account_process_tick take into account how
much time was spent in each of steal, irq, and softirq time.

The latter could help improve the accuracy of cputime
accounting when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 124 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 77 insertions(+), 47 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d76fdb..018bae2ada36 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,40 +79,50 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static cputime_t irqtime_account_hi_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t irq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq_cputime = nsecs_to_cputime(this_cpu_read(cpu_hardirq_time)) -
+		      cpustat[CPUTIME_IRQ];
+	irq_cputime = min(irq_cputime, maxtime);
+	cpustat[CPUTIME_IRQ] += irq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return irq_cputime;
 }
 
-static int irqtime_account_si_update(void)
+static cputime_t irqtime_account_si_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t softirq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq_cputime = nsecs_to_cputime(this_cpu_read(cpu_softirq_time)) -
+			  cpustat[CPUTIME_SOFTIRQ];
+	softirq_cputime = min(softirq_cputime, maxtime);
+	cpustat[CPUTIME_SOFTIRQ] += softirq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return softirq_cputime;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #define sched_clock_irqtime	(0)
 
+static cputime_t irqtime_account_hi_update(cputime_t dummy)
+{
+	return 0;
+}
+
+static cputime_t irqtime_account_si_update(cputime_t dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -257,32 +267,45 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline unsigned long steal_account_process_tick(unsigned long max_jiffies)
+static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
+		cputime_t steal_cputime;
 		u64 steal;
-		unsigned long steal_jiffies;
 
 		steal = paravirt_steal_clock(smp_processor_id());
 		steal -= this_rq()->prev_steal_time;
+		this_rq()->prev_steal_time += steal;
 
-		/*
-		 * steal is in nsecs but our caller is expecting steal
-		 * time in jiffies. Lets cast the result to jiffies
-		 * granularity and account the rest on the next rounds.
-		 */
-		steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies);
-		this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies);
+		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+		account_steal_time(steal_cputime);
 
-		account_steal_time(jiffies_to_cputime(steal_jiffies));
-		return steal_jiffies;
+		return steal_cputime;
 	}
 #endif
 	return 0;
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ */
+static inline cputime_t account_other_time(cputime_t max)
+{
+	cputime_t accounted;
+
+	accounted = steal_account_process_time(max);
+
+	if (accounted < max)
+		accounted += irqtime_account_hi_update(max - accounted);
+
+	if (accounted < max)
+		accounted += irqtime_account_si_update(max - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -342,21 +365,23 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 					 struct rq *rq, int ticks)
 {
-	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
-	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 cputime = (__force u64) cputime_one_jiffy * ticks;
+	cputime_t scaled, other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_time(cputime);
+	if (other >= cputime)
 		return;
+	cputime -= other;
+	scaled = cputime_to_scaled(cputime);
 
-	cputime *= ticks;
-	scaled *= ticks;
-
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
@@ -466,7 +491,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
  */
 void account_process_tick(struct task_struct *p, int user_tick)
 {
-	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
+	cputime_t cputime, scaled, steal;
 	struct rq *rq = this_rq();
 
 	if (vtime_accounting_cpu_enabled())
@@ -477,16 +502,21 @@ void account_process_tick(struct task_struct *p, int user_tick)
 		return;
 	}
 
-	if (steal_account_process_tick(ULONG_MAX))
+	cputime = cputime_one_jiffy;
+	steal = steal_account_process_time(cputime);
+
+	if (steal >= cputime)
 		return;
 
+	cputime -= steal;
+	scaled = cputime_to_scaled(cputime);
+
 	if (user_tick)
-		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
+		account_user_time(p, cputime, scaled);
 	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
-		account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
-				    one_jiffy_scaled);
+		account_system_time(p, HARDIRQ_OFFSET, cputime, scaled);
 	else
-		account_idle_time(cputime_one_jiffy);
+		account_idle_time(cputime);
 }
 
 /*
@@ -681,14 +711,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	cputime_t delta, steal;
 
-	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	delta = jiffies_to_cputime(now - tsk->vtime_snap);
+	steal = steal_account_process_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return delta - steal;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] nohz,cputime: replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
  2016-06-30 19:35 [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-30 19:35 ` [PATCH 1/4] sched,time: count actually elapsed irq & softirq time riel
@ 2016-06-30 19:35 ` riel
  2016-06-30 19:35 ` [PATCH 3/4] irqtime: add irq type parameter to irqtime_account_irq riel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: riel @ 2016-06-30 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs without nohz_full=, only tick based irq time sampling is
done, which breaks down when dealing with a nohz_idle CPU.

On firewalls and similar systems, no ticks may happen on a CPU for a
while, and the irq time spent may never get accounted properly. This
can cause issues with capacity planning and power saving, which use
the CPU statistics as inputs in decision making.

Replace the VTIME_GEN vtime irq time code, and replace it with the
IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vtime.h  | 32 ++++++++++++++------------------
 init/Kconfig           |  6 +++---
 kernel/sched/cputime.c | 16 +++-------------
 3 files changed, 20 insertions(+), 34 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index fa2196990f84..d1977d84ebdf 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,6 +14,18 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+
+#ifdef __ARCH_HAS_VTIME_ACCOUNT
+extern void vtime_account_irq_enter(struct task_struct *tsk);
+#else
+extern void vtime_common_account_irq_enter(struct task_struct *tsk);
+static inline void vtime_account_irq_enter(struct task_struct *tsk)
+{
+	if (vtime_accounting_cpu_enabled())
+		vtime_common_account_irq_enter(tsk);
+}
+#endif /* __ARCH_HAS_VTIME_ACCOUNT */
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -64,17 +76,6 @@ extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
 
-#ifdef __ARCH_HAS_VTIME_ACCOUNT
-extern void vtime_account_irq_enter(struct task_struct *tsk);
-#else
-extern void vtime_common_account_irq_enter(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_account_irq_enter(tsk);
-}
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
-
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 static inline void vtime_task_switch(struct task_struct *prev) { }
@@ -85,13 +86,8 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_gen_account_irq_exit(struct task_struct *tsk);
-
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_gen_account_irq_exit(tsk);
-}
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
 
 extern void vtime_user_enter(struct task_struct *tsk);
 
diff --git a/init/Kconfig b/init/Kconfig
index 0dfd09d54c65..4c7ee4f136cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -375,9 +375,11 @@ config VIRT_CPU_ACCOUNTING_GEN
 
 	  If unsure, say N.
 
+endchoice
+
 config IRQ_TIME_ACCOUNTING
 	bool "Fine granularity task level IRQ time accounting"
-	depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL
+	depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
 	help
 	  Select this option to enable fine granularity task irq time
 	  accounting. This is done by reading a timestamp on each
@@ -386,8 +388,6 @@ config IRQ_TIME_ACCOUNTING
 
 	  If in doubt, say N here.
 
-endchoice
-
 config BSD_PROCESS_ACCT
 	bool "BSD Process Accounting"
 	depends on MULTIUSER
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 018bae2ada36..c4349f2cacf6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -711,14 +711,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	cputime_t delta, steal;
+	cputime_t delta, other;
 
 	delta = jiffies_to_cputime(now - tsk->vtime_snap);
-	steal = steal_account_process_time(delta);
+	other = account_other_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return delta - steal;
+	return delta - other;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
@@ -738,16 +738,6 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_gen_account_irq_exit(struct task_struct *tsk)
-{
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		__vtime_account_system(tsk);
-	if (context_tracking_in_user())
-		tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
-}
-
 void vtime_account_user(struct task_struct *tsk)
 {
 	cputime_t delta_cpu;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] irqtime: add irq type parameter to irqtime_account_irq
  2016-06-30 19:35 [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle riel
  2016-06-30 19:35 ` [PATCH 1/4] sched,time: count actually elapsed irq & softirq time riel
  2016-06-30 19:35 ` [PATCH 2/4] nohz,cputime: replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code riel
@ 2016-06-30 19:35 ` riel
  2016-06-30 19:35 ` [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
  2016-07-05 13:02 ` [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle Nikolay Borisov
  4 siblings, 0 replies; 18+ messages in thread
From: riel @ 2016-06-30 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Add an irq type parameter and documentation to irqtime_account_irq,
this can be used to distinguish between transitioning from process
context to hardirq time, and from process context to softirq time.

This is necessary to be able to remove the local_irq_disable from
irqtime_account_irq.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/hardirq.h | 20 ++++++++++----------
 include/linux/vtime.h   | 12 ++++++------
 kernel/sched/cputime.c  |  9 ++++++++-
 kernel/softirq.c        |  6 +++---
 4 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6bc6f0..1ebb31f56285 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,11 +32,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define __irq_enter()					\
-	do {						\
-		account_irq_enter_time(current);	\
-		preempt_count_add(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
+#define __irq_enter()							\
+	do {								\
+		account_irq_enter_time(current, HARDIRQ_OFFSET);	\
+		preempt_count_add(HARDIRQ_OFFSET);			\
+		trace_hardirq_enter();					\
 	} while (0)
 
 /*
@@ -47,11 +47,11 @@ extern void irq_enter(void);
 /*
  * Exit irq context without processing softirqs:
  */
-#define __irq_exit()					\
-	do {						\
-		trace_hardirq_exit();			\
-		account_irq_exit_time(current);		\
-		preempt_count_sub(HARDIRQ_OFFSET);	\
+#define __irq_exit()							\
+	do {								\
+		trace_hardirq_exit();					\
+		account_irq_exit_time(current, HARDIRQ_OFFSET);		\
+		preempt_count_sub(HARDIRQ_OFFSET);			\
 	} while (0)
 
 /*
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d1977d84ebdf..31e486d741b0 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -112,21 +112,21 @@ static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, int irqtype);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, int irqtype) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_enter(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, int irqtype)
 {
 	vtime_account_irq_exit(tsk);
-	irqtime_account_irq(tsk);
+	irqtime_account_irq(tsk, irqtype);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index c4349f2cacf6..a0aefd4c7ea6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,8 +46,15 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
 /*
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * There are six possible transitions:
+ * process -> softirq, softirq -> process
+ * process -> hardirq, hardirq -> process
+ * softirq -> hardirq, hardirq -> softirq
+ *
+ * When exiting hardirq or softirq time, account the elapsed time.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
 	unsigned long flags;
 	s64 delta;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a311c9622c86 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -245,7 +245,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	account_irq_enter_time(current);
+	account_irq_enter_time(current, SOFTIRQ_OFFSET);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
@@ -295,7 +295,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	}
 
 	lockdep_softirq_end(in_hardirq);
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, SOFTIRQ_OFFSET);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
@@ -385,7 +385,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
-	account_irq_exit_time(current);
+	account_irq_exit_time(current, HARDIRQ_OFFSET);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-30 19:35 [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle riel
                   ` (2 preceding siblings ...)
  2016-06-30 19:35 ` [PATCH 3/4] irqtime: add irq type parameter to irqtime_account_irq riel
@ 2016-06-30 19:35 ` riel
  2016-07-08 12:30   ` Frederic Weisbecker
  2016-07-05 13:02 ` [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle Nikolay Borisov
  4 siblings, 1 reply; 18+ messages in thread
From: riel @ 2016-06-30 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

From: Rik van Riel <riel@redhat.com>

Drop local_irq_save/restore from irqtime_account_irq.
Instead, have softirq and hardirq track their time spent
independently, with the softirq code subtracting hardirq
time that happened during the duration of the softirq run.

The softirq code can be interrupted by hardirq code at
any point in time, but it can check whether it got a
consistent snapshot of the timekeeping variables it wants,
and loop around in the unlikely case that it did not.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 72 +++++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h   | 38 +++++++++++++++++++++-----
 2 files changed, 90 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a0aefd4c7ea6..b78991fac228 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -26,7 +26,9 @@
 DEFINE_PER_CPU(u64, cpu_hardirq_time);
 DEFINE_PER_CPU(u64, cpu_softirq_time);
 
-static DEFINE_PER_CPU(u64, irq_start_time);
+static DEFINE_PER_CPU(u64, hardirq_start_time);
+static DEFINE_PER_CPU(u64, softirq_start_time);
+static DEFINE_PER_CPU(u64, prev_hardirq_time);
 static int sched_clock_irqtime;
 
 void enable_sched_clock_irqtime(void)
@@ -41,6 +43,7 @@ void disable_sched_clock_irqtime(void)
 
 #ifndef CONFIG_64BIT
 DEFINE_PER_CPU(seqcount_t, irq_time_seq);
+DEFINE_PER_CPU(seqcount_t, softirq_time_seq);
 #endif /* CONFIG_64BIT */
 
 /*
@@ -53,36 +56,79 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
  * softirq -> hardirq, hardirq -> softirq
  *
  * When exiting hardirq or softirq time, account the elapsed time.
+ *
+ * When exiting softirq time, subtract the amount of hardirq time that
+ * interrupted this softirq run, to avoid double accounting of that time.
  */
 void irqtime_account_irq(struct task_struct *curr, int irqtype)
 {
-	unsigned long flags;
+	u64 prev_softirq_start;
+	bool leaving_softirq;
+	u64 prev_hardirq;
+	u64 hardirq_time;
 	s64 delta;
 	int cpu;
 
 	if (!sched_clock_irqtime)
 		return;
 
-	local_irq_save(flags);
-
 	cpu = smp_processor_id();
-	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
-	__this_cpu_add(irq_start_time, delta);
 
-	irq_time_write_begin();
+	/*
+	 * Hardirq time accounting is pretty straightforward. If not in
+	 * hardirq context yet (entering hardirq), set the start time.
+	 * If already in hardirq context (leaving), account the elapsed time.
+	 */
+	if (irqtype == HARDIRQ_OFFSET) {
+		bool leaving_hardirq = hardirq_count();
+		delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
+		__this_cpu_add(hardirq_start_time, delta);
+		if (leaving_hardirq) {
+			hardirq_time_write_begin();
+			__this_cpu_add(cpu_hardirq_time, delta);
+			hardirq_time_write_end();
+		}
+		return;
+	}
+
+	/*
+	 * Softirq context may get interrupted by hardirq context, on the
+	 * same CPU. At softirq entry time the amount of time this CPU spent
+	 * in hardirq context is stored. At softirq exit time, the time spent
+	 * in hardirq context during the softirq is subtracted.
+	 */
+	prev_softirq_start = __this_cpu_read(softirq_start_time);
+	prev_hardirq = __this_cpu_read(prev_hardirq_time);
+	leaving_softirq = in_serving_softirq();
+
+	do {
+		u64 now = sched_clock_cpu(cpu);
+
+		hardirq_time = READ_ONCE(per_cpu(cpu_hardirq_time, cpu));
+		__this_cpu_write(softirq_start_time, now);
+		__this_cpu_write(prev_hardirq_time, hardirq_time);
+
+		if (leaving_softirq) {
+			/*
+			 * Subtract hardirq time that happened during this
+			 * softirq.
+			 */
+			s64 hi_delta = hardirq_time - prev_hardirq;
+			delta = now - prev_softirq_start - hi_delta;
+		}
+		/* Loop around if interrupted by a hardirq. */
+	} while (hardirq_time != READ_ONCE(per_cpu(cpu_hardirq_time, cpu)));
+
 	/*
 	 * We do not account for softirq time from ksoftirqd here.
 	 * We want to continue accounting softirq time to ksoftirqd thread
 	 * in that case, so as not to confuse scheduler with a special task
 	 * that do not consume any time, but still wants to run.
 	 */
-	if (hardirq_count())
-		__this_cpu_add(cpu_hardirq_time, delta);
-	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+	softirq_time_write_begin();
+	if (leaving_softirq && curr != this_cpu_ksoftirqd())
 		__this_cpu_add(cpu_softirq_time, delta);
-
-	irq_time_write_end();
-	local_irq_restore(flags);
+	softirq_time_write_end();
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec2e8d23527e..cad4df9835f7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1752,38 +1752,62 @@ DECLARE_PER_CPU(u64, cpu_softirq_time);
 
 #ifndef CONFIG_64BIT
 DECLARE_PER_CPU(seqcount_t, irq_time_seq);
+DECLARE_PER_CPU(seqcount_t, softirq_time_seq);
 
-static inline void irq_time_write_begin(void)
+static inline void hardirq_time_write_begin(void)
 {
 	__this_cpu_inc(irq_time_seq.sequence);
 	smp_wmb();
 }
 
-static inline void irq_time_write_end(void)
+static inline void hardirq_time_write_end(void)
 {
 	smp_wmb();
 	__this_cpu_inc(irq_time_seq.sequence);
 }
 
+static inline void softirq_time_write_begin(void)
+{
+	__this_cpu_inc(softirq_time_seq.sequence);
+	smp_wmb();
+}
+
+static inline void softirq_time_write_end(void)
+{
+	smp_wmb();
+	__this_cpu_inc(softirq_time_seq.sequence);
+}
+
 static inline u64 irq_time_read(int cpu)
 {
 	u64 irq_time;
-	unsigned seq;
+	unsigned hi_seq;
+	unsigned si_seq;
 
 	do {
-		seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		hi_seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+		si_seq = read_seqcount_begin(&per_cpu(softirq_time_seq, cpu));
 		irq_time = per_cpu(cpu_softirq_time, cpu) +
 			   per_cpu(cpu_hardirq_time, cpu);
-	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));
+	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), hi_seq) ||
+		 read_seqcount_retry(&per_cpu(softirq_time_seq, cpu), si_seq));
 
 	return irq_time;
 }
 #else /* CONFIG_64BIT */
-static inline void irq_time_write_begin(void)
+static inline void hardirq_time_write_begin(void)
+{
+}
+
+static inline void hardirq_time_write_end(void)
+{
+}
+
+static inline void softirq_time_write_begin(void)
 {
 }
 
-static inline void irq_time_write_end(void)
+static inline void softirq_time_write_end(void)
 {
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] sched,time: count actually elapsed irq & softirq time
  2016-06-30 19:35 ` [PATCH 1/4] sched,time: count actually elapsed irq & softirq time riel
@ 2016-07-05 12:40   ` Frederic Weisbecker
  2016-07-05 13:08     ` Rik van Riel
  2016-07-05 16:47     ` [PATCH v3 " Rik van Riel
  0 siblings, 2 replies; 18+ messages in thread
From: Frederic Weisbecker @ 2016-07-05 12:40 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Thu, Jun 30, 2016 at 03:35:47PM -0400, riel@redhat.com wrote:
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 3d60e5d76fdb..018bae2ada36 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -79,40 +79,50 @@ void irqtime_account_irq(struct task_struct *curr)
>  }
>  EXPORT_SYMBOL_GPL(irqtime_account_irq);
>  
> -static int irqtime_account_hi_update(void)
> +static cputime_t irqtime_account_hi_update(cputime_t maxtime)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	cputime_t irq_cputime;
>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_hardirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> -		ret = 1;
> +	irq_cputime = nsecs_to_cputime(this_cpu_read(cpu_hardirq_time)) -
> +		      cpustat[CPUTIME_IRQ];

We might want to keep nsecs_to_cputime64(). If cputime_t == jiffies_t == unsigned long,
we may have a problem after 49 days of interrupts. Arguably that's a lot of IRQs
but lets be paranoid.

> +	irq_cputime = min(irq_cputime, maxtime);
> +	cpustat[CPUTIME_IRQ] += irq_cputime;
>  	local_irq_restore(flags);
> -	return ret;
> +	return irq_cputime;
>  }
>  
> -static int irqtime_account_si_update(void)
> +static cputime_t irqtime_account_si_update(cputime_t maxtime)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	cputime_t softirq_cputime;
>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_softirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
> -		ret = 1;
> +	softirq_cputime = nsecs_to_cputime(this_cpu_read(cpu_softirq_time)) -

Ditto.

> +			  cpustat[CPUTIME_SOFTIRQ];
> +	softirq_cputime = min(softirq_cputime, maxtime);
> +	cpustat[CPUTIME_SOFTIRQ] += softirq_cputime;
>  	local_irq_restore(flags);
> -	return ret;
> +	return softirq_cputime;
>  }
>  
>  #else /* CONFIG_IRQ_TIME_ACCOUNTING */
>  
>  #define sched_clock_irqtime	(0)
>  
> +static cputime_t irqtime_account_hi_update(cputime_t dummy)
> +{
> +	return 0;
> +}
> +
> +static cputime_t irqtime_account_si_update(cputime_t dummy)
> +{
> +	return 0;
> +}
> +
>  #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
>  
>  static inline void task_group_account_field(struct task_struct *p, int index,
> @@ -257,32 +267,45 @@ void account_idle_time(cputime_t cputime)
>  		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
>  }
>  
> -static __always_inline unsigned long steal_account_process_tick(unsigned long max_jiffies)
> +static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
>  {
>  #ifdef CONFIG_PARAVIRT
>  	if (static_key_false(&paravirt_steal_enabled)) {
> +		cputime_t steal_cputime;
>  		u64 steal;
> -		unsigned long steal_jiffies;
>  
>  		steal = paravirt_steal_clock(smp_processor_id());
>  		steal -= this_rq()->prev_steal_time;
> +		this_rq()->prev_steal_time += steal;

We are accounting steal_cputime but you make it remember steal_nsecs. This is
leaking quite some steal time in the way.

Imagine that cputime_t == jiffies_t and HZ=100.
paravirt_steal_clock() returns 199 nsecs. prev_steal_time gets added those 199.
nsecs_to_cputime() return 1 jiffy (we are one nsec off the next jiffy). So
account_steal_time() is accounting 1 jiffy and the 99 remaining nsecs are leaked.
If some more steal time is to be accounted on the next tick, the 99 previous nsecs
are forgotten.

A non-leaking sequence would rather be:

	steal = paravirt_steal_clock(smp_processor_id());
	steal -= this_rq()->prev_steal_time;
	steal_cputime = min(nsecs_to_cputime(steal), maxtime);
	account_steal_time(steal_cputime);
	this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);

Thanks!

>  
> -		/*
> -		 * steal is in nsecs but our caller is expecting steal
> -		 * time in jiffies. Lets cast the result to jiffies
> -		 * granularity and account the rest on the next rounds.
> -		 */
> -		steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies);
> -		this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies);
> +		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
> +		account_steal_time(steal_cputime);
>  
> -		account_steal_time(jiffies_to_cputime(steal_jiffies));
> -		return steal_jiffies;
> +		return steal_cputime;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle
  2016-06-30 19:35 [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle riel
                   ` (3 preceding siblings ...)
  2016-06-30 19:35 ` [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
@ 2016-07-05 13:02 ` Nikolay Borisov
  2016-07-05 13:09   ` Rik van Riel
  4 siblings, 1 reply; 18+ messages in thread
From: Nikolay Borisov @ 2016-07-05 13:02 UTC (permalink / raw)
  To: riel, linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar



On 06/30/2016 10:35 PM, riel@redhat.com wrote:
> Currently irq time accounting only works in these cases:
> 1) purely ticke based accounting
> 2) nohz_full accounting, but only on housekeeping & nohz_full CPUs
> 3) architectures with native vtime accounting
> 
> On nohz_idle CPUs, which are probably the majority nowadays,
> irq time accounting is currently broken. This leads to systems
> reporting a dramatically lower amount of irq & softirq time than
> is actually spent handling them, with all the time spent while the
> system is in the idle task being accounted as idle.

Is this supposed to alleviate the issue reported here:
http://www.gossamer-threads.com/lists/linux/kernel/2169180

> 
> This patch set seems to bring the amount of irq time reported by
> top (and /proc/stat) roughly in line with that measured when I do
> a "perf record -g -a" run to see what is using all that time.
> 
> The amount of irq time used, especially softirq, is shockingly high,
> to the point of me thinking this patch set may be wrong, but the
> numbers seem to match what perf is giving me...
> 
> These patches apply on top of Wanpeng Li's steal time patches.
> 
> CONFIG_IRQ_TIME_ACCOUNTING is now a config option that is available
> as a separate choice from tick based / nohz_idle / nohz_full mode,
> a suggested by Frederic Weisbecker.
> 
> Next up: look at the things that are using CPU time on an otherwise
> idle system, and see if I can make those a little faster :)
> 
> v2: address Peterz's concerns, some more cleanups
> v3: rewrite the code along Frederic's suggestions, now cputime_t
>     is used everywhere
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] sched,time: count actually elapsed irq & softirq time
  2016-07-05 12:40   ` Frederic Weisbecker
@ 2016-07-05 13:08     ` Rik van Riel
  2016-07-05 14:00       ` Frederic Weisbecker
  2016-07-05 16:47     ` [PATCH v3 " Rik van Riel
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2016-07-05 13:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 4488 bytes --]

On Tue, 2016-07-05 at 14:40 +0200, Frederic Weisbecker wrote:
> On Thu, Jun 30, 2016 at 03:35:47PM -0400, riel@redhat.com wrote:
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 3d60e5d76fdb..018bae2ada36 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -79,40 +79,50 @@ void irqtime_account_irq(struct task_struct
> > *curr)
> >  }
> >  EXPORT_SYMBOL_GPL(irqtime_account_irq);
> >  
> > -static int irqtime_account_hi_update(void)
> > +static cputime_t irqtime_account_hi_update(cputime_t maxtime)
> >  {
> >  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> >  	unsigned long flags;
> > -	u64 latest_ns;
> > -	int ret = 0;
> > +	cputime_t irq_cputime;
> >  
> >  	local_irq_save(flags);
> > -	latest_ns = this_cpu_read(cpu_hardirq_time);
> > -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> > -		ret = 1;
> > +	irq_cputime =
> > nsecs_to_cputime(this_cpu_read(cpu_hardirq_time)) -
> > +		      cpustat[CPUTIME_IRQ];
> 
> We might want to keep nsecs_to_cputime64(). If cputime_t == jiffies_t
> == unsigned long,
> we may have a problem after 49 days of interrupts. Arguably that's a
> lot of IRQs
> but lets be paranoid.

The macro nsecs_to_cputime64 is only defined in
cputime_jiffies.h though, not in cputime_nsecs.h

Want me to add a #define to the second file?

> > +	irq_cputime = min(irq_cputime, maxtime);
> > +	cpustat[CPUTIME_IRQ] += irq_cputime;
> >  	local_irq_restore(flags);
> > -	return ret;
> > +	return irq_cputime;
> >  }
> >  
> > -static int irqtime_account_si_update(void)
> > +static cputime_t irqtime_account_si_update(cputime_t maxtime)
> >  {
> >  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> >  	unsigned long flags;
> > -	u64 latest_ns;
> > -	int ret = 0;
> > +	cputime_t softirq_cputime;
> >  
> >  	local_irq_save(flags);
> > -	latest_ns = this_cpu_read(cpu_softirq_time);
> > -	if (nsecs_to_cputime64(latest_ns) >
> > cpustat[CPUTIME_SOFTIRQ])
> > -		ret = 1;
> > +	softirq_cputime =
> > nsecs_to_cputime(this_cpu_read(cpu_softirq_time)) -
> 
> Ditto.
> 
> > +			  cpustat[CPUTIME_SOFTIRQ];
> > +	softirq_cputime = min(softirq_cputime, maxtime);
> > +	cpustat[CPUTIME_SOFTIRQ] += softirq_cputime;
> >  	local_irq_restore(flags);
> > -	return ret;
> > +	return softirq_cputime;
> >  }
> >  
> >  #else /* CONFIG_IRQ_TIME_ACCOUNTING */
> >  
> >  #define sched_clock_irqtime	(0)
> >  
> > +static cputime_t irqtime_account_hi_update(cputime_t dummy)
> > +{
> > +	return 0;
> > +}
> > +
> > +static cputime_t irqtime_account_si_update(cputime_t dummy)
> > +{
> > +	return 0;
> > +}
> > +
> >  #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
> >  
> >  static inline void task_group_account_field(struct task_struct *p,
> > int index,
> > @@ -257,32 +267,45 @@ void account_idle_time(cputime_t cputime)
> >  		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
> >  }
> >  
> > -static __always_inline unsigned long
> > steal_account_process_tick(unsigned long max_jiffies)
> > +static __always_inline cputime_t
> > steal_account_process_time(cputime_t maxtime)
> >  {
> >  #ifdef CONFIG_PARAVIRT
> >  	if (static_key_false(&paravirt_steal_enabled)) {
> > +		cputime_t steal_cputime;
> >  		u64 steal;
> > -		unsigned long steal_jiffies;
> >  
> >  		steal = paravirt_steal_clock(smp_processor_id());
> >  		steal -= this_rq()->prev_steal_time;
> > +		this_rq()->prev_steal_time += steal;
> 
> We are accounting steal_cputime but you make it remember steal_nsecs.
> This is
> leaking quite some steal time in the way.
> 
> Imagine that cputime_t == jiffies_t and HZ=100.
> paravirt_steal_clock() returns 199 nsecs. prev_steal_time gets added
> those 199.
> nsecs_to_cputime() return 1 jiffy (we are one nsec off the next
> jiffy). So
> account_steal_time() is accounting 1 jiffy and the 99 remaining nsecs
> are leaked.
> If some more steal time is to be accounted on the next tick, the 99
> previous nsecs
> are forgotten.
> 
> A non-leaking sequence would rather be:
> 
> 	steal = paravirt_steal_clock(smp_processor_id());
> 	steal -= this_rq()->prev_steal_time;
> 	steal_cputime = min(nsecs_to_cputime(steal), maxtime);
> 	account_steal_time(steal_cputime);
> 	this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
> 
> Thanks!

Good catch. I will fix this!

Thanks for reviewing.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle
  2016-07-05 13:02 ` [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle Nikolay Borisov
@ 2016-07-05 13:09   ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2016-07-05 13:09 UTC (permalink / raw)
  To: Nikolay Borisov, linux-kernel
  Cc: peterz, mingo, pbonzini, fweisbec, wanpeng.li, efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]

On Tue, 2016-07-05 at 16:02 +0300, Nikolay Borisov wrote:
> 
> On 06/30/2016 10:35 PM, riel@redhat.com wrote:
> > Currently irq time accounting only works in these cases:
> > 1) purely ticke based accounting
> > 2) nohz_full accounting, but only on housekeeping & nohz_full CPUs
> > 3) architectures with native vtime accounting
> > 
> > On nohz_idle CPUs, which are probably the majority nowadays,
> > irq time accounting is currently broken. This leads to systems
> > reporting a dramatically lower amount of irq & softirq time than
> > is actually spent handling them, with all the time spent while the
> > system is in the idle task being accounted as idle.
> 
> Is this supposed to alleviate the issue reported here:
> http://www.gossamer-threads.com/lists/linux/kernel/2169180

Indeed, it is.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] sched,time: count actually elapsed irq & softirq time
  2016-07-05 13:08     ` Rik van Riel
@ 2016-07-05 14:00       ` Frederic Weisbecker
  0 siblings, 0 replies; 18+ messages in thread
From: Frederic Weisbecker @ 2016-07-05 14:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Tue, Jul 05, 2016 at 09:08:16AM -0400, Rik van Riel wrote:
> On Tue, 2016-07-05 at 14:40 +0200, Frederic Weisbecker wrote:
> > On Thu, Jun 30, 2016 at 03:35:47PM -0400, riel@redhat.com wrote:
> > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > > index 3d60e5d76fdb..018bae2ada36 100644
> > > --- a/kernel/sched/cputime.c
> > > +++ b/kernel/sched/cputime.c
> > > @@ -79,40 +79,50 @@ void irqtime_account_irq(struct task_struct
> > > *curr)
> > >  }
> > >  EXPORT_SYMBOL_GPL(irqtime_account_irq);
> > >  
> > > -static int irqtime_account_hi_update(void)
> > > +static cputime_t irqtime_account_hi_update(cputime_t maxtime)
> > >  {
> > >  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> > >  	unsigned long flags;
> > > -	u64 latest_ns;
> > > -	int ret = 0;
> > > +	cputime_t irq_cputime;
> > >  
> > >  	local_irq_save(flags);
> > > -	latest_ns = this_cpu_read(cpu_hardirq_time);
> > > -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> > > -		ret = 1;
> > > +	irq_cputime =
> > > nsecs_to_cputime(this_cpu_read(cpu_hardirq_time)) -
> > > +		      cpustat[CPUTIME_IRQ];
> > 
> > We might want to keep nsecs_to_cputime64(). If cputime_t == jiffies_t
> > == unsigned long,
> > we may have a problem after 49 days of interrupts. Arguably that's a
> > lot of IRQs
> > but lets be paranoid.
> 
> The macro nsecs_to_cputime64 is only defined in
> cputime_jiffies.h though, not in cputime_nsecs.h

Ah indeed.

> Want me to add a #define to the second file?

Yes please. It should be a simple cast.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/4] sched,time: count actually elapsed irq & softirq time
  2016-07-05 12:40   ` Frederic Weisbecker
  2016-07-05 13:08     ` Rik van Riel
@ 2016-07-05 16:47     ` Rik van Riel
  2016-07-06 14:15       ` Frederic Weisbecker
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2016-07-05 16:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar, kernel

On Tue, 5 Jul 2016 14:40:34 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> ...

Thanks for the review Frederic. Here is a fixed version of patch 1,
the other three still seem to apply cleanly.

---8<---

Subject: sched,time: count actually elapsed irq & softirq time

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of the time was actually spent
handling irqs, and could conceivably mis-count all of the ticks during
a period as irq time, when there was some irq and some softirq time.

This can actually happen when irqtime_account_process_tick is called
from account_idle_ticks, which can pass a larger number of ticks down
all at once.

Fix this by changing irqtime_account_hi_update, irqtime_account_si_update,
and steal_account_process_ticks to work with cputime_t time units, and
return the amount of time spent in each mode.

Rename steal_account_process_ticks to steal_account_process_time, to
reflect that time is now accounted in cputime_t, instead of ticks.

Additionally, have irqtime_account_process_tick take into account how
much time was spent in each of steal, irq, and softirq time.

The latter could help improve the accuracy of cputime
accounting when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/cputime_nsecs.h |   2 +
 kernel/sched/cputime.c              | 124 ++++++++++++++++++++++--------------
 2 files changed, 79 insertions(+), 47 deletions(-)

diff --git a/include/asm-generic/cputime_nsecs.h b/include/asm-generic/cputime_nsecs.h
index 0f1c6f315cdc..918ebb01486c 100644
--- a/include/asm-generic/cputime_nsecs.h
+++ b/include/asm-generic/cputime_nsecs.h
@@ -50,6 +50,8 @@ typedef u64 __nocast cputime64_t;
 	(__force u64)(__ct)
 #define nsecs_to_cputime(__nsecs)	\
 	(__force cputime_t)(__nsecs)
+#define nsecs_to_cputime64(__nsecs)	\
+	(__force cputime_t)(__nsecs)
 
 
 /*
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d76fdb..db82ae12cf01 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,40 +79,50 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static cputime_t irqtime_account_hi_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t irq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_hardirq_time)) -
+		      cpustat[CPUTIME_IRQ];
+	irq_cputime = min(irq_cputime, maxtime);
+	cpustat[CPUTIME_IRQ] += irq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return irq_cputime;
 }
 
-static int irqtime_account_si_update(void)
+static cputime_t irqtime_account_si_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t softirq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_softirq_time)) -
+			  cpustat[CPUTIME_SOFTIRQ];
+	softirq_cputime = min(softirq_cputime, maxtime);
+	cpustat[CPUTIME_SOFTIRQ] += softirq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return softirq_cputime;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #define sched_clock_irqtime	(0)
 
+static cputime_t irqtime_account_hi_update(cputime_t dummy)
+{
+	return 0;
+}
+
+static cputime_t irqtime_account_si_update(cputime_t dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -257,32 +267,45 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline unsigned long steal_account_process_tick(unsigned long max_jiffies)
+static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
+		cputime_t steal_cputime;
 		u64 steal;
-		unsigned long steal_jiffies;
 
 		steal = paravirt_steal_clock(smp_processor_id());
 		steal -= this_rq()->prev_steal_time;
 
-		/*
-		 * steal is in nsecs but our caller is expecting steal
-		 * time in jiffies. Lets cast the result to jiffies
-		 * granularity and account the rest on the next rounds.
-		 */
-		steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies);
-		this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies);
+		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+		account_steal_time(steal_cputime);
+		this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
 
-		account_steal_time(jiffies_to_cputime(steal_jiffies));
-		return steal_jiffies;
+		return steal_cputime;
 	}
 #endif
 	return 0;
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ */
+static inline cputime_t account_other_time(cputime_t max)
+{
+	cputime_t accounted;
+
+	accounted = steal_account_process_time(max);
+
+	if (accounted < max)
+		accounted += irqtime_account_hi_update(max - accounted);
+
+	if (accounted < max)
+		accounted += irqtime_account_si_update(max - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -342,21 +365,23 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 					 struct rq *rq, int ticks)
 {
-	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
-	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 cputime = (__force u64) cputime_one_jiffy * ticks;
+	cputime_t scaled, other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_time(cputime);
+	if (other >= cputime)
 		return;
+	cputime -= other;
+	scaled = cputime_to_scaled(cputime);
 
-	cputime *= ticks;
-	scaled *= ticks;
-
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
@@ -466,7 +491,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
  */
 void account_process_tick(struct task_struct *p, int user_tick)
 {
-	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
+	cputime_t cputime, scaled, steal;
 	struct rq *rq = this_rq();
 
 	if (vtime_accounting_cpu_enabled())
@@ -477,16 +502,21 @@ void account_process_tick(struct task_struct *p, int user_tick)
 		return;
 	}
 
-	if (steal_account_process_tick(ULONG_MAX))
+	cputime = cputime_one_jiffy;
+	steal = steal_account_process_time(cputime);
+
+	if (steal >= cputime)
 		return;
 
+	cputime -= steal;
+	scaled = cputime_to_scaled(cputime);
+
 	if (user_tick)
-		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
+		account_user_time(p, cputime, scaled);
 	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
-		account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
-				    one_jiffy_scaled);
+		account_system_time(p, HARDIRQ_OFFSET, cputime, scaled);
 	else
-		account_idle_time(cputime_one_jiffy);
+		account_idle_time(cputime);
 }
 
 /*
@@ -681,14 +711,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	cputime_t delta, steal;
 
-	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	delta = jiffies_to_cputime(now - tsk->vtime_snap);
+	steal = steal_account_process_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return delta - steal;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/4] sched,time: count actually elapsed irq & softirq time
  2016-07-05 16:47     ` [PATCH v3 " Rik van Riel
@ 2016-07-06 14:15       ` Frederic Weisbecker
  0 siblings, 0 replies; 18+ messages in thread
From: Frederic Weisbecker @ 2016-07-06 14:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar, kernel

On Tue, Jul 05, 2016 at 12:47:07PM -0400, Rik van Riel wrote:
> On Tue, 5 Jul 2016 14:40:34 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > ...
> 
> Thanks for the review Frederic. Here is a fixed version of patch 1,
> the other three still seem to apply cleanly.
> 
> ---8<---
> 
> Subject: sched,time: count actually elapsed irq & softirq time
> 
> Currently, if there was any irq or softirq time during 'ticks'
> jiffies, the entire period will be accounted as irq or softirq
> time.
> 
> This is inaccurate if only a subset of the time was actually spent
> handling irqs, and could conceivably mis-count all of the ticks during
> a period as irq time, when there was some irq and some softirq time.
> 
> This can actually happen when irqtime_account_process_tick is called
> from account_idle_ticks, which can pass a larger number of ticks down
> all at once.
> 
> Fix this by changing irqtime_account_hi_update, irqtime_account_si_update,
> and steal_account_process_ticks to work with cputime_t time units, and
> return the amount of time spent in each mode.
> 
> Rename steal_account_process_ticks to steal_account_process_time, to
> reflect that time is now accounted in cputime_t, instead of ticks.
> 
> Additionally, have irqtime_account_process_tick take into account how
> much time was spent in each of steal, irq, and softirq time.
> 
> The latter could help improve the accuracy of cputime
> accounting when returning from idle on a NO_HZ_IDLE CPU.
> 
> Properly accounting how much time was spent in hardirq and
> softirq time will also allow the NO_HZ_FULL code to re-use
> these same functions for hardirq and softirq accounting.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  include/asm-generic/cputime_nsecs.h |   2 +
>  kernel/sched/cputime.c              | 124 ++++++++++++++++++++++--------------
>  2 files changed, 79 insertions(+), 47 deletions(-)
> 
> diff --git a/include/asm-generic/cputime_nsecs.h b/include/asm-generic/cputime_nsecs.h
> index 0f1c6f315cdc..918ebb01486c 100644
> --- a/include/asm-generic/cputime_nsecs.h
> +++ b/include/asm-generic/cputime_nsecs.h
> @@ -50,6 +50,8 @@ typedef u64 __nocast cputime64_t;
>  	(__force u64)(__ct)
>  #define nsecs_to_cputime(__nsecs)	\
>  	(__force cputime_t)(__nsecs)
> +#define nsecs_to_cputime64(__nsecs)	\
> +	(__force cputime_t)(__nsecs)

(__force cputime64_t) :-)

Although functionally that shouldn't matter for cputime_nsecs.h. You probably
won't get a warning either.

Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-06-30 19:35 ` [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
@ 2016-07-08 12:30   ` Frederic Weisbecker
  2016-07-08 13:19     ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Frederic Weisbecker @ 2016-07-08 12:30 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Thu, Jun 30, 2016 at 03:35:50PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Drop local_irq_save/restore from irqtime_account_irq.
> Instead, have softirq and hardirq track their time spent
> independently, with the softirq code subtracting hardirq
> time that happened during the duration of the softirq run.
> 
> The softirq code can be interrupted by hardirq code at
> any point in time, but it can check whether it got a
> consistent snapshot of the timekeeping variables it wants,
> and loop around in the unlikely case that it did not.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

So the purpose is to get rid of local_irq_save/restore()?
Is it really worth such complication?

> ---
>  kernel/sched/cputime.c | 72 +++++++++++++++++++++++++++++++++++++++++---------
>  kernel/sched/sched.h   | 38 +++++++++++++++++++++-----
>  2 files changed, 90 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index a0aefd4c7ea6..b78991fac228 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -26,7 +26,9 @@
>  DEFINE_PER_CPU(u64, cpu_hardirq_time);
>  DEFINE_PER_CPU(u64, cpu_softirq_time);
>  
> -static DEFINE_PER_CPU(u64, irq_start_time);
> +static DEFINE_PER_CPU(u64, hardirq_start_time);
> +static DEFINE_PER_CPU(u64, softirq_start_time);
> +static DEFINE_PER_CPU(u64, prev_hardirq_time);
>  static int sched_clock_irqtime;
>  
>  void enable_sched_clock_irqtime(void)
> @@ -41,6 +43,7 @@ void disable_sched_clock_irqtime(void)
>  
>  #ifndef CONFIG_64BIT
>  DEFINE_PER_CPU(seqcount_t, irq_time_seq);
> +DEFINE_PER_CPU(seqcount_t, softirq_time_seq);
>  #endif /* CONFIG_64BIT */
>  
>  /*
> @@ -53,36 +56,79 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
>   * softirq -> hardirq, hardirq -> softirq
>   *
>   * When exiting hardirq or softirq time, account the elapsed time.
> + *
> + * When exiting softirq time, subtract the amount of hardirq time that
> + * interrupted this softirq run, to avoid double accounting of that time.
>   */
>  void irqtime_account_irq(struct task_struct *curr, int irqtype)
>  {
> -	unsigned long flags;
> +	u64 prev_softirq_start;
> +	bool leaving_softirq;
> +	u64 prev_hardirq;
> +	u64 hardirq_time;
>  	s64 delta;
>  	int cpu;
>  
>  	if (!sched_clock_irqtime)
>  		return;
>  
> -	local_irq_save(flags);
> -
>  	cpu = smp_processor_id();
> -	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
> -	__this_cpu_add(irq_start_time, delta);
>  
> -	irq_time_write_begin();
> +	/*
> +	 * Hardirq time accounting is pretty straightforward. If not in
> +	 * hardirq context yet (entering hardirq), set the start time.
> +	 * If already in hardirq context (leaving), account the elapsed time.
> +	 */
> +	if (irqtype == HARDIRQ_OFFSET) {
> +		bool leaving_hardirq = hardirq_count();
> +		delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
> +		__this_cpu_add(hardirq_start_time, delta);
> +		if (leaving_hardirq) {
> +			hardirq_time_write_begin();
> +			__this_cpu_add(cpu_hardirq_time, delta);
> +			hardirq_time_write_end();
> +		}

This doesn't seem to work with nesting hardirqs.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-07-08 12:30   ` Frederic Weisbecker
@ 2016-07-08 13:19     ` Rik van Riel
  2016-07-08 14:01       ` Frederic Weisbecker
  2016-07-08 14:34       ` Paolo Bonzini
  0 siblings, 2 replies; 18+ messages in thread
From: Rik van Riel @ 2016-07-08 13:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 4290 bytes --]

On Fri, 2016-07-08 at 14:30 +0200, Frederic Weisbecker wrote:
> On Thu, Jun 30, 2016 at 03:35:50PM -0400, riel@redhat.com wrote:
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Drop local_irq_save/restore from irqtime_account_irq.
> > Instead, have softirq and hardirq track their time spent
> > independently, with the softirq code subtracting hardirq
> > time that happened during the duration of the softirq run.
> > 
> > The softirq code can be interrupted by hardirq code at
> > any point in time, but it can check whether it got a
> > consistent snapshot of the timekeeping variables it wants,
> > and loop around in the unlikely case that it did not.
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> So the purpose is to get rid of local_irq_save/restore()?
> Is it really worth such complication?

local_irq_save/restore are quite slow, and look like the
largest source of overhead in irq time accounting.

However, I have not gotten numbers yet, and have no problem
with this patch being dropped for now.

> > ---
> >  kernel/sched/cputime.c | 72
> > +++++++++++++++++++++++++++++++++++++++++---------
> >  kernel/sched/sched.h   | 38 +++++++++++++++++++++-----
> >  2 files changed, 90 insertions(+), 20 deletions(-)
> > 
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index a0aefd4c7ea6..b78991fac228 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -26,7 +26,9 @@
> >  DEFINE_PER_CPU(u64, cpu_hardirq_time);
> >  DEFINE_PER_CPU(u64, cpu_softirq_time);
> >  
> > -static DEFINE_PER_CPU(u64, irq_start_time);
> > +static DEFINE_PER_CPU(u64, hardirq_start_time);
> > +static DEFINE_PER_CPU(u64, softirq_start_time);
> > +static DEFINE_PER_CPU(u64, prev_hardirq_time);
> >  static int sched_clock_irqtime;
> >  
> >  void enable_sched_clock_irqtime(void)
> > @@ -41,6 +43,7 @@ void disable_sched_clock_irqtime(void)
> >  
> >  #ifndef CONFIG_64BIT
> >  DEFINE_PER_CPU(seqcount_t, irq_time_seq);
> > +DEFINE_PER_CPU(seqcount_t, softirq_time_seq);
> >  #endif /* CONFIG_64BIT */
> >  
> >  /*
> > @@ -53,36 +56,79 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
> >   * softirq -> hardirq, hardirq -> softirq
> >   *
> >   * When exiting hardirq or softirq time, account the elapsed time.
> > + *
> > + * When exiting softirq time, subtract the amount of hardirq time
> > that
> > + * interrupted this softirq run, to avoid double accounting of
> > that time.
> >   */
> >  void irqtime_account_irq(struct task_struct *curr, int irqtype)
> >  {
> > -	unsigned long flags;
> > +	u64 prev_softirq_start;
> > +	bool leaving_softirq;
> > +	u64 prev_hardirq;
> > +	u64 hardirq_time;
> >  	s64 delta;
> >  	int cpu;
> >  
> >  	if (!sched_clock_irqtime)
> >  		return;
> >  
> > -	local_irq_save(flags);
> > -
> >  	cpu = smp_processor_id();
> > -	delta = sched_clock_cpu(cpu) -
> > __this_cpu_read(irq_start_time);
> > -	__this_cpu_add(irq_start_time, delta);
> >  
> > -	irq_time_write_begin();
> > +	/*
> > +	 * Hardirq time accounting is pretty straightforward. If
> > not in
> > +	 * hardirq context yet (entering hardirq), set the start
> > time.
> > +	 * If already in hardirq context (leaving), account the
> > elapsed time.
> > +	 */
> > +	if (irqtype == HARDIRQ_OFFSET) {
> > +		bool leaving_hardirq = hardirq_count();
> > +		delta = sched_clock_cpu(cpu) -
> > __this_cpu_read(hardirq_start_time);
> > +		__this_cpu_add(hardirq_start_time, delta);
> > +		if (leaving_hardirq) {
> > +			hardirq_time_write_begin();
> > +			__this_cpu_add(cpu_hardirq_time, delta);
> > +			hardirq_time_write_end();
> > +		}
> 
> This doesn't seem to work with nesting hardirqs.
> 
> Thanks.

Where does it break?

enter hardirq A -> hardirq_start_time = now

enter hardirq B -> hardirq_start_time = now,
                   account already elapsed time

leave hardirq B -> account elapsed time, set
                   hardirq_start_time = now

leave hardirq A -> account elapsed time

What am I missing, except a softirq-style do-while
loop to account for hardirq A being interrupted by
hardirq B while updating the statistics?

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-07-08 13:19     ` Rik van Riel
@ 2016-07-08 14:01       ` Frederic Weisbecker
  2016-07-08 14:34       ` Paolo Bonzini
  1 sibling, 0 replies; 18+ messages in thread
From: Frederic Weisbecker @ 2016-07-08 14:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, peterz, mingo, pbonzini, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Fri, Jul 08, 2016 at 09:19:47AM -0400, Rik van Riel wrote:
> On Fri, 2016-07-08 at 14:30 +0200, Frederic Weisbecker wrote:
> > > +	if (irqtype == HARDIRQ_OFFSET) {
> > > +		bool leaving_hardirq = hardirq_count();
> > > +		delta = sched_clock_cpu(cpu) -
> > > __this_cpu_read(hardirq_start_time);
> > > +		__this_cpu_add(hardirq_start_time, delta);
> > > +		if (leaving_hardirq) {
> > > +			hardirq_time_write_begin();
> > > +			__this_cpu_add(cpu_hardirq_time, delta);
> > > +			hardirq_time_write_end();
> > > +		}
> > 
> > This doesn't seem to work with nesting hardirqs.
> > 
> > Thanks.
> 
> Where does it break?
> 
> enter hardirq A -> hardirq_start_time = now
> 
> enter hardirq B -> hardirq_start_time = now,
>                    account already elapsed time
> 
> leave hardirq B -> account elapsed time, set
>                    hardirq_start_time = now
> 
> leave hardirq A -> account elapsed time
> 
> What am I missing, except a softirq-style do-while
> loop to account for hardirq A being interrupted by
> hardirq B while updating the statistics?

How about:

enter hardirq A -> delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
   enter hardirq B -> delta = sched_clock_cpu(cpu) - __this_cpu_read(hardirq_start_time);
   __this_cpu_add(hardirq_start_time, delta);
   ...
   exit hardirq B
__this_cpu_add(hardirq_start_time, delta);

Also seqcount writers can't nest.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-07-08 13:19     ` Rik van Riel
  2016-07-08 14:01       ` Frederic Weisbecker
@ 2016-07-08 14:34       ` Paolo Bonzini
  2016-07-08 15:56         ` Rik van Riel
  1 sibling, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2016-07-08 14:34 UTC (permalink / raw)
  To: Rik van Riel, Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar


[-- Attachment #1.1: Type: text/plain, Size: 1721 bytes --]



On 08/07/2016 15:19, Rik van Riel wrote:
> On Fri, 2016-07-08 at 14:30 +0200, Frederic Weisbecker wrote:
>> On Thu, Jun 30, 2016 at 03:35:50PM -0400, riel@redhat.com wrote:
>>> From: Rik van Riel <riel@redhat.com>
>>>
>>> Drop local_irq_save/restore from irqtime_account_irq.
>>> Instead, have softirq and hardirq track their time spent
>>> independently, with the softirq code subtracting hardirq
>>> time that happened during the duration of the softirq run.
>>>
>>> The softirq code can be interrupted by hardirq code at
>>> any point in time, but it can check whether it got a
>>> consistent snapshot of the timekeeping variables it wants,
>>> and loop around in the unlikely case that it did not.
>>>
>>> Signed-off-by: Rik van Riel <riel@redhat.com>
>>
>> So the purpose is to get rid of local_irq_save/restore()?
>> Is it really worth such complication?
> 
> local_irq_save/restore are quite slow, and look like the
> largest source of overhead in irq time accounting.

I'm looking at an upstream tree, without your patches applied,
but it seems to me that irqtime_account_irq is always called with
interrupts disabled:

irqtime_account_irq
	-> account_irq_enter_time
		-> __irq_enter
			-> HARDIRQ_ENTER	[1]
			-> irq_enter		[3]
		-> __do_softirq			[1]
	-> account_irq_exit_time
		-> __do_softirq			[1]
		-> __irq_exit
			-> HARDIRQ_EXIT		[1]
		-> irq_exit			[2]

	[1] = does local_irq_disable/enable
	[2] = contains WARN_ON_ONCE(!irqs_disabled())
	[3] = calls rcu_irq_enter(), which checks irqs_disabled()

I don't think your first two patches change this, so perhaps it's enough
to remove that local_irq-save/restore?  Either this, or ENEEDWEEKEND...

Paolo


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-07-08 14:34       ` Paolo Bonzini
@ 2016-07-08 15:56         ` Rik van Riel
  2016-07-08 23:58           ` Frederic Weisbecker
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2016-07-08 15:56 UTC (permalink / raw)
  To: Paolo Bonzini, Frederic Weisbecker
  Cc: linux-kernel, peterz, mingo, fweisbec, wanpeng.li, efault, tglx, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 2223 bytes --]

On Fri, 2016-07-08 at 16:34 +0200, Paolo Bonzini wrote:
> 
> On 08/07/2016 15:19, Rik van Riel wrote:
> > On Fri, 2016-07-08 at 14:30 +0200, Frederic Weisbecker wrote:
> > > On Thu, Jun 30, 2016 at 03:35:50PM -0400, riel@redhat.com wrote:
> > > > From: Rik van Riel <riel@redhat.com>
> > > > 
> > > > Drop local_irq_save/restore from irqtime_account_irq.
> > > > Instead, have softirq and hardirq track their time spent
> > > > independently, with the softirq code subtracting hardirq
> > > > time that happened during the duration of the softirq run.
> > > > 
> > > > The softirq code can be interrupted by hardirq code at
> > > > any point in time, but it can check whether it got a
> > > > consistent snapshot of the timekeeping variables it wants,
> > > > and loop around in the unlikely case that it did not.
> > > > 
> > > > Signed-off-by: Rik van Riel <riel@redhat.com>
> > > 
> > > So the purpose is to get rid of local_irq_save/restore()?
> > > Is it really worth such complication?
> > 
> > local_irq_save/restore are quite slow, and look like the
> > largest source of overhead in irq time accounting.
> 
> I'm looking at an upstream tree, without your patches applied,
> but it seems to me that irqtime_account_irq is always called with
> interrupts disabled:
> 
> irqtime_account_irq
> 	-> account_irq_enter_time
> 		-> __irq_enter
> 			-> HARDIRQ_ENTER	[1]
> 			-> irq_enter		[3]
> 		-> __do_softirq			[1]
> 	-> account_irq_exit_time
> 		-> __do_softirq			[1]
> 		-> __irq_exit
> 			-> HARDIRQ_EXIT		[1]
> 		-> irq_exit			[2]
> 
> 	[1] = does local_irq_disable/enable
> 	[2] = contains WARN_ON_ONCE(!irqs_disabled())
> 	[3] = calls rcu_irq_enter(), which checks irqs_disabled()
> 
> I don't think your first two patches change this, so perhaps it's
> enough
> to remove that local_irq-save/restore?  Either this, or
> ENEEDWEEKEND...

I think you are right!

__do_softirq() calls account_irq_enter_time() with irqs
already disabled, and also has irqs disabled when it
calls account_irq_exit_time()

This appears to be true for both ksoftirqd and softirq
from irq context.

This could simplify my patch series a lot :)

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq
  2016-07-08 15:56         ` Rik van Riel
@ 2016-07-08 23:58           ` Frederic Weisbecker
  0 siblings, 0 replies; 18+ messages in thread
From: Frederic Weisbecker @ 2016-07-08 23:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paolo Bonzini, linux-kernel, peterz, mingo, fweisbec, wanpeng.li,
	efault, tglx, rkrcmar

On Fri, Jul 08, 2016 at 11:56:49AM -0400, Rik van Riel wrote:
> On Fri, 2016-07-08 at 16:34 +0200, Paolo Bonzini wrote:
> > I'm looking at an upstream tree, without your patches applied,
> > but it seems to me that irqtime_account_irq is always called with
> > interrupts disabled:
> > 
> > irqtime_account_irq
> > 	-> account_irq_enter_time
> > 		-> __irq_enter
> > 			-> HARDIRQ_ENTER	[1]
> > 			-> irq_enter		[3]
> > 		-> __do_softirq			[1]
> > 	-> account_irq_exit_time
> > 		-> __do_softirq			[1]
> > 		-> __irq_exit
> > 			-> HARDIRQ_EXIT		[1]
> > 		-> irq_exit			[2]
> > 
> > 	[1] = does local_irq_disable/enable
> > 	[2] = contains WARN_ON_ONCE(!irqs_disabled())
> > 	[3] = calls rcu_irq_enter(), which checks irqs_disabled()
> > 
> > I don't think your first two patches change this, so perhaps it's
> > enough
> > to remove that local_irq-save/restore?  Either this, or
> > ENEEDWEEKEND...

Good catch Paolo!

> 
> I think you are right!
> 
> __do_softirq() calls account_irq_enter_time() with irqs
> already disabled, and also has irqs disabled when it
> calls account_irq_exit_time()
> 
> This appears to be true for both ksoftirqd and softirq
> from irq context.

Indeed! And irq_enter()/irq_exit() have irqs disabled requirements.

The other users are __irq_enter() and __irq_exit() called by
lockdep selftests which takes care about it too.

I just did a boot test with a WARN_ON_ONCE(!irqs_disabled())
on account_irq_enter_time() and it triggered no issue.

> 
> This could simplify my patch series a lot :)

Definetly! ;-)

Would you mind resending it?

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-07-08 23:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-30 19:35 [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle riel
2016-06-30 19:35 ` [PATCH 1/4] sched,time: count actually elapsed irq & softirq time riel
2016-07-05 12:40   ` Frederic Weisbecker
2016-07-05 13:08     ` Rik van Riel
2016-07-05 14:00       ` Frederic Weisbecker
2016-07-05 16:47     ` [PATCH v3 " Rik van Riel
2016-07-06 14:15       ` Frederic Weisbecker
2016-06-30 19:35 ` [PATCH 2/4] nohz,cputime: replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code riel
2016-06-30 19:35 ` [PATCH 3/4] irqtime: add irq type parameter to irqtime_account_irq riel
2016-06-30 19:35 ` [PATCH 4/4] irqtime: drop local_irq_save/restore from irqtime_account_irq riel
2016-07-08 12:30   ` Frederic Weisbecker
2016-07-08 13:19     ` Rik van Riel
2016-07-08 14:01       ` Frederic Weisbecker
2016-07-08 14:34       ` Paolo Bonzini
2016-07-08 15:56         ` Rik van Riel
2016-07-08 23:58           ` Frederic Weisbecker
2016-07-05 13:02 ` [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle Nikolay Borisov
2016-07-05 13:09   ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.