All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Proper kernel irq time accounting
@ 2010-09-17  1:56 Venkatesh Pallipadi
  2010-09-17  1:56 ` [PATCH 1/6] Consolidate account_system_vtime extern declaration Venkatesh Pallipadi
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-17  1:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner

Here is some background information about interrupt time accounting in kernel
and related problems

Interrupts always run in the context of currently running task. Softirqs mostly
run in the context of currently running task, unless softirqd gets involved.
/proc/interrupts and /proc/softirqs are the interfaces that report the number
of interrupts and softirqs per CPU since boot. /proc/stat has fields that
report per CPU and system-wide interrupt and softirq processing time in
clock_t units.


There are two problems with the way interrupts are accounted by the kernel.

(1) Coarse grained interrupt time reporting
On most archs (except s390, powerpc, ia64 with CONFIG_VIRT_CPU_ACCOUNTING),
the interrupt and softirq time reported in /proc/stat is tick sample based.
Kernel looks at what it is doing at each CPU tick and accounts the entire
tick to that particular activity. This means the data in /proc/stat is
pretty coarse grained.

One related problem (atleast on x86), with recent
"Run irq handlers with interrupts disabled" change, timer interrupt cannot
fire when there is an interrupt being serviced [1]. As a result sampling based
hardirq time in /proc/stat cannot capture any hardirq time at all.

(2) Accounting irq processing time to current task/taskgroup
Whenever irq processing happens, kernel accounts that time to currently
running task. The exec_runtime reported in /proc/<pid>/schedstat and
<cgroup>/cpuacct.usage* includes any irq processing that happened while
the task was running. The scheduler vruntime calculations
also account any irq processing to the current task. This means exec time
accounting during heavy irq processing is kind of random, depending on
when and which CPU processing happens and what task happened to be
running on that CPU at that time.


Solution to (1) involves adding extra timing on irq entry/exit to
get the fine granularity info and then exporting it to user.
The following patchset addresses this problem in a way similar to [2][3].
Keeps most of the code that does the timing generic
(CONFIG_IRQ_TIME_ACCOUNTING), based off of sched_clock(). And adds support for
this in x86. The new fine granularity time information is exported in
/proc/interrupts and /proc/softirqs as a reference implementation. Whether
it actually belongs there or somewhere else is open for discussion.


One partial solution proposed in [2][3] for (2) above, was to capture this
interrupt time at task/taskgroup level and let user know how much irq
processing time kernel charged to each task/taskgroup. But, that solution
did not solve task timeslice including irq processing.
Peter Zijlstra and Martin Schwidefsky disagreed with that approach and
wanted to see more complete solution in not accounting irq processing time
to tasks at all.

The patchset below tries this more complete solution, with two scheduler
related changes. First, to take out irq processing time from the time
scheduler accounts to the task. Second, make adjustments to the CPU
power, to account for irq processing activity on the CPU. That in turn
results in irq busy CPU pulling tasks corresponding to its non-irq
processing bandwidth that it has got.

The changes here is only enabled for CONFIG_IRQ_TIME_ACCOUNTING as of now.
I haven't gotten enough testing cycles on this patchset yet. But, here they
are for review/feedback.

Thanks,
Venki

References:

[1] http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00864.html
    lkml subject - "genirq: Run irq handlers with interrupts disabled"

[2] http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00411.html
    lkml subject - "Finer granularity and task/cgroup irq time accounting"

[3] http://lkml.indiana.edu/hypermail//linux/kernel/1007.2/00987.html
    lkml subject - "Finer granularity and task/cgroup irq time accounting"


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] Consolidate account_system_vtime extern declaration
  2010-09-17  1:56 [PATCH 0/6] Proper kernel irq time accounting Venkatesh Pallipadi
@ 2010-09-17  1:56 ` Venkatesh Pallipadi
  2010-09-17  1:56 ` [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time Venkatesh Pallipadi
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-17  1:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Venkatesh Pallipadi

Just a minor cleanup patch that makes things easier to the following patches.
No functionality change in this patch.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/ia64/include/asm/system.h    |    4 ----
 arch/powerpc/include/asm/system.h |    4 ----
 arch/s390/include/asm/system.h    |    1 -
 include/linux/hardirq.h           |    2 ++
 4 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/ia64/include/asm/system.h b/arch/ia64/include/asm/system.h
index 9f342a5..dd028f2 100644
--- a/arch/ia64/include/asm/system.h
+++ b/arch/ia64/include/asm/system.h
@@ -272,10 +272,6 @@ void cpu_idle_wait(void);
 
 void default_idle(void);
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/system.h b/arch/powerpc/include/asm/system.h
index 6c294ac..9c3d160 100644
--- a/arch/powerpc/include/asm/system.h
+++ b/arch/powerpc/include/asm/system.h
@@ -542,10 +542,6 @@ extern void reloc_got2(unsigned long);
 
 #define PTRRELOC(x)	((typeof(x)) add_reloc_offset((unsigned long)(x)))
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
 extern struct dentry *powerpc_debugfs_root;
 
 #endif /* __KERNEL__ */
diff --git a/arch/s390/include/asm/system.h b/arch/s390/include/asm/system.h
index cef6621..38ddd8a 100644
--- a/arch/s390/include/asm/system.h
+++ b/arch/s390/include/asm/system.h
@@ -97,7 +97,6 @@ static inline void restore_access_regs(unsigned int *acrs)
 
 extern void account_vtime(struct task_struct *, struct task_struct *);
 extern void account_tick_vtime(struct task_struct *);
-extern void account_system_vtime(struct task_struct *);
 
 #ifdef CONFIG_PFAULT
 extern void pfault_irq_init(void);
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d5b3876..ce22d09 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -136,6 +136,8 @@ struct task_struct;
 static inline void account_system_vtime(struct task_struct *tsk)
 {
 }
+#else
+extern void account_system_vtime(struct task_struct *tsk);
 #endif
 
 #if defined(CONFIG_NO_HZ)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-17  1:56 [PATCH 0/6] Proper kernel irq time accounting Venkatesh Pallipadi
  2010-09-17  1:56 ` [PATCH 1/6] Consolidate account_system_vtime extern declaration Venkatesh Pallipadi
@ 2010-09-17  1:56 ` Venkatesh Pallipadi
  2010-09-19 11:11   ` Peter Zijlstra
  2010-09-19 11:21   ` Peter Zijlstra
  2010-09-17  1:56 ` [PATCH 3/6] x86: Add IRQ_TIME_ACCOUNTING in x86 Venkatesh Pallipadi
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-17  1:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Venkatesh Pallipadi

s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 will be challenging however, given the
state of TSC reliability on various platforms and also the overhead it will
add in syscall entry exit.

Instead, add a lighter variant that only does finer accounting of
hardirq and softirq times, providing precise irq times (instead of timer tick
based samples). This accounting is added with a new config option
CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
interested in paying the perf penalty.

This accounting is based on sched_clock, with the code being generic.
So, other archs may find it useful as well.

Note that the kstat_cpu irq times (and hence /proc/stat) are still based on
tick based samples. The reason being that the kstat irq also includes
system time and changing only irq times there to have finer granularity can
result in inconsistency like sum kstat time adding up to more than 100% etc.

This patch just adds the core logic and does not enable this logic yet.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 include/linux/hardirq.h |    2 +-
 include/linux/sched.h   |   11 +++++++++++
 kernel/sched.c          |   38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index ce22d09..bfafd29 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -132,7 +132,7 @@ extern void synchronize_irq(unsigned int irq);
 
 struct task_struct;
 
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#if !defined(CONFIG_VIRT_CPU_ACCOUNTING) && !defined(CONFIG_IRQ_TIME_ACCOUNTING)
 static inline void account_system_vtime(struct task_struct *tsk)
 {
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..dbb6808 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1826,6 +1826,17 @@ extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 #endif
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+/*
+ * An i/f to runtime opt-in for irq time accounting based off of sched_clock.
+ * The reason for this explicit opt-in is not to have perf penalty with
+ * slow sched_clocks.
+ */
+extern void enable_sched_clock_irqtime(void);
+#else
+static inline void enable_sched_clock_irqtime(void) {}
+#endif
+
 extern unsigned long long
 task_sched_runtime(struct task_struct *task);
 extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
diff --git a/kernel/sched.c b/kernel/sched.c
index ed09d4f..912d2de 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1917,6 +1917,44 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 	dec_nr_running(rq);
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+
+static DEFINE_PER_CPU(u64, cpu_hardirq_time);
+static DEFINE_PER_CPU(u64, cpu_softirq_time);
+
+static DEFINE_PER_CPU(u64, irq_start_time);
+static int sched_clock_irqtime;
+
+void enable_sched_clock_irqtime(void)
+{
+       sched_clock_irqtime = 1;
+}
+
+void account_system_vtime(struct task_struct *tsk)
+{
+	unsigned long flags;
+	int cpu;
+	u64 now, delta;
+
+	if (!sched_clock_irqtime)
+		return;
+
+	local_irq_save(flags);
+
+	cpu = task_cpu(tsk);
+	now = sched_clock();
+	delta = now - per_cpu(irq_start_time, cpu);
+	per_cpu(irq_start_time, cpu) = now;
+	if (hardirq_count())
+		per_cpu(cpu_hardirq_time, cpu) += delta;
+	else if (softirq_count())
+		per_cpu(cpu_softirq_time, cpu) += delta;
+
+	local_irq_restore(flags);
+}
+
+#endif
+
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/6] x86: Add IRQ_TIME_ACCOUNTING in x86
  2010-09-17  1:56 [PATCH 0/6] Proper kernel irq time accounting Venkatesh Pallipadi
  2010-09-17  1:56 ` [PATCH 1/6] Consolidate account_system_vtime extern declaration Venkatesh Pallipadi
  2010-09-17  1:56 ` [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time Venkatesh Pallipadi
@ 2010-09-17  1:56 ` Venkatesh Pallipadi
  2010-09-17  1:56 ` [PATCH 4/6] sched: Do not account irq time to current task Venkatesh Pallipadi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-17  1:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Venkatesh Pallipadi

This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it
when TSC is enabled.

This change just enables fine grained irq time accounting, isn't used yet.
Following patches use it for different purposes.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/x86/Kconfig      |   11 +++++++++++
 arch/x86/kernel/tsc.c |    2 ++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cea0cd9..f4c70c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -795,6 +795,17 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config IRQ_TIME_ACCOUNTING
+	bool "Fine granularity task level IRQ time accounting"
+	default n
+	---help---
+	  Select this option to enable fine granularity task irq time
+	  accounting. This is done by reading a timestamp on each
+	  transitions between softirq and hardirq state, so there can be a
+	  small performance impact.
+
+	  If in doubt, say N here.
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 26a863a..110d815 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -987,6 +987,8 @@ void __init tsc_init(void)
 	/* now allow native_sched_clock() to use rdtsc */
 	tsc_disabled = 0;
 
+	enable_sched_clock_irqtime();
+
 	lpj = ((u64)tsc_khz * 1000);
 	do_div(lpj, HZ);
 	lpj_fine = lpj;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/6] sched: Do not account irq time to current task
  2010-09-17  1:56 [PATCH 0/6] Proper kernel irq time accounting Venkatesh Pallipadi
                   ` (2 preceding siblings ...)
  2010-09-17  1:56 ` [PATCH 3/6] x86: Add IRQ_TIME_ACCOUNTING in x86 Venkatesh Pallipadi
@ 2010-09-17  1:56 ` Venkatesh Pallipadi
  2010-09-19 11:28   ` Peter Zijlstra
  2010-09-17  1:56 ` [PATCH 5/6] sched: Remove irq time from available CPU power Venkatesh Pallipadi
  2010-09-17  1:56 ` [PATCH 6/6] Export per cpu hardirq and softirq time in proc Venkatesh Pallipadi
  5 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-17  1:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Venkatesh Pallipadi

Both softirq and interrupt processing times get accounted to the currently
running task by the scheduler. This means if the interrupt processing was
for some other task in the system, then the current task pays the penalty
as it gets shorter runtime than otherwise.

Change this to 'unaccount' irq processing times from currently running task.
We do this is class update_curr() function, modifying the delta_exec.

Note that this change only handler CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.

This change will impact scheduling behavior in interrupt heavy conditions.

Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running this task. Now, if I run another CPU intensive task on CPU 2,
without this change /proc/<pid>/schedstat for the CPU intensive task shows
100% of time accounted to this task. With this change, it shows less than 25%
accounted to this task.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 kernel/sched.c      |   91 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_fair.c |    6 +++-
 kernel/sched_rt.c   |   16 +++++++--
 3 files changed, 108 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 912d2de..f36697b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -370,6 +370,10 @@ struct cfs_rq {
 	unsigned long rq_weight;
 #endif
 #endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+	u64 saved_irq_time;
+#endif
+
 };
 
 /* Real-Time classes' related field in a runqueue: */
@@ -403,6 +407,10 @@ struct rt_rq {
 	struct list_head leaf_rt_rq_list;
 	struct task_group *tg;
 #endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+	u64 saved_irq_time;
+#endif
+
 };
 
 #ifdef CONFIG_SMP
@@ -532,6 +540,10 @@ struct rq {
 	struct hrtimer hrtick_timer;
 #endif
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+	u64 total_irq_time;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	/* latency stats */
 	struct sched_info rq_sched_info;
@@ -643,10 +655,14 @@ static inline struct task_group *task_group(struct task_struct *p)
 
 #endif /* CONFIG_CGROUP_SCHED */
 
+static void save_rq_irq_time(int cpu, struct rq *rq);
+
 inline void update_rq_clock(struct rq *rq)
 {
-	if (!rq->skip_clock_update)
+	if (!rq->skip_clock_update) {
 		rq->clock = sched_clock_cpu(cpu_of(rq));
+		save_rq_irq_time(cpu_of(rq), rq);
+	}
 }
 
 /*
@@ -1919,6 +1935,17 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 
+/*
+ * There are no locks covering percpu hardirq/softirq time.
+ * They are only modified in account_system_vtime, on corresponding CPU
+ * with interrupts disabled. So, writes are safe.
+ * They are read and saved off onto rq->total_irq_time in update_rq_clock().
+ * This may result in other CPU reading this CPU's irq time and can
+ * race with irq/account_system_vtime on this CPU. We would either get old
+ * or new value with a side effect of accounting a slice of irq time to wrong
+ * task when irq is in progress while we read rq->clock. That is a worthy
+ * compromise in place of having locks on each irq in account_system_time.
+ */
 static DEFINE_PER_CPU(u64, cpu_hardirq_time);
 static DEFINE_PER_CPU(u64, cpu_softirq_time);
 
@@ -1930,6 +1957,13 @@ void enable_sched_clock_irqtime(void)
        sched_clock_irqtime = 1;
 }
 
+static void save_rq_irq_time(int cpu, struct rq *rq)
+{
+	if (sched_clock_irqtime)
+		rq->total_irq_time = per_cpu(cpu_softirq_time, cpu) +
+					per_cpu(cpu_hardirq_time, cpu);
+}
+
 void account_system_vtime(struct task_struct *tsk)
 {
 	unsigned long flags;
@@ -1953,6 +1987,61 @@ void account_system_vtime(struct task_struct *tsk)
 	local_irq_restore(flags);
 }
 
+/*
+ * saved_irq_time in cfs_rq, rt_rq caches the accounted irqtime and
+ * it is updated from update_curr while doing entity accouting and
+ * in update_irq_time while the task is first scheduled in.
+ */
+static void __update_irq_time(int cpu, u64 *saved_irq_time)
+{
+	if (sched_clock_irqtime)
+		*saved_irq_time = cpu_rq(cpu)->total_irq_time;
+}
+
+#define update_irq_time(cpu, crq) __update_irq_time(cpu, &(crq)->saved_irq_time)
+
+static u64 unaccount_irq_delta(u64 delta, int cpu, u64 *saved_irq_time)
+{
+	u64 curr_irq_time;
+
+	if (!sched_clock_irqtime)
+		return delta;
+
+	curr_irq_time = cpu_rq(cpu)->total_irq_time - *saved_irq_time;
+	*saved_irq_time = cpu_rq(cpu)->total_irq_time;
+
+	if (likely(delta > curr_irq_time))
+		delta -= curr_irq_time;
+	else
+		delta = 0;
+
+	return delta;
+}
+
+#define unaccount_irq_delta_fair(delta, cpu, class_rq)		 \
+		(unsigned long)unaccount_irq_delta((u64)delta,	 \
+					cpu, &(class_rq)->saved_irq_time)
+
+#define unaccount_irq_delta_rt(delta, cpu, class_rq)		 \
+		unaccount_irq_delta(delta, cpu, &(class_rq)->saved_irq_time)
+
+#else
+
+#define update_irq_time(cpu, crq)		do { } while (0)
+
+static void save_rq_irq_time(int cpu, struct rq *rq) { }
+
+static unsigned long unaccount_irq_delta_fair(unsigned long delta_exec,
+		int cpu, struct cfs_rq *cfs_rq)
+{
+	return delta_exec;
+}
+
+static u64 unaccount_irq_delta_rt(u64 delta_exec, int cpu, struct rt_rq *rt_rq)
+{
+	return delta_exec;
+}
+
 #endif
 
 #include "sched_idletask.c"
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a171138..a64fdaf 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -521,6 +521,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	struct sched_entity *curr = cfs_rq->curr;
 	u64 now = rq_of(cfs_rq)->clock;
 	unsigned long delta_exec;
+	int cpu = cpu_of(rq_of(cfs_rq));
 
 	if (unlikely(!curr))
 		return;
@@ -531,11 +532,13 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	 * overflow on 32 bits):
 	 */
 	delta_exec = (unsigned long)(now - curr->exec_start);
+	curr->exec_start = now;
+	delta_exec = unaccount_irq_delta_fair(delta_exec, cpu, cfs_rq);
+
 	if (!delta_exec)
 		return;
 
 	__update_curr(cfs_rq, curr, delta_exec);
-	curr->exec_start = now;
 
 	if (entity_is_task(curr)) {
 		struct task_struct *curtask = task_of(curr);
@@ -603,6 +606,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 * We are starting a new run period:
 	 */
 	se->exec_start = rq_of(cfs_rq)->clock;
+	update_irq_time(cpu_of(rq_of(cfs_rq)), cfs_rq);
 }
 
 /**************************************************
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..000d402 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -605,6 +605,7 @@ static void update_curr_rt(struct rq *rq)
 	struct sched_rt_entity *rt_se = &curr->rt;
 	struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
 	u64 delta_exec;
+	int cpu = cpu_of(rq);
 
 	if (!task_has_rt_policy(curr))
 		return;
@@ -613,6 +614,14 @@ static void update_curr_rt(struct rq *rq)
 	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
 
+	/*
+	 * rt_avg_update before removing irq_delta, to keep up with
+	 * current behavior.
+	 */
+	sched_rt_avg_update(rq, delta_exec);
+
+	delta_exec = unaccount_irq_delta_rt(delta_exec, cpu, rt_rq);
+
 	schedstat_set(curr->se.statistics.exec_max, max(curr->se.statistics.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
@@ -621,8 +630,6 @@ static void update_curr_rt(struct rq *rq)
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
@@ -1057,9 +1064,10 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 {
 	struct sched_rt_entity *rt_se;
 	struct task_struct *p;
-	struct rt_rq *rt_rq;
+	struct rt_rq *rt_rq, *tmp_rt_rq;
 
 	rt_rq = &rq->rt;
+	tmp_rt_rq = rt_rq;
 
 	if (unlikely(!rt_rq->rt_nr_running))
 		return NULL;
@@ -1075,6 +1083,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 
 	p = rt_task_of(rt_se);
 	p->se.exec_start = rq->clock;
+	update_irq_time(cpu_of(rq), tmp_rt_rq);
 
 	return p;
 }
@@ -1710,6 +1719,7 @@ static void set_curr_task_rt(struct rq *rq)
 	struct task_struct *p = rq->curr;
 
 	p->se.exec_start = rq->clock;
+	update_irq_time(cpu_of(rq), &rq->rt);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/6] sched: Remove irq time from available CPU power
  2010-09-17  1:56 [PATCH 0/6] Proper kernel irq time accounting Venkatesh Pallipadi
                   ` (3 preceding siblings ...)
  2010-09-17  1:56 ` [PATCH 4/6] sched: Do not account irq time to current task Venkatesh Pallipadi
@ 2010-09-17  1:56 ` Venkatesh Pallipadi
  2010-09-19 11:31   ` Peter Zijlstra
  2010-09-17  1:56 ` [PATCH 6/6] Export per cpu hardirq and softirq time in proc Venkatesh Pallipadi
  5 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-17  1:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Venkatesh Pallipadi

The idea suggested by Peter Zijlstra here.
http://marc.info/?l=linux-kernel&m=127476934517534&w=2

irq time is technically not available to the tasks running on the CPU.
This patch removes irq time from CPU power piggybacking on
sched_rt_avg_update().

Tested this by keeping CPU X busy with 75% irq processing (hard+soft) on
an 4-way system. And start 7 cycle soakers on the system. Without this change,
there will be 2 tasks on each CPU. With this change, there is still a
single task on irq busy CPU and remaining 7 tasks are spread around among
other 3 CPUs.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 kernel/sched.c          |   14 ++++++++++++++
 kernel/sched_fair.c     |    3 +++
 kernel/sched_features.h |    5 +++++
 3 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f36697b..8ac5389 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2025,6 +2025,18 @@ static u64 unaccount_irq_delta(u64 delta, int cpu, u64 *saved_irq_time)
 #define unaccount_irq_delta_rt(delta, cpu, class_rq)		 \
 		unaccount_irq_delta(delta, cpu, &(class_rq)->saved_irq_time)
 
+static void sched_irq_power_update_fair(int cpu, struct cfs_rq *cfs_rq,
+			struct rq* rq)
+{
+	if (!sched_clock_irqtime)
+		return;
+
+	if (likely(rq->total_irq_time > cfs_rq->saved_irq_time)) {
+		sched_rt_avg_update(rq,
+				rq->total_irq_time - cfs_rq->saved_irq_time);
+	}
+}
+
 #else
 
 #define update_irq_time(cpu, crq)		do { } while (0)
@@ -2042,6 +2054,8 @@ static u64 unaccount_irq_delta_rt(u64 delta_exec, int cpu, struct rt_rq *rt_rq)
 	return delta_exec;
 }
 
+#define sched_irq_power_update_fair(cpu, crq, rq)	do { } while (0)
+
 #endif
 
 #include "sched_idletask.c"
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a64fdaf..937fded 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -526,6 +526,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	if (unlikely(!curr))
 		return;
 
+	if (sched_feat(NONIRQ_POWER) && entity_is_task(curr))
+		sched_irq_power_update_fair(cpu, cfs_rq, rq_of(cfs_rq));
+
 	/*
 	 * Get the amount of time the current task was running
 	 * since the last time we changed load (this cannot
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 83c66e8..185f920 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,3 +61,8 @@ SCHED_FEAT(ASYM_EFF_LOAD, 1)
  * release the lock. Decreases scheduling overhead.
  */
 SCHED_FEAT(OWNER_SPIN, 1)
+
+/*
+ * Decrement CPU power based on irq activity
+ */
+SCHED_FEAT(NONIRQ_POWER, 1)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/6] Export per cpu hardirq and softirq time in proc
  2010-09-17  1:56 [PATCH 0/6] Proper kernel irq time accounting Venkatesh Pallipadi
                   ` (4 preceding siblings ...)
  2010-09-17  1:56 ` [PATCH 5/6] sched: Remove irq time from available CPU power Venkatesh Pallipadi
@ 2010-09-17  1:56 ` Venkatesh Pallipadi
  5 siblings, 0 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-17  1:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Venkatesh Pallipadi

I can predict this change being debated.

There is already per CPU and system level irq time in /proc/stat, which
on arch like x86 is based on sampled data. Earlier patch adds a fine
grained irq time option for such archs. And exporting this fine grained
irq time to userspace seems helpful.

How should it be exported though? I considered:
(1) Changing the currently exported info in /proc/stat. Doing that though will
    likely break the sum view to the user as user/system/ and other times there
    are still sample based and only irq time will be fine grained. So, user may
    see sum time != 100% in top etc.
(2) Add a new interface in /proc. Implied an additional file read and buffer
    allocation, etc which I want to avoid if possible.
(3) Don't export this info at all. I am ok with this as a alternative. But,
    I needed this to be exported somewhere for my testing atleast.
(4) piggyback on /proc/interrupts and /proc/softirqs. Assuming users interested
    in this kind of info are already looking into those files, we wont have
    overhead of additional file read. There is still a likely hood of breaking
    some apps which only expect interrupt count in those files. But, this seemed
    a good option to me.

So, here is the patch that does (4)

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 Documentation/filesystems/proc.txt |    9 +++++++++
 fs/proc/interrupts.c               |   11 ++++++++++-
 fs/proc/softirqs.c                 |    8 ++++++++
 include/linux/sched.h              |    3 +++
 kernel/sched.c                     |   27 +++++++++++++++++++++++++++
 5 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index a6aca87..4456011 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -536,6 +536,11 @@ the threshold vector does not exist on x86_64 platforms.  Others are
 suppressed when the system is a uniprocessor.  As of this writing, only
 i386 and x86_64 platforms support the new IRQ vector displays.
 
+Another addition to /proc/interrupt is "Time:" line at the end which
+displays time spent by corresponding CPU processing interrupts in USER_HZ units.
+This time is based on fine grained accouting when CONFIG_VIRT_CPU_ACCOUNTING
+or CONFIG_IRQ_TIME_ACCOUNTING is active, otherwise it is tick sample based.
+
 Of some interest is the introduction of the /proc/irq directory to 2.4.
 It could be used to set IRQ to CPU affinity, this means that you can "hook" an
 IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
@@ -824,6 +829,10 @@ Provides counts of softirq handlers serviced since boot time, for each cpu.
  HRTIMER:          0          0          0          0
      RCU:       1678       1769       2178       2250
 
+Addition to /proc/softirqs is "Time:" line at the end which
+displays time spent by corresponding CPU processing softirqs in USER_HZ units.
+This time is based on fine grained accouting when CONFIG_VIRT_CPU_ACCOUNTING
+or CONFIG_IRQ_TIME_ACCOUNTING is active, otherwise it is tick sample based.
 
 1.3 IDE devices in /proc/ide
 ----------------------------
diff --git a/fs/proc/interrupts.c b/fs/proc/interrupts.c
index 05029c0..66d913a 100644
--- a/fs/proc/interrupts.c
+++ b/fs/proc/interrupts.c
@@ -3,6 +3,7 @@
 #include <linux/interrupt.h>
 #include <linux/irqnr.h>
 #include <linux/proc_fs.h>
+#include <linux/sched.h>
 #include <linux/seq_file.h>
 
 /*
@@ -23,7 +24,15 @@ static void *int_seq_next(struct seq_file *f, void *v, loff_t *pos)
 
 static void int_seq_stop(struct seq_file *f, void *v)
 {
-	/* Nothing to do */
+	int j;
+
+	seq_printf(f, "\n");
+	seq_printf(f, "Time:");
+	for_each_possible_cpu(j)
+		seq_printf(f, " %10lu", (unsigned long)get_cpu_hardirq_time(j));
+	seq_printf(f, "  Interrupt Processing Time\n");
+	seq_printf(f, "\n");
+
 }
 
 static const struct seq_operations int_seq_ops = {
diff --git a/fs/proc/softirqs.c b/fs/proc/softirqs.c
index 1807c24..f028329 100644
--- a/fs/proc/softirqs.c
+++ b/fs/proc/softirqs.c
@@ -1,6 +1,7 @@
 #include <linux/init.h>
 #include <linux/kernel_stat.h>
 #include <linux/proc_fs.h>
+#include <linux/sched.h>
 #include <linux/seq_file.h>
 
 /*
@@ -21,6 +22,13 @@ static int show_softirqs(struct seq_file *p, void *v)
 			seq_printf(p, " %10u", kstat_softirqs_cpu(i, j));
 		seq_printf(p, "\n");
 	}
+
+	seq_printf(p, "\n");
+	seq_printf(p, "    Time:");
+	for_each_possible_cpu(j)
+		seq_printf(p, " %10lu", (unsigned long)get_cpu_softirq_time(j));
+	seq_printf(p, "\n");
+
 	return 0;
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dbb6808..9033b21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1826,6 +1826,9 @@ extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 #endif
 
+extern clock_t get_cpu_hardirq_time(int cpu);
+extern clock_t get_cpu_softirq_time(int cpu);
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 /*
  * An i/f to runtime opt-in for irq time accounting based off of sched_clock.
diff --git a/kernel/sched.c b/kernel/sched.c
index 8ac5389..de63d2e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -73,6 +73,7 @@
 #include <linux/ftrace.h>
 #include <linux/slab.h>
 
+#include <asm/cputime.h>
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
 
@@ -2037,6 +2038,22 @@ static void sched_irq_power_update_fair(int cpu, struct cfs_rq *cfs_rq,
 	}
 }
 
+clock_t get_cpu_hardirq_time(int cpu)
+{
+	if (!sched_clock_irqtime)
+		return cputime64_to_clock_t(kstat_cpu(cpu).cpustat.irq);
+
+	return nsec_to_clock_t(per_cpu(cpu_hardirq_time,(cpu)));
+}
+
+clock_t get_cpu_softirq_time(int cpu)
+{
+	if (!sched_clock_irqtime)
+		return cputime64_to_clock_t(kstat_cpu(cpu).cpustat.softirq);
+
+	return nsec_to_clock_t(per_cpu(cpu_softirq_time,(cpu)));
+}
+
 #else
 
 #define update_irq_time(cpu, crq)		do { } while (0)
@@ -2056,6 +2073,16 @@ static u64 unaccount_irq_delta_rt(u64 delta_exec, int cpu, struct rt_rq *rt_rq)
 
 #define sched_irq_power_update_fair(cpu, crq, rq)	do { } while (0)
 
+clock_t get_cpu_hardirq_time(int cpu)
+{
+	return cputime64_to_clock_t(kstat_cpu(cpu).cpustat.irq);
+}
+
+clock_t get_cpu_softirq_time(int cpu)
+{
+	return cputime64_to_clock_t(kstat_cpu(cpu).cpustat.softirq);
+}
+
 #endif
 
 #include "sched_idletask.c"
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-17  1:56 ` [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time Venkatesh Pallipadi
@ 2010-09-19 11:11   ` Peter Zijlstra
  2010-09-20 17:13     ` Venkatesh Pallipadi
  2010-09-19 11:21   ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-19 11:11 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
> 
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> ---
>  include/linux/hardirq.h |    2 +-
>  include/linux/sched.h   |   11 +++++++++++
>  kernel/sched.c          |   38 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 50 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> index ce22d09..bfafd29 100644
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -132,7 +132,7 @@ extern void synchronize_irq(unsigned int irq);
>  
>  struct task_struct;
>  
> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING
> +#if !defined(CONFIG_VIRT_CPU_ACCOUNTING) && !defined(CONFIG_IRQ_TIME_ACCOUNTING)
>  static inline void account_system_vtime(struct task_struct *tsk)
>  {
>  }
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1e2a6db..dbb6808 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1826,6 +1826,17 @@ extern void sched_clock_idle_sleep_event(void);
>  extern void sched_clock_idle_wakeup_event(u64 delta_ns);
>  #endif
>  
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +/*
> + * An i/f to runtime opt-in for irq time accounting based off of sched_clock.
> + * The reason for this explicit opt-in is not to have perf penalty with
> + * slow sched_clocks.
> + */
> +extern void enable_sched_clock_irqtime(void);
> +#else
> +static inline void enable_sched_clock_irqtime(void) {}
> +#endif
> +
>  extern unsigned long long
>  task_sched_runtime(struct task_struct *task);
>  extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index ed09d4f..912d2de 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1917,6 +1917,44 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>         dec_nr_running(rq);
>  }
>  
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +
> +static DEFINE_PER_CPU(u64, cpu_hardirq_time);
> +static DEFINE_PER_CPU(u64, cpu_softirq_time);
> +
> +static DEFINE_PER_CPU(u64, irq_start_time);
> +static int sched_clock_irqtime;
> +
> +void enable_sched_clock_irqtime(void)
> +{
> +       sched_clock_irqtime = 1;
> +}
> +
> +void account_system_vtime(struct task_struct *tsk)
> +{
> +       unsigned long flags;
> +       int cpu;
> +       u64 now, delta;
> +
> +       if (!sched_clock_irqtime)
> +               return;
> +
> +       local_irq_save(flags);
> +
> +       cpu = task_cpu(tsk);

Can this ever be anything other can smp_processor_id() and current?

> +       now = sched_clock();

this should be using one of the kernel/sched_clock.c thingies, probably
local_clock(), or sched_clock_cpu(cpu).

> +       delta = now - per_cpu(irq_start_time, cpu);
> +       per_cpu(irq_start_time, cpu) = now;
> +       if (hardirq_count())
> +               per_cpu(cpu_hardirq_time, cpu) += delta;
> +       else if (softirq_count())
> +               per_cpu(cpu_softirq_time, cpu) += delta;
> +
> +       local_irq_restore(flags);
> +}

Also, this isn't a complete API, its very asymmetric, please cure that.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-17  1:56 ` [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time Venkatesh Pallipadi
  2010-09-19 11:11   ` Peter Zijlstra
@ 2010-09-19 11:21   ` Peter Zijlstra
  2010-09-19 11:42     ` Peter Zijlstra
  2010-09-19 12:01     ` Peter Zijlstra
  1 sibling, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-19 11:21 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
> +void account_system_vtime(struct task_struct *tsk)
> +{
> +       unsigned long flags;
> +       int cpu;
> +       u64 now, delta;
> +
> +       if (!sched_clock_irqtime)
> +               return;
> +
> +       local_irq_save(flags);
> +
> +       cpu = task_cpu(tsk);
> +       now = sched_clock();
> +       delta = now - per_cpu(irq_start_time, cpu);
> +       per_cpu(irq_start_time, cpu) = now;
> +       if (hardirq_count())
> +               per_cpu(cpu_hardirq_time, cpu) += delta;
> +       else if (softirq_count())
> +               per_cpu(cpu_softirq_time, cpu) += delta;
> +
> +       local_irq_restore(flags);
> +} 

This seems to suggest you count time double if a hardirq hits while
we're doing softirqs, but being as this is an incomplete api its very
hard to tell indeed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] sched: Do not account irq time to current task
  2010-09-17  1:56 ` [PATCH 4/6] sched: Do not account irq time to current task Venkatesh Pallipadi
@ 2010-09-19 11:28   ` Peter Zijlstra
  2010-09-20 17:33     ` Venkatesh Pallipadi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-19 11:28 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>

This patch makes my head hurt.

> ---
>  kernel/sched.c      |   91 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched_fair.c |    6 +++-
>  kernel/sched_rt.c   |   16 +++++++--
>  3 files changed, 108 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 912d2de..f36697b 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -370,6 +370,10 @@ struct cfs_rq {
>         unsigned long rq_weight;
>  #endif
>  #endif
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +       u64 saved_irq_time;
> +#endif
> +
>  };
>  
>  /* Real-Time classes' related field in a runqueue: */
> @@ -403,6 +407,10 @@ struct rt_rq {
>         struct list_head leaf_rt_rq_list;
>         struct task_group *tg;
>  #endif
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +       u64 saved_irq_time;
> +#endif
> +
>  };
>  
>  #ifdef CONFIG_SMP

Why do we care about irq_time for groups like this?

> @@ -532,6 +540,10 @@ struct rq {
>         struct hrtimer hrtick_timer;
>  #endif
>  
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +       u64 total_irq_time;
> +#endif
> +
>  #ifdef CONFIG_SCHEDSTATS
>         /* latency stats */
>         struct sched_info rq_sched_info;

Why do we track that here, we've got that information in the percpu
variables already, right?

> @@ -643,10 +655,14 @@ static inline struct task_group *task_group(struct task_struct *p)
>  
>  #endif /* CONFIG_CGROUP_SCHED */
>  
> +static void save_rq_irq_time(int cpu, struct rq *rq);
> +
>  inline void update_rq_clock(struct rq *rq)
>  {
> -       if (!rq->skip_clock_update)
> +       if (!rq->skip_clock_update) {
>                 rq->clock = sched_clock_cpu(cpu_of(rq));
> +               save_rq_irq_time(cpu_of(rq), rq);
> +       }
>  }
>  
>  /*

Something like: rq->clock_task = rq->clock - irq_time(cpu_of(rq)), would
make more sense -- that way the virt people can add a steal-time factor
and all would just work out for them too.

> @@ -1919,6 +1935,17 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>  
>  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>  
> +/*
> + * There are no locks covering percpu hardirq/softirq time.
> + * They are only modified in account_system_vtime, on corresponding CPU
> + * with interrupts disabled. So, writes are safe.
> + * They are read and saved off onto rq->total_irq_time in update_rq_clock().
> + * This may result in other CPU reading this CPU's irq time and can
> + * race with irq/account_system_vtime on this CPU. We would either get old
> + * or new value with a side effect of accounting a slice of irq time to wrong
> + * task when irq is in progress while we read rq->clock. That is a worthy
> + * compromise in place of having locks on each irq in account_system_time.
> + */

Except for 32bit systems, which can read half an updated u64.

>  static DEFINE_PER_CPU(u64, cpu_hardirq_time);
>  static DEFINE_PER_CPU(u64, cpu_softirq_time);
>  
> @@ -1930,6 +1957,13 @@ void enable_sched_clock_irqtime(void)
>         sched_clock_irqtime = 1;
>  }
>  
> +static void save_rq_irq_time(int cpu, struct rq *rq)
> +{
> +       if (sched_clock_irqtime)
> +               rq->total_irq_time = per_cpu(cpu_softirq_time, cpu) +
> +                                       per_cpu(cpu_hardirq_time, cpu);
> +}
> +
>  void account_system_vtime(struct task_struct *tsk)
>  {
>         unsigned long flags;
> @@ -1953,6 +1987,61 @@ void account_system_vtime(struct task_struct *tsk)
>         local_irq_restore(flags);
>  }
>  
> +/*
> + * saved_irq_time in cfs_rq, rt_rq caches the accounted irqtime and
> + * it is updated from update_curr while doing entity accouting and
> + * in update_irq_time while the task is first scheduled in.
> + */
> +static void __update_irq_time(int cpu, u64 *saved_irq_time)
> +{
> +       if (sched_clock_irqtime)
> +               *saved_irq_time = cpu_rq(cpu)->total_irq_time;
> +}
> +
> +#define update_irq_time(cpu, crq) __update_irq_time(cpu, &(crq)->saved_irq_time)
> +
> +static u64 unaccount_irq_delta(u64 delta, int cpu, u64 *saved_irq_time)
> +{
> +       u64 curr_irq_time;
> +
> +       if (!sched_clock_irqtime)
> +               return delta;
> +
> +       curr_irq_time = cpu_rq(cpu)->total_irq_time - *saved_irq_time;
> +       *saved_irq_time = cpu_rq(cpu)->total_irq_time;
> +
> +       if (likely(delta > curr_irq_time))
> +               delta -= curr_irq_time;
> +       else
> +               delta = 0;
> +
> +       return delta;
> +}



> +#define unaccount_irq_delta_fair(delta, cpu, class_rq)          \
> +               (unsigned long)unaccount_irq_delta((u64)delta,   \
> +                                       cpu, &(class_rq)->saved_irq_time)
> +
> +#define unaccount_irq_delta_rt(delta, cpu, class_rq)            \
> +               unaccount_irq_delta(delta, cpu, &(class_rq)->saved_irq_time)

This is a consequence of sticking stuff in {cfs,rt}_rq, which afaikt is
not needed at all.

> +#else
> +
> +#define update_irq_time(cpu, crq)              do { } while (0)
> +
> +static void save_rq_irq_time(int cpu, struct rq *rq) { }
> +
> +static unsigned long unaccount_irq_delta_fair(unsigned long delta_exec,
> +               int cpu, struct cfs_rq *cfs_rq)
> +{
> +       return delta_exec;
> +}
> +
> +static u64 unaccount_irq_delta_rt(u64 delta_exec, int cpu, struct rt_rq *rt_rq)
> +{
> +       return delta_exec;
> +}
> +
>  #endif
>  
>  #include "sched_idletask.c"
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index a171138..a64fdaf 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -521,6 +521,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>         struct sched_entity *curr = cfs_rq->curr;
>         u64 now = rq_of(cfs_rq)->clock;

  u64 now = rq_of(cfs_rq)->clock_task;

>         unsigned long delta_exec;
> +       int cpu = cpu_of(rq_of(cfs_rq));
>  
>         if (unlikely(!curr))
>                 return;
> @@ -531,11 +532,13 @@ static void update_curr(struct cfs_rq *cfs_rq)
>          * overflow on 32 bits):
>          */
>         delta_exec = (unsigned long)(now - curr->exec_start);
> +       curr->exec_start = now;
> +       delta_exec = unaccount_irq_delta_fair(delta_exec, cpu, cfs_rq);
> +
>         if (!delta_exec)
>                 return;
>  
>         __update_curr(cfs_rq, curr, delta_exec);
> -       curr->exec_start = now;
>  
>         if (entity_is_task(curr)) {
>                 struct task_struct *curtask = task_of(curr);
> @@ -603,6 +606,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>          * We are starting a new run period:
>          */
>         se->exec_start = rq_of(cfs_rq)->clock;
> +       update_irq_time(cpu_of(rq_of(cfs_rq)), cfs_rq);
>  }
>  
>  /**************************************************

se->exec_start = rq_of(cfs_rq)->clock_task;

And you don't need anything else, hmm?

> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index d10c80e..000d402 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -605,6 +605,7 @@ static void update_curr_rt(struct rq *rq)
>         struct sched_rt_entity *rt_se = &curr->rt;
>         struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
>         u64 delta_exec;
> +       int cpu = cpu_of(rq);
>  
>         if (!task_has_rt_policy(curr))
>                 return;
> @@ -613,6 +614,14 @@ static void update_curr_rt(struct rq *rq)
>         if (unlikely((s64)delta_exec < 0))
>                 delta_exec = 0;
>  
> +       /*
> +        * rt_avg_update before removing irq_delta, to keep up with
> +        * current behavior.
> +        */
> +       sched_rt_avg_update(rq, delta_exec);
> +
> +       delta_exec = unaccount_irq_delta_rt(delta_exec, cpu, rt_rq);
> +
>         schedstat_set(curr->se.statistics.exec_max, max(curr->se.statistics.exec_max, delta_exec));
>  
>         curr->se.sum_exec_runtime += delta_exec;
> @@ -621,8 +630,6 @@ static void update_curr_rt(struct rq *rq)
>         curr->se.exec_start = rq->clock;
>         cpuacct_charge(curr, delta_exec);
>  
> -       sched_rt_avg_update(rq, delta_exec);
> -
>         if (!rt_bandwidth_enabled())
>                 return;

I think it would all greatly simplify if you would compute delta_exec
from rq->clock_task and add IRQ time to sched_rt_avg_update()
separately.
 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] sched: Remove irq time from available CPU power
  2010-09-17  1:56 ` [PATCH 5/6] sched: Remove irq time from available CPU power Venkatesh Pallipadi
@ 2010-09-19 11:31   ` Peter Zijlstra
  2010-09-20 17:38     ` Venkatesh Pallipadi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-19 11:31 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
> +++ b/kernel/sched.c
> @@ -2025,6 +2025,18 @@ static u64 unaccount_irq_delta(u64 delta, int cpu, u64 *saved_irq_time)
>  #define unaccount_irq_delta_rt(delta, cpu, class_rq)            \
>                 unaccount_irq_delta(delta, cpu, &(class_rq)->saved_irq_time)
>  
> +static void sched_irq_power_update_fair(int cpu, struct cfs_rq *cfs_rq,
> +                       struct rq* rq)
> +{
> +       if (!sched_clock_irqtime)
> +               return;
> +
> +       if (likely(rq->total_irq_time > cfs_rq->saved_irq_time)) {
> +               sched_rt_avg_update(rq,
> +                               rq->total_irq_time - cfs_rq->saved_irq_time);
> +       }
> +}
> +
>  #else
>  
>  #define update_irq_time(cpu, crq)              do { } while (0)
> @@ -2042,6 +2054,8 @@ static u64 unaccount_irq_delta_rt(u64 delta_exec, int cpu, struct rt_rq *rt_rq)
>         return delta_exec;
>  }
>  
> +#define sched_irq_power_update_fair(cpu, crq, rq)      do { } while (0)
> +
>  #endif
>  
>  #include "sched_idletask.c"
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index a64fdaf..937fded 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -526,6 +526,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
>         if (unlikely(!curr))
>                 return;
>  
> +       if (sched_feat(NONIRQ_POWER) && entity_is_task(curr))
> +               sched_irq_power_update_fair(cpu, cfs_rq, rq_of(cfs_rq));
> +
>         /*
>          * Get the amount of time the current task was running
>          * since the last time we changed load (this cannot 

This all looks very confusing to me,.. How about we simply fold the
delta between rq->clock and rq->clock_task into sched_rt_avg_update()
and be done with it?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-19 11:21   ` Peter Zijlstra
@ 2010-09-19 11:42     ` Peter Zijlstra
  2010-09-19 12:01     ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-19 11:42 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Sun, 2010-09-19 at 13:21 +0200, Peter Zijlstra wrote:
> On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
> > +void account_system_vtime(struct task_struct *tsk)
> > +{
> > +       unsigned long flags;
> > +       int cpu;
> > +       u64 now, delta;
> > +
> > +       if (!sched_clock_irqtime)
> > +               return;
> > +
> > +       local_irq_save(flags);
> > +
> > +       cpu = task_cpu(tsk);
> > +       now = sched_clock();
> > +       delta = now - per_cpu(irq_start_time, cpu);
> > +       per_cpu(irq_start_time, cpu) = now;
> > +       if (hardirq_count())
> > +               per_cpu(cpu_hardirq_time, cpu) += delta;
> > +       else if (softirq_count())
> > +               per_cpu(cpu_softirq_time, cpu) += delta;
> > +
> > +       local_irq_restore(flags);
> > +} 
> 
> This seems to suggest you count time double if a hardirq hits while
> we're doing softirqs, but being as this is an incomplete api its very
> hard to tell indeed.

Argh, so this function is called on both _enter and _exit, egads.

Imagine the following in task context:

  local_bh_disable();

  <IRQ>

  local_bh_enable();

That seems to incorrectly inflate softirq_time.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-19 11:21   ` Peter Zijlstra
  2010-09-19 11:42     ` Peter Zijlstra
@ 2010-09-19 12:01     ` Peter Zijlstra
  2010-09-20  7:27       ` Martin Schwidefsky
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-19 12:01 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Sun, 2010-09-19 at 13:21 +0200, Peter Zijlstra wrote:
> On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
> > +void account_system_vtime(struct task_struct *tsk)
> > +{
> > +       unsigned long flags;
> > +       int cpu;
> > +       u64 now, delta;
> > +
> > +       if (!sched_clock_irqtime)
> > +               return;
> > +
> > +       local_irq_save(flags);
> > +
> > +       cpu = task_cpu(tsk);
> > +       now = sched_clock();
> > +       delta = now - per_cpu(irq_start_time, cpu);
> > +       per_cpu(irq_start_time, cpu) = now;
> > +       if (hardirq_count())
> > +               per_cpu(cpu_hardirq_time, cpu) += delta;
> > +       else if (softirq_count())
> > +               per_cpu(cpu_softirq_time, cpu) += delta;
> > +
> > +       local_irq_restore(flags);
> > +} 
> 
> This seems to suggest you count time double if a hardirq hits while
> we're doing softirqs, but being as this is an incomplete api its very
> hard to tell indeed.

OK, so by virtue of calling the same function on _enter and _exit its
not incomplete, just weird.

And it won't account time double, since it uses irq_start_time to
compute deltas between invocations and will attribute that delta to only
one state.

You still do have the problem with local_bh_disable() though, since you
cannot distinguish between having bh disabled and processing softirq.

So a hardirq that hits while you have bh disabled will inflate your
softirq time.

A possible solution is to have local_bh_{disable,enable} {add,sub}
2*SOFTIRQ_OFFSET and have the processing use SOFTIRQ_OFFSET, will need a
bit of a code shuffle though.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-19 12:01     ` Peter Zijlstra
@ 2010-09-20  7:27       ` Martin Schwidefsky
  2010-09-20  9:27         ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Martin Schwidefsky @ 2010-09-20  7:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Venkatesh Pallipadi, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Balbir Singh, linux-kernel, Paul Turner

On Sun, 19 Sep 2010 14:01:06 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, 2010-09-19 at 13:21 +0200, Peter Zijlstra wrote:
> > On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
> > > +void account_system_vtime(struct task_struct *tsk)
> > > +{
> > > +       unsigned long flags;
> > > +       int cpu;
> > > +       u64 now, delta;
> > > +
> > > +       if (!sched_clock_irqtime)
> > > +               return;
> > > +
> > > +       local_irq_save(flags);
> > > +
> > > +       cpu = task_cpu(tsk);
> > > +       now = sched_clock();
> > > +       delta = now - per_cpu(irq_start_time, cpu);
> > > +       per_cpu(irq_start_time, cpu) = now;
> > > +       if (hardirq_count())
> > > +               per_cpu(cpu_hardirq_time, cpu) += delta;
> > > +       else if (softirq_count())
> > > +               per_cpu(cpu_softirq_time, cpu) += delta;
> > > +
> > > +       local_irq_restore(flags);
> > > +} 
> > 
> > This seems to suggest you count time double if a hardirq hits while
> > we're doing softirqs, but being as this is an incomplete api its very
> > hard to tell indeed.
> 
> OK, so by virtue of calling the same function on _enter and _exit its
> not incomplete, just weird.

That is the same with CONFIG_VIRT_CPU_ACCOUNTING=y. irq_enter/irq_exit
call account_system_vtime, the function then uses the preempt/softirq/
hardirq counter to find out which context is currently active.

> And it won't account time double, since it uses irq_start_time to
> compute deltas between invocations and will attribute that delta to only
> one state.

irq_start_time is a bit misleading, it is a time stamp of the last update.
The confusing part (which deserves a comment) is the fact that the delta
is not added to anything if hardirq_count and softirq_count are zero.

> You still do have the problem with local_bh_disable() though, since you
> cannot distinguish between having bh disabled and processing softirq.
> 
> So a hardirq that hits while you have bh disabled will inflate your
> softirq time.
> 
> A possible solution is to have local_bh_{disable,enable} {add,sub}
> 2*SOFTIRQ_OFFSET and have the processing use SOFTIRQ_OFFSET, will need a
> bit of a code shuffle though.

Hmm, that bug is valid for CONFIG_VIRT_CPU_ACCOUNTING=y as well.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-20  7:27       ` Martin Schwidefsky
@ 2010-09-20  9:27         ` Peter Zijlstra
  2010-09-20 17:16           ` Venkatesh Pallipadi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-20  9:27 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Venkatesh Pallipadi, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Balbir Singh, linux-kernel, Paul Turner

On Mon, 2010-09-20 at 09:27 +0200, Martin Schwidefsky wrote:

> > OK, so by virtue of calling the same function on _enter and _exit its
> > not incomplete, just weird.
> 
> That is the same with CONFIG_VIRT_CPU_ACCOUNTING=y. irq_enter/irq_exit
> call account_system_vtime, the function then uses the preempt/softirq/
> hardirq counter to find out which context is currently active.

Yeah, I realized that eventually, I've so far been able to mostly ignore
all that VIRT_CPU_ACCOUNTING muck.

> > And it won't account time double, since it uses irq_start_time to
> > compute deltas between invocations and will attribute that delta to only
> > one state.
> 
> irq_start_time is a bit misleading, it is a time stamp of the last update.
> The confusing part (which deserves a comment) is the fact that the delta
> is not added to anything if hardirq_count and softirq_count are zero.

Yeah, the name didn't help either, but I really expected to see two
hooks: start/exit, I did eventually figure it all out, but its a bit
daft.

If you would have had 4 hooks, the below problem would have been fixable
within the implementation.

> > You still do have the problem with local_bh_disable() though, since you
> > cannot distinguish between having bh disabled and processing softirq.
> > 
> > So a hardirq that hits while you have bh disabled will inflate your
> > softirq time.
> > 
> > A possible solution is to have local_bh_{disable,enable} {add,sub}
> > 2*SOFTIRQ_OFFSET and have the processing use SOFTIRQ_OFFSET, will need a
> > bit of a code shuffle though.
> 
> Hmm, that bug is valid for CONFIG_VIRT_CPU_ACCOUNTING=y as well.

And nobody ever noticed?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-19 11:11   ` Peter Zijlstra
@ 2010-09-20 17:13     ` Venkatesh Pallipadi
  2010-09-20 17:23       ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-20 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Sun, Sep 19, 2010 at 4:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
>>
>> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
>> ---
>>  include/linux/hardirq.h |    2 +-
>>  include/linux/sched.h   |   11 +++++++++++
>>  kernel/sched.c          |   38 ++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 50 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
>> index ce22d09..bfafd29 100644
>> --- a/include/linux/hardirq.h
>> +++ b/include/linux/hardirq.h
>> @@ -132,7 +132,7 @@ extern void synchronize_irq(unsigned int irq);
>>
>>  struct task_struct;
>>
>> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING
>> +#if !defined(CONFIG_VIRT_CPU_ACCOUNTING) && !defined(CONFIG_IRQ_TIME_ACCOUNTING)
>>  static inline void account_system_vtime(struct task_struct *tsk)
>>  {
>>  }
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 1e2a6db..dbb6808 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1826,6 +1826,17 @@ extern void sched_clock_idle_sleep_event(void);
>>  extern void sched_clock_idle_wakeup_event(u64 delta_ns);
>>  #endif
>>
>> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> +/*
>> + * An i/f to runtime opt-in for irq time accounting based off of sched_clock.
>> + * The reason for this explicit opt-in is not to have perf penalty with
>> + * slow sched_clocks.
>> + */
>> +extern void enable_sched_clock_irqtime(void);
>> +#else
>> +static inline void enable_sched_clock_irqtime(void) {}
>> +#endif
>> +
>>  extern unsigned long long
>>  task_sched_runtime(struct task_struct *task);
>>  extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index ed09d4f..912d2de 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -1917,6 +1917,44 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>>         dec_nr_running(rq);
>>  }
>>
>> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> +
>> +static DEFINE_PER_CPU(u64, cpu_hardirq_time);
>> +static DEFINE_PER_CPU(u64, cpu_softirq_time);
>> +
>> +static DEFINE_PER_CPU(u64, irq_start_time);
>> +static int sched_clock_irqtime;
>> +
>> +void enable_sched_clock_irqtime(void)
>> +{
>> +       sched_clock_irqtime = 1;
>> +}
>> +
>> +void account_system_vtime(struct task_struct *tsk)
>> +{
>> +       unsigned long flags;
>> +       int cpu;
>> +       u64 now, delta;
>> +
>> +       if (!sched_clock_irqtime)
>> +               return;
>> +
>> +       local_irq_save(flags);
>> +
>> +       cpu = task_cpu(tsk);
>
> Can this ever be anything other can smp_processor_id() and current?
>
>> +       now = sched_clock();
>
> this should be using one of the kernel/sched_clock.c thingies, probably
> local_clock(), or sched_clock_cpu(cpu).

I don't really need task there. I can use smp_processor_id() or
task_cpu(tsk) and I think latter one would be cheaper.

You mean sched_clock() is not the right interface to use here.
sched_clock_cpu() uses either sched_clock or remote_cpu stuff which I
don't really need here and local_clock() also has irq
disable/smp_processor_id() and calls sched_clock_cpu in turn.
sched_clock() seemed to be more appropriate.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-20  9:27         ` Peter Zijlstra
@ 2010-09-20 17:16           ` Venkatesh Pallipadi
  2010-09-20 17:26             ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-20 17:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, linux-kernel, Paul Turner

On Mon, Sep 20, 2010 at 2:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2010-09-20 at 09:27 +0200, Martin Schwidefsky wrote:
>
>> > OK, so by virtue of calling the same function on _enter and _exit its
>> > not incomplete, just weird.
>>
>> That is the same with CONFIG_VIRT_CPU_ACCOUNTING=y. irq_enter/irq_exit
>> call account_system_vtime, the function then uses the preempt/softirq/
>> hardirq counter to find out which context is currently active.
>
> Yeah, I realized that eventually, I've so far been able to mostly ignore
> all that VIRT_CPU_ACCOUNTING muck.
>
>> > And it won't account time double, since it uses irq_start_time to
>> > compute deltas between invocations and will attribute that delta to only
>> > one state.
>>
>> irq_start_time is a bit misleading, it is a time stamp of the last update.
>> The confusing part (which deserves a comment) is the fact that the delta
>> is not added to anything if hardirq_count and softirq_count are zero.
>
> Yeah, the name didn't help either, but I really expected to see two
> hooks: start/exit, I did eventually figure it all out, but its a bit
> daft.
>
> If you would have had 4 hooks, the below problem would have been fixable
> within the implementation.
>
>> > You still do have the problem with local_bh_disable() though, since you
>> > cannot distinguish between having bh disabled and processing softirq.
>> >
>> > So a hardirq that hits while you have bh disabled will inflate your
>> > softirq time.
>> >
>> > A possible solution is to have local_bh_{disable,enable} {add,sub}
>> > 2*SOFTIRQ_OFFSET and have the processing use SOFTIRQ_OFFSET, will need a
>> > bit of a code shuffle though.
>>
>> Hmm, that bug is valid for CONFIG_VIRT_CPU_ACCOUNTING=y as well.
>
> And nobody ever noticed?
>

Yes. I inherited the API from VIRT_CPU_ACCOUNTING along with this
local_bh_disable bug. Agree that we need one extra bit to handle this
case. I will take a stab at fixing this along with refresh of this
patchset if no one else has beaten me to it until then.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-20 17:13     ` Venkatesh Pallipadi
@ 2010-09-20 17:23       ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-20 17:23 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Mon, 2010-09-20 at 10:13 -0700, Venkatesh Pallipadi wrote:

> >> +       local_irq_save(flags);
> >> +
> >> +       cpu = task_cpu(tsk);
> >
> > Can this ever be anything other can smp_processor_id() and current?
> >
> >> +       now = sched_clock();
> >
> > this should be using one of the kernel/sched_clock.c thingies, probably
> > local_clock(), or sched_clock_cpu(cpu).
> 
> I don't really need task there. I can use smp_processor_id() or
> task_cpu(tsk) and I think latter one would be cheaper.

Not sure, task_cpu() gets the cpu number from the task_info struct,
smp_processor_id() gets it from per-cpu storage, both are a single
memory read.

But I think its a tad confusing that this function has a task argument
at all, but if its always current, it would be slightly better to call
it 'curr' or something.

Also, local_irq_safe() followed by smp_processor_id() is clearly local,
task_cpu(tsk) can be anything.

> You mean sched_clock() is not the right interface to use here.
> sched_clock_cpu() uses either sched_clock or remote_cpu stuff which I
> don't really need here and local_clock() also has irq
> disable/smp_processor_id() and calls sched_clock_cpu in turn.
> sched_clock() seemed to be more appropriate.

yeah, use sched_clock_cpu(smp_processor_id()), sched_clock() can be
utter crap on x86, the code in kernel/sched_clock.c tries its best to
sanitize the crap we get from the hardware.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time
  2010-09-20 17:16           ` Venkatesh Pallipadi
@ 2010-09-20 17:26             ` Peter Zijlstra
  2010-09-27 20:35               ` [PATCH] si time accounting accounts bh_disable'd time to si Venkatesh Pallipadi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-20 17:26 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Martin Schwidefsky, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, linux-kernel, Paul Turner

On Mon, 2010-09-20 at 10:16 -0700, Venkatesh Pallipadi wrote:
> >> > You still do have the problem with local_bh_disable() though, since you
> >> > cannot distinguish between having bh disabled and processing softirq.
> >> >
> >> > So a hardirq that hits while you have bh disabled will inflate your
> >> > softirq time.

> >> Hmm, that bug is valid for CONFIG_VIRT_CPU_ACCOUNTING=y as well.
> >
> > And nobody ever noticed?
> >
> 
> Yes. I inherited the API from VIRT_CPU_ACCOUNTING along with this
> local_bh_disable bug. Agree that we need one extra bit to handle this
> case. I will take a stab at fixing this along with refresh of this
> patchset if no one else has beaten me to it until then. 

Make sure to only fix the softirq processing on the hardirq tail, not
the ksoftirqd one :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] sched: Do not account irq time to current task
  2010-09-19 11:28   ` Peter Zijlstra
@ 2010-09-20 17:33     ` Venkatesh Pallipadi
  2010-09-20 17:38       ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-20 17:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Sun, Sep 19, 2010 at 4:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
>> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
>
> This patch makes my head hurt.

This is what I meant when I said the code turned out to be "messy" here :-)
- http://lkml.indiana.edu/hypermail//linux/kernel/1008.3/00784.html

<snip> as all the comments were related..

>
>> +#else
>> +
>> +#define update_irq_time(cpu, crq)              do { } while (0)
>> +
>> +static void save_rq_irq_time(int cpu, struct rq *rq) { }
>> +
>> +static unsigned long unaccount_irq_delta_fair(unsigned long delta_exec,
>> +               int cpu, struct cfs_rq *cfs_rq)
>> +{
>> +       return delta_exec;
>> +}
>> +
>> +static u64 unaccount_irq_delta_rt(u64 delta_exec, int cpu, struct rt_rq *rt_rq)
>> +{
>> +       return delta_exec;
>> +}
>> +
>>  #endif
>>
>>  #include "sched_idletask.c"
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index a171138..a64fdaf 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -521,6 +521,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>         struct sched_entity *curr = cfs_rq->curr;
>>         u64 now = rq_of(cfs_rq)->clock;
>
>  u64 now = rq_of(cfs_rq)->clock_task;
>
>>         unsigned long delta_exec;
>> +       int cpu = cpu_of(rq_of(cfs_rq));
>>
>>         if (unlikely(!curr))
>>                 return;
>> @@ -531,11 +532,13 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>          * overflow on 32 bits):
>>          */
>>         delta_exec = (unsigned long)(now - curr->exec_start);
>> +       curr->exec_start = now;
>> +       delta_exec = unaccount_irq_delta_fair(delta_exec, cpu, cfs_rq);
>> +
>>         if (!delta_exec)
>>                 return;
>>
>>         __update_curr(cfs_rq, curr, delta_exec);
>> -       curr->exec_start = now;
>>
>>         if (entity_is_task(curr)) {
>>                 struct task_struct *curtask = task_of(curr);
>> @@ -603,6 +606,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>          * We are starting a new run period:
>>          */
>>         se->exec_start = rq_of(cfs_rq)->clock;
>> +       update_irq_time(cpu_of(rq_of(cfs_rq)), cfs_rq);
>>  }
>>
>>  /**************************************************
>
> se->exec_start = rq_of(cfs_rq)->clock_task;
>
> And you don't need anything else, hmm?
>
>> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
>> index d10c80e..000d402 100644
>> --- a/kernel/sched_rt.c
>> +++ b/kernel/sched_rt.c
>> @@ -605,6 +605,7 @@ static void update_curr_rt(struct rq *rq)
>>         struct sched_rt_entity *rt_se = &curr->rt;
>>         struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
>>         u64 delta_exec;
>> +       int cpu = cpu_of(rq);
>>
>>         if (!task_has_rt_policy(curr))
>>                 return;
>> @@ -613,6 +614,14 @@ static void update_curr_rt(struct rq *rq)
>>         if (unlikely((s64)delta_exec < 0))
>>                 delta_exec = 0;
>>
>> +       /*
>> +        * rt_avg_update before removing irq_delta, to keep up with
>> +        * current behavior.
>> +        */
>> +       sched_rt_avg_update(rq, delta_exec);
>> +
>> +       delta_exec = unaccount_irq_delta_rt(delta_exec, cpu, rt_rq);
>> +
>>         schedstat_set(curr->se.statistics.exec_max, max(curr->se.statistics.exec_max, delta_exec));
>>
>>         curr->se.sum_exec_runtime += delta_exec;
>> @@ -621,8 +630,6 @@ static void update_curr_rt(struct rq *rq)
>>         curr->se.exec_start = rq->clock;
>>         cpuacct_charge(curr, delta_exec);
>>
>> -       sched_rt_avg_update(rq, delta_exec);
>> -
>>         if (!rt_bandwidth_enabled())
>>                 return;
>
> I think it would all greatly simplify if you would compute delta_exec
> from rq->clock_task and add IRQ time to sched_rt_avg_update()
> separately.
>

Yes. I like your idea of having separate rq->clock and rq->clock_task.
That will clean up this code a bit.
We will still need to keep track of "last accounted irq time" at the
task or rq level to account sched_rt_avg_update correctly. But, I dont
have to play with cfs_rq and rt_rq as in this patch though.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] sched: Do not account irq time to current task
  2010-09-20 17:33     ` Venkatesh Pallipadi
@ 2010-09-20 17:38       ` Peter Zijlstra
  2010-09-20 17:40         ` Venkatesh Pallipadi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-20 17:38 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Mon, 2010-09-20 at 10:33 -0700, Venkatesh Pallipadi wrote:
> Yes. I like your idea of having separate rq->clock and rq->clock_task.
> That will clean up this code a bit.
> We will still need to keep track of "last accounted irq time" at the
> task or rq level to account sched_rt_avg_update correctly. But, I dont
> have to play with cfs_rq and rt_rq as in this patch though. 

Ah, indeed. Ok so have rq->clock, rq->clock_task and have a
irq_time_stamp to fold stuff into sched_rt_avg_update(), then I think
you can isolate all the clock bits to update_rq_clock() and then use
->clock_task in update_curr{,_rt}().



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] sched: Remove irq time from available CPU power
  2010-09-19 11:31   ` Peter Zijlstra
@ 2010-09-20 17:38     ` Venkatesh Pallipadi
  0 siblings, 0 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-20 17:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Sun, Sep 19, 2010 at 4:31 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-09-16 at 18:56 -0700, Venkatesh Pallipadi wrote:
>> +++ b/kernel/sched.c
>> @@ -2025,6 +2025,18 @@ static u64 unaccount_irq_delta(u64 delta, int cpu, u64 *saved_irq_time)
>>  #define unaccount_irq_delta_rt(delta, cpu, class_rq)            \
>>                 unaccount_irq_delta(delta, cpu, &(class_rq)->saved_irq_time)
>>
>> +static void sched_irq_power_update_fair(int cpu, struct cfs_rq *cfs_rq,
>> +                       struct rq* rq)
>> +{
>> +       if (!sched_clock_irqtime)
>> +               return;
>> +
>> +       if (likely(rq->total_irq_time > cfs_rq->saved_irq_time)) {
>> +               sched_rt_avg_update(rq,
>> +                               rq->total_irq_time - cfs_rq->saved_irq_time);
>> +       }
>> +}
>> +
>>  #else
>>
>>  #define update_irq_time(cpu, crq)              do { } while (0)
>> @@ -2042,6 +2054,8 @@ static u64 unaccount_irq_delta_rt(u64 delta_exec, int cpu, struct rt_rq *rt_rq)
>>         return delta_exec;
>>  }
>>
>> +#define sched_irq_power_update_fair(cpu, crq, rq)      do { } while (0)
>> +
>>  #endif
>>
>>  #include "sched_idletask.c"
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index a64fdaf..937fded 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -526,6 +526,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>         if (unlikely(!curr))
>>                 return;
>>
>> +       if (sched_feat(NONIRQ_POWER) && entity_is_task(curr))
>> +               sched_irq_power_update_fair(cpu, cfs_rq, rq_of(cfs_rq));
>> +
>>         /*
>>          * Get the amount of time the current task was running
>>          * since the last time we changed load (this cannot
>
> This all looks very confusing to me,.. How about we simply fold the
> delta between rq->clock and rq->clock_task into sched_rt_avg_update()
> and be done with it?
>

rq->clock and rq->clock_task: As I understood, you mean having both
being continuous counters since boot. So, we will still need some
"rt_avg_accounted irq_time" that will be updated when
sched_rt_avg_update() is done. Or am I missing something?

Thanks.
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] sched: Do not account irq time to current task
  2010-09-20 17:38       ` Peter Zijlstra
@ 2010-09-20 17:40         ` Venkatesh Pallipadi
  0 siblings, 0 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-20 17:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner

On Mon, Sep 20, 2010 at 10:38 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2010-09-20 at 10:33 -0700, Venkatesh Pallipadi wrote:
>> Yes. I like your idea of having separate rq->clock and rq->clock_task.
>> That will clean up this code a bit.
>> We will still need to keep track of "last accounted irq time" at the
>> task or rq level to account sched_rt_avg_update correctly. But, I dont
>> have to play with cfs_rq and rt_rq as in this patch though.
>
> Ah, indeed. Ok so have rq->clock, rq->clock_task and have a
> irq_time_stamp to fold stuff into sched_rt_avg_update(), then I think
> you can isolate all the clock bits to update_rq_clock() and then use
> ->clock_task in update_curr{,_rt}().
>
>

OK. Thanks for all the feedback. Will have a newer cleaner version of
this soon...

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] si time accounting accounts bh_disable'd time to si
  2010-09-20 17:26             ` Peter Zijlstra
@ 2010-09-27 20:35               ` Venkatesh Pallipadi
  2010-09-27 20:53                 ` Eric Dumazet
  2010-09-30 11:17                 ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-27 20:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, linux-kernel, Paul Turner, Venkatesh Pallipadi

>> >> > You still do have the problem with local_bh_disable() though, since you
>> >> > cannot distinguish between having bh disabled and processing softirq.
>> >> >
>> >> > So a hardirq that hits while you have bh disabled will inflate your
>> >> > softirq time.

>> >> Hmm, that bug is valid for CONFIG_VIRT_CPU_ACCOUNTING=y as well.
>> >
>> > And nobody ever noticed?
>> >
>> Yes. I inherited the API from VIRT_CPU_ACCOUNTING along with this
>> local_bh_disable bug. Agree that we need one extra bit to handle this
>> case. I will take a stab at fixing this along with refresh of this
>> patchset if no one else has beaten me to it until then.
>
>Make sure to only fix the softirq processing on the hardirq tail, not
> the ksoftirqd one :-)

softirq processing from hardirq tail and ksoftirqd are currently
handled in the same way and I didn't see any issues changing both of
them. Am I missing something?

Here's the patch I have for this. 

[PATCH] si time accounting accounts bh_disable'd time to si

Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread.
http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html

The problem is, softirq processing uses local_bh_disable internally and there
would be no way later in the flow to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
id disabled results in time being wrongly accounted for softirq.

Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set when not in softirq with bh disabled.

Peter also suggested solution of using 2 * SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.

Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.

Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 include/linux/hardirq.h |    3 ++
 include/linux/sched.h   |    2 +-
 kernel/sched.c          |    2 +-
 kernel/softirq.c        |   51 +++++++++++++++++++++++++++++++---------------
 net/sched/cls_cgroup.c  |    2 +-
 5 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d5b3876..1c736ae 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -82,10 +82,13 @@
 /*
  * Are we doing bottom half or hardware interrupt processing?
  * Are we in a softirq context? Interrupt context?
+ * in_softirq answers - are we currently processing softirq or have bh disabled?
+ * in_serving_softirq answers - are we currently processing softirq?
  */
 #define in_irq()		(hardirq_count())
 #define in_softirq()		(softirq_count())
 #define in_interrupt()		(irq_count())
+#define in_serving_softirq()	(softirq_count() == SOFTIRQ_OFFSET)
 
 /*
  * Are we in NMI context?
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..1c40289 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2368,7 +2368,7 @@ extern int __cond_resched_lock(spinlock_t *lock);
 extern int __cond_resched_softirq(void);
 
 #define cond_resched_softirq() ({				\
-	__might_sleep(__FILE__, __LINE__, SOFTIRQ_OFFSET);	\
+	__might_sleep(__FILE__, __LINE__, SOFTIRQ_OFFSET * 2);	\
 	__cond_resched_softirq();				\
 })
 
diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..b6e714b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3397,7 +3397,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	tmp = cputime_to_cputime64(cputime);
 	if (hardirq_count() - hardirq_offset)
 		cpustat->irq = cputime64_add(cpustat->irq, tmp);
-	else if (softirq_count())
+	else if (in_serving_softirq())
 		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
 	else
 		cpustat->system = cputime64_add(cpustat->system, tmp);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 07b4f1b..7bfc67b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -77,11 +77,21 @@ void wakeup_softirqd(void)
 }
 
 /*
+ * preempt count and SOFTIRQ_OFFSET usage:
+ * - preempt count is changed by SOFTIRQ_OFFSET on entering or leaving
+ *   softirq processing.
+ * - preempt count is changed by 2 * SOFTIRQ_OFFSET on local_bh_disable or
+ *   local_bh_enable.
+ * This lets us distinguish between whether we are currently processing
+ * softirq and whether we have bh disabled.
+ */
+
+/*
  * This one is for softirq.c-internal use,
  * where hardirqs are disabled legitimately:
  */
 #ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip)
+static void __local_bh_disable(unsigned long ip, unsigned int cnt)
 {
 	unsigned long flags;
 
@@ -95,32 +105,43 @@ static void __local_bh_disable(unsigned long ip)
 	 * We must manually increment preempt_count here and manually
 	 * call the trace_preempt_off later.
 	 */
-	preempt_count() += SOFTIRQ_OFFSET;
+	preempt_count() += cnt;
 	/*
 	 * Were softirqs turned off above:
 	 */
-	if (softirq_count() == SOFTIRQ_OFFSET)
+	if (softirq_count() == cnt)
 		trace_softirqs_off(ip);
 	raw_local_irq_restore(flags);
 
-	if (preempt_count() == SOFTIRQ_OFFSET)
+	if (preempt_count() == cnt)
 		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
 #else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip)
+static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
 {
-	add_preempt_count(SOFTIRQ_OFFSET);
+	add_preempt_count(cnt);
 	barrier();
 }
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
 void local_bh_disable(void)
 {
-	__local_bh_disable((unsigned long)__builtin_return_address(0));
+	__local_bh_disable((unsigned long)__builtin_return_address(0),
+				2 * SOFTIRQ_OFFSET);
 }
 
 EXPORT_SYMBOL(local_bh_disable);
 
+static void __local_bh_enable(unsigned int cnt)
+{
+	WARN_ON_ONCE(in_irq());
+	WARN_ON_ONCE(!irqs_disabled());
+
+	if (softirq_count() == cnt)
+		trace_softirqs_on((unsigned long)__builtin_return_address(0));
+	sub_preempt_count(cnt);
+}
+
 /*
  * Special-case - softirqs can safely be enabled in
  * cond_resched_softirq(), or by __do_softirq(),
@@ -128,12 +149,7 @@ EXPORT_SYMBOL(local_bh_disable);
  */
 void _local_bh_enable(void)
 {
-	WARN_ON_ONCE(in_irq());
-	WARN_ON_ONCE(!irqs_disabled());
-
-	if (softirq_count() == SOFTIRQ_OFFSET)
-		trace_softirqs_on((unsigned long)__builtin_return_address(0));
-	sub_preempt_count(SOFTIRQ_OFFSET);
+	__local_bh_enable(2 * SOFTIRQ_OFFSET);
 }
 
 EXPORT_SYMBOL(_local_bh_enable);
@@ -147,13 +163,13 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 	/*
 	 * Are softirqs going to be turned on now:
 	 */
-	if (softirq_count() == SOFTIRQ_OFFSET)
+	if (softirq_count() == 2 * SOFTIRQ_OFFSET)
 		trace_softirqs_on(ip);
 	/*
 	 * Keep preemption disabled until we are done with
 	 * softirq processing:
  	 */
- 	sub_preempt_count(SOFTIRQ_OFFSET - 1);
+	sub_preempt_count(2 * SOFTIRQ_OFFSET - 1);
 
 	if (unlikely(!in_interrupt() && local_softirq_pending()))
 		do_softirq();
@@ -198,7 +214,8 @@ asmlinkage void __do_softirq(void)
 	pending = local_softirq_pending();
 	account_system_vtime(current);
 
-	__local_bh_disable((unsigned long)__builtin_return_address(0));
+	__local_bh_disable((unsigned long)__builtin_return_address(0),
+				SOFTIRQ_OFFSET);
 	lockdep_softirq_enter();
 
 	cpu = smp_processor_id();
@@ -245,7 +262,7 @@ restart:
 	lockdep_softirq_exit();
 
 	account_system_vtime(current);
-	_local_bh_enable();
+	__local_bh_enable(SOFTIRQ_OFFSET);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 78ef2c5..37dff78 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -123,7 +123,7 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (softirq_count() != SOFTIRQ_OFFSET) {
+	if (in_serving_softirq()) {
 		/* If there is an sk_classid we'll use that. */
 		if (!skb->sk)
 			return -1;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] si time accounting accounts bh_disable'd time to si
  2010-09-27 20:35               ` [PATCH] si time accounting accounts bh_disable'd time to si Venkatesh Pallipadi
@ 2010-09-27 20:53                 ` Eric Dumazet
  2010-09-27 21:11                   ` Venkatesh Pallipadi
  2010-09-30 11:17                 ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-09-27 20:53 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Martin Schwidefsky, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Balbir Singh, linux-kernel, Paul Turner

Le lundi 27 septembre 2010 à 13:35 -0700, Venkatesh Pallipadi a écrit :
> >> >> > You still do have the problem with local_bh_disable() though, since you
> >> >> > cannot distinguish between having bh disabled and processing softirq.
> >> >> >
> >> >> > So a hardirq that hits while you have bh disabled will inflate your
> >> >> > softirq time.
> 
> >> >> Hmm, that bug is valid for CONFIG_VIRT_CPU_ACCOUNTING=y as well.
> >> >
> >> > And nobody ever noticed?
> >> >
> >> Yes. I inherited the API from VIRT_CPU_ACCOUNTING along with this
> >> local_bh_disable bug. Agree that we need one extra bit to handle this
> >> case. I will take a stab at fixing this along with refresh of this
> >> patchset if no one else has beaten me to it until then.
> >
> >Make sure to only fix the softirq processing on the hardirq tail, not
> > the ksoftirqd one :-)
> 
> softirq processing from hardirq tail and ksoftirqd are currently
> handled in the same way and I didn't see any issues changing both of
> them. Am I missing something?
> 
> Here's the patch I have for this. 
> 
> [PATCH] si time accounting accounts bh_disable'd time to si
> 
> Peter Zijlstra found a bug in the way softirq time is accounted in
> VIRT_CPU_ACCOUNTING on this thread.
> http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
> 
> The problem is, softirq processing uses local_bh_disable internally and there
> would be no way later in the flow to differentiate between whether softirq is
> being processed or is it just that bh has been disabled. So, a hardirq when bh
> id disabled results in time being wrongly accounted for softirq.
> 
> Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
> as well. As account_system_time() in normal tick based accouting also uses
> softirq_count, which will be set when not in softirq with bh disabled.
> 
> Peter also suggested solution of using 2 * SOFTIRQ_OFFSET as irq count
> for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
> processing. The patch below does that and adds API in_serving_softirq() which
> returns whether we are currently processing softirq or not.
> 
> Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
> to in_serving_softirq.
> 
> Looks like many usages of in_softirq really want in_serving_softirq. Those
> changes can be made individually on a case by case basis.
> 
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> ---
>  include/linux/hardirq.h |    3 ++
>  include/linux/sched.h   |    2 +-
>  kernel/sched.c          |    2 +-
>  kernel/softirq.c        |   51 +++++++++++++++++++++++++++++++---------------
>  net/sched/cls_cgroup.c  |    2 +-
>  5 files changed, 40 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> index d5b3876..1c736ae 100644
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -82,10 +82,13 @@
>  /*
>   * Are we doing bottom half or hardware interrupt processing?
>   * Are we in a softirq context? Interrupt context?
> + * in_softirq answers - are we currently processing softirq or have bh disabled?
> + * in_serving_softirq answers - are we currently processing softirq?
>   */
>  #define in_irq()		(hardirq_count())
>  #define in_softirq()		(softirq_count())
>  #define in_interrupt()		(irq_count())
> +#define in_serving_softirq()	(softirq_count() == SOFTIRQ_OFFSET)
>  

But softirq handlers sometime call functions that might disable bh
again. It would be good to not switch softirq time to system time ;)

Shouldnt we reserve a bit (high order bit out of 8) instead ?

 * PREEMPT_MASK:    0x000000ff
 * SOFTIRQ_MASK:    0x0000ff00
 * SERVING_SOFTIRQ: 0x00008000
 * HARDIRQ_MASK:    0x03ff0000
 *     NMI_MASK:    0x04000000




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] si time accounting accounts bh_disable'd time to si
  2010-09-27 20:53                 ` Eric Dumazet
@ 2010-09-27 21:11                   ` Venkatesh Pallipadi
  2010-09-27 21:16                     ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-27 21:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Martin Schwidefsky, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Balbir Singh, linux-kernel, Paul Turner

On Mon, Sep 27, 2010 at 1:53 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 27 septembre 2010 à 13:35 -0700, Venkatesh Pallipadi a écrit :
>> >> >> > You still do have the problem with local_bh_disable() though, since you
>> >> >> > cannot distinguish between having bh disabled and processing softirq.
>> >> >> >
>> >> >> > So a hardirq that hits while you have bh disabled will inflate your
>> >> >> > softirq time.
>>
>> >> >> Hmm, that bug is valid for CONFIG_VIRT_CPU_ACCOUNTING=y as well.
>> >> >
>> >> > And nobody ever noticed?
>> >> >
>> >> Yes. I inherited the API from VIRT_CPU_ACCOUNTING along with this
>> >> local_bh_disable bug. Agree that we need one extra bit to handle this
>> >> case. I will take a stab at fixing this along with refresh of this
>> >> patchset if no one else has beaten me to it until then.
>> >
>> >Make sure to only fix the softirq processing on the hardirq tail, not
>> > the ksoftirqd one :-)
>>
>> softirq processing from hardirq tail and ksoftirqd are currently
>> handled in the same way and I didn't see any issues changing both of
>> them. Am I missing something?
>>
>> Here's the patch I have for this.
>>
>> [PATCH] si time accounting accounts bh_disable'd time to si
>>
>> Peter Zijlstra found a bug in the way softirq time is accounted in
>> VIRT_CPU_ACCOUNTING on this thread.
>> http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
>>
>> The problem is, softirq processing uses local_bh_disable internally and there
>> would be no way later in the flow to differentiate between whether softirq is
>> being processed or is it just that bh has been disabled. So, a hardirq when bh
>> id disabled results in time being wrongly accounted for softirq.
>>
>> Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
>> as well. As account_system_time() in normal tick based accouting also uses
>> softirq_count, which will be set when not in softirq with bh disabled.
>>
>> Peter also suggested solution of using 2 * SOFTIRQ_OFFSET as irq count
>> for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
>> processing. The patch below does that and adds API in_serving_softirq() which
>> returns whether we are currently processing softirq or not.
>>
>> Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
>> to in_serving_softirq.
>>
>> Looks like many usages of in_softirq really want in_serving_softirq. Those
>> changes can be made individually on a case by case basis.
>>
>> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
>> ---
>>  include/linux/hardirq.h |    3 ++
>>  include/linux/sched.h   |    2 +-
>>  kernel/sched.c          |    2 +-
>>  kernel/softirq.c        |   51 +++++++++++++++++++++++++++++++---------------
>>  net/sched/cls_cgroup.c  |    2 +-
>>  5 files changed, 40 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
>> index d5b3876..1c736ae 100644
>> --- a/include/linux/hardirq.h
>> +++ b/include/linux/hardirq.h
>> @@ -82,10 +82,13 @@
>>  /*
>>   * Are we doing bottom half or hardware interrupt processing?
>>   * Are we in a softirq context? Interrupt context?
>> + * in_softirq answers - are we currently processing softirq or have bh disabled?
>> + * in_serving_softirq answers - are we currently processing softirq?
>>   */
>>  #define in_irq()             (hardirq_count())
>>  #define in_softirq()         (softirq_count())
>>  #define in_interrupt()               (irq_count())
>> +#define in_serving_softirq() (softirq_count() == SOFTIRQ_OFFSET)
>>
>
> But softirq handlers sometime call functions that might disable bh
> again. It would be good to not switch softirq time to system time ;)

Yes. Good point :-). I should rather have
+#define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET)

>
> Shouldnt we reserve a bit (high order bit out of 8) instead ?
>
>  * PREEMPT_MASK:    0x000000ff
>  * SOFTIRQ_MASK:    0x0000ff00
>  * SERVING_SOFTIRQ: 0x00008000
>  * HARDIRQ_MASK:    0x03ff0000
>  *     NMI_MASK:    0x04000000

Things will be very much similar using higher order bit or lower order
bit. Somehow I felt using lower order bit was cleaner...

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] si time accounting accounts bh_disable'd time to si
  2010-09-27 21:11                   ` Venkatesh Pallipadi
@ 2010-09-27 21:16                     ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-09-27 21:16 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Martin Schwidefsky, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Balbir Singh, linux-kernel, Paul Turner

Le lundi 27 septembre 2010 à 14:11 -0700, Venkatesh Pallipadi a écrit :
> On Mon, Sep 27, 2010 at 1:53 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > But softirq handlers sometime call functions that might disable bh
> > again. It would be good to not switch softirq time to system time ;)
> 
> Yes. Good point :-). I should rather have
> +#define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET)
> 
> >
> > Shouldnt we reserve a bit (high order bit out of 8) instead ?
> >
> >  * PREEMPT_MASK:    0x000000ff
> >  * SOFTIRQ_MASK:    0x0000ff00
> >  * SERVING_SOFTIRQ: 0x00008000
> >  * HARDIRQ_MASK:    0x03ff0000
> >  *     NMI_MASK:    0x04000000
> 
> Things will be very much similar using higher order bit or lower order
> bit. Somehow I felt using lower order bit was cleaner...
> 

No problem, I now undertand what you wanted to achieve ;)




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] si time accounting accounts bh_disable'd time to si
  2010-09-27 20:35               ` [PATCH] si time accounting accounts bh_disable'd time to si Venkatesh Pallipadi
  2010-09-27 20:53                 ` Eric Dumazet
@ 2010-09-30 11:17                 ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-30 11:17 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Martin Schwidefsky, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, linux-kernel, Paul Turner

On Mon, 2010-09-27 at 13:35 -0700, Venkatesh Pallipadi wrote:
> >> >> > You still do have the problem with local_bh_disable() though, since you
> >> >> > cannot distinguish between having bh disabled and processing softirq.
> >> >> >
> >> >> > So a hardirq that hits while you have bh disabled will inflate your
> >> >> > softirq time.
> 
> >> >> Hmm, that bug is valid for CONFIG_VIRT_CPU_ACCOUNTING=y as well.
> >> >
> >> > And nobody ever noticed?
> >> >
> >> Yes. I inherited the API from VIRT_CPU_ACCOUNTING along with this
> >> local_bh_disable bug. Agree that we need one extra bit to handle this
> >> case. I will take a stab at fixing this along with refresh of this
> >> patchset if no one else has beaten me to it until then.
> >
> >Make sure to only fix the softirq processing on the hardirq tail, not
> > the ksoftirqd one :-)
> 
> softirq processing from hardirq tail and ksoftirqd are currently
> handled in the same way and I didn't see any issues changing both of
> them. Am I missing something? 

Well, ksoftirq is a task and its runtime should be accounted to that
task, not softirq.. otherwise softirq and task time get all funny, makes
sense?

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-09-30 11:17 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-17  1:56 [PATCH 0/6] Proper kernel irq time accounting Venkatesh Pallipadi
2010-09-17  1:56 ` [PATCH 1/6] Consolidate account_system_vtime extern declaration Venkatesh Pallipadi
2010-09-17  1:56 ` [PATCH 2/6] Add IRQ_TIME_ACCOUNTING, finer accounting of CPU irq time Venkatesh Pallipadi
2010-09-19 11:11   ` Peter Zijlstra
2010-09-20 17:13     ` Venkatesh Pallipadi
2010-09-20 17:23       ` Peter Zijlstra
2010-09-19 11:21   ` Peter Zijlstra
2010-09-19 11:42     ` Peter Zijlstra
2010-09-19 12:01     ` Peter Zijlstra
2010-09-20  7:27       ` Martin Schwidefsky
2010-09-20  9:27         ` Peter Zijlstra
2010-09-20 17:16           ` Venkatesh Pallipadi
2010-09-20 17:26             ` Peter Zijlstra
2010-09-27 20:35               ` [PATCH] si time accounting accounts bh_disable'd time to si Venkatesh Pallipadi
2010-09-27 20:53                 ` Eric Dumazet
2010-09-27 21:11                   ` Venkatesh Pallipadi
2010-09-27 21:16                     ` Eric Dumazet
2010-09-30 11:17                 ` Peter Zijlstra
2010-09-17  1:56 ` [PATCH 3/6] x86: Add IRQ_TIME_ACCOUNTING in x86 Venkatesh Pallipadi
2010-09-17  1:56 ` [PATCH 4/6] sched: Do not account irq time to current task Venkatesh Pallipadi
2010-09-19 11:28   ` Peter Zijlstra
2010-09-20 17:33     ` Venkatesh Pallipadi
2010-09-20 17:38       ` Peter Zijlstra
2010-09-20 17:40         ` Venkatesh Pallipadi
2010-09-17  1:56 ` [PATCH 5/6] sched: Remove irq time from available CPU power Venkatesh Pallipadi
2010-09-19 11:31   ` Peter Zijlstra
2010-09-20 17:38     ` Venkatesh Pallipadi
2010-09-17  1:56 ` [PATCH 6/6] Export per cpu hardirq and softirq time in proc Venkatesh Pallipadi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.