All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Proper kernel irq time reporting -v0
@ 2010-10-20 22:48 Venkatesh Pallipadi
  2010-10-20 22:48 ` [PATCH 1/6] Free up pf flag PF_KSOFTIRQD Venkatesh Pallipadi
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-20 22:48 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Shaun Ruffell,
	Yong Zhang, Venkatesh Pallipadi

This is part 2 of
"Proper kernel irq time accounting -v4"
http://lkml.indiana.edu/hypermail//linux/kernel/1010.0/01175.html

and applies over those changes.

Part 1 solves the way irqs are accounted in scheduler and tasks. This
patchset solves how irq times are reported in /proc/stat and also not
to include irq time in task->stime, etc.

Example:
Running a cpu intensive loop and network intensive nc on a 4 CPU system
and looking at 'top' output.

With vanilla kernel:
Cpu0  :  0.0% us,  0.3% sy,  0.0% ni, 99.3% id,  0.0% wa,  0.0% hi,  0.3% si
Cpu1  : 100.0% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu2  :  1.3% us, 27.2% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi, 71.4% si
Cpu3  :  1.6% us,  1.3% sy,  0.0% ni, 96.7% id,  0.0% wa,  0.0% hi,  0.3% si

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7555 root      20   0  1760  528  436 R  100  0.0   0:15.79 nc
 7563 root      20   0  3632  268  204 R  100  0.0   0:13.13 loop

Notes:
- Both tasks show 100% CPU, even when one of them is stuck on a CPU thats
  processing 70% softirq.
- no hardirq time.


With "part 1" patches:
Cpu0  :  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu1  : 100.0% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu2  :  2.0% us, 30.6% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi, 67.4% si
Cpu3  :  0.7% us,  0.7% sy,  0.3% ni, 98.3% id,  0.0% wa,  0.0% hi,  0.0% si

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6289 root      20   0  3632  268  204 R  100  0.0   2:18.67 loop
 5737 root      20   0  1760  528  436 R   33  0.0   0:26.72 nc

Notes:
- Tasks show 100% CPU and 33% CPU that correspond to their non-irq exec time.
- no hardirq time.


With "part 1 + part 2" patches:
Cpu0  :  1.3% us,  1.0% sy,  0.3% ni, 97.0% id,  0.0% wa,  0.0% hi,  0.3% si
Cpu1  : 99.3% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.7% hi,  0.0% si
Cpu2  :  1.3% us, 31.5% sy,  0.0% ni,  0.0% id,  0.0% wa,  8.3% hi, 58.9% si
Cpu3  :  1.0% us,  2.0% sy,  0.3% ni, 95.0% id,  0.0% wa,  0.7% hi,  1.0% si

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20929 root      20   0  3632  268  204 R   99  0.0   3:48.25 loop
20796 root      20   0  1760  528  436 R   33  0.0   2:38.65 nc

Notes:
- Both task exec time and hard irq time reported correctly.
- hi and si time are based on fine granularity info and not on samples.
- getrusage would give proper utime/stime split not including irq times
  in that ratio.
- Other places that report user/sys time like, cgroup cpuacct.stat will
  now include only non-irq exectime.

Thanks,
Venki

Signed-off-by: Venkatesh Pallipadi <venki@google.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] Free up pf flag PF_KSOFTIRQD
  2010-10-20 22:48 [PATCH 0/5] Proper kernel irq time reporting -v0 Venkatesh Pallipadi
@ 2010-10-20 22:48 ` Venkatesh Pallipadi
  2010-10-21  5:23   ` Eric Dumazet
  2010-10-21 15:13   ` Christoph Lameter
  2010-10-20 22:48 ` [PATCH 2/6] Add nsecs_to_cputime64 interface for asm-generic Venkatesh Pallipadi
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-20 22:48 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Shaun Ruffell,
	Yong Zhang, Venkatesh Pallipadi

Cleanup patch, freeing up PF_KSOFTIRQD and
using (current == __get_cpu_var(ksoftirqd)) instead, as suggested
by Eric Dumazet.

Unrelated one line cleanup of changing hardirq_count() to in_irq() in
account_system_vtime.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/sched.h |    3 ++-
 kernel/sched.c        |    4 ++--
 kernel/softirq.c      |    6 +++++-
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0383601..6dd4894 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1691,7 +1691,6 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 /*
  * Per process flags
  */
-#define PF_KSOFTIRQD	0x00000001	/* I am ksoftirqd */
 #define PF_STARTING	0x00000002	/* being created */
 #define PF_EXITING	0x00000004	/* getting shut down */
 #define PF_EXITPIDONE	0x00000008	/* pi exit done on shut down */
@@ -1749,6 +1748,8 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+int is_ksoftirqd_context(void);
+
 #ifdef CONFIG_PREEMPT_RCU
 
 #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
diff --git a/kernel/sched.c b/kernel/sched.c
index abf8440..5d4afe9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1984,9 +1984,9 @@ void account_system_vtime(struct task_struct *curr)
 	 * in that case, so as not to confuse scheduler with a special task
 	 * that do not consume any time, but still wants to run.
 	 */
-	if (hardirq_count())
+	if (in_irq())
 		per_cpu(cpu_hardirq_time, cpu) += delta;
-	else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
+	else if (in_serving_softirq() && !(is_ksoftirqd_context()))
 		per_cpu(cpu_softirq_time, cpu) += delta;
 
 	local_irq_restore(flags);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index f02a9df..3ceb560 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -56,6 +56,11 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp
 
 static DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
 
+int is_ksoftirqd_context(void)
+{
+	return (current == __get_cpu_var(ksoftirqd));
+}
+
 char *softirq_to_name[NR_SOFTIRQS] = {
 	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
 	"TASKLET", "SCHED", "HRTIMER",	"RCU"
@@ -719,7 +724,6 @@ static int run_ksoftirqd(void * __bind_cpu)
 {
 	set_current_state(TASK_INTERRUPTIBLE);
 
-	current->flags |= PF_KSOFTIRQD;
 	while (!kthread_should_stop()) {
 		preempt_disable();
 		if (!local_softirq_pending()) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] Add nsecs_to_cputime64 interface for asm-generic
  2010-10-20 22:48 [PATCH 0/5] Proper kernel irq time reporting -v0 Venkatesh Pallipadi
  2010-10-20 22:48 ` [PATCH 1/6] Free up pf flag PF_KSOFTIRQD Venkatesh Pallipadi
@ 2010-10-20 22:48 ` Venkatesh Pallipadi
  2010-10-20 22:48 ` [PATCH 3/6] Refactor account_system_time separating id and actual update Venkatesh Pallipadi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-20 22:48 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Shaun Ruffell,
	Yong Zhang, Venkatesh Pallipadi

Add nsecs_to_cputime64 interface. This is used in following patches that
updates cpu irq stat based on ns granularity info in IRQ_TIME_ACCOUNTING.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 include/asm-generic/cputime.h |    3 +++
 include/linux/jiffies.h       |    1 +
 kernel/time.c                 |   23 +++++++++++++++++++++--
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/cputime.h b/include/asm-generic/cputime.h
index ca0f239..c0f0da0 100644
--- a/include/asm-generic/cputime.h
+++ b/include/asm-generic/cputime.h
@@ -30,6 +30,9 @@ typedef u64 cputime64_t;
 #define cputime64_to_jiffies64(__ct)	(__ct)
 #define jiffies64_to_cputime64(__jif)	(__jif)
 #define cputime_to_cputime64(__ct)	((u64) __ct)
+#define cputime64_gt(__a, __b)		((__a) >  (__b))
+
+#define nsecs_to_cputime64(__ct)	nsecs_to_jiffies64(__ct)
 
 
 /*
diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 6811f4b..922aa31 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -307,6 +307,7 @@ extern clock_t jiffies_to_clock_t(long x);
 extern unsigned long clock_t_to_jiffies(unsigned long x);
 extern u64 jiffies_64_to_clock_t(u64 x);
 extern u64 nsec_to_clock_t(u64 x);
+extern u64 nsecs_to_jiffies64(u64 n);
 extern unsigned long nsecs_to_jiffies(u64 n);
 
 #define TIMESTAMP_SIZE	30
diff --git a/kernel/time.c b/kernel/time.c
index ba9b338..058d4be 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -645,7 +645,7 @@ u64 nsec_to_clock_t(u64 x)
 }
 
 /**
- * nsecs_to_jiffies - Convert nsecs in u64 to jiffies
+ * nsecs_to_jiffies64 - Convert nsecs in u64 to jiffies64
  *
  * @n:	nsecs in u64
  *
@@ -657,7 +657,7 @@ u64 nsec_to_clock_t(u64 x)
  *   NSEC_PER_SEC = 10^9 = (5^9 * 2^9) = (1953125 * 512)
  *   ULLONG_MAX ns = 18446744073.709551615 secs = about 584 years
  */
-unsigned long nsecs_to_jiffies(u64 n)
+u64 nsecs_to_jiffies64(u64 n)
 {
 #if (NSEC_PER_SEC % HZ) == 0
 	/* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */
@@ -673,6 +673,25 @@ unsigned long nsecs_to_jiffies(u64 n)
 	return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ);
 #endif
 }
+
+
+/**
+ * nsecs_to_jiffies - Convert nsecs in u64 to jiffies
+ *
+ * @n:	nsecs in u64
+ *
+ * Unlike {m,u}secs_to_jiffies, type of input is not unsigned int but u64.
+ * And this doesn't return MAX_JIFFY_OFFSET since this function is designed
+ * for scheduler, not for use in device drivers to calculate timeout value.
+ *
+ * note:
+ *   NSEC_PER_SEC = 10^9 = (5^9 * 2^9) = (1953125 * 512)
+ *   ULLONG_MAX ns = 18446744073.709551615 secs = about 584 years
+ */
+unsigned long nsecs_to_jiffies(u64 n)
+{
+	return (unsigned long)nsecs_to_jiffies64(n);
+}
 
 #if (BITS_PER_LONG < 64)
 u64 get_jiffies_64(void)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] Refactor account_system_time separating id and actual update
  2010-10-20 22:48 [PATCH 0/5] Proper kernel irq time reporting -v0 Venkatesh Pallipadi
  2010-10-20 22:48 ` [PATCH 1/6] Free up pf flag PF_KSOFTIRQD Venkatesh Pallipadi
  2010-10-20 22:48 ` [PATCH 2/6] Add nsecs_to_cputime64 interface for asm-generic Venkatesh Pallipadi
@ 2010-10-20 22:48 ` Venkatesh Pallipadi
  2010-10-20 22:49 ` [PATCH 4/6] Export ns irqtimes from IRQ_TIME_ACCOUNTING through /proc/stat Venkatesh Pallipadi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-20 22:48 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Shaun Ruffell,
	Yong Zhang, Venkatesh Pallipadi

Refactor account_system_time, to separate out the logic of
identifying the update needed and code that does actual updating.

This is used by following patch for IRQ_TIME_ACCOUNTING,
which has different identification logic and same update logic.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 kernel/sched.c |   46 +++++++++++++++++++++++++++++++---------------
 1 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 5d4afe9..8b97958 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3501,6 +3501,32 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 }
 
 /*
+ * Account system cpu time to a process and desired cpustat field
+ * @p: the process that the cpu time gets accounted to
+ * @cputime: the cpu time spent in kernel space since the last update
+ * @cputime_scaled: cputime scaled by cpu frequency
+ * @target_cputime64: pointer to cpustat field that needs updating
+ */
+static inline
+void __account_system_time(struct task_struct *p, cputime_t cputime,
+			cputime_t cputime_scaled, cputime64_t *target_cputime64)
+{
+	cputime64_t tmp = cputime_to_cputime64(cputime);
+
+	/* Add system time to process. */
+	p->stime = cputime_add(p->stime, cputime);
+	p->stimescaled = cputime_add(p->stimescaled, cputime_scaled);
+	account_group_system_time(p, cputime);
+
+	/* Add system time to cpustat. */
+	*target_cputime64 = cputime64_add(*target_cputime64, tmp);
+	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
+
+	/* Account for system time used */
+	acct_update_integrals(p);
+}
+
+/*
  * Account system cpu time to a process.
  * @p: the process that the cpu time gets accounted to
  * @hardirq_offset: the offset to subtract from hardirq_count()
@@ -3511,31 +3537,21 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t tmp;
+	cputime64_t *target_cputime64;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
 		return;
 	}
 
-	/* Add system time to process. */
-	p->stime = cputime_add(p->stime, cputime);
-	p->stimescaled = cputime_add(p->stimescaled, cputime_scaled);
-	account_group_system_time(p, cputime);
-
-	/* Add system time to cpustat. */
-	tmp = cputime_to_cputime64(cputime);
 	if (hardirq_count() - hardirq_offset)
-		cpustat->irq = cputime64_add(cpustat->irq, tmp);
+		target_cputime64 = &cpustat->irq;
 	else if (in_serving_softirq())
-		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+		target_cputime64 = &cpustat->softirq;
 	else
-		cpustat->system = cputime64_add(cpustat->system, tmp);
+		target_cputime64 = &cpustat->system;
 
-	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
-
-	/* Account for system time used */
-	acct_update_integrals(p);
+	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
 }
 
 /*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] Export ns irqtimes from IRQ_TIME_ACCOUNTING through /proc/stat
  2010-10-20 22:48 [PATCH 0/5] Proper kernel irq time reporting -v0 Venkatesh Pallipadi
                   ` (2 preceding siblings ...)
  2010-10-20 22:48 ` [PATCH 3/6] Refactor account_system_time separating id and actual update Venkatesh Pallipadi
@ 2010-10-20 22:49 ` Venkatesh Pallipadi
  2010-10-21 14:44   ` Peter Zijlstra
  2010-10-20 22:49 ` [PATCH 5/6] Account ksoftirq time as cpustat softirq Venkatesh Pallipadi
  2010-10-21 17:25 ` [PATCH 0/5] Proper kernel irq time reporting -v0 Shaun Ruffell
  5 siblings, 1 reply; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-20 22:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Shaun Ruffell,
	Yong Zhang, Venkatesh Pallipadi

CONFIG_IRQ_TIME_ACCOUNTING adds ns granularity irq time on each CPU.
This info is already used in scheduler to do proper task chargeback
(earlier patches). This patch retro-fits this ns granularity
hardirq and softirq information to /proc/stat irq and softirq fields.

The update is still done on timer tick, where we look at accumulated
ns hardirq/softirq time and account the tick to user/system/irq/hardirq/guest
accordingly.

No new interface added.

Earlier versions looked at adding this as new fields in some /proc
files. This one seems to be the best in terms of impact to existing
apps, even though it has somewhat more kernel code than earlier versions.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 kernel/sched.c |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 102 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8b97958..7e812a6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2002,8 +2002,40 @@ static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
 	}
 }
 
+static int irqtime_account_hi_update(void)
+{
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	unsigned long flags;
+	u64 latest_ns;
+	int ret = 0;
+
+	local_irq_save(flags);
+	latest_ns = __get_cpu_var(cpu_hardirq_time);
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
+		ret = 1;
+	local_irq_restore(flags);
+	return ret;
+}
+
+static int irqtime_account_si_update(void)
+{
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	unsigned long flags;
+	u64 latest_ns;
+	int ret = 0;
+
+	local_irq_save(flags);
+	latest_ns = __get_cpu_var(cpu_softirq_time);
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
+		ret = 1;
+	local_irq_restore(flags);
+	return ret;
+}
+
 #else
 
+#define sched_clock_irqtime	(0)
+
 static u64 irq_time_cpu(int cpu)
 {
 	return 0;
@@ -3554,6 +3586,65 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+/*
+ * Account a tick to a process and cpustat
+ * @p: the process that the cpu time gets accounted to
+ * @user_tick: is the tick from userspace
+ * @rq: the pointer to rq
+ *
+ * Tick demultiplexing follows the order
+ * - pending hardirq update
+ * - user_time
+ * - pending softirq update
+ * - idle_time
+ * - system time
+ *   - check for guest_time
+ *   - else account as system_time
+ *
+ * Check for hardirq is done both for system and user time as there is
+ * no timer going off while we are on hardirq and hence we may never get an
+ * oppurtunity to update it solely in system time.
+ * p->stime and friends are only updated on system time and not on irq
+ * softirq as those do not count in task exec_runtime any more.
+ */
+static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
+						struct rq *rq)
+{
+	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
+	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+
+	if (irqtime_account_hi_update()) {
+		cpustat->irq = cputime64_add(cpustat->irq, tmp);
+	} else if (user_tick) {
+		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
+	} else if (irqtime_account_si_update()) {
+		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+	} else if (p == rq->idle) {
+		account_idle_time(cputime_one_jiffy);
+	} else if (p->flags & PF_VCPU) { /* System time or guest time */
+		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
+	} else {
+		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
+					&cpustat->system);
+	}
+}
+
+static void irqtime_account_idle_ticks(int ticks)
+{
+	int i;
+	struct rq *rq = this_rq();
+
+	for (i = 0; i < ticks; i++)
+		irqtime_account_process_tick(current, 0, rq);
+}
+#else
+static void irqtime_account_idle_ticks(int ticks) {}
+static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
+						struct rq *rq) {}
+#endif
+
 /*
  * Account for involuntary wait time.
  * @steal: the cpu time spent in involuntary wait
@@ -3594,6 +3685,11 @@ void account_process_tick(struct task_struct *p, int user_tick)
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	struct rq *rq = this_rq();
 
+	if (sched_clock_irqtime) {
+		irqtime_account_process_tick(p, user_tick, rq);
+		return;
+	}
+
 	if (user_tick)
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
@@ -3619,6 +3715,12 @@ void account_steal_ticks(unsigned long ticks)
  */
 void account_idle_ticks(unsigned long ticks)
 {
+
+	if (sched_clock_irqtime) {
+		irqtime_account_idle_ticks(ticks);
+		return;
+	}
+
 	account_idle_time(jiffies_to_cputime(ticks));
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] Account ksoftirq time as cpustat softirq
  2010-10-20 22:48 [PATCH 0/5] Proper kernel irq time reporting -v0 Venkatesh Pallipadi
                   ` (3 preceding siblings ...)
  2010-10-20 22:49 ` [PATCH 4/6] Export ns irqtimes from IRQ_TIME_ACCOUNTING through /proc/stat Venkatesh Pallipadi
@ 2010-10-20 22:49 ` Venkatesh Pallipadi
  2010-10-21 14:53   ` Peter Zijlstra
  2010-10-21 17:25 ` [PATCH 0/5] Proper kernel irq time reporting -v0 Shaun Ruffell
  5 siblings, 1 reply; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-20 22:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Shaun Ruffell,
	Yong Zhang, Venkatesh Pallipadi

softirq time in ksoftirqd context is not accounted in ns granularity
per cpu softirq stats, as we want that to be a part of ksoftirqd
exec_runtime.

Accounting them separately, and using them while doing stats makes
/proc/stat include that time in cpu si time.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 kernel/sched.c |   28 +++++++++++++++++++++-------
 1 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7e812a6..cba8090 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1941,6 +1941,7 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
  */
 static DEFINE_PER_CPU(u64, cpu_hardirq_time);
 static DEFINE_PER_CPU(u64, cpu_softirq_time);
+static DEFINE_PER_CPU(u64, cpu_ksoftirqd_time);
 
 static DEFINE_PER_CPU(u64, irq_start_time);
 static int sched_clock_irqtime;
@@ -1979,15 +1980,20 @@ void account_system_vtime(struct task_struct *curr)
 	delta = now - per_cpu(irq_start_time, cpu);
 	per_cpu(irq_start_time, cpu) = now;
 	/*
-	 * We do not account for softirq time from ksoftirqd here.
 	 * We want to continue accounting softirq time to ksoftirqd thread
 	 * in that case, so as not to confuse scheduler with a special task
 	 * that do not consume any time, but still wants to run.
+	 * So, softirq time in ksoftirqd is accounted separately and used
+	 * for softirq time reporting in /proc/stat.
 	 */
-	if (in_irq())
+	if (in_irq()) {
 		per_cpu(cpu_hardirq_time, cpu) += delta;
-	else if (in_serving_softirq() && !(is_ksoftirqd_context()))
-		per_cpu(cpu_softirq_time, cpu) += delta;
+	} else if (in_serving_softirq()) {
+		if (is_ksoftirqd_context())
+			per_cpu(cpu_ksoftirqd_time, cpu) += delta;
+		else
+			per_cpu(cpu_softirq_time, cpu) += delta;
+	}
 
 	local_irq_restore(flags);
 }
@@ -2025,7 +2031,9 @@ static int irqtime_account_si_update(void)
 	int ret = 0;
 
 	local_irq_save(flags);
-	latest_ns = __get_cpu_var(cpu_softirq_time);
+	latest_ns = __get_cpu_var(cpu_softirq_time) +
+			__get_cpu_var(cpu_ksoftirqd_time);
+
 	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
 		ret = 1;
 	local_irq_restore(flags);
@@ -3606,7 +3614,8 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  * no timer going off while we are on hardirq and hence we may never get an
  * oppurtunity to update it solely in system time.
  * p->stime and friends are only updated on system time and not on irq
- * softirq as those do not count in task exec_runtime any more.
+ * softirq as those do not count in task exec_runtime any more (except
+ * for ksoftirqd).
  */
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 						struct rq *rq)
@@ -3620,7 +3629,12 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (irqtime_account_si_update()) {
-		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+		if (is_ksoftirqd_context()) {
+			__account_system_time(p, cputime_one_jiffy,
+					one_jiffy_scaled, &cpustat->softirq);
+		} else {
+			cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+		}
 	} else if (p == rq->idle) {
 		account_idle_time(cputime_one_jiffy);
 	} else if (p->flags & PF_VCPU) { /* System time or guest time */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] Free up pf flag PF_KSOFTIRQD
  2010-10-20 22:48 ` [PATCH 1/6] Free up pf flag PF_KSOFTIRQD Venkatesh Pallipadi
@ 2010-10-21  5:23   ` Eric Dumazet
  2010-10-21 14:36     ` Venkatesh Pallipadi
  2010-10-21 15:13   ` Christoph Lameter
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-10-21  5:23 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
	Shaun Ruffell, Yong Zhang

Le mercredi 20 octobre 2010 à 15:48 -0700, Venkatesh Pallipadi a écrit :
> +int is_ksoftirqd_context(void)
> +{
> +       return (current == __get_cpu_var(ksoftirqd));
> +} 

"return (X == Y);" should be "return X == Y;"

I believe this function should be inlined, and use this_cpu_read()
You probably can pass 'current' as a pointer.

static inline bool is_ksoftirq(struct task_struct *p)
{
	return p == this_cpu_read(ksoftirqd);
}

Your version is a bit expensive :

<is_ksoftirqd_context>:
	55                   	push   %rbp
	48 89 e5             	mov    %rsp,%rbp
	48 83 ec 10          	sub    $0x10,%rsp
	48 89 1c 24          	mov    %rbx,(%rsp)
	4c 89 64 24 08       	mov    %r12,0x8(%rsp)
	48 c7 c3 40 03 01 00 	mov    $0x10340,%rbx
	e8 93 e5 24 00       	callq  ffffffff812aeab0 <debug_smp_processor_id>
	89 c0                	mov    %eax,%eax
	48 8b 04 c5 a0 b8 b5 	mov    -0x7e4a4760(,%rax,8),%rax
	81 
	65 4c 8b 24 25 00 cc 	mov    %gs:0xcc00,%r12
	00 00 
	4c 39 24 18          	cmp    %r12,(%rax,%rbx,1)
	48 8b 1c 24          	mov    (%rsp),%rbx
	4c 8b 64 24 08       	mov    0x8(%rsp),%r12
	c9                   	leaveq 
	0f 94 c0             	sete   %al
	0f b6 c0             	movzbl %al,%eax
	c3                   	retq   

While alternate version :

				cmp    %gs:0x10340,%rdi

(So it should be as fast as previous flag based test)

Thanks



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] Free up pf flag PF_KSOFTIRQD
  2010-10-21  5:23   ` Eric Dumazet
@ 2010-10-21 14:36     ` Venkatesh Pallipadi
  2010-10-21 14:58       ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-21 14:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
	Shaun Ruffell, Yong Zhang

On Wed, Oct 20, 2010 at 10:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 20 octobre 2010 à 15:48 -0700, Venkatesh Pallipadi a écrit :
>> +int is_ksoftirqd_context(void)
>> +{
>> +       return (current == __get_cpu_var(ksoftirqd));
>> +}
>
> "return (X == Y);" should be "return X == Y;"
>
> I believe this function should be inlined, and use this_cpu_read()
> You probably can pass 'current' as a pointer.
>
> static inline bool is_ksoftirq(struct task_struct *p)
> {
>        return p == this_cpu_read(ksoftirqd);
> }

Yes. I thought about static inline part. The reason I did not make
this static inline was because ksoftirqd was declared static in
softirq.c and this function was getting called from sched.c.

I did not know that this_cpu_read existed though. I guess I should be
looking at using that elsewhere in the patchset too.

Also, part of the overhead you see below I think is coming from
DEBUG_PREEMPT. That would be making every smp_processor_id() call more
expensive. No?

Thanks,
Venki

>
> Your version is a bit expensive :
>
> <is_ksoftirqd_context>:
>        55                      push   %rbp
>        48 89 e5                mov    %rsp,%rbp
>        48 83 ec 10             sub    $0x10,%rsp
>        48 89 1c 24             mov    %rbx,(%rsp)
>        4c 89 64 24 08          mov    %r12,0x8(%rsp)
>        48 c7 c3 40 03 01 00    mov    $0x10340,%rbx
>        e8 93 e5 24 00          callq  ffffffff812aeab0 <debug_smp_processor_id>
>        89 c0                   mov    %eax,%eax
>        48 8b 04 c5 a0 b8 b5    mov    -0x7e4a4760(,%rax,8),%rax
>        81
>        65 4c 8b 24 25 00 cc    mov    %gs:0xcc00,%r12
>        00 00
>        4c 39 24 18             cmp    %r12,(%rax,%rbx,1)
>        48 8b 1c 24             mov    (%rsp),%rbx
>        4c 8b 64 24 08          mov    0x8(%rsp),%r12
>        c9                      leaveq
>        0f 94 c0                sete   %al
>        0f b6 c0                movzbl %al,%eax
>        c3                      retq
>
> While alternate version :
>
>                                cmp    %gs:0x10340,%rdi
>
> (So it should be as fast as previous flag based test)
>
> Thanks
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] Export ns irqtimes from IRQ_TIME_ACCOUNTING through /proc/stat
  2010-10-20 22:49 ` [PATCH 4/6] Export ns irqtimes from IRQ_TIME_ACCOUNTING through /proc/stat Venkatesh Pallipadi
@ 2010-10-21 14:44   ` Peter Zijlstra
  2010-10-21 19:25     ` Venkatesh Pallipadi
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-10-21 14:44 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet,
	Shaun Ruffell, Yong Zhang

On Wed, 2010-10-20 at 15:49 -0700, Venkatesh Pallipadi wrote:

> +static int irqtime_account_hi_update(void)
> +{
> +	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
> +	unsigned long flags;
> +	u64 latest_ns;
> +	int ret = 0;
> +
> +	local_irq_save(flags);
> +	latest_ns = __get_cpu_var(cpu_hardirq_time);

I guess this_cpu_read() would again be an improvement.. same for the SI
version.

> +	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
> +		ret = 1;
> +	local_irq_restore(flags);
> +	return ret;
> +}

> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +/*
> + * Account a tick to a process and cpustat
> + * @p: the process that the cpu time gets accounted to
> + * @user_tick: is the tick from userspace
> + * @rq: the pointer to rq
> + *
> + * Tick demultiplexing follows the order
> + * - pending hardirq update
> + * - user_time
> + * - pending softirq update
> + * - idle_time
> + * - system time
> + *   - check for guest_time
> + *   - else account as system_time
> + *
> + * Check for hardirq is done both for system and user time as there is
> + * no timer going off while we are on hardirq and hence we may never get an
> + * oppurtunity to update it solely in system time.

My mailer suggests you spell that as: opportunity :-)

> + * p->stime and friends are only updated on system time and not on irq
> + * softirq as those do not count in task exec_runtime any more.
> + */
> +static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
> +						struct rq *rq)
> +{
> +	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
> +	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
> +	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
> +
> +	if (irqtime_account_hi_update()) {
> +		cpustat->irq = cputime64_add(cpustat->irq, tmp);
> +	} else if (user_tick) {
> +		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
> +	} else if (irqtime_account_si_update()) {
> +		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
> +	} else if (p == rq->idle) {
> +		account_idle_time(cputime_one_jiffy);
> +	} else if (p->flags & PF_VCPU) { /* System time or guest time */
> +		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
> +	} else {
> +		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
> +					&cpustat->system);
> +	}
> +}

I'd do:

  - hardirq
  - softirq
  - user
  - system
     - guest
     - really system
  - idle

Since otherwise tiny slices of softirq would need to wait for a system
tick to happen before you fold them.

Also, it is possible that in a single tick multiple counters overflow
the jiffy boundary, so something like:

  if (irqtime_account_hi_update())
	cpustat->irq = ...

  if (irqtime_account_si_update())
	cpustate->softirq = ...

  if (user_tick) {
  } else if (...) {

  } else ...

would seem like the better approach.

>  /*
>   * Account for involuntary wait time.
>   * @steal: the cpu time spent in involuntary wait
> @@ -3594,6 +3685,11 @@ void account_process_tick(struct task_struct *p, int user_tick)
>  	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
>  	struct rq *rq = this_rq();
>  
> +	if (sched_clock_irqtime) {
> +		irqtime_account_process_tick(p, user_tick, rq);
> +		return;
> +	}
> +
>  	if (user_tick)
>  		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
>  	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))

mark_tsc_unstable() can disable sched_clock_irqtime at any time, the
accounting won't go funny due to that right?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] Account ksoftirq time as cpustat softirq
  2010-10-20 22:49 ` [PATCH 5/6] Account ksoftirq time as cpustat softirq Venkatesh Pallipadi
@ 2010-10-21 14:53   ` Peter Zijlstra
  2010-10-21 19:10     ` Venkatesh Pallipadi
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-10-21 14:53 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet,
	Shaun Ruffell, Yong Zhang

On Wed, 2010-10-20 at 15:49 -0700, Venkatesh Pallipadi wrote:
> @@ -1979,15 +1980,20 @@ void account_system_vtime(struct task_struct *curr)
>         delta = now - per_cpu(irq_start_time, cpu);
>         per_cpu(irq_start_time, cpu) = now;
>         /*
> -        * We do not account for softirq time from ksoftirqd here.
>          * We want to continue accounting softirq time to ksoftirqd thread
>          * in that case, so as not to confuse scheduler with a special task
>          * that do not consume any time, but still wants to run.
> +        * So, softirq time in ksoftirqd is accounted separately and used
> +        * for softirq time reporting in /proc/stat.
>          */
> -       if (in_irq())
> +       if (in_irq()) {
>                 per_cpu(cpu_hardirq_time, cpu) += delta;
> -       else if (in_serving_softirq() && !(is_ksoftirqd_context()))
> -               per_cpu(cpu_softirq_time, cpu) += delta;
> +       } else if (in_serving_softirq()) {
> +               if (is_ksoftirqd_context())
> +                       per_cpu(cpu_ksoftirqd_time, cpu) += delta;
> +               else
> +                       per_cpu(cpu_softirq_time, cpu) += delta;
> +       }
>  
>         local_irq_restore(flags);
>  }
> @@ -2025,7 +2031,9 @@ static int irqtime_account_si_update(void)
>         int ret = 0;
>  
>         local_irq_save(flags);
> -       latest_ns = __get_cpu_var(cpu_softirq_time);
> +       latest_ns = __get_cpu_var(cpu_softirq_time) +
> +                       __get_cpu_var(cpu_ksoftirqd_time); 

wouldn't something like:

	latest_ns = this_cpu_read(cpu_softirq_time) +
		this_softirq_task()->se.sum_exec_runtime;

be easier?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] Free up pf flag PF_KSOFTIRQD
  2010-10-21 14:36     ` Venkatesh Pallipadi
@ 2010-10-21 14:58       ` Eric Dumazet
  2010-10-21 17:03         ` Venkatesh Pallipadi
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-10-21 14:58 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
	Shaun Ruffell, Yong Zhang

Le jeudi 21 octobre 2010 à 07:36 -0700, Venkatesh Pallipadi a écrit :

> Yes. I thought about static inline part. The reason I did not make
> this static inline was because ksoftirqd was declared static in
> softirq.c and this function was getting called from sched.c.
> 

I believe you can remove the 'static' for this kind of thing.

> I did not know that this_cpu_read existed though. I guess I should be
> looking at using that elsewhere in the patchset too.
> 

Sure :)

> Also, part of the overhead you see below I think is coming from
> DEBUG_PREEMPT. That would be making every smp_processor_id() call more
> expensive. No?
> 

Right, but the point is the this_cpu_read() version doesnt have this
overhead, even if DEBUG_PREEMPT is on, at least on x86.

BTW, I lied somehow, because the way this_cpu_read() is handled,
following code :

	p == this_cpu_read(ksoftirqd);

generates :

mov     %gs:offset,%rax
cmp	%rdi,%rax

not :

cmp	%gs:offset,%rdi




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] Free up pf flag PF_KSOFTIRQD
  2010-10-20 22:48 ` [PATCH 1/6] Free up pf flag PF_KSOFTIRQD Venkatesh Pallipadi
  2010-10-21  5:23   ` Eric Dumazet
@ 2010-10-21 15:13   ` Christoph Lameter
  2010-10-21 17:06     ` Venkatesh Pallipadi
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2010-10-21 15:13 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
	Eric Dumazet, Shaun Ruffell, Yong Zhang

On Wed, 20 Oct 2010, Venkatesh Pallipadi wrote:

> diff --git a/kernel/sched.c b/kernel/sched.c
> index abf8440..5d4afe9 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1984,9 +1984,9 @@ void account_system_vtime(struct task_struct *curr)
>  	 * in that case, so as not to confuse scheduler with a special task
>  	 * that do not consume any time, but still wants to run.
>  	 */
> -	if (hardirq_count())
> +	if (in_irq())
>  		per_cpu(cpu_hardirq_time, cpu) += delta;

If cpu is always the current cpu then you can optimize the code:

	this_cpu_add(cpu_hardirq_time, delta)

> -	else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
> +	else if (in_serving_softirq() && !(is_ksoftirqd_context()))
>  		per_cpu(cpu_softirq_time, cpu) += delta;

Same here.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] Free up pf flag PF_KSOFTIRQD
  2010-10-21 14:58       ` Eric Dumazet
@ 2010-10-21 17:03         ` Venkatesh Pallipadi
  0 siblings, 0 replies; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-21 17:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
	Shaun Ruffell, Yong Zhang

On Thu, Oct 21, 2010 at 7:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 21 octobre 2010 à 07:36 -0700, Venkatesh Pallipadi a écrit :
>
>> Yes. I thought about static inline part. The reason I did not make
>> this static inline was because ksoftirqd was declared static in
>> softirq.c and this function was getting called from sched.c.
>>
>
> I believe you can remove the 'static' for this kind of thing.
>
>> I did not know that this_cpu_read existed though. I guess I should be
>> looking at using that elsewhere in the patchset too.
>>
>
> Sure :)
>
>> Also, part of the overhead you see below I think is coming from
>> DEBUG_PREEMPT. That would be making every smp_processor_id() call more
>> expensive. No?
>>
>
> Right, but the point is the this_cpu_read() version doesnt have this
> overhead, even if DEBUG_PREEMPT is on, at least on x86.
>

Agreed. Will redo this when I refresh the patchset.

Thanks,
Venki

> BTW, I lied somehow, because the way this_cpu_read() is handled,
> following code :
>
>        p == this_cpu_read(ksoftirqd);
>
> generates :
>
> mov     %gs:offset,%rax
> cmp     %rdi,%rax
>
> not :
>
> cmp     %gs:offset,%rdi
>
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] Free up pf flag PF_KSOFTIRQD
  2010-10-21 15:13   ` Christoph Lameter
@ 2010-10-21 17:06     ` Venkatesh Pallipadi
  0 siblings, 0 replies; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-21 17:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
	Eric Dumazet, Shaun Ruffell, Yong Zhang

On Thu, Oct 21, 2010 at 8:13 AM, Christoph Lameter <cl@linux.com> wrote:
> On Wed, 20 Oct 2010, Venkatesh Pallipadi wrote:
>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index abf8440..5d4afe9 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -1984,9 +1984,9 @@ void account_system_vtime(struct task_struct *curr)
>>        * in that case, so as not to confuse scheduler with a special task
>>        * that do not consume any time, but still wants to run.
>>        */
>> -     if (hardirq_count())
>> +     if (in_irq())
>>               per_cpu(cpu_hardirq_time, cpu) += delta;
>
> If cpu is always the current cpu then you can optimize the code:
>
>        this_cpu_add(cpu_hardirq_time, delta)
>
>> -     else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
>> +     else if (in_serving_softirq() && !(is_ksoftirqd_context()))
>>               per_cpu(cpu_softirq_time, cpu) += delta;
>
> Same here.
>

Yes. this_cpu_* variants were my learning of the day for today :-).
Will cleanup various per_cpu() and __get_cpu_var() when I refresh this
patchset.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] Proper kernel irq time reporting -v0
  2010-10-20 22:48 [PATCH 0/5] Proper kernel irq time reporting -v0 Venkatesh Pallipadi
                   ` (4 preceding siblings ...)
  2010-10-20 22:49 ` [PATCH 5/6] Account ksoftirq time as cpustat softirq Venkatesh Pallipadi
@ 2010-10-21 17:25 ` Shaun Ruffell
  5 siblings, 0 replies; 19+ messages in thread
From: Shaun Ruffell @ 2010-10-21 17:25 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
	Eric Dumazet, Yong Zhang

On 10/20/2010 05:48 PM, Venkatesh Pallipadi wrote:
> This is part 2 of
> "Proper kernel irq time accounting -v4"
> http://lkml.indiana.edu/hypermail//linux/kernel/1010.0/01175.html
> 
> and applies over those changes.
> 

I applied both sets on top of 2.6.36 and tested on x86 32-bit and 64-bit
machines with some telephony cards that interrupt at a 1000Hz rate.  The
time reported in top matched the expected value as measured with the
function graph tracer.  i.e, if /sys/kernel/debug/tracing/trace
indicated that the interrupt handler was averaging 13us, then top was
reporting 1.3% CPU time.  I changed the smp_affinity and the "hi" time
moved around as I would have expected.  I also scheduled work items on
all the CPUs that just disable interrupt and spin for a selectable
period.  Scheduling 100ms of non-interruptible delay at 1 second
intervals resulted in 10% time in "hi" as expected.

Hopefully I can check it out on a Powermac G3 B&W next week sometime,
but FWIW:

Tested-by: Shaun Ruffell <sruffell@digium.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] Account ksoftirq time as cpustat softirq
  2010-10-21 14:53   ` Peter Zijlstra
@ 2010-10-21 19:10     ` Venkatesh Pallipadi
  0 siblings, 0 replies; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-21 19:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet,
	Shaun Ruffell, Yong Zhang

On Thu, Oct 21, 2010 at 7:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-10-20 at 15:49 -0700, Venkatesh Pallipadi wrote:
>> @@ -1979,15 +1980,20 @@ void account_system_vtime(struct task_struct *curr)
>>         delta = now - per_cpu(irq_start_time, cpu);
>>         per_cpu(irq_start_time, cpu) = now;
>>         /*
>> -        * We do not account for softirq time from ksoftirqd here.
>>          * We want to continue accounting softirq time to ksoftirqd thread
>>          * in that case, so as not to confuse scheduler with a special task
>>          * that do not consume any time, but still wants to run.
>> +        * So, softirq time in ksoftirqd is accounted separately and used
>> +        * for softirq time reporting in /proc/stat.
>>          */
>> -       if (in_irq())
>> +       if (in_irq()) {
>>                 per_cpu(cpu_hardirq_time, cpu) += delta;
>> -       else if (in_serving_softirq() && !(is_ksoftirqd_context()))
>> -               per_cpu(cpu_softirq_time, cpu) += delta;
>> +       } else if (in_serving_softirq()) {
>> +               if (is_ksoftirqd_context())
>> +                       per_cpu(cpu_ksoftirqd_time, cpu) += delta;
>> +               else
>> +                       per_cpu(cpu_softirq_time, cpu) += delta;
>> +       }
>>
>>         local_irq_restore(flags);
>>  }
>> @@ -2025,7 +2031,9 @@ static int irqtime_account_si_update(void)
>>         int ret = 0;
>>
>>         local_irq_save(flags);
>> -       latest_ns = __get_cpu_var(cpu_softirq_time);
>> +       latest_ns = __get_cpu_var(cpu_softirq_time) +
>> +                       __get_cpu_var(cpu_ksoftirqd_time);
>
> wouldn't something like:
>
>        latest_ns = this_cpu_read(cpu_softirq_time) +
>                this_softirq_task()->se.sum_exec_runtime;
>
> be easier?
>

Yes. As we are using this ksoftirqd_time at only one place,
this_softirq_task()->se.sum_exec_runtime with a comment on why we are
doing that would be simpler. Will do.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] Export ns irqtimes from IRQ_TIME_ACCOUNTING through /proc/stat
  2010-10-21 14:44   ` Peter Zijlstra
@ 2010-10-21 19:25     ` Venkatesh Pallipadi
  2010-10-22 12:23       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-21 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet,
	Shaun Ruffell, Yong Zhang

On Thu, Oct 21, 2010 at 7:44 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-10-20 at 15:49 -0700, Venkatesh Pallipadi wrote:
>
>> +static int irqtime_account_hi_update(void)
>> +{
>> +     struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
>> +     unsigned long flags;
>> +     u64 latest_ns;
>> +     int ret = 0;
>> +
>> +     local_irq_save(flags);
>> +     latest_ns = __get_cpu_var(cpu_hardirq_time);
>
> I guess this_cpu_read() would again be an improvement.. same for the SI
> version.
>

Yes.

>> +     if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
>> +             ret = 1;
>> +     local_irq_restore(flags);
>> +     return ret;
>> +}
>
>> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> +/*
>> + * Account a tick to a process and cpustat
>> + * @p: the process that the cpu time gets accounted to
>> + * @user_tick: is the tick from userspace
>> + * @rq: the pointer to rq
>> + *
>> + * Tick demultiplexing follows the order
>> + * - pending hardirq update
>> + * - user_time
>> + * - pending softirq update
>> + * - idle_time
>> + * - system time
>> + *   - check for guest_time
>> + *   - else account as system_time
>> + *
>> + * Check for hardirq is done both for system and user time as there is
>> + * no timer going off while we are on hardirq and hence we may never get an
>> + * oppurtunity to update it solely in system time.
>
> My mailer suggests you spell that as: opportunity :-)

Ah, I don't have spell checker on my editor :-). Will change.

>> + * p->stime and friends are only updated on system time and not on irq
>> + * softirq as those do not count in task exec_runtime any more.
>> + */
>> +static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>> +                                             struct rq *rq)
>> +{
>> +     cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
>> +     cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
>> +     struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
>> +
>> +     if (irqtime_account_hi_update()) {
>> +             cpustat->irq = cputime64_add(cpustat->irq, tmp);
>> +     } else if (user_tick) {
>> +             account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
>> +     } else if (irqtime_account_si_update()) {
>> +             cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
>> +     } else if (p == rq->idle) {
>> +             account_idle_time(cputime_one_jiffy);
>> +     } else if (p->flags & PF_VCPU) { /* System time or guest time */
>> +             account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
>> +     } else {
>> +             __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
>> +                                     &cpustat->system);
>> +     }
>> +}
>
> I'd do:
>
>  - hardirq
>  - softirq
>  - user
>  - system
>     - guest
>     - really system
>  - idle
>
> Since otherwise tiny slices of softirq would need to wait for a system
> tick to happen before you fold them.
>
> Also, it is possible that in a single tick multiple counters overflow
> the jiffy boundary, so something like:
>
>  if (irqtime_account_hi_update())
>        cpustat->irq = ...
>
>  if (irqtime_account_si_update())
>        cpustate->softirq = ...
>
>  if (user_tick) {
>  } else if (...) {
>
>  } else ...
>
> would seem like the better approach.
>

I am not sure about checking for both si and hi. That would result in
double accounting a tick and have some side-effects.
Regarding moving si above user: Yes. That seems good.
idle after system, That may not make so much of a difference, as there
is no special way to check for system time, other than !idle.

>>  /*
>>   * Account for involuntary wait time.
>>   * @steal: the cpu time spent in involuntary wait
>> @@ -3594,6 +3685,11 @@ void account_process_tick(struct task_struct *p, int user_tick)
>>       cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
>>       struct rq *rq = this_rq();
>>
>> +     if (sched_clock_irqtime) {
>> +             irqtime_account_process_tick(p, user_tick, rq);
>> +             return;
>> +     }
>> +
>>       if (user_tick)
>>               account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
>>       else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
>
> mark_tsc_unstable() can disable sched_clock_irqtime at any time, the
> accounting won't go funny due to that right?
>

That should be OK. We would just fallback to lowres tick based accounting.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] Export ns irqtimes from IRQ_TIME_ACCOUNTING through /proc/stat
  2010-10-21 19:25     ` Venkatesh Pallipadi
@ 2010-10-22 12:23       ` Peter Zijlstra
  2010-10-22 23:34         ` Venkatesh Pallipadi
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-10-22 12:23 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet,
	Shaun Ruffell, Yong Zhang

On Thu, 2010-10-21 at 12:25 -0700, Venkatesh Pallipadi wrote:
> > I'd do:
> >
> >  - hardirq
> >  - softirq
> >  - user
> >  - system
> >     - guest
> >     - really system
> >  - idle
> >
> > Since otherwise tiny slices of softirq would need to wait for a system
> > tick to happen before you fold them.
> >
> > Also, it is possible that in a single tick multiple counters overflow
> > the jiffy boundary, so something like:
> >
> >  if (irqtime_account_hi_update())
> >        cpustat->irq = ...
> >
> >  if (irqtime_account_si_update())
> >        cpustate->softirq = ...
> >
> >  if (user_tick) {
> >  } else if (...) {
> >
> >  } else ...
> >
> > would seem like the better approach.
> >
> 
> I am not sure about checking for both si and hi. That would result in
> double accounting a tick and have some side-effects.

Depends on how you look at it I guess, in order for this to occur a
previous tick would have to be not reported, eg. consider the case where
during two consecutive ticks the time is 50% for both sirq and hirq.

Then, after the first tick, nothing will have progressed because they're
both at 50% of a tick, after the second tick both will have reached a
full jiffy's worth of time and need to roll over.

In total two ticks happened, two ticks got accounted, {0,2}, your
approach would make it look like {0,1,1} two ticks worth of work
happened, two ticks got accounted, but it takes 3 ticks for that to
happen.

> Regarding moving si above user: Yes. That seems good.
> idle after system, That may not make so much of a difference, as there
> is no special way to check for system time, other than !idle. 

Right, so about user and system... we have a bit of a problem there.
There is overlap between si/hi and system. ksoftirqd time would be
accounted as system and si.

Then there is the whole issue of per-task accounting not actually using
the system/user ticks. They use the ticks as a ratio for
se.sum_exec_runtime.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] Export ns irqtimes from IRQ_TIME_ACCOUNTING through /proc/stat
  2010-10-22 12:23       ` Peter Zijlstra
@ 2010-10-22 23:34         ` Venkatesh Pallipadi
  0 siblings, 0 replies; 19+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-22 23:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet,
	Shaun Ruffell, Yong Zhang

On Fri, Oct 22, 2010 at 5:23 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-10-21 at 12:25 -0700, Venkatesh Pallipadi wrote:
>> > I'd do:
>> >
>> >  - hardirq
>> >  - softirq
>> >  - user
>> >  - system
>> >     - guest
>> >     - really system
>> >  - idle
>> >
>> > Since otherwise tiny slices of softirq would need to wait for a system
>> > tick to happen before you fold them.
>> >
>> > Also, it is possible that in a single tick multiple counters overflow
>> > the jiffy boundary, so something like:
>> >
>> >  if (irqtime_account_hi_update())
>> >        cpustat->irq = ...
>> >
>> >  if (irqtime_account_si_update())
>> >        cpustate->softirq = ...
>> >
>> >  if (user_tick) {
>> >  } else if (...) {
>> >
>> >  } else ...
>> >
>> > would seem like the better approach.
>> >
>>
>> I am not sure about checking for both si and hi. That would result in
>> double accounting a tick and have some side-effects.
>
> Depends on how you look at it I guess, in order for this to occur a
> previous tick would have to be not reported, eg. consider the case where
> during two consecutive ticks the time is 50% for both sirq and hirq.
>
> Then, after the first tick, nothing will have progressed because they're
> both at 50% of a tick, after the second tick both will have reached a
> full jiffy's worth of time and need to roll over.
>
> In total two ticks happened, two ticks got accounted, {0,2}, your
> approach would make it look like {0,1,1} two ticks worth of work
> happened, two ticks got accounted, but it takes 3 ticks for that to
> happen.

Yes. But, the initial 50-50 tick would not be accounted to hirq or sirq, but to
system/idle/user depending on other conditions. So, I think it is better to keep
accounting a tick to one bucket, everytime we are called.

>> Regarding moving si above user: Yes. That seems good.
>> idle after system, That may not make so much of a difference, as there
>> is no special way to check for system time, other than !idle.
>
> Right, so about user and system... we have a bit of a problem there.
> There is overlap between si/hi and system. ksoftirqd time would be
> accounted as system and si.

Yes. In general this cannot be accurate, as long as we have fine
granularity user/system. But,
that involves system call path, where any overhead is not desirable.
If only we had
hardware counters for user/system on x86.

overlap between ksoftirq and si, again it is kind of depending on
sampling freq. We can probably change the logic to not include
ksoftirqd time in softirqd folding. Instead, we can fallback to
sampling model, and if ksoftirqd is running when we get this tick, we
can account it as softirq. This still has a problem when hirq/sirq
folding happens on the same tick then ksoftirqd may not get its
chance. Anyways, this whole stat is sampling based. I think we have
some flexibility to hand-wave-stuff :-).

> Then there is the whole issue of per-task accounting not actually using
> the system/user ticks. They use the ticks as a ratio for
> se.sum_exec_runtime.
>

Yes. That was the reason why I removed hirq/sirq changing p->stime.
So, we have a clearer user/system split. But, again its not accurate.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-10-22 23:34 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-20 22:48 [PATCH 0/5] Proper kernel irq time reporting -v0 Venkatesh Pallipadi
2010-10-20 22:48 ` [PATCH 1/6] Free up pf flag PF_KSOFTIRQD Venkatesh Pallipadi
2010-10-21  5:23   ` Eric Dumazet
2010-10-21 14:36     ` Venkatesh Pallipadi
2010-10-21 14:58       ` Eric Dumazet
2010-10-21 17:03         ` Venkatesh Pallipadi
2010-10-21 15:13   ` Christoph Lameter
2010-10-21 17:06     ` Venkatesh Pallipadi
2010-10-20 22:48 ` [PATCH 2/6] Add nsecs_to_cputime64 interface for asm-generic Venkatesh Pallipadi
2010-10-20 22:48 ` [PATCH 3/6] Refactor account_system_time separating id and actual update Venkatesh Pallipadi
2010-10-20 22:49 ` [PATCH 4/6] Export ns irqtimes from IRQ_TIME_ACCOUNTING through /proc/stat Venkatesh Pallipadi
2010-10-21 14:44   ` Peter Zijlstra
2010-10-21 19:25     ` Venkatesh Pallipadi
2010-10-22 12:23       ` Peter Zijlstra
2010-10-22 23:34         ` Venkatesh Pallipadi
2010-10-20 22:49 ` [PATCH 5/6] Account ksoftirq time as cpustat softirq Venkatesh Pallipadi
2010-10-21 14:53   ` Peter Zijlstra
2010-10-21 19:10     ` Venkatesh Pallipadi
2010-10-21 17:25 ` [PATCH 0/5] Proper kernel irq time reporting -v0 Shaun Ruffell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.