[PATCH 0/4] Finer granularity and task/cgroup irq time accounting

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
@ 2010-07-19 23:57 Venkatesh Pallipadi
  2010-07-19 23:57 ` [PATCH 1/4] sched: Track and export per task [hard|soft]irq time Venkatesh Pallipadi
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-07-19 23:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh
  Cc: Venkatesh Pallipadi, Paul Menage, linux-kernel, Paul Turner,
	Martin Schwidefsky, Heiko Carstens, Paul Mackerras, Tony Luck

Earlier version of this patchset here -
lkml subject:
"[RFC PATCH 0/4] Finer granularity and task/cgroup irq time accounting"
http://marc.info/?l=linux-kernel&m=127474630527689&w=2

Currently, the softirq and hardirq time reporting is only done at the
CPU level. There are usecases where reporting this time against task
or task groups or cgroups will be useful for user/administrator
in terms of resource planning and utilization charging. Also, as the
accoounting is already done at the CPU level, reporting the same at
the task level does not add any significant computational overhead
other than task level storage (patch 1).

The softirq/hardirq statistics commonly done based on tick based sampling.
Though some archs have CONFIG_VIRT_CPU_ACCOUNTING based fine granularity
accounting. Having similar mechanism to get fine granularity accounting
on x86 will be a major challenge, given the state of TSC reliability
on various platforms and also the overhead it may add in common paths
like syscall entry exit.

An alternative is to have a generic (sched_clock based) and configurable
fine-granularity accounting of si and hi time which can be reported
over the /proc/<pid>/stat API (patch 2).

Patch 3 and 4 are exporting this info at the cgroup level.

Changes since the original RFC -
* General code cleanup and documentation for new APIs added.
* Handle notsc option by having a runtime flag sched_clock_irqtime, along
  with the original CONFIG_IRQ_TIME_ACCOUNTING option.
  Peter Zijlstra suggested the use of alternate instruction kind of mechanism
  here. But, that is mostly x86 specific and not generic. The irq time
  accounting code is mostly generic.
* Did performance runs with various systems with tsc based sched_clock -
  both with and without sched_clock_stable - running tbench, dbench, SPECjbb
  and did not notice any measurable slowness when this option is enabled.
Todo -
* Peter Zijlstra suggested modifying scale_rt_power to account for
  irq time. I have a patch for that and have been testing that right now.
  But, that change is not very pretty as yet and also will need some more
  testing. Feels better to make that a separate change. Will follow up
  on that soon.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] sched: Track and export per task [hard|soft]irq time
  2010-07-19 23:57 [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Venkatesh Pallipadi
@ 2010-07-19 23:57 ` Venkatesh Pallipadi
  2010-07-19 23:57 ` [PATCH 2/4] x86: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time to task Venkatesh Pallipadi
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-07-19 23:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh
  Cc: Venkatesh Pallipadi, Paul Menage, linux-kernel, Paul Turner,
	Martin Schwidefsky, Heiko Carstens, Paul Mackerras, Tony Luck

Currently, kernel does not account softirq and hardirq times at
the task level. There is irq time info in kstat_cpu which is
accumulated at the cpu level.

Without the task level information, the non irq run time of task(s) would
have to be guessed based on their exec time and CPU on which they were
running recently and assuming that the CPU irq time reported are spread
across all the tasks running there. And this guess can be widely off the mark.

Sample case, considering just the softirq:

If there are varied workloads running on a CPU, say a CPU bound task (loop)
and a network IO bound task (nc) along with the network softirq load,
there is no way for the administrator/user to know the non-irq runtime of each
of these tasks. Only information available is the total runtime for each of the
tasks and kstat_cpu softirq time for the CPU.

In this example, considering a 10 second sample, both loop and nc would have
total run time of ~5s. And kstat_cpu softirq on this cpu increase was
355 (~3.5s).

So, all the information the user gets is that both the tasks are running for
roughly the same amount of time and softirq is around 35%. As a result user
may conclude that irq overhead for both tasks are equal (1.75s) and the
non-irq runtime of both the tasks are around ~3.25s. Yes. There is another
factor of system and user time reported for these tasks that I am ignoring
as that is tough to correlate with irq time, in cases where the tasks have
significant non-irq system time.

This change adds tracking of softirq time on each task and task group.
This information is exported in /proc/<pid>/stat.

So, the user can get info like below, looking at exec_time and si_time in
appropriate /proc/<pid>/stat.
(Taken for a 10s interval)
task exec_time softirqtime (in USER_HZ)
(loop)  (nc)
505 0   500 359
502 1   501 363
503 0   502 354
504 0   499 359
503 3   500 360

with this, user can get the non-irq run time as 5s and ~1.45s for
loop and nc, respectively.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 Documentation/filesystems/proc.txt |    5 ++++-
 fs/proc/array.c                    |   17 +++++++++++++++--
 include/linux/sched.h              |    4 ++++
 kernel/exit.c                      |    2 ++
 kernel/sched.c                     |    9 ++++++---
 5 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 9fb6cbe..9e03468 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -255,7 +255,7 @@ Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
 ..............................................................................
 
 
-Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
+Table 1-4: Contents of the stat files
 ..............................................................................
  Field          Content
   pid           process id
@@ -303,6 +303,9 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
   blkio_ticks   time spent waiting for block IO
   gtime         guest time of the task in jiffies
   cgtime        guest time of the task children in jiffies
+  exec_time     execution time as accounted by scheduler
+  hi_time       hardirq time accounted to this process
+  si_time       softirq time accounted to this process
 ..............................................................................
 
 The /proc/PID/maps file containing the currently mapped memory regions and
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9b58d38..4555cfb 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -380,6 +380,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long rsslim = 0;
 	char tcomm[sizeof(task->comm)];
 	unsigned long flags;
+	cputime64_t exec_time = 0, si_time = 0, hi_time = 0;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
@@ -427,6 +428,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 				min_flt += t->min_flt;
 				maj_flt += t->maj_flt;
 				gtime = cputime_add(gtime, t->gtime);
+				si_time = cputime64_add(si_time, t->si_time);
+				hi_time = cputime64_add(hi_time, t->hi_time);
+				exec_time += t->se.sum_exec_runtime;
 				t = next_thread(t);
 			} while (t != task);
 
@@ -434,6 +438,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			maj_flt += sig->maj_flt;
 			thread_group_times(task, &utime, &stime);
 			gtime = cputime_add(gtime, sig->gtime);
+			si_time = cputime64_add(si_time, sig->si_time);
+			hi_time = cputime64_add(hi_time, sig->hi_time);
+			exec_time += sig->sum_sched_runtime;
 		}
 
 		sid = task_session_nr_ns(task, ns);
@@ -448,6 +455,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	if (!whole) {
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
+		si_time = task->si_time;
+		hi_time = task->hi_time;
+		exec_time = task->se.sum_exec_runtime;
 		task_times(task, &utime, &stime);
 		gtime = task->gtime;
 	}
@@ -467,7 +477,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 
 	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %llu %llu %llu\n",
 		pid_nr_ns(pid, ns),
 		tcomm,
 		state,
@@ -514,7 +524,10 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 		task->policy,
 		(unsigned long long)delayacct_blkio_ticks(task),
 		cputime_to_clock_t(gtime),
-		cputime_to_clock_t(cgtime));
+		cputime_to_clock_t(cgtime),
+		(unsigned long long)nsec_to_clock_t(exec_time),
+		(unsigned long long)cputime64_to_clock_t(hi_time),
+		(unsigned long long)cputime64_to_clock_t(si_time));
 	if (mm)
 		mmput(mm);
 	return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f118809..3ba8cb9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -593,6 +593,8 @@ struct signal_struct {
 	 */
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
+	cputime64_t si_time;
+	cputime64_t hi_time;
 	cputime_t cgtime;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	cputime_t prev_utime, prev_stime;
@@ -1284,6 +1286,8 @@ struct task_struct {
 
 	cputime_t utime, stime, utimescaled, stimescaled;
 	cputime_t gtime;
+	cputime64_t si_time;
+	cputime64_t hi_time;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	cputime_t prev_utime, prev_stime;
 #endif
diff --git a/kernel/exit.c b/kernel/exit.c
index ceffc67..f87f44c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -116,6 +116,8 @@ static void __exit_signal(struct task_struct *tsk)
 		sig->utime = cputime_add(sig->utime, tsk->utime);
 		sig->stime = cputime_add(sig->stime, tsk->stime);
 		sig->gtime = cputime_add(sig->gtime, tsk->gtime);
+		sig->si_time = cputime64_add(sig->si_time, tsk->si_time);
+		sig->hi_time = cputime64_add(sig->hi_time, tsk->hi_time);
 		sig->min_flt += tsk->min_flt;
 		sig->maj_flt += tsk->maj_flt;
 		sig->nvcsw += tsk->nvcsw;
diff --git a/kernel/sched.c b/kernel/sched.c
index f8b8996..3c246cb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3232,12 +3232,15 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 
 	/* Add system time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
-	if (hardirq_count() - hardirq_offset)
+	if (hardirq_count() - hardirq_offset) {
 		cpustat->irq = cputime64_add(cpustat->irq, tmp);
-	else if (softirq_count())
+		p->hi_time = cputime64_add(p->hi_time, tmp);
+	} else if (softirq_count()) {
 		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
-	else
+		p->si_time = cputime64_add(p->si_time, tmp);
+	} else {
 		cpustat->system = cputime64_add(cpustat->system, tmp);
+	}
 
 	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/4] x86: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time to task
  2010-07-19 23:57 [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Venkatesh Pallipadi
  2010-07-19 23:57 ` [PATCH 1/4] sched: Track and export per task [hard|soft]irq time Venkatesh Pallipadi
@ 2010-07-19 23:57 ` Venkatesh Pallipadi
  2010-07-19 23:57 ` [PATCH 3/4] sched: Generalize cpuacct usage tracking making it simpler to add new stats Venkatesh Pallipadi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-07-19 23:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh
  Cc: Venkatesh Pallipadi, Paul Menage, linux-kernel, Paul Turner,
	Martin Schwidefsky, Heiko Carstens, Paul Mackerras, Tony Luck

s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 may be challenging however, given the
state of TSC reliability on various platforms and also the overhead it may
add in syscall entry exit.

Instead, add an option that only does finer accounting of hardirq-softirq,
providing precise irq times (instead of timer ticks based samples). This
accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING
so that there wont be any overhead for users not interested in paying the
perf penalty. And this accounting is based on sched_clock, so other archs
may find it useful as well.

Note that the kstat_cpu irq times are still based on tick based samples
and only the task irq times report this new finer granularity irq time.
The reason being that the kstat irq also includes system time and
changing only irq time to have finer granularity can result in inconsistency
like sum kstat time adding up to more than 100% etc.

Continuing with the example from previous patch, without finer
granularity accounting, exec_time and si_time in 10s intervals were
(appropriate fields of /proc/<pid>/stat)
(loop)  (nc)
505 0   500 359
502 1   501 363
503 0   502 354
504 0   499 359
503 3   500 360

And with finer granularity accounting they were
(loop)  (nc)
503 9   502 301
502 8   502 303
502 9   501 302
502 8   502 302
503 9   501 302

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/ia64/include/asm/system.h    |    4 --
 arch/powerpc/include/asm/system.h |    4 --
 arch/s390/include/asm/system.h    |    1 -
 arch/x86/Kconfig                  |   11 +++++++
 arch/x86/kernel/tsc.c             |    2 +
 fs/proc/array.c                   |    4 +-
 include/linux/hardirq.h           |   15 +++++++++-
 include/linux/sched.h             |   13 ++++++++
 kernel/sched.c                    |   59 +++++++++++++++++++++++++++++++++++-
 9 files changed, 99 insertions(+), 14 deletions(-)

diff --git a/arch/ia64/include/asm/system.h b/arch/ia64/include/asm/system.h
index 9f342a5..dd028f2 100644
--- a/arch/ia64/include/asm/system.h
+++ b/arch/ia64/include/asm/system.h
@@ -272,10 +272,6 @@ void cpu_idle_wait(void);
 
 void default_idle(void);
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/system.h b/arch/powerpc/include/asm/system.h
index a6297c6..880fb57 100644
--- a/arch/powerpc/include/asm/system.h
+++ b/arch/powerpc/include/asm/system.h
@@ -545,10 +545,6 @@ extern void reloc_got2(unsigned long);
 
 #define PTRRELOC(x)	((typeof(x)) add_reloc_offset((unsigned long)(x)))
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
 extern struct dentry *powerpc_debugfs_root;
 
 #endif /* __KERNEL__ */
diff --git a/arch/s390/include/asm/system.h b/arch/s390/include/asm/system.h
index cef6621..38ddd8a 100644
--- a/arch/s390/include/asm/system.h
+++ b/arch/s390/include/asm/system.h
@@ -97,7 +97,6 @@ static inline void restore_access_regs(unsigned int *acrs)
 
 extern void account_vtime(struct task_struct *, struct task_struct *);
 extern void account_tick_vtime(struct task_struct *);
-extern void account_system_vtime(struct task_struct *);
 
 #ifdef CONFIG_PFAULT
 extern void pfault_irq_init(void);
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index dcb0593..ae6705d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -367,6 +367,17 @@ endif
 # This is an alphabetically sorted list of 64 bit extended platforms
 # Please maintain the alphabetic order if and when there are additions
 
+config IRQ_TIME_ACCOUNTING
+	bool "Fine granularity task level IRQ time accounting"
+	default n
+	help
+	  Select this option to enable fine granularity task irq time
+	  accounting. This is done by reading a timestamp on each
+	  transitions between softirq and hardirq state, so there can be a
+	  small performance impact.
+
+	  If in doubt, say N here.
+
 config X86_VSMP
 	bool "ScaleMP vSMP"
 	select PARAVIRT
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 9faf91a..5b5a213 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -952,6 +952,8 @@ void __init tsc_init(void)
 	/* now allow native_sched_clock() to use rdtsc */
 	tsc_disabled = 0;
 
+	enable_sched_clock_irqtime();
+
 	lpj = ((u64)tsc_khz * 1000);
 	do_div(lpj, HZ);
 	lpj_fine = lpj;
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 4555cfb..8316c96 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -526,8 +526,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 		cputime_to_clock_t(gtime),
 		cputime_to_clock_t(cgtime),
 		(unsigned long long)nsec_to_clock_t(exec_time),
-		(unsigned long long)cputime64_to_clock_t(hi_time),
-		(unsigned long long)cputime64_to_clock_t(si_time));
+		(unsigned long long)irqtime_to_clock_t(hi_time),
+		(unsigned long long)irqtime_to_clock_t(si_time));
 	if (mm)
 		mmput(mm);
 	return 0;
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d5b3876..a342de5 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -132,10 +132,23 @@ extern void synchronize_irq(unsigned int irq);
 
 struct task_struct;
 
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#if !defined(CONFIG_VIRT_CPU_ACCOUNTING) && !defined(CONFIG_IRQ_TIME_ACCOUNTING)
 static inline void account_system_vtime(struct task_struct *tsk)
 {
 }
+#else
+extern void account_system_vtime(struct task_struct *tsk);
+#endif
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+/*
+ * task irqtime is kept track in cputime64_t units when VIRT_CPU_ACCOUNTING
+ * is enabled and ns units when it is disabled. Since jiffies is not
+ * fine enough to keep track of irqtime with IRQ_TIME_ACCOUNTING.
+ */
+#define irqtime_to_clock_t(irqtime)	cputime64_to_clock_t(irqtime)
+#else
+#define irqtime_to_clock_t(irqtime)	nsec_to_clock_t(irqtime)
 #endif
 
 #if defined(CONFIG_NO_HZ)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3ba8cb9..37798c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1794,6 +1794,19 @@ static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 #endif
 
 /*
+ * With CONFIG_IRQ_TIME_ACCOUNTING, archs can call this if they have a
+ * 'fast' sched_clock() and they want to account irqtime based off of
+ * sched_clock()
+ */
+#ifndef CONFIG_IRQ_TIME_ACCOUNTING
+static inline void enable_sched_clock_irqtime(void)
+{
+}
+#else
+extern void enable_sched_clock_irqtime(void);
+#endif
+
+/*
  * Architectures can set this to 1 if they have specified
  * CONFIG_HAVE_UNSTABLE_SCHED_CLOCK in their arch Kconfig,
  * but then during bootup it turns out that sched_clock()
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c246cb..f167fbb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3207,6 +3207,34 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 	}
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int sched_clock_irqtime;
+void enable_sched_clock_irqtime(void)
+{
+	sched_clock_irqtime = 1;
+}
+#else
+#define sched_clock_irqtime 0
+#endif
+
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING)
+static void account_task_irqtime(cputime64_t *task_irqtime, cputime64_t irqtime)
+{
+	*task_irqtime = cputime64_add(*task_irqtime, irqtime);
+}
+#else
+/*
+ * Called at tick when we are in si/hi.
+ * We handle !sched_clock_irqtime case here as when sched_clock_irqtime is set,
+ * this accounting is done in account_system_vtime() below.
+ */
+static void account_task_irqtime(cputime64_t *task_irqtime, cputime64_t irqtime)
+{
+	if (!sched_clock_irqtime)
+		*task_irqtime = cputime64_add(*task_irqtime, TICK_NSEC);
+}
+#endif
+
 /*
  * Account system cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -3234,10 +3262,10 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	tmp = cputime_to_cputime64(cputime);
 	if (hardirq_count() - hardirq_offset) {
 		cpustat->irq = cputime64_add(cpustat->irq, tmp);
-		p->hi_time = cputime64_add(p->hi_time, tmp);
+		account_task_irqtime(&p->hi_time, tmp);
 	} else if (softirq_count()) {
 		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
-		p->si_time = cputime64_add(p->si_time, tmp);
+		account_task_irqtime(&p->si_time, tmp);
 	} else {
 		cpustat->system = cputime64_add(cpustat->system, tmp);
 	}
@@ -8967,3 +8995,30 @@ void synchronize_sched_expedited(void)
 EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
 
 #endif /* #else #ifndef CONFIG_SMP */
+
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+
+static DEFINE_PER_CPU(u64, irq_start_time);
+
+void account_system_vtime(struct task_struct *tsk)
+{
+	unsigned long flags;
+	int cpu;
+	u64 now;
+
+	if (!sched_clock_irqtime)
+		return;
+
+	local_irq_save(flags);
+	cpu = task_cpu(tsk);
+	now = sched_clock_cpu(cpu);
+	if (hardirq_count())
+		tsk->hi_time += now - per_cpu(irq_start_time, cpu);
+	else if (softirq_count())
+		tsk->si_time += now - per_cpu(irq_start_time, cpu);
+
+	per_cpu(irq_start_time, cpu) = now;
+	local_irq_restore(flags);
+}
+
+#endif
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/4] sched: Generalize cpuacct usage tracking making it simpler to add new stats
  2010-07-19 23:57 [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Venkatesh Pallipadi
  2010-07-19 23:57 ` [PATCH 1/4] sched: Track and export per task [hard|soft]irq time Venkatesh Pallipadi
  2010-07-19 23:57 ` [PATCH 2/4] x86: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time to task Venkatesh Pallipadi
@ 2010-07-19 23:57 ` Venkatesh Pallipadi
  2010-07-19 23:57 ` [PATCH 4/4] sched: Export irq times through cpuacct cgroup Venkatesh Pallipadi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-07-19 23:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh
  Cc: Venkatesh Pallipadi, Paul Menage, linux-kernel, Paul Turner,
	Martin Schwidefsky, Heiko Carstens, Paul Mackerras, Tony Luck

Generalize cpuacct usage, making it easier to add new stats in the following
patch.

Also adds alloc_percpu_array() interface in percpu.h

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 include/linux/percpu.h |    4 ++++
 kernel/sched.c         |   39 ++++++++++++++++++++++++++-------------
 kernel/sched_fair.c    |    2 +-
 kernel/sched_rt.c      |    2 +-
 4 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index d3a38d6..216f96a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -167,6 +167,10 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
 #define alloc_percpu(type)	\
 	(typeof(type) __percpu *)__alloc_percpu(sizeof(type), __alignof__(type))
 
+#define alloc_percpu_array(type, size)	\
+	(typeof(type) __percpu *)__alloc_percpu(sizeof(type) * size, \
+						__alignof__(type))
+
 /*
  * Optional methods for optimized non-lvalue per-cpu variable access.
  *
diff --git a/kernel/sched.c b/kernel/sched.c
index f167fbb..c12c8ea 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1396,12 +1396,20 @@ enum cpuacct_stat_index {
 	CPUACCT_STAT_NSTATS,
 };
 
+enum cpuacct_charge_index {
+	CPUACCT_CHARGE_USAGE,	/* ... execution time */
+
+	CPUACCT_CHARGE_NCHARGES,
+};
+
 #ifdef CONFIG_CGROUP_CPUACCT
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+static void cpuacct_charge(struct task_struct *tsk,
+		enum cpuacct_charge_index idx, u64 cputime);
 static void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val);
 #else
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_charge(struct task_struct *tsk,
+		enum cpuacct_charge_index idx, u64 cputime) {}
 static inline void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val) {}
 #endif
@@ -8661,7 +8669,7 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 /* track cpu usage of a group of tasks and its child groups */
 struct cpuacct {
 	struct cgroup_subsys_state css;
-	/* cpuusage holds pointer to a u64-type object on every cpu */
+	/* cpuusage holds pointer to a u64-type array object on every cpu */
 	u64 __percpu *cpuusage;
 	struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
 	struct cpuacct *parent;
@@ -8693,7 +8701,7 @@ static struct cgroup_subsys_state *cpuacct_create(
 	if (!ca)
 		goto out;
 
-	ca->cpuusage = alloc_percpu(u64);
+	ca->cpuusage = alloc_percpu_array(u64, CPUACCT_CHARGE_NCHARGES);
 	if (!ca->cpuusage)
 		goto out_free_ca;
 
@@ -8729,9 +8737,10 @@ cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca,
+		enum cpuacct_charge_index idx, int cpu)
 {
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu) + idx;
 	u64 data;
 
 #ifndef CONFIG_64BIT
@@ -8748,9 +8757,10 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
 	return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+static void cpuacct_cpuusage_write(struct cpuacct *ca,
+		enum cpuacct_charge_index idx, int cpu, u64 val)
 {
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu) + idx;
 
 #ifndef CONFIG_64BIT
 	/*
@@ -8772,7 +8782,7 @@ static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
 	int i;
 
 	for_each_present_cpu(i)
-		totalcpuusage += cpuacct_cpuusage_read(ca, i);
+		totalcpuusage += cpuacct_cpuusage_read(ca, cft->private, i);
 
 	return totalcpuusage;
 }
@@ -8790,7 +8800,7 @@ static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
 	}
 
 	for_each_present_cpu(i)
-		cpuacct_cpuusage_write(ca, i, 0);
+		cpuacct_cpuusage_write(ca, cftype->private, i, 0);
 
 out:
 	return err;
@@ -8804,7 +8814,7 @@ static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
 	int i;
 
 	for_each_present_cpu(i) {
-		percpu = cpuacct_cpuusage_read(ca, i);
+		percpu = cpuacct_cpuusage_read(ca, cft->private, i);
 		seq_printf(m, "%llu ", (unsigned long long) percpu);
 	}
 	seq_printf(m, "\n");
@@ -8835,10 +8845,12 @@ static struct cftype files[] = {
 		.name = "usage",
 		.read_u64 = cpuusage_read,
 		.write_u64 = cpuusage_write,
+		.private = CPUACCT_CHARGE_USAGE,
 	},
 	{
 		.name = "usage_percpu",
 		.read_seq_string = cpuacct_percpu_seq_read,
+		.private = CPUACCT_CHARGE_USAGE,
 	},
 	{
 		.name = "stat",
@@ -8856,7 +8868,8 @@ static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
  *
  * called with rq->lock held.
  */
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
+static void cpuacct_charge(struct task_struct *tsk,
+		enum cpuacct_charge_index idx, u64 cputime)
 {
 	struct cpuacct *ca;
 	int cpu;
@@ -8871,7 +8884,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 	ca = task_ca(tsk);
 
 	for (; ca; ca = ca->parent) {
-		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu) + idx;
 		*cpuusage += cputime;
 	}
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index eed35ed..6177253 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -541,7 +541,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
-		cpuacct_charge(curtask, delta_exec);
+		cpuacct_charge(curtask, CPUACCT_CHARGE_USAGE, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
 }
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 8afb953..12adcfe 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -619,7 +619,7 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock;
-	cpuacct_charge(curr, delta_exec);
+	cpuacct_charge(curr, CPUACCT_CHARGE_USAGE, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] sched: Export irq times through cpuacct cgroup
  2010-07-19 23:57 [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Venkatesh Pallipadi
                   ` (2 preceding siblings ...)
  2010-07-19 23:57 ` [PATCH 3/4] sched: Generalize cpuacct usage tracking making it simpler to add new stats Venkatesh Pallipadi
@ 2010-07-19 23:57 ` Venkatesh Pallipadi
  2010-07-20  7:55 ` [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Martin Schwidefsky
  2010-08-24  0:56 ` Venkatesh Pallipadi
  5 siblings, 0 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-07-19 23:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh
  Cc: Venkatesh Pallipadi, Paul Menage, linux-kernel, Paul Turner,
	Martin Schwidefsky, Heiko Carstens, Paul Mackerras, Tony Luck

Adds hi_time, si_time, hi_time_percpu and si_time_percpu info in cpuacct
cgroup.

The info will be fine granularity timings when either
CONFIG_IRQ_TIME_ACCOUNTING or CONFIG_VIRT_CPU_ACCOUNTING is enabled.
Otherwise the info will be based on tick samples.

Looked at adding this under cpuacct.stat. But, this information is useful
to the administrator in percpu format, so that any hi or si activity
on a particular CPU can be noted and some resource reallocation
(move the irq away, assign a different CPU to this cgroup, etc)
can be done based on that info.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 Documentation/cgroups/cpuacct.txt |    5 +++
 kernel/sched.c                    |   73 +++++++++++++++++++++++++++++++------
 2 files changed, 66 insertions(+), 12 deletions(-)

diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index 8b93094..817435e 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -48,3 +48,8 @@ system times. This has two side effects:
   against concurrent writes.
 - It is possible to see slightly outdated values for user and system times
   due to the batch processing nature of percpu_counter.
+
+cpuacct.hi_time and cpuacct.si_time provides the information about hardirq
+and softirq processing time that was accounted to this cgroup. There is also
+percpu variants of hi_time and si_time that splits the info at percpu level.
+All this times are in USER_HZ unit.
diff --git a/kernel/sched.c b/kernel/sched.c
index c12c8ea..7198041 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1398,6 +1398,8 @@ enum cpuacct_stat_index {
 
 enum cpuacct_charge_index {
 	CPUACCT_CHARGE_USAGE,	/* ... execution time */
+	CPUACCT_CHARGE_SI_TIME,	/* ... softirq time */
+	CPUACCT_CHARGE_HI_TIME,	/* ... hardirq time */
 
 	CPUACCT_CHARGE_NCHARGES,
 };
@@ -3226,9 +3228,11 @@ void enable_sched_clock_irqtime(void)
 #endif
 
 #if defined(CONFIG_VIRT_CPU_ACCOUNTING)
-static void account_task_irqtime(cputime64_t *task_irqtime, cputime64_t irqtime)
+static void account_task_irqtime(struct task_struct *p,
+		cputime64_t *task_irqtime, int idx, cputime64_t irqtime)
 {
 	*task_irqtime = cputime64_add(*task_irqtime, irqtime);
+	cpuacct_charge(p, idx, irqtime);
 }
 #else
 /*
@@ -3236,10 +3240,13 @@ static void account_task_irqtime(cputime64_t *task_irqtime, cputime64_t irqtime)
  * We handle !sched_clock_irqtime case here as when sched_clock_irqtime is set,
  * this accounting is done in account_system_vtime() below.
  */
-static void account_task_irqtime(cputime64_t *task_irqtime, cputime64_t irqtime)
+static void account_task_irqtime(struct task_struct *p,
+		cputime64_t *task_irqtime, int idx, cputime64_t irqtime)
 {
-	if (!sched_clock_irqtime)
+	if (!sched_clock_irqtime) {
 		*task_irqtime = cputime64_add(*task_irqtime, TICK_NSEC);
+		cpuacct_charge(p, idx, TICK_NSEC);
+	}
 }
 #endif
 
@@ -3270,10 +3277,12 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	tmp = cputime_to_cputime64(cputime);
 	if (hardirq_count() - hardirq_offset) {
 		cpustat->irq = cputime64_add(cpustat->irq, tmp);
-		account_task_irqtime(&p->hi_time, tmp);
+		account_task_irqtime(p, &p->hi_time,
+					CPUACCT_CHARGE_HI_TIME, tmp);
 	} else if (softirq_count()) {
 		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
-		account_task_irqtime(&p->si_time, tmp);
+		account_task_irqtime(p, &p->si_time,
+					CPUACCT_CHARGE_SI_TIME, tmp);
 	} else {
 		cpustat->system = cputime64_add(cpustat->system, tmp);
 	}
@@ -8737,6 +8746,22 @@ cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	kfree(ca);
 }
 
+static u64 cpuacct_cpuusage_convert(u64 data, enum cpuacct_charge_index idx)
+{
+	switch (idx) {
+	case CPUACCT_CHARGE_SI_TIME:
+	case CPUACCT_CHARGE_HI_TIME:
+		/*
+		 * irqtime is stored either in ns or cputime64, depending
+		 * on CONFIG_VIRT_CPU_ACCOUNTING. Convert it to clock_t
+		 * before returning to user.
+		 */
+		return irqtime_to_clock_t(data);
+	default:
+		return data;
+	}
+}
+
 static u64 cpuacct_cpuusage_read(struct cpuacct *ca,
 		enum cpuacct_charge_index idx, int cpu)
 {
@@ -8754,7 +8779,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca,
 	data = *cpuusage;
 #endif
 
-	return data;
+	return cpuacct_cpuusage_convert(data, idx);
 }
 
 static void cpuacct_cpuusage_write(struct cpuacct *ca,
@@ -8853,6 +8878,26 @@ static struct cftype files[] = {
 		.private = CPUACCT_CHARGE_USAGE,
 	},
 	{
+		.name = "si_time",
+		.read_u64 = cpuusage_read,
+		.private = CPUACCT_CHARGE_SI_TIME,
+	},
+	{
+		.name = "si_time_percpu",
+		.read_seq_string = cpuacct_percpu_seq_read,
+		.private = CPUACCT_CHARGE_SI_TIME,
+	},
+	{
+		.name = "hi_time",
+		.read_u64 = cpuusage_read,
+		.private = CPUACCT_CHARGE_HI_TIME,
+	},
+	{
+		.name = "hi_time_percpu",
+		.read_seq_string = cpuacct_percpu_seq_read,
+		.private = CPUACCT_CHARGE_HI_TIME,
+	},
+	{
 		.name = "stat",
 		.read_map = cpuacct_stats_show,
 	},
@@ -9017,7 +9062,7 @@ void account_system_vtime(struct task_struct *tsk)
 {
 	unsigned long flags;
 	int cpu;
-	u64 now;
+	u64 now, delta;
 
 	if (!sched_clock_irqtime)
 		return;
@@ -9025,12 +9070,16 @@ void account_system_vtime(struct task_struct *tsk)
 	local_irq_save(flags);
 	cpu = task_cpu(tsk);
 	now = sched_clock_cpu(cpu);
-	if (hardirq_count())
-		tsk->hi_time += now - per_cpu(irq_start_time, cpu);
-	else if (softirq_count())
-		tsk->si_time += now - per_cpu(irq_start_time, cpu);
-
+	delta = now - per_cpu(irq_start_time, cpu);
 	per_cpu(irq_start_time, cpu) = now;
+	if (hardirq_count()) {
+		tsk->hi_time += delta;
+		cpuacct_charge(tsk, CPUACCT_CHARGE_HI_TIME, delta);
+	} else if (softirq_count()) {
+		tsk->si_time += delta;
+		cpuacct_charge(tsk, CPUACCT_CHARGE_SI_TIME, delta);
+	}
+
 	local_irq_restore(flags);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-07-19 23:57 [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Venkatesh Pallipadi
                   ` (3 preceding siblings ...)
  2010-07-19 23:57 ` [PATCH 4/4] sched: Export irq times through cpuacct cgroup Venkatesh Pallipadi
@ 2010-07-20  7:55 ` Martin Schwidefsky
  2010-07-20 16:55   ` Venkatesh Pallipadi
  2010-08-24  0:56 ` Venkatesh Pallipadi
  5 siblings, 1 reply; 28+ messages in thread
From: Martin Schwidefsky @ 2010-07-20  7:55 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck

On Mon, 19 Jul 2010 16:57:11 -0700
Venkatesh Pallipadi <venki@google.com> wrote:

> Currently, the softirq and hardirq time reporting is only done at the
> CPU level. There are usecases where reporting this time against task
> or task groups or cgroups will be useful for user/administrator
> in terms of resource planning and utilization charging. Also, as the
> accoounting is already done at the CPU level, reporting the same at
> the task level does not add any significant computational overhead
> other than task level storage (patch 1).

I never understood why the softirq and hardirq time gets accounted to a
task at all. Why is it that the poor task that is running gets charged
with the cpu time of an interrupt that has nothing to do with the task?
I consider this to be a bug, and now this gets formalized in the
taskstats interface? Imho not a good idea.

> The softirq/hardirq statistics commonly done based on tick based sampling.
> Though some archs have CONFIG_VIRT_CPU_ACCOUNTING based fine granularity
> accounting. Having similar mechanism to get fine granularity accounting
> on x86 will be a major challenge, given the state of TSC reliability
> on various platforms and also the overhead it may add in common paths
> like syscall entry exit.
> 
> An alternative is to have a generic (sched_clock based) and configurable
> fine-granularity accounting of si and hi time which can be reported
> over the /proc/<pid>/stat API (patch 2).

To get fine granular accounting for interrupts you need to do a
sched_clock call on irq entry and another one on irq exit. Isn't that
too expensive on a x86 system? (I do think this is a good idea but
still there is the worry about the overhead).

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-07-20  7:55 ` [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Martin Schwidefsky
@ 2010-07-20 16:55   ` Venkatesh Pallipadi
  2010-07-22 11:12     ` Martin Schwidefsky
  0 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-07-20 16:55 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, Jul 20, 2010 at 12:55 AM, Martin Schwidefsky
<schwidefsky@de.ibm.com> wrote:
> On Mon, 19 Jul 2010 16:57:11 -0700
> Venkatesh Pallipadi <venki@google.com> wrote:
>
>> Currently, the softirq and hardirq time reporting is only done at the
>> CPU level. There are usecases where reporting this time against task
>> or task groups or cgroups will be useful for user/administrator
>> in terms of resource planning and utilization charging. Also, as the
>> accoounting is already done at the CPU level, reporting the same at
>> the task level does not add any significant computational overhead
>> other than task level storage (patch 1).
>
> I never understood why the softirq and hardirq time gets accounted to a
> task at all. Why is it that the poor task that is running gets charged
> with the cpu time of an interrupt that has nothing to do with the task?
> I consider this to be a bug, and now this gets formalized in the
> taskstats interface? Imho not a good idea.

Agree that this is a bug. I started by looking at resolving that. But,
it was not exactly easy. Ideally we want irq times to be charged to
right task as much as possible. With things like network rcv softirq
for example, there is a task thats is going to consume the packet
eventually that should be charged. If we cant find a suitable match we
may have to charge it to some system thread. Things like threaded
interrupts will mitigate this problem a bit. But, until we have a good
enough solution, this bug will be around with us.

This change takes a small step giving hint about this to
user/administrator who can take some corrective action based on it.
Next step is to give CFQ scheduler some info about this and I am
working on a patch for that. That will help in load balancing
decisions, with irq heavy CPU not trying to get equal weight-age as
other CPU. I don't think these interfaces are binding in any way. If
and when we have tasks not being charged for irq, we can simply report
"0" in these interfaces (there is some precedent for this in
/proc/<pid>stat output already).

>> The softirq/hardirq statistics commonly done based on tick based sampling.
>> Though some archs have CONFIG_VIRT_CPU_ACCOUNTING based fine granularity
>> accounting. Having similar mechanism to get fine granularity accounting
>> on x86 will be a major challenge, given the state of TSC reliability
>> on various platforms and also the overhead it may add in common paths
>> like syscall entry exit.
>>
>> An alternative is to have a generic (sched_clock based) and configurable
>> fine-granularity accounting of si and hi time which can be reported
>> over the /proc/<pid>/stat API (patch 2).
>
> To get fine granular accounting for interrupts you need to do a
> sched_clock call on irq entry and another one on irq exit. Isn't that
> too expensive on a x86 system? (I do think this is a good idea but
> still there is the worry about the overhead).

On x86: Yes. Overhead is a potential problem. Thats the reason I had
this inside a CONFIG option. But, I have tested this with few
workloads on different systems released in past two years timeframe
and I did  not see any measurable overhead. Note that this is used
only when sched_clock is based off of TSC and not when it is based on
jiffies. The sched_clock overhead I measured on different platforms
was in 30-150 cycles range, which probably isn't going to be highly
visible in generic workloads.
Archs like s390/powerpc/ia64 already do this kind of accounting with
VIRT_CPU_ACCOUNTING. So, this patch will give them task and cgroup
level info free of charge (other than potential bugs with this code
change :-)).

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-07-20 16:55   ` Venkatesh Pallipadi
@ 2010-07-22 11:12     ` Martin Schwidefsky
  2010-07-23  2:12       ` Venkatesh Pallipadi
  0 siblings, 1 reply; 28+ messages in thread
From: Martin Schwidefsky @ 2010-07-22 11:12 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, 20 Jul 2010 09:55:29 -0700
Venkatesh Pallipadi <venki@google.com> wrote:

> On Tue, Jul 20, 2010 at 12:55 AM, Martin Schwidefsky
> <schwidefsky@de.ibm.com> wrote:
> > On Mon, 19 Jul 2010 16:57:11 -0700
> > Venkatesh Pallipadi <venki@google.com> wrote:
> >
> >> Currently, the softirq and hardirq time reporting is only done at the
> >> CPU level. There are usecases where reporting this time against task
> >> or task groups or cgroups will be useful for user/administrator
> >> in terms of resource planning and utilization charging. Also, as the
> >> accoounting is already done at the CPU level, reporting the same at
> >> the task level does not add any significant computational overhead
> >> other than task level storage (patch 1).
> >
> > I never understood why the softirq and hardirq time gets accounted to a
> > task at all. Why is it that the poor task that is running gets charged
> > with the cpu time of an interrupt that has nothing to do with the task?
> > I consider this to be a bug, and now this gets formalized in the
> > taskstats interface? Imho not a good idea.
> 
> Agree that this is a bug. I started by looking at resolving that. But,
> it was not exactly easy. Ideally we want irq times to be charged to
> right task as much as possible. With things like network rcv softirq
> for example, there is a task thats is going to consume the packet
> eventually that should be charged. If we cant find a suitable match we
> may have to charge it to some system thread. Things like threaded
> interrupts will mitigate this problem a bit. But, until we have a good
> enough solution, this bug will be around with us.

Yes, fixing that behavior will be tough. Just consider a standard page
cache I/O that gets merged with other I/O. You would need to "split" the
interrupt time for a block I/O to the process that benefit from it. An
added twist is that there can be multiple processes that require the
page. Split the time even more to the different requesters of a page?
Then the order when the requests come in suddenly gets important. Or
consider the IP packets in a network buffer, split the interrupt time
to the recipients?
The list goes on and on, my guess is that it will be next to impossible
to do it right. If the current situation is wrong because the ire
and softorq system time gets misaccounted and the "correct" solution is
impossible the only thing left to do is to stop accounting irq and
softirq time to processes.

> > To get fine granular accounting for interrupts you need to do a
> > sched_clock call on irq entry and another one on irq exit. Isn't that
> > too expensive on a x86 system? (I do think this is a good idea but
> > still there is the worry about the overhead).
> 
> On x86: Yes. Overhead is a potential problem. Thats the reason I had
> this inside a CONFIG option. But, I have tested this with few
> workloads on different systems released in past two years timeframe
> and I did  not see any measurable overhead. Note that this is used
> only when sched_clock is based off of TSC and not when it is based on
> jiffies. The sched_clock overhead I measured on different platforms
> was in 30-150 cycles range, which probably isn't going to be highly
> visible in generic workloads.

That makes sense to me, with a working TSC the overhead should be
small. But you will need to a performance analysis to prove it.

> Archs like s390/powerpc/ia64 already do this kind of accounting with
> VIRT_CPU_ACCOUNTING. So, this patch will give them task and cgroup
> level info free of charge (other than potential bugs with this code
> change :-)).

Well, the task and cgroup information is there but what does it really
tell me? As long as the irq & softirq time can be caused by any other
process I don't see the value of this incorrect data point.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-07-22 11:12     ` Martin Schwidefsky
@ 2010-07-23  2:12       ` Venkatesh Pallipadi
  2010-08-24  7:51         ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-07-23  2:12 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck

On Thu, Jul 22, 2010 at 4:12 AM, Martin Schwidefsky
<schwidefsky@de.ibm.com> wrote:
> On Tue, 20 Jul 2010 09:55:29 -0700
> Venkatesh Pallipadi <venki@google.com> wrote:
>
<snip>
>
>> Archs like s390/powerpc/ia64 already do this kind of accounting with
>> VIRT_CPU_ACCOUNTING. So, this patch will give them task and cgroup
>> level info free of charge (other than potential bugs with this code
>> change :-)).
>
> Well, the task and cgroup information is there but what does it really
> tell me? As long as the irq & softirq time can be caused by any other
> process I don't see the value of this incorrect data point.
>

Data point will be correct. How it gets used is a different qn. This
interface will be useful for Alert/Paranoid/Annoyed user/admin who
sees that the job exec_time is high but it is not doing any useful
work. With this additional info, he can probably choose to move the
job off to different system. User probably knows more about the job
characteristics and whether it is rightly or wrongly being charged.
Say one task in the task group being charged for another task in the
task group is probably OK as well. So, user can look at this in
different granularity than kernel can.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-07-19 23:57 [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Venkatesh Pallipadi
                   ` (4 preceding siblings ...)
  2010-07-20  7:55 ` [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Martin Schwidefsky
@ 2010-08-24  0:56 ` Venkatesh Pallipadi
  2010-08-24  7:52   ` Peter Zijlstra
  5 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-08-24  0:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel

Peter,

Ping.
Does the patchset look sane.

Thanks,
Venki


On Mon, Jul 19, 2010 at 4:57 PM, Venkatesh Pallipadi <venki@google.com> wrote:
>
> Earlier version of this patchset here -
> lkml subject:
> "[RFC PATCH 0/4] Finer granularity and task/cgroup irq time accounting"
> http://marc.info/?l=linux-kernel&m=127474630527689&w=2
>
> Currently, the softirq and hardirq time reporting is only done at the
> CPU level. There are usecases where reporting this time against task
> or task groups or cgroups will be useful for user/administrator
> in terms of resource planning and utilization charging. Also, as the
> accoounting is already done at the CPU level, reporting the same at
> the task level does not add any significant computational overhead
> other than task level storage (patch 1).
>
> The softirq/hardirq statistics commonly done based on tick based sampling.
> Though some archs have CONFIG_VIRT_CPU_ACCOUNTING based fine granularity
> accounting. Having similar mechanism to get fine granularity accounting
> on x86 will be a major challenge, given the state of TSC reliability
> on various platforms and also the overhead it may add in common paths
> like syscall entry exit.
>
> An alternative is to have a generic (sched_clock based) and configurable
> fine-granularity accounting of si and hi time which can be reported
> over the /proc/<pid>/stat API (patch 2).
>
> Patch 3 and 4 are exporting this info at the cgroup level.
>
> Changes since the original RFC -
> * General code cleanup and documentation for new APIs added.
> * Handle notsc option by having a runtime flag sched_clock_irqtime, along
>  with the original CONFIG_IRQ_TIME_ACCOUNTING option.
>  Peter Zijlstra suggested the use of alternate instruction kind of mechanism
>  here. But, that is mostly x86 specific and not generic. The irq time
>  accounting code is mostly generic.
> * Did performance runs with various systems with tsc based sched_clock -
>  both with and without sched_clock_stable - running tbench, dbench, SPECjbb
>  and did not notice any measurable slowness when this option is enabled.
> Todo -
> * Peter Zijlstra suggested modifying scale_rt_power to account for
>  irq time. I have a patch for that and have been testing that right now.
>  But, that change is not very pretty as yet and also will need some more
>  testing. Feels better to make that a separate change. Will follow up
>  on that soon.
>
> Thanks,
> Venki
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-07-23  2:12       ` Venkatesh Pallipadi
@ 2010-08-24  7:51         ` Peter Zijlstra
  2010-08-24  8:05           ` Balbir Singh
  2010-08-24  8:14           ` Ingo Molnar
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-24  7:51 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Martin Schwidefsky, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck

On Thu, 2010-07-22 at 19:12 -0700, Venkatesh Pallipadi wrote:
> >
> > Well, the task and cgroup information is there but what does it really
> > tell me? As long as the irq & softirq time can be caused by any other
> > process I don't see the value of this incorrect data point.
> >
> 
> Data point will be correct. How it gets used is a different qn. This
> interface will be useful for Alert/Paranoid/Annoyed user/admin who
> sees that the job exec_time is high but it is not doing any useful
> work. 

I'm very sympathetic with Martin's POV. irq/softirq times per task don't
really make sense. In the case you provide above the solution would be
to subtract these times from the task execution time, not break it out.
In that case he would see his task not do much, and end up with the same
action list.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24  0:56 ` Venkatesh Pallipadi
@ 2010-08-24  7:52   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-24  7:52 UTC (permalink / raw)
  To: Venkatesh Pallipadi; +Cc: linux-kernel

On Mon, 2010-08-23 at 17:56 -0700, Venkatesh Pallipadi wrote:
> Does the patchset look sane.

Thanks for the prod, I should rm -rf my inbox and start over,.. its
impossible to keep track of things :/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24  7:51         ` Peter Zijlstra
@ 2010-08-24  8:05           ` Balbir Singh
  2010-08-24  9:09             ` Peter Zijlstra
  2010-08-24  8:14           ` Ingo Molnar
  1 sibling, 1 reply; 28+ messages in thread
From: Balbir Singh @ 2010-08-24  8:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Venkatesh Pallipadi, Martin Schwidefsky, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Paul Menage, linux-kernel,
	Paul Turner, Heiko Carstens, Paul Mackerras, Tony Luck

* Peter Zijlstra <peterz@infradead.org> [2010-08-24 09:51:26]:

> On Thu, 2010-07-22 at 19:12 -0700, Venkatesh Pallipadi wrote:
> > >
> > > Well, the task and cgroup information is there but what does it really
> > > tell me? As long as the irq & softirq time can be caused by any other
> > > process I don't see the value of this incorrect data point.
> > >
> > 
> > Data point will be correct. How it gets used is a different qn. This
> > interface will be useful for Alert/Paranoid/Annoyed user/admin who
> > sees that the job exec_time is high but it is not doing any useful
> > work. 
> 
> I'm very sympathetic with Martin's POV. irq/softirq times per task don't
> really make sense. In the case you provide above the solution would be
> to subtract these times from the task execution time, not break it out.
> In that case he would see his task not do much, and end up with the same
> action list.
>

cgroup level info does make sense, assuming that tasks that share the
costs being mentioned here belong to the same cgroup. Though the data
is calculated per task, when accumulated per cgroup, it should be
close to being correct.

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24  7:51         ` Peter Zijlstra
  2010-08-24  8:05           ` Balbir Singh
@ 2010-08-24  8:14           ` Ingo Molnar
  2010-08-24  8:49             ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2010-08-24  8:14 UTC (permalink / raw)
  To: Peter Zijlstra, Frédéric Weisbecker
  Cc: Venkatesh Pallipadi, Martin Schwidefsky, H. Peter Anvin,
	Thomas Gleixner, Balbir Singh, Paul Menage, linux-kernel,
	Paul Turner, Heiko Carstens, Paul Mackerras, Tony Luck


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, 2010-07-22 at 19:12 -0700, Venkatesh Pallipadi wrote:
> > >
> > > Well, the task and cgroup information is there but what does it really
> > > tell me? As long as the irq & softirq time can be caused by any other
> > > process I don't see the value of this incorrect data point.
> > >
> > 
> > Data point will be correct. How it gets used is a different qn. This
> > interface will be useful for Alert/Paranoid/Annoyed user/admin who
> > sees that the job exec_time is high but it is not doing any useful
> > work. 
> 
> I'm very sympathetic with Martin's POV. irq/softirq times per task 
> don't really make sense. In the case you provide above the solution 
> would be to subtract these times from the task execution time, not 
> break it out. In that case he would see his task not do much, and end 
> up with the same action list.

Right, andthis connects to something Frederic sent a few RFC patches for 
some time ago: finegrained irq/softirq perf stat support. If we do 
something in this area we need a facility that enables both types of 
statistics gathering.

Frederic's model is based on exclusion - so you could do a perf stat run 
that excluded softirq and hardirq execution from a workload's runtime. 
It's nifty, as it allows the reduction of measurement noise. (IRQ and 
softirq execution can be regarded as random noise added (or not added) 
to execution times)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24  8:14           ` Ingo Molnar
@ 2010-08-24  8:49             ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-24  8:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Venkatesh Pallipadi,
	Martin Schwidefsky, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, 2010-08-24 at 10:14 +0200, Ingo Molnar wrote:
> Right, andthis connects to something Frederic sent a few RFC patches for 
> some time ago: finegrained irq/softirq perf stat support. If we do 
> something in this area we need a facility that enables both types of 
> statistics gathering. 

That facility is called irq_enter() and irq_exit() etc..

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24  8:05           ` Balbir Singh
@ 2010-08-24  9:09             ` Peter Zijlstra
  2010-08-24 11:38               ` Balbir Singh
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-24  9:09 UTC (permalink / raw)
  To: balbir
  Cc: Venkatesh Pallipadi, Martin Schwidefsky, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Paul Menage, linux-kernel,
	Paul Turner, Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, 2010-08-24 at 13:35 +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2010-08-24 09:51:26]:
> 
> > On Thu, 2010-07-22 at 19:12 -0700, Venkatesh Pallipadi wrote:
> > > >
> > > > Well, the task and cgroup information is there but what does it really
> > > > tell me? As long as the irq & softirq time can be caused by any other
> > > > process I don't see the value of this incorrect data point.
> > > >
> > > 
> > > Data point will be correct. How it gets used is a different qn. This
> > > interface will be useful for Alert/Paranoid/Annoyed user/admin who
> > > sees that the job exec_time is high but it is not doing any useful
> > > work. 
> > 
> > I'm very sympathetic with Martin's POV. irq/softirq times per task don't
> > really make sense. In the case you provide above the solution would be
> > to subtract these times from the task execution time, not break it out.
> > In that case he would see his task not do much, and end up with the same
> > action list.
> >
> 
> cgroup level info does make sense, assuming that tasks that share the
> costs being mentioned here belong to the same cgroup. 

I don't think that's a valid assumption.

If its not true for tasks, then its not true for groups of tasks either.
It might be slightly less wrong due to the larger number of entities
reducing the error bounds, but its still wrong in principle.

The whole attribution mess can only be solved by actually splitting out
the entries that do work, like per-cgroup workqueue threads and similar
things.

System wide entities like IRQs are very hard to attribute correctly like
Martin already argued, and I don't think its worth doing.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24  9:09             ` Peter Zijlstra
@ 2010-08-24 11:38               ` Balbir Singh
  2010-08-24 11:49                 ` Peter Zijlstra
  2010-08-24 11:53                 ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Balbir Singh @ 2010-08-24 11:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Venkatesh Pallipadi, Martin Schwidefsky, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Paul Menage, linux-kernel,
	Paul Turner, Heiko Carstens, Paul Mackerras, Tony Luck

* Peter Zijlstra <peterz@infradead.org> [2010-08-24 11:09:13]:

> > 
> > cgroup level info does make sense, assuming that tasks that share the
> > costs being mentioned here belong to the same cgroup. 
> 
> I don't think that's a valid assumption.
> 
> If its not true for tasks, then its not true for groups of tasks either.
> It might be slightly less wrong due to the larger number of entities
> reducing the error bounds, but its still wrong in principle.
>

The point is for containers it is more likely to give the right answer
and so on. Yes, the results are not 100% accurate.
 
> The whole attribution mess can only be solved by actually splitting out
> the entries that do work, like per-cgroup workqueue threads and similar
> things.
> 
> System wide entities like IRQs are very hard to attribute correctly like
> Martin already argued, and I don't think its worth doing.

I see Martin's view point, is the suggestion then that we amortize
these costs across all tasks?


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24 11:38               ` Balbir Singh
@ 2010-08-24 11:49                 ` Peter Zijlstra
  2010-08-24 11:53                 ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-24 11:49 UTC (permalink / raw)
  To: balbir
  Cc: Venkatesh Pallipadi, Martin Schwidefsky, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Paul Menage, linux-kernel,
	Paul Turner, Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, 2010-08-24 at 17:08 +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2010-08-24 11:09:13]:

> > The whole attribution mess can only be solved by actually splitting out
> > the entries that do work, like per-cgroup workqueue threads and similar
> > things.
> > 
> > System wide entities like IRQs are very hard to attribute correctly like
> > Martin already argued, and I don't think its worth doing.
> 
> I see Martin's view point, is the suggestion then that we amortize
> these costs across all tasks?

I'm still not sure what you want them for, but if its for wanting to
know wth the system is up to, simply account them on their own, and not
include them in any task stats.

That is, keep the existing hi/si interface and improve upon that, but
also subtract those times from the task execution times.

That way, if a cpu is like 80% hogged by IRQ action, you'll not see a
100% busy task, but only a 20%.

At that point you can also feed the IRQ time back into
sched_rt_avg_update() (which strictly speaking isn't rt but !fair), and
the load-balancer will automagically try and move tasks away from that
cpu.

If you really want to account (and possibly control) all the work
belonging to a particular group you'll have to make sure work does
indeed stay within the group -- which is where per-cgroup workqueue
threads and per-cgroup softirq threads etc. come into play.

Lumping all work together and then trying to extract something again is
silly.

And hardirq time really is system time, not cgroup or task time.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24 11:38               ` Balbir Singh
  2010-08-24 11:49                 ` Peter Zijlstra
@ 2010-08-24 11:53                 ` Peter Zijlstra
  2010-08-24 12:06                   ` Martin Schwidefsky
                                     ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-24 11:53 UTC (permalink / raw)
  To: balbir
  Cc: Venkatesh Pallipadi, Martin Schwidefsky, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Paul Menage, linux-kernel,
	Paul Turner, Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, 2010-08-24 at 17:08 +0530, Balbir Singh wrote:
> 
> The point is for containers it is more likely to give the right answer
> and so on. Yes, the results are not 100% accurate. 

Consider one group heavily dirtying pages, it stuffs the IO queues full
and gets blocked on IO completion. Since the CPU is then free to
schedule something else we start running things from another group,
those IO completions will come in while we run other group and get
accounted to other group -- FAIL.

s/group/task/ etc..

That just really doesn't work, accounting async work, esp stuff that is
not under software control it very tricky indeed.

So what are you wanting to do, and why. Do you really need accounting
madness?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24 11:53                 ` Peter Zijlstra
@ 2010-08-24 12:06                   ` Martin Schwidefsky
  2010-08-24 12:39                     ` Peter Zijlstra
  2010-08-24 12:47                   ` Balbir Singh
  2010-08-24 19:20                   ` Venkatesh Pallipadi
  2 siblings, 1 reply; 28+ messages in thread
From: Martin Schwidefsky @ 2010-08-24 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: balbir, Venkatesh Pallipadi, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, 24 Aug 2010 13:53:55 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2010-08-24 at 17:08 +0530, Balbir Singh wrote:
> > 
> > The point is for containers it is more likely to give the right answer
> > and so on. Yes, the results are not 100% accurate. 
> 
> Consider one group heavily dirtying pages, it stuffs the IO queues full
> and gets blocked on IO completion. Since the CPU is then free to
> schedule something else we start running things from another group,
> those IO completions will come in while we run other group and get
> accounted to other group -- FAIL.
> 
> s/group/task/ etc..
> 
> That just really doesn't work, accounting async work, esp stuff that is
> not under software control it very tricky indeed.
> 
> So what are you wanting to do, and why. Do you really need accounting
> madness?

Well, I have sent a patch back in 2006 that stops adding the hardirq /
softirq time to the currently running process. See
http://lkml.org/lkml/2006/8/24/139
It did not get very far, so that answer to the question if we need
accounting madness seems to be yes ..

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24 12:06                   ` Martin Schwidefsky
@ 2010-08-24 12:39                     ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-24 12:39 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: balbir, Venkatesh Pallipadi, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, 2010-08-24 at 14:06 +0200, Martin Schwidefsky wrote:
> Well, I have sent a patch back in 2006 that stops adding the hardirq /
> softirq time to the currently running process. See
> http://lkml.org/lkml/2006/8/24/139

Time to maybe revisit that ;-)

> It did not get very far, so that answer to the question if we need
> accounting madness seems to be yes .. 

Well, that's only a little accounting, but trying to infer to what task
to account IRQ time is going to become madness.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24 11:53                 ` Peter Zijlstra
  2010-08-24 12:06                   ` Martin Schwidefsky
@ 2010-08-24 12:47                   ` Balbir Singh
  2010-08-24 13:08                     ` Peter Zijlstra
  2010-08-24 19:20                   ` Venkatesh Pallipadi
  2 siblings, 1 reply; 28+ messages in thread
From: Balbir Singh @ 2010-08-24 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Venkatesh Pallipadi, Martin Schwidefsky, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Paul Menage, linux-kernel,
	Paul Turner, Heiko Carstens, Paul Mackerras, Tony Luck

* Peter Zijlstra <peterz@infradead.org> [2010-08-24 13:53:55]:

> On Tue, 2010-08-24 at 17:08 +0530, Balbir Singh wrote:
> > 
> > The point is for containers it is more likely to give the right answer
> > and so on. Yes, the results are not 100% accurate. 
> 
> Consider one group heavily dirtying pages, it stuffs the IO queues full
> and gets blocked on IO completion. Since the CPU is then free to
> schedule something else we start running things from another group,
> those IO completions will come in while we run other group and get
> accounted to other group -- FAIL.
> 
> s/group/task/ etc..
> 
> That just really doesn't work, accounting async work, esp stuff that is
> not under software control it very tricky indeed.

Yes, we don't have sufficient context to charge the correct context. I
think openvz has some technology there, we will too when we have I/O
cgroups at a cgroup level, but the instances of such operations are
too many to accurately identify them all.

> 
> So what are you wanting to do, and why. Do you really need accounting
> madness?

I think Venki gave the answer in the posting

"There are usecases where reporting this time against task
or task groups or cgroups will be useful for user/administrator
in terms of resource planning and utilization charging"

I don't have any specific use cases, I was just reviewing the patchset
and trying to understand how to solve the problem.


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24 12:47                   ` Balbir Singh
@ 2010-08-24 13:08                     ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-24 13:08 UTC (permalink / raw)
  To: balbir
  Cc: Venkatesh Pallipadi, Martin Schwidefsky, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Paul Menage, linux-kernel,
	Paul Turner, Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, 2010-08-24 at 18:17 +0530, Balbir Singh wrote:
> 
> "There are usecases where reporting this time against task
> or task groups or cgroups will be useful for user/administrator
> in terms of resource planning and utilization charging"

Or confusing, what happens if you attribute the IRQ overhead of a
ping-flood to your tasks?

By not providing these numbers per task/group people will have to
actually think about what it is that is causing these high irq loads and
have a chance of actually doing better than random attribution.

So no, providing random numbers on the slight chance that they might
possibly make sense for your workload doesn't seem like a sound reason
to provide them.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24 11:53                 ` Peter Zijlstra
  2010-08-24 12:06                   ` Martin Schwidefsky
  2010-08-24 12:47                   ` Balbir Singh
@ 2010-08-24 19:20                   ` Venkatesh Pallipadi
  2010-08-24 20:39                     ` Peter Zijlstra
  2 siblings, 1 reply; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-08-24 19:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: balbir, Martin Schwidefsky, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck

On Tue, Aug 24, 2010 at 4:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-08-24 at 17:08 +0530, Balbir Singh wrote:
>>
>> The point is for containers it is more likely to give the right answer
>> and so on. Yes, the results are not 100% accurate.
>
> Consider one group heavily dirtying pages, it stuffs the IO queues full
> and gets blocked on IO completion. Since the CPU is then free to
> schedule something else we start running things from another group,
> those IO completions will come in while we run other group and get
> accounted to other group -- FAIL.
>
> s/group/task/ etc..
>
> That just really doesn't work, accounting async work, esp stuff that is
> not under software control it very tricky indeed.
>
> So what are you wanting to do, and why. Do you really need accounting
> madness?

(long email alert)
I have two different answers for why we ended up with this madness.

My personal take on why we need this and the actual flow why I ended
up with this patchset.

- Current /proc/stat hardirq and softirq time reporting is broken for
most archs as it does tick sampling. Hardirq time specifically is
further broken due to interrupts being disabled during irq -
http://kerneltrap.org/mailarchive/linux-kernel/2010/5/25/4574864

OK. Lets fix /proc/stat. But, that doesn't seem enough. We should also
not account this time to tasks themselves.

- I started looking as not accounting this time to tasks themselves.
This was really tricky as things are tightly tied to scheduler
vruntime to get it right. I am not even sure I got it totally right
:(, but I did play with the patch a bit. And noticed there were
multiple issues. 1) A silly case as in of two tasks on one CPU, one
task totally CPU bound and another task doing network recv. This is
how task and softirq time looks like for this (10s samples)
(loop)  (nc)
503 9   502 301
502 8   502 303
502 9   501 302
502 8   502 302
503 9   501 302
Now, when I did "not account si time to task", the loop task ended up
getting a lot less CPU time and doing less work as nc task doing rcv
got more CPU share, which was not right thing to do. IIRC, I had
something like <300 centiseconds for loop after the change (with si
activity increasing due to higher runtime of nc task).
2) Also, a minor problem of breaking current userspace API for
tasks/cgroup stats assume that irq times are included.

So, even though it seems accounting irq time as "system time" seems
the right thing to do, it can break scheduling in many ways. May be
hardirq can be accounted as system time. But, dealing with softirq is
tricky as they can be related to the task.

Figuring out si time and accouting to the right task is a non-starter.
There are so many different ways in which si will come into picture.
finding and accounting it to right task will be almost impossible.

So, why not do the simple things first. Do not disturb any existing
scheduling decisions, account accurate hi and si times system wide,
per task, per cgroup (with as less overhead as possible). Give this
info to users and admin programs and they may make a higher level
sense of this. In the above silly example, user will probably know
that loop is CPU bound where as nc will have net rcv si's and can
decide when to sub si time and when not to.

Thats how I ended up with this patchset.

The other point of view for why this is needed comes from management
apps. Apps which do a higher level task/task group distribution across
different systems and manage them over time, monitoring/resource
allocating/migrating/etc.
There are times when a task/task groups are getting some hi/si
"interference" from some other task/task group currently active on the
system or even -ping flood- kind of activity that you mentioned. These
problems are happening now and tasks end up running slow when such
interference happens with exec run time still being normal. So,
management app has no clue on whats going wrong. One can argue that
this is same as interference with cache conflict etc. But, for hi/si
case, we in OS, should be able to handle things better. We should
either signal such interference to the app, or reflect proper exec
time. That brings us back to debate of should we report them at all or
transparently account them as system time and remove it from all
tasks.

Having looked at both the options, I feel having these export is an
immediate first step. That helps users who are currently having this
problem and wouldn't hurt much the users who don't see any problem
(CONFIG option). Hardware irqs are probably best accounted as system
time. But, there are potential issues doing the same with softirqs.
Longer term we may be able to deal with them all in a clean way. But,
that doesn't affect the short term solution, as we will just have
these new exports end up as zero if and when we get to this full
solution.

Adding stuff to sched_rt_avg_update. Yes, thats another step in the
direction and I have the patch for that. That does take care load
balance. But, I think removing both hi and si from the task right away
will expose more oddities than it will solve....

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24 19:20                   ` Venkatesh Pallipadi
@ 2010-08-24 20:39                     ` Peter Zijlstra
  2010-08-25  2:02                       ` Venkatesh Pallipadi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-24 20:39 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: balbir, Martin Schwidefsky, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck, ikael Starvik,
	dhowells

On Tue, 2010-08-24 at 12:20 -0700, Venkatesh Pallipadi wrote:

> (long email alert)
> I have two different answers for why we ended up with this madness.
> 
> My personal take on why we need this and the actual flow why I ended
> up with this patchset.
> 
> - Current /proc/stat hardirq and softirq time reporting is broken for
> most archs as it does tick sampling. Hardirq time specifically is
> further broken due to interrupts being disabled during irq -
> http://kerneltrap.org/mailarchive/linux-kernel/2010/5/25/4574864

Yeah, architectures without a decent clock are a pain (x86 is still on
that list although nhm/wsm don't suck too bad), but it might be
worthwhile to look at what arch/$foo are strictly tick based.

A quick look suggests:

 alpha
 arm (some)
 avr32
 cris (it could remove its implementation, its identical
       to the weak function provided by kernel/sched_clock.c)
 frv  (idem)
 h8300
 m32r
 m68k* (except nommu-coldfire)
 mips (except cavium-octeon)
 parisc
 score
 sh
 xtensa

which seems to mean too damn many, I bet we can't simply move those to
staging? :-)

> OK. Lets fix /proc/stat. But, that doesn't seem enough. We should also
> not account this time to tasks themselves.

Right

> - I started looking as not accounting this time to tasks themselves.
> This was really tricky as things are tightly tied to scheduler
> vruntime to get it right. 

I'm not exactly sure where that would get complicated, simply treat
interrupts the same as preemptions by other tasks and things should
basically work out rather straight forward from that.

> I am not even sure I got it totally right
> :(, but I did play with the patch a bit. And noticed there were
> multiple issues. 1) A silly case as in of two tasks on one CPU, one
> task totally CPU bound and another task doing network recv. This is
> how task and softirq time looks like for this (10s samples)
> (loop)  (nc)
> 503 9   502 301
> 502 8   502 303
> 502 9   501 302
> 502 8   502 302
> 503 9   501 302
> Now, when I did "not account si time to task", the loop task ended up
> getting a lot less CPU time and doing less work as nc task doing rcv
> got more CPU share, which was not right thing to do. IIRC, I had
> something like <300 centiseconds for loop after the change (with si
> activity increasing due to higher runtime of nc task).

Well, that actually makes sense and I wouldn't call it wrong.

> 2) Also, a minor problem of breaking current userspace API for
> tasks/cgroup stats assume that irq times are included.

Is that actually specified or simply assumed because our implementation
always had that bug? I would really call not accounting irq time to
tasks a bug-fix.

> So, even though it seems accounting irq time as "system time" seems
> the right thing to do, it can break scheduling in many ways. May be
> hardirq can be accounted as system time. But, dealing with softirq is
> tricky as they can be related to the task.

I haven't yet seen any scheduler breakage here, it will divide time
differently, but not in a broken way, if the system consumes 1/3rd of
the time, there's only 2/3rd left to fairly distribute between tasks, so
something like, 1/3-loop 1/3-nc 1/3-softirq makes perfect sense.

You'd get exactly the same kind of thing if you replace (soft)irq with a
FIFO task.

The whole schizo softirq infrastructure (hardirq tails and tasks) is a
pain though, I would really love to rid the kernel of it, but I've got
no idea how to do something like that given that things like the whole
network subsystem are tightly woven into it.

> Figuring out si time and accouting to the right task is a non-starter.
> There are so many different ways in which si will come into picture.
> finding and accounting it to right task will be almost impossible.

Agreed, hence:

> So, why not do the simple things first. Do not disturb any existing
> scheduling decisions, account accurate hi and si times system wide,
> per task, per cgroup (with as less overhead as possible). Give this
> info to users and admin programs and they may make a higher level
> sense of this. 
> 
> Having looked at both the options, I feel having these export is an
> immediate first step. 

This is where I strongly disagree, providing an interface that cannot
possibly be implemented correctly just so you can fudge something (still
not sure what from userspace) seems a very bad idea indeed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-24 20:39                     ` Peter Zijlstra
@ 2010-08-25  2:02                       ` Venkatesh Pallipadi
  2010-08-25  7:20                         ` Martin Schwidefsky
  2010-09-08 11:12                         ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Venkatesh Pallipadi @ 2010-08-25  2:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: balbir, Martin Schwidefsky, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck, ikael Starvik,
	dhowells

On Tue, Aug 24, 2010 at 1:39 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-08-24 at 12:20 -0700, Venkatesh Pallipadi wrote:
>
>
>> - I started looking as not accounting this time to tasks themselves.
>> This was really tricky as things are tightly tied to scheduler
>> vruntime to get it right.
>
> I'm not exactly sure where that would get complicated, simply treat
> interrupts the same as preemptions by other tasks and things should
> basically work out rather straight forward from that.
>

Atleast the way I tried it turned out to be messy. Keep track of time
at si and hi
and remove it from update_curr delta. I did it that way as I didn't
want to take rq lock
on hi si path. Doing it as preemption with put_prev/pick_next would be
expensive. No?
Or you meant it some other way?

>> I am not even sure I got it totally right
>> :(, but I did play with the patch a bit. And noticed there were
>> multiple issues. 1) A silly case as in of two tasks on one CPU, one
>> task totally CPU bound and another task doing network recv. This is
>> how task and softirq time looks like for this (10s samples)
>> (loop)  (nc)
>> 503 9   502 301
>> 502 8   502 303
>> 502 9   501 302
>> 502 8   502 302
>> 503 9   501 302
>> Now, when I did "not account si time to task", the loop task ended up
>> getting a lot less CPU time and doing less work as nc task doing rcv
>> got more CPU share, which was not right thing to do. IIRC, I had
>> something like <300 centiseconds for loop after the change (with si
>> activity increasing due to higher runtime of nc task).
>
> Well, that actually makes sense and I wouldn't call it wrong.

I meant, that will make nc run for more than its fair share.

>
>> 2) Also, a minor problem of breaking current userspace API for
>> tasks/cgroup stats assume that irq times are included.
>
> Is that actually specified or simply assumed because our implementation
> always had that bug? I would really call not accounting irq time to
> tasks a bug-fix.

Agree about this. This should not be a big deal either way.

>> So, even though it seems accounting irq time as "system time" seems
>> the right thing to do, it can break scheduling in many ways. May be
>> hardirq can be accounted as system time. But, dealing with softirq is
>> tricky as they can be related to the task.
>
> I haven't yet seen any scheduler breakage here, it will divide time
> differently, but not in a broken way, if the system consumes 1/3rd of
> the time, there's only 2/3rd left to fairly distribute between tasks, so
> something like, 1/3-loop 1/3-nc 1/3-softirq makes perfect sense.
>
> You'd get exactly the same kind of thing if you replace (soft)irq with a
> FIFO task.
>

But, FIFO in that case would be some unrelated task taking away CPU.
Here one task can take more than its share due to si.

Also, network RFS will try to get softirq to the right CPU thats
running this task. So, this will be sort of common case where task
with softirq will run faster and other non-si tasks will run slower
with a change like this.

> The whole schizo softirq infrastructure (hardirq tails and tasks) is a
> pain though, I would really love to rid the kernel of it, but I've got
> no idea how to do something like that given that things like the whole
> network subsystem are tightly woven into it.
>
>> Figuring out si time and accouting to the right task is a non-starter.
>> There are so many different ways in which si will come into picture.
>> finding and accounting it to right task will be almost impossible.
>
> Agreed, hence:
>
>> So, why not do the simple things first. Do not disturb any existing
>> scheduling decisions, account accurate hi and si times system wide,
>> per task, per cgroup (with as less overhead as possible). Give this
>> info to users and admin programs and they may make a higher level
>> sense of this.
>>
>> Having looked at both the options, I feel having these export is an
>> immediate first step.
>
> This is where I strongly disagree, providing an interface that cannot
> possibly be implemented correctly just so you can fudge something (still
> not sure what from userspace) seems a very bad idea indeed.
>

I don't think correctness is a problem. TSC is pretty good for this
purpose on current hardware. I agree that usability is debatable.

The use case I mentioned is some management application trying to find
interference/slowness for a task/task group because some other si
intensive task or flood ping on that CPU, getting to know that from
si/hi time for task and what it "expects it to be". Yes this is vague.
But, I think you agree that problem of si/hi interference on unrelated
task exists today. And providing this interface was the quick way to
give some hint to management apps about such problem. But. other
alternative of making si and hi time as "system time" will help this
use case as well, as the user will notice lower exec_time in that
case.

If you strongly think that the right way is to make both si and hi
"system time" and that will not cause unfairness and slowdown for some
unrelated tasks, I can try to cleanup the patch I had for that and
send it out. I am afraid though, it will cause some regression and we
will end up back at square one after a month or so. :(

Thanks,
Venki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-25  2:02                       ` Venkatesh Pallipadi
@ 2010-08-25  7:20                         ` Martin Schwidefsky
  2010-09-08 11:12                         ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Martin Schwidefsky @ 2010-08-25  7:20 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, balbir, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck, ikael Starvik,
	dhowells

On Tue, 24 Aug 2010 19:02:04 -0700
Venkatesh Pallipadi <venki@google.com> wrote:

> On Tue, Aug 24, 2010 at 1:39 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2010-08-24 at 12:20 -0700, Venkatesh Pallipadi wrote:
> >> So, why not do the simple things first. Do not disturb any existing
> >> scheduling decisions, account accurate hi and si times system wide,
> >> per task, per cgroup (with as less overhead as possible). Give this
> >> info to users and admin programs and they may make a higher level
> >> sense of this.
> >>
> >> Having looked at both the options, I feel having these export is an
> >> immediate first step.
> >
> > This is where I strongly disagree, providing an interface that cannot
> > possibly be implemented correctly just so you can fudge something (still
> > not sure what from userspace) seems a very bad idea indeed.
> >
> 
> I don't think correctness is a problem. TSC is pretty good for this
> purpose on current hardware. I agree that usability is debatable.
> 
> The use case I mentioned is some management application trying to find
> interference/slowness for a task/task group because some other si
> intensive task or flood ping on that CPU, getting to know that from
> si/hi time for task and what it "expects it to be". Yes this is vague.
> But, I think you agree that problem of si/hi interference on unrelated
> task exists today. And providing this interface was the quick way to
> give some hint to management apps about such problem. But. other
> alternative of making si and hi time as "system time" will help this
> use case as well, as the user will notice lower exec_time in that
> case.
> 
> If you strongly think that the right way is to make both si and hi
> "system time" and that will not cause unfairness and slowdown for some
> unrelated tasks, I can try to cleanup the patch I had for that and
> send it out. I am afraid though, it will cause some regression and we
> will end up back at square one after a month or so. :(

But it is a correctness problem. It is wrong to account the si and hi
time to some random process. To base any kind of decision on wrong data
is asking for trouble. If we can not correctly attribute the si and hi
time to the correct process (which we agree is next to impossible) then
the only thing left to do is to report the time on its own. You can
still pick a random process in your management application and add the
time in user space. As wrong as before but some other application might
want to do smarter things with the data point.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting
  2010-08-25  2:02                       ` Venkatesh Pallipadi
  2010-08-25  7:20                         ` Martin Schwidefsky
@ 2010-09-08 11:12                         ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-09-08 11:12 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: balbir, Martin Schwidefsky, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Paul Menage, linux-kernel, Paul Turner,
	Heiko Carstens, Paul Mackerras, Tony Luck, ikael Starvik,
	dhowells

On Tue, 2010-08-24 at 19:02 -0700, Venkatesh Pallipadi wrote:
> On Tue, Aug 24, 2010 at 1:39 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2010-08-24 at 12:20 -0700, Venkatesh Pallipadi wrote:
> >
> >
> >> - I started looking as not accounting this time to tasks themselves.
> >> This was really tricky as things are tightly tied to scheduler
> >> vruntime to get it right.
> >
> > I'm not exactly sure where that would get complicated, simply treat
> > interrupts the same as preemptions by other tasks and things should
> > basically work out rather straight forward from that.
> >
> 
> Atleast the way I tried it turned out to be messy. Keep track of time
> at si and hi
> and remove it from update_curr delta. I did it that way as I didn't
> want to take rq lock
> on hi si path. Doing it as preemption with put_prev/pick_next would be
> expensive. No?
> Or you meant it some other way?

No, removing it from update_curr()'s delta is exactly what I meant. That
gives the same end result as if the task were preempted (ie. it ran
less).

> >> I am not even sure I got it totally right
> >> :(, but I did play with the patch a bit. And noticed there were
> >> multiple issues. 1) A silly case as in of two tasks on one CPU, one
> >> task totally CPU bound and another task doing network recv. This is
> >> how task and softirq time looks like for this (10s samples)
> >> (loop)  (nc)
> >> 503 9   502 301
> >> 502 8   502 303
> >> 502 9   501 302
> >> 502 8   502 302
> >> 503 9   501 302
> >> Now, when I did "not account si time to task", the loop task ended up
> >> getting a lot less CPU time and doing less work as nc task doing rcv
> >> got more CPU share, which was not right thing to do. IIRC, I had
> >> something like <300 centiseconds for loop after the change (with si
> >> activity increasing due to higher runtime of nc task).
> >
> > Well, that actually makes sense and I wouldn't call it wrong.
> 
> I meant, that will make nc run for more than its fair share.

No it wouldn't, it would make nc run exactly its fair share. SoftIRQ
time isn't nc time, so by not accounting it to nc, nc gets to run more.

> >> So, even though it seems accounting irq time as "system time" seems
> >> the right thing to do, it can break scheduling in many ways. May be
> >> hardirq can be accounted as system time. But, dealing with softirq is
> >> tricky as they can be related to the task.
> >
> > I haven't yet seen any scheduler breakage here, it will divide time
> > differently, but not in a broken way, if the system consumes 1/3rd of
> > the time, there's only 2/3rd left to fairly distribute between tasks, so
> > something like, 1/3-loop 1/3-nc 1/3-softirq makes perfect sense.
> >
> > You'd get exactly the same kind of thing if you replace (soft)irq with a
> > FIFO task.
> >
> 
> But, FIFO in that case would be some unrelated task taking away CPU.
> Here one task can take more than its share due to si.

No, how is being preempted by an unrelated task (FIFO whatever)
different from being 'preempted' by unrelated SoftIRQ processing?


> > This is where I strongly disagree, providing an interface that cannot
> > possibly be implemented correctly just so you can fudge something (still
> > not sure what from userspace) seems a very bad idea indeed.
> >
> 
> I don't think correctness is a problem. TSC is pretty good for this
> purpose on current hardware. I agree that usability is debatable.

Like already argued, it is a correctness issue, TSC is only an accuracy
issue, but since you cannot properly attribute this very accurate time
measurement, you're still totally incorrect.

> If you strongly think that the right way is to make both si and hi
> "system time" and that will not cause unfairness and slowdown for some
> unrelated tasks, I can try to cleanup the patch I had for that and
> send it out. I am afraid though, it will cause some regression and we
> will end up back at square one after a month or so. :(

Sure it will affect some workloads, but its also more correct. If
network workloads suffer we can look at sorting out some of those
problems. But really SoftIRQ != task context and thus should not be
accounted to said task.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-09-08 11:12 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-19 23:57 [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Venkatesh Pallipadi
2010-07-19 23:57 ` [PATCH 1/4] sched: Track and export per task [hard|soft]irq time Venkatesh Pallipadi
2010-07-19 23:57 ` [PATCH 2/4] x86: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time to task Venkatesh Pallipadi
2010-07-19 23:57 ` [PATCH 3/4] sched: Generalize cpuacct usage tracking making it simpler to add new stats Venkatesh Pallipadi
2010-07-19 23:57 ` [PATCH 4/4] sched: Export irq times through cpuacct cgroup Venkatesh Pallipadi
2010-07-20  7:55 ` [PATCH 0/4] Finer granularity and task/cgroup irq time accounting Martin Schwidefsky
2010-07-20 16:55   ` Venkatesh Pallipadi
2010-07-22 11:12     ` Martin Schwidefsky
2010-07-23  2:12       ` Venkatesh Pallipadi
2010-08-24  7:51         ` Peter Zijlstra
2010-08-24  8:05           ` Balbir Singh
2010-08-24  9:09             ` Peter Zijlstra
2010-08-24 11:38               ` Balbir Singh
2010-08-24 11:49                 ` Peter Zijlstra
2010-08-24 11:53                 ` Peter Zijlstra
2010-08-24 12:06                   ` Martin Schwidefsky
2010-08-24 12:39                     ` Peter Zijlstra
2010-08-24 12:47                   ` Balbir Singh
2010-08-24 13:08                     ` Peter Zijlstra
2010-08-24 19:20                   ` Venkatesh Pallipadi
2010-08-24 20:39                     ` Peter Zijlstra
2010-08-25  2:02                       ` Venkatesh Pallipadi
2010-08-25  7:20                         ` Martin Schwidefsky
2010-09-08 11:12                         ` Peter Zijlstra
2010-08-24  8:14           ` Ingo Molnar
2010-08-24  8:49             ` Peter Zijlstra
2010-08-24  0:56 ` Venkatesh Pallipadi
2010-08-24  7:52   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.