linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] do not make cputime scaling in kernel
@ 2013-03-28 16:53 Stanislaw Gruszka
  2013-03-28 16:53 ` [RFC 1/4] cputime: change parameter of thread_group_cputime_adjusted Stanislaw Gruszka
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Stanislaw Gruszka @ 2013-03-28 16:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Peter Zijlstra, hpa, rostedt, akpm, tglx,
	Linus Torvalds, linux-kernel

This patch series removes cputime scaling from kernel. It can be easily
done in user space using floating point if we provide sum_exec_runtime,
what patches 2/4 and 3/4 do. I have procps which utilize that:

http://people.redhat.com/sgruszka/procps-use-sum_exec_runtime.patch

I will post it, if this patch set will be queued.

Change affect also getrusage() and times() syscals, but I don't think
kernel give guarantees about utime/stime precision, in a matter of fact
before commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa, we do not
perform any scaling and we provided raw cputime values to user space.

Providing sum_exec_runtime via proc is done against malware that utilize
lot of cpu time but hide itself from top program.

This affect kernels not compiled with CONFIG_VIRT_CPU_ACCOUNTING_{GEN,NATIVE},
if user choose to compile kernel with some of those options, he/she will
have more precise cputime accounting, what is documented in Kconfig.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/4] cputime: change parameter of thread_group_cputime_adjusted
  2013-03-28 16:53 [RFC 0/4] do not make cputime scaling in kernel Stanislaw Gruszka
@ 2013-03-28 16:53 ` Stanislaw Gruszka
  2013-03-28 16:53 ` [RFC 2/4] procfs: add sum_exec_runtime to /proc/PID/stat Stanislaw Gruszka
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Stanislaw Gruszka @ 2013-03-28 16:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Peter Zijlstra, hpa, rostedt, akpm, tglx,
	Linus Torvalds, linux-kernel, Stanislaw Gruszka

This is preparation for further changes in cputime code.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 fs/proc/array.c        |    5 ++++-
 include/linux/sched.h  |    2 +-
 kernel/exit.c          |    8 ++++----
 kernel/sched/cputime.c |   18 ++++++------------
 kernel/sys.c           |   18 ++++++++++--------
 5 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index f7ed9ee..fef7850 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -446,6 +446,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 		/* add up live thread stats at the group level */
 		if (whole) {
 			struct task_struct *t = task;
+			struct task_cputime cputime;
 			do {
 				min_flt += t->min_flt;
 				maj_flt += t->maj_flt;
@@ -455,7 +456,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
-			thread_group_cputime_adjusted(task, &utime, &stime);
+			thread_group_cputime_adjusted(task, &cputime);
+			utime = cputime.utime;
+			stime = cputime.stime;
 			gtime += sig->gtime;
 		}
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 10626e2..eb6005c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1595,7 +1595,7 @@ static inline cputime_t task_gtime(struct task_struct *t)
 }
 #endif
 extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
-extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *ct);
 
 /*
  * Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 51e485c..5a7023f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1067,7 +1067,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		struct signal_struct *psig;
 		struct signal_struct *sig;
 		unsigned long maxrss;
-		cputime_t tgutime, tgstime;
+		struct task_cputime tg_cputime;
 
 		/*
 		 * The resource counters for the group leader are in its
@@ -1088,12 +1088,12 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		 * group, which consolidates times for all threads in the
 		 * group including the group leader.
 		 */
-		thread_group_cputime_adjusted(p, &tgutime, &tgstime);
+		thread_group_cputime_adjusted(p, &tg_cputime);
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
-		psig->cutime += tgutime + sig->cutime;
-		psig->cstime += tgstime + sig->cstime;
+		psig->cutime += tg_cputime.utime + sig->cutime;
+		psig->cstime += tg_cputime.stime + sig->cstime;
 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
 		psig->cmin_flt +=
 			p->min_flt + sig->min_flt + sig->cmin_flt;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 4006b42..a600f7f 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -456,14 +456,9 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	*st = p->stime;
 }
 
-void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
 {
-	struct task_cputime cputime;
-
-	thread_group_cputime(p, &cputime);
-
-	*ut = cputime.utime;
-	*st = cputime.stime;
+	thread_group_cputime(p, cputime);
 }
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 /*
@@ -618,12 +613,11 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 /*
  * Must be called with siglock held.
  */
-void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
 {
-	struct task_cputime cputime;
-
-	thread_group_cputime(p, &cputime);
-	cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
+	thread_group_cputime(p, cputime);
+	cputime_adjust(cputime, &p->signal->prev_cputime,
+		       &cputime->utime, &cputime->stime);
 }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 39c9c4a..2f555c1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1045,15 +1045,16 @@ change_okay:
 
 void do_sys_times(struct tms *tms)
 {
-	cputime_t tgutime, tgstime, cutime, cstime;
+	cputime_t cutime, cstime;
+	struct task_cputime tg_cputime;
 
 	spin_lock_irq(&current->sighand->siglock);
-	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
+	thread_group_cputime_adjusted(current, &tg_cputime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
 	spin_unlock_irq(&current->sighand->siglock);
-	tms->tms_utime = cputime_to_clock_t(tgutime);
-	tms->tms_stime = cputime_to_clock_t(tgstime);
+	tms->tms_utime = cputime_to_clock_t(tg_cputime.utime);
+	tms->tms_stime = cputime_to_clock_t(tg_cputime.stime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);
 	tms->tms_cstime = cputime_to_clock_t(cstime);
 }
@@ -1699,8 +1700,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
 {
 	struct task_struct *t;
 	unsigned long flags;
-	cputime_t tgutime, tgstime, utime, stime;
+	cputime_t utime, stime;
 	unsigned long maxrss = 0;
+	struct task_cputime tg_cputime;
 
 	memset((char *) r, 0, sizeof *r);
 	utime = stime = 0;
@@ -1732,9 +1734,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
 				break;
 
 		case RUSAGE_SELF:
-			thread_group_cputime_adjusted(p, &tgutime, &tgstime);
-			utime += tgutime;
-			stime += tgstime;
+			thread_group_cputime_adjusted(p, &tg_cputime);
+			utime += tg_cputime.utime;
+			stime += tg_cputime.stime;
 			r->ru_nvcsw += p->signal->nvcsw;
 			r->ru_nivcsw += p->signal->nivcsw;
 			r->ru_minflt += p->signal->min_flt;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC 2/4] procfs: add sum_exec_runtime to /proc/PID/stat
  2013-03-28 16:53 [RFC 0/4] do not make cputime scaling in kernel Stanislaw Gruszka
  2013-03-28 16:53 ` [RFC 1/4] cputime: change parameter of thread_group_cputime_adjusted Stanislaw Gruszka
@ 2013-03-28 16:53 ` Stanislaw Gruszka
  2013-03-28 16:53 ` [RFC 3/4] sched,proc: add csum_sched_runtime Stanislaw Gruszka
  2013-03-28 16:53 ` [RFC 4/4] cputime: remove scaling Stanislaw Gruszka
  3 siblings, 0 replies; 11+ messages in thread
From: Stanislaw Gruszka @ 2013-03-28 16:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Peter Zijlstra, hpa, rostedt, akpm, tglx,
	Linus Torvalds, linux-kernel, Stanislaw Gruszka

Allow procps programs (i.e. top) to read precise CPU time usage of
processes.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 fs/proc/array.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index fef7850..ee47b29 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -403,6 +403,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long rsslim = 0;
 	char tcomm[sizeof(task->comm)];
 	unsigned long flags;
+	u64 sum_exec_runtime = 0;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
@@ -459,6 +460,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			thread_group_cputime_adjusted(task, &cputime);
 			utime = cputime.utime;
 			stime = cputime.stime;
+			sum_exec_runtime = cputime.sum_exec_runtime;
 			gtime += sig->gtime;
 		}
 
@@ -475,6 +477,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
 		task_cputime_adjusted(task, &utime, &stime);
+		sum_exec_runtime = task->se.sum_exec_runtime;
 		gtime = task_gtime(task);
 	}
 
@@ -554,6 +557,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	else
 		seq_put_decimal_ll(m, ' ', 0);
 
+	seq_put_decimal_ull(m, ' ', sum_exec_runtime);
+
 	seq_putc(m, '\n');
 	if (mm)
 		mmput(mm);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC 3/4] sched,proc: add csum_sched_runtime
  2013-03-28 16:53 [RFC 0/4] do not make cputime scaling in kernel Stanislaw Gruszka
  2013-03-28 16:53 ` [RFC 1/4] cputime: change parameter of thread_group_cputime_adjusted Stanislaw Gruszka
  2013-03-28 16:53 ` [RFC 2/4] procfs: add sum_exec_runtime to /proc/PID/stat Stanislaw Gruszka
@ 2013-03-28 16:53 ` Stanislaw Gruszka
  2013-03-28 16:53 ` [RFC 4/4] cputime: remove scaling Stanislaw Gruszka
  3 siblings, 0 replies; 11+ messages in thread
From: Stanislaw Gruszka @ 2013-03-28 16:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Peter Zijlstra, hpa, rostedt, akpm, tglx,
	Linus Torvalds, linux-kernel, Stanislaw Gruszka

Account and export via proc scheduler calculated time of children
execution.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 fs/proc/array.c       |    3 +++
 include/linux/sched.h |    1 +
 kernel/exit.c         |    2 ++
 3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ee47b29..1444dc5 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -404,6 +404,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	char tcomm[sizeof(task->comm)];
 	unsigned long flags;
 	u64 sum_exec_runtime = 0;
+	u64 csum_exec_runtime = 0;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
@@ -442,6 +443,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 		cutime = sig->cutime;
 		cstime = sig->cstime;
 		cgtime = sig->cgtime;
+		csum_exec_runtime = sig->csum_sched_runtime;
 		rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);
 
 		/* add up live thread stats at the group level */
@@ -558,6 +560,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 		seq_put_decimal_ll(m, ' ', 0);
 
 	seq_put_decimal_ull(m, ' ', sum_exec_runtime);
+	seq_put_decimal_ull(m, ' ', csum_exec_runtime);
 
 	seq_putc(m, '\n');
 	if (mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index eb6005c..6673a59 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -574,6 +574,7 @@ struct signal_struct {
 	 * other than jiffies.)
 	 */
 	unsigned long long sum_sched_runtime;
+	unsigned long long csum_sched_runtime;
 
 	/*
 	 * We don't bother to synchronize most readers of this at all,
diff --git a/kernel/exit.c b/kernel/exit.c
index 5a7023f..eda4fa7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1094,6 +1094,8 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		sig = p->signal;
 		psig->cutime += tg_cputime.utime + sig->cutime;
 		psig->cstime += tg_cputime.stime + sig->cstime;
+		psig->csum_sched_runtime +=
+			tg_cputime.sum_exec_runtime + sig->csum_sched_runtime;
 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
 		psig->cmin_flt +=
 			p->min_flt + sig->min_flt + sig->cmin_flt;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC 4/4] cputime: remove scaling
  2013-03-28 16:53 [RFC 0/4] do not make cputime scaling in kernel Stanislaw Gruszka
                   ` (2 preceding siblings ...)
  2013-03-28 16:53 ` [RFC 3/4] sched,proc: add csum_sched_runtime Stanislaw Gruszka
@ 2013-03-28 16:53 ` Stanislaw Gruszka
  2013-04-10 12:02   ` Ingo Molnar
  3 siblings, 1 reply; 11+ messages in thread
From: Stanislaw Gruszka @ 2013-03-28 16:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Peter Zijlstra, hpa, rostedt, akpm, tglx,
	Linus Torvalds, linux-kernel, Stanislaw Gruszka

Scaling cputime cause problems, bunch of them was fixed, but still is possible
to hit multiplication overflow issue, which make {u,s}time values incorrect.
This problem has no good solution in kernel.

This patch remove scaling code and export raw values of {u,t}ime . Procps
programs can use newly introduced sum_exec_runtime to find out precisely
calculated process cpu time and scale utime, stime values accordingly.

Unfortunately times(2) syscall has no such option.

This change affect kernels compiled without CONFIG_VIRT_CPU_ACCOUNTING_*.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 fs/proc/array.c        |    4 +-
 include/linux/sched.h  |   20 --------
 kernel/exit.c          |    2 +-
 kernel/fork.c          |    3 -
 kernel/sched/cputime.c |  117 +-----------------------------------------------
 kernel/sys.c           |    6 +-
 6 files changed, 7 insertions(+), 145 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 1444dc5..5feadc4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -459,7 +459,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
-			thread_group_cputime_adjusted(task, &cputime);
+			thread_group_cputime(task, &cputime);
 			utime = cputime.utime;
 			stime = cputime.stime;
 			sum_exec_runtime = cputime.sum_exec_runtime;
@@ -478,7 +478,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	if (!whole) {
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
-		task_cputime_adjusted(task, &utime, &stime);
+		task_cputime(task, &utime, &stime);
 		sum_exec_runtime = task->se.sum_exec_runtime;
 		gtime = task_gtime(task);
 	}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6673a59..26257a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -397,18 +397,6 @@ struct cpu_itimer {
 };
 
 /**
- * struct cputime - snaphsot of system and user cputime
- * @utime: time spent in user mode
- * @stime: time spent in system mode
- *
- * Gathers a generic snapshot of user and system time.
- */
-struct cputime {
-	cputime_t utime;
-	cputime_t stime;
-};
-
-/**
  * struct task_cputime - collected CPU time counts
  * @utime:		time spent in user mode, in &cputime_t units
  * @stime:		time spent in kernel mode, in &cputime_t units
@@ -558,9 +546,6 @@ struct signal_struct {
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
 	cputime_t cgtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-	struct cputime prev_cputime;
-#endif
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
 	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
 	unsigned long inblock, oublock, cinblock, coublock;
@@ -1159,9 +1144,6 @@ struct task_struct {
 
 	cputime_t utime, stime, utimescaled, stimescaled;
 	cputime_t gtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-	struct cputime prev_cputime;
-#endif
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqlock_t vtime_seqlock;
 	unsigned long long vtime_snap;
@@ -1595,8 +1577,6 @@ static inline cputime_t task_gtime(struct task_struct *t)
 	return t->gtime;
 }
 #endif
-extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
-extern void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *ct);
 
 /*
  * Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index eda4fa7..9a7ecbb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1088,7 +1088,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		 * group, which consolidates times for all threads in the
 		 * group including the group leader.
 		 */
-		thread_group_cputime_adjusted(p, &tg_cputime);
+		thread_group_cputime(p, &tg_cputime);
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
diff --git a/kernel/fork.c b/kernel/fork.c
index 339f60d..2ae1706 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1233,9 +1233,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	p->utime = p->stime = p->gtime = 0;
 	p->utimescaled = p->stimescaled = 0;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-	p->prev_cputime.utime = p->prev_cputime.stime = 0;
-#endif
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqlock_init(&p->vtime_seqlock);
 	p->vtime_snap = 0;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a600f7f..23df74b 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -448,19 +448,7 @@ EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
-
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
-{
-	*ut = p->utime;
-	*st = p->stime;
-}
-
-void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
-{
-	thread_group_cputime(p, cputime);
-}
-#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /*
  * Account a single tick of cpu time.
  * @p: the process that the cpu time gets accounted to
@@ -516,109 +504,6 @@ void account_idle_ticks(unsigned long ticks)
 	account_idle_time(jiffies_to_cputime(ticks));
 }
 
-/*
- * Perform (stime * rtime) / total with reduced chances
- * of multiplication overflows by using smaller factors
- * like quotient and remainders of divisions between
- * rtime and total.
- */
-static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
-{
-	u64 rem, res, scaled;
-
-	if (rtime >= total) {
-		/*
-		 * Scale up to rtime / total then add
-		 * the remainder scaled to stime / total.
-		 */
-		res = div64_u64_rem(rtime, total, &rem);
-		scaled = stime * res;
-		scaled += div64_u64(stime * rem, total);
-	} else {
-		/*
-		 * Same in reverse: scale down to total / rtime
-		 * then substract that result scaled to
-		 * to the remaining part.
-		 */
-		res = div64_u64_rem(total, rtime, &rem);
-		scaled = div64_u64(stime, res);
-		scaled -= div64_u64(scaled * rem, total);
-	}
-
-	return (__force cputime_t) scaled;
-}
-
-/*
- * Adjust tick based cputime random precision against scheduler
- * runtime accounting.
- */
-static void cputime_adjust(struct task_cputime *curr,
-			   struct cputime *prev,
-			   cputime_t *ut, cputime_t *st)
-{
-	cputime_t rtime, stime, total;
-
-	if (vtime_accounting_enabled()) {
-		*ut = curr->utime;
-		*st = curr->stime;
-		return;
-	}
-
-	stime = curr->stime;
-	total = stime + curr->utime;
-
-	/*
-	 * Tick based cputime accounting depend on random scheduling
-	 * timeslices of a task to be interrupted or not by the timer.
-	 * Depending on these circumstances, the number of these interrupts
-	 * may be over or under-optimistic, matching the real user and system
-	 * cputime with a variable precision.
-	 *
-	 * Fix this by scaling these tick based values against the total
-	 * runtime accounted by the CFS scheduler.
-	 */
-	rtime = nsecs_to_cputime(curr->sum_exec_runtime);
-
-	if (!rtime) {
-		stime = 0;
-	} else if (!total) {
-		stime = rtime;
-	} else {
-		stime = scale_stime((__force u64)stime,
-				    (__force u64)rtime, (__force u64)total);
-	}
-
-	/*
-	 * If the tick based count grows faster than the scheduler one,
-	 * the result of the scaling may go backward.
-	 * Let's enforce monotonicity.
-	 */
-	prev->stime = max(prev->stime, stime);
-	prev->utime = max(prev->utime, rtime - prev->stime);
-
-	*ut = prev->utime;
-	*st = prev->stime;
-}
-
-void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
-{
-	struct task_cputime cputime = {
-		.sum_exec_runtime = p->se.sum_exec_runtime,
-	};
-
-	task_cputime(p, &cputime.utime, &cputime.stime);
-	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
-}
-
-/*
- * Must be called with siglock held.
- */
-void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
-{
-	thread_group_cputime(p, cputime);
-	cputime_adjust(cputime, &p->signal->prev_cputime,
-		       &cputime->utime, &cputime->stime);
-}
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/sys.c b/kernel/sys.c
index 2f555c1..00f143e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1049,7 +1049,7 @@ void do_sys_times(struct tms *tms)
 	struct task_cputime tg_cputime;
 
 	spin_lock_irq(&current->sighand->siglock);
-	thread_group_cputime_adjusted(current, &tg_cputime);
+	thread_group_cputime(current, &tg_cputime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
 	spin_unlock_irq(&current->sighand->siglock);
@@ -1708,7 +1708,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
 	utime = stime = 0;
 
 	if (who == RUSAGE_THREAD) {
-		task_cputime_adjusted(current, &utime, &stime);
+		task_cputime(current, &utime, &stime);
 		accumulate_thread_rusage(p, r);
 		maxrss = p->signal->maxrss;
 		goto out;
@@ -1734,7 +1734,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
 				break;
 
 		case RUSAGE_SELF:
-			thread_group_cputime_adjusted(p, &tg_cputime);
+			thread_group_cputime(p, &tg_cputime);
 			utime += tg_cputime.utime;
 			stime += tg_cputime.stime;
 			r->ru_nvcsw += p->signal->nvcsw;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 4/4] cputime: remove scaling
  2013-03-28 16:53 ` [RFC 4/4] cputime: remove scaling Stanislaw Gruszka
@ 2013-04-10 12:02   ` Ingo Molnar
  2013-04-10 14:29     ` H. Peter Anvin
  2013-04-11  8:36     ` Stanislaw Gruszka
  0 siblings, 2 replies; 11+ messages in thread
From: Ingo Molnar @ 2013-04-10 12:02 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Frederic Weisbecker, Peter Zijlstra, hpa, rostedt, akpm, tglx,
	Linus Torvalds, linux-kernel


* Stanislaw Gruszka <sgruszka@redhat.com> wrote:

> Scaling cputime cause problems, bunch of them was fixed, but still is possible 
> to hit multiplication overflow issue, which make {u,s}time values incorrect. 
> This problem has no good solution in kernel.

Wasn't 128-bit math a solution to the overflow problems? 128-bit math isn't nice, 
but at least for multiplication it's defensible.

> This patch remove scaling code and export raw values of {u,t}ime . Procps 
> programs can use newly introduced sum_exec_runtime to find out precisely 
> calculated process cpu time and scale utime, stime values accordingly.
> 
> Unfortunately times(2) syscall has no such option.
> 
> This change affect kernels compiled without CONFIG_VIRT_CPU_ACCOUNTING_*.

So, the concern here is that 'top hiding' code can now hide again. It's also that 
we are not really solving the problem, we are pushing it to user-space - which in 
the best case gets updated to solve the problem in some similar fashion - and in 
the worst case does not get updated or does it in a buggy way.

So while user-space has it a bit easier because it can do floating point math, is 
there really no workable solution to the current kernel side integer overflow bug? 
I really prefer robust kernel side accounting/instrumentation.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 4/4] cputime: remove scaling
  2013-04-10 12:02   ` Ingo Molnar
@ 2013-04-10 14:29     ` H. Peter Anvin
  2013-04-11  8:37       ` Stanislaw Gruszka
  2013-04-11  8:36     ` Stanislaw Gruszka
  1 sibling, 1 reply; 11+ messages in thread
From: H. Peter Anvin @ 2013-04-10 14:29 UTC (permalink / raw)
  To: Ingo Molnar, Stanislaw Gruszka
  Cc: Frederic Weisbecker, Peter Zijlstra, rostedt, akpm, tglx,
	Linus Torvalds, linux-kernel

I have a patch that does scaling by multiply for 64-bit architectures.  I probably should clean it up and send it in.  I need to see if it fixes this problem.

Ingo Molnar <mingo@kernel.org> wrote:

>
>* Stanislaw Gruszka <sgruszka@redhat.com> wrote:
>
>> Scaling cputime cause problems, bunch of them was fixed, but still is
>possible 
>> to hit multiplication overflow issue, which make {u,s}time values
>incorrect. 
>> This problem has no good solution in kernel.
>
>Wasn't 128-bit math a solution to the overflow problems? 128-bit math
>isn't nice, 
>but at least for multiplication it's defensible.
>
>> This patch remove scaling code and export raw values of {u,t}ime .
>Procps 
>> programs can use newly introduced sum_exec_runtime to find out
>precisely 
>> calculated process cpu time and scale utime, stime values
>accordingly.
>> 
>> Unfortunately times(2) syscall has no such option.
>> 
>> This change affect kernels compiled without
>CONFIG_VIRT_CPU_ACCOUNTING_*.
>
>So, the concern here is that 'top hiding' code can now hide again. It's
>also that 
>we are not really solving the problem, we are pushing it to user-space
>- which in 
>the best case gets updated to solve the problem in some similar fashion
>- and in 
>the worst case does not get updated or does it in a buggy way.
>
>So while user-space has it a bit easier because it can do floating
>point math, is 
>there really no workable solution to the current kernel side integer
>overflow bug? 
>I really prefer robust kernel side accounting/instrumentation.
>
>Thanks,
>
>	Ingo

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 4/4] cputime: remove scaling
  2013-04-10 12:02   ` Ingo Molnar
  2013-04-10 14:29     ` H. Peter Anvin
@ 2013-04-11  8:36     ` Stanislaw Gruszka
  2013-04-11 15:06       ` Frederic Weisbecker
  1 sibling, 1 reply; 11+ messages in thread
From: Stanislaw Gruszka @ 2013-04-11  8:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Peter Zijlstra, hpa, rostedt, akpm, tglx,
	Linus Torvalds, linux-kernel

On Wed, Apr 10, 2013 at 02:02:28PM +0200, Ingo Molnar wrote:
> 
> * Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> 
> > Scaling cputime cause problems, bunch of them was fixed, but still is possible 
> > to hit multiplication overflow issue, which make {u,s}time values incorrect. 
> > This problem has no good solution in kernel.
> 
> Wasn't 128-bit math a solution to the overflow problems? 128-bit math isn't nice, 
> but at least for multiplication it's defensible.

128 bit division is needed unfortunately. Though on 99.9% of cases, it will go
through 64 bit fast path.

> > This patch remove scaling code and export raw values of {u,t}ime . Procps 
> > programs can use newly introduced sum_exec_runtime to find out precisely 
> > calculated process cpu time and scale utime, stime values accordingly.
> > 
> > Unfortunately times(2) syscall has no such option.
> > 
> > This change affect kernels compiled without CONFIG_VIRT_CPU_ACCOUNTING_*.
> 
> So, the concern here is that 'top hiding' code can now hide again. It's also that 
> we are not really solving the problem, we are pushing it to user-space - which in 
> the best case gets updated to solve the problem in some similar fashion - and in 
> the worst case does not get updated or does it in a buggy way.
>
> So while user-space has it a bit easier because it can do floating point math, is 
> there really no workable solution to the current kernel side integer overflow bug? 

I do not see any. Basically all we have make problem less reproducible
or just defer it. The best solution, except full 128 bit math I found
is something like this (dropping precision if values are big and overflow
will happen):

u64 _scale_time(u64 rtime, u64 total, u64 time)
{
        const int zero_bits = clzll(time) + clzll(rtime);
        u64 scaled;

        if (zero_bits < 64) {
                /* Drop precision */
                const int drop_bits = 64 - zero_bits;

                time >>= drop_bits;
                rtime >>= drop_bits;
                total >>= 2*drop_bits;

                if (total == 0)
                        return time;
        }

        scaled = (time * rtime) / total;

        return scaled;
}

It defer problem to quite long period. My testing script detect failure at:

FAIL!
rtime: 1954463459156 <- 22621 days (one thread , CONFIG_HZ=1000) 
total: 1771603722423
stime: 354320744484
kernel: 391351504748 <- kernel value
python: 390892691830 <- correct value

For one thread this is fine, but for 512 threads inaccuracy will happen
after only 40 days (due to dropping too many of "total" variable bits).

> I really prefer robust kernel side accounting/instrumentation.

We have CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN.
Perhaps we can change to use one of those options by default. I wonder
if the additional performance cost related with them is really something
that we should care about. Are there any measurement that show those
will make performance worse ?

Stanislaw

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 4/4] cputime: remove scaling
  2013-04-10 14:29     ` H. Peter Anvin
@ 2013-04-11  8:37       ` Stanislaw Gruszka
  2013-04-11 15:19         ` H. Peter Anvin
  0 siblings, 1 reply; 11+ messages in thread
From: Stanislaw Gruszka @ 2013-04-11  8:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Frederic Weisbecker, Peter Zijlstra, rostedt, akpm,
	tglx, Linus Torvalds, linux-kernel

On Wed, Apr 10, 2013 at 07:29:21AM -0700, H. Peter Anvin wrote:
> I have a patch that does scaling by multiply for 64-bit architectures.  I probably should clean it up and send it in.  I need to see if it fixes this problem.
Interesting. Could you attach draft patch, so I could look at it ?

Stanislaw

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 4/4] cputime: remove scaling
  2013-04-11  8:36     ` Stanislaw Gruszka
@ 2013-04-11 15:06       ` Frederic Weisbecker
  0 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2013-04-11 15:06 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Ingo Molnar, Peter Zijlstra, hpa, rostedt, akpm, tglx,
	Linus Torvalds, linux-kernel

On Thu, Apr 11, 2013 at 10:36:35AM +0200, Stanislaw Gruszka wrote:
> > I really prefer robust kernel side accounting/instrumentation.
> 
> We have CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN.
> Perhaps we can change to use one of those options by default. I wonder
> if the additional performance cost related with them is really something
> that we should care about. Are there any measurement that show those
> will make performance worse ?

CONFIG_IRQ_TIME_ACCOUNTING also make use of scaling. And CONFIG_VIRT_CPU_ACCOUNTING_GEN
involves too much overhead on IO-bound workloads. It's mostly good for
undisturbed userspace bound workloads (few IRQs, few exceptions, few syscalls).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 4/4] cputime: remove scaling
  2013-04-11  8:37       ` Stanislaw Gruszka
@ 2013-04-11 15:19         ` H. Peter Anvin
  0 siblings, 0 replies; 11+ messages in thread
From: H. Peter Anvin @ 2013-04-11 15:19 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Ingo Molnar, Frederic Weisbecker, Peter Zijlstra, rostedt, akpm,
	tglx, Linus Torvalds, linux-kernel

On 04/11/2013 01:37 AM, Stanislaw Gruszka wrote:
> On Wed, Apr 10, 2013 at 07:29:21AM -0700, H. Peter Anvin wrote:
>> I have a patch that does scaling by multiply for 64-bit architectures.  I probably should clean it up and send it in.  I need to see if it fixes this problem.
> Interesting. Could you attach draft patch, so I could look at it ?
> 

It's just the trivial extension of the code already in kernel/time.c to
the 64-bit case.

However, it looks like in your case the scaling factor stime/total is
dynamic.  This means that doing scaling by multiplication isn't practical.

I guess I don't understand what function the scaling is supposed to
serve there.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-04-11 15:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-28 16:53 [RFC 0/4] do not make cputime scaling in kernel Stanislaw Gruszka
2013-03-28 16:53 ` [RFC 1/4] cputime: change parameter of thread_group_cputime_adjusted Stanislaw Gruszka
2013-03-28 16:53 ` [RFC 2/4] procfs: add sum_exec_runtime to /proc/PID/stat Stanislaw Gruszka
2013-03-28 16:53 ` [RFC 3/4] sched,proc: add csum_sched_runtime Stanislaw Gruszka
2013-03-28 16:53 ` [RFC 4/4] cputime: remove scaling Stanislaw Gruszka
2013-04-10 12:02   ` Ingo Molnar
2013-04-10 14:29     ` H. Peter Anvin
2013-04-11  8:37       ` Stanislaw Gruszka
2013-04-11 15:19         ` H. Peter Anvin
2013-04-11  8:36     ` Stanislaw Gruszka
2013-04-11 15:06       ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).