linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/cputime: make scale_stime() more precise
@ 2019-07-18 13:18 Oleg Nesterov
  2019-07-18 13:21 ` Oleg Nesterov
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Oleg Nesterov @ 2019-07-18 13:18 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Thomas Gleixner
  Cc: Andrew Fox, Stephen Johnston, linux-kernel

People report that utime and stime from /proc/<pid>/stat become very wrong
when the numbers are big enough. In particular, the monitored application
can run all the time in user-space but only stime grows.

This is because scale_stime() is very inaccurate. It tries to minimize the
relative error, but the absolute error can be huge.

Andrew wrote the test-case:

	int main(int argc, char **argv)
	{
	    struct task_cputime c;
	    struct prev_cputime p;
	    u64 st, pst, cst;
	    u64 ut, put, cut;
	    u64 x;
	    int i = -1; // one step not printed

	    if (argc != 2)
	    {
		printf("usage: %s <start_in_seconds>\n", argv[0]);
		return 1;
	    }
	    x = strtoull(argv[1], NULL, 0) * SEC;
	    printf("start=%lld\n", x);

	    p.stime = 0;
	    p.utime = 0;

	    while (i++ < NSTEPS)
	    {
		x += STEP;
		c.stime = x;
		c.utime = x;
		c.sum_exec_runtime = x + x;
		pst = cputime_to_clock_t(p.stime);
		put = cputime_to_clock_t(p.utime);
		cputime_adjust(&c, &p, &ut, &st);
		cst = cputime_to_clock_t(st);
		cut = cputime_to_clock_t(ut);
		if (i)
		    printf("ut(diff)/st(diff): %20lld (%4lld)  %20lld (%4lld)\n",
			cut, cut - put, cst, cst - pst);
	    }
	}

For example,

	$ ./stime 300000
	start=300000000000000
	ut(diff)/st(diff):            299994875 (   0)             300009124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300011124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300013124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300015124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300017124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300019124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300021124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300023124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300025124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300027124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300029124 (2000)
	ut(diff)/st(diff):            299996875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            299998875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300000875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300002875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300004875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300006875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300008875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300010875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300012055 (1180)             300029944 ( 820)
	ut(diff)/st(diff):            300012055 (   0)             300031944 (2000)
	ut(diff)/st(diff):            300012055 (   0)             300033944 (2000)
	ut(diff)/st(diff):            300012055 (   0)             300035944 (2000)
	ut(diff)/st(diff):            300012055 (   0)             300037944 (2000)

shows the problem even when sum_exec_runtime is not that big: 300000 secs.

The new implementation of scale_stime() does the additional div64_u64_rem()
in a loop but see the comment, as long it is used by cputime_adjust() this
can happen only once.

Reported-by: Andrew Fox <afox@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/cputime.c | 66 ++++++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2305ce8..ad055a3 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -525,47 +525,55 @@ void account_idle_ticks(unsigned long ticks)
 }
 
 /*
- * Perform (stime * rtime) / total, but avoid multiplication overflow by
- * losing precision when the numbers are big.
+ * Perform (stime * rtime) / total, but avoid multiplication overflow
+ * by losing precision when the numbers are big.
+ *
+ * NOTE! currently the only user is cputime_adjust() and thus
+ *
+ *	stime < total && rtime > total
+ *
+ * this means that the end result is always precise and the additional
+ * div64_u64_rem() inside the main loop is called at most once.
  */
 static u64 scale_stime(u64 stime, u64 rtime, u64 total)
 {
-	u64 scaled;
+	u64 res = 0, div, rem;
 
-	for (;;) {
-		/* Make sure "rtime" is the bigger of stime/rtime */
+	/* can stime * rtime overflow ? */
+	while (ilog2(stime) + ilog2(rtime) > 62) {
 		if (stime > rtime)
 			swap(rtime, stime);
 
-		/* Make sure 'total' fits in 32 bits */
-		if (total >> 32)
-			goto drop_precision;
-
-		/* Does rtime (and thus stime) fit in 32 bits? */
-		if (!(rtime >> 32))
-			break;
-
-		/* Can we just balance rtime/stime rather than dropping bits? */
-		if (stime >> 31)
-			goto drop_precision;
-
-		/* We can grow stime and shrink rtime and try to make them both fit */
-		stime <<= 1;
-		rtime >>= 1;
-		continue;
+		if (rtime >= total) {
+			/*
+			 * (rtime * stime) / total is equal to
+			 *
+			 *	(rtime / total) * stime +
+			 *	(rtime % total) * stime / total
+			 *
+			 * if nothing overflows. Can the 1st multiplication
+			 * overflow? Yes, but we do not care: this can only
+			 * happen if the end result can't fit in u64 anyway.
+			 *
+			 * So the code below does
+			 *
+			 *	res += (rtime / total) * stime;
+			 *	rtime = rtime % total;
+			 */
+			div = div64_u64_rem(rtime, total, &rem);
+			res += div * stime;
+			rtime = rem;
+			continue;
+		}
 
-drop_precision:
-		/* We drop from rtime, it has more bits than stime */
+		/* drop precision */
 		rtime >>= 1;
 		total >>= 1;
+		if (!total)
+			return res;
 	}
 
-	/*
-	 * Make sure gcc understands that this is a 32x32->64 multiply,
-	 * followed by a 64/32->64 divide.
-	 */
-	scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total);
-	return scaled;
+	return res + div64_u64(stime * rtime, total);
 }
 
 /*
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread
* [PATCH] sched/cputime: make scale_stime() more precise
@ 2019-07-18 13:15 Oleg Nesterov
  0 siblings, 0 replies; 30+ messages in thread
From: Oleg Nesterov @ 2019-07-18 13:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Thomas Gleixner
  Cc: Andrew Fox, Stephen Johnston, linux-kernel

People report that utime and stime from /proc/<pid>/stat become very wrong
when the numbers are big enough. In particular, the monitored application
can run all the time in user-space but only stime grows.

This is because scale_stime() is very inaccurate. It tries to minimize the
relative error, but the absolute error can be huge.

Andrew wrote the test-case:

	int main(int argc, char **argv)
	{
	    struct task_cputime c;
	    struct prev_cputime p;
	    u64 st, pst, cst;
	    u64 ut, put, cut;
	    u64 x;
	    int i = -1; // one step not printed

	    if (argc != 2)
	    {
		printf("usage: %s <start_in_seconds>\n", argv[0]);
		return 1;
	    }
	    x = strtoull(argv[1], NULL, 0) * SEC;
	    printf("start=%lld\n", x);

	    p.stime = 0;
	    p.utime = 0;

	    while (i++ < NSTEPS)
	    {
		x += STEP;
		c.stime = x;
		c.utime = x;
		c.sum_exec_runtime = x + x;
		pst = cputime_to_clock_t(p.stime);
		put = cputime_to_clock_t(p.utime);
		cputime_adjust(&c, &p, &ut, &st);
		cst = cputime_to_clock_t(st);
		cut = cputime_to_clock_t(ut);
		if (i)
		    printf("ut(diff)/st(diff): %20lld (%4lld)  %20lld (%4lld)\n",
			cut, cut - put, cst, cst - pst);
	    }
	}

For example,

	$ ./stime 300000
	start=300000000000000
	ut(diff)/st(diff):            299994875 (   0)             300009124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300011124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300013124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300015124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300017124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300019124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300021124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300023124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300025124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300027124 (2000)
	ut(diff)/st(diff):            299994875 (   0)             300029124 (2000)
	ut(diff)/st(diff):            299996875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            299998875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300000875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300002875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300004875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300006875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300008875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300010875 (2000)             300029124 (   0)
	ut(diff)/st(diff):            300012055 (1180)             300029944 ( 820)
	ut(diff)/st(diff):            300012055 (   0)             300031944 (2000)
	ut(diff)/st(diff):            300012055 (   0)             300033944 (2000)
	ut(diff)/st(diff):            300012055 (   0)             300035944 (2000)
	ut(diff)/st(diff):            300012055 (   0)             300037944 (2000)

shows the problem even when sum_exec_runtime is not that big: 300000 secs.

The new implementation of scale_stime() does the additional div64_u64_rem()
in a loop but see the comment, as long it is used by cputime_adjust() this
can happen only once.

Reported-by: Andrew Fox <afox@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/cputime.c | 66 ++++++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2305ce8..ad055a3 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -525,47 +525,55 @@ void account_idle_ticks(unsigned long ticks)
 }
 
 /*
- * Perform (stime * rtime) / total, but avoid multiplication overflow by
- * losing precision when the numbers are big.
+ * Perform (stime * rtime) / total, but avoid multiplication overflow
+ * by losing precision when the numbers are big.
+ *
+ * NOTE! currently the only user is cputime_adjust() and thus
+ *
+ *	stime < total && rtime > total
+ *
+ * this means that the end result is always precise and the additional
+ * div64_u64_rem() inside the main loop is called at most once.
  */
 static u64 scale_stime(u64 stime, u64 rtime, u64 total)
 {
-	u64 scaled;
+	u64 res = 0, div, rem;
 
-	for (;;) {
-		/* Make sure "rtime" is the bigger of stime/rtime */
+	/* can stime * rtime overflow ? */
+	while (ilog2(stime) + ilog2(rtime) > 62) {
 		if (stime > rtime)
 			swap(rtime, stime);
 
-		/* Make sure 'total' fits in 32 bits */
-		if (total >> 32)
-			goto drop_precision;
-
-		/* Does rtime (and thus stime) fit in 32 bits? */
-		if (!(rtime >> 32))
-			break;
-
-		/* Can we just balance rtime/stime rather than dropping bits? */
-		if (stime >> 31)
-			goto drop_precision;
-
-		/* We can grow stime and shrink rtime and try to make them both fit */
-		stime <<= 1;
-		rtime >>= 1;
-		continue;
+		if (rtime >= total) {
+			/*
+			 * (rtime * stime) / total is equal to
+			 *
+			 *	(rtime / total) * stime +
+			 *	(rtime % total) * stime / total
+			 *
+			 * if nothing overflows. Can the 1st multiplication
+			 * overflow? Yes, but we do not care: this can only
+			 * happen if the end result can't fit in u64 anyway.
+			 *
+			 * So the code below does
+			 *
+			 *	res += (rtime / total) * stime;
+			 *	rtime = rtime % total;
+			 */
+			div = div64_u64_rem(rtime, total, &rem);
+			res += div * stime;
+			rtime = rem;
+			continue;
+		}
 
-drop_precision:
-		/* We drop from rtime, it has more bits than stime */
+		/* drop precision */
 		rtime >>= 1;
 		total >>= 1;
+		if (!total)
+			return res;
 	}
 
-	/*
-	 * Make sure gcc understands that this is a 32x32->64 multiply,
-	 * followed by a 64/32->64 divide.
-	 */
-	scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total);
-	return scaled;
+	return res + div64_u64(stime * rtime, total);
 }
 
 /*
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-06-16 12:22 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-18 13:18 [PATCH] sched/cputime: make scale_stime() more precise Oleg Nesterov
2019-07-18 13:21 ` Oleg Nesterov
2019-07-18 14:55 ` Oleg Nesterov
2019-07-19 11:03 ` Peter Zijlstra
2019-07-19 13:47   ` Peter Zijlstra
2019-07-19 14:37     ` Oleg Nesterov
2019-07-22 19:56       ` Peter Zijlstra
2019-07-23 14:00         ` Oleg Nesterov
2019-07-23 14:29           ` Oleg Nesterov
2019-07-19 14:03   ` Oleg Nesterov
2019-07-22 19:45     ` Peter Zijlstra
2019-07-22 10:52   ` Stanislaw Gruszka
2019-07-22 20:00     ` Peter Zijlstra
2019-07-23  9:37       ` Stanislaw Gruszka
2020-01-22 16:46 ` Oleg Nesterov
2020-01-23 13:05   ` Oleg Nesterov
2020-01-24 15:42   ` Oleg Nesterov
2020-01-27 12:28 ` [PATCH v2] " Oleg Nesterov
2020-05-15 17:24   ` Oleg Nesterov
2020-05-19 17:25   ` Peter Zijlstra
2020-05-19 18:33     ` Linus Torvalds
2020-05-19 18:42       ` Peter Zijlstra
2020-05-19 19:11       ` Peter Zijlstra
2020-05-19 19:51         ` Linus Torvalds
2020-05-20 15:24     ` Oleg Nesterov
2020-05-20 15:36       ` Peter Zijlstra
2020-05-20 20:10         ` Peter Zijlstra
2020-05-21 13:26           ` Oleg Nesterov
2020-06-16 12:21     ` [tip: sched/core] sched/cputime: Improve cputime_adjust() tip-bot2 for Oleg Nesterov
  -- strict thread matches above, loose matches on Subject: below --
2019-07-18 13:15 [PATCH] sched/cputime: make scale_stime() more precise Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).