linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stanislaw Gruszka <sgruszka@redhat.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, hpa@zytor.com, rostedt@goodmis.org,
	akpm@linux-foundation.org, tglx@linutronix.de,
	linux-tip-commits@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow
Date: Thu, 11 Apr 2013 10:04:33 +0200	[thread overview]
Message-ID: <20130411080432.GA1380@redhat.com> (raw)
In-Reply-To: <20130410173219.GG21951@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]

On Wed, Apr 10, 2013 at 07:32:19PM +0200, Ingo Molnar wrote:
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > 2013/4/10 Ingo Molnar <mingo@kernel.org>:
> > >
> > > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > >
> > >> Of course 128 bits ops are very expensive, so to help you evaluating the
> > >> situation, this is going to happen on every call to task_cputime_adjusted() and
> > >> thread_group_adjusted(), namely:
> > >
> > > It's really only expensive for divisions. Addition and multiplication should be
> > > straightforward and relatively low overhead, especially on 64-bit platforms.
> > 
> > Ok, well we still have one division in the scaling path. I'm mostly
> > worried about the thread group exit that makes use of it through
> > threadgroup_cputime_adjusted(). Not sure if we can avoid that.
> 
> I see, scale_stime()'s use of div64_u64_rem(), right?
> 
> I swapped out the details already, is there a link or commit ID that explains 
> where we hit 64-bit multiplication overflow? It's due to accounting in nanosecs,

No, values are converted to jiffies and then are multiplied. CONFIG_HZ=1000
make issue happen earlier compared to CONFIG_HZ=100.

> spread out across thousands of tasks potentially, right?

Thousands of tasks (in one process) running on thousands of CPUs machine
will make problem reproducible in hours or maybe minutes.
 
> But even with nsecs, a 64-bit variable ought to be able to hold hundreds of years 
> worth of runtime. How do we overflow?

We do rtime * stime. If process utilize 100% CPU we have stime = rtime .
That will overflow if rtime >= sqrt(0xffffffffffffffff) jiffies. That is
rtime >= 49 days assuming CONFIG_HZ=1000. With 50 threads running on
CPUS > 50 machine, this will give 1 day. In real application stime is
never equal to rtime, but can be near half of it, so still problem is
easy achievable.

Using quotient and remainder make problem less reproducible, but it
still happen depending on rtime / total ratio.

I wrote user space program that make same calculations as kernel and python
script that generate stime/rtime/total values (attached). It shows problem
can still be reproducible. For example:

FAIL!
rtime: 25386774344 <- 293 days (one thread, HZ=1000)
total: 27958813690
stime: 8387644107
kernel: 8275815093 <- kernel value
python: 7616032303 <- correct value

FAIL!
rtime: 16877346691 <- 195 days (one thread, HZ=1000)
total: 12215365298
stime: 4886146119
kernel: 5240812402 <- kernel value
python: 6750938676 <- correct value

Stanislaw


[-- Attachment #2: scale_stime.c --]
[-- Type: text/plain, Size: 1029 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

typedef unsigned long long u64;

u64 div64_u64_rem(u64 a, u64 b, u64 *rem)
{
	u64 res = a / b;
	*rem = a % b;

	return res;

}

u64 div64_u64(u64 a, u64 b)
{
	return a / b;
}

u64 scale_stime(u64 stime, u64 rtime, u64 total)
{
        u64 rem, res, scaled;

        if (rtime >= total) {
                res = div64_u64_rem(rtime, total, &rem);
                scaled = stime * res;
                scaled += div64_u64(stime * rem, total);
        } else {
                res = div64_u64_rem(total, rtime, &rem);
                scaled = div64_u64(stime, res);
                scaled -= div64_u64(scaled * rem, total);
	}
	
	return scaled;
}

int main(int argc, char *argv[])
{
	u64 rtime, total, stime, scaled;

	if (argc != 4)
		return -1;

	rtime = strtoll(argv[1], NULL, 10);
	total = strtoll(argv[2], NULL, 10);
	stime = strtoll(argv[3], NULL, 10);

	assert (total >= stime);
	
	scaled = scale_stime(stime, rtime, total);
	printf("%llu\n", scaled);

	return 0;
}

[-- Attachment #3: scale_stime_test.py --]
[-- Type: text/plain, Size: 889 bytes --]

#!/usr/bin/python

import subprocess
import random
import math

def kernel_scale (rtime, total, stime):
	p = subprocess.Popen("./scale_stime " + str(rtime) + " " + str(total) + " " + str(stime) , shell=True, stdout=subprocess.PIPE)
	return int(p.stdout.read())

def python_scale (rtime, total, stime):
	return (stime * rtime) / total


max_rtime = 364*24*60*60*1000  # 1 year (on one thread) 

fail=False

for i in range(0, 100):
	rtime = random.randrange(max_rtime)
	total = int(random.uniform(0.7, 1.3) * rtime)

	for n in range(1, 11):
		stime = (n * total / 10)
		r1 = kernel_scale(rtime, total, stime)
		r2 = python_scale(rtime, total, stime)
		if abs(r1 - r2) > 1:
			print "FAIL!"
			print "rtime: " + str(rtime)
			print "total: " + str(total)
			print "stime: " + str(stime)
			print "kernel: " + str(r1)
			print "python: " + str(r2)

			fail=True
			break
	if fail:
		break;
		

  reply	other threads:[~2013-04-11  8:04 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <tip-d9a3c9823a2e6a543eb7807fb3d15d8233817ec5@git.kernel.org>
2013-03-26 14:01 ` [tip:sched/core] sched: Lower chances of cputime scaling overflow Stanislaw Gruszka
2013-03-26 14:19   ` Frederic Weisbecker
2013-03-26 16:54     ` Stanislaw Gruszka
2013-04-10 12:51     ` Ingo Molnar
2013-04-10 15:28       ` Frederic Weisbecker
2013-04-10 17:32         ` Ingo Molnar
2013-04-11  8:04           ` Stanislaw Gruszka [this message]
2013-04-11 13:45   ` Peter Zijlstra
2013-04-11 14:50     ` Stanislaw Gruszka
2013-04-11 17:31       ` Peter Zijlstra
2013-04-11 15:38     ` Linus Torvalds
2013-04-11 18:07       ` Peter Zijlstra
2013-04-11 18:22         ` Frederic Weisbecker
2013-04-11 18:26           ` Frederic Weisbecker
2013-04-11 18:22         ` Linus Torvalds
2013-04-12  7:55       ` Peter Zijlstra
2013-04-13 14:49         ` Stanislaw Gruszka
2013-04-13 18:44           ` Linus Torvalds
2013-04-16 10:40             ` Stanislaw Gruszka
2013-04-30 14:03             ` Stanislaw Gruszka
2013-04-13 14:55       ` Stanislaw Gruszka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130411080432.GA1380@redhat.com \
    --to=sgruszka@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=fweisbec@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).