[ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
@ 2014-05-13  3:34 Michael wang
  2014-05-13  9:47 ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Michael wang @ 2014-05-13  3:34 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Peter Zijlstra, Mike Galbraith, Alex Shi,
	Paul Turner, Rik van Riel, Mel Gorman, Paul Turner,
	Daniel Lezcano

During our testing, we found that the cpu.shares doesn't work as
expected, the testing is:

X86 HOST:
	12 CPU
GUEST(KVM):
	6 VCPU

We create 3 GUEST, each with 1024 shares, the workload inside them is:

GUEST_1:
	dbench 6
GUEST_2:
	stress -c 6
GUEST_3:
	stress -c 6

So by theory, each GUEST will got (1024 / (3 * 1024)) * 1200% == 400%
according to the group share (3 groups are created by virtual manager on
same level, and they are the only groups heavily running in system).

Now if only GUEST_1 running, it got 300% CPU, which is 1/4 of the whole
CPU resource.

So when all 3 GUEST running concurrently, we expect:

		GUEST_1		GUEST_2		GUEST_3
CPU%		300%		450%		450%

That is the GUEST_1 got the 300% it required, and the unused 100% was
shared by the rest group.

But the result is:

		GUEST_1		GUEST_2		GUEST_3
CPU%		40%		580%		580%

GUEST_1 failed to gain the CPU it required, and the dbench inside it
dropped a lot on performance.

So is this results expected (I really do not think so...)?

Or that imply the cpu-cgroup got some issue to be fixed?

Any comments are welcomed :)

Regards,
Michael Wang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-13  3:34 [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays? Michael wang
@ 2014-05-13  9:47 ` Peter Zijlstra
  2014-05-13 13:36   ` Rik van Riel
  2014-05-14  3:16   ` Michael wang
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-13  9:47 UTC (permalink / raw)
  To: Michael wang
  Cc: LKML, Ingo Molnar, Mike Galbraith, Alex Shi, Paul Turner,
	Rik van Riel, Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 1090 bytes --]

On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote:
> During our testing, we found that the cpu.shares doesn't work as
> expected, the testing is:
> 

/me zaps all the kvm nonsense as that's non reproducable and only serves
to annoy.

Pro-tip: never use kvm to report cpu-cgroup issues.

> So is this results expected (I really do not think so...)?
> 
> Or that imply the cpu-cgroup got some issue to be fixed?

So what I did (WSM-EP 2x6x2):

mount none /cgroup -t cgroup -o cpu
mkdir -p /cgroup/a
mkdir -p /cgroup/b
mkdir -p /cgroup/c

echo $$ > /cgroup/a/tasks ; for ((i=0; i<12; i++)) ; do A.sh & done
echo $$ > /cgroup/b/tasks ; for ((i=0; i<12; i++)) ; do B.sh & done
echo $$ > /cgroup/c/tasks ; for ((i=0; i<12; i++)) ; do C.sh & done

echo 2048 > /cgroup/c/cpu.shares

Where [ABC].sh are spinners:

---
#!/bin/bash

while :; do :; done
---

for i in A B C ; do ps -deo pcpu,cmd | grep "${i}\.sh" | awk '{t += $1} END {print t}' ; done
639.7
629.8
1127.4

That is of course not perfect, but it's close enough.

Now you again.. :-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-13  9:47 ` Peter Zijlstra
@ 2014-05-13 13:36   ` Rik van Riel
  2014-05-13 14:23     ` Peter Zijlstra
  2014-05-14  3:21     ` Michael wang
  2014-05-14  3:16   ` Michael wang
  1 sibling, 2 replies; 28+ messages in thread
From: Rik van Riel @ 2014-05-13 13:36 UTC (permalink / raw)
  To: Peter Zijlstra, Michael wang
  Cc: LKML, Ingo Molnar, Mike Galbraith, Alex Shi, Paul Turner,
	Mel Gorman, Daniel Lezcano

On 05/13/2014 05:47 AM, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote:
>> During our testing, we found that the cpu.shares doesn't work as
>> expected, the testing is:
>>
> 
> /me zaps all the kvm nonsense as that's non reproducable and only serves
> to annoy.
> 
> Pro-tip: never use kvm to report cpu-cgroup issues.
> 
>> So is this results expected (I really do not think so...)?
>>
>> Or that imply the cpu-cgroup got some issue to be fixed?
> 
> So what I did (WSM-EP 2x6x2):
> 
> mount none /cgroup -t cgroup -o cpu
> mkdir -p /cgroup/a
> mkdir -p /cgroup/b
> mkdir -p /cgroup/c
> 
> echo $$ > /cgroup/a/tasks ; for ((i=0; i<12; i++)) ; do A.sh & done
> echo $$ > /cgroup/b/tasks ; for ((i=0; i<12; i++)) ; do B.sh & done
> echo $$ > /cgroup/c/tasks ; for ((i=0; i<12; i++)) ; do C.sh & done
> 
> echo 2048 > /cgroup/c/cpu.shares
> 
> Where [ABC].sh are spinners:

I suspect the "are spinners" is key.

Infinite loops can run all the time, while dbench spends a lot of
its time waiting for locks. That waiting may interfere with getting
as much CPU as it wants.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-13 13:36   ` Rik van Riel
@ 2014-05-13 14:23     ` Peter Zijlstra
  2014-05-14  3:27       ` Michael wang
  2014-05-14  7:36       ` Michael wang
  2014-05-14  3:21     ` Michael wang
  1 sibling, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-13 14:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michael wang, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On Tue, May 13, 2014 at 09:36:20AM -0400, Rik van Riel wrote:
> On 05/13/2014 05:47 AM, Peter Zijlstra wrote:
> > On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote:
> >> During our testing, we found that the cpu.shares doesn't work as
> >> expected, the testing is:
> >>
> > 
> > /me zaps all the kvm nonsense as that's non reproducable and only serves
> > to annoy.
> > 
> > Pro-tip: never use kvm to report cpu-cgroup issues.
> > 
> >> So is this results expected (I really do not think so...)?
> >>
> >> Or that imply the cpu-cgroup got some issue to be fixed?
> > 
> > So what I did (WSM-EP 2x6x2):
> > 
> > mount none /cgroup -t cgroup -o cpu
> > mkdir -p /cgroup/a
> > mkdir -p /cgroup/b
> > mkdir -p /cgroup/c
> > 
> > echo $$ > /cgroup/a/tasks ; for ((i=0; i<12; i++)) ; do A.sh & done
> > echo $$ > /cgroup/b/tasks ; for ((i=0; i<12; i++)) ; do B.sh & done
> > echo $$ > /cgroup/c/tasks ; for ((i=0; i<12; i++)) ; do C.sh & done
> > 
> > echo 2048 > /cgroup/c/cpu.shares
> > 
> > Where [ABC].sh are spinners:
> 
> I suspect the "are spinners" is key.
> 
> Infinite loops can run all the time, while dbench spends a lot of
> its time waiting for locks. That waiting may interfere with getting
> as much CPU as it wants.

At which point it becomes an entirely different problem and the weight
things become far more 'interesting'.

The point remains though, don't use massive and awkward software stacks
that are impossible to operate.

I you want to investigate !spinners, replace the ABC with slightly more
complex loads like: https://lkml.org/lkml/2012/6/18/212

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-13  9:47 ` Peter Zijlstra
  2014-05-13 13:36   ` Rik van Riel
@ 2014-05-14  3:16   ` Michael wang
  1 sibling, 0 replies; 28+ messages in thread
From: Michael wang @ 2014-05-14  3:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Mike Galbraith, Alex Shi, Paul Turner,
	Rik van Riel, Mel Gorman, Daniel Lezcano

On 05/13/2014 05:47 PM, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote:
>> During our testing, we found that the cpu.shares doesn't work as
>> expected, the testing is:
>>
> 
> /me zaps all the kvm nonsense as that's non reproducable and only serves
> to annoy.
> 
> Pro-tip: never use kvm to report cpu-cgroup issues.

Make sense.

> 
[snip]
> for i in A B C ; do ps -deo pcpu,cmd | grep "${i}\.sh" | awk '{t += $1} END {print t}' ; done

Enjoyable :)

> 639.7
> 629.8
> 1127.4
> 
> That is of course not perfect, but it's close enough.

Yeah, for cpu intensive work load, the share do work very well, the
issue only appeared when workload start to become some kind of...sleepy.

I will use the tool you mentioned for the following investigation,
thanks for the suggestion.

> 
> Now you again.. :-)

And here I am ;-)

Regards,
Michael Wang

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-13 13:36   ` Rik van Riel
  2014-05-13 14:23     ` Peter Zijlstra
@ 2014-05-14  3:21     ` Michael wang
  1 sibling, 0 replies; 28+ messages in thread
From: Michael wang @ 2014-05-14  3:21 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra
  Cc: LKML, Ingo Molnar, Mike Galbraith, Alex Shi, Paul Turner,
	Mel Gorman, Daniel Lezcano

On 05/13/2014 09:36 PM, Rik van Riel wrote:
[snip]
>>
>> echo 2048 > /cgroup/c/cpu.shares
>>
>> Where [ABC].sh are spinners:
> 
> I suspect the "are spinners" is key.
> 
> Infinite loops can run all the time, while dbench spends a lot of
> its time waiting for locks. That waiting may interfere with getting
> as much CPU as it wants.

That's what we are thinking, also we assume that by introducing load
decay mechanism, it become harder for the sleepy tasks to gain enough
slice, well, that currently just imagination, more investigation is
needed ;-)

Regards,
Michael Wang

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-13 14:23     ` Peter Zijlstra
@ 2014-05-14  3:27       ` Michael wang
  2014-05-14  7:36       ` Michael wang
  1 sibling, 0 replies; 28+ messages in thread
From: Michael wang @ 2014-05-14  3:27 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: LKML, Ingo Molnar, Mike Galbraith, Alex Shi, Paul Turner,
	Mel Gorman, Daniel Lezcano

On 05/13/2014 10:23 PM, Peter Zijlstra wrote:
[snip]
> 
> The point remains though, don't use massive and awkward software stacks
> that are impossible to operate.
> 
> I you want to investigate !spinners, replace the ABC with slightly more
> complex loads like: https://lkml.org/lkml/2012/6/18/212

That's what we need, may be a little reform to enable multi-threads, or
may be add some locks... anyway, will redo the test and see what we
could found :)

Regards,
Michael Wang

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-13 14:23     ` Peter Zijlstra
  2014-05-14  3:27       ` Michael wang
@ 2014-05-14  7:36       ` Michael wang
  2014-05-14  9:44         ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Michael wang @ 2014-05-14  7:36 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: LKML, Ingo Molnar, Mike Galbraith, Alex Shi, Paul Turner,
	Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 3484 bytes --]

Hi, Peter

On 05/13/2014 10:23 PM, Peter Zijlstra wrote:
[snip]
> 
> I you want to investigate !spinners, replace the ABC with slightly more
> complex loads like: https://lkml.org/lkml/2012/6/18/212

I've done a little reform, enabled multi-threads and add a mutex,
please check the code below for details.

I built it by:
	gcc -o my_tool cgroup_tool.c -lpthread

distro mount cpu-subsys under '/sys/fs/cgroup/cpu', create group like:
	mkdir /sys/fs/cgroup/cpu/A
	mkdir /sys/fs/cgroup/cpu/B
	mkdir /sys/fs/cgroup/cpu/C

and then:
	echo $$ > /sys/fs/cgroup/cpu/A/tasks ; ./my_tool -l
	echo $$ > /sys/fs/cgroup/cpu/B/tasks ; ./my_tool -l
	echo $$ > /sys/fs/cgroup/cpu/C/tasks ; ./my_tool 50

the results in top is around:

		A	B	C
	CPU%	550	550	100

While only './my_tool 50' was running, it require around 300%.

And this could also be reproduced by dbench, stress combination like:
	echo $$ > /sys/fs/cgroup/cpu/A/tasks ; dbench 6
	echo $$ > /sys/fs/cgroup/cpu/B/tasks ; stress -c 6
	echo $$ > /sys/fs/cgroup/cpu/C/tasks ; stress -c 6

Now it seems more like a generic problem... will keep investigating, please
let me know if there are any suggestions :)

Regards,
Michael Wang



#include <sys/time.h>
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>

pthread_mutex_t my_mutex;

unsigned long long stamp(void)
{
	struct timeval tv;
	gettimeofday(&tv, NULL);

	return (unsigned long long)tv.tv_sec * 1000000 + tv.tv_usec;
}
void consume(int spin, int total)
{
	unsigned long long begin, now;
	begin = stamp();

	for (;;) {
		pthread_mutex_lock(&my_mutex);
		now = stamp();
		if ((long long)(now - begin) > spin) {
			pthread_mutex_unlock(&my_mutex);
			usleep(total - spin);
			pthread_mutex_lock(&my_mutex);
			begin += total;
		}
		pthread_mutex_unlock(&my_mutex);
	}
}

struct my_data {
	int spin;
	int total;
};

void *my_fn_sleepy(void *arg)
{
	struct my_data *data = (struct my_data *)arg;
	consume(data->spin, data->total);
	return NULL;
}

void *my_fn_loop(void *arg)
{
	while (1) {};
	return NULL;
}

int main(int argc, char **argv)
{
	int period = 100000; /* 100ms */
	int frac;
	struct my_data data;
	pthread_t last_thread;
	int thread_num = sysconf(_SC_NPROCESSORS_ONLN) / 2;
	void *(*my_fn)(void *arg) = &my_fn_sleepy;

	if (thread_num <= 0 || thread_num > 1024) {
		fprintf(stderr, "insane processor(half) size %d\n", thread_num);
		return -1;
	}

	if (argc == 2 && !strcmp(argv[1], "-l")) {
		my_fn = &my_fn_loop;
		printf("loop mode enabled\n");
		goto loop_mode;
	}

	if (argc < 2) {
		fprintf(stderr, "%s <frac> [<period>]\n"
				"  frac   -- [1-100] %% of time to burn\n"
				"  period -- [usec] period of burn/sleep cycle\n",
				argv[0]);
		return -1;
	}

	frac = atoi(argv[1]);
	if (argc > 2)
		period = atoi(argv[2]);
	if (frac > 100)
		frac = 100;
	if (frac < 1)
		frac = 1;

	data.spin = (period * frac) / 100;
	data.total = period;

loop_mode:
	pthread_mutex_init(&my_mutex, NULL);
	while (thread_num--) {
		if (pthread_create(&last_thread, NULL, my_fn, &data)) {
			fprintf(stderr, "Create thread failed\n");
			return -1;
		}
	}

	printf("Threads never stop, CTRL + C to terminate\n");

	pthread_join(last_thread, NULL);
	pthread_mutex_destroy(&my_mutex);	//won't happen
	return 0;
}

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


[-- Attachment #2: cgroup_tool.c --]
[-- Type: text/x-csrc, Size: 2042 bytes --]

#include <sys/time.h>
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>

pthread_mutex_t my_mutex;

unsigned long long stamp(void)
{
	struct timeval tv;
	gettimeofday(&tv, NULL);

	return (unsigned long long)tv.tv_sec * 1000000 + tv.tv_usec;
}
void consume(int spin, int total)
{
	unsigned long long begin, now;
	begin = stamp();

	for (;;) {
		pthread_mutex_lock(&my_mutex);
		now = stamp();
		if ((long long)(now - begin) > spin) {
			pthread_mutex_unlock(&my_mutex);
			usleep(total - spin);
			pthread_mutex_lock(&my_mutex);
			begin += total;
		}
		pthread_mutex_unlock(&my_mutex);
	}
}

struct my_data {
	int spin;
	int total;
};

void *my_fn_sleepy(void *arg)
{
	struct my_data *data = (struct my_data *)arg;
	consume(data->spin, data->total);
	return NULL;
}

void *my_fn_loop(void *arg)
{
	while (1) {};
	return NULL;
}

int main(int argc, char **argv)
{
	int period = 100000; /* 100ms */
	int frac;
	struct my_data data;
	pthread_t last_thread;
	int thread_num = sysconf(_SC_NPROCESSORS_ONLN) / 2;
	void *(*my_fn)(void *arg) = &my_fn_sleepy;

	if (thread_num <= 0 || thread_num > 1024) {
		fprintf(stderr, "insane processor(half) size %d\n", thread_num);
		return -1;
	}

	if (argc == 2 && !strcmp(argv[1], "-l")) {
		my_fn = &my_fn_loop;
		printf("loop mode enabled\n");
		goto loop_mode;
	}

	if (argc < 2) {
		fprintf(stderr, "%s <frac> [<period>]\n"
				"  frac   -- [1-100] %% of time to burn\n"
				"  period -- [usec] period of burn/sleep cycle\n",
				argv[0]);
		return -1;
	}

	frac = atoi(argv[1]);
	if (argc > 2)
		period = atoi(argv[2]);
	if (frac > 100)
		frac = 100;
	if (frac < 1)
		frac = 1;

	data.spin = (period * frac) / 100;
	data.total = period;

loop_mode:
	pthread_mutex_init(&my_mutex, NULL);
	while (thread_num--) {
		if (pthread_create(&last_thread, NULL, my_fn, &data)) {
			fprintf(stderr, "Create thread failed\n");
			return -1;
		}
	}

	printf("Threads never stop, CTRL + C to terminate\n");

	pthread_join(last_thread, NULL);
	pthread_mutex_destroy(&my_mutex);	//won't happen
	return 0;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-14  7:36       ` Michael wang
@ 2014-05-14  9:44         ` Peter Zijlstra
  2014-05-15  3:46           ` Michael wang
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-14  9:44 UTC (permalink / raw)
  To: Michael wang
  Cc: Rik van Riel, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 1714 bytes --]

On Wed, May 14, 2014 at 03:36:50PM +0800, Michael wang wrote:
> distro mount cpu-subsys under '/sys/fs/cgroup/cpu', create group like:
> 	mkdir /sys/fs/cgroup/cpu/A
> 	mkdir /sys/fs/cgroup/cpu/B
> 	mkdir /sys/fs/cgroup/cpu/C

Yeah, distro is on crack, nobody sane mounts anything there.

> and then:
> 	echo $$ > /sys/fs/cgroup/cpu/A/tasks ; ./my_tool -l
> 	echo $$ > /sys/fs/cgroup/cpu/B/tasks ; ./my_tool -l
> 	echo $$ > /sys/fs/cgroup/cpu/C/tasks ; ./my_tool 50
> 
> the results in top is around:
> 
> 		A	B	C
> 	CPU%	550	550	100

top doesn't do per-cgroup accounting, so how do you get these numbers,
per the above all instances of the prog are also called the same,
further making it error prone and difficult to get sane numbers.


> #include <sys/time.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <pthread.h>
> 
> pthread_mutex_t my_mutex;
> 
> unsigned long long stamp(void)
> {
> 	struct timeval tv;
> 	gettimeofday(&tv, NULL);
> 
> 	return (unsigned long long)tv.tv_sec * 1000000 + tv.tv_usec;
> }
> void consume(int spin, int total)
> {
> 	unsigned long long begin, now;
> 	begin = stamp();
> 
> 	for (;;) {
> 		pthread_mutex_lock(&my_mutex);
> 		now = stamp();
> 		if ((long long)(now - begin) > spin) {
> 			pthread_mutex_unlock(&my_mutex);
> 			usleep(total - spin);
> 			pthread_mutex_lock(&my_mutex);
> 			begin += total;
> 		}
> 		pthread_mutex_unlock(&my_mutex);
> 	}
> }

Uh,.. that's just insane.. what's the point of having a multi-threaded
program do busy-wait loops if you then serialize the lot on a global
mutex such that only 1 thread can run at any one time?

How can one such prog ever consume more than 100% cpu.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-14  9:44         ` Peter Zijlstra
@ 2014-05-15  3:46           ` Michael wang
  2014-05-15  8:35             ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Michael wang @ 2014-05-15  3:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On 05/14/2014 05:44 PM, Peter Zijlstra wrote:
[snip]
>> and then:
>> 	echo $$ > /sys/fs/cgroup/cpu/A/tasks ; ./my_tool -l
>> 	echo $$ > /sys/fs/cgroup/cpu/B/tasks ; ./my_tool -l
>> 	echo $$ > /sys/fs/cgroup/cpu/C/tasks ; ./my_tool 50
>>
>> the results in top is around:
>>
>> 		A	B	C
>> 	CPU%	550	550	100
> 
> top doesn't do per-cgroup accounting, so how do you get these numbers,
> per the above all instances of the prog are also called the same,
> further making it error prone and difficult to get sane numbers.

Oh, my bad to make it confusing, I myself was checking the PID of my_tool
instant inside top, like:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
24968 root      20   0 55600  720  648 S 558.1  0.0   2:08.76 my_tool           
24984 root      20   0 55600  720  648 S 536.2  0.0   1:10.29 my_tool           
25001 root      20   0 55600  720  648 S 88.6  0.0   0:04.39 my_tool

By 'cat /sys/fs/cgroup/cpu/C/tasks' I got the PID of './my_tool 50' is
25001, and all it's pthread's %CPU was count in, could we check like
that?

> 
> 
[snip]
>> void consume(int spin, int total)
>> {
>> 	unsigned long long begin, now;
>> 	begin = stamp();
>>
>> 	for (;;) {
>> 		pthread_mutex_lock(&my_mutex);
>> 		now = stamp();
>> 		if ((long long)(now - begin) > spin) {
>> 			pthread_mutex_unlock(&my_mutex);
>> 			usleep(total - spin);
>> 			pthread_mutex_lock(&my_mutex);
>> 			begin += total;
>> 		}
>> 		pthread_mutex_unlock(&my_mutex);
>> 	}
>> }
> 
> Uh,.. that's just insane.. what's the point of having a multi-threaded
> program do busy-wait loops if you then serialize the lot on a global
> mutex such that only 1 thread can run at any one time?
> 
> How can one such prog ever consume more than 100% cpu.

That's a good point... however the top show that when only './my_tool 50'
25001 running, it used around 300%, like below:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
25001 root      20   0 55600  720  648 S 284.3  0.0   5:18.00 my_tool           
 2376 root      20   0  950m  85m  29m S  4.4  0.2 163:47.94 python             
 1658 root      20   0 1013m  19m  11m S  3.0  0.1  97:06.11 libvirtd

IMHO, if pthread-mutex was similar like the kernel one's behaviour, then
it may not going to sleep when it's the only one running on CPU.

Oh, I think we got the reason here, when there are other task running,
mutex will going to sleep and the %CPU dropped to serialized case that is
around 100%.

But for the dbench, stress combination, that's not spin-wasted, dbench
throughput do dropped, how could we explain that one?

Regards,
Michael Wang

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-15  3:46           ` Michael wang
@ 2014-05-15  8:35             ` Peter Zijlstra
  2014-05-15  8:46               ` Michael wang
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-15  8:35 UTC (permalink / raw)
  To: Michael wang
  Cc: Rik van Riel, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 405 bytes --]

On Thu, May 15, 2014 at 11:46:06AM +0800, Michael wang wrote:
> But for the dbench, stress combination, that's not spin-wasted, dbench
> throughput do dropped, how could we explain that one?

I've no clue what dbench does.. At this point you'll have to
expose/trace the per-task runtime accounting for these tasks and ideally
also the things the cgroup code does with them to see if it still makes
sense.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-15  8:35             ` Peter Zijlstra
@ 2014-05-15  8:46               ` Michael wang
  2014-05-15  9:06                 ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Michael wang @ 2014-05-15  8:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On 05/15/2014 04:35 PM, Peter Zijlstra wrote:
> On Thu, May 15, 2014 at 11:46:06AM +0800, Michael wang wrote:
>> But for the dbench, stress combination, that's not spin-wasted, dbench
>> throughput do dropped, how could we explain that one?
> 
> I've no clue what dbench does.. At this point you'll have to
> expose/trace the per-task runtime accounting for these tasks and ideally
> also the things the cgroup code does with them to see if it still makes
> sense.

I see :)

BTW, some interesting thing we found during the dbench/stress testing
is, by doing:

	echo 240000000 > /proc/sys/kernel/sched_latency_ns
        echo NO_GENTLE_FAIR_SLEEPERS > /sys/kernel/debug/sched_features

that is sched_latency_ns increased around 10 times and
GENTLE_FAIR_SLEEPERS was disabled, the dbench got it's CPU back.

However, when the group level is too deep, that doesn't works any more...

I'm not sure but seems like 'deep group level' and 'vruntime bonus for
sleeper' is the keep points here, will try to list the root cause after
more investigation, thanks for the hints and suggestions, really helpful ;-)

Regards,
Michael Wang

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-15  8:46               ` Michael wang
@ 2014-05-15  9:06                 ` Peter Zijlstra
  2014-05-15  9:35                   ` Michael wang
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-15  9:06 UTC (permalink / raw)
  To: Michael wang
  Cc: Rik van Riel, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 2257 bytes --]

On Thu, May 15, 2014 at 04:46:28PM +0800, Michael wang wrote:
> On 05/15/2014 04:35 PM, Peter Zijlstra wrote:
> > On Thu, May 15, 2014 at 11:46:06AM +0800, Michael wang wrote:
> >> But for the dbench, stress combination, that's not spin-wasted, dbench
> >> throughput do dropped, how could we explain that one?
> > 
> > I've no clue what dbench does.. At this point you'll have to
> > expose/trace the per-task runtime accounting for these tasks and ideally
> > also the things the cgroup code does with them to see if it still makes
> > sense.
> 
> I see :)
> 
> BTW, some interesting thing we found during the dbench/stress testing
> is, by doing:
> 
> 	echo 240000000 > /proc/sys/kernel/sched_latency_ns
>         echo NO_GENTLE_FAIR_SLEEPERS > /sys/kernel/debug/sched_features
> 
> that is sched_latency_ns increased around 10 times and
> GENTLE_FAIR_SLEEPERS was disabled, the dbench got it's CPU back.
> 
> However, when the group level is too deep, that doesn't works any more...
> 
> I'm not sure but seems like 'deep group level' and 'vruntime bonus for
> sleeper' is the keep points here, will try to list the root cause after
> more investigation, thanks for the hints and suggestions, really helpful ;-)

How deep is deep? You run into numerical problems quite quickly, esp.
when you've got lots of CPUs. We've only got 64bit to play with, that
said there were some patches...

What happens if you do the below, Google has been running with that, and
nobody was ever able to reproduce the report that got it disabled.



diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2cbe81308af..e40819d39c69 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -40,7 +40,7 @@ extern void update_cpu_load_active(struct rq *this_rq);
  * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
  * increased costs.
  */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
+#if 1 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
 # define SCHED_LOAD_RESOLUTION	10
 # define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
 # define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-15  9:06                 ` Peter Zijlstra
@ 2014-05-15  9:35                   ` Michael wang
  2014-05-15 11:57                     ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Michael wang @ 2014-05-15  9:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On 05/15/2014 05:06 PM, Peter Zijlstra wrote:
[snip]
>> However, when the group level is too deep, that doesn't works any more...
>>
>> I'm not sure but seems like 'deep group level' and 'vruntime bonus for
>> sleeper' is the keep points here, will try to list the root cause after
>> more investigation, thanks for the hints and suggestions, really helpful ;-)
> 
> How deep is deep? You run into numerical problems quite quickly, esp.
> when you've got lots of CPUs. We've only got 64bit to play with, that
> said there were some patches...

It's like:

	/cgroup/cpu/l1/l2/l3/l4/l5/l6/A

about level 7, the issue can not be solved any more.

> 
> What happens if you do the below, Google has been running with that, and
> nobody was ever able to reproduce the report that got it disabled.
> 
> 
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b2cbe81308af..e40819d39c69 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -40,7 +40,7 @@ extern void update_cpu_load_active(struct rq *this_rq);
>   * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
>   * increased costs.
>   */
> -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> +#if 1 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */

That is trying to solve the load overflow issue, correct?

I'm not sure which account will turns to be huge when group get deeper,
the load accumulation will suffer discount when passing up, isn't it?

Anyway, will give it a try and see what happened :)

Regards,
Michael Wang

>  # define SCHED_LOAD_RESOLUTION	10
>  # define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
>  # define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-15  9:35                   ` Michael wang
@ 2014-05-15 11:57                     ` Peter Zijlstra
  2014-05-16  2:23                       ` Michael wang
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-15 11:57 UTC (permalink / raw)
  To: Michael wang
  Cc: Rik van Riel, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 1584 bytes --]

On Thu, May 15, 2014 at 05:35:25PM +0800, Michael wang wrote:
> On 05/15/2014 05:06 PM, Peter Zijlstra wrote:
> [snip]
> >> However, when the group level is too deep, that doesn't works any more...
> >>
> >> I'm not sure but seems like 'deep group level' and 'vruntime bonus for
> >> sleeper' is the keep points here, will try to list the root cause after
> >> more investigation, thanks for the hints and suggestions, really helpful ;-)
> > 
> > How deep is deep? You run into numerical problems quite quickly, esp.
> > when you've got lots of CPUs. We've only got 64bit to play with, that
> > said there were some patches...
> 
> It's like:
> 
> 	/cgroup/cpu/l1/l2/l3/l4/l5/l6/A
> 
> about level 7, the issue can not be solved any more.

That's pretty retarded and yeah, that's way past the point where things
make sense. You might be lucky and have l1-5 as empty/pointless
hierarchy so the effective depth is less and then things will work, but
*shees*..

> > -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> > +#if 1 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> 
> That is trying to solve the load overflow issue, correct?
> 
> I'm not sure which account will turns to be huge when group get deeper,
> the load accumulation will suffer discount when passing up, isn't it?
> 

It'll use 20 bits for precision instead of 10, so it gives a little more
'room' for deeper hierarchies/big cpu-count.

All assuming you're running 64bit kernels of course.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-15 11:57                     ` Peter Zijlstra
@ 2014-05-16  2:23                       ` Michael wang
  2014-05-16  2:51                         ` Mike Galbraith
  2014-05-16  7:48                         ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Michael wang @ 2014-05-16  2:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On 05/15/2014 07:57 PM, Peter Zijlstra wrote:
[snip]
>>
>> It's like:
>>
>> 	/cgroup/cpu/l1/l2/l3/l4/l5/l6/A
>>
>> about level 7, the issue can not be solved any more.
> 
> That's pretty retarded and yeah, that's way past the point where things
> make sense. You might be lucky and have l1-5 as empty/pointless
> hierarchy so the effective depth is less and then things will work, but
> *shees*..

Exactly, that's the simulation of cgroup topology setup by libvirt,
really doesn't make sense... rather torture than deployment, but they do
make things like that...

> 
[snip]
>> I'm not sure which account will turns to be huge when group get deeper,
>> the load accumulation will suffer discount when passing up, isn't it?
>>
> 
> It'll use 20 bits for precision instead of 10, so it gives a little more
> 'room' for deeper hierarchies/big cpu-count.

Got it :)

> 
> All assuming you're running 64bit kernels of course.

Yes, it's 64bit, I tried the testing with this feature on, seems like
haven't address the issue...

But we found that one difference when group get deeper is the tasks of
that group become to gathered on CPU more often, some time all the
dbench instances was running on the same CPU, this won't happen for l1
group, may could explain why dbench could not get CPU more than 100% any
more.

But why the gather happen when group get deeper is unclear... will try
to make it out :)

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-16  2:23                       ` Michael wang
@ 2014-05-16  2:51                         ` Mike Galbraith
  2014-05-16  4:24                           ` Michael wang
  2014-05-16  7:48                         ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Galbraith @ 2014-05-16  2:51 UTC (permalink / raw)
  To: Michael wang
  Cc: Peter Zijlstra, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On Fri, 2014-05-16 at 10:23 +0800, Michael wang wrote:

> But we found that one difference when group get deeper is the tasks of
> that group become to gathered on CPU more often, some time all the
> dbench instances was running on the same CPU, this won't happen for l1
> group, may could explain why dbench could not get CPU more than 100% any
> more.

Right.  I played a little (sane groups), saw load balancing as well.

-Mike


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-16  2:51                         ` Mike Galbraith
@ 2014-05-16  4:24                           ` Michael wang
  2014-05-16  7:54                             ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Michael wang @ 2014-05-16  4:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

Hey, Mike :)

On 05/16/2014 10:51 AM, Mike Galbraith wrote:
> On Fri, 2014-05-16 at 10:23 +0800, Michael wang wrote:
> 
>> But we found that one difference when group get deeper is the tasks of
>> that group become to gathered on CPU more often, some time all the
>> dbench instances was running on the same CPU, this won't happen for l1
>> group, may could explain why dbench could not get CPU more than 100% any
>> more.
> 
> Right.  I played a little (sane groups), saw load balancing as well.

Yeah, now we found that even l2 groups will face the same issue, allow
me to re-list the details here:

Firstly do workaround (10 times latency):
	echo 240000000 > /proc/sys/kernel/sched_latency_ns
        echo NO_GENTLE_FAIR_SLEEPERS > /sys/kernel/debug/sched_features

This workaround may related to another issue about vruntime bonus for
sleeper, but let's put it down currently and focus on the gather issue.

Create groups like:
	mkdir /sys/fs/cgroup/cpu/A
	mkdir /sys/fs/cgroup/cpu/B
	mkdir /sys/fs/cgroup/cpu/C

	mkdir /sys/fs/cgroup/cpu/l1
	mkdir /sys/fs/cgroup/cpu/l1/A
	mkdir /sys/fs/cgroup/cpu/l1/B
	mkdir /sys/fs/cgroup/cpu/l1/C

Run workload like (6 is half of the CPUS on my box):
	echo $$ > /sys/fs/cgroup/cpu/A/tasks ; dbench 6
	echo $$ > /sys/fs/cgroup/cpu/B/tasks ; stress 6
	echo $$ > /sys/fs/cgroup/cpu/C/tasks ; stress 6

Check top, each dbench instance got around 45%, totally around 270%,
this is close to the case when only dbench running (300%) since we use
the workaround, otherwise we will see it to be around 100%, but that's
another issue...

By sample /proc/sched_debug, rarely see more than 2 dbench instances on
same rq.

Now re-run workload like:
	echo $$ > /sys/fs/cgroup/cpu/l1/A/tasks ; dbench 6
	echo $$ > /sys/fs/cgroup/cpu/l1/B/tasks ; stress 6
	echo $$ > /sys/fs/cgroup/cpu/l1/C/tasks ; stress 6

Check top, each dbench instance got around 20%, totally around 120%,
sometime dropped under 100%, and dbench throughput dropped.

By sample /proc/sched_debug, frequently see 4 or 5 dbench instances on
same rq.

So just one level deeper from l1 to l2 and such a big difference, and
groups with same shares not equally share the resources...

BTW, by bind each dbench instances to different CPU, dbench in l2 groups
will regain all the CPU% which is 300%.

I'll keep investigation and try to figure out why l2 group's tasks
starting to gather, please let me know if there are any suggestions ;-)

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-16  2:23                       ` Michael wang
  2014-05-16  2:51                         ` Mike Galbraith
@ 2014-05-16  7:48                         ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-16  7:48 UTC (permalink / raw)
  To: Michael wang
  Cc: Rik van Riel, LKML, Ingo Molnar, Mike Galbraith, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 923 bytes --]

On Fri, May 16, 2014 at 10:23:11AM +0800, Michael wang wrote:
> On 05/15/2014 07:57 PM, Peter Zijlstra wrote:
> [snip]
> >>
> >> It's like:
> >>
> >> 	/cgroup/cpu/l1/l2/l3/l4/l5/l6/A
> >>
> >> about level 7, the issue can not be solved any more.
> > 
> > That's pretty retarded and yeah, that's way past the point where things
> > make sense. You might be lucky and have l1-5 as empty/pointless
> > hierarchy so the effective depth is less and then things will work, but
> > *shees*..
> 
> Exactly, that's the simulation of cgroup topology setup by libvirt,
> really doesn't make sense... rather torture than deployment, but they do
> make things like that...

I'm calling it broken and unfit for purpose if it does crazy shit like
that.

There's really not much we can do to fix it either, barring softfloat in
the load-balancer and I'm sure everybody but virt wankers will complain
about _that_.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-16  4:24                           ` Michael wang
@ 2014-05-16  7:54                             ` Peter Zijlstra
  2014-05-16  8:15                               ` Michael wang
  2014-06-10  8:56                               ` Michael wang
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-16  7:54 UTC (permalink / raw)
  To: Michael wang
  Cc: Mike Galbraith, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

On Fri, May 16, 2014 at 12:24:35PM +0800, Michael wang wrote:
> Hey, Mike :)
> 
> On 05/16/2014 10:51 AM, Mike Galbraith wrote:
> > On Fri, 2014-05-16 at 10:23 +0800, Michael wang wrote:
> > 
> >> But we found that one difference when group get deeper is the tasks of
> >> that group become to gathered on CPU more often, some time all the
> >> dbench instances was running on the same CPU, this won't happen for l1
> >> group, may could explain why dbench could not get CPU more than 100% any
> >> more.
> > 
> > Right.  I played a little (sane groups), saw load balancing as well.
> 
> Yeah, now we found that even l2 groups will face the same issue, allow
> me to re-list the details here:

Hmm, that _should_ more or less work and does indeed suggest there's
something iffy.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-16  7:54                             ` Peter Zijlstra
@ 2014-05-16  8:15                               ` Michael wang
  2014-06-10  8:56                               ` Michael wang
  1 sibling, 0 replies; 28+ messages in thread
From: Michael wang @ 2014-05-16  8:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On 05/16/2014 03:54 PM, Peter Zijlstra wrote:
[snip]
>>> Right.  I played a little (sane groups), saw load balancing as well.
>>
>> Yeah, now we found that even l2 groups will face the same issue, allow
>> me to re-list the details here:
> 
> Hmm, that _should_ more or less work and does indeed suggest there's
> something iffy.

Yeah, sane group topology also issued... besides the sleeper bonus, it
seems like the root cause is tasks starting to gather, I plan to check
the difference on task load between two cases, see if there is a good
way to solve this problem :)

Regards,
Michael Wang

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-05-16  7:54                             ` Peter Zijlstra
  2014-05-16  8:15                               ` Michael wang
@ 2014-06-10  8:56                               ` Michael wang
  2014-06-10 12:12                                 ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Michael wang @ 2014-06-10  8:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On 05/16/2014 03:54 PM, Peter Zijlstra wrote:
[snip]
> 
> Hmm, that _should_ more or less work and does indeed suggest there's
> something iffy.
> 

I think we locate the reason why cpu-cgroup doesn't works well on dbench
now... finally.

I'd like to link the reproduce way of the issue here since long time
passed...

	https://lkml.org/lkml/2014/5/16/4

Now here is the analysis:

So our problem is when put tasks like dbench which sleep and wakeup each other
frequently into a deep-group, they will gathered on same CPU when workload like
stress are running, which lead to that the whole group could gain no more than
one CPU.

Basically there are two key points here, load-balance and wake-affine.

Wake-affine for sure pull tasks together for workload like dbench, what make
it difference when put dbench into a group one level deeper is the
load-balance, which happened less.

Usually, when system is busy, during the wakeup when we could not locate
idle cpu, we pick the search point instead, whatever how busy it is since
we count on the balance routine later to help balance the load.

However, in our cases the load balance could not help on that, since deeper
the group is, less the load effect it means to root group.

By which means even tasks in deep group all gathered on one CPU, the load
could still balanced from the view of root group, and the tasks lost the
only chances (balance) to spread when they already on the same CPU...

Furthermore, for tasks flip frequently like dbench, it'll become far more
harder for load balance to help, it could even rarely catch them on rq.

So in such cases, the only chance to do balance for these tasks is during
the wakeup, however it will be expensive...

Thus the cheaper way is something just like select_idle_sibling(), the only
difference is now we balance tasks inside the group to prevent them from
gathered.

Below patch has solved the problem during the testing, I'd like to do more
testing on other benchmarks before send out the formal patch, any comments
are welcomed ;-)

Regards,
Michael Wang



diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..e1381cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+static inline int tg_idle_cpu(struct task_group *tg, int cpu)
+{
+	return !tg->cfs_rq[cpu]->nr_running;
+}
+
+/*
+ * Try and locate an idle CPU in the sched_domain from tg's view.
+ *
+ * Although gathered on same CPU and spread accross CPUs could make
+ * no difference from highest group's view, this will cause the tasks
+ * starving, even they have enough share to fight for CPU, they only
+ * got one battle filed, which means whatever how big their weight is,
+ * they totally got one CPU at maximum.
+ *
+ * Thus when system is busy, we filtered out those tasks which couldn't
+ * gain help from balance routine, and try to balance them internally
+ * by this func, so they could stand a chance to show their power.
+ *
+ */
+static int tg_idle_sibling(struct task_struct *p, int target)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg;
+	int i = task_cpu(p);
+	struct task_group *tg = task_group(p);
+
+	if (tg_idle_cpu(tg, target))
+		goto done;
+
+	sd = rcu_dereference(per_cpu(sd_llc, target));
+	for_each_lower_domain(sd) {
+		sg = sd->groups;
+		do {
+			if (!cpumask_intersects(sched_group_cpus(sg),
+						tsk_cpus_allowed(p)))
+				goto next;
+
+			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (i == target || !tg_idle_cpu(tg, i))
+					goto next;
+			}
+
+			target = cpumask_first_and(sched_group_cpus(sg),
+					tsk_cpus_allowed(p));
+
+			goto done;
+next:
+			sg = sg->next;
+		} while (sg != sd->groups);
+	}
+
+done:
+
+	return target;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -4417,6 +4473,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
+	struct sched_entity *se = task_group(p)->se[i];
 
 	if (idle_cpu(target))
 		return target;
@@ -4451,6 +4508,30 @@ next:
 		} while (sg != sd->groups);
 	}
 done:
+
+	if (!idle_cpu(target)) {
+		/*
+		 * No idle cpu located imply the system is somewhat busy,
+		 * usually we count on load balance routine's help and
+		 * just pick the target whatever how busy it is.
+		 *
+		 * However, when task belong to a deep group (harder to
+		 * make root imbalance) and flip frequently (harder to be
+		 * caught during balance), load balance routine could help
+		 * nothing, and these tasks will eventually gathered on same
+		 * cpu when they wakeup each other, that is the chance of
+		 * gathered stand far more higher than the chance of spread.
+		 *
+		 * Thus for such tasks, we need to handle them carefully
+		 * during wakeup, since it's the very rarely chance for
+		 * them to spread.
+		 *
+		 */
+		if (se && se->depth &&
+				p->wakee_flips > this_cpu_read(sd_llc_size))
+			return tg_idle_sibling(p, target);
+	}
+
 	return target;
 }
 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-06-10  8:56                               ` Michael wang
@ 2014-06-10 12:12                                 ` Peter Zijlstra
  2014-06-11  6:13                                   ` Michael wang
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-06-10 12:12 UTC (permalink / raw)
  To: Michael wang
  Cc: Mike Galbraith, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 3951 bytes --]

On Tue, Jun 10, 2014 at 04:56:12PM +0800, Michael wang wrote:
> On 05/16/2014 03:54 PM, Peter Zijlstra wrote:
> [snip]
> > 
> > Hmm, that _should_ more or less work and does indeed suggest there's
> > something iffy.
> > 
> 
> I think we locate the reason why cpu-cgroup doesn't works well on dbench
> now... finally.
> 
> I'd like to link the reproduce way of the issue here since long time
> passed...
> 
> 	https://lkml.org/lkml/2014/5/16/4
> 
> Now here is the analysis:
> 
> So our problem is when put tasks like dbench which sleep and wakeup each other
> frequently into a deep-group, they will gathered on same CPU when workload like
> stress are running, which lead to that the whole group could gain no more than
> one CPU.
> 
> Basically there are two key points here, load-balance and wake-affine.
> 
> Wake-affine for sure pull tasks together for workload like dbench, what make
> it difference when put dbench into a group one level deeper is the
> load-balance, which happened less.

We load-balance less (frequently) or we migrate less tasks due to
load-balancing ?

> Usually, when system is busy, during the wakeup when we could not locate
> idle cpu, we pick the search point instead, whatever how busy it is since
> we count on the balance routine later to help balance the load.

But above you said that dbench usually triggers the wake-affine logic,
but now you say it doesn't and we rely on select_idle_sibling?

Note that the comparison isn't fair, running dbench on an idle system vs
running dbench on a busy system is the first step.

The second is adding the cgroup crap on.

> However, in our cases the load balance could not help on that, since deeper
> the group is, less the load effect it means to root group.

But since all actual load is on the same depth, the relative threshold
(imbalance pct) should work the same, the size of the values don't
matter, the relative ratios do.

> By which means even tasks in deep group all gathered on one CPU, the load
> could still balanced from the view of root group, and the tasks lost the
> only chances (balance) to spread when they already on the same CPU...

Sure, but see above.

> Furthermore, for tasks flip frequently like dbench, it'll become far more
> harder for load balance to help, it could even rarely catch them on rq.

And I suspect that is the main problem; so see what it does on a busy
system: !cgroup: nr_cpus busy loops + dbench, because that's your
benchmark for adding cgroups, the cgroup can only shift that behaviour
around.

> So in such cases, the only chance to do balance for these tasks is during
> the wakeup, however it will be expensive...
> 
> Thus the cheaper way is something just like select_idle_sibling(), the only
> difference is now we balance tasks inside the group to prevent them from
> gathered.
> 
> Below patch has solved the problem during the testing, I'd like to do more
> testing on other benchmarks before send out the formal patch, any comments
> are welcomed ;-)

So I think that approach is wrong, select_idle_siblings() works because
we want to keep CPUs from being idle, but if they're not actually idle,
pretending like they are (in a cgroup) is actively wrong and can skew
load pretty bad.

Furthermore, if as I expect, dbench sucks on a busy system, then the
proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
alter behaviour like that.

More so, I suspect that patch will tend to overload cpu0 (and lower cpu
numbers in general -- because its scanning in the same direction for
each cgroup) for other workloads. You can't just go pile more and more
work on cpu0 just because there's nothing running in this particular
cgroup.

So dbench is very sensitive to queueing, and select_idle_siblings()
avoids a lot of queueing on an idle system. I don't think that's
something we should fix with cgroups.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-06-10 12:12                                 ` Peter Zijlstra
@ 2014-06-11  6:13                                   ` Michael wang
  2014-06-11  8:24                                     ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Michael wang @ 2014-06-11  6:13 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith, Rik van Riel, LKML, Ingo Molnar,
	Alex Shi, Paul Turner, Mel Gorman, Daniel Lezcano

Hi, Peter

Thanks for the reply :)

On 06/10/2014 08:12 PM, Peter Zijlstra wrote:
[snip]
>> Wake-affine for sure pull tasks together for workload like dbench, what make
>> it difference when put dbench into a group one level deeper is the
>> load-balance, which happened less.
> 
> We load-balance less (frequently) or we migrate less tasks due to
> load-balancing ?

IMHO, when we put tasks one group deeper, in other word the totally
weight of these tasks is 1024 (prev is 3072), the load become more
balancing in root, which make bl-routine consider the system is
balanced, which make we migrate less in lb-routine.

> 
>> Usually, when system is busy, during the wakeup when we could not locate
>> idle cpu, we pick the search point instead, whatever how busy it is since
>> we count on the balance routine later to help balance the load.
> 
> But above you said that dbench usually triggers the wake-affine logic,
> but now you say it doesn't and we rely on select_idle_sibling?

During wakeup, it triggered wake-affine, after that, go inside
select_idle_sibling() and found no idle cpu, than pick the search point
instead (curr cpu if wake-affine or prev cpu if not).

> 
> Note that the comparison isn't fair, running dbench on an idle system vs
> running dbench on a busy system is the first step.

Our comparison is based on the same busy-system, all the two cases have
the same workload running, the only difference is that we put the same
workload (dbench + stress) one group deeper, it's like:

Good case:
		root
		l1-A	l1-B	l1-C
		dbench	stress	stress

	results:
		dbench got around 300%
		each stress got around 450%

Bad case:
		root
		l1
		l2-A	l2-B	l2-C
		dbench	stress	stress

	results:
		dbench got around 100% (throughout dropped too)
		each stress got around 550%

Although the l1-group gain the same resources (1200%), it doesn't assign
to l2-ABC correctly like the root-group did.

> 
> The second is adding the cgroup crap on.
> 
>> However, in our cases the load balance could not help on that, since deeper
>> the group is, less the load effect it means to root group.
> 
> But since all actual load is on the same depth, the relative threshold
> (imbalance pct) should work the same, the size of the values don't
> matter, the relative ratios do.

Exactly, however, when group is deep, the chance of it to make root
imbalance reduced, in good case, gathered on cpu means 1024 load, while
in bad case it dropped to 1024/3 ideally, that make it harder to trigger
imbalance and gain help from the routine, please note that although
dbench and stress are the only workload in system, there are still other
tasks serve for the system need to be wakeup (some very actively since
the dbench...), compared to them, deep group load means nothing...

> 
>> By which means even tasks in deep group all gathered on one CPU, the load
>> could still balanced from the view of root group, and the tasks lost the
>> only chances (balance) to spread when they already on the same CPU...
> 
> Sure, but see above.

The lb-routine could not provide enough help for deep group, since the
imbalance happened inside the group could not cause imbalance in root,
ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
easily ignored, but inside the l2-group, the gathered case could already
means imbalance like (1024 * 5) : 1024.

> 
>> Furthermore, for tasks flip frequently like dbench, it'll become far more
>> harder for load balance to help, it could even rarely catch them on rq.
> 
> And I suspect that is the main problem; so see what it does on a busy
> system: !cgroup: nr_cpus busy loops + dbench, because that's your
> benchmark for adding cgroups, the cgroup can only shift that behaviour
> around.

There are busy loops in good case too, and dbench behaviour in l1-groups
should not changed after put them to l2-group, what make things worse is
the chance for them to spread after gathered become less.

> 
[snip]
>> Below patch has solved the problem during the testing, I'd like to do more
>> testing on other benchmarks before send out the formal patch, any comments
>> are welcomed ;-)
> 
> So I think that approach is wrong, select_idle_siblings() works because
> we want to keep CPUs from being idle, but if they're not actually idle,
> pretending like they are (in a cgroup) is actively wrong and can skew
> load pretty bad.

We only choose the timing when no idle cpu located, and flips is
somewhat high, also the group is deep.

In such cases, select_idle_siblings() doesn't works anyway, it return
the target even it is very busy, we just check twice to prevent it from
making some obviously bad decision ;-)

> 
> Furthermore, if as I expect, dbench sucks on a busy system, then the
> proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
> alter behaviour like that.

That's true and that's why we currently still need to shut down the
GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to
solve later...

What we currently expect is that the cgroup assign the resource
according to the share, it works well in l1-groups, so we expect it to
work the same well in l2-groups...

> 
> More so, I suspect that patch will tend to overload cpu0 (and lower cpu
> numbers in general -- because its scanning in the same direction for
> each cgroup) for other workloads. You can't just go pile more and more
> work on cpu0 just because there's nothing running in this particular
> cgroup.

That's a good point...

However during the testing, this doesn't happen on the 3 groups, tasks
stay on high-cpu as often as low-cpu, IMHO the key point here is the
lb-routine still works, although much less than before.

So the fix just make the result of lb-routine effect longer, since the
higher cpu it picked is usually idle in group (directly pick later), in
other word, tasks on high-cpu is harder to be wake-affine to low-cpu
than before.

And when this apply to all the groups, each of them will be balanced
both internally and externally, then we will see equal tasks on each cpu.

select_idle_sibling() do pick low-cpu more often, and combined with
wake-affine, without enough load-balance, the tasks will gathered on
low-cpu more often, but our solution will make the less load-balance
become more valuable (when they need to be), IMHO, it could even
contribute to balance work in some cases...

> 
> So dbench is very sensitive to queueing, and select_idle_siblings()
> avoids a lot of queueing on an idle system. I don't think that's
> something we should fix with cgroups.

It has to queue anyway after wakeup, isn't it? we just want a good
candidate which won't make things too bad inside group, and only do this
when select_idle_siblings() give up on searching...

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-06-11  6:13                                   ` Michael wang
@ 2014-06-11  8:24                                     ` Peter Zijlstra
  2014-06-11  9:18                                       ` Michael wang
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-06-11  8:24 UTC (permalink / raw)
  To: Michael wang
  Cc: Mike Galbraith, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 5082 bytes --]

On Wed, Jun 11, 2014 at 02:13:42PM +0800, Michael wang wrote:
> Hi, Peter
> 
> Thanks for the reply :)
> 
> On 06/10/2014 08:12 PM, Peter Zijlstra wrote:
> [snip]
> >> Wake-affine for sure pull tasks together for workload like dbench, what make
> >> it difference when put dbench into a group one level deeper is the
> >> load-balance, which happened less.
> > 
> > We load-balance less (frequently) or we migrate less tasks due to
> > load-balancing ?
> 
> IMHO, when we put tasks one group deeper, in other word the totally
> weight of these tasks is 1024 (prev is 3072), the load become more
> balancing in root, which make bl-routine consider the system is
> balanced, which make we migrate less in lb-routine.

But how? The absolute value (1024 vs 3072) is of no effect to the
imbalance, the imbalance is computed from relative differences between
cpus.

> Our comparison is based on the same busy-system, all the two cases have
> the same workload running, the only difference is that we put the same
> workload (dbench + stress) one group deeper, it's like:
> 
> Good case:
> 		root
> 		l1-A	l1-B	l1-C
> 		dbench	stress	stress
> 
> 	results:
> 		dbench got around 300%
> 		each stress got around 450%
> 
> Bad case:
> 		root
> 		l1
> 		l2-A	l2-B	l2-C
> 		dbench	stress	stress
> 
> 	results:
> 		dbench got around 100% (throughout dropped too)
> 		each stress got around 550%
> 
> Although the l1-group gain the same resources (1200%), it doesn't assign
> to l2-ABC correctly like the root-group did.

But in this case select_idle_sibling() should function identially, so
that cannot be the problem.

> > The second is adding the cgroup crap on.
> > 
> >> However, in our cases the load balance could not help on that, since deeper
> >> the group is, less the load effect it means to root group.
> > 
> > But since all actual load is on the same depth, the relative threshold
> > (imbalance pct) should work the same, the size of the values don't
> > matter, the relative ratios do.
> 
> Exactly, however, when group is deep, the chance of it to make root
> imbalance reduced, in good case, gathered on cpu means 1024 load, while
> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
> imbalance and gain help from the routine, please note that although
> dbench and stress are the only workload in system, there are still other
> tasks serve for the system need to be wakeup (some very actively since
> the dbench...), compared to them, deep group load means nothing...

What tasks are these? And is it their interference that disturbs
load-balancing?

> >> By which means even tasks in deep group all gathered on one CPU, the load
> >> could still balanced from the view of root group, and the tasks lost the
> >> only chances (balance) to spread when they already on the same CPU...
> > 
> > Sure, but see above.
> 
> The lb-routine could not provide enough help for deep group, since the
> imbalance happened inside the group could not cause imbalance in root,
> ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
> easily ignored, but inside the l2-group, the gathered case could already
> means imbalance like (1024 * 5) : 1024.

your explanation is not making sense, we have 3 cgroups, so the total
root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170.

And again, the absolute value doesn't matter, with (istr) 12 cpus the
avg cpu load would be 3072/12 ~ 256, and 170 is significant on that
scale.

Same with l2, total weight of 1024, giving a per task weight of ~56 and
a per-cpu weight of ~85, which is again significant.

Also, you said load-balance doesn't usually participate much because
dbench is too fast, so please make up your mind, does it or doesn't it
matter?

> > So I think that approach is wrong, select_idle_siblings() works because
> > we want to keep CPUs from being idle, but if they're not actually idle,
> > pretending like they are (in a cgroup) is actively wrong and can skew
> > load pretty bad.
> 
> We only choose the timing when no idle cpu located, and flips is
> somewhat high, also the group is deep.

-enotmakingsense

> In such cases, select_idle_siblings() doesn't works anyway, it return
> the target even it is very busy, we just check twice to prevent it from
> making some obviously bad decision ;-)

-emakinglesssense

> > Furthermore, if as I expect, dbench sucks on a busy system, then the
> > proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
> > alter behaviour like that.
> 
> That's true and that's why we currently still need to shut down the
> GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to
> solve later...

more confusion..

> What we currently expect is that the cgroup assign the resource
> according to the share, it works well in l1-groups, so we expect it to
> work the same well in l2-groups...

Sure, but explain why it isn't? So far you're just saying words that
don't compute.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-06-11  8:24                                     ` Peter Zijlstra
@ 2014-06-11  9:18                                       ` Michael wang
  2014-06-23  9:42                                         ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Michael wang @ 2014-06-11  9:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On 06/11/2014 04:24 PM, Peter Zijlstra wrote:
[snip]
>>
>> IMHO, when we put tasks one group deeper, in other word the totally
>> weight of these tasks is 1024 (prev is 3072), the load become more
>> balancing in root, which make bl-routine consider the system is
>> balanced, which make we migrate less in lb-routine.
> 
> But how? The absolute value (1024 vs 3072) is of no effect to the
> imbalance, the imbalance is computed from relative differences between
> cpus.

Ok, forgive me for the confusion, please allow me to explain things
again, for gathered cases like:

		cpu 0		cpu 1

		dbench		task_sys
		dbench		task_sys
		dbench
		dbench
		dbench
		dbench
		task_sys
		task_sys

task_sys is other tasks belong to root which is nice 0, so when dbench
in l1:

		cpu 0			cpu 1
	load	1024 + 1024*2		1024*2

		3072: 2048	imbalance %150

now when they belong to l2:

		cpu 0			cpu 1
	load	1024/3 + 1024*2		1024*2

		2389 : 2048	imbalance %116

And it could be even less during my testing...

This is just try to explain that when 'group_load : rq_load' become
lower, it's influence to 'rq_load' become lower too, and if the system
is balanced with only 'rq_load' there, it will be considered still
balanced even 'group_load' gathered on one cpu.

Please let me know if I missed something here...

> 
[snip]
>>
>> Although the l1-group gain the same resources (1200%), it doesn't assign
>> to l2-ABC correctly like the root-group did.
> 
> But in this case select_idle_sibling() should function identially, so
> that cannot be the problem.

Yes, it's clean, select_idle_sibling() just return curr or prev cpu in
this case.

> 
[snip]
>>
>> Exactly, however, when group is deep, the chance of it to make root
>> imbalance reduced, in good case, gathered on cpu means 1024 load, while
>> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
>> imbalance and gain help from the routine, please note that although
>> dbench and stress are the only workload in system, there are still other
>> tasks serve for the system need to be wakeup (some very actively since
>> the dbench...), compared to them, deep group load means nothing...
> 
> What tasks are these? And is it their interference that disturbs
> load-balancing?

These are dbench and stress with less root-load when put into l2-groups,
that make it harder to trigger root-group imbalance like in the case above.

> 
>>>> By which means even tasks in deep group all gathered on one CPU, the load
>>>> could still balanced from the view of root group, and the tasks lost the
>>>> only chances (balance) to spread when they already on the same CPU...
>>>
>>> Sure, but see above.
>>
>> The lb-routine could not provide enough help for deep group, since the
>> imbalance happened inside the group could not cause imbalance in root,
>> ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
>> easily ignored, but inside the l2-group, the gathered case could already
>> means imbalance like (1024 * 5) : 1024.
> 
> your explanation is not making sense, we have 3 cgroups, so the total
> root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170.

I mean the l2-groups case here... since l1 share is 1024, the total load
of l2-groups will be 1024 by theory.

> 
> And again, the absolute value doesn't matter, with (istr) 12 cpus the
> avg cpu load would be 3072/12 ~ 256, and 170 is significant on that
> scale.
> 
> Same with l2, total weight of 1024, giving a per task weight of ~56 and
> a per-cpu weight of ~85, which is again significant.

We have other tasks which has to running in the system, in order to
serve dbench and others, and that also the case in real world, dbench
and stress are not the only tasks on rq time to time.

May be we could focus on the case above and see if it could make things
more clear firstly?

Regards,
Michael Wang

> 
> Also, you said load-balance doesn't usually participate much because
> dbench is too fast, so please make up your mind, does it or doesn't it
> matter?
> 
>>> So I think that approach is wrong, select_idle_siblings() works because
>>> we want to keep CPUs from being idle, but if they're not actually idle,
>>> pretending like they are (in a cgroup) is actively wrong and can skew
>>> load pretty bad.
>>
>> We only choose the timing when no idle cpu located, and flips is
>> somewhat high, also the group is deep.
> 
> -enotmakingsense
> 
>> In such cases, select_idle_siblings() doesn't works anyway, it return
>> the target even it is very busy, we just check twice to prevent it from
>> making some obviously bad decision ;-)
> 
> -emakinglesssense
> 
>>> Furthermore, if as I expect, dbench sucks on a busy system, then the
>>> proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
>>> alter behaviour like that.
>>
>> That's true and that's why we currently still need to shut down the
>> GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to
>> solve later...
> 
> more confusion..
> 
>> What we currently expect is that the cgroup assign the resource
>> according to the share, it works well in l1-groups, so we expect it to
>> work the same well in l2-groups...
> 
> Sure, but explain why it isn't? So far you're just saying words that
> don't compute.
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-06-11  9:18                                       ` Michael wang
@ 2014-06-23  9:42                                         ` Peter Zijlstra
  2014-06-24  3:10                                           ` Michael wang
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-06-23  9:42 UTC (permalink / raw)
  To: Michael wang
  Cc: Mike Galbraith, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

On Wed, Jun 11, 2014 at 05:18:29PM +0800, Michael wang wrote:
> On 06/11/2014 04:24 PM, Peter Zijlstra wrote:
> [snip]
> >>
> >> IMHO, when we put tasks one group deeper, in other word the totally
> >> weight of these tasks is 1024 (prev is 3072), the load become more
> >> balancing in root, which make bl-routine consider the system is
> >> balanced, which make we migrate less in lb-routine.
> > 
> > But how? The absolute value (1024 vs 3072) is of no effect to the
> > imbalance, the imbalance is computed from relative differences between
> > cpus.
> 
> Ok, forgive me for the confusion, please allow me to explain things
> again, for gathered cases like:
> 
> 		cpu 0		cpu 1
> 
> 		dbench		task_sys
> 		dbench		task_sys
> 		dbench
> 		dbench
> 		dbench
> 		dbench
> 		task_sys
> 		task_sys

It might help if you prefix each task with the cgroup they're in; but I
think I get it, its like:

	cpu0

	A/dbench
	A/dbench
	A/dbench
	A/dbench
	A/dbench
	A/dbench
	/task_sys
	/task_sys

> task_sys is other tasks belong to root which is nice 0, so when dbench
> in l1:
> 
> 		cpu 0			cpu 1
> 	load	1024 + 1024*2		1024*2
> 
> 		3072: 2048	imbalance %150
> 
> now when they belong to l2:

That would be:

	cpu0

	A/B/dbench
	A/B/dbench
	A/B/dbench
	A/B/dbench
	A/B/dbench
	A/B/dbench
	/task_sys
	/task_sys

Right?

> 		cpu 0			cpu 1
> 	load	1024/3 + 1024*2		1024*2
> 
> 		2389 : 2048	imbalance %116

Which should still end up with 3072, because A is still 1024 in total,
and all its member tasks run on the one CPU.

> And it could be even less during my testing...

Well, yes, up to 1024/nr_cpus I imagine.

> This is just try to explain that when 'group_load : rq_load' become
> lower, it's influence to 'rq_load' become lower too, and if the system
> is balanced with only 'rq_load' there, it will be considered still
> balanced even 'group_load' gathered on one cpu.
> 
> Please let me know if I missed something here...

Yeah, what other tasks are these task_sys things? workqueue crap?

> >> Exactly, however, when group is deep, the chance of it to make root
> >> imbalance reduced, in good case, gathered on cpu means 1024 load, while
> >> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
> >> imbalance and gain help from the routine, please note that although
> >> dbench and stress are the only workload in system, there are still other
> >> tasks serve for the system need to be wakeup (some very actively since
> >> the dbench...), compared to them, deep group load means nothing...
> > 
> > What tasks are these? And is it their interference that disturbs
> > load-balancing?
> 
> These are dbench and stress with less root-load when put into l2-groups,
> that make it harder to trigger root-group imbalance like in the case above.

You're still not making sense here.. without the task_sys thingies in
you get something like:

 cpu0		cpu1

 A/dbench	A/dbench
 B/stress	B/stress

And the total loads are: 512+512 vs 512+512.

> > Same with l2, total weight of 1024, giving a per task weight of ~56 and
> > a per-cpu weight of ~85, which is again significant.
> 
> We have other tasks which has to running in the system, in order to
> serve dbench and others, and that also the case in real world, dbench
> and stress are not the only tasks on rq time to time.
> 
> May be we could focus on the case above and see if it could make things
> more clear firstly?

Well, this all smells like you need some cgroup affinity for whatever
system tasks are running. Not fuck up the scheduler for no sane reason.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
  2014-06-23  9:42                                         ` Peter Zijlstra
@ 2014-06-24  3:10                                           ` Michael wang
  0 siblings, 0 replies; 28+ messages in thread
From: Michael wang @ 2014-06-24  3:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Rik van Riel, LKML, Ingo Molnar, Alex Shi,
	Paul Turner, Mel Gorman, Daniel Lezcano

Hi, Peter

Thanks for the reply :)

On 06/23/2014 05:42 PM, Peter Zijlstra wrote:
[snip]
>>
>> 		cpu 0		cpu 1
>>
>> 		dbench		task_sys
>> 		dbench		task_sys
>> 		dbench
>> 		dbench
>> 		dbench
>> 		dbench
>> 		task_sys
>> 		task_sys
> 
> It might help if you prefix each task with the cgroup they're in;

My bad...

but I
> think I get it, its like:
> 
> 	cpu0
> 
> 	A/dbench
> 	A/dbench
> 	A/dbench
> 	A/dbench
> 	A/dbench
> 	A/dbench
> 	/task_sys
> 	/task_sys

Yeah, it's like that.

> 
[snip]
> 
> 	cpu0
> 
> 	A/B/dbench
> 	A/B/dbench
> 	A/B/dbench
> 	A/B/dbench
> 	A/B/dbench
> 	A/B/dbench
> 	/task_sys
> 	/task_sys
> 
> Right?

My bad to missed the group symbol here... it's actually like:

	cpu0

	/l1/A/dbench
	/l1/A/dbench
	/l1/A/dbench
	/l1/A/dbench
	/l1/A/dbench
	/task_sys
	/task_sys

And we also have six:

	/l1/B/stress

and six:

	/l1/C/stress

running in system.

A, B, C is the child groups of l1.

> 
>> 		cpu 0			cpu 1
>> 	load	1024/3 + 1024*2		1024*2
>>
>> 		2389 : 2048	imbalance %116
> 
> Which should still end up with 3072, because A is still 1024 in total,
> and all its member tasks run on the one CPU.

l1 have 3 child groups, each got 6 NICE 0 tasks, so ideally each task
will got 1024/18, 6 dbench will means (1024/18)*6 == 1024/3.

Previously each of the 3 group got 1024 shares, now they need to share
1024 shares, it will become less for each of them.

> 
>> And it could be even less during my testing...
> 
> Well, yes, up to 1024/nr_cpus I imagine.
> 
>> This is just try to explain that when 'group_load : rq_load' become
>> lower, it's influence to 'rq_load' become lower too, and if the system
>> is balanced with only 'rq_load' there, it will be considered still
>> balanced even 'group_load' gathered on one cpu.
>>
>> Please let me know if I missed something here...
> 
> Yeah, what other tasks are these task_sys things? workqueue crap?

There are some other tasks but mostly showup are the kworkers, yes the
workqueue stuff.

They rapidly showup on each CPU, in some period if they showup too much,
they will eat some CPU% too, but not very much.

> 
[snip]
>>
>> These are dbench and stress with less root-load when put into l2-groups,
>> that make it harder to trigger root-group imbalance like in the case above.
> 
> You're still not making sense here.. without the task_sys thingies in
> you get something like:
> 
>  cpu0		cpu1
> 
>  A/dbench	A/dbench
>  B/stress	B/stress
> 
> And the total loads are: 512+512 vs 512+512.

Without other task's influence, I believe the balance should be fine,
but in our cases, at least these kworkers will join the battle anyway...

> 
>>> Same with l2, total weight of 1024, giving a per task weight of ~56 and
>>> a per-cpu weight of ~85, which is again significant.
>>
>> We have other tasks which has to running in the system, in order to
>> serve dbench and others, and that also the case in real world, dbench
>> and stress are not the only tasks on rq time to time.
>>
>> May be we could focus on the case above and see if it could make things
>> more clear firstly?
> 
> Well, this all smells like you need some cgroup affinity for whatever
> system tasks are running. Not fuck up the scheduler for no sane reason.

These kworkers are bind to their CPU already, I don't know how to handle
them to prevent the issue, they just keep working on their CPU, and
whenever they showup, dbench spreading inactively...

We just want a way which could help workload like dbench to work
normally with cpu-group when there are stress likely workload running in
the system.

We want dbench to gain more CPU% but cpu-shares doesn't work as
expected... dbench can get no more than 100% whatever how big it's
group's shares is, and we consider that cpu-group was broken in this
cases...

I agree that this is not a generic requirement and scheduler should only
be responsible for general situation, but since it's really a too big
regression, could we at least provide some way to stop the damage? After
all, most of the cpu-group logic is insider scheduler...

I'd like to list some real numbers in patch-thread, we really desired
for some way to make cpu-group perform normally on workload like dbench,
actually we also find some transaction workloads suffered from this
issue too, in such cases, cpu-group just failed on managing the CPU
resources...

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-06-24  3:10 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-13  3:34 [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays? Michael wang
2014-05-13  9:47 ` Peter Zijlstra
2014-05-13 13:36   ` Rik van Riel
2014-05-13 14:23     ` Peter Zijlstra
2014-05-14  3:27       ` Michael wang
2014-05-14  7:36       ` Michael wang
2014-05-14  9:44         ` Peter Zijlstra
2014-05-15  3:46           ` Michael wang
2014-05-15  8:35             ` Peter Zijlstra
2014-05-15  8:46               ` Michael wang
2014-05-15  9:06                 ` Peter Zijlstra
2014-05-15  9:35                   ` Michael wang
2014-05-15 11:57                     ` Peter Zijlstra
2014-05-16  2:23                       ` Michael wang
2014-05-16  2:51                         ` Mike Galbraith
2014-05-16  4:24                           ` Michael wang
2014-05-16  7:54                             ` Peter Zijlstra
2014-05-16  8:15                               ` Michael wang
2014-06-10  8:56                               ` Michael wang
2014-06-10 12:12                                 ` Peter Zijlstra
2014-06-11  6:13                                   ` Michael wang
2014-06-11  8:24                                     ` Peter Zijlstra
2014-06-11  9:18                                       ` Michael wang
2014-06-23  9:42                                         ` Peter Zijlstra
2014-06-24  3:10                                           ` Michael wang
2014-05-16  7:48                         ` Peter Zijlstra
2014-05-14  3:21     ` Michael wang
2014-05-14  3:16   ` Michael wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).