All of lore.kernel.org
 help / color / mirror / Atom feed
* [git] CFS-devel, latest code
@ 2007-09-24 21:45 Ingo Molnar
  2007-09-24 21:55 ` Andrew Morton
                   ` (4 more replies)
  0 siblings, 5 replies; 53+ messages in thread
From: Ingo Molnar @ 2007-09-24 21:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton


The latest sched-devel.git tree can be pulled from:
 
  git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

Lots of scheduler updates in the past few days, done by many people. 
Most importantly, the SMP latency problems reported and debugged by Mike 
Galbraith should be fixed for good now.

I've also included the latest and greatest group-fairness scheduling 
patch from Srivatsa Vaddagiri, which can now be used without containers 
as well (in a simplified, each-uid-gets-its-fair-share mode). This 
feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.

Peter Zijlstra has been busy enhancing the math of the scheduler: we've 
got the new 'vslice' forked-task code that should enable snappier shell 
commands during load while still keeping kbuild workloads in check.

On my testsystems this codebase starts looking like something that could 
be merged into v2.6.24, so please give it a good workout and let us know 
if there's anything bad going on. (If this works out fine then i'll 
propagate these changes back into the CFS backport, for wider testing.)

	Ingo

----------------------------------------->
the shortlog relative to 2.6.23-rc7:

Dmitry Adamushko (8):
      sched: clean up struct load_stat
      sched: clean up schedstat block in dequeue_entity()
      sched: sched_setscheduler() fix
      sched: add set_curr_task() calls
      sched: do not keep current in the tree and get rid of sched_entity::fair_key
      sched: optimize task_new_fair()
      sched: simplify sched_class::yield_task()
      sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()

Ingo Molnar (41):
      sched: fix new-task method
      sched: resched task in task_new_fair()
      sched: small sched_debug cleanup
      sched: debug: track maximum 'slice'
      sched: uniform tunings
      sched: use constants if !CONFIG_SCHED_DEBUG
      sched: remove stat_gran
      sched: remove precise CPU load
      sched: remove precise CPU load calculations #2
      sched: track cfs_rq->curr on !group-scheduling too
      sched: cleanup: simplify cfs_rq_curr() methods
      sched: uninline __enqueue_entity()/__dequeue_entity()
      sched: speed up update_load_add/_sub()
      sched: clean up calc_weighted()
      sched: introduce se->vruntime
      sched: move sched_feat() definitions
      sched: optimize vruntime based scheduling
      sched: simplify check_preempt() methods
      sched: wakeup granularity fix
      sched: add se->vruntime debugging
      sched: add more vruntime statistics
      sched: debug: update exec_clock only when SCHED_DEBUG
      sched: remove wait_runtime limit
      sched: remove wait_runtime fields and features
      sched: x86: allow single-depth wchan output
      sched: fix delay accounting performance regression
      sched: prettify /proc/sched_debug output
      sched: enhance debug output
      sched: kernel/sched_fair.c whitespace cleanups
      sched: fair-group sched, cleanups
      sched: enable CONFIG_FAIR_GROUP_SCHED=y by default
      sched debug: BKL usage statistics
      sched: remove unneeded tunables
      sched debug: print settings
      sched debug: more width for parameter printouts
      sched: entity_key() fix
      sched: remove condition from set_task_cpu()
      sched: remove last_min_vruntime effect
      sched: undo some of the recent changes
      sched: fix place_entity()
      sched: fix sched_fork()

Matthias Kaehlcke (1):
      sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (2):
      sched: fix SMP migration latencies
      sched: fix formatting of /proc/sched_debug

Peter Zijlstra (10):
      sched: simplify SCHED_FEAT_* code
      sched: new task placement for vruntime
      sched: simplify adaptive latency
      sched: clean up new task placement
      sched: add tree based averages
      sched: handle vruntime overflow
      sched: better min_vruntime tracking
      sched: add vslice
      sched debug: check spread
      sched: max_vruntime() simplification

Srivatsa Vaddagiri (7):
      sched: group-scheduler core
      sched: revert recent removal of set_curr_task()
      sched: fix minor bug in yield
      sched: print nr_running and load in /proc/sched_debug
      sched: print &rq->cfs stats
      sched: clean up code under CONFIG_FAIR_GROUP_SCHED
      sched: add fair-user scheduler

 arch/i386/Kconfig       |   11 
 include/linux/sched.h   |   45 +--
 init/Kconfig            |   21 +
 kernel/sched.c          |  547 +++++++++++++++++++++++++------------
 kernel/sched_debug.c    |  248 +++++++++++------
 kernel/sched_fair.c     |  692 +++++++++++++++++-------------------------------
 kernel/sched_idletask.c |    5 
 kernel/sched_rt.c       |   12 
 kernel/sched_stats.h    |    4 
 kernel/sysctl.c         |   22 -
 kernel/user.c           |   43 ++
 11 files changed, 906 insertions(+), 744 deletions(-)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-24 21:45 [git] CFS-devel, latest code Ingo Molnar
@ 2007-09-24 21:55 ` Andrew Morton
  2007-09-24 21:59   ` Ingo Molnar
  2007-09-25  0:08 ` Daniel Walker
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 53+ messages in thread
From: Andrew Morton @ 2007-09-24 21:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri,
	Dhaval Giani, Dmitry Adamushko

On Mon, 24 Sep 2007 23:45:37 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> The latest sched-devel.git tree can be pulled from:
>  
>   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

I'm pulling linux-2.6-sched.git, and it's oopsing all over the place on
ia64, and Lee's observations about set_leftmost()'s weirdness are
pertinent.

Should I instead be pulling linux-2.6-sched-devel.git?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-24 21:55 ` Andrew Morton
@ 2007-09-24 21:59   ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2007-09-24 21:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri,
	Dhaval Giani, Dmitry Adamushko


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 24 Sep 2007 23:45:37 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > The latest sched-devel.git tree can be pulled from:
> >  
> >   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
> 
> I'm pulling linux-2.6-sched.git, and it's oopsing all over the place 
> on ia64, and Lee's observations about set_leftmost()'s weirdness are 
> pertinent.
> 
> Should I instead be pulling linux-2.6-sched-devel.git?

yeah, please pull that one.

linux-2.6-sched.git by mistake contained an older sched-devel tree for 
about a day (where your scripts picked it up). I've restored that one to 
-rc7 meanwhile. It's only supposed to contain strict fixes for upstream. 
(none at the moment)

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-24 21:45 [git] CFS-devel, latest code Ingo Molnar
  2007-09-24 21:55 ` Andrew Morton
@ 2007-09-25  0:08 ` Daniel Walker
  2007-09-25  6:45   ` Ingo Molnar
  2007-09-25  6:10 ` Mike Galbraith
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 53+ messages in thread
From: Daniel Walker @ 2007-09-25  0:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri,
	Dhaval Giani, Dmitry Adamushko, Andrew Morton

On Mon, 2007-09-24 at 23:45 +0200, Ingo Molnar wrote:
> Lots of scheduler updates in the past few days, done by many people. 
> Most importantly, the SMP latency problems reported and debugged by
> Mike 
> Galbraith should be fixed for good now. 

Does this have anything to do with idle balancing ? I noticed some
fairly large latencies in that code in 2.6.23-rc's ..

Daniel


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-24 21:45 [git] CFS-devel, latest code Ingo Molnar
  2007-09-24 21:55 ` Andrew Morton
  2007-09-25  0:08 ` Daniel Walker
@ 2007-09-25  6:10 ` Mike Galbraith
  2007-09-25  7:35   ` Mike Galbraith
  2007-09-25  6:50 ` [git] CFS-devel, latest code S.Çağlar Onur
  2007-09-25  7:41 ` Andrew Morton
  4 siblings, 1 reply; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25  6:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Mon, 2007-09-24 at 23:45 +0200, Ingo Molnar wrote:

> Mike Galbraith (2):
>       sched: fix SMP migration latencies
>       sched: fix formatting of /proc/sched_debug

Off-by-one bug in attribution, rocks and sticks (down boy!) don't
count ;-)  I just built, and will spend the morning beating on it... no
news is good news.

	-Mike


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  0:08 ` Daniel Walker
@ 2007-09-25  6:45   ` Ingo Molnar
  2007-09-25 15:17     ` Daniel Walker
  0 siblings, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2007-09-25  6:45 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri,
	Dhaval Giani, Dmitry Adamushko, Andrew Morton


* Daniel Walker <dwalker@mvista.com> wrote:

> On Mon, 2007-09-24 at 23:45 +0200, Ingo Molnar wrote:
> > Lots of scheduler updates in the past few days, done by many people. 
> > Most importantly, the SMP latency problems reported and debugged by
> > Mike 
> > Galbraith should be fixed for good now. 
> 
> Does this have anything to do with idle balancing ? I noticed some 
> fairly large latencies in that code in 2.6.23-rc's ..

any measurements?

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-24 21:45 [git] CFS-devel, latest code Ingo Molnar
                   ` (2 preceding siblings ...)
  2007-09-25  6:10 ` Mike Galbraith
@ 2007-09-25  6:50 ` S.Çağlar Onur
  2007-09-25  9:17   ` Ingo Molnar
  2007-09-25  7:41 ` Andrew Morton
  4 siblings, 1 reply; 53+ messages in thread
From: S.Çağlar Onur @ 2007-09-25  6:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri,
	Dhaval Giani, Dmitry Adamushko, Andrew Morton

Hi;

25 Eyl 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı: 
> 
> The latest sched-devel.git tree can be pulled from:
>  
>   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
> 
> Lots of scheduler updates in the past few days, done by many people. 
> Most importantly, the SMP latency problems reported and debugged by Mike 
> Galbraith should be fixed for good now.
> 
> I've also included the latest and greatest group-fairness scheduling 
> patch from Srivatsa Vaddagiri, which can now be used without containers 
> as well (in a simplified, each-uid-gets-its-fair-share mode). This 
> feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.
> 
> Peter Zijlstra has been busy enhancing the math of the scheduler: we've 
> got the new 'vslice' forked-task code that should enable snappier shell 
> commands during load while still keeping kbuild workloads in check.
> 
> On my testsystems this codebase starts looking like something that could 
> be merged into v2.6.24, so please give it a good workout and let us know 
> if there's anything bad going on. (If this works out fine then i'll 
> propagate these changes back into the CFS backport, for wider testing.)

Seems like following trivial change needed to compile without CONFIG_SCHEDSTATS

caglar@zangetsu linux-2.6 $ LC_ALL=C make
  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  CALL    scripts/checksyscalls.sh
  CHK     include/linux/compile.h
  CC      kernel/sched.o
In file included from kernel/sched.c:853:
kernel/sched_debug.c: In function `print_cfs_rq':
kernel/sched_debug.c:139: error: structure has no member named `bkl_cnt'
kernel/sched_debug.c:139: error: structure has no member named `bkl_cnt'
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2

Signed-off-by: S.Çağlar Onur <caglar@pardus.org.tr>

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index b68e593..4659c90 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -136,8 +136,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			SPLIT_NS(spread0));
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
+#ifdef CONFIG_SCHEDSTATS
 	SEQ_printf(m, "  .%-30s: %ld\n", "bkl_cnt",
 			rq->bkl_cnt);
+#endif
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 }


Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  6:10 ` Mike Galbraith
@ 2007-09-25  7:35   ` Mike Galbraith
  2007-09-25  8:33     ` Mike Galbraith
  0 siblings, 1 reply; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25  7:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 810 bytes --]

On Tue, 2007-09-25 at 08:10 +0200, Mike Galbraith wrote:
>  no news is good news.

Darn, have news: latency thing isn't dead.  Two busy loops, one at nice
0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
latencies below for nice -5 Xorg.  Didn't kill the box though.

se.wait_max              :            10.068169
se.wait_max              :             7.465334
se.wait_max              :           135.501816
se.wait_max              :             0.884483
se.wait_max              :           144.218955
se.wait_max              :           128.578376
se.wait_max              :            93.975768
se.wait_max              :             4.965965
se.wait_max              :           113.655533
se.wait_max              :             4.301075

sched_debug (attached) is.. strange.

	-Mike

[-- Attachment #2: sched_debug.gz --]
[-- Type: application/x-gzip, Size: 9829 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-24 21:45 [git] CFS-devel, latest code Ingo Molnar
                   ` (3 preceding siblings ...)
  2007-09-25  6:50 ` [git] CFS-devel, latest code S.Çağlar Onur
@ 2007-09-25  7:41 ` Andrew Morton
  2007-09-25  8:43   ` Srivatsa Vaddagiri
  4 siblings, 1 reply; 53+ messages in thread
From: Andrew Morton @ 2007-09-25  7:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri,
	Dhaval Giani, Dmitry Adamushko

On Mon, 24 Sep 2007 23:45:37 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> The latest sched-devel.git tree can be pulled from:
>  
>   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

This doornails the Vaio.  After grub handover the screen remains black
and the fan goes whir.

http://userweb.kernel.org/~akpm/config-sony.txt

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  7:35   ` Mike Galbraith
@ 2007-09-25  8:33     ` Mike Galbraith
  2007-09-25  8:53       ` Srivatsa Vaddagiri
  2007-09-25  9:13       ` Ingo Molnar
  0 siblings, 2 replies; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25  8:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, 2007-09-25 at 09:35 +0200, Mike Galbraith wrote:

> Darn, have news: latency thing isn't dead.  Two busy loops, one at nice
> 0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
> latencies below for nice -5 Xorg.  Didn't kill the box though.
> 
> se.wait_max              :            10.068169
> se.wait_max              :             7.465334
> se.wait_max              :           135.501816
> se.wait_max              :             0.884483
> se.wait_max              :           144.218955
> se.wait_max              :           128.578376
> se.wait_max              :            93.975768
> se.wait_max              :             4.965965
> se.wait_max              :           113.655533
> se.wait_max              :             4.301075
> 
> sched_debug (attached) is.. strange.

Disabling CONFIG_FAIR_GROUP_SCHED fixed both.  Latencies of up to 336ms
hit me during the recompile (make -j3), with nothing else running.
Since reboot, latencies are, so far, very very nice.  I'm leaving it
disabled for now.

	-Mike 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  7:41 ` Andrew Morton
@ 2007-09-25  8:43   ` Srivatsa Vaddagiri
  2007-09-25  8:48     ` Andrew Morton
  2007-09-25 11:00     ` Ingo Molnar
  0 siblings, 2 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-09-25  8:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Mike Galbraith,
	Dhaval Giani, Dmitry Adamushko

On Tue, Sep 25, 2007 at 12:41:20AM -0700, Andrew Morton wrote:
> This doornails the Vaio.  After grub handover the screen remains black
> and the fan goes whir.
> 
> http://userweb.kernel.org/~akpm/config-sony.txt

This seems to be UP regression. Sorry abt that. I could recreate 
the problem very easily with CONFIG_SMP turned off.

Can you check if this patch works? Works for me here.

--

Fix UP breakage.

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 kernel/sched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1029,8 +1029,8 @@ static inline void __set_task_cpu(struct
 {
 #ifdef CONFIG_SMP
 	task_thread_info(p)->cpu = cpu;
-	set_task_cfs_rq(p);
 #endif
+	set_task_cfs_rq(p);
 }
 
 #ifdef CONFIG_SMP

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  8:43   ` Srivatsa Vaddagiri
@ 2007-09-25  8:48     ` Andrew Morton
  2007-09-25 11:00     ` Ingo Molnar
  1 sibling, 0 replies; 53+ messages in thread
From: Andrew Morton @ 2007-09-25  8:48 UTC (permalink / raw)
  To: vatsa
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Mike Galbraith,
	Dhaval Giani, Dmitry Adamushko

On Tue, 25 Sep 2007 14:13:27 +0530 Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> On Tue, Sep 25, 2007 at 12:41:20AM -0700, Andrew Morton wrote:
> > This doornails the Vaio.  After grub handover the screen remains black
> > and the fan goes whir.
> > 
> > http://userweb.kernel.org/~akpm/config-sony.txt
> 
> This seems to be UP regression. Sorry abt that. I could recreate 
> the problem very easily with CONFIG_SMP turned off.
> 
> Can you check if this patch works? Works for me here.
> 
> --
> 
> Fix UP breakage.
> 
> Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> 
> 
> ---
>  kernel/sched.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: current/kernel/sched.c
> ===================================================================
> --- current.orig/kernel/sched.c
> +++ current/kernel/sched.c
> @@ -1029,8 +1029,8 @@ static inline void __set_task_cpu(struct
>  {
>  #ifdef CONFIG_SMP
>  	task_thread_info(p)->cpu = cpu;
> -	set_task_cfs_rq(p);
>  #endif
> +	set_task_cfs_rq(p);
>  }
>  
>  #ifdef CONFIG_SMP

yup, that's a fix.  It was 15 minutes too late for rc8-mm1 though :(

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  8:33     ` Mike Galbraith
@ 2007-09-25  8:53       ` Srivatsa Vaddagiri
  2007-09-25  9:11         ` Srivatsa Vaddagiri
  2007-09-25  9:12         ` Mike Galbraith
  2007-09-25  9:13       ` Ingo Molnar
  1 sibling, 2 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-09-25  8:53 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, Sep 25, 2007 at 10:33:27AM +0200, Mike Galbraith wrote:
> > Darn, have news: latency thing isn't dead.  Two busy loops, one at nice
> > 0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
> > latencies below for nice -5 Xorg.  Didn't kill the box though.
> > 
> > se.wait_max              :            10.068169
> > se.wait_max              :             7.465334
> > se.wait_max              :           135.501816
> > se.wait_max              :             0.884483
> > se.wait_max              :           144.218955
> > se.wait_max              :           128.578376
> > se.wait_max              :            93.975768
> > se.wait_max              :             4.965965
> > se.wait_max              :           113.655533
> > se.wait_max              :             4.301075
> > 
> > sched_debug (attached) is.. strange.

Mike,
	Do you have FAIR_USER_SCHED turned on as well? Can you send me
your .config pls?

Also how do you check se.wait_max?

> Disabling CONFIG_FAIR_GROUP_SCHED fixed both.  Latencies of up to 336ms
> hit me during the recompile (make -j3), with nothing else running.
> Since reboot, latencies are, so far, very very nice.  I'm leaving it
> disabled for now.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  8:53       ` Srivatsa Vaddagiri
@ 2007-09-25  9:11         ` Srivatsa Vaddagiri
  2007-09-25  9:15           ` Mike Galbraith
  2007-09-25  9:12         ` Mike Galbraith
  1 sibling, 1 reply; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-09-25  9:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, Sep 25, 2007 at 02:23:29PM +0530, Srivatsa Vaddagiri wrote:
> On Tue, Sep 25, 2007 at 10:33:27AM +0200, Mike Galbraith wrote:
> > > Darn, have news: latency thing isn't dead.  Two busy loops, one at nice
> > > 0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
> > > latencies below for nice -5 Xorg.  Didn't kill the box though.

These busy loops - are they spawned by the same user? Is it the root
user? Also is this seen in UP mode also?

Can you also pls check if tuning root user's cpu share helps? Basically,

	# echo 4096 > /proc/root_user_share

[or any other higher value]

> Also how do you check se.wait_max?

Ok ..I see that it is in /proc/sched_debug.


-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  8:53       ` Srivatsa Vaddagiri
  2007-09-25  9:11         ` Srivatsa Vaddagiri
@ 2007-09-25  9:12         ` Mike Galbraith
  1 sibling, 0 replies; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25  9:12 UTC (permalink / raw)
  To: vatsa
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 342 bytes --]

On Tue, 2007-09-25 at 14:23 +0530, Srivatsa Vaddagiri wrote:

> Mike,
> 	Do you have FAIR_USER_SCHED turned on as well? Can you send me
> your .config pls?

I did have.  gzipped config attached.. this is current though, after
disabling groups.  I'm still beating on the basic changes (boy does it
ever feel nice [awaits other shoe]).

	-Mike

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 13280 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  8:33     ` Mike Galbraith
  2007-09-25  8:53       ` Srivatsa Vaddagiri
@ 2007-09-25  9:13       ` Ingo Molnar
  2007-09-25  9:17         ` Mike Galbraith
  2007-09-25  9:44         ` Srivatsa Vaddagiri
  1 sibling, 2 replies; 53+ messages in thread
From: Ingo Molnar @ 2007-09-25  9:13 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, Peter Zijlstra, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton


* Mike Galbraith <efault@gmx.de> wrote:

> > sched_debug (attached) is.. strange.
> 
> Disabling CONFIG_FAIR_GROUP_SCHED fixed both.  [...]

heh. Evil plan to enable the group scheduler by default worked out as 
planned! ;-) [guess how many container users would do ... interactivity 
tests like you do??]

> [...] Latencies of up to 336ms hit me during the recompile (make -j3), 
> with nothing else running. Since reboot, latencies are, so far, very 
> very nice. [...]

'very very nice' == 'best ever' ? :-)

> [...] I'm leaving it disabled for now.

ok, i'm too seeing some sort of latency weirdness with 
CONFIG_FAIR_GROUP_SCHED enabled, _if_ there's Xorg involved which runs 
under root uid on my box - and hence gets 50% of all CPU time.

Srivatsa, any ideas? It could either be an accounting buglet (less 
likely, seems like the group scheduling bits stick to the 50% splitup 
nicely), or a preemption buglet. One potential preemption buglet would 
be for the group scheduler to not properly preempt a running task when a 
task from another uid is woken?

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  9:11         ` Srivatsa Vaddagiri
@ 2007-09-25  9:15           ` Mike Galbraith
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25  9:15 UTC (permalink / raw)
  To: vatsa
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, 2007-09-25 at 14:41 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Sep 25, 2007 at 02:23:29PM +0530, Srivatsa Vaddagiri wrote:
> > On Tue, Sep 25, 2007 at 10:33:27AM +0200, Mike Galbraith wrote:
> > > > Darn, have news: latency thing isn't dead.  Two busy loops, one at nice
> > > > 0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
> > > > latencies below for nice -5 Xorg.  Didn't kill the box though.
> 
> These busy loops - are they spawned by the same user? Is it the root
> user? Also is this seen in UP mode also?
> 
> Can you also pls check if tuning root user's cpu share helps? Basically,
> 
> 	# echo 4096 > /proc/root_user_share
> 
> [or any other higher value]

I'll try these after I beat on the box some more.

	-Mike


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  6:50 ` [git] CFS-devel, latest code S.Çağlar Onur
@ 2007-09-25  9:17   ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2007-09-25  9:17 UTC (permalink / raw)
  To: S.Çağlar Onur
  Cc: linux-kernel, Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri,
	Dhaval Giani, Dmitry Adamushko, Andrew Morton


* S.Çağlar Onur <caglar@pardus.org.tr> wrote:

> Seems like following trivial change needed to compile without 
> CONFIG_SCHEDSTATS
> 
> caglar@zangetsu linux-2.6 $ LC_ALL=C make
>   CHK     include/linux/version.h
>   CHK     include/linux/utsrelease.h
>   CALL    scripts/checksyscalls.sh
>   CHK     include/linux/compile.h
>   CC      kernel/sched.o
> In file included from kernel/sched.c:853:
> kernel/sched_debug.c: In function `print_cfs_rq':
> kernel/sched_debug.c:139: error: structure has no member named `bkl_cnt'
> kernel/sched_debug.c:139: error: structure has no member named `bkl_cnt'
> make[1]: *** [kernel/sched.o] Error 1
> make: *** [kernel] Error 2
> 
> Signed-off-by: S.Çağlar Onur <caglar@pardus.org.tr>

thanks, applied!

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  9:13       ` Ingo Molnar
@ 2007-09-25  9:17         ` Mike Galbraith
  2007-09-25  9:47           ` Ingo Molnar
  2007-09-25  9:44         ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25  9:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, 2007-09-25 at 11:13 +0200, Ingo Molnar wrote:

> > [...] Latencies of up to 336ms hit me during the recompile (make -j3), 
> > with nothing else running. Since reboot, latencies are, so far, very 
> > very nice. [...]
> 
> 'very very nice' == 'best ever' ? :-)

Yes.  Very VERY nice feel.

	-Mike


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  9:44         ` Srivatsa Vaddagiri
@ 2007-09-25  9:40           ` Ingo Molnar
  2007-09-25 10:10             ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2007-09-25  9:40 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> On Tue, Sep 25, 2007 at 11:13:31AM +0200, Ingo Molnar wrote:
> > ok, i'm too seeing some sort of latency weirdness with 
> > CONFIG_FAIR_GROUP_SCHED enabled, _if_ there's Xorg involved which runs 
> > under root uid on my box - and hence gets 50% of all CPU time.
> > 
> > Srivatsa, any ideas? It could either be an accounting buglet (less 
> > likely, seems like the group scheduling bits stick to the 50% splitup 
> > nicely), or a preemption buglet. One potential preemption buglet would 
> > be for the group scheduler to not properly preempt a running task when a 
> > task from another uid is woken?
> 
> Yep, I noticed that too.
> 
> check_preempt_wakeup()
> {
> 	...
> 
> 	if (is_same_group(curr, p)) {
> 	    ^^^^^^^^^^^^^
> 
> 		resched_task();
> 	}
> 
> }
> 
> Will try a fix to check for preemption at higher levels ..

i bet fixing this will increase precision of group scheduling as well. 
Those long latencies can be thought of as noise as well, and the 
fair-scheduling "engine" might not be capable to offset all sources of 
noise. So generally, while we allow a certain amount of lag in 
preemption decisions (wakeup-granularity, etc.), with which the fairness 
engine will cope just fine, we do not want to allow unlimited lag.

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  9:13       ` Ingo Molnar
  2007-09-25  9:17         ` Mike Galbraith
@ 2007-09-25  9:44         ` Srivatsa Vaddagiri
  2007-09-25  9:40           ` Ingo Molnar
  1 sibling, 1 reply; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-09-25  9:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, Sep 25, 2007 at 11:13:31AM +0200, Ingo Molnar wrote:
> ok, i'm too seeing some sort of latency weirdness with 
> CONFIG_FAIR_GROUP_SCHED enabled, _if_ there's Xorg involved which runs 
> under root uid on my box - and hence gets 50% of all CPU time.
> 
> Srivatsa, any ideas? It could either be an accounting buglet (less 
> likely, seems like the group scheduling bits stick to the 50% splitup 
> nicely), or a preemption buglet. One potential preemption buglet would 
> be for the group scheduler to not properly preempt a running task when a 
> task from another uid is woken?

Yep, I noticed that too.

check_preempt_wakeup()
{
	...

	if (is_same_group(curr, p)) {
	    ^^^^^^^^^^^^^

		resched_task();
	}

}

Will try a fix to check for preemption at higher levels ..

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  9:17         ` Mike Galbraith
@ 2007-09-25  9:47           ` Ingo Molnar
  2007-09-25 10:02             ` Mike Galbraith
                               ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Ingo Molnar @ 2007-09-25  9:47 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, Peter Zijlstra, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton


* Mike Galbraith <efault@gmx.de> wrote:

> On Tue, 2007-09-25 at 11:13 +0200, Ingo Molnar wrote:
> 
> > > [...] Latencies of up to 336ms hit me during the recompile (make -j3), 
> > > with nothing else running. Since reboot, latencies are, so far, very 
> > > very nice. [...]
> > 
> > 'very very nice' == 'best ever' ? :-)
> 
> Yes.  Very VERY nice feel.

cool :-)

Maybe there's more to come: if we can get CONFIG_FAIR_USER_SCHED to work 
properly then your Xorg will have a load-independent 50% of CPU time all 
to itself. (Group scheduling is quite impressive already: i can log in 
as root without feeling _any_ effect from a perpetual 'hackbench 100' 
running as uid mingo. Fork bombs no more.) Will the Amarok gforce plugin 
like that CPU time splitup? (or is most of the gforce overhead under 
your user uid?)

it could also work out negatively, _sometimes_ X does not like being too 
high prio. (weird as that might be.) So we'll see.

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  9:47           ` Ingo Molnar
@ 2007-09-25 10:02             ` Mike Galbraith
  2007-09-26  8:04             ` Mike Galbraith
  2007-09-28 21:46             ` Bill Davidsen
  2 siblings, 0 replies; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25 10:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, 2007-09-25 at 11:47 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Tue, 2007-09-25 at 11:13 +0200, Ingo Molnar wrote:
> > 
> > > > [...] Latencies of up to 336ms hit me during the recompile (make -j3), 
> > > > with nothing else running. Since reboot, latencies are, so far, very 
> > > > very nice. [...]
> > > 
> > > 'very very nice' == 'best ever' ? :-)
> > 
> > Yes.  Very VERY nice feel.
> 
> cool :-)
> 
> Maybe there's more to come: if we can get CONFIG_FAIR_USER_SCHED to work 
> properly then your Xorg will have a load-independent 50% of CPU time all 
> to itself. (Group scheduling is quite impressive already: i can log in 
> as root without feeling _any_ effect from a perpetual 'hackbench 100' 
> running as uid mingo. Fork bombs no more.) Will the Amarok gforce plugin 
> like that CPU time splitup? (or is most of the gforce overhead under 
> your user uid?)

I run everything as root (naughty me), so I'd have to change my evil
ways to reap the benefits.  (I'll do that to test, but it's unlikely to
ever become a permanent habit here)  Amarok/Gforce will definitely like
the user split as long as latency is low.  Visualizations are not only
bandwidth hungry, they're extremely latency sensitive.

	-Mike


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  9:40           ` Ingo Molnar
@ 2007-09-25 10:10             ` Ingo Molnar
  2007-09-25 10:28               ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2007-09-25 10:10 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:
> 
> > On Tue, Sep 25, 2007 at 11:13:31AM +0200, Ingo Molnar wrote:
> > > ok, i'm too seeing some sort of latency weirdness with 
> > > CONFIG_FAIR_GROUP_SCHED enabled, _if_ there's Xorg involved which runs 
> > > under root uid on my box - and hence gets 50% of all CPU time.
> > > 
> > > Srivatsa, any ideas? It could either be an accounting buglet (less 
> > > likely, seems like the group scheduling bits stick to the 50% splitup 
> > > nicely), or a preemption buglet. One potential preemption buglet would 
> > > be for the group scheduler to not properly preempt a running task when a 
> > > task from another uid is woken?
> > 
> > Yep, I noticed that too.
> > 
> > check_preempt_wakeup()
> > {
> > 	...
> > 
> > 	if (is_same_group(curr, p)) {
> > 	    ^^^^^^^^^^^^^
> > 
> > 		resched_task();
> > 	}
> > 
> > }
> > 
> > Will try a fix to check for preemption at higher levels ..
> 
> i bet fixing this will increase precision of group scheduling as well. 
> Those long latencies can be thought of as noise as well, and the 
> fair-scheduling "engine" might not be capable to offset all sources of 
> noise. So generally, while we allow a certain amount of lag in 
> preemption decisions (wakeup-granularity, etc.), with which the 
> fairness engine will cope just fine, we do not want to allow unlimited 
> lag.

hm, i tried the naive patch. In theory the vruntime of all scheduling 
entities should be 'compatible' and comparable (that's the point behind 
using vruntime - the fairness engine drives each vruntime forward and 
tries to balance them).

So the patch below just removes the is_same_group() condition. But i can 
still see bad (and obvious) latencies with Mike's 2-hogs test:

 taskset 01 perl -e 'while (1) {}' &
 nice -19 taskset 02 perl -e 'while (1) {}' &

So something's amiss.

	Ingo

------------------->
Subject: sched: group scheduler wakeup latency fix
From: Ingo Molnar <mingo@elte.hu>

group scheduler wakeup latency fix.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -785,6 +785,7 @@ static void check_preempt_wakeup(struct 
 {
 	struct task_struct *curr = rq->curr;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+	s64 delta;
 
 	if (unlikely(rt_prio(p->prio))) {
 		update_rq_clock(rq);
@@ -792,12 +793,10 @@ static void check_preempt_wakeup(struct 
 		resched_task(curr);
 		return;
 	}
-	if (is_same_group(curr, p)) {
-		s64 delta = curr->se.vruntime - p->se.vruntime;
+	delta = curr->se.vruntime - p->se.vruntime;
 
-		if (delta > (s64)sysctl_sched_wakeup_granularity)
-			resched_task(curr);
-	}
+	if (delta > (s64)sysctl_sched_wakeup_granularity)
+		resched_task(curr);
 }
 
 static struct task_struct *pick_next_task_fair(struct rq *rq)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25 10:10             ` Ingo Molnar
@ 2007-09-25 10:28               ` Srivatsa Vaddagiri
  2007-09-25 10:36                 ` Ingo Molnar
  2007-09-25 12:28                 ` Mike Galbraith
  0 siblings, 2 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-09-25 10:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, Sep 25, 2007 at 12:10:44PM +0200, Ingo Molnar wrote:
> So the patch below just removes the is_same_group() condition. But i can 
> still see bad (and obvious) latencies with Mike's 2-hogs test:
> 
>  taskset 01 perl -e 'while (1) {}' &
>  nice -19 taskset 02 perl -e 'while (1) {}' &
> 
> So something's amiss.

While I try recreating this myself, I wonder if this patch helps?

---
 kernel/sched_fair.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -794,7 +794,8 @@ static void yield_task_fair(struct rq *r
 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
 {
 	struct task_struct *curr = rq->curr;
-	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+	struct cfs_rq *cfs_rq = task_cfs_rq(curr), *pcfs_rq;
+	struct sched_entity *se = &curr->se, *pse = &p->se;
 
 	if (unlikely(rt_prio(p->prio))) {
 		update_rq_clock(rq);
@@ -802,11 +803,19 @@ static void check_preempt_wakeup(struct 
 		resched_task(curr);
 		return;
 	}
-	if (is_same_group(curr, p)) {
-		s64 delta = curr->se.vruntime - p->se.vruntime;
 
-		if (delta > (s64)sysctl_sched_wakeup_granularity)
-			resched_task(curr);
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		pcfs_rq = cfs_rq_of(pse);
+
+		if (cfs_rq == pcfs_rq) {
+			s64 delta = se->vruntime - pse->vruntime;
+
+			if (delta > (s64)sysctl_sched_wakeup_granularity)
+				resched_task(curr);
+			break;
+		}
+		pse = pse->parent;
 	}
 }
 

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25 10:28               ` Srivatsa Vaddagiri
@ 2007-09-25 10:36                 ` Ingo Molnar
  2007-09-25 11:33                   ` Ingo Molnar
  2007-09-25 12:51                   ` Srivatsa Vaddagiri
  2007-09-25 12:28                 ` Mike Galbraith
  1 sibling, 2 replies; 53+ messages in thread
From: Ingo Molnar @ 2007-09-25 10:36 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> On Tue, Sep 25, 2007 at 12:10:44PM +0200, Ingo Molnar wrote:
> > So the patch below just removes the is_same_group() condition. But i can 
> > still see bad (and obvious) latencies with Mike's 2-hogs test:
> > 
> >  taskset 01 perl -e 'while (1) {}' &
> >  nice -19 taskset 02 perl -e 'while (1) {}' &
> > 
> > So something's amiss.
> 
> While I try recreating this myself, I wonder if this patch helps?

you should be able to recreate this easily by booting with maxcpus=1 and 
the commands above - then run a few instances of chew-max (without them 
being bound to any particular CPUs) and the latencies should show up.

i have tried your patch and it does not solve the problem - i think 
there's a more fundamental bug lurking, besides the wakeup latency 
problem.

Find below a /proc/sched_debug output of a really large latency. The 
latency is caused by the _huge_ (~450 seconds!) vruntime offset that 
'loop_silent' and 'sshd' has:

            task   PID         tree-key  switches  prio     exec-runtime
     -------------------------------------------------------------------
     loop_silent  2391     55344.211189       203   120     55344.211189
            sshd  2440    513334.978030         4   120    513334.978030
R            cat  2496    513672.558835         4   120    513672.558835

hm. perhaps this fixup in kernel/sched.c:set_task_cpu():

        p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;

needs to become properly group-hierarchy aware?

	Ingo

------------------>
Sched Debug Version: v0.05-v20, 2.6.23-rc7 #89
now at 95878.065440 msecs
  .sysctl_sched_latency                    : 20.000000
  .sysctl_sched_min_granularity            : 2.000000
  .sysctl_sched_wakeup_granularity         : 2.000000
  .sysctl_sched_batch_wakeup_granularity   : 25.000000
  .sysctl_sched_child_runs_first           : 0.000001
  .sysctl_sched_features                   : 3

cpu#0, 1828.868 MHz
  .nr_running                    : 3
  .load                          : 3072
  .nr_switches                   : 32032
  .nr_load_updates               : 95906
  .nr_uninterruptible            : 4294967238
  .jiffies                       : 4294763202
  .next_balance                  : 4294.763420
  .curr->pid                     : 2496
  .clock                         : 95893.484495
  .idle_clock                    : 55385.089335
  .prev_clock_raw                : 84753.749367
  .clock_warps                   : 0
  .clock_overflows               : 1737
  .clock_deep_idle_events        : 71815
  .clock_max_delta               : 0.999843
  .cpu_load[0]                   : 3072
  .cpu_load[1]                   : 2560
  .cpu_load[2]                   : 2304
  .cpu_load[3]                   : 2176
  .cpu_load[4]                   : 2119

cfs_rq
  .exec_clock                    : 38202.223241
  .MIN_vruntime                  : 36334.281860
  .min_vruntime                  : 36334.279140
  .max_vruntime                  : 36334.281860
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 2
  .load                          : 3072
  .bkl_cnt                       : 3934
  .nr_spread_over                : 37

cfs_rq
  .exec_clock                    : 34769.316246
  .MIN_vruntime                  : 55344.211189
  .min_vruntime                  : 36334.279140
  .max_vruntime                  : 513334.978030
  .spread                        : 457990.766841
  .spread0                       : 0.000000
  .nr_running                    : 2
  .load                          : 2048
  .bkl_cnt                       : 3934
  .nr_spread_over                : 10

cfs_rq
  .exec_clock                    : 36.982394
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 36334.279140
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 0
  .load                          : 0
  .bkl_cnt                       : 3934
  .nr_spread_over                : 1

cfs_rq
  .exec_clock                    : 20.244893
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 36334.279140
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 0
  .load                          : 0
  .bkl_cnt                       : 3934
  .nr_spread_over                : 0

cfs_rq
  .exec_clock                    : 3305.155973
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 36334.279140
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 1
  .load                          : 1024
  .bkl_cnt                       : 3934
  .nr_spread_over                : 13

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
     loop_silent  2391     55344.211189       203   120     55344.211189     34689.036169        42.855690
            sshd  2440    513334.978030         4   120    513334.978030         0.726092         0.000000
R            cat  2496    513672.558835         4   120    513672.558835         0.656690        26.294621

cpu#1, 1828.868 MHz
  .nr_running                    : 3
  .load                          : 2063
  .nr_switches                   : 22792
  .nr_load_updates               : 95625
  .nr_uninterruptible            : 58
  .jiffies                       : 4294763202
  .next_balance                  : 4294.763333
  .curr->pid                     : 2427
  .clock                         : 95643.855219
  .idle_clock                    : 54735.436719
  .prev_clock_raw                : 84754.067800
  .clock_warps                   : 0
  .clock_overflows               : 2521
  .clock_deep_idle_events        : 120633
  .clock_max_delta               : 0.999843
  .cpu_load[0]                   : 2063
  .cpu_load[1]                   : 2063
  .cpu_load[2]                   : 2063
  .cpu_load[3]                   : 2063
  .cpu_load[4]                   : 2063

cfs_rq
  .exec_clock                    : 38457.557282
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 35360.227495
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -974.051645
  .nr_running                    : 1
  .load                          : 1024
  .bkl_cnt                       : 456
  .nr_spread_over                : 18

cfs_rq
  .exec_clock                    : 32536.311766
  .MIN_vruntime                  : 610653.297636
  .min_vruntime                  : 35360.227495
  .max_vruntime                  : 610702.945490
  .spread                        : 49.647854
  .spread0                       : -974.051645
  .nr_running                    : 3
  .load                          : 2063
  .bkl_cnt                       : 456
  .nr_spread_over                : 202

cfs_rq
  .exec_clock                    : 17.392835
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 35360.227495
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -974.051645
  .nr_running                    : 0
  .load                          : 0
  .bkl_cnt                       : 456
  .nr_spread_over                : 1

cfs_rq
  .exec_clock                    : 0.428251
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 35360.227495
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -974.051645
  .nr_running                    : 0
  .load                          : 0
  .bkl_cnt                       : 456
  .nr_spread_over                : 1

cfs_rq
  .exec_clock                    : 5812.752391
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 35360.227495
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -974.051645
  .nr_running                    : 0
  .load                          : 0
  .bkl_cnt                       : 456
  .nr_spread_over                : 11

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
     loop_silent  2400    610702.945490       859   139    610702.945490      8622.783425        13.951214
R       chew-max  2427    610644.849769      2057   120    610644.849769     13737.197972        10.079232
        chew-max  2433    610653.297636      1969   120    610653.297636      9679.778096        10.750704


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  8:43   ` Srivatsa Vaddagiri
  2007-09-25  8:48     ` Andrew Morton
@ 2007-09-25 11:00     ` Ingo Molnar
  1 sibling, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2007-09-25 11:00 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Andrew Morton, linux-kernel, Peter Zijlstra, Mike Galbraith,
	Dhaval Giani, Dmitry Adamushko


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> On Tue, Sep 25, 2007 at 12:41:20AM -0700, Andrew Morton wrote:
> > This doornails the Vaio.  After grub handover the screen remains black
> > and the fan goes whir.
> > 
> > http://userweb.kernel.org/~akpm/config-sony.txt
> 
> This seems to be UP regression. Sorry abt that. I could recreate 
> the problem very easily with CONFIG_SMP turned off.
> 
> Can you check if this patch works? Works for me here.

thanks - i've put this fix into the core group-scheduling patch.

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25 10:36                 ` Ingo Molnar
@ 2007-09-25 11:33                   ` Ingo Molnar
  2007-09-25 14:48                     ` Srivatsa Vaddagiri
  2007-09-25 12:51                   ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2007-09-25 11:33 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> hm. perhaps this fixup in kernel/sched.c:set_task_cpu():
> 
>         p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> 
> needs to become properly group-hierarchy aware?

a quick first stab like the one below does not appear to solve the 
problem.

	Ingo

------------------->
Subject: sched: group scheduler SMP migration fix
From: Ingo Molnar <mingo@elte.hu>

group scheduler SMP migration fix.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1039,7 +1039,8 @@ void set_task_cpu(struct task_struct *p,
 {
 	int old_cpu = task_cpu(p);
 	struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
-	u64 clock_offset;
+	struct sched_entity *se;
+	u64 clock_offset, voffset;
 
 	clock_offset = old_rq->clock - new_rq->clock;
 
@@ -1051,7 +1052,11 @@ void set_task_cpu(struct task_struct *p,
 	if (p->se.block_start)
 		p->se.block_start -= clock_offset;
 #endif
-	p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
+
+	se = &p->se;
+	voffset = old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
+	for_each_sched_entity(se)
+		se->vruntime -= voffset;
 
 	__set_task_cpu(p, new_cpu);
 }

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25 10:28               ` Srivatsa Vaddagiri
  2007-09-25 10:36                 ` Ingo Molnar
@ 2007-09-25 12:28                 ` Mike Galbraith
  2007-09-25 12:54                   ` Mike Galbraith
  1 sibling, 1 reply; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25 12:28 UTC (permalink / raw)
  To: vatsa
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, 2007-09-25 at 15:58 +0530, Srivatsa Vaddagiri wrote:

> While I try recreating this myself, I wonder if this patch helps?

It didn't here, nor did tweaking root's share.  Booting with maxcpus=1,
I was unable to produce large latencies, but didn't try very many
things.

	-Mike


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25 10:36                 ` Ingo Molnar
  2007-09-25 11:33                   ` Ingo Molnar
@ 2007-09-25 12:51                   ` Srivatsa Vaddagiri
  2007-09-25 13:35                     ` Mike Galbraith
  1 sibling, 1 reply; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-09-25 12:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, Sep 25, 2007 at 12:36:17PM +0200, Ingo Molnar wrote:
> hm. perhaps this fixup in kernel/sched.c:set_task_cpu():
> 
>         p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;

This definitely does need some fixup, even though I am not sure yet if
it will solve completely the latency issue.

I tried the following patch. I *think* I see some improvement, wrt
latency seen when I type on the shell. Before this patch, I noticed
oddities like "kill -9 chew-max-pid" wont kill chew-max (it is queued in
runqueue waiting for a looong time to run before it can acknowledge
signal and exit). With this patch, I don't see such oddities ..So I am hoping 
it fixes the latency problem you are seeing as well.



Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1039,6 +1039,8 @@ void set_task_cpu(struct task_struct *p,
 {
 	int old_cpu = task_cpu(p);
 	struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
+	struct cfs_rq *old_cfsrq = task_cfs_rq(p),
+		      *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
 	u64 clock_offset;
 
 	clock_offset = old_rq->clock - new_rq->clock;
@@ -1051,7 +1053,8 @@ void set_task_cpu(struct task_struct *p,
 	if (p->se.block_start)
 		p->se.block_start -= clock_offset;
 #endif
-	p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
+	p->se.vruntime -= old_cfsrq->min_vruntime -
+					 new_cfsrq->min_vruntime;
 
 	__set_task_cpu(p, new_cpu);
 }
 

--
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25 12:28                 ` Mike Galbraith
@ 2007-09-25 12:54                   ` Mike Galbraith
       [not found]                     ` <20070925131717.GM26289@linux.vnet.ibm.com>
  0 siblings, 1 reply; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25 12:54 UTC (permalink / raw)
  To: vatsa
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, 2007-09-25 at 14:28 +0200, Mike Galbraith wrote:
> On Tue, 2007-09-25 at 15:58 +0530, Srivatsa Vaddagiri wrote:
> 
> > While I try recreating this myself, I wonder if this patch helps?
> 
> It didn't here, nor did tweaking root's share.  Booting with maxcpus=1,
> I was unable to produce large latencies, but didn't try very many
> things.

Easy way to make it pretty bad: pin a nice 0 loop to CPU0, pin a nice 19
loop to CPU1, then start an unpinned make.. more Xorg bouncing back and
forth I suppose.

se.wait_max              :            14.105683
se.wait_max              :           316.943787
se.wait_max              :           692.884324
se.wait_max              :            38.165534
se.wait_max              :           732.883492
se.wait_max              :           127.059784
se.wait_max              :            63.403549
se.wait_max              :           372.933284

	-Mike


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25 12:51                   ` Srivatsa Vaddagiri
@ 2007-09-25 13:35                     ` Mike Galbraith
  2007-09-25 14:07                       ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 53+ messages in thread
From: Mike Galbraith @ 2007-09-25 13:35 UTC (permalink / raw)
  To: vatsa
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, 2007-09-25 at 18:21 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Sep 25, 2007 at 12:36:17PM +0200, Ingo Molnar wrote:
> > hm. perhaps this fixup in kernel/sched.c:set_task_cpu():
> > 
> >         p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> 
> This definitely does need some fixup, even though I am not sure yet if
> it will solve completely the latency issue.
> 
> I tried the following patch. I *think* I see some improvement, wrt
> latency seen when I type on the shell. Before this patch, I noticed
> oddities like "kill -9 chew-max-pid" wont kill chew-max (it is queued in
> runqueue waiting for a looong time to run before it can acknowledge
> signal and exit). With this patch, I don't see such oddities ..So I am hoping 
> it fixes the latency problem you are seeing as well.

http://lkml.org/lkml/2007/9/25/117 plus the below seems to be the SIlver
Bullet for the latencies I was seeing.

> Index: current/kernel/sched.c
> ===================================================================
> --- current.orig/kernel/sched.c
> +++ current/kernel/sched.c
> @@ -1039,6 +1039,8 @@ void set_task_cpu(struct task_struct *p,
>  {
>  	int old_cpu = task_cpu(p);
>  	struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
> +	struct cfs_rq *old_cfsrq = task_cfs_rq(p),
> +		      *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
>  	u64 clock_offset;
>  
>  	clock_offset = old_rq->clock - new_rq->clock;
> @@ -1051,7 +1053,8 @@ void set_task_cpu(struct task_struct *p,
>  	if (p->se.block_start)
>  		p->se.block_start -= clock_offset;
>  #endif
> -	p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> +	p->se.vruntime -= old_cfsrq->min_vruntime -
> +					 new_cfsrq->min_vruntime;
>  
>  	__set_task_cpu(p, new_cpu);
>  }
>  
> 
> --
> Regards,
> vatsa


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25 13:35                     ` Mike Galbraith
@ 2007-09-25 14:07                       ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-09-25 14:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, Sep 25, 2007 at 03:35:17PM +0200, Mike Galbraith wrote:
> > I tried the following patch. I *think* I see some improvement, wrt
> > latency seen when I type on the shell. Before this patch, I noticed
> > oddities like "kill -9 chew-max-pid" wont kill chew-max (it is queued in
> > runqueue waiting for a looong time to run before it can acknowledge
> > signal and exit). With this patch, I don't see such oddities ..So I am hoping 
> > it fixes the latency problem you are seeing as well.
> 
> http://lkml.org/lkml/2007/9/25/117 plus the below seems to be the SIlver
> Bullet for the latencies I was seeing.

Cool ..Thanks for the quick feedback.

Ingo, do the two patches fix the latency problems you were seeing as
well?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25 11:33                   ` Ingo Molnar
@ 2007-09-25 14:48                     ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-09-25 14:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, Sep 25, 2007 at 01:33:06PM +0200, Ingo Molnar wrote:
> > hm. perhaps this fixup in kernel/sched.c:set_task_cpu():
> > 
> >         p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> > 
> > needs to become properly group-hierarchy aware?

You seem to have hit the nerve for this problem. The two patches I sent:

	http://lkml.org/lkml/2007/9/25/117
	http://lkml.org/lkml/2007/9/25/168

partly help, but we can do better.

> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -1039,7 +1039,8 @@ void set_task_cpu(struct task_struct *p,
>  {
>  	int old_cpu = task_cpu(p);
>  	struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
> -	u64 clock_offset;
> +	struct sched_entity *se;
> +	u64 clock_offset, voffset;
> 
>  	clock_offset = old_rq->clock - new_rq->clock;
> 
> @@ -1051,7 +1052,11 @@ void set_task_cpu(struct task_struct *p,
>  	if (p->se.block_start)
>  		p->se.block_start -= clock_offset;
>  #endif
> -	p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> +
> +	se = &p->se;
> +	voffset = old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;

This one feels wrong, although I can't express my reaction correctly ..

> +	for_each_sched_entity(se)
> +		se->vruntime -= voffset;

Note that parent entities for a task is per-cpu. So if a task A
belonging to userid guest hops from CPU0 to CPU1, then it gets a new parent 
entity as well, which is different from its parent entity on CPU0.

Before:
	taskA->se.parent = guest's tg->se[0]

After:
	taskA->se.parent = guest's tg->se[1]

So walking up the entity hierarchy and fixing up (parent)se->vruntime will do
little good after the task has moved to a new cpu.

IMO, we need to be doing this :

	- For dequeue of higher level sched entities, simulate as if
	  they are going to "sleep" 
	- For enqueue of higher level entities, simulate as if they are
	  "waking up". This will cause enqueue_entity() to reset their
	  vruntime (to existing value for cfs_rq->min_vruntime) when they 
	  "wakeup".

If we don't do this, then lets say a group had only one task (A) and it
moves from CPU0 to CPU1. Then on CPU1, when group level entity for task
A is enqueued, it will have a very low vruntime (since it was never
running) and this will give task A unlimited cpu time, until its group
entity catches up with all the "sleep" time.

Let me try a fix for this next ..

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  6:45   ` Ingo Molnar
@ 2007-09-25 15:17     ` Daniel Walker
  0 siblings, 0 replies; 53+ messages in thread
From: Daniel Walker @ 2007-09-25 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri,
	Dhaval Giani, Dmitry Adamushko, Andrew Morton

On Tue, 2007-09-25 at 08:45 +0200, Ingo Molnar wrote:
> * Daniel Walker <dwalker@mvista.com> wrote:
> 
> > On Mon, 2007-09-24 at 23:45 +0200, Ingo Molnar wrote:
> > > Lots of scheduler updates in the past few days, done by many people. 
> > > Most importantly, the SMP latency problems reported and debugged by
> > > Mike 
> > > Galbraith should be fixed for good now. 
> > 
> > Does this have anything to do with idle balancing ? I noticed some 
> > fairly large latencies in that code in 2.6.23-rc's ..
> 
> any measurements?

Yes, I made this a while ago,

ftp://source.mvista.com/pub/dwalker/misc/long-cfs-load-balance-trace.txt

This was with PREEMPT_RT on btw, so it's not the most recent kernel. I
was able to reproduce it in all the -rc's I tried.

Daniel


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  9:47           ` Ingo Molnar
  2007-09-25 10:02             ` Mike Galbraith
@ 2007-09-26  8:04             ` Mike Galbraith
  2007-09-28 21:46             ` Bill Davidsen
  2 siblings, 0 replies; 53+ messages in thread
From: Mike Galbraith @ 2007-09-26  8:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko, Andrew Morton

On Tue, 2007-09-25 at 11:47 +0200, Ingo Molnar wrote:

> Maybe there's more to come: if we can get CONFIG_FAIR_USER_SCHED to work 
> properly then your Xorg will have a load-independent 50% of CPU time all 
> to itself. (Group scheduling is quite impressive already: i can log in 
> as root without feeling _any_ effect from a perpetual 'hackbench 100' 
> running as uid mingo. Fork bombs no more.) Will the Amarok gforce plugin 
> like that CPU time splitup? (or is most of the gforce overhead under 
> your user uid?)
> 
> it could also work out negatively, _sometimes_ X does not like being too 
> high prio. (weird as that might be.) So we'll see.

I piddled around with fair users this morning, and it worked well.  With
Xorg and Gforce as one user (X and Gforce are synchronous ATM), and a
make -j30 as another, I could barely tell the make was running.
Watching a dvd, I couldn't tell. Latencies were pretty darn good
throughout three hours of testing this and that.

	-Mike


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [git] CFS-devel, latest code
  2007-09-25  9:47           ` Ingo Molnar
  2007-09-25 10:02             ` Mike Galbraith
  2007-09-26  8:04             ` Mike Galbraith
@ 2007-09-28 21:46             ` Bill Davidsen
  2 siblings, 0 replies; 53+ messages in thread
From: Bill Davidsen @ 2007-09-28 21:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Srivatsa Vaddagiri,
	Dhaval Giani, Dmitry Adamushko, Andrew Morton

Ingo Molnar wrote:

> Maybe there's more to come: if we can get CONFIG_FAIR_USER_SCHED to work 
> properly then your Xorg will have a load-independent 50% of CPU time all 
> to itself.

It seems that perhaps that 50% makes more sense on a single/dual CPU 
system than on a more robust one, such as a four way dual core Xeon with 
HT or some such. With hotplug CPUs, and setups on various machines, 
perhaps some resource limit independent of the available resource would 
be useful.

Just throwing out the idea, in case it lands on fertile ground.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC/PATCH] Add sysfs control to modify a user's cpu share
       [not found]                                 ` <20070926210737.GA8663@elte.hu>
@ 2007-10-01 14:04                                   ` Dhaval Giani
  2007-10-01 14:44                                     ` Ingo Molnar
  2007-10-01 16:12                                     ` [RFC/PATCH] Add sysfs control to modify a user's cpu share Dave Jones
  0 siblings, 2 replies; 53+ messages in thread
From: Dhaval Giani @ 2007-10-01 14:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar

Hi Ingo,

Adds tunables in sysfs to modify a user's cpu share.

A directory is created in sysfs for each new user in the system.

	/sys/kernel/uids/<uid>/cpu_share

Reading this file returns the cpu shares granted for the user.
Writing into this file modifies the cpu share for the user. Only an 
administrator is allowed to modify a user's cpu share.

Ex: 
	# cd /sys/kernel/uids/
	# cat 512/cpu_share
	1024
	# echo 2048 > 512/cpu_share
	# cat 512/cpu_share
	2048
	#

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>

---
 include/linux/sched.h |   11 +++++
 kernel/ksysfs.c       |    4 +
 kernel/sched.c        |    5 ++
 kernel/user.c         |  110 +++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 129 insertions(+), 1 deletion(-)

Index: linux-2.6.23-rc8-sched-devel/include/linux/sched.h
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/include/linux/sched.h
+++ linux-2.6.23-rc8-sched-devel/include/linux/sched.h
@@ -86,6 +86,7 @@ struct sched_param {
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
 #include <linux/task_io_accounting.h>
+#include <linux/kobject.h>
 
 #include <asm/processor.h>
 
@@ -598,9 +599,18 @@ struct user_struct {
 
 #ifdef CONFIG_FAIR_USER_SCHED
 	struct task_group *tg;
+ 	struct kset kset;
+ 	struct subsys_attribute user_attr;
+ 	struct work_struct work;
 #endif
 };
 
+#ifdef CONFIG_FAIR_USER_SCHED
+extern int uids_kobject_init(void);
+#else
+static inline int uids_kobject_init(void) { return 0; }
+#endif
+
 extern struct user_struct *find_user(uid_t);
 
 extern struct user_struct root_user;
@@ -1846,6 +1856,7 @@ extern struct task_group *sched_create_g
 extern void sched_destroy_group(struct task_group *tg);
 extern void sched_move_task(struct task_struct *tsk);
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
+extern unsigned long sched_group_shares(struct task_group *tg);
 
 #endif
 
Index: linux-2.6.23-rc8-sched-devel/kernel/ksysfs.c
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/kernel/ksysfs.c
+++ linux-2.6.23-rc8-sched-devel/kernel/ksysfs.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/kexec.h>
+#include <linux/sched.h>
 
 #define KERNEL_ATTR_RO(_name) \
 static struct subsys_attribute _name##_attr = __ATTR_RO(_name)
@@ -116,6 +117,9 @@ static int __init ksysfs_init(void)
 					      &notes_attr);
 	}
 
+	if (!error)
+		error = uids_kobject_init();
+
 	return error;
 }
 
Index: linux-2.6.23-rc8-sched-devel/kernel/sched.c
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/kernel/sched.c
+++ linux-2.6.23-rc8-sched-devel/kernel/sched.c
@@ -6928,4 +6928,9 @@ int sched_group_set_shares(struct task_g
 	return 0;
 }
 
+unsigned long sched_group_shares(struct task_group *tg)
+{
+	return tg->shares;
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
Index: linux-2.6.23-rc8-sched-devel/kernel/user.c
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/kernel/user.c
+++ linux-2.6.23-rc8-sched-devel/kernel/user.c
@@ -56,6 +56,62 @@ struct user_struct root_user = {
 };
 
 #ifdef CONFIG_FAIR_USER_SCHED
+
+static struct kobject uids_kobject;
+
+ssize_t cpu_shares_show(struct kset *kset, char *buffer)
+{
+	struct user_struct *up = container_of(kset, struct user_struct, kset);
+
+	return sprintf(buffer, "%lu\n", sched_group_shares(up->tg));
+}
+
+ssize_t cpu_shares_store(struct kset *kset, const char *buffer, size_t size)
+{
+	struct user_struct *up = container_of(kset, struct user_struct, kset);
+	unsigned long shares;
+	int rc;
+
+	sscanf(buffer, "%lu", &shares);
+
+	rc = sched_group_set_shares(up->tg, shares);
+
+	return (rc ? rc : size);
+}
+
+static void user_attr_init(struct subsys_attribute *sa, char *name, int mode)
+{
+	sa->attr.name = name;
+	sa->attr.mode = mode;
+	sa->show = cpu_shares_show;
+	sa->store = cpu_shares_store;
+}
+
+static int user_kobject_create(struct user_struct *up)
+{
+	struct kset *kset = &up->kset;
+	struct kobject *kobj = &kset->kobj;
+	int error;
+
+	memset(kset, 0, sizeof(struct kset));
+	kobj->parent = &uids_kobject;
+	kobject_set_name(kobj, "%d", up->uid);
+	kset_init(kset);
+	user_attr_init(&up->user_attr, "cpu_share", 0644);
+
+	error = kobject_add(kobj);
+
+	if (error)
+		goto done;
+
+	error = sysfs_create_file(kobj, &up->user_attr.attr);
+	if (error)
+		kobject_del(kobj);
+
+done:
+	return error;
+}
+
 static void sched_destroy_user(struct user_struct *up)
 {
 	sched_destroy_group(up->tg);
@@ -77,11 +133,53 @@ static void sched_switch_user(struct tas
 	sched_move_task(p);
 }
 
+int __init uids_kobject_init(void)
+{
+	int error;
+
+	uids_kobject.parent = &kernel_subsys.kobj;
+	kobject_set_name(&uids_kobject, "uids");
+	kobject_init(&uids_kobject);
+	error = kobject_add(&uids_kobject);
+
+	if (!error)
+		error = user_kobject_create(&root_user);
+
+	return error;
+}
+
+static void remove_user_sysfs_dir(struct work_struct *w)
+{
+	struct user_struct *up = container_of(w, struct user_struct, work);
+	struct kobject *kobj = &up->kset.kobj;
+
+	sysfs_remove_file(kobj, &up->user_attr.attr);
+	kobject_del(kobj);
+	kmem_cache_free(uid_cachep, up);
+}
+
+static inline void free_user(struct user_struct *up)
+{
+	struct kobject *kobj = &up->kset.kobj;
+
+	if (!kobj->sd)
+		return;
+
+	INIT_WORK(&up->work, remove_user_sysfs_dir);
+	schedule_work(&up->work);
+}
+
 #else	/* CONFIG_FAIR_USER_SCHED */
 
 static void sched_destroy_user(struct user_struct *up) { }
 static int sched_create_user(struct user_struct *up) { return 0; }
 static void sched_switch_user(struct task_struct *p) { }
+static inline int user_kobject_create(struct user_struct *up) { return 0; }
+
+static inline void free_user(struct user_struct *up)
+{
+	kmem_cache_free(uid_cachep, up);
+}
 
 #endif	/* CONFIG_FAIR_USER_SCHED */
 
@@ -145,7 +243,7 @@ void free_uid(struct user_struct *up)
 		sched_destroy_user(up);
 		key_put(up->uid_keyring);
 		key_put(up->session_keyring);
-		kmem_cache_free(uid_cachep, up);
+		free_user(up);
 	} else {
 		local_irq_restore(flags);
 	}
@@ -155,6 +253,7 @@ struct user_struct * alloc_uid(struct us
 {
 	struct hlist_head *hashent = uidhashentry(ns, uid);
 	struct user_struct *up;
+	int create_sysfs_dir = 0;
 
 	spin_lock_irq(&uidhash_lock);
 	up = uid_hash_find(uid, hashent);
@@ -205,10 +304,19 @@ struct user_struct * alloc_uid(struct us
 		} else {
 			uid_hash_insert(new, hashent);
 			up = new;
+			create_sysfs_dir = 1;
 		}
 		spin_unlock_irq(&uidhash_lock);
 
 	}
+
+	if (create_sysfs_dir) {
+		if (user_kobject_create(up)) {
+			free_uid(up);
+			up = NULL;
+		}
+	}
+
 	return up;
 }
 
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share
  2007-10-01 14:04                                   ` [RFC/PATCH] Add sysfs control to modify a user's cpu share Dhaval Giani
@ 2007-10-01 14:44                                     ` Ingo Molnar
  2007-10-01 15:32                                       ` Srivatsa Vaddagiri
                                                         ` (2 more replies)
  2007-10-01 16:12                                     ` [RFC/PATCH] Add sysfs control to modify a user's cpu share Dave Jones
  1 sibling, 3 replies; 53+ messages in thread
From: Ingo Molnar @ 2007-10-01 14:44 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar


hi Dhaval,

* Dhaval Giani <dhaval@linux.vnet.ibm.com> wrote:

> Hi Ingo,
> 
> Adds tunables in sysfs to modify a user's cpu share.
> 
> A directory is created in sysfs for each new user in the system.
> 
> 	/sys/kernel/uids/<uid>/cpu_share
> 
> Reading this file returns the cpu shares granted for the user.
> Writing into this file modifies the cpu share for the user. Only an 
> administrator is allowed to modify a user's cpu share.
> 
> Ex: 
> 	# cd /sys/kernel/uids/
> 	# cat 512/cpu_share
> 	1024
> 	# echo 2048 > 512/cpu_share
> 	# cat 512/cpu_share
> 	2048
> 	#

looks good to me! I think this API is pretty straightforward. I've put 
this into my tree and have updated the sched-devel git tree:

 git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

but there seems to be a bug in it, i get this:

 kobject_add failed for 500 with -EEXIST, don't try to register things with the same name in the same directory.
  [<c01e1730>] kobject_shadow_add+0x13b/0x169
  [<c01e1795>] kobject_set_name+0x28/0x91
  [<c01299d5>] user_kobject_create+0x6a/0x90
  [<c0129d3c>] alloc_uid+0x141/0x181
  [<c012d05c>] set_user+0x1c/0x91
  [<c012e4c7>] sys_setresuid+0xd9/0x17d
  [<c0103cf6>] sysenter_past_esp+0x5f/0x85
  =======================

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share
  2007-10-01 14:44                                     ` Ingo Molnar
@ 2007-10-01 15:32                                       ` Srivatsa Vaddagiri
  2007-10-02 22:12                                       ` Eric St-Laurent
  2007-10-03 17:10                                       ` [RFC/PATCH -v2] " Dhaval Giani
  2 siblings, 0 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-10-01 15:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dhaval Giani, Mike Galbraith, Peter Zijlstra, Dmitry Adamushko,
	lkml, maneesh, Andrew Morton, Sudhir Kumar

On Mon, Oct 01, 2007 at 04:44:02PM +0200, Ingo Molnar wrote:
> but there seems to be a bug in it, i get this:
> 
>  kobject_add failed for 500 with -EEXIST, don't try to register things with the same name in the same directory.

Looks like a remove_dir/create_dir race ..

	free_user()			alloc_uid()

	   uid_hash_remove(up);

					uid_hash_insert(new, hashent);

	   schedule_work()
					user_kobject_create();
					kobject_add();	-> Boom!

	   remove_user_sysfs_dir();

/me looks up sysfs programming examples ..

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share
  2007-10-01 14:04                                   ` [RFC/PATCH] Add sysfs control to modify a user's cpu share Dhaval Giani
  2007-10-01 14:44                                     ` Ingo Molnar
@ 2007-10-01 16:12                                     ` Dave Jones
  2007-10-01 16:37                                       ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 53+ messages in thread
From: Dave Jones @ 2007-10-01 16:12 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar

On Mon, Oct 01, 2007 at 07:34:54PM +0530, Dhaval Giani wrote:
 > Hi Ingo,
 > 
 > Adds tunables in sysfs to modify a user's cpu share.
 > 
 > A directory is created in sysfs for each new user in the system.
 > 
 > 	/sys/kernel/uids/<uid>/cpu_share
 > 
 > Reading this file returns the cpu shares granted for the user.
 > Writing into this file modifies the cpu share for the user. Only an 
 > administrator is allowed to modify a user's cpu share.
 > 
 > Ex: 
 > 	# cd /sys/kernel/uids/
 > 	# cat 512/cpu_share
 > 	1024
 > 	# echo 2048 > 512/cpu_share
 > 	# cat 512/cpu_share
 > 	2048
 > 	#

Can we start adding stuff to Documentation/ for new files created
in sysfs ?  There's so much undocumented stuff in there now that
it's unfunny.

A great start would be 'wtf is a cpu_share and why would I want to
change it' ?

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share
  2007-10-01 16:12                                     ` [RFC/PATCH] Add sysfs control to modify a user's cpu share Dave Jones
@ 2007-10-01 16:37                                       ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-10-01 16:37 UTC (permalink / raw)
  To: Dave Jones, Dhaval Giani, Ingo Molnar, Mike Galbraith,
	Peter Zijlstra, Dmitry Adamushko, lkml, maneesh, Andrew Morton,
	Sudhir Kumar

On Mon, Oct 01, 2007 at 12:12:21PM -0400, Dave Jones wrote:
> Can we start adding stuff to Documentation/ for new files created
> in sysfs ?  There's so much undocumented stuff in there now that
> it's unfunny.

Hi Dave,
	We will in the next version of the patch. At this point, the patch was 
sent out to get an idea of whether sysfs is the right place for this interface 
or not.

> A great start would be 'wtf is a cpu_share and why would I want to
> change it' ?

Quick answer before we release the next version : This has to do with
the CONFIG_FAIR_USER_SCHED feature which allows dividing CPU bandwidth
equally between all users of a system. Before this patch, CPU bandwidth was 
divided equally among all non-root users. After this patch, it will
be possible to control CPU bandwidth allocation to each user separately.

For ex: by setting user "vatsa"'s cpu_share to be twice that of user
"guest", it would be possible for user "vatsa" to get 2x CPU bandwidth that of 
user "guest"


--
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share
  2007-10-01 14:44                                     ` Ingo Molnar
  2007-10-01 15:32                                       ` Srivatsa Vaddagiri
@ 2007-10-02 22:12                                       ` Eric St-Laurent
  2007-10-03  4:09                                         ` Srivatsa Vaddagiri
  2007-10-03 17:10                                       ` [RFC/PATCH -v2] " Dhaval Giani
  2 siblings, 1 reply; 53+ messages in thread
From: Eric St-Laurent @ 2007-10-02 22:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dhaval Giani, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar


On Mon, 2007-10-01 at 16:44 +0200, Ingo Molnar wrote:
> > Adds tunables in sysfs to modify a user's cpu share.
> > 
> > A directory is created in sysfs for each new user in the system.
> > 
> > 	/sys/kernel/uids/<uid>/cpu_share
> > 
> > Reading this file returns the cpu shares granted for the user.
> > Writing into this file modifies the cpu share for the user. Only an 
> > administrator is allowed to modify a user's cpu share.
> > 
> > Ex: 
> > 	# cd /sys/kernel/uids/
> > 	# cat 512/cpu_share
> > 	1024
> > 	# echo 2048 > 512/cpu_share
> > 	# cat 512/cpu_share
> > 	2048
> > 	#
> 
> looks good to me! I think this API is pretty straightforward. I've put 
> this into my tree and have updated the sched-devel git tree:
> 

While a sysfs interface is OK and somewhat orthogonal to the interface
proposed the containers patches, I think maybe a new syscall should be
considered.

Since we now have a fair share cpu scheduler, maybe an interface to
specify the cpu share directly (alternatively to priority) make sense.

For processes, it may become more intuitive (and precise) to set the
processing share directly than setting a priority which is converted to
a share.

Maybe something similar to ioprio_set() and ioprio_get() syscalls:

- per user cpu share
- per user group cpu share
- per process cpu share
- per process group cpu share


Best regards,

- Eric



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share
  2007-10-02 22:12                                       ` Eric St-Laurent
@ 2007-10-03  4:09                                         ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-10-03  4:09 UTC (permalink / raw)
  To: Eric St-Laurent
  Cc: Ingo Molnar, Dhaval Giani, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar

On Tue, Oct 02, 2007 at 06:12:39PM -0400, Eric St-Laurent wrote:
> While a sysfs interface is OK and somewhat orthogonal to the interface
> proposed the containers patches, I think maybe a new syscall should be
> considered.

We had discussed syscall vs filesystem based interface for resource
management [1] and there was a heavy bias favoring filesystem based interface,
based on which the container (now "cgroup") filesystem evolved.

Where we already have one interface defined, I would be against adding 
an equivalent syscall interface.

Note that this "fair-user" scheduling can in theory be accomplished
using the same cgroup based interface, but requires some extra setup in
userspace (either to run a daemon which moves tasks to appropriate
control groups/containers upon their uid change OR to modify initrd to mount 
cgroup filesystem at early bootup time). I expect most distros to enable
CONFIG_FAIR_CGROUP_SCHED (control group based fair group scheduler) and not 
CONFIG_FAIR_USER_SHCED (user id based fair group scheduler). The only
reason why we are providing CONFIG_FAIR_USER_SCHED and the associated
sysfs interface is to help test group scheduler w/o requiring knowledge
of cgroup filesystem.

Reference:

1. http://marc.info/?l=linux-kernel&m=116231242201300&w=2

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC/PATCH -v2] Add sysfs control to modify a user's cpu share
  2007-10-01 14:44                                     ` Ingo Molnar
  2007-10-01 15:32                                       ` Srivatsa Vaddagiri
  2007-10-02 22:12                                       ` Eric St-Laurent
@ 2007-10-03 17:10                                       ` Dhaval Giani
  2007-10-04  7:57                                         ` Ingo Molnar
  2 siblings, 1 reply; 53+ messages in thread
From: Dhaval Giani @ 2007-10-03 17:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar

Hi Ingo,

Can you please drop commit b1add858a10cece3a68b2d8cb9e7350843700a58 (last
version of this patch) and try this instead?
---
Adds tunables in sysfs to modify a user's cpu share.

A directory is created in sysfs for each new user in the system.

	/sys/kernel/uids/<uid>/cpu_share

Reading this file returns the cpu shares granted for the user.
Writing into this file modifies the cpu share for the user. Only an
administrator is allowed to modify a user's cpu share.

Ex:
	# cd /sys/kernel/uids/
	# cat 512/cpu_share
	1024
	# echo 2048 > 512/cpu_share
	# cat 512/cpu_share
	2048
	#

Changelog since v1:
1. Added a mutex to serialize directory creation/destruction for a user in
   sysfs
2. Added a spinlock in the task_group structure to serialize writes to
   tg->shares.
3. Removed /proc/root_user_cpu_shares.
4. Added Documentation about the group scheduler.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>

---
 Documentation/sched-design-CFS.txt |   67 ++++++++++
 include/linux/sched.h              |   11 +
 kernel/ksysfs.c                    |    8 +
 kernel/sched.c                     |   14 +-
 kernel/sched_debug.c               |   48 -------
 kernel/user.c                      |  242 ++++++++++++++++++++++++++++++++-----
 6 files changed, 310 insertions(+), 80 deletions(-)

Index: linux-2.6.23-rc8-sched-devel/Documentation/sched-design-CFS.txt
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/Documentation/sched-design-CFS.txt
+++ linux-2.6.23-rc8-sched-devel/Documentation/sched-design-CFS.txt
@@ -117,3 +117,70 @@ Some implementation details:
    iterators of the scheduling modules are used. The balancing code got
    quite a bit simpler as a result.
 
+
+Group scheduler extension to CFS
+================================
+
+Normally the scheduler operates on individual tasks and strives to provide
+fair CPU time to each task. Sometimes, it may be desirable to group tasks
+and provide fair CPU time to each such task group. For example, it may
+be desirable to first provide fair CPU time to each user on the system
+and then to each task belonging to a user.
+
+CONFIG_FAIR_GROUP_SCHED strives to achieve exactly that. It lets
+SCHED_NORMAL/BATCH tasks be be grouped and divides CPU time fairly among such
+groups. At present, there are two (mutually exclusive) mechanisms to group
+tasks for CPU bandwidth control purpose:
+
+	- Based on user id (CONFIG_FAIR_USER_SCHED)
+		In this option, tasks are grouped according to their user id.
+	- Based on "cgroup" pseudo filesystem (CONFIG_FAIR_CGROUP_SCHED)
+		This options lets the administrator create arbitrary groups
+		of tasks, using the "cgroup" pseudo filesystem. See
+		Documentation/cgroups.txt for more information about this
+		filesystem.
+
+Only one of these options to group tasks can be chosen and not both.
+
+Group scheduler tunables:
+
+When CONFIG_FAIR_USER_SCHED is defined, a directory is created in sysfs for
+each new user and a "cpu_share" file is added in that directory.
+
+	# cd /sys/kernel/uids
+	# cat 512/cpu_share		# Display user 512's CPU share
+	1024
+	# echo 2048 > 512/cpu_share	# Modify user 512's CPU share
+	# cat 512/cpu_share		# Display user 512's CPU share
+	2048
+	#
+
+CPU bandwidth between two users are divided in the ratio of their CPU shares.
+For ex: if you would like user "root" to get twice the bandwidth of user
+"guest", then set the cpu_share for both the users such that "root"'s
+cpu_share is twice "guest"'s cpu_share
+
+
+When CONFIG_FAIR_CGROUP_SCHED is defined, a "cpu.shares" file is created
+for each group created using the pseudo filesystem. See example steps
+below to create task groups and modify their CPU share using the "cgroups"
+pseudo filesystem
+
+	# mkdir /dev/cpuctl
+	# mount -t cgroup -ocpu none /dev/cpuctl
+	# cd /dev/cpuctl
+
+	# mkdir multimedia	# create "multimedia" group of tasks
+	# mkdir browser		# create "browser" group of tasks
+
+	# #Configure the multimedia group to receive twice the CPU bandwidth
+	# #that of browser group
+
+	# echo 2048 > multimedia/cpu.shares
+	# echo 1024 > browser/cpu.shares
+
+	# firefox &	# Launch firefox and move it to "browser" group
+	# echo <firefox_pid> > browser/tasks
+
+	# #Launch gmplayer (or your favourite movie player)
+	# echo <movie_player_pid> > multimedia/tasks
Index: linux-2.6.23-rc8-sched-devel/include/linux/sched.h
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/include/linux/sched.h
+++ linux-2.6.23-rc8-sched-devel/include/linux/sched.h
@@ -86,6 +86,7 @@ struct sched_param {
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
 #include <linux/task_io_accounting.h>
+#include <linux/kobject.h>
 
 #include <asm/processor.h>
 
@@ -598,9 +599,18 @@ struct user_struct {
 
 #ifdef CONFIG_FAIR_USER_SCHED
 	struct task_group *tg;
+	struct kset kset;
+	struct subsys_attribute user_attr;
+	struct work_struct work;
 #endif
 };
 
+#ifdef CONFIG_FAIR_USER_SCHED
+extern int uids_kobject_init(void);
+#else
+static inline int uids_kobject_init(void) { return 0; }
+#endif
+
 extern struct user_struct *find_user(uid_t);
 
 extern struct user_struct root_user;
@@ -1846,6 +1856,7 @@ extern struct task_group *sched_create_g
 extern void sched_destroy_group(struct task_group *tg);
 extern void sched_move_task(struct task_struct *tsk);
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
+extern unsigned long sched_group_shares(struct task_group *tg);
 
 #endif
 
Index: linux-2.6.23-rc8-sched-devel/kernel/ksysfs.c
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/kernel/ksysfs.c
+++ linux-2.6.23-rc8-sched-devel/kernel/ksysfs.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/kexec.h>
+#include <linux/sched.h>
 
 #define KERNEL_ATTR_RO(_name) \
 static struct subsys_attribute _name##_attr = __ATTR_RO(_name)
@@ -116,6 +117,13 @@ static int __init ksysfs_init(void)
 					      &notes_attr);
 	}
 
+	/*
+	 * Create "/sys/kernel/uids" directory and corresponding root user's
+	 * directory under it.
+	 */
+	if (!error)
+		error = uids_kobject_init();
+
 	return error;
 }
 
Index: linux-2.6.23-rc8-sched-devel/kernel/sched.c
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/kernel/sched.c
+++ linux-2.6.23-rc8-sched-devel/kernel/sched.c
@@ -161,6 +161,8 @@ struct task_group {
 	/* runqueue "owned" by this group on each cpu */
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
+	/* spinlock to serialize modification to shares */
+	spinlock_t lock;
 };
 
 /* Default task group's sched entity on each cpu */
@@ -6552,6 +6554,7 @@ void __init sched_init(void)
 			se->parent = NULL;
 		}
 		init_task_group.shares = init_task_group_load;
+		spin_lock_init(&init_task_group.lock);
 #endif
 
 		for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
@@ -6796,6 +6799,7 @@ struct task_group *sched_create_group(vo
 	}
 
 	tg->shares = NICE_0_LOAD;
+	spin_lock_init(&tg->lock);
 
 	return tg;
 
@@ -6916,8 +6920,9 @@ int sched_group_set_shares(struct task_g
 {
 	int i;
 
+	spin_lock(&tg->lock);
 	if (tg->shares == shares)
-		return 0;
+		goto done;
 
 	/* return -EINVAL if the new value is not sane */
 
@@ -6925,7 +6930,14 @@ int sched_group_set_shares(struct task_g
 	for_each_possible_cpu(i)
 		set_se_shares(tg->se[i], shares);
 
+done:
+	spin_unlock(&tg->lock);
 	return 0;
 }
 
+unsigned long sched_group_shares(struct task_group *tg)
+{
+	return tg->shares;
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
Index: linux-2.6.23-rc8-sched-devel/kernel/sched_debug.c
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/kernel/sched_debug.c
+++ linux-2.6.23-rc8-sched-devel/kernel/sched_debug.c
@@ -231,45 +231,6 @@ static void sysrq_sched_debug_show(void)
 	sched_debug_show(NULL, NULL);
 }
 
-#ifdef CONFIG_FAIR_USER_SCHED
-
-static DEFINE_MUTEX(root_user_share_mutex);
-
-static int
-root_user_share_read_proc(char *page, char **start, off_t off, int count,
-				 int *eof, void *data)
-{
-	return sprintf(page, "%d\n", init_task_group_load);
-}
-
-static int
-root_user_share_write_proc(struct file *file, const char __user *buffer,
-				 unsigned long count, void *data)
-{
-	unsigned long shares;
-	char kbuf[sizeof(unsigned long)+1];
-	int rc = 0;
-
-	if (copy_from_user(kbuf, buffer, sizeof(kbuf)))
-		return -EFAULT;
-
-	shares = simple_strtoul(kbuf, NULL, 0);
-
-	if (!shares)
-		shares = NICE_0_LOAD;
-
-	mutex_lock(&root_user_share_mutex);
-
-	init_task_group_load = shares;
-	rc = sched_group_set_shares(&init_task_group, shares);
-
-	mutex_unlock(&root_user_share_mutex);
-
-	return (rc < 0 ? rc : count);
-}
-
-#endif	/* CONFIG_FAIR_USER_SCHED */
-
 static int sched_debug_open(struct inode *inode, struct file *filp)
 {
 	return single_open(filp, sched_debug_show, NULL);
@@ -292,15 +253,6 @@ static int __init init_sched_debug_procf
 
 	pe->proc_fops = &sched_debug_fops;
 
-#ifdef CONFIG_FAIR_USER_SCHED
-	pe = create_proc_entry("root_user_cpu_share", 0644, NULL);
-	if (!pe)
-		return -ENOMEM;
-
-	pe->read_proc = root_user_share_read_proc;
-	pe->write_proc = root_user_share_write_proc;
-#endif
-
 	return 0;
 }
 
Index: linux-2.6.23-rc8-sched-devel/kernel/user.c
===================================================================
--- linux-2.6.23-rc8-sched-devel.orig/kernel/user.c
+++ linux-2.6.23-rc8-sched-devel/kernel/user.c
@@ -55,7 +55,41 @@ struct user_struct root_user = {
 #endif
 };
 
+/*
+ * These routines must be called with the uidhash spinlock held!
+ */
+static inline void uid_hash_insert(struct user_struct *up,
+						struct hlist_head *hashent)
+{
+	hlist_add_head(&up->uidhash_node, hashent);
+}
+
+static inline void uid_hash_remove(struct user_struct *up)
+{
+	hlist_del_init(&up->uidhash_node);
+}
+
+static inline struct user_struct *uid_hash_find(uid_t uid,
+						struct hlist_head *hashent)
+{
+	struct user_struct *user;
+	struct hlist_node *h;
+
+	hlist_for_each_entry(user, h, hashent, uidhash_node) {
+		if (user->uid == uid) {
+			atomic_inc(&user->__count);
+			return user;
+		}
+	}
+
+	return NULL;
+}
+
 #ifdef CONFIG_FAIR_USER_SCHED
+
+static struct kobject uids_kobject; /* represents /sys/kernel/uids directory */
+static DEFINE_MUTEX(uids_mutex);
+
 static void sched_destroy_user(struct user_struct *up)
 {
 	sched_destroy_group(up->tg);
@@ -77,42 +111,173 @@ static void sched_switch_user(struct tas
 	sched_move_task(p);
 }
 
-#else	/* CONFIG_FAIR_USER_SCHED */
+static inline void uids_mutex_lock(void)
+{
+	mutex_lock(&uids_mutex);
+}
 
-static void sched_destroy_user(struct user_struct *up) { }
-static int sched_create_user(struct user_struct *up) { return 0; }
-static void sched_switch_user(struct task_struct *p) { }
+static inline void uids_mutex_unlock(void)
+{
+	mutex_unlock(&uids_mutex);
+}
 
-#endif	/* CONFIG_FAIR_USER_SCHED */
+/* return cpu shares held by the user */
+ssize_t cpu_shares_show(struct kset *kset, char *buffer)
+{
+	struct user_struct *up = container_of(kset, struct user_struct, kset);
 
-/*
- * These routines must be called with the uidhash spinlock held!
- */
-static inline void uid_hash_insert(struct user_struct *up, struct hlist_head *hashent)
+	return sprintf(buffer, "%lu\n", sched_group_shares(up->tg));
+}
+
+/* modify cpu shares held by the user */
+ssize_t cpu_shares_store(struct kset *kset, const char *buffer, size_t size)
 {
-	hlist_add_head(&up->uidhash_node, hashent);
+	struct user_struct *up = container_of(kset, struct user_struct, kset);
+	unsigned long shares;
+	int rc;
+
+	sscanf(buffer, "%lu", &shares);
+
+	rc = sched_group_set_shares(up->tg, shares);
+
+	return (rc ? rc : size);
 }
 
-static inline void uid_hash_remove(struct user_struct *up)
+static void user_attr_init(struct subsys_attribute *sa, char *name, int mode)
 {
-	hlist_del_init(&up->uidhash_node);
+	sa->attr.name = name;
+	sa->attr.mode = mode;
+	sa->show = cpu_shares_show;
+	sa->store = cpu_shares_store;
 }
 
-static inline struct user_struct *uid_hash_find(uid_t uid, struct hlist_head *hashent)
+/* Create "/sys/kernel/uids/<uid>" directory and
+ *  "/sys/kernel/uids/<uid>/cpu_share" file for this user.
+ */
+static int user_kobject_create(struct user_struct *up)
 {
-	struct user_struct *user;
-	struct hlist_node *h;
+	struct kset *kset = &up->kset;
+	struct kobject *kobj = &kset->kobj;
+	int error;
+
+	memset(kset, 0, sizeof(struct kset));
+	kobj->parent = &uids_kobject;	/* create under /sys/kernel/uids dir */
+	kobject_set_name(kobj, "%d", up->uid);
+	kset_init(kset);
+	user_attr_init(&up->user_attr, "cpu_share", 0644);
+
+	error = kobject_add(kobj);
+	if (error)
+		goto done;
+
+	error = sysfs_create_file(kobj, &up->user_attr.attr);
+	if (error)
+		kobject_del(kobj);
+
+done:
+	return error;
+}
+
+/* create these in sysfs filesystem:
+ * 	"/sys/kernel/uids" directory
+ * 	"/sys/kernel/uids/0" directory (for root user)
+ * 	"/sys/kernel/uids/0/cpu_share" file (for root user)
+ */
+int __init uids_kobject_init(void)
+{
+	int error;
 
-	hlist_for_each_entry(user, h, hashent, uidhash_node) {
-		if(user->uid == uid) {
-			atomic_inc(&user->__count);
-			return user;
-		}
+	/* create under /sys/kernel dir */
+	uids_kobject.parent = &kernel_subsys.kobj;
+	kobject_set_name(&uids_kobject, "uids");
+	kobject_init(&uids_kobject);
+
+	error = kobject_add(&uids_kobject);
+	if (!error)
+		error = user_kobject_create(&root_user);
+
+	return error;
+}
+
+/* work function to remove sysfs directory for a user and free up
+ * corresponding structures.
+ */
+static void remove_user_sysfs_dir(struct work_struct *w)
+{
+	struct user_struct *up = container_of(w, struct user_struct, work);
+	struct kobject *kobj = &up->kset.kobj;
+	unsigned long flags;
+	int remove_user = 0;
+
+	/* Make uid_hash_remove() + sysfs_remove_file() + kobject_del()
+	 * atomic.
+	 */
+	uids_mutex_lock();
+
+	local_irq_save(flags);
+
+	if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
+		uid_hash_remove(up);
+		remove_user = 1;
+		spin_unlock_irqrestore(&uidhash_lock, flags);
+	} else {
+		local_irq_restore(flags);
 	}
 
-	return NULL;
+	if (!remove_user)
+		goto done;
+
+	sysfs_remove_file(kobj, &up->user_attr.attr);
+	kobject_del(kobj);
+
+	sched_destroy_user(up);
+	key_put(up->uid_keyring);
+	key_put(up->session_keyring);
+	kmem_cache_free(uid_cachep, up);
+
+done:
+	uids_mutex_unlock();
+}
+
+/* IRQs are disabled and uidhash_lock is held upon function entry.
+ * IRQ state (as stored in flags) is restored and uidhash_lock released
+ * upon function exit.
+ */
+static inline void free_user(struct user_struct *up, unsigned long flags)
+{
+	/* restore back the count */
+	atomic_inc(&up->__count);
+	spin_unlock_irqrestore(&uidhash_lock, flags);
+
+	INIT_WORK(&up->work, remove_user_sysfs_dir);
+	schedule_work(&up->work);
+}
+
+#else	/* CONFIG_FAIR_USER_SCHED */
+
+static void sched_destroy_user(struct user_struct *up) { }
+static int sched_create_user(struct user_struct *up) { return 0; }
+static void sched_switch_user(struct task_struct *p) { }
+static inline int user_kobject_create(struct user_struct *up) { return 0; }
+static inline void uids_mutex_lock(void) { }
+static inline void uids_mutex_unlock(void) { }
+
+/* IRQs are disabled and uidhash_lock is held upon function entry.
+ * IRQ state (as stored in flags) is restored and uidhash_lock released
+ * upon function exit.
+ */
+static inline void free_user(struct user_struct *up, unsigned long flags)
+{
+	uid_hash_remove(up);
+	spin_unlock_irqrestore(&uidhash_lock, flags);
+	sched_destroy_user(up);
+	key_put(up->uid_keyring);
+	key_put(up->session_keyring);
+	kmem_cache_free(uid_cachep, up);
 }
 
+#endif	/* CONFIG_FAIR_USER_SCHED */
+
 /*
  * Locate the user_struct for the passed UID.  If found, take a ref on it.  The
  * caller must undo that ref with free_uid().
@@ -139,16 +304,10 @@ void free_uid(struct user_struct *up)
 		return;
 
 	local_irq_save(flags);
-	if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
-		uid_hash_remove(up);
-		spin_unlock_irqrestore(&uidhash_lock, flags);
-		sched_destroy_user(up);
-		key_put(up->uid_keyring);
-		key_put(up->session_keyring);
-		kmem_cache_free(uid_cachep, up);
-	} else {
+	if (atomic_dec_and_lock(&up->__count, &uidhash_lock))
+		free_user(up, flags);
+	else
 		local_irq_restore(flags);
-	}
 }
 
 struct user_struct * alloc_uid(struct user_namespace *ns, uid_t uid)
@@ -156,6 +315,11 @@ struct user_struct * alloc_uid(struct us
 	struct hlist_head *hashent = uidhashentry(ns, uid);
 	struct user_struct *up;
 
+	/* Make uid_hash_find() + user_kobject_create() + uid_hash_insert()
+	 * atomic.
+	 */
+	uids_mutex_lock();
+
 	spin_lock_irq(&uidhash_lock);
 	up = uid_hash_find(uid, hashent);
 	spin_unlock_irq(&uidhash_lock);
@@ -191,6 +355,15 @@ struct user_struct * alloc_uid(struct us
 			return NULL;
 		}
 
+		if (user_kobject_create(new)) {
+			sched_destroy_user(new);
+			key_put(new->uid_keyring);
+			key_put(new->session_keyring);
+			kmem_cache_free(uid_cachep, new);
+			uids_mutex_unlock();
+			return NULL;
+		}
+
 		/*
 		 * Before adding this, check whether we raced
 		 * on adding the same user already..
@@ -198,7 +371,11 @@ struct user_struct * alloc_uid(struct us
 		spin_lock_irq(&uidhash_lock);
 		up = uid_hash_find(uid, hashent);
 		if (up) {
-			sched_destroy_user(new);
+			/* This case is not possible when CONFIG_FAIR_USER_SCHED
+			 * is defined, since we serialize alloc_uid() using
+			 * uids_mutex. Hence no need to call
+			 * sched_destroy_user() or remove_user_sysfs_dir().
+			 */
 			key_put(new->uid_keyring);
 			key_put(new->session_keyring);
 			kmem_cache_free(uid_cachep, new);
@@ -209,6 +386,9 @@ struct user_struct * alloc_uid(struct us
 		spin_unlock_irq(&uidhash_lock);
 
 	}
+
+	uids_mutex_unlock();
+
 	return up;
 }
 
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH -v2] Add sysfs control to modify a user's cpu share
  2007-10-03 17:10                                       ` [RFC/PATCH -v2] " Dhaval Giani
@ 2007-10-04  7:57                                         ` Ingo Molnar
  2007-10-04  8:54                                           ` Heiko Carstens
  0 siblings, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2007-10-04  7:57 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar


* Dhaval Giani <dhaval@linux.vnet.ibm.com> wrote:

> Hi Ingo,
> 
> Can you please drop commit b1add858a10cece3a68b2d8cb9e7350843700a58 (last
> version of this patch) and try this instead?

> Changelog since v1:
> 1. Added a mutex to serialize directory creation/destruction for a user in
>    sysfs
> 2. Added a spinlock in the task_group structure to serialize writes to
>    tg->shares.
> 3. Removed /proc/root_user_cpu_shares.
> 4. Added Documentation about the group scheduler.

thanks - I have added this to the queue.

i'm wondering about the following: could not (yet) existing UIDs be made 
configurable too? I.e. if i do this in a bootup script:

  echo 2048 > /sys/kernel/uids/500/cpu_share

this should just work too, regardless of there not being any UID 500 
tasks yet. Likewise, once configured, the /sys/kernel/uids/* directories 
(with the settings in them) should probably not go away either.

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH -v2] Add sysfs control to modify a user's cpu share
  2007-10-04  7:57                                         ` Ingo Molnar
@ 2007-10-04  8:54                                           ` Heiko Carstens
  2007-10-04 16:02                                             ` Bill Davidsen
                                                               ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Heiko Carstens @ 2007-10-04  8:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dhaval Giani, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar

> > Changelog since v1:
> > 1. Added a mutex to serialize directory creation/destruction for a user in
> >    sysfs
> > 2. Added a spinlock in the task_group structure to serialize writes to
> >    tg->shares.
> > 3. Removed /proc/root_user_cpu_shares.
> > 4. Added Documentation about the group scheduler.
> 
> thanks - I have added this to the queue.
> 
> i'm wondering about the following: could not (yet) existing UIDs be made 
> configurable too? I.e. if i do this in a bootup script:
> 
>   echo 2048 > /sys/kernel/uids/500/cpu_share
> 
> this should just work too, regardless of there not being any UID 500 
> tasks yet. Likewise, once configured, the /sys/kernel/uids/* directories 
> (with the settings in them) should probably not go away either.

Shouldn't that be done via uevents? E.g. UID x gets added to the sysfs tree,
generates a uevent and a script then figures out the cpu_share and sets it.
That way you also don't need to keep the directories. No?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH -v2] Add sysfs control to modify a user's cpu share
  2007-10-04  8:54                                           ` Heiko Carstens
@ 2007-10-04 16:02                                             ` Bill Davidsen
  2007-10-04 17:20                                               ` Srivatsa Vaddagiri
  2007-10-04 21:32                                             ` Valdis.Kletnieks
  2007-10-09 15:12                                             ` [PATCH sched-devel] Generate uevents for user creation/destruction Srivatsa Vaddagiri
  2 siblings, 1 reply; 53+ messages in thread
From: Bill Davidsen @ 2007-10-04 16:02 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Ingo Molnar, Dhaval Giani, Srivatsa Vaddagiri, Mike Galbraith,
	Peter Zijlstra, Dmitry Adamushko, lkml, maneesh, Andrew Morton,
	Sudhir Kumar

Heiko Carstens wrote:
>>> Changelog since v1:
>>> 1. Added a mutex to serialize directory creation/destruction for a user in
>>>    sysfs
>>> 2. Added a spinlock in the task_group structure to serialize writes to
>>>    tg->shares.
>>> 3. Removed /proc/root_user_cpu_shares.
>>> 4. Added Documentation about the group scheduler.
>> thanks - I have added this to the queue.
>>
>> i'm wondering about the following: could not (yet) existing UIDs be made 
>> configurable too? I.e. if i do this in a bootup script:
>>
>>   echo 2048 > /sys/kernel/uids/500/cpu_share
>>
>> this should just work too, regardless of there not being any UID 500 
>> tasks yet. Likewise, once configured, the /sys/kernel/uids/* directories 
>> (with the settings in them) should probably not go away either.
> 
> Shouldn't that be done via uevents? E.g. UID x gets added to the sysfs tree,
> generates a uevent and a script then figures out the cpu_share and sets it.
> That way you also don't need to keep the directories. No?

That sounds complex administratively. It means that instead of setting a 
higher or lower than default once and having it persist until reboot I 
have to create a script, which *will* in some cases be left in place 
even after the need has gone.

I agree with Ingo, once it's done it should be persistent.

And as another administrative convenience I can look at that set of 
values and see what shares are being used, even when the user is not 
currently active.

Final question, how do setuid processes map into this implementation?

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH -v2] Add sysfs control to modify a user's cpu share
  2007-10-04 16:02                                             ` Bill Davidsen
@ 2007-10-04 17:20                                               ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-10-04 17:20 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Heiko Carstens, Ingo Molnar, Dhaval Giani, Mike Galbraith,
	Peter Zijlstra, Dmitry Adamushko, lkml, maneesh, Andrew Morton,
	Sudhir Kumar

On Thu, Oct 04, 2007 at 12:02:01PM -0400, Bill Davidsen wrote:
> >>i'm wondering about the following: could not (yet) existing UIDs be made 
> >>configurable too? I.e. if i do this in a bootup script:
> >>
> >>  echo 2048 > /sys/kernel/uids/500/cpu_share
> >>
> >>this should just work too, regardless of there not being any UID 500 
> >>tasks yet. Likewise, once configured, the /sys/kernel/uids/* directories 
> >>(with the settings in them) should probably not go away either.
> >
> >Shouldn't that be done via uevents? E.g. UID x gets added to the sysfs 
> >tree,
> >generates a uevent and a script then figures out the cpu_share and sets it.
> >That way you also don't need to keep the directories. No?
> 
> That sounds complex administratively. It means that instead of setting a 
> higher or lower than default once and having it persist until reboot I 
> have to create a script, which *will* in some cases be left in place 
> even after the need has gone.
> 
> I agree with Ingo, once it's done it should be persistent.
> And as another administrative convenience I can look at that set of 
> values and see what shares are being used, even when the user is not 
> currently active.

Although the need seems very real, I am thinking about the implementation 
aspect of this in the kernel i.e how will this be implementable?

1. The current patch proposes a sysfs based interface, where a new
   directory is created for every new user created who logs into the 
   system. To meet the requirement Ingo suggested, it would require the
   ability to create directories in sysfs in advance of (user_struct) objects 
   that aren't yet there - which is not possible to implement in sysfs
   afaik

2. configfs seems to allow creation of directories (i.e kernel objects) from 
   userland. Every new directory created should translate to a
   user_struct object being created in the kernel (and inserted in
   uid_hash table). Would this be acceptable?

Also, IMHO, CONFIG_FAIR_USER_SCHED is there only as a toy, to test fair group 
scheduling and I expect distros to support CONFIG_FAIR_CGROUP_SCHED instead 
which allows "control group" (or process containers) based fair group 
scheduling.

Using CONFIG_FAIR_CGROUP_SCHED it is still possible to provide user-id based 
fair group scheduling, in two ways:

	1. Have a daemon which listens for UID change events
	   (PROC_EVENT_UID) and move the task to appropriate "control
	   groups" and set the "control group" shares
	2. Implement a "user" subsystem registered with "cgroup" core,
  	   which automatically creates new "control groups" whenever
 	   a new user is being added to the system. This is very similar
	   to "ns" subsystem (kernel/ns_cgroup.c in -mm tree).
 	   Thus in order to provide fair user scheduling with this option,
	   distro needs to modify initrd to:

		# mkdir /dev/usercpucontrol
		# mount -t cgroup -ouser,cpu none /dev/usercpucontrol

Using a combination of these two options and a /etc configuration file
which specifies the cpu shares to be given to a user, it should be
possible for distro to give a good fair-user based scheduler.

> Final question, how do setuid processes map into this implementation?

We seem to be going by the real uid of a task (which is what tsk->user
points at) to decide its CPU bandwidth. Is that a cause of concern?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH -v2] Add sysfs control to modify a user's cpu share
  2007-10-04  8:54                                           ` Heiko Carstens
  2007-10-04 16:02                                             ` Bill Davidsen
@ 2007-10-04 21:32                                             ` Valdis.Kletnieks
  2007-10-05  7:01                                               ` Srivatsa Vaddagiri
  2007-10-09 15:12                                             ` [PATCH sched-devel] Generate uevents for user creation/destruction Srivatsa Vaddagiri
  2 siblings, 1 reply; 53+ messages in thread
From: Valdis.Kletnieks @ 2007-10-04 21:32 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Ingo Molnar, Dhaval Giani, Srivatsa Vaddagiri, Mike Galbraith,
	Peter Zijlstra, Dmitry Adamushko, lkml, maneesh, Andrew Morton,
	Sudhir Kumar

[-- Attachment #1: Type: text/plain, Size: 862 bytes --]

On Thu, 04 Oct 2007 10:54:51 +0200, Heiko Carstens said:
> >   echo 2048 > /sys/kernel/uids/500/cpu_share
> > 
> > this should just work too, regardless of there not being any UID 500 
> > tasks yet. Likewise, once configured, the /sys/kernel/uids/* directories 
> > (with the settings in them) should probably not go away either.
> 
> Shouldn't that be done via uevents? E.g. UID x gets added to the sysfs tree,
> generates a uevent and a script then figures out the cpu_share and sets it.

That would tend to be a tad racy - a site may want to set limits in the
hypothetical /sys/kernel/uids/NNN before the program has a chance to fork-bomb
or otherwise make it difficult to set a limitfrom within another userspace
process.  It's similar to why you want a process to be launched with all its
ulimit's set, rather than set them after the fork/exec happens...


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC/PATCH -v2] Add sysfs control to modify a user's cpu share
  2007-10-04 21:32                                             ` Valdis.Kletnieks
@ 2007-10-05  7:01                                               ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-10-05  7:01 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Heiko Carstens, Ingo Molnar, Dhaval Giani, Mike Galbraith,
	Peter Zijlstra, Dmitry Adamushko, lkml, maneesh, Andrew Morton,
	Sudhir Kumar

On Thu, Oct 04, 2007 at 05:32:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
> > Shouldn't that be done via uevents? E.g. UID x gets added to the sysfs tree,
> > generates a uevent and a script then figures out the cpu_share and sets it.
> 
> That would tend to be a tad racy - a site may want to set limits in the
> hypothetical /sys/kernel/uids/NNN before the program has a chance to fork-bomb
> or otherwise make it difficult to set a limitfrom within another userspace

Note that there is a default limit enforced on every new user (1024
value for the user's cpu share). This limit should contain any fork-bomb that
the user may launch immediately.

> process.  It's similar to why you want a process to be launched with all its
> ulimit's set, rather than set them after the fork/exec happens...





-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH sched-devel] Generate uevents for user creation/destruction
  2007-10-04  8:54                                           ` Heiko Carstens
  2007-10-04 16:02                                             ` Bill Davidsen
  2007-10-04 21:32                                             ` Valdis.Kletnieks
@ 2007-10-09 15:12                                             ` Srivatsa Vaddagiri
  2007-10-10  7:42                                               ` Ingo Molnar
  2 siblings, 1 reply; 53+ messages in thread
From: Srivatsa Vaddagiri @ 2007-10-09 15:12 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Ingo Molnar, Dhaval Giani, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar

[-- Attachment #1: Type: text/plain, Size: 2418 bytes --]

On Thu, Oct 04, 2007 at 10:54:51AM +0200, Heiko Carstens wrote:
> > i'm wondering about the following: could not (yet) existing UIDs be made 
> > configurable too? I.e. if i do this in a bootup script:
> > 
> >   echo 2048 > /sys/kernel/uids/500/cpu_share
> > 
> > this should just work too, regardless of there not being any UID 500 
> > tasks yet. Likewise, once configured, the /sys/kernel/uids/* directories 
> > (with the settings in them) should probably not go away either.
> 
> Shouldn't that be done via uevents? E.g. UID x gets added to the sysfs tree,
> generates a uevent and a script then figures out the cpu_share and sets it.
> That way you also don't need to keep the directories. No?

Heiko,
	Thanks for the hint. Here's a patch to enable generation of
uevents for user creation/deletion. These uevents can be handled in
userspace to configure a new user's cpu share.

Note : After bootup I could test that new user's cpu share is configured
as per a configuration file (/etc/user_cpu_share.conf). However this
mechanism didnt work for root user. Perhaps uevent for root user is
generated way too early?

A HOWTO text file is also attached explaining how to make use of these
uevents in userspace.

Ingo,
	This patch applies on top of latest sched-devel tree. Pls review
and apply ..


---

Generate uevents when a user is being created/destroyed. These events
could be used to configure cpu share of a new user.

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>


---
 kernel/user.c |    4 ++++
 1 files changed, 4 insertions(+)

Index: current/kernel/user.c
===================================================================
--- current.orig/kernel/user.c
+++ current/kernel/user.c
@@ -174,6 +174,8 @@ static int user_kobject_create(struct us
 	if (error)
 		kobject_del(kobj);
 
+	kobject_uevent(kobj, KOBJ_ADD);
+
 done:
 	return error;
 }
@@ -189,6 +191,7 @@ int __init uids_kobject_init(void)
 
 	/* create under /sys/kernel dir */
 	uids_kobject.parent = &kernel_subsys.kobj;
+	uids_kobject.kset = &kernel_subsys;
 	kobject_set_name(&uids_kobject, "uids");
 	kobject_init(&uids_kobject);
 
@@ -228,6 +231,7 @@ static void remove_user_sysfs_dir(struct
 		goto done;
 
 	sysfs_remove_file(kobj, &up->user_attr.attr);
+	kobject_uevent(kobj, KOBJ_REMOVE);
 	kobject_del(kobj);
 
 	sched_destroy_user(up);



-- 
Regards,
vatsa

[-- Attachment #2: HOWTO --]
[-- Type: text/plain, Size: 920 bytes --]

This HOWTO explains the steps required to configure a user's cpu share 
automatically when he/she logs in. This has been verified to work on
a Redhat distribution.

Note : This HOWTO is a *hack* to get things working quickly. In particular it 
doesn't follow standards like LSB.

1. Create a file /etc/user_cpu_share.conf with this format:

	[uid]	[cpu_share]

   Ex:

	512	1024	#user vatsa
	514	512	#user guest


2. Create a script, named "kernel.agent" in /etc/hotplug directory 
   as follows. Make that script executable and owned by root.

#!/bin/sh
#
# kernel hotplug agent for 2.6 kernels
#
#       ACTION=add
#       DEVPATH=/kernel/uids/[uid]
#

cd /etc/hotplug
. ./hotplug.functions

case $ACTION in

add)
	uid=${DEVPATH##*/}
	line=`grep -w $uid /etc/user_cpu_share.conf`
	if [ $? -eq 0 ]
	then
		cpu_share=`echo $line | awk '{print $2}'`
		echo $cpu_share > /sys/$DEVPATH/cpu_share
	fi
	;;

*)
	;;
esac


	



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH sched-devel] Generate uevents for user creation/destruction
  2007-10-09 15:12                                             ` [PATCH sched-devel] Generate uevents for user creation/destruction Srivatsa Vaddagiri
@ 2007-10-10  7:42                                               ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2007-10-10  7:42 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Heiko Carstens, Dhaval Giani, Mike Galbraith, Peter Zijlstra,
	Dmitry Adamushko, lkml, maneesh, Andrew Morton, Sudhir Kumar


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> > Shouldn't that be done via uevents? E.g. UID x gets added to the 
> > sysfs tree, generates a uevent and a script then figures out the 
> > cpu_share and sets it. That way you also don't need to keep the 
> > directories. No?
> 
> Heiko,
> 	Thanks for the hint. Here's a patch to enable generation of 
> uevents for user creation/deletion. These uevents can be handled in 
> userspace to configure a new user's cpu share.
> 
> Note : After bootup I could test that new user's cpu share is 
> configured as per a configuration file (/etc/user_cpu_share.conf). 
> However this mechanism didnt work for root user. Perhaps uevent for 
> root user is generated way too early?
> 
> A HOWTO text file is also attached explaining how to make use of these 
> uevents in userspace.
> 
> Ingo,
> 	This patch applies on top of latest sched-devel tree. Pls 
> review and apply ..

thanks, applied. This looks reassuringly simple and straightforward!

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2007-10-10  7:42 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-24 21:45 [git] CFS-devel, latest code Ingo Molnar
2007-09-24 21:55 ` Andrew Morton
2007-09-24 21:59   ` Ingo Molnar
2007-09-25  0:08 ` Daniel Walker
2007-09-25  6:45   ` Ingo Molnar
2007-09-25 15:17     ` Daniel Walker
2007-09-25  6:10 ` Mike Galbraith
2007-09-25  7:35   ` Mike Galbraith
2007-09-25  8:33     ` Mike Galbraith
2007-09-25  8:53       ` Srivatsa Vaddagiri
2007-09-25  9:11         ` Srivatsa Vaddagiri
2007-09-25  9:15           ` Mike Galbraith
2007-09-25  9:12         ` Mike Galbraith
2007-09-25  9:13       ` Ingo Molnar
2007-09-25  9:17         ` Mike Galbraith
2007-09-25  9:47           ` Ingo Molnar
2007-09-25 10:02             ` Mike Galbraith
2007-09-26  8:04             ` Mike Galbraith
2007-09-28 21:46             ` Bill Davidsen
2007-09-25  9:44         ` Srivatsa Vaddagiri
2007-09-25  9:40           ` Ingo Molnar
2007-09-25 10:10             ` Ingo Molnar
2007-09-25 10:28               ` Srivatsa Vaddagiri
2007-09-25 10:36                 ` Ingo Molnar
2007-09-25 11:33                   ` Ingo Molnar
2007-09-25 14:48                     ` Srivatsa Vaddagiri
2007-09-25 12:51                   ` Srivatsa Vaddagiri
2007-09-25 13:35                     ` Mike Galbraith
2007-09-25 14:07                       ` Srivatsa Vaddagiri
2007-09-25 12:28                 ` Mike Galbraith
2007-09-25 12:54                   ` Mike Galbraith
     [not found]                     ` <20070925131717.GM26289@linux.vnet.ibm.com>
     [not found]                       ` <1190725693.13716.10.camel@Homer.simpson.net>
     [not found]                         ` <20070925132528.GN26289@linux.vnet.ibm.com>
     [not found]                           ` <1190726682.11260.1.camel@Homer.simpson.net>
     [not found]                             ` <20070925140559.GB26310@linux.vnet.ibm.com>
     [not found]                               ` <20070925143755.GA15594@elte.hu>
     [not found]                                 ` <20070926210737.GA8663@elte.hu>
2007-10-01 14:04                                   ` [RFC/PATCH] Add sysfs control to modify a user's cpu share Dhaval Giani
2007-10-01 14:44                                     ` Ingo Molnar
2007-10-01 15:32                                       ` Srivatsa Vaddagiri
2007-10-02 22:12                                       ` Eric St-Laurent
2007-10-03  4:09                                         ` Srivatsa Vaddagiri
2007-10-03 17:10                                       ` [RFC/PATCH -v2] " Dhaval Giani
2007-10-04  7:57                                         ` Ingo Molnar
2007-10-04  8:54                                           ` Heiko Carstens
2007-10-04 16:02                                             ` Bill Davidsen
2007-10-04 17:20                                               ` Srivatsa Vaddagiri
2007-10-04 21:32                                             ` Valdis.Kletnieks
2007-10-05  7:01                                               ` Srivatsa Vaddagiri
2007-10-09 15:12                                             ` [PATCH sched-devel] Generate uevents for user creation/destruction Srivatsa Vaddagiri
2007-10-10  7:42                                               ` Ingo Molnar
2007-10-01 16:12                                     ` [RFC/PATCH] Add sysfs control to modify a user's cpu share Dave Jones
2007-10-01 16:37                                       ` Srivatsa Vaddagiri
2007-09-25  6:50 ` [git] CFS-devel, latest code S.Çağlar Onur
2007-09-25  9:17   ` Ingo Molnar
2007-09-25  7:41 ` Andrew Morton
2007-09-25  8:43   ` Srivatsa Vaddagiri
2007-09-25  8:48     ` Andrew Morton
2007-09-25 11:00     ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.