Re: question on sched-rt group allocation cap: sched_rt_runtime

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
       [not found]   ` <dhcf5-263-13@gated-at.bofh.it>
@ 2009-09-06  2:32     ` Ani
  2009-09-06  6:32       ` Mike Galbraith
  0 siblings, 1 reply; 84+ messages in thread
From: Ani @ 2009-09-06  2:32 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: linux-kernel

On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
>
> Indeed. I've tested this same test program in a single core machine and it
> produces the expected behavior:
>
> rt_runtime_us / rt_period_us     % loops executed in SCHED_OTHER
> 95%                              4.48%
> 60%                              54.84%
> 50%                              86.03%
> 40%                              OTHER completed first
>

Hmm. This does seem to indicate that there is some kind of
relationship with SMP. So I wonder whether there is a way to turn this
'RT bandwidth accumulation' heuristic off. I did an
echo 0 > /proc/sys/kernel/sched_migration_cost
but results were identical to previous.

I figure that if I set it to zero, the regular sched-fair (non-RT)
tasks will be treated as not being cache hot and hence susceptible to
migration. From the code it looks like sched-rt tasks are always
treated as cache cold? Mind you though that I have not yet looked into
the code very rigorously. I knew the O(1) scheduler relatively well,
but I am just begun digging into the new CFS scheduler code.

On a side note, why is there no documentation explain the
sched_migration_cost tuning knob? It would be nice to have one - at
least where the sysctl variable is defined.

--Ani

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-06  2:32     ` question on sched-rt group allocation cap: sched_rt_runtime_us Ani
@ 2009-09-06  6:32       ` Mike Galbraith
  2009-09-06 10:18         ` Mike Galbraith
                           ` (3 more replies)
  0 siblings, 4 replies; 84+ messages in thread
From: Mike Galbraith @ 2009-09-06  6:32 UTC (permalink / raw)
  To: Ani; +Cc: Lucas De Marchi, linux-kernel, Peter Zijlstra, Ingo Molnar

On Sat, 2009-09-05 at 19:32 -0700, Ani wrote: 
> On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
> >
> > Indeed. I've tested this same test program in a single core machine and it
> > produces the expected behavior:
> >
> > rt_runtime_us / rt_period_us     % loops executed in SCHED_OTHER
> > 95%                              4.48%
> > 60%                              54.84%
> > 50%                              86.03%
> > 40%                              OTHER completed first
> >
> 
> Hmm. This does seem to indicate that there is some kind of
> relationship with SMP. So I wonder whether there is a way to turn this
> 'RT bandwidth accumulation' heuristic off.

No there isn't, but maybe there should be, since this isn't the first
time it's come up.  One pro argument is that pinned tasks are thoroughly
screwed when an RT hog lands on their runqueue.  On the con side, the
whole RT bandwidth restriction thing is intended (AFAIK) to allow an
admin to regain control should RT app go insane, which the default 5%
aggregate accomplishes just fine.

Dunno.  Fly or die little patchlet (toss). 

sched: allow the user to disable RT bandwidth aggregation.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8736ba1..6e6d4c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1881,6 +1881,7 @@ static inline unsigned int get_sysctl_timer_migration(void)
 #endif
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
+extern int sysctl_sched_rt_bandwidth_aggregate;
 
 int sched_rt_handler(struct ctl_table *table, int write,
 		struct file *filp, void __user *buffer, size_t *lenp,
diff --git a/kernel/sched.c b/kernel/sched.c
index c512a02..ca6a378 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -864,6 +864,12 @@ static __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+/*
+ * aggregate bandwidth, ie allow borrowing from neighbors when
+ * bandwidth for an individual runqueue is exhausted.
+ */
+int sysctl_sched_rt_bandwidth_aggregate = 1;
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 2eb4bd6..75daf88 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -495,6 +495,9 @@ static int balance_runtime(struct rt_rq *rt_rq)
 {
 	int more = 0;
 
+	if (!sysctl_sched_rt_bandwidth_aggregate)
+		return 0;
+
 	if (rt_rq->rt_time > rt_rq->rt_runtime) {
 		spin_unlock(&rt_rq->rt_runtime_lock);
 		more = do_balance_runtime(rt_rq);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cdbe8d0..0ad08e5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,6 +368,14 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_rt_bandwidth_aggregate",
+		.data		= &sysctl_sched_rt_bandwidth_aggregate,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &sched_rt_handler,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "sched_compat_yield",
 		.data		= &sysctl_sched_compat_yield,
 		.maxlen		= sizeof(unsigned int),



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-06  6:32       ` Mike Galbraith
@ 2009-09-06 10:18         ` Mike Galbraith
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC482@exchange.ZeugmaSystems.local>
       [not found]         ` <DDFD17CC94A9BD49A82147DDF7D545C54DC483@exchange.ZeugmaSystems.local>
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 84+ messages in thread
From: Mike Galbraith @ 2009-09-06 10:18 UTC (permalink / raw)
  To: Ani; +Cc: Lucas De Marchi, linux-kernel, Peter Zijlstra, Ingo Molnar

On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> On Sat, 2009-09-05 at 19:32 -0700, Ani wrote: 
> > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
> > >
> > > Indeed. I've tested this same test program in a single core machine and it
> > > produces the expected behavior:
> > >
> > > rt_runtime_us / rt_period_us     % loops executed in SCHED_OTHER
> > > 95%                              4.48%
> > > 60%                              54.84%
> > > 50%                              86.03%
> > > 40%                              OTHER completed first
> > >
> > 
> > Hmm. This does seem to indicate that there is some kind of
> > relationship with SMP. So I wonder whether there is a way to turn this
> > 'RT bandwidth accumulation' heuristic off.
> 
> No there isn't, but maybe there should be, since this isn't the first
> time it's come up.  One pro argument is that pinned tasks are thoroughly
> screwed when an RT hog lands on their runqueue.  On the con side, the
> whole RT bandwidth restriction thing is intended (AFAIK) to allow an
> admin to regain control should RT app go insane, which the default 5%
> aggregate accomplishes just fine.
> 
> Dunno.  Fly or die little patchlet (toss). 

btw, a _kinda sorta_ pro is that it can prevent IO lockups like the
below.  Seems kjournald can end up depending on kblockd/3, which ain't
going anywhere with that 100% RT hog in the way, so the whole box is
fairly hosed.  (much better would be to wake some other kblockd)

top - 12:01:49 up 56 min, 20 users,  load average: 8.01, 4.96, 2.39
Tasks: 304 total,   4 running, 300 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.8%us,  0.3%sy,  0.0%ni,  0.0%id, 73.7%wa,  0.3%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
13897 root      -2   0  7920  592  484 R  100  0.0   1:13.43 3 xx
12716 root      20   0  8868 1328  860 R    1  0.0   0:01.44 0 top
   14 root      15  -5     0    0    0 R    0  0.0   0:00.02 3 events/3
   94 root      15  -5     0    0    0 R    0  0.0   0:00.00 3 kblockd/3
 1212 root      15  -5     0    0    0 D    0  0.0   0:00.04 2 kjournald
14393 root      20   0  9848 2296  756 D    0  0.1   0:00.01 0 make
14404 root      20   0 38012  25m 5552 D    0  0.8   0:00.21 1 cc1
14405 root      20   0 20220 8852 2388 D    0  0.3   0:00.02 1 as
14437 root      20   0 24132  10m 2680 D    0  0.3   0:00.06 2 cc1
14448 root      20   0 18324 1724 1240 D    0  0.1   0:00.00 2 cc1
14452 root      20   0 12540  792  656 D    0  0.0   0:00.00 2 mv



^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: question on sched-rt group allocation cap: sched_rt_runtime_us
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC482@exchange.ZeugmaSystems.local>
@ 2009-09-06 15:09             ` Mike Galbraith
  2009-09-07  0:41               ` Anirban Sinha
       [not found]               ` <1252311463.7586.26.camel@marge.simson.net>
  0 siblings, 2 replies; 84+ messages in thread
From: Mike Galbraith @ 2009-09-06 15:09 UTC (permalink / raw)
  To: Anirban Sinha
  Cc: Lucas De Marchi, linux-kernel, Peter Zijlstra, Ingo Molnar, ani

On Sun, 2009-09-06 at 07:53 -0700, Anirban Sinha wrote:
> 
> 
> 
> > Seems kjournald can end up depending on kblockd/3, which ain't
> > going anywhere with that 100% RT hog in the way,
> 
> I think in the past AKPM's response to this has been "just don't do
> it", i.e, don't hog the CPU with an RT thread.

Oh yeah, sure.  Best to run RT oinkers on isolated cpus.  It just
surprised me that the 100% compute RT cpu became involved in IO.

	-Mike


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC485@exchange.ZeugmaSystems.local>
@ 2009-09-07  0:28             ` Anirban Sinha
  0 siblings, 0 replies; 84+ messages in thread
From: Anirban Sinha @ 2009-09-07  0:28 UTC (permalink / raw)
  To: Anirban Sinha, Dario Faggioli, a.p.zijlstra, Ingo Molnar,
	linux-kernel, Fabio Checconi
  Cc: Anirban Sinha


 > Dunno.  Fly or die little patchlet (toss).

 > sched: allow the user to disable RT bandwidth aggregation.

Hmm. Interesting. With this change, my results are as follows:

rt_runtime/rt_period   % of reg iterations

0.2                    100%
0.25                   100%
0.3                    100%
0.4                    100%
0.5                    82%
0.6                    66%
0.7                    54%
0.8                    46%
0.9                    38.5%
0.95                   32%


This results are on a quad core blade. Does it still makes sense though?
Can anyone else run the same tests on a quadcore over the latest kernel?
I will patch our 2.6.26 kernel with upstream fixes and rerun these  
tests on tuesday.

Ani





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-06 15:09             ` Mike Galbraith
@ 2009-09-07  0:41               ` Anirban Sinha
       [not found]               ` <1252311463.7586.26.camel@marge.simson.net>
  1 sibling, 0 replies; 84+ messages in thread
From: Anirban Sinha @ 2009-09-07  0:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Anirban Sinha, Lucas De Marchi, linux-kernel, Peter Zijlstra,
	Ingo Molnar

On 2009-09-06, at 8:09 AM, Mike Galbraith wrote:

> On Sun, 2009-09-06 at 07:53 -0700, Anirban Sinha wrote:
>>
>>
>>
>>> Seems kjournald can end up depending on kblockd/3, which ain't
>>> going anywhere with that 100% RT hog in the way,
>>
>> I think in the past AKPM's response to this has been "just don't do
>> it", i.e, don't hog the CPU with an RT thread.
>
> Oh yeah, sure.  Best to run RT oinkers on isolated cpus.

Correct. Unfortunately at some places, the application coders do  
stupid things and then the onus falls on the kernel guys to make  
things 'just work'.

I would not have any problems if such a cap mechanism did not exist at  
all. However, since we do have such a tuning knob. I would say that  
let's make it do what it is supposed to do. In the documentation it  
says  "0.05s to be used by SCHED_OTHER". Unfortunately, it never hints  
that if your thread is tied to the RT core, you are screwed. The  
bandwidth accumulation logic would virtually kill all the remaining  
SCHED_OTHER threads, much before that 95% cap is reached. Somewhere it  
doesn't quite seem right. At the very very least, can we have this  
clearly written in sched-rt-group.txt?

Cheers,

Ani

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: question on sched-rt group allocation cap: sched_rt_runtime_us
       [not found]         ` <DDFD17CC94A9BD49A82147DDF7D545C54DC483@exchange.ZeugmaSystems.local>
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC485@exchange.ZeugmaSystems.local>
@ 2009-09-07  6:54           ` Mike Galbraith
       [not found]             ` <DDFD17CC94A9BD49A82147DDF7D545C54DC489@exchange.ZeugmaSystems.local>
  1 sibling, 1 reply; 84+ messages in thread
From: Mike Galbraith @ 2009-09-07  6:54 UTC (permalink / raw)
  To: Anirban Sinha; +Cc: Lucas De Marchi, linux-kernel, Peter Zijlstra, Ingo Molnar

On Sun, 2009-09-06 at 17:18 -0700, Anirban Sinha wrote:
> 
> 
> > Dunno.  Fly or die little patchlet (toss).
> 
> > sched: allow the user to disable RT bandwidth aggregation.
> 
> Hmm. Interesting. With this change, my results are as follows:
> 
> rt_runtime/rt_period   % of reg iterations
> 
> 0.2                    100%
> 0.25                   100%
> 0.3                    100%
> 0.4                    100%
> 0.5                    82%
> 0.6                    66%
> 0.7                    54%
> 0.8                    46%
> 0.9                    38.5%
> 0.95                   32%
> 
> 
> This results are on a quad core blade. Does it still makes sense
> though?
> Can anyone else run the same tests on a quadcore over the latest
> kernel? I will patch our 2.6.26 kernel with upstream fixes and rerun
> these tests on tuesday.

I tested tip (v2.6.31-rc9-1357-ge6a3cd0) with a little perturbation
measurement proglet on an isolated Q6600 core.

10s measurement interval results:

sched_rt_runtime_us  RT utilization
950000               94.99%
750000               75.00%
500000               50.04%
250000               25.02%
 50000                5.03%

Seems to work fine here.

	-Mike


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-06  6:32       ` Mike Galbraith
  2009-09-06 10:18         ` Mike Galbraith
       [not found]         ` <DDFD17CC94A9BD49A82147DDF7D545C54DC483@exchange.ZeugmaSystems.local>
@ 2009-09-07  7:59         ` Peter Zijlstra
  2009-09-07  8:24           ` Mike Galbraith
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC487@exchange.ZeugmaSystems.local>
       [not found]         ` <DDFD17CC94A9BD49A82147DDF7D545C54DC48B@exchange.ZeugmaSystems.local>
  3 siblings, 2 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-07  7:59 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ani, Lucas De Marchi, linux-kernel, Ingo Molnar

On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> On Sat, 2009-09-05 at 19:32 -0700, Ani wrote: 
> > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
> > >
> > > Indeed. I've tested this same test program in a single core machine and it
> > > produces the expected behavior:
> > >
> > > rt_runtime_us / rt_period_us     % loops executed in SCHED_OTHER
> > > 95%                              4.48%
> > > 60%                              54.84%
> > > 50%                              86.03%
> > > 40%                              OTHER completed first
> > >
> > 
> > Hmm. This does seem to indicate that there is some kind of
> > relationship with SMP. So I wonder whether there is a way to turn this
> > 'RT bandwidth accumulation' heuristic off.
> 
> No there isn't..

Actually there is, use cpusets to carve the system into partitions.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-07  7:59         ` Peter Zijlstra
@ 2009-09-07  8:24           ` Mike Galbraith
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC487@exchange.ZeugmaSystems.local>
  1 sibling, 0 replies; 84+ messages in thread
From: Mike Galbraith @ 2009-09-07  8:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ani, Lucas De Marchi, linux-kernel, Ingo Molnar

On Mon, 2009-09-07 at 09:59 +0200, Peter Zijlstra wrote:
> On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> > On Sat, 2009-09-05 at 19:32 -0700, Ani wrote: 
> > > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
> > > >
> > > > Indeed. I've tested this same test program in a single core machine and it
> > > > produces the expected behavior:
> > > >
> > > > rt_runtime_us / rt_period_us     % loops executed in SCHED_OTHER
> > > > 95%                              4.48%
> > > > 60%                              54.84%
> > > > 50%                              86.03%
> > > > 40%                              OTHER completed first
> > > >
> > > 
> > > Hmm. This does seem to indicate that there is some kind of
> > > relationship with SMP. So I wonder whether there is a way to turn this
> > > 'RT bandwidth accumulation' heuristic off.
> > 
> > No there isn't..
> 
> Actually there is, use cpusets to carve the system into partitions.

Yeah, I stand corrected.  I tend to think in terms of the dirt simplest
configuration only.

	-Mike


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [rfc] lru_add_drain_all() vs isolation
       [not found]               ` <1252311463.7586.26.camel@marge.simson.net>
@ 2009-09-07 11:06                   ` Peter Zijlstra
  0 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-07 11:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, linux-mm, Christoph Lameter, Oleg Nesterov, lkml

On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:

> [  774.651779] SysRq : Show Blocked State
> [  774.655770]   task                        PC stack   pid father
> [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> [  774.676008] Call Trace:
> [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b

FWIW, something like the below (prone to explode since its utterly
untested) should (mostly) fix that one case. Something similar needs to
be done for pretty much all machine wide workqueue thingies, possibly
also flush_workqueue().

---
 include/linux/workqueue.h |    1 +
 kernel/workqueue.c        |   52 +++++++++++++++++++++++++++++++++++---------
 mm/swap.c                 |   14 ++++++++---
 3 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 6273fa9..95b1df2 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -213,6 +213,7 @@ extern int schedule_work_on(int cpu, struct work_struct *work);
 extern int schedule_delayed_work(struct delayed_work *work, unsigned long delay);
 extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
 					unsigned long delay);
+extern int schedule_on_mask(const struct cpumask *mask, work_func_t func);
 extern int schedule_on_each_cpu(work_func_t func);
 extern int current_is_keventd(void);
 extern int keventd_up(void);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3c44b56..81456fc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -657,6 +657,23 @@ int schedule_delayed_work_on(int cpu,
 }
 EXPORT_SYMBOL(schedule_delayed_work_on);
 
+struct sched_work_struct {
+	struct work_struct work;
+	work_func_t func;
+	atomic_t *count;
+	struct completion *completion;
+};
+
+static void do_sched_work(struct work_struct *work)
+{
+	struct sched_work_struct *sws = work;
+
+	sws->func(NULL);
+
+	if (atomic_dec_and_test(sws->count))
+		complete(sws->completion);
+}
+
 /**
  * schedule_on_each_cpu - call a function on each online CPU from keventd
  * @func: the function to call
@@ -666,29 +683,42 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
  *
  * schedule_on_each_cpu() is very slow.
  */
-int schedule_on_each_cpu(work_func_t func)
+int schedule_on_mask(const struct cpumask *mask, work_func_t func)
 {
+	struct completion completion = COMPLETION_INITIALIZER_ONSTACK(completion);
+	atomic_t count = ATOMIC_INIT(cpumask_weight(mask));
+	struct sched_work_struct *works;
 	int cpu;
-	struct work_struct *works;
 
-	works = alloc_percpu(struct work_struct);
+	works = alloc_percpu(struct sched_work_struct);
 	if (!works)
 		return -ENOMEM;
 
-	get_online_cpus();
-	for_each_online_cpu(cpu) {
-		struct work_struct *work = per_cpu_ptr(works, cpu);
+	for_each_cpu(cpu, mask) {
+		struct sched_work_struct *work = per_cpu_ptr(works, cpu);
+		work->count = &count;
+		work->completion = &completion;
+		work->func = func;
 
-		INIT_WORK(work, func);
-		schedule_work_on(cpu, work);
+		INIT_WORK(&work->work, do_sched_work);
+		schedule_work_on(cpu, &work->work);
 	}
-	for_each_online_cpu(cpu)
-		flush_work(per_cpu_ptr(works, cpu));
-	put_online_cpus();
+	wait_for_completion(&completion);
 	free_percpu(works);
 	return 0;
 }
 
+int schedule_on_each_cpu(work_func_t func)
+{
+	int ret;
+
+	get_online_cpus();
+	ret = schedule_on_mask(cpu_online_mask, func);
+	put_online_cpus();
+
+	return ret;
+}
+
 void flush_scheduled_work(void)
 {
 	flush_workqueue(keventd_wq);
diff --git a/mm/swap.c b/mm/swap.c
index cb29ae5..11e4b1e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
+static cpumask_t lru_drain_mask;
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 
@@ -216,12 +217,15 @@ EXPORT_SYMBOL(mark_page_accessed);
 
 void __lru_cache_add(struct page *page, enum lru_list lru)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];
+	int cpu = get_cpu();
+	struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu)[lru];
+
+	cpumask_set_cpu(cpu, lru_drain_mask);
 
 	page_cache_get(page);
 	if (!pagevec_add(pvec, page))
 		____pagevec_lru_add(pvec, lru);
-	put_cpu_var(lru_add_pvecs);
+	put_cpu();
 }
 
 /**
@@ -294,7 +298,9 @@ static void drain_cpu_pagevecs(int cpu)
 
 void lru_add_drain(void)
 {
-	drain_cpu_pagevecs(get_cpu());
+	int cpu = get_cpu();
+	cpumask_clear_cpu(cpu, lru_drain_mask);
+	drain_cpu_pagevecs(cpu);
 	put_cpu();
 }
 
@@ -308,7 +314,7 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
  */
 int lru_add_drain_all(void)
 {
-	return schedule_on_each_cpu(lru_add_drain_per_cpu);
+	return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
 }
 
 /*



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [rfc] lru_add_drain_all() vs isolation
@ 2009-09-07 11:06                   ` Peter Zijlstra
  0 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-07 11:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, linux-mm, Christoph Lameter, Oleg Nesterov, lkml

On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:

> [  774.651779] SysRq : Show Blocked State
> [  774.655770]   task                        PC stack   pid father
> [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> [  774.676008] Call Trace:
> [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b

FWIW, something like the below (prone to explode since its utterly
untested) should (mostly) fix that one case. Something similar needs to
be done for pretty much all machine wide workqueue thingies, possibly
also flush_workqueue().

---
 include/linux/workqueue.h |    1 +
 kernel/workqueue.c        |   52 +++++++++++++++++++++++++++++++++++---------
 mm/swap.c                 |   14 ++++++++---
 3 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 6273fa9..95b1df2 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -213,6 +213,7 @@ extern int schedule_work_on(int cpu, struct work_struct *work);
 extern int schedule_delayed_work(struct delayed_work *work, unsigned long delay);
 extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
 					unsigned long delay);
+extern int schedule_on_mask(const struct cpumask *mask, work_func_t func);
 extern int schedule_on_each_cpu(work_func_t func);
 extern int current_is_keventd(void);
 extern int keventd_up(void);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3c44b56..81456fc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -657,6 +657,23 @@ int schedule_delayed_work_on(int cpu,
 }
 EXPORT_SYMBOL(schedule_delayed_work_on);
 
+struct sched_work_struct {
+	struct work_struct work;
+	work_func_t func;
+	atomic_t *count;
+	struct completion *completion;
+};
+
+static void do_sched_work(struct work_struct *work)
+{
+	struct sched_work_struct *sws = work;
+
+	sws->func(NULL);
+
+	if (atomic_dec_and_test(sws->count))
+		complete(sws->completion);
+}
+
 /**
  * schedule_on_each_cpu - call a function on each online CPU from keventd
  * @func: the function to call
@@ -666,29 +683,42 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
  *
  * schedule_on_each_cpu() is very slow.
  */
-int schedule_on_each_cpu(work_func_t func)
+int schedule_on_mask(const struct cpumask *mask, work_func_t func)
 {
+	struct completion completion = COMPLETION_INITIALIZER_ONSTACK(completion);
+	atomic_t count = ATOMIC_INIT(cpumask_weight(mask));
+	struct sched_work_struct *works;
 	int cpu;
-	struct work_struct *works;
 
-	works = alloc_percpu(struct work_struct);
+	works = alloc_percpu(struct sched_work_struct);
 	if (!works)
 		return -ENOMEM;
 
-	get_online_cpus();
-	for_each_online_cpu(cpu) {
-		struct work_struct *work = per_cpu_ptr(works, cpu);
+	for_each_cpu(cpu, mask) {
+		struct sched_work_struct *work = per_cpu_ptr(works, cpu);
+		work->count = &count;
+		work->completion = &completion;
+		work->func = func;
 
-		INIT_WORK(work, func);
-		schedule_work_on(cpu, work);
+		INIT_WORK(&work->work, do_sched_work);
+		schedule_work_on(cpu, &work->work);
 	}
-	for_each_online_cpu(cpu)
-		flush_work(per_cpu_ptr(works, cpu));
-	put_online_cpus();
+	wait_for_completion(&completion);
 	free_percpu(works);
 	return 0;
 }
 
+int schedule_on_each_cpu(work_func_t func)
+{
+	int ret;
+
+	get_online_cpus();
+	ret = schedule_on_mask(cpu_online_mask, func);
+	put_online_cpus();
+
+	return ret;
+}
+
 void flush_scheduled_work(void)
 {
 	flush_workqueue(keventd_wq);
diff --git a/mm/swap.c b/mm/swap.c
index cb29ae5..11e4b1e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
+static cpumask_t lru_drain_mask;
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 
@@ -216,12 +217,15 @@ EXPORT_SYMBOL(mark_page_accessed);
 
 void __lru_cache_add(struct page *page, enum lru_list lru)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];
+	int cpu = get_cpu();
+	struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu)[lru];
+
+	cpumask_set_cpu(cpu, lru_drain_mask);
 
 	page_cache_get(page);
 	if (!pagevec_add(pvec, page))
 		____pagevec_lru_add(pvec, lru);
-	put_cpu_var(lru_add_pvecs);
+	put_cpu();
 }
 
 /**
@@ -294,7 +298,9 @@ static void drain_cpu_pagevecs(int cpu)
 
 void lru_add_drain(void)
 {
-	drain_cpu_pagevecs(get_cpu());
+	int cpu = get_cpu();
+	cpumask_clear_cpu(cpu, lru_drain_mask);
+	drain_cpu_pagevecs(cpu);
 	put_cpu();
 }
 
@@ -308,7 +314,7 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
  */
 int lru_add_drain_all(void)
 {
-	return schedule_on_each_cpu(lru_add_drain_per_cpu);
+	return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-07 11:06                   ` Peter Zijlstra
@ 2009-09-07 13:35                     ` Oleg Nesterov
  -1 siblings, 0 replies; 84+ messages in thread
From: Oleg Nesterov @ 2009-09-07 13:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter, lkml

On 09/07, Peter Zijlstra wrote:
>
> On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
>
> > [  774.651779] SysRq : Show Blocked State
> > [  774.655770]   task                        PC stack   pid father
> > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > [  774.676008] Call Trace:
> > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
>
> FWIW, something like the below (prone to explode since its utterly
> untested) should (mostly) fix that one case. Something similar needs to
> be done for pretty much all machine wide workqueue thingies, possibly
> also flush_workqueue().

Failed to google the previous discussion. Could you please point me?
What is the problem?

> +struct sched_work_struct {
> +	struct work_struct work;
> +	work_func_t func;
> +	atomic_t *count;
> +	struct completion *completion;
> +};

(not that it matters, but perhaps sched_work_struct should have a single
 pointer to the struct which contains func,count,comletion).

> -int schedule_on_each_cpu(work_func_t func)
> +int schedule_on_mask(const struct cpumask *mask, work_func_t func)

Looks like a usefule helper. But,

> +	for_each_cpu(cpu, mask) {
> +		struct sched_work_struct *work = per_cpu_ptr(works, cpu);
> +		work->count = &count;
> +		work->completion = &completion;
> +		work->func = func;
>
> -		INIT_WORK(work, func);
> -		schedule_work_on(cpu, work);
> +		INIT_WORK(&work->work, do_sched_work);
> +		schedule_work_on(cpu, &work->work);

This means the caller must ensure CPU online and can't go away. Otherwise
we can hang forever.

schedule_on_each_cpu() is fine, it calls us under get_online_cpus().
But,

>  int lru_add_drain_all(void)
>  {
> -	return schedule_on_each_cpu(lru_add_drain_per_cpu);
> +	return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
>  }

This doesn't look safe.

Looks like, schedule_on_mask() should take get_online_cpus(), do
cpus_and(mask, mask, online_cpus), then schedule works.

If we don't care the work can migrate to another CPU, schedule_on_mask()
can do put_online_cpus() before wait_for_completion().

Oleg.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-07 13:35                     ` Oleg Nesterov
  0 siblings, 0 replies; 84+ messages in thread
From: Oleg Nesterov @ 2009-09-07 13:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter, lkml

On 09/07, Peter Zijlstra wrote:
>
> On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
>
> > [  774.651779] SysRq : Show Blocked State
> > [  774.655770]   task                        PC stack   pid father
> > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > [  774.676008] Call Trace:
> > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
>
> FWIW, something like the below (prone to explode since its utterly
> untested) should (mostly) fix that one case. Something similar needs to
> be done for pretty much all machine wide workqueue thingies, possibly
> also flush_workqueue().

Failed to google the previous discussion. Could you please point me?
What is the problem?

> +struct sched_work_struct {
> +	struct work_struct work;
> +	work_func_t func;
> +	atomic_t *count;
> +	struct completion *completion;
> +};

(not that it matters, but perhaps sched_work_struct should have a single
 pointer to the struct which contains func,count,comletion).

> -int schedule_on_each_cpu(work_func_t func)
> +int schedule_on_mask(const struct cpumask *mask, work_func_t func)

Looks like a usefule helper. But,

> +	for_each_cpu(cpu, mask) {
> +		struct sched_work_struct *work = per_cpu_ptr(works, cpu);
> +		work->count = &count;
> +		work->completion = &completion;
> +		work->func = func;
>
> -		INIT_WORK(work, func);
> -		schedule_work_on(cpu, work);
> +		INIT_WORK(&work->work, do_sched_work);
> +		schedule_work_on(cpu, &work->work);

This means the caller must ensure CPU online and can't go away. Otherwise
we can hang forever.

schedule_on_each_cpu() is fine, it calls us under get_online_cpus().
But,

>  int lru_add_drain_all(void)
>  {
> -	return schedule_on_each_cpu(lru_add_drain_per_cpu);
> +	return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
>  }

This doesn't look safe.

Looks like, schedule_on_mask() should take get_online_cpus(), do
cpus_and(mask, mask, online_cpus), then schedule works.

If we don't care the work can migrate to another CPU, schedule_on_mask()
can do put_online_cpus() before wait_for_completion().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-07 13:35                     ` Oleg Nesterov
@ 2009-09-07 13:53                       ` Peter Zijlstra
  -1 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-07 13:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter, lkml

On Mon, 2009-09-07 at 15:35 +0200, Oleg Nesterov wrote:
> On 09/07, Peter Zijlstra wrote:
> >
> > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> >
> > > [  774.651779] SysRq : Show Blocked State
> > > [  774.655770]   task                        PC stack   pid father
> > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > [  774.676008] Call Trace:
> > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> >
> > FWIW, something like the below (prone to explode since its utterly
> > untested) should (mostly) fix that one case. Something similar needs to
> > be done for pretty much all machine wide workqueue thingies, possibly
> > also flush_workqueue().
> 
> Failed to google the previous discussion. Could you please point me?
> What is the problem?

Ah, the general problem is that when we carve up the machine into
partitions using cpusets, we still get machine wide tickles on all cpus
from workqueue stuff like schedule_on_each_cpu() and flush_workqueue(),
even if some cpus don't actually used their workqueue.

So the below limits lru_add_drain() activity to cpus that actually have
pages in their per-cpu lists.

flush_workqueue() could limit itself to cpus that had work queued since
the last flush_workqueue() invocation, etc.

This avoids un-needed disruption of these cpus.

Christoph wants this because he's running cpu-bound userspace and simply
doesn't care to donate a few cycles to the kernel maintenance when not
needed (every tiny bit helps in completing the HPC job sooner).

Mike ran into this because he's starving a partitioned cpu using an RT
task -- which currently starves the other cpus because the workqueues
don't get to run and everybody waits...

The lru_add_drain_all() thing is just one of the many cases, and the
below won't fully solve Mike's problem since the cpu could still have
pending work on the per-cpu list from starting the RT task.. but its
showing the direction on how improve things.

> > +struct sched_work_struct {
> > +	struct work_struct work;
> > +	work_func_t func;
> > +	atomic_t *count;
> > +	struct completion *completion;
> > +};
> 
> (not that it matters, but perhaps sched_work_struct should have a single
>  pointer to the struct which contains func,count,comletion).

Sure, it more-or-less grew while writing, I always forget completions
don't count.

> > -int schedule_on_each_cpu(work_func_t func)
> > +int schedule_on_mask(const struct cpumask *mask, work_func_t func)
> 
> Looks like a usefule helper. But,
> 
> > +	for_each_cpu(cpu, mask) {
> > +		struct sched_work_struct *work = per_cpu_ptr(works, cpu);
> > +		work->count = &count;
> > +		work->completion = &completion;
> > +		work->func = func;
> >
> > -		INIT_WORK(work, func);
> > -		schedule_work_on(cpu, work);
> > +		INIT_WORK(&work->work, do_sched_work);
> > +		schedule_work_on(cpu, &work->work);
> 
> This means the caller must ensure CPU online and can't go away. Otherwise
> we can hang forever.
> 
> schedule_on_each_cpu() is fine, it calls us under get_online_cpus().
> But,
> 
> >  int lru_add_drain_all(void)
> >  {
> > -	return schedule_on_each_cpu(lru_add_drain_per_cpu);
> > +	return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
> >  }
> 
> This doesn't look safe.
> 
> Looks like, schedule_on_mask() should take get_online_cpus(), do
> cpus_and(mask, mask, online_cpus), then schedule works.
> 
> If we don't care the work can migrate to another CPU, schedule_on_mask()
> can do put_online_cpus() before wait_for_completion().

Ah, right. Like said, I only quickly hacked this up as an example on how
to improve isolation between cpus and limit unneeded work in the hope
someone would pick this up and maybe tackle other sites as well.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-07 13:53                       ` Peter Zijlstra
  0 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-07 13:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter, lkml

On Mon, 2009-09-07 at 15:35 +0200, Oleg Nesterov wrote:
> On 09/07, Peter Zijlstra wrote:
> >
> > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> >
> > > [  774.651779] SysRq : Show Blocked State
> > > [  774.655770]   task                        PC stack   pid father
> > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > [  774.676008] Call Trace:
> > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> >
> > FWIW, something like the below (prone to explode since its utterly
> > untested) should (mostly) fix that one case. Something similar needs to
> > be done for pretty much all machine wide workqueue thingies, possibly
> > also flush_workqueue().
> 
> Failed to google the previous discussion. Could you please point me?
> What is the problem?

Ah, the general problem is that when we carve up the machine into
partitions using cpusets, we still get machine wide tickles on all cpus
from workqueue stuff like schedule_on_each_cpu() and flush_workqueue(),
even if some cpus don't actually used their workqueue.

So the below limits lru_add_drain() activity to cpus that actually have
pages in their per-cpu lists.

flush_workqueue() could limit itself to cpus that had work queued since
the last flush_workqueue() invocation, etc.

This avoids un-needed disruption of these cpus.

Christoph wants this because he's running cpu-bound userspace and simply
doesn't care to donate a few cycles to the kernel maintenance when not
needed (every tiny bit helps in completing the HPC job sooner).

Mike ran into this because he's starving a partitioned cpu using an RT
task -- which currently starves the other cpus because the workqueues
don't get to run and everybody waits...

The lru_add_drain_all() thing is just one of the many cases, and the
below won't fully solve Mike's problem since the cpu could still have
pending work on the per-cpu list from starting the RT task.. but its
showing the direction on how improve things.

> > +struct sched_work_struct {
> > +	struct work_struct work;
> > +	work_func_t func;
> > +	atomic_t *count;
> > +	struct completion *completion;
> > +};
> 
> (not that it matters, but perhaps sched_work_struct should have a single
>  pointer to the struct which contains func,count,comletion).

Sure, it more-or-less grew while writing, I always forget completions
don't count.

> > -int schedule_on_each_cpu(work_func_t func)
> > +int schedule_on_mask(const struct cpumask *mask, work_func_t func)
> 
> Looks like a usefule helper. But,
> 
> > +	for_each_cpu(cpu, mask) {
> > +		struct sched_work_struct *work = per_cpu_ptr(works, cpu);
> > +		work->count = &count;
> > +		work->completion = &completion;
> > +		work->func = func;
> >
> > -		INIT_WORK(work, func);
> > -		schedule_work_on(cpu, work);
> > +		INIT_WORK(&work->work, do_sched_work);
> > +		schedule_work_on(cpu, &work->work);
> 
> This means the caller must ensure CPU online and can't go away. Otherwise
> we can hang forever.
> 
> schedule_on_each_cpu() is fine, it calls us under get_online_cpus().
> But,
> 
> >  int lru_add_drain_all(void)
> >  {
> > -	return schedule_on_each_cpu(lru_add_drain_per_cpu);
> > +	return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
> >  }
> 
> This doesn't look safe.
> 
> Looks like, schedule_on_mask() should take get_online_cpus(), do
> cpus_and(mask, mask, online_cpus), then schedule works.
> 
> If we don't care the work can migrate to another CPU, schedule_on_mask()
> can do put_online_cpus() before wait_for_completion().

Ah, right. Like said, I only quickly hacked this up as an example on how
to improve isolation between cpus and limit unneeded work in the hope
someone would pick this up and maybe tackle other sites as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-07 13:53                       ` Peter Zijlstra
@ 2009-09-07 14:18                         ` Oleg Nesterov
  -1 siblings, 0 replies; 84+ messages in thread
From: Oleg Nesterov @ 2009-09-07 14:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter, lkml

On 09/07, Peter Zijlstra wrote:
>
> On Mon, 2009-09-07 at 15:35 +0200, Oleg Nesterov wrote:
> >
> > Failed to google the previous discussion. Could you please point me?
> > What is the problem?
>
> Ah, the general problem is that when we carve up the machine into
> partitions using cpusets, we still get machine wide tickles on all cpus
> from workqueue stuff like schedule_on_each_cpu() and flush_workqueue(),
> even if some cpus don't actually used their workqueue.
>
> So the below limits lru_add_drain() activity to cpus that actually have
> pages in their per-cpu lists.

Thanks Peter!

> flush_workqueue() could limit itself to cpus that had work queued since
> the last flush_workqueue() invocation, etc.

But "work queued since the last flush_workqueue() invocation" just means
"has work queued". Please note that flush_cpu_workqueue() does nothing
if there are no works, except it does lock/unlock of cwq->lock.

IIRC, flush_cpu_workqueue() has to lock/unlock to avoid the races with
CPU hotplug, but _perhaps_ flush_workqueue() can do the check lockless.

Afaics, we can add the workqueue_struct->cpu_map_has_works to help
flush_workqueue(), but this means we should complicate insert_work()
and run_workqueue() which should set/clear the bit. But given that
flush_workqueue() should be avoided anyway, I am not sure.

Oleg.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-07 14:18                         ` Oleg Nesterov
  0 siblings, 0 replies; 84+ messages in thread
From: Oleg Nesterov @ 2009-09-07 14:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter, lkml

On 09/07, Peter Zijlstra wrote:
>
> On Mon, 2009-09-07 at 15:35 +0200, Oleg Nesterov wrote:
> >
> > Failed to google the previous discussion. Could you please point me?
> > What is the problem?
>
> Ah, the general problem is that when we carve up the machine into
> partitions using cpusets, we still get machine wide tickles on all cpus
> from workqueue stuff like schedule_on_each_cpu() and flush_workqueue(),
> even if some cpus don't actually used their workqueue.
>
> So the below limits lru_add_drain() activity to cpus that actually have
> pages in their per-cpu lists.

Thanks Peter!

> flush_workqueue() could limit itself to cpus that had work queued since
> the last flush_workqueue() invocation, etc.

But "work queued since the last flush_workqueue() invocation" just means
"has work queued". Please note that flush_cpu_workqueue() does nothing
if there are no works, except it does lock/unlock of cwq->lock.

IIRC, flush_cpu_workqueue() has to lock/unlock to avoid the races with
CPU hotplug, but _perhaps_ flush_workqueue() can do the check lockless.

Afaics, we can add the workqueue_struct->cpu_map_has_works to help
flush_workqueue(), but this means we should complicate insert_work()
and run_workqueue() which should set/clear the bit. But given that
flush_workqueue() should be avoided anyway, I am not sure.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-07 14:18                         ` Oleg Nesterov
@ 2009-09-07 14:25                           ` Peter Zijlstra
  -1 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-07 14:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter, lkml

On Mon, 2009-09-07 at 16:18 +0200, Oleg Nesterov wrote:

> > flush_workqueue() could limit itself to cpus that had work queued since
> > the last flush_workqueue() invocation, etc.
> 
> But "work queued since the last flush_workqueue() invocation" just means
> "has work queued". Please note that flush_cpu_workqueue() does nothing
> if there are no works, except it does lock/unlock of cwq->lock.
> 
> IIRC, flush_cpu_workqueue() has to lock/unlock to avoid the races with
> CPU hotplug, but _perhaps_ flush_workqueue() can do the check lockless.
> 
> Afaics, we can add the workqueue_struct->cpu_map_has_works to help
> flush_workqueue(), but this means we should complicate insert_work()
> and run_workqueue() which should set/clear the bit. But given that
> flush_workqueue() should be avoided anyway, I am not sure.

Ah, indeed. Then nothing new would be needed here, since it will indeed
not interrupt processing on the remote cpus that never queued any work.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-07 14:25                           ` Peter Zijlstra
  0 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-07 14:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter, lkml

On Mon, 2009-09-07 at 16:18 +0200, Oleg Nesterov wrote:

> > flush_workqueue() could limit itself to cpus that had work queued since
> > the last flush_workqueue() invocation, etc.
> 
> But "work queued since the last flush_workqueue() invocation" just means
> "has work queued". Please note that flush_cpu_workqueue() does nothing
> if there are no works, except it does lock/unlock of cwq->lock.
> 
> IIRC, flush_cpu_workqueue() has to lock/unlock to avoid the races with
> CPU hotplug, but _perhaps_ flush_workqueue() can do the check lockless.
> 
> Afaics, we can add the workqueue_struct->cpu_map_has_works to help
> flush_workqueue(), but this means we should complicate insert_work()
> and run_workqueue() which should set/clear the bit. But given that
> flush_workqueue() should be avoided anyway, I am not sure.

Ah, indeed. Then nothing new would be needed here, since it will indeed
not interrupt processing on the remote cpus that never queued any work.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-07 11:06                   ` Peter Zijlstra
@ 2009-09-07 23:56                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-07 23:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Christoph Lameter, Oleg Nesterov, lkml

Hi Peter,

> On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> 
> > [  774.651779] SysRq : Show Blocked State
> > [  774.655770]   task                        PC stack   pid father
> > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > [  774.676008] Call Trace:
> > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> 
> FWIW, something like the below (prone to explode since its utterly
> untested) should (mostly) fix that one case. Something similar needs to
> be done for pretty much all machine wide workqueue thingies, possibly
> also flush_workqueue().

Can you please explain reproduce way and problem detail?

AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
it doesn't cause any deadlock.





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-07 23:56                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-07 23:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Christoph Lameter, Oleg Nesterov, lkml

Hi Peter,

> On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> 
> > [  774.651779] SysRq : Show Blocked State
> > [  774.655770]   task                        PC stack   pid father
> > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > [  774.676008] Call Trace:
> > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> 
> FWIW, something like the below (prone to explode since its utterly
> untested) should (mostly) fix that one case. Something similar needs to
> be done for pretty much all machine wide workqueue thingies, possibly
> also flush_workqueue().

Can you please explain reproduce way and problem detail?

AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
it doesn't cause any deadlock.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC487@exchange.ZeugmaSystems.local>
@ 2009-09-08  7:08             ` Anirban Sinha
  2009-09-08  8:42               ` Peter Zijlstra
  0 siblings, 1 reply; 84+ messages in thread
From: Anirban Sinha @ 2009-09-08  7:08 UTC (permalink / raw)
  To: Anirban Sinha, Peter Zijlstra, Mike Galbraith, linux-kernel,
	Ingo Molnar, Dario Faggioli
  Cc: Anirban Sinha


On 2009-09-07, at 9:42 AM, Anirban Sinha wrote:

>
>
>
> -----Original Message-----
> From: Peter Zijlstra [mailto:a.p.zijlstra@chello.nl]
> Sent: Mon 9/7/2009 12:59 AM
> To: Mike Galbraith
> Cc: Anirban Sinha; Lucas De Marchi; linux-kernel@vger.kernel.org;  
> Ingo Molnar
> Subject: Re: question on sched-rt group allocation cap:  
> sched_rt_runtime_us
>
> On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> > On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com>  
> wrote:
> > > >
> > > > Indeed. I've tested this same test program in a single core  
> machine and it
> > > > produces the expected behavior:
> > > >
> > > > rt_runtime_us / rt_period_us     % loops executed in SCHED_OTHER
> > > > 95%                              4.48%
> > > > 60%                              54.84%
> > > > 50%                              86.03%
> > > > 40%                              OTHER completed first
> > > >
> > >
> > > Hmm. This does seem to indicate that there is some kind of
> > > relationship with SMP. So I wonder whether there is a way to  
> turn this
> > > 'RT bandwidth accumulation' heuristic off.
> >
> > No there isn't..
>
> Actually there is, use cpusets to carve the system into partitions.

hmm. ok. I looked at the code a little bit. It seems to me that the  
'borrowing' of RT runtimes occurs only from rt runqueues belonging to  
the same root domain. And partition_sched_domains() is the only  
external interface that can be used to create root domain out of a CPU  
set. But then I think it needs to have CGROUPS/USER groups enabled?  
Right?

--Ani


>
>
>


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
       [not found]             ` <DDFD17CC94A9BD49A82147DDF7D545C54DC489@exchange.ZeugmaSystems.local>
@ 2009-09-08  7:10               ` Anirban Sinha
  2009-09-08  9:26                 ` Mike Galbraith
  0 siblings, 1 reply; 84+ messages in thread
From: Anirban Sinha @ 2009-09-08  7:10 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Dario Faggioli,
	Mike Galbraith
  Cc: Anirban Sinha, Anirban Sinha


On 2009-09-07, at 9:44 AM, Anirban Sinha wrote:

>
>
>
> -----Original Message-----
> From: Mike Galbraith [mailto:efault@gmx.de]
> Sent: Sun 9/6/2009 11:54 PM
> To: Anirban Sinha
> Cc: Lucas De Marchi; linux-kernel@vger.kernel.org; Peter Zijlstra;  
> Ingo Molnar
> Subject: RE: question on sched-rt group allocation cap:  
> sched_rt_runtime_us
>
> On Sun, 2009-09-06 at 17:18 -0700, Anirban Sinha wrote:
> >
> >
> > > Dunno.  Fly or die little patchlet (toss).
> >
> > > sched: allow the user to disable RT bandwidth aggregation.
> >
> > Hmm. Interesting. With this change, my results are as follows:
> >
> > rt_runtime/rt_period   % of reg iterations
> >
> > 0.2                    100%
> > 0.25                   100%
> > 0.3                    100%
> > 0.4                    100%
> > 0.5                    82%
> > 0.6                    66%
> > 0.7                    54%
> > 0.8                    46%
> > 0.9                    38.5%
> > 0.95                   32%
> >
> >
> > This results are on a quad core blade. Does it still makes sense
> > though?
> > Can anyone else run the same tests on a quadcore over the latest
> > kernel? I will patch our 2.6.26 kernel with upstream fixes and rerun
> > these tests on tuesday.
>
> I tested tip (v2.6.31-rc9-1357-ge6a3cd0) with a little perturbation
> measurement proglet on an isolated Q6600 core.


Thanks Mike. Is this on a single core machine (or one core carved out  
of N)?  We may have some newer patches missing from the 2.6.26 kernel  
that fixes some accounting bugs. I will do a review and rerun the test  
after applying the upstream patches.

Ani


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-07 23:56                     ` KOSAKI Motohiro
@ 2009-09-08  8:20                       ` Peter Zijlstra
  -1 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08  8:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> Hi Peter,
> 
> > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > 
> > > [  774.651779] SysRq : Show Blocked State
> > > [  774.655770]   task                        PC stack   pid father
> > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > [  774.676008] Call Trace:
> > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > 
> > FWIW, something like the below (prone to explode since its utterly
> > untested) should (mostly) fix that one case. Something similar needs to
> > be done for pretty much all machine wide workqueue thingies, possibly
> > also flush_workqueue().
> 
> Can you please explain reproduce way and problem detail?
> 
> AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> it doesn't cause any deadlock.

Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
cpu0 does mlock()->lru_add_drain_all(), which does
schedule_on_each_cpu(), which then waits for all cpus to complete the
work. Except that cpu1, which is busy with the RT task, will never run
keventd until the RT load goes away.

This is not so much an actual deadlock as a serious starvation case.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08  8:20                       ` Peter Zijlstra
  0 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08  8:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> Hi Peter,
> 
> > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > 
> > > [  774.651779] SysRq : Show Blocked State
> > > [  774.655770]   task                        PC stack   pid father
> > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > [  774.676008] Call Trace:
> > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > 
> > FWIW, something like the below (prone to explode since its utterly
> > untested) should (mostly) fix that one case. Something similar needs to
> > be done for pretty much all machine wide workqueue thingies, possibly
> > also flush_workqueue().
> 
> Can you please explain reproduce way and problem detail?
> 
> AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> it doesn't cause any deadlock.

Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
cpu0 does mlock()->lru_add_drain_all(), which does
schedule_on_each_cpu(), which then waits for all cpus to complete the
work. Except that cpu1, which is busy with the RT task, will never run
keventd until the RT load goes away.

This is not so much an actual deadlock as a serious starvation case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-08  7:08             ` Anirban Sinha
@ 2009-09-08  8:42               ` Peter Zijlstra
  2009-09-08 14:41                 ` Anirban Sinha
  0 siblings, 1 reply; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08  8:42 UTC (permalink / raw)
  To: Anirban Sinha
  Cc: Anirban Sinha, Mike Galbraith, linux-kernel, Ingo Molnar, Dario Faggioli

On Tue, 2009-09-08 at 00:08 -0700, Anirban Sinha wrote:

> > Actually there is, use cpusets to carve the system into partitions.
> 
> hmm. ok. I looked at the code a little bit. It seems to me that the  
> 'borrowing' of RT runtimes occurs only from rt runqueues belonging to  
> the same root domain. And partition_sched_domains() is the only  
> external interface that can be used to create root domain out of a CPU  
> set. But then I think it needs to have CGROUPS/USER groups enabled?  
> Right?

No you need cpusets, you create a partition by disabling load-balancing
on the top set, thereby only allowing load-balancing withing the
children.

The runtime sharing is a form of load-balancing.

CONFIG_CPUSETS=y

Documentation/cgroups/cpusets.txt


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-08  7:10               ` Anirban Sinha
@ 2009-09-08  9:26                 ` Mike Galbraith
  0 siblings, 0 replies; 84+ messages in thread
From: Mike Galbraith @ 2009-09-08  9:26 UTC (permalink / raw)
  To: Anirban Sinha
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Dario Faggioli, Anirban Sinha

On Tue, 2009-09-08 at 00:10 -0700, Anirban Sinha wrote:

> > I tested tip (v2.6.31-rc9-1357-ge6a3cd0) with a little perturbation
> > measurement proglet on an isolated Q6600 core.
> 
> 
> Thanks Mike. Is this on a single core machine (or one core carved out  
> of N)?  We may have some newer patches missing from the 2.6.26 kernel  
> that fixes some accounting bugs. I will do a review and rerun the test  
> after applying the upstream patches.

Q6600 is a quad, test was 1 carved out of 4 (thought i said that).

	-Mike


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08  8:20                       ` Peter Zijlstra
@ 2009-09-08 10:06                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-08 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Christoph Lameter, Oleg Nesterov, lkml

> On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > Hi Peter,
> > 
> > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > > 
> > > > [  774.651779] SysRq : Show Blocked State
> > > > [  774.655770]   task                        PC stack   pid father
> > > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > [  774.676008] Call Trace:
> > > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > > 
> > > FWIW, something like the below (prone to explode since its utterly
> > > untested) should (mostly) fix that one case. Something similar needs to
> > > be done for pretty much all machine wide workqueue thingies, possibly
> > > also flush_workqueue().
> > 
> > Can you please explain reproduce way and problem detail?
> > 
> > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > it doesn't cause any deadlock.
> 
> Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> cpu0 does mlock()->lru_add_drain_all(), which does
> schedule_on_each_cpu(), which then waits for all cpus to complete the
> work. Except that cpu1, which is busy with the RT task, will never run
> keventd until the RT load goes away.
> 
> This is not so much an actual deadlock as a serious starvation case.

This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
Why other workqueue flusher doesn't affect this issue?





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08 10:06                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-08 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Christoph Lameter, Oleg Nesterov, lkml

> On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > Hi Peter,
> > 
> > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > > 
> > > > [  774.651779] SysRq : Show Blocked State
> > > > [  774.655770]   task                        PC stack   pid father
> > > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > [  774.676008] Call Trace:
> > > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > > 
> > > FWIW, something like the below (prone to explode since its utterly
> > > untested) should (mostly) fix that one case. Something similar needs to
> > > be done for pretty much all machine wide workqueue thingies, possibly
> > > also flush_workqueue().
> > 
> > Can you please explain reproduce way and problem detail?
> > 
> > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > it doesn't cause any deadlock.
> 
> Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> cpu0 does mlock()->lru_add_drain_all(), which does
> schedule_on_each_cpu(), which then waits for all cpus to complete the
> work. Except that cpu1, which is busy with the RT task, will never run
> keventd until the RT load goes away.
> 
> This is not so much an actual deadlock as a serious starvation case.

This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
Why other workqueue flusher doesn't affect this issue?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 10:06                         ` KOSAKI Motohiro
@ 2009-09-08 10:20                           ` Peter Zijlstra
  -1 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08 10:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 19:06 +0900, KOSAKI Motohiro wrote:
> > On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > > Hi Peter,
> > > 
> > > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > > > 
> > > > > [  774.651779] SysRq : Show Blocked State
> > > > > [  774.655770]   task                        PC stack   pid father
> > > > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > > [  774.676008] Call Trace:
> > > > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > > > 
> > > > FWIW, something like the below (prone to explode since its utterly
> > > > untested) should (mostly) fix that one case. Something similar needs to
> > > > be done for pretty much all machine wide workqueue thingies, possibly
> > > > also flush_workqueue().
> > > 
> > > Can you please explain reproduce way and problem detail?
> > > 
> > > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > > it doesn't cause any deadlock.
> > 
> > Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > cpu0 does mlock()->lru_add_drain_all(), which does
> > schedule_on_each_cpu(), which then waits for all cpus to complete the
> > work. Except that cpu1, which is busy with the RT task, will never run
> > keventd until the RT load goes away.
> > 
> > This is not so much an actual deadlock as a serious starvation case.
> 
> This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
> Why other workqueue flusher doesn't affect this issue?

flush_work() will only flush workqueues on which work has been enqueued
as Oleg pointed out.

The problem is with lru_add_drain_all() enqueueing work on all
workqueues.

There is nothing that makes lru_add_drain_all() the only such site, its
the one Mike posted to me, and my patch was a way to deal with that.

I also explained that its not only RT related in that the HPC folks also
want to avoid unneeded work -- for them its not starvation but a
performance issue.

In generic we should avoid doing work when there is no work to be done.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08 10:20                           ` Peter Zijlstra
  0 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08 10:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 19:06 +0900, KOSAKI Motohiro wrote:
> > On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > > Hi Peter,
> > > 
> > > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > > > 
> > > > > [  774.651779] SysRq : Show Blocked State
> > > > > [  774.655770]   task                        PC stack   pid father
> > > > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > > [  774.676008] Call Trace:
> > > > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > > > 
> > > > FWIW, something like the below (prone to explode since its utterly
> > > > untested) should (mostly) fix that one case. Something similar needs to
> > > > be done for pretty much all machine wide workqueue thingies, possibly
> > > > also flush_workqueue().
> > > 
> > > Can you please explain reproduce way and problem detail?
> > > 
> > > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > > it doesn't cause any deadlock.
> > 
> > Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > cpu0 does mlock()->lru_add_drain_all(), which does
> > schedule_on_each_cpu(), which then waits for all cpus to complete the
> > work. Except that cpu1, which is busy with the RT task, will never run
> > keventd until the RT load goes away.
> > 
> > This is not so much an actual deadlock as a serious starvation case.
> 
> This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
> Why other workqueue flusher doesn't affect this issue?

flush_work() will only flush workqueues on which work has been enqueued
as Oleg pointed out.

The problem is with lru_add_drain_all() enqueueing work on all
workqueues.

There is nothing that makes lru_add_drain_all() the only such site, its
the one Mike posted to me, and my patch was a way to deal with that.

I also explained that its not only RT related in that the HPC folks also
want to avoid unneeded work -- for them its not starvation but a
performance issue.

In generic we should avoid doing work when there is no work to be done.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 10:20                           ` Peter Zijlstra
@ 2009-09-08 11:41                             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-08 11:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Christoph Lameter, Oleg Nesterov, lkml

> On Tue, 2009-09-08 at 19:06 +0900, KOSAKI Motohiro wrote:
> > > On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > > > Hi Peter,
> > > > 
> > > > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > > > > 
> > > > > > [  774.651779] SysRq : Show Blocked State
> > > > > > [  774.655770]   task                        PC stack   pid father
> > > > > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > > > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > > > [  774.676008] Call Trace:
> > > > > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > > > > 
> > > > > FWIW, something like the below (prone to explode since its utterly
> > > > > untested) should (mostly) fix that one case. Something similar needs to
> > > > > be done for pretty much all machine wide workqueue thingies, possibly
> > > > > also flush_workqueue().
> > > > 
> > > > Can you please explain reproduce way and problem detail?
> > > > 
> > > > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > > > it doesn't cause any deadlock.
> > > 
> > > Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > > cpu0 does mlock()->lru_add_drain_all(), which does
> > > schedule_on_each_cpu(), which then waits for all cpus to complete the
> > > work. Except that cpu1, which is busy with the RT task, will never run
> > > keventd until the RT load goes away.
> > > 
> > > This is not so much an actual deadlock as a serious starvation case.
> > 
> > This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
> > Why other workqueue flusher doesn't affect this issue?
> 
> flush_work() will only flush workqueues on which work has been enqueued
> as Oleg pointed out.
> 
> The problem is with lru_add_drain_all() enqueueing work on all
> workqueues.

Thank you for kindly explanation. I gradually become to understand this isssue.
Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code

        for_each_online_cpu(cpu)
                flush_work(per_cpu_ptr(works, cpu));

However, I don't think your approach solve this issue.
lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.

lru_add_pvecs is accounted when
  - lru move
      e.g. read(2), write(2), page fault, vmscan, page migration, et al

lru_rotate_pves is accounted when
  - page writeback

IOW, if RT-thread call write(2) syscall or page fault, we face the same
problem. I don't think we can assume RT-thread don't make page fault....

hmm, this seems difficult problem. I guess any mm code should use
schedule_on_each_cpu(). I continue to think this issue awhile.


> There is nothing that makes lru_add_drain_all() the only such site, its
> the one Mike posted to me, and my patch was a way to deal with that.

Well, schedule_on_each_cpu() is very limited used function.
Practically we can ignore other caller.


> I also explained that its not only RT related in that the HPC folks also
> want to avoid unneeded work -- for them its not starvation but a
> performance issue.

I think you talked about OS jitter issue. if so, I don't think this issue
make serious problem.  OS jitter mainly be caused by periodic action
 (e.g. tick update, timer, vmstat update). it's because
	little-delay x plenty-times = large-delay

lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
page-migration, memory-hotplug. all caller is not periodic.


> In generic we should avoid doing work when there is no work to be done.

Probably. but I'm not sure ;)




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08 11:41                             ` KOSAKI Motohiro
  0 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-08 11:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Christoph Lameter, Oleg Nesterov, lkml

> On Tue, 2009-09-08 at 19:06 +0900, KOSAKI Motohiro wrote:
> > > On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > > > Hi Peter,
> > > > 
> > > > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > > > > 
> > > > > > [  774.651779] SysRq : Show Blocked State
> > > > > > [  774.655770]   task                        PC stack   pid father
> > > > > > [  774.655770] evolution.bin D ffff8800bc1575f0     0  7349   6459 0x00000000
> > > > > > [  774.676008]  ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > > > [  774.676008]  000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > > > [  774.676008]  00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > > > [  774.676008] Call Trace:
> > > > > > [  774.676008]  [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > > > [  774.676008]  [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > > > [  774.676008]  [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > > [  774.676008]  [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > > [  774.676008]  [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > > > [  774.676008]  [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > > > [  774.676008]  [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > > > [  774.676008]  [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > > > [  774.676008]  [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > > > [  774.676008]  [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > > > [  774.676008]  [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > > > > 
> > > > > FWIW, something like the below (prone to explode since its utterly
> > > > > untested) should (mostly) fix that one case. Something similar needs to
> > > > > be done for pretty much all machine wide workqueue thingies, possibly
> > > > > also flush_workqueue().
> > > > 
> > > > Can you please explain reproduce way and problem detail?
> > > > 
> > > > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > > > it doesn't cause any deadlock.
> > > 
> > > Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > > cpu0 does mlock()->lru_add_drain_all(), which does
> > > schedule_on_each_cpu(), which then waits for all cpus to complete the
> > > work. Except that cpu1, which is busy with the RT task, will never run
> > > keventd until the RT load goes away.
> > > 
> > > This is not so much an actual deadlock as a serious starvation case.
> > 
> > This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
> > Why other workqueue flusher doesn't affect this issue?
> 
> flush_work() will only flush workqueues on which work has been enqueued
> as Oleg pointed out.
> 
> The problem is with lru_add_drain_all() enqueueing work on all
> workqueues.

Thank you for kindly explanation. I gradually become to understand this isssue.
Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code

        for_each_online_cpu(cpu)
                flush_work(per_cpu_ptr(works, cpu));

However, I don't think your approach solve this issue.
lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.

lru_add_pvecs is accounted when
  - lru move
      e.g. read(2), write(2), page fault, vmscan, page migration, et al

lru_rotate_pves is accounted when
  - page writeback

IOW, if RT-thread call write(2) syscall or page fault, we face the same
problem. I don't think we can assume RT-thread don't make page fault....

hmm, this seems difficult problem. I guess any mm code should use
schedule_on_each_cpu(). I continue to think this issue awhile.


> There is nothing that makes lru_add_drain_all() the only such site, its
> the one Mike posted to me, and my patch was a way to deal with that.

Well, schedule_on_each_cpu() is very limited used function.
Practically we can ignore other caller.


> I also explained that its not only RT related in that the HPC folks also
> want to avoid unneeded work -- for them its not starvation but a
> performance issue.

I think you talked about OS jitter issue. if so, I don't think this issue
make serious problem.  OS jitter mainly be caused by periodic action
 (e.g. tick update, timer, vmstat update). it's because
	little-delay x plenty-times = large-delay

lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
page-migration, memory-hotplug. all caller is not periodic.


> In generic we should avoid doing work when there is no work to be done.

Probably. but I'm not sure ;)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 11:41                             ` KOSAKI Motohiro
@ 2009-09-08 12:05                               ` Peter Zijlstra
  -1 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08 12:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 20:41 +0900, KOSAKI Motohiro wrote:

> Thank you for kindly explanation. I gradually become to understand this isssue.
> Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code
> 
>         for_each_online_cpu(cpu)
>                 flush_work(per_cpu_ptr(works, cpu));
> 
> However, I don't think your approach solve this issue.
> lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.
> 
> lru_add_pvecs is accounted when
>   - lru move
>       e.g. read(2), write(2), page fault, vmscan, page migration, et al
> 
> lru_rotate_pves is accounted when
>   - page writeback
> 
> IOW, if RT-thread call write(2) syscall or page fault, we face the same
> problem. I don't think we can assume RT-thread don't make page fault....
> 
> hmm, this seems difficult problem. I guess any mm code should use
> schedule_on_each_cpu(). I continue to think this issue awhile.

This is about avoiding work when there is non, clearly when an
application does use the kernel it creates work.

But a clearly userspace, cpu-bound process, while(1), should not get
interrupted by things like lru_add_drain() when it doesn't have any
pages to drain.

> > There is nothing that makes lru_add_drain_all() the only such site, its
> > the one Mike posted to me, and my patch was a way to deal with that.
> 
> Well, schedule_on_each_cpu() is very limited used function.
> Practically we can ignore other caller.

No, we need to inspect all callers, having only a few makes that easier.

> > I also explained that its not only RT related in that the HPC folks also
> > want to avoid unneeded work -- for them its not starvation but a
> > performance issue.
> 
> I think you talked about OS jitter issue. if so, I don't think this issue
> make serious problem.  OS jitter mainly be caused by periodic action
>  (e.g. tick update, timer, vmstat update). it's because
> 	little-delay x plenty-times = large-delay
> 
> lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
> page-migration, memory-hotplug. all caller is not periodic.

Doesn't matter, if you want to reduce it, you need to address all of
them, a process 4 nodes away calling mlock() while this partition has
been user-bound for the last hour or so and doesn't have any lru pages
simply needn't be woken.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08 12:05                               ` Peter Zijlstra
  0 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08 12:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mike Galbraith, Ingo Molnar, linux-mm, Christoph Lameter,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 20:41 +0900, KOSAKI Motohiro wrote:

> Thank you for kindly explanation. I gradually become to understand this isssue.
> Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code
> 
>         for_each_online_cpu(cpu)
>                 flush_work(per_cpu_ptr(works, cpu));
> 
> However, I don't think your approach solve this issue.
> lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.
> 
> lru_add_pvecs is accounted when
>   - lru move
>       e.g. read(2), write(2), page fault, vmscan, page migration, et al
> 
> lru_rotate_pves is accounted when
>   - page writeback
> 
> IOW, if RT-thread call write(2) syscall or page fault, we face the same
> problem. I don't think we can assume RT-thread don't make page fault....
> 
> hmm, this seems difficult problem. I guess any mm code should use
> schedule_on_each_cpu(). I continue to think this issue awhile.

This is about avoiding work when there is non, clearly when an
application does use the kernel it creates work.

But a clearly userspace, cpu-bound process, while(1), should not get
interrupted by things like lru_add_drain() when it doesn't have any
pages to drain.

> > There is nothing that makes lru_add_drain_all() the only such site, its
> > the one Mike posted to me, and my patch was a way to deal with that.
> 
> Well, schedule_on_each_cpu() is very limited used function.
> Practically we can ignore other caller.

No, we need to inspect all callers, having only a few makes that easier.

> > I also explained that its not only RT related in that the HPC folks also
> > want to avoid unneeded work -- for them its not starvation but a
> > performance issue.
> 
> I think you talked about OS jitter issue. if so, I don't think this issue
> make serious problem.  OS jitter mainly be caused by periodic action
>  (e.g. tick update, timer, vmstat update). it's because
> 	little-delay x plenty-times = large-delay
> 
> lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
> page-migration, memory-hotplug. all caller is not periodic.

Doesn't matter, if you want to reduce it, you need to address all of
them, a process 4 nodes away calling mlock() while this partition has
been user-bound for the last hour or so and doesn't have any lru pages
simply needn't be woken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 12:05                               ` Peter Zijlstra
@ 2009-09-08 14:03                                 ` Christoph Lameter
  -1 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-08 14:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Tue, 8 Sep 2009, Peter Zijlstra wrote:

> This is about avoiding work when there is non, clearly when an
> application does use the kernel it creates work.

Hmmm. The lru draining in page migration is to reduce the number of pages
that are not on the lru to increase the chance of page migration to be
successful. A page on a per cpu list cannot be drained.

Reducing the number of cpus where we perform the drain results in
increased likelyhood that we cannot migrate a page because its on the per
cpu lists of a cpu not covered.

On the other hand if the cpu is offline then we know that it has no per
cpu pages. That is why I found the idea of the OFFLINE
scheduler attractive.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08 14:03                                 ` Christoph Lameter
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-08 14:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Tue, 8 Sep 2009, Peter Zijlstra wrote:

> This is about avoiding work when there is non, clearly when an
> application does use the kernel it creates work.

Hmmm. The lru draining in page migration is to reduce the number of pages
that are not on the lru to increase the chance of page migration to be
successful. A page on a per cpu list cannot be drained.

Reducing the number of cpus where we perform the drain results in
increased likelyhood that we cannot migrate a page because its on the per
cpu lists of a cpu not covered.

On the other hand if the cpu is offline then we know that it has no per
cpu pages. That is why I found the idea of the OFFLINE
scheduler attractive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 14:03                                 ` Christoph Lameter
@ 2009-09-08 14:20                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08 14:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 10:03 -0400, Christoph Lameter wrote:
> On Tue, 8 Sep 2009, Peter Zijlstra wrote:
> 
> > This is about avoiding work when there is non, clearly when an
> > application does use the kernel it creates work.
> 
> Hmmm. The lru draining in page migration is to reduce the number of pages
> that are not on the lru to increase the chance of page migration to be
> successful. A page on a per cpu list cannot be drained.
> 
> Reducing the number of cpus where we perform the drain results in
> increased likelyhood that we cannot migrate a page because its on the per
> cpu lists of a cpu not covered.

Did you even read the patch?

There is _no_ functional difference between before and after, except
less wakeups on cpus that don't have any __lru_cache_add activity.

If there's pages on the per cpu lru_add_pvecs list it will be present in
the mask and will be send a drain request. If its not, then it won't be
send.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08 14:20                                   ` Peter Zijlstra
  0 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08 14:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 10:03 -0400, Christoph Lameter wrote:
> On Tue, 8 Sep 2009, Peter Zijlstra wrote:
> 
> > This is about avoiding work when there is non, clearly when an
> > application does use the kernel it creates work.
> 
> Hmmm. The lru draining in page migration is to reduce the number of pages
> that are not on the lru to increase the chance of page migration to be
> successful. A page on a per cpu list cannot be drained.
> 
> Reducing the number of cpus where we perform the drain results in
> increased likelyhood that we cannot migrate a page because its on the per
> cpu lists of a cpu not covered.

Did you even read the patch?

There is _no_ functional difference between before and after, except
less wakeups on cpus that don't have any __lru_cache_add activity.

If there's pages on the per cpu lru_add_pvecs list it will be present in
the mask and will be send a drain request. If its not, then it won't be
send.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-08  8:42               ` Peter Zijlstra
@ 2009-09-08 14:41                 ` Anirban Sinha
  0 siblings, 0 replies; 84+ messages in thread
From: Anirban Sinha @ 2009-09-08 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Anirban Sinha, Mike Galbraith, linux-kernel, Ingo Molnar, Dario Faggioli


On 2009-09-08, at 1:42 AM, Peter Zijlstra wrote:

> On Tue, 2009-09-08 at 00:08 -0700, Anirban Sinha wrote:
>
>>> Actually there is, use cpusets to carve the system into partitions.
>>
>> hmm. ok. I looked at the code a little bit. It seems to me that the
>> 'borrowing' of RT runtimes occurs only from rt runqueues belonging to
>> the same root domain. And partition_sched_domains() is the only
>> external interface that can be used to create root domain out of a  
>> CPU
>> set. But then I think it needs to have CGROUPS/USER groups enabled?
>> Right?
>
> No you need cpusets, you create a partition by disabling load- 
> balancing
> on the top set, thereby only allowing load-balancing withing the
> children.
>

Ah I see. Thanks for the clarification.


> The runtime sharing is a form of load-balancing.

sure.


>
> CONFIG_CPUSETS=y


Hmm. Ok. I guess what I meant but did not articulate properly (because  
I was thinking in terms of code) was CPUSETS needed CGROUPS support:

config CPUSETS
	bool "Cpuset support"
	depends on CGROUPS


Anyway, that's fine. I'll dig around the code a little bit more.


>
> Documentation/cgroups/cpusets.txt

Thanks for the pointer. My bad, I did not care to see the docs. I tend  
to ignore docs and read code instead. :D


>


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 14:20                                   ` Peter Zijlstra
@ 2009-09-08 15:22                                     ` Christoph Lameter
  -1 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-08 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Tue, 8 Sep 2009, Peter Zijlstra wrote:

> There is _no_ functional difference between before and after, except
> less wakeups on cpus that don't have any __lru_cache_add activity.
>
> If there's pages on the per cpu lru_add_pvecs list it will be present in
> the mask and will be send a drain request. If its not, then it won't be
> send.

Ok I see.

A global cpu mask like this will cause cacheline bouncing. After all this
is a hot cpu path. Maybe do not set the bit if its already set
(which may be very frequent)? Then add some benchmarks to show that it
does not cause a regression on a 16p box (Nehalem) or so?







^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08 15:22                                     ` Christoph Lameter
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-08 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Tue, 8 Sep 2009, Peter Zijlstra wrote:

> There is _no_ functional difference between before and after, except
> less wakeups on cpus that don't have any __lru_cache_add activity.
>
> If there's pages on the per cpu lru_add_pvecs list it will be present in
> the mask and will be send a drain request. If its not, then it won't be
> send.

Ok I see.

A global cpu mask like this will cause cacheline bouncing. After all this
is a hot cpu path. Maybe do not set the bit if its already set
(which may be very frequent)? Then add some benchmarks to show that it
does not cause a regression on a 16p box (Nehalem) or so?






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 15:22                                     ` Christoph Lameter
@ 2009-09-08 15:27                                       ` Peter Zijlstra
  -1 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08 15:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 11:22 -0400, Christoph Lameter wrote:
> On Tue, 8 Sep 2009, Peter Zijlstra wrote:
> 
> > There is _no_ functional difference between before and after, except
> > less wakeups on cpus that don't have any __lru_cache_add activity.
> >
> > If there's pages on the per cpu lru_add_pvecs list it will be present in
> > the mask and will be send a drain request. If its not, then it won't be
> > send.
> 
> Ok I see.
> 
> A global cpu mask like this will cause cacheline bouncing. After all this
> is a hot cpu path. Maybe do not set the bit if its already set
> (which may be very frequent)? Then add some benchmarks to show that it
> does not cause a regression on a 16p box (Nehalem) or so?

Yeah, testing the bit before poking at is sounds like a good plan.

Unless someone feels inclined to finish this and audit the kernel for
more such places, I'll stick it on the ever growing todo pile.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08 15:27                                       ` Peter Zijlstra
  0 siblings, 0 replies; 84+ messages in thread
From: Peter Zijlstra @ 2009-09-08 15:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Tue, 2009-09-08 at 11:22 -0400, Christoph Lameter wrote:
> On Tue, 8 Sep 2009, Peter Zijlstra wrote:
> 
> > There is _no_ functional difference between before and after, except
> > less wakeups on cpus that don't have any __lru_cache_add activity.
> >
> > If there's pages on the per cpu lru_add_pvecs list it will be present in
> > the mask and will be send a drain request. If its not, then it won't be
> > send.
> 
> Ok I see.
> 
> A global cpu mask like this will cause cacheline bouncing. After all this
> is a hot cpu path. Maybe do not set the bit if its already set
> (which may be very frequent)? Then add some benchmarks to show that it
> does not cause a regression on a 16p box (Nehalem) or so?

Yeah, testing the bit before poking at is sounds like a good plan.

Unless someone feels inclined to finish this and audit the kernel for
more such places, I'll stick it on the ever growing todo pile.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 15:22                                     ` Christoph Lameter
@ 2009-09-08 15:32                                       ` Christoph Lameter
  -1 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-08 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

The usefulness of a scheme like this requires:

1. There are cpus that continually execute user space code
   without system interaction.

2. There are repeated VM activities that require page isolation /
   migration.

The first page isolation activity will then clear the lru caches of the
processes doing number crunching in user space (and therefore the first
isolation will still interrupt). The second and following isolation will
then no longer interrupt the processes.

2. is rare. So the question is if the additional code in the LRU handling
can be justified. If lru handling is not time sensitive then yes.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-08 15:32                                       ` Christoph Lameter
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-08 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

The usefulness of a scheme like this requires:

1. There are cpus that continually execute user space code
   without system interaction.

2. There are repeated VM activities that require page isolation /
   migration.

The first page isolation activity will then clear the lru caches of the
processes doing number crunching in user space (and therefore the first
isolation will still interrupt). The second and following isolation will
then no longer interrupt the processes.

2. is rare. So the question is if the additional code in the LRU handling
can be justified. If lru handling is not time sensitive then yes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
       [not found]         ` <DDFD17CC94A9BD49A82147DDF7D545C54DC48B@exchange.ZeugmaSystems.local>
@ 2009-09-08 17:41           ` Anirban Sinha
  2009-09-08 19:06             ` Mike Galbraith
  0 siblings, 1 reply; 84+ messages in thread
From: Anirban Sinha @ 2009-09-08 17:41 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel, Peter Zijlstra, Mike Galbraith,
	Dario Faggioli
  Cc: Anirban Sinha, Anirban Sinha


On 2009-09-08, at 10:32 AM, Anirban Sinha wrote:

>
>
>
> -----Original Message-----
> From: Mike Galbraith [mailto:efault@gmx.de]
> Sent: Sat 9/5/2009 11:32 PM
> To: Anirban Sinha
> Cc: Lucas De Marchi; linux-kernel@vger.kernel.org; Peter Zijlstra;  
> Ingo Molnar
> Subject: Re: question on sched-rt group allocation cap:  
> sched_rt_runtime_us
>
> On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com>  
> wrote:
> > >
> > > Indeed. I've tested this same test program in a single core  
> machine and it
> > > produces the expected behavior:
> > >
> > > rt_runtime_us / rt_period_us     % loops executed in SCHED_OTHER
> > > 95%                              4.48%
> > > 60%                              54.84%
> > > 50%                              86.03%
> > > 40%                              OTHER completed first
> > >
> >
> > Hmm. This does seem to indicate that there is some kind of
> > relationship with SMP. So I wonder whether there is a way to turn  
> this
> > 'RT bandwidth accumulation' heuristic off.
>
> No there isn't, but maybe there should be, since this isn't the first
> time it's come up.  One pro argument is that pinned tasks are  
> thoroughly
> screwed when an RT hog lands on their runqueue.  On the con side, the
> whole RT bandwidth restriction thing is intended (AFAIK) to allow an
> admin to regain control should RT app go insane, which the default 5%
> aggregate accomplishes just fine.
>
> Dunno.  Fly or die little patchlet (toss).

So it would be nice to have a knob like this when CGROUPS is disabled  
(it say 'say N  when unsure' :)). CPUSETS depends on CGROUPS.


>
> sched: allow the user to disable RT bandwidth aggregation.
>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>


Verified-by: Anirban Sinha <asinha@zeugmasystems.com>


> LKML-Reference: <new-submission>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8736ba1..6e6d4c7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1881,6 +1881,7 @@ static inline unsigned int  
> get_sysctl_timer_migration(void)
>  #endif
>  extern unsigned int sysctl_sched_rt_period;
>  extern int sysctl_sched_rt_runtime;
> +extern int sysctl_sched_rt_bandwidth_aggregate;
>
>  int sched_rt_handler(struct ctl_table *table, int write,
>                 struct file *filp, void __user *buffer, size_t *lenp,
> diff --git a/kernel/sched.c b/kernel/sched.c
> index c512a02..ca6a378 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -864,6 +864,12 @@ static __read_mostly int scheduler_running;
>   */
>  int sysctl_sched_rt_runtime = 950000;
>
> +/*
> + * aggregate bandwidth, ie allow borrowing from neighbors when
> + * bandwidth for an individual runqueue is exhausted.
> + */
> +int sysctl_sched_rt_bandwidth_aggregate = 1;
> +
>  static inline u64 global_rt_period(void)
>  {
>         return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 2eb4bd6..75daf88 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -495,6 +495,9 @@ static int balance_runtime(struct rt_rq *rt_rq)
>  {
>         int more = 0;
>
> +       if (!sysctl_sched_rt_bandwidth_aggregate)
> +               return 0;
> +
>         if (rt_rq->rt_time > rt_rq->rt_runtime) {
>                 spin_unlock(&rt_rq->rt_runtime_lock);
>                 more = do_balance_runtime(rt_rq);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index cdbe8d0..0ad08e5 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -368,6 +368,14 @@ static struct ctl_table kern_table[] = {
>         },
>         {
>                 .ctl_name       = CTL_UNNUMBERED,
> +               .procname       = "sched_rt_bandwidth_aggregate",
> +               .data           =  
> &sysctl_sched_rt_bandwidth_aggregate,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = &sched_rt_handler,
> +       },
> +       {
> +               .ctl_name       = CTL_UNNUMBERED,
>                 .procname       = "sched_compat_yield",
>                 .data           = &sysctl_sched_compat_yield,
>                 .maxlen         = sizeof(unsigned int),
>
>
>
>


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-08 17:41           ` Anirban Sinha
@ 2009-09-08 19:06             ` Mike Galbraith
  2009-09-08 19:34               ` Anirban Sinha
  0 siblings, 1 reply; 84+ messages in thread
From: Mike Galbraith @ 2009-09-08 19:06 UTC (permalink / raw)
  To: Anirban Sinha
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dario Faggioli, Anirban Sinha

On Tue, 2009-09-08 at 10:41 -0700, Anirban Sinha wrote:
> On 2009-09-08, at 10:32 AM, Anirban Sinha wrote:
> 

> > Dunno.  Fly or die little patchlet (toss).
> 
> So it would be nice to have a knob like this when CGROUPS is disabled  
> (it say 'say N  when unsure' :)). CPUSETS depends on CGROUPS.

Maybe.  Short term hack.  My current thoughts on the subject, after some
testing, is that the patchlet should just die, and pondering the larger
solution should happen.

	-Mike


^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-08 19:06             ` Mike Galbraith
@ 2009-09-08 19:34               ` Anirban Sinha
  2009-09-09  4:10                 ` Mike Galbraith
  0 siblings, 1 reply; 84+ messages in thread
From: Anirban Sinha @ 2009-09-08 19:34 UTC (permalink / raw)
  To: Mike Galbraith, Anirban Sinha
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Dario Faggioli


>Maybe.  Short term hack.  My current thoughts on the subject, after
some
>testing, is that the patchlet should just die, and pondering the larger
>solution should happen.

Just curious, what is the larger solution? When everyone adapts using
control-groups? 

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 12:05                               ` Peter Zijlstra
@ 2009-09-09  2:06                                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-09  2:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Christoph Lameter, Oleg Nesterov, lkml

Hi

> > Thank you for kindly explanation. I gradually become to understand this isssue.
> > Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code
> > 
> >         for_each_online_cpu(cpu)
> >                 flush_work(per_cpu_ptr(works, cpu));
> > 
> > However, I don't think your approach solve this issue.
> > lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.
> > 
> > lru_add_pvecs is accounted when
> >   - lru move
> >       e.g. read(2), write(2), page fault, vmscan, page migration, et al
> > 
> > lru_rotate_pves is accounted when
> >   - page writeback
> > 
> > IOW, if RT-thread call write(2) syscall or page fault, we face the same
> > problem. I don't think we can assume RT-thread don't make page fault....
> > 
> > hmm, this seems difficult problem. I guess any mm code should use
> > schedule_on_each_cpu(). I continue to think this issue awhile.
> 
> This is about avoiding work when there is non, clearly when an
> application does use the kernel it creates work.
> 
> But a clearly userspace, cpu-bound process, while(1), should not get
> interrupted by things like lru_add_drain() when it doesn't have any
> pages to drain.

Yup. makes sense.
So, I think you mean you'd like to tackle this special case as fist step, right?
if yes, I agree.


> > > There is nothing that makes lru_add_drain_all() the only such site, its
> > > the one Mike posted to me, and my patch was a way to deal with that.
> > 
> > Well, schedule_on_each_cpu() is very limited used function.
> > Practically we can ignore other caller.
> 
> No, we need to inspect all callers, having only a few makes that easier.

Sorry my poor english. I meaned I don't oppose your patch approach. I don't oppose
additional work at all.


> 
> > > I also explained that its not only RT related in that the HPC folks also
> > > want to avoid unneeded work -- for them its not starvation but a
> > > performance issue.
> > 
> > I think you talked about OS jitter issue. if so, I don't think this issue
> > make serious problem.  OS jitter mainly be caused by periodic action
> >  (e.g. tick update, timer, vmstat update). it's because
> > 	little-delay x plenty-times = large-delay
> > 
> > lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
> > page-migration, memory-hotplug. all caller is not periodic.
> 
> Doesn't matter, if you want to reduce it, you need to address all of
> them, a process 4 nodes away calling mlock() while this partition has
> been user-bound for the last hour or so and doesn't have any lru pages
> simply needn't be woken.

Doesn't matter? You mean can we stop to discuss hits HPC performance issue
as Christoph pointed out?
hmmm, sorry, I haven't catch your point.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-09  2:06                                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-09  2:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Mike Galbraith, Ingo Molnar, linux-mm,
	Christoph Lameter, Oleg Nesterov, lkml

Hi

> > Thank you for kindly explanation. I gradually become to understand this isssue.
> > Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code
> > 
> >         for_each_online_cpu(cpu)
> >                 flush_work(per_cpu_ptr(works, cpu));
> > 
> > However, I don't think your approach solve this issue.
> > lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.
> > 
> > lru_add_pvecs is accounted when
> >   - lru move
> >       e.g. read(2), write(2), page fault, vmscan, page migration, et al
> > 
> > lru_rotate_pves is accounted when
> >   - page writeback
> > 
> > IOW, if RT-thread call write(2) syscall or page fault, we face the same
> > problem. I don't think we can assume RT-thread don't make page fault....
> > 
> > hmm, this seems difficult problem. I guess any mm code should use
> > schedule_on_each_cpu(). I continue to think this issue awhile.
> 
> This is about avoiding work when there is non, clearly when an
> application does use the kernel it creates work.
> 
> But a clearly userspace, cpu-bound process, while(1), should not get
> interrupted by things like lru_add_drain() when it doesn't have any
> pages to drain.

Yup. makes sense.
So, I think you mean you'd like to tackle this special case as fist step, right?
if yes, I agree.


> > > There is nothing that makes lru_add_drain_all() the only such site, its
> > > the one Mike posted to me, and my patch was a way to deal with that.
> > 
> > Well, schedule_on_each_cpu() is very limited used function.
> > Practically we can ignore other caller.
> 
> No, we need to inspect all callers, having only a few makes that easier.

Sorry my poor english. I meaned I don't oppose your patch approach. I don't oppose
additional work at all.


> 
> > > I also explained that its not only RT related in that the HPC folks also
> > > want to avoid unneeded work -- for them its not starvation but a
> > > performance issue.
> > 
> > I think you talked about OS jitter issue. if so, I don't think this issue
> > make serious problem.  OS jitter mainly be caused by periodic action
> >  (e.g. tick update, timer, vmstat update). it's because
> > 	little-delay x plenty-times = large-delay
> > 
> > lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
> > page-migration, memory-hotplug. all caller is not periodic.
> 
> Doesn't matter, if you want to reduce it, you need to address all of
> them, a process 4 nodes away calling mlock() while this partition has
> been user-bound for the last hour or so and doesn't have any lru pages
> simply needn't be woken.

Doesn't matter? You mean can we stop to discuss hits HPC performance issue
as Christoph pointed out?
hmmm, sorry, I haven't catch your point.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-08 19:34               ` Anirban Sinha
@ 2009-09-09  4:10                 ` Mike Galbraith
  0 siblings, 0 replies; 84+ messages in thread
From: Mike Galbraith @ 2009-09-09  4:10 UTC (permalink / raw)
  To: Anirban Sinha
  Cc: Anirban Sinha, Ingo Molnar, linux-kernel, Peter Zijlstra, Dario Faggioli

On Tue, 2009-09-08 at 12:34 -0700, Anirban Sinha wrote:
> >Maybe.  Short term hack.  My current thoughts on the subject, after
> some
> >testing, is that the patchlet should just die, and pondering the larger
> >solution should happen.
> 
> Just curious, what is the larger solution?

That's what needs pondering :)

	-Mike


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-08 15:32                                       ` Christoph Lameter
@ 2009-09-09  4:27                                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-09  4:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	linux-mm, Oleg Nesterov, lkml

> The usefulness of a scheme like this requires:
> 
> 1. There are cpus that continually execute user space code
>    without system interaction.
> 
> 2. There are repeated VM activities that require page isolation /
>    migration.
> 
> The first page isolation activity will then clear the lru caches of the
> processes doing number crunching in user space (and therefore the first
> isolation will still interrupt). The second and following isolation will
> then no longer interrupt the processes.
> 
> 2. is rare. So the question is if the additional code in the LRU handling
> can be justified. If lru handling is not time sensitive then yes.

Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
I think page migration don't need lru_add_drain_all() as synchronous, because
page migration have 10 times retry.

Then asynchronous lru_add_drain_all() cause

  - if system isn't under heavy pressure, retry succussfull.
  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.

I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.

What do you think?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-09  4:27                                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-09  4:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	linux-mm, Oleg Nesterov, lkml

> The usefulness of a scheme like this requires:
> 
> 1. There are cpus that continually execute user space code
>    without system interaction.
> 
> 2. There are repeated VM activities that require page isolation /
>    migration.
> 
> The first page isolation activity will then clear the lru caches of the
> processes doing number crunching in user space (and therefore the first
> isolation will still interrupt). The second and following isolation will
> then no longer interrupt the processes.
> 
> 2. is rare. So the question is if the additional code in the LRU handling
> can be justified. If lru handling is not time sensitive then yes.

Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
I think page migration don't need lru_add_drain_all() as synchronous, because
page migration have 10 times retry.

Then asynchronous lru_add_drain_all() cause

  - if system isn't under heavy pressure, retry succussfull.
  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.

I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.

What do you think?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-09  4:27                                         ` KOSAKI Motohiro
@ 2009-09-09 14:08                                           ` Christoph Lameter
  -1 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-09 14:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Wed, 9 Sep 2009, KOSAKI Motohiro wrote:

> Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> I think page migration don't need lru_add_drain_all() as synchronous, because
> page migration have 10 times retry.

True this is only an optimization that increases the chance of isolation
being successful. You dont need draining at all.

> Then asynchronous lru_add_drain_all() cause
>
>   - if system isn't under heavy pressure, retry succussfull.
>   - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>
> I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>
> What do you think?

The retries can be very fast if the migrate pages list is small. The
migrate attempts may be finished before the IPI can be processed by the
other cpus.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-09 14:08                                           ` Christoph Lameter
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-09 14:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Wed, 9 Sep 2009, KOSAKI Motohiro wrote:

> Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> I think page migration don't need lru_add_drain_all() as synchronous, because
> page migration have 10 times retry.

True this is only an optimization that increases the chance of isolation
being successful. You dont need draining at all.

> Then asynchronous lru_add_drain_all() cause
>
>   - if system isn't under heavy pressure, retry succussfull.
>   - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>
> I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>
> What do you think?

The retries can be very fast if the migrate pages list is small. The
migrate attempts may be finished before the IPI can be processed by the
other cpus.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-09  4:27                                         ` KOSAKI Motohiro
@ 2009-09-09 15:39                                           ` Minchan Kim
  -1 siblings, 0 replies; 84+ messages in thread
From: Minchan Kim @ 2009-09-09 15:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	linux-mm, Oleg Nesterov, lkml

On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> The usefulness of a scheme like this requires:
>>
>> 1. There are cpus that continually execute user space code
>>    without system interaction.
>>
>> 2. There are repeated VM activities that require page isolation /
>>    migration.
>>
>> The first page isolation activity will then clear the lru caches of the
>> processes doing number crunching in user space (and therefore the first
>> isolation will still interrupt). The second and following isolation will
>> then no longer interrupt the processes.
>>
>> 2. is rare. So the question is if the additional code in the LRU handling
>> can be justified. If lru handling is not time sensitive then yes.
>
> Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> I think page migration don't need lru_add_drain_all() as synchronous, because
> page migration have 10 times retry.
>
> Then asynchronous lru_add_drain_all() cause
>
>  - if system isn't under heavy pressure, retry succussfull.
>  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>
> I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.

I think, more exactly, we don't have to drain lru pages for mlocking.
Mlocked pages will go into unevictable lru due to
try_to_unmap when shrink of lru happens.
How about removing draining in case of mlock?

>
> What do you think?
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-09 15:39                                           ` Minchan Kim
  0 siblings, 0 replies; 84+ messages in thread
From: Minchan Kim @ 2009-09-09 15:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	linux-mm, Oleg Nesterov, lkml

On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> The usefulness of a scheme like this requires:
>>
>> 1. There are cpus that continually execute user space code
>>    without system interaction.
>>
>> 2. There are repeated VM activities that require page isolation /
>>    migration.
>>
>> The first page isolation activity will then clear the lru caches of the
>> processes doing number crunching in user space (and therefore the first
>> isolation will still interrupt). The second and following isolation will
>> then no longer interrupt the processes.
>>
>> 2. is rare. So the question is if the additional code in the LRU handling
>> can be justified. If lru handling is not time sensitive then yes.
>
> Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> I think page migration don't need lru_add_drain_all() as synchronous, because
> page migration have 10 times retry.
>
> Then asynchronous lru_add_drain_all() cause
>
>  - if system isn't under heavy pressure, retry succussfull.
>  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>
> I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.

I think, more exactly, we don't have to drain lru pages for mlocking.
Mlocked pages will go into unevictable lru due to
try_to_unmap when shrink of lru happens.
How about removing draining in case of mlock?

>
> What do you think?
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-09 15:39                                           ` Minchan Kim
@ 2009-09-09 16:18                                             ` Lee Schermerhorn
  -1 siblings, 0 replies; 84+ messages in thread
From: Lee Schermerhorn @ 2009-09-09 16:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Christoph Lameter, Peter Zijlstra,
	Mike Galbraith, Ingo Molnar, linux-mm, Oleg Nesterov, lkml

On Thu, 2009-09-10 at 00:39 +0900, Minchan Kim wrote:
> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> The usefulness of a scheme like this requires:
> >>
> >> 1. There are cpus that continually execute user space code
> >>    without system interaction.
> >>
> >> 2. There are repeated VM activities that require page isolation /
> >>    migration.
> >>
> >> The first page isolation activity will then clear the lru caches of the
> >> processes doing number crunching in user space (and therefore the first
> >> isolation will still interrupt). The second and following isolation will
> >> then no longer interrupt the processes.
> >>
> >> 2. is rare. So the question is if the additional code in the LRU handling
> >> can be justified. If lru handling is not time sensitive then yes.
> >
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
> >
> > Then asynchronous lru_add_drain_all() cause
> >
> >  - if system isn't under heavy pressure, retry succussfull.
> >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> 
> I think, more exactly, we don't have to drain lru pages for mlocking.
> Mlocked pages will go into unevictable lru due to
> try_to_unmap when shrink of lru happens.
> How about removing draining in case of mlock?
> 
> >
> > What do you think?


Remember how the code works:  __mlock_vma_pages_range() loops calliing
get_user_pages() to fault in batches of 16 pages and returns the page
pointers for mlocking.  Mlocking now requires isolation from the lru.
If you don't drain after each call to get_user_pages(), up to a
pagevec's worth of pages [~14] will likely still be in the pagevec and
won't be isolatable/mlockable().  We can end up with most of the pages
still on the normal lru lists.  If we want to move to an almost
exclusively lazy culling of mlocked pages to the unevictable then we can
remove the drain.  If we want to be more proactive in culling the
unevictable pages as we populate the vma, we'll want to keep the drain.

Lee


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-09 16:18                                             ` Lee Schermerhorn
  0 siblings, 0 replies; 84+ messages in thread
From: Lee Schermerhorn @ 2009-09-09 16:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Christoph Lameter, Peter Zijlstra,
	Mike Galbraith, Ingo Molnar, linux-mm, Oleg Nesterov, lkml

On Thu, 2009-09-10 at 00:39 +0900, Minchan Kim wrote:
> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> The usefulness of a scheme like this requires:
> >>
> >> 1. There are cpus that continually execute user space code
> >>    without system interaction.
> >>
> >> 2. There are repeated VM activities that require page isolation /
> >>    migration.
> >>
> >> The first page isolation activity will then clear the lru caches of the
> >> processes doing number crunching in user space (and therefore the first
> >> isolation will still interrupt). The second and following isolation will
> >> then no longer interrupt the processes.
> >>
> >> 2. is rare. So the question is if the additional code in the LRU handling
> >> can be justified. If lru handling is not time sensitive then yes.
> >
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
> >
> > Then asynchronous lru_add_drain_all() cause
> >
> >  - if system isn't under heavy pressure, retry succussfull.
> >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> 
> I think, more exactly, we don't have to drain lru pages for mlocking.
> Mlocked pages will go into unevictable lru due to
> try_to_unmap when shrink of lru happens.
> How about removing draining in case of mlock?
> 
> >
> > What do you think?


Remember how the code works:  __mlock_vma_pages_range() loops calliing
get_user_pages() to fault in batches of 16 pages and returns the page
pointers for mlocking.  Mlocking now requires isolation from the lru.
If you don't drain after each call to get_user_pages(), up to a
pagevec's worth of pages [~14] will likely still be in the pagevec and
won't be isolatable/mlockable().  We can end up with most of the pages
still on the normal lru lists.  If we want to move to an almost
exclusively lazy culling of mlocked pages to the unevictable then we can
remove the drain.  If we want to be more proactive in culling the
unevictable pages as we populate the vma, we'll want to keep the drain.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-09 16:18                                             ` Lee Schermerhorn
@ 2009-09-09 16:46                                               ` Minchan Kim
  -1 siblings, 0 replies; 84+ messages in thread
From: Minchan Kim @ 2009-09-09 16:46 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KOSAKI Motohiro, Christoph Lameter, Peter Zijlstra,
	Mike Galbraith, Ingo Molnar, linux-mm, Oleg Nesterov, lkml

Hi, Lee.
Long time no see. :)

On Thu, Sep 10, 2009 at 1:18 AM, Lee Schermerhorn
<Lee.Schermerhorn@hp.com> wrote:
> On Thu, 2009-09-10 at 00:39 +0900, Minchan Kim wrote:
>> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> The usefulness of a scheme like this requires:
>> >>
>> >> 1. There are cpus that continually execute user space code
>> >>    without system interaction.
>> >>
>> >> 2. There are repeated VM activities that require page isolation /
>> >>    migration.
>> >>
>> >> The first page isolation activity will then clear the lru caches of the
>> >> processes doing number crunching in user space (and therefore the first
>> >> isolation will still interrupt). The second and following isolation will
>> >> then no longer interrupt the processes.
>> >>
>> >> 2. is rare. So the question is if the additional code in the LRU handling
>> >> can be justified. If lru handling is not time sensitive then yes.
>> >
>> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
>> > I think page migration don't need lru_add_drain_all() as synchronous, because
>> > page migration have 10 times retry.
>> >
>> > Then asynchronous lru_add_drain_all() cause
>> >
>> >  - if system isn't under heavy pressure, retry succussfull.
>> >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>> >
>> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>>
>> I think, more exactly, we don't have to drain lru pages for mlocking.
>> Mlocked pages will go into unevictable lru due to
>> try_to_unmap when shrink of lru happens.
>> How about removing draining in case of mlock?
>>
>> >
>> > What do you think?
>
>
> Remember how the code works:  __mlock_vma_pages_range() loops calliing
> get_user_pages() to fault in batches of 16 pages and returns the page
> pointers for mlocking.  Mlocking now requires isolation from the lru.
> If you don't drain after each call to get_user_pages(), up to a
> pagevec's worth of pages [~14] will likely still be in the pagevec and
> won't be isolatable/mlockable().  We can end up with most of the pages

Sorry for confusing.
I said not lru_add_drain but lru_add_drain_all.
Now problem is schedule_on_each_cpu.

Anyway, that case pagevec's worth of pages will be much
increased by the number of CPU as you pointed out.

> still on the normal lru lists.  If we want to move to an almost
> exclusively lazy culling of mlocked pages to the unevictable then we can
> remove the drain.  If we want to be more proactive in culling the
> unevictable pages as we populate the vma, we'll want to keep the drain.
>

It's not good that lazy culling of many pages causes high reclaim overhead.
But now lazy culling of reclaim is doing just only shrink_page_list.
we can do it shrink_active_list's page_referenced so that we can sparse
cost of lazy culling.

> Lee
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-09 16:46                                               ` Minchan Kim
  0 siblings, 0 replies; 84+ messages in thread
From: Minchan Kim @ 2009-09-09 16:46 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KOSAKI Motohiro, Christoph Lameter, Peter Zijlstra,
	Mike Galbraith, Ingo Molnar, linux-mm, Oleg Nesterov, lkml

Hi, Lee.
Long time no see. :)

On Thu, Sep 10, 2009 at 1:18 AM, Lee Schermerhorn
<Lee.Schermerhorn@hp.com> wrote:
> On Thu, 2009-09-10 at 00:39 +0900, Minchan Kim wrote:
>> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> The usefulness of a scheme like this requires:
>> >>
>> >> 1. There are cpus that continually execute user space code
>> >>    without system interaction.
>> >>
>> >> 2. There are repeated VM activities that require page isolation /
>> >>    migration.
>> >>
>> >> The first page isolation activity will then clear the lru caches of the
>> >> processes doing number crunching in user space (and therefore the first
>> >> isolation will still interrupt). The second and following isolation will
>> >> then no longer interrupt the processes.
>> >>
>> >> 2. is rare. So the question is if the additional code in the LRU handling
>> >> can be justified. If lru handling is not time sensitive then yes.
>> >
>> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
>> > I think page migration don't need lru_add_drain_all() as synchronous, because
>> > page migration have 10 times retry.
>> >
>> > Then asynchronous lru_add_drain_all() cause
>> >
>> >  - if system isn't under heavy pressure, retry succussfull.
>> >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>> >
>> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>>
>> I think, more exactly, we don't have to drain lru pages for mlocking.
>> Mlocked pages will go into unevictable lru due to
>> try_to_unmap when shrink of lru happens.
>> How about removing draining in case of mlock?
>>
>> >
>> > What do you think?
>
>
> Remember how the code works:  __mlock_vma_pages_range() loops calliing
> get_user_pages() to fault in batches of 16 pages and returns the page
> pointers for mlocking.  Mlocking now requires isolation from the lru.
> If you don't drain after each call to get_user_pages(), up to a
> pagevec's worth of pages [~14] will likely still be in the pagevec and
> won't be isolatable/mlockable().  We can end up with most of the pages

Sorry for confusing.
I said not lru_add_drain but lru_add_drain_all.
Now problem is schedule_on_each_cpu.

Anyway, that case pagevec's worth of pages will be much
increased by the number of CPU as you pointed out.

> still on the normal lru lists.  If we want to move to an almost
> exclusively lazy culling of mlocked pages to the unevictable then we can
> remove the drain.  If we want to be more proactive in culling the
> unevictable pages as we populate the vma, we'll want to keep the drain.
>

It's not good that lazy culling of many pages causes high reclaim overhead.
But now lazy culling of reclaim is doing just only shrink_page_list.
we can do it shrink_active_list's page_referenced so that we can sparse
cost of lazy culling.

> Lee
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-09 14:08                                           ` Christoph Lameter
@ 2009-09-09 23:43                                             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-09 23:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	linux-mm, Oleg Nesterov, lkml

> On Wed, 9 Sep 2009, KOSAKI Motohiro wrote:
> 
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
> 
> True this is only an optimization that increases the chance of isolation
> being successful. You dont need draining at all.
> 
> > Then asynchronous lru_add_drain_all() cause
> >
> >   - if system isn't under heavy pressure, retry succussfull.
> >   - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> >
> > What do you think?
> 
> The retries can be very fast if the migrate pages list is small. The
> migrate attempts may be finished before the IPI can be processed by the
> other cpus.

Ah, I see. Yes, my last proposal is not good. small migration might be fail.

How about this?
  - pass 1-2,  lru_add_drain_all_async()
  - pass 3-10, lru_add_drain_all()

this scheme might save RT-thread case and never cause regression. (I think)

The last remain problem is, if RT-thread binding cpu's pagevec has migrate
targetted page, migration still face the same issue.
but we can't solve it...
RT-thread must use /proc/sys/vm/drop_caches properly.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-09 23:43                                             ` KOSAKI Motohiro
  0 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-09 23:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	linux-mm, Oleg Nesterov, lkml

> On Wed, 9 Sep 2009, KOSAKI Motohiro wrote:
> 
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
> 
> True this is only an optimization that increases the chance of isolation
> being successful. You dont need draining at all.
> 
> > Then asynchronous lru_add_drain_all() cause
> >
> >   - if system isn't under heavy pressure, retry succussfull.
> >   - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> >
> > What do you think?
> 
> The retries can be very fast if the migrate pages list is small. The
> migrate attempts may be finished before the IPI can be processed by the
> other cpus.

Ah, I see. Yes, my last proposal is not good. small migration might be fail.

How about this?
  - pass 1-2,  lru_add_drain_all_async()
  - pass 3-10, lru_add_drain_all()

this scheme might save RT-thread case and never cause regression. (I think)

The last remain problem is, if RT-thread binding cpu's pagevec has migrate
targetted page, migration still face the same issue.
but we can't solve it...
RT-thread must use /proc/sys/vm/drop_caches properly.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-09 15:39                                           ` Minchan Kim
@ 2009-09-09 23:58                                             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-09 23:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Christoph Lameter, Peter Zijlstra,
	Mike Galbraith, Ingo Molnar, linux-mm, Oleg Nesterov, lkml

> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> The usefulness of a scheme like this requires:
> >>
> >> 1. There are cpus that continually execute user space code
> >>    without system interaction.
> >>
> >> 2. There are repeated VM activities that require page isolation /
> >>    migration.
> >>
> >> The first page isolation activity will then clear the lru caches of the
> >> processes doing number crunching in user space (and therefore the first
> >> isolation will still interrupt). The second and following isolation will
> >> then no longer interrupt the processes.
> >>
> >> 2. is rare. So the question is if the additional code in the LRU handling
> >> can be justified. If lru handling is not time sensitive then yes.
> >
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
> >
> > Then asynchronous lru_add_drain_all() cause
> >
> >  - if system isn't under heavy pressure, retry succussfull.
> >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> 
> I think, more exactly, we don't have to drain lru pages for mlocking.
> Mlocked pages will go into unevictable lru due to
> try_to_unmap when shrink of lru happens.

Right.

> How about removing draining in case of mlock?

Umm, I don't like this. because perfectly no drain often make strange test result.
I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
but many tester and administrator wiill think it's bug... ;)

Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
costly operation. it hide drain cost. now, we only want to treat corner case issue. 
I don't hope dramatic change.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-09 23:58                                             ` KOSAKI Motohiro
  0 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-09 23:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Christoph Lameter, Peter Zijlstra,
	Mike Galbraith, Ingo Molnar, linux-mm, Oleg Nesterov, lkml

> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> The usefulness of a scheme like this requires:
> >>
> >> 1. There are cpus that continually execute user space code
> >>    without system interaction.
> >>
> >> 2. There are repeated VM activities that require page isolation /
> >>    migration.
> >>
> >> The first page isolation activity will then clear the lru caches of the
> >> processes doing number crunching in user space (and therefore the first
> >> isolation will still interrupt). The second and following isolation will
> >> then no longer interrupt the processes.
> >>
> >> 2. is rare. So the question is if the additional code in the LRU handling
> >> can be justified. If lru handling is not time sensitive then yes.
> >
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
> >
> > Then asynchronous lru_add_drain_all() cause
> >
> >  - if system isn't under heavy pressure, retry succussfull.
> >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> 
> I think, more exactly, we don't have to drain lru pages for mlocking.
> Mlocked pages will go into unevictable lru due to
> try_to_unmap when shrink of lru happens.

Right.

> How about removing draining in case of mlock?

Umm, I don't like this. because perfectly no drain often make strange test result.
I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
but many tester and administrator wiill think it's bug... ;)

Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
costly operation. it hide drain cost. now, we only want to treat corner case issue. 
I don't hope dramatic change.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-09 23:58                                             ` KOSAKI Motohiro
@ 2009-09-10  1:00                                               ` Minchan Kim
  -1 siblings, 0 replies; 84+ messages in thread
From: Minchan Kim @ 2009-09-10  1:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Christoph Lameter, Peter Zijlstra, Mike Galbraith,
	Ingo Molnar, linux-mm, Oleg Nesterov, lkml

On Thu, 10 Sep 2009 08:58:20 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > >> The usefulness of a scheme like this requires:
> > >>
> > >> 1. There are cpus that continually execute user space code
> > >>    without system interaction.
> > >>
> > >> 2. There are repeated VM activities that require page isolation /
> > >>    migration.
> > >>
> > >> The first page isolation activity will then clear the lru caches of the
> > >> processes doing number crunching in user space (and therefore the first
> > >> isolation will still interrupt). The second and following isolation will
> > >> then no longer interrupt the processes.
> > >>
> > >> 2. is rare. So the question is if the additional code in the LRU handling
> > >> can be justified. If lru handling is not time sensitive then yes.
> > >
> > > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > > I think page migration don't need lru_add_drain_all() as synchronous, because
> > > page migration have 10 times retry.
> > >
> > > Then asynchronous lru_add_drain_all() cause
> > >
> > >  - if system isn't under heavy pressure, retry succussfull.
> > >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> > >
> > > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> > 
> > I think, more exactly, we don't have to drain lru pages for mlocking.
> > Mlocked pages will go into unevictable lru due to
> > try_to_unmap when shrink of lru happens.
> 
> Right.
> 
> > How about removing draining in case of mlock?
> 
> Umm, I don't like this. because perfectly no drain often make strange test result.
> I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
> but many tester and administrator wiill think it's bug... ;)

I agree. I have no objection to your approach. :)

> Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> costly operation. it hide drain cost. now, we only want to treat corner case issue. 
> I don't hope dramatic change.

Another problem is as follow.

Although some CPUs don't have any thing to do, we do it. 
HPC guys don't want to consume CPU cycle as Christoph pointed out.
I liked Peter's idea with regard to this. 
My approach can solve it, too. 
But I agree it would be dramatic change. 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-10  1:00                                               ` Minchan Kim
  0 siblings, 0 replies; 84+ messages in thread
From: Minchan Kim @ 2009-09-10  1:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Christoph Lameter, Peter Zijlstra, Mike Galbraith,
	Ingo Molnar, linux-mm, Oleg Nesterov, lkml

On Thu, 10 Sep 2009 08:58:20 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > >> The usefulness of a scheme like this requires:
> > >>
> > >> 1. There are cpus that continually execute user space code
> > >> A  A without system interaction.
> > >>
> > >> 2. There are repeated VM activities that require page isolation /
> > >> A  A migration.
> > >>
> > >> The first page isolation activity will then clear the lru caches of the
> > >> processes doing number crunching in user space (and therefore the first
> > >> isolation will still interrupt). The second and following isolation will
> > >> then no longer interrupt the processes.
> > >>
> > >> 2. is rare. So the question is if the additional code in the LRU handling
> > >> can be justified. If lru handling is not time sensitive then yes.
> > >
> > > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > > I think page migration don't need lru_add_drain_all() as synchronous, because
> > > page migration have 10 times retry.
> > >
> > > Then asynchronous lru_add_drain_all() cause
> > >
> > > A - if system isn't under heavy pressure, retry succussfull.
> > > A - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> > >
> > > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> > 
> > I think, more exactly, we don't have to drain lru pages for mlocking.
> > Mlocked pages will go into unevictable lru due to
> > try_to_unmap when shrink of lru happens.
> 
> Right.
> 
> > How about removing draining in case of mlock?
> 
> Umm, I don't like this. because perfectly no drain often make strange test result.
> I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
> but many tester and administrator wiill think it's bug... ;)

I agree. I have no objection to your approach. :)

> Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> costly operation. it hide drain cost. now, we only want to treat corner case issue. 
> I don't hope dramatic change.

Another problem is as follow.

Although some CPUs don't have any thing to do, we do it. 
HPC guys don't want to consume CPU cycle as Christoph pointed out.
I liked Peter's idea with regard to this. 
My approach can solve it, too. 
But I agree it would be dramatic change. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-10  1:00                                               ` Minchan Kim
@ 2009-09-10  1:15                                                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-10  1:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Christoph Lameter, Peter Zijlstra,
	Mike Galbraith, Ingo Molnar, linux-mm, Oleg Nesterov, lkml

> On Thu, 10 Sep 2009 08:58:20 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> > > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > >> The usefulness of a scheme like this requires:
> > > >>
> > > >> 1. There are cpus that continually execute user space code
> > > >>    without system interaction.
> > > >>
> > > >> 2. There are repeated VM activities that require page isolation /
> > > >>    migration.
> > > >>
> > > >> The first page isolation activity will then clear the lru caches of the
> > > >> processes doing number crunching in user space (and therefore the first
> > > >> isolation will still interrupt). The second and following isolation will
> > > >> then no longer interrupt the processes.
> > > >>
> > > >> 2. is rare. So the question is if the additional code in the LRU handling
> > > >> can be justified. If lru handling is not time sensitive then yes.
> > > >
> > > > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > > > I think page migration don't need lru_add_drain_all() as synchronous, because
> > > > page migration have 10 times retry.
> > > >
> > > > Then asynchronous lru_add_drain_all() cause
> > > >
> > > >  - if system isn't under heavy pressure, retry succussfull.
> > > >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> > > >
> > > > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> > > 
> > > I think, more exactly, we don't have to drain lru pages for mlocking.
> > > Mlocked pages will go into unevictable lru due to
> > > try_to_unmap when shrink of lru happens.
> > 
> > Right.
> > 
> > > How about removing draining in case of mlock?
> > 
> > Umm, I don't like this. because perfectly no drain often make strange test result.
> > I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
> > but many tester and administrator wiill think it's bug... ;)
> 
> I agree. I have no objection to your approach. :)
> 
> > Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> > costly operation. it hide drain cost. now, we only want to treat corner case issue. 
> > I don't hope dramatic change.
> 
> Another problem is as follow.
> 
> Although some CPUs don't have any thing to do, we do it. 
> HPC guys don't want to consume CPU cycle as Christoph pointed out.
> I liked Peter's idea with regard to this. 
> My approach can solve it, too. 
> But I agree it would be dramatic change. 

Is Perter's + mine approach bad?

It mean,

  - RT-thread binding cpu is not grabbing the page
	-> mlock successful by Peter's improvement
  - RT-thread binding cpu is grabbing the page
	-> mlock successful by mine approach
	   the page is culled later.





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-10  1:15                                                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 84+ messages in thread
From: KOSAKI Motohiro @ 2009-09-10  1:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Christoph Lameter, Peter Zijlstra,
	Mike Galbraith, Ingo Molnar, linux-mm, Oleg Nesterov, lkml

> On Thu, 10 Sep 2009 08:58:20 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> > > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > >> The usefulness of a scheme like this requires:
> > > >>
> > > >> 1. There are cpus that continually execute user space code
> > > >> A  A without system interaction.
> > > >>
> > > >> 2. There are repeated VM activities that require page isolation /
> > > >> A  A migration.
> > > >>
> > > >> The first page isolation activity will then clear the lru caches of the
> > > >> processes doing number crunching in user space (and therefore the first
> > > >> isolation will still interrupt). The second and following isolation will
> > > >> then no longer interrupt the processes.
> > > >>
> > > >> 2. is rare. So the question is if the additional code in the LRU handling
> > > >> can be justified. If lru handling is not time sensitive then yes.
> > > >
> > > > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > > > I think page migration don't need lru_add_drain_all() as synchronous, because
> > > > page migration have 10 times retry.
> > > >
> > > > Then asynchronous lru_add_drain_all() cause
> > > >
> > > > A - if system isn't under heavy pressure, retry succussfull.
> > > > A - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> > > >
> > > > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> > > 
> > > I think, more exactly, we don't have to drain lru pages for mlocking.
> > > Mlocked pages will go into unevictable lru due to
> > > try_to_unmap when shrink of lru happens.
> > 
> > Right.
> > 
> > > How about removing draining in case of mlock?
> > 
> > Umm, I don't like this. because perfectly no drain often make strange test result.
> > I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
> > but many tester and administrator wiill think it's bug... ;)
> 
> I agree. I have no objection to your approach. :)
> 
> > Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> > costly operation. it hide drain cost. now, we only want to treat corner case issue. 
> > I don't hope dramatic change.
> 
> Another problem is as follow.
> 
> Although some CPUs don't have any thing to do, we do it. 
> HPC guys don't want to consume CPU cycle as Christoph pointed out.
> I liked Peter's idea with regard to this. 
> My approach can solve it, too. 
> But I agree it would be dramatic change. 

Is Perter's + mine approach bad?

It mean,

  - RT-thread binding cpu is not grabbing the page
	-> mlock successful by Peter's improvement
  - RT-thread binding cpu is grabbing the page
	-> mlock successful by mine approach
	   the page is culled later.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-10  1:15                                                 ` KOSAKI Motohiro
@ 2009-09-10  1:23                                                   ` Minchan Kim
  -1 siblings, 0 replies; 84+ messages in thread
From: Minchan Kim @ 2009-09-10  1:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Christoph Lameter, Peter Zijlstra, Mike Galbraith,
	Ingo Molnar, linux-mm, Oleg Nesterov, lkml

On Thu, 10 Sep 2009 10:15:07 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Thu, 10 Sep 2009 08:58:20 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> > > > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > >> The usefulness of a scheme like this requires:
> > > > >>
> > > > >> 1. There are cpus that continually execute user space code
> > > > >>    without system interaction.
> > > > >>
> > > > >> 2. There are repeated VM activities that require page isolation /
> > > > >>    migration.
> > > > >>
> > > > >> The first page isolation activity will then clear the lru caches of the
> > > > >> processes doing number crunching in user space (and therefore the first
> > > > >> isolation will still interrupt). The second and following isolation will
> > > > >> then no longer interrupt the processes.
> > > > >>
> > > > >> 2. is rare. So the question is if the additional code in the LRU handling
> > > > >> can be justified. If lru handling is not time sensitive then yes.
> > > > >
> > > > > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > > > > I think page migration don't need lru_add_drain_all() as synchronous, because
> > > > > page migration have 10 times retry.
> > > > >
> > > > > Then asynchronous lru_add_drain_all() cause
> > > > >
> > > > >  - if system isn't under heavy pressure, retry succussfull.
> > > > >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> > > > >
> > > > > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> > > > 
> > > > I think, more exactly, we don't have to drain lru pages for mlocking.
> > > > Mlocked pages will go into unevictable lru due to
> > > > try_to_unmap when shrink of lru happens.
> > > 
> > > Right.
> > > 
> > > > How about removing draining in case of mlock?
> > > 
> > > Umm, I don't like this. because perfectly no drain often make strange test result.
> > > I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
> > > but many tester and administrator wiill think it's bug... ;)
> > 
> > I agree. I have no objection to your approach. :)
> > 
> > > Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> > > costly operation. it hide drain cost. now, we only want to treat corner case issue. 
> > > I don't hope dramatic change.
> > 
> > Another problem is as follow.
> > 
> > Although some CPUs don't have any thing to do, we do it. 
> > HPC guys don't want to consume CPU cycle as Christoph pointed out.
> > I liked Peter's idea with regard to this. 
> > My approach can solve it, too. 
> > But I agree it would be dramatic change. 
> 
> Is Perter's + mine approach bad?

It's good to me! :)

> It mean,
> 
>   - RT-thread binding cpu is not grabbing the page
> 	-> mlock successful by Peter's improvement
>   - RT-thread binding cpu is grabbing the page
> 	-> mlock successful by mine approach
> 	   the page is culled later.
> 
> 
> 
> 


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-10  1:23                                                   ` Minchan Kim
  0 siblings, 0 replies; 84+ messages in thread
From: Minchan Kim @ 2009-09-10  1:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Christoph Lameter, Peter Zijlstra, Mike Galbraith,
	Ingo Molnar, linux-mm, Oleg Nesterov, lkml

On Thu, 10 Sep 2009 10:15:07 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Thu, 10 Sep 2009 08:58:20 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> > > > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > >> The usefulness of a scheme like this requires:
> > > > >>
> > > > >> 1. There are cpus that continually execute user space code
> > > > >> A  A without system interaction.
> > > > >>
> > > > >> 2. There are repeated VM activities that require page isolation /
> > > > >> A  A migration.
> > > > >>
> > > > >> The first page isolation activity will then clear the lru caches of the
> > > > >> processes doing number crunching in user space (and therefore the first
> > > > >> isolation will still interrupt). The second and following isolation will
> > > > >> then no longer interrupt the processes.
> > > > >>
> > > > >> 2. is rare. So the question is if the additional code in the LRU handling
> > > > >> can be justified. If lru handling is not time sensitive then yes.
> > > > >
> > > > > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > > > > I think page migration don't need lru_add_drain_all() as synchronous, because
> > > > > page migration have 10 times retry.
> > > > >
> > > > > Then asynchronous lru_add_drain_all() cause
> > > > >
> > > > > A - if system isn't under heavy pressure, retry succussfull.
> > > > > A - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> > > > >
> > > > > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> > > > 
> > > > I think, more exactly, we don't have to drain lru pages for mlocking.
> > > > Mlocked pages will go into unevictable lru due to
> > > > try_to_unmap when shrink of lru happens.
> > > 
> > > Right.
> > > 
> > > > How about removing draining in case of mlock?
> > > 
> > > Umm, I don't like this. because perfectly no drain often make strange test result.
> > > I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
> > > but many tester and administrator wiill think it's bug... ;)
> > 
> > I agree. I have no objection to your approach. :)
> > 
> > > Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> > > costly operation. it hide drain cost. now, we only want to treat corner case issue. 
> > > I don't hope dramatic change.
> > 
> > Another problem is as follow.
> > 
> > Although some CPUs don't have any thing to do, we do it. 
> > HPC guys don't want to consume CPU cycle as Christoph pointed out.
> > I liked Peter's idea with regard to this. 
> > My approach can solve it, too. 
> > But I agree it would be dramatic change. 
> 
> Is Perter's + mine approach bad?

It's good to me! :)

> It mean,
> 
>   - RT-thread binding cpu is not grabbing the page
> 	-> mlock successful by Peter's improvement
>   - RT-thread binding cpu is grabbing the page
> 	-> mlock successful by mine approach
> 	   the page is culled later.
> 
> 
> 
> 


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
  2009-09-09 23:43                                             ` KOSAKI Motohiro
@ 2009-09-10 18:03                                               ` Christoph Lameter
  -1 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-10 18:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Thu, 10 Sep 2009, KOSAKI Motohiro wrote:

> How about this?
>   - pass 1-2,  lru_add_drain_all_async()
>   - pass 3-10, lru_add_drain_all()
>
> this scheme might save RT-thread case and never cause regression. (I think)

Sounds good.

> The last remain problem is, if RT-thread binding cpu's pagevec has migrate
> targetted page, migration still face the same issue.
> but we can't solve it...
> RT-thread must use /proc/sys/vm/drop_caches properly.

A system call "sys_os_quiet_down" may be useful. It would drain all
caches, fold counters etc etc so that there will be no OS activities
needed for those things later.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [rfc] lru_add_drain_all() vs isolation
@ 2009-09-10 18:03                                               ` Christoph Lameter
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Lameter @ 2009-09-10 18:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, linux-mm,
	Oleg Nesterov, lkml

On Thu, 10 Sep 2009, KOSAKI Motohiro wrote:

> How about this?
>   - pass 1-2,  lru_add_drain_all_async()
>   - pass 3-10, lru_add_drain_all()
>
> this scheme might save RT-thread case and never cause regression. (I think)

Sounds good.

> The last remain problem is, if RT-thread binding cpu's pagevec has migrate
> targetted page, migration still face the same issue.
> but we can't solve it...
> RT-thread must use /proc/sys/vm/drop_caches properly.

A system call "sys_os_quiet_down" may be useful. It would drain all
caches, fold counters etc etc so that there will be no OS activities
needed for those things later.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-08 17:26           ` Anirban Sinha
@ 2009-09-08 21:37             ` Anirban Sinha
  0 siblings, 0 replies; 84+ messages in thread
From: Anirban Sinha @ 2009-09-08 21:37 UTC (permalink / raw)
  To: Anirban Sinha, linux-kernel; +Cc: Anirban Sinha

> Looking at the git history, there have been several bugfixes to the rt
> bandwidth code from 2.6.26, one of them seems to be strictly related  
> to
> runtime accounting with your setup:
>
>   commit f6121f4f8708195e88cbdf8dd8d171b226b3f858
>   Author: Dario Faggioli <raistlin@linux.it>
>   Date:   Fri Oct 3 17:40:46 2008 +0200
>
>       sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq

Hmm. Indeed there did seem to have quite a few fixes to the accounting  
logic. I back-patched our 2.6.26 kernel with the upstream patches that  
seemed relevant and my test code now yields reasonable results.  
Applying the above patch did not fix it though which kind of makes  
sense since from the commit log it seems that the patch fixed cases  
when the RT task was getting *less* CPU than it's bandwidth allocation  
as opposed to more as in my case. I haven't bisected the patchet to  
figure out exactly which one fixed it but I intend to do it later just  
for fun.

For completeness, these are the results after applying the upstream  
patches *and* disabling bandwidth borrowing logic on my 2.6.26 kernel  
running on a quad core blade with CONFIG_GROUP_SCHED turned off (100HZ  
jiffies):

rt_runtime/
rt_period       % of SCHED_OTHER iterations

.40                100%
.50                74%
.60                47%
.70                31%
.80                18%
.90                 8%
.95                4%

--Ani

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-06 13:21         ` Fabio Checconi
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC484@exchange.ZeugmaSystems.local>
@ 2009-09-08 17:26           ` Anirban Sinha
  2009-09-08 21:37             ` Anirban Sinha
  1 sibling, 1 reply; 84+ messages in thread
From: Anirban Sinha @ 2009-09-08 17:26 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: linux-kernel, Ingo Molnar, a.p.zijlstra, Anirban Sinha, Dario Faggioli



> Looking at the git history, there have been several bugfixes to the rt
> bandwidth code from 2.6.26, one of them seems to be strictly related  
> to
> runtime accounting with your setup:
>
>    commit f6121f4f8708195e88cbdf8dd8d171b226b3f858
>    Author: Dario Faggioli <raistlin@linux.it>
>    Date:   Fri Oct 3 17:40:46 2008 +0200
>
>        sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq

Hmm. Indeed there did seem to have quite a few fixes to the accounting  
logic. I back-patched our 2.6.26 kernel with the upstream patches that  
seemed relevant and my test code now yields reasonable results.  
Applying the above patch did not fix it though which kind of makes  
sense since from the commit log it seems that the patch fixed cases  
when the RT task was getting *less* CPU than it's bandwidth allocation  
as opposed to more as in my case. I haven't bisected the patchet to  
figure out exactly which one fixed it but I intend to do it later just  
for fun.

For completeness, these are the results after applying the upstream  
patches *and* disabling bandwidth borrowing logic on my 2.6.26 kernel  
running on a quad core blade with CONFIG_GROUP_SCHED turned off (100HZ  
jiffies):

rt_runtime/
rt_period       % of SCHED_OTHER iterations

.40                100%
.50                74%
.60                47%
.70                31%
.80                18%
.90                 8%
.95                4%

--Ani


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC484@exchange.ZeugmaSystems.local>
@ 2009-09-07  0:26             ` Anirban Sinha
  0 siblings, 0 replies; 84+ messages in thread
From: Anirban Sinha @ 2009-09-07  0:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fabio Checconi, Ingo Molnar, a.p.zijlstra, Dario Faggioli, Anirban Sinha



 > Running your program I'm unable to reproduce the same issue on a  
recent
 > kernel here; for 25ms over 100ms across several runs I get less  
than 2%.
 > This number increases, reaching your values, only when using short
 > periods (where the meaning for short depends on your HZ value),


In our kernel, the jiffies are configured as 100 HZ.

 > which
 > is something to be expected, due to the fact that rt throttling uses
 > the tick to charge runtimes to tasks.

Hmm. I see. I understand that.


 > Looking at the git history, there have been several bugfixes to the  
rt
 > bandwidth code from 2.6.26, one of them seems to be strictly  
related to
 > runtime accounting with your setup:

I will apply these patches on Tuesday and rerun the tests.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-06  0:47       ` Anirban Sinha
@ 2009-09-06 13:21         ` Fabio Checconi
       [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC484@exchange.ZeugmaSystems.local>
  2009-09-08 17:26           ` Anirban Sinha
  0 siblings, 2 replies; 84+ messages in thread
From: Fabio Checconi @ 2009-09-06 13:21 UTC (permalink / raw)
  To: Anirban Sinha
  Cc: linux-kernel, Ingo Molnar, a.p.zijlstra, Anirban Sinha, Dario Faggioli

> From: Anirban Sinha <ani@anirban.org>
> Date: Sat, Sep 05, 2009 05:47:39PM -0700
>
> > You say you pin the threads to a single core: how many cores does  
> your
> > system have?
> 
> The results I sent you were on a dual core blade.
> 
> 
> > If this is the case, this behavior is the expected one, the scheduler
> > tries to reduce the number of migrations, concentrating the bandwidth
> > of rt tasks on a single core.  With your workload it doesn't work  
> well
> > because runtime migration has freed the other core(s) from rt  
> bandwidth,
> > so these cores are available to SCHED_OTHER ones, but your  
> SCHED_OTHER
> > thread is pinned and cannot make use of them.
> 
> But, I ran the same routine on a quadcore blade and the results this  
> time were:
> 
> rt_runtime/rt_period  % of iterations of reg thrd against rt thrd
> 
> 0.20                  46%
> 0.25                  18%
> 0.26                  7%
> 0.3                   0%
> 0.4                   0%
> (rest of the cases)   0%
> 
> So if the scheduler is concentrating all rt bandwidth to one core, it  
> should be effectively 0.2 * 4 = 0.8 for this core. Hence,  we should  
> see the percentage closer to 20% but it seems that it's more than  
> double. At ~0.25, the regular thread should make no progress, but it  
> seems it does make a little progress.

So this can be a bug.  While it is possible that the kernel does
not succeed in migrating all the runtime (e.g., due to a (system) rt
task consuming some bandwidth on a remote cpu), 46% instead of 20%
is too much.

Running your program I'm unable to reproduce the same issue on a recent
kernel here; for 25ms over 100ms across several runs I get less than 2%.
This number increases, reaching your values, only when using short
periods (where the meaning for short depends on your HZ value), which
is something to be expected, due to the fact that rt throttling uses
the tick to charge runtimes to tasks.

Looking at the git history, there have been several bugfixes to the rt
bandwidth code from 2.6.26, one of them seems to be strictly related to
runtime accounting with your setup:

    commit f6121f4f8708195e88cbdf8dd8d171b226b3f858
    Author: Dario Faggioli <raistlin@linux.it>
    Date:   Fri Oct 3 17:40:46 2008 +0200
    
        sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
       [not found]     ` <DDFD17CC94A9BD49A82147DDF7D545C54DC481@exchange.ZeugmaSystems.local>
@ 2009-09-06  0:47       ` Anirban Sinha
  2009-09-06 13:21         ` Fabio Checconi
  0 siblings, 1 reply; 84+ messages in thread
From: Anirban Sinha @ 2009-09-06  0:47 UTC (permalink / raw)
  To: linux-kernel, fchecconi, Ingo Molnar, a.p.zijlstra
  Cc: Anirban Sinha, Anirban Sinha



 > You say you pin the threads to a single core: how many cores does  
your
 > system have?

The results I sent you were on a dual core blade.


 > If this is the case, this behavior is the expected one, the scheduler
 > tries to reduce the number of migrations, concentrating the bandwidth
 > of rt tasks on a single core.  With your workload it doesn't work  
well
 > because runtime migration has freed the other core(s) from rt  
bandwidth,
 > so these cores are available to SCHED_OTHER ones, but your  
SCHED_OTHER
 > thread is pinned and cannot make use of them.

But, I ran the same routine on a quadcore blade and the results this  
time were:

rt_runtime/rt_period  % of iterations of reg thrd against rt thrd

0.20                  46%
0.25                  18%
0.26                  7%
0.3                   0%
0.4                   0%
(rest of the cases)   0%

So if the scheduler is concentrating all rt bandwidth to one core, it  
should be effectively 0.2 * 4 = 0.8 for this core. Hence,  we should  
see the percentage closer to 20% but it seems that it's more than  
double. At ~0.25, the regular thread should make no progress, but it  
seems it does make a little progress.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-05 20:40   ` Fabio Checconi
@ 2009-09-05 22:40     ` Lucas De Marchi
       [not found]     ` <DDFD17CC94A9BD49A82147DDF7D545C54DC481@exchange.ZeugmaSystems.local>
  1 sibling, 0 replies; 84+ messages in thread
From: Lucas De Marchi @ 2009-09-05 22:40 UTC (permalink / raw)
  To: Fabio Checconi; +Cc: Anirban Sinha, linux-kernel, Ingo Molnar, Peter Zijlstra

> You say you pin the threads to a single core: how many cores does your
> system have?
>
> If this is the case, this behavior is the expected one, the scheduler
> tries to reduce the number of migrations, concentrating the bandwidth
> of rt tasks on a single core.  With your workload it doesn't work well
> because runtime migration has freed the other core(s) from rt bandwidth,
> so these cores are available to SCHED_OTHER ones, but your SCHED_OTHER
> thread is pinned and cannot make use of them.

Indeed. I've tested this same test program in a single core machine and it
produces the expected behavior:

rt_runtime_us / rt_period_us     % loops executed in SCHED_OTHER
95%                              4.48%
60%                              54.84%
50%                              86.03%
40%                              OTHER completed first


Lucas De Marchi

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-05  0:55 ` Anirban Sinha
  2009-09-05 17:43   ` Lucas De Marchi
@ 2009-09-05 20:40   ` Fabio Checconi
  2009-09-05 22:40     ` Lucas De Marchi
       [not found]     ` <DDFD17CC94A9BD49A82147DDF7D545C54DC481@exchange.ZeugmaSystems.local>
  1 sibling, 2 replies; 84+ messages in thread
From: Fabio Checconi @ 2009-09-05 20:40 UTC (permalink / raw)
  To: Anirban Sinha; +Cc: linux-kernel, Ingo Molnar, Peter Zijlstra

> From: Anirban Sinha <ASinha@zeugmasystems.com>
> Date: Fri, Sep 04, 2009 05:55:15PM -0700
>
> Hi Ingo and rest:
> 
> I have been playing around with the sched_rt_runtime_us cap that can be
> used to limit the amount of CPU time allocated towards scheduling rt
> group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled (we
> use only the root user in our embedded setup). I have no other CPU
> intensive workloads (RT or otherwise) running on my system. I have
> changed no other scheduling parameters from /proc. 
> 
> I have written a small test program that:
> 
> (a) forks two threads, one SCHED_FIFO and one SCHED_OTHER (this thread
> is reniced to -20) and ties both of them to a specific core.
> (b) runs both the threads in a tight loop (same number of iterations for
> both threads) until the SCHED_FIFO thread terminates.
> (c) calculates the number of completed iterations of the regular
> SCHED_OTHER thread against the fixed number of iterations of the
> SCHED_FIFO thread. It then calculates a percentage based on that.
> 
> I am running the above workload against varying sched_rt_runtime_us
> values (200 ms to 700 ms) keeping the sched_rt_period_us constant at
> 1000 ms. I have also experimented a little bit by decreasing the value
> of sched_rt_period_us (thus increasing the sched granularity) with no
> apparent change in behavior. 
> 
> My observations are listed in tabular form: 
> 
> Ratio of                  # of completed iterations of reg thread /
> sched_rt_runtime_us /     # of iterations of RT thread (in %)
> sched_rt_runtime_us       
> 
> 0.2                      100 % (regular thread completed all its
> iterations).
> 0.3                      73 %
> 0.4                      45 %
> 0.5                      17 %
> 0.6                      0 % (SCHED_OTHER thread completely throttled.
> Never ran)
> 0.7                      0 %
> 
> This result kind of baffles me. Even when we cap the RT group to a
> fraction of 0.6 of overall CPU time, the rest 0.4 \should\ still be
> available for running regular threads. So my SCHED_OTHER \should\ make
> some progress as opposed to being completely throttled. Similarly, with
> any fraction less than 0.5, the SCHED_OTHER should complete before
> SCHED_FIFO.
> 
> I do not have an easy way to verify my results over the latest kernel
> (2.6.31). Was there any regressions in the scheduling subsystem in
> 2.6.26? Can this behavior be explained? Do we need to tweak any other
> /proc parameters?
> 

You say you pin the threads to a single core: how many cores does your
system have?

I don't know if 2.6.26 had anything wrong (from a quick look the relevant
code seems similar to what we have now), but something like that can be
the consequence of the runtime migration logic moving bandwidth from a
second core to the one executing the two tasks.

If this is the case, this behavior is the expected one, the scheduler
tries to reduce the number of migrations, concentrating the bandwidth
of rt tasks on a single core.  With your workload it doesn't work well
because runtime migration has freed the other core(s) from rt bandwidth,
so these cores are available to SCHED_OTHER ones, but your SCHED_OTHER
thread is pinned and cannot make use of them.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: question on sched-rt group allocation cap: sched_rt_runtime_us
  2009-09-05  0:55 ` Anirban Sinha
@ 2009-09-05 17:43   ` Lucas De Marchi
  2009-09-05 20:40   ` Fabio Checconi
  1 sibling, 0 replies; 84+ messages in thread
From: Lucas De Marchi @ 2009-09-05 17:43 UTC (permalink / raw)
  To: Anirban Sinha; +Cc: linux-kernel, Ingo Molnar

On Sat, Sep 5, 2009 at 02:55, Anirban Sinha<ASinha@zeugmasystems.com> wrote:
> Hi Ingo and rest:
>
> I have been playing around with the sched_rt_runtime_us cap that can be
> used to limit the amount of CPU time allocated towards scheduling rt
> group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled (we
> use only the root user in our embedded setup). I have no other CPU
> intensive workloads (RT or otherwise) running on my system. I have
> changed no other scheduling parameters from /proc.
>
> I have written a small test program that:

Would you mind sending the source of this test?

Lucas De Marchi

^ permalink raw reply	[flat|nested] 84+ messages in thread

* re: question on sched-rt group allocation cap: sched_rt_runtime_us
@ 2009-09-05 17:13 Anirban Sinha
  0 siblings, 0 replies; 84+ messages in thread
From: Anirban Sinha @ 2009-09-05 17:13 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar; +Cc: Anirban Sinha, Anirban Sinha

Hi again:

I am copying my test code here. I am really hoping to get some answers/ 
pointers. If there are whitespace/formatting issues in this mail,  
please let me know. I am using an alternate mailer.

Cheers,

Ani

/* Test code to experiment the CPU allocation cap for an FIFO RT thread
  * spinning on a tight loop. Yes, you read it right. RT thread on a
  * tight loop.
*/
#define _GNU_SOURCE

#include <sched.h>
#include <pthread.h>
#include <time.h>
#include <utmpx.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <assert.h>

unsigned long reg_count;

void *fifo_thread(void *arg)
{
     int core = (int) arg;
     int i, j;
     cpu_set_t cpuset;
     struct sched_param fifo_schedparam;
     int fifo_policy;
     unsigned long start, end;
     unsigned long fifo_count = 0;

     CPU_ZERO(&cpuset);
     CPU_SET(core, &cpuset);

     assert(sched_setaffinity(0, sizeof cpuset, &cpuset) == 0);

     /* RT priority 1 - lowest */
     fifo_schedparam.sched_priority = 1;
     assert(pthread_setschedparam(pthread_self(), SCHED_FIFO,  
&fifo_schedparam) == 0);
     start = reg_count;
     printf("start reg_count=%llu\n", start);

     for(i = 0; i < 5; i++) {
         for(j = 0; j < UINT_MAX/10; j++) {
	  fifo_count++;
         }
     }
     printf("\nRT thread has terminated\n");
     end = reg_count;
     printf("end reg_count=%llu\n", end);
     printf("delta reg count = %llu\n", end-start);
     printf("fifo count = %llu\n", fifo_count);
     printf("% = %f\n", ((float)(end-start)*100)/(float)fifo_count);

     return NULL;
}

void *reg_thread(void *arg)
{
     int core = (int) arg;
     int i, j;
     int new_nice;
     cpu_set_t cpuset;
     struct sched_param fifo_schedparam;
     int fifo_policy;
     /* let's renice it to highest priority level */
     new_nice = nice(-20);
     printf("new nice value for regular thread=%d\n", new_nice);
     printf("regular thread dispatch(%d)\n", core);

     CPU_ZERO(&cpuset);
     CPU_SET(core, &cpuset);

     assert(sched_setaffinity(0, sizeof cpuset, &cpuset) == 0);

     for(i = 0; i < 5; i++) {
       for(j = 0; j < UINT_MAX/10; j++) {
	reg_count++;
       }
     }
     printf("\nregular thread has terminated\n");

     return NULL;
}

int main(int argc, char *argv[])
{
     char *core_str = NULL;
     int core;
     pthread_t tid1, tid2;
     pthread_attr_t attr;

     if(argc != 2) {
         fprintf(stderr, "Usage: %s <core-ID>\n", argv[0]);
         return -1;
     }
     reg_count = 0;

     core = atoi(argv[1]);

     pthread_attr_init(&attr);
     assert(pthread_attr_setschedpolicy(&attr, SCHED_FIFO) == 0);
     assert(pthread_create(&tid1, &attr, fifo_thread, (void*)core) ==  
0);

     assert(pthread_attr_setschedpolicy(&attr, SCHED_OTHER) == 0);
     assert(pthread_create(&tid2, &attr, reg_thread, (void*)core) == 0);

     pthread_join(tid1, NULL);
     pthread_join(tid2, NULL);

     return 0;
}

-----

From: Anirban Sinha
Sent: Fri 9/4/2009 5:55 PM
To:
Subject: question on sched-rt group allocation cap: sched_rt_runtime_us

Hi Ingo and rest:

I have been playing around with the sched_rt_runtime_us cap that can  
be used to limit the amount of CPU time allocated towards scheduling  
rt group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled  
(we use only the root user in our embedded setup). I have no other CPU  
intensive workloads (RT or otherwise) running on my system. I have  
changed no other scheduling parameters from /proc.

I have written a small test program that:

(a) forks two threads, one SCHED_FIFO and one SCHED_OTHER (this thread  
is reniced to -20) and ties both of them to a specific core.
(b) runs both the threads in a tight loop (same number of iterations  
for both threads) until the SCHED_FIFO thread terminates.
(c) calculates the number of completed iterations of the regular  
SCHED_OTHER thread against the fixed number of iterations of the  
SCHED_FIFO thread. It then calculates a percentage based on that.

I am running the above workload against varying sched_rt_runtime_us  
values (200 ms to 700 ms) keeping the sched_rt_period_us constant at  
1000 ms. I have also experimented a little bit by decreasing the value  
of sched_rt_period_us (thus increasing the sched granularity) with no  
apparent change in behavior.

My observations are listed in tabular form. The numbers in the two  
columns are:

rt_runtime_us /
rt_period_us

     Vs

completed iterations of reg thr /
all iterations of RT thr (in %)

0.2   100 % (reg thread completed all its iterations).
0.3   73 %
0.4   45 %
0.5   17 %
0.6   0 % (reg thr completely throttled. Never ran)
0.7   0 %

This result kind of baffles me. Even when we cap the RT group to a  
fraction of 0.6 of overall CPU time, the rest 0.4 \should\ still be  
available for running regular threads. So my SCHED_OTHER \should\ make  
some progress as opposed to being completely throttled. Similarly,  
with any fraction less than 0.5, the SCHED_OTHER should complete  
before SCHED_FIFO.

I do not have an easy way to verify my results over the latest kernel  
(2.6.31). Was there any regressions in the scheduling subsystem in  
2.6.26? Can this behavior be explained? Do we need to tweak any other / 
proc parameters?

Cheers,

Ani

^ permalink raw reply	[flat|nested] 84+ messages in thread

* question on sched-rt group allocation cap: sched_rt_runtime_us
@ 2009-09-05  0:55 ` Anirban Sinha
  2009-09-05 17:43   ` Lucas De Marchi
  2009-09-05 20:40   ` Fabio Checconi
  0 siblings, 2 replies; 84+ messages in thread
From: Anirban Sinha @ 2009-09-05  0:55 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar

Hi Ingo and rest:

I have been playing around with the sched_rt_runtime_us cap that can be
used to limit the amount of CPU time allocated towards scheduling rt
group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled (we
use only the root user in our embedded setup). I have no other CPU
intensive workloads (RT or otherwise) running on my system. I have
changed no other scheduling parameters from /proc. 

I have written a small test program that:

(a) forks two threads, one SCHED_FIFO and one SCHED_OTHER (this thread
is reniced to -20) and ties both of them to a specific core.
(b) runs both the threads in a tight loop (same number of iterations for
both threads) until the SCHED_FIFO thread terminates.
(c) calculates the number of completed iterations of the regular
SCHED_OTHER thread against the fixed number of iterations of the
SCHED_FIFO thread. It then calculates a percentage based on that.

I am running the above workload against varying sched_rt_runtime_us
values (200 ms to 700 ms) keeping the sched_rt_period_us constant at
1000 ms. I have also experimented a little bit by decreasing the value
of sched_rt_period_us (thus increasing the sched granularity) with no
apparent change in behavior. 

My observations are listed in tabular form: 

Ratio of                  # of completed iterations of reg thread /
sched_rt_runtime_us /     # of iterations of RT thread (in %)
sched_rt_runtime_us       

0.2                      100 % (regular thread completed all its
iterations).
0.3                      73 %
0.4                      45 %
0.5                      17 %
0.6                      0 % (SCHED_OTHER thread completely throttled.
Never ran)
0.7                      0 %

This result kind of baffles me. Even when we cap the RT group to a
fraction of 0.6 of overall CPU time, the rest 0.4 \should\ still be
available for running regular threads. So my SCHED_OTHER \should\ make
some progress as opposed to being completely throttled. Similarly, with
any fraction less than 0.5, the SCHED_OTHER should complete before
SCHED_FIFO.

I do not have an easy way to verify my results over the latest kernel
(2.6.31). Was there any regressions in the scheduling subsystem in
2.6.26? Can this behavior be explained? Do we need to tweak any other
/proc parameters?

Cheers,

Ani

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2009-09-10 18:05 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <dgRNo-3uc-5@gated-at.bofh.it>
     [not found] ` <dhb9j-1hp-5@gated-at.bofh.it>
     [not found]   ` <dhcf5-263-13@gated-at.bofh.it>
2009-09-06  2:32     ` question on sched-rt group allocation cap: sched_rt_runtime_us Ani
2009-09-06  6:32       ` Mike Galbraith
2009-09-06 10:18         ` Mike Galbraith
     [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC482@exchange.ZeugmaSystems.local>
2009-09-06 15:09             ` Mike Galbraith
2009-09-07  0:41               ` Anirban Sinha
     [not found]               ` <1252311463.7586.26.camel@marge.simson.net>
2009-09-07 11:06                 ` [rfc] lru_add_drain_all() vs isolation Peter Zijlstra
2009-09-07 11:06                   ` Peter Zijlstra
2009-09-07 13:35                   ` Oleg Nesterov
2009-09-07 13:35                     ` Oleg Nesterov
2009-09-07 13:53                     ` Peter Zijlstra
2009-09-07 13:53                       ` Peter Zijlstra
2009-09-07 14:18                       ` Oleg Nesterov
2009-09-07 14:18                         ` Oleg Nesterov
2009-09-07 14:25                         ` Peter Zijlstra
2009-09-07 14:25                           ` Peter Zijlstra
2009-09-07 23:56                   ` KOSAKI Motohiro
2009-09-07 23:56                     ` KOSAKI Motohiro
2009-09-08  8:20                     ` Peter Zijlstra
2009-09-08  8:20                       ` Peter Zijlstra
2009-09-08 10:06                       ` KOSAKI Motohiro
2009-09-08 10:06                         ` KOSAKI Motohiro
2009-09-08 10:20                         ` Peter Zijlstra
2009-09-08 10:20                           ` Peter Zijlstra
2009-09-08 11:41                           ` KOSAKI Motohiro
2009-09-08 11:41                             ` KOSAKI Motohiro
2009-09-08 12:05                             ` Peter Zijlstra
2009-09-08 12:05                               ` Peter Zijlstra
2009-09-08 14:03                               ` Christoph Lameter
2009-09-08 14:03                                 ` Christoph Lameter
2009-09-08 14:20                                 ` Peter Zijlstra
2009-09-08 14:20                                   ` Peter Zijlstra
2009-09-08 15:22                                   ` Christoph Lameter
2009-09-08 15:22                                     ` Christoph Lameter
2009-09-08 15:27                                     ` Peter Zijlstra
2009-09-08 15:27                                       ` Peter Zijlstra
2009-09-08 15:32                                     ` Christoph Lameter
2009-09-08 15:32                                       ` Christoph Lameter
2009-09-09  4:27                                       ` KOSAKI Motohiro
2009-09-09  4:27                                         ` KOSAKI Motohiro
2009-09-09 14:08                                         ` Christoph Lameter
2009-09-09 14:08                                           ` Christoph Lameter
2009-09-09 23:43                                           ` KOSAKI Motohiro
2009-09-09 23:43                                             ` KOSAKI Motohiro
2009-09-10 18:03                                             ` Christoph Lameter
2009-09-10 18:03                                               ` Christoph Lameter
2009-09-09 15:39                                         ` Minchan Kim
2009-09-09 15:39                                           ` Minchan Kim
2009-09-09 16:18                                           ` Lee Schermerhorn
2009-09-09 16:18                                             ` Lee Schermerhorn
2009-09-09 16:46                                             ` Minchan Kim
2009-09-09 16:46                                               ` Minchan Kim
2009-09-09 23:58                                           ` KOSAKI Motohiro
2009-09-09 23:58                                             ` KOSAKI Motohiro
2009-09-10  1:00                                             ` Minchan Kim
2009-09-10  1:00                                               ` Minchan Kim
2009-09-10  1:15                                               ` KOSAKI Motohiro
2009-09-10  1:15                                                 ` KOSAKI Motohiro
2009-09-10  1:23                                                 ` Minchan Kim
2009-09-10  1:23                                                   ` Minchan Kim
2009-09-09  2:06                               ` KOSAKI Motohiro
2009-09-09  2:06                                 ` KOSAKI Motohiro
     [not found]         ` <DDFD17CC94A9BD49A82147DDF7D545C54DC483@exchange.ZeugmaSystems.local>
     [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC485@exchange.ZeugmaSystems.local>
2009-09-07  0:28             ` question on sched-rt group allocation cap: sched_rt_runtime_us Anirban Sinha
2009-09-07  6:54           ` Mike Galbraith
     [not found]             ` <DDFD17CC94A9BD49A82147DDF7D545C54DC489@exchange.ZeugmaSystems.local>
2009-09-08  7:10               ` Anirban Sinha
2009-09-08  9:26                 ` Mike Galbraith
2009-09-07  7:59         ` Peter Zijlstra
2009-09-07  8:24           ` Mike Galbraith
     [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC487@exchange.ZeugmaSystems.local>
2009-09-08  7:08             ` Anirban Sinha
2009-09-08  8:42               ` Peter Zijlstra
2009-09-08 14:41                 ` Anirban Sinha
     [not found]         ` <DDFD17CC94A9BD49A82147DDF7D545C54DC48B@exchange.ZeugmaSystems.local>
2009-09-08 17:41           ` Anirban Sinha
2009-09-08 19:06             ` Mike Galbraith
2009-09-08 19:34               ` Anirban Sinha
2009-09-09  4:10                 ` Mike Galbraith
2009-09-05 17:13 Anirban Sinha
     [not found] <Acotv57vW6nkRxQOQLuBf8W5yfJIlwAAtMAw>
2009-09-05  0:55 ` Anirban Sinha
2009-09-05 17:43   ` Lucas De Marchi
2009-09-05 20:40   ` Fabio Checconi
2009-09-05 22:40     ` Lucas De Marchi
     [not found]     ` <DDFD17CC94A9BD49A82147DDF7D545C54DC481@exchange.ZeugmaSystems.local>
2009-09-06  0:47       ` Anirban Sinha
2009-09-06 13:21         ` Fabio Checconi
     [not found]           ` <DDFD17CC94A9BD49A82147DDF7D545C54DC484@exchange.ZeugmaSystems.local>
2009-09-07  0:26             ` Anirban Sinha
2009-09-08 17:26           ` Anirban Sinha
2009-09-08 21:37             ` Anirban Sinha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.