linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] oom: create a resource limit for oom_adj
@ 2010-11-11  4:35 Mandeep Singh Baines
  2010-11-11  7:35 ` David Rientjes
  2010-11-14  5:07 ` [PATCH] oom: create a resource limit for oom_adj KOSAKI Motohiro
  0 siblings, 2 replies; 19+ messages in thread
From: Mandeep Singh Baines @ 2010-11-11  4:35 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han
  Cc: linux-kernel, gspencer, piman, wad, olofj

For ChromiumOS, we'd like to be able to oom_adj a process up/down
as its leaves/enters the foreground. Currently, it is not possible
to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
to RLIMIT_NICE. This allows a process's oom_adj to be lowered
without CAP_SYS_RESOURCE as long as the new value is greater
than the resource limit.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their
oom_adj, a setuid or daemon implementation would be complex. The
alternatives also have much higher overhead.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 fs/proc/base.c                 |   12 ++++++++++--
 include/asm-generic/resource.h |    5 ++++-
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..4384013 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
 	[RLIMIT_NICE] = {"Max nice priority", NULL},
 	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
 	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+	[RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},
 };
 
 /* Display limits for a process */
@@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	}
 
 	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
+		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
+		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
+
+		if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
+			unlock_task_sighand(task, &flags);
+			put_task_struct(task);
+			err = -EACCES;
+			goto err_sighand;
+		}
 	}
 
 	if (oom_adjust != task->signal->oom_adj) {
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 587566f..a8640a4 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -45,7 +45,9 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+#define RLIMIT_OOMADJ		16	/* max oom_adj allowed to lower to
+					   0-32 for oom level 15 .. -17 */
+#define RLIM_NLIMITS		17
 
 /*
  * SuS says limits have to be unsigned.
@@ -86,6 +88,7 @@
 	[RLIMIT_MSGQUEUE]	= {   MQ_BYTES_MAX,   MQ_BYTES_MAX },	\
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
+	[RLIMIT_OOMADJ]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
 }
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11  4:35 [PATCH] oom: create a resource limit for oom_adj Mandeep Singh Baines
@ 2010-11-11  7:35 ` David Rientjes
  2010-11-11 18:30   ` Mandeep Singh Baines
  2010-11-14  5:07 ` [PATCH] oom: create a resource limit for oom_adj KOSAKI Motohiro
  1 sibling, 1 reply; 19+ messages in thread
From: David Rientjes @ 2010-11-11  7:35 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj

On Wed, 10 Nov 2010, Mandeep Singh Baines wrote:

> For ChromiumOS, we'd like to be able to oom_adj a process up/down
> as its leaves/enters the foreground. Currently, it is not possible
> to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
> resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
> to RLIMIT_NICE. This allows a process's oom_adj to be lowered
> without CAP_SYS_RESOURCE as long as the new value is greater
> than the resource limit.
> 

First of all, oom_adj is deprecated and scheduled for removal in a couple 
of years (see Documentation/feature-removal-schedule.txt) so any work in 
this area should be targeting oom_score_adj instead.

What is the anticipated use case for this?  We know that you want to lower 
oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an 
app moves from foreground to background?  I assume it's something like 
having an oom_adj of 0 in the background and +15 in the foreground.  If 
so, does /proc/sys/vm/oom_kill_allocating_task get you most of what you're 
looking for?

I'm wondering if we can avoid yet another resource limit for something 
like this.

> Alternative considered:
> 
> * a setuid binary
> * a daemon with CAP_SYS_RESOURCE
> 
> Since you don't wan't all processes to be able to reduce their
> oom_adj, a setuid or daemon implementation would be complex. The
> alternatives also have much higher overhead.
> 

What do you anticipate will be writing to oom_score_adj with this patch, 
the app itself?

> Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
> ---
>  fs/proc/base.c                 |   12 ++++++++++--
>  include/asm-generic/resource.h |    5 ++++-
>  2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..4384013 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
>  	[RLIMIT_NICE] = {"Max nice priority", NULL},
>  	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
>  	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> +	[RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},

s/Max/Min, right?

>  };
>  
>  /* Display limits for a process */
> @@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	}
>  
>  	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> -		err = -EACCES;
> -		goto err_sighand;
> +		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
> +		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> +

Ouch, that's a rather unfortunate mapping.

> +		if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
> +			unlock_task_sighand(task, &flags);
> +			put_task_struct(task);
> +			err = -EACCES;
> +			goto err_sighand;

err_sighand has duplicate unlock_task_sighand() and put_task_struct(); 
since you're missing the task_unlock(task) here, just using goto 
err_sighand would suffice.

> +		}
>  	}
>  
>  	if (oom_adjust != task->signal->oom_adj) {

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11  7:35 ` David Rientjes
@ 2010-11-11 18:30   ` Mandeep Singh Baines
  2010-11-11 20:57     ` David Rientjes
  0 siblings, 1 reply; 19+ messages in thread
From: Mandeep Singh Baines @ 2010-11-11 18:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mandeep Singh Baines, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han, linux-kernel, gspencer,
	piman, wad, olofj

David Rientjes (rientjes@google.com) wrote:
> On Wed, 10 Nov 2010, Mandeep Singh Baines wrote:
> 
> > For ChromiumOS, we'd like to be able to oom_adj a process up/down
> > as its leaves/enters the foreground. Currently, it is not possible
> > to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
> > resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
> > to RLIMIT_NICE. This allows a process's oom_adj to be lowered
> > without CAP_SYS_RESOURCE as long as the new value is greater
> > than the resource limit.
> > 
> 
> First of all, oom_adj is deprecated and scheduled for removal in a couple 
> of years (see Documentation/feature-removal-schedule.txt) so any work in 
> this area should be targeting oom_score_adj instead.
> 

Ah. Thanks for the pointer.

> What is the anticipated use case for this?  We know that you want to lower 
> oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an 
> app moves from foreground to background?  I assume it's something like 

The focus here is the web browser's tabs. In our case, each is a process. If
OOM is going to kill a process, you'd rather it kill the tab you looked at
hours ago instead of the one you're looking at now. So you'd like to have a
policy where the LRU tab gets killed first. We'd like to use oom_score_adj
as the mechanism to implement an LRU policy like this.

> having an oom_adj of 0 in the background and +15 in the foreground.  If 
> so, does /proc/sys/vm/oom_kill_allocating_task get you most of what you're 
> looking for?
> 

As explained above, oom_kill_allocating_task won't give us what we want.

> I'm wondering if we can avoid yet another resource limit for something 
> like this.
> 
> > Alternative considered:
> > 
> > * a setuid binary
> > * a daemon with CAP_SYS_RESOURCE
> > 
> > Since you don't wan't all processes to be able to reduce their
> > oom_adj, a setuid or daemon implementation would be complex. The
> > alternatives also have much higher overhead.
> > 
> 
> What do you anticipate will be writing to oom_score_adj with this patch, 
> the app itself?
> 

A process in the browser session will do the adusting. We'd rather not give
it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
and down within the bounds set by the administrator. Analagous to renice()
which we also do using a similar policy.

> > Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
> > ---
> >  fs/proc/base.c                 |   12 ++++++++++--
> >  include/asm-generic/resource.h |    5 ++++-
> >  2 files changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index f3d02ca..4384013 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
> >  	[RLIMIT_NICE] = {"Max nice priority", NULL},
> >  	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
> >  	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> > +	[RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},
> 
> s/Max/Min, right?
> 

This is a MAX value because of how resource limits work. On the other hand,
it is really controlling the minimum oom_adj. So its a toss up for me.
More than happy to change if you prefer Min.

> >  };
> >  
> >  /* Display limits for a process */
> > @@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
> >  	}
> >  
> >  	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> > -		err = -EACCES;
> > -		goto err_sighand;
> > +		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
> > +		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> > +
> 
> Ouch, that's a rather unfortunate mapping.
> 

Unfortunate but unavoidable. The resource limit code checks to see if the
new limit is greater than the limit. This code was based on the can_nice()
code in sched.c.

> > +		if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
> > +			unlock_task_sighand(task, &flags);
> > +			put_task_struct(task);
> > +			err = -EACCES;
> > +			goto err_sighand;
> 
> err_sighand has duplicate unlock_task_sighand() and put_task_struct(); 
> since you're missing the task_unlock(task) here, just using goto 
> err_sighand would suffice.
> 

D'oh. Forward port error. I should be more careful. Thanks for catching:)

> > +		}
> >  	}
> >  
> >  	if (oom_adjust != task->signal->oom_adj) {

Thank you for reviewing this patch.

Should I send an updated oom_score_adj patch?

Regards,
Mandeep

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 18:30   ` Mandeep Singh Baines
@ 2010-11-11 20:57     ` David Rientjes
  2010-11-11 22:25       ` Mandeep Singh Baines
  0 siblings, 1 reply; 19+ messages in thread
From: David Rientjes @ 2010-11-11 20:57 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj

On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:

> > What is the anticipated use case for this?  We know that you want to lower 
> > oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an 
> > app moves from foreground to background?  I assume it's something like 
> 
> The focus here is the web browser's tabs. In our case, each is a process. If
> OOM is going to kill a process, you'd rather it kill the tab you looked at
> hours ago instead of the one you're looking at now. So you'd like to have a
> policy where the LRU tab gets killed first. We'd like to use oom_score_adj
> as the mechanism to implement an LRU policy like this.
> 

Hmm, at first glance that seems potentially dangerous if the current tab 
generates a burt of memory allocations and it ends up killing all other 
tabs before finally targeting the culprit whereas currently the heuristic 
should do a good job of finding this problematic tab and killing it 
instantly.

Perhaps that can't happen and it probably doesn't even matter: 
oom_score_adj allows users to determine which process to kill regardless 
of the underlying reason.

> > What do you anticipate will be writing to oom_score_adj with this patch, 
> > the app itself?
> 
> A process in the browser session will do the adusting. We'd rather not give
> it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
> and down within the bounds set by the administrator. Analagous to renice()
> which we also do using a similar policy.
> 

So as more and more tabs get used, the least recently used tab gets its 
oom_score_adj raised higher and higher until it is reused itself and then 
it gets reset back to 0 for the current tab?

Is there a reason you don't want to give the underlying browser session 
process CAP_SYS_RESOURCE?  Will it not be enforcing resource limits to 
ensure tabs don't deplete all memory when certain sites are opened?  Are 
you concerned that it may deplete all memory itself (for which case you 
could raise its own oom_score_adj, which is a proportion of available 
memory so you can define where that point of depletiong is)?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 20:57     ` David Rientjes
@ 2010-11-11 22:25       ` Mandeep Singh Baines
  2010-11-11 23:19         ` David Rientjes
  0 siblings, 1 reply; 19+ messages in thread
From: Mandeep Singh Baines @ 2010-11-11 22:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mandeep Singh Baines, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han, linux-kernel, gspencer,
	piman, wad, olofj

David Rientjes (rientjes@google.com) wrote:
> On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:
> 
> > > What is the anticipated use case for this?  We know that you want to lower 
> > > oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an 
> > > app moves from foreground to background?  I assume it's something like 
> > 
> > The focus here is the web browser's tabs. In our case, each is a process. If
> > OOM is going to kill a process, you'd rather it kill the tab you looked at
> > hours ago instead of the one you're looking at now. So you'd like to have a
> > policy where the LRU tab gets killed first. We'd like to use oom_score_adj
> > as the mechanism to implement an LRU policy like this.
> > 
> 
> Hmm, at first glance that seems potentially dangerous if the current tab 
> generates a burt of memory allocations and it ends up killing all other 
> tabs before finally targeting the culprit whereas currently the heuristic 
> should do a good job of finding this problematic tab and killing it 
> instantly.
> 

If you're watching a movie, video chatting, playing a game, etc. What
would you rather have killed: the current tab you are interacting with or
some tab you opened a while back and are no longer interacting with.

> Perhaps that can't happen and it probably doesn't even matter: 
> oom_score_adj allows users to determine which process to kill regardless 
> of the underlying reason.
> 
> > > What do you anticipate will be writing to oom_score_adj with this patch, 
> > > the app itself?
> > 
> > A process in the browser session will do the adusting. We'd rather not give
> > it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
> > and down within the bounds set by the administrator. Analagous to renice()
> > which we also do using a similar policy.
> > 
> 
> So as more and more tabs get used, the least recently used tab gets its 
> oom_score_adj raised higher and higher until it is reused itself and then 
> it gets reset back to 0 for the current tab?
> 

Exactly.

> Is there a reason you don't want to give the underlying browser session 
> process CAP_SYS_RESOURCE?  Will it not be enforcing resource limits to 

Security. We want to use the least-privilege possible. We really want to
avoid giving special privileges to the browser. You shouldn't need
administrative privileges to run the browser. We'd like for oom_score_adj
to work the same as nice. An unprivileged user can nice up and down as
long as the new setting is within the administratively configured
resource limit: ulimit -e.

> ensure tabs don't deplete all memory when certain sites are opened?  Are 
> you concerned that it may deplete all memory itself (for which case you 
> could raise its own oom_score_adj, which is a proportion of available 
> memory so you can define where that point of depletiong is)?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 22:25       ` Mandeep Singh Baines
@ 2010-11-11 23:19         ` David Rientjes
  2010-11-11 23:56           ` Mandeep Singh Baines
  0 siblings, 1 reply; 19+ messages in thread
From: David Rientjes @ 2010-11-11 23:19 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj

On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:

> > Hmm, at first glance that seems potentially dangerous if the current tab 
> > generates a burt of memory allocations and it ends up killing all other 
> > tabs before finally targeting the culprit whereas currently the heuristic 
> > should do a good job of finding this problematic tab and killing it 
> > instantly.
> > 
> 
> If you're watching a movie, video chatting, playing a game, etc. What
> would you rather have killed: the current tab you are interacting with or
> some tab you opened a while back and are no longer interacting with.
> 

Well, it's a tangential point, but I'd personally prefer that my existing 
tabs that I've decided to leave open are guaranteed to remain open 
regardless of where I'm browsing next (they could hold valuable data that 
I can't easily get back) and avoid having all of them sacrificed out from 
under me for the newly opened tab.  I can always go back and close those 
tabs for more memory if I know I don't need them anymore and then retry 
the failed allocation.

> > So as more and more tabs get used, the least recently used tab gets its 
> > oom_score_adj raised higher and higher until it is reused itself and then 
> > it gets reset back to 0 for the current tab?
> > 
> 
> Exactly.
> 

We don't necessarily want arbitrary tasks to be able to decrease their 
oom_score_adj back to 0 if a CAP_SYS_RESOURCE thread has elevated it, 
that's part of the reason for the restriction (in addition to decreasing 
your own oom_score_adj all the way to OOM_SCORE_ADJ_MIN).

Would it suffice to allow a task to decrease its oom_score_adj back to the 
highest value that a CAP_SYS_RESOURCE thread set it or its inherited value 
at fork?  Assuming the thread that has forked it has oom_score_adj of 0, 
each tab could decrease it back from 0 upon activation unless a 
CAP_SYS_RESOURCE thread elevated it to something higher.

To do this, we'd need to save the highest oom_score_adj set by a 
CAP_SYS_RESOURCE in struct signal_struct.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 23:19         ` David Rientjes
@ 2010-11-11 23:56           ` Mandeep Singh Baines
  2010-11-13  0:46             ` [PATCH] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down Mandeep Singh Baines
  0 siblings, 1 reply; 19+ messages in thread
From: Mandeep Singh Baines @ 2010-11-11 23:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mandeep Singh Baines, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han, linux-kernel, gspencer,
	piman, wad, olofj

David Rientjes (rientjes@google.com) wrote:
> On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:
> 
> > > Hmm, at first glance that seems potentially dangerous if the current tab 
> > > generates a burt of memory allocations and it ends up killing all other 
> > > tabs before finally targeting the culprit whereas currently the heuristic 
> > > should do a good job of finding this problematic tab and killing it 
> > > instantly.
> > > 
> > 
> > If you're watching a movie, video chatting, playing a game, etc. What
> > would you rather have killed: the current tab you are interacting with or
> > some tab you opened a while back and are no longer interacting with.
> > 
> 
> Well, it's a tangential point, but I'd personally prefer that my existing 
> tabs that I've decided to leave open are guaranteed to remain open 
> regardless of where I'm browsing next (they could hold valuable data that 
> I can't easily get back) and avoid having all of them sacrificed out from 
> under me for the newly opened tab.  I can always go back and close those 
> tabs for more memory if I know I don't need them anymore and then retry 
> the failed allocation.
> 
> > > So as more and more tabs get used, the least recently used tab gets its 
> > > oom_score_adj raised higher and higher until it is reused itself and then 
> > > it gets reset back to 0 for the current tab?
> > > 
> > 
> > Exactly.
> > 
> 
> We don't necessarily want arbitrary tasks to be able to decrease their 
> oom_score_adj back to 0 if a CAP_SYS_RESOURCE thread has elevated it, 
> that's part of the reason for the restriction (in addition to decreasing 
> your own oom_score_adj all the way to OOM_SCORE_ADJ_MIN).
> 
> Would it suffice to allow a task to decrease its oom_score_adj back to the 
> highest value that a CAP_SYS_RESOURCE thread set it or its inherited value 
> at fork?  Assuming the thread that has forked it has oom_score_adj of 0, 
> each tab could decrease it back from 0 upon activation unless a 
> CAP_SYS_RESOURCE thread elevated it to something higher.
> 
> To do this, we'd need to save the highest oom_score_adj set by a 
> CAP_SYS_RESOURCE in struct signal_struct.

Sounds good to me. I'll start working on this patch.

Thanks,
Mandeep

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down
  2010-11-11 23:56           ` Mandeep Singh Baines
@ 2010-11-13  0:46             ` Mandeep Singh Baines
  2010-11-14  1:37               ` David Rientjes
  0 siblings, 1 reply; 19+ messages in thread
From: Mandeep Singh Baines @ 2010-11-13  0:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj, Bodo Eggert

We'd like to be able to oom_score_adj a process up/down as its
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease
its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set
it or its inherited value at fork. Assuming the thread that has forked
it has oom_score_adj of 0, each tab could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their
oom_adj, a setuid or daemon implementation would be complex. The
alternatives also have much higher overhead.

This patch updated based on feedback from
David Rientjes <rientjes@google.com>.

Change-Id: If8f52363fd6c156e1730f43148aee987260e9c72
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 fs/proc/base.c        |    4 +++-
 include/linux/sched.h |    2 ++
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..e617413 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto err_task_lock;
 	}
 
-	if (oom_score_adj < task->signal->oom_score_adj &&
+	if (oom_score_adj < task->signal->oom_score_adj_min &&
 			!capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
 		goto err_sighand;
@@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 			atomic_dec(&task->mm->oom_disable_count);
 	}
 	task->signal->oom_score_adj = oom_score_adj;
+	if (capable(CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = oom_score_adj;
 	/*
 	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
 	 * always attainable.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f53cdf2..2a71ee0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -626,6 +626,8 @@ struct signal_struct {
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
+	int oom_score_adj_min;	/* OOM kill score adjustment minimum value.
+				 * Only settable by CAP_SYS_RESOURCE. */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down
  2010-11-13  0:46             ` [PATCH] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down Mandeep Singh Baines
@ 2010-11-14  1:37               ` David Rientjes
  2010-11-15 22:01                 ` [PATCH v2] " Mandeep Singh Baines
  0 siblings, 1 reply; 19+ messages in thread
From: David Rientjes @ 2010-11-14  1:37 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj, Bodo Eggert

On Fri, 12 Nov 2010, Mandeep Singh Baines wrote:

> We'd like to be able to oom_score_adj a process up/down as its
> enters/leaves the foreground. Currently, it is not possible to oom_adj
> down without CAP_SYS_RESOURCE. This patch allows a task to decrease
> its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set
> it or its inherited value at fork. Assuming the thread that has forked
> it has oom_score_adj of 0, each tab could decrease it back from 0 upon
> activation unless a CAP_SYS_RESOURCE thread elevated it to something
> higher.
> 

oom_score_adj_min doesn't appear to be inherited at fork in your patch.

> Alternative considered:
> 
> * a setuid binary
> * a daemon with CAP_SYS_RESOURCE
> 
> Since you don't wan't all processes to be able to reduce their
> oom_adj, a setuid or daemon implementation would be complex. The
> alternatives also have much higher overhead.
> 

This behavior should be documented in Documentation/filesystems/proc.txt.

> This patch updated based on feedback from
> David Rientjes <rientjes@google.com>.
> 
> Change-Id: If8f52363fd6c156e1730f43148aee987260e9c72

I know what a Change-Id is , but nobody else here does :)

> Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
> ---
>  fs/proc/base.c        |    4 +++-
>  include/linux/sched.h |    2 ++
>  2 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..e617413 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
>  		goto err_task_lock;
>  	}
>  
> -	if (oom_score_adj < task->signal->oom_score_adj &&
> +	if (oom_score_adj < task->signal->oom_score_adj_min &&
>  			!capable(CAP_SYS_RESOURCE)) {
>  		err = -EACCES;
>  		goto err_sighand;
> @@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
>  			atomic_dec(&task->mm->oom_disable_count);
>  	}
>  	task->signal->oom_score_adj = oom_score_adj;
> +	if (capable(CAP_SYS_RESOURCE))
> +		task->signal->oom_score_adj_min = oom_score_adj;
>  	/*
>  	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
>  	 * always attainable.
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f53cdf2..2a71ee0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -626,6 +626,8 @@ struct signal_struct {
>  
>  	int oom_adj;		/* OOM kill score adjustment (bit shift) */
>  	int oom_score_adj;	/* OOM kill score adjustment */
> +	int oom_score_adj_min;	/* OOM kill score adjustment minimum value.
> +				 * Only settable by CAP_SYS_RESOURCE. */
>  
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11  4:35 [PATCH] oom: create a resource limit for oom_adj Mandeep Singh Baines
  2010-11-11  7:35 ` David Rientjes
@ 2010-11-14  5:07 ` KOSAKI Motohiro
  1 sibling, 0 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  5:07 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: kosaki.motohiro, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Ying Han, linux-kernel,
	gspencer, piman, wad, olofj

Hi Mandeep,

> For ChromiumOS, we'd like to be able to oom_adj a process up/down
> as its leaves/enters the foreground. Currently, it is not possible
> to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
> resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
> to RLIMIT_NICE. This allows a process's oom_adj to be lowered
> without CAP_SYS_RESOURCE as long as the new value is greater
> than the resource limit.
> 
> Alternative considered:
> 
> * a setuid binary
> * a daemon with CAP_SYS_RESOURCE
> 
> Since you don't wan't all processes to be able to reduce their
> oom_adj, a setuid or daemon implementation would be complex. The
> alternatives also have much higher overhead.
> 
> Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
> ---
>  fs/proc/base.c                 |   12 ++++++++++--
>  include/asm-generic/resource.h |    5 ++++-
>  2 files changed, 14 insertions(+), 3 deletions(-)

This concept sound useful for embedeed. but I dislike this interface
a bit. Why don't you create /proc/{pid}/oom_adj_lower_bound or similar? 
It is more straight forward because oom_adj are already using /proc.

I also think 15..-17 to 0-32 convertion is a bit user unfriendly.


> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..4384013 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
>  	[RLIMIT_NICE] = {"Max nice priority", NULL},
>  	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
>  	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> +	[RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},
>  };
>  
>  /* Display limits for a process */
> @@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	}
>  
>  	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> -		err = -EACCES;
> -		goto err_sighand;
> +		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
> +		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> +
> +		if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {

two points.

1) task->signal->rlim[RLIMIT_OOMADJ].rlim_cur is incorrect.
   please use task_rlimit().

2) If process has CAP_SYS_RESOURCE, we should ignore RLIMIT_OOMADJ for
   backword compatibility. CAP_NICE do so. (see below)

------------------------------------------------------------------
int can_nice(const struct task_struct *p, const int nice)
{
        /* convert nice value [19,-20] to rlimit style value [1,40] */
        int nice_rlim = 20 - nice;

        return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||
                capable(CAP_SYS_NICE));
}
------------------------------------------------------------------



> +			unlock_task_sighand(task, &flags);
> +			put_task_struct(task);
> +			err = -EACCES;
> +			goto err_sighand;
> +		}
>  	}
>  
>  	if (oom_adjust != task->signal->oom_adj) {
> diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
> index 587566f..a8640a4 100644
> --- a/include/asm-generic/resource.h
> +++ b/include/asm-generic/resource.h
> @@ -45,7 +45,9 @@
>  					   0-39 for nice level 19 .. -20 */
>  #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
>  #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
> -#define RLIM_NLIMITS		16
> +#define RLIMIT_OOMADJ		16	/* max oom_adj allowed to lower to
> +					   0-32 for oom level 15 .. -17 */
> +#define RLIM_NLIMITS		17
>  
>  /*
>   * SuS says limits have to be unsigned.
> @@ -86,6 +88,7 @@
>  	[RLIMIT_MSGQUEUE]	= {   MQ_BYTES_MAX,   MQ_BYTES_MAX },	\
>  	[RLIMIT_NICE]		= { 0, 0 },				\
>  	[RLIMIT_RTPRIO]		= { 0, 0 },				\
> +	[RLIMIT_OOMADJ]		= { 0, 0 },				\

I don't think 0 is good initial value because 0 mean oom_adj==15.



>  	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
>  }
>  
> -- 
> 1.7.3.1
> 




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down
  2010-11-14  1:37               ` David Rientjes
@ 2010-11-15 22:01                 ` Mandeep Singh Baines
  2010-11-15 22:06                   ` David Rientjes
  0 siblings, 1 reply; 19+ messages in thread
From: Mandeep Singh Baines @ 2010-11-15 22:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj, Bodo Eggert

Hi,

Attached is V2 of this patch. Since V1:

* Added documentation in Documentation/filesystems/proc.txt
* Copy oom_score_adj_min value across a fork

An alternative to this patch would be to expose oom_score_adj_min via
proc (suggested by Kosaki Motohiro). You could allow non-CAP_SYS_RESOURCE
processes to irreversibly increase the minimum. That would give you
similar semantics to a resource limit and would give you a bit more
control but it means adding another proc variable.

Regards,
Mandeep

---

We'd like to be able to oom_score_adj a process up/down as its
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease
its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set
it or its inherited value at fork. Assuming the thread that has forked
it has oom_score_adj of 0, each tab process could decrease it back from
0 upon activation unless a CAP_SYS_RESOURCE thread elevated it to
something higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their
oom_adj, a setuid or daemon implementation would be complex. The
alternatives also have much higher overhead.

This patch updated from original patch based on feedback from
David Rientjes <rientjes@google.com>.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 Documentation/filesystems/proc.txt |    4 ++++
 fs/proc/base.c                     |    4 +++-
 include/linux/sched.h              |    2 ++
 kernel/fork.c                      |    1 +
 4 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e73df27..7139c50 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1296,6 +1296,10 @@ scaled linearly with /proc/<pid>/oom_score_adj.
 Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
 other with its scaled value.
 
+The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
+value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
+requires CAP_SYS_RESOURCE.
+
 NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
 Documentation/feature-removal-schedule.txt.
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..e617413 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto err_task_lock;
 	}
 
-	if (oom_score_adj < task->signal->oom_score_adj &&
+	if (oom_score_adj < task->signal->oom_score_adj_min &&
 			!capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
 		goto err_sighand;
@@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 			atomic_dec(&task->mm->oom_disable_count);
 	}
 	task->signal->oom_score_adj = oom_score_adj;
+	if (capable(CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = oom_score_adj;
 	/*
 	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
 	 * always attainable.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f53cdf2..2a71ee0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -626,6 +626,8 @@ struct signal_struct {
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
+	int oom_score_adj_min;	/* OOM kill score adjustment minimum value.
+				 * Only settable by CAP_SYS_RESOURCE. */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..0979527 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -907,6 +907,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
+	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down
  2010-11-15 22:01                 ` [PATCH v2] " Mandeep Singh Baines
@ 2010-11-15 22:06                   ` David Rientjes
  2010-11-16  0:03                     ` [PATCH v3] " Mandeep Singh Baines
  0 siblings, 1 reply; 19+ messages in thread
From: David Rientjes @ 2010-11-15 22:06 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj, Bodo Eggert

On Mon, 15 Nov 2010, Mandeep Singh Baines wrote:

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..e617413 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
>  		goto err_task_lock;
>  	}
>  
> -	if (oom_score_adj < task->signal->oom_score_adj &&
> +	if (oom_score_adj < task->signal->oom_score_adj_min &&
>  			!capable(CAP_SYS_RESOURCE)) {
>  		err = -EACCES;
>  		goto err_sighand;
> @@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
>  			atomic_dec(&task->mm->oom_disable_count);
>  	}
>  	task->signal->oom_score_adj = oom_score_adj;
> +	if (capable(CAP_SYS_RESOURCE))
> +		task->signal->oom_score_adj_min = oom_score_adj;
>  	/*
>  	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
>  	 * always attainable.

This should be has_capability_noaudit(current, CAP_SYS_RESOURCE) instead, 
we don't want an audit message to be emitted when checking if 
oom_score_adj_min should be set.

Other than that:

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down
  2010-11-15 22:06                   ` David Rientjes
@ 2010-11-16  0:03                     ` Mandeep Singh Baines
  0 siblings, 0 replies; 19+ messages in thread
From: Mandeep Singh Baines @ 2010-11-16  0:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mandeep Singh Baines, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han, linux-kernel, gspencer,
	piman, wad, olofj, Bodo Eggert

We'd like to be able to oom_score_adj a process up/down as its
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease
its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set
it or its inherited value at fork. Assuming the thread that has forked
it has oom_score_adj of 0, each tab process could decrease it back from
0 upon activation unless a CAP_SYS_RESOURCE thread elevated it to
something higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their
oom_adj, a setuid or daemon implementation would be complex. The
alternatives also have much higher overhead.

This patch updated from original patch based on feedback from
David Rientjes <rientjes@google.com>.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |    4 ++++
 fs/proc/base.c                     |    4 +++-
 include/linux/sched.h              |    2 ++
 kernel/fork.c                      |    1 +
 4 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e73df27..7139c50 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1296,6 +1296,10 @@ scaled linearly with /proc/<pid>/oom_score_adj.
 Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
 other with its scaled value.
 
+The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
+value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
+requires CAP_SYS_RESOURCE.
+
 NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
 Documentation/feature-removal-schedule.txt.
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..7b1a9df 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto err_task_lock;
 	}
 
-	if (oom_score_adj < task->signal->oom_score_adj &&
+	if (oom_score_adj < task->signal->oom_score_adj_min &&
 			!capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
 		goto err_sighand;
@@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 			atomic_dec(&task->mm->oom_disable_count);
 	}
 	task->signal->oom_score_adj = oom_score_adj;
+	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = oom_score_adj;
 	/*
 	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
 	 * always attainable.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f53cdf2..2a71ee0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -626,6 +626,8 @@ struct signal_struct {
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
+	int oom_score_adj_min;	/* OOM kill score adjustment minimum value.
+				 * Only settable by CAP_SYS_RESOURCE. */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..0979527 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -907,6 +907,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
+	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-14 21:42           ` David Rientjes
@ 2010-11-23  7:16             ` KOSAKI Motohiro
  0 siblings, 0 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2010-11-23  7:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, 7eggert, Andrew Morton, KAMEZAWA Hiroyuki,
	Rik van Riel, Ying Han, linux-kernel, gspencer, piman, wad,
	olofj, Linus Torvalds

> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> 
> > I've sent revert patch to linus. We DON'T want to break any userland seriously. 
> > So, we reconsider about this area. personally I hope to reintroduce oom_score_adj
> > later, but It must don't break current userland.
> > 
> 
> Nobody is going to stall their development waiting for your patches to be 
> merged (especially when you've proposed them four times before and they 
> haven't been), so patches in this area should be against Linus' latest 
> tree.

Nobody want to accept you break userland.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-14  5:07         ` KOSAKI Motohiro
@ 2010-11-14 21:42           ` David Rientjes
  2010-11-23  7:16             ` KOSAKI Motohiro
  0 siblings, 1 reply; 19+ messages in thread
From: David Rientjes @ 2010-11-14 21:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: 7eggert, Andrew Morton, KAMEZAWA Hiroyuki, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj,
	Linus Torvalds

On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:

> I've sent revert patch to linus. We DON'T want to break any userland seriously. 
> So, we reconsider about this area. personally I hope to reintroduce oom_score_adj
> later, but It must don't break current userland.
> 

Nobody is going to stall their development waiting for your patches to be 
merged (especially when you've proposed them four times before and they 
haven't been), so patches in this area should be against Linus' latest 
tree.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 23:15       ` Bodo Eggert
  2010-11-11 23:21         ` David Rientjes
@ 2010-11-14  5:07         ` KOSAKI Motohiro
  2010-11-14 21:42           ` David Rientjes
  1 sibling, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  5:07 UTC (permalink / raw)
  To: 7eggert
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Rik van Riel, Ying Han, linux-kernel,
	gspencer, piman, wad, olofj, Linus Torvalds

> > Hmm, at first glance that seems potentially dangerous if the current tab
> > generates a burt of memory allocations and it ends up killing all other
> > tabs before finally targeting the culprit whereas currently the heuristic
> > should do a good job of finding this problematic tab and killing it
> > instantly.
> 
> The original oom_adj design would e.g. adjust all background tabs to seem
> twice as bad as the current tab, so a ever-growing current tab would
> only be able to get one or two tabs killed in the worst case:
> 
> System is near OOM while the current tab starts using a normal amount of
> mem and grows beyond limits. After it easily killed the first bg tab by
> allocating one byte, it needs to grow to twice the normal tab memsize to
> start the OOM killer again. By then, it's score will be equal to normal
> background tabs (half size, double score), and killing the second tab will
> be luck. Killing the third tab should be impossible.
> 
> (If you adjust the bg tabs to be four times as killable, you'll get what
>  you asked for.)

Sorry for the delay. 

I've sent revert patch to linus. We DON'T want to break any userland seriously. 
So, we reconsider about this area. personally I hope to reintroduce oom_score_adj
later, but It must don't break current userland.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 23:15       ` Bodo Eggert
@ 2010-11-11 23:21         ` David Rientjes
  2010-11-14  5:07         ` KOSAKI Motohiro
  1 sibling, 0 replies; 19+ messages in thread
From: David Rientjes @ 2010-11-11 23:21 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj

On Fri, 12 Nov 2010, Bodo Eggert wrote:

> The original oom_adj design would e.g. adjust all background tabs to seem
> twice as bad as the current tab, so a ever-growing current tab would
> only be able to get one or two tabs killed in the worst case:
> 
> System is near OOM while the current tab starts using a normal amount of
> mem and grows beyond limits. After it easily killed the first bg tab by
> allocating one byte, it needs to grow to twice the normal tab memsize to
> start the OOM killer again. By then, it's score will be equal to normal
> background tabs (half size, double score), and killing the second tab will
> be luck. Killing the third tab should be impossible.
> 
> (If you adjust the bg tabs to be four times as killable, you'll get what
>  you asked for.)
> 
> 
> I don't know the current oom_score_adj, it should be able to do something
> similar? Or should there be a oom_score_mul?
> 

oom_score_adj is done on a linear scale instead of exponential so instead 
of increasing oom_adj by 1 for each hop to the current tab, you would need 
to double it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
       [not found]     ` <fNMps-1S1-21@gated-at.bofh.it>
@ 2010-11-11 23:15       ` Bodo Eggert
  2010-11-11 23:21         ` David Rientjes
  2010-11-14  5:07         ` KOSAKI Motohiro
  0 siblings, 2 replies; 19+ messages in thread
From: Bodo Eggert @ 2010-11-11 23:15 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han, linux-kernel, gspencer,
	piman, wad, olofj

David Rientjes <rientjes@google.com> wrote:
> On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:

>> > What is the anticipated use case for this?  We know that you want to lower
>> > oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an
>> > app moves from foreground to background?  I assume it's something like
>> 
>> The focus here is the web browser's tabs. In our case, each is a process. If
>> OOM is going to kill a process, you'd rather it kill the tab you looked at
>> hours ago instead of the one you're looking at now. So you'd like to have a
>> policy where the LRU tab gets killed first. We'd like to use oom_score_adj
>> as the mechanism to implement an LRU policy like this.
>> 
> 
> Hmm, at first glance that seems potentially dangerous if the current tab
> generates a burt of memory allocations and it ends up killing all other
> tabs before finally targeting the culprit whereas currently the heuristic
> should do a good job of finding this problematic tab and killing it
> instantly.

The original oom_adj design would e.g. adjust all background tabs to seem
twice as bad as the current tab, so a ever-growing current tab would
only be able to get one or two tabs killed in the worst case:

System is near OOM while the current tab starts using a normal amount of
mem and grows beyond limits. After it easily killed the first bg tab by
allocating one byte, it needs to grow to twice the normal tab memsize to
start the OOM killer again. By then, it's score will be equal to normal
background tabs (half size, double score), and killing the second tab will
be luck. Killing the third tab should be impossible.

(If you adjust the bg tabs to be four times as killable, you'll get what
 you asked for.)


I don't know the current oom_score_adj, it should be able to do something
similar? Or should there be a oom_score_mul?

>> > What do you anticipate will be writing to oom_score_adj with this patch,
>> > the app itself?
>> 
>> A process in the browser session will do the adusting. We'd rather not give
>> it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
>> and down within the bounds set by the administrator. Analagous to renice()
>> which we also do using a similar policy.
>> 
> 
> So as more and more tabs get used, the least recently used tab gets its
> oom_score_adj raised higher and higher until it is reused itself and then
> it gets reset back to 0 for the current tab?

As far as I understand, a background tab should get a higher score if it
rests untouched for some time, and if it gets used again, it will jump to
neutral oom_adj.

> Is there a reason you don't want to give the underlying browser session
> process CAP_SYS_RESOURCE?

I should not have control over a CAP_SYS_RESOURCE process while being a user,
but I should be able to tell the system which process to kill. When I designed
the oom_adj, I did not think about one-process-per-tab, so I did not suggest
a soft/hard limit.

I think the concept of using an rlimit is the right thing to do, I did not yet
look/think about the mapping. Glancing at oom_score_adj does suggest a different
range ...


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
@ 2010-11-11  5:19 Figo.zhang
  0 siblings, 0 replies; 19+ messages in thread
From: Figo.zhang @ 2010-11-11  5:19 UTC (permalink / raw)
  To: linux-kernel


>   	if (oom_adjust<  task->signal->oom_adj&&  !capable(CAP_SYS_RESOURCE)) {
> -		err = -EACCES;
> -		goto err_sighand;
> +		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
> +		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> +
> +		if (oom_rlim>  task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
> +			unlock_task_sighand(task,&flags);
> +			put_task_struct(task);
> +			err = -EACCES;
> +			goto err_sighand;
> +		}
>   	}

=> Label "err_sighand" have do that, why are you do that on here?

+			err = -EACCES;
+			goto err_sighand;
+		}
 	}



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-11-23  7:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-11  4:35 [PATCH] oom: create a resource limit for oom_adj Mandeep Singh Baines
2010-11-11  7:35 ` David Rientjes
2010-11-11 18:30   ` Mandeep Singh Baines
2010-11-11 20:57     ` David Rientjes
2010-11-11 22:25       ` Mandeep Singh Baines
2010-11-11 23:19         ` David Rientjes
2010-11-11 23:56           ` Mandeep Singh Baines
2010-11-13  0:46             ` [PATCH] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down Mandeep Singh Baines
2010-11-14  1:37               ` David Rientjes
2010-11-15 22:01                 ` [PATCH v2] " Mandeep Singh Baines
2010-11-15 22:06                   ` David Rientjes
2010-11-16  0:03                     ` [PATCH v3] " Mandeep Singh Baines
2010-11-14  5:07 ` [PATCH] oom: create a resource limit for oom_adj KOSAKI Motohiro
2010-11-11  5:19 Figo.zhang
     [not found] <fNx73-1cI-1@gated-at.bofh.it>
     [not found] ` <fNzVf-5UY-3@gated-at.bofh.it>
     [not found]   ` <fNKdY-6FU-11@gated-at.bofh.it>
     [not found]     ` <fNMps-1S1-21@gated-at.bofh.it>
2010-11-11 23:15       ` Bodo Eggert
2010-11-11 23:21         ` David Rientjes
2010-11-14  5:07         ` KOSAKI Motohiro
2010-11-14 21:42           ` David Rientjes
2010-11-23  7:16             ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).