linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] oom: create a resource limit for oom_adj
       [not found]     ` <fNMps-1S1-21@gated-at.bofh.it>
@ 2010-11-11 23:15       ` Bodo Eggert
  2010-11-11 23:21         ` David Rientjes
  2010-11-14  5:07         ` KOSAKI Motohiro
       [not found]       ` <fNNOx-4qf-1@gated-at.bofh.it>
  1 sibling, 2 replies; 15+ messages in thread
From: Bodo Eggert @ 2010-11-11 23:15 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han, linux-kernel, gspencer,
	piman, wad, olofj

David Rientjes <rientjes@google.com> wrote:
> On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:

>> > What is the anticipated use case for this?  We know that you want to lower
>> > oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an
>> > app moves from foreground to background?  I assume it's something like
>> 
>> The focus here is the web browser's tabs. In our case, each is a process. If
>> OOM is going to kill a process, you'd rather it kill the tab you looked at
>> hours ago instead of the one you're looking at now. So you'd like to have a
>> policy where the LRU tab gets killed first. We'd like to use oom_score_adj
>> as the mechanism to implement an LRU policy like this.
>> 
> 
> Hmm, at first glance that seems potentially dangerous if the current tab
> generates a burt of memory allocations and it ends up killing all other
> tabs before finally targeting the culprit whereas currently the heuristic
> should do a good job of finding this problematic tab and killing it
> instantly.

The original oom_adj design would e.g. adjust all background tabs to seem
twice as bad as the current tab, so a ever-growing current tab would
only be able to get one or two tabs killed in the worst case:

System is near OOM while the current tab starts using a normal amount of
mem and grows beyond limits. After it easily killed the first bg tab by
allocating one byte, it needs to grow to twice the normal tab memsize to
start the OOM killer again. By then, it's score will be equal to normal
background tabs (half size, double score), and killing the second tab will
be luck. Killing the third tab should be impossible.

(If you adjust the bg tabs to be four times as killable, you'll get what
 you asked for.)


I don't know the current oom_score_adj, it should be able to do something
similar? Or should there be a oom_score_mul?

>> > What do you anticipate will be writing to oom_score_adj with this patch,
>> > the app itself?
>> 
>> A process in the browser session will do the adusting. We'd rather not give
>> it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
>> and down within the bounds set by the administrator. Analagous to renice()
>> which we also do using a similar policy.
>> 
> 
> So as more and more tabs get used, the least recently used tab gets its
> oom_score_adj raised higher and higher until it is reused itself and then
> it gets reset back to 0 for the current tab?

As far as I understand, a background tab should get a higher score if it
rests untouched for some time, and if it gets used again, it will jump to
neutral oom_adj.

> Is there a reason you don't want to give the underlying browser session
> process CAP_SYS_RESOURCE?

I should not have control over a CAP_SYS_RESOURCE process while being a user,
but I should be able to tell the system which process to kill. When I designed
the oom_adj, I did not think about one-process-per-tab, so I did not suggest
a soft/hard limit.

I think the concept of using an rlimit is the right thing to do, I did not yet
look/think about the mapping. Glancing at oom_score_adj does suggest a different
range ...


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 23:15       ` [PATCH] oom: create a resource limit for oom_adj Bodo Eggert
@ 2010-11-11 23:21         ` David Rientjes
  2010-11-14  5:07         ` KOSAKI Motohiro
  1 sibling, 0 replies; 15+ messages in thread
From: David Rientjes @ 2010-11-11 23:21 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj

On Fri, 12 Nov 2010, Bodo Eggert wrote:

> The original oom_adj design would e.g. adjust all background tabs to seem
> twice as bad as the current tab, so a ever-growing current tab would
> only be able to get one or two tabs killed in the worst case:
> 
> System is near OOM while the current tab starts using a normal amount of
> mem and grows beyond limits. After it easily killed the first bg tab by
> allocating one byte, it needs to grow to twice the normal tab memsize to
> start the OOM killer again. By then, it's score will be equal to normal
> background tabs (half size, double score), and killing the second tab will
> be luck. Killing the third tab should be impossible.
> 
> (If you adjust the bg tabs to be four times as killable, you'll get what
>  you asked for.)
> 
> 
> I don't know the current oom_score_adj, it should be able to do something
> similar? Or should there be a oom_score_mul?
> 

oom_score_adj is done on a linear scale instead of exponential so instead 
of increasing oom_adj by 1 for each hop to the current tab, you would need 
to double it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj  down
       [not found]             ` <fOctz-4o0-9@gated-at.bofh.it>
@ 2010-11-14  0:05               ` Bodo Eggert
  0 siblings, 0 replies; 15+ messages in thread
From: Bodo Eggert @ 2010-11-14  0:05 UTC (permalink / raw)
  To: Mandeep Singh Baines, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Ying Han,
	linux-kernel, gspencer, piman, wad, olofj, Bodo Eggert

Mandeep Singh Baines <msb@google.com> wrote:

> We'd like to be able to oom_score_adj a process up/down as its
> enters/leaves the foreground. Currently, it is not possible to oom_adj
> down without CAP_SYS_RESOURCE. This patch allows a task to decrease
> its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set
> it or its inherited value at fork. Assuming the thread that has forked
> it has oom_score_adj of 0, each tab

process

> could decrease it back from 0 upon 
> activation unless a CAP_SYS_RESOURCE thread elevated it to something
> higher.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 23:15       ` [PATCH] oom: create a resource limit for oom_adj Bodo Eggert
  2010-11-11 23:21         ` David Rientjes
@ 2010-11-14  5:07         ` KOSAKI Motohiro
  2010-11-14 21:42           ` David Rientjes
  1 sibling, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  5:07 UTC (permalink / raw)
  To: 7eggert
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Rik van Riel, Ying Han, linux-kernel,
	gspencer, piman, wad, olofj, Linus Torvalds

> > Hmm, at first glance that seems potentially dangerous if the current tab
> > generates a burt of memory allocations and it ends up killing all other
> > tabs before finally targeting the culprit whereas currently the heuristic
> > should do a good job of finding this problematic tab and killing it
> > instantly.
> 
> The original oom_adj design would e.g. adjust all background tabs to seem
> twice as bad as the current tab, so a ever-growing current tab would
> only be able to get one or two tabs killed in the worst case:
> 
> System is near OOM while the current tab starts using a normal amount of
> mem and grows beyond limits. After it easily killed the first bg tab by
> allocating one byte, it needs to grow to twice the normal tab memsize to
> start the OOM killer again. By then, it's score will be equal to normal
> background tabs (half size, double score), and killing the second tab will
> be luck. Killing the third tab should be impossible.
> 
> (If you adjust the bg tabs to be four times as killable, you'll get what
>  you asked for.)

Sorry for the delay. 

I've sent revert patch to linus. We DON'T want to break any userland seriously. 
So, we reconsider about this area. personally I hope to reintroduce oom_score_adj
later, but It must don't break current userland.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-14  5:07         ` KOSAKI Motohiro
@ 2010-11-14 21:42           ` David Rientjes
  2010-11-23  7:16             ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2010-11-14 21:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: 7eggert, Andrew Morton, KAMEZAWA Hiroyuki, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj,
	Linus Torvalds

On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:

> I've sent revert patch to linus. We DON'T want to break any userland seriously. 
> So, we reconsider about this area. personally I hope to reintroduce oom_score_adj
> later, but It must don't break current userland.
> 

Nobody is going to stall their development waiting for your patches to be 
merged (especially when you've proposed them four times before and they 
haven't been), so patches in this area should be against Linus' latest 
tree.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-14 21:42           ` David Rientjes
@ 2010-11-23  7:16             ` KOSAKI Motohiro
  0 siblings, 0 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2010-11-23  7:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, 7eggert, Andrew Morton, KAMEZAWA Hiroyuki,
	Rik van Riel, Ying Han, linux-kernel, gspencer, piman, wad,
	olofj, Linus Torvalds

> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> 
> > I've sent revert patch to linus. We DON'T want to break any userland seriously. 
> > So, we reconsider about this area. personally I hope to reintroduce oom_score_adj
> > later, but It must don't break current userland.
> > 
> 
> Nobody is going to stall their development waiting for your patches to be 
> merged (especially when you've proposed them four times before and they 
> haven't been), so patches in this area should be against Linus' latest 
> tree.

Nobody want to accept you break userland.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11  4:35 Mandeep Singh Baines
  2010-11-11  7:35 ` David Rientjes
@ 2010-11-14  5:07 ` KOSAKI Motohiro
  1 sibling, 0 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  5:07 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: kosaki.motohiro, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Ying Han, linux-kernel,
	gspencer, piman, wad, olofj

Hi Mandeep,

> For ChromiumOS, we'd like to be able to oom_adj a process up/down
> as its leaves/enters the foreground. Currently, it is not possible
> to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
> resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
> to RLIMIT_NICE. This allows a process's oom_adj to be lowered
> without CAP_SYS_RESOURCE as long as the new value is greater
> than the resource limit.
> 
> Alternative considered:
> 
> * a setuid binary
> * a daemon with CAP_SYS_RESOURCE
> 
> Since you don't wan't all processes to be able to reduce their
> oom_adj, a setuid or daemon implementation would be complex. The
> alternatives also have much higher overhead.
> 
> Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
> ---
>  fs/proc/base.c                 |   12 ++++++++++--
>  include/asm-generic/resource.h |    5 ++++-
>  2 files changed, 14 insertions(+), 3 deletions(-)

This concept sound useful for embedeed. but I dislike this interface
a bit. Why don't you create /proc/{pid}/oom_adj_lower_bound or similar? 
It is more straight forward because oom_adj are already using /proc.

I also think 15..-17 to 0-32 convertion is a bit user unfriendly.


> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..4384013 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
>  	[RLIMIT_NICE] = {"Max nice priority", NULL},
>  	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
>  	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> +	[RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},
>  };
>  
>  /* Display limits for a process */
> @@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	}
>  
>  	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> -		err = -EACCES;
> -		goto err_sighand;
> +		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
> +		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> +
> +		if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {

two points.

1) task->signal->rlim[RLIMIT_OOMADJ].rlim_cur is incorrect.
   please use task_rlimit().

2) If process has CAP_SYS_RESOURCE, we should ignore RLIMIT_OOMADJ for
   backword compatibility. CAP_NICE do so. (see below)

------------------------------------------------------------------
int can_nice(const struct task_struct *p, const int nice)
{
        /* convert nice value [19,-20] to rlimit style value [1,40] */
        int nice_rlim = 20 - nice;

        return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||
                capable(CAP_SYS_NICE));
}
------------------------------------------------------------------



> +			unlock_task_sighand(task, &flags);
> +			put_task_struct(task);
> +			err = -EACCES;
> +			goto err_sighand;
> +		}
>  	}
>  
>  	if (oom_adjust != task->signal->oom_adj) {
> diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
> index 587566f..a8640a4 100644
> --- a/include/asm-generic/resource.h
> +++ b/include/asm-generic/resource.h
> @@ -45,7 +45,9 @@
>  					   0-39 for nice level 19 .. -20 */
>  #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
>  #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
> -#define RLIM_NLIMITS		16
> +#define RLIMIT_OOMADJ		16	/* max oom_adj allowed to lower to
> +					   0-32 for oom level 15 .. -17 */
> +#define RLIM_NLIMITS		17
>  
>  /*
>   * SuS says limits have to be unsigned.
> @@ -86,6 +88,7 @@
>  	[RLIMIT_MSGQUEUE]	= {   MQ_BYTES_MAX,   MQ_BYTES_MAX },	\
>  	[RLIMIT_NICE]		= { 0, 0 },				\
>  	[RLIMIT_RTPRIO]		= { 0, 0 },				\
> +	[RLIMIT_OOMADJ]		= { 0, 0 },				\

I don't think 0 is good initial value because 0 mean oom_adj==15.



>  	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
>  }
>  
> -- 
> 1.7.3.1
> 




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 23:19         ` David Rientjes
@ 2010-11-11 23:56           ` Mandeep Singh Baines
  0 siblings, 0 replies; 15+ messages in thread
From: Mandeep Singh Baines @ 2010-11-11 23:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mandeep Singh Baines, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han, linux-kernel, gspencer,
	piman, wad, olofj

David Rientjes (rientjes@google.com) wrote:
> On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:
> 
> > > Hmm, at first glance that seems potentially dangerous if the current tab 
> > > generates a burt of memory allocations and it ends up killing all other 
> > > tabs before finally targeting the culprit whereas currently the heuristic 
> > > should do a good job of finding this problematic tab and killing it 
> > > instantly.
> > > 
> > 
> > If you're watching a movie, video chatting, playing a game, etc. What
> > would you rather have killed: the current tab you are interacting with or
> > some tab you opened a while back and are no longer interacting with.
> > 
> 
> Well, it's a tangential point, but I'd personally prefer that my existing 
> tabs that I've decided to leave open are guaranteed to remain open 
> regardless of where I'm browsing next (they could hold valuable data that 
> I can't easily get back) and avoid having all of them sacrificed out from 
> under me for the newly opened tab.  I can always go back and close those 
> tabs for more memory if I know I don't need them anymore and then retry 
> the failed allocation.
> 
> > > So as more and more tabs get used, the least recently used tab gets its 
> > > oom_score_adj raised higher and higher until it is reused itself and then 
> > > it gets reset back to 0 for the current tab?
> > > 
> > 
> > Exactly.
> > 
> 
> We don't necessarily want arbitrary tasks to be able to decrease their 
> oom_score_adj back to 0 if a CAP_SYS_RESOURCE thread has elevated it, 
> that's part of the reason for the restriction (in addition to decreasing 
> your own oom_score_adj all the way to OOM_SCORE_ADJ_MIN).
> 
> Would it suffice to allow a task to decrease its oom_score_adj back to the 
> highest value that a CAP_SYS_RESOURCE thread set it or its inherited value 
> at fork?  Assuming the thread that has forked it has oom_score_adj of 0, 
> each tab could decrease it back from 0 upon activation unless a 
> CAP_SYS_RESOURCE thread elevated it to something higher.
> 
> To do this, we'd need to save the highest oom_score_adj set by a 
> CAP_SYS_RESOURCE in struct signal_struct.

Sounds good to me. I'll start working on this patch.

Thanks,
Mandeep

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 22:25       ` Mandeep Singh Baines
@ 2010-11-11 23:19         ` David Rientjes
  2010-11-11 23:56           ` Mandeep Singh Baines
  0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2010-11-11 23:19 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj

On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:

> > Hmm, at first glance that seems potentially dangerous if the current tab 
> > generates a burt of memory allocations and it ends up killing all other 
> > tabs before finally targeting the culprit whereas currently the heuristic 
> > should do a good job of finding this problematic tab and killing it 
> > instantly.
> > 
> 
> If you're watching a movie, video chatting, playing a game, etc. What
> would you rather have killed: the current tab you are interacting with or
> some tab you opened a while back and are no longer interacting with.
> 

Well, it's a tangential point, but I'd personally prefer that my existing 
tabs that I've decided to leave open are guaranteed to remain open 
regardless of where I'm browsing next (they could hold valuable data that 
I can't easily get back) and avoid having all of them sacrificed out from 
under me for the newly opened tab.  I can always go back and close those 
tabs for more memory if I know I don't need them anymore and then retry 
the failed allocation.

> > So as more and more tabs get used, the least recently used tab gets its 
> > oom_score_adj raised higher and higher until it is reused itself and then 
> > it gets reset back to 0 for the current tab?
> > 
> 
> Exactly.
> 

We don't necessarily want arbitrary tasks to be able to decrease their 
oom_score_adj back to 0 if a CAP_SYS_RESOURCE thread has elevated it, 
that's part of the reason for the restriction (in addition to decreasing 
your own oom_score_adj all the way to OOM_SCORE_ADJ_MIN).

Would it suffice to allow a task to decrease its oom_score_adj back to the 
highest value that a CAP_SYS_RESOURCE thread set it or its inherited value 
at fork?  Assuming the thread that has forked it has oom_score_adj of 0, 
each tab could decrease it back from 0 upon activation unless a 
CAP_SYS_RESOURCE thread elevated it to something higher.

To do this, we'd need to save the highest oom_score_adj set by a 
CAP_SYS_RESOURCE in struct signal_struct.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 20:57     ` David Rientjes
@ 2010-11-11 22:25       ` Mandeep Singh Baines
  2010-11-11 23:19         ` David Rientjes
  0 siblings, 1 reply; 15+ messages in thread
From: Mandeep Singh Baines @ 2010-11-11 22:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mandeep Singh Baines, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han, linux-kernel, gspencer,
	piman, wad, olofj

David Rientjes (rientjes@google.com) wrote:
> On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:
> 
> > > What is the anticipated use case for this?  We know that you want to lower 
> > > oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an 
> > > app moves from foreground to background?  I assume it's something like 
> > 
> > The focus here is the web browser's tabs. In our case, each is a process. If
> > OOM is going to kill a process, you'd rather it kill the tab you looked at
> > hours ago instead of the one you're looking at now. So you'd like to have a
> > policy where the LRU tab gets killed first. We'd like to use oom_score_adj
> > as the mechanism to implement an LRU policy like this.
> > 
> 
> Hmm, at first glance that seems potentially dangerous if the current tab 
> generates a burt of memory allocations and it ends up killing all other 
> tabs before finally targeting the culprit whereas currently the heuristic 
> should do a good job of finding this problematic tab and killing it 
> instantly.
> 

If you're watching a movie, video chatting, playing a game, etc. What
would you rather have killed: the current tab you are interacting with or
some tab you opened a while back and are no longer interacting with.

> Perhaps that can't happen and it probably doesn't even matter: 
> oom_score_adj allows users to determine which process to kill regardless 
> of the underlying reason.
> 
> > > What do you anticipate will be writing to oom_score_adj with this patch, 
> > > the app itself?
> > 
> > A process in the browser session will do the adusting. We'd rather not give
> > it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
> > and down within the bounds set by the administrator. Analagous to renice()
> > which we also do using a similar policy.
> > 
> 
> So as more and more tabs get used, the least recently used tab gets its 
> oom_score_adj raised higher and higher until it is reused itself and then 
> it gets reset back to 0 for the current tab?
> 

Exactly.

> Is there a reason you don't want to give the underlying browser session 
> process CAP_SYS_RESOURCE?  Will it not be enforcing resource limits to 

Security. We want to use the least-privilege possible. We really want to
avoid giving special privileges to the browser. You shouldn't need
administrative privileges to run the browser. We'd like for oom_score_adj
to work the same as nice. An unprivileged user can nice up and down as
long as the new setting is within the administratively configured
resource limit: ulimit -e.

> ensure tabs don't deplete all memory when certain sites are opened?  Are 
> you concerned that it may deplete all memory itself (for which case you 
> could raise its own oom_score_adj, which is a proportion of available 
> memory so you can define where that point of depletiong is)?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11 18:30   ` Mandeep Singh Baines
@ 2010-11-11 20:57     ` David Rientjes
  2010-11-11 22:25       ` Mandeep Singh Baines
  0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2010-11-11 20:57 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj

On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:

> > What is the anticipated use case for this?  We know that you want to lower 
> > oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an 
> > app moves from foreground to background?  I assume it's something like 
> 
> The focus here is the web browser's tabs. In our case, each is a process. If
> OOM is going to kill a process, you'd rather it kill the tab you looked at
> hours ago instead of the one you're looking at now. So you'd like to have a
> policy where the LRU tab gets killed first. We'd like to use oom_score_adj
> as the mechanism to implement an LRU policy like this.
> 

Hmm, at first glance that seems potentially dangerous if the current tab 
generates a burt of memory allocations and it ends up killing all other 
tabs before finally targeting the culprit whereas currently the heuristic 
should do a good job of finding this problematic tab and killing it 
instantly.

Perhaps that can't happen and it probably doesn't even matter: 
oom_score_adj allows users to determine which process to kill regardless 
of the underlying reason.

> > What do you anticipate will be writing to oom_score_adj with this patch, 
> > the app itself?
> 
> A process in the browser session will do the adusting. We'd rather not give
> it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
> and down within the bounds set by the administrator. Analagous to renice()
> which we also do using a similar policy.
> 

So as more and more tabs get used, the least recently used tab gets its 
oom_score_adj raised higher and higher until it is reused itself and then 
it gets reset back to 0 for the current tab?

Is there a reason you don't want to give the underlying browser session 
process CAP_SYS_RESOURCE?  Will it not be enforcing resource limits to 
ensure tabs don't deplete all memory when certain sites are opened?  Are 
you concerned that it may deplete all memory itself (for which case you 
could raise its own oom_score_adj, which is a proportion of available 
memory so you can define where that point of depletiong is)?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11  7:35 ` David Rientjes
@ 2010-11-11 18:30   ` Mandeep Singh Baines
  2010-11-11 20:57     ` David Rientjes
  0 siblings, 1 reply; 15+ messages in thread
From: Mandeep Singh Baines @ 2010-11-11 18:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mandeep Singh Baines, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han, linux-kernel, gspencer,
	piman, wad, olofj

David Rientjes (rientjes@google.com) wrote:
> On Wed, 10 Nov 2010, Mandeep Singh Baines wrote:
> 
> > For ChromiumOS, we'd like to be able to oom_adj a process up/down
> > as its leaves/enters the foreground. Currently, it is not possible
> > to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
> > resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
> > to RLIMIT_NICE. This allows a process's oom_adj to be lowered
> > without CAP_SYS_RESOURCE as long as the new value is greater
> > than the resource limit.
> > 
> 
> First of all, oom_adj is deprecated and scheduled for removal in a couple 
> of years (see Documentation/feature-removal-schedule.txt) so any work in 
> this area should be targeting oom_score_adj instead.
> 

Ah. Thanks for the pointer.

> What is the anticipated use case for this?  We know that you want to lower 
> oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an 
> app moves from foreground to background?  I assume it's something like 

The focus here is the web browser's tabs. In our case, each is a process. If
OOM is going to kill a process, you'd rather it kill the tab you looked at
hours ago instead of the one you're looking at now. So you'd like to have a
policy where the LRU tab gets killed first. We'd like to use oom_score_adj
as the mechanism to implement an LRU policy like this.

> having an oom_adj of 0 in the background and +15 in the foreground.  If 
> so, does /proc/sys/vm/oom_kill_allocating_task get you most of what you're 
> looking for?
> 

As explained above, oom_kill_allocating_task won't give us what we want.

> I'm wondering if we can avoid yet another resource limit for something 
> like this.
> 
> > Alternative considered:
> > 
> > * a setuid binary
> > * a daemon with CAP_SYS_RESOURCE
> > 
> > Since you don't wan't all processes to be able to reduce their
> > oom_adj, a setuid or daemon implementation would be complex. The
> > alternatives also have much higher overhead.
> > 
> 
> What do you anticipate will be writing to oom_score_adj with this patch, 
> the app itself?
> 

A process in the browser session will do the adusting. We'd rather not give
it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
and down within the bounds set by the administrator. Analagous to renice()
which we also do using a similar policy.

> > Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
> > ---
> >  fs/proc/base.c                 |   12 ++++++++++--
> >  include/asm-generic/resource.h |    5 ++++-
> >  2 files changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index f3d02ca..4384013 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
> >  	[RLIMIT_NICE] = {"Max nice priority", NULL},
> >  	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
> >  	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> > +	[RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},
> 
> s/Max/Min, right?
> 

This is a MAX value because of how resource limits work. On the other hand,
it is really controlling the minimum oom_adj. So its a toss up for me.
More than happy to change if you prefer Min.

> >  };
> >  
> >  /* Display limits for a process */
> > @@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
> >  	}
> >  
> >  	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> > -		err = -EACCES;
> > -		goto err_sighand;
> > +		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
> > +		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> > +
> 
> Ouch, that's a rather unfortunate mapping.
> 

Unfortunate but unavoidable. The resource limit code checks to see if the
new limit is greater than the limit. This code was based on the can_nice()
code in sched.c.

> > +		if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
> > +			unlock_task_sighand(task, &flags);
> > +			put_task_struct(task);
> > +			err = -EACCES;
> > +			goto err_sighand;
> 
> err_sighand has duplicate unlock_task_sighand() and put_task_struct(); 
> since you're missing the task_unlock(task) here, just using goto 
> err_sighand would suffice.
> 

D'oh. Forward port error. I should be more careful. Thanks for catching:)

> > +		}
> >  	}
> >  
> >  	if (oom_adjust != task->signal->oom_adj) {

Thank you for reviewing this patch.

Should I send an updated oom_score_adj patch?

Regards,
Mandeep

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
  2010-11-11  4:35 Mandeep Singh Baines
@ 2010-11-11  7:35 ` David Rientjes
  2010-11-11 18:30   ` Mandeep Singh Baines
  2010-11-14  5:07 ` KOSAKI Motohiro
  1 sibling, 1 reply; 15+ messages in thread
From: David Rientjes @ 2010-11-11  7:35 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Ying Han, linux-kernel, gspencer, piman, wad, olofj

On Wed, 10 Nov 2010, Mandeep Singh Baines wrote:

> For ChromiumOS, we'd like to be able to oom_adj a process up/down
> as its leaves/enters the foreground. Currently, it is not possible
> to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
> resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
> to RLIMIT_NICE. This allows a process's oom_adj to be lowered
> without CAP_SYS_RESOURCE as long as the new value is greater
> than the resource limit.
> 

First of all, oom_adj is deprecated and scheduled for removal in a couple 
of years (see Documentation/feature-removal-schedule.txt) so any work in 
this area should be targeting oom_score_adj instead.

What is the anticipated use case for this?  We know that you want to lower 
oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an 
app moves from foreground to background?  I assume it's something like 
having an oom_adj of 0 in the background and +15 in the foreground.  If 
so, does /proc/sys/vm/oom_kill_allocating_task get you most of what you're 
looking for?

I'm wondering if we can avoid yet another resource limit for something 
like this.

> Alternative considered:
> 
> * a setuid binary
> * a daemon with CAP_SYS_RESOURCE
> 
> Since you don't wan't all processes to be able to reduce their
> oom_adj, a setuid or daemon implementation would be complex. The
> alternatives also have much higher overhead.
> 

What do you anticipate will be writing to oom_score_adj with this patch, 
the app itself?

> Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
> ---
>  fs/proc/base.c                 |   12 ++++++++++--
>  include/asm-generic/resource.h |    5 ++++-
>  2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..4384013 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
>  	[RLIMIT_NICE] = {"Max nice priority", NULL},
>  	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
>  	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> +	[RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},

s/Max/Min, right?

>  };
>  
>  /* Display limits for a process */
> @@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	}
>  
>  	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> -		err = -EACCES;
> -		goto err_sighand;
> +		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
> +		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> +

Ouch, that's a rather unfortunate mapping.

> +		if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
> +			unlock_task_sighand(task, &flags);
> +			put_task_struct(task);
> +			err = -EACCES;
> +			goto err_sighand;

err_sighand has duplicate unlock_task_sighand() and put_task_struct(); 
since you're missing the task_unlock(task) here, just using goto 
err_sighand would suffice.

> +		}
>  	}
>  
>  	if (oom_adjust != task->signal->oom_adj) {

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] oom: create a resource limit for oom_adj
@ 2010-11-11  5:19 Figo.zhang
  0 siblings, 0 replies; 15+ messages in thread
From: Figo.zhang @ 2010-11-11  5:19 UTC (permalink / raw)
  To: linux-kernel


>   	if (oom_adjust<  task->signal->oom_adj&&  !capable(CAP_SYS_RESOURCE)) {
> -		err = -EACCES;
> -		goto err_sighand;
> +		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
> +		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> +
> +		if (oom_rlim>  task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
> +			unlock_task_sighand(task,&flags);
> +			put_task_struct(task);
> +			err = -EACCES;
> +			goto err_sighand;
> +		}
>   	}

=> Label "err_sighand" have do that, why are you do that on here?

+			err = -EACCES;
+			goto err_sighand;
+		}
 	}



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] oom: create a resource limit for oom_adj
@ 2010-11-11  4:35 Mandeep Singh Baines
  2010-11-11  7:35 ` David Rientjes
  2010-11-14  5:07 ` KOSAKI Motohiro
  0 siblings, 2 replies; 15+ messages in thread
From: Mandeep Singh Baines @ 2010-11-11  4:35 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Ying Han
  Cc: linux-kernel, gspencer, piman, wad, olofj

For ChromiumOS, we'd like to be able to oom_adj a process up/down
as its leaves/enters the foreground. Currently, it is not possible
to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
to RLIMIT_NICE. This allows a process's oom_adj to be lowered
without CAP_SYS_RESOURCE as long as the new value is greater
than the resource limit.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their
oom_adj, a setuid or daemon implementation would be complex. The
alternatives also have much higher overhead.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 fs/proc/base.c                 |   12 ++++++++++--
 include/asm-generic/resource.h |    5 ++++-
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..4384013 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
 	[RLIMIT_NICE] = {"Max nice priority", NULL},
 	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
 	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+	[RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},
 };
 
 /* Display limits for a process */
@@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	}
 
 	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
+		/* convert oom_adj [15,-17] to rlimit style value [1,33] */
+		long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
+
+		if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
+			unlock_task_sighand(task, &flags);
+			put_task_struct(task);
+			err = -EACCES;
+			goto err_sighand;
+		}
 	}
 
 	if (oom_adjust != task->signal->oom_adj) {
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 587566f..a8640a4 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -45,7 +45,9 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+#define RLIMIT_OOMADJ		16	/* max oom_adj allowed to lower to
+					   0-32 for oom level 15 .. -17 */
+#define RLIM_NLIMITS		17
 
 /*
  * SuS says limits have to be unsigned.
@@ -86,6 +88,7 @@
 	[RLIMIT_MSGQUEUE]	= {   MQ_BYTES_MAX,   MQ_BYTES_MAX },	\
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
+	[RLIMIT_OOMADJ]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
 }
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-11-23  7:17 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <fNx73-1cI-1@gated-at.bofh.it>
     [not found] ` <fNzVf-5UY-3@gated-at.bofh.it>
     [not found]   ` <fNKdY-6FU-11@gated-at.bofh.it>
     [not found]     ` <fNMps-1S1-21@gated-at.bofh.it>
2010-11-11 23:15       ` [PATCH] oom: create a resource limit for oom_adj Bodo Eggert
2010-11-11 23:21         ` David Rientjes
2010-11-14  5:07         ` KOSAKI Motohiro
2010-11-14 21:42           ` David Rientjes
2010-11-23  7:16             ` KOSAKI Motohiro
     [not found]       ` <fNNOx-4qf-1@gated-at.bofh.it>
     [not found]         ` <fNOAW-5Oi-35@gated-at.bofh.it>
     [not found]           ` <fNPdF-6Hu-33@gated-at.bofh.it>
     [not found]             ` <fOctz-4o0-9@gated-at.bofh.it>
2010-11-14  0:05               ` [PATCH] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down Bodo Eggert
2010-11-11  5:19 [PATCH] oom: create a resource limit for oom_adj Figo.zhang
  -- strict thread matches above, loose matches on Subject: below --
2010-11-11  4:35 Mandeep Singh Baines
2010-11-11  7:35 ` David Rientjes
2010-11-11 18:30   ` Mandeep Singh Baines
2010-11-11 20:57     ` David Rientjes
2010-11-11 22:25       ` Mandeep Singh Baines
2010-11-11 23:19         ` David Rientjes
2010-11-11 23:56           ` Mandeep Singh Baines
2010-11-14  5:07 ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).