Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
@ 2010-01-29 16:11 KAMEZAWA Hiroyuki
  2010-01-29 16:21 ` Alan Cox
  0 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-29 16:11 UTC (permalink / raw)
  To: Alan Cox
  Cc: vedran.furac, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm,
	rientjes, minchan.kim, linux-kernel, balbir

Alan Cox wrote:
>> off by default. Problem is that it breaks java and some other stuff that
>> allocates much more memory than it needs. Very quickly Committed_AS hits
>> CommitLimit and one cannot allocate any more while there is plenty of
>> memory still unused.
>
> So how about you go and have a complain at the people who are causing
> your problem, rather than the kernel.
>
Alan, please allow me to talk about my concern.

At first, I think all OOM-killer are bad and there are no chance
to implement innocent, good OOM-Killer. The best way we can do is
"never cause OOM-Kill". But we're human being, OOM-Killer can happen
by _mistake_....

For example, a customer runs 1000+ process of Oracle without using
HugeTLB and the total size of page table goes up to 10GByes. Hahaha.
(Of course, We asked him  to use Hugetlb ;) We can't ask him to
 use overcommit memory if much proprietaty applications runs on it.)

So, I believe there is a cirtial situation OOM-Killer has to run even
if it's bad. Even in corner case.
Now, in OOM situaion, sshd or X-server or some task launcher is killed at
first if oom_adj is not tweaked. IIUC, OOM-Killer is for giving a chance
to administrator to recover his system, safe reboot. But if sshd/X is
kiiled, this is no help.

My first purpose was to prevent killing some daemons or task launchers.
The first patch was nacked ;).

On that way, I tried to add lowmem counting because it was also
my concern. This was nacked ;(

I stop this because of my personal reason. For my enviroment,
panic_on_oom=1 works enough well.For Vedran's, overcommit memory will work
well. But oom-killer kills very bad process if not tweaked.
So, I think some improvement should be done.

And we have memcg even if it's called as ugly workaround.
Sorry for all the noise.

Bye,
-Kame

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29 16:11 [PATCH v3] oom-kill: add lowmem usage aware oom kill handling KAMEZAWA Hiroyuki
@ 2010-01-29 16:21 ` Alan Cox
  2010-01-29 16:25   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 35+ messages in thread
From: Alan Cox @ 2010-01-29 16:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: vedran.furac, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm,
	rientjes, minchan.kim, linux-kernel, balbir

> panic_on_oom=1 works enough well.For Vedran's, overcommit memory will work
> well. But oom-killer kills very bad process if not tweaked.
> So, I think some improvement should be done.

That is why we have the per process oom_adj values - because for nearly
fifteen years someone comes along and says "actually in my environment
the right choice is ..."

Ultimately it is policy. The kernel simply can't read minds.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29 16:21 ` Alan Cox
@ 2010-01-29 16:25   ` KAMEZAWA Hiroyuki
  2010-01-29 16:30     ` Alan Cox
  2010-01-29 21:11     ` David Rientjes
  0 siblings, 2 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-29 16:25 UTC (permalink / raw)
  To: Alan Cox
  Cc: KAMEZAWA Hiroyuki, vedran.furac, Andrew Morton, linux-mm,
	rientjes, minchan.kim, linux-kernel, balbir

Alan Cox wrote:
>> panic_on_oom=1 works enough well.For Vedran's, overcommit memory will
>> work
>> well. But oom-killer kills very bad process if not tweaked.
>> So, I think some improvement should be done.
>
> That is why we have the per process oom_adj values - because for nearly
> fifteen years someone comes along and says "actually in my environment
> the right choice is ..."
>
> Ultimately it is policy. The kernel simply can't read minds.
>
If so, all heuristics other than vm_size should be purged, I think.
...Or victim should be just determined by the class of application
user sets. oom_adj other than OOM_DISABLE, searching victim process
by black magic are all garbage.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29 16:25   ` KAMEZAWA Hiroyuki
@ 2010-01-29 16:30     ` Alan Cox
  2010-01-29 16:41       ` KAMEZAWA Hiroyuki
  2010-01-29 21:11     ` David Rientjes
  1 sibling, 1 reply; 35+ messages in thread
From: Alan Cox @ 2010-01-29 16:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: vedran.furac, Andrew Morton, linux-mm, rientjes, minchan.kim,
	linux-kernel, balbir

> > Ultimately it is policy. The kernel simply can't read minds.
> >
> If so, all heuristics other than vm_size should be purged, I think.
> ...Or victim should be just determined by the class of application
> user sets. oom_adj other than OOM_DISABLE, searching victim process
> by black magic are all garbage.

oom_adj by value makes sense as do some of the basic heuristics - but a
lot of the complexity I would agree is completely nonsensical.

There are folks who use oom_adj weightings to influence things (notably
embedded and desktop). The embedded world would actually benefit on the
whole if the oom_adj was an absolute value because they usually know
precisely what they want to die and in what order.

Alan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29 16:30     ` Alan Cox
@ 2010-01-29 16:41       ` KAMEZAWA Hiroyuki
  2010-01-29 21:07         ` David Rientjes
  0 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-29 16:41 UTC (permalink / raw)
  To: Alan Cox
  Cc: KAMEZAWA Hiroyuki, vedran.furac, Andrew Morton, linux-mm,
	rientjes, minchan.kim, linux-kernel, balbir

Alan Cox wrote:
>> > Ultimately it is policy. The kernel simply can't read minds.
>> >
>> If so, all heuristics other than vm_size should be purged, I think.
>> ...Or victim should be just determined by the class of application
>> user sets. oom_adj other than OOM_DISABLE, searching victim process
>> by black magic are all garbage.
>
> oom_adj by value makes sense as do some of the basic heuristics - but a
> lot of the complexity I would agree is completely nonsensical.
>
> There are folks who use oom_adj weightings to influence things (notably
> embedded and desktop). The embedded world would actually benefit on the
> whole if the oom_adj was an absolute value because they usually know
> precisely what they want to die and in what order.
>
okay...I guess the cause of the problem Vedran met came from
this calculation.
==
 109         /*
 110          * Processes which fork a lot of child processes are likely
 111          * a good choice. We add half the vmsize of the children if they
 112          * have an own mm. This prevents forking servers to flood the
 113          * machine with an endless amount of children. In case a single
 114          * child is eating the vast majority of memory, adding only half
 115          * to the parents will make the child our kill candidate of
choice.
 116          */
 117         list_for_each_entry(child, &p->children, sibling) {
 118                 task_lock(child);
 119                 if (child->mm != mm && child->mm)
 120                         points += child->mm->total_vm/2 + 1;
 121                 task_unlock(child);
 122         }
 123
==
This makes task launcher(the fist child of some daemon.) first victim.
And...I wonder this is not good for oom_adj,
I think it's set per task with regard to personal memory usage.

But I'm not sure why this code is used now. Does anyone remember
history or the benefit of this calculation ?

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29 16:41       ` KAMEZAWA Hiroyuki
@ 2010-01-29 21:07         ` David Rientjes
  2010-01-30 12:46           ` Vedran Furač
  2010-02-01  0:01           ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 35+ messages in thread
From: David Rientjes @ 2010-01-29 21:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Alan Cox, vedran.furac, Andrew Morton, linux-mm, minchan.kim,
	linux-kernel, balbir

On Sat, 30 Jan 2010, KAMEZAWA Hiroyuki wrote:

> okay...I guess the cause of the problem Vedran met came from
> this calculation.
> ==
>  109         /*
>  110          * Processes which fork a lot of child processes are likely
>  111          * a good choice. We add half the vmsize of the children if they
>  112          * have an own mm. This prevents forking servers to flood the
>  113          * machine with an endless amount of children. In case a single
>  114          * child is eating the vast majority of memory, adding only half
>  115          * to the parents will make the child our kill candidate of
> choice.
>  116          */
>  117         list_for_each_entry(child, &p->children, sibling) {
>  118                 task_lock(child);
>  119                 if (child->mm != mm && child->mm)
>  120                         points += child->mm->total_vm/2 + 1;
>  121                 task_unlock(child);
>  122         }
>  123
> ==
> This makes task launcher(the fist child of some daemon.) first victim.

That "victim", p, is passed to oom_kill_process() which does this:

	/* Try to kill a child first */
	list_for_each_entry(c, &p->children, sibling) {
		if (c->mm == p->mm)
			continue;
		if (!oom_kill_task(c))
			return 0;
	}
	return oom_kill_task(p);

which prevents your example of the task launcher from getting killed 
unless it itself is using such an egregious amount of memory that its VM 
size has caused the heuristic to select the daemon in the first place.  
We only look at a single level of children, and attempt to kill one of 
those children not sharing memory with the selected task first, so your 
example is exaggerated for dramatic value.

The oom killer has been doing this for years and I haven't noticed a huge 
surge in complaints about it killing X specifically because of that code 
in oom_kill_process().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29 16:25   ` KAMEZAWA Hiroyuki
  2010-01-29 16:30     ` Alan Cox
@ 2010-01-29 21:11     ` David Rientjes
  1 sibling, 0 replies; 35+ messages in thread
From: David Rientjes @ 2010-01-29 21:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Alan Cox, vedran.furac, Andrew Morton, linux-mm, minchan.kim,
	linux-kernel, balbir

On Sat, 30 Jan 2010, KAMEZAWA Hiroyuki wrote:

> If so, all heuristics other than vm_size should be purged, I think.

I don't recall anybody disagreeing about removing some of the current 
heuristics, but there is value to those beyond simply total_vm: we want to 
penalize tasks that do not share any mems_allowed with the triggering 
task, for example, otherwise it can lead to needless oom killing.  Many 
people believe we should keep the slight penalty for superuser tasks over 
regular user tasks, as well.

Auditing the badness() function is a worthwhile endeavor and I think you'd 
be most successful if you tweaked the various penalties (runtime, nice, 
capabilities, etc) to reflect how much each is valued in terms of VM size, 
the baseline.  I doubt anybody would defend simply dividing by 4 or 
multiplying by 2 being scientific.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29 21:07         ` David Rientjes
@ 2010-01-30 12:46           ` Vedran Furač
  2010-01-30 22:53             ` David Rientjes
  2010-02-01  0:01           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 35+ messages in thread
From: Vedran Furač @ 2010-01-30 12:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Alan Cox, Andrew Morton, linux-mm,
	minchan.kim, linux-kernel, balbir

[-- Attachment #1: Type: text/plain, Size: 769 bytes --]

David Rientjes wrote:

> The oom killer has been doing this for years and I haven't noticed a huge 
> surge in complaints about it killing X specifically because of that code 
> in oom_kill_process().

Well you said it yourself, you won't see a surge because "oom killer has
been doing this *for years*". So you'll have a more/less constant number
of complains over the years. Just google for: linux, random, kill, memory;

What provoked me to start this discussions is that every few months on
our croatian linux newsgroup someone starts asking why is linux randomly
killing his processes. And at the end of discussion a few, mostly
aix/solaris sysadmins, conclude that linux is still a toy.

Regards,
Vedran

-- 
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

[-- Attachment #2: vedran_furac.vcf --]
[-- Type: text/x-vcard, Size: 220 bytes --]

begin:vcard
fn;quoted-printable:Vedran Fura=C4=8D
n;quoted-printable:Fura=C4=8D;Vedran
adr:;;;;;;Croatia
email;internet:vedran.furac@gmail.com
x-mozilla-html:FALSE
url:http://vedranf.net
version:2.1
end:vcard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-30 12:46           ` Vedran Furač
@ 2010-01-30 22:53             ` David Rientjes
  2010-01-31 20:29               ` Vedran Furač
  0 siblings, 1 reply; 35+ messages in thread
From: David Rientjes @ 2010-01-30 22:53 UTC (permalink / raw)
  To: vedran.furac
  Cc: KAMEZAWA Hiroyuki, Alan Cox, Andrew Morton, minchan.kim,
	Balbir Singh, linux-kernel, linux-mm

On Sat, 30 Jan 2010, Vedran Furac wrote:

> > The oom killer has been doing this for years and I haven't noticed a huge 
> > surge in complaints about it killing X specifically because of that code 
> > in oom_kill_process().
> 
> Well you said it yourself, you won't see a surge because "oom killer has
> been doing this *for years*". So you'll have a more/less constant number
> of complains over the years. Just google for: linux, random, kill, memory;
> 

You snipped the code segment where I demonstrated that the selected task 
for oom kill is not necessarily the one chosen to die: if there is a child 
with disjoint memory that is killable, it will be selected instead.  If 
Xorg or sshd is being chosen for kill, then you should investigate why 
that is, but there is nothing random about how the oom killer chooses 
tasks to kill.

The facts that you're completely ignoring are that changing the heuristic 
baseline to rss is not going to prevent Xorg or sshd from being selected 
(in fact, I even showed that it makes Xorg _more_ preferrable when I 
reviewed the patch), and you have complete power of disabling oom killing 
for selected tasks and that trait is inheritable to children.

I agree that we can do a better job than needlessly killing innocent tasks 
when we have a lowmem oom.  I suggested killing current in such a scenario 
since ZONE_DMA memory was not reclaimable (and, soon, not migratable) and 
all memory is pinned for such purposes.  However, saying we need to change 
the baseline for that particular case and completely misinterpret the 
oom_adj values for all system-wide tasks is simply not an option.  And 
when that point is raised, it doesn't help for people to take their ball 
and go home if their motivation is to improve the oom killer.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-30 22:53             ` David Rientjes
@ 2010-01-31 20:29               ` Vedran Furač
  2010-02-01 10:33                 ` David Rientjes
  0 siblings, 1 reply; 35+ messages in thread
From: Vedran Furač @ 2010-01-31 20:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Alan Cox, Andrew Morton, minchan.kim,
	Balbir Singh, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1709 bytes --]

David Rientjes wrote:

> On Sat, 30 Jan 2010, Vedran Furac wrote:
> 
>>> The oom killer has been doing this for years and I haven't noticed a huge 
>>> surge in complaints about it killing X specifically because of that code 
>>> in oom_kill_process().
>> Well you said it yourself, you won't see a surge because "oom killer has
>> been doing this *for years*". So you'll have a more/less constant number
>> of complains over the years. Just google for: linux, random, kill, memory;
> 
> You snipped the code segment where I demonstrated that the selected task 
> for oom kill is not necessarily the one chosen to die: if there is a child 
> with disjoint memory that is killable, it will be selected instead.  If 
> Xorg or sshd is being chosen for kill, then you should investigate why 
> that is, but there is nothing random about how the oom killer chooses 
> tasks to kill.

I know that it isn't random, but it sure looks like that to the end user
and I use it to emphasize the problem. And about me investigating, that
simply not possible as I am not a kernel hacker who understands the code
beyond the syntax level. I can only point to the problem in hope that
someone will fix it.

> The facts that you're completely ignoring are that changing the heuristic 
> baseline to rss is not going to prevent Xorg or sshd from being selected 

In my tests a simple "ps -eo rss,command --sort rss" always showed the
cuprit, but OK, find another approach in fixing the problem in hope for
a positive review. Just... I feel everything will be put under the
carpet with fingers in ears while singing everything is fine. Prove me
wrong.

Regards,
Vedran


-- 
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

[-- Attachment #2: vedran_furac.vcf --]
[-- Type: text/x-vcard, Size: 220 bytes --]

begin:vcard
fn;quoted-printable:Vedran Fura=C4=8D
n;quoted-printable:Fura=C4=8D;Vedran
adr:;;;;;;Croatia
email;internet:vedran.furac@gmail.com
x-mozilla-html:FALSE
url:http://vedranf.net
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29 21:07         ` David Rientjes
  2010-01-30 12:46           ` Vedran Furač
@ 2010-02-01  0:01           ` KAMEZAWA Hiroyuki
  2010-02-01 10:28             ` David Rientjes
  1 sibling, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-01  0:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alan Cox, vedran.furac, Andrew Morton, linux-mm, minchan.kim,
	linux-kernel, balbir

On Fri, 29 Jan 2010 13:07:01 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Sat, 30 Jan 2010, KAMEZAWA Hiroyuki wrote:
> 
> > okay...I guess the cause of the problem Vedran met came from
> > this calculation.
> > ==
> >  109         /*
> >  110          * Processes which fork a lot of child processes are likely
> >  111          * a good choice. We add half the vmsize of the children if they
> >  112          * have an own mm. This prevents forking servers to flood the
> >  113          * machine with an endless amount of children. In case a single
> >  114          * child is eating the vast majority of memory, adding only half
> >  115          * to the parents will make the child our kill candidate of
> > choice.
> >  116          */
> >  117         list_for_each_entry(child, &p->children, sibling) {
> >  118                 task_lock(child);
> >  119                 if (child->mm != mm && child->mm)
> >  120                         points += child->mm->total_vm/2 + 1;
> >  121                 task_unlock(child);
> >  122         }
> >  123
> > ==
> > This makes task launcher(the fist child of some daemon.) first victim.
> 
> That "victim", p, is passed to oom_kill_process() which does this:
> 
> 	/* Try to kill a child first */
> 	list_for_each_entry(c, &p->children, sibling) {
> 		if (c->mm == p->mm)
> 			continue;
> 		if (!oom_kill_task(c))
> 			return 0;
> 	}
> 	return oom_kill_task(p);
> 

Then, finally, per-process oom_adj(!=OOM_DISABLE) control is ignored ?
Seems broken.

I think all this children-parent logic is bad.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-02-01  0:01           ` KAMEZAWA Hiroyuki
@ 2010-02-01 10:28             ` David Rientjes
  0 siblings, 0 replies; 35+ messages in thread
From: David Rientjes @ 2010-02-01 10:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Alan Cox, vedran.furac, Andrew Morton, minchan.kim, Balbir Singh,
	linux-kernel, linux-mm

On Mon, 1 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > >  109         /*
> > >  110          * Processes which fork a lot of child processes are likely
> > >  111          * a good choice. We add half the vmsize of the children if they
> > >  112          * have an own mm. This prevents forking servers to flood the
> > >  113          * machine with an endless amount of children. In case a single
> > >  114          * child is eating the vast majority of memory, adding only half
> > >  115          * to the parents will make the child our kill candidate of
> > > choice.
> > >  116          */
> > >  117         list_for_each_entry(child, &p->children, sibling) {
> > >  118                 task_lock(child);
> > >  119                 if (child->mm != mm && child->mm)
> > >  120                         points += child->mm->total_vm/2 + 1;
> > >  121                 task_unlock(child);
> > >  122         }
> > >  123
> > > ==
> > > This makes task launcher(the fist child of some daemon.) first victim.
> > 
> > That "victim", p, is passed to oom_kill_process() which does this:
> > 
> > 	/* Try to kill a child first */
> > 	list_for_each_entry(c, &p->children, sibling) {
> > 		if (c->mm == p->mm)
> > 			continue;
> > 		if (!oom_kill_task(c))
> > 			return 0;
> > 	}
> > 	return oom_kill_task(p);
> > 
> 
> Then, finally, per-process oom_adj(!=OOM_DISABLE) control is ignored ?
> Seems broken.
> 

No, oom_kill_task() returns 1 if the child has OOM_DISABLE set, meaning it 
never gets killed and we continue iterating through the child list.  If 
there are no children with seperate memory to kill, the selected task gets 
killed.  This prevents things from like sshd or bash from getting killed 
unless they are actually the memory leaker themselves.

It would naturally be better to select the child with the highest 
badness() score, but it only depends on the ordering of p->children at the 
moment.  That's because we only want to iterate through this potentially 
long list once, but improvements in this area (as well as sane tweaks to 
the heuristic) would certainly be welcome.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-31 20:29               ` Vedran Furač
@ 2010-02-01 10:33                 ` David Rientjes
  0 siblings, 0 replies; 35+ messages in thread
From: David Rientjes @ 2010-02-01 10:33 UTC (permalink / raw)
  To: Vedran Furac
  Cc: KAMEZAWA Hiroyuki, Alan Cox, Andrew Morton, minchan.kim,
	Balbir Singh, linux-kernel, linux-mm

On Sun, 31 Jan 2010, Vedran Furac wrote:

> > You snipped the code segment where I demonstrated that the selected task 
> > for oom kill is not necessarily the one chosen to die: if there is a child 
> > with disjoint memory that is killable, it will be selected instead.  If 
> > Xorg or sshd is being chosen for kill, then you should investigate why 
> > that is, but there is nothing random about how the oom killer chooses 
> > tasks to kill.
> 
> I know that it isn't random, but it sure looks like that to the end user
> and I use it to emphasize the problem. And about me investigating, that
> simply not possible as I am not a kernel hacker who understands the code
> beyond the syntax level. I can only point to the problem in hope that
> someone will fix it.
> 

Disregarding the opportunity that userspace has to influence the oom 
killer's selection for a moment, it really tends to favor killing tasks 
that are the largest in size.  Tasks that typically get the highest 
badness score are those that have the highest mm->total_vm, it's that 
simple.  There are definitely cornercases where the first generation 
children have a strong influence, but they are often killed either as a 
result of themselves being a thread group leader with seperate memory from 
the parent or as the result of the oom killer killing a task with seperate 
memory before the selected task.  It's completely natural for the oom 
killer to select bash, for example, when in actuality it will kill a 
memory leaker that has a high badness score as a result of the logic in 
oom_kill_process().

If you have specific logs that you'd like to show, please enable 
/proc/sys/vm/oom_dump_tasks and respond with them in another message with 
that data inline.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-30 17:45                             ` Alan Cox
@ 2010-01-30 18:17                               ` Vedran Furač
  0 siblings, 0 replies; 35+ messages in thread
From: Vedran Furač @ 2010-01-30 18:17 UTC (permalink / raw)
  To: Alan Cox
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, rientjes,
	minchan.kim, linux-kernel, balbir

[-- Attachment #1: Type: text/plain, Size: 544 bytes --]

Alan Cox wrote:

>> http://vedranf.net/tmp/oom.ogv  (you can watch it using VLC for example)
>>
>> Actually anyone receiving this mail should see it. What do you think,
>> what will customers rather choose if they see this?
> 
> Address that to the distributions. Their customers. Systems I set up for
> people always have no overcommit enabled.

And distros say that's a bug in kernel. Result: nothing gets done and
users will continue to swear at linux after it kills their work...


-- 
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

[-- Attachment #2: vedran_furac.vcf --]
[-- Type: text/x-vcard, Size: 220 bytes --]

begin:vcard
fn;quoted-printable:Vedran Fura=C4=8D
n;quoted-printable:Fura=C4=8D;Vedran
adr:;;;;;;Croatia
email;internet:vedran.furac@gmail.com
x-mozilla-html:FALSE
url:http://vedranf.net
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-30 17:30                           ` Vedran Furač
@ 2010-01-30 17:45                             ` Alan Cox
  2010-01-30 18:17                               ` Vedran Furač
  0 siblings, 1 reply; 35+ messages in thread
From: Alan Cox @ 2010-01-30 17:45 UTC (permalink / raw)
  To: vedran.furac
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, rientjes,
	minchan.kim, linux-kernel, balbir

> I can't say I'm able to correctly read kernel code, but I believe
> default is set by:

It['s set by the distribution - the kerne has a default value but given
the same distributions are shipping the crap userspace that breaks with
no-overcommit do you think they'll set it to break their user apps ?

No.

> http://vedranf.net/tmp/oom.ogv  (you can watch it using VLC for example)
> 
> Actually anyone receiving this mail should see it. What do you think,
> what will customers rather choose if they see this?

Address that to the distributions. Their customers. Systems I set up for
people always have no overcommit enabled.

Alan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-30 12:59                         ` Alan Cox
@ 2010-01-30 17:30                           ` Vedran Furač
  2010-01-30 17:45                             ` Alan Cox
  0 siblings, 1 reply; 35+ messages in thread
From: Vedran Furač @ 2010-01-30 17:30 UTC (permalink / raw)
  To: Alan Cox
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, rientjes,
	minchan.kim, linux-kernel, balbir

[-- Attachment #1: Type: text/plain, Size: 1560 bytes --]

Alan Cox wrote:

>>> So how about you go and have a complain at the people who are causing
>>> your problem, rather than the kernel.
>> That would pass completely unnoticed and ignored as long as overcommit
>> is enabled by default.
> 
> Defaults are set by the distributions. So you are still complaining to
> the wrong people.

I can't say I'm able to correctly read kernel code, but I believe
default is set by:

int sysctl_overcommit_memory = OVERCOMMIT_GUESS;  /* heuristic overcommit */
int sysctl_overcommit_ratio = 50;       /* default is 50% */
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;

in mmap.c.

>> So, if you don't want to change the OOM algorithm why not fixing this
>> bug then? And after that change the proc(5) manpage entry for
>> /proc/sys/vm/overcommit_memory into something like:
>>
>> 0: heuristic overcommit (enable this if you have memory problems with
>>                           some buggy software)
>> 1: always overcommit, never check
>> 2: always check, never overcommit (this is the default)
> 
> Because there are a lot of systems where heuristic overcommit makes
> sense ?

Ok, I won't argue any more. Just please watch this short (~1min)
screencast I made and tell me which behavior is good and which is bad
and should be fixed:

http://vedranf.net/tmp/oom.ogv  (you can watch it using VLC for example)

Actually anyone receiving this mail should see it. What do you think,
what will customers rather choose if they see this?

Regards,
Vedran


-- 
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

[-- Attachment #2: vedran_furac.vcf --]
[-- Type: text/x-vcard, Size: 220 bytes --]

begin:vcard
fn;quoted-printable:Vedran Fura=C4=8D
n;quoted-printable:Fura=C4=8D;Vedran
adr:;;;;;;Croatia
email;internet:vedran.furac@gmail.com
x-mozilla-html:FALSE
url:http://vedranf.net
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-30 12:33                       ` Vedran Furač
@ 2010-01-30 12:59                         ` Alan Cox
  2010-01-30 17:30                           ` Vedran Furač
  0 siblings, 1 reply; 35+ messages in thread
From: Alan Cox @ 2010-01-30 12:59 UTC (permalink / raw)
  To: vedran.furac
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, rientjes,
	minchan.kim, linux-kernel, balbir

> > So how about you go and have a complain at the people who are causing
> > your problem, rather than the kernel.
> 
> That would pass completely unnoticed and ignored as long as overcommit
> is enabled by default.

Defaults are set by the distributions. So you are still complaining to
the wrong people.

> So, if you don't want to change the OOM algorithm why not fixing this
> bug then? And after that change the proc(5) manpage entry for
> /proc/sys/vm/overcommit_memory into something like:
> 
> 0: heuristic overcommit (enable this if you have memory problems with
>                           some buggy software)
> 1: always overcommit, never check
> 2: always check, never overcommit (this is the default)

Because there are a lot of systems where heuristic overcommit makes
sense ?

Alan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29 11:03                     ` Alan Cox
@ 2010-01-30 12:33                       ` Vedran Furač
  2010-01-30 12:59                         ` Alan Cox
  0 siblings, 1 reply; 35+ messages in thread
From: Vedran Furač @ 2010-01-30 12:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, rientjes,
	minchan.kim, linux-kernel, balbir

[-- Attachment #1: Type: text/plain, Size: 2557 bytes --]

Alan Cox wrote:

>> off by default. Problem is that it breaks java and some other stuff that
>> allocates much more memory than it needs. Very quickly Committed_AS hits
>> CommitLimit and one cannot allocate any more while there is plenty of
>> memory still unused.
> 
> So how about you go and have a complain at the people who are causing
> your problem, rather than the kernel.

That would pass completely unnoticed and ignored as long as overcommit
is enabled by default.

>>> theoretical limit, but you generally need more swap (it's one of the
>>> reasons why things like BSD historically have a '3 * memory' rule).
>> Say I have 8GB of memory and there's always some free, why would I need
>> swap?
> 
> So that all the applications that allocate tons of address space and
> don't use it can swap when you hit that corner case, and as a result you
> don't need to go OOM. You should only get an OOM when you run out of
> memory + swap.

Yes, but unfortunately using swap makes machine crawl with huge disk IO
every time you access some application you haven't been using for a few
hours. So recently more and more people are disabling it completely with
positive experience.

>>> So sounds to me like a problem between the keyboard and screen (coupled
>> Unfortunately it is not. Give me ssh access to your computer (leave
>> overcommit on) and I'll kill your X with anything running on it.
> 
> If you have overcommit on then you can cause stuff to get killed. Thats
> what the option enables.

s/stuff/wrong stuff/

> It's really very simple: overcommit off you must have enough RAM and swap
> to hold all allocations requested. Overcommit on - you don't need this
> but if you do use more than is available on the system something has to
> go.
> 
> It's kind of like banking  overcommit off is proper banking, overcommit
> on is modern western banking.

Hehe, yes and you know the consequences.

If you look at malloc(3) you would see this:

"This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available.  This is a really bad bug."

So, if you don't want to change the OOM algorithm why not fixing this
bug then? And after that change the proc(5) manpage entry for
/proc/sys/vm/overcommit_memory into something like:

0: heuristic overcommit (enable this if you have memory problems with
                          some buggy software)
1: always overcommit, never check
2: always check, never overcommit (this is the default)

Regards,
Vedran


-- 
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

[-- Attachment #2: vedran_furac.vcf --]
[-- Type: text/x-vcard, Size: 220 bytes --]

begin:vcard
fn;quoted-printable:Vedran Fura=C4=8D
n;quoted-printable:Fura=C4=8D;Vedran
adr:;;;;;;Croatia
email;internet:vedran.furac@gmail.com
x-mozilla-html:FALSE
url:http://vedranf.net
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29  0:57                   ` Vedran Furač
@ 2010-01-29 11:03                     ` Alan Cox
  2010-01-30 12:33                       ` Vedran Furač
  0 siblings, 1 reply; 35+ messages in thread
From: Alan Cox @ 2010-01-29 11:03 UTC (permalink / raw)
  To: vedran.furac
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, rientjes,
	minchan.kim, linux-kernel, balbir

> off by default. Problem is that it breaks java and some other stuff that
> allocates much more memory than it needs. Very quickly Committed_AS hits
> CommitLimit and one cannot allocate any more while there is plenty of
> memory still unused.

So how about you go and have a complain at the people who are causing
your problem, rather than the kernel.

> > theoretical limit, but you generally need more swap (it's one of the
> > reasons why things like BSD historically have a '3 * memory' rule).
> 
> Say I have 8GB of memory and there's always some free, why would I need
> swap?

So that all the applications that allocate tons of address space and
don't use it can swap when you hit that corner case, and as a result you
don't need to go OOM. You should only get an OOM when you run out of
memory + swap.

> > So sounds to me like a problem between the keyboard and screen (coupled
> 
> Unfortunately it is not. Give me ssh access to your computer (leave
> overcommit on) and I'll kill your X with anything running on it.

If you have overcommit on then you can cause stuff to get killed. Thats
what the option enables.

It's really very simple: overcommit off you must have enough RAM and swap
to hold all allocations requested. Overcommit on - you don't need this
but if you do use more than is available on the system something has to
go.

It's kind of like banking  overcommit off is proper banking, overcommit
on is modern western banking.

Alan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29  0:35                 ` Alan Cox
@ 2010-01-29  0:57                   ` Vedran Furač
  2010-01-29 11:03                     ` Alan Cox
  0 siblings, 1 reply; 35+ messages in thread
From: Vedran Furač @ 2010-01-29  0:57 UTC (permalink / raw)
  To: Alan Cox
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, rientjes,
	minchan.kim, linux-kernel, balbir

[-- Attachment #1: Type: text/plain, Size: 1305 bytes --]

Alan Cox wrote:

> On Fri, 29 Jan 2010 01:25:18 +0100
> Vedran Furač <vedran.furac@gmail.com> wrote:
> 
>> Alan Cox wrote:
>>
>>> Am I missing something fundamental here ?
>> Yes, the fact linux mm currently sucks. How else would you explain
>> possibility of killing random (often root owned) processes using a 5
>> lines program started by an ordinary user? 
> 
> If you don't want to run with overcommit you turn it off. At that point
> processes get memory allocations refused if they can overrun the

I've started this discussion with question why overcommit isn't turned
off by default. Problem is that it breaks java and some other stuff that
allocates much more memory than it needs. Very quickly Committed_AS hits
CommitLimit and one cannot allocate any more while there is plenty of
memory still unused.

> theoretical limit, but you generally need more swap (it's one of the
> reasons why things like BSD historically have a '3 * memory' rule).

Say I have 8GB of memory and there's always some free, why would I need
swap?

> So sounds to me like a problem between the keyboard and screen (coupled

Unfortunately it is not. Give me ssh access to your computer (leave
overcommit on) and I'll kill your X with anything running on it.


-- 
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

[-- Attachment #2: vedran_furac.vcf --]
[-- Type: text/x-vcard, Size: 220 bytes --]

begin:vcard
fn;quoted-printable:Vedran Fura=C4=8D
n;quoted-printable:Fura=C4=8D;Vedran
adr:;;;;;;Croatia
email;internet:vedran.furac@gmail.com
x-mozilla-html:FALSE
url:http://vedranf.net
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-29  0:25               ` Vedran Furač
@ 2010-01-29  0:35                 ` Alan Cox
  2010-01-29  0:57                   ` Vedran Furač
  0 siblings, 1 reply; 35+ messages in thread
From: Alan Cox @ 2010-01-29  0:35 UTC (permalink / raw)
  To: vedran.furac
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, rientjes,
	minchan.kim, linux-kernel, balbir

On Fri, 29 Jan 2010 01:25:18 +0100
Vedran Furač <vedran.furac@gmail.com> wrote:

> Alan Cox wrote:
> 
> > Am I missing something fundamental here ?
> 
> Yes, the fact linux mm currently sucks. How else would you explain
> possibility of killing random (often root owned) processes using a 5
> lines program started by an ordinary user? 

If you don't want to run with overcommit you turn it off. At that point
processes get memory allocations refused if they can overrun the
theoretical limit, but you generally need more swap (it's one of the
reasons why things like BSD historically have a '3 * memory' rule).

So sounds to me like a problem between the keyboard and screen (coupled
with the fact far too few desktop vendors include tools to easily set
this stuff up)

Alan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-28  0:16             ` Alan Cox
  2010-01-28  0:26               ` KAMEZAWA Hiroyuki
  2010-01-28  0:59               ` David Rientjes
@ 2010-01-29  0:25               ` Vedran Furač
  2010-01-29  0:35                 ` Alan Cox
  2 siblings, 1 reply; 35+ messages in thread
From: Vedran Furač @ 2010-01-29  0:25 UTC (permalink / raw)
  To: Alan Cox
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, rientjes,
	minchan.kim, linux-kernel, balbir

Alan Cox wrote:

> Am I missing something fundamental here ?

Yes, the fact linux mm currently sucks. How else would you explain
possibility of killing random (often root owned) processes using a 5
lines program started by an ordinary user? Killed process could be an
apache web server or X server on a desktop. I demonstrated this flaw few
months ago here and only Kame tried to find a way to fix it but
encountered noncooperation.

I don't know what to say, really. Sad... Actually funny, when you know
that competition OS, often ridiculed by linux users, doesn't suffer any
consequences when that same 5 line program is run.

Regards,
Vedran

-- 
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-28  0:16             ` Alan Cox
  2010-01-28  0:26               ` KAMEZAWA Hiroyuki
@ 2010-01-28  0:59               ` David Rientjes
  2010-01-29  0:25               ` Vedran Furač
  2 siblings, 0 replies; 35+ messages in thread
From: David Rientjes @ 2010-01-28  0:59 UTC (permalink / raw)
  To: Alan Cox
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, minchan.kim,
	linux-kernel, Balbir Singh, Mel Gorman

On Thu, 28 Jan 2010, Alan Cox wrote:

> > Now, /proc/<pid>/oom_score and /proc/<pid>/oom_adj are used by servers.
> 
> And embedded, and some desktops (including some neat experimental hacks
> where windows slowly get to be bigger bigger oom targes the longer
> they've been non-focussed)
> 

Right, oom_adj is used much more widely than described.

> I can't help feeling this is the wrong approach. IFF we are running out
> of low memory pages then killing stuff for that reason is wrong to begin
> with except in extreme cases and those extreme cases are probably also
> cases the kill won't help.
> 
> If we have a movable user page (even an mlocked one) then if there is
> space in other parts of memory (ie the OOM is due to a single zone
> problem) we should *never* be killing in the first place, we should be
> moving the page. The mlock case is a bit hairy but the non mlock case is
> exactly the same sequence of operations as a page out and page in
> somewhere else skipping the widdling on the disk bit in the middle.
> 

Mel Gorman's memory compaction patchset will preempt direct reclaim and 
the oom killer if it can defragment zones by page migration such that a 
higher order allocation would now succeed.

In this specific context, both compaction and direct reclaim will have 
failed so the oom killer is the only alternative.  For __GFP_NOFAIL, 
that's required.  However, there has been some long-standing debate (and 
not only for lowmem, but for all oom conditions) about when the page 
allocator should simply return NULL.  We've always killed something on 
blocking allocations to favor current at the expense of other memory hogs, 
but that may be changed soon: it may make sense to defer oom killing 
completely unless the badness() score reaches a certain threshold such 
that memory leakers really can be dealt with accordingly.

In the lowmem case, it certainly seems plausible to use the same behavior 
that we currently do for mempolicy-constrained ooms: kill current.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-28  0:16             ` Alan Cox
@ 2010-01-28  0:26               ` KAMEZAWA Hiroyuki
  2010-01-28  0:59               ` David Rientjes
  2010-01-29  0:25               ` Vedran Furač
  2 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-28  0:26 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, linux-mm, rientjes, minchan.kim, linux-kernel, balbir

Thank you for comment. But I stoppped this already....

On Thu, 28 Jan 2010 00:16:36 +0000
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > Now, /proc/<pid>/oom_score and /proc/<pid>/oom_adj are used by servers.
> 
> And embedded, and some desktops (including some neat experimental hacks
> where windows slowly get to be bigger bigger oom targes the longer
> they've been non-focussed)
> 
Sure.

> > For my customers, I don't like oom black magic. I'd like to recommend to
> > use memcg, of course ;) But lowmem oom cannot be handled by memcg, well.
> > So I started from this. 
> 
> I can't help feeling this is the wrong approach. IFF we are running out
> of low memory pages then killing stuff for that reason is wrong to begin
> with except in extreme cases and those extreme cases are probably also
> cases the kill won't help.
> 
> If we have a movable user page (even an mlocked one) then if there is
> space in other parts of memory (ie the OOM is due to a single zone
> problem) we should *never* be killing in the first place, we should be
> moving the page. The mlock case is a bit hairy but the non mlock case is
> exactly the same sequence of operations as a page out and page in
> somewhere else skipping the widdling on the disk bit in the middle.
> 
> There are cases we can't do that - eg if the kernel has it pinned for
> DMA, but in that case OOM isn't going to recover the page either - at
> least not until the DMA or whatever unpins it (at which point you could
> just move it).
> 
> Am I missing something fundamental here ?
> 

I just wanted to make oom-killer shouldn't kill sshd or X-serivce or
task launcher IOW, oom-killer shouldn't do not-reasonalble selection.

If lowmem user is killed, I'll be satisfied with the cace "Oh, the process
is killed because lowmem was in short and it used lowmem, Hmmm..." and
never be satisfied with the cace "Ohch!, F*cking OOM killer killed X-server
and 10s of innocent processes!!!".

But year, I stop this. For me, panic_on_oom=1 is all and enough.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-27  0:58           ` KAMEZAWA Hiroyuki
  2010-01-27 23:56             ` David Rientjes
@ 2010-01-28  0:16             ` Alan Cox
  2010-01-28  0:26               ` KAMEZAWA Hiroyuki
                                 ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Alan Cox @ 2010-01-28  0:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-mm, rientjes, minchan.kim, linux-kernel, balbir

> Now, /proc/<pid>/oom_score and /proc/<pid>/oom_adj are used by servers.

And embedded, and some desktops (including some neat experimental hacks
where windows slowly get to be bigger bigger oom targes the longer
they've been non-focussed)

> For my customers, I don't like oom black magic. I'd like to recommend to
> use memcg, of course ;) But lowmem oom cannot be handled by memcg, well.
> So I started from this. 

I can't help feeling this is the wrong approach. IFF we are running out
of low memory pages then killing stuff for that reason is wrong to begin
with except in extreme cases and those extreme cases are probably also
cases the kill won't help.

If we have a movable user page (even an mlocked one) then if there is
space in other parts of memory (ie the OOM is due to a single zone
problem) we should *never* be killing in the first place, we should be
moving the page. The mlock case is a bit hairy but the non mlock case is
exactly the same sequence of operations as a page out and page in
somewhere else skipping the widdling on the disk bit in the middle.

There are cases we can't do that - eg if the kernel has it pinned for
DMA, but in that case OOM isn't going to recover the page either - at
least not until the DMA or whatever unpins it (at which point you could
just move it).

Am I missing something fundamental here ?

Alan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-27  0:58           ` KAMEZAWA Hiroyuki
@ 2010-01-27 23:56             ` David Rientjes
  2010-01-28  0:16             ` Alan Cox
  1 sibling, 0 replies; 35+ messages in thread
From: David Rientjes @ 2010-01-27 23:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-mm, minchan.kim, linux-kernel, balbir

On Wed, 27 Jan 2010, KAMEZAWA Hiroyuki wrote:

> Now, /proc/<pid>/oom_score and /proc/<pid>/oom_adj are used by servers.

Nonsense, there are plenty of userspace applications such as udev that 
tune their own oom_adj value on their own!  oom_adj is used by anyone who 
wants to define oom killer priority by polarizing the badness heuristic 
for certain tasks to, for example, always prefer them or completely 
disable oom killing for them.

> After this patch, badness() returns different value based on given context.
> Changing format of them was an idea, but, as David said, using "RSS" values
> will show unstable oom_score. So, I didn't modify oom_score (for this time).
> 

That's a seperate issue: you cannot define the baseline of the heuristic 
in terms of rss because it does not allow userspace to define when a task 
has become "rogue", i.e. when it is consuming far more memory than 
expected, because it is a dynamic value that depends on the state of the 
VM at the time of oom.  That is one of the two most popular reasons for 
tuning oom_adj, the other aforementioned.

The issue with using lowmem rss for CONSTRAINT_LOWMEM is that it 
misinterprets oom_adj values given to tasks; users will tune their oom_adj 
based on global, system-wide ooms (or use /proc/pid/oom_score to reveal 
the priority) and will never understand how it affects the value of a 
resident page in lowmem for GFP_DMA allocations.

> To be honest, all my work are for guys who don't tweak oom_adj based on oom_score.
> IOW, this is for usual novice people. And I don't wan't to break servers which
> depends on oom black magic currently supported.
> 

Why can't you simply create your own heuristic, seperate from badness(), 
for CONSTRAINT_LOWMEM?  Define the criteria that you see as important in 
selecting a task in that scenario and then propose it as a seperate 
function, there is no requirement that we must have a single heuristic 
that works for all the various oom killer constraints.  It would be 
entirely appropriate to ignore oom_adj in that heuristic, as well, since 
its not defined for such oom conditions (OOM_DISABLE is already taken care 
of in the tasklist scan and needs no further support).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-26 23:53       ` KAMEZAWA Hiroyuki
  2010-01-27  0:19         ` Andrew Morton
@ 2010-01-27 23:46         ` David Rientjes
  1 sibling, 0 replies; 35+ messages in thread
From: David Rientjes @ 2010-01-27 23:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, minchan.kim, linux-kernel, linux-mm

On Wed, 27 Jan 2010, KAMEZAWA Hiroyuki wrote:

> Yes, I think you're right. But "breaking current behaviro of our servers!"
> arguments kills all proposal to this area and this oom-killer or vmscan is
> a feature should be tested by real users.

Nobody has said we should discount lowmem rss when dealing with a GFP_DMA 
allocation, it simply wasn't possible until the lowmem rss counters were 
introduced in -mm.  It would prevent the needless killing of innocent 
tasks which would not allow the page allocation to succeed, so it's a good 
feature to have.  It doesn't need to be configurable at all, we just need 
to find a way to introduce it into the heuristic without mangling oom_adj.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-25  6:15   ` [PATCH v3] " KAMEZAWA Hiroyuki
  2010-01-26 23:12     ` Andrew Morton
  2010-01-26 23:16     ` Andrew Morton
@ 2010-01-27 23:40     ` David Rientjes
  2 siblings, 0 replies; 35+ messages in thread
From: David Rientjes @ 2010-01-27 23:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, minchan.kim, linux-kernel, linux-mm

On Mon, 25 Jan 2010, KAMEZAWA Hiroyuki wrote:

> Default oom-killer uses badness calculation based on process's vm_size
> and some amounts of heuristics. Some users see proc->oom_score and
> proc->oom_adj to control oom-killed tendency under their server.
> 
> Now, we know oom-killer don't work ideally in some situaion, in PCs. Some
> enhancements are demanded. But such enhancements for oom-killer makes
> incomaptibility to oom-controls in enterprise world. So, this patch
> adds sysctl for extensions for oom-killer. Main purpose is for
> making a chance for wider test for new scheme.
> 

That's insufficient for inclusion in mainline, we don't add new sysctls so 
that new heuristics can be tried out.  It's fine to propose a new sysctl 
to define how the oom killer behaves, but the main purpose would not be 
for testing; rather, it would be to enable options that users within the 
minority would want to use.

I disagree that we should be doing this as a bitmask that defines certain 
oom killer options; we already have three seperate sysctls which also 
enable options: panic_on_oom, oom_kill_allocating_task, and 
oom_dump_tasks.  Either these existing sysctls need to be converted to the 
bitmask, breaking the long-standing legacy support, or you simply need to 
clutter procfs a little more.  I'm slightly biased toward the latter since 
it doesn't require any userspace change and tunables such as panic_on_oom 
have been around for a long time.

 [ Note: it may be possible to consolidate two of these existing sysctls
   down into one: oom_dump_tasks can be enabled by default if the tasklist
   is sufficiently short and the only use-case for oom_kill_allocating_task
   is for machines with enormously long tasklists to prevent unnecessary
   delays in selecting a bad process to kill.  Thus, we could probably
   consolidate these into one sysctl: oom_kill_quick, which would disable
   the tasklist dump and always kill current when invoked. ]

> One cause of OOM-Killer is memory shortage in lower zones.
> (If memory is enough, lowmem_reserve_ratio works well. but..)

I don't understand the reference to lowmem_reserve_ratio here, it may 
reserve lowmem from ~GFP_DMA requests but it does nothing to prevent oom 
conditions from excessive DMA page allocations.

> I saw lowmem-oom frequently on x86-32 and sometimes on ia64 in
> my cusotmer support jobs. If we just see process's vm_size at oom,
> we can never kill a process which has lowmem.

That's not always true, it may end up killing a large consumer of DMA 
memory by chance simply because the heuristics work out that way.  In 
other words, we can't say it will "never" work correctly as it is 
currently implemented.  I agree we can make it smarter, however.

> At last, there will be an oom-serial-killer.
> 

Heh.

> Now, we have per-mm lowmem usage counter. We can make use of it
> to select a good victim.
> 
> This patch does
>   - add sysctl for new bahavior.
>   - add CONSTRAINT_LOWMEM to oom's constraint type.
>   - pass constraint to __badness()

You mean badness()?  Passing the constraint works well for my 
CONSTRAINT_MEMPOLICY patch as well.

>   - change calculation based on constraint. If CONSTRAINT_LOWMEM,
>     use low_rss instead of vmsize.
> 

Nack, we can't simply use the lowmem rss as a baseline because 
/proc/pid/oom_adj, the single most powerful heuristic in badness(), is not 
defined for these dual scenarios.  There may only be a single baseline to 
define for oom_adj, otherwise it will have erradic results depending on 
the context in which the oom killer is called.  It can be used to polarize 
the heuristic depending on the total VM size which may be disadvantageous 
when using lowmem rss as the baseline.

I think the best alternative would be to strongly penalize the badness() 
points for tasks that do not have a lowmem rss when we are constrained by 
CONSTRAINT_LOWMEM, similar to how we penalize tasks not sharing current's 
mems_allowed since it (usually) doesn't help.  We do not necessarily 
always want to kill the task that is consuming the most lowmem for a 
single page allocation; we need to decide how valuable lowmem is in 
relation to overall VM size, however.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-27  0:19         ` Andrew Morton
@ 2010-01-27  0:58           ` KAMEZAWA Hiroyuki
  2010-01-27 23:56             ` David Rientjes
  2010-01-28  0:16             ` Alan Cox
  0 siblings, 2 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-27  0:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, rientjes, minchan.kim, linux-kernel, balbir

On Tue, 26 Jan 2010 16:19:52 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 27 Jan 2010 08:53:55 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > > Hardly anyone will know to enable
> > > it so the feature won't get much testing and this binary decision
> > > fractures the testing effort.  It would be much better if we can get
> > > everyone running the same code.  I mean, if there are certain workloads
> > > on certain machines with which the oom-killer doesn't behave correctly
> > > then fix it!
> > Yes, I think you're right. But "breaking current behaviro of our servers!"
> > arguments kills all proposal to this area and this oom-killer or vmscan is
> > a feature should be tested by real users. (I'll write fork-bomb detector
> > and RSS based OOM again.)
> 
> Well don't break their servers then ;)
> 
> What I'm not understanding is: why is it not possible to improve the
> behaviour on the affected machines without affecting the behaviour on
> other machines?
> 

Now, /proc/<pid>/oom_score and /proc/<pid>/oom_adj are used by servers.
After this patch, badness() returns different value based on given context.
Changing format of them was an idea, but, as David said, using "RSS" values
will show unstable oom_score. So, I didn't modify oom_score (for this time).

To be honest, all my work are for guys who don't tweak oom_adj based on oom_score.
IOW, this is for usual novice people. And I don't wan't to break servers which
depends on oom black magic currently supported.

It may be better to show lowmem_rss via /proc/<pid>/statm or somewhere. But
I didn't do that because usual people doesn't check that in periodic and
tweak oom_adj.

For my customers, I don't like oom black magic. I'd like to recommend to
use memcg, of course ;) But lowmem oom cannot be handled by memcg, well.
So I started from this. 

> What are these "servers" to which you refer?
Almost all servers/PCs/laptops which have multiple zones in memory layout.

> x86_32 servers, I assume
> - the patch shouldn't affect 64-bit machines.  Why don't they also want
> this treatment and in what way does the patch "break" them?

Ah, explanation was not enough.

This patch depends on mm-add-lowmem-detection-logic.patch
The lowmem is
   - ZONE_NORMAL in x86-32 which has HIGHMEM
   - ZONE_DMA32  in x86-64 which has ZONE_NORMAL
   - ZONE_DMA    in x86-64 which doesn't have ZONE_NORMAL(memory < 4G)
   - ZONE_DMA    in ia64 which has ZONE_NORMAL(memory > 4G)
   - no zone     in ppc. all zone are DMA. (lowmem_zone=-1)

So, this affects x86-64 hosts, especially when it has 4 Gbytes of memory
and 32bit pci cards.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-26 23:53       ` KAMEZAWA Hiroyuki
@ 2010-01-27  0:19         ` Andrew Morton
  2010-01-27  0:58           ` KAMEZAWA Hiroyuki
  2010-01-27 23:46         ` David Rientjes
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2010-01-27  0:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, rientjes, minchan.kim, linux-kernel, balbir

On Wed, 27 Jan 2010 08:53:55 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > Hardly anyone will know to enable
> > it so the feature won't get much testing and this binary decision
> > fractures the testing effort.  It would be much better if we can get
> > everyone running the same code.  I mean, if there are certain workloads
> > on certain machines with which the oom-killer doesn't behave correctly
> > then fix it!
> Yes, I think you're right. But "breaking current behaviro of our servers!"
> arguments kills all proposal to this area and this oom-killer or vmscan is
> a feature should be tested by real users. (I'll write fork-bomb detector
> and RSS based OOM again.)

Well don't break their servers then ;)

What I'm not understanding is: why is it not possible to improve the
behaviour on the affected machines without affecting the behaviour on
other machines?

What are these "servers" to which you refer?  x86_32 servers, I assume
- the patch shouldn't affect 64-bit machines.  Why don't they also want
this treatment and in what way does the patch "break" them?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-26 23:12     ` Andrew Morton
@ 2010-01-26 23:53       ` KAMEZAWA Hiroyuki
  2010-01-27  0:19         ` Andrew Morton
  2010-01-27 23:46         ` David Rientjes
  0 siblings, 2 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-26 23:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, rientjes, minchan.kim, linux-kernel, balbir

On Tue, 26 Jan 2010 15:12:02 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 25 Jan 2010 15:15:03 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > This patch does
> >   - add sysctl for new bahavior.
> >   - add CONSTRAINT_LOWMEM to oom's constraint type.
> >   - pass constraint to __badness()
> >   - change calculation based on constraint. If CONSTRAINT_LOWMEM,
> >     use low_rss instead of vmsize.
> > 
> > Changelog 2010/01/25
> >  - showing extension_mask value in OOM kill main log header.
> > Changelog 2010/01/22:
> >  - added sysctl
> >  - fixed !CONFIG_MMU
> >  - fixed fs/proc/base.c breakacge.
> 
> It'd be nice to see some testing results for this.  Perhaps "here's a
> test case and here's the before-and-after behaviour".
> 
Hm. posting test case module is O.K ?
At leaset, I'll add what test was done and /var/log/message output to the log.


> I don't like the sysctl knob much. 
me, too.

> Hardly anyone will know to enable
> it so the feature won't get much testing and this binary decision
> fractures the testing effort.  It would be much better if we can get
> everyone running the same code.  I mean, if there are certain workloads
> on certain machines with which the oom-killer doesn't behave correctly
> then fix it!
Yes, I think you're right. But "breaking current behaviro of our servers!"
arguments kills all proposal to this area and this oom-killer or vmscan is
a feature should be tested by real users. (I'll write fork-bomb detector
and RSS based OOM again.)

Then, I'd like to use sysctl. Distro/users can select default value of this
by /etc/sysctl.conf file, at least.


> 
> Why was the '#include <linux/sysctl.h>" removed from sysctl.c?
> 
> The patch adds a random newline to sysctl.c.
> 
Sorry. my bad.


> It was never a good idea to add extern declarations to sysctl.c.  It's
> better to add them to a subsystem-specific header file (ie:
> mm-sysctl.h) and then include that file from the mm files that define
> or use sysctl_foo, and include it into sysctl.c.  Oh well.
> 
Hmm. Okay. I'll consider about that.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-26 23:16     ` Andrew Morton
@ 2010-01-26 23:44       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-26 23:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, rientjes, minchan.kim, linux-kernel, balbir

On Tue, 26 Jan 2010 15:16:06 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 25 Jan 2010 15:15:03 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > -unsigned long badness(struct task_struct *p, unsigned long uptime)
> > +unsigned long badness(struct task_struct *p, unsigned long uptime,
> > +			int constraint)
> 
> And badness() should be renamed to something better (eg, oom_badness), and
> the declaration should be placed in a mm-specific header file.
> 
> Yes, the code was already like that.  But please don't leave crappiness in
> place when you come across it - take the opportunity to fix it up.
> 
Sure. I'll write v4.

-Kame


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-25  6:15   ` [PATCH v3] " KAMEZAWA Hiroyuki
  2010-01-26 23:12     ` Andrew Morton
@ 2010-01-26 23:16     ` Andrew Morton
  2010-01-26 23:44       ` KAMEZAWA Hiroyuki
  2010-01-27 23:40     ` David Rientjes
  2 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2010-01-26 23:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, rientjes, minchan.kim, linux-kernel, balbir

On Mon, 25 Jan 2010 15:15:03 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> -unsigned long badness(struct task_struct *p, unsigned long uptime)
> +unsigned long badness(struct task_struct *p, unsigned long uptime,
> +			int constraint)

And badness() should be renamed to something better (eg, oom_badness), and
the declaration should be placed in a mm-specific header file.

Yes, the code was already like that.  But please don't leave crappiness in
place when you come across it - take the opportunity to fix it up.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-25  6:15   ` [PATCH v3] " KAMEZAWA Hiroyuki
@ 2010-01-26 23:12     ` Andrew Morton
  2010-01-26 23:53       ` KAMEZAWA Hiroyuki
  2010-01-26 23:16     ` Andrew Morton
  2010-01-27 23:40     ` David Rientjes
  2 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2010-01-26 23:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, rientjes, minchan.kim, linux-kernel, balbir

On Mon, 25 Jan 2010 15:15:03 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Did several tests on x86-32 and I felt that sysctl value should be
> printed on oom log... this is v3.
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Default oom-killer uses badness calculation based on process's vm_size
> and some amounts of heuristics. Some users see proc->oom_score and
> proc->oom_adj to control oom-killed tendency under their server.
> 
> Now, we know oom-killer don't work ideally in some situaion, in PCs. Some
> enhancements are demanded. But such enhancements for oom-killer makes
> incomaptibility to oom-controls in enterprise world. So, this patch
> adds sysctl for extensions for oom-killer. Main purpose is for
> making a chance for wider test for new scheme.
> 
> One cause of OOM-Killer is memory shortage in lower zones.
> (If memory is enough, lowmem_reserve_ratio works well. but..)
> I saw lowmem-oom frequently on x86-32 and sometimes on ia64 in
> my cusotmer support jobs. If we just see process's vm_size at oom,
> we can never kill a process which has lowmem.
> At last, there will be an oom-serial-killer.
> 
> Now, we have per-mm lowmem usage counter. We can make use of it
> to select a good victim.
> 
> This patch does
>   - add sysctl for new bahavior.
>   - add CONSTRAINT_LOWMEM to oom's constraint type.
>   - pass constraint to __badness()
>   - change calculation based on constraint. If CONSTRAINT_LOWMEM,
>     use low_rss instead of vmsize.
> 
> Changelog 2010/01/25
>  - showing extension_mask value in OOM kill main log header.
> Changelog 2010/01/22:
>  - added sysctl
>  - fixed !CONFIG_MMU
>  - fixed fs/proc/base.c breakacge.

It'd be nice to see some testing results for this.  Perhaps "here's a
test case and here's the before-and-after behaviour".

I don't like the sysctl knob much.  Hardly anyone will know to enable
it so the feature won't get much testing and this binary decision
fractures the testing effort.  It would be much better if we can get
everyone running the same code.  I mean, if there are certain workloads
on certain machines with which the oom-killer doesn't behave correctly
then fix it!

Why was the '#include <linux/sysctl.h>" removed from sysctl.c?

The patch adds a random newline to sysctl.c.

It was never a good idea to add extern declarations to sysctl.c.  It's
better to add them to a subsystem-specific header file (ie:
mm-sysctl.h) and then include that file from the mm files that define
or use sysctl_foo, and include it into sysctl.c.  Oh well.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3] oom-kill: add lowmem usage aware oom kill handling
  2010-01-22  6:23 ` [PATCH v2] " KAMEZAWA Hiroyuki
@ 2010-01-25  6:15   ` KAMEZAWA Hiroyuki
  2010-01-26 23:12     ` Andrew Morton
                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-25  6:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, akpm, rientjes, minchan.kim, linux-kernel, balbir

Did several tests on x86-32 and I felt that sysctl value should be
printed on oom log... this is v3.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Default oom-killer uses badness calculation based on process's vm_size
and some amounts of heuristics. Some users see proc->oom_score and
proc->oom_adj to control oom-killed tendency under their server.

Now, we know oom-killer don't work ideally in some situaion, in PCs. Some
enhancements are demanded. But such enhancements for oom-killer makes
incomaptibility to oom-controls in enterprise world. So, this patch
adds sysctl for extensions for oom-killer. Main purpose is for
making a chance for wider test for new scheme.

One cause of OOM-Killer is memory shortage in lower zones.
(If memory is enough, lowmem_reserve_ratio works well. but..)
I saw lowmem-oom frequently on x86-32 and sometimes on ia64 in
my cusotmer support jobs. If we just see process's vm_size at oom,
we can never kill a process which has lowmem.
At last, there will be an oom-serial-killer.

Now, we have per-mm lowmem usage counter. We can make use of it
to select a good victim.

This patch does
  - add sysctl for new bahavior.
  - add CONSTRAINT_LOWMEM to oom's constraint type.
  - pass constraint to __badness()
  - change calculation based on constraint. If CONSTRAINT_LOWMEM,
    use low_rss instead of vmsize.

Changelog 2010/01/25
 - showing extension_mask value in OOM kill main log header.
Changelog 2010/01/22:
 - added sysctl
 - fixed !CONFIG_MMU
 - fixed fs/proc/base.c breakacge.

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/sysctl/vm.txt |   16 +++++++
 fs/proc/base.c              |    5 +-
 include/linux/oom.h         |    1 
 kernel/sysctl.c             |   10 ++++
 mm/oom_kill.c               |   92 ++++++++++++++++++++++++++++++++------------
 5 files changed, 97 insertions(+), 27 deletions(-)

Index: mmotm-2.6.33-Jan15/include/linux/oom.h
===================================================================
--- mmotm-2.6.33-Jan15.orig/include/linux/oom.h
+++ mmotm-2.6.33-Jan15/include/linux/oom.h
@@ -20,6 +20,7 @@ struct notifier_block;
  */
 enum oom_constraint {
 	CONSTRAINT_NONE,
+	CONSTRAINT_LOWMEM,
 	CONSTRAINT_CPUSET,
 	CONSTRAINT_MEMORY_POLICY,
 };
Index: mmotm-2.6.33-Jan15/mm/oom_kill.c
===================================================================
--- mmotm-2.6.33-Jan15.orig/mm/oom_kill.c
+++ mmotm-2.6.33-Jan15/mm/oom_kill.c
@@ -34,6 +34,23 @@ int sysctl_oom_dump_tasks;
 static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
+int sysctl_oom_kill_extension_mask;
+enum {
+	EXT_LOWMEM_OOM,
+};
+
+#ifdef CONFIG_MMU
+static int oom_extension(int idx)
+{
+	return sysctl_oom_kill_extension_mask & (1 << idx);
+}
+#else
+static int oom_extension(int idx)
+{
+	return 0;
+}
+#endif
+
 /*
  * Is all threads of the target process nodes overlap ours?
  */
@@ -55,6 +72,7 @@ static int has_intersects_mems_allowed(s
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
  * @uptime: current uptime in seconds
+ * @constraint: context of badness calculation.
  *
  * The formula used is relatively simple and documented inline in the
  * function. The main rationale is that we want to select a good task
@@ -70,7 +88,8 @@ static int has_intersects_mems_allowed(s
  *    of least surprise ... (be careful when you change it)
  */
 
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long badness(struct task_struct *p, unsigned long uptime,
+			int constraint)
 {
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
@@ -89,11 +108,16 @@ unsigned long badness(struct task_struct
 		task_unlock(p);
 		return 0;
 	}
-
-	/*
-	 * The memory size of the process is the basis for the badness.
-	 */
-	points = mm->total_vm;
+	switch  (constraint) {
+	case CONSTRAINT_LOWMEM:
+		/* use lowmem usage as the basis for the badness */
+		points = get_low_rss(mm);
+		break;
+	default:
+		/* use virtual memory size as the basis for the badness */
+		points = mm->total_vm;
+		break;
+	}
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -113,12 +137,17 @@ unsigned long badness(struct task_struct
 	 * machine with an endless amount of children. In case a single
 	 * child is eating the vast majority of memory, adding only half
 	 * to the parents will make the child our kill candidate of choice.
+	 *
+	 * At lowmem shortage, this part is skipped because children's lowmem
+	 * usage is not related to its parent.
 	 */
-	list_for_each_entry(child, &p->children, sibling) {
-		task_lock(child);
-		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
-		task_unlock(child);
+	if (constraint != CONSTRAINT_LOWMEM) {
+		list_for_each_entry(child, &p->children, sibling) {
+			task_lock(child);
+			if (child->mm != mm && child->mm)
+				points += child->mm->total_vm/2 + 1;
+			task_unlock(child);
+		}
 	}
 
 	/*
@@ -212,6 +241,9 @@ static enum oom_constraint constrained_a
 	if (gfp_mask & __GFP_THISNODE)
 		return CONSTRAINT_NONE;
 
+	if (oom_extension(EXT_LOWMEM_OOM) && (high_zoneidx <= lowmem_zone))
+		return CONSTRAINT_LOWMEM;
+
 	/*
 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
@@ -233,6 +265,10 @@ static enum oom_constraint constrained_a
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
+	int zone_idx = gfp_zone(gfp_mask);
+
+	if (oom_extension(EXT_LOWMEM_OOM) && (zone_idx <= lowmem_zone))
+		return CONSTRAINT_LOWMEM;
 	return CONSTRAINT_NONE;
 }
 #endif
@@ -244,7 +280,7 @@ static enum oom_constraint constrained_a
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+				struct mem_cgroup *mem, int constraint)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -300,7 +336,7 @@ static struct task_struct *select_bad_pr
 		if (p->signal->oom_adj == OOM_DISABLE)
 			continue;
 
-		points = badness(p, uptime.tv_sec);
+		points = badness(p, uptime.tv_sec, constraint);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -360,8 +396,9 @@ static void dump_header(struct task_stru
 							struct mem_cgroup *mem)
 {
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d\n",
-		current->comm, gfp_mask, order, current->signal->oom_adj);
+		"oom_adj=%d extesion=%x\n",
+		current->comm, gfp_mask, order,
+		current->signal->oom_adj, sysctl_oom_kill_extension_mask);
 	task_lock(current);
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
@@ -455,7 +492,7 @@ static int oom_kill_process(struct task_
 	}
 
 	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
-					message, task_pid_nr(p), p->comm, points);
+				message, task_pid_nr(p), p->comm, points);
 
 	/* Try to kill a child first */
 	list_for_each_entry(c, &p->children, sibling) {
@@ -475,7 +512,7 @@ void mem_cgroup_out_of_memory(struct mem
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, CONSTRAINT_NONE);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -557,7 +594,7 @@ void clear_zonelist_oom(struct zonelist 
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order, int constraint)
 {
 	struct task_struct *p;
 	unsigned long points;
@@ -571,7 +608,7 @@ retry:
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, constraint);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -583,9 +620,16 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	switch (constraint) {
+	case CONSTRAINT_LOWMEM:
+		if (oom_kill_process(p, gfp_mask, order, points, NULL,
+			"Out of memory (in lowmem)"))
+			goto retry;
+	default:
+		if (oom_kill_process(p, gfp_mask, order, points, NULL,
 			     "Out of memory"))
-		goto retry;
+			goto retry;
+	}
 }
 
 /*
@@ -612,7 +656,7 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	__out_of_memory(0, 0, CONSTRAINT_NONE); /* unknown gfp_mask and order */
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -663,7 +707,7 @@ void out_of_memory(struct zonelist *zone
 		oom_kill_process(current, gfp_mask, order, 0, NULL,
 				"No available memory (MPOL_BIND)");
 		break;
-
+	case CONSTRAINT_LOWMEM:
 	case CONSTRAINT_NONE:
 		if (sysctl_panic_on_oom) {
 			dump_header(NULL, gfp_mask, order, NULL);
@@ -671,7 +715,7 @@ void out_of_memory(struct zonelist *zone
 		}
 		/* Fall-through */
 	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
+		__out_of_memory(gfp_mask, order, constraint);
 		break;
 	}
 
Index: mmotm-2.6.33-Jan15/kernel/sysctl.c
===================================================================
--- mmotm-2.6.33-Jan15.orig/kernel/sysctl.c
+++ mmotm-2.6.33-Jan15/kernel/sysctl.c
@@ -22,7 +22,6 @@
 #include <linux/mm.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
-#include <linux/sysctl.h>
 #include <linux/proc_fs.h>
 #include <linux/security.h>
 #include <linux/ctype.h>
@@ -72,6 +71,7 @@ extern int sysctl_overcommit_ratio;
 extern int sysctl_panic_on_oom;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_extension_mask;
 extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;
@@ -202,6 +202,7 @@ extern struct ctl_table epoll_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+
 extern int prove_locking;
 extern int lock_stat;
 
@@ -1282,6 +1283,13 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "oom_kill_extension_mask",
+		.data		= &sysctl_oom_kill_extension_mask,
+		.maxlen		= sizeof(sysctl_oom_kill_extension_mask),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
Index: mmotm-2.6.33-Jan15/Documentation/sysctl/vm.txt
===================================================================
--- mmotm-2.6.33-Jan15.orig/Documentation/sysctl/vm.txt
+++ mmotm-2.6.33-Jan15/Documentation/sysctl/vm.txt
@@ -45,6 +45,7 @@ Currently, these files are in /proc/sys/
 - numa_zonelist_order
 - oom_dump_tasks
 - oom_kill_allocating_task
+- oom_kill_extension_mask
 - overcommit_memory
 - overcommit_ratio
 - page-cluster
@@ -511,6 +512,21 @@ The default value is 0.
 
 ==============================================================
 
+oom_kill_extension_mask:
+
+This is a mask for oom-killer extension features.
+Setting these flags may cause incompatibility for proc->oom_score and
+proc->oom_adj controls. So, please set carefully.
+
+bit 0....lowmem aware oom-killing.
+    If set, at lowmem shortage oom killing (for example, exhausting NORMAL_ZONE
+    under x86-32 HIGHMEM host), oom-killer will see lowmem rss usage of
+    processes instead of vmsize. Works only when CONFIG_MMU=y.
+
+The default value is 0
+
+==============================================================
+
 overcommit_memory:
 
 This value contains a flag that enables memory overcommitment.
Index: mmotm-2.6.33-Jan15/fs/proc/base.c
===================================================================
--- mmotm-2.6.33-Jan15.orig/fs/proc/base.c
+++ mmotm-2.6.33-Jan15/fs/proc/base.c
@@ -458,7 +458,8 @@ static const struct file_operations proc
 #endif
 
 /* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
+unsigned long badness(struct task_struct *p,
+	unsigned long uptime, int constraint);
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points;
@@ -466,7 +467,7 @@ static int proc_oom_score(struct task_st
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
-	points = badness(task->group_leader, uptime.tv_sec);
+	points = badness(task->group_leader, uptime.tv_sec, CONSTRAINT_NONE);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2010-02-01 10:34 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-29 16:11 [PATCH v3] oom-kill: add lowmem usage aware oom kill handling KAMEZAWA Hiroyuki
2010-01-29 16:21 ` Alan Cox
2010-01-29 16:25   ` KAMEZAWA Hiroyuki
2010-01-29 16:30     ` Alan Cox
2010-01-29 16:41       ` KAMEZAWA Hiroyuki
2010-01-29 21:07         ` David Rientjes
2010-01-30 12:46           ` Vedran Furač
2010-01-30 22:53             ` David Rientjes
2010-01-31 20:29               ` Vedran Furač
2010-02-01 10:33                 ` David Rientjes
2010-02-01  0:01           ` KAMEZAWA Hiroyuki
2010-02-01 10:28             ` David Rientjes
2010-01-29 21:11     ` David Rientjes
  -- strict thread matches above, loose matches on Subject: below --
2010-01-21  5:59 [PATCH] " KAMEZAWA Hiroyuki
2010-01-22  6:23 ` [PATCH v2] " KAMEZAWA Hiroyuki
2010-01-25  6:15   ` [PATCH v3] " KAMEZAWA Hiroyuki
2010-01-26 23:12     ` Andrew Morton
2010-01-26 23:53       ` KAMEZAWA Hiroyuki
2010-01-27  0:19         ` Andrew Morton
2010-01-27  0:58           ` KAMEZAWA Hiroyuki
2010-01-27 23:56             ` David Rientjes
2010-01-28  0:16             ` Alan Cox
2010-01-28  0:26               ` KAMEZAWA Hiroyuki
2010-01-28  0:59               ` David Rientjes
2010-01-29  0:25               ` Vedran Furač
2010-01-29  0:35                 ` Alan Cox
2010-01-29  0:57                   ` Vedran Furač
2010-01-29 11:03                     ` Alan Cox
2010-01-30 12:33                       ` Vedran Furač
2010-01-30 12:59                         ` Alan Cox
2010-01-30 17:30                           ` Vedran Furač
2010-01-30 17:45                             ` Alan Cox
2010-01-30 18:17                               ` Vedran Furač
2010-01-27 23:46         ` David Rientjes
2010-01-26 23:16     ` Andrew Morton
2010-01-26 23:44       ` KAMEZAWA Hiroyuki
2010-01-27 23:40     ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).