linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
@ 2011-03-05 11:44 Andrey Vagin
  2011-03-05 15:20 ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: Andrey Vagin @ 2011-03-05 11:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, KOSAKI Motohiro, avagin, linux-mm, linux-kernel

Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
kernel may hang up, because shrink_zones() will do nothing, but
all_unreclaimable() will say, that zone has reclaimable pages.

do_try_to_free_pages()
	shrink_zones()
		 for_each_zone
			if (zone->all_unreclaimable)
				continue
	if !all_unreclaimable(zonelist, sc)
		return 1

__alloc_pages_slowpath()
retry:
	did_some_progress = do_try_to_free_pages(page)
	...
	if (!page && did_some_progress)
		retry;

Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 mm/vmscan.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6771ea7..1c056f7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			gfp_zone(sc->gfp_mask), sc->nodemask) {
+		if (zone->all_unreclaimable)
+			continue;
 		if (!populated_zone(zone))
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-05 11:44 [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable() Andrey Vagin
@ 2011-03-05 15:20 ` Minchan Kim
  2011-03-05 15:34   ` Andrew Vagin
  0 siblings, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-03-05 15:20 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: Andrew Morton, Mel Gorman, KOSAKI Motohiro, linux-mm, linux-kernel

On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> kernel may hang up, because shrink_zones() will do nothing, but
> all_unreclaimable() will say, that zone has reclaimable pages.
> 
> do_try_to_free_pages()
> 	shrink_zones()
> 		 for_each_zone
> 			if (zone->all_unreclaimable)
> 				continue
> 	if !all_unreclaimable(zonelist, sc)
> 		return 1
> 
> __alloc_pages_slowpath()
> retry:
> 	did_some_progress = do_try_to_free_pages(page)
> 	...
> 	if (!page && did_some_progress)
> 		retry;
> 
> Signed-off-by: Andrey Vagin <avagin@openvz.org>
> ---
>  mm/vmscan.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6771ea7..1c056f7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>  
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  			gfp_zone(sc->gfp_mask), sc->nodemask) {
> +		if (zone->all_unreclaimable)
> +			continue;
>  		if (!populated_zone(zone))
>  			continue;
>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))


zone_reclaimable checks it. Isn't it enough?
Does the hang up really happen or see it by code review?

> -- 
> 1.7.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-05 15:20 ` Minchan Kim
@ 2011-03-05 15:34   ` Andrew Vagin
  2011-03-05 15:53     ` Minchan Kim
  2011-05-04  1:38     ` CAI Qian
  0 siblings, 2 replies; 58+ messages in thread
From: Andrew Vagin @ 2011-03-05 15:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrey Vagin, Andrew Morton, Mel Gorman, KOSAKI Motohiro,
	linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1971 bytes --]

On 03/05/2011 06:20 PM, Minchan Kim wrote:
> On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
>> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
>> kernel may hang up, because shrink_zones() will do nothing, but
>> all_unreclaimable() will say, that zone has reclaimable pages.
>>
>> do_try_to_free_pages()
>> 	shrink_zones()
>> 		 for_each_zone
>> 			if (zone->all_unreclaimable)
>> 				continue
>> 	if !all_unreclaimable(zonelist, sc)
>> 		return 1
>>
>> __alloc_pages_slowpath()
>> retry:
>> 	did_some_progress = do_try_to_free_pages(page)
>> 	...
>> 	if (!page&&  did_some_progress)
>> 		retry;
>>
>> Signed-off-by: Andrey Vagin<avagin@openvz.org>
>> ---
>>   mm/vmscan.c |    2 ++
>>   1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 6771ea7..1c056f7 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>>
>>   	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>>   			gfp_zone(sc->gfp_mask), sc->nodemask) {
>> +		if (zone->all_unreclaimable)
>> +			continue;
>>   		if (!populated_zone(zone))
>>   			continue;
>>   		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>
> zone_reclaimable checks it. Isn't it enough?
I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
This two patches are enough.
> Does the hang up really happen or see it by code review?
Yes. You can reproduce it for help the attached python program. It's not 
very clever:)
It make the following actions in loop:
1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap

>> -- 
>> 1.7.1
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>


[-- Attachment #2: memeater.py --]
[-- Type: text/plain, Size: 659 bytes --]

import sys, time, mmap, os
from subprocess import Popen, PIPE
import random

global mem_size

def info(msg):
	pid = os.getpid()
	print >> sys.stderr, "%s: %s" % (pid, msg)
	sys.stderr.flush()



def memory_loop(cmd = "a"):
	"""
	cmd may be:
		c: check memory
		else: touch memory
	"""
	c = 0
	for j in xrange(0, mem_size):
		if cmd == "c":
			if f[j<<12] != chr(j % 255):
				info("Data corruption")
				sys.exit(1)
		else:
			f[j<<12] = chr(j % 255)

while True:
	pid = os.fork()
	if (pid != 0):
		mem_size = random.randint(0, 56 * 4096)
		f = mmap.mmap(-1, mem_size << 12, mmap.MAP_ANONYMOUS|mmap.MAP_PRIVATE)
		memory_loop()
		memory_loop("c")
		f.close()

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-05 15:34   ` Andrew Vagin
@ 2011-03-05 15:53     ` Minchan Kim
  2011-03-05 16:41       ` Andrew Vagin
  2011-05-04  1:38     ` CAI Qian
  1 sibling, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-03-05 15:53 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, Andrew Morton, Mel Gorman, KOSAKI Motohiro,
	linux-mm, linux-kernel

On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
> On 03/05/2011 06:20 PM, Minchan Kim wrote:
> >On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> >>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> >>kernel may hang up, because shrink_zones() will do nothing, but
> >>all_unreclaimable() will say, that zone has reclaimable pages.
> >>
> >>do_try_to_free_pages()
> >>	shrink_zones()
> >>		 for_each_zone
> >>			if (zone->all_unreclaimable)
> >>				continue
> >>	if !all_unreclaimable(zonelist, sc)
> >>		return 1
> >>
> >>__alloc_pages_slowpath()
> >>retry:
> >>	did_some_progress = do_try_to_free_pages(page)
> >>	...
> >>	if (!page&&  did_some_progress)
> >>		retry;
> >>
> >>Signed-off-by: Andrey Vagin<avagin@openvz.org>
> >>---
> >>  mm/vmscan.c |    2 ++
> >>  1 files changed, 2 insertions(+), 0 deletions(-)
> >>
> >>diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>index 6771ea7..1c056f7 100644
> >>--- a/mm/vmscan.c
> >>+++ b/mm/vmscan.c
> >>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> >>
> >>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >>  			gfp_zone(sc->gfp_mask), sc->nodemask) {
> >>+		if (zone->all_unreclaimable)
> >>+			continue;
> >>  		if (!populated_zone(zone))
> >>  			continue;
> >>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >
> >zone_reclaimable checks it. Isn't it enough?
> I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> This two patches are enough.

Sorry if I confused you. 
I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
So too many page scanning in small lru pages makes the zone to unreclaimable zone.

In all_unreclaimable, we calls zone_reclaimable to detect it.
It's the same thing with your patch.

> >Does the hang up really happen or see it by code review?
> Yes. You can reproduce it for help the attached python program. It's
> not very clever:)
> It make the following actions in loop:
> 1. fork
> 2. mmap
> 3. touch memory
> 4. read memory
> 5. munmmap

It seems the test program makes fork bombs and memory hogging.
If you applied this patch, the problem is gone?

> 
> >>-- 
> >>1.7.1
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> >>Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
> 

> import sys, time, mmap, os
> from subprocess import Popen, PIPE
> import random
> 
> global mem_size
> 
> def info(msg):
> 	pid = os.getpid()
> 	print >> sys.stderr, "%s: %s" % (pid, msg)
> 	sys.stderr.flush()
> 
> 
> 
> def memory_loop(cmd = "a"):
> 	"""
> 	cmd may be:
> 		c: check memory
> 		else: touch memory
> 	"""
> 	c = 0
> 	for j in xrange(0, mem_size):
> 		if cmd == "c":
> 			if f[j<<12] != chr(j % 255):
> 				info("Data corruption")
> 				sys.exit(1)
> 		else:
> 			f[j<<12] = chr(j % 255)
> 
> while True:
> 	pid = os.fork()
> 	if (pid != 0):
> 		mem_size = random.randint(0, 56 * 4096)
> 		f = mmap.mmap(-1, mem_size << 12, mmap.MAP_ANONYMOUS|mmap.MAP_PRIVATE)
> 		memory_loop()
> 		memory_loop("c")
> 		f.close()


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-05 15:53     ` Minchan Kim
@ 2011-03-05 16:41       ` Andrew Vagin
  2011-03-05 17:07         ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Vagin @ 2011-03-05 16:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrey Vagin, Andrew Morton, Mel Gorman, KOSAKI Motohiro,
	linux-mm, linux-kernel

On 03/05/2011 06:53 PM, Minchan Kim wrote:
> On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
>> On 03/05/2011 06:20 PM, Minchan Kim wrote:
>>> On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
>>>> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
>>>> kernel may hang up, because shrink_zones() will do nothing, but
>>>> all_unreclaimable() will say, that zone has reclaimable pages.
>>>>
>>>> do_try_to_free_pages()
>>>> 	shrink_zones()
>>>> 		 for_each_zone
>>>> 			if (zone->all_unreclaimable)
>>>> 				continue
>>>> 	if !all_unreclaimable(zonelist, sc)
>>>> 		return 1
>>>>
>>>> __alloc_pages_slowpath()
>>>> retry:
>>>> 	did_some_progress = do_try_to_free_pages(page)
>>>> 	...
>>>> 	if (!page&&   did_some_progress)
>>>> 		retry;
>>>>
>>>> Signed-off-by: Andrey Vagin<avagin@openvz.org>
>>>> ---
>>>>   mm/vmscan.c |    2 ++
>>>>   1 files changed, 2 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 6771ea7..1c056f7 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>>>>
>>>>   	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>>>>   			gfp_zone(sc->gfp_mask), sc->nodemask) {
>>>> +		if (zone->all_unreclaimable)
>>>> +			continue;
>>>>   		if (!populated_zone(zone))
>>>>   			continue;
>>>>   		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>> zone_reclaimable checks it. Isn't it enough?
>> I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
>> This two patches are enough.
> Sorry if I confused you.
> I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
> zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
> So too many page scanning in small lru pages makes the zone to unreclaimable zone.
>
> In all_unreclaimable, we calls zone_reclaimable to detect it.
> It's the same thing with your patch.
balance_pgdat set zone->all_unreclaimable, but the problem is that it is 
cleaned late.

The problem is that zone->all_unreclaimable = True, but 
zone_reclaimable() returns True too.

zone->all_unreclaimable will be cleaned in free_*_pages, but this may be 
late. It is enough allocate one page from page cache, that 
zone_reclaimable() returns True and zone->all_unreclaimable becomes True.
>>> Does the hang up really happen or see it by code review?
>> Yes. You can reproduce it for help the attached python program. It's
>> not very clever:)
>> It make the following actions in loop:
>> 1. fork
>> 2. mmap
>> 3. touch memory
>> 4. read memory
>> 5. munmmap
> It seems the test program makes fork bombs and memory hogging.
> If you applied this patch, the problem is gone?
Yes.
>>>> -- 
>>>> 1.7.1
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>>>> Don't email:<a href=mailto:"dont@kvack.org">   email@kvack.org</a>
>> import sys, time, mmap, os
>> from subprocess import Popen, PIPE
>> import random
>>
>> global mem_size
>>
>> def info(msg):
>> 	pid = os.getpid()
>> 	print>>  sys.stderr, "%s: %s" % (pid, msg)
>> 	sys.stderr.flush()
>>
>>
>>
>> def memory_loop(cmd = "a"):
>> 	"""
>> 	cmd may be:
>> 		c: check memory
>> 		else: touch memory
>> 	"""
>> 	c = 0
>> 	for j in xrange(0, mem_size):
>> 		if cmd == "c":
>> 			if f[j<<12] != chr(j % 255):
>> 				info("Data corruption")
>> 				sys.exit(1)
>> 		else:
>> 			f[j<<12] = chr(j % 255)
>>
>> while True:
>> 	pid = os.fork()
>> 	if (pid != 0):
>> 		mem_size = random.randint(0, 56 * 4096)
>> 		f = mmap.mmap(-1, mem_size<<  12, mmap.MAP_ANONYMOUS|mmap.MAP_PRIVATE)
>> 		memory_loop()
>> 		memory_loop("c")
>> 		f.close()
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-05 16:41       ` Andrew Vagin
@ 2011-03-05 17:07         ` Minchan Kim
  2011-03-07 21:58           ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-03-05 17:07 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, Andrew Morton, Mel Gorman, KOSAKI Motohiro,
	linux-mm, linux-kernel

On Sat, Mar 05, 2011 at 07:41:26PM +0300, Andrew Vagin wrote:
> On 03/05/2011 06:53 PM, Minchan Kim wrote:
> >On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
> >>On 03/05/2011 06:20 PM, Minchan Kim wrote:
> >>>On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> >>>>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> >>>>kernel may hang up, because shrink_zones() will do nothing, but
> >>>>all_unreclaimable() will say, that zone has reclaimable pages.
> >>>>
> >>>>do_try_to_free_pages()
> >>>>	shrink_zones()
> >>>>		 for_each_zone
> >>>>			if (zone->all_unreclaimable)
> >>>>				continue
> >>>>	if !all_unreclaimable(zonelist, sc)
> >>>>		return 1
> >>>>
> >>>>__alloc_pages_slowpath()
> >>>>retry:
> >>>>	did_some_progress = do_try_to_free_pages(page)
> >>>>	...
> >>>>	if (!page&&   did_some_progress)
> >>>>		retry;
> >>>>
> >>>>Signed-off-by: Andrey Vagin<avagin@openvz.org>
> >>>>---
> >>>>  mm/vmscan.c |    2 ++
> >>>>  1 files changed, 2 insertions(+), 0 deletions(-)
> >>>>
> >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>index 6771ea7..1c056f7 100644
> >>>>--- a/mm/vmscan.c
> >>>>+++ b/mm/vmscan.c
> >>>>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> >>>>
> >>>>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >>>>  			gfp_zone(sc->gfp_mask), sc->nodemask) {
> >>>>+		if (zone->all_unreclaimable)
> >>>>+			continue;
> >>>>  		if (!populated_zone(zone))
> >>>>  			continue;
> >>>>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >>>zone_reclaimable checks it. Isn't it enough?
> >>I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> >>This two patches are enough.
> >Sorry if I confused you.
> >I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
> >zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
> >So too many page scanning in small lru pages makes the zone to unreclaimable zone.
> >
> >In all_unreclaimable, we calls zone_reclaimable to detect it.
> >It's the same thing with your patch.
> balance_pgdat set zone->all_unreclaimable, but the problem is that
> it is cleaned late.

Yes. It can be delayed by pcp so (zone->all_unreclaimable = true) is
a false alram since zone have a free page and it can be returned 
to free list by drain_all_pages in next turn.

> 
> The problem is that zone->all_unreclaimable = True, but
> zone_reclaimable() returns True too.

Why is it a problem? 
If zone->all_unreclaimable gives a false alram, we does need to check
it again by zone_reclaimable call.

If we believe a false alarm and give up the reclaim, maybe we have to make
unnecessary oom kill.

> 
> zone->all_unreclaimable will be cleaned in free_*_pages, but this
> may be late. It is enough allocate one page from page cache, that
> zone_reclaimable() returns True and zone->all_unreclaimable becomes
> True.
> >>>Does the hang up really happen or see it by code review?
> >>Yes. You can reproduce it for help the attached python program. It's
> >>not very clever:)
> >>It make the following actions in loop:
> >>1. fork
> >>2. mmap
> >>3. touch memory
> >>4. read memory
> >>5. munmmap
> >It seems the test program makes fork bombs and memory hogging.
> >If you applied this patch, the problem is gone?
> Yes.

Hmm.. Although it solves the problem, I think it's not a good idea that
depends on false alram and give up the retry.


> >>>>-- 
> >>>>1.7.1
> >>>>
> >>>>--
> >>>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>>>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>>>see: http://www.linux-mm.org/ .
> >>>>Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> >>>>Don't email:<a href=mailto:"dont@kvack.org">   email@kvack.org</a>
> >>import sys, time, mmap, os
> >>from subprocess import Popen, PIPE
> >>import random
> >>
> >>global mem_size
> >>
> >>def info(msg):
> >>	pid = os.getpid()
> >>	print>>  sys.stderr, "%s: %s" % (pid, msg)
> >>	sys.stderr.flush()
> >>
> >>
> >>
> >>def memory_loop(cmd = "a"):
> >>	"""
> >>	cmd may be:
> >>		c: check memory
> >>		else: touch memory
> >>	"""
> >>	c = 0
> >>	for j in xrange(0, mem_size):
> >>		if cmd == "c":
> >>			if f[j<<12] != chr(j % 255):
> >>				info("Data corruption")
> >>				sys.exit(1)
> >>		else:
> >>			f[j<<12] = chr(j % 255)
> >>
> >>while True:
> >>	pid = os.fork()
> >>	if (pid != 0):
> >>		mem_size = random.randint(0, 56 * 4096)
> >>		f = mmap.mmap(-1, mem_size<<  12, mmap.MAP_ANONYMOUS|mmap.MAP_PRIVATE)
> >>		memory_loop()
> >>		memory_loop("c")
> >>		f.close()
> >
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-05 17:07         ` Minchan Kim
@ 2011-03-07 21:58           ` Andrew Morton
  2011-03-07 23:45             ` Minchan Kim
  2011-03-08  0:44             ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2011-03-07 21:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Vagin, Andrey Vagin, Mel Gorman, KOSAKI Motohiro,
	linux-mm, linux-kernel

On Sun, 6 Mar 2011 02:07:59 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Sat, Mar 05, 2011 at 07:41:26PM +0300, Andrew Vagin wrote:
> > On 03/05/2011 06:53 PM, Minchan Kim wrote:
> > >On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
> > >>On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > >>>On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> > >>>>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> > >>>>kernel may hang up, because shrink_zones() will do nothing, but
> > >>>>all_unreclaimable() will say, that zone has reclaimable pages.
> > >>>>
> > >>>>do_try_to_free_pages()
> > >>>>	shrink_zones()
> > >>>>		 for_each_zone
> > >>>>			if (zone->all_unreclaimable)
> > >>>>				continue
> > >>>>	if !all_unreclaimable(zonelist, sc)
> > >>>>		return 1
> > >>>>
> > >>>>__alloc_pages_slowpath()
> > >>>>retry:
> > >>>>	did_some_progress = do_try_to_free_pages(page)
> > >>>>	...
> > >>>>	if (!page&&   did_some_progress)
> > >>>>		retry;
> > >>>>
> > >>>>Signed-off-by: Andrey Vagin<avagin@openvz.org>
> > >>>>---
> > >>>>  mm/vmscan.c |    2 ++
> > >>>>  1 files changed, 2 insertions(+), 0 deletions(-)
> > >>>>
> > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >>>>index 6771ea7..1c056f7 100644
> > >>>>--- a/mm/vmscan.c
> > >>>>+++ b/mm/vmscan.c
> > >>>>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> > >>>>
> > >>>>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > >>>>  			gfp_zone(sc->gfp_mask), sc->nodemask) {
> > >>>>+		if (zone->all_unreclaimable)
> > >>>>+			continue;
> > >>>>  		if (!populated_zone(zone))
> > >>>>  			continue;
> > >>>>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > >>>zone_reclaimable checks it. Isn't it enough?
> > >>I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> > >>This two patches are enough.
> > >Sorry if I confused you.
> > >I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
> > >zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
> > >So too many page scanning in small lru pages makes the zone to unreclaimable zone.
> > >
> > >In all_unreclaimable, we calls zone_reclaimable to detect it.
> > >It's the same thing with your patch.
> > balance_pgdat set zone->all_unreclaimable, but the problem is that
> > it is cleaned late.
> 
> Yes. It can be delayed by pcp so (zone->all_unreclaimable = true) is
> a false alram since zone have a free page and it can be returned 
> to free list by drain_all_pages in next turn.
> 
> > 
> > The problem is that zone->all_unreclaimable = True, but
> > zone_reclaimable() returns True too.
> 
> Why is it a problem? 
> If zone->all_unreclaimable gives a false alram, we does need to check
> it again by zone_reclaimable call.
> 
> If we believe a false alarm and give up the reclaim, maybe we have to make
> unnecessary oom kill.
> 
> > 
> > zone->all_unreclaimable will be cleaned in free_*_pages, but this
> > may be late. It is enough allocate one page from page cache, that
> > zone_reclaimable() returns True and zone->all_unreclaimable becomes
> > True.
> > >>>Does the hang up really happen or see it by code review?
> > >>Yes. You can reproduce it for help the attached python program. It's
> > >>not very clever:)
> > >>It make the following actions in loop:
> > >>1. fork
> > >>2. mmap
> > >>3. touch memory
> > >>4. read memory
> > >>5. munmmap
> > >It seems the test program makes fork bombs and memory hogging.
> > >If you applied this patch, the problem is gone?
> > Yes.
> 
> Hmm.. Although it solves the problem, I think it's not a good idea that
> depends on false alram and give up the retry.

Any alternative proposals?  We should get the livelock fixed if possible..

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-07 21:58           ` Andrew Morton
@ 2011-03-07 23:45             ` Minchan Kim
  2011-03-09  5:37               ` KAMEZAWA Hiroyuki
  2011-03-08  0:44             ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-03-07 23:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrew Vagin, Andrey Vagin, Mel Gorman, KOSAKI Motohiro,
	linux-mm, linux-kernel

On Tue, Mar 8, 2011 at 6:58 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Sun, 6 Mar 2011 02:07:59 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Sat, Mar 05, 2011 at 07:41:26PM +0300, Andrew Vagin wrote:
>> > On 03/05/2011 06:53 PM, Minchan Kim wrote:
>> > >On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
>> > >>On 03/05/2011 06:20 PM, Minchan Kim wrote:
>> > >>>On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
>> > >>>>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
>> > >>>>kernel may hang up, because shrink_zones() will do nothing, but
>> > >>>>all_unreclaimable() will say, that zone has reclaimable pages.
>> > >>>>
>> > >>>>do_try_to_free_pages()
>> > >>>>        shrink_zones()
>> > >>>>                 for_each_zone
>> > >>>>                        if (zone->all_unreclaimable)
>> > >>>>                                continue
>> > >>>>        if !all_unreclaimable(zonelist, sc)
>> > >>>>                return 1
>> > >>>>
>> > >>>>__alloc_pages_slowpath()
>> > >>>>retry:
>> > >>>>        did_some_progress = do_try_to_free_pages(page)
>> > >>>>        ...
>> > >>>>        if (!page&&   did_some_progress)
>> > >>>>                retry;
>> > >>>>
>> > >>>>Signed-off-by: Andrey Vagin<avagin@openvz.org>
>> > >>>>---
>> > >>>>  mm/vmscan.c |    2 ++
>> > >>>>  1 files changed, 2 insertions(+), 0 deletions(-)
>> > >>>>
>> > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > >>>>index 6771ea7..1c056f7 100644
>> > >>>>--- a/mm/vmscan.c
>> > >>>>+++ b/mm/vmscan.c
>> > >>>>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>> > >>>>
>> > >>>>        for_each_zone_zonelist_nodemask(zone, z, zonelist,
>> > >>>>                        gfp_zone(sc->gfp_mask), sc->nodemask) {
>> > >>>>+               if (zone->all_unreclaimable)
>> > >>>>+                       continue;
>> > >>>>                if (!populated_zone(zone))
>> > >>>>                        continue;
>> > >>>>                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>> > >>>zone_reclaimable checks it. Isn't it enough?
>> > >>I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
>> > >>This two patches are enough.
>> > >Sorry if I confused you.
>> > >I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
>> > >zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
>> > >So too many page scanning in small lru pages makes the zone to unreclaimable zone.
>> > >
>> > >In all_unreclaimable, we calls zone_reclaimable to detect it.
>> > >It's the same thing with your patch.
>> > balance_pgdat set zone->all_unreclaimable, but the problem is that
>> > it is cleaned late.
>>
>> Yes. It can be delayed by pcp so (zone->all_unreclaimable = true) is
>> a false alram since zone have a free page and it can be returned
>> to free list by drain_all_pages in next turn.
>>
>> >
>> > The problem is that zone->all_unreclaimable = True, but
>> > zone_reclaimable() returns True too.
>>
>> Why is it a problem?
>> If zone->all_unreclaimable gives a false alram, we does need to check
>> it again by zone_reclaimable call.
>>
>> If we believe a false alarm and give up the reclaim, maybe we have to make
>> unnecessary oom kill.
>>
>> >
>> > zone->all_unreclaimable will be cleaned in free_*_pages, but this
>> > may be late. It is enough allocate one page from page cache, that
>> > zone_reclaimable() returns True and zone->all_unreclaimable becomes
>> > True.
>> > >>>Does the hang up really happen or see it by code review?
>> > >>Yes. You can reproduce it for help the attached python program. It's
>> > >>not very clever:)
>> > >>It make the following actions in loop:
>> > >>1. fork
>> > >>2. mmap
>> > >>3. touch memory
>> > >>4. read memory
>> > >>5. munmmap
>> > >It seems the test program makes fork bombs and memory hogging.
>> > >If you applied this patch, the problem is gone?
>> > Yes.
>>
>> Hmm.. Although it solves the problem, I think it's not a good idea that
>> depends on false alram and give up the retry.
>
> Any alternative proposals?  We should get the livelock fixed if possible..
>

And we should avoid unnecessary OOM kill if possible.

I think the problem is caused by (zone->pages_scanned <
zone_reclaimable_pages(zone) * 6). I am not sure (* 6) is a best. It
would be rather big on recent big DRAM machines.

I think it is a trade-off between latency and OOM kill.
If we decrease the magic value, maybe we should prevent the almost
livelock but happens unnecessary OOM kill.

And I think zone_reclaimable not fair.
For example, too many scanning makes reclaimable state to
unreclaimable state. Maybe it takes a very long time. But just some
page free makes unreclaimable state to reclaimabe with very easy. So
we need much painful reclaiming for changing reclaimable state with
unreclaimabe state. it would affect latency very much.

Maybe we need more smart zone_reclaimabe which is adaptive with memory pressure.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-07 21:58           ` Andrew Morton
  2011-03-07 23:45             ` Minchan Kim
@ 2011-03-08  0:44             ` KAMEZAWA Hiroyuki
  2011-03-08  3:06               ` KOSAKI Motohiro
  2011-03-08  8:12               ` Andrew Vagin
  1 sibling, 2 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-08  0:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Andrew Vagin, Andrey Vagin, Mel Gorman,
	KOSAKI Motohiro, linux-mm, linux-kernel

On Mon, 7 Mar 2011 13:58:31 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sun, 6 Mar 2011 02:07:59 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > On Sat, Mar 05, 2011 at 07:41:26PM +0300, Andrew Vagin wrote:
> > > On 03/05/2011 06:53 PM, Minchan Kim wrote:
> > > >On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
> > > >>On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > > >>>On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> > > >>>>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> > > >>>>kernel may hang up, because shrink_zones() will do nothing, but
> > > >>>>all_unreclaimable() will say, that zone has reclaimable pages.
> > > >>>>
> > > >>>>do_try_to_free_pages()
> > > >>>>	shrink_zones()
> > > >>>>		 for_each_zone
> > > >>>>			if (zone->all_unreclaimable)
> > > >>>>				continue
> > > >>>>	if !all_unreclaimable(zonelist, sc)
> > > >>>>		return 1
> > > >>>>
> > > >>>>__alloc_pages_slowpath()
> > > >>>>retry:
> > > >>>>	did_some_progress = do_try_to_free_pages(page)
> > > >>>>	...
> > > >>>>	if (!page&&   did_some_progress)
> > > >>>>		retry;
> > > >>>>
> > > >>>>Signed-off-by: Andrey Vagin<avagin@openvz.org>
> > > >>>>---
> > > >>>>  mm/vmscan.c |    2 ++
> > > >>>>  1 files changed, 2 insertions(+), 0 deletions(-)
> > > >>>>
> > > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > >>>>index 6771ea7..1c056f7 100644
> > > >>>>--- a/mm/vmscan.c
> > > >>>>+++ b/mm/vmscan.c
> > > >>>>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> > > >>>>
> > > >>>>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > >>>>  			gfp_zone(sc->gfp_mask), sc->nodemask) {
> > > >>>>+		if (zone->all_unreclaimable)
> > > >>>>+			continue;
> > > >>>>  		if (!populated_zone(zone))
> > > >>>>  			continue;
> > > >>>>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > > >>>zone_reclaimable checks it. Isn't it enough?
> > > >>I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> > > >>This two patches are enough.
> > > >Sorry if I confused you.
> > > >I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
> > > >zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
> > > >So too many page scanning in small lru pages makes the zone to unreclaimable zone.
> > > >
> > > >In all_unreclaimable, we calls zone_reclaimable to detect it.
> > > >It's the same thing with your patch.
> > > balance_pgdat set zone->all_unreclaimable, but the problem is that
> > > it is cleaned late.
> > 
> > Yes. It can be delayed by pcp so (zone->all_unreclaimable = true) is
> > a false alram since zone have a free page and it can be returned 
> > to free list by drain_all_pages in next turn.
> > 
> > > 
> > > The problem is that zone->all_unreclaimable = True, but
> > > zone_reclaimable() returns True too.
> > 
> > Why is it a problem? 
> > If zone->all_unreclaimable gives a false alram, we does need to check
> > it again by zone_reclaimable call.
> > 
> > If we believe a false alarm and give up the reclaim, maybe we have to make
> > unnecessary oom kill.
> > 
> > > 
> > > zone->all_unreclaimable will be cleaned in free_*_pages, but this
> > > may be late. It is enough allocate one page from page cache, that
> > > zone_reclaimable() returns True and zone->all_unreclaimable becomes
> > > True.
> > > >>>Does the hang up really happen or see it by code review?
> > > >>Yes. You can reproduce it for help the attached python program. It's
> > > >>not very clever:)
> > > >>It make the following actions in loop:
> > > >>1. fork
> > > >>2. mmap
> > > >>3. touch memory
> > > >>4. read memory
> > > >>5. munmmap
> > > >It seems the test program makes fork bombs and memory hogging.
> > > >If you applied this patch, the problem is gone?
> > > Yes.
> > 
> > Hmm.. Although it solves the problem, I think it's not a good idea that
> > depends on false alram and give up the retry.
> 
> Any alternative proposals?  We should get the livelock fixed if possible..

I agree with Minchan and can't think this is a real fix....
Andrey, I'm now trying your fix and it seems your fix for oom-killer,
'skip-zombie-process' works enough good for my environ.

What is your enviroment ? number of cpus ? architecture ? size of memory ?



Thanks,
-Kame


















^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-08  0:44             ` KAMEZAWA Hiroyuki
@ 2011-03-08  3:06               ` KOSAKI Motohiro
  2011-03-08 19:02                 ` avagin
  2011-03-08  8:12               ` Andrew Vagin
  1 sibling, 1 reply; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-03-08  3:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Andrew Morton, Minchan Kim, Andrew Vagin,
	Andrey Vagin, Mel Gorman, linux-mm, linux-kernel

> > > Hmm.. Although it solves the problem, I think it's not a good idea that
> > > depends on false alram and give up the retry.
> > 
> > Any alternative proposals?  We should get the livelock fixed if possible..
> 
> I agree with Minchan and can't think this is a real fix....
> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> 'skip-zombie-process' works enough good for my environ.
> 
> What is your enviroment ? number of cpus ? architecture ? size of memory ?

me too. 'skip-zombie-process V1' work fine. and I didn't seen this patch
improve oom situation.

And, The test program is purely fork bomb. Our oom-killer is not silver
bullet for fork bomb from very long time ago. That said, oom-killer send 
SIGKILL and start to kill the victim process. But, it doesn't prevent 
to be created new memory hogging tasks. Therefore we have no gurantee 
to win process exiting and creating race.

*IF* we really need to care fork bomb issue, we need to write completely 
new VM feature.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-08  0:44             ` KAMEZAWA Hiroyuki
  2011-03-08  3:06               ` KOSAKI Motohiro
@ 2011-03-08  8:12               ` Andrew Vagin
  2011-03-09  6:06                 ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 58+ messages in thread
From: Andrew Vagin @ 2011-03-08  8:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Minchan Kim, Andrey Vagin, Mel Gorman,
	KOSAKI Motohiro, linux-mm, linux-kernel

Hi, All
> I agree with Minchan and can't think this is a real fix....
> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> 'skip-zombie-process' works enough good for my environ.
>
> What is your enviroment ? number of cpus ? architecture ? size of memory ?
Processort: AMD Phenom(tm) II X6 1055T Processor (six-core)
Ram: 8Gb
RHEL6, x86_64. This host doesn't have swap.

It hangs up fast. Tomorrow I will have to send a processes state, if it 
will be interesting for you. With my patch the kernel work fine. I added 
debug and found that it hangs up in the described case.
I suppose that my patch may be incorrect, but the problem exists and we 
should do something.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-08  3:06               ` KOSAKI Motohiro
@ 2011-03-08 19:02                 ` avagin
  2011-03-09  5:52                   ` KAMEZAWA Hiroyuki
  2011-03-09  6:17                   ` KOSAKI Motohiro
  0 siblings, 2 replies; 58+ messages in thread
From: avagin @ 2011-03-08 19:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Minchan Kim, Andrey Vagin,
	Mel Gorman, linux-mm, linux-kernel

On 03/08/2011 06:06 AM, KOSAKI Motohiro wrote:
>>>> Hmm.. Although it solves the problem, I think it's not a good idea that
>>>> depends on false alram and give up the retry.
>>>
>>> Any alternative proposals?  We should get the livelock fixed if possible..
>>
>> I agree with Minchan and can't think this is a real fix....
>> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
>> 'skip-zombie-process' works enough good for my environ.
>>
>> What is your enviroment ? number of cpus ? architecture ? size of memory ?
>
> me too. 'skip-zombie-process V1' work fine. and I didn't seen this patch
> improve oom situation.
>
> And, The test program is purely fork bomb. Our oom-killer is not silver
> bullet for fork bomb from very long time ago. That said, oom-killer send
> SIGKILL and start to kill the victim process. But, it doesn't prevent
> to be created new memory hogging tasks. Therefore we have no gurantee
> to win process exiting and creating race.

I think a live-lock is a bug, even if it's provoked by fork bomds.

And now I want say some words about zone->all_unreclaimable. I think 
this flag is "conservative". It is set when situation is bad and it's 
unset when situation get better. If we have a small number of 
reclaimable  pages, the situation is still bad. What do you mean, when 
say that kernel is alive? If we have one reclaimable page, is the kernel 
alive? Yes, it can work, it will generate many page faults and do 
something, but anyone say that it is more dead than alive.

Try to look at it from my point of view. The patch will be correct and 
the kernel will be more alive.

Excuse me, If I'm mistaken...


>
> *IF* we really need to care fork bomb issue, we need to write completely
> new VM feature.
>
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-07 23:45             ` Minchan Kim
@ 2011-03-09  5:37               ` KAMEZAWA Hiroyuki
  2011-03-09  5:43                 ` KAMEZAWA Hiroyuki
  2011-03-10  6:58                 ` Minchan Kim
  0 siblings, 2 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-09  5:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Andrew Vagin, Andrey Vagin, Mel Gorman,
	KOSAKI Motohiro, linux-mm, linux-kernel

On Tue, 8 Mar 2011 08:45:51 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Tue, Mar 8, 2011 at 6:58 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Sun, 6 Mar 2011 02:07:59 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> > Any alternative proposals?  We should get the livelock fixed if possible..
> >
> 
> And we should avoid unnecessary OOM kill if possible.
> 
> I think the problem is caused by (zone->pages_scanned <
> zone_reclaimable_pages(zone) * 6). I am not sure (* 6) is a best. It
> would be rather big on recent big DRAM machines.
> 

It means 3 times full-scan from the highest priority to the lowest
and cannot freed any pages. I think big memory machine tend to have
more cpus, so don't think it's big.

> I think it is a trade-off between latency and OOM kill.
> If we decrease the magic value, maybe we should prevent the almost
> livelock but happens unnecessary OOM kill.
> 

Hmm, should I support a sacrifice feature 'some signal(SIGINT?) will be sent by
the kernel when it detects system memory is in short' in cgroup ?
(For example, if full LRU scan is done in a zone, notifier
 works and SIGINT will be sent.)

> And I think zone_reclaimable not fair.
> For example, too many scanning makes reclaimable state to
> unreclaimable state. Maybe it takes a very long time. But just some
> page free makes unreclaimable state to reclaimabe with very easy. So
> we need much painful reclaiming for changing reclaimable state with
> unreclaimabe state. it would affect latency very much.
> 
> Maybe we need more smart zone_reclaimabe which is adaptive with memory pressure.
> 
I agree.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-09  5:37               ` KAMEZAWA Hiroyuki
@ 2011-03-09  5:43                 ` KAMEZAWA Hiroyuki
  2011-03-10  6:58                 ` Minchan Kim
  1 sibling, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-09  5:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Andrew Morton, Andrew Vagin, Andrey Vagin,
	Mel Gorman, KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, 9 Mar 2011 14:37:04 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 8 Mar 2011 08:45:51 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> Hmm, should I support a sacrifice feature 'some signal(SIGINT?) will be sent by
> the kernel when it detects system memory is in short' in cgroup ?
> (For example, if full LRU scan is done in a zone, notifier
>  works and SIGINT will be sent.)
> 

Sorry, this sounds like  "mem_notify" ;), Kosaki-san's old work.

I think functionality for "mem_notify" will have no obstacle opinion but
implementation detail is a problem....Shouldn't we try it again ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-08 19:02                 ` avagin
@ 2011-03-09  5:52                   ` KAMEZAWA Hiroyuki
  2011-03-09  6:17                   ` KOSAKI Motohiro
  1 sibling, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-09  5:52 UTC (permalink / raw)
  To: avagin
  Cc: KOSAKI Motohiro, Andrew Morton, Minchan Kim, Andrey Vagin,
	Mel Gorman, linux-mm, linux-kernel

On Tue, 08 Mar 2011 22:02:27 +0300
"avagin@gmail.com" <avagin@gmail.com> wrote:

> On 03/08/2011 06:06 AM, KOSAKI Motohiro wrote:
> >>>> Hmm.. Although it solves the problem, I think it's not a good idea that
> >>>> depends on false alram and give up the retry.
> >>>
> >>> Any alternative proposals?  We should get the livelock fixed if possible..
> >>
> >> I agree with Minchan and can't think this is a real fix....
> >> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> >> 'skip-zombie-process' works enough good for my environ.
> >>
> >> What is your enviroment ? number of cpus ? architecture ? size of memory ?
> >
> > me too. 'skip-zombie-process V1' work fine. and I didn't seen this patch
> > improve oom situation.
> >
> > And, The test program is purely fork bomb. Our oom-killer is not silver
> > bullet for fork bomb from very long time ago. That said, oom-killer send
> > SIGKILL and start to kill the victim process. But, it doesn't prevent
> > to be created new memory hogging tasks. Therefore we have no gurantee
> > to win process exiting and creating race.
> 
> I think a live-lock is a bug, even if it's provoked by fork bomds.
> 

I tried to write fork-bomb-detector in oom-kill layer but I think
it should be co-operative with do_fork(), now.
IOW, some fork() should return -ENOMEM under OOM condition.

I'd like to try some but if you have some idea, please do.


> And now I want say some words about zone->all_unreclaimable. I think 
> this flag is "conservative". It is set when situation is bad and it's 
> unset when situation get better. If we have a small number of 
> reclaimable  pages, the situation is still bad. What do you mean, when 
> say that kernel is alive? If we have one reclaimable page, is the kernel 
> alive? Yes, it can work, it will generate many page faults and do 
> something, but anyone say that it is more dead than alive.
> 
> Try to look at it from my point of view. The patch will be correct and 
> the kernel will be more alive.
> 
> Excuse me, If I'm mistaken...
> 

Mayne something more casual interface than oom-kill should be provided.
I wonder I can add memory-reclaim-priority to memory cgroup and
allow control of page fault latency for applicaton...
Maybe "soft_limit" for memcg, it's implemented now, works to some extent.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-08  8:12               ` Andrew Vagin
@ 2011-03-09  6:06                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-09  6:06 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrew Morton, Minchan Kim, Andrey Vagin, Mel Gorman,
	KOSAKI Motohiro, linux-mm, linux-kernel

On Tue, 08 Mar 2011 11:12:22 +0300
Andrew Vagin <avagin@gmail.com> wrote:

> Hi, All
> > I agree with Minchan and can't think this is a real fix....
> > Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> > 'skip-zombie-process' works enough good for my environ.
> >
> > What is your enviroment ? number of cpus ? architecture ? size of memory ?
> Processort: AMD Phenom(tm) II X6 1055T Processor (six-core)
> Ram: 8Gb
> RHEL6, x86_64. This host doesn't have swap.
> 
Ok, thanks. "NO SWAP" is a big information ;)

> It hangs up fast. Tomorrow I will have to send a processes state, if it 
> will be interesting for you. With my patch the kernel work fine. I added 
> debug and found that it hangs up in the described case.
> I suppose that my patch may be incorrect, but the problem exists and we 
> should do something.
>

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-08 19:02                 ` avagin
  2011-03-09  5:52                   ` KAMEZAWA Hiroyuki
@ 2011-03-09  6:17                   ` KOSAKI Motohiro
  2011-03-10 14:08                     ` KOSAKI Motohiro
  1 sibling, 1 reply; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-03-09  6:17 UTC (permalink / raw)
  To: avagin
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Minchan Kim,
	Andrey Vagin, Mel Gorman, linux-mm, linux-kernel

> On 03/08/2011 06:06 AM, KOSAKI Motohiro wrote:
> >>>> Hmm.. Although it solves the problem, I think it's not a good idea that
> >>>> depends on false alram and give up the retry.
> >>>
> >>> Any alternative proposals?  We should get the livelock fixed if possible..
> >>
> >> I agree with Minchan and can't think this is a real fix....
> >> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> >> 'skip-zombie-process' works enough good for my environ.
> >>
> >> What is your enviroment ? number of cpus ? architecture ? size of memory ?
> >
> > me too. 'skip-zombie-process V1' work fine. and I didn't seen this patch
> > improve oom situation.
> >
> > And, The test program is purely fork bomb. Our oom-killer is not silver
> > bullet for fork bomb from very long time ago. That said, oom-killer send
> > SIGKILL and start to kill the victim process. But, it doesn't prevent
> > to be created new memory hogging tasks. Therefore we have no gurantee
> > to win process exiting and creating race.
> 
> I think a live-lock is a bug, even if it's provoked by fork bomds.
> 
> And now I want say some words about zone->all_unreclaimable. I think 
> this flag is "conservative". It is set when situation is bad and it's 
> unset when situation get better. If we have a small number of 
> reclaimable  pages, the situation is still bad. What do you mean, when 
> say that kernel is alive? If we have one reclaimable page, is the kernel 
> alive? Yes, it can work, it will generate many page faults and do 
> something, but anyone say that it is more dead than alive.
> 
> Try to look at it from my point of view. The patch will be correct and 
> the kernel will be more alive.
> 
> Excuse me, If I'm mistaken...

Hi, 

Hmmm...
If I could observed your patch, I did support your opinion. but I didn't. so, now I'm 
curious why we got the different conclusion. tommorow, I'll try to construct a test 
environment to reproduce your system.

Unfortunatelly, zone->all_unreclamable is unreliable value while hibernation processing.
Then I doubt current your patch is enough acceptable. but I'm not against to make alternative
if we can observe the same phenomenon.

At minimum, I also dislike kernel hang up issue.

Thanks.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-09  5:37               ` KAMEZAWA Hiroyuki
  2011-03-09  5:43                 ` KAMEZAWA Hiroyuki
@ 2011-03-10  6:58                 ` Minchan Kim
  2011-03-10 23:58                   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-03-10  6:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrew Vagin, Andrey Vagin, Mel Gorman,
	KOSAKI Motohiro, linux-mm, linux-kernel

Hi Kame,

Sorry for late response.
I had a time to test this issue shortly because these day I am very busy.
This issue was interesting to me.
So I hope taking a time for enough testing when I have a time.
I should find out root cause of livelock.

I will answer your comment after it. :)
Thanks!

On Wed, Mar 9, 2011 at 2:37 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 8 Mar 2011 08:45:51 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Tue, Mar 8, 2011 at 6:58 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> > On Sun, 6 Mar 2011 02:07:59 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> > Any alternative proposals?  We should get the livelock fixed if possible..
>> >
>>
>> And we should avoid unnecessary OOM kill if possible.
>>
>> I think the problem is caused by (zone->pages_scanned <
>> zone_reclaimable_pages(zone) * 6). I am not sure (* 6) is a best. It
>> would be rather big on recent big DRAM machines.
>>
>
> It means 3 times full-scan from the highest priority to the lowest
> and cannot freed any pages. I think big memory machine tend to have
> more cpus, so don't think it's big.
>
>> I think it is a trade-off between latency and OOM kill.
>> If we decrease the magic value, maybe we should prevent the almost
>> livelock but happens unnecessary OOM kill.
>>
>
> Hmm, should I support a sacrifice feature 'some signal(SIGINT?) will be sent by
> the kernel when it detects system memory is in short' in cgroup ?
> (For example, if full LRU scan is done in a zone, notifier
>  works and SIGINT will be sent.)
>
>> And I think zone_reclaimable not fair.
>> For example, too many scanning makes reclaimable state to
>> unreclaimable state. Maybe it takes a very long time. But just some
>> page free makes unreclaimable state to reclaimabe with very easy. So
>> we need much painful reclaiming for changing reclaimable state with
>> unreclaimabe state. it would affect latency very much.
>>
>> Maybe we need more smart zone_reclaimabe which is adaptive with memory pressure.
>>
> I agree.
>
> Thanks,
> -Kame
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-09  6:17                   ` KOSAKI Motohiro
@ 2011-03-10 14:08                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-03-10 14:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, avagin, KAMEZAWA Hiroyuki, Andrew Morton,
	Minchan Kim, Andrey Vagin, Mel Gorman, linux-mm, linux-kernel

> Hi, 
> 
> Hmmm...
> If I could observed your patch, I did support your opinion. but I didn't. so, now I'm 
> curious why we got the different conclusion. tommorow, I'll try to construct a test 
> environment to reproduce your system.

Hm, 

following two patches seems to have bad interaction. former makes
SCHED_FIFO when OOM, latter makes CPU 100% occupied busy loop if
LRU is really tight.
Of cource, I need to run more much test. I'll digg it more at this 
weekend (maybe).


commit 93b43fa55088fe977503a156d1097cc2055449a2
Author: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Date:   Mon Aug 9 17:19:41 2010 -0700

    oom: give the dying task a higher priority


commit 0e093d99763eb4cea09f8ca4f1d01f34e121d10b
Author: Mel Gorman <mel@csn.ul.ie>
Date:   Tue Oct 26 14:21:45 2010 -0700

    writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant conge




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-10  6:58                 ` Minchan Kim
@ 2011-03-10 23:58                   ` KAMEZAWA Hiroyuki
  2011-03-11  0:18                     ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-10 23:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Andrew Vagin, Andrey Vagin, Mel Gorman,
	KOSAKI Motohiro, linux-mm, linux-kernel

On Thu, 10 Mar 2011 15:58:29 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Kame,
> 
> Sorry for late response.
> I had a time to test this issue shortly because these day I am very busy.
> This issue was interesting to me.
> So I hope taking a time for enough testing when I have a time.
> I should find out root cause of livelock.
> 

Thanks. I and Kosaki-san reproduced the bug with swapless system.
Now, Kosaki-san is digging and found some issue with scheduler boost at OOM
and lack of enough "wait" in vmscan.c.

I myself made patch like attached one. This works well for returning TRUE at
all_unreclaimable() but livelock(deadlock?) still happens.
I wonder vmscan itself isn't a key for fixing issue.
Then, I'd like to wait for Kosaki-san's answer ;)

I'm now wondering how to catch fork-bomb and stop it (without using cgroup). 
I think the problem is that fork-bomb is faster than killall...

Thanks,
-Kame
==

This is just a debug patch.

---
 mm/vmscan.c |   58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 4 deletions(-)

Index: mmotm-0303/mm/vmscan.c
===================================================================
--- mmotm-0303.orig/mm/vmscan.c
+++ mmotm-0303/mm/vmscan.c
@@ -1983,9 +1983,55 @@ static void shrink_zones(int priority, s
 	}
 }
 
-static bool zone_reclaimable(struct zone *zone)
+static bool zone_seems_empty(struct zone *zone, struct scan_control *sc)
 {
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+	unsigned long nr, wmark, free, isolated, lru;
+
+	/*
+	 * If scanned, zone->pages_scanned is incremented and this can
+ 	 * trigger OOM.
+ 	 */
+	if (sc->nr_scanned)
+		return false;
+
+	free = zone_page_state(zone, NR_FREE_PAGES);
+	isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+	if (nr_swap_pages)
+		isolated += zone_page_state(zone, NR_ISOLATED_ANON);
+
+	/* In we cannot do scan, don't count LRU pages. */
+	if (!zone->all_unreclaimable) {
+		lru = zone_page_state(zone, NR_ACTIVE_FILE);
+		lru += zone_page_state(zone, NR_INACTIVE_FILE);
+		if (nr_swap_pages) {
+			lru += zone_page_state(zone, NR_ACTIVE_ANON);
+			lru += zone_page_state(zone, NR_INACTIVE_ANON);
+		}
+	} else
+		lru = 0;
+	nr = free + isolated + lru;
+	wmark = min_wmark_pages(zone);
+	wmark += zone->lowmem_reserve[gfp_zone(sc->gfp_mask)];
+	wmark += 1 << sc->order;
+	printk("thread %d/%ld all %d scanned %ld pages %ld/%ld/%ld/%ld/%ld/%ld\n",
+		current->pid, sc->nr_scanned, zone->all_unreclaimable,
+		zone->pages_scanned,
+		nr,free,isolated,lru,
+		zone_reclaimable_pages(zone), wmark);
+	/*
+	 * In some case (especially noswap), almost all page cache are paged out
+	 * and we'll see the amount of reclaimable+free pages is smaller than
+	 * zone->min. In this case, we canoot expect any recovery other
+	 * than OOM-KILL. We can't reclaim memory enough for usual tasks.
+	 */
+
+	return nr <= wmark;
+}
+
+static bool zone_reclaimable(struct zone *zone, struct scan_control *sc)
+{
+	/* zone_reclaimable_pages() can return 0, we need <= */
+	return zone->pages_scanned <= zone_reclaimable_pages(zone) * 6;
 }
 
 /*
@@ -2006,11 +2052,15 @@ static bool all_unreclaimable(struct zon
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (zone_reclaimable(zone)) {
+		if (zone_seems_empty(zone, sc))
+			continue;
+		if (zone_reclaimable(zone, sc)) {
 			all_unreclaimable = false;
 			break;
 		}
 	}
+	if (all_unreclaimable)
+		printk("all_unreclaimable() returns TRUE\n");
 
 	return all_unreclaimable;
 }
@@ -2456,7 +2506,7 @@ loop_again:
 			if (zone->all_unreclaimable)
 				continue;
 			if (!compaction && nr_slab == 0 &&
-			    !zone_reclaimable(zone))
+			    !zone_reclaimable(zone, &sc))
 				zone->all_unreclaimable = 1;
 			/*
 			 * If we've done a decent amount of scanning and


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-10 23:58                   ` KAMEZAWA Hiroyuki
@ 2011-03-11  0:18                     ` Minchan Kim
  2011-03-11  6:08                       ` avagin
  0 siblings, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-03-11  0:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrew Vagin, Andrey Vagin, Mel Gorman,
	KOSAKI Motohiro, linux-mm, linux-kernel

On Fri, Mar 11, 2011 at 8:58 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 10 Mar 2011 15:58:29 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi Kame,
>>
>> Sorry for late response.
>> I had a time to test this issue shortly because these day I am very busy.
>> This issue was interesting to me.
>> So I hope taking a time for enough testing when I have a time.
>> I should find out root cause of livelock.
>>
>
> Thanks. I and Kosaki-san reproduced the bug with swapless system.
> Now, Kosaki-san is digging and found some issue with scheduler boost at OOM
> and lack of enough "wait" in vmscan.c.
>
> I myself made patch like attached one. This works well for returning TRUE at
> all_unreclaimable() but livelock(deadlock?) still happens.

I saw the deadlock.
It seems to happen by following code by my quick debug but not sure. I
need to investigate further but don't have a time now. :(


                 * Note: this may have a chance of deadlock if it gets
                 * blocked waiting for another task which itself is waiting
                 * for memory. Is there a better alternative?
                 */
                if (test_tsk_thread_flag(p, TIF_MEMDIE))
                        return ERR_PTR(-1UL);
It would be wait to die the task forever without another victim selection.
If it's right, It's a known BUG and we have no choice until now. Hmm.

> I wonder vmscan itself isn't a key for fixing issue.

I agree.

> Then, I'd like to wait for Kosaki-san's answer ;)

Me, too. :)

>
> I'm now wondering how to catch fork-bomb and stop it (without using cgroup).

Yes. Fork throttling without cgroup is very important.
And as off-topic, mem_notify without memcontrol you mentioned is
important to embedded people, I gues.

> I think the problem is that fork-bomb is faster than killall...

And deadlock problem I mentioned.

>
> Thanks,
> -Kame

Thanks for the investigation, Kame.

> ==
>
> This is just a debug patch.
>
> ---
>  mm/vmscan.c |   58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 54 insertions(+), 4 deletions(-)
>
> Index: mmotm-0303/mm/vmscan.c
> ===================================================================
> --- mmotm-0303.orig/mm/vmscan.c
> +++ mmotm-0303/mm/vmscan.c
> @@ -1983,9 +1983,55 @@ static void shrink_zones(int priority, s
>        }
>  }
>
> -static bool zone_reclaimable(struct zone *zone)
> +static bool zone_seems_empty(struct zone *zone, struct scan_control *sc)
>  {
> -       return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +       unsigned long nr, wmark, free, isolated, lru;
> +
> +       /*
> +        * If scanned, zone->pages_scanned is incremented and this can
> +        * trigger OOM.
> +        */
> +       if (sc->nr_scanned)
> +               return false;
> +
> +       free = zone_page_state(zone, NR_FREE_PAGES);
> +       isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> +       if (nr_swap_pages)
> +               isolated += zone_page_state(zone, NR_ISOLATED_ANON);
> +
> +       /* In we cannot do scan, don't count LRU pages. */
> +       if (!zone->all_unreclaimable) {
> +               lru = zone_page_state(zone, NR_ACTIVE_FILE);
> +               lru += zone_page_state(zone, NR_INACTIVE_FILE);
> +               if (nr_swap_pages) {
> +                       lru += zone_page_state(zone, NR_ACTIVE_ANON);
> +                       lru += zone_page_state(zone, NR_INACTIVE_ANON);
> +               }
> +       } else
> +               lru = 0;
> +       nr = free + isolated + lru;
> +       wmark = min_wmark_pages(zone);
> +       wmark += zone->lowmem_reserve[gfp_zone(sc->gfp_mask)];
> +       wmark += 1 << sc->order;
> +       printk("thread %d/%ld all %d scanned %ld pages %ld/%ld/%ld/%ld/%ld/%ld\n",
> +               current->pid, sc->nr_scanned, zone->all_unreclaimable,
> +               zone->pages_scanned,
> +               nr,free,isolated,lru,
> +               zone_reclaimable_pages(zone), wmark);
> +       /*
> +        * In some case (especially noswap), almost all page cache are paged out
> +        * and we'll see the amount of reclaimable+free pages is smaller than
> +        * zone->min. In this case, we canoot expect any recovery other
> +        * than OOM-KILL. We can't reclaim memory enough for usual tasks.
> +        */
> +
> +       return nr <= wmark;
> +}
> +
> +static bool zone_reclaimable(struct zone *zone, struct scan_control *sc)
> +{
> +       /* zone_reclaimable_pages() can return 0, we need <= */
> +       return zone->pages_scanned <= zone_reclaimable_pages(zone) * 6;
>  }
>
>  /*
> @@ -2006,11 +2052,15 @@ static bool all_unreclaimable(struct zon
>                        continue;
>                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                        continue;
> -               if (zone_reclaimable(zone)) {
> +               if (zone_seems_empty(zone, sc))
> +                       continue;
> +               if (zone_reclaimable(zone, sc)) {
>                        all_unreclaimable = false;
>                        break;
>                }
>        }
> +       if (all_unreclaimable)
> +               printk("all_unreclaimable() returns TRUE\n");
>
>        return all_unreclaimable;
>  }
> @@ -2456,7 +2506,7 @@ loop_again:
>                        if (zone->all_unreclaimable)
>                                continue;
>                        if (!compaction && nr_slab == 0 &&
> -                           !zone_reclaimable(zone))
> +                           !zone_reclaimable(zone, &sc))
>                                zone->all_unreclaimable = 1;
>                        /*
>                         * If we've done a decent amount of scanning and
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-11  0:18                     ` Minchan Kim
@ 2011-03-11  6:08                       ` avagin
  2011-03-14  1:03                         ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: avagin @ 2011-03-11  6:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrey Vagin, Mel Gorman,
	KOSAKI Motohiro, linux-mm, linux-kernel

On 03/11/2011 03:18 AM, Minchan Kim wrote:
> On Fri, Mar 11, 2011 at 8:58 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com>  wrote:
>> On Thu, 10 Mar 2011 15:58:29 +0900
>> Minchan Kim<minchan.kim@gmail.com>  wrote:
>>
>>> Hi Kame,
>>>
>>> Sorry for late response.
>>> I had a time to test this issue shortly because these day I am very busy.
>>> This issue was interesting to me.
>>> So I hope taking a time for enough testing when I have a time.
>>> I should find out root cause of livelock.
>>>
>>
>> Thanks. I and Kosaki-san reproduced the bug with swapless system.
>> Now, Kosaki-san is digging and found some issue with scheduler boost at OOM
>> and lack of enough "wait" in vmscan.c.
>>
>> I myself made patch like attached one. This works well for returning TRUE at
>> all_unreclaimable() but livelock(deadlock?) still happens.
>
> I saw the deadlock.
> It seems to happen by following code by my quick debug but not sure. I
> need to investigate further but don't have a time now. :(
>
>
>                   * Note: this may have a chance of deadlock if it gets
>                   * blocked waiting for another task which itself is waiting
>                   * for memory. Is there a better alternative?
>                   */
>                  if (test_tsk_thread_flag(p, TIF_MEMDIE))
>                          return ERR_PTR(-1UL);
> It would be wait to die the task forever without another victim selection.
> If it's right, It's a known BUG and we have no choice until now. Hmm.


I fixed this bug too and sent patch "mm: skip zombie in OOM-killer".

http://groups.google.com/group/linux.kernel/browse_thread/thread/b9c6ddf34d1671ab/2941e1877ca4f626?lnk=raot&pli=1

-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
+		if (test_tsk_thread_flag(p, TIF_MEMDIE) && p->mm)
   			return ERR_PTR(-1UL);

It is not committed yet, because Devid Rientjes and company think what 
to do with "[patch] oom: prevent unnecessary oom kills or kernel panics.".
>
>> I wonder vmscan itself isn't a key for fixing issue.
>
> I agree.
>
>> Then, I'd like to wait for Kosaki-san's answer ;)
>
> Me, too. :)
>
>>
>> I'm now wondering how to catch fork-bomb and stop it (without using cgroup).
>
> Yes. Fork throttling without cgroup is very important.
> And as off-topic, mem_notify without memcontrol you mentioned is
> important to embedded people, I gues.
>
>> I think the problem is that fork-bomb is faster than killall...
>
> And deadlock problem I mentioned.
>
>>
>> Thanks,
>> -Kame
>
> Thanks for the investigation, Kame.
>
>> ==
>>
>> This is just a debug patch.
>>
>> ---
>>   mm/vmscan.c |   58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 54 insertions(+), 4 deletions(-)
>>
>> Index: mmotm-0303/mm/vmscan.c
>> ===================================================================
>> --- mmotm-0303.orig/mm/vmscan.c
>> +++ mmotm-0303/mm/vmscan.c
>> @@ -1983,9 +1983,55 @@ static void shrink_zones(int priority, s
>>         }
>>   }
>>
>> -static bool zone_reclaimable(struct zone *zone)
>> +static bool zone_seems_empty(struct zone *zone, struct scan_control *sc)
>>   {
>> -       return zone->pages_scanned<  zone_reclaimable_pages(zone) * 6;
>> +       unsigned long nr, wmark, free, isolated, lru;
>> +
>> +       /*
>> +        * If scanned, zone->pages_scanned is incremented and this can
>> +        * trigger OOM.
>> +        */
>> +       if (sc->nr_scanned)
>> +               return false;
>> +
>> +       free = zone_page_state(zone, NR_FREE_PAGES);
>> +       isolated = zone_page_state(zone, NR_ISOLATED_FILE);
>> +       if (nr_swap_pages)
>> +               isolated += zone_page_state(zone, NR_ISOLATED_ANON);
>> +
>> +       /* In we cannot do scan, don't count LRU pages. */
>> +       if (!zone->all_unreclaimable) {
>> +               lru = zone_page_state(zone, NR_ACTIVE_FILE);
>> +               lru += zone_page_state(zone, NR_INACTIVE_FILE);
>> +               if (nr_swap_pages) {
>> +                       lru += zone_page_state(zone, NR_ACTIVE_ANON);
>> +                       lru += zone_page_state(zone, NR_INACTIVE_ANON);
>> +               }
>> +       } else
>> +               lru = 0;
>> +       nr = free + isolated + lru;
>> +       wmark = min_wmark_pages(zone);
>> +       wmark += zone->lowmem_reserve[gfp_zone(sc->gfp_mask)];
>> +       wmark += 1<<  sc->order;
>> +       printk("thread %d/%ld all %d scanned %ld pages %ld/%ld/%ld/%ld/%ld/%ld\n",
>> +               current->pid, sc->nr_scanned, zone->all_unreclaimable,
>> +               zone->pages_scanned,
>> +               nr,free,isolated,lru,
>> +               zone_reclaimable_pages(zone), wmark);
>> +       /*
>> +        * In some case (especially noswap), almost all page cache are paged out
>> +        * and we'll see the amount of reclaimable+free pages is smaller than
>> +        * zone->min. In this case, we canoot expect any recovery other
>> +        * than OOM-KILL. We can't reclaim memory enough for usual tasks.
>> +        */
>> +
>> +       return nr<= wmark;
>> +}
>> +
>> +static bool zone_reclaimable(struct zone *zone, struct scan_control *sc)
>> +{
>> +       /* zone_reclaimable_pages() can return 0, we need<= */
>> +       return zone->pages_scanned<= zone_reclaimable_pages(zone) * 6;
>>   }
>>
>>   /*
>> @@ -2006,11 +2052,15 @@ static bool all_unreclaimable(struct zon
>>                         continue;
>>                 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>                         continue;
>> -               if (zone_reclaimable(zone)) {
>> +               if (zone_seems_empty(zone, sc))
>> +                       continue;
>> +               if (zone_reclaimable(zone, sc)) {
>>                         all_unreclaimable = false;
>>                         break;
>>                 }
>>         }
>> +       if (all_unreclaimable)
>> +               printk("all_unreclaimable() returns TRUE\n");
>>
>>         return all_unreclaimable;
>>   }
>> @@ -2456,7 +2506,7 @@ loop_again:
>>                         if (zone->all_unreclaimable)
>>                                 continue;
>>                         if (!compaction&&  nr_slab == 0&&
>> -                           !zone_reclaimable(zone))
>> +                           !zone_reclaimable(zone,&sc))
>>                                 zone->all_unreclaimable = 1;
>>                         /*
>>                          * If we've done a decent amount of scanning and
>>
>>
>
>
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-11  6:08                       ` avagin
@ 2011-03-14  1:03                         ` Minchan Kim
  0 siblings, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2011-03-14  1:03 UTC (permalink / raw)
  To: avagin
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrey Vagin, Mel Gorman,
	KOSAKI Motohiro, linux-mm, linux-kernel

On Fri, Mar 11, 2011 at 3:08 PM, avagin@gmail.com <avagin@gmail.com> wrote:
> On 03/11/2011 03:18 AM, Minchan Kim wrote:
>>
>> On Fri, Mar 11, 2011 at 8:58 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com>  wrote:
>>>
>>> On Thu, 10 Mar 2011 15:58:29 +0900
>>> Minchan Kim<minchan.kim@gmail.com>  wrote:
>>>
>>>> Hi Kame,
>>>>
>>>> Sorry for late response.
>>>> I had a time to test this issue shortly because these day I am very
>>>> busy.
>>>> This issue was interesting to me.
>>>> So I hope taking a time for enough testing when I have a time.
>>>> I should find out root cause of livelock.
>>>>
>>>
>>> Thanks. I and Kosaki-san reproduced the bug with swapless system.
>>> Now, Kosaki-san is digging and found some issue with scheduler boost at
>>> OOM
>>> and lack of enough "wait" in vmscan.c.
>>>
>>> I myself made patch like attached one. This works well for returning TRUE
>>> at
>>> all_unreclaimable() but livelock(deadlock?) still happens.
>>
>> I saw the deadlock.
>> It seems to happen by following code by my quick debug but not sure. I
>> need to investigate further but don't have a time now. :(
>>
>>
>>                  * Note: this may have a chance of deadlock if it gets
>>                  * blocked waiting for another task which itself is
>> waiting
>>                  * for memory. Is there a better alternative?
>>                  */
>>                 if (test_tsk_thread_flag(p, TIF_MEMDIE))
>>                         return ERR_PTR(-1UL);
>> It would be wait to die the task forever without another victim selection.
>> If it's right, It's a known BUG and we have no choice until now. Hmm.
>
>
> I fixed this bug too and sent patch "mm: skip zombie in OOM-killer".
>
> http://groups.google.com/group/linux.kernel/browse_thread/thread/b9c6ddf34d1671ab/2941e1877ca4f626?lnk=raot&pli=1
>
> -               if (test_tsk_thread_flag(p, TIF_MEMDIE))
> +               if (test_tsk_thread_flag(p, TIF_MEMDIE) && p->mm)
>                        return ERR_PTR(-1UL);
>
> It is not committed yet, because Devid Rientjes and company think what to do
> with "[patch] oom: prevent unnecessary oom kills or kernel panics.".

Thanks, Andrey.
The patch "mm: skip zombie in OOM-killer"  solves my livelock issue
but I didn't look effectiveness of "mm: check zone->all_unreclaimable
in all_unreclaimable". I have to look further.

But your patch  "mm: skip zombie in OOM-killer" is very controversial
because It breaks multi-thread case.
Since find_lock_task_mm is introduced, we have considered mt cases but
I think it doesn't cover completely all cases like discussing
TIF_MEMDIE now.

I will watch the discussion.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-03-05 15:34   ` Andrew Vagin
  2011-03-05 15:53     ` Minchan Kim
@ 2011-05-04  1:38     ` CAI Qian
  2011-05-09  6:54       ` KOSAKI Motohiro
  1 sibling, 1 reply; 58+ messages in thread
From: CAI Qian @ 2011-05-04  1:38 UTC (permalink / raw)
  To: avagin
  Cc: Andrey Vagin, Andrew Morton, Mel Gorman, KOSAKI Motohiro,
	linux-mm, linux-kernel, Minchan Kim

[-- Attachment #1: Type: text/plain, Size: 1999 bytes --]



----- Original Message -----
> On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> >> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> >> kernel may hang up, because shrink_zones() will do nothing, but
> >> all_unreclaimable() will say, that zone has reclaimable pages.
> >>
> >> do_try_to_free_pages()
> >> 	shrink_zones()
> >> 		 for_each_zone
> >> 			if (zone->all_unreclaimable)
> >> 				continue
> >> 	if !all_unreclaimable(zonelist, sc)
> >> 		return 1
> >>
> >> __alloc_pages_slowpath()
> >> retry:
> >> 	did_some_progress = do_try_to_free_pages(page)
> >> 	...
> >> 	if (!page&& did_some_progress)
> >> 		retry;
> >>
> >> Signed-off-by: Andrey Vagin<avagin@openvz.org>
> >> ---
> >>   mm/vmscan.c | 2 ++
> >>   1 files changed, 2 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 6771ea7..1c056f7 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist
> >> *zonelist,
> >>
> >>   	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >>   			gfp_zone(sc->gfp_mask), sc->nodemask) {
> >> + if (zone->all_unreclaimable)
> >> + continue;
> >>   		if (!populated_zone(zone))
> >>   			continue;
> >>   		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >
> > zone_reclaimable checks it. Isn't it enough?
> I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> This two patches are enough.
> > Does the hang up really happen or see it by code review?
> Yes. You can reproduce it for help the attached python program. It's
> not
> very clever:)
> It make the following actions in loop:
> 1. fork
> 2. mmap
> 3. touch memory
> 4. read memory
> 5. munmmap
> 
> >> --
> >> 1.7.1
I have tested this for the latest mainline kernel using the reproducer
attached, the system just hung or deadlock after oom. The whole oom
trace is here.
http://people.redhat.com/qcai/oom.log

Did I miss anything?

[-- Attachment #2: memeater.py --]
[-- Type: text/plain, Size: 695 bytes --]

import sys, time, mmap, os
from subprocess import Popen, PIPE
import random

global mem_size

def info(msg):
	pid = os.getpid()
	print >> sys.stderr, "%s: %s" % (pid, msg)
	sys.stderr.flush()



def memory_loop(cmd = "a"):
	"""
	cmd may be:
		c: check memory
		else: touch memory
	"""
	c = 0
	for j in xrange(0, mem_size):
		if cmd == "c":
			if f[j<<12] != chr(j % 255):
				info("Data corruption")
				sys.exit(1)
		else:
			f[j<<12] = chr(j % 255)

while True:
	pid = os.fork()
	if (pid != 0):
		mem_size = random.randint(0, 56 * 4096)
		f = mmap.mmap(-1, mem_size << 12, mmap.MAP_ANONYMOUS|mmap.MAP_PRIVATE)
		memory_loop()
		memory_loop("c")
		f.close()

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-05-04  1:38     ` CAI Qian
@ 2011-05-09  6:54       ` KOSAKI Motohiro
  2011-05-09  8:47         ` CAI Qian
  0 siblings, 1 reply; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  6:54 UTC (permalink / raw)
  To: CAI Qian
  Cc: kosaki.motohiro, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim

> 
> 
> ----- Original Message -----
> > On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > > On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> > >> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> > >> kernel may hang up, because shrink_zones() will do nothing, but
> > >> all_unreclaimable() will say, that zone has reclaimable pages.
> > >>
> > >> do_try_to_free_pages()
> > >> 	shrink_zones()
> > >> 		 for_each_zone
> > >> 			if (zone->all_unreclaimable)
> > >> 				continue
> > >> 	if !all_unreclaimable(zonelist, sc)
> > >> 		return 1
> > >>
> > >> __alloc_pages_slowpath()
> > >> retry:
> > >> 	did_some_progress = do_try_to_free_pages(page)
> > >> 	...
> > >> 	if (!page&& did_some_progress)
> > >> 		retry;
> > >>
> > >> Signed-off-by: Andrey Vagin<avagin@openvz.org>
> > >> ---
> > >>   mm/vmscan.c | 2 ++
> > >>   1 files changed, 2 insertions(+), 0 deletions(-)
> > >>
> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >> index 6771ea7..1c056f7 100644
> > >> --- a/mm/vmscan.c
> > >> +++ b/mm/vmscan.c
> > >> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist
> > >> *zonelist,
> > >>
> > >>   	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > >>   			gfp_zone(sc->gfp_mask), sc->nodemask) {
> > >> + if (zone->all_unreclaimable)
> > >> + continue;
> > >>   		if (!populated_zone(zone))
> > >>   			continue;
> > >>   		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > >
> > > zone_reclaimable checks it. Isn't it enough?
> > I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> > This two patches are enough.
> > > Does the hang up really happen or see it by code review?
> > Yes. You can reproduce it for help the attached python program. It's
> > not
> > very clever:)
> > It make the following actions in loop:
> > 1. fork
> > 2. mmap
> > 3. touch memory
> > 4. read memory
> > 5. munmmap
> > 
> > >> --
> > >> 1.7.1
> I have tested this for the latest mainline kernel using the reproducer
> attached, the system just hung or deadlock after oom. The whole oom
> trace is here.
> http://people.redhat.com/qcai/oom.log
> 
> Did I miss anything?

Can you please try commit 929bea7c714220fc76ce3f75bef9056477c28e74?





^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-05-09  6:54       ` KOSAKI Motohiro
@ 2011-05-09  8:47         ` CAI Qian
  2011-05-09  9:19           ` KOSAKI Motohiro
  0 siblings, 1 reply; 58+ messages in thread
From: CAI Qian @ 2011-05-09  8:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: avagin, Andrey Vagin, Andrew Morton, Mel Gorman, linux-mm,
	linux-kernel, Minchan Kim



----- Original Message -----
> >
> >
> > ----- Original Message -----
> > > On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > > > On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> > > >> Check zone->all_unreclaimable in all_unreclaimable(), otherwise
> > > >> the
> > > >> kernel may hang up, because shrink_zones() will do nothing, but
> > > >> all_unreclaimable() will say, that zone has reclaimable pages.
> > > >>
> > > >> do_try_to_free_pages()
> > > >> 	shrink_zones()
> > > >> 		 for_each_zone
> > > >> 			if (zone->all_unreclaimable)
> > > >> 				continue
> > > >> 	if !all_unreclaimable(zonelist, sc)
> > > >> 		return 1
> > > >>
> > > >> __alloc_pages_slowpath()
> > > >> retry:
> > > >> 	did_some_progress = do_try_to_free_pages(page)
> > > >> 	...
> > > >> 	if (!page&& did_some_progress)
> > > >> 		retry;
> > > >>
> > > >> Signed-off-by: Andrey Vagin<avagin@openvz.org>
> > > >> ---
> > > >>   mm/vmscan.c | 2 ++
> > > >>   1 files changed, 2 insertions(+), 0 deletions(-)
> > > >>
> > > >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > >> index 6771ea7..1c056f7 100644
> > > >> --- a/mm/vmscan.c
> > > >> +++ b/mm/vmscan.c
> > > >> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct
> > > >> zonelist
> > > >> *zonelist,
> > > >>
> > > >>   	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > >>   			gfp_zone(sc->gfp_mask), sc->nodemask) {
> > > >> + if (zone->all_unreclaimable)
> > > >> + continue;
> > > >>   		if (!populated_zone(zone))
> > > >>   			continue;
> > > >>   		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > > >
> > > > zone_reclaimable checks it. Isn't it enough?
> > > I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> > > This two patches are enough.
> > > > Does the hang up really happen or see it by code review?
> > > Yes. You can reproduce it for help the attached python program.
> > > It's
> > > not
> > > very clever:)
> > > It make the following actions in loop:
> > > 1. fork
> > > 2. mmap
> > > 3. touch memory
> > > 4. read memory
> > > 5. munmmap
> > >
> > > >> --
> > > >> 1.7.1
> > I have tested this for the latest mainline kernel using the
> > reproducer
> > attached, the system just hung or deadlock after oom. The whole oom
> > trace is here.
> > http://people.redhat.com/qcai/oom.log
> >
> > Did I miss anything?
> 
> Can you please try commit 929bea7c714220fc76ce3f75bef9056477c28e74?
As I have mentioned that I have tested the latest mainline which have
already included that fix. Also, does this problem only for x86? The
testing was done using x86_64. Not sure if that would be a problem.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()
  2011-05-09  8:47         ` CAI Qian
@ 2011-05-09  9:19           ` KOSAKI Motohiro
  2011-05-10  8:11             ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) KOSAKI Motohiro
  0 siblings, 1 reply; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  9:19 UTC (permalink / raw)
  To: CAI Qian
  Cc: kosaki.motohiro, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim

> > > I have tested this for the latest mainline kernel using the
> > > reproducer
> > > attached, the system just hung or deadlock after oom. The whole oom
> > > trace is here.
> > > http://people.redhat.com/qcai/oom.log
> > >
> > > Did I miss anything?
> > 
> > Can you please try commit 929bea7c714220fc76ce3f75bef9056477c28e74?
> As I have mentioned that I have tested the latest mainline which have
> already included that fix. Also, does this problem only for x86? The
> testing was done using x86_64. Not sure if that would be a problem.

No. I'm also using x86_64 and my machine completely works on current
latest linus tree. I confirmed it today.





^ permalink raw reply	[flat|nested] 58+ messages in thread

* OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-09  9:19           ` KOSAKI Motohiro
@ 2011-05-10  8:11             ` KOSAKI Motohiro
  2011-05-10  8:14               ` [PATCH 1/4] oom: improve dump_tasks() show items KOSAKI Motohiro
                                 ` (5 more replies)
  0 siblings, 6 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-10  8:11 UTC (permalink / raw)
  To: CAI Qian
  Cc: kosaki.motohiro, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, David Rientjes,
	Hugh Dickins, Oleg Nesterov

(cc to oom interested people)

> > > > I have tested this for the latest mainline kernel using the
> > > > reproducer
> > > > attached, the system just hung or deadlock after oom. The whole oom
> > > > trace is here.
> > > > http://people.redhat.com/qcai/oom.log
> > > >
> > > > Did I miss anything?
> > > 
> > > Can you please try commit 929bea7c714220fc76ce3f75bef9056477c28e74?
> > As I have mentioned that I have tested the latest mainline which have
> > already included that fix. Also, does this problem only for x86? The
> > testing was done using x86_64. Not sure if that would be a problem.
> 
> No. I'm also using x86_64 and my machine completely works on current
> latest linus tree. I confirmed it today.

> 4194288 pages RAM

You have 16GB RAM.

> Out of memory: Kill process 1175 (dhclient) score 1 or sacrifice child
> Out of memory: Kill process 1247 (rsyslogd) score 1 or sacrifice child
> Out of memory: Kill process 1284 (irqbalance) score 1 or sacrifice child
> Out of memory: Kill process 1303 (rpcbind) score 1 or sacrifice child
> Out of memory: Kill process 1321 (rpc.statd) score 1 or sacrifice child
> Out of memory: Kill process 1333 (mdadm) score 1 or sacrifice child
> Out of memory: Kill process 1365 (rpc.idmapd) score 1 or sacrifice child
> Out of memory: Kill process 1403 (dbus-daemon) score 1 or sacrifice child
> Out of memory: Kill process 1438 (acpid) score 1 or sacrifice child
> Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> Out of memory: Kill process 1487 (hald-addon-inpu) score 1 or sacrifice child
> Out of memory: Kill process 1488 (hald-addon-acpi) score 1 or sacrifice child
> Out of memory: Kill process 1507 (automount) score 1 or sacrifice child

Oops.

OK. That's known issue. Current OOM logic doesn't works if you have
gigabytes RAM. because _all_ process have the exactly same score (=1).
then oom killer just fallback to random process killer. It was made
commit a63d83f427 (oom: badness heuristic rewrite). I pointed out
it at least three times. You have to blame Google folks. :-/


The problems are three.

1) if two processes have the same oom score, we should kill younger process.
   but current logic kill older. Oldest processes are typicall system daemons.
2) Current logic use 'unsigned int' for internal score calculation. (exactly says,
   it only use 0-1000 value). its very low precision calculation makes a lot of
   same oom score and kill an ineligible process.
3) Current logic give 3% of SystemRAM to root processes. It obviously too big
   if you have plenty memory. Now, your fork-bomb processes have 500MB OOM immune
   bonus. then your fork-bomb never ever be killed.


CAI-san: I've made fixing patches. Can you please try them?




^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/4] oom: improve dump_tasks() show items
  2011-05-10  8:11             ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) KOSAKI Motohiro
@ 2011-05-10  8:14               ` KOSAKI Motohiro
  2011-05-10 23:29                 ` David Rientjes
  2011-05-10  8:15               ` [PATCH 2/4] oom: kill younger process first KOSAKI Motohiro
                                 ` (4 subsequent siblings)
  5 siblings, 1 reply; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-10  8:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, Minchan Kim, David Rientjes,
	Hugh Dickins, Oleg Nesterov

Recently, oom internal logic was dramatically changed. Thus
dump_tasks() is no longer useful. it has some meaningless
items and don't have some oom socre related items.

This patch adapt displaying fields to new oom logic.

details
==========
removed: pid (we always kill process. don't need thread id),
         mm->total_vm (we no longer uses virtual memory size)
         signal->oom_adj (we no longer uses it internally)
added: ppid (we often kill sacrifice child process)
modify: RSS (account mm->nr_ptes too)

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---

Strictly speaking. this is NOT a part of oom fixing patches. but it's
necessary when I parse QAI's test result.


 mm/oom_kill.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f52e85c..118d958 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -355,7 +355,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
 	struct task_struct *p;
 	struct task_struct *task;
 
-	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
+	pr_info("[   pid]   ppid   uid      rss  cpu score_adj name\n");
 	for_each_process(p) {
 		if (oom_unkillable_task(p, mem, nodemask))
 			continue;
@@ -370,11 +370,13 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
 			continue;
 		}
 
-		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
-			task->pid, task_uid(task), task->tgid,
-			task->mm->total_vm, get_mm_rss(task->mm),
-			task_cpu(task), task->signal->oom_adj,
-			task->signal->oom_score_adj, task->comm);
+		pr_info("[%6d] %6d %5d %8lu %4u %9d %s\n",
+			task_tgid_nr(task), task_tgid_nr(task->real_parent),
+			task_uid(task),
+			get_mm_rss(task->mm) + p->mm->nr_ptes,
+			task_cpu(task),
+			task->signal->oom_score_adj,
+			task->comm);
 		task_unlock(task);
 	}
 }
-- 
1.7.3.1




^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/4] oom: kill younger process first
  2011-05-10  8:11             ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) KOSAKI Motohiro
  2011-05-10  8:14               ` [PATCH 1/4] oom: improve dump_tasks() show items KOSAKI Motohiro
@ 2011-05-10  8:15               ` KOSAKI Motohiro
  2011-05-10 23:31                 ` David Rientjes
                                   ` (2 more replies)
  2011-05-10  8:15               ` [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally KOSAKI Motohiro
                                 ` (3 subsequent siblings)
  5 siblings, 3 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-10  8:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, Minchan Kim, David Rientjes,
	Hugh Dickins, Oleg Nesterov

This patch introduces do_each_thread_reverse() and
select_bad_process() uses it. The benefits are two,
1) oom-killer can kill younger process than older if
they have a same oom score. Usually younger process
is less important. 2) younger task often have PF_EXITING
because shell script makes a lot of short lived processes.
Reverse order search can detect it faster.

Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/sched.h |    6 ++++++
 mm/oom_kill.c         |    2 +-
 2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 013314a..a0a8339 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2194,6 +2194,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
 #define next_task(p) \
 	list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
 
+#define prev_task(p) \
+	list_entry_rcu((p)->tasks.prev, struct task_struct, tasks)
+
 #define for_each_process(p) \
 	for (p = &init_task ; (p = next_task(p)) != &init_task ; )
 
@@ -2206,6 +2209,9 @@ extern bool current_is_single_threaded(void);
 #define do_each_thread(g, t) \
 	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
 
+#define do_each_thread_reverse(g, t) \
+	for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
+
 #define while_each_thread(g, t) \
 	while ((t = next_thread(t)) != g)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 118d958..0cf5091 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -282,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	struct task_struct *chosen = NULL;
 	*ppoints = 0;
 
-	do_each_thread(g, p) {
+	do_each_thread_reverse(g, p) {
 		unsigned int points;
 
 		if (!p->mm)
-- 
1.7.3.1




^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally
  2011-05-10  8:11             ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) KOSAKI Motohiro
  2011-05-10  8:14               ` [PATCH 1/4] oom: improve dump_tasks() show items KOSAKI Motohiro
  2011-05-10  8:15               ` [PATCH 2/4] oom: kill younger process first KOSAKI Motohiro
@ 2011-05-10  8:15               ` KOSAKI Motohiro
  2011-05-10 23:40                 ` David Rientjes
  2011-05-10  8:16               ` [PATCH 4/4] oom: don't kill random process KOSAKI Motohiro
                                 ` (2 subsequent siblings)
  5 siblings, 1 reply; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-10  8:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, Minchan Kim, David Rientjes,
	Hugh Dickins, Oleg Nesterov

CAI Qian reported his kernel did hang-up if he ran fork intensive
workload and then invoke oom-killer.

The problem is, Current oom calculation uses 0-1000 normalized value
(The unit is a permillage of system-ram). Its low precision make
a lot of same oom score. IOW, in his case, all processes have <1
oom score and internal integral calculation round it to 1. Thus
oom-killer kill ineligible process. This regression is caused by
commit a63d83f427 (oom: badness heuristic rewrite).

The solution is, the internal calculation just use number of pages
instead of permillage of system-ram. And convert it to permillage
value at displaying time.

This patch doesn't change any ABI (included  /proc/<pid>/oom_score_adj)
even though current logic has a lot of my dislike thing.

Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c      |   13 ++++++----
 include/linux/oom.h |    7 +----
 mm/oom_kill.c       |   60 +++++++++++++++++++++++++++++++++-----------------
 3 files changed, 49 insertions(+), 31 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index dfa5327..d6b0424 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -476,14 +476,17 @@ static const struct file_operations proc_lstats_operations = {
 
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
-	unsigned long points = 0;
+	unsigned long points;
+	unsigned long ratio = 0;
+	unsigned long totalpages = totalram_pages + total_swap_pages + 1;
 
 	read_lock(&tasklist_lock);
-	if (pid_alive(task))
-		points = oom_badness(task, NULL, NULL,
-					totalram_pages + total_swap_pages);
+	if (pid_alive(task)) {
+		points = oom_badness(task, NULL, NULL, totalpages);
+		ratio = points * 1000 / totalpages;
+	}
 	read_unlock(&tasklist_lock);
-	return sprintf(buffer, "%lu\n", points);
+	return sprintf(buffer, "%lu\n", ratio);
 }
 
 struct limit_names {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..0f5b588 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -40,7 +40,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+/* The badness from the OOM killer */
+extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 			const nodemask_t *nodemask, unsigned long totalpages);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
@@ -62,10 +63,6 @@ static inline void oom_killer_enable(void)
 	oom_killer_disabled = false;
 }
 
-/* The badness from the OOM killer */
-extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long uptime);
-
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
 /* sysctls */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0cf5091..ba95870 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -132,10 +132,12 @@ static bool oom_unkillable_task(struct task_struct *p,
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 		      const nodemask_t *nodemask, unsigned long totalpages)
 {
-	int points;
+	unsigned long points;
+	unsigned long score_adj = 0;
+
 
 	if (oom_unkillable_task(p, mem, nodemask))
 		return 0;
@@ -160,7 +162,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 */
 	if (p->flags & PF_OOM_ORIGIN) {
 		task_unlock(p);
-		return 1000;
+		return ULONG_MAX;
 	}
 
 	/*
@@ -176,33 +178,49 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 */
 	points = get_mm_rss(p->mm) + p->mm->nr_ptes;
 	points += get_mm_counter(p->mm, MM_SWAPENTS);
-
-	points *= 1000;
-	points /= totalpages;
 	task_unlock(p);
 
 	/*
 	 * Root processes get 3% bonus, just like the __vm_enough_memory()
 	 * implementation used by LSMs.
+	 *
+	 * XXX: Too large bonus. Example,if the system have tera-bytes memory...
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
-		points -= 30;
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
+		if (points >= totalpages / 32)
+			points -= totalpages / 32;
+		else
+			points = 0;
+	}
 
 	/*
 	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
 	 * either completely disable oom killing or always prefer a certain
 	 * task.
 	 */
-	points += p->signal->oom_score_adj;
+	if (p->signal->oom_score_adj >= 0) {
+		score_adj = p->signal->oom_score_adj * (totalpages / 1000);
+		if (ULONG_MAX - points >= score_adj)
+			points += score_adj;
+		else
+			points = ULONG_MAX;
+	} else {
+		score_adj = -p->signal->oom_score_adj * (totalpages / 1000);
+		if (points >= score_adj)
+			points -= score_adj;
+		else
+			points = 0;
+	}
 
 	/*
 	 * Never return 0 for an eligible task that may be killed since it's
 	 * possible that no single user task uses more than 0.1% of memory and
 	 * no single admin tasks uses more than 3.0%.
 	 */
-	if (points <= 0)
-		return 1;
-	return (points < 1000) ? points : 1000;
+	if (!points)
+		points = 1;
+
+	return points;
 }
 
 /*
@@ -274,7 +292,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned int *ppoints,
+static struct task_struct *select_bad_process(unsigned long *ppoints,
 		unsigned long totalpages, struct mem_cgroup *mem,
 		const nodemask_t *nodemask)
 {
@@ -283,7 +301,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	*ppoints = 0;
 
 	do_each_thread_reverse(g, p) {
-		unsigned int points;
+		unsigned long points;
 
 		if (!p->mm)
 			continue;
@@ -314,7 +332,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 			 */
 			if (p == current) {
 				chosen = p;
-				*ppoints = 1000;
+				*ppoints = ULONG_MAX;
 			} else {
 				/*
 				 * If this task is not being ptraced on exit,
@@ -444,14 +462,14 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 #undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned int points, unsigned long totalpages,
+			    unsigned long points, unsigned long totalpages,
 			    struct mem_cgroup *mem, nodemask_t *nodemask,
 			    const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
 	struct task_struct *t = p;
-	unsigned int victim_points = 0;
+	unsigned long victim_points = 0;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);
@@ -466,7 +484,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}
 
 	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) points %d or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);
 
@@ -478,7 +496,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
+			unsigned long child_points;
 
 			if (child->mm == p->mm)
 				continue;
@@ -525,7 +543,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
 	unsigned long limit;
-	unsigned int points = 0;
+	unsigned long points = 0;
 	struct task_struct *p;
 
 	/*
@@ -674,7 +692,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	struct task_struct *p;
 	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned long points;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
-- 
1.7.3.1




^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 4/4] oom: don't kill random process
  2011-05-10  8:11             ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) KOSAKI Motohiro
                                 ` (2 preceding siblings ...)
  2011-05-10  8:15               ` [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally KOSAKI Motohiro
@ 2011-05-10  8:16               ` KOSAKI Motohiro
  2011-05-10 23:41                 ` David Rientjes
  2011-05-10 23:22               ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) David Rientjes
  2011-05-11  2:30               ` CAI Qian
  5 siblings, 1 reply; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-10  8:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, Minchan Kim, David Rientjes,
	Hugh Dickins, Oleg Nesterov

CAI Qian reported oom-killer killed all system daemons in his
system at first if he ran fork bomb as root. The problem is,
current logic give them bonus of 3% of system ram. Example,
he has 16GB machine, then root processes have ~500MB oom
immune. It bring us crazy bad result. _all_ processes have
oom-score=1 and then, oom killer ignore process memroy usage
and kill random process. This regression is caused by commit
a63d83f427 (oom: badness heuristic rewrite).

This patch changes select_bad_process() slightly. If oom points == 1,
it's a sign that the system have only root privileged processes or
similar. Thus, select_bad_process() calculate oom badness without
root bonus and select eligible process.

Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c      |    2 +-
 include/linux/oom.h |    3 ++-
 mm/oom_kill.c       |   19 +++++++++++++++----
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d6b0424..b608b69 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -482,7 +482,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
 
 	read_lock(&tasklist_lock);
 	if (pid_alive(task)) {
-		points = oom_badness(task, NULL, NULL, totalpages);
+		points = oom_badness(task, NULL, NULL, totalpages, 1);
 		ratio = points * 1000 / totalpages;
 	}
 	read_unlock(&tasklist_lock);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 0f5b588..3dd3669 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -42,7 +42,8 @@ enum oom_constraint {
 
 /* The badness from the OOM killer */
 extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-			const nodemask_t *nodemask, unsigned long totalpages);
+			const nodemask_t *nodemask, unsigned long totalpages,
+			int protect_root);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ba95870..525e1d2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -133,7 +133,8 @@ static bool oom_unkillable_task(struct task_struct *p,
  * task consuming the most memory to avoid subsequent oom failures.
  */
 unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long totalpages)
+			 const nodemask_t *nodemask, unsigned long totalpages,
+			 int protect_root)
 {
 	unsigned long points;
 	unsigned long score_adj = 0;
@@ -186,7 +187,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 *
 	 * XXX: Too large bonus. Example,if the system have tera-bytes memory...
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
+	if (protect_root && has_capability_noaudit(p, CAP_SYS_ADMIN)) {
 		if (points >= totalpages / 32)
 			points -= totalpages / 32;
 		else
@@ -298,8 +299,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
+	int protect_root = 1;
 	*ppoints = 0;
 
+ retry:
 	do_each_thread_reverse(g, p) {
 		unsigned long points;
 
@@ -345,13 +348,18 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			}
 		}
 
-		points = oom_badness(p, mem, nodemask, totalpages);
+		points = oom_badness(p, mem, nodemask, totalpages, protect_root);
 		if (points > *ppoints) {
 			chosen = p;
 			*ppoints = points;
 		}
 	} while_each_thread(g, p);
 
+	if (protect_root && (*ppoints == 1)) {
+		protect_root = 0;
+		goto retry;
+	}
+
 	return chosen;
 }
 
@@ -470,6 +478,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	struct task_struct *child;
 	struct task_struct *t = p;
 	unsigned long victim_points = 0;
+	int admin;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);
@@ -494,6 +503,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
+	admin = has_capability_noaudit(victim, CAP_SYS_ADMIN);
+	victim_points = oom_badness(victim, mem, nodemask, totalpages, !admin);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned long child_points;
@@ -504,7 +515,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */
 			child_points = oom_badness(child, mem, nodemask,
-								totalpages);
+						   totalpages, !admin);
 			if (child_points > victim_points) {
 				victim = child;
 				victim_points = child_points;
-- 
1.7.3.1




^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-10  8:11             ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) KOSAKI Motohiro
                                 ` (3 preceding siblings ...)
  2011-05-10  8:16               ` [PATCH 4/4] oom: don't kill random process KOSAKI Motohiro
@ 2011-05-10 23:22               ` David Rientjes
  2011-05-11  2:30               ` CAI Qian
  5 siblings, 0 replies; 58+ messages in thread
From: David Rientjes @ 2011-05-10 23:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> OK. That's known issue. Current OOM logic doesn't works if you have
> gigabytes RAM. because _all_ process have the exactly same score (=1).
> then oom killer just fallback to random process killer. It was made
> commit a63d83f427 (oom: badness heuristic rewrite). I pointed out
> it at least three times. You have to blame Google folks. :-/
> 

If all threads have the same badness score, which by definition must be 1 
since that is the lowest badness score possible for an eligible thread, 
then each thread is using < 0.2% of RAM.

The granularity of the badness score doesn't differentiate between threads  
using 0.1% of RAM in terms of priority for kill (in this case, 16MB).  The 
largest consumers of memory from CAI's log have an rss of 336MB, which is 
~2% of system RAM.  The problem is that these are forked by root and 
therefore get a 3% bonus, making their badness score 1 instead of 2.

 [ You also don't have to blame "Google folks," I rewrote the oom
   killer. ]

> 
> The problems are three.
> 
> 1) if two processes have the same oom score, we should kill younger process.
>    but current logic kill older. Oldest processes are typicall system daemons.

Agreed, that seems advantageous to prefer killing threads that have done 
the least amount of work (defined as those with the least runtime compared 
to others in the tasklist order) over others.

> 2) Current logic use 'unsigned int' for internal score calculation. (exactly says,
>    it only use 0-1000 value). its very low precision calculation makes a lot of
>    same oom score and kill an ineligible process.

The range of 0-1000 allows us to differentiate tasks up to 0.1% of system 
RAM from each other when making oom kill decisions.  If we really want to 
increase this granularity, we could increase the value to 10000 and then 
multiple oom_score_adj values by 10.

> 3) Current logic give 3% of SystemRAM to root processes. It obviously too big
>    if you have plenty memory. Now, your fork-bomb processes have 500MB OOM immune
>    bonus. then your fork-bomb never ever be killed.
> 

I agree that a constant proportion for root processes is probably not 
ideal, especially in situations where there are many small threads that 
only use about 1% of system RAM, such as in CAI's case.  I don't agree 
that we need to guard against forkbombs created by root, however.  The 
worst case scenario is that the continuous killing of non-root threads 
will allow the admin to fix his or her error.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] oom: improve dump_tasks() show items
  2011-05-10  8:14               ` [PATCH 1/4] oom: improve dump_tasks() show items KOSAKI Motohiro
@ 2011-05-10 23:29                 ` David Rientjes
  2011-05-13 10:14                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-05-10 23:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> Recently, oom internal logic was dramatically changed. Thus
> dump_tasks() is no longer useful. it has some meaningless
> items and don't have some oom socre related items.
> 

This changelog is inaccurate.

dump_tasks() is actually useful as it currently stands; there are things 
that you may add or remove but saying that it is "no longer useful" is an 
exaggeration.

> This patch adapt displaying fields to new oom logic.
> 
> details
> ==========
> removed: pid (we always kill process. don't need thread id),
>          mm->total_vm (we no longer uses virtual memory size)

Showing mm->total_vm is still interesting to know what the old heuristic 
would have used rather than the new heuristic, I'd prefer if we kept it.

>          signal->oom_adj (we no longer uses it internally)
> added: ppid (we often kill sacrifice child process)
> modify: RSS (account mm->nr_ptes too)

I'd prefer if ptes were shown independently from rss instead of adding it 
to the thread's true rss usage and representing it as such.

I think the cpu should also be removed.

For the next version, could you show the old output and comparsion to new 
output in the changelog?

> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> 
> Strictly speaking. this is NOT a part of oom fixing patches. but it's
> necessary when I parse QAI's test result.
> 
> 
>  mm/oom_kill.c |   14 ++++++++------
>  1 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f52e85c..118d958 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -355,7 +355,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
>  	struct task_struct *p;
>  	struct task_struct *task;
>  
> -	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
> +	pr_info("[   pid]   ppid   uid      rss  cpu score_adj name\n");
>  	for_each_process(p) {
>  		if (oom_unkillable_task(p, mem, nodemask))
>  			continue;
> @@ -370,11 +370,13 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
>  			continue;
>  		}
>  
> -		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
> -			task->pid, task_uid(task), task->tgid,
> -			task->mm->total_vm, get_mm_rss(task->mm),
> -			task_cpu(task), task->signal->oom_adj,
> -			task->signal->oom_score_adj, task->comm);
> +		pr_info("[%6d] %6d %5d %8lu %4u %9d %s\n",
> +			task_tgid_nr(task), task_tgid_nr(task->real_parent),
> +			task_uid(task),
> +			get_mm_rss(task->mm) + p->mm->nr_ptes,
> +			task_cpu(task),
> +			task->signal->oom_score_adj,
> +			task->comm);
>  		task_unlock(task);
>  	}
>  }

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-10  8:15               ` [PATCH 2/4] oom: kill younger process first KOSAKI Motohiro
@ 2011-05-10 23:31                 ` David Rientjes
  2011-05-13 10:15                   ` KOSAKI Motohiro
  2011-05-11 23:33                 ` Minchan Kim
  2011-05-12  0:52                 ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-05-10 23:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> This patch introduces do_each_thread_reverse() and
> select_bad_process() uses it. The benefits are two,
> 1) oom-killer can kill younger process than older if
> they have a same oom score. Usually younger process
> is less important. 2) younger task often have PF_EXITING
> because shell script makes a lot of short lived processes.
> Reverse order search can detect it faster.
> 

I like this change, thanks!  I'm suprised we haven't needed a 
do_each_thread_reverse() in the past somewhere else in the kernel.

Could you update the comment about do_each_thread() not being break-safe 
in the second version, though?

After that:

	Acked-by: David Rientjes <rientjes@google.com>

> Reported-by: CAI Qian <caiqian@redhat.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  include/linux/sched.h |    6 ++++++
>  mm/oom_kill.c         |    2 +-
>  2 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 013314a..a0a8339 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2194,6 +2194,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
>  #define next_task(p) \
>  	list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
>  
> +#define prev_task(p) \
> +	list_entry_rcu((p)->tasks.prev, struct task_struct, tasks)
> +
>  #define for_each_process(p) \
>  	for (p = &init_task ; (p = next_task(p)) != &init_task ; )
>  
> @@ -2206,6 +2209,9 @@ extern bool current_is_single_threaded(void);
>  #define do_each_thread(g, t) \
>  	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
>  
> +#define do_each_thread_reverse(g, t) \
> +	for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
> +
>  #define while_each_thread(g, t) \
>  	while ((t = next_thread(t)) != g)
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 118d958..0cf5091 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -282,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  	struct task_struct *chosen = NULL;
>  	*ppoints = 0;
>  
> -	do_each_thread(g, p) {
> +	do_each_thread_reverse(g, p) {
>  		unsigned int points;
>  
>  		if (!p->mm)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally
  2011-05-10  8:15               ` [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally KOSAKI Motohiro
@ 2011-05-10 23:40                 ` David Rientjes
  2011-05-13 10:30                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-05-10 23:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> CAI Qian reported his kernel did hang-up if he ran fork intensive
> workload and then invoke oom-killer.
> 
> The problem is, Current oom calculation uses 0-1000 normalized value
> (The unit is a permillage of system-ram). Its low precision make
> a lot of same oom score. IOW, in his case, all processes have <1
> oom score and internal integral calculation round it to 1. Thus
> oom-killer kill ineligible process. This regression is caused by
> commit a63d83f427 (oom: badness heuristic rewrite).
> 
> The solution is, the internal calculation just use number of pages
> instead of permillage of system-ram. And convert it to permillage
> value at displaying time.
> 
> This patch doesn't change any ABI (included  /proc/<pid>/oom_score_adj)
> even though current logic has a lot of my dislike thing.
> 

s/permillage/proportion/

This is unacceptable, it does not allow users to tune oom_score_adj 
appropriately based on the scores exported by /proc/pid/oom_score to 
discount an amount of RAM from a thread's memory usage in systemwide, 
memory controller, cpuset, or mempolicy contexts.  This is only possible 
because the oom score is normalized.

What would be acceptable would be to increase the granularity of the score 
to 10000 or 100000 to differentiate between threads using 0.01% or 0.001% 
of RAM from each other, respectively.  The range of oom_score_adj would 
remain the same, however, and be multiplied by 10 or 100, respectively, 
when factored into the badness score baseline.  I don't believe userspace 
cares to differentiate between more than 0.1% of available memory.

The other issue that this patch addresses is the bonus given to root 
processes.  I agree that if a root process is using 4% of RAM that it 
should not be equal to all other threads using 1%.  I do believe that a 
root process using 60% of RAM should be equal priority to a thread using 
57%, however.  Perhaps a compromise would be to give root processes a 
bonus of 1% for every 30% of RAM they consume?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 4/4] oom: don't kill random process
  2011-05-10  8:16               ` [PATCH 4/4] oom: don't kill random process KOSAKI Motohiro
@ 2011-05-10 23:41                 ` David Rientjes
  0 siblings, 0 replies; 58+ messages in thread
From: David Rientjes @ 2011-05-10 23:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> CAI Qian reported oom-killer killed all system daemons in his
> system at first if he ran fork bomb as root. The problem is,
> current logic give them bonus of 3% of system ram. Example,
> he has 16GB machine, then root processes have ~500MB oom
> immune. It bring us crazy bad result. _all_ processes have
> oom-score=1 and then, oom killer ignore process memroy usage
> and kill random process. This regression is caused by commit
> a63d83f427 (oom: badness heuristic rewrite).
> 
> This patch changes select_bad_process() slightly. If oom points == 1,
> it's a sign that the system have only root privileged processes or
> similar. Thus, select_bad_process() calculate oom badness without
> root bonus and select eligible process.
> 

This second (and very costly) iteration is unnecessary if the range of 
oom scores is increased from 1000 to 10000 or 100000 as suggested in the 
previous patch.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-10  8:11             ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) KOSAKI Motohiro
                                 ` (4 preceding siblings ...)
  2011-05-10 23:22               ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) David Rientjes
@ 2011-05-11  2:30               ` CAI Qian
  2011-05-11 20:34                 ` David Rientjes
  5 siblings, 1 reply; 58+ messages in thread
From: CAI Qian @ 2011-05-11  2:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: avagin, Andrey Vagin, Andrew Morton, Mel Gorman, linux-mm,
	linux-kernel, Minchan Kim, David Rientjes, Hugh Dickins,
	Oleg Nesterov



----- Original Message -----
> (cc to oom interested people)
> 
> > > > > I have tested this for the latest mainline kernel using the
> > > > > reproducer
> > > > > attached, the system just hung or deadlock after oom. The
> > > > > whole oom
> > > > > trace is here.
> > > > > http://people.redhat.com/qcai/oom.log
> > > > >
> > > > > Did I miss anything?
> > > >
> > > > Can you please try commit
> > > > 929bea7c714220fc76ce3f75bef9056477c28e74?
> > > As I have mentioned that I have tested the latest mainline which
> > > have
> > > already included that fix. Also, does this problem only for x86?
> > > The
> > > testing was done using x86_64. Not sure if that would be a
> > > problem.
> >
> > No. I'm also using x86_64 and my machine completely works on current
> > latest linus tree. I confirmed it today.
> 
> > 4194288 pages RAM
> 
> You have 16GB RAM.
> 
> > Out of memory: Kill process 1175 (dhclient) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1247 (rsyslogd) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1284 (irqbalance) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1303 (rpcbind) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1321 (rpc.statd) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1333 (mdadm) score 1 or sacrifice child
> > Out of memory: Kill process 1365 (rpc.idmapd) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1403 (dbus-daemon) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1438 (acpid) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1487 (hald-addon-inpu) score 1 or
> > sacrifice child
> > Out of memory: Kill process 1488 (hald-addon-acpi) score 1 or
> > sacrifice child
> > Out of memory: Kill process 1507 (automount) score 1 or sacrifice
> > child
> 
> Oops.
> 
> OK. That's known issue. Current OOM logic doesn't works if you have
> gigabytes RAM. because _all_ process have the exactly same score (=1).
> then oom killer just fallback to random process killer. It was made
> commit a63d83f427 (oom: badness heuristic rewrite). I pointed out
> it at least three times. You have to blame Google folks. :-/
> 
> 
> The problems are three.
> 
> 1) if two processes have the same oom score, we should kill younger
> process.
> but current logic kill older. Oldest processes are typicall system
> daemons.
> 2) Current logic use 'unsigned int' for internal score calculation.
> (exactly says,
> it only use 0-1000 value). its very low precision calculation makes a
> lot of
> same oom score and kill an ineligible process.
> 3) Current logic give 3% of SystemRAM to root processes. It obviously
> too big
> if you have plenty memory. Now, your fork-bomb processes have 500MB
> OOM immune
> bonus. then your fork-bomb never ever be killed.
> 
> 
> CAI-san: I've made fixing patches. Can you please try them?
Sure, I saw there were some discussion going on between you and David
about your patches. Does it make more sense for me to test those after
you have settled down technical arguments?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-11  2:30               ` CAI Qian
@ 2011-05-11 20:34                 ` David Rientjes
  2011-05-12  0:13                   ` Minchan Kim
  2011-05-13  6:53                   ` CAI Qian
  0 siblings, 2 replies; 58+ messages in thread
From: David Rientjes @ 2011-05-11 20:34 UTC (permalink / raw)
  To: CAI Qian
  Cc: KOSAKI Motohiro, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

On Tue, 10 May 2011, CAI Qian wrote:

> Sure, I saw there were some discussion going on between you and David
> about your patches. Does it make more sense for me to test those after
> you have settled down technical arguments?
> 

Something like the following (untested) patch should fix the issue by 
simply increasing the range of a task's badness from 0-1000 to 0-10000.

There are other things to fix like the tasklist dump output and 
documentation, but this shows how easy it is to increase the resolution of 
the scoring.  (This patch also includes a change to only give root 
processes a 1% bonus for every 30% of memory they use as proposed 
earlier.)


diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 */
 	if (p->flags & PF_OOM_ORIGIN) {
 		task_unlock(p);
-		return 1000;
+		return 10000;
 	}
 
 	/*
@@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	points = get_mm_rss(p->mm) + p->mm->nr_ptes;
 	points += get_mm_counter(p->mm, MM_SWAPENTS);
 
-	points *= 1000;
+	points *= 10000;
 	points /= totalpages;
 	task_unlock(p);
 
 	/*
-	 * Root processes get 3% bonus, just like the __vm_enough_memory()
-	 * implementation used by LSMs.
+	 * Root processes get 1% bonus per 30% memory used for a total of 3%
+	 * possible just like LSMs.
 	 */
 	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
-		points -= 30;
+		points -= 100 * (points / 3000);
 
 	/*
 	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
 	 * either completely disable oom killing or always prefer a certain
 	 * task.
 	 */
-	points += p->signal->oom_score_adj;
+	points += p->signal->oom_score_adj * 10;
 
 	/*
 	 * Never return 0 for an eligible task that may be killed since it's
-	 * possible that no single user task uses more than 0.1% of memory and
+	 * possible that no single user task uses more than 0.01% of memory and
 	 * no single admin tasks uses more than 3.0%.
 	 */
 	if (points <= 0)
 		return 1;
-	return (points < 1000) ? points : 1000;
+	return (points < 10000) ? points : 10000;
 }
 
 /*
@@ -314,7 +314,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 			 */
 			if (p == current) {
 				chosen = p;
-				*ppoints = 1000;
+				*ppoints = 10000;
 			} else {
 				/*
 				 * If this task is not being ptraced on exit,

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-10  8:15               ` [PATCH 2/4] oom: kill younger process first KOSAKI Motohiro
  2011-05-10 23:31                 ` David Rientjes
@ 2011-05-11 23:33                 ` Minchan Kim
  2011-05-12  0:52                 ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2011-05-11 23:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, David Rientjes, Hugh Dickins,
	Oleg Nesterov

On Tue, May 10, 2011 at 5:15 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> This patch introduces do_each_thread_reverse() and
> select_bad_process() uses it. The benefits are two,
> 1) oom-killer can kill younger process than older if
> they have a same oom score. Usually younger process
> is less important. 2) younger task often have PF_EXITING
> because shell script makes a lot of short lived processes.
> Reverse order search can detect it faster.
>
> Reported-by: CAI Qian <caiqian@redhat.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-11 20:34                 ` David Rientjes
@ 2011-05-12  0:13                   ` Minchan Kim
  2011-05-12 19:38                     ` David Rientjes
  2011-05-13  6:53                   ` CAI Qian
  1 sibling, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-05-12  0:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: CAI Qian, KOSAKI Motohiro, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, Hugh Dickins, Oleg Nesterov

On Thu, May 12, 2011 at 5:34 AM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 10 May 2011, CAI Qian wrote:
>
>> Sure, I saw there were some discussion going on between you and David
>> about your patches. Does it make more sense for me to test those after
>> you have settled down technical arguments?
>>
>
> Something like the following (untested) patch should fix the issue by
> simply increasing the range of a task's badness from 0-1000 to 0-10000.
>
> There are other things to fix like the tasklist dump output and
> documentation, but this shows how easy it is to increase the resolution of
> the scoring.  (This patch also includes a change to only give root

It does make sense.
I think raising resolution should be a easy way to fix the problem.

> processes a 1% bonus for every 30% of memory they use as proposed
> earlier.)

I didn't follow earlier your suggestion.
But it's not formal patch so I expect if you send formal patch to
merge, you would write down the rationale.

>
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>         */
>        if (p->flags & PF_OOM_ORIGIN) {
>                task_unlock(p);
> -               return 1000;
> +               return 10000;
>        }
>
>        /*
> @@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>        points = get_mm_rss(p->mm) + p->mm->nr_ptes;
>        points += get_mm_counter(p->mm, MM_SWAPENTS);
>
> -       points *= 1000;
> +       points *= 10000;
>        points /= totalpages;
>        task_unlock(p);
>
>        /*
> -        * Root processes get 3% bonus, just like the __vm_enough_memory()
> -        * implementation used by LSMs.
> +        * Root processes get 1% bonus per 30% memory used for a total of 3%
> +        * possible just like LSMs.
>         */
>        if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> -               points -= 30;
> +               points -= 100 * (points / 3000);
>
>        /*
>         * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
>         * either completely disable oom killing or always prefer a certain
>         * task.
>         */
> -       points += p->signal->oom_score_adj;
> +       points += p->signal->oom_score_adj * 10;
>
>        /*
>         * Never return 0 for an eligible task that may be killed since it's
> -        * possible that no single user task uses more than 0.1% of memory and
> +        * possible that no single user task uses more than 0.01% of memory and
>         * no single admin tasks uses more than 3.0%.
>         */
>        if (points <= 0)
>                return 1;
> -       return (points < 1000) ? points : 1000;
> +       return (points < 10000) ? points : 10000;
>  }
>
>  /*
> @@ -314,7 +314,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>                         */
>                        if (p == current) {
>                                chosen = p;
> -                               *ppoints = 1000;
> +                               *ppoints = 10000;

Scattering constant value isn't good.
You are proving it now.
I think you did it since this is not a formal patch.
I expect you will define new value (ex, OOM_INTERNAL_MAX_SCORE or whatever)


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-10  8:15               ` [PATCH 2/4] oom: kill younger process first KOSAKI Motohiro
  2011-05-10 23:31                 ` David Rientjes
  2011-05-11 23:33                 ` Minchan Kim
@ 2011-05-12  0:52                 ` KAMEZAWA Hiroyuki
  2011-05-12  1:30                   ` Minchan Kim
  2011-05-13 10:18                   ` KOSAKI Motohiro
  2 siblings, 2 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12  0:52 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, David Rientjes,
	Hugh Dickins, Oleg Nesterov

On Tue, 10 May 2011 17:15:01 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> This patch introduces do_each_thread_reverse() and
> select_bad_process() uses it. The benefits are two,
> 1) oom-killer can kill younger process than older if
> they have a same oom score. Usually younger process
> is less important. 2) younger task often have PF_EXITING
> because shell script makes a lot of short lived processes.
> Reverse order search can detect it faster.
> 
> Reported-by: CAI Qian <caiqian@redhat.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

IIUC, for_each_thread() can be called under rcu_read_lock() but 
for_each_thread_reverse() must be under tasklist_lock.

Could you add some comment ? and prev_task() should use list_entry()
not list_entry_rcu().

Thanks,
-Kame

> ---
>  include/linux/sched.h |    6 ++++++
>  mm/oom_kill.c         |    2 +-
>  2 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 013314a..a0a8339 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2194,6 +2194,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
>  #define next_task(p) \
>  	list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
>  
> +#define prev_task(p) \
> +	list_entry_rcu((p)->tasks.prev, struct task_struct, tasks)
> +
>  #define for_each_process(p) \
>  	for (p = &init_task ; (p = next_task(p)) != &init_task ; )
>  
> @@ -2206,6 +2209,9 @@ extern bool current_is_single_threaded(void);
>  #define do_each_thread(g, t) \
>  	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
>  
> +#define do_each_thread_reverse(g, t) \
> +	for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
> +
>  #define while_each_thread(g, t) \
>  	while ((t = next_thread(t)) != g)
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 118d958..0cf5091 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -282,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  	struct task_struct *chosen = NULL;
>  	*ppoints = 0;
>  
> -	do_each_thread(g, p) {
> +	do_each_thread_reverse(g, p) {
>  		unsigned int points;
>  
>  		if (!p->mm)
> -- 
> 1.7.3.1
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-12  0:52                 ` KAMEZAWA Hiroyuki
@ 2011-05-12  1:30                   ` Minchan Kim
  2011-05-12  1:53                     ` KAMEZAWA Hiroyuki
  2011-05-13 10:18                   ` KOSAKI Motohiro
  1 sibling, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-05-12  1:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, David Rientjes, Hugh Dickins,
	Oleg Nesterov

Hi Kame,

On Thu, May 12, 2011 at 9:52 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 10 May 2011 17:15:01 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
>
>> This patch introduces do_each_thread_reverse() and
>> select_bad_process() uses it. The benefits are two,
>> 1) oom-killer can kill younger process than older if
>> they have a same oom score. Usually younger process
>> is less important. 2) younger task often have PF_EXITING
>> because shell script makes a lot of short lived processes.
>> Reverse order search can detect it faster.
>>
>> Reported-by: CAI Qian <caiqian@redhat.com>
>> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> IIUC, for_each_thread() can be called under rcu_read_lock() but
> for_each_thread_reverse() must be under tasklist_lock.

Just out of curiosity.
You mentioned it when I sent forkbomb killer patch. :)
>From at that time, I can't understand why we need holding
tasklist_lock not rcu_read_lock. Sorry for the dumb question.

At present, it seems that someone uses tasklist_lock and others uses
rcu_read_lock. But I can't find any rule for that.

Could you elaborate it, please?
Doesn't it need document about it?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-12  1:30                   ` Minchan Kim
@ 2011-05-12  1:53                     ` KAMEZAWA Hiroyuki
  2011-05-12  2:23                       ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12  1:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, David Rientjes, Hugh Dickins,
	Oleg Nesterov

On Thu, 12 May 2011 10:30:45 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Kame,
> 
> On Thu, May 12, 2011 at 9:52 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 10 May 2011 17:15:01 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> >
> >> This patch introduces do_each_thread_reverse() and
> >> select_bad_process() uses it. The benefits are two,
> >> 1) oom-killer can kill younger process than older if
> >> they have a same oom score. Usually younger process
> >> is less important. 2) younger task often have PF_EXITING
> >> because shell script makes a lot of short lived processes.
> >> Reverse order search can detect it faster.
> >>
> >> Reported-by: CAI Qian <caiqian@redhat.com>
> >> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >
> > IIUC, for_each_thread() can be called under rcu_read_lock() but
> > for_each_thread_reverse() must be under tasklist_lock.
> 
> Just out of curiosity.
> You mentioned it when I sent forkbomb killer patch. :)
> From at that time, I can't understand why we need holding
> tasklist_lock not rcu_read_lock. Sorry for the dumb question.
> 
> At present, it seems that someone uses tasklist_lock and others uses
> rcu_read_lock. But I can't find any rule for that.
> 

for_each_list_rcu() makes use of RCU list's characteristics and allows
walk a list under rcu_read_lock() without taking any atomic locks.

list_del() of RCU list works as folllowing.

==
 1) assume  A, B, C, are linked in the list.
	(head)<->(A) <-> (B)  <-> (C)

 2) remove B.
	(head)<->(A) <-> (C)
		        /
                     (B)

 Because (B)'s next points to (C) even after (B) is removed, (B)->next
 points to the alive object. Even if (C) is removed at the same time,
 (C) is not freed until rcu glace period and (C)'s next points to (head)

Then, for_each_list_rcu() can work well under rcu_read_lock(), it will visit
only alive objects (but may not be valid.)

==

please see include/linux/rculist.h and check list_add_rcu() ;)

As above implies, (B)->prev pointer is invalid pointer after list_del().
So, there will be race with list modification and for_each_list_reverse under
rcu_read__lock()

So, when you need to take atomic lock (as tasklist lock is) is...

 1) You can't check 'entry' is valid or not...
    In above for_each_list_rcu(), you may visit an object which is under removing.
    You need some flag or check to see the object is valid or not.

 2) you want to use list_for_each_safe().
    You can't do list_del() an object which is under removing...

 3) You want to walk the list in reverse.

 3) Some other reasons. For example, you'll access an object pointed by the
    'entry' and the object is not rcu safe.

make sense ?

Thanks,
-Kame


> Could you elaborate it, please?
> Doesn't it need document about it?
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-12  1:53                     ` KAMEZAWA Hiroyuki
@ 2011-05-12  2:23                       ` Minchan Kim
  2011-05-12  3:39                         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-05-12  2:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, David Rientjes, Hugh Dickins,
	Oleg Nesterov

On Thu, May 12, 2011 at 10:53 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 12 May 2011 10:30:45 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi Kame,
>>
>> On Thu, May 12, 2011 at 9:52 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Tue, 10 May 2011 17:15:01 +0900 (JST)
>> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >
>> >> This patch introduces do_each_thread_reverse() and
>> >> select_bad_process() uses it. The benefits are two,
>> >> 1) oom-killer can kill younger process than older if
>> >> they have a same oom score. Usually younger process
>> >> is less important. 2) younger task often have PF_EXITING
>> >> because shell script makes a lot of short lived processes.
>> >> Reverse order search can detect it faster.
>> >>
>> >> Reported-by: CAI Qian <caiqian@redhat.com>
>> >> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> >
>> > IIUC, for_each_thread() can be called under rcu_read_lock() but
>> > for_each_thread_reverse() must be under tasklist_lock.
>>
>> Just out of curiosity.
>> You mentioned it when I sent forkbomb killer patch. :)
>> From at that time, I can't understand why we need holding
>> tasklist_lock not rcu_read_lock. Sorry for the dumb question.
>>
>> At present, it seems that someone uses tasklist_lock and others uses
>> rcu_read_lock. But I can't find any rule for that.
>>
>
> for_each_list_rcu() makes use of RCU list's characteristics and allows
> walk a list under rcu_read_lock() without taking any atomic locks.
>
> list_del() of RCU list works as folllowing.
>
> ==
>  1) assume  A, B, C, are linked in the list.
>        (head)<->(A) <-> (B)  <-> (C)
>
>  2) remove B.
>        (head)<->(A) <-> (C)
>                        /
>                     (B)
>
>  Because (B)'s next points to (C) even after (B) is removed, (B)->next
>  points to the alive object. Even if (C) is removed at the same time,
>  (C) is not freed until rcu glace period and (C)'s next points to (head)
>
> Then, for_each_list_rcu() can work well under rcu_read_lock(), it will visit
> only alive objects (but may not be valid.)
>
> ==
>
> please see include/linux/rculist.h and check list_add_rcu() ;)
>
> As above implies, (B)->prev pointer is invalid pointer after list_del().
> So, there will be race with list modification and for_each_list_reverse under
> rcu_read__lock()
>
> So, when you need to take atomic lock (as tasklist lock is) is...
>
>  1) You can't check 'entry' is valid or not...
>    In above for_each_list_rcu(), you may visit an object which is under removing.
>    You need some flag or check to see the object is valid or not.
>
>  2) you want to use list_for_each_safe().
>    You can't do list_del() an object which is under removing...
>
>  3) You want to walk the list in reverse.
>
>  3) Some other reasons. For example, you'll access an object pointed by the
>    'entry' and the object is not rcu safe.
>
> make sense ?

Yes. Thanks, Kame.
It seems It is caused by prev poisoning of list_del_rcu.
If we remove it, isn't it possible to traverse reverse without atomic lock?



>
> Thanks,
> -Kame
>
>
>> Could you elaborate it, please?
>> Doesn't it need document about it?
>>
>> --
>> Kind regards,
>> Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-12  2:23                       ` Minchan Kim
@ 2011-05-12  3:39                         ` KAMEZAWA Hiroyuki
  2011-05-12  4:17                           ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12  3:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, David Rientjes, Hugh Dickins,
	Oleg Nesterov

On Thu, 12 May 2011 11:23:38 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Thu, May 12, 2011 at 10:53 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 12 May 2011 10:30:45 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:

> > As above implies, (B)->prev pointer is invalid pointer after list_del().
> > So, there will be race with list modification and for_each_list_reverse under
> > rcu_read__lock()
> >
> > So, when you need to take atomic lock (as tasklist lock is) is...
> >
> >  1) You can't check 'entry' is valid or not...
> >    In above for_each_list_rcu(), you may visit an object which is under removing.
> >    You need some flag or check to see the object is valid or not.
> >
> >  2) you want to use list_for_each_safe().
> >    You can't do list_del() an object which is under removing...
> >
> >  3) You want to walk the list in reverse.
> >
> >  3) Some other reasons. For example, you'll access an object pointed by the
> >    'entry' and the object is not rcu safe.
> >
> > make sense ?
> 
> Yes. Thanks, Kame.
> It seems It is caused by prev poisoning of list_del_rcu.
> If we remove it, isn't it possible to traverse reverse without atomic lock?
> 

IIUC, it's possible (Fix me if I'm wrong) but I don't like that because of 2 reasons.

1. LIST_POISON is very important information at debug.

2. If we don't clear prev pointer, ok, we'll allow 2 directional walk of list
   under RCU.
   But, in following case
   1. you are now at (C). you'll visit (C)->next...(D)
   2. you are now at (D). you want to go back to (C) via (D)->prev.
   3. But (D)->prev points to (B)

  It's not a 2 directional list, something other or broken one.
  Then, the rculist is 1 directional list in nature, I think. 

So, without very very big reason, we should keep POISON.

Thanks,
-Kame











^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-12  3:39                         ` KAMEZAWA Hiroyuki
@ 2011-05-12  4:17                           ` Minchan Kim
  2011-05-12 14:38                             ` Paul E. McKenney
  0 siblings, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-05-12  4:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, David Rientjes, Hugh Dickins,
	Oleg Nesterov

On Thu, May 12, 2011 at 12:39 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 12 May 2011 11:23:38 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Thu, May 12, 2011 at 10:53 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 12 May 2011 10:30:45 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> > As above implies, (B)->prev pointer is invalid pointer after list_del().
>> > So, there will be race with list modification and for_each_list_reverse under
>> > rcu_read__lock()
>> >
>> > So, when you need to take atomic lock (as tasklist lock is) is...
>> >
>> >  1) You can't check 'entry' is valid or not...
>> >    In above for_each_list_rcu(), you may visit an object which is under removing.
>> >    You need some flag or check to see the object is valid or not.
>> >
>> >  2) you want to use list_for_each_safe().
>> >    You can't do list_del() an object which is under removing...
>> >
>> >  3) You want to walk the list in reverse.
>> >
>> >  3) Some other reasons. For example, you'll access an object pointed by the
>> >    'entry' and the object is not rcu safe.
>> >
>> > make sense ?
>>
>> Yes. Thanks, Kame.
>> It seems It is caused by prev poisoning of list_del_rcu.
>> If we remove it, isn't it possible to traverse reverse without atomic lock?
>>
>
> IIUC, it's possible (Fix me if I'm wrong) but I don't like that because of 2 reasons.
>
> 1. LIST_POISON is very important information at debug.

Indeed.
But if we can get a better something although we lost debug facility,
I think it would be okay.

>
> 2. If we don't clear prev pointer, ok, we'll allow 2 directional walk of list
>   under RCU.
>   But, in following case
>   1. you are now at (C). you'll visit (C)->next...(D)
>   2. you are now at (D). you want to go back to (C) via (D)->prev.
>   3. But (D)->prev points to (B)
>
>  It's not a 2 directional list, something other or broken one.

Yes. but it shouldn't be a problem in RCU semantics.
If you need such consistency, you should use lock.

I recall old thread about it.
In http://lwn.net/Articles/262464/, mmutz and Paul already discussed
about it. :)

>  Then, the rculist is 1 directional list in nature, I think.

Yes. But Why RCU become 1 directional list is we can't find a useful usecases.

>
> So, without very very big reason, we should keep POISON.

Agree.
I don't insist on it as it's not a useful usecase for persuading Paul.
That's because it's not a hot path.

It's started from just out of curiosity.
Thanks for very much clarifying that, Kame!

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-12  4:17                           ` Minchan Kim
@ 2011-05-12 14:38                             ` Paul E. McKenney
  0 siblings, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2011-05-12 14:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, CAI Qian, avagin,
	Andrey Vagin, Andrew Morton, Mel Gorman, linux-mm, linux-kernel,
	David Rientjes, Hugh Dickins, Oleg Nesterov

On Thu, May 12, 2011 at 01:17:13PM +0900, Minchan Kim wrote:
> On Thu, May 12, 2011 at 12:39 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 12 May 2011 11:23:38 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Thu, May 12, 2011 at 10:53 AM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Thu, 12 May 2011 10:30:45 +0900
> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> > As above implies, (B)->prev pointer is invalid pointer after list_del().
> >> > So, there will be race with list modification and for_each_list_reverse under
> >> > rcu_read__lock()
> >> >
> >> > So, when you need to take atomic lock (as tasklist lock is) is...
> >> >
> >> >  1) You can't check 'entry' is valid or not...
> >> >    In above for_each_list_rcu(), you may visit an object which is under removing.
> >> >    You need some flag or check to see the object is valid or not.
> >> >
> >> >  2) you want to use list_for_each_safe().
> >> >    You can't do list_del() an object which is under removing...
> >> >
> >> >  3) You want to walk the list in reverse.
> >> >
> >> >  3) Some other reasons. For example, you'll access an object pointed by the
> >> >    'entry' and the object is not rcu safe.
> >> >
> >> > make sense ?
> >>
> >> Yes. Thanks, Kame.
> >> It seems It is caused by prev poisoning of list_del_rcu.
> >> If we remove it, isn't it possible to traverse reverse without atomic lock?
> >>
> >
> > IIUC, it's possible (Fix me if I'm wrong) but I don't like that because of 2 reasons.
> >
> > 1. LIST_POISON is very important information at debug.
> 
> Indeed.
> But if we can get a better something although we lost debug facility,
> I think it would be okay.
> 
> >
> > 2. If we don't clear prev pointer, ok, we'll allow 2 directional walk of list
> >   under RCU.
> >   But, in following case
> >   1. you are now at (C). you'll visit (C)->next...(D)
> >   2. you are now at (D). you want to go back to (C) via (D)->prev.
> >   3. But (D)->prev points to (B)
> >
> >  It's not a 2 directional list, something other or broken one.
> 
> Yes. but it shouldn't be a problem in RCU semantics.
> If you need such consistency, you should use lock.
> 
> I recall old thread about it.
> In http://lwn.net/Articles/262464/, mmutz and Paul already discussed
> about it. :)
> 
> >  Then, the rculist is 1 directional list in nature, I think.
> 
> Yes. But Why RCU become 1 directional list is we can't find a useful usecases.
> 
> >
> > So, without very very big reason, we should keep POISON.
> 
> Agree.
> I don't insist on it as it's not a useful usecase for persuading Paul.
> That's because it's not a hot path.
> 
> It's started from just out of curiosity.
> Thanks for very much clarifying that, Kame!

Indeed, we would need a large performance/scalability/simplicity advantage
to put up with such a loss of debugging information.  If it turns out
that you really need this, please let me know, but please also provide
data supporting your need.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-12  0:13                   ` Minchan Kim
@ 2011-05-12 19:38                     ` David Rientjes
  2011-05-13  4:16                       ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-05-12 19:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: CAI Qian, KOSAKI Motohiro, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, Hugh Dickins, Oleg Nesterov

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4738 bytes --]

On Thu, 12 May 2011, Minchan Kim wrote:

> > processes a 1% bonus for every 30% of memory they use as proposed
> > earlier.)
> 
> I didn't follow earlier your suggestion.
> But it's not formal patch so I expect if you send formal patch to
> merge, you would write down the rationale.
> 

Yes, I'm sure we'll still have additional discussion when KOSAKI-san 
replies to my review of his patchset, so this quick patch was written only 
for CAI's testing at this point.

In reference to the above, I think that giving root processes a 3% bonus 
at all times may be a bit aggressive.  As mentioned before, I don't think 
that all root processes using 4% of memory and the remainder of system 
threads are using 1% should all be considered equal.  At the same time, I 
do not believe that two threads using 50% of memory should be considered 
equal if one is root and one is not.  So my idea was to discount 1% for 
every 30% of memory that a root process uses rather than a strict 3%.

That change can be debated and I think we'll probably settle on something 
more aggressive like 1% for every 10% of memory used since oom scores are 
only useful in comparison to other oom scores: in the above scenario where 
there are two threads, one by root and one not by root, using 50% of 
memory each, I think it would be legitimate to give the root task a 5% 
bonus so that it would only be selected if no other threads used more than 
44% of memory (even though the root thread is truly using 50%).

This is a heuristic within the oom killer badness scoring that can always 
be debated back and forth, but I think a 1% bonus for root processes for 
every 10% of memory used is plausible.

Comments?

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> >         */
> >        if (p->flags & PF_OOM_ORIGIN) {
> >                task_unlock(p);
> > -               return 1000;
> > +               return 10000;
> >        }
> >
> >        /*
> > @@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> >        points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> >        points += get_mm_counter(p->mm, MM_SWAPENTS);
> >
> > -       points *= 1000;
> > +       points *= 10000;
> >        points /= totalpages;
> >        task_unlock(p);
> >
> >        /*
> > -        * Root processes get 3% bonus, just like the __vm_enough_memory()
> > -        * implementation used by LSMs.
> > +        * Root processes get 1% bonus per 30% memory used for a total of 3%
> > +        * possible just like LSMs.
> >         */
> >        if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> > -               points -= 30;
> > +               points -= 100 * (points / 3000);
> >
> >        /*
> >         * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> >         * either completely disable oom killing or always prefer a certain
> >         * task.
> >         */
> > -       points += p->signal->oom_score_adj;
> > +       points += p->signal->oom_score_adj * 10;
> >
> >        /*
> >         * Never return 0 for an eligible task that may be killed since it's
> > -        * possible that no single user task uses more than 0.1% of memory and
> > +        * possible that no single user task uses more than 0.01% of memory and
> >         * no single admin tasks uses more than 3.0%.
> >         */
> >        if (points <= 0)
> >                return 1;
> > -       return (points < 1000) ? points : 1000;
> > +       return (points < 10000) ? points : 10000;
> >  }
> >
> >  /*
> > @@ -314,7 +314,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> >                         */
> >                        if (p == current) {
> >                                chosen = p;
> > -                               *ppoints = 1000;
> > +                               *ppoints = 10000;
> 
> Scattering constant value isn't good.
> You are proving it now.
> I think you did it since this is not a formal patch.
> I expect you will define new value (ex, OOM_INTERNAL_MAX_SCORE or whatever)
> 

Right, we could probably do something like

	#define OOM_SCORE_MAX_FACTOR	10
	#define OOM_SCORE_MAX		(OOM_SCORE_ADJ_MAX * OOM_SCORE_MAX_FACTOR)

in mm/oom_kill.c, which would then be used to replace all of the constants 
above since OOM_SCORE_ADJ_MAX is already defined to be 1000 in 
include/linux/oom.h.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-12 19:38                     ` David Rientjes
@ 2011-05-13  4:16                       ` Minchan Kim
  2011-05-13 11:04                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-05-13  4:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: CAI Qian, KOSAKI Motohiro, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, Hugh Dickins, Oleg Nesterov

On Fri, May 13, 2011 at 4:38 AM, David Rientjes <rientjes@google.com> wrote:
> On Thu, 12 May 2011, Minchan Kim wrote:
>
>> > processes a 1% bonus for every 30% of memory they use as proposed
>> > earlier.)
>>
>> I didn't follow earlier your suggestion.
>> But it's not formal patch so I expect if you send formal patch to
>> merge, you would write down the rationale.
>>
>
> Yes, I'm sure we'll still have additional discussion when KOSAKI-san
> replies to my review of his patchset, so this quick patch was written only
> for CAI's testing at this point.
>
> In reference to the above, I think that giving root processes a 3% bonus
> at all times may be a bit aggressive.  As mentioned before, I don't think
> that all root processes using 4% of memory and the remainder of system
> threads are using 1% should all be considered equal.  At the same time, I
> do not believe that two threads using 50% of memory should be considered
> equal if one is root and one is not.  So my idea was to discount 1% for
> every 30% of memory that a root process uses rather than a strict 3%.
>
> That change can be debated and I think we'll probably settle on something
> more aggressive like 1% for every 10% of memory used since oom scores are
> only useful in comparison to other oom scores: in the above scenario where
> there are two threads, one by root and one not by root, using 50% of
> memory each, I think it would be legitimate to give the root task a 5%
> bonus so that it would only be selected if no other threads used more than
> 44% of memory (even though the root thread is truly using 50%).
>
> This is a heuristic within the oom killer badness scoring that can always
> be debated back and forth, but I think a 1% bonus for root processes for
> every 10% of memory used is plausible.
>
> Comments?

Yes. Tend to agree.
Apparently, absolute 3% bonus is a problem in CAI's case.

Your approach which makes bonus with function of rss is consistent
with current OOM heuristic.
So In consistency POV, I like it as it could help deterministic OOM policy.

About 30% or 10% things, I think it's hard to define a ideal magic
value for handling for whole workloads.
It would be very arguable. So we might need some standard method to
measure it/or redhat/suse peoples. Anyway, I don't want to argue it
until we get a number.

>
>> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> > --- a/mm/oom_kill.c
>> > +++ b/mm/oom_kill.c
>> > @@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>> >         */
>> >        if (p->flags & PF_OOM_ORIGIN) {
>> >                task_unlock(p);
>> > -               return 1000;
>> > +               return 10000;
>> >        }
>> >
>> >        /*
>> > @@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>> >        points = get_mm_rss(p->mm) + p->mm->nr_ptes;
>> >        points += get_mm_counter(p->mm, MM_SWAPENTS);
>> >
>> > -       points *= 1000;
>> > +       points *= 10000;
>> >        points /= totalpages;
>> >        task_unlock(p);
>> >
>> >        /*
>> > -        * Root processes get 3% bonus, just like the __vm_enough_memory()
>> > -        * implementation used by LSMs.
>> > +        * Root processes get 1% bonus per 30% memory used for a total of 3%
>> > +        * possible just like LSMs.
>> >         */
>> >        if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>> > -               points -= 30;
>> > +               points -= 100 * (points / 3000);
>> >
>> >        /*
>> >         * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
>> >         * either completely disable oom killing or always prefer a certain
>> >         * task.
>> >         */
>> > -       points += p->signal->oom_score_adj;
>> > +       points += p->signal->oom_score_adj * 10;
>> >
>> >        /*
>> >         * Never return 0 for an eligible task that may be killed since it's
>> > -        * possible that no single user task uses more than 0.1% of memory and
>> > +        * possible that no single user task uses more than 0.01% of memory and
>> >         * no single admin tasks uses more than 3.0%.
>> >         */
>> >        if (points <= 0)
>> >                return 1;
>> > -       return (points < 1000) ? points : 1000;
>> > +       return (points < 10000) ? points : 10000;
>> >  }
>> >
>> >  /*
>> > @@ -314,7 +314,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>> >                         */
>> >                        if (p == current) {
>> >                                chosen = p;
>> > -                               *ppoints = 1000;
>> > +                               *ppoints = 10000;
>>
>> Scattering constant value isn't good.
>> You are proving it now.
>> I think you did it since this is not a formal patch.
>> I expect you will define new value (ex, OOM_INTERNAL_MAX_SCORE or whatever)
>>
>
> Right, we could probably do something like
>
>        #define OOM_SCORE_MAX_FACTOR    10
>        #define OOM_SCORE_MAX           (OOM_SCORE_ADJ_MAX * OOM_SCORE_MAX_FACTOR)
>
> in mm/oom_kill.c, which would then be used to replace all of the constants
> above since OOM_SCORE_ADJ_MAX is already defined to be 1000 in
> include/linux/oom.h.

Looks good to me.
Let's wait KOSAKI's opinion and CAI's test result.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-11 20:34                 ` David Rientjes
  2011-05-12  0:13                   ` Minchan Kim
@ 2011-05-13  6:53                   ` CAI Qian
  2011-05-16 20:46                     ` David Rientjes
  1 sibling, 1 reply; 58+ messages in thread
From: CAI Qian @ 2011-05-13  6:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov



----- Original Message -----
> On Tue, 10 May 2011, CAI Qian wrote:
> 
> > Sure, I saw there were some discussion going on between you and
> > David
> > about your patches. Does it make more sense for me to test those
> > after
> > you have settled down technical arguments?
> >
> 
> Something like the following (untested) patch should fix the issue by
> simply increasing the range of a task's badness from 0-1000 to
> 0-10000.
> 
> There are other things to fix like the tasklist dump output and
> documentation, but this shows how easy it is to increase the
> resolution of
> the scoring. (This patch also includes a change to only give root
> processes a 1% bonus for every 30% of memory they use as proposed
> earlier.)
I have had a chance to test this patch after applied this patch manually
(dd not apply cleanly) on the top of mainline kernel. The test is still
running because it is trying to kill tons of python processes instread
of the parent. Isn't there a way for oom to be smart enough to do
"killall python"?
> 
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> */
> if (p->flags & PF_OOM_ORIGIN) {
> task_unlock(p);
> - return 1000;
> + return 10000;
> }
> 
> /*
> @@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> points += get_mm_counter(p->mm, MM_SWAPENTS);
> 
> - points *= 1000;
> + points *= 10000;
> points /= totalpages;
> task_unlock(p);
> 
> /*
> - * Root processes get 3% bonus, just like the __vm_enough_memory()
> - * implementation used by LSMs.
> + * Root processes get 1% bonus per 30% memory used for a total of 3%
> + * possible just like LSMs.
> */
> if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> - points -= 30;
> + points -= 100 * (points / 3000);
> 
> /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
> */
> - points += p->signal->oom_score_adj;
> + points += p->signal->oom_score_adj * 10;
> 
> /*
> * Never return 0 for an eligible task that may be killed since it's
> - * possible that no single user task uses more than 0.1% of memory
> and
> + * possible that no single user task uses more than 0.01% of memory
> and
> * no single admin tasks uses more than 3.0%.
> */
> if (points <= 0)
> return 1;
> - return (points < 1000) ? points : 1000;
> + return (points < 10000) ? points : 10000;
> }
> 
> /*
> @@ -314,7 +314,7 @@ static struct task_struct
> *select_bad_process(unsigned int *ppoints,
> */
> if (p == current) {
> chosen = p;
> - *ppoints = 1000;
> + *ppoints = 10000;
> } else {
> /*
> * If this task is not being ptraced on exit,
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] oom: improve dump_tasks() show items
  2011-05-10 23:29                 ` David Rientjes
@ 2011-05-13 10:14                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-13 10:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

Hi

Sorry for the delay. I did hit machine crash in this week and I lost
a lot of e-mail.


> On Tue, 10 May 2011, KOSAKI Motohiro wrote:
>
>> Recently, oom internal logic was dramatically changed. Thus
>> dump_tasks() is no longer useful. it has some meaningless
>> items and don't have some oom socre related items.
>>
>
> This changelog is inaccurate.
>
> dump_tasks() is actually useful as it currently stands; there are things
> that you may add or remove but saying that it is "no longer useful" is an
> exaggeration.

Hm. OK.

>
>> This patch adapt displaying fields to new oom logic.
>>
>> details
>> ==========
>> removed: pid (we always kill process. don't need thread id),
>>           mm->total_vm (we no longer uses virtual memory size)
>
> Showing mm->total_vm is still interesting to know what the old heuristic
> would have used rather than the new heuristic, I'd prefer if we kept it.

OK, reasonable.



>
>>           signal->oom_adj (we no longer uses it internally)
>> added: ppid (we often kill sacrifice child process)
>> modify: RSS (account mm->nr_ptes too)
>
> I'd prefer if ptes were shown independently from rss instead of adding it
> to the thread's true rss usage and representing it as such.

No. nr-pte should always be accounted as rss. I plan to change RLIMIT_RSS too.
Because, when end users change RLIMIT_RSS, they intend to limit number of physical
memory usage. It's not only subset. current total_rss = anon-rss + file-rss is
only implementation limit. In the other hand, if we makes separate RLIMIT, RLIMIT_RSS
and RLIMIT_PTE, RLIMIT_RSS don't prevent zero page DoS attack. it's no optimal.


> I think the cpu should also be removed.

ok.


> For the next version, could you show the old output and comparsion to new
> output in the changelog?

Will do.


>
>>
>> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>> ---
>>
>> Strictly speaking. this is NOT a part of oom fixing patches. but it's
>> necessary when I parse QAI's test result.
>>
>>
>>   mm/oom_kill.c |   14 ++++++++------
>>   1 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index f52e85c..118d958 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -355,7 +355,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
>>   	struct task_struct *p;
>>   	struct task_struct *task;
>>
>> -	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
>> +	pr_info("[   pid]   ppid   uid      rss  cpu score_adj name\n");
>>   	for_each_process(p) {
>>   		if (oom_unkillable_task(p, mem, nodemask))
>>   			continue;
>> @@ -370,11 +370,13 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
>>   			continue;
>>   		}
>>
>> -		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
>> -			task->pid, task_uid(task), task->tgid,
>> -			task->mm->total_vm, get_mm_rss(task->mm),
>> -			task_cpu(task), task->signal->oom_adj,
>> -			task->signal->oom_score_adj, task->comm);
>> +		pr_info("[%6d] %6d %5d %8lu %4u %9d %s\n",
>> +			task_tgid_nr(task), task_tgid_nr(task->real_parent),
>> +			task_uid(task),
>> +			get_mm_rss(task->mm) + p->mm->nr_ptes,
>> +			task_cpu(task),
>> +			task->signal->oom_score_adj,
>> +			task->comm);
>>   		task_unlock(task);
>>   	}
>>   }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>
>



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-10 23:31                 ` David Rientjes
@ 2011-05-13 10:15                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-13 10:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

(2011/05/11 8:31), David Rientjes wrote:
> On Tue, 10 May 2011, KOSAKI Motohiro wrote:
>
>> This patch introduces do_each_thread_reverse() and
>> select_bad_process() uses it. The benefits are two,
>> 1) oom-killer can kill younger process than older if
>> they have a same oom score. Usually younger process
>> is less important. 2) younger task often have PF_EXITING
>> because shell script makes a lot of short lived processes.
>> Reverse order search can detect it faster.
>>
>
> I like this change, thanks!  I'm suprised we haven't needed a
> do_each_thread_reverse() in the past somewhere else in the kernel.
>
> Could you update the comment about do_each_thread() not being break-safe
> in the second version, though?

ok.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] oom: kill younger process first
  2011-05-12  0:52                 ` KAMEZAWA Hiroyuki
  2011-05-12  1:30                   ` Minchan Kim
@ 2011-05-13 10:18                   ` KOSAKI Motohiro
  1 sibling, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-13 10:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, David Rientjes,
	Hugh Dickins, Oleg Nesterov

(2011/05/12 9:52), KAMEZAWA Hiroyuki wrote:
> On Tue, 10 May 2011 17:15:01 +0900 (JST)
> KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>  wrote:
>
>> This patch introduces do_each_thread_reverse() and
>> select_bad_process() uses it. The benefits are two,
>> 1) oom-killer can kill younger process than older if
>> they have a same oom score. Usually younger process
>> is less important. 2) younger task often have PF_EXITING
>> because shell script makes a lot of short lived processes.
>> Reverse order search can detect it faster.
>>
>> Reported-by: CAI Qian<caiqian@redhat.com>
>> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>
> IIUC, for_each_thread() can be called under rcu_read_lock() but
> for_each_thread_reverse() must be under tasklist_lock.
>
> Could you add some comment ? and prev_task() should use list_entry()
> not list_entry_rcu().

Will fix. thanks.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally
  2011-05-10 23:40                 ` David Rientjes
@ 2011-05-13 10:30                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-13 10:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: CAI Qian, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

(2011/05/11 8:40), David Rientjes wrote:
> On Tue, 10 May 2011, KOSAKI Motohiro wrote:
>
>> CAI Qian reported his kernel did hang-up if he ran fork intensive
>> workload and then invoke oom-killer.
>>
>> The problem is, Current oom calculation uses 0-1000 normalized value
>> (The unit is a permillage of system-ram). Its low precision make
>> a lot of same oom score. IOW, in his case, all processes have<1
>> oom score and internal integral calculation round it to 1. Thus
>> oom-killer kill ineligible process. This regression is caused by
>> commit a63d83f427 (oom: badness heuristic rewrite).
>>
>> The solution is, the internal calculation just use number of pages
>> instead of permillage of system-ram. And convert it to permillage
>> value at displaying time.
>>
>> This patch doesn't change any ABI (included  /proc/<pid>/oom_score_adj)
>> even though current logic has a lot of my dislike thing.
>>
>
> s/permillage/proportion/
>
> This is unacceptable, it does not allow users to tune oom_score_adj
> appropriately based on the scores exported by /proc/pid/oom_score to
> discount an amount of RAM from a thread's memory usage in systemwide,
> memory controller, cpuset, or mempolicy contexts.  This is only possible
> because the oom score is normalized.

You misunderstand the code. The patch doesn't change oom_score.
The patch change fs/proc too.

>
> What would be acceptable would be to increase the granularity of the score
> to 10000 or 100000 to differentiate between threads using 0.01% or 0.001%
> of RAM from each other, respectively.  The range of oom_score_adj would
> remain the same, however, and be multiplied by 10 or 100, respectively,
> when factored into the badness score baseline.  I don't believe userspace
> cares to differentiate between more than 0.1% of available memory.

Currently, SGI buy 16TB memory. 16TB x 0.1% = 1.6GB. I don't think your
fork bomb process use bigger than 1.6GB. Thus your patch is unacceptable.

So, please read the code again. or run it.

> The other issue that this patch addresses is the bonus given to root
> processes.  I agree that if a root process is using 4% of RAM that it
> should not be equal to all other threads using 1%.  I do believe that a
> root process using 60% of RAM should be equal priority to a thread using
> 57%, however.  Perhaps a compromise would be to give root processes a
> bonus of 1% for every 30% of RAM they consume?

I think you are talking about patch [4/4], right? patch [3/4] and [4/4]
are attacking another issue. big machine issue and root user issue.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-13  4:16                       ` Minchan Kim
@ 2011-05-13 11:04                         ` KOSAKI Motohiro
  2011-05-16 20:42                           ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: KOSAKI Motohiro @ 2011-05-13 11:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, Hugh Dickins, Oleg Nesterov

(2011/05/13 13:16), Minchan Kim wrote:
> On Fri, May 13, 2011 at 4:38 AM, David Rientjes<rientjes@google.com>  wrote:
>> On Thu, 12 May 2011, Minchan Kim wrote:
>>
>>>> processes a 1% bonus for every 30% of memory they use as proposed
>>>> earlier.)
>>>
>>> I didn't follow earlier your suggestion.
>>> But it's not formal patch so I expect if you send formal patch to
>>> merge, you would write down the rationale.
>>>
>>
>> Yes, I'm sure we'll still have additional discussion when KOSAKI-san
>> replies to my review of his patchset, so this quick patch was written only
>> for CAI's testing at this point.
>>
>> In reference to the above, I think that giving root processes a 3% bonus
>> at all times may be a bit aggressive.  As mentioned before, I don't think
>> that all root processes using 4% of memory and the remainder of system
>> threads are using 1% should all be considered equal.  At the same time, I
>> do not believe that two threads using 50% of memory should be considered
>> equal if one is root and one is not.  So my idea was to discount 1% for
>> every 30% of memory that a root process uses rather than a strict 3%.
>>
>> That change can be debated and I think we'll probably settle on something
>> more aggressive like 1% for every 10% of memory used since oom scores are
>> only useful in comparison to other oom scores: in the above scenario where
>> there are two threads, one by root and one not by root, using 50% of
>> memory each, I think it would be legitimate to give the root task a 5%
>> bonus so that it would only be selected if no other threads used more than
>> 44% of memory (even though the root thread is truly using 50%).
>>
>> This is a heuristic within the oom killer badness scoring that can always
>> be debated back and forth, but I think a 1% bonus for root processes for
>> every 10% of memory used is plausible.
>>
>> Comments?
>
> Yes. Tend to agree.
> Apparently, absolute 3% bonus is a problem in CAI's case.
>
> Your approach which makes bonus with function of rss is consistent
> with current OOM heuristic.
> So In consistency POV, I like it as it could help deterministic OOM policy.
>
> About 30% or 10% things, I think it's hard to define a ideal magic
> value for handling for whole workloads.
> It would be very arguable. So we might need some standard method to
> measure it/or redhat/suse peoples. Anyway, I don't want to argue it
> until we get a number.

I have small comments. 1) typical system have some small size system daemon
2) David's points -= 100 * (points / 3000); line doesn't make any bonus if
points is less than 3000. Zero root bonus is really desired? It may lead to
kill system daemon at first issue. 3) if we change minimum bonus from 0% to
1%, we will face the exact same problem when all process have less than
1% memory. It's not rare if the system has a plenty memory.
So, my recalculation logic (patch [4/4]) is necessary anyway.

However, proportional 1% - 10% bonus seems considerable good idea.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-13 11:04                         ` KOSAKI Motohiro
@ 2011-05-16 20:42                           ` David Rientjes
  0 siblings, 0 replies; 58+ messages in thread
From: David Rientjes @ 2011-05-16 20:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, CAI Qian, avagin, Andrey Vagin, Andrew Morton,
	Mel Gorman, linux-mm, linux-kernel, Hugh Dickins, Oleg Nesterov

On Fri, 13 May 2011, KOSAKI Motohiro wrote:

> > > Yes, I'm sure we'll still have additional discussion when KOSAKI-san
> > > replies to my review of his patchset, so this quick patch was written only
> > > for CAI's testing at this point.
> > > 
> > > In reference to the above, I think that giving root processes a 3% bonus
> > > at all times may be a bit aggressive.  As mentioned before, I don't think
> > > that all root processes using 4% of memory and the remainder of system
> > > threads are using 1% should all be considered equal.  At the same time, I
> > > do not believe that two threads using 50% of memory should be considered
> > > equal if one is root and one is not.  So my idea was to discount 1% for
> > > every 30% of memory that a root process uses rather than a strict 3%.
> > > 
> > > That change can be debated and I think we'll probably settle on something
> > > more aggressive like 1% for every 10% of memory used since oom scores are
> > > only useful in comparison to other oom scores: in the above scenario where
> > > there are two threads, one by root and one not by root, using 50% of
> > > memory each, I think it would be legitimate to give the root task a 5%
> > > bonus so that it would only be selected if no other threads used more than
> > > 44% of memory (even though the root thread is truly using 50%).
> > > 
> > > This is a heuristic within the oom killer badness scoring that can always
> > > be debated back and forth, but I think a 1% bonus for root processes for
> > > every 10% of memory used is plausible.
> > > 
> > > Comments?
> > 
> > Yes. Tend to agree.
> > Apparently, absolute 3% bonus is a problem in CAI's case.
> > 
> > Your approach which makes bonus with function of rss is consistent
> > with current OOM heuristic.
> > So In consistency POV, I like it as it could help deterministic OOM policy.
> > 
> > About 30% or 10% things, I think it's hard to define a ideal magic
> > value for handling for whole workloads.
> > It would be very arguable. So we might need some standard method to
> > measure it/or redhat/suse peoples. Anyway, I don't want to argue it
> > until we get a number.
> 
> I have small comments. 1) typical system have some small size system daemon
> 2) David's points -= 100 * (points / 3000); line doesn't make any bonus if
> points is less than 3000.

With the 1% bonus per 10% memory consumption, it would be

	points -= 100 * (points / 1000);

instead.  So, yes, this wouldn't give any bonus for root tasks that use 
10% of allowed memory or less.

> Zero root bonus is really desired? It may lead to
> kill system daemon at first issue.

I would think of it this way: if a root task is using 9% of available 
memory and that happens to be the largest consumer of memory, then it 
makes sense to kill it instead of killing other smaller non-root tasks.  
The 3% bonus would have killed the task if all other threads are using 6% 
or less, this just allows them to use 2% more memory now.

On the other hand, if a root task is using 50% of available memory, then a 
45% non-root task would be sacrificed instead.

Perhaps we need to be more aggressive and give more of a bonus to root 
tasks?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())
  2011-05-13  6:53                   ` CAI Qian
@ 2011-05-16 20:46                     ` David Rientjes
  0 siblings, 0 replies; 58+ messages in thread
From: David Rientjes @ 2011-05-16 20:46 UTC (permalink / raw)
  To: CAI Qian
  Cc: KOSAKI Motohiro, avagin, Andrey Vagin, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Oleg Nesterov

On Fri, 13 May 2011, CAI Qian wrote:

> I have had a chance to test this patch after applied this patch manually
> (dd not apply cleanly) on the top of mainline kernel. The test is still
> running because it is trying to kill tons of python processes instread
> of the parent. Isn't there a way for oom to be smart enough to do
> "killall python"?

Not without userspace doing that explicitly, the oom killer attempts to 
only kill the most memory-hogging task to free memory.  If you're test is 
constantly forking new processes which allocate memory then the oom killer 
will just keep killing those children anytime it is out of memory.

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2011-05-16 20:46 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-05 11:44 [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable() Andrey Vagin
2011-03-05 15:20 ` Minchan Kim
2011-03-05 15:34   ` Andrew Vagin
2011-03-05 15:53     ` Minchan Kim
2011-03-05 16:41       ` Andrew Vagin
2011-03-05 17:07         ` Minchan Kim
2011-03-07 21:58           ` Andrew Morton
2011-03-07 23:45             ` Minchan Kim
2011-03-09  5:37               ` KAMEZAWA Hiroyuki
2011-03-09  5:43                 ` KAMEZAWA Hiroyuki
2011-03-10  6:58                 ` Minchan Kim
2011-03-10 23:58                   ` KAMEZAWA Hiroyuki
2011-03-11  0:18                     ` Minchan Kim
2011-03-11  6:08                       ` avagin
2011-03-14  1:03                         ` Minchan Kim
2011-03-08  0:44             ` KAMEZAWA Hiroyuki
2011-03-08  3:06               ` KOSAKI Motohiro
2011-03-08 19:02                 ` avagin
2011-03-09  5:52                   ` KAMEZAWA Hiroyuki
2011-03-09  6:17                   ` KOSAKI Motohiro
2011-03-10 14:08                     ` KOSAKI Motohiro
2011-03-08  8:12               ` Andrew Vagin
2011-03-09  6:06                 ` KAMEZAWA Hiroyuki
2011-05-04  1:38     ` CAI Qian
2011-05-09  6:54       ` KOSAKI Motohiro
2011-05-09  8:47         ` CAI Qian
2011-05-09  9:19           ` KOSAKI Motohiro
2011-05-10  8:11             ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) KOSAKI Motohiro
2011-05-10  8:14               ` [PATCH 1/4] oom: improve dump_tasks() show items KOSAKI Motohiro
2011-05-10 23:29                 ` David Rientjes
2011-05-13 10:14                   ` KOSAKI Motohiro
2011-05-10  8:15               ` [PATCH 2/4] oom: kill younger process first KOSAKI Motohiro
2011-05-10 23:31                 ` David Rientjes
2011-05-13 10:15                   ` KOSAKI Motohiro
2011-05-11 23:33                 ` Minchan Kim
2011-05-12  0:52                 ` KAMEZAWA Hiroyuki
2011-05-12  1:30                   ` Minchan Kim
2011-05-12  1:53                     ` KAMEZAWA Hiroyuki
2011-05-12  2:23                       ` Minchan Kim
2011-05-12  3:39                         ` KAMEZAWA Hiroyuki
2011-05-12  4:17                           ` Minchan Kim
2011-05-12 14:38                             ` Paul E. McKenney
2011-05-13 10:18                   ` KOSAKI Motohiro
2011-05-10  8:15               ` [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally KOSAKI Motohiro
2011-05-10 23:40                 ` David Rientjes
2011-05-13 10:30                   ` KOSAKI Motohiro
2011-05-10  8:16               ` [PATCH 4/4] oom: don't kill random process KOSAKI Motohiro
2011-05-10 23:41                 ` David Rientjes
2011-05-10 23:22               ` OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()) David Rientjes
2011-05-11  2:30               ` CAI Qian
2011-05-11 20:34                 ` David Rientjes
2011-05-12  0:13                   ` Minchan Kim
2011-05-12 19:38                     ` David Rientjes
2011-05-13  4:16                       ` Minchan Kim
2011-05-13 11:04                         ` KOSAKI Motohiro
2011-05-16 20:42                           ` David Rientjes
2011-05-13  6:53                   ` CAI Qian
2011-05-16 20:46                     ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).