[RFC PATCH] do_try_to_free_pages() might enter infinite loop

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] do_try_to_free_pages() might enter infinite loop
@ 2012-04-23 20:56 Ying Han
  2012-04-23 22:20 ` KOSAKI Motohiro
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Ying Han @ 2012-04-23 20:56 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	Nick Piggin, Andrew Morton
  Cc: linux-mm

This is not a patch targeted to be merged at all, but trying to understand
a logic in global direct reclaim.

There is a logic in global direct reclaim where reclaim fails on priority 0
and zone->all_unreclaimable is not set, it will cause the direct to start over
from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
very likely caused by direct reclaim enters infinite loop.

There have been serious patches trying to fix similar issue and the latest
patch has good summary of all the efforts:

commit 929bea7c714220fc76ce3f75bef9056477c28e74
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Thu Apr 14 15:22:12 2011 -0700

    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name

Kosaki explained the problem triggered by async zone->all_unreclaimable and
zone->pages_scanned where the later one was being checked by direct reclaim.
However, after the patch, the problem remains where the setting of
zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.

The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
the zone is actually unreclaimable?

1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
will end up resetting the pages_scanned.

2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
hibernation case by checking oom_killer_disabled, but not sure if that is
everything we need to worry about. The key point here is that direct reclaim
relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.

Instead of keep fixing the problem, I am wondering why we have the logic
"not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place:

Here is the patch introduced the logic initially:

commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f
Author: Nick Piggin <npiggin@suse.de>
Date:   Mon Sep 25 23:31:27 2006 -0700

    [PATCH] oom: use unreclaimable info

However, I didn't find detailed description of what problem the commit trying
to fix and wondering if the problem still exist after 5 years. I would be happy
to see the later case where we can consider to revert the initial patch.

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/vmscan.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1a51868..c7de242 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2420,8 +2420,8 @@ out:
 		return 1;

 	/* top priority shrink_zones still had more to do? don't OOM, then */
-	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
-		return 1;
+//	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
+//		return 1;

 	return 0;
 }
-- 
1.7.7.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-23 20:56 [RFC PATCH] do_try_to_free_pages() might enter infinite loop Ying Han
@ 2012-04-23 22:20 ` KOSAKI Motohiro
  2012-04-23 23:18   ` Ying Han
  2012-04-24  5:36 ` Nick Piggin
  2012-06-11 23:33 ` KOSAKI Motohiro
  2 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2012-04-23 22:20 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, Nick Piggin,
	Andrew Morton, linux-mm

On Mon, Apr 23, 2012 at 4:56 PM, Ying Han <yinghan@google.com> wrote:
> This is not a patch targeted to be merged at all, but trying to understand
> a logic in global direct reclaim.
>
> There is a logic in global direct reclaim where reclaim fails on priority 0
> and zone->all_unreclaimable is not set, it will cause the direct to start over
> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
> very likely caused by direct reclaim enters infinite loop.
>
> There have been serious patches trying to fix similar issue and the latest
> patch has good summary of all the efforts:
>
> commit 929bea7c714220fc76ce3f75bef9056477c28e74
> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date:   Thu Apr 14 15:22:12 2011 -0700
>
>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>
> Kosaki explained the problem triggered by async zone->all_unreclaimable and
> zone->pages_scanned where the later one was being checked by direct reclaim.
> However, after the patch, the problem remains where the setting of
> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>
> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
> the zone is actually unreclaimable?
>
> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
> will end up resetting the pages_scanned.
>
> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
> hibernation case by checking oom_killer_disabled, but not sure if that is
> everything we need to worry about. The key point here is that direct reclaim
> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.

If kswapd was frozen except hibernation, why don't you add frozen
check instead of
hibernation check? And when and why is that happen?


>
> Instead of keep fixing the problem, I am wondering why we have the logic
> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place:
>
> Here is the patch introduced the logic initially:
>
> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f
> Author: Nick Piggin <npiggin@suse.de>
> Date:   Mon Sep 25 23:31:27 2006 -0700
>
>    [PATCH] oom: use unreclaimable info
>
> However, I didn't find detailed description of what problem the commit trying
> to fix and wondering if the problem still exist after 5 years. I would be happy
> to see the later case where we can consider to revert the initial patch.

This patch fixed one of false oom issue. Think,

1. thread-a reach priority-0.
2. thread-b was exited and free a lot of pages.
3. thread-a call out_of_memory().

This is not very good because we now have enough memory....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-23 22:20 ` KOSAKI Motohiro
@ 2012-04-23 23:18   ` Ying Han
  2012-04-23 23:19     ` Ying Han
  2012-04-24  1:31     ` Minchan Kim
  0 siblings, 2 replies; 22+ messages in thread
From: Ying Han @ 2012-04-23 23:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, Nick Piggin,
	Andrew Morton, linux-mm

On Mon, Apr 23, 2012 at 3:20 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> On Mon, Apr 23, 2012 at 4:56 PM, Ying Han <yinghan@google.com> wrote:
>> This is not a patch targeted to be merged at all, but trying to understand
>> a logic in global direct reclaim.
>>
>> There is a logic in global direct reclaim where reclaim fails on priority 0
>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>> very likely caused by direct reclaim enters infinite loop.
>>
>> There have been serious patches trying to fix similar issue and the latest
>> patch has good summary of all the efforts:
>>
>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>
>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>
>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>> zone->pages_scanned where the later one was being checked by direct reclaim.
>> However, after the patch, the problem remains where the setting of
>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>
>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>> the zone is actually unreclaimable?
>>
>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
>> will end up resetting the pages_scanned.
>>
>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
>> hibernation case by checking oom_killer_disabled, but not sure if that is
>> everything we need to worry about. The key point here is that direct reclaim
>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>
> If kswapd was frozen except hibernation, why don't you add frozen
> check instead of
> hibernation check? And when and why is that happen?

I haven't tried to reproduce the issue, so everything is based on
eye-balling the code. The problem is that we have the potential
infinite loop in direct reclaim where it keeps trying as long as
!zone->all_unreclaimable.

The flag is only set by kswapd and it will skip setting the flag if
the following condition is true:

zone->pages_scanned < zone_reclaimable_pages(zone) * 6;

In a few-pages-on-lru condition, the zone->pages_scanned is easily
remains 0 and also it is reset to 0 everytime a page being freed.
Then, i will cause global direct reclaim entering infinite loop.


>
>
>>
>> Instead of keep fixing the problem, I am wondering why we have the logic
>> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place:
>>
>> Here is the patch introduced the logic initially:
>>
>> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f
>> Author: Nick Piggin <npiggin@suse.de>
>> Date:   Mon Sep 25 23:31:27 2006 -0700
>>
>>    [PATCH] oom: use unreclaimable info
>>
>> However, I didn't find detailed description of what problem the commit trying
>> to fix and wondering if the problem still exist after 5 years. I would be happy
>> to see the later case where we can consider to revert the initial patch.
>
> This patch fixed one of false oom issue. Think,
>
> 1. thread-a reach priority-0.
> 2. thread-b was exited and free a lot of pages.
> 3. thread-a call out_of_memory().
>
> This is not very good because we now have enough memory....

Isn't that being covered by the following in __alloc_pages_may_oom() ?

>-------/*
>------- * Go through the zonelist yet one more time, keep very high watermark
>------- * here, this is only to catch a parallel oom killing, we must fail if
>------- * we're still under heavy pressure.
>------- */
>-------page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
>------->-------order, zonelist, high_zoneidx,
>------->-------ALLOC_WMARK_HIGH|ALLOC_CPUSET,
>------->-------preferred_zone, migratetype);

Thanks

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-23 23:18   ` Ying Han
@ 2012-04-23 23:19     ` Ying Han
  2012-04-24  1:31     ` Minchan Kim
  1 sibling, 0 replies; 22+ messages in thread
From: Ying Han @ 2012-04-23 23:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, Nick Piggin,
	Andrew Morton, linux-mm

++cc Nick on the right email address...

--Ying

On Mon, Apr 23, 2012 at 4:18 PM, Ying Han <yinghan@google.com> wrote:
> On Mon, Apr 23, 2012 at 3:20 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Mon, Apr 23, 2012 at 4:56 PM, Ying Han <yinghan@google.com> wrote:
>>> This is not a patch targeted to be merged at all, but trying to understand
>>> a logic in global direct reclaim.
>>>
>>> There is a logic in global direct reclaim where reclaim fails on priority 0
>>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>>> very likely caused by direct reclaim enters infinite loop.
>>>
>>> There have been serious patches trying to fix similar issue and the latest
>>> patch has good summary of all the efforts:
>>>
>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>>
>>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>>
>>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>>> zone->pages_scanned where the later one was being checked by direct reclaim.
>>> However, after the patch, the problem remains where the setting of
>>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>>
>>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>>> the zone is actually unreclaimable?
>>>
>>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
>>> will end up resetting the pages_scanned.
>>>
>>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
>>> hibernation case by checking oom_killer_disabled, but not sure if that is
>>> everything we need to worry about. The key point here is that direct reclaim
>>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>>
>> If kswapd was frozen except hibernation, why don't you add frozen
>> check instead of
>> hibernation check? And when and why is that happen?
>
> I haven't tried to reproduce the issue, so everything is based on
> eye-balling the code. The problem is that we have the potential
> infinite loop in direct reclaim where it keeps trying as long as
> !zone->all_unreclaimable.
>
> The flag is only set by kswapd and it will skip setting the flag if
> the following condition is true:
>
> zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>
> In a few-pages-on-lru condition, the zone->pages_scanned is easily
> remains 0 and also it is reset to 0 everytime a page being freed.
> Then, i will cause global direct reclaim entering infinite loop.
>
>
>>
>>
>>>
>>> Instead of keep fixing the problem, I am wondering why we have the logic
>>> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place:
>>>
>>> Here is the patch introduced the logic initially:
>>>
>>> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f
>>> Author: Nick Piggin <npiggin@suse.de>
>>> Date:   Mon Sep 25 23:31:27 2006 -0700
>>>
>>>    [PATCH] oom: use unreclaimable info
>>>
>>> However, I didn't find detailed description of what problem the commit trying
>>> to fix and wondering if the problem still exist after 5 years. I would be happy
>>> to see the later case where we can consider to revert the initial patch.
>>
>> This patch fixed one of false oom issue. Think,
>>
>> 1. thread-a reach priority-0.
>> 2. thread-b was exited and free a lot of pages.
>> 3. thread-a call out_of_memory().
>>
>> This is not very good because we now have enough memory....
>
> Isn't that being covered by the following in __alloc_pages_may_oom() ?
>
>>-------/*
>>------- * Go through the zonelist yet one more time, keep very high watermark
>>------- * here, this is only to catch a parallel oom killing, we must fail if
>>------- * we're still under heavy pressure.
>>------- */
>>-------page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
>>------->-------order, zonelist, high_zoneidx,
>>------->-------ALLOC_WMARK_HIGH|ALLOC_CPUSET,
>>------->-------preferred_zone, migratetype);
>
> Thanks
>
> --Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-23 23:18   ` Ying Han
  2012-04-23 23:19     ` Ying Han
@ 2012-04-24  1:31     ` Minchan Kim
  2012-04-24  2:06       ` Ying Han
  2012-04-24 16:36       ` Ying Han
  1 sibling, 2 replies; 22+ messages in thread
From: Minchan Kim @ 2012-04-24  1:31 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Rik van Riel, Minchan Kim, Hugh Dickins,
	Nick Piggin, Andrew Morton, linux-mm

Hi Ying,

On 04/24/2012 08:18 AM, Ying Han wrote:

> On Mon, Apr 23, 2012 at 3:20 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Mon, Apr 23, 2012 at 4:56 PM, Ying Han <yinghan@google.com> wrote:
>>> This is not a patch targeted to be merged at all, but trying to understand
>>> a logic in global direct reclaim.
>>>
>>> There is a logic in global direct reclaim where reclaim fails on priority 0
>>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>>> very likely caused by direct reclaim enters infinite loop.
>>>
>>> There have been serious patches trying to fix similar issue and the latest
>>> patch has good summary of all the efforts:
>>>
>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>>
>>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>>
>>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>>> zone->pages_scanned where the later one was being checked by direct reclaim.
>>> However, after the patch, the problem remains where the setting of
>>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>>
>>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>>> the zone is actually unreclaimable?
>>>
>>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
>>> will end up resetting the pages_scanned.
>>>
>>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
>>> hibernation case by checking oom_killer_disabled, but not sure if that is
>>> everything we need to worry about. The key point here is that direct reclaim
>>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>>
>> If kswapd was frozen except hibernation, why don't you add frozen
>> check instead of
>> hibernation check? And when and why is that happen?
> 
> I haven't tried to reproduce the issue, so everything is based on
> eye-balling the code. The problem is that we have the potential
> infinite loop in direct reclaim where it keeps trying as long as
> !zone->all_unreclaimable.
> 
> The flag is only set by kswapd and it will skip setting the flag if
> the following condition is true:
> 
> zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> 
> In a few-pages-on-lru condition, the zone->pages_scanned is easily
> remains 0 and also it is reset to 0 everytime a page being freed.
> Then, i will cause global direct reclaim entering infinite loop.
> 


how does zone->pages_scanned become 0 easily in global reclaim?
Once VM has pages in LRU, it wouldn't be a zero. Look at isolate_lru_pages.
The problem is get_scan_count which could prevent scanning of LRU list but
it works well now. If the priority isn't zero and there are few pages in LRU,
it could be a zero scan but when the priority drop at zero, it could let VM scan
less pages under SWAP_CLUSTER_MAX. So pages_scanned would be increased.

I think the problem is live-lock as follows,


    A			kswapd				B

direct reclaim
reclaim a page						
                     	pages_scanned check <- skip
                          
							steal a page reclaimed by A
							use the page for user memory.
alloc failed
retry

In this scenario, process A would be a live-locked.
Does it make sense for infinite loop case you mentioned?


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-24  1:31     ` Minchan Kim
@ 2012-04-24  2:06       ` Ying Han
  2012-04-24 16:36       ` Ying Han
  1 sibling, 0 replies; 22+ messages in thread
From: Ying Han @ 2012-04-24  2:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Rik van Riel, Minchan Kim, Hugh Dickins,
	Nick Piggin, Andrew Morton, linux-mm

On Mon, Apr 23, 2012 at 6:31 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Ying,
>
> On 04/24/2012 08:18 AM, Ying Han wrote:
>
>> On Mon, Apr 23, 2012 at 3:20 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>>> On Mon, Apr 23, 2012 at 4:56 PM, Ying Han <yinghan@google.com> wrote:
>>>> This is not a patch targeted to be merged at all, but trying to understand
>>>> a logic in global direct reclaim.
>>>>
>>>> There is a logic in global direct reclaim where reclaim fails on priority 0
>>>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>>>> very likely caused by direct reclaim enters infinite loop.
>>>>
>>>> There have been serious patches trying to fix similar issue and the latest
>>>> patch has good summary of all the efforts:
>>>>
>>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>>>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>>>
>>>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>>>
>>>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>>>> zone->pages_scanned where the later one was being checked by direct reclaim.
>>>> However, after the patch, the problem remains where the setting of
>>>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>>>
>>>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>>>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>>>> the zone is actually unreclaimable?
>>>>
>>>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
>>>> will end up resetting the pages_scanned.
>>>>
>>>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
>>>> hibernation case by checking oom_killer_disabled, but not sure if that is
>>>> everything we need to worry about. The key point here is that direct reclaim
>>>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>>>
>>> If kswapd was frozen except hibernation, why don't you add frozen
>>> check instead of
>>> hibernation check? And when and why is that happen?
>>
>> I haven't tried to reproduce the issue, so everything is based on
>> eye-balling the code. The problem is that we have the potential
>> infinite loop in direct reclaim where it keeps trying as long as
>> !zone->all_unreclaimable.
>>
>> The flag is only set by kswapd and it will skip setting the flag if
>> the following condition is true:
>>
>> zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>>
>> In a few-pages-on-lru condition, the zone->pages_scanned is easily
>> remains 0 and also it is reset to 0 everytime a page being freed.
>> Then, i will cause global direct reclaim entering infinite loop.
>>
>
>
> how does zone->pages_scanned become 0 easily in global reclaim?
> Once VM has pages in LRU, it wouldn't be a zero. Look at isolate_lru_pages.
> The problem is get_scan_count which could prevent scanning of LRU list but
> it works well now. If the priority isn't zero and there are few pages in LRU,
> it could be a zero scan but when the priority drop at zero, it could let VM scan
> less pages under SWAP_CLUSTER_MAX. So pages_scanned would be increased.

Yes, that is true. But the pages_scanned will be reset on freeing a
page and that could happen asynchronously. For example I have only 2
pages on file_lru (w/o swap), and here is what is supposed to happen:

A
       kswapd                                   B

direct reclaim

        priority DEP_PRIORITY to 0

        zone->pages_scanned = 3

        zone_reclaimable() == true

        zone->all_unreclaimable == 0

nr_reclaimed == 0 & !zone->all_unreclaimable
retry

         priority DEP_PRIORITY to 0

         zone->pages_scanned = 6

         zone_reclaimable() == true

         zone->all_unreclaimable == 0
nr_reclaimed == 0 & !zone->all_unreclaimable
retry

        repeat the above which eventually

        zone->pages_scanned will grow

        zone->pages_scanned to 12

        zone_reclaimable() == false

        zone->all_unreclaimable == 1
nr_reclaimed == 0 & zone->all_unreclaimable
oom

However, what if B frees a pages everytime before pages_scanned
reaches the point, then we won't set zone->all_unreclaimable at all.
If so, we reaches a livelock here...

>
> I think the problem is live-lock as follows,
>
>
>    A                   kswapd                          B
>
> direct reclaim
> reclaim a page
>                        pages_scanned check <- skip
>
>                                                        steal a page reclaimed by A
>                                                        use the page for user memory.
> alloc failed
> retry
>
> In this scenario, process A would be a live-locked.
> Does it make sense for infinite loop case you mentioned?

Maybe but need to verify. The problem is that we can not distinguish
this case from the case I listed above by seeing
do_try_to_free_pages() always return 1. AFAIK, we do see
zone->pages_scanned == 0 on some of the cases after instrumenting the
kernel.

Overall, having the direct reclaim in a infinite loop based on the
zone->all_unreclaimable flag looks scary.

--Ying

>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-23 20:56 [RFC PATCH] do_try_to_free_pages() might enter infinite loop Ying Han
  2012-04-23 22:20 ` KOSAKI Motohiro
@ 2012-04-24  5:36 ` Nick Piggin
  2012-04-24 18:37   ` Ying Han
  2012-06-11 23:33 ` KOSAKI Motohiro
  2 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2012-04-24  5:36 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	Nick Piggin, Andrew Morton, linux-mm

On 24 April 2012 06:56, Ying Han <yinghan@google.com> wrote:
> This is not a patch targeted to be merged at all, but trying to understand
> a logic in global direct reclaim.
>
> There is a logic in global direct reclaim where reclaim fails on priority 0
> and zone->all_unreclaimable is not set, it will cause the direct to start over
> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
> very likely caused by direct reclaim enters infinite loop.

Very likely, or definitely? Can you reproduce it? What workload?

>
> There have been serious patches trying to fix similar issue and the latest
> patch has good summary of all the efforts:
>
> commit 929bea7c714220fc76ce3f75bef9056477c28e74
> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date:   Thu Apr 14 15:22:12 2011 -0700
>
>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>
> Kosaki explained the problem triggered by async zone->all_unreclaimable and
> zone->pages_scanned where the later one was being checked by direct reclaim.
> However, after the patch, the problem remains where the setting of
> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>
> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
> the zone is actually unreclaimable?
>
> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
> will end up resetting the pages_scanned.
>
> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
> hibernation case by checking oom_killer_disabled, but not sure if that is
> everything we need to worry about. The key point here is that direct reclaim
> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>
> Instead of keep fixing the problem, I am wondering why we have the logic
> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place:
>
> Here is the patch introduced the logic initially:
>
> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f
> Author: Nick Piggin <npiggin@suse.de>
> Date:   Mon Sep 25 23:31:27 2006 -0700
>
>    [PATCH] oom: use unreclaimable info
>
> However, I didn't find detailed description of what problem the commit trying
> to fix and wondering if the problem still exist after 5 years. I would be happy
> to see the later case where we can consider to revert the initial patch.

The problem we were having is that processes would be killed at seemingly
random points of time, under heavy swapping, but long before all swap was
used.

The particular problem IIRC was related to testing a lot of guests on an s390
machine. I'm ashamed to have not included more information in the
changelog -- I suspect it was probably in a small batch of patches with a
description in the introductory mail and not properly placed into patches :(

There are certainly a lot of changes in the area since then, so I couldn't be
sure of what will happen by taking this out.

I don't think the page allocator "try harder" logic was enough to solve the
problem, and I think it was around in some form even back then.

The biggest problem is that it's not an exact science. It will never do the
right thing for everybody, sadly. Even if it is able to allocate pages at a
very slow rate, this is effectively as good as a hang for some users. For
others, they want to be able to manually intervene before anything is killed.

Sorry if this isn't too helpful! Any ideas would be good. Possibly need to have
a way to describe these behaviours in an abstract way (i.e., not just magic
numbers), and allow user to tune it.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-24  1:31     ` Minchan Kim
  2012-04-24  2:06       ` Ying Han
@ 2012-04-24 16:36       ` Ying Han
  2012-04-24 16:38         ` Rik van Riel
  1 sibling, 1 reply; 22+ messages in thread
From: Ying Han @ 2012-04-24 16:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Rik van Riel, Minchan Kim, Hugh Dickins,
	Nick Piggin, Andrew Morton, linux-mm

Sorry about the word-wrap last email, here i resend it w/ hopefully
better looking:

 On Mon, Apr 23, 2012 at 6:31 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Ying,
>
> On 04/24/2012 08:18 AM, Ying Han wrote:
>
>> On Mon, Apr 23, 2012 at 3:20 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>>> On Mon, Apr 23, 2012 at 4:56 PM, Ying Han <yinghan@google.com> wrote:
>>>> This is not a patch targeted to be merged at all, but trying to understand
>>>> a logic in global direct reclaim.
>>>>
>>>> There is a logic in global direct reclaim where reclaim fails on priority 0
>>>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>>>> very likely caused by direct reclaim enters infinite loop.
>>>>
>>>> There have been serious patches trying to fix similar issue and the latest
>>>> patch has good summary of all the efforts:
>>>>
>>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>>>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>>>
>>>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>>>
>>>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>>>> zone->pages_scanned where the later one was being checked by direct reclaim.
>>>> However, after the patch, the problem remains where the setting of
>>>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>>>
>>>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>>>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>>>> the zone is actually unreclaimable?
>>>>
>>>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
>>>> will end up resetting the pages_scanned.
>>>>
>>>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
>>>> hibernation case by checking oom_killer_disabled, but not sure if that is
>>>> everything we need to worry about. The key point here is that direct reclaim
>>>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>>>
>>> If kswapd was frozen except hibernation, why don't you add frozen
>>> check instead of
>>> hibernation check? And when and why is that happen?
>>
>> I haven't tried to reproduce the issue, so everything is based on
>> eye-balling the code. The problem is that we have the potential
>> infinite loop in direct reclaim where it keeps trying as long as
>> !zone->all_unreclaimable.
>>
>> The flag is only set by kswapd and it will skip setting the flag if
>> the following condition is true:
>>
>> zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>>
>> In a few-pages-on-lru condition, the zone->pages_scanned is easily
>> remains 0 and also it is reset to 0 everytime a page being freed.
>> Then, i will cause global direct reclaim entering infinite loop.
>>
>
>
> how does zone->pages_scanned become 0 easily in global reclaim?
> Once VM has pages in LRU, it wouldn't be a zero. Look at isolate_lru_pages.
> The problem is get_scan_count which could prevent scanning of LRU list but
> it works well now. If the priority isn't zero and there are few pages in LRU,
> it could be a zero scan but when the priority drop at zero, it could let VM scan
> less pages under SWAP_CLUSTER_MAX. So pages_scanned would be increased.

Yes, that is true. But the pages_scanned will be reset on freeing a
page and that could happen asynchronously. For example I have only 2
pages on file_lru (w/o swap), and here is what is supposed to happen:

A                                    kswapd                                   B

direct reclaim

                                     priority DEP_PRIORITY to 0

                                     zone->pages_scanned = 3

                                     zone_reclaimable() == true

                                     zone->all_unreclaimable == 0

nr_reclaimed == 0 & !zone->all_unreclaimable
retry


                                     priority DEP_PRIORITY to 0

                                     zone->pages_scanned = 6

                                     zone_reclaimable() == true

                                     zone->all_unreclaimable == 0

nr_reclaimed == 0 & !zone->all_unreclaimable
retry

                                    repeat the above which eventually

                                    zone->pages_scanned will grow

                                    zone->pages_scanned to 12

                                    zone_reclaimable() == false

                                    zone->all_unreclaimable == 1
nr_reclaimed == 0 & zone->all_unreclaimable
oom

However, what if B frees a pages everytime before pages_scanned
reaches the point, then we won't set zone->all_unreclaimable at all.
If so, we reaches a livelock here...

--Ying

>
> I think the problem is live-lock as follows,
>
>
>    A                   kswapd                          B
>
> direct reclaim
> reclaim a page
>                        pages_scanned check <- skip
>
>                                                        steal a page reclaimed by A
>                                                        use the page for user memory.
> alloc failed
> retry
>
> In this scenario, process A would be a live-locked.
> Does it make sense for infinite loop case you mentioned?
>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-24 16:36       ` Ying Han
@ 2012-04-24 16:38         ` Rik van Riel
  2012-04-24 16:45           ` KOSAKI Motohiro
  2012-04-24 17:17           ` Ying Han
  0 siblings, 2 replies; 22+ messages in thread
From: Rik van Riel @ 2012-04-24 16:38 UTC (permalink / raw)
  To: Ying Han
  Cc: Minchan Kim, KOSAKI Motohiro, Michal Hocko, Johannes Weiner,
	Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim, Hugh Dickins,
	Nick Piggin, Andrew Morton, linux-mm

On 04/24/2012 12:36 PM, Ying Han wrote:

> However, what if B frees a pages everytime before pages_scanned
> reaches the point, then we won't set zone->all_unreclaimable at all.
> If so, we reaches a livelock here...

If B keeps freeing pages, surely A will get a successful
allocation and there will not be a livelock?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-24 16:38         ` Rik van Riel
@ 2012-04-24 16:45           ` KOSAKI Motohiro
  2012-04-24 17:22             ` Ying Han
  2012-04-24 17:17           ` Ying Han
  1 sibling, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2012-04-24 16:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ying Han, Minchan Kim, KOSAKI Motohiro, Michal Hocko,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim,
	Hugh Dickins, Nick Piggin, Andrew Morton, linux-mm,
	kosaki.motohiro

(4/24/12 12:38 PM), Rik van Riel wrote:
> On 04/24/2012 12:36 PM, Ying Han wrote:
>
>> However, what if B frees a pages everytime before pages_scanned
>> reaches the point, then we won't set zone->all_unreclaimable at all.
>> If so, we reaches a livelock here...
>
> If B keeps freeing pages, surely A will get a successful
> allocation and there will not be a livelock?

And, I hope we distinguish true livelock and pseudo livelock at first.
Nick's patch definitely makes kernel slowdown when OOM situation. It is
intentional. We thought slowdown is better than false positive OOM even
though the slowdown is extream slow and similar to livelock.

Ying, Which problem do you want to discuss? a) current kernrel has true
live lock b) current oom detection is too slow and livelock like and it
is not acceptable to you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-24 16:38         ` Rik van Riel
  2012-04-24 16:45           ` KOSAKI Motohiro
@ 2012-04-24 17:17           ` Ying Han
  1 sibling, 0 replies; 22+ messages in thread
From: Ying Han @ 2012-04-24 17:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, KOSAKI Motohiro, Michal Hocko, Johannes Weiner,
	Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim, Hugh Dickins,
	Nick Piggin, Andrew Morton, linux-mm

On Tue, Apr 24, 2012 at 9:38 AM, Rik van Riel <riel@redhat.com> wrote:
> On 04/24/2012 12:36 PM, Ying Han wrote:
>
>> However, what if B frees a pages everytime before pages_scanned
>> reaches the point, then we won't set zone->all_unreclaimable at all.
>> If so, we reaches a livelock here...
>
>
> If B keeps freeing pages, surely A will get a successful
> allocation and there will not be a livelock?

Ah, that is another piece of puzzle. We suspect the zone is under
min_watermark due to previous alloc_flags (ALLOC_NO_WATERMARKS) and B
returns the page under min which can not be allocated by A.

Now we reset the zone->pages_scanned on freeing page regardless of the
watermarks, so it is possible that zone is under min_watermark but
!zone->all_unreclaimable.

--Ying

>
> --
> All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-24 16:45           ` KOSAKI Motohiro
@ 2012-04-24 17:22             ` Ying Han
  0 siblings, 0 replies; 22+ messages in thread
From: Ying Han @ 2012-04-24 17:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Michal Hocko,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim,
	Hugh Dickins, Nick Piggin, Andrew Morton, linux-mm

On Tue, Apr 24, 2012 at 9:45 AM, KOSAKI Motohiro
<kosaki.motohiro@gmail.com> wrote:
> (4/24/12 12:38 PM), Rik van Riel wrote:
>>
>> On 04/24/2012 12:36 PM, Ying Han wrote:
>>
>>> However, what if B frees a pages everytime before pages_scanned
>>> reaches the point, then we won't set zone->all_unreclaimable at all.
>>> If so, we reaches a livelock here...
>>
>>
>> If B keeps freeing pages, surely A will get a successful
>> allocation and there will not be a livelock?
>
>
> And, I hope we distinguish true livelock and pseudo livelock at first.
> Nick's patch definitely makes kernel slowdown when OOM situation. It is
> intentional. We thought slowdown is better than false positive OOM even
> though the slowdown is extream slow and similar to livelock.

I get the false positive OOM part from Minchan's reply, but I am
wondering if the current code covers some of the problem?

__alloc_pages_may_oom() {

>-------/*
>------- * Go through the zonelist yet one more time, keep very high watermark
>------- * here, this is only to catch a parallel oom killing, we must fail if
>------- * we're still under heavy pressure.
>------- */
>-------page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
>------->-------order, zonelist, high_zoneidx,
>------->-------ALLOC_WMARK_HIGH|ALLOC_CPUSET,
>------->-------preferred_zone, migratetype);

}


>
> Ying, Which problem do you want to discuss? a) current kernrel has true
> live lock b) current oom detection is too slow and livelock like and it
> is not acceptable to you.

The first one, however I don't have test program to reproduce it yet.
Everything by far is based on the watchdog_timeout core dump and
that's why I prefer to understand the code first before jumping to a
fix :)

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-24  5:36 ` Nick Piggin
@ 2012-04-24 18:37   ` Ying Han
  2012-05-01  3:34     ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Ying Han @ 2012-04-24 18:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	Nick Piggin, Andrew Morton, linux-mm

On Mon, Apr 23, 2012 at 10:36 PM, Nick Piggin <npiggin@gmail.com> wrote:
> On 24 April 2012 06:56, Ying Han <yinghan@google.com> wrote:
>> This is not a patch targeted to be merged at all, but trying to understand
>> a logic in global direct reclaim.
>>
>> There is a logic in global direct reclaim where reclaim fails on priority 0
>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>> very likely caused by direct reclaim enters infinite loop.
>
> Very likely, or definitely? Can you reproduce it? What workload?

No, we don't have reproduce workload for that yet. Everything is based
on the watchdog dump file :(

>
>>
>> There have been serious patches trying to fix similar issue and the latest
>> patch has good summary of all the efforts:
>>
>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>
>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>
>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>> zone->pages_scanned where the later one was being checked by direct reclaim.
>> However, after the patch, the problem remains where the setting of
>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>
>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>> the zone is actually unreclaimable?
>>
>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
>> will end up resetting the pages_scanned.
>>
>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
>> hibernation case by checking oom_killer_disabled, but not sure if that is
>> everything we need to worry about. The key point here is that direct reclaim
>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>>
>> Instead of keep fixing the problem, I am wondering why we have the logic
>> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place:
>>
>> Here is the patch introduced the logic initially:
>>
>> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f
>> Author: Nick Piggin <npiggin@suse.de>
>> Date:   Mon Sep 25 23:31:27 2006 -0700
>>
>>    [PATCH] oom: use unreclaimable info
>>
>> However, I didn't find detailed description of what problem the commit trying
>> to fix and wondering if the problem still exist after 5 years. I would be happy
>> to see the later case where we can consider to revert the initial patch.
>
> The problem we were having is that processes would be killed at seemingly
> random points of time, under heavy swapping, but long before all swap was
> used.
>
> The particular problem IIRC was related to testing a lot of guests on an s390
> machine. I'm ashamed to have not included more information in the
> changelog -- I suspect it was probably in a small batch of patches with a
> description in the introductory mail and not properly placed into patches :(
>
> There are certainly a lot of changes in the area since then, so I couldn't be
> sure of what will happen by taking this out.
>
> I don't think the page allocator "try harder" logic was enough to solve the
> problem, and I think it was around in some form even back then.
>
> The biggest problem is that it's not an exact science. It will never do the
> right thing for everybody, sadly. Even if it is able to allocate pages at a
> very slow rate, this is effectively as good as a hang for some users. For
> others, they want to be able to manually intervene before anything is killed.
>
> Sorry if this isn't too helpful! Any ideas would be good. Possibly need to have
> a way to describe these behaviours in an abstract way (i.e., not just magic
> numbers), and allow user to tune it.

Thank you Nick and this is helpful. I looked up on the patches you
mentioned, and I can see what problem they were trying to solve by
that time. However things have been changed a lot, and it is hard to
tell if the problem still remains on the current kernel or not. By
spotting each by each, I see either the patch has been replaced by
different logic or the same logic has been implemented differently.

For this particular one patch, we now have code which does page alloc
retry before entering OOM. So I am wondering if that will help the OOM
situation by that time.

--Ying

> Thanks,
> Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-24 18:37   ` Ying Han
@ 2012-05-01  3:34     ` Nick Piggin
  2012-05-01 16:18       ` Ying Han
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2012-05-01  3:34 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	Nick Piggin, Andrew Morton, linux-mm

On 25 April 2012 04:37, Ying Han <yinghan@google.com> wrote:
> On Mon, Apr 23, 2012 at 10:36 PM, Nick Piggin <npiggin@gmail.com> wrote:
>> On 24 April 2012 06:56, Ying Han <yinghan@google.com> wrote:
>>> This is not a patch targeted to be merged at all, but trying to understand
>>> a logic in global direct reclaim.
>>>
>>> There is a logic in global direct reclaim where reclaim fails on priority 0
>>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>>> very likely caused by direct reclaim enters infinite loop.
>>
>> Very likely, or definitely? Can you reproduce it? What workload?
>
> No, we don't have reproduce workload for that yet. Everything is based
> on the watchdog dump file :(
>
>>
>>>
>>> There have been serious patches trying to fix similar issue and the latest
>>> patch has good summary of all the efforts:
>>>
>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>>
>>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>>
>>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>>> zone->pages_scanned where the later one was being checked by direct reclaim.
>>> However, after the patch, the problem remains where the setting of
>>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>>
>>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>>> the zone is actually unreclaimable?
>>>
>>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
>>> will end up resetting the pages_scanned.
>>>
>>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
>>> hibernation case by checking oom_killer_disabled, but not sure if that is
>>> everything we need to worry about. The key point here is that direct reclaim
>>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>>>
>>> Instead of keep fixing the problem, I am wondering why we have the logic
>>> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place:
>>>
>>> Here is the patch introduced the logic initially:
>>>
>>> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f
>>> Author: Nick Piggin <npiggin@suse.de>
>>> Date:   Mon Sep 25 23:31:27 2006 -0700
>>>
>>>    [PATCH] oom: use unreclaimable info
>>>
>>> However, I didn't find detailed description of what problem the commit trying
>>> to fix and wondering if the problem still exist after 5 years. I would be happy
>>> to see the later case where we can consider to revert the initial patch.
>>
>> The problem we were having is that processes would be killed at seemingly
>> random points of time, under heavy swapping, but long before all swap was
>> used.
>>
>> The particular problem IIRC was related to testing a lot of guests on an s390
>> machine. I'm ashamed to have not included more information in the
>> changelog -- I suspect it was probably in a small batch of patches with a
>> description in the introductory mail and not properly placed into patches :(
>>
>> There are certainly a lot of changes in the area since then, so I couldn't be
>> sure of what will happen by taking this out.
>>
>> I don't think the page allocator "try harder" logic was enough to solve the
>> problem, and I think it was around in some form even back then.
>>
>> The biggest problem is that it's not an exact science. It will never do the
>> right thing for everybody, sadly. Even if it is able to allocate pages at a
>> very slow rate, this is effectively as good as a hang for some users. For
>> others, they want to be able to manually intervene before anything is killed.
>>
>> Sorry if this isn't too helpful! Any ideas would be good. Possibly need to have
>> a way to describe these behaviours in an abstract way (i.e., not just magic
>> numbers), and allow user to tune it.
>
> Thank you Nick and this is helpful. I looked up on the patches you
> mentioned, and I can see what problem they were trying to solve by
> that time. However things have been changed a lot, and it is hard to
> tell if the problem still remains on the current kernel or not. By
> spotting each by each, I see either the patch has been replaced by
> different logic or the same logic has been implemented differently.
>
> For this particular one patch, we now have code which does page alloc
> retry before entering OOM. So I am wondering if that will help the OOM
> situation by that time.

Well it's not doing exactly the same thing, actually. And note that the
problem was not about parallel OOM-killing. The fact that the page
reclaim has not made any progress when we last called in does not
actually mean that it cannot make _any_ progress.

My patch is more about detecting the latter case. I don't see there
is equivalent logic in page allocator to replace it.

But again: this is not a question of correct or incorrect as far as I
can see, simply a matter of where you define "hopeless"! I could
easily see the need for way to bias that (kill quickly, medium, try to
never kill).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-05-01  3:34     ` Nick Piggin
@ 2012-05-01 16:18       ` Ying Han
  2012-05-01 16:20         ` Ying Han
  2012-05-01 17:06         ` Rik van Riel
  0 siblings, 2 replies; 22+ messages in thread
From: Ying Han @ 2012-05-01 16:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	Nick Piggin, Andrew Morton, linux-mm

On Mon, Apr 30, 2012 at 8:34 PM, Nick Piggin <npiggin@gmail.com> wrote:
> On 25 April 2012 04:37, Ying Han <yinghan@google.com> wrote:
>> On Mon, Apr 23, 2012 at 10:36 PM, Nick Piggin <npiggin@gmail.com> wrote:
>>> On 24 April 2012 06:56, Ying Han <yinghan@google.com> wrote:
>>>> This is not a patch targeted to be merged at all, but trying to understand
>>>> a logic in global direct reclaim.
>>>>
>>>> There is a logic in global direct reclaim where reclaim fails on priority 0
>>>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>>>> very likely caused by direct reclaim enters infinite loop.
>>>
>>> Very likely, or definitely? Can you reproduce it? What workload?
>>
>> No, we don't have reproduce workload for that yet. Everything is based
>> on the watchdog dump file :(
>>
>>>
>>>>
>>>> There have been serious patches trying to fix similar issue and the latest
>>>> patch has good summary of all the efforts:
>>>>
>>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>>>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>>>
>>>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>>>
>>>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>>>> zone->pages_scanned where the later one was being checked by direct reclaim.
>>>> However, after the patch, the problem remains where the setting of
>>>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>>>
>>>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>>>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>>>> the zone is actually unreclaimable?
>>>>
>>>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
>>>> will end up resetting the pages_scanned.
>>>>
>>>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
>>>> hibernation case by checking oom_killer_disabled, but not sure if that is
>>>> everything we need to worry about. The key point here is that direct reclaim
>>>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>>>>
>>>> Instead of keep fixing the problem, I am wondering why we have the logic
>>>> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place:
>>>>
>>>> Here is the patch introduced the logic initially:
>>>>
>>>> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f
>>>> Author: Nick Piggin <npiggin@suse.de>
>>>> Date:   Mon Sep 25 23:31:27 2006 -0700
>>>>
>>>>    [PATCH] oom: use unreclaimable info
>>>>
>>>> However, I didn't find detailed description of what problem the commit trying
>>>> to fix and wondering if the problem still exist after 5 years. I would be happy
>>>> to see the later case where we can consider to revert the initial patch.
>>>
>>> The problem we were having is that processes would be killed at seemingly
>>> random points of time, under heavy swapping, but long before all swap was
>>> used.
>>>
>>> The particular problem IIRC was related to testing a lot of guests on an s390
>>> machine. I'm ashamed to have not included more information in the
>>> changelog -- I suspect it was probably in a small batch of patches with a
>>> description in the introductory mail and not properly placed into patches :(
>>>
>>> There are certainly a lot of changes in the area since then, so I couldn't be
>>> sure of what will happen by taking this out.
>>>
>>> I don't think the page allocator "try harder" logic was enough to solve the
>>> problem, and I think it was around in some form even back then.
>>>
>>> The biggest problem is that it's not an exact science. It will never do the
>>> right thing for everybody, sadly. Even if it is able to allocate pages at a
>>> very slow rate, this is effectively as good as a hang for some users. For
>>> others, they want to be able to manually intervene before anything is killed.
>>>
>>> Sorry if this isn't too helpful! Any ideas would be good. Possibly need to have
>>> a way to describe these behaviours in an abstract way (i.e., not just magic
>>> numbers), and allow user to tune it.
>>
>> Thank you Nick and this is helpful. I looked up on the patches you
>> mentioned, and I can see what problem they were trying to solve by
>> that time. However things have been changed a lot, and it is hard to
>> tell if the problem still remains on the current kernel or not. By
>> spotting each by each, I see either the patch has been replaced by
>> different logic or the same logic has been implemented differently.
>>
>> For this particular one patch, we now have code which does page alloc
>> retry before entering OOM. So I am wondering if that will help the OOM
>> situation by that time.
>
> Well it's not doing exactly the same thing, actually. And note that the
> problem was not about parallel OOM-killing. The fact that the page
> reclaim has not made any progress when we last called in does not
> actually mean that it cannot make _any_ progress.
>
> My patch is more about detecting the latter case. I don't see there
> is equivalent logic in page allocator to replace it.
>
> But again: this is not a question of correct or incorrect as far as I
> can see, simply a matter of where you define "hopeless"! I could
> easily see the need for way to bias that (kill quickly, medium, try to
> never kill).

That is right. We ( at google) seems to be on the other end of the
bias where we prefer to oom kill instead of hopelessly looping in the
reclaim path. Normally if the application gets into that state, the
performance would be sucks already and they might prefer to be
restarted :)

Unfortunately, We weren't being able to reproduce the issue with
synthetic workload. So far it only happens in production with a
particular workload when the memory runs really really tight.

The current logic seems perfer to reclaim more than going oom kill,
and that might not fit all user's expectation. However, I guess it is
hard to convince for any changes since different users has different
bias as you said....

Thanks

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-05-01 16:18       ` Ying Han
@ 2012-05-01 16:20         ` Ying Han
  2012-05-01 17:06         ` Rik van Riel
  1 sibling, 0 replies; 22+ messages in thread
From: Ying Han @ 2012-05-01 16:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	Andrew Morton, linux-mm

take off the old mailing address of Nick :)

--Ying

On Tue, May 1, 2012 at 9:18 AM, Ying Han <yinghan@google.com> wrote:
> On Mon, Apr 30, 2012 at 8:34 PM, Nick Piggin <npiggin@gmail.com> wrote:
>> On 25 April 2012 04:37, Ying Han <yinghan@google.com> wrote:
>>> On Mon, Apr 23, 2012 at 10:36 PM, Nick Piggin <npiggin@gmail.com> wrote:
>>>> On 24 April 2012 06:56, Ying Han <yinghan@google.com> wrote:
>>>>> This is not a patch targeted to be merged at all, but trying to understand
>>>>> a logic in global direct reclaim.
>>>>>
>>>>> There is a logic in global direct reclaim where reclaim fails on priority 0
>>>>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>>>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>>>>> very likely caused by direct reclaim enters infinite loop.
>>>>
>>>> Very likely, or definitely? Can you reproduce it? What workload?
>>>
>>> No, we don't have reproduce workload for that yet. Everything is based
>>> on the watchdog dump file :(
>>>
>>>>
>>>>>
>>>>> There have been serious patches trying to fix similar issue and the latest
>>>>> patch has good summary of all the efforts:
>>>>>
>>>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>>>>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>>>>
>>>>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>>>>
>>>>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>>>>> zone->pages_scanned where the later one was being checked by direct reclaim.
>>>>> However, after the patch, the problem remains where the setting of
>>>>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>>>>
>>>>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>>>>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>>>>> the zone is actually unreclaimable?
>>>>>
>>>>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It
>>>>> will end up resetting the pages_scanned.
>>>>>
>>>>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the
>>>>> hibernation case by checking oom_killer_disabled, but not sure if that is
>>>>> everything we need to worry about. The key point here is that direct reclaim
>>>>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe.
>>>>>
>>>>> Instead of keep fixing the problem, I am wondering why we have the logic
>>>>> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place:
>>>>>
>>>>> Here is the patch introduced the logic initially:
>>>>>
>>>>> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f
>>>>> Author: Nick Piggin <npiggin@suse.de>
>>>>> Date:   Mon Sep 25 23:31:27 2006 -0700
>>>>>
>>>>>    [PATCH] oom: use unreclaimable info
>>>>>
>>>>> However, I didn't find detailed description of what problem the commit trying
>>>>> to fix and wondering if the problem still exist after 5 years. I would be happy
>>>>> to see the later case where we can consider to revert the initial patch.
>>>>
>>>> The problem we were having is that processes would be killed at seemingly
>>>> random points of time, under heavy swapping, but long before all swap was
>>>> used.
>>>>
>>>> The particular problem IIRC was related to testing a lot of guests on an s390
>>>> machine. I'm ashamed to have not included more information in the
>>>> changelog -- I suspect it was probably in a small batch of patches with a
>>>> description in the introductory mail and not properly placed into patches :(
>>>>
>>>> There are certainly a lot of changes in the area since then, so I couldn't be
>>>> sure of what will happen by taking this out.
>>>>
>>>> I don't think the page allocator "try harder" logic was enough to solve the
>>>> problem, and I think it was around in some form even back then.
>>>>
>>>> The biggest problem is that it's not an exact science. It will never do the
>>>> right thing for everybody, sadly. Even if it is able to allocate pages at a
>>>> very slow rate, this is effectively as good as a hang for some users. For
>>>> others, they want to be able to manually intervene before anything is killed.
>>>>
>>>> Sorry if this isn't too helpful! Any ideas would be good. Possibly need to have
>>>> a way to describe these behaviours in an abstract way (i.e., not just magic
>>>> numbers), and allow user to tune it.
>>>
>>> Thank you Nick and this is helpful. I looked up on the patches you
>>> mentioned, and I can see what problem they were trying to solve by
>>> that time. However things have been changed a lot, and it is hard to
>>> tell if the problem still remains on the current kernel or not. By
>>> spotting each by each, I see either the patch has been replaced by
>>> different logic or the same logic has been implemented differently.
>>>
>>> For this particular one patch, we now have code which does page alloc
>>> retry before entering OOM. So I am wondering if that will help the OOM
>>> situation by that time.
>>
>> Well it's not doing exactly the same thing, actually. And note that the
>> problem was not about parallel OOM-killing. The fact that the page
>> reclaim has not made any progress when we last called in does not
>> actually mean that it cannot make _any_ progress.
>>
>> My patch is more about detecting the latter case. I don't see there
>> is equivalent logic in page allocator to replace it.
>>
>> But again: this is not a question of correct or incorrect as far as I
>> can see, simply a matter of where you define "hopeless"! I could
>> easily see the need for way to bias that (kill quickly, medium, try to
>> never kill).
>
> That is right. We ( at google) seems to be on the other end of the
> bias where we prefer to oom kill instead of hopelessly looping in the
> reclaim path. Normally if the application gets into that state, the
> performance would be sucks already and they might prefer to be
> restarted :)
>
> Unfortunately, We weren't being able to reproduce the issue with
> synthetic workload. So far it only happens in production with a
> particular workload when the memory runs really really tight.
>
> The current logic seems perfer to reclaim more than going oom kill,
> and that might not fit all user's expectation. However, I guess it is
> hard to convince for any changes since different users has different
> bias as you said....
>
> Thanks
>
> --Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-05-01 16:18       ` Ying Han
  2012-05-01 16:20         ` Ying Han
@ 2012-05-01 17:06         ` Rik van Riel
  2012-05-02  3:25           ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2012-05-01 17:06 UTC (permalink / raw)
  To: Ying Han
  Cc: Nick Piggin, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	Nick Piggin, Andrew Morton, linux-mm

On 05/01/2012 12:18 PM, Ying Han wrote:

> The current logic seems perfer to reclaim more than going oom kill,
> and that might not fit all user's expectation. However, I guess it is
> hard to convince for any changes since different users has different
> bias as you said....

However, it is a sure thing that desktop users and smartphone
users do want an earlier OOM kill.

I wonder if doing an OOM kill when the number of free pages
plus the number of file lru pages in every zone is below
pages_high and there is no more swap available might work?

On the other hand, that still leaves us cgroups. What could
be appropriate there?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-05-01 17:06         ` Rik van Riel
@ 2012-05-02  3:25           ` Nick Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2012-05-02  3:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ying Han, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	Andrew Morton, linux-mm

On 2 May 2012 03:06, Rik van Riel <riel@redhat.com> wrote:
> On 05/01/2012 12:18 PM, Ying Han wrote:
>
>> The current logic seems perfer to reclaim more than going oom kill,
>> and that might not fit all user's expectation. However, I guess it is
>> hard to convince for any changes since different users has different
>> bias as you said....
>
>
> However, it is a sure thing that desktop users and smartphone
> users do want an earlier OOM kill.
>
> I wonder if doing an OOM kill when the number of free pages
> plus the number of file lru pages in every zone is below
> pages_high and there is no more swap available might work?

This patch in the first place was required to even reach the
condition that "no more swap is available" on this virtual
machine server workload I was vaguely remembering.

Without this logic, it would be OOM killed before such condition
is hit, because we can require to go around the LRU a few times
to free enough pages. You can imagine with concurrent threads
touching ptes and possibly allocating memory themselves.

> On the other hand, that still leaves us cgroups. What could
> be appropriate there?

We always seem to end up with a tangle of mysterious dials
and knobs deep in the heart of the the implementation :(

I wonder if there is some other way to approach it, like a
QoS from point of view of caller? That seems to be not so
far removed from the "end result requirements".

e.g., if short term average page allocation latency for a task
exceeds {10us, 100us, 1ms, 10ms}, then start shooting.

This does not catch the theoretical corner case where memory
is reclaimed very quickly, but just reused for the same thing
because a task is thrashing. But in practice, I think thrashing
workloads quickly lead to a lot of write IOs and major allocation
slowdowns anyway, so in practice I think it could work.

And it could be adjusted per task, per cgroup, etc. and would
not depend on reclaim/allocator implementation details.

I'm sure someone will tell me how horribly flawed the idea is
though :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-04-23 20:56 [RFC PATCH] do_try_to_free_pages() might enter infinite loop Ying Han
  2012-04-23 22:20 ` KOSAKI Motohiro
  2012-04-24  5:36 ` Nick Piggin
@ 2012-06-11 23:33 ` KOSAKI Motohiro
  2012-06-11 23:37   ` KOSAKI Motohiro
  2012-06-12  0:53   ` Rik van Riel
  2 siblings, 2 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2012-06-11 23:33 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, Nick Piggin,
	Andrew Morton, linux-mm

On Mon, Apr 23, 2012 at 4:56 PM, Ying Han <yinghan@google.com> wrote:
> This is not a patch targeted to be merged at all, but trying to understand
> a logic in global direct reclaim.
>
> There is a logic in global direct reclaim where reclaim fails on priority 0
> and zone->all_unreclaimable is not set, it will cause the direct to start over
> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
> very likely caused by direct reclaim enters infinite loop.
>
> There have been serious patches trying to fix similar issue and the latest
> patch has good summary of all the efforts:
>
> commit 929bea7c714220fc76ce3f75bef9056477c28e74
> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date:   Thu Apr 14 15:22:12 2011 -0700
>
>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>
> Kosaki explained the problem triggered by async zone->all_unreclaimable and
> zone->pages_scanned where the later one was being checked by direct reclaim.
> However, after the patch, the problem remains where the setting of
> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>
> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
> the zone is actually unreclaimable?

I'm backed very old threads. :-(
I could reproduce this issue by using memory hotplug. Can anyone
review following patch?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-06-11 23:33 ` KOSAKI Motohiro
@ 2012-06-11 23:37   ` KOSAKI Motohiro
  2012-06-14  5:25     ` Ying Han
  2012-06-12  0:53   ` Rik van Riel
  1 sibling, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2012-06-11 23:37 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, Andrew Morton, linux-mm,
	Nick Piggin

Sigh, fix Nick's e-mail address.


<full quote intentionally>

> On Mon, Apr 23, 2012 at 4:56 PM, Ying Han <yinghan@google.com> wrote:
>> This is not a patch targeted to be merged at all, but trying to understand
>> a logic in global direct reclaim.
>>
>> There is a logic in global direct reclaim where reclaim fails on priority 0
>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>> very likely caused by direct reclaim enters infinite loop.
>>
>> There have been serious patches trying to fix similar issue and the latest
>> patch has good summary of all the efforts:
>>
>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>
>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>
>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>> zone->pages_scanned where the later one was being checked by direct reclaim.
>> However, after the patch, the problem remains where the setting of
>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>
>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>> the zone is actually unreclaimable?
>
> I'm backed very old threads. :-(
> I could reproduce this issue by using memory hotplug. Can anyone
> review following patch?
>
>
> From 767b9ff5b53a34cb95e59a7c230aef3fda07be49 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Mon, 11 Jun 2012 18:48:03 -0400
> Subject: [PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
>
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
>
> 1) kswapd sleep when it couldn't reclaim any page even though
>   reach priority 0. This is because to avoid kswapd() infinite
>   loop. That said, kswapd assume direct reclaim makes enough
>   free pages either regular page reclaim or oom-killer.
>   This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>   kswapd turn on zone->all_unreclaimble. This is because
>   to avoid too early oom-kill.
>   This logic makes direct-reclaim -> kswapd dependency.
>
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd is slept and any other thread don't wakeup kswapd.
>
> We can't turn on zone->all_unreclaimable because this is racy.
> direct reclaim path don't take any lock. Thus this patch removes
> zone->all_unreclaimable field completely and recalculates every
> time.
>
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because,
> it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
>
> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  include/linux/mm_inline.h |   19 +++++++++++++++++
>  include/linux/mmzone.h    |    2 +-
>  include/linux/vmstat.h    |    1 -
>  mm/page-writeback.c       |    2 +
>  mm/page_alloc.c           |    5 +--
>  mm/vmscan.c               |   48 ++++++++++++--------------------------------
>  mm/vmstat.c               |    3 +-
>  7 files changed, 39 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 1397ccf..04f32e1 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -2,6 +2,7 @@
>  #define LINUX_MM_INLINE_H
>
>  #include <linux/huge_mm.h>
> +#include <linux/swap.h>
>
>  /**
>  * page_is_file_cache - should the page be on a file LRU or anon LRU?
> @@ -99,4 +100,22 @@ static __always_inline enum lru_list
> page_lru(struct page *page)
>        return lru;
>  }
>
> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> +{
> +       int nr;
> +
> +       nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> +            zone_page_state(zone, NR_INACTIVE_FILE);
> +
> +       if (nr_swap_pages > 0)
> +               nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> +                     zone_page_state(zone, NR_INACTIVE_ANON);
> +
> +       return nr;
> +}
> +
> +static inline bool zone_reclaimable(struct zone *zone)
> +{
> +       return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +}
>  #endif
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2427706..9d2a720 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -368,7 +368,7 @@ struct zone {
>         * free areas of different sizes
>         */
>        spinlock_t              lock;
> -       int                     all_unreclaimable; /* All pages pinned */
> +
>  #ifdef CONFIG_MEMORY_HOTPLUG
>        /* see spanned/present_pages for more description */
>        seqlock_t               span_seqlock;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 65efb92..9607256 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -140,7 +140,6 @@ static inline unsigned long
> zone_page_state_snapshot(struct zone *zone,
>  }
>
>  extern unsigned long global_reclaimable_pages(void);
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>
>  #ifdef CONFIG_NUMA
>  /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 93d8d2f..d2d957f 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -34,6 +34,8 @@
>  #include <linux/syscalls.h>
>  #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
>  #include <linux/pagevec.h>
> +#include <linux/mm_inline.h>
> +
>  #include <trace/events/writeback.h>
>
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..5716b00 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -59,6 +59,7 @@
>  #include <linux/prefetch.h>
>  #include <linux/migrate.h>
>  #include <linux/page-debug-flags.h>
> +#include <linux/mm_inline.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -638,7 +639,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>        int to_free = count;
>
>        spin_lock(&zone->lock);
> -       zone->all_unreclaimable = 0;
>        zone->pages_scanned = 0;
>
>        while (to_free) {
> @@ -680,7 +680,6 @@ static void free_one_page(struct zone *zone,
> struct page *page, int order,
>                                int migratetype)
>  {
>        spin_lock(&zone->lock);
> -       zone->all_unreclaimable = 0;
>        zone->pages_scanned = 0;
>
>        __free_one_page(page, zone, order, migratetype);
> @@ -2870,7 +2869,7 @@ void show_free_areas(unsigned int filter)
>                        K(zone_page_state(zone, NR_BOUNCE)),
>                        K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
>                        zone->pages_scanned,
> -                       (zone->all_unreclaimable ? "yes" : "no")
> +                      (zone_reclaimable(zone) ? "yes" : "no")
>                        );
>                printk("lowmem_reserve[]:");
>                for (i = 0; i < MAX_NR_ZONES; i++)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..033671c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1592,7 +1592,7 @@ static void get_scan_count(struct lruvec
> *lruvec, struct scan_control *sc,
>         * latencies, so it's better to scan a minimum amount there as
>         * well.
>         */
> -       if (current_is_kswapd() && zone->all_unreclaimable)
> +       if (current_is_kswapd() && !zone_reclaimable(zone))
>                force_scan = true;
>        if (!global_reclaim(sc))
>                force_scan = true;
> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist
> *zonelist, struct scan_control *sc)
>                if (global_reclaim(sc)) {
>                        if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                                continue;
> -                       if (zone->all_unreclaimable &&
> -                                       sc->priority != DEF_PRIORITY)
> +                       if (!zone_reclaimable(zone) &&
> +                           sc->priority != DEF_PRIORITY)
>                                continue;       /* Let kswapd poll it */
>                        if (COMPACTION_BUILD) {
>                                /*
> @@ -1975,11 +1975,6 @@ static bool shrink_zones(struct zonelist
> *zonelist, struct scan_control *sc)
>        return aborted_reclaim;
>  }
>
> -static bool zone_reclaimable(struct zone *zone)
> -{
> -       return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> -}
> -
>  /* All zones in zonelist are unreclaimable? */
>  static bool all_unreclaimable(struct zonelist *zonelist,
>                struct scan_control *sc)
> @@ -1993,7 +1988,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>                        continue;
>                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                        continue;
> -               if (!zone->all_unreclaimable)
> +               if (zone_reclaimable(zone))
>                        return false;
>        }
>
> @@ -2299,7 +2294,7 @@ static bool sleeping_prematurely(pg_data_t
> *pgdat, int order, long remaining,
>                 * they must be considered balanced here as well if kswapd
>                 * is to sleep
>                 */
> -               if (zone->all_unreclaimable) {
> +               if (zone_reclaimable(zone)) {
>                        balanced += zone->present_pages;
>                        continue;
>                }
> @@ -2393,8 +2388,7 @@ loop_again:
>                        if (!populated_zone(zone))
>                                continue;
>
> -                       if (zone->all_unreclaimable &&
> -                           sc.priority != DEF_PRIORITY)
> +                       if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>                                continue;
>
>                        /*
> @@ -2443,14 +2437,13 @@ loop_again:
>                 */
>                for (i = 0; i <= end_zone; i++) {
>                        struct zone *zone = pgdat->node_zones + i;
> -                       int nr_slab, testorder;
> +                       int testorder;
>                        unsigned long balance_gap;
>
>                        if (!populated_zone(zone))
>                                continue;
>
> -                       if (zone->all_unreclaimable &&
> -                           sc.priority != DEF_PRIORITY)
> +                       if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>                                continue;
>
>                        sc.nr_scanned = 0;
> @@ -2497,12 +2490,11 @@ loop_again:
>                                shrink_zone(zone, &sc);
>
>                                reclaim_state->reclaimed_slab = 0;
> -                               nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> +                               shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>                                sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>                                total_scanned += sc.nr_scanned;
>
> -                               if (nr_slab == 0 && !zone_reclaimable(zone))
> -                                       zone->all_unreclaimable = 1;
> +
>                        }
>
>                        /*
> @@ -2514,7 +2506,7 @@ loop_again:
>                            total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>                                sc.may_writepage = 1;
>
> -                       if (zone->all_unreclaimable) {
> +                       if (!zone_reclaimable(zone)) {
>                                if (end_zone && end_zone == i)
>                                        end_zone--;
>                                continue;
> @@ -2616,7 +2608,7 @@ out:
>                        if (!populated_zone(zone))
>                                continue;
>
> -                       if (zone->all_unreclaimable &&
> +                       if (!zone_reclaimable(zone) &&
>                            sc.priority != DEF_PRIORITY)
>                                continue;
>
> @@ -2850,20 +2842,6 @@ unsigned long global_reclaimable_pages(void)
>        return nr;
>  }
>
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> -{
> -       int nr;
> -
> -       nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -            zone_page_state(zone, NR_INACTIVE_FILE);
> -
> -       if (nr_swap_pages > 0)
> -               nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -                     zone_page_state(zone, NR_INACTIVE_ANON);
> -
> -       return nr;
> -}
> -
>  #ifdef CONFIG_HIBERNATION
>  /*
>  * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
> @@ -3158,7 +3136,7 @@ int zone_reclaim(struct zone *zone, gfp_t
> gfp_mask, unsigned int order)
>            zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>                return ZONE_RECLAIM_FULL;
>
> -       if (zone->all_unreclaimable)
> +       if (!zone_reclaimable(zone))
>                return ZONE_RECLAIM_FULL;
>
>        /*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1bbbbd9..94b9d4c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -19,6 +19,7 @@
>  #include <linux/math64.h>
>  #include <linux/writeback.h>
>  #include <linux/compaction.h>
> +#include <linux/mm_inline.h>
>
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
> @@ -1022,7 +1023,7 @@ static void zoneinfo_show_print(struct seq_file
> *m, pg_data_t *pgdat,
>                   "¥n  all_unreclaimable: %u"
>                   "¥n  start_pfn:         %lu"
>                   "¥n  inactive_ratio:    %u",
> -                  zone->all_unreclaimable,
> +                  !zone_reclaimable(zone),
>                   zone->zone_start_pfn,
>                   zone->inactive_ratio);
>        seq_putc(m, '¥n');
> --
> 1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-06-11 23:33 ` KOSAKI Motohiro
  2012-06-11 23:37   ` KOSAKI Motohiro
@ 2012-06-12  0:53   ` Rik van Riel
  1 sibling, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2012-06-12  0:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ying Han, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Minchan Kim, Hugh Dickins, Nick Piggin,
	Andrew Morton, linux-mm

On 06/11/2012 07:33 PM, KOSAKI Motohiro wrote:
> On Mon, Apr 23, 2012 at 4:56 PM, Ying Han<yinghan@google.com>  wrote:
>> This is not a patch targeted to be merged at all, but trying to understand
>> a logic in global direct reclaim.
>>
>> There is a logic in global direct reclaim where reclaim fails on priority 0
>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>> very likely caused by direct reclaim enters infinite loop.
>>
>> There have been serious patches trying to fix similar issue and the latest
>> patch has good summary of all the efforts:
>>
>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>> Author: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>
>>     vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>
>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>> zone->pages_scanned where the later one was being checked by direct reclaim.
>> However, after the patch, the problem remains where the setting of
>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>
>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>> the zone is actually unreclaimable?
>
> I'm backed very old threads. :-(
> I could reproduce this issue by using memory hotplug. Can anyone
> review following patch?

Looks like a sane approach to me.

> Reported-by: Aaditya Kumar<aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han<yinghan@google.com>
> Cc: Nick Piggin<npiggin@gmail.com>
> Cc: Rik van Riel<riel@redhat.com>
> Cc: Michal Hocko<mhocko@suse.cz>
> Cc: Johannes Weiner<hannes@cmpxchg.org>
> Cc: Mel Gorman<mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop
  2012-06-11 23:37   ` KOSAKI Motohiro
@ 2012-06-14  5:25     ` Ying Han
  0 siblings, 0 replies; 22+ messages in thread
From: Ying Han @ 2012-06-14  5:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Minchan Kim, Hugh Dickins, Andrew Morton, linux-mm,
	Nick Piggin

On Mon, Jun 11, 2012 at 4:37 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Sigh, fix Nick's e-mail address.
>
>
> <full quote intentionally>
>
>> On Mon, Apr 23, 2012 at 4:56 PM, Ying Han <yinghan@google.com> wrote:
>>> This is not a patch targeted to be merged at all, but trying to understand
>>> a logic in global direct reclaim.
>>>
>>> There is a logic in global direct reclaim where reclaim fails on priority 0
>>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>>> very likely caused by direct reclaim enters infinite loop.
>>>
>>> There have been serious patches trying to fix similar issue and the latest
>>> patch has good summary of all the efforts:
>>>
>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>>> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>>
>>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>>
>>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>>> zone->pages_scanned where the later one was being checked by direct reclaim.
>>> However, after the patch, the problem remains where the setting of
>>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not.
>>>
>>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in
>>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while
>>> the zone is actually unreclaimable?
>>
>> I'm backed very old threads. :-(
>> I could reproduce this issue by using memory hotplug. Can anyone
>> review following patch?
>>
>>
>> From 767b9ff5b53a34cb95e59a7c230aef3fda07be49 Mon Sep 17 00:00:00 2001
>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Date: Mon, 11 Jun 2012 18:48:03 -0400
>> Subject: [PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
>>
>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>> now vmscan has two conflicted policies.
>>
>> 1) kswapd sleep when it couldn't reclaim any page even though
>>   reach priority 0. This is because to avoid kswapd() infinite
>>   loop. That said, kswapd assume direct reclaim makes enough
>>   free pages either regular page reclaim or oom-killer.
>>   This logic makes kswapd -> direct-reclaim dependency.
>> 2) direct reclaim continue to reclaim without oom-killer until
>>   kswapd turn on zone->all_unreclaimble. This is because
>>   to avoid too early oom-kill.
>>   This logic makes direct-reclaim -> kswapd dependency.
>>
>> In worst case, direct-reclaim may continue to page reclaim forever
>> when kswapd is slept and any other thread don't wakeup kswapd.
>>
>> We can't turn on zone->all_unreclaimable because this is racy.
>> direct reclaim path don't take any lock. Thus this patch removes
>> zone->all_unreclaimable field completely and recalculates every
>> time.
>>
>> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
>> directly and kswapd continue to use zone->all_unreclaimable. Because,
>> it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
>> zone->all_unreclaimable as a name) describes the detail.
>>
>> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
>> Reported-by: Ying Han <yinghan@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Cc: Minchan Kim <minchan.kim@gmail.com>
>> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> ---
>>  include/linux/mm_inline.h |   19 +++++++++++++++++
>>  include/linux/mmzone.h    |    2 +-
>>  include/linux/vmstat.h    |    1 -
>>  mm/page-writeback.c       |    2 +
>>  mm/page_alloc.c           |    5 +--
>>  mm/vmscan.c               |   48 ++++++++++++--------------------------------
>>  mm/vmstat.c               |    3 +-
>>  7 files changed, 39 insertions(+), 41 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index 1397ccf..04f32e1 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -2,6 +2,7 @@
>>  #define LINUX_MM_INLINE_H
>>
>>  #include <linux/huge_mm.h>
>> +#include <linux/swap.h>
>>
>>  /**
>>  * page_is_file_cache - should the page be on a file LRU or anon LRU?
>> @@ -99,4 +100,22 @@ static __always_inline enum lru_list
>> page_lru(struct page *page)
>>        return lru;
>>  }
>>
>> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
>> +{
>> +       int nr;
>> +
>> +       nr = zone_page_state(zone, NR_ACTIVE_FILE) +
>> +            zone_page_state(zone, NR_INACTIVE_FILE);
>> +
>> +       if (nr_swap_pages > 0)
>> +               nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>> +                     zone_page_state(zone, NR_INACTIVE_ANON);
>> +
>> +       return nr;
>> +}
>> +
>> +static inline bool zone_reclaimable(struct zone *zone)
>> +{
>> +       return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> +}
>>  #endif
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 2427706..9d2a720 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -368,7 +368,7 @@ struct zone {
>>         * free areas of different sizes
>>         */
>>        spinlock_t              lock;
>> -       int                     all_unreclaimable; /* All pages pinned */
>> +
>>  #ifdef CONFIG_MEMORY_HOTPLUG
>>        /* see spanned/present_pages for more description */
>>        seqlock_t               span_seqlock;
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 65efb92..9607256 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -140,7 +140,6 @@ static inline unsigned long
>> zone_page_state_snapshot(struct zone *zone,
>>  }
>>
>>  extern unsigned long global_reclaimable_pages(void);
>> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>>
>>  #ifdef CONFIG_NUMA
>>  /*
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 93d8d2f..d2d957f 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -34,6 +34,8 @@
>>  #include <linux/syscalls.h>
>>  #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
>>  #include <linux/pagevec.h>
>> +#include <linux/mm_inline.h>
>> +
>>  #include <trace/events/writeback.h>
>>
>>  /*
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 4403009..5716b00 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -59,6 +59,7 @@
>>  #include <linux/prefetch.h>
>>  #include <linux/migrate.h>
>>  #include <linux/page-debug-flags.h>
>> +#include <linux/mm_inline.h>
>>
>>  #include <asm/tlbflush.h>
>>  #include <asm/div64.h>
>> @@ -638,7 +639,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>        int to_free = count;
>>
>>        spin_lock(&zone->lock);
>> -       zone->all_unreclaimable = 0;
>>        zone->pages_scanned = 0;
>>
>>        while (to_free) {
>> @@ -680,7 +680,6 @@ static void free_one_page(struct zone *zone,
>> struct page *page, int order,
>>                                int migratetype)
>>  {
>>        spin_lock(&zone->lock);
>> -       zone->all_unreclaimable = 0;
>>        zone->pages_scanned = 0;
>>
>>        __free_one_page(page, zone, order, migratetype);
>> @@ -2870,7 +2869,7 @@ void show_free_areas(unsigned int filter)
>>                        K(zone_page_state(zone, NR_BOUNCE)),
>>                        K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
>>                        zone->pages_scanned,
>> -                       (zone->all_unreclaimable ? "yes" : "no")
>> +                      (zone_reclaimable(zone) ? "yes" : "no")
>>                        );
>>                printk("lowmem_reserve[]:");
>>                for (i = 0; i < MAX_NR_ZONES; i++)
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index eeb3bc9..033671c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1592,7 +1592,7 @@ static void get_scan_count(struct lruvec
>> *lruvec, struct scan_control *sc,
>>         * latencies, so it's better to scan a minimum amount there as
>>         * well.
>>         */
>> -       if (current_is_kswapd() && zone->all_unreclaimable)
>> +       if (current_is_kswapd() && !zone_reclaimable(zone))
>>                force_scan = true;
>>        if (!global_reclaim(sc))
>>                force_scan = true;
>> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist
>> *zonelist, struct scan_control *sc)
>>                if (global_reclaim(sc)) {
>>                        if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>                                continue;
>> -                       if (zone->all_unreclaimable &&
>> -                                       sc->priority != DEF_PRIORITY)
>> +                       if (!zone_reclaimable(zone) &&
>> +                           sc->priority != DEF_PRIORITY)
>>                                continue;       /* Let kswapd poll it */
>>                        if (COMPACTION_BUILD) {
>>                                /*
>> @@ -1975,11 +1975,6 @@ static bool shrink_zones(struct zonelist
>> *zonelist, struct scan_control *sc)
>>        return aborted_reclaim;
>>  }
>>
>> -static bool zone_reclaimable(struct zone *zone)
>> -{
>> -       return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> -}
>> -
>>  /* All zones in zonelist are unreclaimable? */
>>  static bool all_unreclaimable(struct zonelist *zonelist,
>>                struct scan_control *sc)
>> @@ -1993,7 +1988,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>>                        continue;
>>                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>                        continue;
>> -               if (!zone->all_unreclaimable)
>> +               if (zone_reclaimable(zone))
>>                        return false;
>>        }
>>
>> @@ -2299,7 +2294,7 @@ static bool sleeping_prematurely(pg_data_t
>> *pgdat, int order, long remaining,
>>                 * they must be considered balanced here as well if kswapd
>>                 * is to sleep
>>                 */
>> -               if (zone->all_unreclaimable) {
>> +               if (zone_reclaimable(zone)) {
>>                        balanced += zone->present_pages;
>>                        continue;
>>                }
>> @@ -2393,8 +2388,7 @@ loop_again:
>>                        if (!populated_zone(zone))
>>                                continue;
>>
>> -                       if (zone->all_unreclaimable &&
>> -                           sc.priority != DEF_PRIORITY)
>> +                       if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>>                                continue;
>>
>>                        /*
>> @@ -2443,14 +2437,13 @@ loop_again:
>>                 */
>>                for (i = 0; i <= end_zone; i++) {
>>                        struct zone *zone = pgdat->node_zones + i;
>> -                       int nr_slab, testorder;
>> +                       int testorder;
>>                        unsigned long balance_gap;
>>
>>                        if (!populated_zone(zone))
>>                                continue;
>>
>> -                       if (zone->all_unreclaimable &&
>> -                           sc.priority != DEF_PRIORITY)
>> +                       if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>>                                continue;
>>
>>                        sc.nr_scanned = 0;
>> @@ -2497,12 +2490,11 @@ loop_again:
>>                                shrink_zone(zone, &sc);
>>
>>                                reclaim_state->reclaimed_slab = 0;
>> -                               nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>> +                               shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>>                                sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>>                                total_scanned += sc.nr_scanned;
>>
>> -                               if (nr_slab == 0 && !zone_reclaimable(zone))
>> -                                       zone->all_unreclaimable = 1;
>> +
>>                        }
>>
>>                        /*
>> @@ -2514,7 +2506,7 @@ loop_again:
>>                            total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>>                                sc.may_writepage = 1;
>>
>> -                       if (zone->all_unreclaimable) {
>> +                       if (!zone_reclaimable(zone)) {
>>                                if (end_zone && end_zone == i)
>>                                        end_zone--;
>>                                continue;
>> @@ -2616,7 +2608,7 @@ out:
>>                        if (!populated_zone(zone))
>>                                continue;
>>
>> -                       if (zone->all_unreclaimable &&
>> +                       if (!zone_reclaimable(zone) &&
>>                            sc.priority != DEF_PRIORITY)
>>                                continue;
>>
>> @@ -2850,20 +2842,6 @@ unsigned long global_reclaimable_pages(void)
>>        return nr;
>>  }
>>
>> -unsigned long zone_reclaimable_pages(struct zone *zone)
>> -{
>> -       int nr;
>> -
>> -       nr = zone_page_state(zone, NR_ACTIVE_FILE) +
>> -            zone_page_state(zone, NR_INACTIVE_FILE);
>> -
>> -       if (nr_swap_pages > 0)
>> -               nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>> -                     zone_page_state(zone, NR_INACTIVE_ANON);
>> -
>> -       return nr;
>> -}
>> -
>>  #ifdef CONFIG_HIBERNATION
>>  /*
>>  * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
>> @@ -3158,7 +3136,7 @@ int zone_reclaim(struct zone *zone, gfp_t
>> gfp_mask, unsigned int order)
>>            zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>>                return ZONE_RECLAIM_FULL;
>>
>> -       if (zone->all_unreclaimable)
>> +       if (!zone_reclaimable(zone))
>>                return ZONE_RECLAIM_FULL;
>>
>>        /*
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 1bbbbd9..94b9d4c 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/math64.h>
>>  #include <linux/writeback.h>
>>  #include <linux/compaction.h>
>> +#include <linux/mm_inline.h>
>>
>>  #ifdef CONFIG_VM_EVENT_COUNTERS
>>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
>> @@ -1022,7 +1023,7 @@ static void zoneinfo_show_print(struct seq_file
>> *m, pg_data_t *pgdat,
>>                   "¥n  all_unreclaimable: %u"
>>                   "¥n  start_pfn:         %lu"
>>                   "¥n  inactive_ratio:    %u",
>> -                  zone->all_unreclaimable,
>> +                  !zone_reclaimable(zone),
>>                   zone->zone_start_pfn,
>>                   zone->inactive_ratio);
>>        seq_putc(m, '¥n');
>> --
>> 1.7.1

Thank you for looking into this, and I like the idea of removing the
zone->all_unreclaimable flag. We've seen the same problem triggered in
production recently again, but I don't have simple workload to
reproduce it manually.

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-06-14  5:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-23 20:56 [RFC PATCH] do_try_to_free_pages() might enter infinite loop Ying Han
2012-04-23 22:20 ` KOSAKI Motohiro
2012-04-23 23:18   ` Ying Han
2012-04-23 23:19     ` Ying Han
2012-04-24  1:31     ` Minchan Kim
2012-04-24  2:06       ` Ying Han
2012-04-24 16:36       ` Ying Han
2012-04-24 16:38         ` Rik van Riel
2012-04-24 16:45           ` KOSAKI Motohiro
2012-04-24 17:22             ` Ying Han
2012-04-24 17:17           ` Ying Han
2012-04-24  5:36 ` Nick Piggin
2012-04-24 18:37   ` Ying Han
2012-05-01  3:34     ` Nick Piggin
2012-05-01 16:18       ` Ying Han
2012-05-01 16:20         ` Ying Han
2012-05-01 17:06         ` Rik van Riel
2012-05-02  3:25           ` Nick Piggin
2012-06-11 23:33 ` KOSAKI Motohiro
2012-06-11 23:37   ` KOSAKI Motohiro
2012-06-14  5:25     ` Ying Han
2012-06-12  0:53   ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.