Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.

All of lore.kernel.org
 help / color / mirror / Atom feed

* Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-13 21:31 Andrew Barry
  2011-05-17 10:34 ` Minchan Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Barry @ 2011-05-13 21:31 UTC (permalink / raw)
  To: linux-mm

I believe I found a problem in __alloc_pages_slowpath, which allows a process to
get stuck endlessly looping, even when lots of memory is available.

Running an I/O and memory intensive stress-test I see a 0-order page allocation
with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
Right about the same time that the stress-test gets killed by the OOM-killer,
the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
though most of the systems memory was freed by the oom-kill of the stress-test.

The utility ends up looping from the rebalance label down through the
wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
skips the call to get_page_from_freelist. Because all of the reclaimable memory
on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
__alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
jumps back to rebalance without ever trying to get_page_from_freelist. This loop
repeats infinitely.

Is there a reason that this loop is set up this way for 0 order allocations? I
applied the below patch, and the problem corrects itself. Does anyone have any
thoughts on the patch, or on a better way to address this situation?

The test case is pretty pathological. Running a mix of I/O stress-tests that do
a lot of fork() and consume all of the system memory, I can pretty reliably hit
this on 600 nodes, in about 12 hours. 32GB/node.

Thanks
Andrew Barry

---
 mm/page_alloc.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..c719664 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2158,7 +2158,10 @@ rebalance:
        if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
                /* Wait for some write requests to complete then retry */
                wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
-               goto rebalance;
+               if (did_some_progress)
+                       goto rebalance;
+               else
+                       goto restart;
        } else {
                /*
                 * High-order allocations do not necessarily loop after

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-13 21:31 Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch Andrew Barry
@ 2011-05-17 10:34 ` Minchan Kim
  2011-05-17 11:34   ` Mel Gorman
  2011-05-17 15:49   ` Andrew Barry
  0 siblings, 2 replies; 39+ messages in thread
From: Minchan Kim @ 2011-05-17 10:34 UTC (permalink / raw)
  To: Andrew Barry
  Cc: linux-mm, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Andrew Morton, Johannes Weiner

On Sat, May 14, 2011 at 6:31 AM, Andrew Barry <abarry@cray.com> wrote:
> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> get stuck endlessly looping, even when lots of memory is available.
>
> Running an I/O and memory intensive stress-test I see a 0-order page allocation
> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> Right about the same time that the stress-test gets killed by the OOM-killer,
> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> though most of the systems memory was freed by the oom-kill of the stress-test.
>
> The utility ends up looping from the rebalance label down through the
> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> skips the call to get_page_from_freelist. Because all of the reclaimable memory
> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> repeats infinitely.
>
> Is there a reason that this loop is set up this way for 0 order allocations? I
> applied the below patch, and the problem corrects itself. Does anyone have any
> thoughts on the patch, or on a better way to address this situation?
>
> The test case is pretty pathological. Running a mix of I/O stress-tests that do
> a lot of fork() and consume all of the system memory, I can pretty reliably hit
> this on 600 nodes, in about 12 hours. 32GB/node.
>

It's amazing.
I think it's _very_ rare but it's possible if test program killed by
oom has only lots of anonymous pages and allocation tasks try to
allocate order-0 page with GFP_NOFS.

When the [in]active lists are empty suddenly(But I am not sure how
come the situation happens.) and we are reclaiming order-0 page,
compaction and __alloc_pages_direct_reclaim doesn't work. compaction
doesn't work as it's order-0 page reclaiming.  In case of
__alloc_pages_direct_reclaim, it would work only if we have lru pages
in [in]active list. But unfortunately we don't have any pages in lru
list.
So, last resort is following codes in do_try_to_free_pages.

        /* top priority shrink_zones still had more to do? don't OOM, then */
        if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
                return 1;

But it has a problem, too. all_unreclaimable checks zone->all_unreclaimable.
zone->all_unreclaimable is set by below condition.

zone->pages_scanned < zone_reclaimable_pages(zone) * 6

If lru list is completely empty, shrink_zone doesn't work so
zone->pages_scanned would be zero. But as we know, zone_page_state
isn't exact by per_cpu_pageset. So it might be positive value. After
all, zone_reclaimable always return true. It means kswapd never set
zone->all_unreclaimable.  So last resort become nop.

In this case, current allocation doesn't have a chance to call
get_page_from_freelist as Andrew Barry said.

Does it make sense?
If it is, how about this?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebc7faa..4f64355 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2105,6 +2105,7 @@ restart:
                first_zones_zonelist(zonelist, high_zoneidx, NULL,
                                        &preferred_zone);

+rebalance:
        /* This is the last chance, in general, before the goto nopage. */
        page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
                        high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
@@ -2112,7 +2113,6 @@ restart:
        if (page)
                goto got_pg;

-rebalance:
        /* Allocate without watermarks if the context allows */
        if (alloc_flags & ALLOC_NO_WATERMARKS) {
                page = __alloc_pages_high_priority(gfp_mask, order,


> Thanks
> Andrew Barry
>
> ---
>  mm/page_alloc.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9f8a97b..c719664 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2158,7 +2158,10 @@ rebalance:
>        if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
>                /* Wait for some write requests to complete then retry */
>                wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
> -               goto rebalance;
> +               if (did_some_progress)
> +                       goto rebalance;
> +               else
> +                       goto restart;
>        } else {
>                /*
>                 * High-order allocations do not necessarily loop after
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-17 10:34 ` Minchan Kim
@ 2011-05-17 11:34   ` Mel Gorman
  2011-05-17 15:49   ` Andrew Barry
  1 sibling, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-17 11:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Barry, linux-mm, Rik van Riel, KOSAKI Motohiro,
	Andrew Morton, Johannes Weiner

On Tue, May 17, 2011 at 07:34:47PM +0900, Minchan Kim wrote:
> On Sat, May 14, 2011 at 6:31 AM, Andrew Barry <abarry@cray.com> wrote:
> > I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> > get stuck endlessly looping, even when lots of memory is available.
> >
> > Running an I/O and memory intensive stress-test I see a 0-order page allocation
> > with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> > Right about the same time that the stress-test gets killed by the OOM-killer,
> > the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> > though most of the systems memory was freed by the oom-kill of the stress-test.
> >
> > The utility ends up looping from the rebalance label down through the
> > wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> > skips the call to get_page_from_freelist. Because all of the reclaimable memory
> > on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> > call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> > __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> > jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> > repeats infinitely.
> >
> > Is there a reason that this loop is set up this way for 0 order allocations? I
> > applied the below patch, and the problem corrects itself. Does anyone have any
> > thoughts on the patch, or on a better way to address this situation?
> >
> > The test case is pretty pathological. Running a mix of I/O stress-tests that do
> > a lot of fork() and consume all of the system memory, I can pretty reliably hit
> > this on 600 nodes, in about 12 hours. 32GB/node.
> >
> 
> It's amazing.
> I think it's _very_ rare but it's possible if test program killed by
> oom has only lots of anonymous pages and allocation tasks try to
> allocate order-0 page with GFP_NOFS.
> 
> When the [in]active lists are empty suddenly(But I am not sure how
> come the situation happens.)

Maybe because the stress test consumed almost all, if not all, of the
LRU and then got oom-killed emptying the lists.

> and we are reclaiming order-0 page,
> compaction and __alloc_pages_direct_reclaim doesn't work. compaction
> doesn't work as it's order-0 page reclaiming.  In case of
> __alloc_pages_direct_reclaim, it would work only if we have lru pages
> in [in]active list. But unfortunately we don't have any pages in lru
> list.
> So, last resort is following codes in do_try_to_free_pages.
> 
>         /* top priority shrink_zones still had more to do? don't OOM, then */
>         if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
>                 return 1;
> 
> But it has a problem, too. all_unreclaimable checks zone->all_unreclaimable.
> zone->all_unreclaimable is set by below condition.
> 
> zone->pages_scanned < zone_reclaimable_pages(zone) * 6
> 
> If lru list is completely empty, shrink_zone doesn't work so
> zone->pages_scanned would be zero. But as we know, zone_page_state
> isn't exact by per_cpu_pageset. So it might be positive value. After
> all, zone_reclaimable always return true. It means kswapd never set
> zone->all_unreclaimable.  So last resort become nop.
> 
> In this case, current allocation doesn't have a chance to call
> get_page_from_freelist as Andrew Barry said.
> 
> Does it make sense?
> If it is, how about this?
> 

This looks like a better fix. The alternative fix continually wakes
kswapd and takes additional unnecessary steps.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-17 10:34 ` Minchan Kim
  2011-05-17 11:34   ` Mel Gorman
@ 2011-05-17 15:49   ` Andrew Barry
  2011-05-18 22:29     ` Minchan Kim
  1 sibling, 1 reply; 39+ messages in thread
From: Andrew Barry @ 2011-05-17 15:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Andrew Morton, Johannes Weiner

On 05/17/2011 05:34 AM, Minchan Kim wrote:
> On Sat, May 14, 2011 at 6:31 AM, Andrew Barry <abarry@cray.com> wrote:
>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
>> get stuck endlessly looping, even when lots of memory is available.
>>
>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
>> Right about the same time that the stress-test gets killed by the OOM-killer,
>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
>> though most of the systems memory was freed by the oom-kill of the stress-test.
>>
>> The utility ends up looping from the rebalance label down through the
>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
>> repeats infinitely.
>>
>> Is there a reason that this loop is set up this way for 0 order allocations? I
>> applied the below patch, and the problem corrects itself. Does anyone have any
>> thoughts on the patch, or on a better way to address this situation?
>>
>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
>> this on 600 nodes, in about 12 hours. 32GB/node.
>>
> 
> It's amazing.
> I think it's _very_ rare but it's possible if test program killed by
> oom has only lots of anonymous pages and allocation tasks try to
> allocate order-0 page with GFP_NOFS.

Unfortunately very rare is a subjective thing. We have been hitting it a couple
times a week in our test lab.

> When the [in]active lists are empty suddenly(But I am not sure how
> come the situation happens.) and we are reclaiming order-0 page,
> compaction and __alloc_pages_direct_reclaim doesn't work. compaction
> doesn't work as it's order-0 page reclaiming.  In case of
> __alloc_pages_direct_reclaim, it would work only if we have lru pages
> in [in]active list. But unfortunately we don't have any pages in lru
> list.
> So, last resort is following codes in do_try_to_free_pages.
> 
>         /* top priority shrink_zones still had more to do? don't OOM, then */
>         if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
>                 return 1;
> 
> But it has a problem, too. all_unreclaimable checks zone->all_unreclaimable.
> zone->all_unreclaimable is set by below condition.
> 
> zone->pages_scanned < zone_reclaimable_pages(zone) * 6
> 
> If lru list is completely empty, shrink_zone doesn't work so
> zone->pages_scanned would be zero. But as we know, zone_page_state
> isn't exact by per_cpu_pageset. So it might be positive value. After
> all, zone_reclaimable always return true. It means kswapd never set
> zone->all_unreclaimable.  So last resort become nop.
> 
> In this case, current allocation doesn't have a chance to call
> get_page_from_freelist as Andrew Barry said.
> 
> Does it make sense?
> If it is, how about this?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ebc7faa..4f64355 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2105,6 +2105,7 @@ restart:
>                 first_zones_zonelist(zonelist, high_zoneidx, NULL,
>                                         &preferred_zone);
> 
> +rebalance:
>         /* This is the last chance, in general, before the goto nopage. */
>         page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>                         high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> @@ -2112,7 +2113,6 @@ restart:
>         if (page)
>                 goto got_pg;
> 
> -rebalance:
>         /* Allocate without watermarks if the context allows */
>         if (alloc_flags & ALLOC_NO_WATERMARKS) {
>                 page = __alloc_pages_high_priority(gfp_mask, order,

I think your solution is simpler than my patch.
Thanks very much.
-Andrew





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-17 15:49   ` Andrew Barry
@ 2011-05-18 22:29     ` Minchan Kim
  2011-05-20 16:49         ` Minchan Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2011-05-18 22:29 UTC (permalink / raw)
  To: Andrew Barry
  Cc: linux-mm, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Andrew Morton, Johannes Weiner

Hi Andrew,

On Wed, May 18, 2011 at 12:49 AM, Andrew Barry <abarry@cray.com> wrote:
> On 05/17/2011 05:34 AM, Minchan Kim wrote:
>> On Sat, May 14, 2011 at 6:31 AM, Andrew Barry <abarry@cray.com> wrote:
>>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
>>> get stuck endlessly looping, even when lots of memory is available.
>>>
>>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
>>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
>>> Right about the same time that the stress-test gets killed by the OOM-killer,
>>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
>>> though most of the systems memory was freed by the oom-kill of the stress-test.
>>>
>>> The utility ends up looping from the rebalance label down through the
>>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
>>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
>>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
>>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
>>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
>>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
>>> repeats infinitely.
>>>
>>> Is there a reason that this loop is set up this way for 0 order allocations? I
>>> applied the below patch, and the problem corrects itself. Does anyone have any
>>> thoughts on the patch, or on a better way to address this situation?
>>>
>>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
>>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
>>> this on 600 nodes, in about 12 hours. 32GB/node.
>>>
>>
>> It's amazing.
>> I think it's _very_ rare but it's possible if test program killed by
>> oom has only lots of anonymous pages and allocation tasks try to
>> allocate order-0 page with GFP_NOFS.
>
> Unfortunately very rare is a subjective thing. We have been hitting it a couple
> times a week in our test lab.

Okay.

>
>> When the [in]active lists are empty suddenly(But I am not sure how
>> come the situation happens.) and we are reclaiming order-0 page,
>> compaction and __alloc_pages_direct_reclaim doesn't work. compaction
>> doesn't work as it's order-0 page reclaiming.  In case of
>> __alloc_pages_direct_reclaim, it would work only if we have lru pages
>> in [in]active list. But unfortunately we don't have any pages in lru
>> list.
>> So, last resort is following codes in do_try_to_free_pages.
>>
>>         /* top priority shrink_zones still had more to do? don't OOM, then */
>>         if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
>>                 return 1;
>>
>> But it has a problem, too. all_unreclaimable checks zone->all_unreclaimable.
>> zone->all_unreclaimable is set by below condition.
>>
>> zone->pages_scanned < zone_reclaimable_pages(zone) * 6
>>
>> If lru list is completely empty, shrink_zone doesn't work so
>> zone->pages_scanned would be zero. But as we know, zone_page_state
>> isn't exact by per_cpu_pageset. So it might be positive value. After
>> all, zone_reclaimable always return true. It means kswapd never set
>> zone->all_unreclaimable.  So last resort become nop.
>>
>> In this case, current allocation doesn't have a chance to call
>> get_page_from_freelist as Andrew Barry said.
>>
>> Does it make sense?
>> If it is, how about this?
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index ebc7faa..4f64355 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2105,6 +2105,7 @@ restart:
>>                 first_zones_zonelist(zonelist, high_zoneidx, NULL,
>>                                         &preferred_zone);
>>
>> +rebalance:
>>         /* This is the last chance, in general, before the goto nopage. */
>>         page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>>                         high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
>> @@ -2112,7 +2113,6 @@ restart:
>>         if (page)
>>                 goto got_pg;
>>
>> -rebalance:
>>         /* Allocate without watermarks if the context allows */
>>         if (alloc_flags & ALLOC_NO_WATERMARKS) {
>>                 page = __alloc_pages_high_priority(gfp_mask, order,
>
> I think your solution is simpler than my patch.
> Thanks very much.

You find the problem and it's harder than fix, I think.
So I think you have to get a credit.

Could you send the patch to akpm with Cced Mel and me?
(Maybe it's the subject to send stable).
You can get my Reviewed-by.

Thanks for the good bug reporting.

> -Andrew
>
>
>
>
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-18 22:29     ` Minchan Kim
@ 2011-05-20 16:49         ` Minchan Kim
  0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2011-05-20 16:49 UTC (permalink / raw)
  To: Andrew Barry, Andrew Morton
  Cc: linux-mm, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Johannes Weiner, linux kernel mailing list

On Thu, May 19, 2011 at 07:29:01AM +0900, Minchan Kim wrote:
> Hi Andrew,
> 
> On Wed, May 18, 2011 at 12:49 AM, Andrew Barry <abarry@cray.com> wrote:
> > On 05/17/2011 05:34 AM, Minchan Kim wrote:
> >> On Sat, May 14, 2011 at 6:31 AM, Andrew Barry <abarry@cray.com> wrote:
> >>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> >>> get stuck endlessly looping, even when lots of memory is available.
> >>>
> >>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
> >>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> >>> Right about the same time that the stress-test gets killed by the OOM-killer,
> >>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> >>> though most of the systems memory was freed by the oom-kill of the stress-test.
> >>>
> >>> The utility ends up looping from the rebalance label down through the
> >>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> >>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
> >>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> >>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> >>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> >>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> >>> repeats infinitely.
> >>>
> >>> Is there a reason that this loop is set up this way for 0 order allocations? I
> >>> applied the below patch, and the problem corrects itself. Does anyone have any
> >>> thoughts on the patch, or on a better way to address this situation?
> >>>
> >>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
> >>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
> >>> this on 600 nodes, in about 12 hours. 32GB/node.
> >>>
> >>
> >> It's amazing.
> >> I think it's _very_ rare but it's possible if test program killed by
> >> oom has only lots of anonymous pages and allocation tasks try to
> >> allocate order-0 page with GFP_NOFS.
> >
> > Unfortunately very rare is a subjective thing. We have been hitting it a couple
> > times a week in our test lab.
> 
> Okay.
> 
> >
> >> When the [in]active lists are empty suddenly(But I am not sure how
> >> come the situation happens.) and we are reclaiming order-0 page,
> >> compaction and __alloc_pages_direct_reclaim doesn't work. compaction
> >> doesn't work as it's order-0 page reclaiming.  In case of
> >> __alloc_pages_direct_reclaim, it would work only if we have lru pages
> >> in [in]active list. But unfortunately we don't have any pages in lru
> >> list.
> >> So, last resort is following codes in do_try_to_free_pages.
> >>
> >>         /* top priority shrink_zones still had more to do? don't OOM, then */
> >>         if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> >>                 return 1;
> >>
> >> But it has a problem, too. all_unreclaimable checks zone->all_unreclaimable.
> >> zone->all_unreclaimable is set by below condition.
> >>
> >> zone->pages_scanned < zone_reclaimable_pages(zone) * 6
> >>
> >> If lru list is completely empty, shrink_zone doesn't work so
> >> zone->pages_scanned would be zero. But as we know, zone_page_state
> >> isn't exact by per_cpu_pageset. So it might be positive value. After
> >> all, zone_reclaimable always return true. It means kswapd never set
> >> zone->all_unreclaimable.  So last resort become nop.
> >>
> >> In this case, current allocation doesn't have a chance to call
> >> get_page_from_freelist as Andrew Barry said.
> >>
> >> Does it make sense?
> >> If it is, how about this?
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index ebc7faa..4f64355 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2105,6 +2105,7 @@ restart:
> >>                 first_zones_zonelist(zonelist, high_zoneidx, NULL,
> >>                                         &preferred_zone);
> >>
> >> +rebalance:
> >>         /* This is the last chance, in general, before the goto nopage. */
> >>         page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> >>                         high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> >> @@ -2112,7 +2113,6 @@ restart:
> >>         if (page)
> >>                 goto got_pg;
> >>
> >> -rebalance:
> >>         /* Allocate without watermarks if the context allows */
> >>         if (alloc_flags & ALLOC_NO_WATERMARKS) {
> >>                 page = __alloc_pages_high_priority(gfp_mask, order,
> >
> > I think your solution is simpler than my patch.
> > Thanks very much.
> 
> You find the problem and it's harder than fix, I think.
> So I think you have to get a credit.
> 
> Could you send the patch to akpm with Cced Mel and me?
> (Maybe it's the subject to send stable).
> You can get my Reviewed-by.
> 
> Thanks for the good bug reporting.

>From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan.kim@gmail.com>
Date: Sat, 21 May 2011 01:37:41 +0900
Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath

From: Andrew Barry <abarry@cray.com>

I believe I found a problem in __alloc_pages_slowpath, which allows a process to
get stuck endlessly looping, even when lots of memory is available.

Running an I/O and memory intensive stress-test I see a 0-order page allocation
with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
Right about the same time that the stress-test gets killed by the OOM-killer,
the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
though most of the systems memory was freed by the oom-kill of the stress-test.

The utility ends up looping from the rebalance label down through the
wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
skips the call to get_page_from_freelist. Because all of the reclaimable memory
on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
__alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
jumps back to rebalance without ever trying to get_page_from_freelist. This loop
repeats infinitely.

The test case is pretty pathological. Running a mix of I/O stress-tests that do
a lot of fork() and consume all of the system memory, I can pretty reliably hit
this on 600 nodes, in about 12 hours. 32GB/node.

Signed-off-by: Andrew Barry <abarry@cray.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f8bce2..e78b324 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2064,6 +2064,7 @@ restart:
 		first_zones_zonelist(zonelist, high_zoneidx, NULL,
 					&preferred_zone);
 
+rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
@@ -2071,7 +2072,6 @@ restart:
 	if (page)
 		goto got_pg;
 
-rebalance:
 	/* Allocate without watermarks if the context allows */
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 		page = __alloc_pages_high_priority(gfp_mask, order,
-- 
1.7.1


> 
> > -Andrew
> >
> >
> >
> >
> >
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-20 16:49         ` Minchan Kim
  0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2011-05-20 16:49 UTC (permalink / raw)
  To: Andrew Barry, Andrew Morton
  Cc: linux-mm, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Johannes Weiner, linux kernel mailing list

On Thu, May 19, 2011 at 07:29:01AM +0900, Minchan Kim wrote:
> Hi Andrew,
> 
> On Wed, May 18, 2011 at 12:49 AM, Andrew Barry <abarry@cray.com> wrote:
> > On 05/17/2011 05:34 AM, Minchan Kim wrote:
> >> On Sat, May 14, 2011 at 6:31 AM, Andrew Barry <abarry@cray.com> wrote:
> >>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> >>> get stuck endlessly looping, even when lots of memory is available.
> >>>
> >>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
> >>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> >>> Right about the same time that the stress-test gets killed by the OOM-killer,
> >>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> >>> though most of the systems memory was freed by the oom-kill of the stress-test.
> >>>
> >>> The utility ends up looping from the rebalance label down through the
> >>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> >>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
> >>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> >>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> >>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> >>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> >>> repeats infinitely.
> >>>
> >>> Is there a reason that this loop is set up this way for 0 order allocations? I
> >>> applied the below patch, and the problem corrects itself. Does anyone have any
> >>> thoughts on the patch, or on a better way to address this situation?
> >>>
> >>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
> >>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
> >>> this on 600 nodes, in about 12 hours. 32GB/node.
> >>>
> >>
> >> It's amazing.
> >> I think it's _very_ rare but it's possible if test program killed by
> >> oom has only lots of anonymous pages and allocation tasks try to
> >> allocate order-0 page with GFP_NOFS.
> >
> > Unfortunately very rare is a subjective thing. We have been hitting it a couple
> > times a week in our test lab.
> 
> Okay.
> 
> >
> >> When the [in]active lists are empty suddenly(But I am not sure how
> >> come the situation happens.) and we are reclaiming order-0 page,
> >> compaction and __alloc_pages_direct_reclaim doesn't work. compaction
> >> doesn't work as it's order-0 page reclaiming.  In case of
> >> __alloc_pages_direct_reclaim, it would work only if we have lru pages
> >> in [in]active list. But unfortunately we don't have any pages in lru
> >> list.
> >> So, last resort is following codes in do_try_to_free_pages.
> >>
> >>         /* top priority shrink_zones still had more to do? don't OOM, then */
> >>         if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> >>                 return 1;
> >>
> >> But it has a problem, too. all_unreclaimable checks zone->all_unreclaimable.
> >> zone->all_unreclaimable is set by below condition.
> >>
> >> zone->pages_scanned < zone_reclaimable_pages(zone) * 6
> >>
> >> If lru list is completely empty, shrink_zone doesn't work so
> >> zone->pages_scanned would be zero. But as we know, zone_page_state
> >> isn't exact by per_cpu_pageset. So it might be positive value. After
> >> all, zone_reclaimable always return true. It means kswapd never set
> >> zone->all_unreclaimable.  So last resort become nop.
> >>
> >> In this case, current allocation doesn't have a chance to call
> >> get_page_from_freelist as Andrew Barry said.
> >>
> >> Does it make sense?
> >> If it is, how about this?
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index ebc7faa..4f64355 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2105,6 +2105,7 @@ restart:
> >>                 first_zones_zonelist(zonelist, high_zoneidx, NULL,
> >>                                         &preferred_zone);
> >>
> >> +rebalance:
> >>         /* This is the last chance, in general, before the goto nopage. */
> >>         page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> >>                         high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> >> @@ -2112,7 +2113,6 @@ restart:
> >>         if (page)
> >>                 goto got_pg;
> >>
> >> -rebalance:
> >>         /* Allocate without watermarks if the context allows */
> >>         if (alloc_flags & ALLOC_NO_WATERMARKS) {
> >>                 page = __alloc_pages_high_priority(gfp_mask, order,
> >
> > I think your solution is simpler than my patch.
> > Thanks very much.
> 
> You find the problem and it's harder than fix, I think.
> So I think you have to get a credit.
> 
> Could you send the patch to akpm with Cced Mel and me?
> (Maybe it's the subject to send stable).
> You can get my Reviewed-by.
> 
> Thanks for the good bug reporting.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-20 16:49         ` Minchan Kim
@ 2011-05-20 17:16           ` Rik van Riel
  -1 siblings, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2011-05-20 17:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Barry, Andrew Morton, linux-mm, Mel Gorman,
	KOSAKI Motohiro, Johannes Weiner, linux kernel mailing list

On 05/20/2011 12:49 PM, Minchan Kim wrote:

> From: Andrew Barry<abarry@cray.com>
>
> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> get stuck endlessly looping, even when lots of memory is available.

> Signed-off-by: Andrew Barry<abarry@cray.com>
> Reviewed-by: Minchan Kim<minchan.kim@gmail.com>
> Cc: Mel Gorman<mgorman@suse.de>

Reviewed-by: Rik van Riel<riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-20 17:16           ` Rik van Riel
  0 siblings, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2011-05-20 17:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Barry, Andrew Morton, linux-mm, Mel Gorman,
	KOSAKI Motohiro, Johannes Weiner, linux kernel mailing list

On 05/20/2011 12:49 PM, Minchan Kim wrote:

> From: Andrew Barry<abarry@cray.com>
>
> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> get stuck endlessly looping, even when lots of memory is available.

> Signed-off-by: Andrew Barry<abarry@cray.com>
> Reviewed-by: Minchan Kim<minchan.kim@gmail.com>
> Cc: Mel Gorman<mgorman@suse.de>

Reviewed-by: Rik van Riel<riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-20 16:49         ` Minchan Kim
@ 2011-05-20 17:23           ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-20 17:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Barry, Andrew Morton, linux-mm, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, linux kernel mailing list

On Sat, May 21, 2011 at 01:49:24AM +0900, Minchan Kim wrote:
> <SNIP>
> 
> From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Sat, 21 May 2011 01:37:41 +0900
> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
> 
> From: Andrew Barry <abarry@cray.com>
> 
> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> get stuck endlessly looping, even when lots of memory is available.
> 
> Running an I/O and memory intensive stress-test I see a 0-order page allocation
> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> Right about the same time that the stress-test gets killed by the OOM-killer,
> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> though most of the systems memory was freed by the oom-kill of the stress-test.
> 
> The utility ends up looping from the rebalance label down through the
> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> skips the call to get_page_from_freelist. Because all of the reclaimable memory
> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> repeats infinitely.
> 
> The test case is pretty pathological. Running a mix of I/O stress-tests that do
> a lot of fork() and consume all of the system memory, I can pretty reliably hit
> this on 600 nodes, in about 12 hours. 32GB/node.
> 
> Signed-off-by: Andrew Barry <abarry@cray.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Cc: Mel Gorman <mgorman@suse.de>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-20 17:23           ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-20 17:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Barry, Andrew Morton, linux-mm, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, linux kernel mailing list

On Sat, May 21, 2011 at 01:49:24AM +0900, Minchan Kim wrote:
> <SNIP>
> 
> From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Sat, 21 May 2011 01:37:41 +0900
> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
> 
> From: Andrew Barry <abarry@cray.com>
> 
> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> get stuck endlessly looping, even when lots of memory is available.
> 
> Running an I/O and memory intensive stress-test I see a 0-order page allocation
> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> Right about the same time that the stress-test gets killed by the OOM-killer,
> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> though most of the systems memory was freed by the oom-kill of the stress-test.
> 
> The utility ends up looping from the rebalance label down through the
> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> skips the call to get_page_from_freelist. Because all of the reclaimable memory
> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> repeats infinitely.
> 
> The test case is pretty pathological. Running a mix of I/O stress-tests that do
> a lot of fork() and consume all of the system memory, I can pretty reliably hit
> this on 600 nodes, in about 12 hours. 32GB/node.
> 
> Signed-off-by: Andrew Barry <abarry@cray.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Cc: Mel Gorman <mgorman@suse.de>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-20 16:49         ` Minchan Kim
@ 2011-05-24  4:54           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  4:54 UTC (permalink / raw)
  To: minchan.kim; +Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel

>>From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Sat, 21 May 2011 01:37:41 +0900
> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
> 
> From: Andrew Barry <abarry@cray.com>
> 
> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> get stuck endlessly looping, even when lots of memory is available.
> 
> Running an I/O and memory intensive stress-test I see a 0-order page allocation
> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> Right about the same time that the stress-test gets killed by the OOM-killer,
> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> though most of the systems memory was freed by the oom-kill of the stress-test.
> 
> The utility ends up looping from the rebalance label down through the
> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> skips the call to get_page_from_freelist. Because all of the reclaimable memory
> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> repeats infinitely.
> 
> The test case is pretty pathological. Running a mix of I/O stress-tests that do
> a lot of fork() and consume all of the system memory, I can pretty reliably hit
> this on 600 nodes, in about 12 hours. 32GB/node.
> 
> Signed-off-by: Andrew Barry <abarry@cray.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Cc: Mel Gorman <mgorman@suse.de>
> ---
>  mm/page_alloc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3f8bce2..e78b324 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2064,6 +2064,7 @@ restart:
>  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
>  					&preferred_zone);
>  
> +rebalance:
>  	/* This is the last chance, in general, before the goto nopage. */
>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> @@ -2071,7 +2072,6 @@ restart:
>  	if (page)
>  		goto got_pg;
>  
> -rebalance:
>  	/* Allocate without watermarks if the context allows */
>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>  		page = __alloc_pages_high_priority(gfp_mask, order,

I'm sorry I missed this thread long time.

In this case, I think we should call drain_all_pages(). then following
patch is better.
However I also think your patch is valuable. because while the task is
sleeping in wait_iff_congested(), an another task may free some pages.
thus, rebalance path should try to get free pages. iow, you makes sense.

So, I'd like to propose to merge both your and my patch.

Thanks.


>From 2e77784668f6ca53d88ecb46aa6b99d9d0f33ffa Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Tue, 24 May 2011 13:41:57 +0900
Subject: [PATCH] vmscan: remove painful micro optimization

Currently, __alloc_pages_direct_reclaim() call get_page_from_freelist()
only if try_to_free_pages() return !0.

It's no necessary micro optimization becauase "return 0" mean vmscan reached
priority 0 and didn't get any pages, iow, it's really slow path. But also it
has bad side effect. If we don't call drain_all_pages(), we have a chance to
get infinite loop.

This patch remove its bad and meaningless micro optimization.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1572079..c41d488 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1950,9 +1950,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,

 	cond_resched();

-	if (unlikely(!(*did_some_progress)))
-		return NULL;
-
 retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-- 
1.7.3.1





^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  4:54           ` KOSAKI Motohiro
  0 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  4:54 UTC (permalink / raw)
  To: minchan.kim; +Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel

>>From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Sat, 21 May 2011 01:37:41 +0900
> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
> 
> From: Andrew Barry <abarry@cray.com>
> 
> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> get stuck endlessly looping, even when lots of memory is available.
> 
> Running an I/O and memory intensive stress-test I see a 0-order page allocation
> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> Right about the same time that the stress-test gets killed by the OOM-killer,
> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> though most of the systems memory was freed by the oom-kill of the stress-test.
> 
> The utility ends up looping from the rebalance label down through the
> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> skips the call to get_page_from_freelist. Because all of the reclaimable memory
> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> repeats infinitely.
> 
> The test case is pretty pathological. Running a mix of I/O stress-tests that do
> a lot of fork() and consume all of the system memory, I can pretty reliably hit
> this on 600 nodes, in about 12 hours. 32GB/node.
> 
> Signed-off-by: Andrew Barry <abarry@cray.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Cc: Mel Gorman <mgorman@suse.de>
> ---
>  mm/page_alloc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3f8bce2..e78b324 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2064,6 +2064,7 @@ restart:
>  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
>  					&preferred_zone);
>  
> +rebalance:
>  	/* This is the last chance, in general, before the goto nopage. */
>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> @@ -2071,7 +2072,6 @@ restart:
>  	if (page)
>  		goto got_pg;
>  
> -rebalance:
>  	/* Allocate without watermarks if the context allows */
>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>  		page = __alloc_pages_high_priority(gfp_mask, order,

I'm sorry I missed this thread long time.

In this case, I think we should call drain_all_pages(). then following
patch is better.
However I also think your patch is valuable. because while the task is
sleeping in wait_iff_congested(), an another task may free some pages.
thus, rebalance path should try to get free pages. iow, you makes sense.

So, I'd like to propose to merge both your and my patch.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  4:54           ` KOSAKI Motohiro
@ 2011-05-24  5:45             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  5:45 UTC (permalink / raw)
  To: minchan.kim; +Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel

(2011/05/24 13:54), KOSAKI Motohiro wrote:
>> >From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
>> From: Minchan Kim <minchan.kim@gmail.com>
>> Date: Sat, 21 May 2011 01:37:41 +0900
>> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
>>
>> From: Andrew Barry <abarry@cray.com>
>>
>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
>> get stuck endlessly looping, even when lots of memory is available.
>>
>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
>> Right about the same time that the stress-test gets killed by the OOM-killer,
>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
>> though most of the systems memory was freed by the oom-kill of the stress-test.
>>
>> The utility ends up looping from the rebalance label down through the
>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
>> repeats infinitely.
>>
>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
>> this on 600 nodes, in about 12 hours. 32GB/node.
>>
>> Signed-off-by: Andrew Barry <abarry@cray.com>
>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> ---
>>  mm/page_alloc.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3f8bce2..e78b324 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2064,6 +2064,7 @@ restart:
>>  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
>>  					&preferred_zone);
>>  
>> +rebalance:
>>  	/* This is the last chance, in general, before the goto nopage. */
>>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
>> @@ -2071,7 +2072,6 @@ restart:
>>  	if (page)
>>  		goto got_pg;
>>  
>> -rebalance:
>>  	/* Allocate without watermarks if the context allows */
>>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>>  		page = __alloc_pages_high_priority(gfp_mask, order,
> 
> I'm sorry I missed this thread long time.
> 
> In this case, I think we should call drain_all_pages(). then following
> patch is better.
> However I also think your patch is valuable. because while the task is
> sleeping in wait_iff_congested(), an another task may free some pages.
> thus, rebalance path should try to get free pages. iow, you makes sense.
> 
> So, I'd like to propose to merge both your and my patch.

I forgot to write important thing. Your patch looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  5:45             ` KOSAKI Motohiro
  0 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  5:45 UTC (permalink / raw)
  To: minchan.kim; +Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel

(2011/05/24 13:54), KOSAKI Motohiro wrote:
>> >From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
>> From: Minchan Kim <minchan.kim@gmail.com>
>> Date: Sat, 21 May 2011 01:37:41 +0900
>> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
>>
>> From: Andrew Barry <abarry@cray.com>
>>
>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
>> get stuck endlessly looping, even when lots of memory is available.
>>
>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
>> Right about the same time that the stress-test gets killed by the OOM-killer,
>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
>> though most of the systems memory was freed by the oom-kill of the stress-test.
>>
>> The utility ends up looping from the rebalance label down through the
>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
>> repeats infinitely.
>>
>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
>> this on 600 nodes, in about 12 hours. 32GB/node.
>>
>> Signed-off-by: Andrew Barry <abarry@cray.com>
>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> ---
>>  mm/page_alloc.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3f8bce2..e78b324 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2064,6 +2064,7 @@ restart:
>>  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
>>  					&preferred_zone);
>>  
>> +rebalance:
>>  	/* This is the last chance, in general, before the goto nopage. */
>>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
>> @@ -2071,7 +2072,6 @@ restart:
>>  	if (page)
>>  		goto got_pg;
>>  
>> -rebalance:
>>  	/* Allocate without watermarks if the context allows */
>>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>>  		page = __alloc_pages_high_priority(gfp_mask, order,
> 
> I'm sorry I missed this thread long time.
> 
> In this case, I think we should call drain_all_pages(). then following
> patch is better.
> However I also think your patch is valuable. because while the task is
> sleeping in wait_iff_congested(), an another task may free some pages.
> thus, rebalance path should try to get free pages. iow, you makes sense.
> 
> So, I'd like to propose to merge both your and my patch.

I forgot to write important thing. Your patch looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  4:54           ` KOSAKI Motohiro
@ 2011-05-24  8:30             ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-24  8:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

On Tue, May 24, 2011 at 01:54:54PM +0900, KOSAKI Motohiro wrote:
> >>From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Date: Sat, 21 May 2011 01:37:41 +0900
> > Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
> > 
> > From: Andrew Barry <abarry@cray.com>
> > 
> > I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> > get stuck endlessly looping, even when lots of memory is available.
> > 
> > Running an I/O and memory intensive stress-test I see a 0-order page allocation
> > with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> > Right about the same time that the stress-test gets killed by the OOM-killer,
> > the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> > though most of the systems memory was freed by the oom-kill of the stress-test.
> > 
> > The utility ends up looping from the rebalance label down through the
> > wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> > skips the call to get_page_from_freelist. Because all of the reclaimable memory
> > on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> > call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> > __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> > jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> > repeats infinitely.
> > 
> > The test case is pretty pathological. Running a mix of I/O stress-tests that do
> > a lot of fork() and consume all of the system memory, I can pretty reliably hit
> > this on 600 nodes, in about 12 hours. 32GB/node.
> > 
> > Signed-off-by: Andrew Barry <abarry@cray.com>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/page_alloc.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3f8bce2..e78b324 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2064,6 +2064,7 @@ restart:
> >  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
> >  					&preferred_zone);
> >  
> > +rebalance:
> >  	/* This is the last chance, in general, before the goto nopage. */
> >  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> >  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> > @@ -2071,7 +2072,6 @@ restart:
> >  	if (page)
> >  		goto got_pg;
> >  
> > -rebalance:
> >  	/* Allocate without watermarks if the context allows */
> >  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
> >  		page = __alloc_pages_high_priority(gfp_mask, order,
> 
> I'm sorry I missed this thread long time.
> 
> In this case, I think we should call drain_all_pages().

Why?

If the direct reclaimer failed to reclaim any pages on its own, the call
to get_page_from_freelist() is going to be useless and there is
no guarantee that any other CPU managed to reclaim pages either. All
this ends up doing is sending in IPI which if it's very lucky will take
a page from another CPUs free list.

> then following
> patch is better.
> However I also think your patch is valuable. because while the task is
> sleeping in wait_iff_congested(), an another task may free some pages.
> thus, rebalance path should try to get free pages. iow, you makes sense.
> 
> So, I'd like to propose to merge both your and my patch.
> 
> Thanks.
> 
> 
> From 2e77784668f6ca53d88ecb46aa6b99d9d0f33ffa Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Tue, 24 May 2011 13:41:57 +0900
> Subject: [PATCH] vmscan: remove painful micro optimization
> 
> Currently, __alloc_pages_direct_reclaim() call get_page_from_freelist()
> only if try_to_free_pages() return !0.
> 
> It's no necessary micro optimization becauase "return 0" mean vmscan reached
> priority 0 and didn't get any pages, iow, it's really slow path. But also it
> has bad side effect. If we don't call drain_all_pages(), we have a chance to
> get infinite loop.
> 

With the "rebalance" patch, where is the infinite loop?

> This patch remove its bad and meaningless micro optimization.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |    3 ---
>  1 files changed, 0 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1572079..c41d488 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1950,9 +1950,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> 
>  	cond_resched();
> 
> -	if (unlikely(!(*did_some_progress)))
> -		return NULL;
> -
>  retry:
>  	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
> -- 
> 1.7.3.1
> 
> 
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  8:30             ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-24  8:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

On Tue, May 24, 2011 at 01:54:54PM +0900, KOSAKI Motohiro wrote:
> >>From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Date: Sat, 21 May 2011 01:37:41 +0900
> > Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
> > 
> > From: Andrew Barry <abarry@cray.com>
> > 
> > I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> > get stuck endlessly looping, even when lots of memory is available.
> > 
> > Running an I/O and memory intensive stress-test I see a 0-order page allocation
> > with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> > Right about the same time that the stress-test gets killed by the OOM-killer,
> > the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> > though most of the systems memory was freed by the oom-kill of the stress-test.
> > 
> > The utility ends up looping from the rebalance label down through the
> > wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> > skips the call to get_page_from_freelist. Because all of the reclaimable memory
> > on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> > call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> > __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> > jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> > repeats infinitely.
> > 
> > The test case is pretty pathological. Running a mix of I/O stress-tests that do
> > a lot of fork() and consume all of the system memory, I can pretty reliably hit
> > this on 600 nodes, in about 12 hours. 32GB/node.
> > 
> > Signed-off-by: Andrew Barry <abarry@cray.com>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/page_alloc.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3f8bce2..e78b324 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2064,6 +2064,7 @@ restart:
> >  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
> >  					&preferred_zone);
> >  
> > +rebalance:
> >  	/* This is the last chance, in general, before the goto nopage. */
> >  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> >  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> > @@ -2071,7 +2072,6 @@ restart:
> >  	if (page)
> >  		goto got_pg;
> >  
> > -rebalance:
> >  	/* Allocate without watermarks if the context allows */
> >  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
> >  		page = __alloc_pages_high_priority(gfp_mask, order,
> 
> I'm sorry I missed this thread long time.
> 
> In this case, I think we should call drain_all_pages().

Why?

If the direct reclaimer failed to reclaim any pages on its own, the call
to get_page_from_freelist() is going to be useless and there is
no guarantee that any other CPU managed to reclaim pages either. All
this ends up doing is sending in IPI which if it's very lucky will take
a page from another CPUs free list.

> then following
> patch is better.
> However I also think your patch is valuable. because while the task is
> sleeping in wait_iff_congested(), an another task may free some pages.
> thus, rebalance path should try to get free pages. iow, you makes sense.
> 
> So, I'd like to propose to merge both your and my patch.
> 
> Thanks.
> 
> 
> From 2e77784668f6ca53d88ecb46aa6b99d9d0f33ffa Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Tue, 24 May 2011 13:41:57 +0900
> Subject: [PATCH] vmscan: remove painful micro optimization
> 
> Currently, __alloc_pages_direct_reclaim() call get_page_from_freelist()
> only if try_to_free_pages() return !0.
> 
> It's no necessary micro optimization becauase "return 0" mean vmscan reached
> priority 0 and didn't get any pages, iow, it's really slow path. But also it
> has bad side effect. If we don't call drain_all_pages(), we have a chance to
> get infinite loop.
> 

With the "rebalance" patch, where is the infinite loop?

> This patch remove its bad and meaningless micro optimization.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |    3 ---
>  1 files changed, 0 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1572079..c41d488 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1950,9 +1950,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> 
>  	cond_resched();
> 
> -	if (unlikely(!(*did_some_progress)))
> -		return NULL;
> -
>  retry:
>  	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
> -- 
> 1.7.3.1
> 
> 
> 
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  4:54           ` KOSAKI Motohiro
@ 2011-05-24  8:34             ` Minchan Kim
  -1 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2011-05-24  8:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel,
	Wu Fengguang

On Tue, May 24, 2011 at 1:54 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>>>From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
>> From: Minchan Kim <minchan.kim@gmail.com>
>> Date: Sat, 21 May 2011 01:37:41 +0900
>> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
>>
>> From: Andrew Barry <abarry@cray.com>
>>
>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
>> get stuck endlessly looping, even when lots of memory is available.
>>
>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
>> Right about the same time that the stress-test gets killed by the OOM-killer,
>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
>> though most of the systems memory was freed by the oom-kill of the stress-test.
>>
>> The utility ends up looping from the rebalance label down through the
>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
>> repeats infinitely.
>>
>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
>> this on 600 nodes, in about 12 hours. 32GB/node.
>>
>> Signed-off-by: Andrew Barry <abarry@cray.com>
>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> ---
>>  mm/page_alloc.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3f8bce2..e78b324 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2064,6 +2064,7 @@ restart:
>>               first_zones_zonelist(zonelist, high_zoneidx, NULL,
>>                                       &preferred_zone);
>>
>> +rebalance:
>>       /* This is the last chance, in general, before the goto nopage. */
>>       page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>>                       high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
>> @@ -2071,7 +2072,6 @@ restart:
>>       if (page)
>>               goto got_pg;
>>
>> -rebalance:
>>       /* Allocate without watermarks if the context allows */
>>       if (alloc_flags & ALLOC_NO_WATERMARKS) {
>>               page = __alloc_pages_high_priority(gfp_mask, order,
>
> I'm sorry I missed this thread long time.

No problem. It would be better than not review.

>
> In this case, I think we should call drain_all_pages(). then following
> patch is better.

Strictly speaking, this problem isn't related to drain_all_pages.
This problem caused by lru empty but I admit it could work well if
your patch applied.
So yours could help, too.

> However I also think your patch is valuable. because while the task is
> sleeping in wait_iff_congested(), an another task may free some pages.
> thus, rebalance path should try to get free pages. iow, you makes sense.

Yes.
Off-topic.
I would like to move cond_resched below get_page_from_freelist in
__alloc_pages_direct_reclaim. Otherwise, it is likely we can be stolen
pages to other processes.
One more benefit is that if it's apparently OOM path(ie,
did_some_progress = 0), we can reduce OOM kill latency due to remove
unnecessary cond_resched.

>
> So, I'd like to propose to merge both your and my patch.

Recently, there was discussion on drain_all_pages with Wu.
He saw much overhead in 8-core system, AFAIR.
I Cced Wu.

How about checking per-cpu before calling drain_all_pages() than
unconditional calling?
if (per_cpu_ptr(zone->pageset, smp_processor_id())
    drain_all_pages();

Of course, It can miss other CPU free pages. But above routine assume
local cpu direct reclaim is successful but it failed by per-cpu. So I
think it works.

Thanks for good suggestion and Reviewed-by, KOSAKI.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  8:34             ` Minchan Kim
  0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2011-05-24  8:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel,
	Wu Fengguang

On Tue, May 24, 2011 at 1:54 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>>>From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
>> From: Minchan Kim <minchan.kim@gmail.com>
>> Date: Sat, 21 May 2011 01:37:41 +0900
>> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
>>
>> From: Andrew Barry <abarry@cray.com>
>>
>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
>> get stuck endlessly looping, even when lots of memory is available.
>>
>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
>> Right about the same time that the stress-test gets killed by the OOM-killer,
>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
>> though most of the systems memory was freed by the oom-kill of the stress-test.
>>
>> The utility ends up looping from the rebalance label down through the
>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
>> repeats infinitely.
>>
>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
>> this on 600 nodes, in about 12 hours. 32GB/node.
>>
>> Signed-off-by: Andrew Barry <abarry@cray.com>
>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> ---
>>  mm/page_alloc.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3f8bce2..e78b324 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2064,6 +2064,7 @@ restart:
>>               first_zones_zonelist(zonelist, high_zoneidx, NULL,
>>                                       &preferred_zone);
>>
>> +rebalance:
>>       /* This is the last chance, in general, before the goto nopage. */
>>       page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>>                       high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
>> @@ -2071,7 +2072,6 @@ restart:
>>       if (page)
>>               goto got_pg;
>>
>> -rebalance:
>>       /* Allocate without watermarks if the context allows */
>>       if (alloc_flags & ALLOC_NO_WATERMARKS) {
>>               page = __alloc_pages_high_priority(gfp_mask, order,
>
> I'm sorry I missed this thread long time.

No problem. It would be better than not review.

>
> In this case, I think we should call drain_all_pages(). then following
> patch is better.

Strictly speaking, this problem isn't related to drain_all_pages.
This problem caused by lru empty but I admit it could work well if
your patch applied.
So yours could help, too.

> However I also think your patch is valuable. because while the task is
> sleeping in wait_iff_congested(), an another task may free some pages.
> thus, rebalance path should try to get free pages. iow, you makes sense.

Yes.
Off-topic.
I would like to move cond_resched below get_page_from_freelist in
__alloc_pages_direct_reclaim. Otherwise, it is likely we can be stolen
pages to other processes.
One more benefit is that if it's apparently OOM path(ie,
did_some_progress = 0), we can reduce OOM kill latency due to remove
unnecessary cond_resched.

>
> So, I'd like to propose to merge both your and my patch.

Recently, there was discussion on drain_all_pages with Wu.
He saw much overhead in 8-core system, AFAIR.
I Cced Wu.

How about checking per-cpu before calling drain_all_pages() than
unconditional calling?
if (per_cpu_ptr(zone->pageset, smp_processor_id())
    drain_all_pages();

Of course, It can miss other CPU free pages. But above routine assume
local cpu direct reclaim is successful but it failed by per-cpu. So I
think it works.

Thanks for good suggestion and Reviewed-by, KOSAKI.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  8:30             ` Mel Gorman
@ 2011-05-24  8:36               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  8:36 UTC (permalink / raw)
  To: mgorman; +Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

(2011/05/24 17:30), Mel Gorman wrote:
> On Tue, May 24, 2011 at 01:54:54PM +0900, KOSAKI Motohiro wrote:
>>> >From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
>>> From: Minchan Kim <minchan.kim@gmail.com>
>>> Date: Sat, 21 May 2011 01:37:41 +0900
>>> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
>>>
>>> From: Andrew Barry <abarry@cray.com>
>>>
>>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
>>> get stuck endlessly looping, even when lots of memory is available.
>>>
>>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
>>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
>>> Right about the same time that the stress-test gets killed by the OOM-killer,
>>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
>>> though most of the systems memory was freed by the oom-kill of the stress-test.
>>>
>>> The utility ends up looping from the rebalance label down through the
>>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
>>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
>>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
>>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
>>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
>>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
>>> repeats infinitely.
>>>
>>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
>>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
>>> this on 600 nodes, in about 12 hours. 32GB/node.
>>>
>>> Signed-off-by: Andrew Barry <abarry@cray.com>
>>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> ---
>>>  mm/page_alloc.c |    2 +-
>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 3f8bce2..e78b324 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2064,6 +2064,7 @@ restart:
>>>  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
>>>  					&preferred_zone);
>>>  
>>> +rebalance:
>>>  	/* This is the last chance, in general, before the goto nopage. */
>>>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>>>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
>>> @@ -2071,7 +2072,6 @@ restart:
>>>  	if (page)
>>>  		goto got_pg;
>>>  
>>> -rebalance:
>>>  	/* Allocate without watermarks if the context allows */
>>>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>>>  		page = __alloc_pages_high_priority(gfp_mask, order,
>>
>> I'm sorry I missed this thread long time.
>>
>> In this case, I think we should call drain_all_pages().
> 
> Why?

Otherwise, we don't have good PCP dropping trigger. Big machine might have
big pcp cache.


> If the direct reclaimer failed to reclaim any pages on its own, the call
> to get_page_from_freelist() is going to be useless and there is
> no guarantee that any other CPU managed to reclaim pages either. All
> this ends up doing is sending in IPI which if it's very lucky will take
> a page from another CPUs free list.

It's no matter. because did_some_progress==0 mean vmscan failed to reclaim
any pages and reach priority==0. Thus, it obviously slow path.


> 
>> then following
>> patch is better.
>> However I also think your patch is valuable. because while the task is
>> sleeping in wait_iff_congested(), an another task may free some pages.
>> thus, rebalance path should try to get free pages. iow, you makes sense.
>>
>> So, I'd like to propose to merge both your and my patch.
>>
>> Thanks.
>>
>>
>> From 2e77784668f6ca53d88ecb46aa6b99d9d0f33ffa Mon Sep 17 00:00:00 2001
>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Date: Tue, 24 May 2011 13:41:57 +0900
>> Subject: [PATCH] vmscan: remove painful micro optimization
>>
>> Currently, __alloc_pages_direct_reclaim() call get_page_from_freelist()
>> only if try_to_free_pages() return !0.
>>
>> It's no necessary micro optimization becauase "return 0" mean vmscan reached
>> priority 0 and didn't get any pages, iow, it's really slow path. But also it
>> has bad side effect. If we don't call drain_all_pages(), we have a chance to
>> get infinite loop.
>>
> 
> With the "rebalance" patch, where is the infinite loop?

I wrote the above.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  8:36               ` KOSAKI Motohiro
  0 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  8:36 UTC (permalink / raw)
  To: mgorman; +Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

(2011/05/24 17:30), Mel Gorman wrote:
> On Tue, May 24, 2011 at 01:54:54PM +0900, KOSAKI Motohiro wrote:
>>> >From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
>>> From: Minchan Kim <minchan.kim@gmail.com>
>>> Date: Sat, 21 May 2011 01:37:41 +0900
>>> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
>>>
>>> From: Andrew Barry <abarry@cray.com>
>>>
>>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
>>> get stuck endlessly looping, even when lots of memory is available.
>>>
>>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
>>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
>>> Right about the same time that the stress-test gets killed by the OOM-killer,
>>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
>>> though most of the systems memory was freed by the oom-kill of the stress-test.
>>>
>>> The utility ends up looping from the rebalance label down through the
>>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
>>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
>>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
>>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
>>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
>>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
>>> repeats infinitely.
>>>
>>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
>>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
>>> this on 600 nodes, in about 12 hours. 32GB/node.
>>>
>>> Signed-off-by: Andrew Barry <abarry@cray.com>
>>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> ---
>>>  mm/page_alloc.c |    2 +-
>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 3f8bce2..e78b324 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2064,6 +2064,7 @@ restart:
>>>  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
>>>  					&preferred_zone);
>>>  
>>> +rebalance:
>>>  	/* This is the last chance, in general, before the goto nopage. */
>>>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>>>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
>>> @@ -2071,7 +2072,6 @@ restart:
>>>  	if (page)
>>>  		goto got_pg;
>>>  
>>> -rebalance:
>>>  	/* Allocate without watermarks if the context allows */
>>>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>>>  		page = __alloc_pages_high_priority(gfp_mask, order,
>>
>> I'm sorry I missed this thread long time.
>>
>> In this case, I think we should call drain_all_pages().
> 
> Why?

Otherwise, we don't have good PCP dropping trigger. Big machine might have
big pcp cache.


> If the direct reclaimer failed to reclaim any pages on its own, the call
> to get_page_from_freelist() is going to be useless and there is
> no guarantee that any other CPU managed to reclaim pages either. All
> this ends up doing is sending in IPI which if it's very lucky will take
> a page from another CPUs free list.

It's no matter. because did_some_progress==0 mean vmscan failed to reclaim
any pages and reach priority==0. Thus, it obviously slow path.


> 
>> then following
>> patch is better.
>> However I also think your patch is valuable. because while the task is
>> sleeping in wait_iff_congested(), an another task may free some pages.
>> thus, rebalance path should try to get free pages. iow, you makes sense.
>>
>> So, I'd like to propose to merge both your and my patch.
>>
>> Thanks.
>>
>>
>> From 2e77784668f6ca53d88ecb46aa6b99d9d0f33ffa Mon Sep 17 00:00:00 2001
>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Date: Tue, 24 May 2011 13:41:57 +0900
>> Subject: [PATCH] vmscan: remove painful micro optimization
>>
>> Currently, __alloc_pages_direct_reclaim() call get_page_from_freelist()
>> only if try_to_free_pages() return !0.
>>
>> It's no necessary micro optimization becauase "return 0" mean vmscan reached
>> priority 0 and didn't get any pages, iow, it's really slow path. But also it
>> has bad side effect. If we don't call drain_all_pages(), we have a chance to
>> get infinite loop.
>>
> 
> With the "rebalance" patch, where is the infinite loop?

I wrote the above.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  8:34             ` Minchan Kim
@ 2011-05-24  8:41               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  8:41 UTC (permalink / raw)
  To: minchan.kim
  Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel,
	fengguang.wu

>> I'm sorry I missed this thread long time.
> 
> No problem. It would be better than not review.

thx.


>> In this case, I think we should call drain_all_pages(). then following
>> patch is better.
> 
> Strictly speaking, this problem isn't related to drain_all_pages.
> This problem caused by lru empty but I admit it could work well if
> your patch applied.
> So yours could help, too.
> 
>> However I also think your patch is valuable. because while the task is
>> sleeping in wait_iff_congested(), an another task may free some pages.
>> thus, rebalance path should try to get free pages. iow, you makes sense.
> 
> Yes.
> Off-topic.
> I would like to move cond_resched below get_page_from_freelist in
> __alloc_pages_direct_reclaim. Otherwise, it is likely we can be stolen
> pages to other processes.
> One more benefit is that if it's apparently OOM path(ie,
> did_some_progress = 0), we can reduce OOM kill latency due to remove
> unnecessary cond_resched.

I agree. Can you please mind to send a patch?


>> So, I'd like to propose to merge both your and my patch.
> 
> Recently, there was discussion on drain_all_pages with Wu.
> He saw much overhead in 8-core system, AFAIR.
> I Cced Wu.
> 
> How about checking per-cpu before calling drain_all_pages() than
> unconditional calling?
> if (per_cpu_ptr(zone->pageset, smp_processor_id())
>     drain_all_pages();
> 
> Of course, It can miss other CPU free pages. But above routine assume
> local cpu direct reclaim is successful but it failed by per-cpu. So I
> think it works.

Can you please tell me previous discussion url or mail subject?
I mean, if it is costly and performance degression risk, we don't have to
take my idea.

Thanks.


> 
> Thanks for good suggestion and Reviewed-by, KOSAKI.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  8:41               ` KOSAKI Motohiro
  0 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  8:41 UTC (permalink / raw)
  To: minchan.kim
  Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel,
	fengguang.wu

>> I'm sorry I missed this thread long time.
> 
> No problem. It would be better than not review.

thx.


>> In this case, I think we should call drain_all_pages(). then following
>> patch is better.
> 
> Strictly speaking, this problem isn't related to drain_all_pages.
> This problem caused by lru empty but I admit it could work well if
> your patch applied.
> So yours could help, too.
> 
>> However I also think your patch is valuable. because while the task is
>> sleeping in wait_iff_congested(), an another task may free some pages.
>> thus, rebalance path should try to get free pages. iow, you makes sense.
> 
> Yes.
> Off-topic.
> I would like to move cond_resched below get_page_from_freelist in
> __alloc_pages_direct_reclaim. Otherwise, it is likely we can be stolen
> pages to other processes.
> One more benefit is that if it's apparently OOM path(ie,
> did_some_progress = 0), we can reduce OOM kill latency due to remove
> unnecessary cond_resched.

I agree. Can you please mind to send a patch?


>> So, I'd like to propose to merge both your and my patch.
> 
> Recently, there was discussion on drain_all_pages with Wu.
> He saw much overhead in 8-core system, AFAIR.
> I Cced Wu.
> 
> How about checking per-cpu before calling drain_all_pages() than
> unconditional calling?
> if (per_cpu_ptr(zone->pageset, smp_processor_id())
>     drain_all_pages();
> 
> Of course, It can miss other CPU free pages. But above routine assume
> local cpu direct reclaim is successful but it failed by per-cpu. So I
> think it works.

Can you please tell me previous discussion url or mail subject?
I mean, if it is costly and performance degression risk, we don't have to
take my idea.

Thanks.


> 
> Thanks for good suggestion and Reviewed-by, KOSAKI.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  8:36               ` KOSAKI Motohiro
@ 2011-05-24  8:49                 ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-24  8:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

On Tue, May 24, 2011 at 05:36:06PM +0900, KOSAKI Motohiro wrote:
> (2011/05/24 17:30), Mel Gorman wrote:
> > On Tue, May 24, 2011 at 01:54:54PM +0900, KOSAKI Motohiro wrote:
> >>> >From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
> >>> From: Minchan Kim <minchan.kim@gmail.com>
> >>> Date: Sat, 21 May 2011 01:37:41 +0900
> >>> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
> >>>
> >>> From: Andrew Barry <abarry@cray.com>
> >>>
> >>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> >>> get stuck endlessly looping, even when lots of memory is available.
> >>>
> >>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
> >>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> >>> Right about the same time that the stress-test gets killed by the OOM-killer,
> >>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> >>> though most of the systems memory was freed by the oom-kill of the stress-test.
> >>>
> >>> The utility ends up looping from the rebalance label down through the
> >>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> >>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
> >>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> >>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> >>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> >>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> >>> repeats infinitely.
> >>>
> >>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
> >>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
> >>> this on 600 nodes, in about 12 hours. 32GB/node.
> >>>
> >>> Signed-off-by: Andrew Barry <abarry@cray.com>
> >>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> >>> Cc: Mel Gorman <mgorman@suse.de>
> >>> ---
> >>>  mm/page_alloc.c |    2 +-
> >>>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 3f8bce2..e78b324 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -2064,6 +2064,7 @@ restart:
> >>>  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
> >>>  					&preferred_zone);
> >>>  
> >>> +rebalance:
> >>>  	/* This is the last chance, in general, before the goto nopage. */
> >>>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> >>>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> >>> @@ -2071,7 +2072,6 @@ restart:
> >>>  	if (page)
> >>>  		goto got_pg;
> >>>  
> >>> -rebalance:
> >>>  	/* Allocate without watermarks if the context allows */
> >>>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
> >>>  		page = __alloc_pages_high_priority(gfp_mask, order,
> >>
> >> I'm sorry I missed this thread long time.
> >>
> >> In this case, I think we should call drain_all_pages().
> > 
> > Why?
> 
> Otherwise, we don't have good PCP dropping trigger. Big machine might have
> big pcp cache.
> 

Big machines also have a large cost for sending IPIs.

> 
> > If the direct reclaimer failed to reclaim any pages on its own, the call
> > to get_page_from_freelist() is going to be useless and there is
> > no guarantee that any other CPU managed to reclaim pages either. All
> > this ends up doing is sending in IPI which if it's very lucky will take
> > a page from another CPUs free list.
> 
> It's no matter. because did_some_progress==0 mean vmscan failed to reclaim
> any pages and reach priority==0. Thus, it obviously slow path.
> 

Maybe, but that still is no reason to send an IPI that probably isn't
going to help but incur a high cost on large machines (we've had bugs
related to excessive IPI usage before). As it is, a failure to reclaim
will fall through and assuming it has the right flags, it will wait
on congestion to clear before retrying direct reclaim. When it starts
to make progress, the pages will get drained at a time when it'll help.

> >> then following
> >> patch is better.
> >> However I also think your patch is valuable. because while the task is
> >> sleeping in wait_iff_congested(), an another task may free some pages.
> >> thus, rebalance path should try to get free pages. iow, you makes sense.
> >>
> >> So, I'd like to propose to merge both your and my patch.
> >>
> >> Thanks.
> >>
> >>
> >> From 2e77784668f6ca53d88ecb46aa6b99d9d0f33ffa Mon Sep 17 00:00:00 2001
> >> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> Date: Tue, 24 May 2011 13:41:57 +0900
> >> Subject: [PATCH] vmscan: remove painful micro optimization
> >>
> >> Currently, __alloc_pages_direct_reclaim() call get_page_from_freelist()
> >> only if try_to_free_pages() return !0.
> >>
> >> It's no necessary micro optimization becauase "return 0" mean vmscan reached
> >> priority 0 and didn't get any pages, iow, it's really slow path. But also it
> >> has bad side effect. If we don't call drain_all_pages(), we have a chance to
> >> get infinite loop.
> >>
> > 
> > With the "rebalance" patch, where is the infinite loop?
> 
> I wrote the above.
> 

Where? Failing to call drain_all_pages() if reclaim fails is not an
infinite loop. It'll wait on congestion and retry until some progress
is made on reclaim and even then, it'll only drain the pages if the
subsequent allocation failed. That is not an infinite loop unless the
machine is wedged so badly it cannot make any progress on reclaim in
which case the machine is in serious trouble and an IPI isn't going
to fix things.

Hence, I'm failing to see why avoiding expensive IPI calls is a painful
micro-optimisation.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  8:49                 ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-24  8:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

On Tue, May 24, 2011 at 05:36:06PM +0900, KOSAKI Motohiro wrote:
> (2011/05/24 17:30), Mel Gorman wrote:
> > On Tue, May 24, 2011 at 01:54:54PM +0900, KOSAKI Motohiro wrote:
> >>> >From 8bd3f16736548375238161d1bd85f7d7c381031f Mon Sep 17 00:00:00 2001
> >>> From: Minchan Kim <minchan.kim@gmail.com>
> >>> Date: Sat, 21 May 2011 01:37:41 +0900
> >>> Subject: [PATCH] Prevent unending loop in __alloc_pages_slowpath
> >>>
> >>> From: Andrew Barry <abarry@cray.com>
> >>>
> >>> I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> >>> get stuck endlessly looping, even when lots of memory is available.
> >>>
> >>> Running an I/O and memory intensive stress-test I see a 0-order page allocation
> >>> with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> >>> Right about the same time that the stress-test gets killed by the OOM-killer,
> >>> the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> >>> though most of the systems memory was freed by the oom-kill of the stress-test.
> >>>
> >>> The utility ends up looping from the rebalance label down through the
> >>> wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> >>> skips the call to get_page_from_freelist. Because all of the reclaimable memory
> >>> on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> >>> call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> >>> __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> >>> jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> >>> repeats infinitely.
> >>>
> >>> The test case is pretty pathological. Running a mix of I/O stress-tests that do
> >>> a lot of fork() and consume all of the system memory, I can pretty reliably hit
> >>> this on 600 nodes, in about 12 hours. 32GB/node.
> >>>
> >>> Signed-off-by: Andrew Barry <abarry@cray.com>
> >>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> >>> Cc: Mel Gorman <mgorman@suse.de>
> >>> ---
> >>>  mm/page_alloc.c |    2 +-
> >>>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 3f8bce2..e78b324 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -2064,6 +2064,7 @@ restart:
> >>>  		first_zones_zonelist(zonelist, high_zoneidx, NULL,
> >>>  					&preferred_zone);
> >>>  
> >>> +rebalance:
> >>>  	/* This is the last chance, in general, before the goto nopage. */
> >>>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> >>>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> >>> @@ -2071,7 +2072,6 @@ restart:
> >>>  	if (page)
> >>>  		goto got_pg;
> >>>  
> >>> -rebalance:
> >>>  	/* Allocate without watermarks if the context allows */
> >>>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
> >>>  		page = __alloc_pages_high_priority(gfp_mask, order,
> >>
> >> I'm sorry I missed this thread long time.
> >>
> >> In this case, I think we should call drain_all_pages().
> > 
> > Why?
> 
> Otherwise, we don't have good PCP dropping trigger. Big machine might have
> big pcp cache.
> 

Big machines also have a large cost for sending IPIs.

> 
> > If the direct reclaimer failed to reclaim any pages on its own, the call
> > to get_page_from_freelist() is going to be useless and there is
> > no guarantee that any other CPU managed to reclaim pages either. All
> > this ends up doing is sending in IPI which if it's very lucky will take
> > a page from another CPUs free list.
> 
> It's no matter. because did_some_progress==0 mean vmscan failed to reclaim
> any pages and reach priority==0. Thus, it obviously slow path.
> 

Maybe, but that still is no reason to send an IPI that probably isn't
going to help but incur a high cost on large machines (we've had bugs
related to excessive IPI usage before). As it is, a failure to reclaim
will fall through and assuming it has the right flags, it will wait
on congestion to clear before retrying direct reclaim. When it starts
to make progress, the pages will get drained at a time when it'll help.

> >> then following
> >> patch is better.
> >> However I also think your patch is valuable. because while the task is
> >> sleeping in wait_iff_congested(), an another task may free some pages.
> >> thus, rebalance path should try to get free pages. iow, you makes sense.
> >>
> >> So, I'd like to propose to merge both your and my patch.
> >>
> >> Thanks.
> >>
> >>
> >> From 2e77784668f6ca53d88ecb46aa6b99d9d0f33ffa Mon Sep 17 00:00:00 2001
> >> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> Date: Tue, 24 May 2011 13:41:57 +0900
> >> Subject: [PATCH] vmscan: remove painful micro optimization
> >>
> >> Currently, __alloc_pages_direct_reclaim() call get_page_from_freelist()
> >> only if try_to_free_pages() return !0.
> >>
> >> It's no necessary micro optimization becauase "return 0" mean vmscan reached
> >> priority 0 and didn't get any pages, iow, it's really slow path. But also it
> >> has bad side effect. If we don't call drain_all_pages(), we have a chance to
> >> get infinite loop.
> >>
> > 
> > With the "rebalance" patch, where is the infinite loop?
> 
> I wrote the above.
> 

Where? Failing to call drain_all_pages() if reclaim fails is not an
infinite loop. It'll wait on congestion and retry until some progress
is made on reclaim and even then, it'll only drain the pages if the
subsequent allocation failed. That is not an infinite loop unless the
machine is wedged so badly it cannot make any progress on reclaim in
which case the machine is in serious trouble and an IPI isn't going
to fix things.

Hence, I'm failing to see why avoiding expensive IPI calls is a painful
micro-optimisation.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  8:41               ` KOSAKI Motohiro
@ 2011-05-24  8:57                 ` Minchan Kim
  -1 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2011-05-24  8:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel,
	fengguang.wu

On Tue, May 24, 2011 at 5:41 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>>> I'm sorry I missed this thread long time.
>>
>> No problem. It would be better than not review.
>
> thx.
>
>
>>> In this case, I think we should call drain_all_pages(). then following
>>> patch is better.
>>
>> Strictly speaking, this problem isn't related to drain_all_pages.
>> This problem caused by lru empty but I admit it could work well if
>> your patch applied.
>> So yours could help, too.
>>
>>> However I also think your patch is valuable. because while the task is
>>> sleeping in wait_iff_congested(), an another task may free some pages.
>>> thus, rebalance path should try to get free pages. iow, you makes sense.
>>
>> Yes.
>> Off-topic.
>> I would like to move cond_resched below get_page_from_freelist in
>> __alloc_pages_direct_reclaim. Otherwise, it is likely we can be stolen
>> pages to other processes.
>> One more benefit is that if it's apparently OOM path(ie,
>> did_some_progress = 0), we can reduce OOM kill latency due to remove
>> unnecessary cond_resched.
>
> I agree. Can you please mind to send a patch?

I had but at that time, Andrew had a concern.
I will resend it when I have a time. Let's discuss, again.

>
>
>>> So, I'd like to propose to merge both your and my patch.
>>
>> Recently, there was discussion on drain_all_pages with Wu.
>> He saw much overhead in 8-core system, AFAIR.
>> I Cced Wu.
>>
>> How about checking per-cpu before calling drain_all_pages() than
>> unconditional calling?
>> if (per_cpu_ptr(zone->pageset, smp_processor_id())
>>     drain_all_pages();
>>
>> Of course, It can miss other CPU free pages. But above routine assume
>> local cpu direct reclaim is successful but it failed by per-cpu. So I
>> think it works.
>
> Can you please tell me previous discussion url or mail subject?
> I mean, if it is costly and performance degression risk, we don't have to
> take my idea.

Yes. You could see it by https://lkml.org/lkml/2011/4/30/81.

>
> Thanks.
>
>
>>
>> Thanks for good suggestion and Reviewed-by, KOSAKI.
>
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  8:57                 ` Minchan Kim
  0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2011-05-24  8:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel,
	fengguang.wu

On Tue, May 24, 2011 at 5:41 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>>> I'm sorry I missed this thread long time.
>>
>> No problem. It would be better than not review.
>
> thx.
>
>
>>> In this case, I think we should call drain_all_pages(). then following
>>> patch is better.
>>
>> Strictly speaking, this problem isn't related to drain_all_pages.
>> This problem caused by lru empty but I admit it could work well if
>> your patch applied.
>> So yours could help, too.
>>
>>> However I also think your patch is valuable. because while the task is
>>> sleeping in wait_iff_congested(), an another task may free some pages.
>>> thus, rebalance path should try to get free pages. iow, you makes sense.
>>
>> Yes.
>> Off-topic.
>> I would like to move cond_resched below get_page_from_freelist in
>> __alloc_pages_direct_reclaim. Otherwise, it is likely we can be stolen
>> pages to other processes.
>> One more benefit is that if it's apparently OOM path(ie,
>> did_some_progress = 0), we can reduce OOM kill latency due to remove
>> unnecessary cond_resched.
>
> I agree. Can you please mind to send a patch?

I had but at that time, Andrew had a concern.
I will resend it when I have a time. Let's discuss, again.

>
>
>>> So, I'd like to propose to merge both your and my patch.
>>
>> Recently, there was discussion on drain_all_pages with Wu.
>> He saw much overhead in 8-core system, AFAIR.
>> I Cced Wu.
>>
>> How about checking per-cpu before calling drain_all_pages() than
>> unconditional calling?
>> if (per_cpu_ptr(zone->pageset, smp_processor_id())
>>     drain_all_pages();
>>
>> Of course, It can miss other CPU free pages. But above routine assume
>> local cpu direct reclaim is successful but it failed by per-cpu. So I
>> think it works.
>
> Can you please tell me previous discussion url or mail subject?
> I mean, if it is costly and performance degression risk, we don't have to
> take my idea.

Yes. You could see it by https://lkml.org/lkml/2011/4/30/81.

>
> Thanks.
>
>
>>
>> Thanks for good suggestion and Reviewed-by, KOSAKI.
>
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  8:49                 ` Mel Gorman
@ 2011-05-24  9:05                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  9:05 UTC (permalink / raw)
  To: mgorman; +Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

>>> Why?
>>
>> Otherwise, we don't have good PCP dropping trigger. Big machine might have
>> big pcp cache.
>>
> 
> Big machines also have a large cost for sending IPIs.

Yes. But it's only matter if IPIs are frequently happen.
But, drain_all_pages() is NOT only IPI source. some vmscan function (e.g.
try_to_umap) makes a lot of IPIs.

Then, it's _relatively_ not costly. I have a question. Do you compare which
operation and drain_all_pages()? IOW, your "costly" mean which scenario suspect?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  9:05                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  9:05 UTC (permalink / raw)
  To: mgorman; +Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

>>> Why?
>>
>> Otherwise, we don't have good PCP dropping trigger. Big machine might have
>> big pcp cache.
>>
> 
> Big machines also have a large cost for sending IPIs.

Yes. But it's only matter if IPIs are frequently happen.
But, drain_all_pages() is NOT only IPI source. some vmscan function (e.g.
try_to_umap) makes a lot of IPIs.

Then, it's _relatively_ not costly. I have a question. Do you compare which
operation and drain_all_pages()? IOW, your "costly" mean which scenario suspect?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  9:05                   ` KOSAKI Motohiro
@ 2011-05-24  9:16                     ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-24  9:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

On Tue, May 24, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
> >>> Why?
> >>
> >> Otherwise, we don't have good PCP dropping trigger. Big machine might have
> >> big pcp cache.
> >>
> > 
> > Big machines also have a large cost for sending IPIs.
> 
> Yes. But it's only matter if IPIs are frequently happen.
> But, drain_all_pages() is NOT only IPI source. some vmscan function (e.g.
> try_to_umap) makes a lot of IPIs.
> 
> Then, it's _relatively_ not costly. I have a question. Do you compare which
> operation and drain_all_pages()? IOW, your "costly" mean which scenario suspect?
> 

I am concerned that if the machine gets into trouble and we are failing
to reclaim that sending more IPIs is not going to help any. There is no
evidence at the moment that sending extra IPIs here will help anything.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  9:16                     ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-24  9:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

On Tue, May 24, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
> >>> Why?
> >>
> >> Otherwise, we don't have good PCP dropping trigger. Big machine might have
> >> big pcp cache.
> >>
> > 
> > Big machines also have a large cost for sending IPIs.
> 
> Yes. But it's only matter if IPIs are frequently happen.
> But, drain_all_pages() is NOT only IPI source. some vmscan function (e.g.
> try_to_umap) makes a lot of IPIs.
> 
> Then, it's _relatively_ not costly. I have a question. Do you compare which
> operation and drain_all_pages()? IOW, your "costly" mean which scenario suspect?
> 

I am concerned that if the machine gets into trouble and we are failing
to reclaim that sending more IPIs is not going to help any. There is no
evidence at the moment that sending extra IPIs here will help anything.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  8:57                 ` Minchan Kim
@ 2011-05-24  9:36                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  9:36 UTC (permalink / raw)
  To: minchan.kim
  Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel,
	fengguang.wu

>> Can you please tell me previous discussion url or mail subject?
>> I mean, if it is costly and performance degression risk, we don't have to
>> take my idea.
> 
> Yes. You could see it by https://lkml.org/lkml/2011/4/30/81.

I think Wu pointed out "lightweight vmscan could reclaim pages but stealed
from another task case". It's very different with "most heavyweight vmscan
still failed to reclaim any pages". The point is, IPIs cost depend on the
frequency. stealing frequently occur on current logic, but vmscan priority==0
is?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  9:36                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  9:36 UTC (permalink / raw)
  To: minchan.kim
  Cc: abarry, akpm, linux-mm, mgorman, riel, hannes, linux-kernel,
	fengguang.wu

>> Can you please tell me previous discussion url or mail subject?
>> I mean, if it is costly and performance degression risk, we don't have to
>> take my idea.
> 
> Yes. You could see it by https://lkml.org/lkml/2011/4/30/81.

I think Wu pointed out "lightweight vmscan could reclaim pages but stealed
from another task case". It's very different with "most heavyweight vmscan
still failed to reclaim any pages". The point is, IPIs cost depend on the
frequency. stealing frequently occur on current logic, but vmscan priority==0
is?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  9:16                     ` Mel Gorman
@ 2011-05-24  9:40                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  9:40 UTC (permalink / raw)
  To: mgorman; +Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

(2011/05/24 18:16), Mel Gorman wrote:
> On Tue, May 24, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
>>>>> Why?
>>>>
>>>> Otherwise, we don't have good PCP dropping trigger. Big machine might have
>>>> big pcp cache.
>>>>
>>>
>>> Big machines also have a large cost for sending IPIs.
>>
>> Yes. But it's only matter if IPIs are frequently happen.
>> But, drain_all_pages() is NOT only IPI source. some vmscan function (e.g.
>> try_to_umap) makes a lot of IPIs.
>>
>> Then, it's _relatively_ not costly. I have a question. Do you compare which
>> operation and drain_all_pages()? IOW, your "costly" mean which scenario suspect?
>>
> 
> I am concerned that if the machine gets into trouble and we are failing
> to reclaim that sending more IPIs is not going to help any. There is no
> evidence at the moment that sending extra IPIs here will help anything.

In old days, we always call drain_all_pages() if did_some_progress!=0. But
current kernel only call it when get_page_from_freelist() fail. So,
wait_iff_congested() may help but no guarantee to help us.

If you still strongly worry about IPI cost, I'm concern to move drain_all_pages()
to more unfrequently point. but to ignore pcp makes less sense, IMHO.




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24  9:40                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24  9:40 UTC (permalink / raw)
  To: mgorman; +Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

(2011/05/24 18:16), Mel Gorman wrote:
> On Tue, May 24, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
>>>>> Why?
>>>>
>>>> Otherwise, we don't have good PCP dropping trigger. Big machine might have
>>>> big pcp cache.
>>>>
>>>
>>> Big machines also have a large cost for sending IPIs.
>>
>> Yes. But it's only matter if IPIs are frequently happen.
>> But, drain_all_pages() is NOT only IPI source. some vmscan function (e.g.
>> try_to_umap) makes a lot of IPIs.
>>
>> Then, it's _relatively_ not costly. I have a question. Do you compare which
>> operation and drain_all_pages()? IOW, your "costly" mean which scenario suspect?
>>
> 
> I am concerned that if the machine gets into trouble and we are failing
> to reclaim that sending more IPIs is not going to help any. There is no
> evidence at the moment that sending extra IPIs here will help anything.

In old days, we always call drain_all_pages() if did_some_progress!=0. But
current kernel only call it when get_page_from_freelist() fail. So,
wait_iff_congested() may help but no guarantee to help us.

If you still strongly worry about IPI cost, I'm concern to move drain_all_pages()
to more unfrequently point. but to ignore pcp makes less sense, IMHO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24  9:40                       ` KOSAKI Motohiro
@ 2011-05-24 10:57                         ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-24 10:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

On Tue, May 24, 2011 at 06:40:31PM +0900, KOSAKI Motohiro wrote:
> (2011/05/24 18:16), Mel Gorman wrote:
> > On Tue, May 24, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
> >>>>> Why?
> >>>>
> >>>> Otherwise, we don't have good PCP dropping trigger. Big machine might have
> >>>> big pcp cache.
> >>>>
> >>>
> >>> Big machines also have a large cost for sending IPIs.
> >>
> >> Yes. But it's only matter if IPIs are frequently happen.
> >> But, drain_all_pages() is NOT only IPI source. some vmscan function (e.g.
> >> try_to_umap) makes a lot of IPIs.
> >>
> >> Then, it's _relatively_ not costly. I have a question. Do you compare which
> >> operation and drain_all_pages()? IOW, your "costly" mean which scenario suspect?
> >>
> > 
> > I am concerned that if the machine gets into trouble and we are failing
> > to reclaim that sending more IPIs is not going to help any. There is no
> > evidence at the moment that sending extra IPIs here will help anything.
> 
> In old days, we always call drain_all_pages() if did_some_progress!=0. But
> current kernel only call it when get_page_from_freelist() fail. So,
> wait_iff_congested() may help but no guarantee to help us.
> 
> If you still strongly worry about IPI cost, I'm concern to move drain_all_pages()
> to more unfrequently point. but to ignore pcp makes less sense, IMHO.
> 

Yes, I'm worried about it because excessive time
spent in drain_all_pages() has come up on the past
http://lkml.org/lkml/2010/8/23/81 . The PCP lists are not being
ignored at the moment. They are drained when direct reclaim makes
forward progress but still fails to allocate a page.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24 10:57                         ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2011-05-24 10:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

On Tue, May 24, 2011 at 06:40:31PM +0900, KOSAKI Motohiro wrote:
> (2011/05/24 18:16), Mel Gorman wrote:
> > On Tue, May 24, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
> >>>>> Why?
> >>>>
> >>>> Otherwise, we don't have good PCP dropping trigger. Big machine might have
> >>>> big pcp cache.
> >>>>
> >>>
> >>> Big machines also have a large cost for sending IPIs.
> >>
> >> Yes. But it's only matter if IPIs are frequently happen.
> >> But, drain_all_pages() is NOT only IPI source. some vmscan function (e.g.
> >> try_to_umap) makes a lot of IPIs.
> >>
> >> Then, it's _relatively_ not costly. I have a question. Do you compare which
> >> operation and drain_all_pages()? IOW, your "costly" mean which scenario suspect?
> >>
> > 
> > I am concerned that if the machine gets into trouble and we are failing
> > to reclaim that sending more IPIs is not going to help any. There is no
> > evidence at the moment that sending extra IPIs here will help anything.
> 
> In old days, we always call drain_all_pages() if did_some_progress!=0. But
> current kernel only call it when get_page_from_freelist() fail. So,
> wait_iff_congested() may help but no guarantee to help us.
> 
> If you still strongly worry about IPI cost, I'm concern to move drain_all_pages()
> to more unfrequently point. but to ignore pcp makes less sense, IMHO.
> 

Yes, I'm worried about it because excessive time
spent in drain_all_pages() has come up on the past
http://lkml.org/lkml/2010/8/23/81 . The PCP lists are not being
ignored at the moment. They are drained when direct reclaim makes
forward progress but still fails to allocate a page.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
  2011-05-24 10:57                         ` Mel Gorman
@ 2011-05-24 23:53                           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24 23:53 UTC (permalink / raw)
  To: mgorman; +Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

>> In old days, we always call drain_all_pages() if did_some_progress!=0. But
>> current kernel only call it when get_page_from_freelist() fail. So,
>> wait_iff_congested() may help but no guarantee to help us.
>>
>> If you still strongly worry about IPI cost, I'm concern to move drain_all_pages()
>> to more unfrequently point. but to ignore pcp makes less sense, IMHO.
>>
> 
> Yes, I'm worried about it because excessive time
> spent in drain_all_pages() has come up on the past
> http://lkml.org/lkml/2010/8/23/81 . The PCP lists are not being
> ignored at the moment. They are drained when direct reclaim makes
> forward progress but still fails to allocate a page.

Well, it's no priority==0 case. that's my point.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
@ 2011-05-24 23:53                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 39+ messages in thread
From: KOSAKI Motohiro @ 2011-05-24 23:53 UTC (permalink / raw)
  To: mgorman; +Cc: minchan.kim, abarry, akpm, linux-mm, riel, hannes, linux-kernel

>> In old days, we always call drain_all_pages() if did_some_progress!=0. But
>> current kernel only call it when get_page_from_freelist() fail. So,
>> wait_iff_congested() may help but no guarantee to help us.
>>
>> If you still strongly worry about IPI cost, I'm concern to move drain_all_pages()
>> to more unfrequently point. but to ignore pcp makes less sense, IMHO.
>>
> 
> Yes, I'm worried about it because excessive time
> spent in drain_all_pages() has come up on the past
> http://lkml.org/lkml/2010/8/23/81 . The PCP lists are not being
> ignored at the moment. They are drained when direct reclaim makes
> forward progress but still fails to allocate a page.

Well, it's no priority==0 case. that's my point.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2011-05-24 23:53 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-13 21:31 Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch Andrew Barry
2011-05-17 10:34 ` Minchan Kim
2011-05-17 11:34   ` Mel Gorman
2011-05-17 15:49   ` Andrew Barry
2011-05-18 22:29     ` Minchan Kim
2011-05-20 16:49       ` Minchan Kim
2011-05-20 16:49         ` Minchan Kim
2011-05-20 17:16         ` Rik van Riel
2011-05-20 17:16           ` Rik van Riel
2011-05-20 17:23         ` Mel Gorman
2011-05-20 17:23           ` Mel Gorman
2011-05-24  4:54         ` KOSAKI Motohiro
2011-05-24  4:54           ` KOSAKI Motohiro
2011-05-24  5:45           ` KOSAKI Motohiro
2011-05-24  5:45             ` KOSAKI Motohiro
2011-05-24  8:30           ` Mel Gorman
2011-05-24  8:30             ` Mel Gorman
2011-05-24  8:36             ` KOSAKI Motohiro
2011-05-24  8:36               ` KOSAKI Motohiro
2011-05-24  8:49               ` Mel Gorman
2011-05-24  8:49                 ` Mel Gorman
2011-05-24  9:05                 ` KOSAKI Motohiro
2011-05-24  9:05                   ` KOSAKI Motohiro
2011-05-24  9:16                   ` Mel Gorman
2011-05-24  9:16                     ` Mel Gorman
2011-05-24  9:40                     ` KOSAKI Motohiro
2011-05-24  9:40                       ` KOSAKI Motohiro
2011-05-24 10:57                       ` Mel Gorman
2011-05-24 10:57                         ` Mel Gorman
2011-05-24 23:53                         ` KOSAKI Motohiro
2011-05-24 23:53                           ` KOSAKI Motohiro
2011-05-24  8:34           ` Minchan Kim
2011-05-24  8:34             ` Minchan Kim
2011-05-24  8:41             ` KOSAKI Motohiro
2011-05-24  8:41               ` KOSAKI Motohiro
2011-05-24  8:57               ` Minchan Kim
2011-05-24  8:57                 ` Minchan Kim
2011-05-24  9:36                 ` KOSAKI Motohiro
2011-05-24  9:36                   ` KOSAKI Motohiro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.