All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible deadloop in direct reclaim?
@ 2013-07-23  4:58 Lisa Du
  2013-07-23 20:28 ` Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Lisa Du @ 2013-07-23  4:58 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]

Dear Sir:
Currently I met a possible deadloop in direct reclaim. After run plenty of the application, system run into a status that system memory is very fragmentized. Like only order-0 and order-1 memory left.
Then one process required a order-2 buffer but it enter an endless direct reclaim. From my trace log, I can see this loop already over 200,000 times. Kswapd was first wake up and then go back to sleep as it cannot rebalance this order's memory. But zone->all_unreclaimable remains 1.
Though direct_reclaim every time returns no pages, but as zone->all_unreclaimable = 1, so it loop again and again. Even when zone->pages_scanned also becomes very large. It will block the process for long time, until some watchdog thread detect this and kill this process. Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe cost over 50 seconds or even more.
I think it's not as expected right?  Can we also add below check in the function all_unreclaimable() to terminate this loop?

@@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
                        continue;
                if (!zone->all_unreclaimable)
                        return false;
+               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
+                       return true;
        }
         BTW: I'm using kernel3.4, I also try to search in the kernel3.9, didn't see a possible fix for such issue. Or is anyone also met such issue before? Any comment will be welcomed, looking forward to your reply!

Thanks!

Best Regards
Lisa Du


[-- Attachment #2: Type: text/html, Size: 8244 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-07-23  4:58 Possible deadloop in direct reclaim? Lisa Du
@ 2013-07-23 20:28 ` Christoph Lameter
  2013-07-24  1:21   ` Lisa Du
  2013-07-24  1:18 ` Bob Liu
  2013-08-01  5:43 ` Minchan Kim
  2 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2013-07-23 20:28 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm, Mel Gorman

On Mon, 22 Jul 2013, Lisa Du wrote:

> Currently I met a possible deadloop in direct reclaim. After run plenty of the application, system run into a status that system memory is very fragmentized. Like only order-0 and order-1 memory left.

Can you verify that by doing a

 cat /proc/buddyinfo

?

> Then one process required a order-2 buffer but it enter an endless
> direct reclaim. From my trace log, I can see this loop already over
> 200,000 times. Kswapd was first wake up and then go back to sleep as it
> cannot rebalance this order's memory. But zone->all_unreclaimable
> remains 1. Though direct_reclaim every time returns no pages, but as
> zone->all_unreclaimable = 1, so it loop again and again. Even when
> zone->pages_scanned also becomes very large. It will block the process
> for long time, until some watchdog thread detect this and kill this
> process. Though it's in __alloc_pages_slowpath, but it's too slow right?
> Maybe cost over 50 seconds or even more.

> I think it's not as expected right?  Can we also add below check in the
> function all_unreclaimable() to terminate this loop?
>
> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>                         continue;
>                 if (!zone->all_unreclaimable)
>                         return false;
> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
> +                       return true;
>         }

Mel?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-07-23  4:58 Possible deadloop in direct reclaim? Lisa Du
  2013-07-23 20:28 ` Christoph Lameter
@ 2013-07-24  1:18 ` Bob Liu
  2013-07-24  1:31   ` Lisa Du
                     ` (2 more replies)
  2013-08-01  5:43 ` Minchan Kim
  2 siblings, 3 replies; 36+ messages in thread
From: Bob Liu @ 2013-07-24  1:18 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm, Christoph Lameter, Mel Gorman

On Tue, Jul 23, 2013 at 12:58 PM, Lisa Du <cldu@marvell.com> wrote:
> Dear Sir:
>
> Currently I met a possible deadloop in direct reclaim. After run plenty of
> the application, system run into a status that system memory is very
> fragmentized. Like only order-0 and order-1 memory left.
>
> Then one process required a order-2 buffer but it enter an endless direct
> reclaim. From my trace log, I can see this loop already over 200,000 times.
> Kswapd was first wake up and then go back to sleep as it cannot rebalance
> this order’s memory. But zone->all_unreclaimable remains 1.
>
> Though direct_reclaim every time returns no pages, but as
> zone->all_unreclaimable = 1, so it loop again and again. Even when
> zone->pages_scanned also becomes very large. It will block the process for
> long time, until some watchdog thread detect this and kill this process.
> Though it’s in __alloc_pages_slowpath, but it’s too slow right? Maybe cost
> over 50 seconds or even more.

You must be mean zone->all_unreclaimable = 0?

>
> I think it’s not as expected right?  Can we also add below check in the
> function all_unreclaimable() to terminate this loop?
>
>
>
> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> *zonelist,
>
>                         continue;
>
>                 if (!zone->all_unreclaimable)
>
>                         return false;
>
> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>
> +                       return true;
>

How about replace the checking in kswapd_shrink_zone()?

@@ -2824,7 +2824,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
        /* Account for the number of pages attempted to reclaim */
        *nr_attempted += sc->nr_to_reclaim;

-       if (nr_slab == 0 && !zone_reclaimable(zone))
+       if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
                zone->all_unreclaimable = 1;

        zone_clear_flag(zone, ZONE_WRITEBACK);


I think the current check is wrong, reclaimed a slab doesn't mean
reclaimed a page.

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-07-23 20:28 ` Christoph Lameter
@ 2013-07-24  1:21   ` Lisa Du
  2013-07-25 18:19     ` KOSAKI Motohiro
  0 siblings, 1 reply; 36+ messages in thread
From: Lisa Du @ 2013-07-24  1:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Mel Gorman, Bob Liu

Dear Christoph
   Thanks a lot for your comment. When this issue happen I just trigger a kernel panic and got the kdump.
From the kdump, I got the global variable pg_data_t congit_page_data. From this structure, I can see in normal zone, only order-0's nr_free = 18442, order-1's nr_free = 367, all the other order's nr_free is 0.

Thanks!

Best Regards
Lisa Du


-----Original Message-----
From: Christoph Lameter [mailto:cl@linux.com] 
Sent: 2013年7月24日 4:29
To: Lisa Du
Cc: linux-mm@kvack.org; Mel Gorman
Subject: Re: Possible deadloop in direct reclaim?

On Mon, 22 Jul 2013, Lisa Du wrote:

> Currently I met a possible deadloop in direct reclaim. After run plenty of the application, system run into a status that system memory is very fragmentized. Like only order-0 and order-1 memory left.

Can you verify that by doing a

 cat /proc/buddyinfo

?

> Then one process required a order-2 buffer but it enter an endless
> direct reclaim. From my trace log, I can see this loop already over
> 200,000 times. Kswapd was first wake up and then go back to sleep as it
> cannot rebalance this order's memory. But zone->all_unreclaimable
> remains 1. Though direct_reclaim every time returns no pages, but as
> zone->all_unreclaimable = 1, so it loop again and again. Even when
> zone->pages_scanned also becomes very large. It will block the process
> for long time, until some watchdog thread detect this and kill this
> process. Though it's in __alloc_pages_slowpath, but it's too slow right?
> Maybe cost over 50 seconds or even more.

> I think it's not as expected right?  Can we also add below check in the
> function all_unreclaimable() to terminate this loop?
>
> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>                         continue;
>                 if (!zone->all_unreclaimable)
>                         return false;
> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
> +                       return true;
>         }

Mel?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-07-24  1:18 ` Bob Liu
@ 2013-07-24  1:31   ` Lisa Du
  2013-07-24  2:23   ` Lisa Du
  2013-07-25 18:14   ` KOSAKI Motohiro
  2 siblings, 0 replies; 36+ messages in thread
From: Lisa Du @ 2013-07-24  1:31 UTC (permalink / raw)
  To: Bob Liu; +Cc: linux-mm, Christoph Lameter, Mel Gorman

Dear Bob
    Thank you so much for the careful review, Yes, it's a typo, I mean zone->all_unreclaimable = 0.
    You mentioned add the check in kswapd_shrink_zone(), sorry that I didn't find this function in kernel3.4 or kernel3.9.
    Is this function called in direct_reclaim? 
    As I mentioned this issue happened after kswapd thread sleep, if it only called in kswapd, then I think it can't help.

Thanks!

Best Regards
Lisa Du


-----Original Message-----
From: Bob Liu [mailto:lliubbo@gmail.com] 
Sent: 2013年7月24日 9:18
To: Lisa Du
Cc: linux-mm@kvack.org; Christoph Lameter; Mel Gorman
Subject: Re: Possible deadloop in direct reclaim?

On Tue, Jul 23, 2013 at 12:58 PM, Lisa Du <cldu@marvell.com> wrote:
> Dear Sir:
>
> Currently I met a possible deadloop in direct reclaim. After run plenty of
> the application, system run into a status that system memory is very
> fragmentized. Like only order-0 and order-1 memory left.
>
> Then one process required a order-2 buffer but it enter an endless direct
> reclaim. From my trace log, I can see this loop already over 200,000 times.
> Kswapd was first wake up and then go back to sleep as it cannot rebalance
> this order’s memory. But zone->all_unreclaimable remains 1.
>
> Though direct_reclaim every time returns no pages, but as
> zone->all_unreclaimable = 1, so it loop again and again. Even when
> zone->pages_scanned also becomes very large. It will block the process for
> long time, until some watchdog thread detect this and kill this process.
> Though it’s in __alloc_pages_slowpath, but it’s too slow right? Maybe cost
> over 50 seconds or even more.

You must be mean zone->all_unreclaimable = 0?

>
> I think it’s not as expected right?  Can we also add below check in the
> function all_unreclaimable() to terminate this loop?
>
>
>
> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> *zonelist,
>
>                         continue;
>
>                 if (!zone->all_unreclaimable)
>
>                         return false;
>
> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>
> +                       return true;
>

How about replace the checking in kswapd_shrink_zone()?

@@ -2824,7 +2824,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
        /* Account for the number of pages attempted to reclaim */
        *nr_attempted += sc->nr_to_reclaim;

-       if (nr_slab == 0 && !zone_reclaimable(zone))
+       if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
                zone->all_unreclaimable = 1;

        zone_clear_flag(zone, ZONE_WRITEBACK);


I think the current check is wrong, reclaimed a slab doesn't mean
reclaimed a page.

-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-07-24  1:18 ` Bob Liu
  2013-07-24  1:31   ` Lisa Du
@ 2013-07-24  2:23   ` Lisa Du
  2013-07-24  3:38     ` Bob Liu
  2013-07-25 18:14   ` KOSAKI Motohiro
  2 siblings, 1 reply; 36+ messages in thread
From: Lisa Du @ 2013-07-24  2:23 UTC (permalink / raw)
  To: Lisa Du, Bob Liu; +Cc: linux-mm, Christoph Lameter, Mel Gorman

Dear Bob
   Also from my check before kswapd sleep, though nr_slab = 0 but zone_reclaimable(zone) returns true, so zone->all_unreclaimable can't be changed to 1; So even when change the nr_slab to sc->nr_reclaimed, it can't help.

Thanks!

Best Regards
Lisa Du


-----Original Message-----
From: Lisa Du 
Sent: 2013年7月24日 9:31
To: 'Bob Liu'
Cc: linux-mm@kvack.org; Christoph Lameter; Mel Gorman
Subject: RE: Possible deadloop in direct reclaim?

Dear Bob
    Thank you so much for the careful review, Yes, it's a typo, I mean zone->all_unreclaimable = 0.
    You mentioned add the check in kswapd_shrink_zone(), sorry that I didn't find this function in kernel3.4 or kernel3.9.
    Is this function called in direct_reclaim? 
    As I mentioned this issue happened after kswapd thread sleep, if it only called in kswapd, then I think it can't help.

Thanks!

Best Regards
Lisa Du


-----Original Message-----
From: Bob Liu [mailto:lliubbo@gmail.com] 
Sent: 2013年7月24日 9:18
To: Lisa Du
Cc: linux-mm@kvack.org; Christoph Lameter; Mel Gorman
Subject: Re: Possible deadloop in direct reclaim?

On Tue, Jul 23, 2013 at 12:58 PM, Lisa Du <cldu@marvell.com> wrote:
> Dear Sir:
>
> Currently I met a possible deadloop in direct reclaim. After run plenty of
> the application, system run into a status that system memory is very
> fragmentized. Like only order-0 and order-1 memory left.
>
> Then one process required a order-2 buffer but it enter an endless direct
> reclaim. From my trace log, I can see this loop already over 200,000 times.
> Kswapd was first wake up and then go back to sleep as it cannot rebalance
> this order’s memory. But zone->all_unreclaimable remains 1.
>
> Though direct_reclaim every time returns no pages, but as
> zone->all_unreclaimable = 1, so it loop again and again. Even when
> zone->pages_scanned also becomes very large. It will block the process for
> long time, until some watchdog thread detect this and kill this process.
> Though it’s in __alloc_pages_slowpath, but it’s too slow right? Maybe cost
> over 50 seconds or even more.

You must be mean zone->all_unreclaimable = 0?

>
> I think it’s not as expected right?  Can we also add below check in the
> function all_unreclaimable() to terminate this loop?
>
>
>
> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> *zonelist,
>
>                         continue;
>
>                 if (!zone->all_unreclaimable)
>
>                         return false;
>
> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>
> +                       return true;
>

How about replace the checking in kswapd_shrink_zone()?

@@ -2824,7 +2824,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
        /* Account for the number of pages attempted to reclaim */
        *nr_attempted += sc->nr_to_reclaim;

-       if (nr_slab == 0 && !zone_reclaimable(zone))
+       if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
                zone->all_unreclaimable = 1;

        zone_clear_flag(zone, ZONE_WRITEBACK);


I think the current check is wrong, reclaimed a slab doesn't mean
reclaimed a page.

-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-07-24  2:23   ` Lisa Du
@ 2013-07-24  3:38     ` Bob Liu
  2013-07-24  5:58       ` Lisa Du
  0 siblings, 1 reply; 36+ messages in thread
From: Bob Liu @ 2013-07-24  3:38 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm, Christoph Lameter, Mel Gorman

On Wed, Jul 24, 2013 at 10:23 AM, Lisa Du <cldu@marvell.com> wrote:
> Dear Bob
>    Also from my check before kswapd sleep, though nr_slab = 0 but zone_reclaimable(zone) returns true, so zone->all_unreclaimable can't be changed to 1; So even when change the nr_slab to sc->nr_reclaimed, it can't help.
>

Then the other fix might be set zone->all_unreclaimable in direct
reclaim path also, like:

@@ -2278,6 +2278,8 @@ static bool shrink_zones(struct zonelist
*zonelist, struct scan_control *sc)
                }

                shrink_zone(zone, sc);
+               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
+                       zone->all_unreclaimable = 1;
        }

> Thanks!
>
> Best Regards
> Lisa Du
>
>
> -----Original Message-----
> From: Lisa Du
> Sent: 2013年7月24日 9:31
> To: 'Bob Liu'
> Cc: linux-mm@kvack.org; Christoph Lameter; Mel Gorman
> Subject: RE: Possible deadloop in direct reclaim?
>
> Dear Bob
>     Thank you so much for the careful review, Yes, it's a typo, I mean zone->all_unreclaimable = 0.
>     You mentioned add the check in kswapd_shrink_zone(), sorry that I didn't find this function in kernel3.4 or kernel3.9.
>     Is this function called in direct_reclaim?
>     As I mentioned this issue happened after kswapd thread sleep, if it only called in kswapd, then I think it can't help.
>
> Thanks!
>
> Best Regards
> Lisa Du
>
>
> -----Original Message-----
> From: Bob Liu [mailto:lliubbo@gmail.com]
> Sent: 2013年7月24日 9:18
> To: Lisa Du
> Cc: linux-mm@kvack.org; Christoph Lameter; Mel Gorman
> Subject: Re: Possible deadloop in direct reclaim?
>
> On Tue, Jul 23, 2013 at 12:58 PM, Lisa Du <cldu@marvell.com> wrote:
>> Dear Sir:
>>
>> Currently I met a possible deadloop in direct reclaim. After run plenty of
>> the application, system run into a status that system memory is very
>> fragmentized. Like only order-0 and order-1 memory left.
>>
>> Then one process required a order-2 buffer but it enter an endless direct
>> reclaim. From my trace log, I can see this loop already over 200,000 times.
>> Kswapd was first wake up and then go back to sleep as it cannot rebalance
>> this order’s memory. But zone->all_unreclaimable remains 1.
>>
>> Though direct_reclaim every time returns no pages, but as
>> zone->all_unreclaimable = 1, so it loop again and again. Even when
>> zone->pages_scanned also becomes very large. It will block the process for
>> long time, until some watchdog thread detect this and kill this process.
>> Though it’s in __alloc_pages_slowpath, but it’s too slow right? Maybe cost
>> over 50 seconds or even more.
>
> You must be mean zone->all_unreclaimable = 0?
>
>>
>> I think it’s not as expected right?  Can we also add below check in the
>> function all_unreclaimable() to terminate this loop?
>>
>>
>>
>> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
>> *zonelist,
>>
>>                         continue;
>>
>>                 if (!zone->all_unreclaimable)
>>
>>                         return false;
>>
>> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>>
>> +                       return true;
>>
>
> How about replace the checking in kswapd_shrink_zone()?
>
> @@ -2824,7 +2824,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>         /* Account for the number of pages attempted to reclaim */
>         *nr_attempted += sc->nr_to_reclaim;
>
> -       if (nr_slab == 0 && !zone_reclaimable(zone))
> +       if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>                 zone->all_unreclaimable = 1;
>
>         zone_clear_flag(zone, ZONE_WRITEBACK);
>
>
> I think the current check is wrong, reclaimed a slab doesn't mean
> reclaimed a page.
>
> --
> Regards,
> --Bob



-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-07-24  3:38     ` Bob Liu
@ 2013-07-24  5:58       ` Lisa Du
  0 siblings, 0 replies; 36+ messages in thread
From: Lisa Du @ 2013-07-24  5:58 UTC (permalink / raw)
  To: Bob Liu; +Cc: linux-mm, Christoph Lameter, Mel Gorman, kosaki.motohiro

Dear Bob
   Really appreciate for your review and suggestions!
   Yes, your suggestion can end my infinite loop in direct_reclaim. This change I think will easier than before to mark a zone as unreclaimable right?  Will it have other side effect?

   I reviewed the mainline's patch list, and found below patch should be a similar case as mine, it's case is kswapd is frozen, but my case kswapd go to sleep.
From d1908362ae0b97374eb8328fbb471576332f9fb1 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan.kim@gmail.com>
Date: Wed, 22 Sep 2010 13:05:01 -0700
Subject: [PATCH] vmscan: check all_unreclaimable in direct reclaim path

  But later below patch changed the logic, and checked the flag oom_killer_disable which seems only be set when hibernate, so my issue appeared.

From 929bea7c714220fc76ce3f75bef9056477c28e74 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Thu, 14 Apr 2011 15:22:12 -0700
Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
@@ -2006,13 +2002,11 @@ static bool all_unreclaimable(struct zonelist *zonelist,
                        continue;
                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
                        continue;
-               if (zone_reclaimable(zone)) {
-                       all_unreclaimable = false;
-                       break;
-               }
+               if (!zone->all_unreclaimable)
+                       return false;
        }

-       return all_unreclaimable;
+       return true;
 }

 /*
@@ -2108,6 +2102,14 @@ out:
        if (sc->nr_reclaimed)
                return sc->nr_reclaimed;

+       /*
+        * As hibernation is going on, kswapd is freezed so that it can't mark
+        * the zone into all_unreclaimable. Thus bypassing all_unreclaimable
+        * check.
+        */
+       if (oom_killer_disabled)
+               return 0;
Thanks!

Best Regards
Lisa Du


-----Original Message-----
From: Bob Liu [mailto:lliubbo@gmail.com] 
Sent: 2013年7月24日 11:39
To: Lisa Du
Cc: linux-mm@kvack.org; Christoph Lameter; Mel Gorman
Subject: Re: Possible deadloop in direct reclaim?

On Wed, Jul 24, 2013 at 10:23 AM, Lisa Du <cldu@marvell.com> wrote:
> Dear Bob
>    Also from my check before kswapd sleep, though nr_slab = 0 but zone_reclaimable(zone) returns true, so zone->all_unreclaimable can't be changed to 1; So even when change the nr_slab to sc->nr_reclaimed, it can't help.
>

Then the other fix might be set zone->all_unreclaimable in direct
reclaim path also, like:

@@ -2278,6 +2278,8 @@ static bool shrink_zones(struct zonelist
*zonelist, struct scan_control *sc)
                }

                shrink_zone(zone, sc);
+               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
+                       zone->all_unreclaimable = 1;
        }

> Thanks!
>
> Best Regards
> Lisa Du
>
>
> -----Original Message-----
> From: Lisa Du
> Sent: 2013年7月24日 9:31
> To: 'Bob Liu'
> Cc: linux-mm@kvack.org; Christoph Lameter; Mel Gorman
> Subject: RE: Possible deadloop in direct reclaim?
>
> Dear Bob
>     Thank you so much for the careful review, Yes, it's a typo, I mean zone->all_unreclaimable = 0.
>     You mentioned add the check in kswapd_shrink_zone(), sorry that I didn't find this function in kernel3.4 or kernel3.9.
>     Is this function called in direct_reclaim?
>     As I mentioned this issue happened after kswapd thread sleep, if it only called in kswapd, then I think it can't help.
>
> Thanks!
>
> Best Regards
> Lisa Du
>
>
> -----Original Message-----
> From: Bob Liu [mailto:lliubbo@gmail.com]
> Sent: 2013年7月24日 9:18
> To: Lisa Du
> Cc: linux-mm@kvack.org; Christoph Lameter; Mel Gorman
> Subject: Re: Possible deadloop in direct reclaim?
>
> On Tue, Jul 23, 2013 at 12:58 PM, Lisa Du <cldu@marvell.com> wrote:
>> Dear Sir:
>>
>> Currently I met a possible deadloop in direct reclaim. After run plenty of
>> the application, system run into a status that system memory is very
>> fragmentized. Like only order-0 and order-1 memory left.
>>
>> Then one process required a order-2 buffer but it enter an endless direct
>> reclaim. From my trace log, I can see this loop already over 200,000 times.
>> Kswapd was first wake up and then go back to sleep as it cannot rebalance
>> this order’s memory. But zone->all_unreclaimable remains 1.
>>
>> Though direct_reclaim every time returns no pages, but as
>> zone->all_unreclaimable = 1, so it loop again and again. Even when
>> zone->pages_scanned also becomes very large. It will block the process for
>> long time, until some watchdog thread detect this and kill this process.
>> Though it’s in __alloc_pages_slowpath, but it’s too slow right? Maybe cost
>> over 50 seconds or even more.
>
> You must be mean zone->all_unreclaimable = 0?
>
>>
>> I think it’s not as expected right?  Can we also add below check in the
>> function all_unreclaimable() to terminate this loop?
>>
>>
>>
>> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
>> *zonelist,
>>
>>                         continue;
>>
>>                 if (!zone->all_unreclaimable)
>>
>>                         return false;
>>
>> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>>
>> +                       return true;
>>
>
> How about replace the checking in kswapd_shrink_zone()?
>
> @@ -2824,7 +2824,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>         /* Account for the number of pages attempted to reclaim */
>         *nr_attempted += sc->nr_to_reclaim;
>
> -       if (nr_slab == 0 && !zone_reclaimable(zone))
> +       if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>                 zone->all_unreclaimable = 1;
>
>         zone_clear_flag(zone, ZONE_WRITEBACK);
>
>
> I think the current check is wrong, reclaimed a slab doesn't mean
> reclaimed a page.
>
> --
> Regards,
> --Bob



-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-07-24  1:18 ` Bob Liu
  2013-07-24  1:31   ` Lisa Du
  2013-07-24  2:23   ` Lisa Du
@ 2013-07-25 18:14   ` KOSAKI Motohiro
  2013-07-26  1:22     ` Bob Liu
  2 siblings, 1 reply; 36+ messages in thread
From: KOSAKI Motohiro @ 2013-07-25 18:14 UTC (permalink / raw)
  To: Bob Liu; +Cc: Lisa Du, linux-mm, Christoph Lameter, Mel Gorman

> How about replace the checking in kswapd_shrink_zone()?
>
> @@ -2824,7 +2824,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>         /* Account for the number of pages attempted to reclaim */
>         *nr_attempted += sc->nr_to_reclaim;
>
> -       if (nr_slab == 0 && !zone_reclaimable(zone))
> +       if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>                 zone->all_unreclaimable = 1;
>
>         zone_clear_flag(zone, ZONE_WRITEBACK);
>
>
> I think the current check is wrong, reclaimed a slab doesn't mean
> reclaimed a page.

The code is correct, at least, it works as intentional. page reclaim
status is checked by zone_reclaimable() and slab shrinking status is
checked by nr_slab.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-07-24  1:21   ` Lisa Du
@ 2013-07-25 18:19     ` KOSAKI Motohiro
  2013-07-26  1:11       ` Lisa Du
  2013-07-29  1:32       ` Lisa Du
  0 siblings, 2 replies; 36+ messages in thread
From: KOSAKI Motohiro @ 2013-07-25 18:19 UTC (permalink / raw)
  To: Lisa Du; +Cc: Christoph Lameter, linux-mm, Mel Gorman, Bob Liu

On Tue, Jul 23, 2013 at 9:21 PM, Lisa Du <cldu@marvell.com> wrote:
> Dear Christoph
>    Thanks a lot for your comment. When this issue happen I just trigger a kernel panic and got the kdump.
> From the kdump, I got the global variable pg_data_t congit_page_data. From this structure, I can see in normal zone, only order-0's nr_free = 18442, order-1's nr_free = 367, all the other order's nr_free is 0.

Don't you use compaction? Of if use, please get a log by tracepoints.
We need to know why it doesn't work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-07-25 18:19     ` KOSAKI Motohiro
@ 2013-07-26  1:11       ` Lisa Du
  2013-07-29 16:44         ` KOSAKI Motohiro
  2013-07-29  1:32       ` Lisa Du
  1 sibling, 1 reply; 36+ messages in thread
From: Lisa Du @ 2013-07-26  1:11 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Christoph Lameter, linux-mm, Mel Gorman, Bob Liu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 1743 bytes --]

Dear KOSAKI
   In my test, I didn't set compaction. Maybe compaction is helpful to avoid this issue. I can have try later.
   In my mind CONFIG_COMPACTION is an optional configuration right? 
   If we don't use, and met such an issue, how should we deal with such infinite loop?

   I made a change in all_reclaimable() function, passed overnight tests, please help review, thanks in advance!
@@ -2353,7 +2353,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
                        continue;
                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
                        continue;
-               if (!zone->all_unreclaimable)
+               if (zone->all_unreclaimable)
+                       continue;
+               if (zone_reclaimable(zone))
                        return false;
        }

Thanks!

Best Regards
Lisa Du


-----Original Message-----
From: KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com] 
Sent: 2013Äê7ÔÂ26ÈÕ 2:19
To: Lisa Du
Cc: Christoph Lameter; linux-mm@kvack.org; Mel Gorman; Bob Liu
Subject: Re: Possible deadloop in direct reclaim?

On Tue, Jul 23, 2013 at 9:21 PM, Lisa Du <cldu@marvell.com> wrote:
> Dear Christoph
>    Thanks a lot for your comment. When this issue happen I just trigger a kernel panic and got the kdump.
> From the kdump, I got the global variable pg_data_t congit_page_data. From this structure, I can see in normal zone, only order-0's nr_free = 18442, order-1's nr_free = 367, all the other order's nr_free is 0.

Don't you use compaction? Of if use, please get a log by tracepoints.
We need to know why it doesn't work.
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-07-25 18:14   ` KOSAKI Motohiro
@ 2013-07-26  1:22     ` Bob Liu
  2013-07-29 16:46       ` KOSAKI Motohiro
  0 siblings, 1 reply; 36+ messages in thread
From: Bob Liu @ 2013-07-26  1:22 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Lisa Du, linux-mm, Christoph Lameter, Mel Gorman

Hi Kosaki,

On Fri, Jul 26, 2013 at 2:14 AM, KOSAKI Motohiro
<kosaki.motohiro@gmail.com> wrote:
>> How about replace the checking in kswapd_shrink_zone()?
>>
>> @@ -2824,7 +2824,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>>         /* Account for the number of pages attempted to reclaim */
>>         *nr_attempted += sc->nr_to_reclaim;
>>
>> -       if (nr_slab == 0 && !zone_reclaimable(zone))
>> +       if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>>                 zone->all_unreclaimable = 1;
>>
>>         zone_clear_flag(zone, ZONE_WRITEBACK);
>>
>>
>> I think the current check is wrong, reclaimed a slab doesn't mean
>> reclaimed a page.
>
> The code is correct, at least, it works as intentional. page reclaim
> status is checked by zone_reclaimable() and slab shrinking status is
> checked by nr_slab.

I'm afraid in some special cases, nr_slab = 1 or any small number
which means we reclaimed some slab objects.
Then we don't set zone->all_unreclaimeabled =1.

But even though we reclaimed some slab objects, there may be no pages freed.
Because one page may contain several objects.

If we reclaimed some slab objects but without actual pages, we need to
set zone->all_unreclaimeabled=1!
So I think we should check sc->nr_reclaimed == 0 instead of nr_slab == 0.

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-07-25 18:19     ` KOSAKI Motohiro
  2013-07-26  1:11       ` Lisa Du
@ 2013-07-29  1:32       ` Lisa Du
  1 sibling, 0 replies; 36+ messages in thread
From: Lisa Du @ 2013-07-29  1:32 UTC (permalink / raw)
  To: Lisa Du, KOSAKI Motohiro; +Cc: Christoph Lameter, linux-mm, Mel Gorman, Bob Liu

Dear Kosaki
   Do you have the chance to review my change in the function all_unreclaimable()?
@@ -2353,7 +2353,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
                        continue;
                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
                        continue;
-               if (!zone->all_unreclaimable)
+               if (zone->all_unreclaimable)
+                       continue;
+               if (zone_reclaimable(zone))
                        return false;
        }
   In my test, it helped to avoid the infinite loop in direct_reclaim path, and I think it should also avoid the kernel hanging up issue you met in the commit: 929bea7c714220.
   In a word, I think neither check the zone->all_unreclaimable nor zone_reclaimable() is enough in the function all_unreclaimable(), so shall we check both to confirm if a zone is all_unreclaimable?

Thanks!

Best Regards
Lisa Du

-----Original Message-----
From: Lisa Du 
Sent: 2013年7月26日 9:11
To: 'KOSAKI Motohiro'
Cc: Christoph Lameter; linux-mm@kvack.org; Mel Gorman; Bob Liu
Subject: RE: Possible deadloop in direct reclaim?

Dear KOSAKI
   In my test, I didn't set compaction. Maybe compaction is helpful to avoid this issue. I can have try later.
   In my mind CONFIG_COMPACTION is an optional configuration right? 
   If we don't use, and met such an issue, how should we deal with such infinite loop?

   I made a change in all_reclaimable() function, passed overnight tests, please help review, thanks in advance!
@@ -2353,7 +2353,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
                        continue;
                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
                        continue;
-               if (!zone->all_unreclaimable)
+               if (zone->all_unreclaimable)
+                       continue;
+               if (zone_reclaimable(zone))
                        return false;
        }

Thanks!

Best Regards
Lisa Du


-----Original Message-----
From: KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com] 
Sent: 2013年7月26日 2:19
To: Lisa Du
Cc: Christoph Lameter; linux-mm@kvack.org; Mel Gorman; Bob Liu
Subject: Re: Possible deadloop in direct reclaim?

On Tue, Jul 23, 2013 at 9:21 PM, Lisa Du <cldu@marvell.com> wrote:
> Dear Christoph
>    Thanks a lot for your comment. When this issue happen I just trigger a kernel panic and got the kdump.
> From the kdump, I got the global variable pg_data_t congit_page_data. From this structure, I can see in normal zone, only order-0's nr_free = 18442, order-1's nr_free = 367, all the other order's nr_free is 0.

Don't you use compaction? Of if use, please get a log by tracepoints.
We need to know why it doesn't work.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-07-26  1:11       ` Lisa Du
@ 2013-07-29 16:44         ` KOSAKI Motohiro
  2013-07-30  1:27           ` Lisa Du
  2013-08-01  2:24           ` Lisa Du
  0 siblings, 2 replies; 36+ messages in thread
From: KOSAKI Motohiro @ 2013-07-29 16:44 UTC (permalink / raw)
  To: Lisa Du; +Cc: KOSAKI Motohiro, Christoph Lameter, linux-mm, Mel Gorman, Bob Liu

(7/25/13 9:11 PM), Lisa Du wrote:
> Dear KOSAKI
>     In my test, I didn't set compaction. Maybe compaction is helpful to avoid this issue. I can have try later.
>     In my mind CONFIG_COMPACTION is an optional configuration right?

Right. But if you don't set it, application must NOT use >1 order allocations. It doesn't work and it is expected
result.
That's your application mistake.

>     If we don't use, and met such an issue, how should we deal with such infinite loop?
> 
>     I made a change in all_reclaimable() function, passed overnight tests, please help review, thanks in advance!
> @@ -2353,7 +2353,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>                          continue;
>                  if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                          continue;
> -               if (!zone->all_unreclaimable)
> +               if (zone->all_unreclaimable)
> +                       continue;
> +               if (zone_reclaimable(zone))
>                          return false;

Please tell me why you chaned here.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-07-26  1:22     ` Bob Liu
@ 2013-07-29 16:46       ` KOSAKI Motohiro
  0 siblings, 0 replies; 36+ messages in thread
From: KOSAKI Motohiro @ 2013-07-29 16:46 UTC (permalink / raw)
  To: Bob Liu; +Cc: KOSAKI Motohiro, Lisa Du, linux-mm, Christoph Lameter, Mel Gorman

(7/25/13 9:22 PM), Bob Liu wrote:
> Hi Kosaki,
>
> On Fri, Jul 26, 2013 at 2:14 AM, KOSAKI Motohiro
> <kosaki.motohiro@gmail.com> wrote:
>>> How about replace the checking in kswapd_shrink_zone()?
>>>
>>> @@ -2824,7 +2824,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>>>          /* Account for the number of pages attempted to reclaim */
>>>          *nr_attempted += sc->nr_to_reclaim;
>>>
>>> -       if (nr_slab == 0 && !zone_reclaimable(zone))
>>> +       if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>>>                  zone->all_unreclaimable = 1;
>>>
>>>          zone_clear_flag(zone, ZONE_WRITEBACK);
>>>
>>>
>>> I think the current check is wrong, reclaimed a slab doesn't mean
>>> reclaimed a page.
>>
>> The code is correct, at least, it works as intentional. page reclaim
>> status is checked by zone_reclaimable() and slab shrinking status is
>> checked by nr_slab.
>
> I'm afraid in some special cases, nr_slab = 1 or any small number
> which means we reclaimed some slab objects.
> Then we don't set zone->all_unreclaimeabled =1.
>
> But even though we reclaimed some slab objects, there may be no pages freed.
> Because one page may contain several objects.

Right. This is a limitation of current slab shrinker's implementation.
We are welcome you contribution this area.


> If we reclaimed some slab objects but without actual pages, we need to
> set zone->all_unreclaimeabled=1!
> So I think we should check sc->nr_reclaimed == 0 instead of nr_slab == 0.

sc->nr_reclaimed doesn't check how much pages freed from slab.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-07-29 16:44         ` KOSAKI Motohiro
@ 2013-07-30  1:27           ` Lisa Du
  2013-08-01  2:24           ` Lisa Du
  1 sibling, 0 replies; 36+ messages in thread
From: Lisa Du @ 2013-07-30  1:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, linux-mm, Mel Gorman, Bob Liu, Neil Zhang

-----Original Message-----
From: KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com] 
Sent: 2013年7月30日 0:44
To: Lisa Du
Cc: KOSAKI Motohiro; Christoph Lameter; linux-mm@kvack.org; Mel Gorman; Bob Liu
Subject: Re: Possible deadloop in direct reclaim?

(7/25/13 9:11 PM), Lisa Du wrote:
> Dear KOSAKI
>     In my test, I didn't set compaction. Maybe compaction is helpful to avoid this issue. I can have try later.
>     In my mind CONFIG_COMPACTION is an optional configuration right?

Right. But if you don't set it, application must NOT use >1 order allocations. It doesn't work and it is expected
result.
That's your application mistake.
Dear Kosaki, I have two questions on your explanation: a) you said if don't set CONFIG_COMPATION, application must NOT use >1 order allocations, is there any documentation for this theory?  b) My order-2 allocation not comes from application, but from do_fork which is in kernel space, in my mind when a parent process forks a child process, it need to allocate a order-2 memory, if a) is right, then CONFIG_COMPATION should be a MUST configuration for linux kernel but not optional? 
>     If we don't use, and met such an issue, how should we deal with such infinite loop?
> 
>     I made a change in all_reclaimable() function, passed overnight tests, please help review, thanks in advance!
> @@ -2353,7 +2353,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>                          continue;
>                  if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                          continue;
> -               if (!zone->all_unreclaimable)
> +               if (zone->all_unreclaimable)
> +                       continue;
> +               if (zone_reclaimable(zone))
>                          return false;

Please tell me why you chaned here.
The original check is once found zone->all_unreclaimable is false, it will return false, then it will set did_some_progress non-zero.
Then another loop of direct_reclaimed performed. But I think zone->all_unreclaimable is not always reliable such as in my case, kswapd go to sleep and no one will change this flag. We should also check zone_reclaimalbe(zone) if zone->all_unreclaimalbe = 0 to double confirm if a zone is reclaimable;
This change also avoid the issue you described in below commit:
commit 929bea7c714220fc76ce3f75bef9056477c28e74
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Thu Apr 14 15:22:12 2011 -0700

    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-07-29 16:44         ` KOSAKI Motohiro
  2013-07-30  1:27           ` Lisa Du
@ 2013-08-01  2:24           ` Lisa Du
  2013-08-01  2:45             ` KOSAKI Motohiro
  1 sibling, 1 reply; 36+ messages in thread
From: Lisa Du @ 2013-08-01  2:24 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Christoph Lameter, linux-mm, Mel Gorman, Bob Liu

Dear Kosaki
   Would you please help to check my comment as below:
>(7/25/13 9:11 PM), Lisa Du wrote:
>> Dear KOSAKI
>>     In my test, I didn't set compaction. Maybe compaction is helpful to
>avoid this issue. I can have try later.
>>     In my mind CONFIG_COMPACTION is an optional configuration
>right?
>
>Right. But if you don't set it, application must NOT use >1 order allocations.
>It doesn't work and it is expected
>result.
>That's your application mistake.
Dear Kosaki, I have two questions on your explanation:
a) you said if don't set CONFIG_COMPATION, application must NOT use >1 order allocations, is there any documentation for this theory?  
b) My order-2 allocation not comes from application, but from do_fork which is in kernel space, in my mind when a parent process forks a child process, it need to allocate a order-2 memory, if a) is right, then CONFIG_COMPATION should be a MUST configuration for linux kernel but not optional?
>
>>     If we don't use, and met such an issue, how should we deal with
>such infinite loop?
>>
>>     I made a change in all_reclaimable() function, passed overnight tests,
>please help review, thanks in advance!
>> @@ -2353,7 +2353,9 @@ static bool all_unreclaimable(struct zonelist
>*zonelist,
>>                          continue;
>>                  if (!cpuset_zone_allowed_hardwall(zone,
>GFP_KERNEL))
>>                          continue;
>> -               if (!zone->all_unreclaimable)
>> +               if (zone->all_unreclaimable)
>> +                       continue;
>> +               if (zone_reclaimable(zone))
>>                          return false;
>
>Please tell me why you chaned here.
The original check is once found zone->all_unreclaimable is false, it will return false, then it will set did_some_progress non-zero. Then another loop of direct_reclaimed performed. But I think zone->all_unreclaimable is not always reliable such as in my case, kswapd go to sleep and no one will change this flag. We should also check zone_reclaimalbe(zone) if zone->all_unreclaimalbe = 0 to double confirm if a zone is reclaimable; This change also avoid the issue you described in below commit:
commit 929bea7c714220fc76ce3f75bef9056477c28e74
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Thu Apr 14 15:22:12 2011 -0700
    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-01  2:24           ` Lisa Du
@ 2013-08-01  2:45             ` KOSAKI Motohiro
  2013-08-01  4:21               ` Bob Liu
  2013-08-01  5:19               ` Lisa Du
  0 siblings, 2 replies; 36+ messages in thread
From: KOSAKI Motohiro @ 2013-08-01  2:45 UTC (permalink / raw)
  To: Lisa Du; +Cc: KOSAKI Motohiro, Christoph Lameter, linux-mm, Mel Gorman, Bob Liu

(7/31/13 10:24 PM), Lisa Du wrote:
> Dear Kosaki
>     Would you please help to check my comment as below:
>> (7/25/13 9:11 PM), Lisa Du wrote:
>>> Dear KOSAKI
>>>      In my test, I didn't set compaction. Maybe compaction is helpful to
>> avoid this issue. I can have try later.
>>>      In my mind CONFIG_COMPACTION is an optional configuration
>> right?
>>
>> Right. But if you don't set it, application must NOT use >1 order allocations.
>> It doesn't work and it is expected
>> result.
>> That's your application mistake.
> Dear Kosaki, I have two questions on your explanation:
> a) you said if don't set CONFIG_COMPATION, application must NOT use >1 order allocations, is there any documentation
   for this theory?

Sorry I don't understand what "this" mean. I mean, Even though you use desktop or server machine, no compaction kernel
easily makes no order-2 situations.
Then, our in-kernel subsystems don't use order-2 allocations as far as possible.


> b) My order-2 allocation not comes from application, but from do_fork which is in kernel space,
    in my mind when a parent process forks a child process, it need to allocate a order-2 memory,
   if a) is right, then CONFIG_COMPATION should be a MUST configuration for linux kernel but not optional?

???
fork alloc order-1 memory for stack. Where and why alloc order-2? If it is arch specific code, please
contact arch maintainer.



>>
>>>      If we don't use, and met such an issue, how should we deal with
>> such infinite loop?
>>>
>>>      I made a change in all_reclaimable() function, passed overnight tests,
>> please help review, thanks in advance!
>>> @@ -2353,7 +2353,9 @@ static bool all_unreclaimable(struct zonelist
>> *zonelist,
>>>                           continue;
>>>                   if (!cpuset_zone_allowed_hardwall(zone,
>> GFP_KERNEL))
>>>                           continue;
>>> -               if (!zone->all_unreclaimable)
>>> +               if (zone->all_unreclaimable)
>>> +                       continue;
>>> +               if (zone_reclaimable(zone))
>>>                           return false;
>>
>> Please tell me why you chaned here.
> The original check is once found zone->all_unreclaimable is false, it will return false, then
>it will set did_some_progress non-zero. Then another loop of direct_reclaimed performed.
>  But I think zone->all_unreclaimable is not always reliable such as in my case, kswapd go to
>  sleep and no one will change this flag. We should also check zone_reclaimalbe(zone) if
>  zone->all_unreclaimalbe = 0 to double confirm if a zone is reclaimable; This change also
>  avoid the issue you described in below commit:

Please read more older code. Your pointed code is temporary change and I changed back for fixing
bugs.
If you look at the status in middle direct reclaim, we can't avoid race condition from multi direct
reclaim issues. Moreover, if kswapd doesn't awaken, it is a problem. This is a reason why current code
behave as you described.
I agree we should fix your issue as far as possible. But I can't agree your analysis.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-01  2:45             ` KOSAKI Motohiro
@ 2013-08-01  4:21               ` Bob Liu
  2013-08-03 21:22                 ` KOSAKI Motohiro
  2013-08-01  5:19               ` Lisa Du
  1 sibling, 1 reply; 36+ messages in thread
From: Bob Liu @ 2013-08-01  4:21 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Lisa Du, Christoph Lameter, linux-mm, Mel Gorman, Bob Liu

Hi KOSAKI,

On 08/01/2013 10:45 AM, KOSAKI Motohiro wrote:

> 
> Please read more older code. Your pointed code is temporary change and I
> changed back for fixing
> bugs.
> If you look at the status in middle direct reclaim, we can't avoid race
> condition from multi direct
> reclaim issues. Moreover, if kswapd doesn't awaken, it is a problem.
> This is a reason why current code
> behave as you described.
> I agree we should fix your issue as far as possible. But I can't agree
> your analysis.
> 

I found this thread:
mm, vmscan: fix do_try_to_free_pages() livelock
https://lkml.org/lkml/2012/6/14/74

I think that's the same issue Lisa met.

But I didn't find out why your patch didn't get merged?
There were already many acks.

-- 
Regards,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-08-01  2:45             ` KOSAKI Motohiro
  2013-08-01  4:21               ` Bob Liu
@ 2013-08-01  5:19               ` Lisa Du
  2013-08-01  8:56                 ` Russell King - ARM Linux
  1 sibling, 1 reply; 36+ messages in thread
From: Lisa Du @ 2013-08-01  5:19 UTC (permalink / raw)
  To: KOSAKI Motohiro, linux
  Cc: Christoph Lameter, linux-mm, Mel Gorman, Bob Liu, Neil Zhang

Loop in Russel King.
Would you please help to comment below questions Mr Motohiro asked about fork allocating order-2 memory? Thanks in advance!
>(7/31/13 10:24 PM), Lisa Du wrote:
>> Dear Kosaki
>>     Would you please help to check my comment as below:
>>> (7/25/13 9:11 PM), Lisa Du wrote:
>>>> Dear KOSAKI
>>>>      In my test, I didn't set compaction. Maybe compaction is helpful
>to
>>> avoid this issue. I can have try later.
>>>>      In my mind CONFIG_COMPACTION is an optional configuration
>>> right?
>>>
>>> Right. But if you don't set it, application must NOT use >1 order
>allocations.
>>> It doesn't work and it is expected
>>> result.
>>> That's your application mistake.
>> Dear Kosaki, I have two questions on your explanation:
>> a) you said if don't set CONFIG_COMPATION, application must NOT use >1
>order allocations, is there any documentation
>   for this theory?
>
>Sorry I don't understand what "this" mean. I mean, Even though you use
>desktop or server machine, no compaction kernel
>easily makes no order-2 situations.
>Then, our in-kernel subsystems don't use order-2 allocations as far as
>possible.
Thanks, now I got your point. 
>
>
>> b) My order-2 allocation not comes from application, but from do_fork
>which is in kernel space,
>    in my mind when a parent process forks a child process, it need to
>allocate a order-2 memory,
>   if a) is right, then CONFIG_COMPATION should be a MUST configuration
>for linux kernel but not optional?
>
>???
>fork alloc order-1 memory for stack. Where and why alloc order-2? If it is
>arch specific code, please
>contact arch maintainer.
Yes arch do_fork allocate order-2 memory when copy_process. 
Hi, Russel
What's your opinion about this question?  
If we really need order-2 memory for fork, then we'd better set CONFIG_COMPATION right?
>
>
>
>>>
>>>>      If we don't use, and met such an issue, how should we deal with
>>> such infinite loop?
>>>>
>>>>      I made a change in all_reclaimable() function, passed overnight
>tests,
>>> please help review, thanks in advance!
>>>> @@ -2353,7 +2353,9 @@ static bool all_unreclaimable(struct zonelist
>>> *zonelist,
>>>>                           continue;
>>>>                   if (!cpuset_zone_allowed_hardwall(zone,
>>> GFP_KERNEL))
>>>>                           continue;
>>>> -               if (!zone->all_unreclaimable)
>>>> +               if (zone->all_unreclaimable)
>>>> +                       continue;
>>>> +               if (zone_reclaimable(zone))
>>>>                           return false;
>>>
>>> Please tell me why you chaned here.
>> The original check is once found zone->all_unreclaimable is false, it will
>return false, then
>>it will set did_some_progress non-zero. Then another loop of
>direct_reclaimed performed.
>>  But I think zone->all_unreclaimable is not always reliable such as in my
>case, kswapd go to
>>  sleep and no one will change this flag. We should also check
>zone_reclaimalbe(zone) if
>>  zone->all_unreclaimalbe = 0 to double confirm if a zone is reclaimable;
>This change also
>>  avoid the issue you described in below commit:
>
>Please read more older code. Your pointed code is temporary change and I
>changed back for fixing
>bugs.
>If you look at the status in middle direct reclaim, we can't avoid race
>condition from multi direct
>reclaim issues. Moreover, if kswapd doesn't awaken, it is a problem. This is
>a reason why current code
>behave as you described.
>I agree we should fix your issue as far as possible. But I can't agree your
>analysis.
I read the code you modified which check the zone->all_unreclaimable instead of zone_reclaimable(zone);
(In the commit 929bea7c714 vmscan: all_unreclaimable() use zone->all_unreclaimable as a name)
Your patch was trying to fix the issue of zone->all_unreclaimable = 1, but zone->pages_scanned = 0 which result all_unreclaimable() return false.
Is there anything else I missed or misunderstanding?
In my change, I'll first check zone->all_unreclaimable, if it was set 1, then I wouldn't check zone->pages_scanned value.
My point is zone->all_unreclaimable = 0 doesn't mean this zone is always reclaimable. As zone->all_unreclaimable can only be set in kswapd.
And kswapd already fully scan all zones and still can't rebalance the system for high-order allocations.  Instead it recheck all watermarks at order-0, and watermarks ok will let kswapd back to sleep. Unfortunately, Kswapd doesn't awaken because long time no higher order allocation wake it up. But this process continue direct reclaim again and again as zone->all_unreclaimable remains 0.
So I also checked the zone->pages_scanned when zone->all_unreclaimable = 0, if zone_reclaimable() return true, then it's really reclaimable for direct reclaimer. This change would break your bug fix right?

Thanks Bob's finding, I read through below thread, and the patch your are trying to fix is the same issue as mine:
mm, vmscan: fix do_try_to_free_pages() livelock
https://lkml.org/lkml/2012/6/14/74
I have the same question as Bob, you already find this issue, why this patch wasn't got merged? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-07-23  4:58 Possible deadloop in direct reclaim? Lisa Du
  2013-07-23 20:28 ` Christoph Lameter
  2013-07-24  1:18 ` Bob Liu
@ 2013-08-01  5:43 ` Minchan Kim
  2013-08-01  6:13   ` Lisa Du
  2 siblings, 1 reply; 36+ messages in thread
From: Minchan Kim @ 2013-08-01  5:43 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm

Hello,

On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
> Dear Sir:
> Currently I met a possible deadloop in direct reclaim. After run plenty of the application, system run into a status that system memory is very fragmentized. Like only order-0 and order-1 memory left.
> Then one process required a order-2 buffer but it enter an endless direct reclaim. From my trace log, I can see this loop already over 200,000 times. Kswapd was first wake up and then go back to sleep as it cannot rebalance this order's memory. But zone->all_unreclaimable remains 1.
> Though direct_reclaim every time returns no pages, but as zone->all_unreclaimable = 1, so it loop again and again. Even when zone->pages_scanned also becomes very large. It will block the process for long time, until some watchdog thread detect this and kill this process. Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe cost over 50 seconds or even more.
> I think it's not as expected right?  Can we also add below check in the function all_unreclaimable() to terminate this loop?
> 
> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>                         continue;
>                 if (!zone->all_unreclaimable)
>                         return false;
> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
> +                       return true;
>         }
>          BTW: I'm using kernel3.4, I also try to search in the kernel3.9, didn't see a possible fix for such issue. Or is anyone also met such issue before? Any comment will be welcomed, looking forward to your reply!
> 
> Thanks!

I'd like to ask somethigs.

1. Do you have enabled swap?
2. Do you enable CONFIG_COMPACTION?
3. Could we get your zoneinfo via cat /proc/zoneinfo?
4. If you disabled watchdog thread, you could see OOM sometime
   although it takes very long time?


> 
> Best Regards
> Lisa Du
> 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-08-01  5:43 ` Minchan Kim
@ 2013-08-01  6:13   ` Lisa Du
  2013-08-01  7:33     ` Minchan Kim
  0 siblings, 1 reply; 36+ messages in thread
From: Lisa Du @ 2013-08-01  6:13 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm

>On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
>> Dear Sir:
>> Currently I met a possible deadloop in direct reclaim. After run plenty of
>the application, system run into a status that system memory is very
>fragmentized. Like only order-0 and order-1 memory left.
>> Then one process required a order-2 buffer but it enter an endless direct
>reclaim. From my trace log, I can see this loop already over 200,000 times.
>Kswapd was first wake up and then go back to sleep as it cannot rebalance
>this order's memory. But zone->all_unreclaimable remains 1.
>> Though direct_reclaim every time returns no pages, but as
>zone->all_unreclaimable = 1, so it loop again and again. Even when
>zone->pages_scanned also becomes very large. It will block the process for
>long time, until some watchdog thread detect this and kill this process.
>Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe cost
>over 50 seconds or even more.
>> I think it's not as expected right?  Can we also add below check in the
>function all_unreclaimable() to terminate this loop?
>>
>> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
>*zonelist,
>>                         continue;
>>                 if (!zone->all_unreclaimable)
>>                         return false;
>> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>> +                       return true;
>>         }
>>          BTW: I'm using kernel3.4, I also try to search in the kernel3.9,
>didn't see a possible fix for such issue. Or is anyone also met such issue
>before? Any comment will be welcomed, looking forward to your reply!
>>
>> Thanks!
>
>I'd like to ask somethigs.
>
>1. Do you have enabled swap?
I set CONFIG_SWAP=y, but I didn't really have a swap partition, that means my swap buffer size is 0;
>2. Do you enable CONFIG_COMPACTION?
No, I didn't enable;
>3. Could we get your zoneinfo via cat /proc/zoneinfo?
I dump some info from ramdump, please review:
crash> kmem -z
NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
  SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279
  VM_STAT:
          NR_FREE_PAGES: 16092
       NR_INACTIVE_ANON: 17
         NR_ACTIVE_ANON: 55091
       NR_INACTIVE_FILE: 17
         NR_ACTIVE_FILE: 17
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 55077
         NR_FILE_MAPPED: 42
          NR_FILE_PAGES: 69
          NR_FILE_DIRTY: 0
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 1226
  NR_SLAB_UNRECLAIMABLE: 9373
           NR_PAGETABLE: 2776
        NR_KERNEL_STACK: 798
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 91
    NR_VMSCAN_IMMEDIATE: 115381
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 0
               NR_SHMEM: 31
             NR_DIRTIED: 15256
             NR_WRITTEN: 11981
NR_ANON_TRANSPARENT_HUGEPAGES: 0

NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
  SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
  VM_STAT:
          NR_FREE_PAGES: 161
       NR_INACTIVE_ANON: 104
         NR_ACTIVE_ANON: 46114
       NR_INACTIVE_FILE: 9722
         NR_ACTIVE_FILE: 12263
         NR_UNEVICTABLE: 168
               NR_MLOCK: 0
          NR_ANON_PAGES: 46102
         NR_FILE_MAPPED: 12227
          NR_FILE_PAGES: 22270
          NR_FILE_DIRTY: 1
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 0
  NR_SLAB_UNRECLAIMABLE: 0
           NR_PAGETABLE: 0
        NR_KERNEL_STACK: 0
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 0
    NR_VMSCAN_IMMEDIATE: 0
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 0
               NR_SHMEM: 117
             NR_DIRTIED: 7364
             NR_WRITTEN: 6989
NR_ANON_TRANSPARENT_HUGEPAGES: 0

ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR  START_MAPNR
  0   Normal    192512   16092  c1200000       0            0     
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k      c08460f0           3      3
  0       4k      c08460f8         436    436
  0       4k      c0846100       15237  15237
  0       4k      c0846108           0      0
  0       4k      c0846110           0      0
  1       8k      c084611c          39     78
  1       8k      c0846124           0      0
  1       8k      c084612c         169    338
  1       8k      c0846134           0      0
  1       8k      c084613c           0      0
  2      16k      c0846148           0      0
  2      16k      c0846150           0      0
  2      16k      c0846158           0      0
---------Normal zone all order > 1 has no free pages
ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR  START_MAPNR
  1   HighMem    69632     161  c17e0000    2f000000      192512  
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k      c08464f0          12     12
  0       4k      c08464f8           0      0
  0       4k      c0846500          14     14
  0       4k      c0846508           3      3
  0       4k      c0846510           0      0
  1       8k      c084651c           0      0
  1       8k      c0846524           0      0
  1       8k      c084652c           0      0
  2      16k      c0846548           0      0
  2      16k      c0846550           0      0
  2      16k      c0846558           0      0
  2      16k      c0846560           1      4
  2      16k      c0846568           0      0
  5     128k      c08465cc           0      0
  5     128k      c08465d4           0      0
  5     128k      c08465dc           0      0
  5     128k      c08465e4           4    128
  5     128k      c08465ec           0      0
------Other's all zero

Some other zone information I dump from pglist_data
{
	watermark = {853, 1066, 1279}, 
      percpu_drift_mark = 0, 
      lowmem_reserve = {0, 2159, 2159}, 
      dirty_balance_reserve = 3438, 
      pageset = 0xc07f6144, 
      lock = {
        {
          rlock = {
            raw_lock = {
              lock = 0
            }, 
            break_lock = 0
          }
        }
      },       
	all_unreclaimable = 0,
      reclaim_stat = {
        recent_rotated = {903355, 960912}, 
        recent_scanned = {932404, 2462017}
      }, 
      pages_scanned = 84231,
inactive_ratio = 1, 
      _pad2_ = {
        x = 0xc0846480 "@"
      }, 
      wait_table = 0xc1a00040, 
      wait_table_hash_nr_entries = 1024, 
      wait_table_bits = 10, 
      zone_pgdat = 0xc08460c0, 
      zone_start_pfn = 0, 
      spanned_pages = 192512, 
      present_pages = 182304, 
      name = 0xc06f1d46 "Normal"
}
{  watermark = {67, 147, 228}, 
      percpu_drift_mark = 0, 
      lowmem_reserve = {0, 0, 0}, 
      dirty_balance_reserve = 228, 
      pageset = 0xc07f6184, 
      lock = {
        {
          rlock = {
            raw_lock = {
              lock = 0
            }, 
            break_lock = 0
          }
        }
      }, 
      all_unreclaimable = 0,
     reclaim_stat = {
        recent_rotated = {272514, 28087}, 
        recent_scanned = {287521, 110478}
      }, 
      pages_scanned = 0, 
      flags = 0,
}
kswapd = 0xe02d4f00, 
  kswapd_max_order = 0, 
  classzone_idx = ZONE_HIGHMEM

>4. If you disabled watchdog thread, you could see OOM sometime
>   although it takes very long time?
I haven't try to disable watchdog, in my case, when watchdog triggered, it means process already blocked over 60s.
And during these 60s, I didn't see OOM happen.
>
>
>>
>> Best Regards
>> Lisa Du
>>
>
>--
>Kind regards,
>Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-01  6:13   ` Lisa Du
@ 2013-08-01  7:33     ` Minchan Kim
  2013-08-01  8:20       ` Lisa Du
  0 siblings, 1 reply; 36+ messages in thread
From: Minchan Kim @ 2013-08-01  7:33 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm, KOSAKI Motohiro

On Wed, Jul 31, 2013 at 11:13:07PM -0700, Lisa Du wrote:
> >On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
> >> Dear Sir:
> >> Currently I met a possible deadloop in direct reclaim. After run plenty of
> >the application, system run into a status that system memory is very
> >fragmentized. Like only order-0 and order-1 memory left.
> >> Then one process required a order-2 buffer but it enter an endless direct
> >reclaim. From my trace log, I can see this loop already over 200,000 times.
> >Kswapd was first wake up and then go back to sleep as it cannot rebalance
> >this order's memory. But zone->all_unreclaimable remains 1.
> >> Though direct_reclaim every time returns no pages, but as
> >zone->all_unreclaimable = 1, so it loop again and again. Even when
> >zone->pages_scanned also becomes very large. It will block the process for
> >long time, until some watchdog thread detect this and kill this process.
> >Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe cost
> >over 50 seconds or even more.
> >> I think it's not as expected right?  Can we also add below check in the
> >function all_unreclaimable() to terminate this loop?
> >>
> >> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> >*zonelist,
> >>                         continue;
> >>                 if (!zone->all_unreclaimable)
> >>                         return false;
> >> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
> >> +                       return true;
> >>         }
> >>          BTW: I'm using kernel3.4, I also try to search in the kernel3.9,
> >didn't see a possible fix for such issue. Or is anyone also met such issue
> >before? Any comment will be welcomed, looking forward to your reply!
> >>
> >> Thanks!
> >
> >I'd like to ask somethigs.
> >
> >1. Do you have enabled swap?
> I set CONFIG_SWAP=y, but I didn't really have a swap partition, that means my swap buffer size is 0;
> >2. Do you enable CONFIG_COMPACTION?
> No, I didn't enable;
> >3. Could we get your zoneinfo via cat /proc/zoneinfo?
> I dump some info from ramdump, please review:

Thanks for the information.
You said order-2 allocation was failed so I will assume preferred zone
is normal zone, not high zone because high order allocation in kernel side
isn't from high zone.

> crash> kmem -z
> NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
>   SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279

712M normal memory.

>   VM_STAT:
>           NR_FREE_PAGES: 16092

There are plenty of free pages over high watermark but there are heavy
fragmentation as I see below information.

So, kswapd doesn't scan this zone loop iteration is done with order-2.
I mean kswapd will scan this zone with order-0 if first iteration is
done by this

        order = sc.order = 0;
        
        goto loop_again;

But this time, zone_watermark_ok_safe with testorder = 0 on normal zone
is always true so that scanning of zone will be skipped. It means kswapd
never set zone->unreclaimable to 1.

>        NR_INACTIVE_ANON: 17
>          NR_ACTIVE_ANON: 55091
>        NR_INACTIVE_FILE: 17
>          NR_ACTIVE_FILE: 17
>          NR_UNEVICTABLE: 0
>                NR_MLOCK: 0
>           NR_ANON_PAGES: 55077

There are about 200M anon pages and few file pages.
You don't have swap so that reclaimer couldn't go far.

>          NR_FILE_MAPPED: 42
>           NR_FILE_PAGES: 69
>           NR_FILE_DIRTY: 0
>            NR_WRITEBACK: 0
>     NR_SLAB_RECLAIMABLE: 1226
>   NR_SLAB_UNRECLAIMABLE: 9373
>            NR_PAGETABLE: 2776
>         NR_KERNEL_STACK: 798
>         NR_UNSTABLE_NFS: 0
>               NR_BOUNCE: 0
>         NR_VMSCAN_WRITE: 91
>     NR_VMSCAN_IMMEDIATE: 115381
>       NR_WRITEBACK_TEMP: 0
>        NR_ISOLATED_ANON: 0
>        NR_ISOLATED_FILE: 0
>                NR_SHMEM: 31
>              NR_DIRTIED: 15256
>              NR_WRITTEN: 11981
> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> 
> NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
>   SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
>   VM_STAT:
>           NR_FREE_PAGES: 161

Reclaimer should reclaim this zone.

>        NR_INACTIVE_ANON: 104
>          NR_ACTIVE_ANON: 46114
>        NR_INACTIVE_FILE: 9722
>          NR_ACTIVE_FILE: 12263

It seems there are lots of room to evict file pages.

>          NR_UNEVICTABLE: 168
>                NR_MLOCK: 0
>           NR_ANON_PAGES: 46102
>          NR_FILE_MAPPED: 12227
>           NR_FILE_PAGES: 22270
>           NR_FILE_DIRTY: 1
>            NR_WRITEBACK: 0
>     NR_SLAB_RECLAIMABLE: 0
>   NR_SLAB_UNRECLAIMABLE: 0
>            NR_PAGETABLE: 0
>         NR_KERNEL_STACK: 0
>         NR_UNSTABLE_NFS: 0
>               NR_BOUNCE: 0
>         NR_VMSCAN_WRITE: 0
>     NR_VMSCAN_IMMEDIATE: 0
>       NR_WRITEBACK_TEMP: 0
>        NR_ISOLATED_ANON: 0
>        NR_ISOLATED_FILE: 0
>                NR_SHMEM: 117
>              NR_DIRTIED: 7364
>              NR_WRITTEN: 6989
> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> 
> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR  START_MAPNR
>   0   Normal    192512   16092  c1200000       0            0     
> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
>   0       4k      c08460f0           3      3
>   0       4k      c08460f8         436    436
>   0       4k      c0846100       15237  15237
>   0       4k      c0846108           0      0
>   0       4k      c0846110           0      0
>   1       8k      c084611c          39     78
>   1       8k      c0846124           0      0
>   1       8k      c084612c         169    338
>   1       8k      c0846134           0      0
>   1       8k      c084613c           0      0
>   2      16k      c0846148           0      0
>   2      16k      c0846150           0      0
>   2      16k      c0846158           0      0
> ---------Normal zone all order > 1 has no free pages
> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR  START_MAPNR
>   1   HighMem    69632     161  c17e0000    2f000000      192512  
> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
>   0       4k      c08464f0          12     12
>   0       4k      c08464f8           0      0
>   0       4k      c0846500          14     14
>   0       4k      c0846508           3      3
>   0       4k      c0846510           0      0
>   1       8k      c084651c           0      0
>   1       8k      c0846524           0      0
>   1       8k      c084652c           0      0
>   2      16k      c0846548           0      0
>   2      16k      c0846550           0      0
>   2      16k      c0846558           0      0
>   2      16k      c0846560           1      4
>   2      16k      c0846568           0      0
>   5     128k      c08465cc           0      0
>   5     128k      c08465d4           0      0
>   5     128k      c08465dc           0      0
>   5     128k      c08465e4           4    128
>   5     128k      c08465ec           0      0
> ------Other's all zero
> 
> Some other zone information I dump from pglist_data
> {
> 	watermark = {853, 1066, 1279}, 
>       percpu_drift_mark = 0, 
>       lowmem_reserve = {0, 2159, 2159}, 
>       dirty_balance_reserve = 3438, 
>       pageset = 0xc07f6144, 
>       lock = {
>         {
>           rlock = {
>             raw_lock = {
>               lock = 0
>             }, 
>             break_lock = 0
>           }
>         }
>       },       
> 	all_unreclaimable = 0,
>       reclaim_stat = {
>         recent_rotated = {903355, 960912}, 
>         recent_scanned = {932404, 2462017}
>       }, 
>       pages_scanned = 84231,

Most of scan happens in direct reclaim path, I guess
but direct reclaim couldn't reclaim any pages due to lack of swap device.

It means we have to set zone->all_unreclaimable in direct reclaim path, too.
Below patch fix your problem?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-08-01  7:33     ` Minchan Kim
@ 2013-08-01  8:20       ` Lisa Du
  2013-08-01  8:42         ` Minchan Kim
  0 siblings, 1 reply; 36+ messages in thread
From: Lisa Du @ 2013-08-01  8:20 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, KOSAKI Motohiro

>-----Original Message-----
>From: Minchan Kim [mailto:minchan@kernel.org]
>Sent: 2013年8月1日 15:34
>To: Lisa Du
>Cc: linux-mm@kvack.org; KOSAKI Motohiro
>Subject: Re: Possible deadloop in direct reclaim?
>
>On Wed, Jul 31, 2013 at 11:13:07PM -0700, Lisa Du wrote:
>> >On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
>> >> Dear Sir:
>> >> Currently I met a possible deadloop in direct reclaim. After run plenty
>of
>> >the application, system run into a status that system memory is very
>> >fragmentized. Like only order-0 and order-1 memory left.
>> >> Then one process required a order-2 buffer but it enter an endless
>direct
>> >reclaim. From my trace log, I can see this loop already over 200,000
>times.
>> >Kswapd was first wake up and then go back to sleep as it cannot
>rebalance
>> >this order's memory. But zone->all_unreclaimable remains 1.
>> >> Though direct_reclaim every time returns no pages, but as
>> >zone->all_unreclaimable = 1, so it loop again and again. Even when
>> >zone->pages_scanned also becomes very large. It will block the process
>for
>> >long time, until some watchdog thread detect this and kill this process.
>> >Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe
>cost
>> >over 50 seconds or even more.
>> >> I think it's not as expected right?  Can we also add below check in the
>> >function all_unreclaimable() to terminate this loop?
>> >>
>> >> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
>> >*zonelist,
>> >>                         continue;
>> >>                 if (!zone->all_unreclaimable)
>> >>                         return false;
>> >> +               if (sc->nr_reclaimed == 0
>&& !zone_reclaimable(zone))
>> >> +                       return true;
>> >>         }
>> >>          BTW: I'm using kernel3.4, I also try to search in the
>kernel3.9,
>> >didn't see a possible fix for such issue. Or is anyone also met such issue
>> >before? Any comment will be welcomed, looking forward to your reply!
>> >>
>> >> Thanks!
>> >
>> >I'd like to ask somethigs.
>> >
>> >1. Do you have enabled swap?
>> I set CONFIG_SWAP=y, but I didn't really have a swap partition, that
>means my swap buffer size is 0;
>> >2. Do you enable CONFIG_COMPACTION?
>> No, I didn't enable;
>> >3. Could we get your zoneinfo via cat /proc/zoneinfo?
>> I dump some info from ramdump, please review:
>
>Thanks for the information.
>You said order-2 allocation was failed so I will assume preferred zone
>is normal zone, not high zone because high order allocation in kernel side
>isn't from high zone.
Yes, that's right!
>
>> crash> kmem -z
>> NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
>>   SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279
>
>712M normal memory.
>
>>   VM_STAT:
>>           NR_FREE_PAGES: 16092
>
>There are plenty of free pages over high watermark but there are heavy
>fragmentation as I see below information.
>
>So, kswapd doesn't scan this zone loop iteration is done with order-2.
>I mean kswapd will scan this zone with order-0 if first iteration is
>done by this
>
>        order = sc.order = 0;
>
>        goto loop_again;
>
>But this time, zone_watermark_ok_safe with testorder = 0 on normal zone
>is always true so that scanning of zone will be skipped. It means kswapd
>never set zone->unreclaimable to 1.
Yes, definitely!
>
>>        NR_INACTIVE_ANON: 17
>>          NR_ACTIVE_ANON: 55091
>>        NR_INACTIVE_FILE: 17
>>          NR_ACTIVE_FILE: 17
>>          NR_UNEVICTABLE: 0
>>                NR_MLOCK: 0
>>           NR_ANON_PAGES: 55077
>
>There are about 200M anon pages and few file pages.
>You don't have swap so that reclaimer couldn't go far.
>
>>          NR_FILE_MAPPED: 42
>>           NR_FILE_PAGES: 69
>>           NR_FILE_DIRTY: 0
>>            NR_WRITEBACK: 0
>>     NR_SLAB_RECLAIMABLE: 1226
>>   NR_SLAB_UNRECLAIMABLE: 9373
>>            NR_PAGETABLE: 2776
>>         NR_KERNEL_STACK: 798
>>         NR_UNSTABLE_NFS: 0
>>               NR_BOUNCE: 0
>>         NR_VMSCAN_WRITE: 91
>>     NR_VMSCAN_IMMEDIATE: 115381
>>       NR_WRITEBACK_TEMP: 0
>>        NR_ISOLATED_ANON: 0
>>        NR_ISOLATED_FILE: 0
>>                NR_SHMEM: 31
>>              NR_DIRTIED: 15256
>>              NR_WRITTEN: 11981
>> NR_ANON_TRANSPARENT_HUGEPAGES: 0
>>
>> NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
>>   SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
>>   VM_STAT:
>>           NR_FREE_PAGES: 161
>
>Reclaimer should reclaim this zone.
>
>>        NR_INACTIVE_ANON: 104
>>          NR_ACTIVE_ANON: 46114
>>        NR_INACTIVE_FILE: 9722
>>          NR_ACTIVE_FILE: 12263
>
>It seems there are lots of room to evict file pages.
>
>>          NR_UNEVICTABLE: 168
>>                NR_MLOCK: 0
>>           NR_ANON_PAGES: 46102
>>          NR_FILE_MAPPED: 12227
>>           NR_FILE_PAGES: 22270
>>           NR_FILE_DIRTY: 1
>>            NR_WRITEBACK: 0
>>     NR_SLAB_RECLAIMABLE: 0
>>   NR_SLAB_UNRECLAIMABLE: 0
>>            NR_PAGETABLE: 0
>>         NR_KERNEL_STACK: 0
>>         NR_UNSTABLE_NFS: 0
>>               NR_BOUNCE: 0
>>         NR_VMSCAN_WRITE: 0
>>     NR_VMSCAN_IMMEDIATE: 0
>>       NR_WRITEBACK_TEMP: 0
>>        NR_ISOLATED_ANON: 0
>>        NR_ISOLATED_FILE: 0
>>                NR_SHMEM: 117
>>              NR_DIRTIED: 7364
>>              NR_WRITTEN: 6989
>> NR_ANON_TRANSPARENT_HUGEPAGES: 0
>>
>> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
>START_MAPNR
>>   0   Normal    192512   16092  c1200000       0            0
>> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
>>   0       4k      c08460f0           3      3
>>   0       4k      c08460f8         436    436
>>   0       4k      c0846100       15237  15237
>>   0       4k      c0846108           0      0
>>   0       4k      c0846110           0      0
>>   1       8k      c084611c          39     78
>>   1       8k      c0846124           0      0
>>   1       8k      c084612c         169    338
>>   1       8k      c0846134           0      0
>>   1       8k      c084613c           0      0
>>   2      16k      c0846148           0      0
>>   2      16k      c0846150           0      0
>>   2      16k      c0846158           0      0
>> ---------Normal zone all order > 1 has no free pages
>> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
>START_MAPNR
>>   1   HighMem    69632     161  c17e0000    2f000000
>192512
>> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
>>   0       4k      c08464f0          12     12
>>   0       4k      c08464f8           0      0
>>   0       4k      c0846500          14     14
>>   0       4k      c0846508           3      3
>>   0       4k      c0846510           0      0
>>   1       8k      c084651c           0      0
>>   1       8k      c0846524           0      0
>>   1       8k      c084652c           0      0
>>   2      16k      c0846548           0      0
>>   2      16k      c0846550           0      0
>>   2      16k      c0846558           0      0
>>   2      16k      c0846560           1      4
>>   2      16k      c0846568           0      0
>>   5     128k      c08465cc           0      0
>>   5     128k      c08465d4           0      0
>>   5     128k      c08465dc           0      0
>>   5     128k      c08465e4           4    128
>>   5     128k      c08465ec           0      0
>> ------Other's all zero
>>
>> Some other zone information I dump from pglist_data
>> {
>> 	watermark = {853, 1066, 1279},
>>       percpu_drift_mark = 0,
>>       lowmem_reserve = {0, 2159, 2159},
>>       dirty_balance_reserve = 3438,
>>       pageset = 0xc07f6144,
>>       lock = {
>>         {
>>           rlock = {
>>             raw_lock = {
>>               lock = 0
>>             },
>>             break_lock = 0
>>           }
>>         }
>>       },
>> 	all_unreclaimable = 0,
>>       reclaim_stat = {
>>         recent_rotated = {903355, 960912},
>>         recent_scanned = {932404, 2462017}
>>       },
>>       pages_scanned = 84231,
>
>Most of scan happens in direct reclaim path, I guess
>but direct reclaim couldn't reclaim any pages due to lack of swap device.
>
>It means we have to set zone->all_unreclaimable in direct reclaim path,
>too.
>Below patch fix your problem?
Yes, your patch should fix my problem! 
Actually I also did another patch, after test, should also fix my issue, 
but I didn't set zone->all_unreclaimable in direct reclaim path as you, 
just double check zone_reclaimable() status in all_unreclaimable() function.
Maybe your patch is better!

commit 26d2b60d06234683a81666da55129f9c982271a5
Author: Lisa Du <cldu@marvell.com>
Date:   Thu Aug 1 10:16:32 2013 +0800

    mm: fix infinite direct_reclaim when memory is very fragmentized
    
    latest all_unreclaimable check in direct reclaim is the following commit.
    2011 Apr 14; commit 929bea7c; vmscan:  all_unreclaimable() use
                                zone->all_unreclaimable as a name
    and in addition, add oom_killer_disabled check to avoid reintroduce the
    issue of commit d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
    
    But except the hibernation case in which kswapd is freezed, there's also other case
    which may lead infinite loop in direct relaim. In a real test, direct_relaimer did
    over 200000 times rebalance in __alloc_pages_slowpath(), so this process will be
    blocked until watchdog detect and kill it. The root cause is as below:
    
    If system memory is very fragmentized like only order-0 and order-1 left,
    kswapd will go to sleep as system cann't rebalanced for high-order allocations.
    But direct_reclaim still works for higher order request. So zones can become a state
    zone->all_unreclaimable = 0 but zone->pages_scanned > zone_reclaimable_pages(zone) * 6.
    In this case if a process like do_fork try to allocate an order-2 memory which is not
    a COSTLY_ORDER, as direct_reclaim always said it did_some_progress, so rebalance again
    and again in __alloc_pages_slowpath(). This issue is easily happen in no swap and no
    compaction enviroment.
    
    So add furthur check in all_unreclaimable() to avoid such case.
    
    Change-Id: Id3266b47c63f5b96aab466fd9f1f44d37e16cdcb
    Signed-off-by: Lisa Du <cldu@marvell.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cff0d4..34582d9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2301,7 +2301,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
                        continue;
                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
                        continue;
-               if (!zone->all_unreclaimable)
+               if (zone->all_unreclaimable)
+                       continue;
+               if (zone_reclaimable(zone))
                        return false;
        }
>
>From a5d82159b98f3d90c2f9ff9e486699fb4c67cced Mon Sep 17 00:00:00
>2001
>From: Minchan Kim <minchan@kernel.org>
>Date: Thu, 1 Aug 2013 16:18:00 +0900
>Subject:[PATCH] mm: set zone->all_unreclaimable in direct reclaim
> path
>
>Lisa reported there are lots of free pages in a zone but most of them
>is order-0 pages so it means the zone is heavily fragemented.
>Then, high order allocation could make direct reclaim path'slong stall(
>ex, 50 second) in no swap and no compaction environment.
>
>The reason is kswapd can skip the zone's scanning because the zone
>is lots of free pages and kswapd changes scanning order from high-order
>to 0-order after his first iteration is done because kswapd think
>order-0 allocation is the most important.
>Look at 73ce02e9 in detail.
>
>The problem from that is that only kswapd can set zone->all_unreclaimable
>to 1 at the moment so direct reclaim path should loop forever until a ghost
>can set the zone->all_unreclaimable to 1.
>
>This patch makes direct reclaim path to set zone->all_unreclaimable
>to avoid infinite loop. So now we don't need a ghost.
>
>Reported-by: Lisa Du <cldu@marvell.com>
>Signed-off-by: Minchan Kim <minchan@kernel.org>
>---
> mm/vmscan.c |   29 ++++++++++++++++++++++++++++-
> 1 file changed, 28 insertions(+), 1 deletion(-)
>
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 33dc256..f957e87 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -2317,6 +2317,23 @@ static bool all_unreclaimable(struct zonelist
>*zonelist,
> 	return true;
> }
>
>+static void check_zones_unreclaimable(struct zonelist *zonelist,
>+					struct scan_control *sc)
>+{
>+	struct zoneref *z;
>+	struct zone *zone;
>+
>+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>+			gfp_zone(sc->gfp_mask), sc->nodemask) {
>+		if (!populated_zone(zone))
>+			continue;
>+		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>+			continue;
>+		if (!zone_reclaimable(zone))
>+			zone->all_unreclaimable = 1;
>+	}
>+}
>+
> /*
>  * This is the main entry point to direct page reclaim.
>  *
>@@ -2370,7 +2387,17 @@ static unsigned long
>do_try_to_free_pages(struct zonelist *zonelist,
> 				lru_pages += zone_reclaimable_pages(zone);
> 			}
>
>-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
>+			/*
>+			 * When a zone has enough order-0 free memory but
>+			 * zone is heavily fragmented and we need high order
>+			 * page from the zone, kswapd could skip the zone
>+			 * after first iteration with high order. So, kswapd
>+			 * never set the zone->all_unreclaimable to 1 so
>+			 * direct reclaim path needs the check.
>+			 */
>+			if (!shrink_slab(shrink, sc->nr_scanned, lru_pages))
>+				check_zones_unreclaimable(zonelist, sc);
>+
> 			if (reclaim_state) {
> 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> 				reclaim_state->reclaimed_slab = 0;
>--
>1.7.9.5
>
>--
>Kind regards,
>Minchan Kim

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-01  8:20       ` Lisa Du
@ 2013-08-01  8:42         ` Minchan Kim
  2013-08-02  1:03           ` Lisa Du
  2013-08-02  2:26           ` Minchan Kim
  0 siblings, 2 replies; 36+ messages in thread
From: Minchan Kim @ 2013-08-01  8:42 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm, KOSAKI Motohiro

On Thu, Aug 01, 2013 at 01:20:34AM -0700, Lisa Du wrote:
> >-----Original Message-----
> >From: Minchan Kim [mailto:minchan@kernel.org]
> >Sent: 2013a1'8ae??1ae?JPY 15:34
> >To: Lisa Du
> >Cc: linux-mm@kvack.org; KOSAKI Motohiro
> >Subject: Re: Possible deadloop in direct reclaim?
> >
> >On Wed, Jul 31, 2013 at 11:13:07PM -0700, Lisa Du wrote:
> >> >On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
> >> >> Dear Sir:
> >> >> Currently I met a possible deadloop in direct reclaim. After run plenty
> >of
> >> >the application, system run into a status that system memory is very
> >> >fragmentized. Like only order-0 and order-1 memory left.
> >> >> Then one process required a order-2 buffer but it enter an endless
> >direct
> >> >reclaim. From my trace log, I can see this loop already over 200,000
> >times.
> >> >Kswapd was first wake up and then go back to sleep as it cannot
> >rebalance
> >> >this order's memory. But zone->all_unreclaimable remains 1.
> >> >> Though direct_reclaim every time returns no pages, but as
> >> >zone->all_unreclaimable = 1, so it loop again and again. Even when
> >> >zone->pages_scanned also becomes very large. It will block the process
> >for
> >> >long time, until some watchdog thread detect this and kill this process.
> >> >Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe
> >cost
> >> >over 50 seconds or even more.
> >> >> I think it's not as expected right?  Can we also add below check in the
> >> >function all_unreclaimable() to terminate this loop?
> >> >>
> >> >> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> >> >*zonelist,
> >> >>                         continue;
> >> >>                 if (!zone->all_unreclaimable)
> >> >>                         return false;
> >> >> +               if (sc->nr_reclaimed == 0
> >&& !zone_reclaimable(zone))
> >> >> +                       return true;
> >> >>         }
> >> >>          BTW: I'm using kernel3.4, I also try to search in the
> >kernel3.9,
> >> >didn't see a possible fix for such issue. Or is anyone also met such issue
> >> >before? Any comment will be welcomed, looking forward to your reply!
> >> >>
> >> >> Thanks!
> >> >
> >> >I'd like to ask somethigs.
> >> >
> >> >1. Do you have enabled swap?
> >> I set CONFIG_SWAP=y, but I didn't really have a swap partition, that
> >means my swap buffer size is 0;
> >> >2. Do you enable CONFIG_COMPACTION?
> >> No, I didn't enable;
> >> >3. Could we get your zoneinfo via cat /proc/zoneinfo?
> >> I dump some info from ramdump, please review:
> >
> >Thanks for the information.
> >You said order-2 allocation was failed so I will assume preferred zone
> >is normal zone, not high zone because high order allocation in kernel side
> >isn't from high zone.
> Yes, that's right!
> >
> >> crash> kmem -z
> >> NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
> >>   SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279
> >
> >712M normal memory.
> >
> >>   VM_STAT:
> >>           NR_FREE_PAGES: 16092
> >
> >There are plenty of free pages over high watermark but there are heavy
> >fragmentation as I see below information.
> >
> >So, kswapd doesn't scan this zone loop iteration is done with order-2.
> >I mean kswapd will scan this zone with order-0 if first iteration is
> >done by this
> >
> >        order = sc.order = 0;
> >
> >        goto loop_again;
> >
> >But this time, zone_watermark_ok_safe with testorder = 0 on normal zone
> >is always true so that scanning of zone will be skipped. It means kswapd
> >never set zone->unreclaimable to 1.
> Yes, definitely!
> >
> >>        NR_INACTIVE_ANON: 17
> >>          NR_ACTIVE_ANON: 55091
> >>        NR_INACTIVE_FILE: 17
> >>          NR_ACTIVE_FILE: 17
> >>          NR_UNEVICTABLE: 0
> >>                NR_MLOCK: 0
> >>           NR_ANON_PAGES: 55077
> >
> >There are about 200M anon pages and few file pages.
> >You don't have swap so that reclaimer couldn't go far.
> >
> >>          NR_FILE_MAPPED: 42
> >>           NR_FILE_PAGES: 69
> >>           NR_FILE_DIRTY: 0
> >>            NR_WRITEBACK: 0
> >>     NR_SLAB_RECLAIMABLE: 1226
> >>   NR_SLAB_UNRECLAIMABLE: 9373
> >>            NR_PAGETABLE: 2776
> >>         NR_KERNEL_STACK: 798
> >>         NR_UNSTABLE_NFS: 0
> >>               NR_BOUNCE: 0
> >>         NR_VMSCAN_WRITE: 91
> >>     NR_VMSCAN_IMMEDIATE: 115381
> >>       NR_WRITEBACK_TEMP: 0
> >>        NR_ISOLATED_ANON: 0
> >>        NR_ISOLATED_FILE: 0
> >>                NR_SHMEM: 31
> >>              NR_DIRTIED: 15256
> >>              NR_WRITTEN: 11981
> >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> >>
> >> NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
> >>   SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
> >>   VM_STAT:
> >>           NR_FREE_PAGES: 161
> >
> >Reclaimer should reclaim this zone.
> >
> >>        NR_INACTIVE_ANON: 104
> >>          NR_ACTIVE_ANON: 46114
> >>        NR_INACTIVE_FILE: 9722
> >>          NR_ACTIVE_FILE: 12263
> >
> >It seems there are lots of room to evict file pages.
> >
> >>          NR_UNEVICTABLE: 168
> >>                NR_MLOCK: 0
> >>           NR_ANON_PAGES: 46102
> >>          NR_FILE_MAPPED: 12227
> >>           NR_FILE_PAGES: 22270
> >>           NR_FILE_DIRTY: 1
> >>            NR_WRITEBACK: 0
> >>     NR_SLAB_RECLAIMABLE: 0
> >>   NR_SLAB_UNRECLAIMABLE: 0
> >>            NR_PAGETABLE: 0
> >>         NR_KERNEL_STACK: 0
> >>         NR_UNSTABLE_NFS: 0
> >>               NR_BOUNCE: 0
> >>         NR_VMSCAN_WRITE: 0
> >>     NR_VMSCAN_IMMEDIATE: 0
> >>       NR_WRITEBACK_TEMP: 0
> >>        NR_ISOLATED_ANON: 0
> >>        NR_ISOLATED_FILE: 0
> >>                NR_SHMEM: 117
> >>              NR_DIRTIED: 7364
> >>              NR_WRITTEN: 6989
> >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> >>
> >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> >START_MAPNR
> >>   0   Normal    192512   16092  c1200000       0            0
> >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> >>   0       4k      c08460f0           3      3
> >>   0       4k      c08460f8         436    436
> >>   0       4k      c0846100       15237  15237
> >>   0       4k      c0846108           0      0
> >>   0       4k      c0846110           0      0
> >>   1       8k      c084611c          39     78
> >>   1       8k      c0846124           0      0
> >>   1       8k      c084612c         169    338
> >>   1       8k      c0846134           0      0
> >>   1       8k      c084613c           0      0
> >>   2      16k      c0846148           0      0
> >>   2      16k      c0846150           0      0
> >>   2      16k      c0846158           0      0
> >> ---------Normal zone all order > 1 has no free pages
> >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> >START_MAPNR
> >>   1   HighMem    69632     161  c17e0000    2f000000
> >192512
> >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> >>   0       4k      c08464f0          12     12
> >>   0       4k      c08464f8           0      0
> >>   0       4k      c0846500          14     14
> >>   0       4k      c0846508           3      3
> >>   0       4k      c0846510           0      0
> >>   1       8k      c084651c           0      0
> >>   1       8k      c0846524           0      0
> >>   1       8k      c084652c           0      0
> >>   2      16k      c0846548           0      0
> >>   2      16k      c0846550           0      0
> >>   2      16k      c0846558           0      0
> >>   2      16k      c0846560           1      4
> >>   2      16k      c0846568           0      0
> >>   5     128k      c08465cc           0      0
> >>   5     128k      c08465d4           0      0
> >>   5     128k      c08465dc           0      0
> >>   5     128k      c08465e4           4    128
> >>   5     128k      c08465ec           0      0
> >> ------Other's all zero
> >>
> >> Some other zone information I dump from pglist_data
> >> {
> >> 	watermark = {853, 1066, 1279},
> >>       percpu_drift_mark = 0,
> >>       lowmem_reserve = {0, 2159, 2159},
> >>       dirty_balance_reserve = 3438,
> >>       pageset = 0xc07f6144,
> >>       lock = {
> >>         {
> >>           rlock = {
> >>             raw_lock = {
> >>               lock = 0
> >>             },
> >>             break_lock = 0
> >>           }
> >>         }
> >>       },
> >> 	all_unreclaimable = 0,
> >>       reclaim_stat = {
> >>         recent_rotated = {903355, 960912},
> >>         recent_scanned = {932404, 2462017}
> >>       },
> >>       pages_scanned = 84231,
> >
> >Most of scan happens in direct reclaim path, I guess
> >but direct reclaim couldn't reclaim any pages due to lack of swap device.
> >
> >It means we have to set zone->all_unreclaimable in direct reclaim path,
> >too.
> >Below patch fix your problem?
> Yes, your patch should fix my problem! 
> Actually I also did another patch, after test, should also fix my issue, 
> but I didn't set zone->all_unreclaimable in direct reclaim path as you, 
> just double check zone_reclaimable() status in all_unreclaimable() function.
> Maybe your patch is better!

Nope. I think your patch is better. :)
Just thing is anlaysis of the problem and description and I think we could do
better but unfortunately, I don't have enough time today so I will see tomorrow.
Just nitpick below.

Thanks.

> 
> commit 26d2b60d06234683a81666da55129f9c982271a5
> Author: Lisa Du <cldu@marvell.com>
> Date:   Thu Aug 1 10:16:32 2013 +0800
> 
>     mm: fix infinite direct_reclaim when memory is very fragmentized
>     
>     latest all_unreclaimable check in direct reclaim is the following commit.
>     2011 Apr 14; commit 929bea7c; vmscan:  all_unreclaimable() use
>                                 zone->all_unreclaimable as a name
>     and in addition, add oom_killer_disabled check to avoid reintroduce the
>     issue of commit d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
>     
>     But except the hibernation case in which kswapd is freezed, there's also other case
>     which may lead infinite loop in direct relaim. In a real test, direct_relaimer did
>     over 200000 times rebalance in __alloc_pages_slowpath(), so this process will be
>     blocked until watchdog detect and kill it. The root cause is as below:
>     
>     If system memory is very fragmentized like only order-0 and order-1 left,
>     kswapd will go to sleep as system cann't rebalanced for high-order allocations.
>     But direct_reclaim still works for higher order request. So zones can become a state
>     zone->all_unreclaimable = 0 but zone->pages_scanned > zone_reclaimable_pages(zone) * 6.
>     In this case if a process like do_fork try to allocate an order-2 memory which is not
>     a COSTLY_ORDER, as direct_reclaim always said it did_some_progress, so rebalance again
>     and again in __alloc_pages_slowpath(). This issue is easily happen in no swap and no
>     compaction enviroment.
>     
>     So add furthur check in all_unreclaimable() to avoid such case.
>     
>     Change-Id: Id3266b47c63f5b96aab466fd9f1f44d37e16cdcb
>     Signed-off-by: Lisa Du <cldu@marvell.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2cff0d4..34582d9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2301,7 +2301,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>                         continue;
>                 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                         continue;
> -               if (!zone->all_unreclaimable)
> +               if (zone->all_unreclaimable)
> +                       continue;

Nitpick: If we use zone_reclaimable(), above check is redundant and
gain is very tiny because this path is already slow.

> +               if (zone_reclaimable(zone))
>                         return false;
>         }
> >
> >From a5d82159b98f3d90c2f9ff9e486699fb4c67cced Mon Sep 17 00:00:00
> >2001
> >From: Minchan Kim <minchan@kernel.org>
> >Date: Thu, 1 Aug 2013 16:18:00 +0900
> >Subject:[PATCH] mm: set zone->all_unreclaimable in direct reclaim
> > path
> >
> >Lisa reported there are lots of free pages in a zone but most of them
> >is order-0 pages so it means the zone is heavily fragemented.
> >Then, high order allocation could make direct reclaim path'slong stall(
> >ex, 50 second) in no swap and no compaction environment.
> >
> >The reason is kswapd can skip the zone's scanning because the zone
> >is lots of free pages and kswapd changes scanning order from high-order
> >to 0-order after his first iteration is done because kswapd think
> >order-0 allocation is the most important.
> >Look at 73ce02e9 in detail.
> >
> >The problem from that is that only kswapd can set zone->all_unreclaimable
> >to 1 at the moment so direct reclaim path should loop forever until a ghost
> >can set the zone->all_unreclaimable to 1.
> >
> >This patch makes direct reclaim path to set zone->all_unreclaimable
> >to avoid infinite loop. So now we don't need a ghost.
> >
> >Reported-by: Lisa Du <cldu@marvell.com>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> > mm/vmscan.c |   29 ++++++++++++++++++++++++++++-
> > 1 file changed, 28 insertions(+), 1 deletion(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index 33dc256..f957e87 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -2317,6 +2317,23 @@ static bool all_unreclaimable(struct zonelist
> >*zonelist,
> > 	return true;
> > }
> >
> >+static void check_zones_unreclaimable(struct zonelist *zonelist,
> >+					struct scan_control *sc)
> >+{
> >+	struct zoneref *z;
> >+	struct zone *zone;
> >+
> >+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >+			gfp_zone(sc->gfp_mask), sc->nodemask) {
> >+		if (!populated_zone(zone))
> >+			continue;
> >+		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >+			continue;
> >+		if (!zone_reclaimable(zone))
> >+			zone->all_unreclaimable = 1;
> >+	}
> >+}
> >+
> > /*
> >  * This is the main entry point to direct page reclaim.
> >  *
> >@@ -2370,7 +2387,17 @@ static unsigned long
> >do_try_to_free_pages(struct zonelist *zonelist,
> > 				lru_pages += zone_reclaimable_pages(zone);
> > 			}
> >
> >-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
> >+			/*
> >+			 * When a zone has enough order-0 free memory but
> >+			 * zone is heavily fragmented and we need high order
> >+			 * page from the zone, kswapd could skip the zone
> >+			 * after first iteration with high order. So, kswapd
> >+			 * never set the zone->all_unreclaimable to 1 so
> >+			 * direct reclaim path needs the check.
> >+			 */
> >+			if (!shrink_slab(shrink, sc->nr_scanned, lru_pages))
> >+				check_zones_unreclaimable(zonelist, sc);
> >+
> > 			if (reclaim_state) {
> > 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > 				reclaim_state->reclaimed_slab = 0;
> >--
> >1.7.9.5
> >
> >--
> >Kind regards,
> >Minchan Kim

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-01  5:19               ` Lisa Du
@ 2013-08-01  8:56                 ` Russell King - ARM Linux
  2013-08-02  1:18                   ` Lisa Du
  0 siblings, 1 reply; 36+ messages in thread
From: Russell King - ARM Linux @ 2013-08-01  8:56 UTC (permalink / raw)
  To: Lisa Du
  Cc: KOSAKI Motohiro, Christoph Lameter, linux-mm, Mel Gorman,
	Bob Liu, Neil Zhang

On Wed, Jul 31, 2013 at 10:19:53PM -0700, Lisa Du wrote:
> >fork alloc order-1 memory for stack. Where and why alloc order-2? If it is
> >arch specific code, please
> >contact arch maintainer.
> Yes arch do_fork allocate order-2 memory when copy_process. 
> Hi, Russel
> What's your opinion about this question?  
> If we really need order-2 memory for fork, then we'd better set
> CONFIG_COMPATION right?

Well, I gave up trying to read the original messages because the quoting
style is a total mess, so I don't have a full understanding of what the
issue is.

However, we have always required order-2 memory for fork, going back to
the 1.x kernel days - it's fundamental to ARM to have that.  The order-2
allocation os for the 1st level page table.  No order-2 allocation, no
page tables for the new thread.

Looking at this commit:

commit 05106e6a54aed321191b4bb5c9ee09538cbad3b1
Author: Rik van Riel <riel@redhat.com>
Date:   Mon Oct 8 16:33:03 2012 -0700

    mm: enable CONFIG_COMPACTION by default

    Now that lumpy reclaim has been removed, compaction is the only way left
    to free up contiguous memory areas.  It is time to just enable
    CONFIG_COMPACTION by default.

it seems to indicate that everyone should have this enabled - however,
the way the change has been done, anyone building from defconfigs before
that change will not have that option enabled.

So yes, this option should be turned on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-08-01  8:42         ` Minchan Kim
@ 2013-08-02  1:03           ` Lisa Du
  2013-08-02  2:26           ` Minchan Kim
  1 sibling, 0 replies; 36+ messages in thread
From: Lisa Du @ 2013-08-02  1:03 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, KOSAKI Motohiro

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 16575 bytes --]

>-----Original Message-----
>From: Minchan Kim [mailto:minchan@kernel.org]
>Sent: 2013年8月1日 16:43
>To: Lisa Du
>Cc: linux-mm@kvack.org; KOSAKI Motohiro
>Subject: Re: Possible deadloop in direct reclaim?
>
>On Thu, Aug 01, 2013 at 01:20:34AM -0700, Lisa Du wrote:
>> >-----Original Message-----
>> >From: Minchan Kim [mailto:minchan@kernel.org]
>> >Sent: 2013年8月1日 15:34
>> >To: Lisa Du
>> >Cc: linux-mm@kvack.org; KOSAKI Motohiro
>> >Subject: Re: Possible deadloop in direct reclaim?
>> >
>> >On Wed, Jul 31, 2013 at 11:13:07PM -0700, Lisa Du wrote:
>> >> >On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
>> >> >> Dear Sir:
>> >> >> Currently I met a possible deadloop in direct reclaim. After run
>plenty
>> >of
>> >> >the application, system run into a status that system memory is very
>> >> >fragmentized. Like only order-0 and order-1 memory left.
>> >> >> Then one process required a order-2 buffer but it enter an endless
>> >direct
>> >> >reclaim. From my trace log, I can see this loop already over 200,000
>> >times.
>> >> >Kswapd was first wake up and then go back to sleep as it cannot
>> >rebalance
>> >> >this order's memory. But zone->all_unreclaimable remains 1.
>> >> >> Though direct_reclaim every time returns no pages, but as
>> >> >zone->all_unreclaimable = 1, so it loop again and again. Even when
>> >> >zone->pages_scanned also becomes very large. It will block the
>process
>> >for
>> >> >long time, until some watchdog thread detect this and kill this
>process.
>> >> >Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe
>> >cost
>> >> >over 50 seconds or even more.
>> >> >> I think it's not as expected right?  Can we also add below check in
>the
>> >> >function all_unreclaimable() to terminate this loop?
>> >> >>
>> >> >> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct
>zonelist
>> >> >*zonelist,
>> >> >>                         continue;
>> >> >>                 if (!zone->all_unreclaimable)
>> >> >>                         return false;
>> >> >> +               if (sc->nr_reclaimed == 0
>> >&& !zone_reclaimable(zone))
>> >> >> +                       return true;
>> >> >>         }
>> >> >>          BTW: I'm using kernel3.4, I also try to search in the
>> >kernel3.9,
>> >> >didn't see a possible fix for such issue. Or is anyone also met such
>issue
>> >> >before? Any comment will be welcomed, looking forward to your
>reply!
>> >> >>
>> >> >> Thanks!
>> >> >
>> >> >I'd like to ask somethigs.
>> >> >
>> >> >1. Do you have enabled swap?
>> >> I set CONFIG_SWAP=y, but I didn't really have a swap partition, that
>> >means my swap buffer size is 0;
>> >> >2. Do you enable CONFIG_COMPACTION?
>> >> No, I didn't enable;
>> >> >3. Could we get your zoneinfo via cat /proc/zoneinfo?
>> >> I dump some info from ramdump, please review:
>> >
>> >Thanks for the information.
>> >You said order-2 allocation was failed so I will assume preferred zone
>> >is normal zone, not high zone because high order allocation in kernel
>side
>> >isn't from high zone.
>> Yes, that's right!
>> >
>> >> crash> kmem -z
>> >> NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
>> >>   SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279
>> >
>> >712M normal memory.
>> >
>> >>   VM_STAT:
>> >>           NR_FREE_PAGES: 16092
>> >
>> >There are plenty of free pages over high watermark but there are heavy
>> >fragmentation as I see below information.
>> >
>> >So, kswapd doesn't scan this zone loop iteration is done with order-2.
>> >I mean kswapd will scan this zone with order-0 if first iteration is
>> >done by this
>> >
>> >        order = sc.order = 0;
>> >
>> >        goto loop_again;
>> >
>> >But this time, zone_watermark_ok_safe with testorder = 0 on normal
>zone
>> >is always true so that scanning of zone will be skipped. It means kswapd
>> >never set zone->unreclaimable to 1.
>> Yes, definitely!
>> >
>> >>        NR_INACTIVE_ANON: 17
>> >>          NR_ACTIVE_ANON: 55091
>> >>        NR_INACTIVE_FILE: 17
>> >>          NR_ACTIVE_FILE: 17
>> >>          NR_UNEVICTABLE: 0
>> >>                NR_MLOCK: 0
>> >>           NR_ANON_PAGES: 55077
>> >
>> >There are about 200M anon pages and few file pages.
>> >You don't have swap so that reclaimer couldn't go far.
>> >
>> >>          NR_FILE_MAPPED: 42
>> >>           NR_FILE_PAGES: 69
>> >>           NR_FILE_DIRTY: 0
>> >>            NR_WRITEBACK: 0
>> >>     NR_SLAB_RECLAIMABLE: 1226
>> >>   NR_SLAB_UNRECLAIMABLE: 9373
>> >>            NR_PAGETABLE: 2776
>> >>         NR_KERNEL_STACK: 798
>> >>         NR_UNSTABLE_NFS: 0
>> >>               NR_BOUNCE: 0
>> >>         NR_VMSCAN_WRITE: 91
>> >>     NR_VMSCAN_IMMEDIATE: 115381
>> >>       NR_WRITEBACK_TEMP: 0
>> >>        NR_ISOLATED_ANON: 0
>> >>        NR_ISOLATED_FILE: 0
>> >>                NR_SHMEM: 31
>> >>              NR_DIRTIED: 15256
>> >>              NR_WRITTEN: 11981
>> >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
>> >>
>> >> NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
>> >>   SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
>> >>   VM_STAT:
>> >>           NR_FREE_PAGES: 161
>> >
>> >Reclaimer should reclaim this zone.
>> >
>> >>        NR_INACTIVE_ANON: 104
>> >>          NR_ACTIVE_ANON: 46114
>> >>        NR_INACTIVE_FILE: 9722
>> >>          NR_ACTIVE_FILE: 12263
>> >
>> >It seems there are lots of room to evict file pages.
>> >
>> >>          NR_UNEVICTABLE: 168
>> >>                NR_MLOCK: 0
>> >>           NR_ANON_PAGES: 46102
>> >>          NR_FILE_MAPPED: 12227
>> >>           NR_FILE_PAGES: 22270
>> >>           NR_FILE_DIRTY: 1
>> >>            NR_WRITEBACK: 0
>> >>     NR_SLAB_RECLAIMABLE: 0
>> >>   NR_SLAB_UNRECLAIMABLE: 0
>> >>            NR_PAGETABLE: 0
>> >>         NR_KERNEL_STACK: 0
>> >>         NR_UNSTABLE_NFS: 0
>> >>               NR_BOUNCE: 0
>> >>         NR_VMSCAN_WRITE: 0
>> >>     NR_VMSCAN_IMMEDIATE: 0
>> >>       NR_WRITEBACK_TEMP: 0
>> >>        NR_ISOLATED_ANON: 0
>> >>        NR_ISOLATED_FILE: 0
>> >>                NR_SHMEM: 117
>> >>              NR_DIRTIED: 7364
>> >>              NR_WRITTEN: 6989
>> >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
>> >>
>> >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
>> >START_MAPNR
>> >>   0   Normal    192512   16092  c1200000       0
>0
>> >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
>> >>   0       4k      c08460f0           3      3
>> >>   0       4k      c08460f8         436    436
>> >>   0       4k      c0846100       15237  15237
>> >>   0       4k      c0846108           0      0
>> >>   0       4k      c0846110           0      0
>> >>   1       8k      c084611c          39     78
>> >>   1       8k      c0846124           0      0
>> >>   1       8k      c084612c         169    338
>> >>   1       8k      c0846134           0      0
>> >>   1       8k      c084613c           0      0
>> >>   2      16k      c0846148           0      0
>> >>   2      16k      c0846150           0      0
>> >>   2      16k      c0846158           0      0
>> >> ---------Normal zone all order > 1 has no free pages
>> >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
>> >START_MAPNR
>> >>   1   HighMem    69632     161  c17e0000    2f000000
>> >192512
>> >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
>> >>   0       4k      c08464f0          12     12
>> >>   0       4k      c08464f8           0      0
>> >>   0       4k      c0846500          14     14
>> >>   0       4k      c0846508           3      3
>> >>   0       4k      c0846510           0      0
>> >>   1       8k      c084651c           0      0
>> >>   1       8k      c0846524           0      0
>> >>   1       8k      c084652c           0      0
>> >>   2      16k      c0846548           0      0
>> >>   2      16k      c0846550           0      0
>> >>   2      16k      c0846558           0      0
>> >>   2      16k      c0846560           1      4
>> >>   2      16k      c0846568           0      0
>> >>   5     128k      c08465cc           0      0
>> >>   5     128k      c08465d4           0      0
>> >>   5     128k      c08465dc           0      0
>> >>   5     128k      c08465e4           4    128
>> >>   5     128k      c08465ec           0      0
>> >> ------Other's all zero
>> >>
>> >> Some other zone information I dump from pglist_data
>> >> {
>> >>   watermark = {853, 1066, 1279},
>> >>       percpu_drift_mark = 0,
>> >>       lowmem_reserve = {0, 2159, 2159},
>> >>       dirty_balance_reserve = 3438,
>> >>       pageset = 0xc07f6144,
>> >>       lock = {
>> >>         {
>> >>           rlock = {
>> >>             raw_lock = {
>> >>               lock = 0
>> >>             },
>> >>             break_lock = 0
>> >>           }
>> >>         }
>> >>       },
>> >>   all_unreclaimable = 0,
>> >>       reclaim_stat = {
>> >>         recent_rotated = {903355, 960912},
>> >>         recent_scanned = {932404, 2462017}
>> >>       },
>> >>       pages_scanned = 84231,
>> >
>> >Most of scan happens in direct reclaim path, I guess
>> >but direct reclaim couldn't reclaim any pages due to lack of swap device.
>> >
>> >It means we have to set zone->all_unreclaimable in direct reclaim path,
>> >too.
>> >Below patch fix your problem?
>> Yes, your patch should fix my problem!
>> Actually I also did another patch, after test, should also fix my issue,
>> but I didn't set zone->all_unreclaimable in direct reclaim path as you,
>> just double check zone_reclaimable() status in all_unreclaimable()
>function.
>> Maybe your patch is better!
>
>Nope. I think your patch is better. :)
>Just thing is anlaysis of the problem and description and I think we could
>do
>better but unfortunately, I don't have enough time today so I will see
>tomorrow.
>Just nitpick below.
>
>Thanks.
>
>>
>> commit 26d2b60d06234683a81666da55129f9c982271a5
>> Author: Lisa Du <cldu@marvell.com>
>> Date:   Thu Aug 1 10:16:32 2013 +0800
>>
>>     mm: fix infinite direct_reclaim when memory is very fragmentized
>>
>>     latest all_unreclaimable check in direct reclaim is the following
>commit.
>>     2011 Apr 14; commit 929bea7c; vmscan:  all_unreclaimable() use
>>                                 zone->all_unreclaimable as a name
>>     and in addition, add oom_killer_disabled check to avoid reintroduce
>the
>>     issue of commit d1908362 ("vmscan: check all_unreclaimable in
>direct reclaim path").
>>
>>     But except the hibernation case in which kswapd is freezed, there's
>also other case
>>     which may lead infinite loop in direct relaim. In a real test,
>direct_relaimer did
>>     over 200000 times rebalance in __alloc_pages_slowpath(), so this
>process will be
>>     blocked until watchdog detect and kill it. The root cause is as below:
>>
>>     If system memory is very fragmentized like only order-0 and order-1
>left,
>>     kswapd will go to sleep as system cann't rebalanced for high-order
>allocations.
>>     But direct_reclaim still works for higher order request. So zones can
>become a state
>>     zone->all_unreclaimable = 0 but zone->pages_scanned >
>zone_reclaimable_pages(zone) * 6.
>>     In this case if a process like do_fork try to allocate an order-2
>memory which is not
>>     a COSTLY_ORDER, as direct_reclaim always said it
>did_some_progress, so rebalance again
>>     and again in __alloc_pages_slowpath(). This issue is easily happen in
>no swap and no
>>     compaction enviroment.
>>
>>     So add furthur check in all_unreclaimable() to avoid such case.
>>
>>     Change-Id: Id3266b47c63f5b96aab466fd9f1f44d37e16cdcb
>>     Signed-off-by: Lisa Du <cldu@marvell.com>
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2cff0d4..34582d9 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2301,7 +2301,9 @@ static bool all_unreclaimable(struct zonelist
>*zonelist,
>>                         continue;
>>                 if (!cpuset_zone_allowed_hardwall(zone,
>GFP_KERNEL))
>>                         continue;
>> -               if (!zone->all_unreclaimable)
>> +               if (zone->all_unreclaimable)
>> +                       continue;
>
>Nitpick: If we use zone_reclaimable(), above check is redundant and
>gain is very tiny because this path is already slow.
Yes, I agree, I add above check just want to avoid the issue Kosaki met which fix by the commit 929bea7c.
In short, to avoid the case zone->all_unreclaimable = 1, but zone->pages_scanned = 0, so only check zone_reclaimable() should not enough.
>
>> +               if (zone_reclaimable(zone))
>>                         return false;
>>         }
>> >
>> >From a5d82159b98f3d90c2f9ff9e486699fb4c67cced Mon Sep 17 00:00:00
>> >2001
>> >From: Minchan Kim <minchan@kernel.org>
>> >Date: Thu, 1 Aug 2013 16:18:00 +0900
>> >Subject:[PATCH] mm: set zone->all_unreclaimable in direct reclaim
>> > path
>> >
>> >Lisa reported there are lots of free pages in a zone but most of them
>> >is order-0 pages so it means the zone is heavily fragemented.
>> >Then, high order allocation could make direct reclaim path'slong stall(
>> >ex, 50 second) in no swap and no compaction environment.
>> >
>> >The reason is kswapd can skip the zone's scanning because the zone
>> >is lots of free pages and kswapd changes scanning order from high-order
>> >to 0-order after his first iteration is done because kswapd think
>> >order-0 allocation is the most important.
>> >Look at 73ce02e9 in detail.
>> >
>> >The problem from that is that only kswapd can set
>zone->all_unreclaimable
>> >to 1 at the moment so direct reclaim path should loop forever until a
>ghost
>> >can set the zone->all_unreclaimable to 1.
>> >
>> >This patch makes direct reclaim path to set zone->all_unreclaimable
>> >to avoid infinite loop. So now we don't need a ghost.
>> >
>> >Reported-by: Lisa Du <cldu@marvell.com>
>> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >---
>> > mm/vmscan.c |   29 ++++++++++++++++++++++++++++-
>> > 1 file changed, 28 insertions(+), 1 deletion(-)
>> >
>> >diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >index 33dc256..f957e87 100644
>> >--- a/mm/vmscan.c
>> >+++ b/mm/vmscan.c
>> >@@ -2317,6 +2317,23 @@ static bool all_unreclaimable(struct zonelist
>> >*zonelist,
>> >    return true;
>> > }
>> >
>> >+static void check_zones_unreclaimable(struct zonelist *zonelist,
>> >+                                   struct scan_control *sc)
>> >+{
>> >+   struct zoneref *z;
>> >+   struct zone *zone;
>> >+
>> >+   for_each_zone_zonelist_nodemask(zone, z, zonelist,
>> >+                   gfp_zone(sc->gfp_mask), sc->nodemask) {
>> >+           if (!populated_zone(zone))
>> >+                   continue;
>> >+           if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>> >+                   continue;
>> >+           if (!zone_reclaimable(zone))
>> >+                   zone->all_unreclaimable = 1;
>> >+   }
>> >+}
>> >+
>> > /*
>> >  * This is the main entry point to direct page reclaim.
>> >  *
>> >@@ -2370,7 +2387,17 @@ static unsigned long
>> >do_try_to_free_pages(struct zonelist *zonelist,
>> >                            lru_pages += zone_reclaimable_pages(zone);
>> >                    }
>> >
>> >-                   shrink_slab(shrink, sc->nr_scanned, lru_pages);
>> >+                   /*
>> >+                    * When a zone has enough order-0 free memory but
>> >+                    * zone is heavily fragmented and we need high order
>> >+                    * page from the zone, kswapd could skip the zone
>> >+                    * after first iteration with high order. So, kswapd
>> >+                    * never set the zone->all_unreclaimable to 1 so
>> >+                    * direct reclaim path needs the check.
>> >+                    */
>> >+                   if (!shrink_slab(shrink, sc->nr_scanned, lru_pages))
>> >+                           check_zones_unreclaimable(zonelist, sc);
>> >+
>> >                    if (reclaim_state) {
>> >                            sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>> >                            reclaim_state->reclaimed_slab = 0;
>> >--
>> >1.7.9.5
>> >
>> >--
>> >Kind regards,
>> >Minchan Kim
>
>--
>Kind regards,
>Minchan Kim
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-08-01  8:56                 ` Russell King - ARM Linux
@ 2013-08-02  1:18                   ` Lisa Du
  0 siblings, 0 replies; 36+ messages in thread
From: Lisa Du @ 2013-08-02  1:18 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: KOSAKI Motohiro, Christoph Lameter, linux-mm, Mel Gorman,
	Bob Liu, Neil Zhang

>-----Original Message-----
>From: Russell King - ARM Linux [mailto:linux@arm.linux.org.uk]
>Sent: 2013年8月1日 16:57
>To: Lisa Du
>Cc: KOSAKI Motohiro; Christoph Lameter; linux-mm@kvack.org; Mel
>Gorman; Bob Liu; Neil Zhang
>Subject: Re: Possible deadloop in direct reclaim?
>
>On Wed, Jul 31, 2013 at 10:19:53PM -0700, Lisa Du wrote:
>> >fork alloc order-1 memory for stack. Where and why alloc order-2? If it is
>> >arch specific code, please
>> >contact arch maintainer.
>> Yes arch do_fork allocate order-2 memory when copy_process.
>> Hi, Russel
>> What's your opinion about this question?
>> If we really need order-2 memory for fork, then we'd better set
>> CONFIG_COMPATION right?
>
>Well, I gave up trying to read the original messages because the quoting
>style is a total mess, so I don't have a full understanding of what the
>issue is.
I'm really sorry for my quoting style, I'll avoid such issue in future!
>
>However, we have always required order-2 memory for fork, going back to
>the 1.x kernel days - it's fundamental to ARM to have that.  The order-2
>allocation os for the 1st level page table.  No order-2 allocation, no
>page tables for the new thread.
>
>Looking at this commit:
>
>commit 05106e6a54aed321191b4bb5c9ee09538cbad3b1
>Author: Rik van Riel <riel@redhat.com>
>Date:   Mon Oct 8 16:33:03 2012 -0700
>
>    mm: enable CONFIG_COMPACTION by default
>
>    Now that lumpy reclaim has been removed, compaction is the only
>way left
>    to free up contiguous memory areas.  It is time to just enable
>    CONFIG_COMPACTION by default.
>
>it seems to indicate that everyone should have this enabled - however,
>the way the change has been done, anyone building from defconfigs before
>that change will not have that option enabled.
>
>So yes, this option should be turned on.
Thanks Russel! 
I think I have got the information I want. Really appreciate for your explanation!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-01  8:42         ` Minchan Kim
  2013-08-02  1:03           ` Lisa Du
@ 2013-08-02  2:26           ` Minchan Kim
  2013-08-02  2:33             ` Minchan Kim
  2013-08-02  3:17             ` Lisa Du
  1 sibling, 2 replies; 36+ messages in thread
From: Minchan Kim @ 2013-08-02  2:26 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm, KOSAKI Motohiro

Hello Lisa and KOSAKI,

Lisa's quote style is very hard to follow so I'd like to write at bottom
as ignoring line by line rule.

Lisa, please correct your MUA.

On Thu, Aug 01, 2013 at 05:42:59PM +0900, Minchan Kim wrote:
> On Thu, Aug 01, 2013 at 01:20:34AM -0700, Lisa Du wrote:
> > >-----Original Message-----
> > >From: Minchan Kim [mailto:minchan@kernel.org]
> > >Sent: 2013a1'8ae??1ae?JPY 15:34
> > >To: Lisa Du
> > >Cc: linux-mm@kvack.org; KOSAKI Motohiro
> > >Subject: Re: Possible deadloop in direct reclaim?
> > >
> > >On Wed, Jul 31, 2013 at 11:13:07PM -0700, Lisa Du wrote:
> > >> >On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
> > >> >> Dear Sir:
> > >> >> Currently I met a possible deadloop in direct reclaim. After run plenty
> > >of
> > >> >the application, system run into a status that system memory is very
> > >> >fragmentized. Like only order-0 and order-1 memory left.
> > >> >> Then one process required a order-2 buffer but it enter an endless
> > >direct
> > >> >reclaim. From my trace log, I can see this loop already over 200,000
> > >times.
> > >> >Kswapd was first wake up and then go back to sleep as it cannot
> > >rebalance
> > >> >this order's memory. But zone->all_unreclaimable remains 1.
> > >> >> Though direct_reclaim every time returns no pages, but as
> > >> >zone->all_unreclaimable = 1, so it loop again and again. Even when
> > >> >zone->pages_scanned also becomes very large. It will block the process
> > >for
> > >> >long time, until some watchdog thread detect this and kill this process.
> > >> >Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe
> > >cost
> > >> >over 50 seconds or even more.
> > >> >> I think it's not as expected right?  Can we also add below check in the
> > >> >function all_unreclaimable() to terminate this loop?
> > >> >>
> > >> >> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> > >> >*zonelist,
> > >> >>                         continue;
> > >> >>                 if (!zone->all_unreclaimable)
> > >> >>                         return false;
> > >> >> +               if (sc->nr_reclaimed == 0
> > >&& !zone_reclaimable(zone))
> > >> >> +                       return true;
> > >> >>         }
> > >> >>          BTW: I'm using kernel3.4, I also try to search in the
> > >kernel3.9,
> > >> >didn't see a possible fix for such issue. Or is anyone also met such issue
> > >> >before? Any comment will be welcomed, looking forward to your reply!
> > >> >>
> > >> >> Thanks!
> > >> >
> > >> >I'd like to ask somethigs.
> > >> >
> > >> >1. Do you have enabled swap?
> > >> I set CONFIG_SWAP=y, but I didn't really have a swap partition, that
> > >means my swap buffer size is 0;
> > >> >2. Do you enable CONFIG_COMPACTION?
> > >> No, I didn't enable;
> > >> >3. Could we get your zoneinfo via cat /proc/zoneinfo?
> > >> I dump some info from ramdump, please review:
> > >
> > >Thanks for the information.
> > >You said order-2 allocation was failed so I will assume preferred zone
> > >is normal zone, not high zone because high order allocation in kernel side
> > >isn't from high zone.
> > Yes, that's right!
> > >
> > >> crash> kmem -z
> > >> NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
> > >>   SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279
> > >
> > >712M normal memory.
> > >
> > >>   VM_STAT:
> > >>           NR_FREE_PAGES: 16092
> > >
> > >There are plenty of free pages over high watermark but there are heavy
> > >fragmentation as I see below information.
> > >
> > >So, kswapd doesn't scan this zone loop iteration is done with order-2.
> > >I mean kswapd will scan this zone with order-0 if first iteration is
> > >done by this
> > >
> > >        order = sc.order = 0;
> > >
> > >        goto loop_again;
> > >
> > >But this time, zone_watermark_ok_safe with testorder = 0 on normal zone
> > >is always true so that scanning of zone will be skipped. It means kswapd
> > >never set zone->unreclaimable to 1.
> > Yes, definitely!
> > >
> > >>        NR_INACTIVE_ANON: 17
> > >>          NR_ACTIVE_ANON: 55091
> > >>        NR_INACTIVE_FILE: 17
> > >>          NR_ACTIVE_FILE: 17
> > >>          NR_UNEVICTABLE: 0
> > >>                NR_MLOCK: 0
> > >>           NR_ANON_PAGES: 55077
> > >
> > >There are about 200M anon pages and few file pages.
> > >You don't have swap so that reclaimer couldn't go far.
> > >
> > >>          NR_FILE_MAPPED: 42
> > >>           NR_FILE_PAGES: 69
> > >>           NR_FILE_DIRTY: 0
> > >>            NR_WRITEBACK: 0
> > >>     NR_SLAB_RECLAIMABLE: 1226
> > >>   NR_SLAB_UNRECLAIMABLE: 9373
> > >>            NR_PAGETABLE: 2776
> > >>         NR_KERNEL_STACK: 798
> > >>         NR_UNSTABLE_NFS: 0
> > >>               NR_BOUNCE: 0
> > >>         NR_VMSCAN_WRITE: 91
> > >>     NR_VMSCAN_IMMEDIATE: 115381
> > >>       NR_WRITEBACK_TEMP: 0
> > >>        NR_ISOLATED_ANON: 0
> > >>        NR_ISOLATED_FILE: 0
> > >>                NR_SHMEM: 31
> > >>              NR_DIRTIED: 15256
> > >>              NR_WRITTEN: 11981
> > >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> > >>
> > >> NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
> > >>   SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
> > >>   VM_STAT:
> > >>           NR_FREE_PAGES: 161
> > >
> > >Reclaimer should reclaim this zone.
> > >
> > >>        NR_INACTIVE_ANON: 104
> > >>          NR_ACTIVE_ANON: 46114
> > >>        NR_INACTIVE_FILE: 9722
> > >>          NR_ACTIVE_FILE: 12263
> > >
> > >It seems there are lots of room to evict file pages.
> > >
> > >>          NR_UNEVICTABLE: 168
> > >>                NR_MLOCK: 0
> > >>           NR_ANON_PAGES: 46102
> > >>          NR_FILE_MAPPED: 12227
> > >>           NR_FILE_PAGES: 22270
> > >>           NR_FILE_DIRTY: 1
> > >>            NR_WRITEBACK: 0
> > >>     NR_SLAB_RECLAIMABLE: 0
> > >>   NR_SLAB_UNRECLAIMABLE: 0
> > >>            NR_PAGETABLE: 0
> > >>         NR_KERNEL_STACK: 0
> > >>         NR_UNSTABLE_NFS: 0
> > >>               NR_BOUNCE: 0
> > >>         NR_VMSCAN_WRITE: 0
> > >>     NR_VMSCAN_IMMEDIATE: 0
> > >>       NR_WRITEBACK_TEMP: 0
> > >>        NR_ISOLATED_ANON: 0
> > >>        NR_ISOLATED_FILE: 0
> > >>                NR_SHMEM: 117
> > >>              NR_DIRTIED: 7364
> > >>              NR_WRITTEN: 6989
> > >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> > >>
> > >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> > >START_MAPNR
> > >>   0   Normal    192512   16092  c1200000       0            0
> > >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> > >>   0       4k      c08460f0           3      3
> > >>   0       4k      c08460f8         436    436
> > >>   0       4k      c0846100       15237  15237
> > >>   0       4k      c0846108           0      0
> > >>   0       4k      c0846110           0      0
> > >>   1       8k      c084611c          39     78
> > >>   1       8k      c0846124           0      0
> > >>   1       8k      c084612c         169    338
> > >>   1       8k      c0846134           0      0
> > >>   1       8k      c084613c           0      0
> > >>   2      16k      c0846148           0      0
> > >>   2      16k      c0846150           0      0
> > >>   2      16k      c0846158           0      0
> > >> ---------Normal zone all order > 1 has no free pages
> > >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> > >START_MAPNR
> > >>   1   HighMem    69632     161  c17e0000    2f000000
> > >192512
> > >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> > >>   0       4k      c08464f0          12     12
> > >>   0       4k      c08464f8           0      0
> > >>   0       4k      c0846500          14     14
> > >>   0       4k      c0846508           3      3
> > >>   0       4k      c0846510           0      0
> > >>   1       8k      c084651c           0      0
> > >>   1       8k      c0846524           0      0
> > >>   1       8k      c084652c           0      0
> > >>   2      16k      c0846548           0      0
> > >>   2      16k      c0846550           0      0
> > >>   2      16k      c0846558           0      0
> > >>   2      16k      c0846560           1      4
> > >>   2      16k      c0846568           0      0
> > >>   5     128k      c08465cc           0      0
> > >>   5     128k      c08465d4           0      0
> > >>   5     128k      c08465dc           0      0
> > >>   5     128k      c08465e4           4    128
> > >>   5     128k      c08465ec           0      0
> > >> ------Other's all zero
> > >>
> > >> Some other zone information I dump from pglist_data
> > >> {
> > >> 	watermark = {853, 1066, 1279},
> > >>       percpu_drift_mark = 0,
> > >>       lowmem_reserve = {0, 2159, 2159},
> > >>       dirty_balance_reserve = 3438,
> > >>       pageset = 0xc07f6144,
> > >>       lock = {
> > >>         {
> > >>           rlock = {
> > >>             raw_lock = {
> > >>               lock = 0
> > >>             },
> > >>             break_lock = 0
> > >>           }
> > >>         }
> > >>       },
> > >> 	all_unreclaimable = 0,
> > >>       reclaim_stat = {
> > >>         recent_rotated = {903355, 960912},
> > >>         recent_scanned = {932404, 2462017}
> > >>       },
> > >>       pages_scanned = 84231,
> > >
> > >Most of scan happens in direct reclaim path, I guess
> > >but direct reclaim couldn't reclaim any pages due to lack of swap device.
> > >
> > >It means we have to set zone->all_unreclaimable in direct reclaim path,
> > >too.
> > >Below patch fix your problem?
> > Yes, your patch should fix my problem! 
> > Actually I also did another patch, after test, should also fix my issue, 
> > but I didn't set zone->all_unreclaimable in direct reclaim path as you, 
> > just double check zone_reclaimable() status in all_unreclaimable() function.
> > Maybe your patch is better!
> 
> Nope. I think your patch is better. :)
> Just thing is anlaysis of the problem and description and I think we could do
> better but unfortunately, I don't have enough time today so I will see tomorrow.
> Just nitpick below.
> 
> Thanks.
> 
> > 
> > commit 26d2b60d06234683a81666da55129f9c982271a5
> > Author: Lisa Du <cldu@marvell.com>
> > Date:   Thu Aug 1 10:16:32 2013 +0800
> > 
> >     mm: fix infinite direct_reclaim when memory is very fragmentized
> >     
> >     latest all_unreclaimable check in direct reclaim is the following commit.
> >     2011 Apr 14; commit 929bea7c; vmscan:  all_unreclaimable() use
> >                                 zone->all_unreclaimable as a name
> >     and in addition, add oom_killer_disabled check to avoid reintroduce the
> >     issue of commit d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
> >     
> >     But except the hibernation case in which kswapd is freezed, there's also other case
> >     which may lead infinite loop in direct relaim. In a real test, direct_relaimer did
> >     over 200000 times rebalance in __alloc_pages_slowpath(), so this process will be
> >     blocked until watchdog detect and kill it. The root cause is as below:
> >     
> >     If system memory is very fragmentized like only order-0 and order-1 left,
> >     kswapd will go to sleep as system cann't rebalanced for high-order allocations.
> >     But direct_reclaim still works for higher order request. So zones can become a state
> >     zone->all_unreclaimable = 0 but zone->pages_scanned > zone_reclaimable_pages(zone) * 6.
> >     In this case if a process like do_fork try to allocate an order-2 memory which is not
> >     a COSTLY_ORDER, as direct_reclaim always said it did_some_progress, so rebalance again
> >     and again in __alloc_pages_slowpath(). This issue is easily happen in no swap and no
> >     compaction enviroment.
> >     
> >     So add furthur check in all_unreclaimable() to avoid such case.
> >     
> >     Change-Id: Id3266b47c63f5b96aab466fd9f1f44d37e16cdcb
> >     Signed-off-by: Lisa Du <cldu@marvell.com>
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2cff0d4..34582d9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2301,7 +2301,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> >                         continue;
> >                 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >                         continue;
> > -               if (!zone->all_unreclaimable)
> > +               if (zone->all_unreclaimable)
> > +                       continue;
> 
> Nitpick: If we use zone_reclaimable(), above check is redundant and
> gain is very tiny because this path is already slow.
> 
> > +               if (zone_reclaimable(zone))
> >                         return false;
> >         }
> > >
> > >From a5d82159b98f3d90c2f9ff9e486699fb4c67cced Mon Sep 17 00:00:00
> > >2001
> > >From: Minchan Kim <minchan@kernel.org>
> > >Date: Thu, 1 Aug 2013 16:18:00 +0900
> > >Subject:[PATCH] mm: set zone->all_unreclaimable in direct reclaim
> > > path
> > >
> > >Lisa reported there are lots of free pages in a zone but most of them
> > >is order-0 pages so it means the zone is heavily fragemented.
> > >Then, high order allocation could make direct reclaim path'slong stall(
> > >ex, 50 second) in no swap and no compaction environment.
> > >
> > >The reason is kswapd can skip the zone's scanning because the zone
> > >is lots of free pages and kswapd changes scanning order from high-order
> > >to 0-order after his first iteration is done because kswapd think
> > >order-0 allocation is the most important.
> > >Look at 73ce02e9 in detail.
> > >
> > >The problem from that is that only kswapd can set zone->all_unreclaimable
> > >to 1 at the moment so direct reclaim path should loop forever until a ghost
> > >can set the zone->all_unreclaimable to 1.
> > >
> > >This patch makes direct reclaim path to set zone->all_unreclaimable
> > >to avoid infinite loop. So now we don't need a ghost.
> > >
> > >Reported-by: Lisa Du <cldu@marvell.com>
> > >Signed-off-by: Minchan Kim <minchan@kernel.org>
> > >---
> > > mm/vmscan.c |   29 ++++++++++++++++++++++++++++-
> > > 1 file changed, 28 insertions(+), 1 deletion(-)
> > >
> > >diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >index 33dc256..f957e87 100644
> > >--- a/mm/vmscan.c
> > >+++ b/mm/vmscan.c
> > >@@ -2317,6 +2317,23 @@ static bool all_unreclaimable(struct zonelist
> > >*zonelist,
> > > 	return true;
> > > }
> > >
> > >+static void check_zones_unreclaimable(struct zonelist *zonelist,
> > >+					struct scan_control *sc)
> > >+{
> > >+	struct zoneref *z;
> > >+	struct zone *zone;
> > >+
> > >+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > >+			gfp_zone(sc->gfp_mask), sc->nodemask) {
> > >+		if (!populated_zone(zone))
> > >+			continue;
> > >+		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > >+			continue;
> > >+		if (!zone_reclaimable(zone))
> > >+			zone->all_unreclaimable = 1;
> > >+	}
> > >+}
> > >+
> > > /*
> > >  * This is the main entry point to direct page reclaim.
> > >  *
> > >@@ -2370,7 +2387,17 @@ static unsigned long
> > >do_try_to_free_pages(struct zonelist *zonelist,
> > > 				lru_pages += zone_reclaimable_pages(zone);
> > > 			}
> > >
> > >-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
> > >+			/*
> > >+			 * When a zone has enough order-0 free memory but
> > >+			 * zone is heavily fragmented and we need high order
> > >+			 * page from the zone, kswapd could skip the zone
> > >+			 * after first iteration with high order. So, kswapd
> > >+			 * never set the zone->all_unreclaimable to 1 so
> > >+			 * direct reclaim path needs the check.
> > >+			 */
> > >+			if (!shrink_slab(shrink, sc->nr_scanned, lru_pages))
> > >+				check_zones_unreclaimable(zonelist, sc);
> > >+
> > > 			if (reclaim_state) {
> > > 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > > 				reclaim_state->reclaimed_slab = 0;
> > >--
> > >1.7.9.5
> > >
> > >--
> > >Kind regards,
> > >Minchan Kim
> 

I reviewed current mmotm because recently Mel changed kswapd a lot and
all_unreclaimable patch history today.
What I see is recent mmotm has a same problem, too if system have no swap
and no compaction. Of course, compaction is default yes option so we could
recommend to enable if system works well but it's up to user and we should
avoid direct reclaim hang although user disable compaction.

When I see the patch history, real culprit is 929bea7c.

"  zone->all_unreclaimable and zone->pages_scanned are neigher atomic
    variables nor protected by lock.  Therefore zones can become a state of
    zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
    all_unreclaimable() return false even though zone->all_unreclaimabe=1."

I understand the problem but apparently, it makes Lisa's problem because
kswapd can give up balancing when high order allocation happens to prevent
excessive reclaim with assuming the process requested high order allocation
can do direct reclaim/compaction. But what if the process can't reclaim
by no swap but lots of anon pages and can't compact by !CONFIG_COMPACTION?

In such system, OOM kill is natural but not hang.
So, a solution we can fix simply introduces zone_reclaimable check again in
all_unreclaimabe() like this.

What do you think about it?

It's a same patch Lisa posted so we should give a credit
to her/him(Sorry I'm not sure) if we agree thie approach.

Lisa, If KOSAKI agree with this, could you resend this patch with your SOB?

Thanks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a3bf7fd..78f46d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2367,7 +2367,15 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (!zone->all_unreclaimable)
+		/*
+		 * zone->page_scanned and could be raced so we need
+		 * dobule check by zone->all_unreclaimable. Morever, kswapd
+		 * could skip (zone->all_unreclaimable = 1) if the zone
+		 * is heavily fragmented but enough free pages to meet
+		 * high watermark. In such case, kswapd never set
+		 * all_unreclaimable to 1 so we need zone_reclaimable, too.
+		 */
+		if (!zone->all_unreclaimable || zone_reclaimable(zone))
 			return false;
 	}
 


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-02  2:26           ` Minchan Kim
@ 2013-08-02  2:33             ` Minchan Kim
  2013-08-02  3:17             ` Lisa Du
  1 sibling, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2013-08-02  2:33 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm, KOSAKI Motohiro

On Fri, Aug 02, 2013 at 11:26:28AM +0900, Minchan Kim wrote:
> Hello Lisa and KOSAKI,
> 
> Lisa's quote style is very hard to follow so I'd like to write at bottom
> as ignoring line by line rule.
> 
> Lisa, please correct your MUA.
> 
> On Thu, Aug 01, 2013 at 05:42:59PM +0900, Minchan Kim wrote:
> > On Thu, Aug 01, 2013 at 01:20:34AM -0700, Lisa Du wrote:
> > > >-----Original Message-----
> > > >From: Minchan Kim [mailto:minchan@kernel.org]
> > > >Sent: 2013a1'8ae??1ae?JPY 15:34
> > > >To: Lisa Du
> > > >Cc: linux-mm@kvack.org; KOSAKI Motohiro
> > > >Subject: Re: Possible deadloop in direct reclaim?
> > > >
> > > >On Wed, Jul 31, 2013 at 11:13:07PM -0700, Lisa Du wrote:
> > > >> >On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
> > > >> >> Dear Sir:
> > > >> >> Currently I met a possible deadloop in direct reclaim. After run plenty
> > > >of
> > > >> >the application, system run into a status that system memory is very
> > > >> >fragmentized. Like only order-0 and order-1 memory left.
> > > >> >> Then one process required a order-2 buffer but it enter an endless
> > > >direct
> > > >> >reclaim. From my trace log, I can see this loop already over 200,000
> > > >times.
> > > >> >Kswapd was first wake up and then go back to sleep as it cannot
> > > >rebalance
> > > >> >this order's memory. But zone->all_unreclaimable remains 1.
> > > >> >> Though direct_reclaim every time returns no pages, but as
> > > >> >zone->all_unreclaimable = 1, so it loop again and again. Even when
> > > >> >zone->pages_scanned also becomes very large. It will block the process
> > > >for
> > > >> >long time, until some watchdog thread detect this and kill this process.
> > > >> >Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe
> > > >cost
> > > >> >over 50 seconds or even more.
> > > >> >> I think it's not as expected right?  Can we also add below check in the
> > > >> >function all_unreclaimable() to terminate this loop?
> > > >> >>
> > > >> >> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> > > >> >*zonelist,
> > > >> >>                         continue;
> > > >> >>                 if (!zone->all_unreclaimable)
> > > >> >>                         return false;
> > > >> >> +               if (sc->nr_reclaimed == 0
> > > >&& !zone_reclaimable(zone))
> > > >> >> +                       return true;
> > > >> >>         }
> > > >> >>          BTW: I'm using kernel3.4, I also try to search in the
> > > >kernel3.9,
> > > >> >didn't see a possible fix for such issue. Or is anyone also met such issue
> > > >> >before? Any comment will be welcomed, looking forward to your reply!
> > > >> >>
> > > >> >> Thanks!
> > > >> >
> > > >> >I'd like to ask somethigs.
> > > >> >
> > > >> >1. Do you have enabled swap?
> > > >> I set CONFIG_SWAP=y, but I didn't really have a swap partition, that
> > > >means my swap buffer size is 0;
> > > >> >2. Do you enable CONFIG_COMPACTION?
> > > >> No, I didn't enable;
> > > >> >3. Could we get your zoneinfo via cat /proc/zoneinfo?
> > > >> I dump some info from ramdump, please review:
> > > >
> > > >Thanks for the information.
> > > >You said order-2 allocation was failed so I will assume preferred zone
> > > >is normal zone, not high zone because high order allocation in kernel side
> > > >isn't from high zone.
> > > Yes, that's right!
> > > >
> > > >> crash> kmem -z
> > > >> NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
> > > >>   SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279
> > > >
> > > >712M normal memory.
> > > >
> > > >>   VM_STAT:
> > > >>           NR_FREE_PAGES: 16092
> > > >
> > > >There are plenty of free pages over high watermark but there are heavy
> > > >fragmentation as I see below information.
> > > >
> > > >So, kswapd doesn't scan this zone loop iteration is done with order-2.
> > > >I mean kswapd will scan this zone with order-0 if first iteration is
> > > >done by this
> > > >
> > > >        order = sc.order = 0;
> > > >
> > > >        goto loop_again;
> > > >
> > > >But this time, zone_watermark_ok_safe with testorder = 0 on normal zone
> > > >is always true so that scanning of zone will be skipped. It means kswapd
> > > >never set zone->unreclaimable to 1.
> > > Yes, definitely!
> > > >
> > > >>        NR_INACTIVE_ANON: 17
> > > >>          NR_ACTIVE_ANON: 55091
> > > >>        NR_INACTIVE_FILE: 17
> > > >>          NR_ACTIVE_FILE: 17
> > > >>          NR_UNEVICTABLE: 0
> > > >>                NR_MLOCK: 0
> > > >>           NR_ANON_PAGES: 55077
> > > >
> > > >There are about 200M anon pages and few file pages.
> > > >You don't have swap so that reclaimer couldn't go far.
> > > >
> > > >>          NR_FILE_MAPPED: 42
> > > >>           NR_FILE_PAGES: 69
> > > >>           NR_FILE_DIRTY: 0
> > > >>            NR_WRITEBACK: 0
> > > >>     NR_SLAB_RECLAIMABLE: 1226
> > > >>   NR_SLAB_UNRECLAIMABLE: 9373
> > > >>            NR_PAGETABLE: 2776
> > > >>         NR_KERNEL_STACK: 798
> > > >>         NR_UNSTABLE_NFS: 0
> > > >>               NR_BOUNCE: 0
> > > >>         NR_VMSCAN_WRITE: 91
> > > >>     NR_VMSCAN_IMMEDIATE: 115381
> > > >>       NR_WRITEBACK_TEMP: 0
> > > >>        NR_ISOLATED_ANON: 0
> > > >>        NR_ISOLATED_FILE: 0
> > > >>                NR_SHMEM: 31
> > > >>              NR_DIRTIED: 15256
> > > >>              NR_WRITTEN: 11981
> > > >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> > > >>
> > > >> NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
> > > >>   SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
> > > >>   VM_STAT:
> > > >>           NR_FREE_PAGES: 161
> > > >
> > > >Reclaimer should reclaim this zone.
> > > >
> > > >>        NR_INACTIVE_ANON: 104
> > > >>          NR_ACTIVE_ANON: 46114
> > > >>        NR_INACTIVE_FILE: 9722
> > > >>          NR_ACTIVE_FILE: 12263
> > > >
> > > >It seems there are lots of room to evict file pages.
> > > >
> > > >>          NR_UNEVICTABLE: 168
> > > >>                NR_MLOCK: 0
> > > >>           NR_ANON_PAGES: 46102
> > > >>          NR_FILE_MAPPED: 12227
> > > >>           NR_FILE_PAGES: 22270
> > > >>           NR_FILE_DIRTY: 1
> > > >>            NR_WRITEBACK: 0
> > > >>     NR_SLAB_RECLAIMABLE: 0
> > > >>   NR_SLAB_UNRECLAIMABLE: 0
> > > >>            NR_PAGETABLE: 0
> > > >>         NR_KERNEL_STACK: 0
> > > >>         NR_UNSTABLE_NFS: 0
> > > >>               NR_BOUNCE: 0
> > > >>         NR_VMSCAN_WRITE: 0
> > > >>     NR_VMSCAN_IMMEDIATE: 0
> > > >>       NR_WRITEBACK_TEMP: 0
> > > >>        NR_ISOLATED_ANON: 0
> > > >>        NR_ISOLATED_FILE: 0
> > > >>                NR_SHMEM: 117
> > > >>              NR_DIRTIED: 7364
> > > >>              NR_WRITTEN: 6989
> > > >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> > > >>
> > > >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> > > >START_MAPNR
> > > >>   0   Normal    192512   16092  c1200000       0            0
> > > >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> > > >>   0       4k      c08460f0           3      3
> > > >>   0       4k      c08460f8         436    436
> > > >>   0       4k      c0846100       15237  15237
> > > >>   0       4k      c0846108           0      0
> > > >>   0       4k      c0846110           0      0
> > > >>   1       8k      c084611c          39     78
> > > >>   1       8k      c0846124           0      0
> > > >>   1       8k      c084612c         169    338
> > > >>   1       8k      c0846134           0      0
> > > >>   1       8k      c084613c           0      0
> > > >>   2      16k      c0846148           0      0
> > > >>   2      16k      c0846150           0      0
> > > >>   2      16k      c0846158           0      0
> > > >> ---------Normal zone all order > 1 has no free pages
> > > >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> > > >START_MAPNR
> > > >>   1   HighMem    69632     161  c17e0000    2f000000
> > > >192512
> > > >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> > > >>   0       4k      c08464f0          12     12
> > > >>   0       4k      c08464f8           0      0
> > > >>   0       4k      c0846500          14     14
> > > >>   0       4k      c0846508           3      3
> > > >>   0       4k      c0846510           0      0
> > > >>   1       8k      c084651c           0      0
> > > >>   1       8k      c0846524           0      0
> > > >>   1       8k      c084652c           0      0
> > > >>   2      16k      c0846548           0      0
> > > >>   2      16k      c0846550           0      0
> > > >>   2      16k      c0846558           0      0
> > > >>   2      16k      c0846560           1      4
> > > >>   2      16k      c0846568           0      0
> > > >>   5     128k      c08465cc           0      0
> > > >>   5     128k      c08465d4           0      0
> > > >>   5     128k      c08465dc           0      0
> > > >>   5     128k      c08465e4           4    128
> > > >>   5     128k      c08465ec           0      0
> > > >> ------Other's all zero
> > > >>
> > > >> Some other zone information I dump from pglist_data
> > > >> {
> > > >> 	watermark = {853, 1066, 1279},
> > > >>       percpu_drift_mark = 0,
> > > >>       lowmem_reserve = {0, 2159, 2159},
> > > >>       dirty_balance_reserve = 3438,
> > > >>       pageset = 0xc07f6144,
> > > >>       lock = {
> > > >>         {
> > > >>           rlock = {
> > > >>             raw_lock = {
> > > >>               lock = 0
> > > >>             },
> > > >>             break_lock = 0
> > > >>           }
> > > >>         }
> > > >>       },
> > > >> 	all_unreclaimable = 0,
> > > >>       reclaim_stat = {
> > > >>         recent_rotated = {903355, 960912},
> > > >>         recent_scanned = {932404, 2462017}
> > > >>       },
> > > >>       pages_scanned = 84231,
> > > >
> > > >Most of scan happens in direct reclaim path, I guess
> > > >but direct reclaim couldn't reclaim any pages due to lack of swap device.
> > > >
> > > >It means we have to set zone->all_unreclaimable in direct reclaim path,
> > > >too.
> > > >Below patch fix your problem?
> > > Yes, your patch should fix my problem! 
> > > Actually I also did another patch, after test, should also fix my issue, 
> > > but I didn't set zone->all_unreclaimable in direct reclaim path as you, 
> > > just double check zone_reclaimable() status in all_unreclaimable() function.
> > > Maybe your patch is better!
> > 
> > Nope. I think your patch is better. :)
> > Just thing is anlaysis of the problem and description and I think we could do
> > better but unfortunately, I don't have enough time today so I will see tomorrow.
> > Just nitpick below.
> > 
> > Thanks.
> > 
> > > 
> > > commit 26d2b60d06234683a81666da55129f9c982271a5
> > > Author: Lisa Du <cldu@marvell.com>
> > > Date:   Thu Aug 1 10:16:32 2013 +0800
> > > 
> > >     mm: fix infinite direct_reclaim when memory is very fragmentized
> > >     
> > >     latest all_unreclaimable check in direct reclaim is the following commit.
> > >     2011 Apr 14; commit 929bea7c; vmscan:  all_unreclaimable() use
> > >                                 zone->all_unreclaimable as a name
> > >     and in addition, add oom_killer_disabled check to avoid reintroduce the
> > >     issue of commit d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
> > >     
> > >     But except the hibernation case in which kswapd is freezed, there's also other case
> > >     which may lead infinite loop in direct relaim. In a real test, direct_relaimer did
> > >     over 200000 times rebalance in __alloc_pages_slowpath(), so this process will be
> > >     blocked until watchdog detect and kill it. The root cause is as below:
> > >     
> > >     If system memory is very fragmentized like only order-0 and order-1 left,
> > >     kswapd will go to sleep as system cann't rebalanced for high-order allocations.
> > >     But direct_reclaim still works for higher order request. So zones can become a state
> > >     zone->all_unreclaimable = 0 but zone->pages_scanned > zone_reclaimable_pages(zone) * 6.
> > >     In this case if a process like do_fork try to allocate an order-2 memory which is not
> > >     a COSTLY_ORDER, as direct_reclaim always said it did_some_progress, so rebalance again
> > >     and again in __alloc_pages_slowpath(). This issue is easily happen in no swap and no
> > >     compaction enviroment.
> > >     
> > >     So add furthur check in all_unreclaimable() to avoid such case.
> > >     
> > >     Change-Id: Id3266b47c63f5b96aab466fd9f1f44d37e16cdcb
> > >     Signed-off-by: Lisa Du <cldu@marvell.com>
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 2cff0d4..34582d9 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2301,7 +2301,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> > >                         continue;
> > >                 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > >                         continue;
> > > -               if (!zone->all_unreclaimable)
> > > +               if (zone->all_unreclaimable)
> > > +                       continue;
> > 
> > Nitpick: If we use zone_reclaimable(), above check is redundant and
> > gain is very tiny because this path is already slow.
> > 
> > > +               if (zone_reclaimable(zone))
> > >                         return false;
> > >         }
> > > >
> > > >From a5d82159b98f3d90c2f9ff9e486699fb4c67cced Mon Sep 17 00:00:00
> > > >2001
> > > >From: Minchan Kim <minchan@kernel.org>
> > > >Date: Thu, 1 Aug 2013 16:18:00 +0900
> > > >Subject:[PATCH] mm: set zone->all_unreclaimable in direct reclaim
> > > > path
> > > >
> > > >Lisa reported there are lots of free pages in a zone but most of them
> > > >is order-0 pages so it means the zone is heavily fragemented.
> > > >Then, high order allocation could make direct reclaim path'slong stall(
> > > >ex, 50 second) in no swap and no compaction environment.
> > > >
> > > >The reason is kswapd can skip the zone's scanning because the zone
> > > >is lots of free pages and kswapd changes scanning order from high-order
> > > >to 0-order after his first iteration is done because kswapd think
> > > >order-0 allocation is the most important.
> > > >Look at 73ce02e9 in detail.
> > > >
> > > >The problem from that is that only kswapd can set zone->all_unreclaimable
> > > >to 1 at the moment so direct reclaim path should loop forever until a ghost
> > > >can set the zone->all_unreclaimable to 1.
> > > >
> > > >This patch makes direct reclaim path to set zone->all_unreclaimable
> > > >to avoid infinite loop. So now we don't need a ghost.
> > > >
> > > >Reported-by: Lisa Du <cldu@marvell.com>
> > > >Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > >---
> > > > mm/vmscan.c |   29 ++++++++++++++++++++++++++++-
> > > > 1 file changed, 28 insertions(+), 1 deletion(-)
> > > >
> > > >diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > >index 33dc256..f957e87 100644
> > > >--- a/mm/vmscan.c
> > > >+++ b/mm/vmscan.c
> > > >@@ -2317,6 +2317,23 @@ static bool all_unreclaimable(struct zonelist
> > > >*zonelist,
> > > > 	return true;
> > > > }
> > > >
> > > >+static void check_zones_unreclaimable(struct zonelist *zonelist,
> > > >+					struct scan_control *sc)
> > > >+{
> > > >+	struct zoneref *z;
> > > >+	struct zone *zone;
> > > >+
> > > >+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > >+			gfp_zone(sc->gfp_mask), sc->nodemask) {
> > > >+		if (!populated_zone(zone))
> > > >+			continue;
> > > >+		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > > >+			continue;
> > > >+		if (!zone_reclaimable(zone))
> > > >+			zone->all_unreclaimable = 1;
> > > >+	}
> > > >+}
> > > >+
> > > > /*
> > > >  * This is the main entry point to direct page reclaim.
> > > >  *
> > > >@@ -2370,7 +2387,17 @@ static unsigned long
> > > >do_try_to_free_pages(struct zonelist *zonelist,
> > > > 				lru_pages += zone_reclaimable_pages(zone);
> > > > 			}
> > > >
> > > >-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
> > > >+			/*
> > > >+			 * When a zone has enough order-0 free memory but
> > > >+			 * zone is heavily fragmented and we need high order
> > > >+			 * page from the zone, kswapd could skip the zone
> > > >+			 * after first iteration with high order. So, kswapd
> > > >+			 * never set the zone->all_unreclaimable to 1 so
> > > >+			 * direct reclaim path needs the check.
> > > >+			 */
> > > >+			if (!shrink_slab(shrink, sc->nr_scanned, lru_pages))
> > > >+				check_zones_unreclaimable(zonelist, sc);
> > > >+
> > > > 			if (reclaim_state) {
> > > > 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > > > 				reclaim_state->reclaimed_slab = 0;
> > > >--
> > > >1.7.9.5
> > > >
> > > >--
> > > >Kind regards,
> > > >Minchan Kim
> > 
> 
> I reviewed current mmotm because recently Mel changed kswapd a lot and
> all_unreclaimable patch history today.
> What I see is recent mmotm has a same problem, too if system have no swap
> and no compaction. Of course, compaction is default yes option so we could
> recommend to enable if system works well but it's up to user and we should
                                ^^^ typo
                      if system want work well

> avoid direct reclaim hang although user disable compaction.
> 
> When I see the patch history, real culprit is 929bea7c.
> 
> "  zone->all_unreclaimable and zone->pages_scanned are neigher atomic
>     variables nor protected by lock.  Therefore zones can become a state of
>     zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
>     all_unreclaimable() return false even though zone->all_unreclaimabe=1."
> 
> I understand the problem but apparently, it makes Lisa's problem because
> kswapd can give up balancing when high order allocation happens to prevent
> excessive reclaim with assuming the process requested high order allocation
> can do direct reclaim/compaction. But what if the process can't reclaim
> by no swap but lots of anon pages and can't compact by !CONFIG_COMPACTION?
> 
> In such system, OOM kill is natural but not hang.
> So, a solution we can fix simply introduces zone_reclaimable check again in
> all_unreclaimabe() like this.
> 
> What do you think about it?
> 
> It's a same patch Lisa posted so we should give a credit
> to her/him(Sorry I'm not sure) if we agree thie approach.
> 
> Lisa, If KOSAKI agree with this, could you resend this patch with your SOB?
> 
> Thanks.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a3bf7fd..78f46d8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2367,7 +2367,15 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>  			continue;
>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  			continue;
> -		if (!zone->all_unreclaimable)
> +		/*
> +		 * zone->page_scanned and could be raced so we need
                                      ^^^
                                      typo: please remove it.

> +		 * dobule check by zone->all_unreclaimable. Morever, kswapd
> +		 * could skip (zone->all_unreclaimable = 1) if the zone
> +		 * is heavily fragmented but enough free pages to meet
> +		 * high watermark. In such case, kswapd never set
> +		 * all_unreclaimable to 1 so we need zone_reclaimable, too.
> +		 */
> +		if (!zone->all_unreclaimable || zone_reclaimable(zone))
>  			return false;
>  	}
>  
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-08-02  2:26           ` Minchan Kim
  2013-08-02  2:33             ` Minchan Kim
@ 2013-08-02  3:17             ` Lisa Du
  2013-08-02  3:53               ` Minchan Kim
  1 sibling, 1 reply; 36+ messages in thread
From: Lisa Du @ 2013-08-02  3:17 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, KOSAKI Motohiro, Bob Liu

>-----Original Message-----
>From: Minchan Kim [mailto:minchan@kernel.org]
>Sent: 2013年8月2日 10:26
>To: Lisa Du
>Cc: linux-mm@kvack.org; KOSAKI Motohiro
>Subject: Re: Possible deadloop in direct reclaim?
>
>Hello Lisa and KOSAKI,
>
>Lisa's quote style is very hard to follow so I'd like to write at bottom
>as ignoring line by line rule.
>
>Lisa, please correct your MUA.
I'm really sorry for my quote style, will improve it in my following mails.
>
>
>I reviewed current mmotm because recently Mel changed kswapd a lot and
>all_unreclaimable patch history today.
>What I see is recent mmotm has a same problem, too if system have no swap
>and no compaction. Of course, compaction is default yes option so we could
>recommend to enable if system works well but it's up to user and we should
>avoid direct reclaim hang although user disable compaction.
>
>When I see the patch history, real culprit is 929bea7c.
>
>"  zone->all_unreclaimable and zone->pages_scanned are neigher atomic
>    variables nor protected by lock.  Therefore zones can become a state of
>    zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
>    all_unreclaimable() return false even though zone->all_unreclaimabe=1."
>
>I understand the problem but apparently, it makes Lisa's problem because
>kswapd can give up balancing when high order allocation happens to prevent
>excessive reclaim with assuming the process requested high order allocation
>can do direct reclaim/compaction. But what if the process can't reclaim
>by no swap but lots of anon pages and can't compact by !CONFIG_COMPACTION?
>
>In such system, OOM kill is natural but not hang.
>So, a solution we can fix simply introduces zone_reclaimable check again in
>all_unreclaimabe() like this.
>
>What do you think about it?
>
>It's a same patch Lisa posted so we should give a credit
>to her/him(Sorry I'm not sure) if we agree thie approach.
>
>Lisa, If KOSAKI agree with this, could you resend this patch with your SOB?
>
>Thanks.
>
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index a3bf7fd..78f46d8 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -2367,7 +2367,15 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> 			continue;
> 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> 			continue;
>-		if (!zone->all_unreclaimable)
>+		/*
>+		 * zone->page_scanned and could be raced so we need
>+		 * dobule check by zone->all_unreclaimable. Morever, kswapd
>+		 * could skip (zone->all_unreclaimable = 1) if the zone
>+		 * is heavily fragmented but enough free pages to meet
>+		 * high watermark. In such case, kswapd never set
>+		 * all_unreclaimable to 1 so we need zone_reclaimable, too.
>+		 */
>+		if (!zone->all_unreclaimable || zone_reclaimable(zone))
> 			return false;
> 	}
   I'm afraid this patch may can't help.
   zone->all_unreclaimable = 0 will always result the false return,
   zone_reclaimable(zone) check wouldn't take effect no matter
   it's true of false right?

Also Bob found below thread, seems Kosaki also found same issue:
mm, vmscan: fix do_try_to_free_pages() livelock
https://lkml.org/lkml/2012/6/14/74

>
>
>
>--
>Kind regards,
>Minchan Kim

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-02  3:17             ` Lisa Du
@ 2013-08-02  3:53               ` Minchan Kim
  2013-08-02  8:08                 ` Lisa Du
  0 siblings, 1 reply; 36+ messages in thread
From: Minchan Kim @ 2013-08-02  3:53 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm, KOSAKI Motohiro, Bob Liu

On Thu, Aug 01, 2013 at 08:17:56PM -0700, Lisa Du wrote:
> >-----Original Message-----
> >From: Minchan Kim [mailto:minchan@kernel.org]
> >Sent: 2013a1'8ae??2ae?JPY 10:26
> >To: Lisa Du
> >Cc: linux-mm@kvack.org; KOSAKI Motohiro
> >Subject: Re: Possible deadloop in direct reclaim?
> >
> >Hello Lisa and KOSAKI,
> >
> >Lisa's quote style is very hard to follow so I'd like to write at bottom
> >as ignoring line by line rule.
> >
> >Lisa, please correct your MUA.
> I'm really sorry for my quote style, will improve it in my following mails.
> >
> >
> >I reviewed current mmotm because recently Mel changed kswapd a lot and
> >all_unreclaimable patch history today.
> >What I see is recent mmotm has a same problem, too if system have no swap
> >and no compaction. Of course, compaction is default yes option so we could
> >recommend to enable if system works well but it's up to user and we should
> >avoid direct reclaim hang although user disable compaction.
> >
> >When I see the patch history, real culprit is 929bea7c.
> >
> >"  zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> >    variables nor protected by lock.  Therefore zones can become a state of
> >    zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
> >    all_unreclaimable() return false even though zone->all_unreclaimabe=1."
> >
> >I understand the problem but apparently, it makes Lisa's problem because
> >kswapd can give up balancing when high order allocation happens to prevent
> >excessive reclaim with assuming the process requested high order allocation
> >can do direct reclaim/compaction. But what if the process can't reclaim
> >by no swap but lots of anon pages and can't compact by !CONFIG_COMPACTION?
> >
> >In such system, OOM kill is natural but not hang.
> >So, a solution we can fix simply introduces zone_reclaimable check again in
> >all_unreclaimabe() like this.
> >
> >What do you think about it?
> >
> >It's a same patch Lisa posted so we should give a credit
> >to her/him(Sorry I'm not sure) if we agree thie approach.
> >
> >Lisa, If KOSAKI agree with this, could you resend this patch with your SOB?
> >
> >Thanks.
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index a3bf7fd..78f46d8 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -2367,7 +2367,15 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> > 			continue;
> > 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > 			continue;
> >-		if (!zone->all_unreclaimable)
> >+		/*
> >+		 * zone->page_scanned and could be raced so we need
> >+		 * dobule check by zone->all_unreclaimable. Morever, kswapd
> >+		 * could skip (zone->all_unreclaimable = 1) if the zone
> >+		 * is heavily fragmented but enough free pages to meet
> >+		 * high watermark. In such case, kswapd never set
> >+		 * all_unreclaimable to 1 so we need zone_reclaimable, too.
> >+		 */
> >+		if (!zone->all_unreclaimable || zone_reclaimable(zone))
> > 			return false;
> > 	}
>    I'm afraid this patch may can't help.
>    zone->all_unreclaimable = 0 will always result the false return,
>    zone_reclaimable(zone) check wouldn't take effect no matter
>    it's true of false right?

You're right. It was not what I want but check both conditions.

> 
> Also Bob found below thread, seems Kosaki also found same issue:
> mm, vmscan: fix do_try_to_free_pages() livelock
> https://lkml.org/lkml/2012/6/14/74

I remember it and AFAIRC, I had a concern because description was
too vague without detailed example and I fixed Aaditya's problem with
another approach. That's why it wasn't merged at that time.

Now, we have a real problem and analysis so I think KOSAKI's patch makes
perfect to me.

Lisa, Could you resend KOSAKI's patch with more detailed description?

> 
> >
> >
> >
> >--
> >Kind regards,
> >Minchan Kim

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Possible deadloop in direct reclaim?
  2013-08-02  3:53               ` Minchan Kim
@ 2013-08-02  8:08                 ` Lisa Du
  2013-08-04 23:47                   ` Minchan Kim
  0 siblings, 1 reply; 36+ messages in thread
From: Lisa Du @ 2013-08-02  8:08 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, KOSAKI Motohiro, Bob Liu

>-----Original Message-----
>From: Minchan Kim [mailto:minchan@kernel.org]
>Sent: 2013年8月2日 11:54
>To: Lisa Du
>Cc: linux-mm@kvack.org; KOSAKI Motohiro; Bob Liu
>Subject: Re: Possible deadloop in direct reclaim?
>
>On Thu, Aug 01, 2013 at 08:17:56PM -0700, Lisa Du wrote:
>> >-----Original Message-----
>> >From: Minchan Kim [mailto:minchan@kernel.org]
>> >Sent: 2013年8月2日 10:26
>> >To: Lisa Du
>> >Cc: linux-mm@kvack.org; KOSAKI Motohiro
>> >Subject: Re: Possible deadloop in direct reclaim?


>> >I reviewed current mmotm because recently Mel changed kswapd a lot and
>> >all_unreclaimable patch history today.
>> >What I see is recent mmotm has a same problem, too if system have no swap
>> >and no compaction. Of course, compaction is default yes option so we could
>> >recommend to enable if system works well but it's up to user and we should
>> >avoid direct reclaim hang although user disable compaction.
>> >
>> >When I see the patch history, real culprit is 929bea7c.
>> >
>> >"  zone->all_unreclaimable and zone->pages_scanned are neigher atomic
>> >    variables nor protected by lock.  Therefore zones can become a state of
>> >    zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
>> >    all_unreclaimable() return false even though zone->all_unreclaimabe=1."
>> >
>> >I understand the problem but apparently, it makes Lisa's problem because
>> >kswapd can give up balancing when high order allocation happens to prevent
>> >excessive reclaim with assuming the process requested high order allocation
>> >can do direct reclaim/compaction. But what if the process can't reclaim
>> >by no swap but lots of anon pages and can't compact by !CONFIG_COMPACTION?
>> >
>>
>> Also Bob found below thread, seems Kosaki also found same issue:
>> mm, vmscan: fix do_try_to_free_pages() livelock
>> https://lkml.org/lkml/2012/6/14/74
>
>I remember it and AFAIRC, I had a concern because description was
>too vague without detailed example and I fixed Aaditya's problem with
>another approach. That's why it wasn't merged at that time.
>
>Now, we have a real problem and analysis so I think KOSAKI's patch makes
>perfect to me.
>
>Lisa, Could you resend KOSAKI's patch with more detailed description?

Hi, Minchan and Kosaki
Would you please help check below patch I resend based on previous Kosaki's patch?
I'm not sure if the description is clear enough, please let me know if you have any comments.
Many thanks!
From 2dfe137665a694dcc74ae9c8a27641b06190f344 Mon Sep 17 00:00:00 2001
From: Lisa Du <cldu@marvell.com>
Date: Fri, 2 Aug 2013 14:37:31 +0800
Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock

Currently, I found system can enter a state that there are lots
of free pages in a zone but only order-0 and order-1 pages which
means the zone is heavily fragmented, then high order allocation
could make direct reclaim path's long stall(ex, 60 seconds)
especially in no swap and no compaciton enviroment.

The reason is do_try_to_free_pages enter live lock:

kswapd will go to sleep if the zones have been fully scanned
and are still not balanced. As kswapd thinks there's little point
trying all over again to avoid infinite loop. Instead it changes
order from high-order to 0-order because kswapd think order-0 is the
most important. Look at 73ce02e9 in detail. If watermarks are ok,
kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
It assume high-order users can still perform direct reclaim if they wish.

Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
This is because to avoid too early oom-kill. So it means direct_reclaim
depends on kswapd to break this loop.

In worst case, direct-reclaim may continue to page reclaim forever
when kswapd sleeps forever until someone like watchdog detect and finally
kill the process.

We can't turn on zone->all_unreclaimable from direct reclaim path
because direct reclaim path don't take any lock and this way is racy.
Thus this patch removes zone->all_unreclaimable field completely and
recalculates zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

Change-Id: If3b44e33e400c1db0e42a5e2fc9ebc7a265f2aae
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Bob Liu <lliubbo@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
---
 include/linux/mm_inline.h |   20 ++++++++++++++++++++
 include/linux/mmzone.h    |    1 -
 include/linux/vmstat.h    |    1 -
 mm/page-writeback.c       |    1 +
 mm/page_alloc.c           |    5 ++---
 mm/vmscan.c               |   43 +++++++++----------------------------------
 mm/vmstat.c               |    3 ++-
 7 files changed, 34 insertions(+), 40 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 1397ccf..e212fae 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -2,6 +2,7 @@
 #define LINUX_MM_INLINE_H
 
 #include <linux/huge_mm.h>
+#include <linux/swap.h>
 
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
@@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
 	return lru;
 }
 
+static inline unsigned long zone_reclaimable_pages(struct zone *zone)
+{
+	int nr;
+
+	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
+	     zone_page_state(zone, NR_INACTIVE_FILE);
+
+	if (get_nr_swap_pages() > 0)
+		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
+		      zone_page_state(zone, NR_INACTIVE_ANON);
+
+	return nr;
+}
+
+static inline bool zone_reclaimable(struct zone *zone)
+{
+	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+}
+
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index af4a3b7..e835974 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -352,7 +352,6 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 	/* Set to true when the PG_migrate_skip bits should be cleared */
 	bool			compact_blockskip_flush;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index c586679..6fff004 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -143,7 +143,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 }
 
 extern unsigned long global_reclaimable_pages(void);
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
 #ifdef CONFIG_NUMA
 /*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..62bfd92 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,6 +36,7 @@
 #include <linux/pagevec.h>
 #include <linux/timer.h>
 #include <linux/sched/rt.h>
+#include <linux/mm_inline.h>
 #include <trace/events/writeback.h>
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..19a18c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -60,6 +60,7 @@
 #include <linux/page-debug-flags.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/rt.h>
+#include <linux/mm_inline.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -647,7 +648,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int to_free = count;
 
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	while (to_free) {
@@ -696,7 +696,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	__free_one_page(page, zone, order, migratetype);
@@ -3095,7 +3094,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
 			zone->pages_scanned,
-			(zone->all_unreclaimable ? "yes" : "no")
+			(!zone_reclaimable(zone) ? "yes" : "no")
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34582d9..7501d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1789,7 +1789,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && zone->all_unreclaimable)
+	if (current_is_kswapd() && !zone_reclaimable(zone))
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			if (zone->all_unreclaimable &&
-					sc->priority != DEF_PRIORITY)
+			if (!zone_reclaimable(zone) &&
+			    sc->priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
 			if (IS_ENABLED(CONFIG_COMPACTION)) {
 				/*
@@ -2283,11 +2283,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	return aborted_reclaim;
 }
 
-static bool zone_reclaimable(struct zone *zone)
-{
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
-}
-
 /* All zones in zonelist are unreclaimable? */
 static bool all_unreclaimable(struct zonelist *zonelist,
 		struct scan_control *sc)
@@ -2301,8 +2296,6 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (zone->all_unreclaimable)
-			continue;
 		if (zone_reclaimable(zone))
 			return false;
 	}
@@ -2714,7 +2707,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 		 * DEF_PRIORITY. Effectively, it considers them balanced so
 		 * they must be considered balanced here as well!
 		 */
-		if (zone->all_unreclaimable) {
+		if (!zone_reclaimable(zone)) {
 			balanced_pages += zone->managed_pages;
 			continue;
 		}
@@ -2775,7 +2768,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
 			       unsigned long lru_pages,
 			       unsigned long *nr_attempted)
 {
-	unsigned long nr_slab;
 	int testorder = sc->order;
 	unsigned long balance_gap;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2820,15 +2812,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	shrink_zone(zone, sc);
 
 	reclaim_state->reclaimed_slab = 0;
-	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+	shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 
 	/* Account for the number of pages attempted to reclaim */
 	*nr_attempted += sc->nr_to_reclaim;
 
-	if (nr_slab == 0 && !zone_reclaimable(zone))
-		zone->all_unreclaimable = 1;
-
 	zone_clear_flag(zone, ZONE_WRITEBACK);
 
 	/*
@@ -2837,7 +2826,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * BDIs but as pressure is relieved, speculatively avoid congestion
 	 * waits.
 	 */
-	if (!zone->all_unreclaimable &&
+	if (zone_reclaimable(zone) &&
 	    zone_balanced(zone, testorder, 0, classzone_idx)) {
 		zone_clear_flag(zone, ZONE_CONGESTED);
 		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
@@ -2903,7 +2892,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
+			if (!zone_reclaimable(zone) &&
 			    sc.priority != DEF_PRIORITY)
 				continue;
 
@@ -2982,7 +2971,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
+			if (!zone_reclaimable(zone) &&
 			    sc.priority != DEF_PRIORITY)
 				continue;
 
@@ -3267,20 +3256,6 @@ unsigned long global_reclaimable_pages(void)
 	return nr;
 }
 
-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	int nr;
-
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE);
-
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
-
-	return nr;
-}
-
 #ifdef CONFIG_HIBERNATION
 /*
  * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
@@ -3578,7 +3553,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return ZONE_RECLAIM_FULL;
 
-	if (zone->all_unreclaimable)
+	if (!zone_reclaimable(zone))
 		return ZONE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c2ef4..c48f75b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -19,6 +19,7 @@
 #include <linux/math64.h>
 #include <linux/writeback.h>
 #include <linux/compaction.h>
+#include <linux/mm_inline.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -1052,7 +1053,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  all_unreclaimable: %u"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
-		   zone->all_unreclaimable,
+		   !zone_reclaimable(zone),
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.0.4

>
>>
>> >
>> >
>> >
>> >--
>> >Kind regards,
>> >Minchan Kim
>
>--
>Kind regards,
>Minchan Kim

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-01  4:21               ` Bob Liu
@ 2013-08-03 21:22                 ` KOSAKI Motohiro
  2013-08-04 23:50                   ` Minchan Kim
  0 siblings, 1 reply; 36+ messages in thread
From: KOSAKI Motohiro @ 2013-08-03 21:22 UTC (permalink / raw)
  To: Bob Liu
  Cc: KOSAKI Motohiro, Lisa Du, Christoph Lameter, linux-mm,
	Mel Gorman, Bob Liu

(8/1/13 12:21 AM), Bob Liu wrote:
> Hi KOSAKI,
>
> On 08/01/2013 10:45 AM, KOSAKI Motohiro wrote:
>
>>
>> Please read more older code. Your pointed code is temporary change and I
>> changed back for fixing
>> bugs.
>> If you look at the status in middle direct reclaim, we can't avoid race
>> condition from multi direct
>> reclaim issues. Moreover, if kswapd doesn't awaken, it is a problem.
>> This is a reason why current code
>> behave as you described.
>> I agree we should fix your issue as far as possible. But I can't agree
>> your analysis.
>>
>
> I found this thread:
> mm, vmscan: fix do_try_to_free_pages() livelock
> https://lkml.org/lkml/2012/6/14/74
>
> I think that's the same issue Lisa met.
>
> But I didn't find out why your patch didn't get merged?
> There were already many acks.

Just because I misunderstood the patch has already been merged. OK, I'll
resend this.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-02  8:08                 ` Lisa Du
@ 2013-08-04 23:47                   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2013-08-04 23:47 UTC (permalink / raw)
  To: Lisa Du; +Cc: linux-mm, KOSAKI Motohiro, Bob Liu

Hello,

On Fri, Aug 02, 2013 at 01:08:50AM -0700, Lisa Du wrote:
> >-----Original Message-----
> >From: Minchan Kim [mailto:minchan@kernel.org]
> >Sent: 2013a1'8ae??2ae?JPY 11:54
> >To: Lisa Du
> >Cc: linux-mm@kvack.org; KOSAKI Motohiro; Bob Liu
> >Subject: Re: Possible deadloop in direct reclaim?
> >
> >On Thu, Aug 01, 2013 at 08:17:56PM -0700, Lisa Du wrote:
> >> >-----Original Message-----
> >> >From: Minchan Kim [mailto:minchan@kernel.org]
> >> >Sent: 2013a1'8ae??2ae?JPY 10:26
> >> >To: Lisa Du
> >> >Cc: linux-mm@kvack.org; KOSAKI Motohiro
> >> >Subject: Re: Possible deadloop in direct reclaim?
> 
> 
> >> >I reviewed current mmotm because recently Mel changed kswapd a lot and
> >> >all_unreclaimable patch history today.
> >> >What I see is recent mmotm has a same problem, too if system have no swap
> >> >and no compaction. Of course, compaction is default yes option so we could
> >> >recommend to enable if system works well but it's up to user and we should
> >> >avoid direct reclaim hang although user disable compaction.
> >> >
> >> >When I see the patch history, real culprit is 929bea7c.
> >> >
> >> >"  zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> >> >    variables nor protected by lock.  Therefore zones can become a state of
> >> >    zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
> >> >    all_unreclaimable() return false even though zone->all_unreclaimabe=1."
> >> >
> >> >I understand the problem but apparently, it makes Lisa's problem because
> >> >kswapd can give up balancing when high order allocation happens to prevent
> >> >excessive reclaim with assuming the process requested high order allocation
> >> >can do direct reclaim/compaction. But what if the process can't reclaim
> >> >by no swap but lots of anon pages and can't compact by !CONFIG_COMPACTION?
> >> >
> >>
> >> Also Bob found below thread, seems Kosaki also found same issue:
> >> mm, vmscan: fix do_try_to_free_pages() livelock
> >> https://lkml.org/lkml/2012/6/14/74
> >
> >I remember it and AFAIRC, I had a concern because description was
> >too vague without detailed example and I fixed Aaditya's problem with
> >another approach. That's why it wasn't merged at that time.
> >
> >Now, we have a real problem and analysis so I think KOSAKI's patch makes
> >perfect to me.
> >
> >Lisa, Could you resend KOSAKI's patch with more detailed description?
> 
> Hi, Minchan and Kosaki
> Would you please help check below patch I resend based on previous Kosaki's patch?
> I'm not sure if the description is clear enough, please let me know if you have any comments.
> Many thanks!
> From 2dfe137665a694dcc74ae9c8a27641b06190f344 Mon Sep 17 00:00:00 2001
> From: Lisa Du <cldu@marvell.com>
> Date: Fri, 2 Aug 2013 14:37:31 +0800
> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> 
> Currently, I found system can enter a state that there are lots
> of free pages in a zone but only order-0 and order-1 pages which
> means the zone is heavily fragmented, then high order allocation
> could make direct reclaim path's long stall(ex, 60 seconds)
> especially in no swap and no compaciton enviroment.
> 
> The reason is do_try_to_free_pages enter live lock:
> 
> kswapd will go to sleep if the zones have been fully scanned
> and are still not balanced. As kswapd thinks there's little point
> trying all over again to avoid infinite loop. Instead it changes
> order from high-order to 0-order because kswapd think order-0 is the
> most important. Look at 73ce02e9 in detail. If watermarks are ok,
> kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
> It assume high-order users can still perform direct reclaim if they wish.
> 
> Direct reclaim continue to reclaim for a high order which is not a
> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> This is because to avoid too early oom-kill. So it means direct_reclaim
> depends on kswapd to break this loop.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever until someone like watchdog detect and finally
> kill the process.
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Change-Id: If3b44e33e400c1db0e42a5e2fc9ebc7a265f2aae

Remove Change-ID.
And please write down explicitly "It's based KOSAKI's work and I rewrite
the description"
In addtion, write down "The problem happend v3.4 but it seems the problem
still lives in current tree because ..."

Otherwise, looks good to me.
If you respin, please send a mail with new thread for akpm to be confused.

Thanks!

> Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Cc: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Bob Liu <lliubbo@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Lisa Du <cldu@marvell.com>
> ---
>  include/linux/mm_inline.h |   20 ++++++++++++++++++++
>  include/linux/mmzone.h    |    1 -
>  include/linux/vmstat.h    |    1 -
>  mm/page-writeback.c       |    1 +
>  mm/page_alloc.c           |    5 ++---
>  mm/vmscan.c               |   43 +++++++++----------------------------------
>  mm/vmstat.c               |    3 ++-
>  7 files changed, 34 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 1397ccf..e212fae 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -2,6 +2,7 @@
>  #define LINUX_MM_INLINE_H
>  
>  #include <linux/huge_mm.h>
> +#include <linux/swap.h>
>  
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
> @@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
>  	return lru;
>  }
>  
> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> +{
> +	int nr;
> +
> +	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state(zone, NR_INACTIVE_FILE);
> +
> +	if (get_nr_swap_pages() > 0)
> +		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state(zone, NR_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +static inline bool zone_reclaimable(struct zone *zone)
> +{
> +	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +}
> +
>  #endif
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index af4a3b7..e835974 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -352,7 +352,6 @@ struct zone {
>  	 * free areas of different sizes
>  	 */
>  	spinlock_t		lock;
> -	int                     all_unreclaimable; /* All pages pinned */
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  	/* Set to true when the PG_migrate_skip bits should be cleared */
>  	bool			compact_blockskip_flush;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index c586679..6fff004 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -143,7 +143,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>  }
>  
>  extern unsigned long global_reclaimable_pages(void);
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
>  #ifdef CONFIG_NUMA
>  /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3f0c895..62bfd92 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -36,6 +36,7 @@
>  #include <linux/pagevec.h>
>  #include <linux/timer.h>
>  #include <linux/sched/rt.h>
> +#include <linux/mm_inline.h>
>  #include <trace/events/writeback.h>
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b100255..19a18c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -60,6 +60,7 @@
>  #include <linux/page-debug-flags.h>
>  #include <linux/hugetlb.h>
>  #include <linux/sched/rt.h>
> +#include <linux/mm_inline.h>
>  
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -647,7 +648,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	int to_free = count;
>  
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	while (to_free) {
> @@ -696,7 +696,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
>  				int migratetype)
>  {
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, zone, order, migratetype);
> @@ -3095,7 +3094,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
>  			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
>  			zone->pages_scanned,
> -			(zone->all_unreclaimable ? "yes" : "no")
> +			(!zone_reclaimable(zone) ? "yes" : "no")
>  			);
>  		printk("lowmem_reserve[]:");
>  		for (i = 0; i < MAX_NR_ZONES; i++)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 34582d9..7501d1e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1789,7 +1789,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (current_is_kswapd() && zone->all_unreclaimable)
> +	if (current_is_kswapd() && !zone_reclaimable(zone))
>  		force_scan = true;
>  	if (!global_reclaim(sc))
>  		force_scan = true;
> @@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
> -			if (zone->all_unreclaimable &&
> -					sc->priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) &&
> +			    sc->priority != DEF_PRIORITY)
>  				continue;	/* Let kswapd poll it */
>  			if (IS_ENABLED(CONFIG_COMPACTION)) {
>  				/*
> @@ -2283,11 +2283,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	return aborted_reclaim;
>  }
>  
> -static bool zone_reclaimable(struct zone *zone)
> -{
> -	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> -}
> -
>  /* All zones in zonelist are unreclaimable? */
>  static bool all_unreclaimable(struct zonelist *zonelist,
>  		struct scan_control *sc)
> @@ -2301,8 +2296,6 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>  			continue;
>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  			continue;
> -		if (zone->all_unreclaimable)
> -			continue;
>  		if (zone_reclaimable(zone))
>  			return false;
>  	}
> @@ -2714,7 +2707,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * DEF_PRIORITY. Effectively, it considers them balanced so
>  		 * they must be considered balanced here as well!
>  		 */
> -		if (zone->all_unreclaimable) {
> +		if (!zone_reclaimable(zone)) {
>  			balanced_pages += zone->managed_pages;
>  			continue;
>  		}
> @@ -2775,7 +2768,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  			       unsigned long lru_pages,
>  			       unsigned long *nr_attempted)
>  {
> -	unsigned long nr_slab;
>  	int testorder = sc->order;
>  	unsigned long balance_gap;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -2820,15 +2812,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  	shrink_zone(zone, sc);
>  
>  	reclaim_state->reclaimed_slab = 0;
> -	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> +	shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>  	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>  
>  	/* Account for the number of pages attempted to reclaim */
>  	*nr_attempted += sc->nr_to_reclaim;
>  
> -	if (nr_slab == 0 && !zone_reclaimable(zone))
> -		zone->all_unreclaimable = 1;
> -
>  	zone_clear_flag(zone, ZONE_WRITEBACK);
>  
>  	/*
> @@ -2837,7 +2826,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  	 * BDIs but as pressure is relieved, speculatively avoid congestion
>  	 * waits.
>  	 */
> -	if (!zone->all_unreclaimable &&
> +	if (zone_reclaimable(zone) &&
>  	    zone_balanced(zone, testorder, 0, classzone_idx)) {
>  		zone_clear_flag(zone, ZONE_CONGESTED);
>  		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
> @@ -2903,7 +2892,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  
> @@ -2982,7 +2971,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  
> @@ -3267,20 +3256,6 @@ unsigned long global_reclaimable_pages(void)
>  	return nr;
>  }
>  
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> -{
> -	int nr;
> -
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE);
> -
> -	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> -
> -	return nr;
> -}
> -
>  #ifdef CONFIG_HIBERNATION
>  /*
>   * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
> @@ -3578,7 +3553,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>  		return ZONE_RECLAIM_FULL;
>  
> -	if (zone->all_unreclaimable)
> +	if (!zone_reclaimable(zone))
>  		return ZONE_RECLAIM_FULL;
>  
>  	/*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 20c2ef4..c48f75b 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -19,6 +19,7 @@
>  #include <linux/math64.h>
>  #include <linux/writeback.h>
>  #include <linux/compaction.h>
> +#include <linux/mm_inline.h>
>  
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
> @@ -1052,7 +1053,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  all_unreclaimable: %u"
>  		   "\n  start_pfn:         %lu"
>  		   "\n  inactive_ratio:    %u",
> -		   zone->all_unreclaimable,
> +		   !zone_reclaimable(zone),
>  		   zone->zone_start_pfn,
>  		   zone->inactive_ratio);
>  	seq_putc(m, '\n');
> -- 
> 1.7.0.4
> 
> >
> >>
> >> >
> >> >
> >> >
> >> >--
> >> >Kind regards,
> >> >Minchan Kim
> >
> >--
> >Kind regards,
> >Minchan Kim

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Possible deadloop in direct reclaim?
  2013-08-03 21:22                 ` KOSAKI Motohiro
@ 2013-08-04 23:50                   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2013-08-04 23:50 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Bob Liu, Lisa Du, Christoph Lameter, linux-mm, Mel Gorman, Bob Liu

Hi KOSAKI,

On Sat, Aug 03, 2013 at 05:22:11PM -0400, KOSAKI Motohiro wrote:
> (8/1/13 12:21 AM), Bob Liu wrote:
> >Hi KOSAKI,
> >
> >On 08/01/2013 10:45 AM, KOSAKI Motohiro wrote:
> >
> >>
> >>Please read more older code. Your pointed code is temporary change and I
> >>changed back for fixing
> >>bugs.
> >>If you look at the status in middle direct reclaim, we can't avoid race
> >>condition from multi direct
> >>reclaim issues. Moreover, if kswapd doesn't awaken, it is a problem.
> >>This is a reason why current code
> >>behave as you described.
> >>I agree we should fix your issue as far as possible. But I can't agree
> >>your analysis.
> >>
> >
> >I found this thread:
> >mm, vmscan: fix do_try_to_free_pages() livelock
> >https://lkml.org/lkml/2012/6/14/74
> >
> >I think that's the same issue Lisa met.
> >
> >But I didn't find out why your patch didn't get merged?
> >There were already many acks.
> 
> Just because I misunderstood the patch has already been merged. OK, I'll
> resend this.

Just FYI,
Now Lisa am working on it and have a plan to resend more concrete
description based on your old version.

Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-08-04 23:49 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-23  4:58 Possible deadloop in direct reclaim? Lisa Du
2013-07-23 20:28 ` Christoph Lameter
2013-07-24  1:21   ` Lisa Du
2013-07-25 18:19     ` KOSAKI Motohiro
2013-07-26  1:11       ` Lisa Du
2013-07-29 16:44         ` KOSAKI Motohiro
2013-07-30  1:27           ` Lisa Du
2013-08-01  2:24           ` Lisa Du
2013-08-01  2:45             ` KOSAKI Motohiro
2013-08-01  4:21               ` Bob Liu
2013-08-03 21:22                 ` KOSAKI Motohiro
2013-08-04 23:50                   ` Minchan Kim
2013-08-01  5:19               ` Lisa Du
2013-08-01  8:56                 ` Russell King - ARM Linux
2013-08-02  1:18                   ` Lisa Du
2013-07-29  1:32       ` Lisa Du
2013-07-24  1:18 ` Bob Liu
2013-07-24  1:31   ` Lisa Du
2013-07-24  2:23   ` Lisa Du
2013-07-24  3:38     ` Bob Liu
2013-07-24  5:58       ` Lisa Du
2013-07-25 18:14   ` KOSAKI Motohiro
2013-07-26  1:22     ` Bob Liu
2013-07-29 16:46       ` KOSAKI Motohiro
2013-08-01  5:43 ` Minchan Kim
2013-08-01  6:13   ` Lisa Du
2013-08-01  7:33     ` Minchan Kim
2013-08-01  8:20       ` Lisa Du
2013-08-01  8:42         ` Minchan Kim
2013-08-02  1:03           ` Lisa Du
2013-08-02  2:26           ` Minchan Kim
2013-08-02  2:33             ` Minchan Kim
2013-08-02  3:17             ` Lisa Du
2013-08-02  3:53               ` Minchan Kim
2013-08-02  8:08                 ` Lisa Du
2013-08-04 23:47                   ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.