All of lore.kernel.org
 help / color / mirror / Atom feed
* [resend] [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
@ 2013-08-05  2:26 Lisa Du
  2013-08-05  2:56 ` Minchan Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Lisa Du @ 2013-08-05  2:26 UTC (permalink / raw)
  To: linux-mm, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang
  Cc: Russell King - ARM Linux

From: Lisa Du <cldu@marvell.com>
Date: Mon, 5 Aug 2013 09:26:57 +0800
Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock

This patch is based on KOSAKI's work and I add a little more
description, please refer https://lkml.org/lkml/2012/6/14/74.

Currently, I found system can enter a state that there are lots
of free pages in a zone but only order-0 and order-1 pages which
means the zone is heavily fragmented, then high order allocation
could make direct reclaim path's long stall(ex, 60 seconds)
especially in no swap and no compaciton enviroment. This problem
happened on v3.4, but it seems issue still lives in current tree,
the reason is do_try_to_free_pages enter live lock:

kswapd will go to sleep if the zones have been fully scanned
and are still not balanced. As kswapd thinks there's little point
trying all over again to avoid infinite loop. Instead it changes
order from high-order to 0-order because kswapd think order-0 is the
most important. Look at 73ce02e9 in detail. If watermarks are ok,
kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
It assume high-order users can still perform direct reclaim if they wish.

Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
This is because to avoid too early oom-kill. So it means direct_reclaim
depends on kswapd to break this loop.

In worst case, direct-reclaim may continue to page reclaim forever
when kswapd sleeps forever until someone like watchdog detect and finally
kill the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737

We can't turn on zone->all_unreclaimable from direct reclaim path
because direct reclaim path don't take any lock and this way is racy.
Thus this patch removes zone->all_unreclaimable field completely and
recalculates zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

Change-Id: If3b44e33e400c1db0e42a5e2fc9ebc7a265f2aae
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Neil Zhang <zhangwm@marvell.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
---
 include/linux/mm_inline.h |   20 ++++++++++++++++++++
 include/linux/mmzone.h    |    1 -
 include/linux/vmstat.h    |    1 -
 mm/page-writeback.c       |    1 +
 mm/page_alloc.c           |    5 ++---
 mm/vmscan.c               |   43 ++++++++++---------------------------------
 mm/vmstat.c               |    3 ++-
 7 files changed, 35 insertions(+), 39 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 1397ccf..e212fae 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -2,6 +2,7 @@
 #define LINUX_MM_INLINE_H
 
 #include <linux/huge_mm.h>
+#include <linux/swap.h>
 
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
@@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
 	return lru;
 }
 
+static inline unsigned long zone_reclaimable_pages(struct zone *zone)
+{
+	int nr;
+
+	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
+	     zone_page_state(zone, NR_INACTIVE_FILE);
+
+	if (get_nr_swap_pages() > 0)
+		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
+		      zone_page_state(zone, NR_INACTIVE_ANON);
+
+	return nr;
+}
+
+static inline bool zone_reclaimable(struct zone *zone)
+{
+	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+}
+
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index af4a3b7..e835974 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -352,7 +352,6 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 	/* Set to true when the PG_migrate_skip bits should be cleared */
 	bool			compact_blockskip_flush;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index c586679..6fff004 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -143,7 +143,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 }
 
 extern unsigned long global_reclaimable_pages(void);
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
 #ifdef CONFIG_NUMA
 /*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..62bfd92 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,6 +36,7 @@
 #include <linux/pagevec.h>
 #include <linux/timer.h>
 #include <linux/sched/rt.h>
+#include <linux/mm_inline.h>
 #include <trace/events/writeback.h>
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..19a18c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -60,6 +60,7 @@
 #include <linux/page-debug-flags.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/rt.h>
+#include <linux/mm_inline.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -647,7 +648,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int to_free = count;
 
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	while (to_free) {
@@ -696,7 +696,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	__free_one_page(page, zone, order, migratetype);
@@ -3095,7 +3094,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
 			zone->pages_scanned,
-			(zone->all_unreclaimable ? "yes" : "no")
+			(!zone_reclaimable(zone) ? "yes" : "no")
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cff0d4..7501d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1789,7 +1789,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && zone->all_unreclaimable)
+	if (current_is_kswapd() && !zone_reclaimable(zone))
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			if (zone->all_unreclaimable &&
-					sc->priority != DEF_PRIORITY)
+			if (!zone_reclaimable(zone) &&
+			    sc->priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
 			if (IS_ENABLED(CONFIG_COMPACTION)) {
 				/*
@@ -2283,11 +2283,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	return aborted_reclaim;
 }
 
-static bool zone_reclaimable(struct zone *zone)
-{
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
-}
-
 /* All zones in zonelist are unreclaimable? */
 static bool all_unreclaimable(struct zonelist *zonelist,
 		struct scan_control *sc)
@@ -2301,7 +2296,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (!zone->all_unreclaimable)
+		if (zone_reclaimable(zone))
 			return false;
 	}
 
@@ -2712,7 +2707,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 		 * DEF_PRIORITY. Effectively, it considers them balanced so
 		 * they must be considered balanced here as well!
 		 */
-		if (zone->all_unreclaimable) {
+		if (!zone_reclaimable(zone)) {
 			balanced_pages += zone->managed_pages;
 			continue;
 		}
@@ -2773,7 +2768,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
 			       unsigned long lru_pages,
 			       unsigned long *nr_attempted)
 {
-	unsigned long nr_slab;
 	int testorder = sc->order;
 	unsigned long balance_gap;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2818,15 +2812,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	shrink_zone(zone, sc);
 
 	reclaim_state->reclaimed_slab = 0;
-	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+	shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 
 	/* Account for the number of pages attempted to reclaim */
 	*nr_attempted += sc->nr_to_reclaim;
 
-	if (nr_slab == 0 && !zone_reclaimable(zone))
-		zone->all_unreclaimable = 1;
-
 	zone_clear_flag(zone, ZONE_WRITEBACK);
 
 	/*
@@ -2835,7 +2826,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * BDIs but as pressure is relieved, speculatively avoid congestion
 	 * waits.
 	 */
-	if (!zone->all_unreclaimable &&
+	if (zone_reclaimable(zone) &&
 	    zone_balanced(zone, testorder, 0, classzone_idx)) {
 		zone_clear_flag(zone, ZONE_CONGESTED);
 		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
@@ -2901,7 +2892,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
+			if (!zone_reclaimable(zone) &&
 			    sc.priority != DEF_PRIORITY)
 				continue;
 
@@ -2980,7 +2971,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
+			if (!zone_reclaimable(zone) &&
 			    sc.priority != DEF_PRIORITY)
 				continue;
 
@@ -3265,20 +3256,6 @@ unsigned long global_reclaimable_pages(void)
 	return nr;
 }
 
-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	int nr;
-
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE);
-
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
-
-	return nr;
-}
-
 #ifdef CONFIG_HIBERNATION
 /*
  * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
@@ -3576,7 +3553,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return ZONE_RECLAIM_FULL;
 
-	if (zone->all_unreclaimable)
+	if (!zone_reclaimable(zone))
 		return ZONE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c2ef4..c48f75b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -19,6 +19,7 @@
 #include <linux/math64.h>
 #include <linux/writeback.h>
 #include <linux/compaction.h>
+#include <linux/mm_inline.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -1052,7 +1053,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  all_unreclaimable: %u"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
-		   zone->all_unreclaimable,
+		   !zone_reclaimable(zone),
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.0.4


Thanks!

Best Regards
Lisa Du

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-05  2:26 [resend] [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock Lisa Du
@ 2013-08-05  2:56 ` Minchan Kim
  2013-08-05  4:53 ` Johannes Weiner
  2013-08-05  7:41 ` Michal Hocko
  2 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2013-08-05  2:56 UTC (permalink / raw)
  To: Lisa Du
  Cc: linux-mm, KOSAKI Motohiro, Mel Gorman, Christoph Lameter,
	Bob Liu, Neil Zhang, Russell King - ARM Linux

On Sun, Aug 04, 2013 at 07:26:38PM -0700, Lisa Du wrote:
> From: Lisa Du <cldu@marvell.com>
> Date: Mon, 5 Aug 2013 09:26:57 +0800
> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> 
> This patch is based on KOSAKI's work and I add a little more
> description, please refer https://lkml.org/lkml/2012/6/14/74.
> 
> Currently, I found system can enter a state that there are lots
> of free pages in a zone but only order-0 and order-1 pages which
> means the zone is heavily fragmented, then high order allocation
> could make direct reclaim path's long stall(ex, 60 seconds)
> especially in no swap and no compaciton enviroment. This problem
> happened on v3.4, but it seems issue still lives in current tree,
> the reason is do_try_to_free_pages enter live lock:
> 
> kswapd will go to sleep if the zones have been fully scanned
> and are still not balanced. As kswapd thinks there's little point
> trying all over again to avoid infinite loop. Instead it changes
> order from high-order to 0-order because kswapd think order-0 is the
> most important. Look at 73ce02e9 in detail. If watermarks are ok,
> kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
> It assume high-order users can still perform direct reclaim if they wish.
> 
> Direct reclaim continue to reclaim for a high order which is not a
> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> This is because to avoid too early oom-kill. So it means direct_reclaim
> depends on kswapd to break this loop.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever until someone like watchdog detect and finally
> kill the process. As described in:
> http://thread.gmane.org/gmane.linux.kernel.mm/103737
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Change-Id: If3b44e33e400c1db0e42a5e2fc9ebc7a265f2aae

Please remove this line and It seems to go with stable if others agree.
Otherwise, looks good to me.

Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-05  2:26 [resend] [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock Lisa Du
  2013-08-05  2:56 ` Minchan Kim
@ 2013-08-05  4:53 ` Johannes Weiner
  2013-08-05  5:02   ` Minchan Kim
  2013-08-05  7:41 ` Michal Hocko
  2 siblings, 1 reply; 51+ messages in thread
From: Johannes Weiner @ 2013-08-05  4:53 UTC (permalink / raw)
  To: Lisa Du
  Cc: linux-mm, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang, Russell King - ARM Linux

On Sun, Aug 04, 2013 at 07:26:38PM -0700, Lisa Du wrote:
> From: Lisa Du <cldu@marvell.com>
> Date: Mon, 5 Aug 2013 09:26:57 +0800
> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> 
> This patch is based on KOSAKI's work and I add a little more
> description, please refer https://lkml.org/lkml/2012/6/14/74.
> 
> Currently, I found system can enter a state that there are lots
> of free pages in a zone but only order-0 and order-1 pages which
> means the zone is heavily fragmented, then high order allocation
> could make direct reclaim path's long stall(ex, 60 seconds)
> especially in no swap and no compaciton enviroment. This problem
> happened on v3.4, but it seems issue still lives in current tree,
> the reason is do_try_to_free_pages enter live lock:
> 
> kswapd will go to sleep if the zones have been fully scanned
> and are still not balanced. As kswapd thinks there's little point
> trying all over again to avoid infinite loop. Instead it changes
> order from high-order to 0-order because kswapd think order-0 is the
> most important. Look at 73ce02e9 in detail. If watermarks are ok,
> kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
> It assume high-order users can still perform direct reclaim if they wish.
> 
> Direct reclaim continue to reclaim for a high order which is not a
> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> This is because to avoid too early oom-kill. So it means direct_reclaim
> depends on kswapd to break this loop.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever until someone like watchdog detect and finally
> kill the process. As described in:
> http://thread.gmane.org/gmane.linux.kernel.mm/103737
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Change-Id: If3b44e33e400c1db0e42a5e2fc9ebc7a265f2aae
> Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Cc: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Bob Liu <lliubbo@gmail.com>
> Cc: Neil Zhang <zhangwm@marvell.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Lisa Du <cldu@marvell.com>

Wow, the original patch is over a year old.  As before:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

One comment:

> @@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
> -			if (zone->all_unreclaimable &&
> -					sc->priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) &&
> +			    sc->priority != DEF_PRIORITY)
>  				continue;	/* Let kswapd poll it */
>  			if (IS_ENABLED(CONFIG_COMPACTION)) {
>  				/*

As Michal pointed out last time, it would make sense to reorder these
checks because the priority test is much lighter than calculating the
reclaimable pages.  Would make DEF_PRIORITY cycles slightly lighter.

It's not necessarily about the performance but if we leave it like
this there will be boring patches in the future that change it to do
the light-weight check first, claiming it will improve performance,
and then somebody else will ask them for benchmark results and they
will ask how page reclaim is usually benchmarked and everybody will
shrug their shoulders and go "good question" until somebody blames
memory cgroups.

So, please, save us from all this drama and reorder the checks.

Thanks!
Johannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-05  4:53 ` Johannes Weiner
@ 2013-08-05  5:02   ` Minchan Kim
  0 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2013-08-05  5:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Lisa Du, linux-mm, KOSAKI Motohiro, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang, Russell King - ARM Linux

On Mon, Aug 05, 2013 at 12:53:43AM -0400, Johannes Weiner wrote:
> On Sun, Aug 04, 2013 at 07:26:38PM -0700, Lisa Du wrote:
> > From: Lisa Du <cldu@marvell.com>
> > Date: Mon, 5 Aug 2013 09:26:57 +0800
> > Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> > 
> > This patch is based on KOSAKI's work and I add a little more
> > description, please refer https://lkml.org/lkml/2012/6/14/74.
> > 
> > Currently, I found system can enter a state that there are lots
> > of free pages in a zone but only order-0 and order-1 pages which
> > means the zone is heavily fragmented, then high order allocation
> > could make direct reclaim path's long stall(ex, 60 seconds)
> > especially in no swap and no compaciton enviroment. This problem
> > happened on v3.4, but it seems issue still lives in current tree,
> > the reason is do_try_to_free_pages enter live lock:
> > 
> > kswapd will go to sleep if the zones have been fully scanned
> > and are still not balanced. As kswapd thinks there's little point
> > trying all over again to avoid infinite loop. Instead it changes
> > order from high-order to 0-order because kswapd think order-0 is the
> > most important. Look at 73ce02e9 in detail. If watermarks are ok,
> > kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
> > It assume high-order users can still perform direct reclaim if they wish.
> > 
> > Direct reclaim continue to reclaim for a high order which is not a
> > COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> > This is because to avoid too early oom-kill. So it means direct_reclaim
> > depends on kswapd to break this loop.
> > 
> > In worst case, direct-reclaim may continue to page reclaim forever
> > when kswapd sleeps forever until someone like watchdog detect and finally
> > kill the process. As described in:
> > http://thread.gmane.org/gmane.linux.kernel.mm/103737
> > 
> > We can't turn on zone->all_unreclaimable from direct reclaim path
> > because direct reclaim path don't take any lock and this way is racy.
> > Thus this patch removes zone->all_unreclaimable field completely and
> > recalculates zone reclaimable state every time.
> > 
> > Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> > directly and kswapd continue to use zone->all_unreclaimable. Because, it
> > is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> > zone->all_unreclaimable as a name) describes the detail.
> > 
> > Change-Id: If3b44e33e400c1db0e42a5e2fc9ebc7a265f2aae
> > Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> > Cc: Ying Han <yinghan@google.com>
> > Cc: Nick Piggin <npiggin@gmail.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Cc: Michal Hocko <mhocko@suse.cz>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Cc: Minchan Kim <minchan.kim@gmail.com>
> > Cc: Bob Liu <lliubbo@gmail.com>
> > Cc: Neil Zhang <zhangwm@marvell.com>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Lisa Du <cldu@marvell.com>
> 
> Wow, the original patch is over a year old.  As before:
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> One comment:
> 
> > @@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> >  		if (global_reclaim(sc)) {
> >  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >  				continue;
> > -			if (zone->all_unreclaimable &&
> > -					sc->priority != DEF_PRIORITY)
> > +			if (!zone_reclaimable(zone) &&
> > +			    sc->priority != DEF_PRIORITY)
> >  				continue;	/* Let kswapd poll it */
> >  			if (IS_ENABLED(CONFIG_COMPACTION)) {
> >  				/*
> 
> As Michal pointed out last time, it would make sense to reorder these
> checks because the priority test is much lighter than calculating the
> reclaimable pages.  Would make DEF_PRIORITY cycles slightly lighter.
> 
> It's not necessarily about the performance but if we leave it like
> this there will be boring patches in the future that change it to do
> the light-weight check first, claiming it will improve performance,
> and then somebody else will ask them for benchmark results and they
> will ask how page reclaim is usually benchmarked and everybody will
> shrug their shoulders and go "good question" until somebody blames
> memory cgroups.
> 
> So, please, save us from all this drama and reorder the checks.

+1

I don't want to pay my money for soap opera.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-05  2:26 [resend] [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock Lisa Du
  2013-08-05  2:56 ` Minchan Kim
  2013-08-05  4:53 ` Johannes Weiner
@ 2013-08-05  7:41 ` Michal Hocko
  2013-08-06  9:23   ` [resend] [PATCH V2] " Lisa Du
  2 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2013-08-05  7:41 UTC (permalink / raw)
  To: Lisa Du
  Cc: linux-mm, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang, Russell King - ARM Linux

It would help to CC all people mentioned in Cc: ;)

On Sun 04-08-13 19:26:38, Lisa Du wrote:
> From: Lisa Du <cldu@marvell.com>
> Date: Mon, 5 Aug 2013 09:26:57 +0800
> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> 
> This patch is based on KOSAKI's work and I add a little more
> description, please refer https://lkml.org/lkml/2012/6/14/74.
> 
> Currently, I found system can enter a state that there are lots
> of free pages in a zone but only order-0 and order-1 pages which
> means the zone is heavily fragmented, then high order allocation
> could make direct reclaim path's long stall(ex, 60 seconds)
> especially in no swap and no compaciton enviroment. This problem
> happened on v3.4, but it seems issue still lives in current tree,
> the reason is do_try_to_free_pages enter live lock:
> 
> kswapd will go to sleep if the zones have been fully scanned
> and are still not balanced. As kswapd thinks there's little point
> trying all over again to avoid infinite loop. Instead it changes
> order from high-order to 0-order because kswapd think order-0 is the
> most important. Look at 73ce02e9 in detail. If watermarks are ok,
> kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
> It assume high-order users can still perform direct reclaim if they wish.
> 
> Direct reclaim continue to reclaim for a high order which is not a
> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> This is because to avoid too early oom-kill. So it means direct_reclaim
> depends on kswapd to break this loop.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever until someone like watchdog detect and finally
> kill the process. As described in:
> http://thread.gmane.org/gmane.linux.kernel.mm/103737
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Change-Id: If3b44e33e400c1db0e42a5e2fc9ebc7a265f2aae
> Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Cc: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Bob Liu <lliubbo@gmail.com>
> Cc: Neil Zhang <zhangwm@marvell.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Lisa Du <cldu@marvell.com>

all_unreclaimable was just a bad idea from the very beginning

Same as the last time
Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> ---
>  include/linux/mm_inline.h |   20 ++++++++++++++++++++
>  include/linux/mmzone.h    |    1 -
>  include/linux/vmstat.h    |    1 -
>  mm/page-writeback.c       |    1 +
>  mm/page_alloc.c           |    5 ++---
>  mm/vmscan.c               |   43 ++++++++++---------------------------------
>  mm/vmstat.c               |    3 ++-
>  7 files changed, 35 insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 1397ccf..e212fae 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -2,6 +2,7 @@
>  #define LINUX_MM_INLINE_H
>  
>  #include <linux/huge_mm.h>
> +#include <linux/swap.h>
>  
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
> @@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
>  	return lru;
>  }
>  
> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> +{
> +	int nr;
> +
> +	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state(zone, NR_INACTIVE_FILE);
> +
> +	if (get_nr_swap_pages() > 0)
> +		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state(zone, NR_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +static inline bool zone_reclaimable(struct zone *zone)
> +{
> +	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +}
> +
>  #endif
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index af4a3b7..e835974 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -352,7 +352,6 @@ struct zone {
>  	 * free areas of different sizes
>  	 */
>  	spinlock_t		lock;
> -	int                     all_unreclaimable; /* All pages pinned */
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  	/* Set to true when the PG_migrate_skip bits should be cleared */
>  	bool			compact_blockskip_flush;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index c586679..6fff004 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -143,7 +143,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>  }
>  
>  extern unsigned long global_reclaimable_pages(void);
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
>  #ifdef CONFIG_NUMA
>  /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3f0c895..62bfd92 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -36,6 +36,7 @@
>  #include <linux/pagevec.h>
>  #include <linux/timer.h>
>  #include <linux/sched/rt.h>
> +#include <linux/mm_inline.h>
>  #include <trace/events/writeback.h>
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b100255..19a18c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -60,6 +60,7 @@
>  #include <linux/page-debug-flags.h>
>  #include <linux/hugetlb.h>
>  #include <linux/sched/rt.h>
> +#include <linux/mm_inline.h>
>  
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -647,7 +648,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	int to_free = count;
>  
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	while (to_free) {
> @@ -696,7 +696,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
>  				int migratetype)
>  {
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, zone, order, migratetype);
> @@ -3095,7 +3094,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
>  			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
>  			zone->pages_scanned,
> -			(zone->all_unreclaimable ? "yes" : "no")
> +			(!zone_reclaimable(zone) ? "yes" : "no")
>  			);
>  		printk("lowmem_reserve[]:");
>  		for (i = 0; i < MAX_NR_ZONES; i++)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2cff0d4..7501d1e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1789,7 +1789,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (current_is_kswapd() && zone->all_unreclaimable)
> +	if (current_is_kswapd() && !zone_reclaimable(zone))
>  		force_scan = true;
>  	if (!global_reclaim(sc))
>  		force_scan = true;
> @@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
> -			if (zone->all_unreclaimable &&
> -					sc->priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) &&
> +			    sc->priority != DEF_PRIORITY)
>  				continue;	/* Let kswapd poll it */
>  			if (IS_ENABLED(CONFIG_COMPACTION)) {
>  				/*
> @@ -2283,11 +2283,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	return aborted_reclaim;
>  }
>  
> -static bool zone_reclaimable(struct zone *zone)
> -{
> -	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> -}
> -
>  /* All zones in zonelist are unreclaimable? */
>  static bool all_unreclaimable(struct zonelist *zonelist,
>  		struct scan_control *sc)
> @@ -2301,7 +2296,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>  			continue;
>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  			continue;
> -		if (!zone->all_unreclaimable)
> +		if (zone_reclaimable(zone))
>  			return false;
>  	}
>  
> @@ -2712,7 +2707,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * DEF_PRIORITY. Effectively, it considers them balanced so
>  		 * they must be considered balanced here as well!
>  		 */
> -		if (zone->all_unreclaimable) {
> +		if (!zone_reclaimable(zone)) {
>  			balanced_pages += zone->managed_pages;
>  			continue;
>  		}
> @@ -2773,7 +2768,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  			       unsigned long lru_pages,
>  			       unsigned long *nr_attempted)
>  {
> -	unsigned long nr_slab;
>  	int testorder = sc->order;
>  	unsigned long balance_gap;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -2818,15 +2812,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  	shrink_zone(zone, sc);
>  
>  	reclaim_state->reclaimed_slab = 0;
> -	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> +	shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>  	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>  
>  	/* Account for the number of pages attempted to reclaim */
>  	*nr_attempted += sc->nr_to_reclaim;
>  
> -	if (nr_slab == 0 && !zone_reclaimable(zone))
> -		zone->all_unreclaimable = 1;
> -
>  	zone_clear_flag(zone, ZONE_WRITEBACK);
>  
>  	/*
> @@ -2835,7 +2826,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  	 * BDIs but as pressure is relieved, speculatively avoid congestion
>  	 * waits.
>  	 */
> -	if (!zone->all_unreclaimable &&
> +	if (zone_reclaimable(zone) &&
>  	    zone_balanced(zone, testorder, 0, classzone_idx)) {
>  		zone_clear_flag(zone, ZONE_CONGESTED);
>  		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
> @@ -2901,7 +2892,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  
> @@ -2980,7 +2971,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  
> @@ -3265,20 +3256,6 @@ unsigned long global_reclaimable_pages(void)
>  	return nr;
>  }
>  
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> -{
> -	int nr;
> -
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE);
> -
> -	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> -
> -	return nr;
> -}
> -
>  #ifdef CONFIG_HIBERNATION
>  /*
>   * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
> @@ -3576,7 +3553,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>  		return ZONE_RECLAIM_FULL;
>  
> -	if (zone->all_unreclaimable)
> +	if (!zone_reclaimable(zone))
>  		return ZONE_RECLAIM_FULL;
>  
>  	/*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 20c2ef4..c48f75b 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -19,6 +19,7 @@
>  #include <linux/math64.h>
>  #include <linux/writeback.h>
>  #include <linux/compaction.h>
> +#include <linux/mm_inline.h>
>  
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
> @@ -1052,7 +1053,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  all_unreclaimable: %u"
>  		   "\n  start_pfn:         %lu"
>  		   "\n  inactive_ratio:    %u",
> -		   zone->all_unreclaimable,
> +		   !zone_reclaimable(zone),
>  		   zone->zone_start_pfn,
>  		   zone->inactive_ratio);
>  	seq_putc(m, '\n');
> -- 
> 1.7.0.4
> 
> 
> Thanks!
> 
> Best Regards
> Lisa Du
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [resend] [PATCH V2] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-05  7:41 ` Michal Hocko
@ 2013-08-06  9:23   ` Lisa Du
  2013-08-06 10:35     ` Michal Hocko
  0 siblings, 1 reply; 51+ messages in thread
From: Lisa Du @ 2013-08-06  9:23 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang, Russell King - ARM Linux
  Cc: Aaditya Kumar, yinghan, npiggin, riel, hannes, kamezawa.hiroyu

In this version:
Remove change ID according to Minchan's comment;
Reorder the check in shrink_zones according Johannes's comment.
Also cc to more people.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH V2] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-06  9:23   ` [resend] [PATCH V2] " Lisa Du
@ 2013-08-06 10:35     ` Michal Hocko
  2013-08-07  1:42       ` Lisa Du
  0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2013-08-06 10:35 UTC (permalink / raw)
  To: Lisa Du
  Cc: linux-mm, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang, Russell King - ARM Linux,
	Aaditya Kumar, yinghan, npiggin, riel, hannes, kamezawa.hiroyu

On Tue 06-08-13 02:23:44, Lisa Du wrote:
> In this version:
> Remove change ID according to Minchan's comment;
> Reorder the check in shrink_zones according Johannes's comment.
> Also cc to more people.
> 
> From d0c9da37bb3dfd271513cfde8bca59ee1a7d4c44 Mon Sep 17 00:00:00 2001
> From: Lisa Du <cldu@marvell.com>
> Date: Mon, 5 Aug 2013 09:26:57 +0800
> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> 
> This patch is based on KOSAKI's work and I add a little more
> description, please refer https://lkml.org/lkml/2012/6/14/74.
> 
> Currently, I found system can enter a state that there are lots
> of free pages in a zone but only order-0 and order-1 pages which
> means the zone is heavily fragmented, then high order allocation
> could make direct reclaim path's long stall(ex, 60 seconds)
> especially in no swap and no compaciton enviroment. This problem
> happened on v3.4, but it seems issue still lives in current tree,
> the reason is do_try_to_free_pages enter live lock:
> 
> kswapd will go to sleep if the zones have been fully scanned
> and are still not balanced. As kswapd thinks there's little point
> trying all over again to avoid infinite loop. Instead it changes
> order from high-order to 0-order because kswapd think order-0 is the
> most important. Look at 73ce02e9 in detail. If watermarks are ok,
> kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
> It assume high-order users can still perform direct reclaim if they wish.
> 
> Direct reclaim continue to reclaim for a high order which is not a
> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> This is because to avoid too early oom-kill. So it means direct_reclaim
> depends on kswapd to break this loop.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever until someone like watchdog detect and finally
> kill the process. As described in:
> http://thread.gmane.org/gmane.linux.kernel.mm/103737
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Cc: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>

Seems you forgot to add my Reviewed-by ;)

> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Bob Liu <lliubbo@gmail.com>
> Cc: Neil Zhang <zhangwm@marvell.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Lisa Du <cldu@marvell.com>
> ---
>  include/linux/mm_inline.h |   20 ++++++++++++++++++++
>  include/linux/mmzone.h    |    1 -
>  include/linux/vmstat.h    |    1 -
>  mm/page-writeback.c       |    1 +
>  mm/page_alloc.c           |    5 ++---
>  mm/vmscan.c               |   43 ++++++++++---------------------------------
>  mm/vmstat.c               |    3 ++-
>  7 files changed, 35 insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 1397ccf..e212fae 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -2,6 +2,7 @@
>  #define LINUX_MM_INLINE_H
>  
>  #include <linux/huge_mm.h>
> +#include <linux/swap.h>
>  
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
> @@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
>  	return lru;
>  }
>  
> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> +{
> +	int nr;
> +
> +	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state(zone, NR_INACTIVE_FILE);
> +
> +	if (get_nr_swap_pages() > 0)
> +		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state(zone, NR_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +static inline bool zone_reclaimable(struct zone *zone)
> +{
> +	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +}
> +
>  #endif
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index af4a3b7..e835974 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -352,7 +352,6 @@ struct zone {
>  	 * free areas of different sizes
>  	 */
>  	spinlock_t		lock;
> -	int                     all_unreclaimable; /* All pages pinned */
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  	/* Set to true when the PG_migrate_skip bits should be cleared */
>  	bool			compact_blockskip_flush;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index c586679..6fff004 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -143,7 +143,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>  }
>  
>  extern unsigned long global_reclaimable_pages(void);
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
>  #ifdef CONFIG_NUMA
>  /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3f0c895..62bfd92 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -36,6 +36,7 @@
>  #include <linux/pagevec.h>
>  #include <linux/timer.h>
>  #include <linux/sched/rt.h>
> +#include <linux/mm_inline.h>
>  #include <trace/events/writeback.h>
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b100255..19a18c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -60,6 +60,7 @@
>  #include <linux/page-debug-flags.h>
>  #include <linux/hugetlb.h>
>  #include <linux/sched/rt.h>
> +#include <linux/mm_inline.h>
>  
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -647,7 +648,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	int to_free = count;
>  
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	while (to_free) {
> @@ -696,7 +696,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
>  				int migratetype)
>  {
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, zone, order, migratetype);
> @@ -3095,7 +3094,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
>  			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
>  			zone->pages_scanned,
> -			(zone->all_unreclaimable ? "yes" : "no")
> +			(!zone_reclaimable(zone) ? "yes" : "no")
>  			);
>  		printk("lowmem_reserve[]:");
>  		for (i = 0; i < MAX_NR_ZONES; i++)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2cff0d4..e1a48d4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1789,7 +1789,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (current_is_kswapd() && zone->all_unreclaimable)
> +	if (current_is_kswapd() && !zone_reclaimable(zone))
>  		force_scan = true;
>  	if (!global_reclaim(sc))
>  		force_scan = true;
> @@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
> -			if (zone->all_unreclaimable &&
> -					sc->priority != DEF_PRIORITY)
> +			if (sc->priority != DEF_PRIORITY &&
> +			    !zone_reclaimable(zone))
>  				continue;	/* Let kswapd poll it */
>  			if (IS_ENABLED(CONFIG_COMPACTION)) {
>  				/*
> @@ -2283,11 +2283,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	return aborted_reclaim;
>  }
>  
> -static bool zone_reclaimable(struct zone *zone)
> -{
> -	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> -}
> -
>  /* All zones in zonelist are unreclaimable? */
>  static bool all_unreclaimable(struct zonelist *zonelist,
>  		struct scan_control *sc)
> @@ -2301,7 +2296,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>  			continue;
>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  			continue;
> -		if (!zone->all_unreclaimable)
> +		if (zone_reclaimable(zone))
>  			return false;
>  	}
>  
> @@ -2712,7 +2707,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * DEF_PRIORITY. Effectively, it considers them balanced so
>  		 * they must be considered balanced here as well!
>  		 */
> -		if (zone->all_unreclaimable) {
> +		if (!zone_reclaimable(zone)) {
>  			balanced_pages += zone->managed_pages;
>  			continue;
>  		}
> @@ -2773,7 +2768,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  			       unsigned long lru_pages,
>  			       unsigned long *nr_attempted)
>  {
> -	unsigned long nr_slab;
>  	int testorder = sc->order;
>  	unsigned long balance_gap;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -2818,15 +2812,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  	shrink_zone(zone, sc);
>  
>  	reclaim_state->reclaimed_slab = 0;
> -	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> +	shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>  	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>  
>  	/* Account for the number of pages attempted to reclaim */
>  	*nr_attempted += sc->nr_to_reclaim;
>  
> -	if (nr_slab == 0 && !zone_reclaimable(zone))
> -		zone->all_unreclaimable = 1;
> -
>  	zone_clear_flag(zone, ZONE_WRITEBACK);
>  
>  	/*
> @@ -2835,7 +2826,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  	 * BDIs but as pressure is relieved, speculatively avoid congestion
>  	 * waits.
>  	 */
> -	if (!zone->all_unreclaimable &&
> +	if (zone_reclaimable(zone) &&
>  	    zone_balanced(zone, testorder, 0, classzone_idx)) {
>  		zone_clear_flag(zone, ZONE_CONGESTED);
>  		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
> @@ -2901,7 +2892,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  
> @@ -2980,7 +2971,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  
> @@ -3265,20 +3256,6 @@ unsigned long global_reclaimable_pages(void)
>  	return nr;
>  }
>  
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> -{
> -	int nr;
> -
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE);
> -
> -	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> -
> -	return nr;
> -}
> -
>  #ifdef CONFIG_HIBERNATION
>  /*
>   * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
> @@ -3576,7 +3553,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>  		return ZONE_RECLAIM_FULL;
>  
> -	if (zone->all_unreclaimable)
> +	if (!zone_reclaimable(zone))
>  		return ZONE_RECLAIM_FULL;
>  
>  	/*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 20c2ef4..c48f75b 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -19,6 +19,7 @@
>  #include <linux/math64.h>
>  #include <linux/writeback.h>
>  #include <linux/compaction.h>
> +#include <linux/mm_inline.h>
>  
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
> @@ -1052,7 +1053,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  all_unreclaimable: %u"
>  		   "\n  start_pfn:         %lu"
>  		   "\n  inactive_ratio:    %u",
> -		   zone->all_unreclaimable,
> +		   !zone_reclaimable(zone),
>  		   zone->zone_start_pfn,
>  		   zone->inactive_ratio);
>  	seq_putc(m, '\n');
> -- 
> 1.7.0.4
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [resend] [PATCH V2] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-06 10:35     ` Michal Hocko
@ 2013-08-07  1:42       ` Lisa Du
  2013-08-08 18:14         ` Johannes Weiner
  0 siblings, 1 reply; 51+ messages in thread
From: Lisa Du @ 2013-08-07  1:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang, Russell King - ARM Linux,
	Aaditya Kumar, yinghan, npiggin, riel, hannes, kamezawa.hiroyu

Hi, Michal and all
   I'm so sorry, now I updated the review list.
   Thank you all for the review and comments!

In this version:
Remove change ID according to Minchan's comment;
Reorder the check in shrink_zones according Johannes's comment.
Also cc to more people.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH V2] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-07  1:42       ` Lisa Du
@ 2013-08-08 18:14         ` Johannes Weiner
  2013-08-12  1:46           ` [resend] [PATCH V3] " Lisa Du
  2013-08-19  8:19           ` Lisa Du
  0 siblings, 2 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-08 18:14 UTC (permalink / raw)
  To: Lisa Du
  Cc: Michal Hocko, linux-mm, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang, Russell King - ARM Linux,
	Aaditya Kumar, yinghan, npiggin, riel, kamezawa.hiroyu

On Tue, Aug 06, 2013 at 06:42:06PM -0700, Lisa Du wrote:
> Hi, Michal and all
>    I'm so sorry, now I updated the review list.
>    Thank you all for the review and comments!
> 
> In this version:
> Remove change ID according to Minchan's comment;
> Reorder the check in shrink_zones according Johannes's comment.
> Also cc to more people.
> 
> >From ff345123117394c6b9b838bf54eb935bfa879f32 Mon Sep 17 00:00:00 2001
> From: Lisa Du <cldu@marvell.com>
> Date: Mon, 5 Aug 2013 09:26:57 +0800
> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> 
> This patch is based on KOSAKI's work and I add a little more
> description, please refer https://lkml.org/lkml/2012/6/14/74.
> 
> Currently, I found system can enter a state that there are lots
> of free pages in a zone but only order-0 and order-1 pages which
> means the zone is heavily fragmented, then high order allocation
> could make direct reclaim path's long stall(ex, 60 seconds)
> especially in no swap and no compaciton enviroment. This problem
> happened on v3.4, but it seems issue still lives in current tree,
> the reason is do_try_to_free_pages enter live lock:
> 
> kswapd will go to sleep if the zones have been fully scanned
> and are still not balanced. As kswapd thinks there's little point
> trying all over again to avoid infinite loop. Instead it changes
> order from high-order to 0-order because kswapd think order-0 is the
> most important. Look at 73ce02e9 in detail. If watermarks are ok,
> kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
> It assume high-order users can still perform direct reclaim if they wish.
> 
> Direct reclaim continue to reclaim for a high order which is not a
> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> This is because to avoid too early oom-kill. So it means direct_reclaim
> depends on kswapd to break this loop.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever until someone like watchdog detect and finally
> kill the process. As described in:
> http://thread.gmane.org/gmane.linux.kernel.mm/103737
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Cc: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Bob Liu <lliubbo@gmail.com>
> Cc: Neil Zhang <zhangwm@marvell.com>
> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

I sent an Acked-by, please keep these tags verbatim.

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Lisa Du <cldu@marvell.com>

> @@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
> -			if (zone->all_unreclaimable &&
> -					sc->priority != DEF_PRIORITY)
> +			if (sc->priority != DEF_PRIORITY &&
> +			    !zone_reclaimable(zone))
>  				continue;	/* Let kswapd poll it */
>  			if (IS_ENABLED(CONFIG_COMPACTION)) {
>  				/*

Yes, this is better, thanks.

> @@ -2901,7 +2892,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  

> @@ -2980,7 +2971,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  

What about those?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-08 18:14         ` Johannes Weiner
@ 2013-08-12  1:46           ` Lisa Du
  2013-08-20 22:16             ` Andrew Morton
  2013-08-27 19:43             ` Andrew Morton
  2013-08-19  8:19           ` Lisa Du
  1 sibling, 2 replies; 51+ messages in thread
From: Lisa Du @ 2013-08-12  1:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, Minchan Kim, KOSAKI Motohiro, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang, Russell King - ARM Linux,
	Aaditya Kumar, yinghan, npiggin, riel, kamezawa.hiroyu

In this version:
Reorder the check in pgdat_balanced according Johannes's comment.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-08 18:14         ` Johannes Weiner
  2013-08-12  1:46           ` [resend] [PATCH V3] " Lisa Du
@ 2013-08-19  8:19           ` Lisa Du
  1 sibling, 0 replies; 51+ messages in thread
From: Lisa Du @ 2013-08-19  8:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman, Christoph Lameter, Bob Liu,
	Neil Zhang, Russell King - ARM Linux, Aaditya Kumar, yinghan,
	npiggin, riel, Lisa Du, kamezawa.hiroyu

Hi, Andrew
Would you please have a look at below patch and give your comments if any?
Thanks a lot!

Thanks!

Best Regards
Lisa Du


>-----Original Message-----
>From: Lisa Du
>Sent: 2013年8月12日 9:46
>To: 'Johannes Weiner'
>Cc: Michal Hocko; linux-mm@kvack.org; Minchan Kim; KOSAKI Motohiro; Mel Gorman; Christoph Lameter; Bob Liu; Neil Zhang;
>Russell King - ARM Linux; Aaditya Kumar; yinghan@google.com; npiggin@gmail.com; riel@redhat.com;
>kamezawa.hiroyu@jp.fujitsu.com
>Subject: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
>
>In this version:
>Reorder the check in pgdat_balanced according Johannes's comment.
>
>From 66a98566792b954e187dca251fbe3819aeb977b9 Mon Sep 17 00:00:00 2001
>From: Lisa Du <cldu@marvell.com>
>Date: Mon, 5 Aug 2013 09:26:57 +0800
>Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
>
>This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74.
>
>Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which
>means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds)
>especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree,
>the reason is do_try_to_free_pages enter live lock:
>
>kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying
>all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most
>important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable =
>0.
>It assume high-order users can still perform direct reclaim if they wish.
>
>Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on
>zone->all_unreclaimble.
>This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop.
>
>In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect
>and finally kill the process. As described in:
>http://thread.gmane.org/gmane.linux.kernel.mm/103737
>
>We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is
>racy.
>Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time.
>
>Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use
>zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
>zone->all_unreclaimable as a name) describes the detail.
>
>Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
>Cc: Ying Han <yinghan@google.com>
>Cc: Nick Piggin <npiggin@gmail.com>
>Acked-by: Rik van Riel <riel@redhat.com>
>Cc: Mel Gorman <mel@csn.ul.ie>
>Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>Cc: Christoph Lameter <cl@linux.com>
>Cc: Bob Liu <lliubbo@gmail.com>
>Cc: Neil Zhang <zhangwm@marvell.com>
>Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
>Reviewed-by: Michal Hocko <mhocko@suse.cz>
>Acked-by: Minchan Kim <minchan@kernel.org>
>Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>Signed-off-by: Lisa Du <cldu@marvell.com>
>---
> include/linux/mm_inline.h |   20 +++++++++++++++++++
> include/linux/mmzone.h    |    1 -
> include/linux/vmstat.h    |    1 -
> mm/page-writeback.c       |    1 +
> mm/page_alloc.c           |    5 +--
> mm/vmscan.c               |   47 +++++++++++---------------------------------
> mm/vmstat.c               |    3 +-
> 7 files changed, 37 insertions(+), 41 deletions(-)
>
>diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 1397ccf..e212fae 100644
>--- a/include/linux/mm_inline.h
>+++ b/include/linux/mm_inline.h
>@@ -2,6 +2,7 @@
> #define LINUX_MM_INLINE_H
>
> #include <linux/huge_mm.h>
>+#include <linux/swap.h>
>
> /**
>  * page_is_file_cache - should the page be on a file LRU or anon LRU?
>@@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
> 	return lru;
> }
>
>+static inline unsigned long zone_reclaimable_pages(struct zone *zone) {
>+	int nr;
>+
>+	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
>+	     zone_page_state(zone, NR_INACTIVE_FILE);
>+
>+	if (get_nr_swap_pages() > 0)
>+		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>+		      zone_page_state(zone, NR_INACTIVE_ANON);
>+
>+	return nr;
>+}
>+
>+static inline bool zone_reclaimable(struct zone *zone) {
>+	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6; }
>+
> #endif
>diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index af4a3b7..e835974 100644
>--- a/include/linux/mmzone.h
>+++ b/include/linux/mmzone.h
>@@ -352,7 +352,6 @@ struct zone {
> 	 * free areas of different sizes
> 	 */
> 	spinlock_t		lock;
>-	int                     all_unreclaimable; /* All pages pinned */
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> 	/* Set to true when the PG_migrate_skip bits should be cleared */
> 	bool			compact_blockskip_flush;
>diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index c586679..6fff004 100644
>--- a/include/linux/vmstat.h
>+++ b/include/linux/vmstat.h
>@@ -143,7 +143,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,  }
>
> extern unsigned long global_reclaimable_pages(void); -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>
> #ifdef CONFIG_NUMA
> /*
>diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3f0c895..62bfd92 100644
>--- a/mm/page-writeback.c
>+++ b/mm/page-writeback.c
>@@ -36,6 +36,7 @@
> #include <linux/pagevec.h>
> #include <linux/timer.h>
> #include <linux/sched/rt.h>
>+#include <linux/mm_inline.h>
> #include <trace/events/writeback.h>
>
> /*
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b100255..19a18c0 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -60,6 +60,7 @@
> #include <linux/page-debug-flags.h>
> #include <linux/hugetlb.h>
> #include <linux/sched/rt.h>
>+#include <linux/mm_inline.h>
>
> #include <asm/sections.h>
> #include <asm/tlbflush.h>
>@@ -647,7 +648,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> 	int to_free = count;
>
> 	spin_lock(&zone->lock);
>-	zone->all_unreclaimable = 0;
> 	zone->pages_scanned = 0;
>
> 	while (to_free) {
>@@ -696,7 +696,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
> 				int migratetype)
> {
> 	spin_lock(&zone->lock);
>-	zone->all_unreclaimable = 0;
> 	zone->pages_scanned = 0;
>
> 	__free_one_page(page, zone, order, migratetype); @@ -3095,7 +3094,7 @@ void show_free_areas(unsigned int filter)
> 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
> 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
> 			zone->pages_scanned,
>-			(zone->all_unreclaimable ? "yes" : "no")
>+			(!zone_reclaimable(zone) ? "yes" : "no")
> 			);
> 		printk("lowmem_reserve[]:");
> 		for (i = 0; i < MAX_NR_ZONES; i++)
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 2cff0d4..3fe3d5d 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -1789,7 +1789,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> 	 * latencies, so it's better to scan a minimum amount there as
> 	 * well.
> 	 */
>-	if (current_is_kswapd() && zone->all_unreclaimable)
>+	if (current_is_kswapd() && !zone_reclaimable(zone))
> 		force_scan = true;
> 	if (!global_reclaim(sc))
> 		force_scan = true;
>@@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> 		if (global_reclaim(sc)) {
> 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> 				continue;
>-			if (zone->all_unreclaimable &&
>-					sc->priority != DEF_PRIORITY)
>+			if (sc->priority != DEF_PRIORITY &&
>+			    !zone_reclaimable(zone))
> 				continue;	/* Let kswapd poll it */
> 			if (IS_ENABLED(CONFIG_COMPACTION)) {
> 				/*
>@@ -2283,11 +2283,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> 	return aborted_reclaim;
> }
>
>-static bool zone_reclaimable(struct zone *zone) -{
>-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>-}
>-
> /* All zones in zonelist are unreclaimable? */  static bool all_unreclaimable(struct zonelist *zonelist,
> 		struct scan_control *sc)
>@@ -2301,7 +2296,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> 			continue;
> 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> 			continue;
>-		if (!zone->all_unreclaimable)
>+		if (zone_reclaimable(zone))
> 			return false;
> 	}
>
>@@ -2712,7 +2707,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
> 		 * DEF_PRIORITY. Effectively, it considers them balanced so
> 		 * they must be considered balanced here as well!
> 		 */
>-		if (zone->all_unreclaimable) {
>+		if (!zone_reclaimable(zone)) {
> 			balanced_pages += zone->managed_pages;
> 			continue;
> 		}
>@@ -2773,7 +2768,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
> 			       unsigned long lru_pages,
> 			       unsigned long *nr_attempted)
> {
>-	unsigned long nr_slab;
> 	int testorder = sc->order;
> 	unsigned long balance_gap;
> 	struct reclaim_state *reclaim_state = current->reclaim_state; @@ -2818,15 +2812,12 @@ static bool
>kswapd_shrink_zone(struct zone *zone,
> 	shrink_zone(zone, sc);
>
> 	reclaim_state->reclaimed_slab = 0;
>-	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>+	shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> 	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>
> 	/* Account for the number of pages attempted to reclaim */
> 	*nr_attempted += sc->nr_to_reclaim;
>
>-	if (nr_slab == 0 && !zone_reclaimable(zone))
>-		zone->all_unreclaimable = 1;
>-
> 	zone_clear_flag(zone, ZONE_WRITEBACK);
>
> 	/*
>@@ -2835,7 +2826,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
> 	 * BDIs but as pressure is relieved, speculatively avoid congestion
> 	 * waits.
> 	 */
>-	if (!zone->all_unreclaimable &&
>+	if (zone_reclaimable(zone) &&
> 	    zone_balanced(zone, testorder, 0, classzone_idx)) {
> 		zone_clear_flag(zone, ZONE_CONGESTED);
> 		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); @@ -2901,8 +2892,8 @@ static unsigned long balance_pgdat(pg_data_t
>*pgdat, int order,
> 			if (!populated_zone(zone))
> 				continue;
>
>-			if (zone->all_unreclaimable &&
>-			    sc.priority != DEF_PRIORITY)
>+			if (sc.priority != DEF_PRIORITY &&
>+			    !zone_reclaimable(zone))
> 				continue;
>
> 			/*
>@@ -2980,8 +2971,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> 			if (!populated_zone(zone))
> 				continue;
>
>-			if (zone->all_unreclaimable &&
>-			    sc.priority != DEF_PRIORITY)
>+			if (sc.priority != DEF_PRIORITY &&
>+			    !zone_reclaimable(zone))
> 				continue;
>
> 			sc.nr_scanned = 0;
>@@ -3265,20 +3256,6 @@ unsigned long global_reclaimable_pages(void)
> 	return nr;
> }
>
>-unsigned long zone_reclaimable_pages(struct zone *zone) -{
>-	int nr;
>-
>-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
>-	     zone_page_state(zone, NR_INACTIVE_FILE);
>-
>-	if (get_nr_swap_pages() > 0)
>-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>-		      zone_page_state(zone, NR_INACTIVE_ANON);
>-
>-	return nr;
>-}
>-
> #ifdef CONFIG_HIBERNATION
> /*
>  * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of @@ -3576,7 +3553,7 @@ int zone_reclaim(struct
>zone *zone, gfp_t gfp_mask, unsigned int order)
> 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> 		return ZONE_RECLAIM_FULL;
>
>-	if (zone->all_unreclaimable)
>+	if (!zone_reclaimable(zone))
> 		return ZONE_RECLAIM_FULL;
>
> 	/*
>diff --git a/mm/vmstat.c b/mm/vmstat.c
>index 20c2ef4..c48f75b 100644
>--- a/mm/vmstat.c
>+++ b/mm/vmstat.c
>@@ -19,6 +19,7 @@
> #include <linux/math64.h>
> #include <linux/writeback.h>
> #include <linux/compaction.h>
>+#include <linux/mm_inline.h>
>
> #ifdef CONFIG_VM_EVENT_COUNTERS
> DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; @@ -1052,7 +1053,7 @@ static void
>zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> 		   "\n  all_unreclaimable: %u"
> 		   "\n  start_pfn:         %lu"
> 		   "\n  inactive_ratio:    %u",
>-		   zone->all_unreclaimable,
>+		   !zone_reclaimable(zone),
> 		   zone->zone_start_pfn,
> 		   zone->inactive_ratio);
> 	seq_putc(m, '\n');
>--
>1.7.0.4
>
>
>Thanks!
>
>Best Regards
>Lisa Du

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-12  1:46           ` [resend] [PATCH V3] " Lisa Du
@ 2013-08-20 22:16             ` Andrew Morton
  2013-08-22  5:24               ` Lisa Du
  2013-08-27 19:43             ` Andrew Morton
  1 sibling, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 22:16 UTC (permalink / raw)
  To: Lisa Du
  Cc: Johannes Weiner, Michal Hocko, linux-mm, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman, Christoph Lameter, Bob Liu,
	Neil Zhang, Russell King - ARM Linux, Aaditya Kumar, yinghan,
	npiggin, riel, kamezawa.hiroyu

On Sun, 11 Aug 2013 18:46:08 -0700 Lisa Du <cldu@marvell.com> wrote:

> In this version:
> Reorder the check in pgdat_balanced according Johannes's comment.
> 
> >From 66a98566792b954e187dca251fbe3819aeb977b9 Mon Sep 17 00:00:00 2001
> From: Lisa Du <cldu@marvell.com>
> Date: Mon, 5 Aug 2013 09:26:57 +0800
> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> 
> This patch is based on KOSAKI's work and I add a little more
> description, please refer https://lkml.org/lkml/2012/6/14/74.
> 
> Currently, I found system can enter a state that there are lots
> of free pages in a zone but only order-0 and order-1 pages which
> means the zone is heavily fragmented, then high order allocation
> could make direct reclaim path's long stall(ex, 60 seconds)
> especially in no swap and no compaciton enviroment. This problem
> happened on v3.4, but it seems issue still lives in current tree,
> the reason is do_try_to_free_pages enter live lock:
> 
> kswapd will go to sleep if the zones have been fully scanned
> and are still not balanced. As kswapd thinks there's little point
> trying all over again to avoid infinite loop. Instead it changes
> order from high-order to 0-order because kswapd think order-0 is the
> most important. Look at 73ce02e9 in detail. If watermarks are ok,
> kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
> It assume high-order users can still perform direct reclaim if they wish.
> 
> Direct reclaim continue to reclaim for a high order which is not a
> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> This is because to avoid too early oom-kill. So it means direct_reclaim
> depends on kswapd to break this loop.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever until someone like watchdog detect and finally
> kill the process. As described in:
> http://thread.gmane.org/gmane.linux.kernel.mm/103737
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.

I don't see that this is correct.  Page reclaim does racy things quite
often, in the knowledge that the effects of a race are recoverable and
small.

> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> @@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
>  	return lru;
>  }
>  
> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> +{
> +	int nr;
> +
> +	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state(zone, NR_INACTIVE_FILE);
> +
> +	if (get_nr_swap_pages() > 0)
> +		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state(zone, NR_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +static inline bool zone_reclaimable(struct zone *zone)
> +{
> +	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +}

Inlining is often wrong.  Uninlining just these two funtions saves
several hundred bytes of text in mm/.  That's three of someone else's
cachelines which we didn't need to evict.

And what the heck is up with that magical "6"?  Why not "7"?  "42"?

At a minimum it needs extensive documentation which describes why "6"
is the optimum value for all machines and workloads (good luck with
that) and which describes the effects of altering this number and which
helps people understand why we didn't make it a runtime tunable.

I'll merge it for some testing (the lack of Tested-by's is conspicuous)
but I don't want to put that random "6" into Linux core MM in its
current state.


From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-vmscan-fix-do_try_to_free_pages-livelock-fix

uninline zone_reclaimable_pages() and zone_reclaimable().

Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lisa Du <cldu@marvell.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Neil Zhang <zhangwm@marvell.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm_inline.h |   19 -------------------
 mm/internal.h             |    2 ++
 mm/page-writeback.c       |    2 ++
 mm/vmscan.c               |   19 +++++++++++++++++++
 mm/vmstat.c               |    2 ++
 5 files changed, 25 insertions(+), 19 deletions(-)

diff -puN include/linux/mm_inline.h~mm-vmscan-fix-do_try_to_free_pages-livelock-fix include/linux/mm_inline.h
--- a/include/linux/mm_inline.h~mm-vmscan-fix-do_try_to_free_pages-livelock-fix
+++ a/include/linux/mm_inline.h
@@ -100,23 +100,4 @@ static __always_inline enum lru_list pag
 	return lru;
 }
 
-static inline unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	int nr;
-
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE);
-
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
-
-	return nr;
-}
-
-static inline bool zone_reclaimable(struct zone *zone)
-{
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
-}
-
 #endif
diff -puN include/linux/mmzone.h~mm-vmscan-fix-do_try_to_free_pages-livelock-fix include/linux/mmzone.h
diff -puN include/linux/vmstat.h~mm-vmscan-fix-do_try_to_free_pages-livelock-fix include/linux/vmstat.h
diff -puN mm/page-writeback.c~mm-vmscan-fix-do_try_to_free_pages-livelock-fix mm/page-writeback.c
--- a/mm/page-writeback.c~mm-vmscan-fix-do_try_to_free_pages-livelock-fix
+++ a/mm/page-writeback.c
@@ -39,6 +39,8 @@
 #include <linux/mm_inline.h>
 #include <trace/events/writeback.h>
 
+#include "internal.h"
+
 /*
  * Sleep at most 200ms at a time in balance_dirty_pages().
  */
diff -puN mm/page_alloc.c~mm-vmscan-fix-do_try_to_free_pages-livelock-fix mm/page_alloc.c
diff -puN mm/vmscan.c~mm-vmscan-fix-do_try_to_free_pages-livelock-fix mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-fix-do_try_to_free_pages-livelock-fix
+++ a/mm/vmscan.c
@@ -146,6 +146,25 @@ static bool global_reclaim(struct scan_c
 }
 #endif
 
+unsigned long zone_reclaimable_pages(struct zone *zone)
+{
+	int nr;
+
+	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
+	     zone_page_state(zone, NR_INACTIVE_FILE);
+
+	if (get_nr_swap_pages() > 0)
+		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
+		      zone_page_state(zone, NR_INACTIVE_ANON);
+
+	return nr;
+}
+
+bool zone_reclaimable(struct zone *zone)
+{
+	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+}
+
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
 	if (!mem_cgroup_disabled())
diff -puN mm/vmstat.c~mm-vmscan-fix-do_try_to_free_pages-livelock-fix mm/vmstat.c
--- a/mm/vmstat.c~mm-vmscan-fix-do_try_to_free_pages-livelock-fix
+++ a/mm/vmstat.c
@@ -21,6 +21,8 @@
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
 
+#include "internal.h"
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
 EXPORT_PER_CPU_SYMBOL(vm_event_states);
diff -puN mm/internal.h~mm-vmscan-fix-do_try_to_free_pages-livelock-fix mm/internal.h
--- a/mm/internal.h~mm-vmscan-fix-do_try_to_free_pages-livelock-fix
+++ a/mm/internal.h
@@ -85,6 +85,8 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
+extern bool zone_reclaimable(struct zone *zone);
 
 /*
  * in mm/rmap.c:
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-20 22:16             ` Andrew Morton
@ 2013-08-22  5:24               ` Lisa Du
  2013-08-22  6:24                 ` Minchan Kim
  0 siblings, 1 reply; 51+ messages in thread
From: Lisa Du @ 2013-08-22  5:24 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim, KOSAKI Motohiro
  Cc: Johannes Weiner, Michal Hocko, linux-mm, Mel Gorman,
	Christoph Lameter, Bob Liu, Neil Zhang, Russell King - ARM Linux,
	Aaditya Kumar, yinghan, npiggin, riel, kamezawa.hiroyu,
	chunlingdu1

>-----Original Message-----
>From: Andrew Morton [mailto:akpm@linux-foundation.org]
>Sent: 2013年8月21日 6:17
>To: Lisa Du
>Cc: Johannes Weiner; Michal Hocko; linux-mm@kvack.org; Minchan Kim; KOSAKI Motohiro; Mel Gorman; Christoph Lameter; Bob Liu;
>Neil Zhang; Russell King - ARM Linux; Aaditya Kumar; yinghan@google.com; npiggin@gmail.com; riel@redhat.com;
>kamezawa.hiroyu@jp.fujitsu.com
>Subject: Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
>
>On Sun, 11 Aug 2013 18:46:08 -0700 Lisa Du <cldu@marvell.com> wrote:
>
>> In this version:
>> Reorder the check in pgdat_balanced according Johannes's comment.
>>
>> >From 66a98566792b954e187dca251fbe3819aeb977b9 Mon Sep 17 00:00:00
>> >2001
>> From: Lisa Du <cldu@marvell.com>
>> Date: Mon, 5 Aug 2013 09:26:57 +0800
>> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
>>
>> This patch is based on KOSAKI's work and I add a little more
>> description, please refer https://lkml.org/lkml/2012/6/14/74.
>>
>> Currently, I found system can enter a state that there are lots of
>> free pages in a zone but only order-0 and order-1 pages which means
>> the zone is heavily fragmented, then high order allocation could make
>> direct reclaim path's long stall(ex, 60 seconds) especially in no swap
>> and no compaciton enviroment. This problem happened on v3.4, but it
>> seems issue still lives in current tree, the reason is
>> do_try_to_free_pages enter live lock:
>>
>> kswapd will go to sleep if the zones have been fully scanned and are
>> still not balanced. As kswapd thinks there's little point trying all
>> over again to avoid infinite loop. Instead it changes order from
>> high-order to 0-order because kswapd think order-0 is the most
>> important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd
>> will go back to sleep and may leave zone->all_unreclaimable = 0.
>> It assume high-order users can still perform direct reclaim if they wish.
>>
>> Direct reclaim continue to reclaim for a high order which is not a
>> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
>> This is because to avoid too early oom-kill. So it means
>> direct_reclaim depends on kswapd to break this loop.
>>
>> In worst case, direct-reclaim may continue to page reclaim forever
>> when kswapd sleeps forever until someone like watchdog detect and
>> finally kill the process. As described in:
>> http://thread.gmane.org/gmane.linux.kernel.mm/103737
>>
>> We can't turn on zone->all_unreclaimable from direct reclaim path
>> because direct reclaim path don't take any lock and this way is racy.
>
>I don't see that this is correct.  Page reclaim does racy things quite often, in the knowledge that the effects of a race are
>recoverable and small.
Maybe Kosaki can give some comments, I think the mainly reason maybe direct reclaim don't take any lock.
>
>> Thus this patch removes zone->all_unreclaimable field completely and
>> recalculates zone reclaimable state every time.
>>
>> Note: we can't take the idea that direct-reclaim see
>> zone->pages_scanned directly and kswapd continue to use
>> zone->all_unreclaimable. Because, it is racy. commit 929bea7c71
>> (vmscan: all_unreclaimable() use
>> zone->all_unreclaimable as a name) describes the detail.
>>
>> @@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
>>  	return lru;
>>  }
>>
>> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
>> +{
>> +	int nr;
>> +
>> +	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
>> +	     zone_page_state(zone, NR_INACTIVE_FILE);
>> +
>> +	if (get_nr_swap_pages() > 0)
>> +		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>> +		      zone_page_state(zone, NR_INACTIVE_ANON);
>> +
>> +	return nr;
>> +}
>> +
>> +static inline bool zone_reclaimable(struct zone *zone) {
>> +	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6; }
>
>Inlining is often wrong.  Uninlining just these two funtions saves several hundred bytes of text in mm/.  That's three of someone
>else's cachelines which we didn't need to evict.
Would you explain more about why "inline is often wrong"? Thanks a lot!
>
>And what the heck is up with that magical "6"?  Why not "7"?  "42"?
This magical number "6" was first defined in commit d1908362ae0.
Hi, Minchan, do you remember why we set this number? Thanks!
>
>At a minimum it needs extensive documentation which describes why "6"
>is the optimum value for all machines and workloads (good luck with
>that) and which describes the effects of altering this number and which helps people understand why we didn't make it a runtime
>tunable.
>
>I'll merge it for some testing (the lack of Tested-by's is conspicuous) but I don't want to put that random "6" into Linux core MM in
>its current state.
I did the test in kernel v3.4, it works fine and solve the endless loop in direct reclaim path, but not test with latest kernel version.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-22  5:24               ` Lisa Du
@ 2013-08-22  6:24                 ` Minchan Kim
  2013-08-22  7:14                   ` Lisa Du
  0 siblings, 1 reply; 51+ messages in thread
From: Minchan Kim @ 2013-08-22  6:24 UTC (permalink / raw)
  To: Lisa Du
  Cc: Andrew Morton, KOSAKI Motohiro, Johannes Weiner, Michal Hocko,
	linux-mm, Mel Gorman, Christoph Lameter, Bob Liu, Neil Zhang,
	Russell King - ARM Linux, Aaditya Kumar, yinghan, npiggin, riel,
	kamezawa.hiroyu, chunlingdu1

Hello Lisa,

Please fix your mail client.

On Wed, Aug 21, 2013 at 10:24:07PM -0700, Lisa Du wrote:
> >-----Original Message-----
> >From: Andrew Morton [mailto:akpm@linux-foundation.org]
> >Sent: 2013a1'8ae??21ae?JPY 6:17
> >To: Lisa Du
> >Cc: Johannes Weiner; Michal Hocko; linux-mm@kvack.org; Minchan Kim; KOSAKI Motohiro; Mel Gorman; Christoph Lameter; Bob Liu;
> >Neil Zhang; Russell King - ARM Linux; Aaditya Kumar; yinghan@google.com; npiggin@gmail.com; riel@redhat.com;
> >kamezawa.hiroyu@jp.fujitsu.com
> >Subject: Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
> >
> >On Sun, 11 Aug 2013 18:46:08 -0700 Lisa Du <cldu@marvell.com> wrote:
> >
> >> In this version:
> >> Reorder the check in pgdat_balanced according Johannes's comment.
> >>
> >> >From 66a98566792b954e187dca251fbe3819aeb977b9 Mon Sep 17 00:00:00
> >> >2001
> >> From: Lisa Du <cldu@marvell.com>
> >> Date: Mon, 5 Aug 2013 09:26:57 +0800
> >> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> >>
> >> This patch is based on KOSAKI's work and I add a little more
> >> description, please refer https://lkml.org/lkml/2012/6/14/74.
> >>
> >> Currently, I found system can enter a state that there are lots of
> >> free pages in a zone but only order-0 and order-1 pages which means
> >> the zone is heavily fragmented, then high order allocation could make
> >> direct reclaim path's long stall(ex, 60 seconds) especially in no swap
> >> and no compaciton enviroment. This problem happened on v3.4, but it
> >> seems issue still lives in current tree, the reason is
> >> do_try_to_free_pages enter live lock:
> >>
> >> kswapd will go to sleep if the zones have been fully scanned and are
> >> still not balanced. As kswapd thinks there's little point trying all
> >> over again to avoid infinite loop. Instead it changes order from
> >> high-order to 0-order because kswapd think order-0 is the most
> >> important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd
> >> will go back to sleep and may leave zone->all_unreclaimable = 0.
> >> It assume high-order users can still perform direct reclaim if they wish.
> >>
> >> Direct reclaim continue to reclaim for a high order which is not a
> >> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> >> This is because to avoid too early oom-kill. So it means
> >> direct_reclaim depends on kswapd to break this loop.
> >>
> >> In worst case, direct-reclaim may continue to page reclaim forever
> >> when kswapd sleeps forever until someone like watchdog detect and
> >> finally kill the process. As described in:
> >> http://thread.gmane.org/gmane.linux.kernel.mm/103737
> >>
> >> We can't turn on zone->all_unreclaimable from direct reclaim path
> >> because direct reclaim path don't take any lock and this way is racy.
> >
> >I don't see that this is correct.  Page reclaim does racy things quite often, in the knowledge that the effects of a race are
> >recoverable and small.
> Maybe Kosaki can give some comments, I think the mainly reason maybe direct reclaim don't take any lock.

You have to write the problem out in detail.
It doesn't related to lock and race.
The problem is kswapd could sleep in highly-fragemented case
if it was woke up by high-order alloc because kswapd don't want
to reclaim too many pages by high-order allocation so it just
check order-0 pages once it try to reclaim for high order
allocation.

         * Fragmentation may mean that the system cannot be rebalanced
         * for high-order allocations in all zones. If twice the
         * allocation size has been reclaimed and the zones are still
         * not balanced then recheck the watermarks at order-0 to
         * prevent kswapd reclaiming excessively. Assume that a
         * process requested a high-order can direct reclaim/compact.
         */
        if (order && sc.nr_reclaimed >= 2UL << order)
                order = sc.order = 0;

But direct reclaim cannot meet kswapd's expectation because
although it has lots of free pages, it couldn't reclaim at all
because it has no swap device but lots of anon pages so it
should stop scanning and kill someone and it depends on the logic

	/* top priority shrink_zones still had more to do? don't OOM, then */
	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
		return 1;

In addtion that, all_unreclaimable is just check zone->all_unreclaimable.
Then, who set it? kswapd. What is kswapd doing? Sleep.
So, do_try_free_pages return 1, in turn, direct reclaimer think there is
progress so do not kill anything. 

It's just an example from your log and we can see more potential bugs
and I agree with Michal that "all_unreclaimable was just a bad idea
from the very beginning". It's very subtle and error-prone.

> >
> >> Thus this patch removes zone->all_unreclaimable field completely and
> >> recalculates zone reclaimable state every time.
> >>
> >> Note: we can't take the idea that direct-reclaim see
> >> zone->pages_scanned directly and kswapd continue to use
> >> zone->all_unreclaimable. Because, it is racy. commit 929bea7c71
> >> (vmscan: all_unreclaimable() use
> >> zone->all_unreclaimable as a name) describes the detail.
> >>
> >> @@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
> >>  	return lru;
> >>  }
> >>
> >> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> >> +{
> >> +	int nr;
> >> +
> >> +	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> >> +	     zone_page_state(zone, NR_INACTIVE_FILE);
> >> +
> >> +	if (get_nr_swap_pages() > 0)
> >> +		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> >> +		      zone_page_state(zone, NR_INACTIVE_ANON);
> >> +
> >> +	return nr;
> >> +}
> >> +
> >> +static inline bool zone_reclaimable(struct zone *zone) {
> >> +	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6; }
> >
> >Inlining is often wrong.  Uninlining just these two funtions saves several hundred bytes of text in mm/.  That's three of someone
> >else's cachelines which we didn't need to evict.
> Would you explain more about why "inline is often wrong"? Thanks a lot!

Andrew said it increases binary size several hundred bytes.

Reclaim stuff is totally slow path in MM so normally, we don't buy such
bad deal. (code size VS speed)

> >
> >And what the heck is up with that magical "6"?  Why not "7"?  "42"?
> This magical number "6" was first defined in commit d1908362ae0.
> Hi, Minchan, do you remember why we set this number? Thanks!

Not me. It was introduced by [1] long time ago and long time ago,
it was just 4. I don't know who did at the first time.
Anyway, I agree it's very heuristic but have no idea because
some system might want to avoid OOM kill as slow as possible
because even a process killing is very critical for the system
while some system like android want to make OOM kill as soon as
possible because it prefers background process killing to system
slowness. So, IMHO, we would need a knob for tune.

[1] 4ff1ffb4870b007, [PATCH] oom: reclaim_mapped on oom.


> >
> >At a minimum it needs extensive documentation which describes why "6"
> >is the optimum value for all machines and workloads (good luck with
> >that) and which describes the effects of altering this number and which helps people understand why we didn't make it a runtime
> >tunable.
> >
> >I'll merge it for some testing (the lack of Tested-by's is conspicuous) but I don't want to put that random "6" into Linux core MM in
> >its current state.
> I did the test in kernel v3.4, it works fine and solve the endless loop in direct reclaim path, but not test with latest kernel version.
> >

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-22  6:24                 ` Minchan Kim
@ 2013-08-22  7:14                   ` Lisa Du
  0 siblings, 0 replies; 51+ messages in thread
From: Lisa Du @ 2013-08-22  7:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Lisa Du, Andrew Morton, KOSAKI Motohiro, Johannes Weiner,
	Michal Hocko, linux-mm, Mel Gorman, Christoph Lameter, Bob Liu,
	Neil Zhang, Russell King - ARM Linux, Aaditya Kumar, yinghan,
	npiggin, riel, kamezawa.hiroyu

On Thu, Aug 22, 2013 at 2:24 PM, Minchan Kim <minchan@kernel.org> wrote:
>
> Hello Lisa,
>
> Please fix your mail client.

I apologize for this, I tried to fix, seems still have issue, so
changed to gmail.

If still have issue, please let me know, thanks!
>
>
> On Wed, Aug 21, 2013 at 10:24:07PM -0700, Lisa Du wrote:
> > >-----Original Message-----
> > >From: Andrew Morton [mailto:akpm@linux-foundation.org]
> > >Sent: 2013年8月21日 6:17
> > >To: Lisa Du
> > >Cc: Johannes Weiner; Michal Hocko; linux-mm@kvack.org; Minchan Kim; KOSAKI Motohiro; Mel Gorman; Christoph Lameter; Bob Liu;
> > >Neil Zhang; Russell King - ARM Linux; Aaditya Kumar; yinghan@google.com; npiggin@gmail.com; riel@redhat.com;
> > >kamezawa.hiroyu@jp.fujitsu.com
> > >Subject: Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
> > >
> > >On Sun, 11 Aug 2013 18:46:08 -0700 Lisa Du <cldu@marvell.com> wrote:
> > >
> > >> In this version:
> > >> Reorder the check in pgdat_balanced according Johannes's comment.
> > >>
> > >> >From 66a98566792b954e187dca251fbe3819aeb977b9 Mon Sep 17 00:00:00
> > >> >2001
> > >> From: Lisa Du <cldu@marvell.com>
> > >> Date: Mon, 5 Aug 2013 09:26:57 +0800
> > >> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock
> > >>
> > >> This patch is based on KOSAKI's work and I add a little more
> > >> description, please refer https://lkml.org/lkml/2012/6/14/74.
> > >>
> > >> Currently, I found system can enter a state that there are lots of
> > >> free pages in a zone but only order-0 and order-1 pages which means
> > >> the zone is heavily fragmented, then high order allocation could make
> > >> direct reclaim path's long stall(ex, 60 seconds) especially in no swap
> > >> and no compaciton enviroment. This problem happened on v3.4, but it
> > >> seems issue still lives in current tree, the reason is
> > >> do_try_to_free_pages enter live lock:
> > >>
> > >> kswapd will go to sleep if the zones have been fully scanned and are
> > >> still not balanced. As kswapd thinks there's little point trying all
> > >> over again to avoid infinite loop. Instead it changes order from
> > >> high-order to 0-order because kswapd think order-0 is the most
> > >> important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd
> > >> will go back to sleep and may leave zone->all_unreclaimable = 0.
> > >> It assume high-order users can still perform direct reclaim if they wish.
> > >>
> > >> Direct reclaim continue to reclaim for a high order which is not a
> > >> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> > >> This is because to avoid too early oom-kill. So it means
> > >> direct_reclaim depends on kswapd to break this loop.
> > >>
> > >> In worst case, direct-reclaim may continue to page reclaim forever
> > >> when kswapd sleeps forever until someone like watchdog detect and
> > >> finally kill the process. As described in:
> > >> http://thread.gmane.org/gmane.linux.kernel.mm/103737
> > >>
> > >> We can't turn on zone->all_unreclaimable from direct reclaim path
> > >> because direct reclaim path don't take any lock and this way is racy.
> > >
> > >I don't see that this is correct.  Page reclaim does racy things quite often, in the knowledge that the effects of a race are
> > >recoverable and small.
> > Maybe Kosaki can give some comments, I think the mainly reason maybe direct reclaim don't take any lock.
>
> You have to write the problem out in detail.
> It doesn't related to lock and race.
> The problem is kswapd could sleep in highly-fragemented case
> if it was woke up by high-order alloc because kswapd don't want
> to reclaim too many pages by high-order allocation so it just
> check order-0 pages once it try to reclaim for high order
> allocation.
>
>          * Fragmentation may mean that the system cannot be rebalanced
>          * for high-order allocations in all zones. If twice the
>          * allocation size has been reclaimed and the zones are still
>          * not balanced then recheck the watermarks at order-0 to
>          * prevent kswapd reclaiming excessively. Assume that a
>          * process requested a high-order can direct reclaim/compact.
>          */
>         if (order && sc.nr_reclaimed >= 2UL << order)
>                 order = sc.order = 0;
>
> But direct reclaim cannot meet kswapd's expectation because
> although it has lots of free pages, it couldn't reclaim at all
> because it has no swap device but lots of anon pages so it
> should stop scanning and kill someone and it depends on the logic
>
>         /* top priority shrink_zones still had more to do? don't OOM, then */
>         if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
>                 return 1;
>
> In addtion that, all_unreclaimable is just check zone->all_unreclaimable.
> Then, who set it? kswapd. What is kswapd doing? Sleep.
> So, do_try_free_pages return 1, in turn, direct reclaimer think there is
> progress so do not kill anything.

Thanks for the detail description for this issue! It's exactly the issue.

I thought Andrew is asking "We can't turn on zone->all_unreclaimable from direct

reclaim path because direct reclaim path don't take any lock and this
way is racy."

Actually I'm not very clear why turn on zone->all_unreclaimable in
direct reclaim

path is not good. This statement I just copied from the original patch
from Kosaki.
>
>
> It's just an example from your log and we can see more potential bugs
> and I agree with Michal that "all_unreclaimable was just a bad idea
> from the very beginning". It's very subtle and error-prone.
>
> > >
> > >> Thus this patch removes zone->all_unreclaimable field completely and
> > >> recalculates zone reclaimable state every time.
> > >>
> > >> Note: we can't take the idea that direct-reclaim see
> > >> zone->pages_scanned directly and kswapd continue to use
> > >> zone->all_unreclaimable. Because, it is racy. commit 929bea7c71
> > >> (vmscan: all_unreclaimable() use
> > >> zone->all_unreclaimable as a name) describes the detail.
> > >>
> > >> @@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page)
> > >>    return lru;
> > >>  }
> > >>
> > >> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> > >> +{
> > >> +  int nr;
> > >> +
> > >> +  nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> > >> +       zone_page_state(zone, NR_INACTIVE_FILE);
> > >> +
> > >> +  if (get_nr_swap_pages() > 0)
> > >> +          nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> > >> +                zone_page_state(zone, NR_INACTIVE_ANON);
> > >> +
> > >> +  return nr;
> > >> +}
> > >> +
> > >> +static inline bool zone_reclaimable(struct zone *zone) {
> > >> +  return zone->pages_scanned < zone_reclaimable_pages(zone) * 6; }
> > >
> > >Inlining is often wrong.  Uninlining just these two funtions saves several hundred bytes of text in mm/.  That's three of someone
> > >else's cachelines which we didn't need to evict.
> > Would you explain more about why "inline is often wrong"? Thanks a lot!
>
> Andrew said it increases binary size several hundred bytes.
>
> Reclaim stuff is totally slow path in MM so normally, we don't buy such
> bad deal. (code size VS speed)
>
> > >
> > >And what the heck is up with that magical "6"?  Why not "7"?  "42"?
> > This magical number "6" was first defined in commit d1908362ae0.
> > Hi, Minchan, do you remember why we set this number? Thanks!
>
> Not me. It was introduced by [1] long time ago and long time ago,
> it was just 4. I don't know who did at the first time.
> Anyway, I agree it's very heuristic but have no idea because
> some system might want to avoid OOM kill as slow as possible
> because even a process killing is very critical for the system
> while some system like android want to make OOM kill as soon as
> possible because it prefers background process killing to system
> slowness. So, IMHO, we would need a knob for tune.
>
> [1] 4ff1ffb4870b007, [PATCH] oom: reclaim_mapped on oom.

Thanks for the explaining!
>
>
>
> > >
> > >At a minimum it needs extensive documentation which describes why "6"
> > >is the optimum value for all machines and workloads (good luck with
> > >that) and which describes the effects of altering this number and which helps people understand why we didn't make it a runtime
> > >tunable.
> > >
> > >I'll merge it for some testing (the lack of Tested-by's is conspicuous) but I don't want to put that random "6" into Linux core MM in
> > >its current state.
> > I did the test in kernel v3.4, it works fine and solve the endless loop in direct reclaim path, but not test with latest kernel version.
> > >
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-12  1:46           ` [resend] [PATCH V3] " Lisa Du
  2013-08-20 22:16             ` Andrew Morton
@ 2013-08-27 19:43             ` Andrew Morton
  2013-08-28  1:58               ` Lisa Du
  1 sibling, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2013-08-27 19:43 UTC (permalink / raw)
  To: Lisa Du
  Cc: Johannes Weiner, Michal Hocko, linux-mm, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman, Christoph Lameter, Bob Liu,
	Neil Zhang, Russell King - ARM Linux, Aaditya Kumar, yinghan,
	npiggin, riel, kamezawa.hiroyu

On Sun, 11 Aug 2013 18:46:08 -0700 Lisa Du <cldu@marvell.com> wrote:

> This patch is based on KOSAKI's work and I add a little more
> description, please refer https://lkml.org/lkml/2012/6/14/74.
> 
> Currently, I found system can enter a state that there are lots
> of free pages in a zone but only order-0 and order-1 pages which
> means the zone is heavily fragmented, then high order allocation
> could make direct reclaim path's long stall(ex, 60 seconds)
> especially in no swap and no compaciton enviroment. This problem
> happened on v3.4, but it seems issue still lives in current tree,
> the reason is do_try_to_free_pages enter live lock:
> 
> kswapd will go to sleep if the zones have been fully scanned
> and are still not balanced. As kswapd thinks there's little point
> trying all over again to avoid infinite loop. Instead it changes
> order from high-order to 0-order because kswapd think order-0 is the
> most important. Look at 73ce02e9 in detail. If watermarks are ok,
> kswapd will go back to sleep and may leave zone->all_unreclaimable = 0.
> It assume high-order users can still perform direct reclaim if they wish.
> 
> Direct reclaim continue to reclaim for a high order which is not a
> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
> This is because to avoid too early oom-kill. So it means direct_reclaim
> depends on kswapd to break this loop.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever until someone like watchdog detect and finally
> kill the process. As described in:
> http://thread.gmane.org/gmane.linux.kernel.mm/103737
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.

I did this to fix the build:

--- a/mm/migrate.c~mm-vmscan-fix-do_try_to_free_pages-livelock-fix-2
+++ a/mm/migrate.c
@@ -1471,7 +1471,7 @@ static bool migrate_balanced_pgdat(struc
 		if (!populated_zone(zone))
 			continue;
 
-		if (zone->all_unreclaimable)
+		if (!zone_reclaimable(zone))
 			continue;
 
 		/* Avoid waking kswapd by allocating pages_to_migrate pages. */

Please review and runtime test it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
  2013-08-27 19:43             ` Andrew Morton
@ 2013-08-28  1:58               ` Lisa Du
  0 siblings, 0 replies; 51+ messages in thread
From: Lisa Du @ 2013-08-28  1:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman, Christoph Lameter, Bob Liu,
	Neil Zhang, Russell King - ARM Linux, Aaditya Kumar, yinghan,
	npiggin, riel, kamezawa.hiroyu

>-----Original Message-----
>From: Andrew Morton [mailto:akpm@linux-foundation.org]
>Sent: 2013年8月28日 3:43
>To: Lisa Du
>Cc: Johannes Weiner; Michal Hocko; linux-mm@kvack.org; Minchan Kim; KOSAKI Motohiro; Mel Gorman; Christoph Lameter; Bob Liu;
>Neil Zhang; Russell King - ARM Linux; Aaditya Kumar; yinghan@google.com; npiggin@gmail.com; riel@redhat.com;
>kamezawa.hiroyu@jp.fujitsu.com
>Subject: Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock
>
>On Sun, 11 Aug 2013 18:46:08 -0700 Lisa Du <cldu@marvell.com> wrote:
>
>> This patch is based on KOSAKI's work and I add a little more
>> description, please refer https://lkml.org/lkml/2012/6/14/74.
>>
>> Currently, I found system can enter a state that there are lots of
>> free pages in a zone but only order-0 and order-1 pages which means
>> the zone is heavily fragmented, then high order allocation could make
>> direct reclaim path's long stall(ex, 60 seconds) especially in no swap
>> and no compaciton enviroment. This problem happened on v3.4, but it
>> seems issue still lives in current tree, the reason is
>> do_try_to_free_pages enter live lock:
>>
>> kswapd will go to sleep if the zones have been fully scanned and are
>> still not balanced. As kswapd thinks there's little point trying all
>> over again to avoid infinite loop. Instead it changes order from
>> high-order to 0-order because kswapd think order-0 is the most
>> important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd
>> will go back to sleep and may leave zone->all_unreclaimable = 0.
>> It assume high-order users can still perform direct reclaim if they wish.
>>
>> Direct reclaim continue to reclaim for a high order which is not a
>> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble.
>> This is because to avoid too early oom-kill. So it means
>> direct_reclaim depends on kswapd to break this loop.
>>
>> In worst case, direct-reclaim may continue to page reclaim forever
>> when kswapd sleeps forever until someone like watchdog detect and
>> finally kill the process. As described in:
>> http://thread.gmane.org/gmane.linux.kernel.mm/103737
>>
>> We can't turn on zone->all_unreclaimable from direct reclaim path
>> because direct reclaim path don't take any lock and this way is racy.
>> Thus this patch removes zone->all_unreclaimable field completely and
>> recalculates zone reclaimable state every time.
>>
>> Note: we can't take the idea that direct-reclaim see
>> zone->pages_scanned directly and kswapd continue to use
>> zone->all_unreclaimable. Because, it is racy. commit 929bea7c71
>> (vmscan: all_unreclaimable() use
>> zone->all_unreclaimable as a name) describes the detail.
>
>I did this to fix the build:
>
>--- a/mm/migrate.c~mm-vmscan-fix-do_try_to_free_pages-livelock-fix-2
>+++ a/mm/migrate.c
>@@ -1471,7 +1471,7 @@ static bool migrate_balanced_pgdat(struc
> 		if (!populated_zone(zone))
> 			continue;
>
>-		if (zone->all_unreclaimable)
>+		if (!zone_reclaimable(zone))
> 			continue;
>
> 		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
>
>Please review and runtime test it?
This should be reasonable, I'm sorry that I only have the v3.4 environment.
And v3.4 doesn't have this function.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-19 22:17             ` KOSAKI Motohiro
@ 2012-06-20  6:18               ` Minchan Kim
  -1 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2012-06-20  6:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Aaditya Kumar, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi

On 06/20/2012 07:17 AM, KOSAKI Motohiro wrote:

> (6/17/12 8:43 PM), Minchan Kim wrote:
>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>
>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>>>
>>>>>
>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>> pgdat_balanced()
>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>> normal zone has no reclaimable page.
>>>>>
>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>> sleep only if every zones have much free pages than high water mark
>>>>> _and_ 25% of present pages in node are free.
>>>>>
>>>>
>>>>
>>>> Sorry. I can't understand your point.
>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>> It seems I am missing your point.
>>>> Please anybody correct me.
>>>
>>> Since currently direct reclaim is given up based on
>>> zone->all_unreclaimable flag,
>>> so for e.g in one of the scenarios:
>>>
>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>> hot-remove the all the pages of the MOVABLE zone.
>>>
>>> While migrating pages during memory hot-unplugging, the allocation function
>>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>>> looping in direct reclaim path for ever.
>>>
>>> This is so because when most of the pages in the MOVABLE zone have
>>> been migrated,
>>> the zone now contains lots of free memory (basically above low watermark)
>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>
>>> So kswapd() would not balance this zone as free pages are above low watermark
>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>> never be set for this zone
>>> and allocation function would end up looping forever. (assuming the
>>> zone NORMAL is
>>> left with no reclaimable memory)
>>>
>>
>>
>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>> But I don't see it's a problem of kswapd.
>>
>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>> but we can't allocate it. :(
>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>> Kswapd is just one of them confused.
>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. 
>>
>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>> free_area[order].nr_free exactly. 
>>
>> Any thought?
>>
>> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 4403009..19de56c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>>  
>>  out:
>>         if (!ret) {
>> +               int pages_moved;
>>                 set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>> -               move_freepages_block(zone, page, MIGRATE_ISOLATE);
>> +               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
>> +               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>>         }   
>>  
>>         spin_unlock_irqrestore(&zone->lock, flags);
>> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>>  {
>>         struct zone *zone;
>>         unsigned long flags;
>> +       int pages_moved;
>>         zone = page_zone(page);
>>         spin_lock_irqsave(&zone->lock, flags);
>>         if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>>                 goto out;
>>         set_pageblock_migratetype(page, migratetype);
>> -       move_freepages_block(zone, page, migratetype);
>> +       pages_moved = move_freepages_block(zone, page, migratetype);
>> +       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>>  out:
>>         spin_unlock_irqrestore(&zone->lock, flags);
>>  }
> 
> Unfortunately, this doesn't work. there are two reasons. 1) when memory hotplug occue, we have
> two scenarios. a) free page -> page block change into isolate b) page block change into isolate
> -> free page. The above patch only care scenario (a). Thus it lead to confusing NR_FREE_PAGES value.
> _if_ we put a new branch free page hotpath, we can solve scenario (b). but I don't like it. because of,
> zero hotpath overhead is one of memory hotplug design principle. 2) event if we can solve above issue,


Yeb. Aaditya already pointed out.
And I just sent other patch.
Let's talk about this problem on another thread because it's not a direct/background reclaim problem.
http://lkml.org/lkml/2012/6/20/30


> all_unreclaimable logic still broken. because of, __alloc_pages_slowpath() wake up kswapd only once and
> don't wake up when "goto rebalance" path. But, wake_all_kswapd() is racy and no guarantee to wake up
> kswapd. It mean direct reclaim should work fine w/o background reclaim.


We can fix it easily in direct reclaim path but I think your approach still make sense
because current scheme of zone_unreclaimable setting is very fragile on livelock.
So if you send your patch again with rewritten description, I have no objection.

Thanks.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-20  6:18               ` Minchan Kim
  0 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2012-06-20  6:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Aaditya Kumar, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi

On 06/20/2012 07:17 AM, KOSAKI Motohiro wrote:

> (6/17/12 8:43 PM), Minchan Kim wrote:
>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>
>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>>>
>>>>>
>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>> pgdat_balanced()
>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>> normal zone has no reclaimable page.
>>>>>
>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>> sleep only if every zones have much free pages than high water mark
>>>>> _and_ 25% of present pages in node are free.
>>>>>
>>>>
>>>>
>>>> Sorry. I can't understand your point.
>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>> It seems I am missing your point.
>>>> Please anybody correct me.
>>>
>>> Since currently direct reclaim is given up based on
>>> zone->all_unreclaimable flag,
>>> so for e.g in one of the scenarios:
>>>
>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>> hot-remove the all the pages of the MOVABLE zone.
>>>
>>> While migrating pages during memory hot-unplugging, the allocation function
>>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>>> looping in direct reclaim path for ever.
>>>
>>> This is so because when most of the pages in the MOVABLE zone have
>>> been migrated,
>>> the zone now contains lots of free memory (basically above low watermark)
>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>
>>> So kswapd() would not balance this zone as free pages are above low watermark
>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>> never be set for this zone
>>> and allocation function would end up looping forever. (assuming the
>>> zone NORMAL is
>>> left with no reclaimable memory)
>>>
>>
>>
>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>> But I don't see it's a problem of kswapd.
>>
>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>> but we can't allocate it. :(
>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>> Kswapd is just one of them confused.
>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. 
>>
>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>> free_area[order].nr_free exactly. 
>>
>> Any thought?
>>
>> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 4403009..19de56c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>>  
>>  out:
>>         if (!ret) {
>> +               int pages_moved;
>>                 set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>> -               move_freepages_block(zone, page, MIGRATE_ISOLATE);
>> +               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
>> +               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>>         }   
>>  
>>         spin_unlock_irqrestore(&zone->lock, flags);
>> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>>  {
>>         struct zone *zone;
>>         unsigned long flags;
>> +       int pages_moved;
>>         zone = page_zone(page);
>>         spin_lock_irqsave(&zone->lock, flags);
>>         if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>>                 goto out;
>>         set_pageblock_migratetype(page, migratetype);
>> -       move_freepages_block(zone, page, migratetype);
>> +       pages_moved = move_freepages_block(zone, page, migratetype);
>> +       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>>  out:
>>         spin_unlock_irqrestore(&zone->lock, flags);
>>  }
> 
> Unfortunately, this doesn't work. there are two reasons. 1) when memory hotplug occue, we have
> two scenarios. a) free page -> page block change into isolate b) page block change into isolate
> -> free page. The above patch only care scenario (a). Thus it lead to confusing NR_FREE_PAGES value.
> _if_ we put a new branch free page hotpath, we can solve scenario (b). but I don't like it. because of,
> zero hotpath overhead is one of memory hotplug design principle. 2) event if we can solve above issue,


Yeb. Aaditya already pointed out.
And I just sent other patch.
Let's talk about this problem on another thread because it's not a direct/background reclaim problem.
http://lkml.org/lkml/2012/6/20/30


> all_unreclaimable logic still broken. because of, __alloc_pages_slowpath() wake up kswapd only once and
> don't wake up when "goto rebalance" path. But, wake_all_kswapd() is racy and no guarantee to wake up
> kswapd. It mean direct reclaim should work fine w/o background reclaim.


We can fix it easily in direct reclaim path but I think your approach still make sense
because current scheme of zone_unreclaimable setting is very fragile on livelock.
So if you send your patch again with rewritten description, I have no objection.

Thanks.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-18  0:43           ` Minchan Kim
@ 2012-06-19 22:17             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 51+ messages in thread
From: KOSAKI Motohiro @ 2012-06-19 22:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aaditya Kumar, KOSAKI Motohiro, linux-kernel, linux-mm, akpm,
	Nick Piggin, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Minchan Kim, frank.rowand, tim.bird,
	takuzo.ohara, kan.iibuchi

(6/17/12 8:43 PM), Minchan Kim wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
> 
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
> 
> 
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.
> 
> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. 
> 
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly. 
> 
> Any thought?
> 
> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>  
>  out:
>         if (!ret) {
> +               int pages_moved;
>                 set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> -               move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>         }   
>  
>         spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>  {
>         struct zone *zone;
>         unsigned long flags;
> +       int pages_moved;
>         zone = page_zone(page);
>         spin_lock_irqsave(&zone->lock, flags);
>         if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>                 goto out;
>         set_pageblock_migratetype(page, migratetype);
> -       move_freepages_block(zone, page, migratetype);
> +       pages_moved = move_freepages_block(zone, page, migratetype);
> +       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>  out:
>         spin_unlock_irqrestore(&zone->lock, flags);
>  }

Unfortunately, this doesn't work. there are two reasons. 1) when memory hotplug occue, we have
two scenarios. a) free page -> page block change into isolate b) page block change into isolate
-> free page. The above patch only care scenario (a). Thus it lead to confusing NR_FREE_PAGES value.
_if_ we put a new branch free page hotpath, we can solve scenario (b). but I don't like it. because of,
zero hotpath overhead is one of memory hotplug design principle. 2) event if we can solve above issue,
all_unreclaimable logic still broken. because of, __alloc_pages_slowpath() wake up kswapd only once and
don't wake up when "goto rebalance" path. But, wake_all_kswapd() is racy and no guarantee to wake up
kswapd. It mean direct reclaim should work fine w/o background reclaim.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-19 22:17             ` KOSAKI Motohiro
  0 siblings, 0 replies; 51+ messages in thread
From: KOSAKI Motohiro @ 2012-06-19 22:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aaditya Kumar, KOSAKI Motohiro, linux-kernel, linux-mm, akpm,
	Nick Piggin, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Minchan Kim, frank.rowand, tim.bird,
	takuzo.ohara, kan.iibuchi

(6/17/12 8:43 PM), Minchan Kim wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
> 
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
> 
> 
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.
> 
> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. 
> 
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly. 
> 
> Any thought?
> 
> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>  
>  out:
>         if (!ret) {
> +               int pages_moved;
>                 set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> -               move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>         }   
>  
>         spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>  {
>         struct zone *zone;
>         unsigned long flags;
> +       int pages_moved;
>         zone = page_zone(page);
>         spin_lock_irqsave(&zone->lock, flags);
>         if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>                 goto out;
>         set_pageblock_migratetype(page, migratetype);
> -       move_freepages_block(zone, page, migratetype);
> +       pages_moved = move_freepages_block(zone, page, migratetype);
> +       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>  out:
>         spin_unlock_irqrestore(&zone->lock, flags);
>  }

Unfortunately, this doesn't work. there are two reasons. 1) when memory hotplug occue, we have
two scenarios. a) free page -> page block change into isolate b) page block change into isolate
-> free page. The above patch only care scenario (a). Thus it lead to confusing NR_FREE_PAGES value.
_if_ we put a new branch free page hotpath, we can solve scenario (b). but I don't like it. because of,
zero hotpath overhead is one of memory hotplug design principle. 2) event if we can solve above issue,
all_unreclaimable logic still broken. because of, __alloc_pages_slowpath() wake up kswapd only once and
don't wake up when "goto rebalance" path. But, wake_all_kswapd() is racy and no guarantee to wake up
kswapd. It mean direct reclaim should work fine w/o background reclaim.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-15 12:31         ` Hillf Danton
@ 2012-06-19 21:17           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 51+ messages in thread
From: KOSAKI Motohiro @ 2012-06-19 21:17 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Minchan Kim, KOSAKI Motohiro, linux-kernel, linux-mm, akpm,
	Nick Piggin, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki

> Who left comment on unreclaimable there, and why?
> 		/*
> 		 * balance_pgdat() skips over all_unreclaimable after
> 		 * DEF_PRIORITY. Effectively, it considers them balanced so
> 		 * they must be considered balanced here as well if kswapd
> 		 * is to sleep
> 		 */

Thank you for finding this! I'll fix.



> BTW, are you still using prefetch_prev_lru_page?

No. we should kill it. ;-)


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-19 21:17           ` KOSAKI Motohiro
  0 siblings, 0 replies; 51+ messages in thread
From: KOSAKI Motohiro @ 2012-06-19 21:17 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Minchan Kim, KOSAKI Motohiro, linux-kernel, linux-mm, akpm,
	Nick Piggin, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki

> Who left comment on unreclaimable there, and why?
> 		/*
> 		 * balance_pgdat() skips over all_unreclaimable after
> 		 * DEF_PRIORITY. Effectively, it considers them balanced so
> 		 * they must be considered balanced here as well if kswapd
> 		 * is to sleep
> 		 */

Thank you for finding this! I'll fix.



> BTW, are you still using prefetch_prev_lru_page?

No. we should kill it. ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-18  0:43           ` Minchan Kim
@ 2012-06-19 13:18             ` Aaditya Kumar
  -1 siblings, 0 replies; 51+ messages in thread
From: Aaditya Kumar @ 2012-06-19 13:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi

On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan@kernel.org> wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
>
>
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.

Hi Kim,

Yes I agree it is not a problem of kswapd.

> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]


I assume that by the inconsistency you mention above, you mean
temporary inconsistency.

Sorry, but IMHO as for memory hot plug the main issue with this patch
is that the inconsistency you mentioned above would NOT be a temporary
inconsistency.

Every time say 'x' number of page frames are off lined, they will
introduce a difference of 'x' pages between
NR_FREE_PAGES and SumOf[free_area[order].nr_free].
(So for e.g. if we do a frequent offline/online it will make
NR_FREE_PAGES  negative)

This is so because, unset_migratetype_isolate() is called from
offlining  code (to set the migrate type of off lined pages again back
to MIGRATE_MOVABLE)
after the pages have been off lined and removed from the buddy list.
Since the pages for which unset_migratetype_isolate() is called are
not buddy pages so move_freepages_block() does not move any page, and
thus introducing a permanent inconsistency.

> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly.
>
> Any thought?

As for fixing move_freepages_block(), At least for memory hot plug,
the pages stay in MIGRATE_ISOLATE list only for duration
offline_pages() function,
I mean only temporarily. Since fixing move_freepages_block() for will
introduce some overhead, So I am not very sure whether that overhead
is justified
for a temporary condition. What do you think?


> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>
>  out:
>        if (!ret) {
> +               int pages_moved;
>                set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> -               move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>        }
>
>        spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>  {
>        struct zone *zone;
>        unsigned long flags;
> +       int pages_moved;
>        zone = page_zone(page);
>        spin_lock_irqsave(&zone->lock, flags);
>        if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>                goto out;
>        set_pageblock_migratetype(page, migratetype);
> -       move_freepages_block(zone, page, migratetype);
> +       pages_moved = move_freepages_block(zone, page, migratetype);
> +       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>  out:
>        spin_unlock_irqrestore(&zone->lock, flags);
>  }
>
>
>>
>> Regards,
>> Aaditya Kumar
>> Sony India Software Centre,
>> Bangalore.
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
>
>
> --
> Kind regards,
> Minchan Kim

Regards,
Aaditya Kumar
Sony India Software Centre,
Bangalore.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-19 13:18             ` Aaditya Kumar
  0 siblings, 0 replies; 51+ messages in thread
From: Aaditya Kumar @ 2012-06-19 13:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi

On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan@kernel.org> wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
>
>
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.

Hi Kim,

Yes I agree it is not a problem of kswapd.

> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]


I assume that by the inconsistency you mention above, you mean
temporary inconsistency.

Sorry, but IMHO as for memory hot plug the main issue with this patch
is that the inconsistency you mentioned above would NOT be a temporary
inconsistency.

Every time say 'x' number of page frames are off lined, they will
introduce a difference of 'x' pages between
NR_FREE_PAGES and SumOf[free_area[order].nr_free].
(So for e.g. if we do a frequent offline/online it will make
NR_FREE_PAGES  negative)

This is so because, unset_migratetype_isolate() is called from
offlining  code (to set the migrate type of off lined pages again back
to MIGRATE_MOVABLE)
after the pages have been off lined and removed from the buddy list.
Since the pages for which unset_migratetype_isolate() is called are
not buddy pages so move_freepages_block() does not move any page, and
thus introducing a permanent inconsistency.

> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly.
>
> Any thought?

As for fixing move_freepages_block(), At least for memory hot plug,
the pages stay in MIGRATE_ISOLATE list only for duration
offline_pages() function,
I mean only temporarily. Since fixing move_freepages_block() for will
introduce some overhead, So I am not very sure whether that overhead
is justified
for a temporary condition. What do you think?


> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>
>  out:
>        if (!ret) {
> +               int pages_moved;
>                set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> -               move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>        }
>
>        spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>  {
>        struct zone *zone;
>        unsigned long flags;
> +       int pages_moved;
>        zone = page_zone(page);
>        spin_lock_irqsave(&zone->lock, flags);
>        if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>                goto out;
>        set_pageblock_migratetype(page, migratetype);
> -       move_freepages_block(zone, page, migratetype);
> +       pages_moved = move_freepages_block(zone, page, migratetype);
> +       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>  out:
>        spin_unlock_irqrestore(&zone->lock, flags);
>  }
>
>
>>
>> Regards,
>> Aaditya Kumar
>> Sony India Software Centre,
>> Bangalore.
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
>
>
> --
> Kind regards,
> Minchan Kim

Regards,
Aaditya Kumar
Sony India Software Centre,
Bangalore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-18  0:43           ` Minchan Kim
@ 2012-06-18  0:52             ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 51+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-18  0:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aaditya Kumar, KOSAKI Motohiro, linux-kernel, linux-mm, akpm,
	Nick Piggin, Michal Hocko, Johannes Weiner, Mel Gorman,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi,
	Yasuaki ISIMATU

(2012/06/18 9:43), Minchan Kim wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim<minchan@kernel.org>  wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
>
>
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.
>
> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly.
>
> Any thought?
>
> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>
>   out:
>          if (!ret) {
> +               int pages_moved;
>                  set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> -               move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>          }
>
>          spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>   {
>          struct zone *zone;
>          unsigned long flags;
> +       int pages_moved;
>          zone = page_zone(page);
>          spin_lock_irqsave(&zone->lock, flags);
>          if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>                  goto out;
>          set_pageblock_migratetype(page, migratetype);
> -       move_freepages_block(zone, page, migratetype);
> +       pages_moved = move_freepages_block(zone, page, migratetype);
> +       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>   out:
>          spin_unlock_irqrestore(&zone->lock, flags);
>   }
>


I think this patch is very good.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-18  0:52             ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 51+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-18  0:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aaditya Kumar, KOSAKI Motohiro, linux-kernel, linux-mm, akpm,
	Nick Piggin, Michal Hocko, Johannes Weiner, Mel Gorman,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi,
	Yasuaki ISIMATU

(2012/06/18 9:43), Minchan Kim wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim<minchan@kernel.org>  wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
>
>
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.
>
> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly.
>
> Any thought?
>
> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>
>   out:
>          if (!ret) {
> +               int pages_moved;
>                  set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> -               move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>          }
>
>          spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>   {
>          struct zone *zone;
>          unsigned long flags;
> +       int pages_moved;
>          zone = page_zone(page);
>          spin_lock_irqsave(&zone->lock, flags);
>          if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>                  goto out;
>          set_pageblock_migratetype(page, migratetype);
> -       move_freepages_block(zone, page, migratetype);
> +       pages_moved = move_freepages_block(zone, page, migratetype);
> +       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>   out:
>          spin_unlock_irqrestore(&zone->lock, flags);
>   }
>


I think this patch is very good.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-16 17:48         ` Aaditya Kumar
@ 2012-06-18  0:43           ` Minchan Kim
  -1 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2012-06-18  0:43 UTC (permalink / raw)
  To: Aaditya Kumar
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi

On 06/17/2012 02:48 AM, Aaditya Kumar wrote:

> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> 
>>>
>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>> pgdat_balanced()
>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>> normal zone has no reclaimable page.
>>>
>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>> sleep only if every zones have much free pages than high water mark
>>> _and_ 25% of present pages in node are free.
>>>
>>
>>
>> Sorry. I can't understand your point.
>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>> It seems I am missing your point.
>> Please anybody correct me.
> 
> Since currently direct reclaim is given up based on
> zone->all_unreclaimable flag,
> so for e.g in one of the scenarios:
> 
> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
> hot-remove the all the pages of the MOVABLE zone.
> 
> While migrating pages during memory hot-unplugging, the allocation function
> (for new page to which the page in MOVABLE zone would be moved)  can end up
> looping in direct reclaim path for ever.
> 
> This is so because when most of the pages in the MOVABLE zone have
> been migrated,
> the zone now contains lots of free memory (basically above low watermark)
> BUT all are in MIGRATE_ISOLATE list of the buddy list.
> 
> So kswapd() would not balance this zone as free pages are above low watermark
> (but all are in isolate list). So zone->all_unreclaimable flag would
> never be set for this zone
> and allocation function would end up looping forever. (assuming the
> zone NORMAL is
> left with no reclaimable memory)
>


Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
But I don't see it's a problem of kswapd.

a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
but we can't allocate it. :(
It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
Kswapd is just one of them confused.
As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. 

This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
free_area[order].nr_free exactly. 

Any thought?

Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..19de56c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
 
 out:
        if (!ret) {
+               int pages_moved;
                set_pageblock_migratetype(page, MIGRATE_ISOLATE);
-               move_freepages_block(zone, page, MIGRATE_ISOLATE);
+               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
+               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
        }   
 
        spin_unlock_irqrestore(&zone->lock, flags);
@@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 {
        struct zone *zone;
        unsigned long flags;
+       int pages_moved;
        zone = page_zone(page);
        spin_lock_irqsave(&zone->lock, flags);
        if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
                goto out;
        set_pageblock_migratetype(page, migratetype);
-       move_freepages_block(zone, page, migratetype);
+       pages_moved = move_freepages_block(zone, page, migratetype);
+       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
 out:
        spin_unlock_irqrestore(&zone->lock, flags);
 }


> 
> Regards,
> Aaditya Kumar
> Sony India Software Centre,
> Bangalore.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-18  0:43           ` Minchan Kim
  0 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2012-06-18  0:43 UTC (permalink / raw)
  To: Aaditya Kumar
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi

On 06/17/2012 02:48 AM, Aaditya Kumar wrote:

> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> 
>>>
>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>> pgdat_balanced()
>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>> normal zone has no reclaimable page.
>>>
>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>> sleep only if every zones have much free pages than high water mark
>>> _and_ 25% of present pages in node are free.
>>>
>>
>>
>> Sorry. I can't understand your point.
>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>> It seems I am missing your point.
>> Please anybody correct me.
> 
> Since currently direct reclaim is given up based on
> zone->all_unreclaimable flag,
> so for e.g in one of the scenarios:
> 
> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
> hot-remove the all the pages of the MOVABLE zone.
> 
> While migrating pages during memory hot-unplugging, the allocation function
> (for new page to which the page in MOVABLE zone would be moved)  can end up
> looping in direct reclaim path for ever.
> 
> This is so because when most of the pages in the MOVABLE zone have
> been migrated,
> the zone now contains lots of free memory (basically above low watermark)
> BUT all are in MIGRATE_ISOLATE list of the buddy list.
> 
> So kswapd() would not balance this zone as free pages are above low watermark
> (but all are in isolate list). So zone->all_unreclaimable flag would
> never be set for this zone
> and allocation function would end up looping forever. (assuming the
> zone NORMAL is
> left with no reclaimable memory)
>


Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
But I don't see it's a problem of kswapd.

a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
but we can't allocate it. :(
It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
Kswapd is just one of them confused.
As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. 

This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
free_area[order].nr_free exactly. 

Any thought?

Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..19de56c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
 
 out:
        if (!ret) {
+               int pages_moved;
                set_pageblock_migratetype(page, MIGRATE_ISOLATE);
-               move_freepages_block(zone, page, MIGRATE_ISOLATE);
+               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
+               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
        }   
 
        spin_unlock_irqrestore(&zone->lock, flags);
@@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 {
        struct zone *zone;
        unsigned long flags;
+       int pages_moved;
        zone = page_zone(page);
        spin_lock_irqsave(&zone->lock, flags);
        if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
                goto out;
        set_pageblock_migratetype(page, migratetype);
-       move_freepages_block(zone, page, migratetype);
+       pages_moved = move_freepages_block(zone, page, migratetype);
+       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
 out:
        spin_unlock_irqrestore(&zone->lock, flags);
 }


> 
> Regards,
> Aaditya Kumar
> Sony India Software Centre,
> Bangalore.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-15  7:27       ` Minchan Kim
@ 2012-06-16 17:48         ` Aaditya Kumar
  -1 siblings, 0 replies; 51+ messages in thread
From: Aaditya Kumar @ 2012-06-16 17:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi

On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:

>>
>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>> if node has multiple zones. Hm ok, I realized my descriptions was
>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>> pgdat_balanced()
>> every priority. Most easy case is, movable zone has a lot of free pages and
>> normal zone has no reclaimable page.
>>
>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>> sleep only if every zones have much free pages than high water mark
>> _and_ 25% of present pages in node are free.
>>
>
>
> Sorry. I can't understand your point.
> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
> It seems I am missing your point.
> Please anybody correct me.

Since currently direct reclaim is given up based on
zone->all_unreclaimable flag,
so for e.g in one of the scenarios:

Lets say system has one node with two zones (NORMAL and MOVABLE) and we
hot-remove the all the pages of the MOVABLE zone.

While migrating pages during memory hot-unplugging, the allocation function
(for new page to which the page in MOVABLE zone would be moved)  can end up
looping in direct reclaim path for ever.

This is so because when most of the pages in the MOVABLE zone have
been migrated,
the zone now contains lots of free memory (basically above low watermark)
BUT all are in MIGRATE_ISOLATE list of the buddy list.

So kswapd() would not balance this zone as free pages are above low watermark
(but all are in isolate list). So zone->all_unreclaimable flag would
never be set for this zone
and allocation function would end up looping forever. (assuming the
zone NORMAL is
left with no reclaimable memory)


Regards,
Aaditya Kumar
Sony India Software Centre,
Bangalore.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-16 17:48         ` Aaditya Kumar
  0 siblings, 0 replies; 51+ messages in thread
From: Aaditya Kumar @ 2012-06-16 17:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim, frank.rowand, tim.bird, takuzo.ohara, kan.iibuchi

On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@kernel.org> wrote:

>>
>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>> if node has multiple zones. Hm ok, I realized my descriptions was
>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>> pgdat_balanced()
>> every priority. Most easy case is, movable zone has a lot of free pages and
>> normal zone has no reclaimable page.
>>
>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>> sleep only if every zones have much free pages than high water mark
>> _and_ 25% of present pages in node are free.
>>
>
>
> Sorry. I can't understand your point.
> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
> It seems I am missing your point.
> Please anybody correct me.

Since currently direct reclaim is given up based on
zone->all_unreclaimable flag,
so for e.g in one of the scenarios:

Lets say system has one node with two zones (NORMAL and MOVABLE) and we
hot-remove the all the pages of the MOVABLE zone.

While migrating pages during memory hot-unplugging, the allocation function
(for new page to which the page in MOVABLE zone would be moved)  can end up
looping in direct reclaim path for ever.

This is so because when most of the pages in the MOVABLE zone have
been migrated,
the zone now contains lots of free memory (basically above low watermark)
BUT all are in MIGRATE_ISOLATE list of the buddy list.

So kswapd() would not balance this zone as free pages are above low watermark
(but all are in isolate list). So zone->all_unreclaimable flag would
never be set for this zone
and allocation function would end up looping forever. (assuming the
zone NORMAL is
left with no reclaimable memory)


Regards,
Aaditya Kumar
Sony India Software Centre,
Bangalore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-15  7:27       ` Minchan Kim
@ 2012-06-15 12:31         ` Hillf Danton
  -1 siblings, 0 replies; 51+ messages in thread
From: Hillf Danton @ 2012-06-15 12:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki

Hi Minchan and KOSAKI

On Fri, Jun 15, 2012 at 3:27 PM, Minchan Kim <minchan@kernel.org> wrote:
> On 06/15/2012 01:10 AM, KOSAKI Motohiro wrote:
>
>> On Thu, Jun 14, 2012 at 10:57 AM, Minchan Kim <minchan@kernel.org> wrote:
>>> Hi KOSAKI,
>>>
>>> Sorry for late response.
>>> Let me ask a question about description.
>>>
>>> On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
>>>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>
>>>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>>>> now vmscan has two conflicted policies.
>>>>
>>>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>>>    priority 0. This is because to avoid kswapd() infinite
>>>>    loop. That said, kswapd assume direct reclaim makes enough
>>>>    free pages to use either regular page reclaim or oom-killer.
>>>>    This logic makes kswapd -> direct-reclaim dependency.
>>>> 2) direct reclaim continue to reclaim without oom-killer until
>>>>    kswapd turn on zone->all_unreclaimble. This is because
>>>>    to avoid too early oom-kill.
>>>>    This logic makes direct-reclaim -> kswapd dependency.
>>>>
>>>> In worst case, direct-reclaim may continue to page reclaim forever
>>>> when kswapd sleeps forever.
>>>
>>> I have tried imagined scenario you mentioned above with code level but
>>> unfortunately I got failed.
>>> If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.
>>
>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>> if node has multiple zones. Hm ok, I realized my descriptions was
>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>> pgdat_balanced()
>> every priority. Most easy case is, movable zone has a lot of free pages and
>> normal zone has no reclaimable page.
>>
>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>> sleep only if every zones have much free pages than high water mark
>> _and_ 25% of present pages in node are free.
>>
>
>
> Sorry. I can't understand your point.
> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
> It seems I am missing your point.
> Please anybody correct me.
>

Who left comment on unreclaimable there, and why?
		/*
		 * balance_pgdat() skips over all_unreclaimable after
		 * DEF_PRIORITY. Effectively, it considers them balanced so
		 * they must be considered balanced here as well if kswapd
		 * is to sleep
		 */

BTW, are you still using prefetch_prev_lru_page?

Good Weekend
Hillf

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-15 12:31         ` Hillf Danton
  0 siblings, 0 replies; 51+ messages in thread
From: Hillf Danton @ 2012-06-15 12:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki

Hi Minchan and KOSAKI

On Fri, Jun 15, 2012 at 3:27 PM, Minchan Kim <minchan@kernel.org> wrote:
> On 06/15/2012 01:10 AM, KOSAKI Motohiro wrote:
>
>> On Thu, Jun 14, 2012 at 10:57 AM, Minchan Kim <minchan@kernel.org> wrote:
>>> Hi KOSAKI,
>>>
>>> Sorry for late response.
>>> Let me ask a question about description.
>>>
>>> On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
>>>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>
>>>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>>>> now vmscan has two conflicted policies.
>>>>
>>>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>>>    priority 0. This is because to avoid kswapd() infinite
>>>>    loop. That said, kswapd assume direct reclaim makes enough
>>>>    free pages to use either regular page reclaim or oom-killer.
>>>>    This logic makes kswapd -> direct-reclaim dependency.
>>>> 2) direct reclaim continue to reclaim without oom-killer until
>>>>    kswapd turn on zone->all_unreclaimble. This is because
>>>>    to avoid too early oom-kill.
>>>>    This logic makes direct-reclaim -> kswapd dependency.
>>>>
>>>> In worst case, direct-reclaim may continue to page reclaim forever
>>>> when kswapd sleeps forever.
>>>
>>> I have tried imagined scenario you mentioned above with code level but
>>> unfortunately I got failed.
>>> If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.
>>
>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>> if node has multiple zones. Hm ok, I realized my descriptions was
>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>> pgdat_balanced()
>> every priority. Most easy case is, movable zone has a lot of free pages and
>> normal zone has no reclaimable page.
>>
>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>> sleep only if every zones have much free pages than high water mark
>> _and_ 25% of present pages in node are free.
>>
>
>
> Sorry. I can't understand your point.
> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
> It seems I am missing your point.
> Please anybody correct me.
>

Who left comment on unreclaimable there, and why?
		/*
		 * balance_pgdat() skips over all_unreclaimable after
		 * DEF_PRIORITY. Effectively, it considers them balanced so
		 * they must be considered balanced here as well if kswapd
		 * is to sleep
		 */

BTW, are you still using prefetch_prev_lru_page?

Good Weekend
Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-14  8:13 ` kosaki.motohiro
@ 2012-06-15 10:45   ` Mel Gorman
  -1 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2012-06-15 10:45 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, Minchan Kim

On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>    priority 0. This is because to avoid kswapd() infinite
>    loop. That said, kswapd assume direct reclaim makes enough
>    free pages to use either regular page reclaim or oom-killer.
>    This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>    kswapd turn on zone->all_unreclaimble. This is because
>    to avoid too early oom-kill.
>    This logic makes direct-reclaim -> kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-15 10:45   ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2012-06-15 10:45 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, Minchan Kim

On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>    priority 0. This is because to avoid kswapd() infinite
>    loop. That said, kswapd assume direct reclaim makes enough
>    free pages to use either regular page reclaim or oom-killer.
>    This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>    kswapd turn on zone->all_unreclaimble. This is because
>    to avoid too early oom-kill.
>    This logic makes direct-reclaim -> kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-14 16:10     ` KOSAKI Motohiro
@ 2012-06-15  7:27       ` Minchan Kim
  -1 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2012-06-15  7:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, linux-mm, akpm, Nick Piggin, Michal Hocko,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On 06/15/2012 01:10 AM, KOSAKI Motohiro wrote:

> On Thu, Jun 14, 2012 at 10:57 AM, Minchan Kim <minchan@kernel.org> wrote:
>> Hi KOSAKI,
>>
>> Sorry for late response.
>> Let me ask a question about description.
>>
>> On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
>>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>
>>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>>> now vmscan has two conflicted policies.
>>>
>>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>>    priority 0. This is because to avoid kswapd() infinite
>>>    loop. That said, kswapd assume direct reclaim makes enough
>>>    free pages to use either regular page reclaim or oom-killer.
>>>    This logic makes kswapd -> direct-reclaim dependency.
>>> 2) direct reclaim continue to reclaim without oom-killer until
>>>    kswapd turn on zone->all_unreclaimble. This is because
>>>    to avoid too early oom-kill.
>>>    This logic makes direct-reclaim -> kswapd dependency.
>>>
>>> In worst case, direct-reclaim may continue to page reclaim forever
>>> when kswapd sleeps forever.
>>
>> I have tried imagined scenario you mentioned above with code level but
>> unfortunately I got failed.
>> If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.
> 
> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
> if node has multiple zones. Hm ok, I realized my descriptions was
> slightly misleading. priority 0 is not needed. bakance_pddat() calls
> pgdat_balanced()
> every priority. Most easy case is, movable zone has a lot of free pages and
> normal zone has no reclaimable page.
> 
> btw, current pgdat_balanced() logic seems not correct. kswapd should
> sleep only if every zones have much free pages than high water mark
> _and_ 25% of present pages in node are free.
> 


Sorry. I can't understand your point.
Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
It seems I am missing your point.
Please anybody correct me.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-15  7:27       ` Minchan Kim
  0 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2012-06-15  7:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, linux-mm, akpm, Nick Piggin, Michal Hocko,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On 06/15/2012 01:10 AM, KOSAKI Motohiro wrote:

> On Thu, Jun 14, 2012 at 10:57 AM, Minchan Kim <minchan@kernel.org> wrote:
>> Hi KOSAKI,
>>
>> Sorry for late response.
>> Let me ask a question about description.
>>
>> On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
>>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>
>>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>>> now vmscan has two conflicted policies.
>>>
>>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>>    priority 0. This is because to avoid kswapd() infinite
>>>    loop. That said, kswapd assume direct reclaim makes enough
>>>    free pages to use either regular page reclaim or oom-killer.
>>>    This logic makes kswapd -> direct-reclaim dependency.
>>> 2) direct reclaim continue to reclaim without oom-killer until
>>>    kswapd turn on zone->all_unreclaimble. This is because
>>>    to avoid too early oom-kill.
>>>    This logic makes direct-reclaim -> kswapd dependency.
>>>
>>> In worst case, direct-reclaim may continue to page reclaim forever
>>> when kswapd sleeps forever.
>>
>> I have tried imagined scenario you mentioned above with code level but
>> unfortunately I got failed.
>> If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.
> 
> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
> if node has multiple zones. Hm ok, I realized my descriptions was
> slightly misleading. priority 0 is not needed. bakance_pddat() calls
> pgdat_balanced()
> every priority. Most easy case is, movable zone has a lot of free pages and
> normal zone has no reclaimable page.
> 
> btw, current pgdat_balanced() logic seems not correct. kswapd should
> sleep only if every zones have much free pages than high water mark
> _and_ 25% of present pages in node are free.
> 


Sorry. I can't understand your point.
Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
It seems I am missing your point.
Please anybody correct me.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-14 14:57   ` Minchan Kim
@ 2012-06-14 16:10     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 51+ messages in thread
From: KOSAKI Motohiro @ 2012-06-14 16:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, akpm, Nick Piggin, Michal Hocko,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On Thu, Jun 14, 2012 at 10:57 AM, Minchan Kim <minchan@kernel.org> wrote:
> Hi KOSAKI,
>
> Sorry for late response.
> Let me ask a question about description.
>
> On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>
>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>> now vmscan has two conflicted policies.
>>
>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>    priority 0. This is because to avoid kswapd() infinite
>>    loop. That said, kswapd assume direct reclaim makes enough
>>    free pages to use either regular page reclaim or oom-killer.
>>    This logic makes kswapd -> direct-reclaim dependency.
>> 2) direct reclaim continue to reclaim without oom-killer until
>>    kswapd turn on zone->all_unreclaimble. This is because
>>    to avoid too early oom-kill.
>>    This logic makes direct-reclaim -> kswapd dependency.
>>
>> In worst case, direct-reclaim may continue to page reclaim forever
>> when kswapd sleeps forever.
>
> I have tried imagined scenario you mentioned above with code level but
> unfortunately I got failed.
> If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.

pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
if node has multiple zones. Hm ok, I realized my descriptions was
slightly misleading. priority 0 is not needed. bakance_pddat() calls
pgdat_balanced()
every priority. Most easy case is, movable zone has a lot of free pages and
normal zone has no reclaimable page.

btw, current pgdat_balanced() logic seems not correct. kswapd should
sleep only if every zones have much free pages than high water mark
_and_ 25% of present pages in node are free.



> So if kswapd sleeps, it means we already have enough order-0 free pages.
> Hmm, could you describe scenario you found in detail with code level?
>
> Anyway, as I look at your patch, I can't find any problem.
> I just want to understand scenario you mentioned completely in my head.
> Maybe It can help making description clear.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-14 16:10     ` KOSAKI Motohiro
  0 siblings, 0 replies; 51+ messages in thread
From: KOSAKI Motohiro @ 2012-06-14 16:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, akpm, Nick Piggin, Michal Hocko,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On Thu, Jun 14, 2012 at 10:57 AM, Minchan Kim <minchan@kernel.org> wrote:
> Hi KOSAKI,
>
> Sorry for late response.
> Let me ask a question about description.
>
> On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>
>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>> now vmscan has two conflicted policies.
>>
>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>    priority 0. This is because to avoid kswapd() infinite
>>    loop. That said, kswapd assume direct reclaim makes enough
>>    free pages to use either regular page reclaim or oom-killer.
>>    This logic makes kswapd -> direct-reclaim dependency.
>> 2) direct reclaim continue to reclaim without oom-killer until
>>    kswapd turn on zone->all_unreclaimble. This is because
>>    to avoid too early oom-kill.
>>    This logic makes direct-reclaim -> kswapd dependency.
>>
>> In worst case, direct-reclaim may continue to page reclaim forever
>> when kswapd sleeps forever.
>
> I have tried imagined scenario you mentioned above with code level but
> unfortunately I got failed.
> If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.

pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
if node has multiple zones. Hm ok, I realized my descriptions was
slightly misleading. priority 0 is not needed. bakance_pddat() calls
pgdat_balanced()
every priority. Most easy case is, movable zone has a lot of free pages and
normal zone has no reclaimable page.

btw, current pgdat_balanced() logic seems not correct. kswapd should
sleep only if every zones have much free pages than high water mark
_and_ 25% of present pages in node are free.



> So if kswapd sleeps, it means we already have enough order-0 free pages.
> Hmm, could you describe scenario you found in detail with code level?
>
> Anyway, as I look at your patch, I can't find any problem.
> I just want to understand scenario you mentioned completely in my head.
> Maybe It can help making description clear.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-14 15:25   ` Michal Hocko
@ 2012-06-14 15:46     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 51+ messages in thread
From: KOSAKI Motohiro @ 2012-06-14 15:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, akpm, Nick Piggin, Johannes Weiner,
	Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On Thu, Jun 14, 2012 at 11:25 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 14-06-12 04:13:12, kosaki.motohiro@gmail.com wrote:
>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>
>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>> now vmscan has two conflicted policies.
>>
>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>    priority 0. This is because to avoid kswapd() infinite
>>    loop. That said, kswapd assume direct reclaim makes enough
>>    free pages to use either regular page reclaim or oom-killer.
>>    This logic makes kswapd -> direct-reclaim dependency.
>> 2) direct reclaim continue to reclaim without oom-killer until
>>    kswapd turn on zone->all_unreclaimble. This is because
>>    to avoid too early oom-kill.
>>    This logic makes direct-reclaim -> kswapd dependency.
>>
>> In worst case, direct-reclaim may continue to page reclaim forever
>> when kswapd sleeps forever.
>>
>> We can't turn on zone->all_unreclaimable from direct reclaim path
>> because direct reclaim path don't take any lock and this way is racy.
>>
>> Thus this patch removes zone->all_unreclaimable field completely and
>> recalculates zone reclaimable state every time.
>>
>> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
>> directly and kswapd continue to use zone->all_unreclaimable. Because, it
>> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
>> zone->all_unreclaimable as a name) describes the detail.
>>
>> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
>> Reported-by: Ying Han <yinghan@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> Acked-by: Rik van Riel <riel@redhat.com>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Cc: Minchan Kim <minchan.kim@gmail.com>
>> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> Looks good, just one comment bellow:
>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
>
> [...]
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index eeb3bc9..033671c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
> [...]
>> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>>               if (global_reclaim(sc)) {
>>                       if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>                               continue;
>> -                     if (zone->all_unreclaimable &&
>> -                                     sc->priority != DEF_PRIORITY)
>> +                     if (!zone_reclaimable(zone) &&
>> +                         sc->priority != DEF_PRIORITY)
>
> Not exactly a hot path but still would be nice to test the priority
> first as the test is cheaper (maybe compiler is clever enough to reorder
> this, as both expressions are independent and without any side-effects
> but...).

ok, will fix.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-14 15:46     ` KOSAKI Motohiro
  0 siblings, 0 replies; 51+ messages in thread
From: KOSAKI Motohiro @ 2012-06-14 15:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, akpm, Nick Piggin, Johannes Weiner,
	Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On Thu, Jun 14, 2012 at 11:25 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 14-06-12 04:13:12, kosaki.motohiro@gmail.com wrote:
>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>
>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>> now vmscan has two conflicted policies.
>>
>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>    priority 0. This is because to avoid kswapd() infinite
>>    loop. That said, kswapd assume direct reclaim makes enough
>>    free pages to use either regular page reclaim or oom-killer.
>>    This logic makes kswapd -> direct-reclaim dependency.
>> 2) direct reclaim continue to reclaim without oom-killer until
>>    kswapd turn on zone->all_unreclaimble. This is because
>>    to avoid too early oom-kill.
>>    This logic makes direct-reclaim -> kswapd dependency.
>>
>> In worst case, direct-reclaim may continue to page reclaim forever
>> when kswapd sleeps forever.
>>
>> We can't turn on zone->all_unreclaimable from direct reclaim path
>> because direct reclaim path don't take any lock and this way is racy.
>>
>> Thus this patch removes zone->all_unreclaimable field completely and
>> recalculates zone reclaimable state every time.
>>
>> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
>> directly and kswapd continue to use zone->all_unreclaimable. Because, it
>> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
>> zone->all_unreclaimable as a name) describes the detail.
>>
>> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
>> Reported-by: Ying Han <yinghan@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> Acked-by: Rik van Riel <riel@redhat.com>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Cc: Minchan Kim <minchan.kim@gmail.com>
>> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> Looks good, just one comment bellow:
>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
>
> [...]
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index eeb3bc9..033671c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
> [...]
>> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>>               if (global_reclaim(sc)) {
>>                       if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>                               continue;
>> -                     if (zone->all_unreclaimable &&
>> -                                     sc->priority != DEF_PRIORITY)
>> +                     if (!zone_reclaimable(zone) &&
>> +                         sc->priority != DEF_PRIORITY)
>
> Not exactly a hot path but still would be nice to test the priority
> first as the test is cheaper (maybe compiler is clever enough to reorder
> this, as both expressions are independent and without any side-effects
> but...).

ok, will fix.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-14  8:13 ` kosaki.motohiro
@ 2012-06-14 15:25   ` Michal Hocko
  -1 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2012-06-14 15:25 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On Thu 14-06-12 04:13:12, kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>    priority 0. This is because to avoid kswapd() infinite
>    loop. That said, kswapd assume direct reclaim makes enough
>    free pages to use either regular page reclaim or oom-killer.
>    This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>    kswapd turn on zone->all_unreclaimble. This is because
>    to avoid too early oom-kill.
>    This logic makes direct-reclaim -> kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Looks good, just one comment bellow:

Reviewed-by: Michal Hocko <mhocko@suse.cz>

[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..033671c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
[...]
> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
> -			if (zone->all_unreclaimable &&
> -					sc->priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) &&
> +			    sc->priority != DEF_PRIORITY)

Not exactly a hot path but still would be nice to test the priority
first as the test is cheaper (maybe compiler is clever enough to reorder
this, as both expressions are independent and without any side-effects
but...).

[...]
> @@ -2393,8 +2388,7 @@ loop_again:
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> -			    sc.priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>  				continue;

Same here

>  
>  			/*
> @@ -2443,14 +2437,13 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab, testorder;
> +			int testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> -			    sc.priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>  				continue;

Same here

[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-14 15:25   ` Michal Hocko
  0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2012-06-14 15:25 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On Thu 14-06-12 04:13:12, kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>    priority 0. This is because to avoid kswapd() infinite
>    loop. That said, kswapd assume direct reclaim makes enough
>    free pages to use either regular page reclaim or oom-killer.
>    This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>    kswapd turn on zone->all_unreclaimble. This is because
>    to avoid too early oom-kill.
>    This logic makes direct-reclaim -> kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Looks good, just one comment bellow:

Reviewed-by: Michal Hocko <mhocko@suse.cz>

[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..033671c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
[...]
> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
> -			if (zone->all_unreclaimable &&
> -					sc->priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) &&
> +			    sc->priority != DEF_PRIORITY)

Not exactly a hot path but still would be nice to test the priority
first as the test is cheaper (maybe compiler is clever enough to reorder
this, as both expressions are independent and without any side-effects
but...).

[...]
> @@ -2393,8 +2388,7 @@ loop_again:
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> -			    sc.priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>  				continue;

Same here

>  
>  			/*
> @@ -2443,14 +2437,13 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab, testorder;
> +			int testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> -			    sc.priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>  				continue;

Same here

[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-14  8:13 ` kosaki.motohiro
@ 2012-06-14 14:57   ` Minchan Kim
  -1 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2012-06-14 14:57 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim

Hi KOSAKI,

Sorry for late response.
Let me ask a question about description.

On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>    priority 0. This is because to avoid kswapd() infinite
>    loop. That said, kswapd assume direct reclaim makes enough
>    free pages to use either regular page reclaim or oom-killer.
>    This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>    kswapd turn on zone->all_unreclaimble. This is because
>    to avoid too early oom-kill.
>    This logic makes direct-reclaim -> kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.

I have tried imagined scenario you mentioned above with code level but
unfortunately I got failed.
If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.
So if kswapd sleeps, it means we already have enough order-0 free pages.
Hmm, could you describe scenario you found in detail with code level?

Anyway, as I look at your patch, I can't find any problem.
I just want to understand scenario you mentioned completely in my head.
Maybe It can help making description clear.

Thanks.

> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  include/linux/mm_inline.h |   19 +++++++++++++++++
>  include/linux/mmzone.h    |    2 +-
>  include/linux/vmstat.h    |    1 -
>  mm/page-writeback.c       |    2 +
>  mm/page_alloc.c           |    5 +--
>  mm/vmscan.c               |   48 ++++++++++++--------------------------------
>  mm/vmstat.c               |    3 +-
>  7 files changed, 39 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 1397ccf..04f32e1 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -2,6 +2,7 @@
>  #define LINUX_MM_INLINE_H
>  
>  #include <linux/huge_mm.h>
> +#include <linux/swap.h>
>  
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
> @@ -99,4 +100,22 @@ static __always_inline enum lru_list page_lru(struct page *page)
>  	return lru;
>  }
>  
> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> +{
> +	int nr;
> +
> +	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state(zone, NR_INACTIVE_FILE);
> +
> +	if (nr_swap_pages > 0)
> +		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state(zone, NR_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +static inline bool zone_reclaimable(struct zone *zone)
> +{
> +	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +}
>  #endif
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2427706..9d2a720 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -368,7 +368,7 @@ struct zone {
>  	 * free areas of different sizes
>  	 */
>  	spinlock_t		lock;
> -	int                     all_unreclaimable; /* All pages pinned */
> +
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  	/* see spanned/present_pages for more description */
>  	seqlock_t		span_seqlock;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 65efb92..9607256 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -140,7 +140,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>  }
>  
>  extern unsigned long global_reclaimable_pages(void);
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
>  #ifdef CONFIG_NUMA
>  /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 93d8d2f..d2d957f 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -34,6 +34,8 @@
>  #include <linux/syscalls.h>
>  #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
>  #include <linux/pagevec.h>
> +#include <linux/mm_inline.h>
> +
>  #include <trace/events/writeback.h>
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..5716b00 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -59,6 +59,7 @@
>  #include <linux/prefetch.h>
>  #include <linux/migrate.h>
>  #include <linux/page-debug-flags.h>
> +#include <linux/mm_inline.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -638,7 +639,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	int to_free = count;
>  
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	while (to_free) {
> @@ -680,7 +680,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
>  				int migratetype)
>  {
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, zone, order, migratetype);
> @@ -2870,7 +2869,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_BOUNCE)),
>  			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
>  			zone->pages_scanned,
> -			(zone->all_unreclaimable ? "yes" : "no")
> +		       (zone_reclaimable(zone) ? "yes" : "no")
>  			);
>  		printk("lowmem_reserve[]:");
>  		for (i = 0; i < MAX_NR_ZONES; i++)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..033671c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1592,7 +1592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (current_is_kswapd() && zone->all_unreclaimable)
> +	if (current_is_kswapd() && !zone_reclaimable(zone))
>  		force_scan = true;
>  	if (!global_reclaim(sc))
>  		force_scan = true;
> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
> -			if (zone->all_unreclaimable &&
> -					sc->priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) &&
> +			    sc->priority != DEF_PRIORITY)
>  				continue;	/* Let kswapd poll it */
>  			if (COMPACTION_BUILD) {
>  				/*
> @@ -1975,11 +1975,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	return aborted_reclaim;
>  }
>  
> -static bool zone_reclaimable(struct zone *zone)
> -{
> -	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> -}
> -
>  /* All zones in zonelist are unreclaimable? */
>  static bool all_unreclaimable(struct zonelist *zonelist,
>  		struct scan_control *sc)
> @@ -1993,7 +1988,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>  			continue;
>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  			continue;
> -		if (!zone->all_unreclaimable)
> +		if (zone_reclaimable(zone))
>  			return false;
>  	}
>  
> @@ -2299,7 +2294,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  		 * they must be considered balanced here as well if kswapd
>  		 * is to sleep
>  		 */
> -		if (zone->all_unreclaimable) {
> +		if (zone_reclaimable(zone)) {
>  			balanced += zone->present_pages;
>  			continue;
>  		}
> @@ -2393,8 +2388,7 @@ loop_again:
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> -			    sc.priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>  				continue;
>  
>  			/*
> @@ -2443,14 +2437,13 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab, testorder;
> +			int testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> -			    sc.priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>  				continue;
>  
>  			sc.nr_scanned = 0;
> @@ -2497,12 +2490,11 @@ loop_again:
>  				shrink_zone(zone, &sc);
>  
>  				reclaim_state->reclaimed_slab = 0;
> -				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> +				shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>  				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  				total_scanned += sc.nr_scanned;
>  
> -				if (nr_slab == 0 && !zone_reclaimable(zone))
> -					zone->all_unreclaimable = 1;
> +
>  			}
>  
>  			/*
> @@ -2514,7 +2506,7 @@ loop_again:
>  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>  				sc.may_writepage = 1;
>  
> -			if (zone->all_unreclaimable) {
> +			if (!zone_reclaimable(zone)) {
>  				if (end_zone && end_zone == i)
>  					end_zone--;
>  				continue;
> @@ -2616,7 +2608,7 @@ out:
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  
> @@ -2850,20 +2842,6 @@ unsigned long global_reclaimable_pages(void)
>  	return nr;
>  }
>  
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> -{
> -	int nr;
> -
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE);
> -
> -	if (nr_swap_pages > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> -
> -	return nr;
> -}
> -
>  #ifdef CONFIG_HIBERNATION
>  /*
>   * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
> @@ -3158,7 +3136,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>  		return ZONE_RECLAIM_FULL;
>  
> -	if (zone->all_unreclaimable)
> +	if (!zone_reclaimable(zone))
>  		return ZONE_RECLAIM_FULL;
>  
>  	/*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1bbbbd9..94b9d4c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -19,6 +19,7 @@
>  #include <linux/math64.h>
>  #include <linux/writeback.h>
>  #include <linux/compaction.h>
> +#include <linux/mm_inline.h>
>  
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
> @@ -1022,7 +1023,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  all_unreclaimable: %u"
>  		   "\n  start_pfn:         %lu"
>  		   "\n  inactive_ratio:    %u",
> -		   zone->all_unreclaimable,
> +		   !zone_reclaimable(zone),
>  		   zone->zone_start_pfn,
>  		   zone->inactive_ratio);
>  	seq_putc(m, '\n');
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-14 14:57   ` Minchan Kim
  0 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2012-06-14 14:57 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Minchan Kim

Hi KOSAKI,

Sorry for late response.
Let me ask a question about description.

On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>    priority 0. This is because to avoid kswapd() infinite
>    loop. That said, kswapd assume direct reclaim makes enough
>    free pages to use either regular page reclaim or oom-killer.
>    This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>    kswapd turn on zone->all_unreclaimble. This is because
>    to avoid too early oom-kill.
>    This logic makes direct-reclaim -> kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.

I have tried imagined scenario you mentioned above with code level but
unfortunately I got failed.
If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.
So if kswapd sleeps, it means we already have enough order-0 free pages.
Hmm, could you describe scenario you found in detail with code level?

Anyway, as I look at your patch, I can't find any problem.
I just want to understand scenario you mentioned completely in my head.
Maybe It can help making description clear.

Thanks.

> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  include/linux/mm_inline.h |   19 +++++++++++++++++
>  include/linux/mmzone.h    |    2 +-
>  include/linux/vmstat.h    |    1 -
>  mm/page-writeback.c       |    2 +
>  mm/page_alloc.c           |    5 +--
>  mm/vmscan.c               |   48 ++++++++++++--------------------------------
>  mm/vmstat.c               |    3 +-
>  7 files changed, 39 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 1397ccf..04f32e1 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -2,6 +2,7 @@
>  #define LINUX_MM_INLINE_H
>  
>  #include <linux/huge_mm.h>
> +#include <linux/swap.h>
>  
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
> @@ -99,4 +100,22 @@ static __always_inline enum lru_list page_lru(struct page *page)
>  	return lru;
>  }
>  
> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> +{
> +	int nr;
> +
> +	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state(zone, NR_INACTIVE_FILE);
> +
> +	if (nr_swap_pages > 0)
> +		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state(zone, NR_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +static inline bool zone_reclaimable(struct zone *zone)
> +{
> +	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +}
>  #endif
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2427706..9d2a720 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -368,7 +368,7 @@ struct zone {
>  	 * free areas of different sizes
>  	 */
>  	spinlock_t		lock;
> -	int                     all_unreclaimable; /* All pages pinned */
> +
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  	/* see spanned/present_pages for more description */
>  	seqlock_t		span_seqlock;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 65efb92..9607256 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -140,7 +140,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>  }
>  
>  extern unsigned long global_reclaimable_pages(void);
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
>  #ifdef CONFIG_NUMA
>  /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 93d8d2f..d2d957f 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -34,6 +34,8 @@
>  #include <linux/syscalls.h>
>  #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
>  #include <linux/pagevec.h>
> +#include <linux/mm_inline.h>
> +
>  #include <trace/events/writeback.h>
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..5716b00 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -59,6 +59,7 @@
>  #include <linux/prefetch.h>
>  #include <linux/migrate.h>
>  #include <linux/page-debug-flags.h>
> +#include <linux/mm_inline.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -638,7 +639,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	int to_free = count;
>  
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	while (to_free) {
> @@ -680,7 +680,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
>  				int migratetype)
>  {
>  	spin_lock(&zone->lock);
> -	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, zone, order, migratetype);
> @@ -2870,7 +2869,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_BOUNCE)),
>  			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
>  			zone->pages_scanned,
> -			(zone->all_unreclaimable ? "yes" : "no")
> +		       (zone_reclaimable(zone) ? "yes" : "no")
>  			);
>  		printk("lowmem_reserve[]:");
>  		for (i = 0; i < MAX_NR_ZONES; i++)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..033671c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1592,7 +1592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (current_is_kswapd() && zone->all_unreclaimable)
> +	if (current_is_kswapd() && !zone_reclaimable(zone))
>  		force_scan = true;
>  	if (!global_reclaim(sc))
>  		force_scan = true;
> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
> -			if (zone->all_unreclaimable &&
> -					sc->priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) &&
> +			    sc->priority != DEF_PRIORITY)
>  				continue;	/* Let kswapd poll it */
>  			if (COMPACTION_BUILD) {
>  				/*
> @@ -1975,11 +1975,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	return aborted_reclaim;
>  }
>  
> -static bool zone_reclaimable(struct zone *zone)
> -{
> -	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> -}
> -
>  /* All zones in zonelist are unreclaimable? */
>  static bool all_unreclaimable(struct zonelist *zonelist,
>  		struct scan_control *sc)
> @@ -1993,7 +1988,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>  			continue;
>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  			continue;
> -		if (!zone->all_unreclaimable)
> +		if (zone_reclaimable(zone))
>  			return false;
>  	}
>  
> @@ -2299,7 +2294,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  		 * they must be considered balanced here as well if kswapd
>  		 * is to sleep
>  		 */
> -		if (zone->all_unreclaimable) {
> +		if (zone_reclaimable(zone)) {
>  			balanced += zone->present_pages;
>  			continue;
>  		}
> @@ -2393,8 +2388,7 @@ loop_again:
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> -			    sc.priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>  				continue;
>  
>  			/*
> @@ -2443,14 +2437,13 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab, testorder;
> +			int testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> -			    sc.priority != DEF_PRIORITY)
> +			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
>  				continue;
>  
>  			sc.nr_scanned = 0;
> @@ -2497,12 +2490,11 @@ loop_again:
>  				shrink_zone(zone, &sc);
>  
>  				reclaim_state->reclaimed_slab = 0;
> -				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> +				shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>  				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  				total_scanned += sc.nr_scanned;
>  
> -				if (nr_slab == 0 && !zone_reclaimable(zone))
> -					zone->all_unreclaimable = 1;
> +
>  			}
>  
>  			/*
> @@ -2514,7 +2506,7 @@ loop_again:
>  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>  				sc.may_writepage = 1;
>  
> -			if (zone->all_unreclaimable) {
> +			if (!zone_reclaimable(zone)) {
>  				if (end_zone && end_zone == i)
>  					end_zone--;
>  				continue;
> @@ -2616,7 +2608,7 @@ out:
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> +			if (!zone_reclaimable(zone) &&
>  			    sc.priority != DEF_PRIORITY)
>  				continue;
>  
> @@ -2850,20 +2842,6 @@ unsigned long global_reclaimable_pages(void)
>  	return nr;
>  }
>  
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> -{
> -	int nr;
> -
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE);
> -
> -	if (nr_swap_pages > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> -
> -	return nr;
> -}
> -
>  #ifdef CONFIG_HIBERNATION
>  /*
>   * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
> @@ -3158,7 +3136,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>  		return ZONE_RECLAIM_FULL;
>  
> -	if (zone->all_unreclaimable)
> +	if (!zone_reclaimable(zone))
>  		return ZONE_RECLAIM_FULL;
>  
>  	/*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1bbbbd9..94b9d4c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -19,6 +19,7 @@
>  #include <linux/math64.h>
>  #include <linux/writeback.h>
>  #include <linux/compaction.h>
> +#include <linux/mm_inline.h>
>  
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
> @@ -1022,7 +1023,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  all_unreclaimable: %u"
>  		   "\n  start_pfn:         %lu"
>  		   "\n  inactive_ratio:    %u",
> -		   zone->all_unreclaimable,
> +		   !zone_reclaimable(zone),
>  		   zone->zone_start_pfn,
>  		   zone->inactive_ratio);
>  	seq_putc(m, '\n');
> -- 
> 1.7.1
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-14  8:13 ` kosaki.motohiro
@ 2012-06-14  8:51   ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 51+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-14  8:51 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, Minchan Kim

(2012/06/14 17:13), kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>     priority 0. This is because to avoid kswapd() infinite
>     loop. That said, kswapd assume direct reclaim makes enough
>     free pages to use either regular page reclaim or oom-killer.
>     This logic makes kswapd ->  direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>     kswapd turn on zone->all_unreclaimble. This is because
>     to avoid too early oom-kill.
>     This logic makes direct-reclaim ->  kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar<aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han<yinghan@google.com>
> Cc: Nick Piggin<npiggin@gmail.com>
> Acked-by: Rik van Riel<riel@redhat.com>
> Cc: Michal Hocko<mhocko@suse.cz>
> Cc: Johannes Weiner<hannes@cmpxchg.org>
> Cc: Mel Gorman<mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>

I like this.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-14  8:51   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 51+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-14  8:51 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Michal Hocko, Johannes Weiner, Mel Gorman, Minchan Kim

(2012/06/14 17:13), kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>     priority 0. This is because to avoid kswapd() infinite
>     loop. That said, kswapd assume direct reclaim makes enough
>     free pages to use either regular page reclaim or oom-killer.
>     This logic makes kswapd ->  direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>     kswapd turn on zone->all_unreclaimble. This is because
>     to avoid too early oom-kill.
>     This logic makes direct-reclaim ->  kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar<aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han<yinghan@google.com>
> Cc: Nick Piggin<npiggin@gmail.com>
> Acked-by: Rik van Riel<riel@redhat.com>
> Cc: Michal Hocko<mhocko@suse.cz>
> Cc: Johannes Weiner<hannes@cmpxchg.org>
> Cc: Mel Gorman<mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>

I like this.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
  2012-06-14  8:13 ` kosaki.motohiro
@ 2012-06-14  8:43   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2012-06-14  8:43 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>    priority 0. This is because to avoid kswapd() infinite
>    loop. That said, kswapd assume direct reclaim makes enough
>    free pages to use either regular page reclaim or oom-killer.
>    This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>    kswapd turn on zone->all_unreclaimble. This is because
>    to avoid too early oom-kill.
>    This logic makes direct-reclaim -> kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> @@ -2497,12 +2490,11 @@ loop_again:
>  				shrink_zone(zone, &sc);
>  
>  				reclaim_state->reclaimed_slab = 0;
> -				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> +				shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>  				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  				total_scanned += sc.nr_scanned;
>  
> -				if (nr_slab == 0 && !zone_reclaimable(zone))
> -					zone->all_unreclaimable = 1;
> +

That IS a slight change in behaviour.  But then, if you scanned 6
times the amount of reclaimable pages without freeing a single slab
page, it's probably not worth going on.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-14  8:43   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2012-06-14  8:43 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-kernel, linux-mm, akpm, KOSAKI Motohiro, Nick Piggin,
	Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

On Thu, Jun 14, 2012 at 04:13:12AM -0400, kosaki.motohiro@gmail.com wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
> 
> 1) kswapd sleep when it couldn't reclaim any page when reaching
>    priority 0. This is because to avoid kswapd() infinite
>    loop. That said, kswapd assume direct reclaim makes enough
>    free pages to use either regular page reclaim or oom-killer.
>    This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
>    kswapd turn on zone->all_unreclaimble. This is because
>    to avoid too early oom-kill.
>    This logic makes direct-reclaim -> kswapd dependency.
> 
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
> 
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
> 
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
> 
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
> 
> Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
> Reported-by: Ying Han <yinghan@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> @@ -2497,12 +2490,11 @@ loop_again:
>  				shrink_zone(zone, &sc);
>  
>  				reclaim_state->reclaimed_slab = 0;
> -				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> +				shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>  				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  				total_scanned += sc.nr_scanned;
>  
> -				if (nr_slab == 0 && !zone_reclaimable(zone))
> -					zone->all_unreclaimable = 1;
> +

That IS a slight change in behaviour.  But then, if you scanned 6
times the amount of reclaimable pages without freeing a single slab
page, it's probably not worth going on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-14  8:13 ` kosaki.motohiro
  0 siblings, 0 replies; 51+ messages in thread
From: kosaki.motohiro @ 2012-06-14  8:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, akpm, KOSAKI Motohiro, Nick Piggin, Michal Hocko,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Currently, do_try_to_free_pages() can enter livelock. Because of,
now vmscan has two conflicted policies.

1) kswapd sleep when it couldn't reclaim any page when reaching
   priority 0. This is because to avoid kswapd() infinite
   loop. That said, kswapd assume direct reclaim makes enough
   free pages to use either regular page reclaim or oom-killer.
   This logic makes kswapd -> direct-reclaim dependency.
2) direct reclaim continue to reclaim without oom-killer until
   kswapd turn on zone->all_unreclaimble. This is because
   to avoid too early oom-kill.
   This logic makes direct-reclaim -> kswapd dependency.

In worst case, direct-reclaim may continue to page reclaim forever
when kswapd sleeps forever.

We can't turn on zone->all_unreclaimable from direct reclaim path
because direct reclaim path don't take any lock and this way is racy.

Thus this patch removes zone->all_unreclaimable field completely and
recalculates zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Reported-by: Ying Han <yinghan@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mm_inline.h |   19 +++++++++++++++++
 include/linux/mmzone.h    |    2 +-
 include/linux/vmstat.h    |    1 -
 mm/page-writeback.c       |    2 +
 mm/page_alloc.c           |    5 +--
 mm/vmscan.c               |   48 ++++++++++++--------------------------------
 mm/vmstat.c               |    3 +-
 7 files changed, 39 insertions(+), 41 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 1397ccf..04f32e1 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -2,6 +2,7 @@
 #define LINUX_MM_INLINE_H
 
 #include <linux/huge_mm.h>
+#include <linux/swap.h>
 
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
@@ -99,4 +100,22 @@ static __always_inline enum lru_list page_lru(struct page *page)
 	return lru;
 }
 
+static inline unsigned long zone_reclaimable_pages(struct zone *zone)
+{
+	int nr;
+
+	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
+	     zone_page_state(zone, NR_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
+		      zone_page_state(zone, NR_INACTIVE_ANON);
+
+	return nr;
+}
+
+static inline bool zone_reclaimable(struct zone *zone)
+{
+	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+}
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2427706..9d2a720 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,7 +368,7 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	int                     all_unreclaimable; /* All pages pinned */
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 65efb92..9607256 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -140,7 +140,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 }
 
 extern unsigned long global_reclaimable_pages(void);
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
 #ifdef CONFIG_NUMA
 /*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 93d8d2f..d2d957f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,8 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
 #include <linux/pagevec.h>
+#include <linux/mm_inline.h>
+
 #include <trace/events/writeback.h>
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..5716b00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/prefetch.h>
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -638,7 +639,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int to_free = count;
 
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	while (to_free) {
@@ -680,7 +680,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	__free_one_page(page, zone, order, migratetype);
@@ -2870,7 +2869,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
 			zone->pages_scanned,
-			(zone->all_unreclaimable ? "yes" : "no")
+		       (zone_reclaimable(zone) ? "yes" : "no")
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeb3bc9..033671c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1592,7 +1592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && zone->all_unreclaimable)
+	if (current_is_kswapd() && !zone_reclaimable(zone))
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			if (zone->all_unreclaimable &&
-					sc->priority != DEF_PRIORITY)
+			if (!zone_reclaimable(zone) &&
+			    sc->priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
 			if (COMPACTION_BUILD) {
 				/*
@@ -1975,11 +1975,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	return aborted_reclaim;
 }
 
-static bool zone_reclaimable(struct zone *zone)
-{
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
-}
-
 /* All zones in zonelist are unreclaimable? */
 static bool all_unreclaimable(struct zonelist *zonelist,
 		struct scan_control *sc)
@@ -1993,7 +1988,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (!zone->all_unreclaimable)
+		if (zone_reclaimable(zone))
 			return false;
 	}
 
@@ -2299,7 +2294,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 		 * they must be considered balanced here as well if kswapd
 		 * is to sleep
 		 */
-		if (zone->all_unreclaimable) {
+		if (zone_reclaimable(zone)) {
 			balanced += zone->present_pages;
 			continue;
 		}
@@ -2393,8 +2388,7 @@ loop_again:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
-			    sc.priority != DEF_PRIORITY)
+			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
 				continue;
 
 			/*
@@ -2443,14 +2437,13 @@ loop_again:
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab, testorder;
+			int testorder;
 			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
-			    sc.priority != DEF_PRIORITY)
+			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
 				continue;
 
 			sc.nr_scanned = 0;
@@ -2497,12 +2490,11 @@ loop_again:
 				shrink_zone(zone, &sc);
 
 				reclaim_state->reclaimed_slab = 0;
-				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+				shrink_slab(&shrink, sc.nr_scanned, lru_pages);
 				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 				total_scanned += sc.nr_scanned;
 
-				if (nr_slab == 0 && !zone_reclaimable(zone))
-					zone->all_unreclaimable = 1;
+
 			}
 
 			/*
@@ -2514,7 +2506,7 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
-			if (zone->all_unreclaimable) {
+			if (!zone_reclaimable(zone)) {
 				if (end_zone && end_zone == i)
 					end_zone--;
 				continue;
@@ -2616,7 +2608,7 @@ out:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
+			if (!zone_reclaimable(zone) &&
 			    sc.priority != DEF_PRIORITY)
 				continue;
 
@@ -2850,20 +2842,6 @@ unsigned long global_reclaimable_pages(void)
 	return nr;
 }
 
-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	int nr;
-
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE);
-
-	if (nr_swap_pages > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
-
-	return nr;
-}
-
 #ifdef CONFIG_HIBERNATION
 /*
  * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
@@ -3158,7 +3136,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return ZONE_RECLAIM_FULL;
 
-	if (zone->all_unreclaimable)
+	if (!zone_reclaimable(zone))
 		return ZONE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1bbbbd9..94b9d4c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -19,6 +19,7 @@
 #include <linux/math64.h>
 #include <linux/writeback.h>
 #include <linux/compaction.h>
+#include <linux/mm_inline.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -1022,7 +1023,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  all_unreclaimable: %u"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
-		   zone->all_unreclaimable,
+		   !zone_reclaimable(zone),
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock
@ 2012-06-14  8:13 ` kosaki.motohiro
  0 siblings, 0 replies; 51+ messages in thread
From: kosaki.motohiro @ 2012-06-14  8:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, akpm, KOSAKI Motohiro, Nick Piggin, Michal Hocko,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Currently, do_try_to_free_pages() can enter livelock. Because of,
now vmscan has two conflicted policies.

1) kswapd sleep when it couldn't reclaim any page when reaching
   priority 0. This is because to avoid kswapd() infinite
   loop. That said, kswapd assume direct reclaim makes enough
   free pages to use either regular page reclaim or oom-killer.
   This logic makes kswapd -> direct-reclaim dependency.
2) direct reclaim continue to reclaim without oom-killer until
   kswapd turn on zone->all_unreclaimble. This is because
   to avoid too early oom-kill.
   This logic makes direct-reclaim -> kswapd dependency.

In worst case, direct-reclaim may continue to page reclaim forever
when kswapd sleeps forever.

We can't turn on zone->all_unreclaimable from direct reclaim path
because direct reclaim path don't take any lock and this way is racy.

Thus this patch removes zone->all_unreclaimable field completely and
recalculates zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

Reported-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Reported-by: Ying Han <yinghan@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mm_inline.h |   19 +++++++++++++++++
 include/linux/mmzone.h    |    2 +-
 include/linux/vmstat.h    |    1 -
 mm/page-writeback.c       |    2 +
 mm/page_alloc.c           |    5 +--
 mm/vmscan.c               |   48 ++++++++++++--------------------------------
 mm/vmstat.c               |    3 +-
 7 files changed, 39 insertions(+), 41 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 1397ccf..04f32e1 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -2,6 +2,7 @@
 #define LINUX_MM_INLINE_H
 
 #include <linux/huge_mm.h>
+#include <linux/swap.h>
 
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
@@ -99,4 +100,22 @@ static __always_inline enum lru_list page_lru(struct page *page)
 	return lru;
 }
 
+static inline unsigned long zone_reclaimable_pages(struct zone *zone)
+{
+	int nr;
+
+	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
+	     zone_page_state(zone, NR_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
+		      zone_page_state(zone, NR_INACTIVE_ANON);
+
+	return nr;
+}
+
+static inline bool zone_reclaimable(struct zone *zone)
+{
+	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+}
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2427706..9d2a720 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,7 +368,7 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	int                     all_unreclaimable; /* All pages pinned */
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 65efb92..9607256 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -140,7 +140,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 }
 
 extern unsigned long global_reclaimable_pages(void);
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
 #ifdef CONFIG_NUMA
 /*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 93d8d2f..d2d957f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,8 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
 #include <linux/pagevec.h>
+#include <linux/mm_inline.h>
+
 #include <trace/events/writeback.h>
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..5716b00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/prefetch.h>
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -638,7 +639,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int to_free = count;
 
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	while (to_free) {
@@ -680,7 +680,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	__free_one_page(page, zone, order, migratetype);
@@ -2870,7 +2869,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
 			zone->pages_scanned,
-			(zone->all_unreclaimable ? "yes" : "no")
+		       (zone_reclaimable(zone) ? "yes" : "no")
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeb3bc9..033671c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1592,7 +1592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && zone->all_unreclaimable)
+	if (current_is_kswapd() && !zone_reclaimable(zone))
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			if (zone->all_unreclaimable &&
-					sc->priority != DEF_PRIORITY)
+			if (!zone_reclaimable(zone) &&
+			    sc->priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
 			if (COMPACTION_BUILD) {
 				/*
@@ -1975,11 +1975,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	return aborted_reclaim;
 }
 
-static bool zone_reclaimable(struct zone *zone)
-{
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
-}
-
 /* All zones in zonelist are unreclaimable? */
 static bool all_unreclaimable(struct zonelist *zonelist,
 		struct scan_control *sc)
@@ -1993,7 +1988,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (!zone->all_unreclaimable)
+		if (zone_reclaimable(zone))
 			return false;
 	}
 
@@ -2299,7 +2294,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 		 * they must be considered balanced here as well if kswapd
 		 * is to sleep
 		 */
-		if (zone->all_unreclaimable) {
+		if (zone_reclaimable(zone)) {
 			balanced += zone->present_pages;
 			continue;
 		}
@@ -2393,8 +2388,7 @@ loop_again:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
-			    sc.priority != DEF_PRIORITY)
+			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
 				continue;
 
 			/*
@@ -2443,14 +2437,13 @@ loop_again:
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab, testorder;
+			int testorder;
 			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
-			    sc.priority != DEF_PRIORITY)
+			if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
 				continue;
 
 			sc.nr_scanned = 0;
@@ -2497,12 +2490,11 @@ loop_again:
 				shrink_zone(zone, &sc);
 
 				reclaim_state->reclaimed_slab = 0;
-				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+				shrink_slab(&shrink, sc.nr_scanned, lru_pages);
 				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 				total_scanned += sc.nr_scanned;
 
-				if (nr_slab == 0 && !zone_reclaimable(zone))
-					zone->all_unreclaimable = 1;
+
 			}
 
 			/*
@@ -2514,7 +2506,7 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
-			if (zone->all_unreclaimable) {
+			if (!zone_reclaimable(zone)) {
 				if (end_zone && end_zone == i)
 					end_zone--;
 				continue;
@@ -2616,7 +2608,7 @@ out:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
+			if (!zone_reclaimable(zone) &&
 			    sc.priority != DEF_PRIORITY)
 				continue;
 
@@ -2850,20 +2842,6 @@ unsigned long global_reclaimable_pages(void)
 	return nr;
 }
 
-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	int nr;
-
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE);
-
-	if (nr_swap_pages > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
-
-	return nr;
-}
-
 #ifdef CONFIG_HIBERNATION
 /*
  * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
@@ -3158,7 +3136,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return ZONE_RECLAIM_FULL;
 
-	if (zone->all_unreclaimable)
+	if (!zone_reclaimable(zone))
 		return ZONE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1bbbbd9..94b9d4c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -19,6 +19,7 @@
 #include <linux/math64.h>
 #include <linux/writeback.h>
 #include <linux/compaction.h>
+#include <linux/mm_inline.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -1022,7 +1023,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  all_unreclaimable: %u"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
-		   zone->all_unreclaimable,
+		   !zone_reclaimable(zone),
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2013-08-28  2:01 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-05  2:26 [resend] [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock Lisa Du
2013-08-05  2:56 ` Minchan Kim
2013-08-05  4:53 ` Johannes Weiner
2013-08-05  5:02   ` Minchan Kim
2013-08-05  7:41 ` Michal Hocko
2013-08-06  9:23   ` [resend] [PATCH V2] " Lisa Du
2013-08-06 10:35     ` Michal Hocko
2013-08-07  1:42       ` Lisa Du
2013-08-08 18:14         ` Johannes Weiner
2013-08-12  1:46           ` [resend] [PATCH V3] " Lisa Du
2013-08-20 22:16             ` Andrew Morton
2013-08-22  5:24               ` Lisa Du
2013-08-22  6:24                 ` Minchan Kim
2013-08-22  7:14                   ` Lisa Du
2013-08-27 19:43             ` Andrew Morton
2013-08-28  1:58               ` Lisa Du
2013-08-19  8:19           ` Lisa Du
  -- strict thread matches above, loose matches on Subject: below --
2012-06-14  8:13 [resend][PATCH] mm, " kosaki.motohiro
2012-06-14  8:13 ` kosaki.motohiro
2012-06-14  8:43 ` Johannes Weiner
2012-06-14  8:43   ` Johannes Weiner
2012-06-14  8:51 ` Kamezawa Hiroyuki
2012-06-14  8:51   ` Kamezawa Hiroyuki
2012-06-14 14:57 ` Minchan Kim
2012-06-14 14:57   ` Minchan Kim
2012-06-14 16:10   ` KOSAKI Motohiro
2012-06-14 16:10     ` KOSAKI Motohiro
2012-06-15  7:27     ` Minchan Kim
2012-06-15  7:27       ` Minchan Kim
2012-06-15 12:31       ` Hillf Danton
2012-06-15 12:31         ` Hillf Danton
2012-06-19 21:17         ` KOSAKI Motohiro
2012-06-19 21:17           ` KOSAKI Motohiro
2012-06-16 17:48       ` Aaditya Kumar
2012-06-16 17:48         ` Aaditya Kumar
2012-06-18  0:43         ` Minchan Kim
2012-06-18  0:43           ` Minchan Kim
2012-06-18  0:52           ` Kamezawa Hiroyuki
2012-06-18  0:52             ` Kamezawa Hiroyuki
2012-06-19 13:18           ` Aaditya Kumar
2012-06-19 13:18             ` Aaditya Kumar
2012-06-19 22:17           ` KOSAKI Motohiro
2012-06-19 22:17             ` KOSAKI Motohiro
2012-06-20  6:18             ` Minchan Kim
2012-06-20  6:18               ` Minchan Kim
2012-06-14 15:25 ` Michal Hocko
2012-06-14 15:25   ` Michal Hocko
2012-06-14 15:46   ` KOSAKI Motohiro
2012-06-14 15:46     ` KOSAKI Motohiro
2012-06-15 10:45 ` Mel Gorman
2012-06-15 10:45   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.