All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch]vmscan: make kswapd use a correct order
@ 2010-12-01  3:08 Shaohua Li
  2010-12-01  4:21 ` Minchan Kim
  2010-12-01  9:44 ` KOSAKI Motohiro
  0 siblings, 2 replies; 40+ messages in thread
From: Shaohua Li @ 2010-12-01  3:08 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton

T0: Task1 wakeup_kswapd(order=3)
T1: kswapd enters balance_pgdat
T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
quickly
T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
then order will become pgdat->kswapd_max_order(0), while at this time the
order should 2
This isn't a big deal, but we do have a small window the order is wrong.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d31d7ce..15cd0d2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2450,7 +2450,7 @@ static int kswapd(void *p)
 				}
 			}
 
-			order = pgdat->kswapd_max_order;
+			order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
 		}
 		finish_wait(&pgdat->kswapd_wait, &wait);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-01  3:08 [patch]vmscan: make kswapd use a correct order Shaohua Li
@ 2010-12-01  4:21 ` Minchan Kim
  2010-12-01  5:42   ` Shaohua Li
  2010-12-01  9:44 ` KOSAKI Motohiro
  1 sibling, 1 reply; 40+ messages in thread
From: Minchan Kim @ 2010-12-01  4:21 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, Andrew Morton

On Wed, Dec 1, 2010 at 12:08 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> T0: Task1 wakeup_kswapd(order=3)
> T1: kswapd enters balance_pgdat
> T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> quickly
> T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> then order will become pgdat->kswapd_max_order(0), while at this time the
> order should 2
> This isn't a big deal, but we do have a small window the order is wrong.
>
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmai.com>

But you need the description more easily.

I try it.

T0 : Task 1 wakes up kswapd with order-3
T1 : So, kswapd starts to reclaim pages using balance_pgdat
T2:  Task 2 wakes up kswapd with order-2 because pages reclaimed by T1
are consumed quickly.
T3: kswapd exits balance_pgdat and will recheck remained work
T4-1 : In beginning of kswapd's loop, pgdat->kswapd_max_order will be
reset with zero.
T4-2: If previous balance_pgdat can't meet requirement of order-3 free
pages by high watermark, it can start reclaiming again.
T4-3 :Unfortunately, balance_pgdat's argument _order_ is
pgdat->kswapd_max_order which was zero.  It should have been 2.

Regardless of my suggestion, I will add my Reviewed-by.

>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d31d7ce..15cd0d2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
>                                }
>                        }
>
> -                       order = pgdat->kswapd_max_order;
> +                       order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
>                }
>                finish_wait(&pgdat->kswapd_wait, &wait);
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-01  4:21 ` Minchan Kim
@ 2010-12-01  5:42   ` Shaohua Li
  0 siblings, 0 replies; 40+ messages in thread
From: Shaohua Li @ 2010-12-01  5:42 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Andrew Morton

On Wed, 2010-12-01 at 12:21 +0800, Minchan Kim wrote:
> On Wed, Dec 1, 2010 at 12:08 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> > T0: Task1 wakeup_kswapd(order=3)
> > T1: kswapd enters balance_pgdat
> > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> > quickly
> > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> > then order will become pgdat->kswapd_max_order(0), while at this time the
> > order should 2
> > This isn't a big deal, but we do have a small window the order is wrong.
> >
> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmai.com>
> 
> But you need the description more easily.
> 
> I try it.
Thanks, changed the description.


T0: Task 1 wakes up kswapd with order-3
T1: So, kswapd starts to reclaim pages using balance_pgdat
T2: Task 2 wakes up kswapd with order-2 because pages reclaimed by T1
are consumed quickly.
T3: kswapd exits balance_pgdat and will do following:
T4-1: In beginning of kswapd's loop, pgdat->kswapd_max_order will be
reset with zero.
T4-2: order will be set to pgdat->kswapd_max_order(0), since it enters the
false branch of 'if (order (3) < new_order (2))'
T4-3: If previous balance_pgdat can't meet requirement of order-2 free
pages by high watermark, it will start reclaiming again. So balance_pgdat will
use order-0 to do reclaim, while at this time it really should use order-2

This isn't a big deal, but we do have a small window the order is wrong.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d31d7ce..c630349 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2450,7 +2450,8 @@ static int kswapd(void *p)
 				}
 			}
 
-			order = pgdat->kswapd_max_order;
+			order = max_t(unsigned long, new_order,
+				pgdat->kswapd_max_order);
 		}
 		finish_wait(&pgdat->kswapd_wait, &wait);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-01  3:08 [patch]vmscan: make kswapd use a correct order Shaohua Li
  2010-12-01  4:21 ` Minchan Kim
@ 2010-12-01  9:44 ` KOSAKI Motohiro
  2010-12-01 15:58   ` Minchan Kim
  1 sibling, 1 reply; 40+ messages in thread
From: KOSAKI Motohiro @ 2010-12-01  9:44 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kosaki.motohiro, linux-mm, Andrew Morton, Minchan Kim, Mel Gorman

> T0: Task1 wakeup_kswapd(order=3)
> T1: kswapd enters balance_pgdat
> T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> quickly
> T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> then order will become pgdat->kswapd_max_order(0), while at this time the
> order should 2
> This isn't a big deal, but we do have a small window the order is wrong.
> 
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d31d7ce..15cd0d2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
>  				}
>  			}
>  
> -			order = pgdat->kswapd_max_order;
> +			order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
>  		}
>  		finish_wait(&pgdat->kswapd_wait, &wait);

Good catch!

But unfortunatelly, the code is not correct. At least, don't fit corrent
design.

1) if "order < new_order" condition is false, we already decided to don't
   use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
2) if sleeping_prematurely() return false, it probably mean
   zone_watermark_ok_safe(zone, order, high_wmark) return false.
   therefore, we have to retry reclaim by using old 'order' parameter.


new patch is here.



From 8f436224219a1da01985fd9644e1307e7c4cb8c3 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Sun, 26 Dec 2010 21:10:55 +0900
Subject: [PATCH] vmscan: make kswapd use a correct order

If sleeping_prematurely() return false, It's a sign of retrying reclaim.
So, we don't have to drop old order value.

Reported-by: Shaohua Li <shaohua.li@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1fcadaf..f052a1a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2364,13 +2364,13 @@ out:
 	return sc.nr_reclaimed;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
+static int kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
 
 	if (freezing(current) || kthread_should_stop())
-		return;
+		return 0;
 
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
@@ -2399,13 +2399,17 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 		schedule();
 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+		order = pgdat->kswapd_max_order;
 	} else {
 		if (remaining)
 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 		else
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
+		order = max(order, pgdat->kswapd_max_order);
 	}
 	finish_wait(&pgdat->kswapd_wait, &wait);
+
+	return order;
 }
 
 /*
@@ -2467,8 +2471,7 @@ static int kswapd(void *p)
 			 */
 			order = new_order;
 		} else {
-			kswapd_try_to_sleep(pgdat, order);
-			order = pgdat->kswapd_max_order;
+			order = kswapd_try_to_sleep(pgdat, order);
 		}
 
 		ret = try_to_freeze();
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-01  9:44 ` KOSAKI Motohiro
@ 2010-12-01 15:58   ` Minchan Kim
  2010-12-02  0:09     ` KOSAKI Motohiro
                       ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Minchan Kim @ 2010-12-01 15:58 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Shaohua Li, linux-mm, Andrew Morton, Mel Gorman

On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
> > T0: Task1 wakeup_kswapd(order=3)
> > T1: kswapd enters balance_pgdat
> > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> > quickly
> > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> > then order will become pgdat->kswapd_max_order(0), while at this time the
> > order should 2
> > This isn't a big deal, but we do have a small window the order is wrong.
> > 
> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d31d7ce..15cd0d2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
> >  				}
> >  			}
> >  
> > -			order = pgdat->kswapd_max_order;
> > +			order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
> >  		}
> >  		finish_wait(&pgdat->kswapd_wait, &wait);
> 
> Good catch!
> 
> But unfortunatelly, the code is not correct. At least, don't fit corrent
> design.
> 
> 1) if "order < new_order" condition is false, we already decided to don't
>    use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
> 2) if sleeping_prematurely() return false, it probably mean
>    zone_watermark_ok_safe(zone, order, high_wmark) return false.
>    therefore, we have to retry reclaim by using old 'order' parameter.

Good catch, too.

In Shaohua's scenario, if Task1 gets the order-3 page after kswapd's reclaiming,
it's no problem.
But if Task1 doesn't get the order-3 page and others used the order-3 page for Task1,
Kswapd have to reclaim order-3 for Task1, again.
In addtion, new order is always less than old order in that context. 
so big order page reclaim makes much safe for low order pages.

> 
> new patch is here.
> 
> 
> 
> From 8f436224219a1da01985fd9644e1307e7c4cb8c3 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Sun, 26 Dec 2010 21:10:55 +0900
> Subject: [PATCH] vmscan: make kswapd use a correct order
> 
> If sleeping_prematurely() return false, It's a sign of retrying reclaim.
> So, we don't have to drop old order value.

I think this description isn't enough.

> 
> Reported-by: Shaohua Li <shaohua.li@intel.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
But could you see my below suggestion?

> ---
>  mm/vmscan.c |   11 +++++++----
>  1 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1fcadaf..f052a1a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2364,13 +2364,13 @@ out:
>  	return sc.nr_reclaimed;
>  }
>  
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> +static int kswapd_try_to_sleep(pg_data_t *pgdat, int order)
>  {
>  	long remaining = 0;
>  	DEFINE_WAIT(wait);
>  
>  	if (freezing(current) || kthread_should_stop())
> -		return;
> +		return 0;
>  
>  	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>  
> @@ -2399,13 +2399,17 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
>  		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
>  		schedule();
>  		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
> +		order = pgdat->kswapd_max_order;
>  	} else {
>  		if (remaining)
>  			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>  		else
>  			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> +		order = max(order, pgdat->kswapd_max_order);
>  	}
>  	finish_wait(&pgdat->kswapd_wait, &wait);
> +
> +	return order;
>  }
>  
>  /*
> @@ -2467,8 +2471,7 @@ static int kswapd(void *p)
>  			 */
>  			order = new_order;
>  		} else {
> -			kswapd_try_to_sleep(pgdat, order);
> -			order = pgdat->kswapd_max_order;
> +			order = kswapd_try_to_sleep(pgdat, order);
>  		}
>  
>  		ret = try_to_freeze();
> -- 
> 1.6.5.2
> 
> 

It might work well. but I don't like such a coding that kswapd_try_to_sleep's
eturn value is order. It doesn't look good to me and even no comment. Hmm..

How about this?
If you want it, feel free to use it.
If you insist on your coding style, I don't have any objection.
Then add My Reviewed-by.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/vmscan.c |   21 +++++++++++++++++----
 1 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..e48a612 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2447,13 +2447,18 @@ out:
 	return sc.nr_reclaimed;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
+/*
+ * Return true if we sleep enough. Othrewise, return false
+ */
+static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 {
 	long remaining = 0;
+	bool sleep = 0;
+
 	DEFINE_WAIT(wait);
 
 	if (freezing(current) || kthread_should_stop())
-		return;
+		return sleep;
 
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
@@ -2482,6 +2487,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 		schedule();
 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+		sleep = 1;
 	} else {
 		if (remaining)
 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2489,6 +2495,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
 	finish_wait(&pgdat->kswapd_wait, &wait);
+
+	return sleep;
 }
 
 /*
@@ -2550,8 +2558,13 @@ static int kswapd(void *p)
 			 */
 			order = new_order;
 		} else {
-			kswapd_try_to_sleep(pgdat, order);
-			order = pgdat->kswapd_max_order;
+			/*
+			 * If we wake up after enough sleeping, let's
+			 * start new order. Otherwise, it was a premature
+			 * sleep so we keep going on.
+			 */
+			if (kswapd_try_to_sleep(pgdat, order))
+				order = pgdat->kswapd_max_order;
 		}
 
 		ret = try_to_freeze();
-- 
1.7.0.4

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-01 15:58   ` Minchan Kim
@ 2010-12-02  0:09     ` KOSAKI Motohiro
  2010-12-02  0:29       ` KOSAKI Motohiro
  2010-12-02  0:19     ` Andrew Morton
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: KOSAKI Motohiro @ 2010-12-02  0:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, Andrew Morton, Mel Gorman

> It might work well. but I don't like such a coding that kswapd_try_to_sleep's
> eturn value is order. It doesn't look good to me and even no comment. Hmm..
> 
> How about this?
> If you want it, feel free to use it.
> If you insist on your coding style, I don't have any objection.
> Then add My Reviewed-by.
> 
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>

I'm ok this.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-01 15:58   ` Minchan Kim
  2010-12-02  0:09     ` KOSAKI Motohiro
@ 2010-12-02  0:19     ` Andrew Morton
  2010-12-02  9:40       ` Mel Gorman
  2010-12-02  0:29     ` Shaohua Li
  2010-12-02 10:12     ` Mel Gorman
  3 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2010-12-02  0:19 UTC (permalink / raw)
  To: Minchan Kim; +Cc: KOSAKI Motohiro, Shaohua Li, linux-mm, Mel Gorman


Paging Mel Gorman.  This fix looks pretty thoroughly related to your
"[RFC PATCH 0/3] Prevent kswapd dumping excessive amounts of memory in
response to high-order allocations"?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-01 15:58   ` Minchan Kim
  2010-12-02  0:09     ` KOSAKI Motohiro
  2010-12-02  0:19     ` Andrew Morton
@ 2010-12-02  0:29     ` Shaohua Li
  2010-12-02  0:54       ` Minchan Kim
  2010-12-02  1:28       ` KOSAKI Motohiro
  2010-12-02 10:12     ` Mel Gorman
  3 siblings, 2 replies; 40+ messages in thread
From: Shaohua Li @ 2010-12-02  0:29 UTC (permalink / raw)
  To: Minchan Kim; +Cc: KOSAKI Motohiro, linux-mm, Andrew Morton, Mel Gorman

On Wed, 2010-12-01 at 23:58 +0800, Minchan Kim wrote:
> On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
> > > T0: Task1 wakeup_kswapd(order=3)
> > > T1: kswapd enters balance_pgdat
> > > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> > > quickly
> > > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> > > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> > > then order will become pgdat->kswapd_max_order(0), while at this time the
> > > order should 2
> > > This isn't a big deal, but we do have a small window the order is wrong.
> > > 
> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d31d7ce..15cd0d2 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
> > >  				}
> > >  			}
> > >  
> > > -			order = pgdat->kswapd_max_order;
> > > +			order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
> > >  		}
> > >  		finish_wait(&pgdat->kswapd_wait, &wait);
> > 
> > Good catch!
> > 
> > But unfortunatelly, the code is not correct. At least, don't fit corrent
> > design.
> > 
> > 1) if "order < new_order" condition is false, we already decided to don't
> >    use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
> > 2) if sleeping_prematurely() return false, it probably mean
> >    zone_watermark_ok_safe(zone, order, high_wmark) return false.
> >    therefore, we have to retry reclaim by using old 'order' parameter.
> 
> Good catch, too.
> 
> In Shaohua's scenario, if Task1 gets the order-3 page after kswapd's reclaiming,
> it's no problem.
> But if Task1 doesn't get the order-3 page and others used the order-3 page for Task1,
> Kswapd have to reclaim order-3 for Task1, again.
why? it's just a possibility. Task1 might get its pages too. If Task1
doesn't get its pages, it will wakeup kswapd too with its order.

> In addtion, new order is always less than old order in that context. 
> so big order page reclaim makes much safe for low order pages.
big order page reclaim makes we have more chances to reclaim useful
pages by lumpy, why it's safe?

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  0:09     ` KOSAKI Motohiro
@ 2010-12-02  0:29       ` KOSAKI Motohiro
  2010-12-02  0:58         ` Minchan Kim
  0 siblings, 1 reply; 40+ messages in thread
From: KOSAKI Motohiro @ 2010-12-02  0:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, Andrew Morton, Mel Gorman

> > It might work well. but I don't like such a coding that kswapd_try_to_sleep's
> > eturn value is order. It doesn't look good to me and even no comment. Hmm..
> > 
> > How about this?
> > If you want it, feel free to use it.
> > If you insist on your coding style, I don't have any objection.
> > Then add My Reviewed-by.
> > 
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> 
> I'm ok this.
> 
> 	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> 
> Thanks.
> 

Please consider rensend a patch with full patch description. Of cource,
you need to rebase this on top Mel's patch.

Plus, please don't remove Shaohua's reported-by tag. It's important line 
than my code. Please respect good bug finder.

Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  0:29     ` Shaohua Li
@ 2010-12-02  0:54       ` Minchan Kim
  2010-12-02  1:05         ` Shaohua Li
  2010-12-02  1:28       ` KOSAKI Motohiro
  1 sibling, 1 reply; 40+ messages in thread
From: Minchan Kim @ 2010-12-02  0:54 UTC (permalink / raw)
  To: Shaohua Li; +Cc: KOSAKI Motohiro, linux-mm, Andrew Morton, Mel Gorman

On Thu, Dec 2, 2010 at 9:29 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Wed, 2010-12-01 at 23:58 +0800, Minchan Kim wrote:
>> On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
>> > > T0: Task1 wakeup_kswapd(order=3)
>> > > T1: kswapd enters balance_pgdat
>> > > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
>> > > quickly
>> > > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
>> > > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
>> > > then order will become pgdat->kswapd_max_order(0), while at this time the
>> > > order should 2
>> > > This isn't a big deal, but we do have a small window the order is wrong.
>> > >
>> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
>> > >
>> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > > index d31d7ce..15cd0d2 100644
>> > > --- a/mm/vmscan.c
>> > > +++ b/mm/vmscan.c
>> > > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
>> > >                           }
>> > >                   }
>> > >
>> > > -                 order = pgdat->kswapd_max_order;
>> > > +                 order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
>> > >           }
>> > >           finish_wait(&pgdat->kswapd_wait, &wait);
>> >
>> > Good catch!
>> >
>> > But unfortunatelly, the code is not correct. At least, don't fit corrent
>> > design.
>> >
>> > 1) if "order < new_order" condition is false, we already decided to don't
>> >    use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
>> > 2) if sleeping_prematurely() return false, it probably mean
>> >    zone_watermark_ok_safe(zone, order, high_wmark) return false.
>> >    therefore, we have to retry reclaim by using old 'order' parameter.
>>
>> Good catch, too.
>>
>> In Shaohua's scenario, if Task1 gets the order-3 page after kswapd's reclaiming,
>> it's no problem.
>> But if Task1 doesn't get the order-3 page and others used the order-3 page for Task1,
>> Kswapd have to reclaim order-3 for Task1, again.
> why? it's just a possibility. Task1 might get its pages too. If Task1
> doesn't get its pages, it will wakeup kswapd too with its order.
>
>> In addtion, new order is always less than old order in that context.
>> so big order page reclaim makes much safe for low order pages.
> big order page reclaim makes we have more chances to reclaim useful
> pages by lumpy, why it's safe?

For example, It assume tat Task1 continues to fail get the order-3
page of GFP_ATOMIC since other tasks continues to allocate order-2
pages so that they steal pages. Then, your patch makes continue to
reclaim order-2 page in this scenario. Task1 never get the order-3
pages if it doesn't have a merge luck. It's kind of live lock(But it's
just unlikely theory). But KOSAKI's approach can make sure reclaim
order-3 pages so it can meet requirement about both order-2,3. In this
context, I said _safety_.

Of course, it could discard useful pages if Task1 get a pages. I think
it's a trade-off.
We should determine the policy.
I biased safety of GFP_ATOMIC.

> Thanks,
> Shaohua
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  0:29       ` KOSAKI Motohiro
@ 2010-12-02  0:58         ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2010-12-02  0:58 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Shaohua Li, linux-mm, Andrew Morton, Mel Gorman

On Thu, Dec 2, 2010 at 9:29 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> > It might work well. but I don't like such a coding that kswapd_try_to_sleep's
>> > eturn value is order. It doesn't look good to me and even no comment. Hmm..
>> >
>> > How about this?
>> > If you want it, feel free to use it.
>> > If you insist on your coding style, I don't have any objection.
>> > Then add My Reviewed-by.
>> >
>> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
>>
>> I'm ok this.
>>
>>       Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>
>>
>> Thanks.
>>
>
> Please consider rensend a patch with full patch description. Of cource,
> you need to rebase this on top Mel's patch.
>
> Plus, please don't remove Shaohua's reported-by tag. It's important line
> than my code. Please respect good bug finder.

I don't have a thought to intercept Shaohua and Your's credit.
Just a review so I hoped you send the patch with adding my signed-off
or reviewed-by.

Okay. I will resend it with full-description and you guys's signed-off.
But before that, We have to discuss Shaohua's argue about _safety_.

> Thanks.
>
>
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  0:54       ` Minchan Kim
@ 2010-12-02  1:05         ` Shaohua Li
  2010-12-02  1:23           ` Minchan Kim
  0 siblings, 1 reply; 40+ messages in thread
From: Shaohua Li @ 2010-12-02  1:05 UTC (permalink / raw)
  To: Minchan Kim; +Cc: KOSAKI Motohiro, linux-mm, Andrew Morton, Mel Gorman

On Thu, 2010-12-02 at 08:54 +0800, Minchan Kim wrote:
> On Thu, Dec 2, 2010 at 9:29 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > On Wed, 2010-12-01 at 23:58 +0800, Minchan Kim wrote:
> >> On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
> >> > > T0: Task1 wakeup_kswapd(order=3)
> >> > > T1: kswapd enters balance_pgdat
> >> > > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> >> > > quickly
> >> > > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> >> > > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> >> > > then order will become pgdat->kswapd_max_order(0), while at this time the
> >> > > order should 2
> >> > > This isn't a big deal, but we do have a small window the order is wrong.
> >> > >
> >> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> >> > >
> >> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > > index d31d7ce..15cd0d2 100644
> >> > > --- a/mm/vmscan.c
> >> > > +++ b/mm/vmscan.c
> >> > > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
> >> > >                           }
> >> > >                   }
> >> > >
> >> > > -                 order = pgdat->kswapd_max_order;
> >> > > +                 order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
> >> > >           }
> >> > >           finish_wait(&pgdat->kswapd_wait, &wait);
> >> >
> >> > Good catch!
> >> >
> >> > But unfortunatelly, the code is not correct. At least, don't fit corrent
> >> > design.
> >> >
> >> > 1) if "order < new_order" condition is false, we already decided to don't
> >> >    use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
> >> > 2) if sleeping_prematurely() return false, it probably mean
> >> >    zone_watermark_ok_safe(zone, order, high_wmark) return false.
> >> >    therefore, we have to retry reclaim by using old 'order' parameter.
> >>
> >> Good catch, too.
> >>
> >> In Shaohua's scenario, if Task1 gets the order-3 page after kswapd's reclaiming,
> >> it's no problem.
> >> But if Task1 doesn't get the order-3 page and others used the order-3 page for Task1,
> >> Kswapd have to reclaim order-3 for Task1, again.
> > why? it's just a possibility. Task1 might get its pages too. If Task1
> > doesn't get its pages, it will wakeup kswapd too with its order.
> >
> >> In addtion, new order is always less than old order in that context.
> >> so big order page reclaim makes much safe for low order pages.
> > big order page reclaim makes we have more chances to reclaim useful
> > pages by lumpy, why it's safe?
> 
> For example, It assume tat Task1 continues to fail get the order-3
> page of GFP_ATOMIC since other tasks continues to allocate order-2
> pages so that they steal pages. 
but even you reclaim order-3, you can't guarantee task1 can get the
pages too. order-3 page can be steal by order-2 allocation

> Then, your patch makes continue to
> reclaim order-2 page in this scenario. Task1 never get the order-3
> pages if it doesn't have a merge luck.
Task1 will wakeup kswapd again for order-3, so kswapd will reclaim
order-3 very soon after the order-2 reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  1:05         ` Shaohua Li
@ 2010-12-02  1:23           ` Minchan Kim
  2010-12-02  1:36             ` Minchan Kim
  2010-12-02  2:39             ` Shaohua Li
  0 siblings, 2 replies; 40+ messages in thread
From: Minchan Kim @ 2010-12-02  1:23 UTC (permalink / raw)
  To: Shaohua Li; +Cc: KOSAKI Motohiro, linux-mm, Andrew Morton, Mel Gorman

On Thu, Dec 2, 2010 at 10:05 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Thu, 2010-12-02 at 08:54 +0800, Minchan Kim wrote:
>> On Thu, Dec 2, 2010 at 9:29 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> > On Wed, 2010-12-01 at 23:58 +0800, Minchan Kim wrote:
>> >> On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
>> >> > > T0: Task1 wakeup_kswapd(order=3)
>> >> > > T1: kswapd enters balance_pgdat
>> >> > > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
>> >> > > quickly
>> >> > > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
>> >> > > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
>> >> > > then order will become pgdat->kswapd_max_order(0), while at this time the
>> >> > > order should 2
>> >> > > This isn't a big deal, but we do have a small window the order is wrong.
>> >> > >
>> >> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
>> >> > >
>> >> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> > > index d31d7ce..15cd0d2 100644
>> >> > > --- a/mm/vmscan.c
>> >> > > +++ b/mm/vmscan.c
>> >> > > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
>> >> > >                           }
>> >> > >                   }
>> >> > >
>> >> > > -                 order = pgdat->kswapd_max_order;
>> >> > > +                 order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
>> >> > >           }
>> >> > >           finish_wait(&pgdat->kswapd_wait, &wait);
>> >> >
>> >> > Good catch!
>> >> >
>> >> > But unfortunatelly, the code is not correct. At least, don't fit corrent
>> >> > design.
>> >> >
>> >> > 1) if "order < new_order" condition is false, we already decided to don't
>> >> >    use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
>> >> > 2) if sleeping_prematurely() return false, it probably mean
>> >> >    zone_watermark_ok_safe(zone, order, high_wmark) return false

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  0:29     ` Shaohua Li
  2010-12-02  0:54       ` Minchan Kim
@ 2010-12-02  1:28       ` KOSAKI Motohiro
  1 sibling, 0 replies; 40+ messages in thread
From: KOSAKI Motohiro @ 2010-12-02  1:28 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kosaki.motohiro, Minchan Kim, linux-mm, Andrew Morton, Mel Gorman

> > In addtion, new order is always less than old order in that context. 
> > so big order page reclaim makes much safe for low order pages.
> big order page reclaim makes we have more chances to reclaim useful
> pages by lumpy, why it's safe?

Because some crappy driver try high order GFP_ATOMIC allocation and
they often don't have enough failure handling.

Even though they have good allocation failure handling, network packet
loss (wireless drivers are one of most big high order GFP_ATOMIC user) is
usually big impact than page cache drop/re-readings.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  1:23           ` Minchan Kim
@ 2010-12-02  1:36             ` Minchan Kim
  2010-12-02  9:42               ` Mel Gorman
  2010-12-02  2:39             ` Shaohua Li
  1 sibling, 1 reply; 40+ messages in thread
From: Minchan Kim @ 2010-12-02  1:36 UTC (permalink / raw)
  To: Shaohua Li; +Cc: KOSAKI Motohiro, linux-mm, Andrew Morton, Mel Gorman

Where is my mail?
I will resend lost content.

On Thu, Dec 2, 2010 at 10:23 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Dec 2, 2010 at 10:05 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> On Thu, 2010-12-02 at 08:54 +0800, Minchan Kim wrote:
>>> On Thu, Dec 2, 2010 at 9:29 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>>> > On Wed, 2010-12-01 at 23:58 +0800, Minchan Kim wrote:
>>> >> On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
>>> >> > > T0: Task1 wakeup_kswapd(order=3)
>>> >> > > T1: kswapd enters balance_pgdat
>>> >> > > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
>>> >> > > quickly
>>> >> > > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
>>> >> > > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
>>> >> > > then order will become pgdat->kswapd_max_order(0), while at this time the
>>> >> > > order should 2
>>> >> > > This isn't a big deal, but we do have a small window the order is wrong.
>>> >> > >
>>> >> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
>>> >> > >
>>> >> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> >> > > index d31d7ce..15cd0d2 100644
>>> >> > > --- a/mm/vmscan.c
>>> >> > > +++ b/mm/vmscan.c
>>> >> > > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
>>> >> > >                           }
>>> >> > >                   }
>>> >> > >
>>> >> > > -                 order = pgdat->kswapd_max_order;
>>> >> > > +                 order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
>>> >> > >           }
>>> >> > >           finish_wait(&pgdat->kswapd_wait, &wait);
>>> >> >
>>> >> > Good catch!
>>> >> >
>>> >> > But unfortunatelly, the code is not correct. At least, don't fit corrent
>>> >> > design.
>>> >> >
>>> >> > 1) if "order < new_order" condition is false, we already decided to don't
>>> >> >    use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
>>> >> > 2) if sleeping_prematurely() return false, it probably mean
>>> >> >    zone_watermark_ok_safe(zone, order, high_wmark) return false.
>>> >> >    therefore, we have to retry reclaim by using old 'order' parameter.
>>> >>
>>> >> Good catch, too.
>>> >>
>>> >> In Shaohua's scenario, if Task1 gets the order-3 page after kswapd's reclaiming,
>>> >> it's no problem.
>>> >> But if Task1 doesn't get the order-3 page and others used the order-3 page for Task1,
>>> >> Kswapd have to reclaim order-3 for Task1, again.
>>> > why? it's just a possibility. Task1 might get its pages too. If Task1
>>> > doesn't get its pages, it will wakeup kswapd too with its order.
>>> >
>>> >> In addtion, new order is always less than old order in that context.
>>> >> so big order page reclaim makes much safe for low order pages.
>>> > big order page reclaim makes we have more chances to reclaim useful
>>> > pages by lumpy, why it's safe?
>>>
>>> For example, It assume tat Task1 continues to fail get the order-3
>>> page of GFP_ATOMIC since other tasks continues to allocate order-2
>>> pages so that they steal pages.
>> but even you reclaim order-3, you can't guarantee task1 can get the
>> pages too. order-3 page can be steal by order-2 allocation
>
> But at least, it has a high possibility to allocate order-3 page than
> reclaim order-2 pages.
>
>>
>>> Then, your patch makes continue to
>>> reclaim order-2 page in this scenario. Task1 never get the order-3
>>> pages if it doesn't have a merge luck.
>> Task1 will wakeup kswapd again for order-3, so kswapd will reclaim
>> order-3 very soon after the order-2 reclaim.
>
> GFP_ATOMIC case doesn't wakeup kswapd.
> When kswapd wakeup by order-3 depends on caller's retry.
> And this situation can be repeated in next turn.
>
> We can't guarantee 100% Task-1's order-3 pages so I hope we should go
> way to reduce the problem as possible as we can.
>
>>
>>
>
>
>
> --
> Kind regards,
> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  1:23           ` Minchan Kim
  2010-12-02  1:36             ` Minchan Kim
@ 2010-12-02  2:39             ` Shaohua Li
  1 sibling, 0 replies; 40+ messages in thread
From: Shaohua Li @ 2010-12-02  2:39 UTC (permalink / raw)
  To: Minchan Kim; +Cc: KOSAKI Motohiro, linux-mm, Andrew Morton, Mel Gorman

On Thu, 2010-12-02 at 09:23 +0800, Minchan Kim wrote:
> On Thu, Dec 2, 2010 at 10:05 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > On Thu, 2010-12-02 at 08:54 +0800, Minchan Kim wrote:
> >> On Thu, Dec 2, 2010 at 9:29 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> > On Wed, 2010-12-01 at 23:58 +0800, Minchan Kim wrote:
> >> >> On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
> >> >> > > T0: Task1 wakeup_kswapd(order=3)
> >> >> > > T1: kswapd enters balance_pgdat
> >> >> > > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> >> >> > > quickly
> >> >> > > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> >> >> > > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> >> >> > > then order will become pgdat->kswapd_max_order(0), while at this time the
> >> >> > > order should 2
> >> >> > > This isn't a big deal, but we do have a small window the order is wrong.
> >> >> > >
> >> >> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> >> >> > >
> >> >> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> > > index d31d7ce..15cd0d2 100644
> >> >> > > --- a/mm/vmscan.c
> >> >> > > +++ b/mm/vmscan.c
> >> >> > > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
> >> >> > >                           }
> >> >> > >                   }
> >> >> > >
> >> >> > > -                 order = pgdat->kswapd_max_order;
> >> >> > > +                 order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
> >> >> > >           }
> >> >> > >           finish_wait(&pgdat->kswapd_wait, &wait);
> >> >> >
> >> >> > Good catch!
> >> >> >
> >> >> > But unfortunatelly, the code is not correct. At least, don't fit corrent
> >> >> > design.
> >> >> >
> >> >> > 1) if "order < new_order" condition is false, we already decided to don't
> >> >> >    use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
> >> >> > 2) if sleeping_prematurely() return false, it probably mean
> >> >> >    zone_watermark_ok_safe(zone, order, high_wmark) return false.
> >> >> >    therefore, we have to retry reclaim by using old 'order' parameter.
> >> >>
> >> >> Good catch, too.
> >> >>
> >> >> In Shaohua's scenario, if Task1 gets the order-3 page after kswapd's reclaiming,
> >> >> it's no problem.
> >> >> But if Task1 doesn't get the order-3 page and others used the order-3 page for Task1,
> >> >> Kswapd have to reclaim order-3 for Task1, again.
> >> > why? it's just a possibility. Task1 might get its pages too. If Task1
> >> > doesn't get its pages, it will wakeup kswapd too with its order.
> >> >
> >> >> In addtion, new order is always less than old order in that context.
> >> >> so big order page reclaim makes much safe for low order pages.
> >> > big order page reclaim makes we have more chances to reclaim useful
> >> > pages by lumpy, why it's safe?
> >>
> >> For example, It assume tat Task1 continues to fail get the order-3
> >> page of GFP_ATOMIC since other tasks continues to allocate order-2
> >> pages so that they steal pages.
> > but even you reclaim order-3, you can't guarantee task1 can get the
> > pages too. order-3 page can be steal by order-2 allocation
> 
> But at least, it has a high possibility to allocate order-3 page than
> reclaim order-2 pages.
> 
> >
> >> Then, your patch makes continue to
> >> reclaim order-2 page in this scenario. Task1 never get the order-3
> >> pages if it doesn't have a merge luck.
> > Task1 will wakeup kswapd again for order-3, so kswapd will reclaim
> > order-3 very soon after the order-2 reclaim.
> 
> GFP_ATOMIC case doesn't wakeup kswapd.
> When kswapd wakeup by order-3 depends on caller's retry.
> And this situation can be repeated in next turn.
> 
> We can't guarantee 100% Task-1's order-3 pages so I hope we should go
> way to reduce the problem as possible as we can.
ok, I have no objection now.
Reviewed-by: Shaohua Li <shaohua.li@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  0:19     ` Andrew Morton
@ 2010-12-02  9:40       ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2010-12-02  9:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, KOSAKI Motohiro, Shaohua Li, linux-mm

On Wed, Dec 01, 2010 at 04:19:54PM -0800, Andrew Morton wrote:
> 
> Paging Mel Gorman.  This fix looks pretty thoroughly related to your
> "[RFC PATCH 0/3] Prevent kswapd dumping excessive amounts of memory in
> response to high-order allocations"?
> 

It affects the same area and I'll need to rebase this patch on top of
my series for testing by Simon and his "aggressive kswapd" problem. I
haven't finished reviewing the thread yet but my initial impressions are
that it won't fix Simon's problem but that it's a real issue.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  1:36             ` Minchan Kim
@ 2010-12-02  9:42               ` Mel Gorman
  2010-12-02 15:25                 ` Minchan Kim
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-12-02  9:42 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Shaohua Li, KOSAKI Motohiro, linux-mm, Andrew Morton

On Thu, Dec 02, 2010 at 10:36:27AM +0900, Minchan Kim wrote:
> Where is my mail?
> I will resend lost content.
> 
> On Thu, Dec 2, 2010 at 10:23 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > On Thu, Dec 2, 2010 at 10:05 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> On Thu, 2010-12-02 at 08:54 +0800, Minchan Kim wrote:
> >>> On Thu, Dec 2, 2010 at 9:29 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >>> > On Wed, 2010-12-01 at 23:58 +0800, Minchan Kim wrote:
> >>> >> On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
> >>> >> > > T0: Task1 wakeup_kswapd(order=3)
> >>> >> > > T1: kswapd enters balance_pgdat
> >>> >> > > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> >>> >> > > quickly
> >>> >> > > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> >>> >> > > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> >>> >> > > then order will become pgdat->kswapd_max_order(0), while at this time the
> >>> >> > > order should 2
> >>> >> > > This isn't a big deal, but we do have a small window the order is wrong.
> >>> >> > >
> >>> >> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> >>> >> > >
> >>> >> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> >> > > index d31d7ce..15cd0d2 100644
> >>> >> > > --- a/mm/vmscan.c
> >>> >> > > +++ b/mm/vmscan.c
> >>> >> > > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
> >>> >> > >                           }
> >>> >> > >                   }
> >>> >> > >
> >>> >> > > -                 order = pgdat->kswapd_max_order;
> >>> >> > > +                 order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
> >>> >> > >           }
> >>> >> > >           finish_wait(&pgdat->kswapd_wait, &wait);
> >>> >> >
> >>> >> > Good catch!
> >>> >> >
> >>> >> > But unfortunatelly, the code is not correct. At least, don't fit corrent
> >>> >> > design.
> >>> >> >
> >>> >> > 1) if "order < new_order" condition is false, we already decided to don't
> >>> >> >    use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
> >>> >> > 2) if sleeping_prematurely() return false, it probably mean
> >>> >> >    zone_watermark_ok_safe(zone, order, high_wmark) return false.
> >>> >> >    therefore, we have to retry reclaim by using old 'order' parameter.
> >>> >>
> >>> >> Good catch, too.
> >>> >>
> >>> >> In Shaohua's scenario, if Task1 gets the order-3 page after kswapd's reclaiming,
> >>> >> it's no problem.
> >>> >> But if Task1 doesn't get the order-3 page and others used the order-3 page for Task1,
> >>> >> Kswapd have to reclaim order-3 for Task1, again.
> >>> > why? it's just a possibility. Task1 might get its pages too. If Task1
> >>> > doesn't get its pages, it will wakeup kswapd too with its order.
> >>> >
> >>> >> In addtion, new order is always less than old order in that context.
> >>> >> so big order page reclaim makes much safe for low order pages.
> >>> > big order page reclaim makes we have more chances to reclaim useful
> >>> > pages by lumpy, why it's safe?
> >>>
> >>> For example, It assume tat Task1 continues to fail get the order-3
> >>> page of GFP_ATOMIC since other tasks continues to allocate order-2
> >>> pages so that they steal pages.
> >> but even you reclaim order-3, you can't guarantee task1 can get the
> >> pages too. order-3 page can be steal by order-2 allocation
> >
> > But at least, it has a high possibility to allocate order-3 page than
> > reclaim order-2 pages.
> >
> >>
> >>> Then, your patch makes continue to
> >>> reclaim order-2 page in this scenario. Task1 never get the order-3
> >>> pages if it doesn't have a merge luck.
> >> Task1 will wakeup kswapd again for order-3, so kswapd will reclaim
> >> order-3 very soon after the order-2 reclaim.
> >
> > GFP_ATOMIC case doesn't wakeup kswapd.
> > When kswapd wakeup by order-3 depends on caller's retry.
> > And this situation can be repeated in next turn.
> >

GFP_ATOMIC does wakeup kswapd. It just doesn't wait on kswapd to do
anything.

> > We can't guarantee 100% Task-1's order-3 pages so I hope we should go
> > way to reduce the problem as possible as we can.
> >
> >>
> >>
> >
> >
> >
> > --
> > Kind regards,
> > Minchan Kim
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-01 15:58   ` Minchan Kim
                       ` (2 preceding siblings ...)
  2010-12-02  0:29     ` Shaohua Li
@ 2010-12-02 10:12     ` Mel Gorman
  2010-12-02 15:35       ` Minchan Kim
  3 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-12-02 10:12 UTC (permalink / raw)
  To: Minchan Kim; +Cc: KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Thu, Dec 02, 2010 at 12:58:54AM +0900, Minchan Kim wrote:
> On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
> > > T0: Task1 wakeup_kswapd(order=3)
> > > T1: kswapd enters balance_pgdat
> > > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> > > quickly
> > > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> > > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> > > then order will become pgdat->kswapd_max_order(0), while at this time the
> > > order should 2
> > > This isn't a big deal, but we do have a small window the order is wrong.
> > > 
> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d31d7ce..15cd0d2 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
> > >  				}
> > >  			}
> > >  
> > > -			order = pgdat->kswapd_max_order;
> > > +			order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
> > >  		}
> > >  		finish_wait(&pgdat->kswapd_wait, &wait);
> > 
> > Good catch!
> > 
> > But unfortunatelly, the code is not correct. At least, don't fit corrent
> > design.
> > 
> > 1) if "order < new_order" condition is false, we already decided to don't
> >    use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
> > 2) if sleeping_prematurely() return false, it probably mean
> >    zone_watermark_ok_safe(zone, order, high_wmark) return false.
> >    therefore, we have to retry reclaim by using old 'order' parameter.
> 
> Good catch, too.
> 
> In Shaohua's scenario, if Task1 gets the order-3 page after kswapd's reclaiming,
> it's no problem.
> But if Task1 doesn't get the order-3 page and others used the order-3 page for Task1,
> Kswapd have to reclaim order-3 for Task1, again.
> In addtion, new order is always less than old order in that context. 
> so big order page reclaim makes much safe for low order pages.
> 
> > 
> > new patch is here.
> > 
> > 
> > 
> > From 8f436224219a1da01985fd9644e1307e7c4cb8c3 Mon Sep 17 00:00:00 2001
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Date: Sun, 26 Dec 2010 21:10:55 +0900
> > Subject: [PATCH] vmscan: make kswapd use a correct order
> > 
> > If sleeping_prematurely() return false, It's a sign of retrying reclaim.
> > So, we don't have to drop old order value.
> 
> I think this description isn't enough.
> 
> > 
> > Reported-by: Shaohua Li <shaohua.li@intel.com>
> > Cc: Minchan Kim <minchan.kim@gmail.com>
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> But could you see my below suggestion?
> 
> > ---
> >  mm/vmscan.c |   11 +++++++----
> >  1 files changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 1fcadaf..f052a1a 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2364,13 +2364,13 @@ out:
> >  	return sc.nr_reclaimed;
> >  }
> >  
> > -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> > +static int kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> >  {
> >  	long remaining = 0;
> >  	DEFINE_WAIT(wait);
> >  
> >  	if (freezing(current) || kthread_should_stop())
> > -		return;
> > +		return 0;
> >  
> >  	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> >  
> > @@ -2399,13 +2399,17 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> >  		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
> >  		schedule();
> >  		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
> > +		order = pgdat->kswapd_max_order;
> >  	} else {
> >  		if (remaining)
> >  			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> >  		else
> >  			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> > +		order = max(order, pgdat->kswapd_max_order);
> >  	}
> >  	finish_wait(&pgdat->kswapd_wait, &wait);
> > +
> > +	return order;
> >  }
> >  
> >  /*
> > @@ -2467,8 +2471,7 @@ static int kswapd(void *p)
> >  			 */
> >  			order = new_order;
> >  		} else {
> > -			kswapd_try_to_sleep(pgdat, order);
> > -			order = pgdat->kswapd_max_order;
> > +			order = kswapd_try_to_sleep(pgdat, order);
> >  		}
> >  
> >  		ret = try_to_freeze();
> > -- 
> > 1.6.5.2
> > 
> > 
> 
> It might work well. but I don't like such a coding that kswapd_try_to_sleep's
> eturn value is order. It doesn't look good to me and even no comment. Hmm..
> 
> How about this?
> If you want it, feel free to use it.
> If you insist on your coding style, I don't have any objection.
> Then add My Reviewed-by.
> 
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> ---
>  mm/vmscan.c |   21 +++++++++++++++++----
>  1 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 42a4859..e48a612 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2447,13 +2447,18 @@ out:
>  	return sc.nr_reclaimed;
>  }
>  
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> +/*
> + * Return true if we sleep enough. Othrewise, return false
> + */

s/Othrewise/Otherwise/

Maybe

> +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
>  {
>  	long remaining = 0;
> +	bool sleep = 0;
> +

sleep is a boolean, it's true or false, not 0 or !0

The term "sleep" implies present or future tense - i.e. I am going to sleep or
will go to sleep in the future.  The event this variable cares about in the
past so "slept" or finished_sleeping might be a more appropriate term. Sorry
to be picky about the English but there is an important distinction here.

>  	DEFINE_WAIT(wait);
>  
>  	if (freezing(current) || kthread_should_stop())
> -		return;
> +		return sleep;
>  
>  	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>  
> @@ -2482,6 +2487,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
>  		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
>  		schedule();
>  		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
> +		sleep = 1;
>  	} else {
>  		if (remaining)
>  			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> @@ -2489,6 +2495,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
>  			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>  	}
>  	finish_wait(&pgdat->kswapd_wait, &wait);
> +
> +	return sleep;
>  }
>  
>  /*
> @@ -2550,8 +2558,13 @@ static int kswapd(void *p)
>  			 */
>  			order = new_order;
>  		} else {
> -			kswapd_try_to_sleep(pgdat, order);
> -			order = pgdat->kswapd_max_order;
> +			/*
> +			 * If we wake up after enough sleeping, let's
> +			 * start new order. Otherwise, it was a premature
> +			 * sleep so we keep going on.
> +			 */
> +			if (kswapd_try_to_sleep(pgdat, order))
> +				order = pgdat->kswapd_max_order;

Ok, we lose the old order if we slept enough. That is fine because if we
slept enough it implies that reclaiming at that order was no longer
necessary.

This needs a repost with a full changelog explaining why order has to be
preserved if kswapd fails to go to sleep. There will be merge difficulties
with the series aimed at fixing Simon's problem but it's unavoidable.
Rebasing on top of my series isn't an option as I'm still patching
against mainline until that issue is resolved.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02  9:42               ` Mel Gorman
@ 2010-12-02 15:25                 ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2010-12-02 15:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Shaohua Li, KOSAKI Motohiro, linux-mm, Andrew Morton

On Thu, Dec 02, 2010 at 09:42:42AM +0000, Mel Gorman wrote:
> On Thu, Dec 02, 2010 at 10:36:27AM +0900, Minchan Kim wrote:
> > Where is my mail?
> > I will resend lost content.
> > 
> > On Thu, Dec 2, 2010 at 10:23 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > > On Thu, Dec 2, 2010 at 10:05 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > >> On Thu, 2010-12-02 at 08:54 +0800, Minchan Kim wrote:
> > >>> On Thu, Dec 2, 2010 at 9:29 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > >>> > On Wed, 2010-12-01 at 23:58 +0800, Minchan Kim wrote:
> > >>> >> On Wed, Dec 01, 2010 at 06:44:27PM +0900, KOSAKI Motohiro wrote:
> > >>> >> > > T0: Task1 wakeup_kswapd(order=3)
> > >>> >> > > T1: kswapd enters balance_pgdat
> > >>> >> > > T2: Task2 wakeup_kswapd(order=2), because pages reclaimed by kswapd are used
> > >>> >> > > quickly
> > >>> >> > > T3: kswapd exits balance_pgdat. kswapd will do check. Now new order=2,
> > >>> >> > > pgdat->kswapd_max_order will become 0, but order=3, if sleeping_prematurely,
> > >>> >> > > then order will become pgdat->kswapd_max_order(0), while at this time the
> > >>> >> > > order should 2
> > >>> >> > > This isn't a big deal, but we do have a small window the order is wrong.
> > >>> >> > >
> > >>> >> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > >>> >> > >
> > >>> >> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >>> >> > > index d31d7ce..15cd0d2 100644
> > >>> >> > > --- a/mm/vmscan.c
> > >>> >> > > +++ b/mm/vmscan.c
> > >>> >> > > @@ -2450,7 +2450,7 @@ static int kswapd(void *p)
> > >>> >> > > ? ? ? ? ? ? ? ? ? ? ? ? ? }
> > >>> >> > > ? ? ? ? ? ? ? ? ? }
> > >>> >> > >
> > >>> >> > > - ? ? ? ? ? ? ? ? order = pgdat->kswapd_max_order;
> > >>> >> > > + ? ? ? ? ? ? ? ? order = max_t(unsigned long, new_order, pgdat->kswapd_max_order);
> > >>> >> > > ? ? ? ? ? }
> > >>> >> > > ? ? ? ? ? finish_wait(&pgdat->kswapd_wait, &wait);
> > >>> >> >
> > >>> >> > Good catch!
> > >>> >> >
> > >>> >> > But unfortunatelly, the code is not correct. At least, don't fit corrent
> > >>> >> > design.
> > >>> >> >
> > >>> >> > 1) if "order < new_order" condition is false, we already decided to don't
> > >>> >> > ? ?use new_order. So, we shouldn't use new_order after kswapd_try_to_sleep()
> > >>> >> > 2) if sleeping_prematurely() return false, it probably mean
> > >>> >> > ? ?zone_watermark_ok_safe(zone, order, high_wmark) return false.
> > >>> >> > ? ?therefore, we have to retry reclaim by using old 'order' parameter.
> > >>> >>
> > >>> >> Good catch, too.
> > >>> >>
> > >>> >> In Shaohua's scenario, if Task1 gets the order-3 page after kswapd's reclaiming,
> > >>> >> it's no problem.
> > >>> >> But if Task1 doesn't get the order-3 page and others used the order-3 page for Task1,
> > >>> >> Kswapd have to reclaim order-3 for Task1, again.
> > >>> > why? it's just a possibility. Task1 might get its pages too. If Task1
> > >>> > doesn't get its pages, it will wakeup kswapd too with its order.
> > >>> >
> > >>> >> In addtion, new order is always less than old order in that context.
> > >>> >> so big order page reclaim makes much safe for low order pages.
> > >>> > big order page reclaim makes we have more chances to reclaim useful
> > >>> > pages by lumpy, why it's safe?
> > >>>
> > >>> For example, It assume tat Task1 continues to fail get the order-3
> > >>> page of GFP_ATOMIC since other tasks continues to allocate order-2
> > >>> pages so that they steal pages.
> > >> but even you reclaim order-3, you can't guarantee task1 can get the
> > >> pages too. order-3 page can be steal by order-2 allocation
> > >
> > > But at least, it has a high possibility to allocate order-3 page than
> > > reclaim order-2 pages.
> > >
> > >>
> > >>> Then, your patch makes continue to
> > >>> reclaim order-2 page in this scenario. Task1 never get the order-3
> > >>> pages if it doesn't have a merge luck.
> > >> Task1 will wakeup kswapd again for order-3, so kswapd will reclaim
> > >> order-3 very soon after the order-2 reclaim.
> > >
> > > GFP_ATOMIC case doesn't wakeup kswapd.
> > > When kswapd wakeup by order-3 depends on caller's retry.
> > > And this situation can be repeated in next turn.
> > >
> 
> GFP_ATOMIC does wakeup kswapd. It just doesn't wait on kswapd to do
> anything.

I mean GFP_ATOMIC doesn't do restart for waking up kswapd again 
in __alloc_pages_slowpath.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02 10:12     ` Mel Gorman
@ 2010-12-02 15:35       ` Minchan Kim
  2010-12-02 15:42         ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Minchan Kim @ 2010-12-02 15:35 UTC (permalink / raw)
  To: Mel Gorman; +Cc: KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Thu, Dec 02, 2010 at 10:12:34AM +0000, Mel Gorman wrote:
> On Thu, Dec 02, 2010 at 12:58:54AM +0900, Minchan Kim wrote:
> > How about this?
> > If you want it, feel free to use it.
> > If you insist on your coding style, I don't have any objection.
> > Then add My Reviewed-by.
> > 
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > ---
> >  mm/vmscan.c |   21 +++++++++++++++++----
> >  1 files changed, 17 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 42a4859..e48a612 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2447,13 +2447,18 @@ out:
> >  	return sc.nr_reclaimed;
> >  }
> >  
> > -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> > +/*
> > + * Return true if we sleep enough. Othrewise, return false
> > + */
> 
> s/Othrewise/Otherwise/
> 
> Maybe

Will fix.

> 
> > +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> >  {
> >  	long remaining = 0;
> > +	bool sleep = 0;
> > +
> 
> sleep is a boolean, it's true or false, not 0 or !0

I was out of my mind. :(

> 
> The term "sleep" implies present or future tense - i.e. I am going to sleep or
> will go to sleep in the future.  The event this variable cares about in the
> past so "slept" or finished_sleeping might be a more appropriate term. Sorry
> to be picky about the English but there is an important distinction here.

Never mind. You pointed a very important thing.
Non-native speaker like me always suffer from writing some comment or naming
variable name so such a point can help very much.

It's a very desirable, I think.

> 
> >  	DEFINE_WAIT(wait);
> >  
> >  	if (freezing(current) || kthread_should_stop())
> > -		return;
> > +		return sleep;
> >  
> >  	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> >  
> > @@ -2482,6 +2487,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> >  		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
> >  		schedule();
> >  		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
> > +		sleep = 1;
> >  	} else {
> >  		if (remaining)
> >  			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> > @@ -2489,6 +2495,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> >  			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> >  	}
> >  	finish_wait(&pgdat->kswapd_wait, &wait);
> > +
> > +	return sleep;
> >  }
> >  
> >  /*
> > @@ -2550,8 +2558,13 @@ static int kswapd(void *p)
> >  			 */
> >  			order = new_order;
> >  		} else {
> > -			kswapd_try_to_sleep(pgdat, order);
> > -			order = pgdat->kswapd_max_order;
> > +			/*
> > +			 * If we wake up after enough sleeping, let's
> > +			 * start new order. Otherwise, it was a premature
> > +			 * sleep so we keep going on.
> > +			 */
> > +			if (kswapd_try_to_sleep(pgdat, order))
> > +				order = pgdat->kswapd_max_order;
> 
> Ok, we lose the old order if we slept enough. That is fine because if we
> slept enough it implies that reclaiming at that order was no longer
> necessary.
> 
> This needs a repost with a full changelog explaining why order has to be
> preserved if kswapd fails to go to sleep. There will be merge difficulties
> with the series aimed at fixing Simon's problem but it's unavoidable.
> Rebasing on top of my series isn't an option as I'm still patching
> against mainline until that issue is resolved.

So what's your point? Do you want me to send this patch alone
regardless of your series for Simon's problem?

Thanks for the review, Mel.

> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02 15:35       ` Minchan Kim
@ 2010-12-02 15:42         ` Mel Gorman
  2010-12-02 20:53           ` Simon Kirby
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-12-02 15:42 UTC (permalink / raw)
  To: Minchan Kim; +Cc: KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Fri, Dec 03, 2010 at 12:35:26AM +0900, Minchan Kim wrote:
> > > @@ -2550,8 +2558,13 @@ static int kswapd(void *p)
> > >  			 */
> > >  			order = new_order;
> > >  		} else {
> > > -			kswapd_try_to_sleep(pgdat, order);
> > > -			order = pgdat->kswapd_max_order;
> > > +			/*
> > > +			 * If we wake up after enough sleeping, let's
> > > +			 * start new order. Otherwise, it was a premature
> > > +			 * sleep so we keep going on.
> > > +			 */
> > > +			if (kswapd_try_to_sleep(pgdat, order))
> > > +				order = pgdat->kswapd_max_order;
> > 
> > Ok, we lose the old order if we slept enough. That is fine because if we
> > slept enough it implies that reclaiming at that order was no longer
> > necessary.
> > 
> > This needs a repost with a full changelog explaining why order has to be
> > preserved if kswapd fails to go to sleep. There will be merge difficulties
> > with the series aimed at fixing Simon's problem but it's unavoidable.
> > Rebasing on top of my series isn't an option as I'm still patching
> > against mainline until that issue is resolved.
> 
> So what's your point?

Only point was to comment "I think this part of the patch is fine".

> Do you want me to send this patch alone
> regardless of your series for Simon's problem?
> 

Yes, because I do not believe the problems are directly related. When/if
I get something working with Simon, I'll backport your patch on top of it
for testing by him just in case but I don't think it'll affect him.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02 15:42         ` Mel Gorman
@ 2010-12-02 20:53           ` Simon Kirby
  2010-12-03 12:00             ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Kirby @ 2010-12-02 20:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Thu, Dec 02, 2010 at 03:42:35PM +0000, Mel Gorman wrote:

> On Fri, Dec 03, 2010 at 12:35:26AM +0900, Minchan Kim wrote:
> > > > @@ -2550,8 +2558,13 @@ static int kswapd(void *p)
> > > >  			 */
> > > >  			order = new_order;
> > > >  		} else {
> > > > -			kswapd_try_to_sleep(pgdat, order);
> > > > -			order = pgdat->kswapd_max_order;
> > > > +			/*
> > > > +			 * If we wake up after enough sleeping, let's
> > > > +			 * start new order. Otherwise, it was a premature
> > > > +			 * sleep so we keep going on.
> > > > +			 */
> > > > +			if (kswapd_try_to_sleep(pgdat, order))
> > > > +				order = pgdat->kswapd_max_order;
> > > 
> > > Ok, we lose the old order if we slept enough. That is fine because if we
> > > slept enough it implies that reclaiming at that order was no longer
> > > necessary.
> > > 
> > > This needs a repost with a full changelog explaining why order has to be
> > > preserved if kswapd fails to go to sleep. There will be merge difficulties
> > > with the series aimed at fixing Simon's problem but it's unavoidable.
> > > Rebasing on top of my series isn't an option as I'm still patching
> > > against mainline until that issue is resolved.
> > 
> > So what's your point?
> 
> Only point was to comment "I think this part of the patch is fine".
> 
> > Do you want me to send this patch alone
> > regardless of your series for Simon's problem?
> > 
> 
> Yes, because I do not believe the problems are directly related. When/if
> I get something working with Simon, I'll backport your patch on top of it
> for testing by him just in case but I don't think it'll affect him.

We could test this and your patch together, no?  Your patch definitely
fixed the case for us where kswapd would just run all day long, throwing
out everything while trying to reach the order-3 watermark in zone Normal
while order-0 page cache allocations were splitting it back out again.

However, the subject of my original post was to do with too much free
memory and swap, which is still occurring:

	http://0x.ca/sim/ref/2.6.36/memory_mel_patch_week.png

But this is still occurring even if I tell slub to use only order-0 and
order-1, and disable jumbo frames (which I did on another box, not this
one).  It may not be quite as bad, but I think the increase in free
memory is just based on fragmentation that builds up over time.  I don't
have any long-running graphs of this yet, but I can see that pretty much
all of the free memory always is order-0, and even a "while true; do
sleep .01; done" is enough to make it throw out more order-0 while trying
to make room for order-1 task_struct allocations.

Maybe some pattern in the way that pages are reclaimed while they are
being allocated is resulting in increasing fragmentation?  All the boxes
I see this on start out fine, but after a day or week they end up in swap
and with lots of free memory.

Simon-

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-02 20:53           ` Simon Kirby
@ 2010-12-03 12:00             ` Mel Gorman
  2010-12-04 12:07               ` Simon Kirby
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-12-03 12:00 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Minchan Kim, KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Thu, Dec 02, 2010 at 12:53:42PM -0800, Simon Kirby wrote:
> On Thu, Dec 02, 2010 at 03:42:35PM +0000, Mel Gorman wrote:
> 
> > On Fri, Dec 03, 2010 at 12:35:26AM +0900, Minchan Kim wrote:
> > > > > @@ -2550,8 +2558,13 @@ static int kswapd(void *p)
> > > > >  			 */
> > > > >  			order = new_order;
> > > > >  		} else {
> > > > > -			kswapd_try_to_sleep(pgdat, order);
> > > > > -			order = pgdat->kswapd_max_order;
> > > > > +			/*
> > > > > +			 * If we wake up after enough sleeping, let's
> > > > > +			 * start new order. Otherwise, it was a premature
> > > > > +			 * sleep so we keep going on.
> > > > > +			 */
> > > > > +			if (kswapd_try_to_sleep(pgdat, order))
> > > > > +				order = pgdat->kswapd_max_order;
> > > > 
> > > > Ok, we lose the old order if we slept enough. That is fine because if we
> > > > slept enough it implies that reclaiming at that order was no longer
> > > > necessary.
> > > > 
> > > > This needs a repost with a full changelog explaining why order has to be
> > > > preserved if kswapd fails to go to sleep. There will be merge difficulties
> > > > with the series aimed at fixing Simon's problem but it's unavoidable.
> > > > Rebasing on top of my series isn't an option as I'm still patching
> > > > against mainline until that issue is resolved.
> > > 
> > > So what's your point?
> > 
> > Only point was to comment "I think this part of the patch is fine".
> > 
> > > Do you want me to send this patch alone
> > > regardless of your series for Simon's problem?
> > > 
> > 
> > Yes, because I do not believe the problems are directly related. When/if
> > I get something working with Simon, I'll backport your patch on top of it
> > for testing by him just in case but I don't think it'll affect him.
> 
> We could test this and your patch together, no? 
> Your patch definitely
> fixed the case for us where kswapd would just run all day long, throwing
> out everything while trying to reach the order-3 watermark in zone Normal
> while order-0 page cache allocations were splitting it back out again.
> 

Ideally they would ultimately be tested together, but I'd really like to
hear if the 5 patch series I posted still prevents kswapd going crazy
and if the "too much free memory" problem is affected. Minimally, fixing
kswapd being awake is worthwhile.

> However, the subject of my original post was to do with too much free
> memory and swap, which is still occurring:
> 
> 	http://0x.ca/sim/ref/2.6.36/memory_mel_patch_week.png
> 

Ok, we had been working on the assumption that kswapd staying awake was
responsible for too much memory being free. If after the series is applied and
working there is still too much free memory, we know there is an additional
part to the problem.

> But this is still occurring even if I tell slub to use only order-0 and
> order-1, and disable jumbo frames (which I did on another box, not this
> one).  It may not be quite as bad, but I think the increase in free
> memory is just based on fragmentation that builds up over time. 

Before you said SLUB was using only order-0 and order-1, I would have
suspected lumpy reclaim. Without high-order allocations, fragmentation
is not a problem and shouldn't be triggering a mass freeing of memory.
can you confirm with perf that there is no other constant source of
high-order allocations?

> I don't
> have any long-running graphs of this yet, but I can see that pretty much
> all of the free memory always is order-0, and even a "while true; do
> sleep .01; done" is enough to make it throw out more order-0 while trying
> to make room for order-1 task_struct allocations.
> 

It would be semi-normal to throw out a few pages for order-1 task_struct
allocations. Is your server fork-heavy? I would have guessed "no" as you
are forcing a large number of forks with the while loop.

> Maybe some pattern in the way that pages are reclaimed while they are
> being allocated is resulting in increasing fragmentation?  All the boxes
> I see this on start out fine, but after a day or week they end up in swap
> and with lots of free memory.
> 

Is there something like a big weekly backup task running that would be
responsible for pushing a large amount of memory to swap that is never
faulted back in again because it's unused?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-03 12:00             ` Mel Gorman
@ 2010-12-04 12:07               ` Simon Kirby
  2010-12-06 12:03                 ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Kirby @ 2010-12-04 12:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Fri, Dec 03, 2010 at 12:00:54PM +0000, Mel Gorman wrote:

> On Thu, Dec 02, 2010 at 12:53:42PM -0800, Simon Kirby wrote:
> > On Thu, Dec 02, 2010 at 03:42:35PM +0000, Mel Gorman wrote:
> > 
> > > On Fri, Dec 03, 2010 at 12:35:26AM +0900, Minchan Kim wrote:
> > > 
> > > Only point was to comment "I think this part of the patch is fine".
> > > 
> > > > Do you want me to send this patch alone
> > > > regardless of your series for Simon's problem?
> > > 
> > > Yes, because I do not believe the problems are directly related. When/if
> > > I get something working with Simon, I'll backport your patch on top of it
> > > for testing by him just in case but I don't think it'll affect him.
> > 
> > We could test this and your patch together, no? 
> > Your patch definitely
> > fixed the case for us where kswapd would just run all day long, throwing
> > out everything while trying to reach the order-3 watermark in zone Normal
> > while order-0 page cache allocations were splitting it back out again.
> > 
> 
> Ideally they would ultimately be tested together, but I'd really like to
> hear if the 5 patch series I posted still prevents kswapd going crazy
> and if the "too much free memory" problem is affected. Minimally, fixing
> kswapd being awake is worthwhile.

Ok, we will try this version of your patches and see if anything changes.
The previous version did stop kswapd from running continuously during
daytime load, and made our SSD server useful, so I definitely like it. :)

> > However, the subject of my original post was to do with too much free
> > memory and swap, which is still occurring:
> > 
> > 	http://0x.ca/sim/ref/2.6.36/memory_mel_patch_week.png
> 
> Ok, we had been working on the assumption that kswapd staying awake was
> responsible for too much memory being free. If after the series is applied and
> working there is still too much free memory, we know there is an additional
> part to the problem.

This was part of the problem.  kswapd was throwing so much out while
trying to meet the watermark in zone Normal that the daemons had to keep
being read back in from /dev/sda (non-ssd), and this ended up causing
degraded performance.

> > But this is still occurring even if I tell slub to use only order-0 and
> > order-1, and disable jumbo frames (which I did on another box, not this
> > one).  It may not be quite as bad, but I think the increase in free
> > memory is just based on fragmentation that builds up over time. 
> 
> Before you said SLUB was using only order-0 and order-1, I would have
> suspected lumpy reclaim. Without high-order allocations, fragmentation
> is not a problem and shouldn't be triggering a mass freeing of memory.
> can you confirm with perf that there is no other constant source of
> high-order allocations?

Let me clarify: On _another_ box, with 2.6.36 but without your patches
and without as much load or SSD devices, I forced slub to use order-0
except where order-1 was absolutely necessary (objects > 4096 bytes),
just to see what impact this had on free memory.  There was a change,
but still lots of memory left free.  I was trying to avoid confusion by
posting graphs from different machines, but here is that one just as a
reference: http://0x.ca/sim/ref/2.6.36/memory_stor25r_week.png
(I made the slub order adjustment on Tuesday, November 30th.)
The spikes are actually from mail nightly expunge/purge runs.  It seems
that minimizing the slub orders did remove the large free spike that
was happening during mailbox compaction runs (nightly), and overall there
was a bit more memory used on average, but it definitely didn't "fix" it. 

The original server I was posting graphs for has had no other vm tweaks,
and so slub is still doing order-3 GFP_ATOMIC allocations from skb
allocations.

By the way, I noticed slub seems to choose different maximum orders based
on the memory size.  You may be able to get your test box to issue the
same GFP_ATOMIC order-3 allocations from skb allocations by making your
sysfs files match these values:

[/sys/kernel/slab]# grep . kmalloc-??{,?,??}/order
kmalloc-16/order:0
kmalloc-32/order:0
kmalloc-64/order:0
kmalloc-96/order:0
kmalloc-128/order:0
kmalloc-192/order:0
kmalloc-256/order:1
kmalloc-512/order:2
kmalloc-1024/order:3
kmalloc-2048/order:3
kmalloc-4096/order:3
kmalloc-8192/order:3

I suspect your kmalloc-1024 and kmalloc-2048 orders are less than 3 now?

> > I don't
> > have any long-running graphs of this yet, but I can see that pretty much
> > all of the free memory always is order-0, and even a "while true; do
> > sleep .01; done" is enough to make it throw out more order-0 while trying
> > to make room for order-1 task_struct allocations.
> > 
> 
> It would be semi-normal to throw out a few pages for order-1 task_struct
> allocations. Is your server fork-heavy? I would have guessed "no" as you
> are forcing a large number of forks with the while loop.

No, the only things that cause forks on these servers usually are monitoring
processes.  According to munin, it averages under 3 forks per second.

> > Maybe some pattern in the way that pages are reclaimed while they are
> > being allocated is resulting in increasing fragmentation?  All the boxes
> > I see this on start out fine, but after a day or week they end up in swap
> > and with lots of free memory.
> 
> Is there something like a big weekly backup task running that would be
> responsible for pushing a large amount of memory to swap that is never
> faulted back in again because it's unused?

There are definitely pages that are leaking from dovecot or similar which
can be swapped out and not swapped in again (you can see "apps" growing),
but there are no tasks I can think of that would ever cause the system to
be starved.  The calls to pageout() seem to happen if sc.may_writepage is
set, which seems to happen when it thinks it has scanned enough without
making enough progress.  Could this happen just from too much
fragmentation?

The swapping seems to be at a slow but constant rate, so maybe it's
happening just due to the way the types of allocations are biasing to
Normal instead of DMA32, or vice-versa.  Check out the latest memory
graphs for the server running your original patch:

http://0x.ca/sim/ref/2.6.36/memory_mel_patch_dec4.png
http://0x.ca/sim/ref/2.6.36/zoneinfo_mel_patch_dec4
http://0x.ca/sim/ref/2.6.36/pagetypeinfo_mel_patch_dec4

Hmm, pagetypeinfo shows none or only a few of the pages in Normal are
considered reclaimable...

Simon-

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-04 12:07               ` Simon Kirby
@ 2010-12-06 12:03                 ` Mel Gorman
  2010-12-09 23:44                   ` Simon Kirby
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-12-06 12:03 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Minchan Kim, KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Sat, Dec 04, 2010 at 04:07:26AM -0800, Simon Kirby wrote:
> On Fri, Dec 03, 2010 at 12:00:54PM +0000, Mel Gorman wrote:
> 
> > On Thu, Dec 02, 2010 at 12:53:42PM -0800, Simon Kirby wrote:
> > > On Thu, Dec 02, 2010 at 03:42:35PM +0000, Mel Gorman wrote:
> > > 
> > > > On Fri, Dec 03, 2010 at 12:35:26AM +0900, Minchan Kim wrote:
> > > > 
> > > > Only point was to comment "I think this part of the patch is fine".
> > > > 
> > > > > Do you want me to send this patch alone
> > > > > regardless of your series for Simon's problem?
> > > > 
> > > > Yes, because I do not believe the problems are directly related. When/if
> > > > I get something working with Simon, I'll backport your patch on top of it
> > > > for testing by him just in case but I don't think it'll affect him.
> > > 
> > > We could test this and your patch together, no? 
> > > Your patch definitely
> > > fixed the case for us where kswapd would just run all day long, throwing
> > > out everything while trying to reach the order-3 watermark in zone Normal
> > > while order-0 page cache allocations were splitting it back out again.
> > > 
> > 
> > Ideally they would ultimately be tested together, but I'd really like to
> > hear if the 5 patch series I posted still prevents kswapd going crazy
> > and if the "too much free memory" problem is affected. Minimally, fixing
> > kswapd being awake is worthwhile.
> 
> Ok, we will try this version of your patches and see if anything changes.
> The previous version did stop kswapd from running continuously during
> daytime load, and made our SSD server useful, so I definitely like it. :)
> 

Good.

> > > However, the subject of my original post was to do with too much free
> > > memory and swap, which is still occurring:
> > > 
> > > 	http://0x.ca/sim/ref/2.6.36/memory_mel_patch_week.png
> > 
> > Ok, we had been working on the assumption that kswapd staying awake was
> > responsible for too much memory being free. If after the series is applied and
> > working there is still too much free memory, we know there is an additional
> > part to the problem.
> 
> This was part of the problem.  kswapd was throwing so much out while
> trying to meet the watermark in zone Normal that the daemons had to keep
> being read back in from /dev/sda (non-ssd), and this ended up causing
> degraded performance.
> 

But there is still potentially two problems here. The first was kswapd
throwing out everything in zone normal. Even when fixed, there is
potentially still too many pages being thrown out. The situation might
be improved but not repaired.

> > > But this is still occurring even if I tell slub to use only order-0 and
> > > order-1, and disable jumbo frames (which I did on another box, not this
> > > one).  It may not be quite as bad, but I think the increase in free
> > > memory is just based on fragmentation that builds up over time. 
> > 
> > Before you said SLUB was using only order-0 and order-1, I would have
> > suspected lumpy reclaim. Without high-order allocations, fragmentation
> > is not a problem and shouldn't be triggering a mass freeing of memory.
> > can you confirm with perf that there is no other constant source of
> > high-order allocations?
> 
> Let me clarify: On _another_ box, with 2.6.36 but without your patches
> and without as much load or SSD devices, I forced slub to use order-0
> except where order-1 was absolutely necessary (objects > 4096 bytes),
> just to see what impact this had on free memory.  There was a change,
> but still lots of memory left free.  I was trying to avoid confusion by
> posting graphs from different machines, but here is that one just as a
> reference: http://0x.ca/sim/ref/2.6.36/memory_stor25r_week.png
> (I made the slub order adjustment on Tuesday, November 30th.)
> The spikes are actually from mail nightly expunge/purge runs.  It seems
> that minimizing the slub orders did remove the large free spike that
> was happening during mailbox compaction runs (nightly), and overall there
> was a bit more memory used on average, but it definitely didn't "fix" it. 
> 

Ok, but it's still evidence that lumpy reclaim is still the problem here. This
should be "fixed" by reclaim/compaction which has less impact and frees
fewer pages than lumpy reclaim. If necessary, I can backport this to 2.6.36
for you to verify. There is little chance the series would be accepted into
-stable but you'd at least know that 2.6.37 or 2.6.38 would behave as expected.

> The original server I was posting graphs for has had no other vm tweaks,
> and so slub is still doing order-3 GFP_ATOMIC allocations from skb
> allocations.
> 
> By the way, I noticed slub seems to choose different maximum orders based
> on the memory size.  You may be able to get your test box to issue the
> same GFP_ATOMIC order-3 allocations from skb allocations by making your
> sysfs files match these values:
> 
> [/sys/kernel/slab]# grep . kmalloc-??{,?,??}/order
> kmalloc-16/order:0
> kmalloc-32/order:0
> kmalloc-64/order:0
> kmalloc-96/order:0
> kmalloc-128/order:0
> kmalloc-192/order:0
> kmalloc-256/order:1
> kmalloc-512/order:2
> kmalloc-1024/order:3
> kmalloc-2048/order:3
> kmalloc-4096/order:3
> kmalloc-8192/order:3
> 
> I suspect your kmalloc-1024 and kmalloc-2048 orders are less than 3 now?
> 

I'm using slub_min_order=3 to force the larger allocations.

> > > I don't
> > > have any long-running graphs of this yet, but I can see that pretty much
> > > all of the free memory always is order-0, and even a "while true; do
> > > sleep .01; done" is enough to make it throw out more order-0 while trying
> > > to make room for order-1 task_struct allocations.
> > > 
> > 
> > It would be semi-normal to throw out a few pages for order-1 task_struct
> > allocations. Is your server fork-heavy? I would have guessed "no" as you
> > are forcing a large number of forks with the while loop.
> 
> No, the only things that cause forks on these servers usually are monitoring
> processes.  According to munin, it averages under 3 forks per second.
> 

Ok.

> > > Maybe some pattern in the way that pages are reclaimed while they are
> > > being allocated is resulting in increasing fragmentation?  All the boxes
> > > I see this on start out fine, but after a day or week they end up in swap
> > > and with lots of free memory.
> > 
> > Is there something like a big weekly backup task running that would be
> > responsible for pushing a large amount of memory to swap that is never
> > faulted back in again because it's unused?
> 
> There are definitely pages that are leaking from dovecot or similar which
> can be swapped out and not swapped in again (you can see "apps" growing),
> but there are no tasks I can think of that would ever cause the system to
> be starved. 

So dovecot has a memory leak? As you say, this shouldn't starve the system
but it's inevitable that swap usage will grow over time.

> The calls to pageout() seem to happen if sc.may_writepage is
> set, which seems to happen when it thinks it has scanned enough without
> making enough progress.  Could this happen just from too much
> fragmentation?
> 

Not on its own but if too many pages have to be scanned due to
fragmentation, it can get set.

> The swapping seems to be at a slow but constant rate, so maybe it's

I assume you mean swap usage is growing at a slow but constant rate?

> happening just due to the way the types of allocations are biasing to
> Normal instead of DMA32, or vice-versa. 
> Check out the latest memory
> graphs for the server running your original patch:
> 
> http://0x.ca/sim/ref/2.6.36/memory_mel_patch_dec4.png

Do you think the growth in swap usage is due to dovecot leaking?

> http://0x.ca/sim/ref/2.6.36/zoneinfo_mel_patch_dec4
> http://0x.ca/sim/ref/2.6.36/pagetypeinfo_mel_patch_dec4
> 
> Hmm, pagetypeinfo shows none or only a few of the pages in Normal are
> considered reclaimable...
> 

Reclaimable in the context of pagetypeinfo means slab-reclaimable. The
results imply that very few slab allocations are being satisified from
the Normal zone or at least very few have been released recently.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-06 12:03                 ` Mel Gorman
@ 2010-12-09 23:44                   ` Simon Kirby
  2010-12-10 11:32                     ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Kirby @ 2010-12-09 23:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Mon, Dec 06, 2010 at 12:03:42PM +0000, Mel Gorman wrote:

> > This was part of the problem.  kswapd was throwing so much out while
> > trying to meet the watermark in zone Normal that the daemons had to keep
> > being read back in from /dev/sda (non-ssd), and this ended up causing
> > degraded performance.
> 
> But there is still potentially two problems here. The first was kswapd
> throwing out everything in zone normal. Even when fixed, there is
> potentially still too many pages being thrown out. The situation might
> be improved but not repaired.

Yes.

> > > Before you said SLUB was using only order-0 and order-1, I would have
> > > suspected lumpy reclaim. Without high-order allocations, fragmentation
> > > is not a problem and shouldn't be triggering a mass freeing of memory.
> > > can you confirm with perf that there is no other constant source of
> > > high-order allocations?
> > 
> > Let me clarify: On _another_ box, with 2.6.36 but without your patches
> > and without as much load or SSD devices, I forced slub to use order-0
> > except where order-1 was absolutely necessary (objects > 4096 bytes),
> > just to see what impact this had on free memory.  There was a change,
> > but still lots of memory left free.  I was trying to avoid confusion by
> > posting graphs from different machines, but here is that one just as a
> > reference: http://0x.ca/sim/ref/2.6.36/memory_stor25r_week.png
> > (I made the slub order adjustment on Tuesday, November 30th.)
> > The spikes are actually from mail nightly expunge/purge runs.  It seems
> > that minimizing the slub orders did remove the large free spike that
> > was happening during mailbox compaction runs (nightly), and overall there
> > was a bit more memory used on average, but it definitely didn't "fix" it. 
> 
> Ok, but it's still evidence that lumpy reclaim is still the problem here. This
> should be "fixed" by reclaim/compaction which has less impact and frees
> fewer pages than lumpy reclaim. If necessary, I can backport this to 2.6.36
> for you to verify. There is little chance the series would be accepted into
> -stable but you'd at least know that 2.6.37 or 2.6.38 would behave as expected.

Isn't lumpy reclaim supposed to _improve_ this situation by trying to
free contiguous stuff rather than shooting aimlessly until contiguous
pages appear?  Or is there some other point to it?  If this is the case,
maybe the issue is that lumpy reclaim isn't happening soon enough, so it
shoots around too much before it tries to look for lumpy stuff.  In
2.6.3[67], set_lumpy_reclaim_mode() only sets lumpy mode if sc->order >
PAGE_ALLOC_COSTLY_ORDER (>= 4), or if priority < DEF_PRIORITY - 2.

Also, try_to_compact_pages() bails without doing anything when order <=
PAGE_ALLOC_COSTLY_ORDER, which is the order I'm seeing problems at.  So,
without further chanegs, I don't see how CONFIG_COMPACTION or 2.6.37 will
make any difference, unless I'm missing some related 2.6.37 changes.

> > There are definitely pages that are leaking from dovecot or similar which
> > can be swapped out and not swapped in again (you can see "apps" growing),
> > but there are no tasks I can think of that would ever cause the system to
> > be starved. 
> 
> So dovecot has a memory leak? As you say, this shouldn't starve the system
> but it's inevitable that swap usage will grow over time.

Yeah, we just squashed what seemed to be the biggest leak in dovecot, so
this should stop happening once we rebuild and restart everything.

> > The calls to pageout() seem to happen if sc.may_writepage is
> > set, which seems to happen when it thinks it has scanned enough without
> > making enough progress.  Could this happen just from too much
> > fragmentation?
> > 
> 
> Not on its own but if too many pages have to be scanned due to
> fragmentation, it can get set.
> 
> > The swapping seems to be at a slow but constant rate, so maybe it's
> 
> I assume you mean swap usage is growing at a slow but constant rate?

Yes.

> > happening just due to the way the types of allocations are biasing to
> > Normal instead of DMA32, or vice-versa. 
> > Check out the latest memory
> > graphs for the server running your original patch:
> > 
> > http://0x.ca/sim/ref/2.6.36/memory_mel_patch_dec4.png
> 
> Do you think the growth in swap usage is due to dovecot leaking?

I guess we'll find out shortly, with dovecot being fixed. :)

> > http://0x.ca/sim/ref/2.6.36/zoneinfo_mel_patch_dec4
> > http://0x.ca/sim/ref/2.6.36/pagetypeinfo_mel_patch_dec4
> > 
> > Hmm, pagetypeinfo shows none or only a few of the pages in Normal are
> > considered reclaimable...
> > 
> 
> Reclaimable in the context of pagetypeinfo means slab-reclaimable. The
> results imply that very few slab allocations are being satisified from
> the Normal zone or at least very few have been released recently.

Hmm, ok.

Simon-

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-09 23:44                   ` Simon Kirby
@ 2010-12-10 11:32                     ` Mel Gorman
  2010-12-10 23:42                       ` Simon Kirby
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-12-10 11:32 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Minchan Kim, KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Thu, Dec 09, 2010 at 03:44:52PM -0800, Simon Kirby wrote:
> On Mon, Dec 06, 2010 at 12:03:42PM +0000, Mel Gorman wrote:
> 
> > > This was part of the problem.  kswapd was throwing so much out while
> > > trying to meet the watermark in zone Normal that the daemons had to keep
> > > being read back in from /dev/sda (non-ssd), and this ended up causing
> > > degraded performance.
> > 
> > But there is still potentially two problems here. The first was kswapd
> > throwing out everything in zone normal. Even when fixed, there is
> > potentially still too many pages being thrown out. The situation might
> > be improved but not repaired.
> 
> Yes.
> 
> > > > Before you said SLUB was using only order-0 and order-1, I would have
> > > > suspected lumpy reclaim. Without high-order allocations, fragmentation
> > > > is not a problem and shouldn't be triggering a mass freeing of memory.
> > > > can you confirm with perf that there is no other constant source of
> > > > high-order allocations?
> > > 
> > > Let me clarify: On _another_ box, with 2.6.36 but without your patches
> > > and without as much load or SSD devices, I forced slub to use order-0
> > > except where order-1 was absolutely necessary (objects > 4096 bytes),
> > > just to see what impact this had on free memory.  There was a change,
> > > but still lots of memory left free.  I was trying to avoid confusion by
> > > posting graphs from different machines, but here is that one just as a
> > > reference: http://0x.ca/sim/ref/2.6.36/memory_stor25r_week.png
> > > (I made the slub order adjustment on Tuesday, November 30th.)
> > > The spikes are actually from mail nightly expunge/purge runs.  It seems
> > > that minimizing the slub orders did remove the large free spike that
> > > was happening during mailbox compaction runs (nightly), and overall there
> > > was a bit more memory used on average, but it definitely didn't "fix" it. 
> > 
> > Ok, but it's still evidence that lumpy reclaim is still the problem here. This
> > should be "fixed" by reclaim/compaction which has less impact and frees
> > fewer pages than lumpy reclaim. If necessary, I can backport this to 2.6.36
> > for you to verify. There is little chance the series would be accepted into
> > -stable but you'd at least know that 2.6.37 or 2.6.38 would behave as expected.
> 
> Isn't lumpy reclaim supposed to _improve_ this situation by trying to
> free contiguous stuff rather than shooting aimlessly until contiguous
> pages appear? 

For lower orders like order-1 and order-2, it reclaims randomly before
using lumpy reclaim as the assumption is that these lower pages free
naturally.

> Or is there some other point to it?  If this is the case,
> maybe the issue is that lumpy reclaim isn't happening soon enough, so it
> shoots around too much before it tries to look for lumpy stuff. 

It used to happen sooner but it ran into latency problems.

> In
> 2.6.3[67], set_lumpy_reclaim_mode() only sets lumpy mode if sc->order >
> PAGE_ALLOC_COSTLY_ORDER (>= 4), or if priority < DEF_PRIORITY - 2.
> 
> Also, try_to_compact_pages() bails without doing anything when order <=
> PAGE_ALLOC_COSTLY_ORDER, which is the order I'm seeing problems at.  So,
> without further chanegs, I don't see how CONFIG_COMPACTION or 2.6.37 will
> make any difference, unless I'm missing some related 2.6.37 changes.
> 

There is increasing pressure to use compaction for the lower orders as
well. This problem is going to be added to the list of justifications :/

> > > There are definitely pages that are leaking from dovecot or similar which
> > > can be swapped out and not swapped in again (you can see "apps" growing),
> > > but there are no tasks I can think of that would ever cause the system to
> > > be starved. 
> > 
> > So dovecot has a memory leak? As you say, this shouldn't starve the system
> > but it's inevitable that swap usage will grow over time.
> 
> Yeah, we just squashed what seemed to be the biggest leak in dovecot, so
> this should stop happening once we rebuild and restart everything.
> 

Ok.

> > > The calls to pageout() seem to happen if sc.may_writepage is
> > > set, which seems to happen when it thinks it has scanned enough without
> > > making enough progress.  Could this happen just from too much
> > > fragmentation?
> > > 
> > 
> > Not on its own but if too many pages have to be scanned due to
> > fragmentation, it can get set.
> > 
> > > The swapping seems to be at a slow but constant rate, so maybe it's
> > 
> > I assume you mean swap usage is growing at a slow but constant rate?
> 
> Yes.
> 
> > > happening just due to the way the types of allocations are biasing to
> > > Normal instead of DMA32, or vice-versa. 
> > > Check out the latest memory
> > > graphs for the server running your original patch:
> > > 
> > > http://0x.ca/sim/ref/2.6.36/memory_mel_patch_dec4.png
> > 
> > Do you think the growth in swap usage is due to dovecot leaking?
> 
> I guess we'll find out shortly, with dovecot being fixed. :)
> 

Great.

> > > http://0x.ca/sim/ref/2.6.36/zoneinfo_mel_patch_dec4
> > > http://0x.ca/sim/ref/2.6.36/pagetypeinfo_mel_patch_dec4
> > > 
> > > Hmm, pagetypeinfo shows none or only a few of the pages in Normal are
> > > considered reclaimable...
> > > 
> > 
> > Reclaimable in the context of pagetypeinfo means slab-reclaimable. The
> > results imply that very few slab allocations are being satisified from
> > the Normal zone or at least very few have been released recently.
> 
> Hmm, ok.
> 
> Simon-
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-10 11:32                     ` Mel Gorman
@ 2010-12-10 23:42                       ` Simon Kirby
  2010-12-14  9:52                         ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Kirby @ 2010-12-10 23:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Fri, Dec 10, 2010 at 11:32:45AM +0000, Mel Gorman wrote:

> On Thu, Dec 09, 2010 at 03:44:52PM -0800, Simon Kirby wrote:
> > On Mon, Dec 06, 2010 at 12:03:42PM +0000, Mel Gorman wrote:
> > 
> > > But there is still potentially two problems here. The first was kswapd
> > > throwing out everything in zone normal. Even when fixed, there is
> > > potentially still too many pages being thrown out. The situation might
> > > be improved but not repaired.
> > 
> > Yes.
> > 
> > > > Let me clarify: On _another_ box, with 2.6.36 but without your patches
> > > > and without as much load or SSD devices, I forced slub to use order-0
> > > > except where order-1 was absolutely necessary (objects > 4096 bytes),
> > > > just to see what impact this had on free memory.  There was a change,
> > > > but still lots of memory left free.  I was trying to avoid confusion by
> > > > posting graphs from different machines, but here is that one just as a
> > > > reference: http://0x.ca/sim/ref/2.6.36/memory_stor25r_week.png
> > > > (I made the slub order adjustment on Tuesday, November 30th.)
> > > > The spikes are actually from mail nightly expunge/purge runs.  It seems
> > > > that minimizing the slub orders did remove the large free spike that
> > > > was happening during mailbox compaction runs (nightly), and overall there
> > > > was a bit more memory used on average, but it definitely didn't "fix" it. 
> > > 
> > > Ok, but it's still evidence that lumpy reclaim is still the problem here. This
> > > should be "fixed" by reclaim/compaction which has less impact and frees
> > > fewer pages than lumpy reclaim. If necessary, I can backport this to 2.6.36
> > > for you to verify. There is little chance the series would be accepted into
> > > -stable but you'd at least know that 2.6.37 or 2.6.38 would behave as expected.
> > 
> > Isn't lumpy reclaim supposed to _improve_ this situation by trying to
> > free contiguous stuff rather than shooting aimlessly until contiguous
> > pages appear? 
> 
> For lower orders like order-1 and order-2, it reclaims randomly before
> using lumpy reclaim as the assumption is that these lower pages free
> naturally.

Hmm.. We were looking were looking at some other servers' munin graphs,
and I seem to notice a correlation between high allocation rates (eg,
heavily loaded servers) and more memory being free.  I am wondering if
the problem isn't the choice of how to reclaim, but more an issue from
concurrent allocation calls.  Because (direct) reclaim isn't protected
from other allocations, it can fight with allocations that split back up
the orders, which might be increasing fragmentation.

The fragmentation and reaching watermarks does seem to be what is causing
a larger amount to stay free, once it _gets_ fragmented...

I was thinking of ways that it could hold pages while reclaiming, and
then free them all and allocate the request under a lock to avoid
colliding with other allocations.  I see shrink_active_list() almost
seems to have something like this with l_hold, but nothing cares about
watermarks down at that level.  The inactive list goes through a separate
routine, and only inactive uses lumpy reclaim.

> > Or is there some other point to it?  If this is the case,
> > maybe the issue is that lumpy reclaim isn't happening soon enough, so it
> > shoots around too much before it tries to look for lumpy stuff. 
> 
> It used to happen sooner but it ran into latency problems.

Latency from writeback or something?  I wonder if it would be worth
trying a cheap patch to try lumpy mode immediately, just to see how
things change.

> > In
> > 2.6.3[67], set_lumpy_reclaim_mode() only sets lumpy mode if sc->order >
> > PAGE_ALLOC_COSTLY_ORDER (>= 4), or if priority < DEF_PRIORITY - 2.
> > 
> > Also, try_to_compact_pages() bails without doing anything when order <=
> > PAGE_ALLOC_COSTLY_ORDER, which is the order I'm seeing problems at.  So,
> > without further chanegs, I don't see how CONFIG_COMPACTION or 2.6.37 will
> > make any difference, unless I'm missing some related 2.6.37 changes.
> 
> There is increasing pressure to use compaction for the lower orders as
> well. This problem is going to be added to the list of justifications :/

I figured perhaps this was skipped due to being expensive, or else why
wouldn't it just always happen for non-zero orders.  I do see a lot of
servers with 200,000 free order-0 pages and almost no order-1 or anything
bigger, so maybe this could help.  I could try to modify the test in
try_to_compact_pages() to if (!order || !may_enter_fs || !may_perform_io).

Simon-

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch]vmscan: make kswapd use a correct order
  2010-12-10 23:42                       ` Simon Kirby
@ 2010-12-14  9:52                         ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2010-12-14  9:52 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Minchan Kim, KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton

On Fri, Dec 10, 2010 at 03:42:06PM -0800, Simon Kirby wrote:
> On Fri, Dec 10, 2010 at 11:32:45AM +0000, Mel Gorman wrote:
> 
> > On Thu, Dec 09, 2010 at 03:44:52PM -0800, Simon Kirby wrote:
> > > On Mon, Dec 06, 2010 at 12:03:42PM +0000, Mel Gorman wrote:
> > > 
> > > > But there is still potentially two problems here. The first was kswapd
> > > > throwing out everything in zone normal. Even when fixed, there is
> > > > potentially still too many pages being thrown out. The situation might
> > > > be improved but not repaired.
> > > 
> > > Yes.
> > > 
> > > > > Let me clarify: On _another_ box, with 2.6.36 but without your patches
> > > > > and without as much load or SSD devices, I forced slub to use order-0
> > > > > except where order-1 was absolutely necessary (objects > 4096 bytes),
> > > > > just to see what impact this had on free memory.  There was a change,
> > > > > but still lots of memory left free.  I was trying to avoid confusion by
> > > > > posting graphs from different machines, but here is that one just as a
> > > > > reference: http://0x.ca/sim/ref/2.6.36/memory_stor25r_week.png
> > > > > (I made the slub order adjustment on Tuesday, November 30th.)
> > > > > The spikes are actually from mail nightly expunge/purge runs.  It seems
> > > > > that minimizing the slub orders did remove the large free spike that
> > > > > was happening during mailbox compaction runs (nightly), and overall there
> > > > > was a bit more memory used on average, but it definitely didn't "fix" it. 
> > > > 
> > > > Ok, but it's still evidence that lumpy reclaim is still the problem here. This
> > > > should be "fixed" by reclaim/compaction which has less impact and frees
> > > > fewer pages than lumpy reclaim. If necessary, I can backport this to 2.6.36
> > > > for you to verify. There is little chance the series would be accepted into
> > > > -stable but you'd at least know that 2.6.37 or 2.6.38 would behave as expected.
> > > 
> > > Isn't lumpy reclaim supposed to _improve_ this situation by trying to
> > > free contiguous stuff rather than shooting aimlessly until contiguous
> > > pages appear? 
> > 
> > For lower orders like order-1 and order-2, it reclaims randomly before
> > using lumpy reclaim as the assumption is that these lower pages free
> > naturally.
> 
> Hmm.. We were looking were looking at some other servers' munin graphs,
> and I seem to notice a correlation between high allocation rates (eg,
> heavily loaded servers) and more memory being free.  I am wondering if
> the problem isn't the choice of how to reclaim, but more an issue from
> concurrent allocation calls. 

It could be both. With many parallel allocators reclaiming order-0
pages, significantly more memory would be freed than necessary.

> Because (direct) reclaim isn't protected
> from other allocations, it can fight with allocations that split back up
> the orders, which might be increasing fragmentation.
> 

It might but the dominant problem is likely to be a larger number of order-0
pages being reclaimed rather than this known-to-exist race occuring.

> The fragmentation and reaching watermarks does seem to be what is causing
> a larger amount to stay free, once it _gets_ fragmented...
> 
> I was thinking of ways that it could hold pages while reclaiming, and
> then free them all and allocate the request under a lock to avoid
> colliding with other allocations. 

This has been tried a few times in different implementations but it almost
always ended up increasing the number of pages reclaimed (because more
pages get isolated) and increased allocation ltency. It could be tried again
of course.

> I see shrink_active_list() almost
> seems to have something like this with l_hold, but nothing cares about
> watermarks down at that level.  The inactive list goes through a separate
> routine, and only inactive uses lumpy reclaim.
> 
> > > Or is there some other point to it?  If this is the case,
> > > maybe the issue is that lumpy reclaim isn't happening soon enough, so it
> > > shoots around too much before it tries to look for lumpy stuff. 
> > 
> > It used to happen sooner but it ran into latency problems.
> 
> Latency from writeback or something? 

Writeback and deactivating active pages.

> I wonder if it would be worth
> trying a cheap patch to try lumpy mode immediately, just to see how
> things change.
> 

If the problem persists, it can be considered but lumpy mode is delayed
on purpose. Maybe that decision was wrong.

> > > In
> > > 2.6.3[67], set_lumpy_reclaim_mode() only sets lumpy mode if sc->order >
> > > PAGE_ALLOC_COSTLY_ORDER (>= 4), or if priority < DEF_PRIORITY - 2.
> > > 
> > > Also, try_to_compact_pages() bails without doing anything when order <=
> > > PAGE_ALLOC_COSTLY_ORDER, which is the order I'm seeing problems at.  So,
> > > without further chanegs, I don't see how CONFIG_COMPACTION or 2.6.37 will
> > > make any difference, unless I'm missing some related 2.6.37 changes.
> > 
> > There is increasing pressure to use compaction for the lower orders as
> > well. This problem is going to be added to the list of justifications :/
> 
> I figured perhaps this was skipped due to being expensive, or else why
> wouldn't it just always happen for non-zero orders. 

No, it was skipped because it was a lot of new code and a number of bugs in
page migration had to be ironed out. If it was only going to be activated
for huge page allocation to start with, people were reasonably happy. It is
a bit expensive as well as the scanning rates for it are astonishinly high
but it should still be cheaper than reclaiming and paging back in.

> I do see a lot of
> servers with 200,000 free order-0 pages and almost no order-1 or anything
> bigger, so maybe this could help.  I could try to modify the test in
> try_to_compact_pages() to if (!order || !may_enter_fs || !may_perform_io).
> 

You could but I reckon the reclaim/compaction patches that are currently
in mmotm in combination with allowing compaction for lower orders will
be what happens longer term.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] vmscan: make kswapd use a correct order
  2010-12-09 22:13   ` Andrew Morton
@ 2010-12-10 11:17     ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2010-12-10 11:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, LKML, Shaohua Li, KOSAKI Motohiro

On Thu, Dec 09, 2010 at 02:13:17PM -0800, Andrew Morton wrote:
> On Fri,  3 Dec 2010 01:00:49 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> 
> OT: kswapd_try_to_sleep() does a
> trace_mm_vmscan_kswapd_sleep(pgdat->node_id) if it sleeps for a long
> time, but doesn't trace anything at all if it does a short sleep. 
> Where's the sense in that?
> 

The tracepoint is to mark when kswapd is going fully to sleep and being
inactive because all its work is done. The tracepoints name might be
unfortunate because it's really used to track if kswapd is active or
inactive rather than sleeping.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] vmscan: make kswapd use a correct order
@ 2010-12-10 11:17     ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2010-12-10 11:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, LKML, Shaohua Li, KOSAKI Motohiro

On Thu, Dec 09, 2010 at 02:13:17PM -0800, Andrew Morton wrote:
> On Fri,  3 Dec 2010 01:00:49 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
> 
> OT: kswapd_try_to_sleep() does a
> trace_mm_vmscan_kswapd_sleep(pgdat->node_id) if it sleeps for a long
> time, but doesn't trace anything at all if it does a short sleep. 
> Where's the sense in that?
> 

The tracepoint is to mark when kswapd is going fully to sleep and being
inactive because all its work is done. The tracepoints name might be
unfortunate because it's really used to track if kswapd is active or
inactive rather than sleeping.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] vmscan: make kswapd use a correct order
  2010-12-09 22:13   ` Andrew Morton
@ 2010-12-10  3:53     ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2010-12-10  3:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Shaohua Li, KOSAKI Motohiro, Mel Gorman

On Fri, Dec 10, 2010 at 7:13 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri,  3 Dec 2010 01:00:49 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
>
> OT: kswapd_try_to_sleep() does a
> trace_mm_vmscan_kswapd_sleep(pgdat->node_id) if it sleeps for a long
> time, but doesn't trace anything at all if it does a short sleep.
> Where's the sense in that?
>

AFAIU, short sleep is _sleep_ but that trace's goal is to count only long sleep.
In addition, short sleep is a just ready to go or not long sleep so I
think we don't need short sleep trace.
And for knowing short sleep count, we can use
KSWAPD_{LOW|HIGH}_WMARK_HIT_QUICKLY.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] vmscan: make kswapd use a correct order
@ 2010-12-10  3:53     ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2010-12-10  3:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Shaohua Li, KOSAKI Motohiro, Mel Gorman

On Fri, Dec 10, 2010 at 7:13 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri,  3 Dec 2010 01:00:49 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
>
> OT: kswapd_try_to_sleep() does a
> trace_mm_vmscan_kswapd_sleep(pgdat->node_id) if it sleeps for a long
> time, but doesn't trace anything at all if it does a short sleep.
> Where's the sense in that?
>

AFAIU, short sleep is _sleep_ but that trace's goal is to count only long sleep.
In addition, short sleep is a just ready to go or not long sleep so I
think we don't need short sleep trace.
And for knowing short sleep count, we can use
KSWAPD_{LOW|HIGH}_WMARK_HIT_QUICKLY.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] vmscan: make kswapd use a correct order
  2010-12-02 16:00 ` Minchan Kim
@ 2010-12-09 22:13   ` Andrew Morton
  -1 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2010-12-09 22:13 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, LKML, Shaohua Li, KOSAKI Motohiro, Mel Gorman

On Fri,  3 Dec 2010 01:00:49 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)

OT: kswapd_try_to_sleep() does a
trace_mm_vmscan_kswapd_sleep(pgdat->node_id) if it sleeps for a long
time, but doesn't trace anything at all if it does a short sleep. 
Where's the sense in that?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] vmscan: make kswapd use a correct order
@ 2010-12-09 22:13   ` Andrew Morton
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2010-12-09 22:13 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, LKML, Shaohua Li, KOSAKI Motohiro, Mel Gorman

On Fri,  3 Dec 2010 01:00:49 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)

OT: kswapd_try_to_sleep() does a
trace_mm_vmscan_kswapd_sleep(pgdat->node_id) if it sleeps for a long
time, but doesn't trace anything at all if it does a short sleep. 
Where's the sense in that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] vmscan: make kswapd use a correct order
  2010-12-02 16:00 ` Minchan Kim
@ 2010-12-03 12:11   ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2010-12-03 12:11 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-mm, LKML, Shaohua Li, KOSAKI Motohiro

On Fri, Dec 03, 2010 at 01:00:49AM +0900, Minchan Kim wrote:
> If we wake up prematurely, it means we should keep going on
> reclaiming not new order page but at old order page.
> Sometime new order can be smaller than old order by below
> race so it could make failure of old order page reclaiming.
> 
> T0: Task 1 wakes up kswapd with order-3
> T1: So, kswapd starts to reclaim pages using balance_pgdat
> T2: Task 2 wakes up kswapd with order-2 because pages reclaimed
> 	by T1 are consumed quickly.
> T3: kswapd exits balance_pgdat and will do following:
> T4-1: In beginning of kswapd's loop, pgdat->kswapd_max_order will
> 	be reset with zero.
> T4-2: 'order' will be set to pgdat->kswapd_max_order(0), since it
>         enters the false branch of 'if (order (3) < new_order (2))'
> T4-3: If previous balance_pgdat can't meet requirement of order-2
> 	free pages by high watermark, it will start reclaiming again.
>         So balance_pgdat will use order-0 to do reclaim while it
> 	really should use order-2 at the moment.
> T4-4: At last, Task 1 can't get the any page if it wanted with
> 	GFP_ATOMIC.
> 
> Reported-by: Shaohua Li <shaohua.li@intel.com>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Shaohua Li <shaohua.li@intel.com>
> Cc: Mel Gorman <mel@csn.ul.ie>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] vmscan: make kswapd use a correct order
@ 2010-12-03 12:11   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2010-12-03 12:11 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-mm, LKML, Shaohua Li, KOSAKI Motohiro

On Fri, Dec 03, 2010 at 01:00:49AM +0900, Minchan Kim wrote:
> If we wake up prematurely, it means we should keep going on
> reclaiming not new order page but at old order page.
> Sometime new order can be smaller than old order by below
> race so it could make failure of old order page reclaiming.
> 
> T0: Task 1 wakes up kswapd with order-3
> T1: So, kswapd starts to reclaim pages using balance_pgdat
> T2: Task 2 wakes up kswapd with order-2 because pages reclaimed
> 	by T1 are consumed quickly.
> T3: kswapd exits balance_pgdat and will do following:
> T4-1: In beginning of kswapd's loop, pgdat->kswapd_max_order will
> 	be reset with zero.
> T4-2: 'order' will be set to pgdat->kswapd_max_order(0), since it
>         enters the false branch of 'if (order (3) < new_order (2))'
> T4-3: If previous balance_pgdat can't meet requirement of order-2
> 	free pages by high watermark, it will start reclaiming again.
>         So balance_pgdat will use order-0 to do reclaim while it
> 	really should use order-2 at the moment.
> T4-4: At last, Task 1 can't get the any page if it wanted with
> 	GFP_ATOMIC.
> 
> Reported-by: Shaohua Li <shaohua.li@intel.com>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Shaohua Li <shaohua.li@intel.com>
> Cc: Mel Gorman <mel@csn.ul.ie>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] vmscan: make kswapd use a correct order
@ 2010-12-02 16:00 ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2010-12-02 16:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Shaohua Li, KOSAKI Motohiro, Minchan Kim, Mel Gorman

If we wake up prematurely, it means we should keep going on
reclaiming not new order page but at old order page.
Sometime new order can be smaller than old order by below
race so it could make failure of old order page reclaiming.

T0: Task 1 wakes up kswapd with order-3
T1: So, kswapd starts to reclaim pages using balance_pgdat
T2: Task 2 wakes up kswapd with order-2 because pages reclaimed
	by T1 are consumed quickly.
T3: kswapd exits balance_pgdat and will do following:
T4-1: In beginning of kswapd's loop, pgdat->kswapd_max_order will
	be reset with zero.
T4-2: 'order' will be set to pgdat->kswapd_max_order(0), since it
        enters the false branch of 'if (order (3) < new_order (2))'
T4-3: If previous balance_pgdat can't meet requirement of order-2
	free pages by high watermark, it will start reclaiming again.
        So balance_pgdat will use order-0 to do reclaim while it
	really should use order-2 at the moment.
T4-4: At last, Task 1 can't get the any page if it wanted with
	GFP_ATOMIC.

Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Shaohua Li <shaohua.li@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   23 +++++++++++++++++++----
 1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..27d0839 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2447,13 +2447,18 @@ out:
 	return sc.nr_reclaimed;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
+/*
+ * Return true if we slept enough. Otherwise, return false
+ */
+static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 {
 	long remaining = 0;
+	bool slept = false;
+
 	DEFINE_WAIT(wait);
 
 	if (freezing(current) || kthread_should_stop())
-		return;
+		return slept;
 
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
@@ -2482,6 +2487,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 		schedule();
 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+		slept = true;
 	} else {
 		if (remaining)
 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2489,6 +2495,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
 	finish_wait(&pgdat->kswapd_wait, &wait);
+
+	return slept;
 }
 
 /*
@@ -2550,8 +2558,15 @@ static int kswapd(void *p)
 			 */
 			order = new_order;
 		} else {
-			kswapd_try_to_sleep(pgdat, order);
-			order = pgdat->kswapd_max_order;
+			/*
+			 * If we wake up after enough sleeping, it means
+			 * we reclaimed enough pages at that order. so
+			 * we starts reclaim new order in this time.
+			 * Otherwise, it was a premature sleep so we should
+			 * keep going on reclaiming at that order pages.
+			 */
+			if (kswapd_try_to_sleep(pgdat, order))
+				order = pgdat->kswapd_max_order;
 		}
 
 		ret = try_to_freeze();
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH] vmscan: make kswapd use a correct order
@ 2010-12-02 16:00 ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2010-12-02 16:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Shaohua Li, KOSAKI Motohiro, Minchan Kim, Mel Gorman

If we wake up prematurely, it means we should keep going on
reclaiming not new order page but at old order page.
Sometime new order can be smaller than old order by below
race so it could make failure of old order page reclaiming.

T0: Task 1 wakes up kswapd with order-3
T1: So, kswapd starts to reclaim pages using balance_pgdat
T2: Task 2 wakes up kswapd with order-2 because pages reclaimed
	by T1 are consumed quickly.
T3: kswapd exits balance_pgdat and will do following:
T4-1: In beginning of kswapd's loop, pgdat->kswapd_max_order will
	be reset with zero.
T4-2: 'order' will be set to pgdat->kswapd_max_order(0), since it
        enters the false branch of 'if (order (3) < new_order (2))'
T4-3: If previous balance_pgdat can't meet requirement of order-2
	free pages by high watermark, it will start reclaiming again.
        So balance_pgdat will use order-0 to do reclaim while it
	really should use order-2 at the moment.
T4-4: At last, Task 1 can't get the any page if it wanted with
	GFP_ATOMIC.

Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Shaohua Li <shaohua.li@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   23 +++++++++++++++++++----
 1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..27d0839 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2447,13 +2447,18 @@ out:
 	return sc.nr_reclaimed;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
+/*
+ * Return true if we slept enough. Otherwise, return false
+ */
+static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 {
 	long remaining = 0;
+	bool slept = false;
+
 	DEFINE_WAIT(wait);
 
 	if (freezing(current) || kthread_should_stop())
-		return;
+		return slept;
 
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
@@ -2482,6 +2487,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 		schedule();
 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+		slept = true;
 	} else {
 		if (remaining)
 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2489,6 +2495,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
 	finish_wait(&pgdat->kswapd_wait, &wait);
+
+	return slept;
 }
 
 /*
@@ -2550,8 +2558,15 @@ static int kswapd(void *p)
 			 */
 			order = new_order;
 		} else {
-			kswapd_try_to_sleep(pgdat, order);
-			order = pgdat->kswapd_max_order;
+			/*
+			 * If we wake up after enough sleeping, it means
+			 * we reclaimed enough pages at that order. so
+			 * we starts reclaim new order in this time.
+			 * Otherwise, it was a premature sleep so we should
+			 * keep going on reclaiming at that order pages.
+			 */
+			if (kswapd_try_to_sleep(pgdat, order))
+				order = pgdat->kswapd_max_order;
 		}
 
 		ret = try_to_freeze();
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2010-12-14  9:52 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-01  3:08 [patch]vmscan: make kswapd use a correct order Shaohua Li
2010-12-01  4:21 ` Minchan Kim
2010-12-01  5:42   ` Shaohua Li
2010-12-01  9:44 ` KOSAKI Motohiro
2010-12-01 15:58   ` Minchan Kim
2010-12-02  0:09     ` KOSAKI Motohiro
2010-12-02  0:29       ` KOSAKI Motohiro
2010-12-02  0:58         ` Minchan Kim
2010-12-02  0:19     ` Andrew Morton
2010-12-02  9:40       ` Mel Gorman
2010-12-02  0:29     ` Shaohua Li
2010-12-02  0:54       ` Minchan Kim
2010-12-02  1:05         ` Shaohua Li
2010-12-02  1:23           ` Minchan Kim
2010-12-02  1:36             ` Minchan Kim
2010-12-02  9:42               ` Mel Gorman
2010-12-02 15:25                 ` Minchan Kim
2010-12-02  2:39             ` Shaohua Li
2010-12-02  1:28       ` KOSAKI Motohiro
2010-12-02 10:12     ` Mel Gorman
2010-12-02 15:35       ` Minchan Kim
2010-12-02 15:42         ` Mel Gorman
2010-12-02 20:53           ` Simon Kirby
2010-12-03 12:00             ` Mel Gorman
2010-12-04 12:07               ` Simon Kirby
2010-12-06 12:03                 ` Mel Gorman
2010-12-09 23:44                   ` Simon Kirby
2010-12-10 11:32                     ` Mel Gorman
2010-12-10 23:42                       ` Simon Kirby
2010-12-14  9:52                         ` Mel Gorman
2010-12-02 16:00 [PATCH] vmscan: " Minchan Kim
2010-12-02 16:00 ` Minchan Kim
2010-12-03 12:11 ` Mel Gorman
2010-12-03 12:11   ` Mel Gorman
2010-12-09 22:13 ` Andrew Morton
2010-12-09 22:13   ` Andrew Morton
2010-12-10  3:53   ` Minchan Kim
2010-12-10  3:53     ` Minchan Kim
2010-12-10 11:17   ` Mel Gorman
2010-12-10 11:17     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.