linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
@ 2009-12-28  7:47 KOSAKI Motohiro
  2009-12-28  7:48 ` [PATCH 2/4] vmscan: get_scan_ratio cleanup KOSAKI Motohiro
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-12-28  7:47 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman,
	KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro

The zone->lock is one of performance critical locks. Then, it shouldn't
be hold for long time. Currently, we have four walk_zones_in_node()
usage and almost use-case don't need to hold zone->lock.

Thus, this patch move locking responsibility from walk_zones_in_node
to its sub function. Also this patch kill unnecessary zone->lock taking.

Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmstat.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6051fba..a5d45bc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
 {
 	struct zone *zone;
 	struct zone *node_zones = pgdat->node_zones;
-	unsigned long flags;
 
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
 		if (!populated_zone(zone))
 			continue;
 
-		spin_lock_irqsave(&zone->lock, flags);
 		print(m, pgdat, zone);
-		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 }
 
@@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
 	int order, mtype;
+	unsigned long flags;
 
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
 		seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 
 			area = &(zone->free_area[order]);
 
+			spin_lock_irqsave(&zone->lock, flags);
 			list_for_each(curr, &area->free_list[mtype])
 				freecount++;
+			spin_unlock_irqrestore(&zone->lock, flags);
+
 			seq_printf(m, "%6lu ", freecount);
 		}
 		seq_putc(m, '\n');
@@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 							struct zone *zone)
 {
 	int i;
+
 	seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
 	seq_printf(m,
 		   "\n  pages free     %lu"
-- 
1.6.5.2





^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] vmscan: get_scan_ratio cleanup
  2009-12-28  7:47 [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node KOSAKI Motohiro
@ 2009-12-28  7:48 ` KOSAKI Motohiro
  2009-12-29  7:34   ` Balbir Singh
                     ` (2 more replies)
  2009-12-28  7:48 ` [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo KOSAKI Motohiro
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-12-28  7:48 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman,
	KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro

The get_scan_ratio() should have all scan-ratio related calculations.
Thus, this patch move some calculation into get_scan_ratio.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bbee91..640486b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1501,6 +1501,13 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
 	unsigned long ap, fp;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
+	/* If we have no swap space, do not bother scanning anon pages. */
+	if (!sc->may_swap || (nr_swap_pages <= 0)) {
+		percent[0] = 0;
+		percent[1] = 100;
+		return;
+	}
+
 	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
 		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
 	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
@@ -1598,22 +1605,20 @@ static void shrink_zone(int priority, struct zone *zone,
 	unsigned long nr_reclaimed = sc->nr_reclaimed;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
-	int noswap = 0;
 
-	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || (nr_swap_pages <= 0)) {
-		noswap = 1;
-		percent[0] = 0;
-		percent[1] = 100;
-	} else
-		get_scan_ratio(zone, sc, percent);
+	get_scan_ratio(zone, sc, percent);
 
 	for_each_evictable_lru(l) {
 		int file = is_file_lru(l);
 		unsigned long scan;
 
+		if (percent[file] == 0) {
+			nr[l] = 0;
+			continue;
+		}
+
 		scan = zone_nr_lru_pages(zone, sc, l);
-		if (priority || noswap) {
+		if (priority) {
 			scan >>= priority;
 			scan = (scan * percent[file]) / 100;
 		}
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo
  2009-12-28  7:47 [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node KOSAKI Motohiro
  2009-12-28  7:48 ` [PATCH 2/4] vmscan: get_scan_ratio cleanup KOSAKI Motohiro
@ 2009-12-28  7:48 ` KOSAKI Motohiro
  2009-12-29 14:08   ` Balbir Singh
                     ` (2 more replies)
  2009-12-28  7:49 ` [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file KOSAKI Motohiro
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-12-28  7:48 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman,
	KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro

Vmscan folks was asked "why does my system makes so much swap-out?"
in lkml at several times.
At that time, I made the debug patch to show recent_anon_{scanned/rorated}
parameter at least three times.

Thus, its parameter should be showed on /proc/zoneinfo. It help
vmscan folks debugging.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/swap.h |    2 ++
 mm/vmscan.c          |   15 +++++++++++++++
 mm/vmstat.c          |    7 +++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..e95d7ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -280,6 +280,8 @@ extern void scan_unevictable_unregister_node(struct node *node);
 extern int kswapd_run(int nid);
 extern void kswapd_stop(int nid);
 
+unsigned long get_anon_scan_ratio(struct zone *zone, struct mem_cgroup *memcg, int swappiness);
+
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 640486b..1c39a74 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1572,6 +1572,21 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
 	percent[1] = 100 - percent[0];
 }
 
+unsigned long get_anon_scan_ratio(struct zone *zone, struct mem_cgroup *memcg, int swappiness)
+{
+	unsigned long percent[2];
+	struct scan_control sc = {
+		.may_swap = 1,
+		.swappiness = swappiness,
+		.mem_cgroup = memcg,
+	};
+
+	get_scan_ratio(zone, &sc, percent);
+
+	return percent[0];
+}
+
+
 /*
  * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
  * until we collected @swap_cluster_max pages to scan.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a5d45bc..24383b4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
+#include <linux/swap.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -762,11 +763,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  all_unreclaimable: %u"
 		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
-		   "\n  inactive_ratio:    %u",
+		   "\n  inactive_ratio:    %u"
+		   "\n  anon_scan_ratio:   %lu",
 			   zone_is_all_unreclaimable(zone),
 		   zone->prev_priority,
 		   zone->zone_start_pfn,
-		   zone->inactive_ratio);
+		   zone->inactive_ratio,
+		   get_anon_scan_ratio(zone, NULL, vm_swappiness));
 	seq_putc(m, '\n');
 }
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file
  2009-12-28  7:47 [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node KOSAKI Motohiro
  2009-12-28  7:48 ` [PATCH 2/4] vmscan: get_scan_ratio cleanup KOSAKI Motohiro
  2009-12-28  7:48 ` [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo KOSAKI Motohiro
@ 2009-12-28  7:49 ` KOSAKI Motohiro
  2009-12-29 14:09   ` Balbir Singh
  2009-12-29  4:05 ` [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node Minchan Kim
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-12-28  7:49 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman,
	KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro

anon_scan_ratio feature doesn't only useful for global VM pressure
analysis, but it also useful for memcg memroy pressure analysis.

Then, this patch add anon_scan_ratio field to memory.stat file too.

Instead, following debug statistics was removed. It isn't so user and/or
developer friendly.

	- recent_rotated_anon
	- recent_rotated_file
	- recent_scanned_anon
	- recent_scanned_file

This removing don't cause ABI issue. because it was enclosed
CONFIG_DEBUG_VM.

Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/memcontrol.c |   43 +++++++++++++++++++------------------------
 1 files changed, 19 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 325df12..daa027c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2950,6 +2950,9 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
 {
 	struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont);
 	struct mcs_total_stat mystat;
+	struct zone *zone;
+	unsigned long total_anon = 0;
+	unsigned long total_scan_anon = 0;
 	int i;
 
 	memset(&mystat, 0, sizeof(mystat));
@@ -2978,34 +2981,26 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
 		cb->fill(cb, memcg_stat_strings[i].total_name, mystat.stat[i]);
 	}
 
-#ifdef CONFIG_DEBUG_VM
 	cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));
 
-	{
-		int nid, zid;
+	for_each_populated_zone(zone) {
+		int nid = zone->zone_pgdat->node_id;
+		int zid = zone_idx(zone);
 		struct mem_cgroup_per_zone *mz;
-		unsigned long recent_rotated[2] = {0, 0};
-		unsigned long recent_scanned[2] = {0, 0};
-
-		for_each_online_node(nid)
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-				mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-
-				recent_rotated[0] +=
-					mz->reclaim_stat.recent_rotated[0];
-				recent_rotated[1] +=
-					mz->reclaim_stat.recent_rotated[1];
-				recent_scanned[0] +=
-					mz->reclaim_stat.recent_scanned[0];
-				recent_scanned[1] +=
-					mz->reclaim_stat.recent_scanned[1];
-			}
-		cb->fill(cb, "recent_rotated_anon", recent_rotated[0]);
-		cb->fill(cb, "recent_rotated_file", recent_rotated[1]);
-		cb->fill(cb, "recent_scanned_anon", recent_scanned[0]);
-		cb->fill(cb, "recent_scanned_file", recent_scanned[1]);
+		unsigned long anon;
+		unsigned long ratio;
+
+		mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
+
+		anon = MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+		anon += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON);
+
+		ratio = get_anon_scan_ratio(zone, mem_cont, mem_cont->swappiness);
+
+		total_anon += anon;
+		total_scan_anon += anon * ratio;
 	}
-#endif
+	cb->fill(cb, "anon_scan_ratio", total_scan_anon / total_anon);
 
 	return 0;
 }
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
  2009-12-28  7:47 [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node KOSAKI Motohiro
                   ` (2 preceding siblings ...)
  2009-12-28  7:49 ` [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file KOSAKI Motohiro
@ 2009-12-29  4:05 ` Minchan Kim
  2009-12-29  6:34 ` Balbir Singh
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Minchan Kim @ 2009-12-29  4:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman,
	KAMEZAWA Hiroyuki

On Mon, Dec 28, 2009 at 4:47 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> The zone->lock is one of performance critical locks. Then, it shouldn't
> be hold for long time. Currently, we have four walk_zones_in_node()
> usage and almost use-case don't need to hold zone->lock.

I agree.

We can use walk_zone_in_node freely to show the information related to zone.

- frag_show_print : the number of free pages per order.
- pagetypeinfo_showfree_print : the number of free page per migration type
- pagetypeinfo_showblockcount_print : the number of pages in zone per
migration type
- zoneinfo_show_print : many info about zone.

Do we want to show exact value? No.
If we want it, it's not enough zone->lock only.
After all, All of things would be transient value.

>
> Thus, this patch move locking responsibility from walk_zones_in_node
> to its sub function. Also this patch kill unnecessary zone->lock taking.
>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
  2009-12-28  7:47 [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node KOSAKI Motohiro
                   ` (3 preceding siblings ...)
  2009-12-29  4:05 ` [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node Minchan Kim
@ 2009-12-29  6:34 ` Balbir Singh
  2009-12-30 12:49   ` KOSAKI Motohiro
  2010-01-03 18:59 ` Mel Gorman
  2010-01-03 23:48 ` KAMEZAWA Hiroyuki
  6 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-12-29  6:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-12-28 16:47:22]:

> The zone->lock is one of performance critical locks. Then, it shouldn't
> be hold for long time. Currently, we have four walk_zones_in_node()
> usage and almost use-case don't need to hold zone->lock.
> 
> Thus, this patch move locking responsibility from walk_zones_in_node
> to its sub function. Also this patch kill unnecessary zone->lock taking.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/vmstat.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6051fba..a5d45bc 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
>  {
>  	struct zone *zone;
>  	struct zone *node_zones = pgdat->node_zones;
> -	unsigned long flags;
> 
>  	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
>  		if (!populated_zone(zone))
>  			continue;
> 
> -		spin_lock_irqsave(&zone->lock, flags);
>  		print(m, pgdat, zone);
> -		spin_unlock_irqrestore(&zone->lock, flags);
>  	}
>  }
>

> @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
>  {
>  	int order, mtype;
> +	unsigned long flags;
> 
>  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> 
>  			area = &(zone->free_area[order]);
> 
> +			spin_lock_irqsave(&zone->lock, flags);
>  			list_for_each(curr, &area->free_list[mtype])
>  				freecount++;
> +			spin_unlock_irqrestore(&zone->lock, flags);
> +
>  			seq_printf(m, "%6lu ", freecount);
>  		}
>  		seq_putc(m, '\n');
> @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  							struct zone *zone)
>  {
>  	int i;
> +
>  	seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
>  	seq_printf(m,
>  		   "\n  pages free     %lu"

While this is a noble cause, is printing all this information correct
without the lock, I am not worried about old 
information, but incorrect data. Should the read side be rcu'ed. Is
just removing the lock and accessing data safe across architectures?

-- 
	Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] vmscan: get_scan_ratio cleanup
  2009-12-28  7:48 ` [PATCH 2/4] vmscan: get_scan_ratio cleanup KOSAKI Motohiro
@ 2009-12-29  7:34   ` Balbir Singh
  2009-12-30 13:06     ` KOSAKI Motohiro
  2010-01-01 14:55   ` Rik van Riel
  2010-01-03 23:45   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-12-29  7:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-12-28 16:48:06]:

> The get_scan_ratio() should have all scan-ratio related calculations.
> Thus, this patch move some calculation into get_scan_ratio.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/vmscan.c |   23 ++++++++++++++---------
>  1 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2bbee91..640486b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1501,6 +1501,13 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
>  	unsigned long ap, fp;
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> 
> +	/* If we have no swap space, do not bother scanning anon pages. */
> +	if (!sc->may_swap || (nr_swap_pages <= 0)) {
> +		percent[0] = 0;
> +		percent[1] = 100;
> +		return;
> +	}
> +


>  	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
>  		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
>  	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
> @@ -1598,22 +1605,20 @@ static void shrink_zone(int priority, struct zone *zone,
>  	unsigned long nr_reclaimed = sc->nr_reclaimed;
>  	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> -	int noswap = 0;
> 
> -	/* If we have no swap space, do not bother scanning anon pages. */
> -	if (!sc->may_swap || (nr_swap_pages <= 0)) {
> -		noswap = 1;
> -		percent[0] = 0;
> -		percent[1] = 100;
> -	} else
> -		get_scan_ratio(zone, sc, percent);
> +	get_scan_ratio(zone, sc, percent);

Where do we set noswap? Is percent[0] == 0 used to indicate noswap =
1?

> 
>  	for_each_evictable_lru(l) {
>  		int file = is_file_lru(l);
>  		unsigned long scan;
> 
> +		if (percent[file] == 0) {
> +			nr[l] = 0;
> +			continue;
> +		}
> +

Is this really needed? Won't nr_scan_try_batch handle it correctly?

>  		scan = zone_nr_lru_pages(zone, sc, l);
> -		if (priority || noswap) {
> +		if (priority) {
>  			scan >>= priority;
>  			scan = (scan * percent[file]) / 100;
>  		}

-- 
	Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo
  2009-12-28  7:48 ` [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo KOSAKI Motohiro
@ 2009-12-29 14:08   ` Balbir Singh
  2009-12-30 13:13     ` KOSAKI Motohiro
  2010-01-01 14:59   ` Rik van Riel
  2010-01-03 23:47   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-12-29 14:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-12-28 16:48:51]:

> Vmscan folks was asked "why does my system makes so much swap-out?"
> in lkml at several times.
> At that time, I made the debug patch to show recent_anon_{scanned/rorated}
> parameter at least three times.
> 
> Thus, its parameter should be showed on /proc/zoneinfo. It help
> vmscan folks debugging.
>

Hmmm.. I think this should come under DEBUG_VM, a lot of tools use
/proc/zoneinfo, the additional overhead may be high.. no? Also,
I would recommend adding the additional details to the end, so
as to not break existing tools (specifically dump line # based
tools).
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file
  2009-12-28  7:49 ` [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file KOSAKI Motohiro
@ 2009-12-29 14:09   ` Balbir Singh
  2009-12-30 12:53     ` KOSAKI Motohiro
  0 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-12-29 14:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-12-28 16:49:27]:

> anon_scan_ratio feature doesn't only useful for global VM pressure
> analysis, but it also useful for memcg memroy pressure analysis.
> 
> Then, this patch add anon_scan_ratio field to memory.stat file too.
> 
> Instead, following debug statistics was removed. It isn't so user and/or
> developer friendly.
> 
> 	- recent_rotated_anon
> 	- recent_rotated_file
> 	- recent_scanned_anon
> 	- recent_scanned_file
>

I've been using these to look at statistics - specifically reclaim
data on my developer kernels.
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
  2009-12-29  6:34 ` Balbir Singh
@ 2009-12-30 12:49   ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-12-30 12:49 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki

> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-12-28 16:47:22]:
> 
> > The zone->lock is one of performance critical locks. Then, it shouldn't
> > be hold for long time. Currently, we have four walk_zones_in_node()
> > usage and almost use-case don't need to hold zone->lock.
> > 
> > Thus, this patch move locking responsibility from walk_zones_in_node
> > to its sub function. Also this patch kill unnecessary zone->lock taking.
> > 
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  mm/vmstat.c |    8 +++++---
> >  1 files changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 6051fba..a5d45bc 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> >  {
> >  	struct zone *zone;
> >  	struct zone *node_zones = pgdat->node_zones;
> > -	unsigned long flags;
> > 
> >  	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> >  		if (!populated_zone(zone))
> >  			continue;
> > 
> > -		spin_lock_irqsave(&zone->lock, flags);
> >  		print(m, pgdat, zone);
> > -		spin_unlock_irqrestore(&zone->lock, flags);
> >  	}
> >  }
> >
> 
> > @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >  					pg_data_t *pgdat, struct zone *zone)
> >  {
> >  	int order, mtype;
> > +	unsigned long flags;
> > 
> >  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> >  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > 
> >  			area = &(zone->free_area[order]);
> > 
> > +			spin_lock_irqsave(&zone->lock, flags);
> >  			list_for_each(curr, &area->free_list[mtype])
> >  				freecount++;
> > +			spin_unlock_irqrestore(&zone->lock, flags);
> > +
> >  			seq_printf(m, "%6lu ", freecount);
> >  		}
> >  		seq_putc(m, '\n');
> > @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> >  							struct zone *zone)
> >  {
> >  	int i;
> > +
> >  	seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
> >  	seq_printf(m,
> >  		   "\n  pages free     %lu"
> 
> While this is a noble cause, is printing all this information correct
> without the lock, I am not worried about old 
> information, but incorrect data. Should the read side be rcu'ed. Is
> just removing the lock and accessing data safe across architectures?

Hm. 
Actually,current memory-hotplug implementation change various zone data without zone->lock.
then, my patch doesn't cause regression. I'm not sure rcu protection is very worth or not.
but I think it can separate this patch.





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file
  2009-12-29 14:09   ` Balbir Singh
@ 2009-12-30 12:53     ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-12-30 12:53 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki

> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-12-28 16:49:27]:
> 
> > anon_scan_ratio feature doesn't only useful for global VM pressure
> > analysis, but it also useful for memcg memroy pressure analysis.
> > 
> > Then, this patch add anon_scan_ratio field to memory.stat file too.
> > 
> > Instead, following debug statistics was removed. It isn't so user and/or
> > developer friendly.
> > 
> > 	- recent_rotated_anon
> > 	- recent_rotated_file
> > 	- recent_scanned_anon
> > 	- recent_scanned_file
> 
> I've been using these to look at statistics - specifically reclaim
> data on my developer kernels.

ok, I'll drop removing part. just add anon_scan_ratio.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] vmscan: get_scan_ratio cleanup
  2009-12-29  7:34   ` Balbir Singh
@ 2009-12-30 13:06     ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-12-30 13:06 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki

> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-12-28 16:48:06]:
> 
> > The get_scan_ratio() should have all scan-ratio related calculations.
> > Thus, this patch move some calculation into get_scan_ratio.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  mm/vmscan.c |   23 ++++++++++++++---------
> >  1 files changed, 14 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2bbee91..640486b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1501,6 +1501,13 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
> >  	unsigned long ap, fp;
> >  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > 
> > +	/* If we have no swap space, do not bother scanning anon pages. */
> > +	if (!sc->may_swap || (nr_swap_pages <= 0)) {
> > +		percent[0] = 0;
> > +		percent[1] = 100;
> > +		return;
> > +	}
> > +
> 
> 
> >  	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
> >  		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
> >  	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
> > @@ -1598,22 +1605,20 @@ static void shrink_zone(int priority, struct zone *zone,
> >  	unsigned long nr_reclaimed = sc->nr_reclaimed;
> >  	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> >  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > -	int noswap = 0;
> > 
> > -	/* If we have no swap space, do not bother scanning anon pages. */
> > -	if (!sc->may_swap || (nr_swap_pages <= 0)) {
> > -		noswap = 1;
> > -		percent[0] = 0;
> > -		percent[1] = 100;
> > -	} else
> > -		get_scan_ratio(zone, sc, percent);
> > +	get_scan_ratio(zone, sc, percent);
> 
> Where do we set noswap? Is percent[0] == 0 used to indicate noswap =
> 1?

Yes, I intended so. I guess your question can convert next sentence.

	following case makes different result, is it intentional?
	  - there are free swap
	  - sc->may_swap == 1
	  - priority == 0
	  - percent[0] == 0

My answer is, it isn't happen on practical workload. if priority reach to 0, vmscan always
scan and reclaim some anon (please recall, now vmscan automatically enable lumpy_reclaim
if priority < 10), then recent_rotated_anon isn't 0 always.. practically.

Do you think this is wrong assumption?


> >  	for_each_evictable_lru(l) {
> >  		int file = is_file_lru(l);
> >  		unsigned long scan;
> > 
> > +		if (percent[file] == 0) {
> > +			nr[l] = 0;
> > +			continue;
> > +		}
> > +
> 
> Is this really needed? Won't nr_scan_try_batch handle it correctly?

this two if branch is nicer than "priority || noswap", I think.
it clearly explain what do it.

> 
> >  		scan = zone_nr_lru_pages(zone, sc, l);
> > -		if (priority || noswap) {
> > +		if (priority) {
> >  			scan >>= priority;
> >  			scan = (scan * percent[file]) / 100;
> >  		}
> 
> -- 
> 	Balbir




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo
  2009-12-29 14:08   ` Balbir Singh
@ 2009-12-30 13:13     ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-12-30 13:13 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki

> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-12-28 16:48:51]:
> 
> > Vmscan folks was asked "why does my system makes so much swap-out?"
> > in lkml at several times.
> > At that time, I made the debug patch to show recent_anon_{scanned/rorated}
> > parameter at least three times.
> > 
> > Thus, its parameter should be showed on /proc/zoneinfo. It help
> > vmscan folks debugging.
> >
> 
> Hmmm.. I think this should come under DEBUG_VM, a lot of tools use
> /proc/zoneinfo, the additional overhead may be high.. no? Also,
> I would recommend adding the additional details to the end, so
> as to not break existing tools (specifically dump line # based
> tools).

Thanks, I have three answer. 1)  I really hope to don't enclose DEBUG_VM. otherwise 
my harm doesn't solve. 2) but your performance worry is fair enough. I plan to remove 
to grab zone->lru_lock in reading /proc/zoneinfo. 3)  append new line doesn't break 
existing tools. because zoneinfo show vmstat and we often append new vmstat in past 
years. but I haven't seen zoneinfo breakage bug report. because zoneinfo file show 
multiple zone information, then nobody access it by line number.






^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] vmscan: get_scan_ratio cleanup
  2009-12-28  7:48 ` [PATCH 2/4] vmscan: get_scan_ratio cleanup KOSAKI Motohiro
  2009-12-29  7:34   ` Balbir Singh
@ 2010-01-01 14:55   ` Rik van Riel
  2010-01-03 23:45   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2010-01-01 14:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman,
	KAMEZAWA Hiroyuki

On 12/28/2009 02:48 AM, KOSAKI Motohiro wrote:
> The get_scan_ratio() should have all scan-ratio related calculations.
> Thus, this patch move some calculation into get_scan_ratio.
>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo
  2009-12-28  7:48 ` [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo KOSAKI Motohiro
  2009-12-29 14:08   ` Balbir Singh
@ 2010-01-01 14:59   ` Rik van Riel
  2010-01-03 23:47   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2010-01-01 14:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman,
	KAMEZAWA Hiroyuki

On 12/28/2009 02:48 AM, KOSAKI Motohiro wrote:
> Vmscan folks was asked "why does my system makes so much swap-out?"
> in lkml at several times.
> At that time, I made the debug patch to show recent_anon_{scanned/rorated}
> parameter at least three times.
>
> Thus, its parameter should be showed on /proc/zoneinfo. It help
> vmscan folks debugging.
>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
  2009-12-28  7:47 [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node KOSAKI Motohiro
                   ` (4 preceding siblings ...)
  2009-12-29  6:34 ` Balbir Singh
@ 2010-01-03 18:59 ` Mel Gorman
  2010-01-05  2:04   ` KOSAKI Motohiro
  2010-01-03 23:48 ` KAMEZAWA Hiroyuki
  6 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2010-01-03 18:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki

On Mon, Dec 28, 2009 at 04:47:22PM +0900, KOSAKI Motohiro wrote:
> The zone->lock is one of performance critical locks. Then, it shouldn't
> be hold for long time. Currently, we have four walk_zones_in_node()
> usage and almost use-case don't need to hold zone->lock.
> 
> Thus, this patch move locking responsibility from walk_zones_in_node
> to its sub function. Also this patch kill unnecessary zone->lock taking.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/vmstat.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6051fba..a5d45bc 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
>  {
>  	struct zone *zone;
>  	struct zone *node_zones = pgdat->node_zones;
> -	unsigned long flags;
>  
>  	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
>  		if (!populated_zone(zone))
>  			continue;
>  
> -		spin_lock_irqsave(&zone->lock, flags);
>  		print(m, pgdat, zone);
> -		spin_unlock_irqrestore(&zone->lock, flags);
>  	}
>  }
>  
> @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
>  {
>  	int order, mtype;
> +	unsigned long flags;
>  
>  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  
>  			area = &(zone->free_area[order]);
>  
> +			spin_lock_irqsave(&zone->lock, flags);
>  			list_for_each(curr, &area->free_list[mtype])
>  				freecount++;
> +			spin_unlock_irqrestore(&zone->lock, flags);
> +

It's not clear why you feel this information requires the lock and the
others do not.

For the most part, I agree that the accuracy of the information is
not critical. Assuming partial writes of the data are not a problem,
the information is not going to go so badly out of sync that it would be
noticable, even if the information is out of date within the zone.

However, inconsistent reads in zoneinfo really could be a problem. I am
concerned that under heavy allocation load that that "pages free" would
not match "nr_pages_free" for example. Other examples that adding all the
counters together may or may not equal the total number of pages in the zone.

Lets say for example there was a subtle bug related to __inc_zone_page_state()
that meant that counters were getting slightly out of sync but it was very
marginal and/or difficult to reproduce. With this patch applied, we could
not be absolutly sure the counters were correct because it could always have
raced with someone holding the zone->lock.

Minimally, I think zoneinfo should be taking the zone lock.

Secondly, has increased zone->lock contention due to reading /proc
really been shown to be a problem? The only situation that I can think
of is a badly-written monitor program that is copying all of /proc
instead of the files of interest. If a monitor program is doing
something like that, it's likely to be incurring performance problems in
a large number of different areas. If that is not the trigger case, what
is?

Thanks

>  			seq_printf(m, "%6lu ", freecount);
>  		}
>  		seq_putc(m, '\n');
> @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  							struct zone *zone)
>  {
>  	int i;
> +

Unnecessary whitespace change.

>  	seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
>  	seq_printf(m,
>  		   "\n  pages free     %lu"
> -- 
> 1.6.5.2
> 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] vmscan: get_scan_ratio cleanup
  2009-12-28  7:48 ` [PATCH 2/4] vmscan: get_scan_ratio cleanup KOSAKI Motohiro
  2009-12-29  7:34   ` Balbir Singh
  2010-01-01 14:55   ` Rik van Riel
@ 2010-01-03 23:45   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-03 23:45 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman

On Mon, 28 Dec 2009 16:48:06 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> The get_scan_ratio() should have all scan-ratio related calculations.
> Thus, this patch move some calculation into get_scan_ratio.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo
  2009-12-28  7:48 ` [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo KOSAKI Motohiro
  2009-12-29 14:08   ` Balbir Singh
  2010-01-01 14:59   ` Rik van Riel
@ 2010-01-03 23:47   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-03 23:47 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman

On Mon, 28 Dec 2009 16:48:51 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Vmscan folks was asked "why does my system makes so much swap-out?"
> in lkml at several times.
> At that time, I made the debug patch to show recent_anon_{scanned/rorated}
> parameter at least three times.
> 
> Thus, its parameter should be showed on /proc/zoneinfo. It help
> vmscan folks debugging.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
  2009-12-28  7:47 [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node KOSAKI Motohiro
                   ` (5 preceding siblings ...)
  2010-01-03 18:59 ` Mel Gorman
@ 2010-01-03 23:48 ` KAMEZAWA Hiroyuki
  6 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-03 23:48 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Mel Gorman

On Mon, 28 Dec 2009 16:47:22 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> The zone->lock is one of performance critical locks. Then, it shouldn't
> be hold for long time. Currently, we have four walk_zones_in_node()
> usage and almost use-case don't need to hold zone->lock.
> 
> Thus, this patch move locking responsibility from walk_zones_in_node
> to its sub function. Also this patch kill unnecessary zone->lock taking.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
  2010-01-03 18:59 ` Mel Gorman
@ 2010-01-05  2:04   ` KOSAKI Motohiro
  2010-01-05 10:18     ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2010-01-05  2:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Balbir Singh,
	KAMEZAWA Hiroyuki

Hi

> On Mon, Dec 28, 2009 at 04:47:22PM +0900, KOSAKI Motohiro wrote:
> > The zone->lock is one of performance critical locks. Then, it shouldn't
> > be hold for long time. Currently, we have four walk_zones_in_node()
> > usage and almost use-case don't need to hold zone->lock.
> > 
> > Thus, this patch move locking responsibility from walk_zones_in_node
> > to its sub function. Also this patch kill unnecessary zone->lock taking.
> > 
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  mm/vmstat.c |    8 +++++---
> >  1 files changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 6051fba..a5d45bc 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> >  {
> >  	struct zone *zone;
> >  	struct zone *node_zones = pgdat->node_zones;
> > -	unsigned long flags;
> >  
> >  	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> >  		if (!populated_zone(zone))
> >  			continue;
> >  
> > -		spin_lock_irqsave(&zone->lock, flags);
> >  		print(m, pgdat, zone);
> > -		spin_unlock_irqrestore(&zone->lock, flags);
> >  	}
> >  }
> >  
> > @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >  					pg_data_t *pgdat, struct zone *zone)
> >  {
> >  	int order, mtype;
> > +	unsigned long flags;
> >  
> >  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> >  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >  
> >  			area = &(zone->free_area[order]);
> >  
> > +			spin_lock_irqsave(&zone->lock, flags);
> >  			list_for_each(curr, &area->free_list[mtype])
> >  				freecount++;
> > +			spin_unlock_irqrestore(&zone->lock, flags);
> > +
> 
> It's not clear why you feel this information requires the lock and the
> others do not.

I think above list operation require lock to prevent NULL pointer access. but other parts
doesn't protect anything, because memory-hotplug change them without zone lock.


> For the most part, I agree that the accuracy of the information is
> not critical. Assuming partial writes of the data are not a problem,
> the information is not going to go so badly out of sync that it would be
> noticable, even if the information is out of date within the zone.
> 
> However, inconsistent reads in zoneinfo really could be a problem. I am
> concerned that under heavy allocation load that that "pages free" would
> not match "nr_pages_free" for example. Other examples that adding all the
> counters together may or may not equal the total number of pages in the zone.
> 
> Lets say for example there was a subtle bug related to __inc_zone_page_state()
> that meant that counters were getting slightly out of sync but it was very
> marginal and/or difficult to reproduce. With this patch applied, we could
> not be absolutly sure the counters were correct because it could always have
> raced with someone holding the zone->lock.
> 
> Minimally, I think zoneinfo should be taking the zone lock.

Thanks lots comments. 
hmm.. I'd like to clarily your point. My point is memory-hotplug don't take zone lock,
then zone lock doesn't protect anything. so we have two option

1) Add zone lock to memroy-hotplug
2) Remove zone lock from zoneinfo

I thought (2) is sufficient. Do you mean you prefer to (1)? Or you prefer to ignore rarely event
(of cource, memory hotplug is rarely)?


> Secondly, has increased zone->lock contention due to reading /proc
> really been shown to be a problem? The only situation that I can think
> of is a badly-written monitor program that is copying all of /proc
> instead of the files of interest. If a monitor program is doing
> something like that, it's likely to be incurring performance problems in
> a large number of different areas. If that is not the trigger case, what
> is?

Ah no. I haven't observe such issue. my point is removing meaningless lock.


> >  			seq_printf(m, "%6lu ", freecount);
> >  		}
> >  		seq_putc(m, '\n');
> > @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> >  							struct zone *zone)
> >  {
> >  	int i;
> > +
> 
> Unnecessary whitespace change.

Ug. thanks, it's my fault.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
  2010-01-05  2:04   ` KOSAKI Motohiro
@ 2010-01-05 10:18     ` Mel Gorman
  2010-01-06  0:33       ` KOSAKI Motohiro
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2010-01-05 10:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki

On Tue, Jan 05, 2010 at 11:04:58AM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > On Mon, Dec 28, 2009 at 04:47:22PM +0900, KOSAKI Motohiro wrote:
> > > The zone->lock is one of performance critical locks. Then, it shouldn't
> > > be hold for long time. Currently, we have four walk_zones_in_node()
> > > usage and almost use-case don't need to hold zone->lock.
> > > 
> > > Thus, this patch move locking responsibility from walk_zones_in_node
> > > to its sub function. Also this patch kill unnecessary zone->lock taking.
> > > 
> > > Cc: Mel Gorman <mel@csn.ul.ie>
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > ---
> > >  mm/vmstat.c |    8 +++++---
> > >  1 files changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 6051fba..a5d45bc 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> > >  {
> > >  	struct zone *zone;
> > >  	struct zone *node_zones = pgdat->node_zones;
> > > -	unsigned long flags;
> > >  
> > >  	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> > >  		if (!populated_zone(zone))
> > >  			continue;
> > >  
> > > -		spin_lock_irqsave(&zone->lock, flags);
> > >  		print(m, pgdat, zone);
> > > -		spin_unlock_irqrestore(&zone->lock, flags);
> > >  	}
> > >  }
> > >  
> > > @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >  					pg_data_t *pgdat, struct zone *zone)
> > >  {
> > >  	int order, mtype;
> > > +	unsigned long flags;
> > >  
> > >  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > >  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >  
> > >  			area = &(zone->free_area[order]);
> > >  
> > > +			spin_lock_irqsave(&zone->lock, flags);
> > >  			list_for_each(curr, &area->free_list[mtype])
> > >  				freecount++;
> > > +			spin_unlock_irqrestore(&zone->lock, flags);
> > > +
> > 
> > It's not clear why you feel this information requires the lock and the
> > others do not.
> 
> I think above list operation require lock to prevent NULL pointer access. but other parts
> doesn't protect anything, because memory-hotplug change them without zone lock.
> 

True. Add a comment explaining that. I considered list_for_each_safe()
but it wouldn't work in all cases.

> 
> > For the most part, I agree that the accuracy of the information is
> > not critical. Assuming partial writes of the data are not a problem,
> > the information is not going to go so badly out of sync that it would be
> > noticable, even if the information is out of date within the zone.
> > 
> > However, inconsistent reads in zoneinfo really could be a problem. I am
> > concerned that under heavy allocation load that that "pages free" would
> > not match "nr_pages_free" for example. Other examples that adding all the
> > counters together may or may not equal the total number of pages in the zone.
> > 
> > Lets say for example there was a subtle bug related to __inc_zone_page_state()
> > that meant that counters were getting slightly out of sync but it was very
> > marginal and/or difficult to reproduce. With this patch applied, we could
> > not be absolutly sure the counters were correct because it could always have
> > raced with someone holding the zone->lock.
> > 
> > Minimally, I think zoneinfo should be taking the zone lock.
> 
> Thanks lots comments. 
> hmm.. I'd like to clarily your point. My point is memory-hotplug don't take zone lock,
> then zone lock doesn't protect anything. so we have two option
> 
> 1) Add zone lock to memroy-hotplug
> 2) Remove zone lock from zoneinfo
> 
> I thought (2) is sufficient. Do you mean you prefer to (1)? Or you prefer to ignore rarely event
> (of cource, memory hotplug is rarely)?
> 

I think (2) will make zoneinfo harder to use for examining all the counters
properly as I explained above. I haven't looked at memory-hotplug in a
while but IIRC, fields like present_pages should be protected by a lock on
the pgdat and a seq lock on the zone. If this is not true at the moment,
it is a problem.

For the free lists, memory hotplug should be taking the zone->lock properly as
the final stage of onlining memory is to walk the sections being hot-added,
init the memmap and then __free_page() each page individually - i.e. the
normal free path.

So, if memory hotplug is not protected by proper locking, it's not intentional.

> 
> > Secondly, has increased zone->lock contention due to reading /proc
> > really been shown to be a problem? The only situation that I can think
> > of is a badly-written monitor program that is copying all of /proc
> > instead of the files of interest. If a monitor program is doing
> > something like that, it's likely to be incurring performance problems in
> > a large number of different areas. If that is not the trigger case, what
> > is?
> 
> Ah no. I haven't observe such issue. my point is removing meaningless lock.
> 

Then I believe the zonelock should be preserved so that all entries in
/proc/zoneinfo are consistent.

> 
> > >  			seq_printf(m, "%6lu ", freecount);
> > >  		}
> > >  		seq_putc(m, '\n');
> > > @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> > >  							struct zone *zone)
> > >  {
> > >  	int i;
> > > +
> > 
> > Unnecessary whitespace change.
> 
> Ug. thanks, it's my fault.
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
  2010-01-05 10:18     ` Mel Gorman
@ 2010-01-06  0:33       ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2010-01-06  0:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Balbir Singh,
	KAMEZAWA Hiroyuki

> > Thanks lots comments. 
> > hmm.. I'd like to clarily your point. My point is memory-hotplug don't take zone lock,
> > then zone lock doesn't protect anything. so we have two option
> > 
> > 1) Add zone lock to memroy-hotplug
> > 2) Remove zone lock from zoneinfo
> > 
> > I thought (2) is sufficient. Do you mean you prefer to (1)? Or you prefer to ignore rarely event
> > (of cource, memory hotplug is rarely)?
> > 
> 
> I think (2) will make zoneinfo harder to use for examining all the counters
> properly as I explained above. I haven't looked at memory-hotplug in a
> while but IIRC, fields like present_pages should be protected by a lock on
> the pgdat and a seq lock on the zone. If this is not true at the moment,
> it is a problem.
> 
> For the free lists, memory hotplug should be taking the zone->lock properly as
> the final stage of onlining memory is to walk the sections being hot-added,
> init the memmap and then __free_page() each page individually - i.e. the
> normal free path.
> 
> So, if memory hotplug is not protected by proper locking, it's not intentional.

ok, I drop this patch. thanks.




^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-01-06  0:33 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-28  7:47 [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node KOSAKI Motohiro
2009-12-28  7:48 ` [PATCH 2/4] vmscan: get_scan_ratio cleanup KOSAKI Motohiro
2009-12-29  7:34   ` Balbir Singh
2009-12-30 13:06     ` KOSAKI Motohiro
2010-01-01 14:55   ` Rik van Riel
2010-01-03 23:45   ` KAMEZAWA Hiroyuki
2009-12-28  7:48 ` [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo KOSAKI Motohiro
2009-12-29 14:08   ` Balbir Singh
2009-12-30 13:13     ` KOSAKI Motohiro
2010-01-01 14:59   ` Rik van Riel
2010-01-03 23:47   ` KAMEZAWA Hiroyuki
2009-12-28  7:49 ` [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file KOSAKI Motohiro
2009-12-29 14:09   ` Balbir Singh
2009-12-30 12:53     ` KOSAKI Motohiro
2009-12-29  4:05 ` [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node Minchan Kim
2009-12-29  6:34 ` Balbir Singh
2009-12-30 12:49   ` KOSAKI Motohiro
2010-01-03 18:59 ` Mel Gorman
2010-01-05  2:04   ` KOSAKI Motohiro
2010-01-05 10:18     ` Mel Gorman
2010-01-06  0:33       ` KOSAKI Motohiro
2010-01-03 23:48 ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).