All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/6]  memg: better numa scanning
@ 2011-08-09 10:04 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:04 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, Michal Hocko, hannes, nishimura


No major update since the last version I posted 27/Jul.
The patch is rebased onto mmotm-Aug3.

This patch set implements a victim node selection logic and some
behavior fix in vmscan.c for memcg.
The logic calculates 'weight' for each nodes and a victim node
will be selected by comparing 'weight' in fair style.
The core is how to calculate 'weight' and this patch implements
a logic, which make use of recent lotation logic and the amount
of file caches and inactive anon pages.

I'll be absent in 12/Aug - 17/Aug.
I'm sorry if my response is delayed.

In this time, I did 'kernel make' test ...as
==
#!/bin/bash -x

cgset -r memory.limit_in_bytes=500M A

make -j 4 clean
sync
sync
sync
echo 3 > /proc/sys/vm/drop_caches
sleep 1
echo 0 > /cgroup/memory/A/memory.vmscan_stat
cgexec -g memory:A -g cpuset:A time make -j 8
==

On 8cpu, 4-node fake-numa box.
(each node has 2cpus.)

Under the limit of 500M, 'make' need to scan memory to reclaim.
This tests see how vmscan works.

When cpuset.memory_spread_page==0.

[Before patch]
773.07user 305.45system 4:09.64elapsed 432%CPU (0avgtext+0avgdata 1456576maxresident)k
4397944inputs+5093232outputs (9688major+35689066minor)pagefaults 0swaps
scanned_pages_by_limit 3867645
scanned_anon_pages_by_limit 1518266
scanned_file_pages_by_limit 2349379
rotated_pages_by_limit 1502640
rotated_anon_pages_by_limit 1416627
rotated_file_pages_by_limit 86013
freed_pages_by_limit 1005141
freed_anon_pages_by_limit 24577
freed_file_pages_by_limit 980564
elapsed_ns_by_limit 82833866094

[Patched]
773.73user 305.09system 3:51.28elapsed 466%CPU (0avgtext+0avgdata 1458464maxresident)k
4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps

scanned_pages_by_limit 4326462
scanned_anon_pages_by_limit 1310619
scanned_file_pages_by_limit 3015843
rotated_pages_by_limit 1264223
rotated_anon_pages_by_limit 1247180
rotated_file_pages_by_limit 17043
freed_pages_by_limit 1003434
freed_anon_pages_by_limit 20599
freed_file_pages_by_limit 982835
elapsed_ns_by_limit 42495200307

elapsed time for vmscan and the number of page faults are reduced.


When cpuset.memory_spread_page==1, in this case, file cache will be
spread to the all nodes in round robin.
==
[Before Patch + cpuset spread=1]
773.23user 309.55system 4:26.83elapsed 405%CPU (0avgtext+0avgdata 1457696maxresident)k
5400928inputs+5105368outputs (17344major+35735886minor)pagefaults 0swaps

scanned_pages_by_limit 3731787
scanned_anon_pages_by_limit 1374310
scanned_file_pages_by_limit 2357477
rotated_pages_by_limit 1403160
rotated_anon_pages_by_limit 1293568
rotated_file_pages_by_limit 109592
freed_pages_by_limit 1120828
freed_anon_pages_by_limit 20076
freed_file_pages_by_limit 1100752
elapsed_ns_by_limit 82458981267


[Patched + cpuset spread=1]
773.56user 306.02system 3:52.28elapsed 464%CPU (0avgtext+0avgdata 1458160maxresident)k
4173504inputs+4783688outputs (5971major+35666498minor)pagefaults 0swaps

scanned_pages_by_limit 2672392
scanned_anon_pages_by_limit 1140069
scanned_file_pages_by_limit 1532323
rotated_pages_by_limit 1108124
rotated_anon_pages_by_limit 1088982
rotated_file_pages_by_limit 19142
freed_pages_by_limit 975653
freed_anon_pages_by_limit 12578
freed_file_pages_by_limit 963075
elapsed_ns_by_limit 46482588602

elapsed time for vmscan and the number of page faults are reduced.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 0/6]  memg: better numa scanning
@ 2011-08-09 10:04 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:04 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, Michal Hocko, hannes, nishimura


No major update since the last version I posted 27/Jul.
The patch is rebased onto mmotm-Aug3.

This patch set implements a victim node selection logic and some
behavior fix in vmscan.c for memcg.
The logic calculates 'weight' for each nodes and a victim node
will be selected by comparing 'weight' in fair style.
The core is how to calculate 'weight' and this patch implements
a logic, which make use of recent lotation logic and the amount
of file caches and inactive anon pages.

I'll be absent in 12/Aug - 17/Aug.
I'm sorry if my response is delayed.

In this time, I did 'kernel make' test ...as
==
#!/bin/bash -x

cgset -r memory.limit_in_bytes=500M A

make -j 4 clean
sync
sync
sync
echo 3 > /proc/sys/vm/drop_caches
sleep 1
echo 0 > /cgroup/memory/A/memory.vmscan_stat
cgexec -g memory:A -g cpuset:A time make -j 8
==

On 8cpu, 4-node fake-numa box.
(each node has 2cpus.)

Under the limit of 500M, 'make' need to scan memory to reclaim.
This tests see how vmscan works.

When cpuset.memory_spread_page==0.

[Before patch]
773.07user 305.45system 4:09.64elapsed 432%CPU (0avgtext+0avgdata 1456576maxresident)k
4397944inputs+5093232outputs (9688major+35689066minor)pagefaults 0swaps
scanned_pages_by_limit 3867645
scanned_anon_pages_by_limit 1518266
scanned_file_pages_by_limit 2349379
rotated_pages_by_limit 1502640
rotated_anon_pages_by_limit 1416627
rotated_file_pages_by_limit 86013
freed_pages_by_limit 1005141
freed_anon_pages_by_limit 24577
freed_file_pages_by_limit 980564
elapsed_ns_by_limit 82833866094

[Patched]
773.73user 305.09system 3:51.28elapsed 466%CPU (0avgtext+0avgdata 1458464maxresident)k
4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps

scanned_pages_by_limit 4326462
scanned_anon_pages_by_limit 1310619
scanned_file_pages_by_limit 3015843
rotated_pages_by_limit 1264223
rotated_anon_pages_by_limit 1247180
rotated_file_pages_by_limit 17043
freed_pages_by_limit 1003434
freed_anon_pages_by_limit 20599
freed_file_pages_by_limit 982835
elapsed_ns_by_limit 42495200307

elapsed time for vmscan and the number of page faults are reduced.


When cpuset.memory_spread_page==1, in this case, file cache will be
spread to the all nodes in round robin.
==
[Before Patch + cpuset spread=1]
773.23user 309.55system 4:26.83elapsed 405%CPU (0avgtext+0avgdata 1457696maxresident)k
5400928inputs+5105368outputs (17344major+35735886minor)pagefaults 0swaps

scanned_pages_by_limit 3731787
scanned_anon_pages_by_limit 1374310
scanned_file_pages_by_limit 2357477
rotated_pages_by_limit 1403160
rotated_anon_pages_by_limit 1293568
rotated_file_pages_by_limit 109592
freed_pages_by_limit 1120828
freed_anon_pages_by_limit 20076
freed_file_pages_by_limit 1100752
elapsed_ns_by_limit 82458981267


[Patched + cpuset spread=1]
773.56user 306.02system 3:52.28elapsed 464%CPU (0avgtext+0avgdata 1458160maxresident)k
4173504inputs+4783688outputs (5971major+35666498minor)pagefaults 0swaps

scanned_pages_by_limit 2672392
scanned_anon_pages_by_limit 1140069
scanned_file_pages_by_limit 1532323
rotated_pages_by_limit 1108124
rotated_anon_pages_by_limit 1088982
rotated_file_pages_by_limit 19142
freed_pages_by_limit 975653
freed_anon_pages_by_limit 12578
freed_file_pages_by_limit 963075
elapsed_ns_by_limit 46482588602

elapsed time for vmscan and the number of page faults are reduced.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 1/6]  memg: better numa scanning
  2011-08-09 10:04 ` KAMEZAWA Hiroyuki
@ 2011-08-09 10:08   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura


Making memcg numa's scanning information update by schedule_work().

Now, memcg's numa information is updated under a thread doing
memory reclaim. It's not very heavy weight now. But upcoming updates
around numa scanning will add more works. This patch makes
the update be done by schedule_work() and reduce latency caused
by this updates.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -285,6 +285,7 @@ struct mem_cgroup {
 	nodemask_t	scan_nodes;
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
+	struct work_struct	numainfo_update_work;
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -1567,6 +1568,23 @@ static bool test_mem_cgroup_node_reclaim
 }
 #if MAX_NUMNODES > 1
 
+static void mem_cgroup_numainfo_update_work(struct work_struct *work)
+{
+	struct mem_cgroup *memcg;
+	int nid;
+
+	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
+
+	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
+			node_clear(nid, memcg->scan_nodes);
+	}
+	atomic_set(&memcg->numainfo_updating, 0);
+	css_put(&memcg->css);
+}
+
+
 /*
  * Always updating the nodemask is not very good - even if we have an empty
  * list or the wrong list here, we can start from some node and traverse all
@@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
  */
 static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
 {
-	int nid;
 	/*
 	 * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
 	 * pagein/pageout changes since the last update.
@@ -1584,18 +1601,9 @@ static void mem_cgroup_may_update_nodema
 		return;
 	if (atomic_inc_return(&mem->numainfo_updating) > 1)
 		return;
-
-	/* make a nodemask where this memcg uses memory from */
-	mem->scan_nodes = node_states[N_HIGH_MEMORY];
-
-	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
-
-		if (!test_mem_cgroup_node_reclaimable(mem, nid, false))
-			node_clear(nid, mem->scan_nodes);
-	}
-
 	atomic_set(&mem->numainfo_events, 0);
-	atomic_set(&mem->numainfo_updating, 0);
+	css_get(&mem->css);
+	schedule_work(&mem->numainfo_update_work);
 }
 
 /*
@@ -1668,6 +1676,12 @@ bool mem_cgroup_reclaimable(struct mem_c
 	return false;
 }
 
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+	INIT_WORK(&memcg->numainfo_update_work,
+		mem_cgroup_numainfo_update_work);
+}
+
 #else
 int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
 {
@@ -1678,6 +1692,9 @@ bool mem_cgroup_reclaimable(struct mem_c
 {
 	return test_mem_cgroup_node_reclaimable(mem, 0, noswap);
 }
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5097,6 +5114,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
 	spin_lock_init(&mem->scanstat.lock);
+	mem_cgroup_numascan_init(mem);
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 1/6]  memg: better numa scanning
@ 2011-08-09 10:08   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura


Making memcg numa's scanning information update by schedule_work().

Now, memcg's numa information is updated under a thread doing
memory reclaim. It's not very heavy weight now. But upcoming updates
around numa scanning will add more works. This patch makes
the update be done by schedule_work() and reduce latency caused
by this updates.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -285,6 +285,7 @@ struct mem_cgroup {
 	nodemask_t	scan_nodes;
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
+	struct work_struct	numainfo_update_work;
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -1567,6 +1568,23 @@ static bool test_mem_cgroup_node_reclaim
 }
 #if MAX_NUMNODES > 1
 
+static void mem_cgroup_numainfo_update_work(struct work_struct *work)
+{
+	struct mem_cgroup *memcg;
+	int nid;
+
+	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
+
+	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
+			node_clear(nid, memcg->scan_nodes);
+	}
+	atomic_set(&memcg->numainfo_updating, 0);
+	css_put(&memcg->css);
+}
+
+
 /*
  * Always updating the nodemask is not very good - even if we have an empty
  * list or the wrong list here, we can start from some node and traverse all
@@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
  */
 static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
 {
-	int nid;
 	/*
 	 * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
 	 * pagein/pageout changes since the last update.
@@ -1584,18 +1601,9 @@ static void mem_cgroup_may_update_nodema
 		return;
 	if (atomic_inc_return(&mem->numainfo_updating) > 1)
 		return;
-
-	/* make a nodemask where this memcg uses memory from */
-	mem->scan_nodes = node_states[N_HIGH_MEMORY];
-
-	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
-
-		if (!test_mem_cgroup_node_reclaimable(mem, nid, false))
-			node_clear(nid, mem->scan_nodes);
-	}
-
 	atomic_set(&mem->numainfo_events, 0);
-	atomic_set(&mem->numainfo_updating, 0);
+	css_get(&mem->css);
+	schedule_work(&mem->numainfo_update_work);
 }
 
 /*
@@ -1668,6 +1676,12 @@ bool mem_cgroup_reclaimable(struct mem_c
 	return false;
 }
 
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+	INIT_WORK(&memcg->numainfo_update_work,
+		mem_cgroup_numainfo_update_work);
+}
+
 #else
 int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
 {
@@ -1678,6 +1692,9 @@ bool mem_cgroup_reclaimable(struct mem_c
 {
 	return test_mem_cgroup_node_reclaimable(mem, 0, noswap);
 }
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5097,6 +5114,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
 	spin_lock_init(&mem->scanstat.lock);
+	mem_cgroup_numascan_init(mem);
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-09 10:04 ` KAMEZAWA Hiroyuki
@ 2011-08-09 10:09   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura

memcg :avoid node fallback scan if possible.

Now, try_to_free_pages() scans all zonelist because the page allocator
should visit all zonelists...but that behavior is harmful for memcg.
Memcg just scans memory because it hits limit...no memory shortage
in pased zonelist.

For example, with following unbalanced nodes

     Node 0    Node 1
File 1G        0
Anon 200M      200M

memcg will cause swap-out from Node1 at every vmscan.

Another example, assume 1024 nodes system.
With 1024 node system, memcg will visit 1024 nodes
pages per vmscan... This is overkilling. 

This is why memcg's victim node selection logic doesn't work
as expected.

This patch is a help for stopping vmscan when we scanned enough.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/vmscan.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2124,6 +2124,16 @@ static void shrink_zones(int priority, s
 		}
 
 		shrink_zone(priority, zone, sc);
+		if (!scanning_global_lru(sc)) {
+			/*
+			 * When we do scan for memcg's limit, it's bad to do
+			 * fallback into more node/zones because there is no
+			 * memory shortage. We quit as much as possible when
+			 * we reache target.
+			 */
+			if (sc->nr_to_reclaim <= sc->nr_reclaimed)
+				break;
+		}
 	}
 }
 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-09 10:09   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura

memcg :avoid node fallback scan if possible.

Now, try_to_free_pages() scans all zonelist because the page allocator
should visit all zonelists...but that behavior is harmful for memcg.
Memcg just scans memory because it hits limit...no memory shortage
in pased zonelist.

For example, with following unbalanced nodes

     Node 0    Node 1
File 1G        0
Anon 200M      200M

memcg will cause swap-out from Node1 at every vmscan.

Another example, assume 1024 nodes system.
With 1024 node system, memcg will visit 1024 nodes
pages per vmscan... This is overkilling. 

This is why memcg's victim node selection logic doesn't work
as expected.

This patch is a help for stopping vmscan when we scanned enough.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/vmscan.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2124,6 +2124,16 @@ static void shrink_zones(int priority, s
 		}
 
 		shrink_zone(priority, zone, sc);
+		if (!scanning_global_lru(sc)) {
+			/*
+			 * When we do scan for memcg's limit, it's bad to do
+			 * fallback into more node/zones because there is no
+			 * memory shortage. We quit as much as possible when
+			 * we reache target.
+			 */
+			if (sc->nr_to_reclaim <= sc->nr_reclaimed)
+				break;
+		}
 	}
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 3/6]  memg: vmscan pass nodemask
  2011-08-09 10:04 ` KAMEZAWA Hiroyuki
@ 2011-08-09 10:10   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura


pass memcg's nodemask to try_to_free_pages().

try_to_free_pages can take nodemask as its argument but memcg
doesn't pass it. Considering memcg can be used with cpuset on
big NUMA, memcg should pass nodemask if available.

Now, memcg maintain nodemask with periodic updates. pass it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Changelog:
 - fixed bugs to pass nodemask.
---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |    8 ++++++--
 mm/vmscan.c                |    4 ++--
 3 files changed, 9 insertions(+), 5 deletions(-)

Index: mmotm-Aug3/include/linux/memcontrol.h
===================================================================
--- mmotm-Aug3.orig/include/linux/memcontrol.h
+++ mmotm-Aug3/include/linux/memcontrol.h
@@ -118,7 +118,7 @@ extern void mem_cgroup_end_migration(str
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -1618,10 +1618,11 @@ static void mem_cgroup_may_update_nodema
  *
  * Now, we use round-robin. Better algorithm is welcomed.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
 	int node;
 
+	*mask = NULL;
 	mem_cgroup_may_update_nodemask(mem);
 	node = mem->last_scanned_node;
 
@@ -1636,6 +1637,8 @@ int mem_cgroup_select_victim_node(struct
 	 */
 	if (unlikely(node == MAX_NUMNODES))
 		node = numa_node_id();
+	else
+		*mask = &mem->scan_nodes;
 
 	mem->last_scanned_node = node;
 	return node;
@@ -1683,8 +1686,9 @@ static void mem_cgroup_numascan_init(str
 }
 
 #else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
+	*mask = NULL;
 	return 0;
 }
 
Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2354,7 +2354,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,
-		.nodemask = NULL, /* we don't care the placement */
+		.nodemask = NULL,
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 	};
@@ -2368,7 +2368,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont);
+	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 3/6]  memg: vmscan pass nodemask
@ 2011-08-09 10:10   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura


pass memcg's nodemask to try_to_free_pages().

try_to_free_pages can take nodemask as its argument but memcg
doesn't pass it. Considering memcg can be used with cpuset on
big NUMA, memcg should pass nodemask if available.

Now, memcg maintain nodemask with periodic updates. pass it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Changelog:
 - fixed bugs to pass nodemask.
---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |    8 ++++++--
 mm/vmscan.c                |    4 ++--
 3 files changed, 9 insertions(+), 5 deletions(-)

Index: mmotm-Aug3/include/linux/memcontrol.h
===================================================================
--- mmotm-Aug3.orig/include/linux/memcontrol.h
+++ mmotm-Aug3/include/linux/memcontrol.h
@@ -118,7 +118,7 @@ extern void mem_cgroup_end_migration(str
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -1618,10 +1618,11 @@ static void mem_cgroup_may_update_nodema
  *
  * Now, we use round-robin. Better algorithm is welcomed.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
 	int node;
 
+	*mask = NULL;
 	mem_cgroup_may_update_nodemask(mem);
 	node = mem->last_scanned_node;
 
@@ -1636,6 +1637,8 @@ int mem_cgroup_select_victim_node(struct
 	 */
 	if (unlikely(node == MAX_NUMNODES))
 		node = numa_node_id();
+	else
+		*mask = &mem->scan_nodes;
 
 	mem->last_scanned_node = node;
 	return node;
@@ -1683,8 +1686,9 @@ static void mem_cgroup_numascan_init(str
 }
 
 #else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
+	*mask = NULL;
 	return 0;
 }
 
Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2354,7 +2354,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,
-		.nodemask = NULL, /* we don't care the placement */
+		.nodemask = NULL,
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 	};
@@ -2368,7 +2368,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont);
+	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 4/6]  memg: calculate numa weight for vmscan
  2011-08-09 10:04 ` KAMEZAWA Hiroyuki
@ 2011-08-09 10:11   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura

caclculate node scan weight.

Now, memory cgroup selects a scan target node in round-robin.
It's not very good...there is not scheduling based on page usages.

This patch is for calculating each node's weight for scanning.
If weight of a node is high, the node is worth to be scanned.

The weight is now calucauted on following concept.

   - make use of swappiness.
   - If inactive-file is enough, ignore active-file
   - If file is enough (w.r.t swappiness), ignore anon
   - make use of recent_scan/rotated reclaim stats.

Then, a node contains many inactive file pages will be a 1st victim.
Node selection logic based on this weight will be in the next patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 105 insertions(+), 5 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -144,6 +144,7 @@ struct mem_cgroup_per_zone {
 
 struct mem_cgroup_per_node {
 	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+	unsigned long weight;
 };
 
 struct mem_cgroup_lru_info {
@@ -286,6 +287,7 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 	struct work_struct	numainfo_update_work;
+	unsigned long total_weight;
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -1568,18 +1570,108 @@ static bool test_mem_cgroup_node_reclaim
 }
 #if MAX_NUMNODES > 1
 
+static unsigned long
+__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
+				int nid,
+				unsigned long anon_prio,
+				unsigned long file_prio,
+				int lru_mask)
+{
+	u64 file, anon;
+	unsigned long weight, mask;
+	unsigned long rotated[2], scanned[2];
+	int zid;
+
+	scanned[0] = 0;
+	scanned[1] = 0;
+	rotated[0] = 0;
+	rotated[1] = 0;
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		struct mem_cgroup_per_zone *mz;
+
+		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+		scanned[0] += mz->reclaim_stat.recent_scanned[0];
+		scanned[1] += mz->reclaim_stat.recent_scanned[1];
+		rotated[0] += mz->reclaim_stat.recent_rotated[0];
+		rotated[1] += mz->reclaim_stat.recent_rotated[1];
+	}
+	file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
+
+	if (total_swap_pages)
+		anon = mem_cgroup_node_nr_lru_pages(memcg,
+					nid, mask & LRU_ALL_ANON);
+	else
+		anon = 0;
+	if (!(file + anon))
+		node_clear(nid, memcg->scan_nodes);
+
+	/* 'scanned - rotated/scanned' means ratio of finding not active. */
+	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
+	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
+
+	weight = (anon * anon_prio + file * file_prio) / 200;
+	return weight;
+}
+
+/*
+ * Calculate each NUMA node's scan weight. scan weight is determined by
+ * amount of pages and recent scan ratio, swappiness.
+ */
+static unsigned long
+mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
+{
+	unsigned long weight, total_weight;
+	u64 anon_prio, file_prio, nr_anon, nr_file;
+	int lru_mask;
+	int nid;
+
+	anon_prio = mem_cgroup_swappiness(memcg) + 1;
+	file_prio = 200 - anon_prio + 1;
+
+	lru_mask = BIT(LRU_INACTIVE_FILE);
+	if (mem_cgroup_inactive_file_is_low(memcg))
+		lru_mask |= BIT(LRU_ACTIVE_FILE);
+	/*
+	 * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
+	 * amounts of file/anon pages and swappiness and reclaim_stat. Here,
+	 * we try to find good node to be scanned. If the memcg contains enough
+	 * file caches, we'll ignore anon's weight.
+	 * (Note) scanning anon-only node tends to be waste of time.
+	 */
+
+	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
+	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
+
+	/* If file cache is small w.r.t swappiness, check anon page's weight */
+	if (nr_file * file_prio >= nr_anon * anon_prio)
+		lru_mask |= BIT(LRU_INACTIVE_ANON);
+
+	total_weight = 0;
+
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		weight = __mem_cgroup_calc_numascan_weight(memcg,
+				nid, anon_prio, file_prio, lru_mask);
+		memcg->info.nodeinfo[nid]->weight = weight;
+		total_weight += weight;
+	}
+
+	return total_weight;
+}
+
+/*
+ * Update all node's scan weight in background.
+ */
 static void mem_cgroup_numainfo_update_work(struct work_struct *work)
 {
 	struct mem_cgroup *memcg;
-	int nid;
 
 	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
 
 	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
-	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
-		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
-			node_clear(nid, memcg->scan_nodes);
-	}
+
+	memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
+
 	atomic_set(&memcg->numainfo_updating, 0);
 	css_put(&memcg->css);
 }
@@ -4277,6 +4369,14 @@ static int mem_control_numa_stat_show(st
 		seq_printf(m, " N%d=%lu", nid, node_nr);
 	}
 	seq_putc(m, '\n');
+
+	seq_printf(m, "scan_weight=%lu", mem_cont->total_weight);
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		unsigned long weight;
+		weight = mem_cont->info.nodeinfo[nid]->weight;
+		seq_printf(m, " N%d=%lu", nid, weight);
+	}
+	seq_putc(m, '\n');
 	return 0;
 }
 #endif /* CONFIG_NUMA */


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 4/6]  memg: calculate numa weight for vmscan
@ 2011-08-09 10:11   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura

caclculate node scan weight.

Now, memory cgroup selects a scan target node in round-robin.
It's not very good...there is not scheduling based on page usages.

This patch is for calculating each node's weight for scanning.
If weight of a node is high, the node is worth to be scanned.

The weight is now calucauted on following concept.

   - make use of swappiness.
   - If inactive-file is enough, ignore active-file
   - If file is enough (w.r.t swappiness), ignore anon
   - make use of recent_scan/rotated reclaim stats.

Then, a node contains many inactive file pages will be a 1st victim.
Node selection logic based on this weight will be in the next patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 105 insertions(+), 5 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -144,6 +144,7 @@ struct mem_cgroup_per_zone {
 
 struct mem_cgroup_per_node {
 	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+	unsigned long weight;
 };
 
 struct mem_cgroup_lru_info {
@@ -286,6 +287,7 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 	struct work_struct	numainfo_update_work;
+	unsigned long total_weight;
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -1568,18 +1570,108 @@ static bool test_mem_cgroup_node_reclaim
 }
 #if MAX_NUMNODES > 1
 
+static unsigned long
+__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
+				int nid,
+				unsigned long anon_prio,
+				unsigned long file_prio,
+				int lru_mask)
+{
+	u64 file, anon;
+	unsigned long weight, mask;
+	unsigned long rotated[2], scanned[2];
+	int zid;
+
+	scanned[0] = 0;
+	scanned[1] = 0;
+	rotated[0] = 0;
+	rotated[1] = 0;
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		struct mem_cgroup_per_zone *mz;
+
+		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+		scanned[0] += mz->reclaim_stat.recent_scanned[0];
+		scanned[1] += mz->reclaim_stat.recent_scanned[1];
+		rotated[0] += mz->reclaim_stat.recent_rotated[0];
+		rotated[1] += mz->reclaim_stat.recent_rotated[1];
+	}
+	file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
+
+	if (total_swap_pages)
+		anon = mem_cgroup_node_nr_lru_pages(memcg,
+					nid, mask & LRU_ALL_ANON);
+	else
+		anon = 0;
+	if (!(file + anon))
+		node_clear(nid, memcg->scan_nodes);
+
+	/* 'scanned - rotated/scanned' means ratio of finding not active. */
+	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
+	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
+
+	weight = (anon * anon_prio + file * file_prio) / 200;
+	return weight;
+}
+
+/*
+ * Calculate each NUMA node's scan weight. scan weight is determined by
+ * amount of pages and recent scan ratio, swappiness.
+ */
+static unsigned long
+mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
+{
+	unsigned long weight, total_weight;
+	u64 anon_prio, file_prio, nr_anon, nr_file;
+	int lru_mask;
+	int nid;
+
+	anon_prio = mem_cgroup_swappiness(memcg) + 1;
+	file_prio = 200 - anon_prio + 1;
+
+	lru_mask = BIT(LRU_INACTIVE_FILE);
+	if (mem_cgroup_inactive_file_is_low(memcg))
+		lru_mask |= BIT(LRU_ACTIVE_FILE);
+	/*
+	 * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
+	 * amounts of file/anon pages and swappiness and reclaim_stat. Here,
+	 * we try to find good node to be scanned. If the memcg contains enough
+	 * file caches, we'll ignore anon's weight.
+	 * (Note) scanning anon-only node tends to be waste of time.
+	 */
+
+	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
+	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
+
+	/* If file cache is small w.r.t swappiness, check anon page's weight */
+	if (nr_file * file_prio >= nr_anon * anon_prio)
+		lru_mask |= BIT(LRU_INACTIVE_ANON);
+
+	total_weight = 0;
+
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		weight = __mem_cgroup_calc_numascan_weight(memcg,
+				nid, anon_prio, file_prio, lru_mask);
+		memcg->info.nodeinfo[nid]->weight = weight;
+		total_weight += weight;
+	}
+
+	return total_weight;
+}
+
+/*
+ * Update all node's scan weight in background.
+ */
 static void mem_cgroup_numainfo_update_work(struct work_struct *work)
 {
 	struct mem_cgroup *memcg;
-	int nid;
 
 	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
 
 	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
-	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
-		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
-			node_clear(nid, memcg->scan_nodes);
-	}
+
+	memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
+
 	atomic_set(&memcg->numainfo_updating, 0);
 	css_put(&memcg->css);
 }
@@ -4277,6 +4369,14 @@ static int mem_control_numa_stat_show(st
 		seq_printf(m, " N%d=%lu", nid, node_nr);
 	}
 	seq_putc(m, '\n');
+
+	seq_printf(m, "scan_weight=%lu", mem_cont->total_weight);
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		unsigned long weight;
+		weight = mem_cont->info.nodeinfo[nid]->weight;
+		seq_printf(m, " N%d=%lu", nid, weight);
+	}
+	seq_putc(m, '\n');
 	return 0;
 }
 #endif /* CONFIG_NUMA */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 5/6]  memg: vmscan select victim node by weight
  2011-08-09 10:04 ` KAMEZAWA Hiroyuki
@ 2011-08-09 10:12   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura


This patch implements a node selection logic based on each node's weight.

This patch adds a new array of nodescan_tickets[]. This array holds
each node's scan weight in a tuple of 2 values. as

    for (i = 0, total_weight = 0; i < nodes; i++) {
        weight = node->weight;
        nodescan_tickets[i].start = total_weight;
        nodescan_tickets[i].length = weight;
    }

After this, a lottery logic as 'ticket = random32()/total_weight'
will make a ticket and bserach(ticket, nodescan_tickets[])
will find a node which holds [start, length] contains ticket.
(This is a lottery scheduling.)

By this, node will be selected in fair manner proportinal to
its weight.

This patch improve the scan time. Following is a test result
ot kernel-make on 4-node fake-numa under 500M limit, with 8cpus.
2cpus per node.

[Before patch]
 772.52user 305.67system 4:11.48elapsed 428%CPU
 (0avgtext+0avgdata 1457264maxresident)k
 4797592inputs+5483240outputs (12550major+35707629minor)pagefaults 0swaps

[After patch]
 773.73user 305.09system 3:51.28elapsed 466%CPU
 (0avgtext+0avgdata 1458464maxresident)k
 4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps

elapsed time and major faults are reduced.

Here, vmscan_stat shows

[Before patch]
  scanned_pages_by_limit 3926782
  scanned_anon_pages_by_limit 1511090
  scanned_file_pages_by_limit 2415692
  elapsed_ns_by_limit 69528714562

[After patch]
  scanned_pages_by_limit 4326462
  scanned_anon_pages_by_limit 1310619
  scanned_file_pages_by_limit 3015843
  elapsed_ns_by_limit 42495200307

This patch helps to scan file caches rather than scanning anon.
and elapsed time is much reduced.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    3 
 mm/memcontrol.c            |  150 ++++++++++++++++++++++++++++++++++++++-------
 mm/vmscan.c                |    4 -
 3 files changed, 131 insertions(+), 26 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -49,6 +49,9 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/random.h>
+#include <linux/bsearch.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -151,6 +154,11 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+struct numascan_ticket {
+	int nid;
+	unsigned int start, tickets;
+};
+
 /*
  * Cgroups above their limits are maintained in a RB-Tree, independent of
  * their hierarchy representation
@@ -287,7 +295,10 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 	struct work_struct	numainfo_update_work;
-	unsigned long total_weight;
+	unsigned long	total_weight;
+	int		numascan_generation;
+	int		numascan_tickets_num[2];
+	struct numascan_ticket *numascan_tickets[2];
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -1660,6 +1671,46 @@ mem_cgroup_calc_numascan_weight(struct m
 }
 
 /*
+ * For lottery scheduling, this routine disributes "ticket" for
+ * scanning to each node. ticket will be recored into numascan_ticket
+ * array and this array will be used for scheduling, lator.
+ * For make lottery wair, we limit the sum of tickets almost 0xffff.
+ * Later, random() & 0xffff will do proportional fair lottery.
+ */
+#define NUMA_TICKET_SHIFT	(16)
+#define NUMA_TICKET_FACTOR	((1 << NUMA_TICKET_SHIFT) - 1)
+static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
+{
+	struct numascan_ticket *nt;
+	unsigned int node_ticket, assigned_tickets;
+	u64 weight;
+	int nid, assigned_num, generation;
+
+	/* update ticket information by double buffering */
+	generation = memcg->numascan_generation ^ 0x1;
+
+	nt = memcg->numascan_tickets[generation];
+	assigned_tickets = 0;
+	assigned_num = 0;
+	for_each_node_mask(nid, memcg->scan_nodes) {
+		weight = memcg->info.nodeinfo[nid]->weight;
+		node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
+					memcg->total_weight + 1);
+		if (!node_ticket)
+			node_ticket = 1;
+		nt->nid = nid;
+		nt->start = assigned_tickets;
+		nt->tickets = node_ticket;
+		assigned_tickets += node_ticket;
+		nt++;
+		assigned_num++;
+	}
+	memcg->numascan_tickets_num[generation] = assigned_num;
+	smp_wmb();
+	memcg->numascan_generation = generation;
+}
+
+/*
  * Update all node's scan weight in background.
  */
 static void mem_cgroup_numainfo_update_work(struct work_struct *work)
@@ -1672,6 +1723,8 @@ static void mem_cgroup_numainfo_update_w
 
 	memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
 
+	synchronize_rcu();
+	mem_cgroup_update_numascan_tickets(memcg);
 	atomic_set(&memcg->numainfo_updating, 0);
 	css_put(&memcg->css);
 }
@@ -1698,6 +1751,18 @@ static void mem_cgroup_may_update_nodema
 	schedule_work(&mem->numainfo_update_work);
 }
 
+static int node_weight_compare(const void *key, const void *elt)
+{
+	unsigned long lottery = (unsigned long)key;
+	struct numascan_ticket *nt = (struct numascan_ticket *)elt;
+
+	if (lottery < nt->start)
+		return -1;
+	if (lottery > (nt->start + nt->tickets))
+		return 1;
+	return 0;
+}
+
 /*
  * Selecting a node where we start reclaim from. Because what we need is just
  * reducing usage counter, start from anywhere is O,K. Considering
@@ -1707,32 +1772,38 @@ static void mem_cgroup_may_update_nodema
  * we'll use or we've used. So, it may make LRU bad. And if several threads
  * hit limits, it will see a contention on a node. But freeing from remote
  * node means more costs for memory reclaim because of memory latency.
- *
- * Now, we use round-robin. Better algorithm is welcomed.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+				struct memcg_scanrecord *rec)
 {
-	int node;
+	int node = MAX_NUMNODES;
+	struct numascan_ticket *nt;
+	unsigned long lottery;
+	int generation;
 
+	if (rec->context == SCAN_BY_SHRINK)
+		goto out;
+
+	mem_cgroup_may_update_nodemask(memcg);
 	*mask = NULL;
-	mem_cgroup_may_update_nodemask(mem);
-	node = mem->last_scanned_node;
+	lottery = random32() & NUMA_TICKET_FACTOR;
 
-	node = next_node(node, mem->scan_nodes);
-	if (node == MAX_NUMNODES)
-		node = first_node(mem->scan_nodes);
-	/*
-	 * We call this when we hit limit, not when pages are added to LRU.
-	 * No LRU may hold pages because all pages are UNEVICTABLE or
-	 * memcg is too small and all pages are not on LRU. In that case,
-	 * we use curret node.
-	 */
-	if (unlikely(node == MAX_NUMNODES))
+	rcu_read_lock();
+	generation = memcg->numascan_generation;
+	nt = bsearch((void *)lottery,
+		memcg->numascan_tickets[generation],
+		memcg->numascan_tickets_num[generation],
+		sizeof(struct numascan_ticket), node_weight_compare);
+	rcu_read_unlock();
+	if (nt)
+		node = nt->nid;
+out:
+	if (unlikely(node == MAX_NUMNODES)) {
 		node = numa_node_id();
-	else
-		*mask = &mem->scan_nodes;
+		*mask = NULL;
+	} else
+		*mask = &memcg->scan_nodes;
 
-	mem->last_scanned_node = node;
 	return node;
 }
 
@@ -1771,14 +1842,42 @@ bool mem_cgroup_reclaimable(struct mem_c
 	return false;
 }
 
-static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+static bool mem_cgroup_numascan_init(struct mem_cgroup *memcg)
 {
+	struct numascan_ticket *nt;
+	int nr_nodes;
+
 	INIT_WORK(&memcg->numainfo_update_work,
 		mem_cgroup_numainfo_update_work);
+
+	nr_nodes = num_possible_nodes();
+	nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+			GFP_KERNEL);
+	if (!nt)
+		return false;
+	memcg->numascan_tickets[0] = nt;
+	nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+			GFP_KERNEL);
+	if (!nt) {
+		kfree(memcg->numascan_tickets[0]);
+		memcg->numascan_tickets[0] = NULL;
+		return false;
+	}
+	memcg->numascan_tickets[1] = nt;
+	memcg->numascan_tickets_num[0] = 0;
+	memcg->numascan_tickets_num[1] = 0;
+	return true;
+}
+
+static void mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+	kfree(memcg->numascan_tickets[0]);
+	kfree(memcg->numascan_tickets[1]);
 }
 
 #else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask,
+				struct memcg_scanrecord *rec)
 {
 	*mask = NULL;
 	return 0;
@@ -1791,6 +1890,9 @@ bool mem_cgroup_reclaimable(struct mem_c
 static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
 {
 }
+static bool mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5080,6 +5182,7 @@ static void __mem_cgroup_free(struct mem
 	int node;
 
 	mem_cgroup_remove_from_trees(mem);
+	mem_cgroup_numascan_free(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -5218,7 +5321,8 @@ mem_cgroup_create(struct cgroup_subsys *
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
 	spin_lock_init(&mem->scanstat.lock);
-	mem_cgroup_numascan_init(mem);
+	if (!mem_cgroup_numascan_init(mem))
+		goto free_out;
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2378,9 +2378,9 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
+	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask, rec);
 
-	zonelist = NODE_DATA(nid)->node_zonelists;
+	zonelist = &NODE_DATA(nid)->node_zonelists[0];
 
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
Index: mmotm-Aug3/include/linux/memcontrol.h
===================================================================
--- mmotm-Aug3.orig/include/linux/memcontrol.h
+++ mmotm-Aug3/include/linux/memcontrol.h
@@ -118,7 +118,8 @@ extern void mem_cgroup_end_migration(str
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+				struct memcg_scanrecord *rec);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 5/6]  memg: vmscan select victim node by weight
@ 2011-08-09 10:12   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura


This patch implements a node selection logic based on each node's weight.

This patch adds a new array of nodescan_tickets[]. This array holds
each node's scan weight in a tuple of 2 values. as

    for (i = 0, total_weight = 0; i < nodes; i++) {
        weight = node->weight;
        nodescan_tickets[i].start = total_weight;
        nodescan_tickets[i].length = weight;
    }

After this, a lottery logic as 'ticket = random32()/total_weight'
will make a ticket and bserach(ticket, nodescan_tickets[])
will find a node which holds [start, length] contains ticket.
(This is a lottery scheduling.)

By this, node will be selected in fair manner proportinal to
its weight.

This patch improve the scan time. Following is a test result
ot kernel-make on 4-node fake-numa under 500M limit, with 8cpus.
2cpus per node.

[Before patch]
 772.52user 305.67system 4:11.48elapsed 428%CPU
 (0avgtext+0avgdata 1457264maxresident)k
 4797592inputs+5483240outputs (12550major+35707629minor)pagefaults 0swaps

[After patch]
 773.73user 305.09system 3:51.28elapsed 466%CPU
 (0avgtext+0avgdata 1458464maxresident)k
 4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps

elapsed time and major faults are reduced.

Here, vmscan_stat shows

[Before patch]
  scanned_pages_by_limit 3926782
  scanned_anon_pages_by_limit 1511090
  scanned_file_pages_by_limit 2415692
  elapsed_ns_by_limit 69528714562

[After patch]
  scanned_pages_by_limit 4326462
  scanned_anon_pages_by_limit 1310619
  scanned_file_pages_by_limit 3015843
  elapsed_ns_by_limit 42495200307

This patch helps to scan file caches rather than scanning anon.
and elapsed time is much reduced.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    3 
 mm/memcontrol.c            |  150 ++++++++++++++++++++++++++++++++++++++-------
 mm/vmscan.c                |    4 -
 3 files changed, 131 insertions(+), 26 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -49,6 +49,9 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/random.h>
+#include <linux/bsearch.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -151,6 +154,11 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+struct numascan_ticket {
+	int nid;
+	unsigned int start, tickets;
+};
+
 /*
  * Cgroups above their limits are maintained in a RB-Tree, independent of
  * their hierarchy representation
@@ -287,7 +295,10 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 	struct work_struct	numainfo_update_work;
-	unsigned long total_weight;
+	unsigned long	total_weight;
+	int		numascan_generation;
+	int		numascan_tickets_num[2];
+	struct numascan_ticket *numascan_tickets[2];
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -1660,6 +1671,46 @@ mem_cgroup_calc_numascan_weight(struct m
 }
 
 /*
+ * For lottery scheduling, this routine disributes "ticket" for
+ * scanning to each node. ticket will be recored into numascan_ticket
+ * array and this array will be used for scheduling, lator.
+ * For make lottery wair, we limit the sum of tickets almost 0xffff.
+ * Later, random() & 0xffff will do proportional fair lottery.
+ */
+#define NUMA_TICKET_SHIFT	(16)
+#define NUMA_TICKET_FACTOR	((1 << NUMA_TICKET_SHIFT) - 1)
+static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
+{
+	struct numascan_ticket *nt;
+	unsigned int node_ticket, assigned_tickets;
+	u64 weight;
+	int nid, assigned_num, generation;
+
+	/* update ticket information by double buffering */
+	generation = memcg->numascan_generation ^ 0x1;
+
+	nt = memcg->numascan_tickets[generation];
+	assigned_tickets = 0;
+	assigned_num = 0;
+	for_each_node_mask(nid, memcg->scan_nodes) {
+		weight = memcg->info.nodeinfo[nid]->weight;
+		node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
+					memcg->total_weight + 1);
+		if (!node_ticket)
+			node_ticket = 1;
+		nt->nid = nid;
+		nt->start = assigned_tickets;
+		nt->tickets = node_ticket;
+		assigned_tickets += node_ticket;
+		nt++;
+		assigned_num++;
+	}
+	memcg->numascan_tickets_num[generation] = assigned_num;
+	smp_wmb();
+	memcg->numascan_generation = generation;
+}
+
+/*
  * Update all node's scan weight in background.
  */
 static void mem_cgroup_numainfo_update_work(struct work_struct *work)
@@ -1672,6 +1723,8 @@ static void mem_cgroup_numainfo_update_w
 
 	memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
 
+	synchronize_rcu();
+	mem_cgroup_update_numascan_tickets(memcg);
 	atomic_set(&memcg->numainfo_updating, 0);
 	css_put(&memcg->css);
 }
@@ -1698,6 +1751,18 @@ static void mem_cgroup_may_update_nodema
 	schedule_work(&mem->numainfo_update_work);
 }
 
+static int node_weight_compare(const void *key, const void *elt)
+{
+	unsigned long lottery = (unsigned long)key;
+	struct numascan_ticket *nt = (struct numascan_ticket *)elt;
+
+	if (lottery < nt->start)
+		return -1;
+	if (lottery > (nt->start + nt->tickets))
+		return 1;
+	return 0;
+}
+
 /*
  * Selecting a node where we start reclaim from. Because what we need is just
  * reducing usage counter, start from anywhere is O,K. Considering
@@ -1707,32 +1772,38 @@ static void mem_cgroup_may_update_nodema
  * we'll use or we've used. So, it may make LRU bad. And if several threads
  * hit limits, it will see a contention on a node. But freeing from remote
  * node means more costs for memory reclaim because of memory latency.
- *
- * Now, we use round-robin. Better algorithm is welcomed.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+				struct memcg_scanrecord *rec)
 {
-	int node;
+	int node = MAX_NUMNODES;
+	struct numascan_ticket *nt;
+	unsigned long lottery;
+	int generation;
 
+	if (rec->context == SCAN_BY_SHRINK)
+		goto out;
+
+	mem_cgroup_may_update_nodemask(memcg);
 	*mask = NULL;
-	mem_cgroup_may_update_nodemask(mem);
-	node = mem->last_scanned_node;
+	lottery = random32() & NUMA_TICKET_FACTOR;
 
-	node = next_node(node, mem->scan_nodes);
-	if (node == MAX_NUMNODES)
-		node = first_node(mem->scan_nodes);
-	/*
-	 * We call this when we hit limit, not when pages are added to LRU.
-	 * No LRU may hold pages because all pages are UNEVICTABLE or
-	 * memcg is too small and all pages are not on LRU. In that case,
-	 * we use curret node.
-	 */
-	if (unlikely(node == MAX_NUMNODES))
+	rcu_read_lock();
+	generation = memcg->numascan_generation;
+	nt = bsearch((void *)lottery,
+		memcg->numascan_tickets[generation],
+		memcg->numascan_tickets_num[generation],
+		sizeof(struct numascan_ticket), node_weight_compare);
+	rcu_read_unlock();
+	if (nt)
+		node = nt->nid;
+out:
+	if (unlikely(node == MAX_NUMNODES)) {
 		node = numa_node_id();
-	else
-		*mask = &mem->scan_nodes;
+		*mask = NULL;
+	} else
+		*mask = &memcg->scan_nodes;
 
-	mem->last_scanned_node = node;
 	return node;
 }
 
@@ -1771,14 +1842,42 @@ bool mem_cgroup_reclaimable(struct mem_c
 	return false;
 }
 
-static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+static bool mem_cgroup_numascan_init(struct mem_cgroup *memcg)
 {
+	struct numascan_ticket *nt;
+	int nr_nodes;
+
 	INIT_WORK(&memcg->numainfo_update_work,
 		mem_cgroup_numainfo_update_work);
+
+	nr_nodes = num_possible_nodes();
+	nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+			GFP_KERNEL);
+	if (!nt)
+		return false;
+	memcg->numascan_tickets[0] = nt;
+	nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+			GFP_KERNEL);
+	if (!nt) {
+		kfree(memcg->numascan_tickets[0]);
+		memcg->numascan_tickets[0] = NULL;
+		return false;
+	}
+	memcg->numascan_tickets[1] = nt;
+	memcg->numascan_tickets_num[0] = 0;
+	memcg->numascan_tickets_num[1] = 0;
+	return true;
+}
+
+static void mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+	kfree(memcg->numascan_tickets[0]);
+	kfree(memcg->numascan_tickets[1]);
 }
 
 #else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask,
+				struct memcg_scanrecord *rec)
 {
 	*mask = NULL;
 	return 0;
@@ -1791,6 +1890,9 @@ bool mem_cgroup_reclaimable(struct mem_c
 static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
 {
 }
+static bool mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5080,6 +5182,7 @@ static void __mem_cgroup_free(struct mem
 	int node;
 
 	mem_cgroup_remove_from_trees(mem);
+	mem_cgroup_numascan_free(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -5218,7 +5321,8 @@ mem_cgroup_create(struct cgroup_subsys *
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
 	spin_lock_init(&mem->scanstat.lock);
-	mem_cgroup_numascan_init(mem);
+	if (!mem_cgroup_numascan_init(mem))
+		goto free_out;
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2378,9 +2378,9 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
+	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask, rec);
 
-	zonelist = NODE_DATA(nid)->node_zonelists;
+	zonelist = &NODE_DATA(nid)->node_zonelists[0];
 
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
Index: mmotm-Aug3/include/linux/memcontrol.h
===================================================================
--- mmotm-Aug3.orig/include/linux/memcontrol.h
+++ mmotm-Aug3/include/linux/memcontrol.h
@@ -118,7 +118,8 @@ extern void mem_cgroup_end_migration(str
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+				struct memcg_scanrecord *rec);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 6/6]  memg: do target scan if unbalanced
  2011-08-09 10:04 ` KAMEZAWA Hiroyuki
@ 2011-08-09 10:13   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura


Because do_try_to_free_pages() scans node based on zonelist,
even if we select a victim node, we may scan other nodes.

When the nodes are balanced, it's good because we'll quit scan loop
before updating 'priority'. But when the nodes are unbalanced,
it will force scanning a very small nodes and will cause
swap-out when the node doesn't contains enough file caches.

This patch selects zonelist[] for vmscan scan list for memcg.
If memcg is well balanced among nodes, usual fall back (and mask) is used.
If not, it selects node local zonelist and do target reclaim.

This will reduce unnecessary (anon page) scans when memcg is not balanced.

Now, memcg/NUMA is balanced when each node's weight is between
 80% and 120% of average node weight.
 (*) This value is just a magic number but works well in several tests.
     Further study to detemine this value is appreciated.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |   20 ++++++++++++++++++--
 mm/vmscan.c                |    9 +++++++--
 3 files changed, 26 insertions(+), 5 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -296,6 +296,7 @@ struct mem_cgroup {
 	atomic_t	numainfo_updating;
 	struct work_struct	numainfo_update_work;
 	unsigned long	total_weight;
+	bool		numascan_balance;
 	int		numascan_generation;
 	int		numascan_tickets_num[2];
 	struct numascan_ticket *numascan_tickets[2];
@@ -1679,12 +1680,15 @@ mem_cgroup_calc_numascan_weight(struct m
  */
 #define NUMA_TICKET_SHIFT	(16)
 #define NUMA_TICKET_FACTOR	((1 << NUMA_TICKET_SHIFT) - 1)
+#define NUMA_BALANCE_RANGE_LOW	(80)
+#define NUMA_BALANCE_RANGE_HIGH	(120)
 static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
 {
 	struct numascan_ticket *nt;
 	unsigned int node_ticket, assigned_tickets;
 	u64 weight;
 	int nid, assigned_num, generation;
+	unsigned long average, balance_low, balance_high;
 
 	/* update ticket information by double buffering */
 	generation = memcg->numascan_generation ^ 0x1;
@@ -1692,6 +1696,11 @@ static void mem_cgroup_update_numascan_t
 	nt = memcg->numascan_tickets[generation];
 	assigned_tickets = 0;
 	assigned_num = 0;
+	average = memcg->total_weight / (nodes_weight(memcg->scan_nodes) + 1);
+	balance_low = NUMA_BALANCE_RANGE_LOW * average / 100;
+	balance_high = NUMA_BALANCE_RANGE_HIGH * average / 100;
+	memcg->numascan_balance = true;
+
 	for_each_node_mask(nid, memcg->scan_nodes) {
 		weight = memcg->info.nodeinfo[nid]->weight;
 		node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
@@ -1704,6 +1713,9 @@ static void mem_cgroup_update_numascan_t
 		assigned_tickets += node_ticket;
 		nt++;
 		assigned_num++;
+		if ((weight < balance_low) ||
+		    (weight > balance_high))
+			memcg->numascan_balance = false;
 	}
 	memcg->numascan_tickets_num[generation] = assigned_num;
 	smp_wmb();
@@ -1774,7 +1786,7 @@ static int node_weight_compare(const voi
  * node means more costs for memory reclaim because of memory latency.
  */
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
-				struct memcg_scanrecord *rec)
+				struct memcg_scanrecord *rec, bool *fallback)
 {
 	int node = MAX_NUMNODES;
 	struct numascan_ticket *nt;
@@ -1801,8 +1813,11 @@ out:
 	if (unlikely(node == MAX_NUMNODES)) {
 		node = numa_node_id();
 		*mask = NULL;
-	} else
+		*fallback = true;
+	} else {
 		*mask = &memcg->scan_nodes;
+		*fallback = memcg->numascan_balance;
+	}
 
 	return node;
 }
@@ -1880,6 +1895,7 @@ int mem_cgroup_select_victim_node(struct
 				struct memcg_scanrecord *rec)
 {
 	*mask = NULL;
+	*fallback = true;
 	return 0;
 }
 
Index: mmotm-Aug3/include/linux/memcontrol.h
===================================================================
--- mmotm-Aug3.orig/include/linux/memcontrol.h
+++ mmotm-Aug3/include/linux/memcontrol.h
@@ -119,7 +119,7 @@ extern void mem_cgroup_end_migration(str
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
-				struct memcg_scanrecord *rec);
+				struct memcg_scanrecord *rec, bool *fallback);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2355,6 +2355,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	ktime_t start, end;
+	bool fallback;
 	int nid;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
@@ -2378,9 +2379,13 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask, rec);
+	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask,
+				rec, &fallback);
 
-	zonelist = &NODE_DATA(nid)->node_zonelists[0];
+	if (fallback) /* memcg/NUMA is balanced and fallback works well */
+		zonelist = &NODE_DATA(nid)->node_zonelists[0];
+	else /* memcg/NUMA is not balanced, do target reclaim */
+		zonelist = &NODE_DATA(nid)->node_zonelists[1];
 
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 6/6]  memg: do target scan if unbalanced
@ 2011-08-09 10:13   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, akpm, Michal Hocko, hannes, nishimura


Because do_try_to_free_pages() scans node based on zonelist,
even if we select a victim node, we may scan other nodes.

When the nodes are balanced, it's good because we'll quit scan loop
before updating 'priority'. But when the nodes are unbalanced,
it will force scanning a very small nodes and will cause
swap-out when the node doesn't contains enough file caches.

This patch selects zonelist[] for vmscan scan list for memcg.
If memcg is well balanced among nodes, usual fall back (and mask) is used.
If not, it selects node local zonelist and do target reclaim.

This will reduce unnecessary (anon page) scans when memcg is not balanced.

Now, memcg/NUMA is balanced when each node's weight is between
 80% and 120% of average node weight.
 (*) This value is just a magic number but works well in several tests.
     Further study to detemine this value is appreciated.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |   20 ++++++++++++++++++--
 mm/vmscan.c                |    9 +++++++--
 3 files changed, 26 insertions(+), 5 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -296,6 +296,7 @@ struct mem_cgroup {
 	atomic_t	numainfo_updating;
 	struct work_struct	numainfo_update_work;
 	unsigned long	total_weight;
+	bool		numascan_balance;
 	int		numascan_generation;
 	int		numascan_tickets_num[2];
 	struct numascan_ticket *numascan_tickets[2];
@@ -1679,12 +1680,15 @@ mem_cgroup_calc_numascan_weight(struct m
  */
 #define NUMA_TICKET_SHIFT	(16)
 #define NUMA_TICKET_FACTOR	((1 << NUMA_TICKET_SHIFT) - 1)
+#define NUMA_BALANCE_RANGE_LOW	(80)
+#define NUMA_BALANCE_RANGE_HIGH	(120)
 static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
 {
 	struct numascan_ticket *nt;
 	unsigned int node_ticket, assigned_tickets;
 	u64 weight;
 	int nid, assigned_num, generation;
+	unsigned long average, balance_low, balance_high;
 
 	/* update ticket information by double buffering */
 	generation = memcg->numascan_generation ^ 0x1;
@@ -1692,6 +1696,11 @@ static void mem_cgroup_update_numascan_t
 	nt = memcg->numascan_tickets[generation];
 	assigned_tickets = 0;
 	assigned_num = 0;
+	average = memcg->total_weight / (nodes_weight(memcg->scan_nodes) + 1);
+	balance_low = NUMA_BALANCE_RANGE_LOW * average / 100;
+	balance_high = NUMA_BALANCE_RANGE_HIGH * average / 100;
+	memcg->numascan_balance = true;
+
 	for_each_node_mask(nid, memcg->scan_nodes) {
 		weight = memcg->info.nodeinfo[nid]->weight;
 		node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
@@ -1704,6 +1713,9 @@ static void mem_cgroup_update_numascan_t
 		assigned_tickets += node_ticket;
 		nt++;
 		assigned_num++;
+		if ((weight < balance_low) ||
+		    (weight > balance_high))
+			memcg->numascan_balance = false;
 	}
 	memcg->numascan_tickets_num[generation] = assigned_num;
 	smp_wmb();
@@ -1774,7 +1786,7 @@ static int node_weight_compare(const voi
  * node means more costs for memory reclaim because of memory latency.
  */
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
-				struct memcg_scanrecord *rec)
+				struct memcg_scanrecord *rec, bool *fallback)
 {
 	int node = MAX_NUMNODES;
 	struct numascan_ticket *nt;
@@ -1801,8 +1813,11 @@ out:
 	if (unlikely(node == MAX_NUMNODES)) {
 		node = numa_node_id();
 		*mask = NULL;
-	} else
+		*fallback = true;
+	} else {
 		*mask = &memcg->scan_nodes;
+		*fallback = memcg->numascan_balance;
+	}
 
 	return node;
 }
@@ -1880,6 +1895,7 @@ int mem_cgroup_select_victim_node(struct
 				struct memcg_scanrecord *rec)
 {
 	*mask = NULL;
+	*fallback = true;
 	return 0;
 }
 
Index: mmotm-Aug3/include/linux/memcontrol.h
===================================================================
--- mmotm-Aug3.orig/include/linux/memcontrol.h
+++ mmotm-Aug3/include/linux/memcontrol.h
@@ -119,7 +119,7 @@ extern void mem_cgroup_end_migration(str
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
-				struct memcg_scanrecord *rec);
+				struct memcg_scanrecord *rec, bool *fallback);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2355,6 +2355,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	ktime_t start, end;
+	bool fallback;
 	int nid;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
@@ -2378,9 +2379,13 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask, rec);
+	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask,
+				rec, &fallback);
 
-	zonelist = &NODE_DATA(nid)->node_zonelists[0];
+	if (fallback) /* memcg/NUMA is balanced and fallback works well */
+		zonelist = &NODE_DATA(nid)->node_zonelists[0];
+	else /* memcg/NUMA is not balanced, do target reclaim */
+		zonelist = &NODE_DATA(nid)->node_zonelists[1];
 
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 0/6]  memg: better numa scanning
  2011-08-09 10:04 ` KAMEZAWA Hiroyuki
@ 2011-08-09 14:33   ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-09 14:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:04:50, KAMEZAWA Hiroyuki wrote:
> 
> No major update since the last version I posted 27/Jul.
> The patch is rebased onto mmotm-Aug3.
> 
> This patch set implements a victim node selection logic and some
> behavior fix in vmscan.c for memcg.
> The logic calculates 'weight' for each nodes and a victim node
> will be selected by comparing 'weight' in fair style.
> The core is how to calculate 'weight' and this patch implements
> a logic, which make use of recent lotation logic and the amount
> of file caches and inactive anon pages.
> 
> I'll be absent in 12/Aug - 17/Aug.
> I'm sorry if my response is delayed.
> 
> In this time, I did 'kernel make' test ...as
> ==
> #!/bin/bash -x
> 
> cgset -r memory.limit_in_bytes=500M A
> 
> make -j 4 clean
> sync
> sync
> sync
> echo 3 > /proc/sys/vm/drop_caches
> sleep 1
> echo 0 > /cgroup/memory/A/memory.vmscan_stat
> cgexec -g memory:A -g cpuset:A time make -j 8
> ==
> 
> On 8cpu, 4-node fake-numa box.

How big are those nodes? I assume that you haven't used any numa
policies, right?

> (each node has 2cpus.)
> 
> Under the limit of 500M, 'make' need to scan memory to reclaim.
> This tests see how vmscan works.
> 
> When cpuset.memory_spread_page==0.

> 
> [Before patch]
> 773.07user 305.45system 4:09.64elapsed 432%CPU (0avgtext+0avgdata 1456576maxresident)k
> 4397944inputs+5093232outputs (9688major+35689066minor)pagefaults 0swaps
> scanned_pages_by_limit 3867645
> scanned_anon_pages_by_limit 1518266
> scanned_file_pages_by_limit 2349379
> rotated_pages_by_limit 1502640
> rotated_anon_pages_by_limit 1416627
> rotated_file_pages_by_limit 86013
> freed_pages_by_limit 1005141
> freed_anon_pages_by_limit 24577
> freed_file_pages_by_limit 980564
> elapsed_ns_by_limit 82833866094
> 
> [Patched]
> 773.73user 305.09system 3:51.28elapsed 466%CPU (0avgtext+0avgdata 1458464maxresident)k
> 4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps

Hmm, 57% reduction of major page faults which doesn't fit with other
numbers. At least I do not see any corelation with them. Your workload
has freed more or less the same number of file pages (>1% less). Do you
have a theory for that?

Is it possible that this is caused by "memcg: stop vmscan when enough
done."?

> 
> scanned_pages_by_limit 4326462
> scanned_anon_pages_by_limit 1310619
> scanned_file_pages_by_limit 3015843
> rotated_pages_by_limit 1264223
> rotated_anon_pages_by_limit 1247180
> rotated_file_pages_by_limit 17043
> freed_pages_by_limit 1003434
> freed_anon_pages_by_limit 20599
> freed_file_pages_by_limit 982835
> elapsed_ns_by_limit 42495200307
> 
> elapsed time for vmscan and the number of page faults are reduced.
> 
> 
> When cpuset.memory_spread_page==1, in this case, file cache will be
> spread to the all nodes in round robin.
> ==
> [Before Patch + cpuset spread=1]
> 773.23user 309.55system 4:26.83elapsed 405%CPU (0avgtext+0avgdata 1457696maxresident)k
> 5400928inputs+5105368outputs (17344major+35735886minor)pagefaults 0swaps
> 
> scanned_pages_by_limit 3731787
> scanned_anon_pages_by_limit 1374310
> scanned_file_pages_by_limit 2357477
> rotated_pages_by_limit 1403160
> rotated_anon_pages_by_limit 1293568
> rotated_file_pages_by_limit 109592
> freed_pages_by_limit 1120828
> freed_anon_pages_by_limit 20076
> freed_file_pages_by_limit 1100752
> elapsed_ns_by_limit 82458981267
> 
> 
> [Patched + cpuset spread=1]
> 773.56user 306.02system 3:52.28elapsed 464%CPU (0avgtext+0avgdata 1458160maxresident)k
> 4173504inputs+4783688outputs (5971major+35666498minor)pagefaults 0swaps

page fauls and time seem to be consistent with the previous test which
is really good. 

> 
> scanned_pages_by_limit 2672392
> scanned_anon_pages_by_limit 1140069
> scanned_file_pages_by_limit 1532323
> rotated_pages_by_limit 1108124
> rotated_anon_pages_by_limit 1088982
> rotated_file_pages_by_limit 19142
> freed_pages_by_limit 975653
> freed_anon_pages_by_limit 12578
> freed_file_pages_by_limit 963075
> elapsed_ns_by_limit 46482588602
> 
> elapsed time for vmscan and the number of page faults are reduced.
> 
> Thanks,
> -Kame

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 0/6]  memg: better numa scanning
@ 2011-08-09 14:33   ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-09 14:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:04:50, KAMEZAWA Hiroyuki wrote:
> 
> No major update since the last version I posted 27/Jul.
> The patch is rebased onto mmotm-Aug3.
> 
> This patch set implements a victim node selection logic and some
> behavior fix in vmscan.c for memcg.
> The logic calculates 'weight' for each nodes and a victim node
> will be selected by comparing 'weight' in fair style.
> The core is how to calculate 'weight' and this patch implements
> a logic, which make use of recent lotation logic and the amount
> of file caches and inactive anon pages.
> 
> I'll be absent in 12/Aug - 17/Aug.
> I'm sorry if my response is delayed.
> 
> In this time, I did 'kernel make' test ...as
> ==
> #!/bin/bash -x
> 
> cgset -r memory.limit_in_bytes=500M A
> 
> make -j 4 clean
> sync
> sync
> sync
> echo 3 > /proc/sys/vm/drop_caches
> sleep 1
> echo 0 > /cgroup/memory/A/memory.vmscan_stat
> cgexec -g memory:A -g cpuset:A time make -j 8
> ==
> 
> On 8cpu, 4-node fake-numa box.

How big are those nodes? I assume that you haven't used any numa
policies, right?

> (each node has 2cpus.)
> 
> Under the limit of 500M, 'make' need to scan memory to reclaim.
> This tests see how vmscan works.
> 
> When cpuset.memory_spread_page==0.

> 
> [Before patch]
> 773.07user 305.45system 4:09.64elapsed 432%CPU (0avgtext+0avgdata 1456576maxresident)k
> 4397944inputs+5093232outputs (9688major+35689066minor)pagefaults 0swaps
> scanned_pages_by_limit 3867645
> scanned_anon_pages_by_limit 1518266
> scanned_file_pages_by_limit 2349379
> rotated_pages_by_limit 1502640
> rotated_anon_pages_by_limit 1416627
> rotated_file_pages_by_limit 86013
> freed_pages_by_limit 1005141
> freed_anon_pages_by_limit 24577
> freed_file_pages_by_limit 980564
> elapsed_ns_by_limit 82833866094
> 
> [Patched]
> 773.73user 305.09system 3:51.28elapsed 466%CPU (0avgtext+0avgdata 1458464maxresident)k
> 4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps

Hmm, 57% reduction of major page faults which doesn't fit with other
numbers. At least I do not see any corelation with them. Your workload
has freed more or less the same number of file pages (>1% less). Do you
have a theory for that?

Is it possible that this is caused by "memcg: stop vmscan when enough
done."?

> 
> scanned_pages_by_limit 4326462
> scanned_anon_pages_by_limit 1310619
> scanned_file_pages_by_limit 3015843
> rotated_pages_by_limit 1264223
> rotated_anon_pages_by_limit 1247180
> rotated_file_pages_by_limit 17043
> freed_pages_by_limit 1003434
> freed_anon_pages_by_limit 20599
> freed_file_pages_by_limit 982835
> elapsed_ns_by_limit 42495200307
> 
> elapsed time for vmscan and the number of page faults are reduced.
> 
> 
> When cpuset.memory_spread_page==1, in this case, file cache will be
> spread to the all nodes in round robin.
> ==
> [Before Patch + cpuset spread=1]
> 773.23user 309.55system 4:26.83elapsed 405%CPU (0avgtext+0avgdata 1457696maxresident)k
> 5400928inputs+5105368outputs (17344major+35735886minor)pagefaults 0swaps
> 
> scanned_pages_by_limit 3731787
> scanned_anon_pages_by_limit 1374310
> scanned_file_pages_by_limit 2357477
> rotated_pages_by_limit 1403160
> rotated_anon_pages_by_limit 1293568
> rotated_file_pages_by_limit 109592
> freed_pages_by_limit 1120828
> freed_anon_pages_by_limit 20076
> freed_file_pages_by_limit 1100752
> elapsed_ns_by_limit 82458981267
> 
> 
> [Patched + cpuset spread=1]
> 773.56user 306.02system 3:52.28elapsed 464%CPU (0avgtext+0avgdata 1458160maxresident)k
> 4173504inputs+4783688outputs (5971major+35666498minor)pagefaults 0swaps

page fauls and time seem to be consistent with the previous test which
is really good. 

> 
> scanned_pages_by_limit 2672392
> scanned_anon_pages_by_limit 1140069
> scanned_file_pages_by_limit 1532323
> rotated_pages_by_limit 1108124
> rotated_anon_pages_by_limit 1088982
> rotated_file_pages_by_limit 19142
> freed_pages_by_limit 975653
> freed_anon_pages_by_limit 12578
> freed_file_pages_by_limit 963075
> elapsed_ns_by_limit 46482588602
> 
> elapsed time for vmscan and the number of page faults are reduced.
> 
> Thanks,
> -Kame

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 0/6]  memg: better numa scanning
  2011-08-09 14:33   ` Michal Hocko
@ 2011-08-10  0:15     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10  0:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue, 9 Aug 2011 16:33:14 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:04:50, KAMEZAWA Hiroyuki wrote:
> > 
> > No major update since the last version I posted 27/Jul.
> > The patch is rebased onto mmotm-Aug3.
> > 
> > This patch set implements a victim node selection logic and some
> > behavior fix in vmscan.c for memcg.
> > The logic calculates 'weight' for each nodes and a victim node
> > will be selected by comparing 'weight' in fair style.
> > The core is how to calculate 'weight' and this patch implements
> > a logic, which make use of recent lotation logic and the amount
> > of file caches and inactive anon pages.
> > 
> > I'll be absent in 12/Aug - 17/Aug.
> > I'm sorry if my response is delayed.
> > 
> > In this time, I did 'kernel make' test ...as
> > ==
> > #!/bin/bash -x
> > 
> > cgset -r memory.limit_in_bytes=500M A
> > 
> > make -j 4 clean
> > sync
> > sync
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> > sleep 1
> > echo 0 > /cgroup/memory/A/memory.vmscan_stat
> > cgexec -g memory:A -g cpuset:A time make -j 8
> > ==
> > 
> > On 8cpu, 4-node fake-numa box.
> 
> How big are those nodes? I assume that you haven't used any numa
> policies, right?
> 

This box has 24GB memory and fake numa creates 6GBnode x 4.

[kamezawa@bluextal ~]$ grep MemTotal /sys/devices/system/node/node?/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemTotal:        6290360 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemTotal:        6291456 kB
/sys/devices/system/node/node2/meminfo:Node 2 MemTotal:        6291456 kB
/sys/devices/system/node/node3/meminfo:Node 3 MemTotal:        6291456 kB

2 cpus per each node. (IIRC, Hyperthread)

[kamezawa@bluextal ~]$ ls -d /sys/devices/system/node/node?/cpu?
/sys/devices/system/node/node0/cpu0  /sys/devices/system/node/node2/cpu2
/sys/devices/system/node/node0/cpu4  /sys/devices/system/node/node2/cpu6
/sys/devices/system/node/node1/cpu1  /sys/devices/system/node/node3/cpu3
/sys/devices/system/node/node1/cpu5  /sys/devices/system/node/node3/cpu7

And yes, I don't use any numa policy other than spread-page.



> > (each node has 2cpus.)
> > 
> > Under the limit of 500M, 'make' need to scan memory to reclaim.
> > This tests see how vmscan works.
> > 
> > When cpuset.memory_spread_page==0.
> 
> > 
> > [Before patch]
> > 773.07user 305.45system 4:09.64elapsed 432%CPU (0avgtext+0avgdata 1456576maxresident)k
> > 4397944inputs+5093232outputs (9688major+35689066minor)pagefaults 0swaps
> > scanned_pages_by_limit 3867645
> > scanned_anon_pages_by_limit 1518266
> > scanned_file_pages_by_limit 2349379
> > rotated_pages_by_limit 1502640
> > rotated_anon_pages_by_limit 1416627
> > rotated_file_pages_by_limit 86013
> > freed_pages_by_limit 1005141
> > freed_anon_pages_by_limit 24577
> > freed_file_pages_by_limit 980564
> > elapsed_ns_by_limit 82833866094
> > 
> > [Patched]
> > 773.73user 305.09system 3:51.28elapsed 466%CPU (0avgtext+0avgdata 1458464maxresident)k
> > 4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps
> 
> Hmm, 57% reduction of major page faults which doesn't fit with other
> numbers. At least I do not see any corelation with them. Your workload
> has freed more or less the same number of file pages (>1% less). Do you
> have a theory for that?
> 
[Before] freed_anon_pages_by_limit 24577 
[After]  freed_anon_pages_by_limit 20599

This reduces 3987 swap out. Changes in major fault is 4110.
I think this is major reason to reduce the major faults.

> Is it possible that this is caused by "memcg: stop vmscan when enough
> done."?
> 

The patch is one of a help.

Assume nodes are in following state under limit=2000
     
       Node0   Node1   Node2   Node3
File   250     250       0     250
Anon   250     250      500    250

If select_victim_node() selects Node0, vmscan will visit
Node0->Node1->Node2->Node3 in zonelist order and cause swap-out in Node2.
"memcg: stop vmscan when enough done." will help to avoid scaning Node2
when Node0,Node1,Node3 are selected.

And other patches will help not to select Node2.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 0/6]  memg: better numa scanning
@ 2011-08-10  0:15     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10  0:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue, 9 Aug 2011 16:33:14 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:04:50, KAMEZAWA Hiroyuki wrote:
> > 
> > No major update since the last version I posted 27/Jul.
> > The patch is rebased onto mmotm-Aug3.
> > 
> > This patch set implements a victim node selection logic and some
> > behavior fix in vmscan.c for memcg.
> > The logic calculates 'weight' for each nodes and a victim node
> > will be selected by comparing 'weight' in fair style.
> > The core is how to calculate 'weight' and this patch implements
> > a logic, which make use of recent lotation logic and the amount
> > of file caches and inactive anon pages.
> > 
> > I'll be absent in 12/Aug - 17/Aug.
> > I'm sorry if my response is delayed.
> > 
> > In this time, I did 'kernel make' test ...as
> > ==
> > #!/bin/bash -x
> > 
> > cgset -r memory.limit_in_bytes=500M A
> > 
> > make -j 4 clean
> > sync
> > sync
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> > sleep 1
> > echo 0 > /cgroup/memory/A/memory.vmscan_stat
> > cgexec -g memory:A -g cpuset:A time make -j 8
> > ==
> > 
> > On 8cpu, 4-node fake-numa box.
> 
> How big are those nodes? I assume that you haven't used any numa
> policies, right?
> 

This box has 24GB memory and fake numa creates 6GBnode x 4.

[kamezawa@bluextal ~]$ grep MemTotal /sys/devices/system/node/node?/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemTotal:        6290360 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemTotal:        6291456 kB
/sys/devices/system/node/node2/meminfo:Node 2 MemTotal:        6291456 kB
/sys/devices/system/node/node3/meminfo:Node 3 MemTotal:        6291456 kB

2 cpus per each node. (IIRC, Hyperthread)

[kamezawa@bluextal ~]$ ls -d /sys/devices/system/node/node?/cpu?
/sys/devices/system/node/node0/cpu0  /sys/devices/system/node/node2/cpu2
/sys/devices/system/node/node0/cpu4  /sys/devices/system/node/node2/cpu6
/sys/devices/system/node/node1/cpu1  /sys/devices/system/node/node3/cpu3
/sys/devices/system/node/node1/cpu5  /sys/devices/system/node/node3/cpu7

And yes, I don't use any numa policy other than spread-page.



> > (each node has 2cpus.)
> > 
> > Under the limit of 500M, 'make' need to scan memory to reclaim.
> > This tests see how vmscan works.
> > 
> > When cpuset.memory_spread_page==0.
> 
> > 
> > [Before patch]
> > 773.07user 305.45system 4:09.64elapsed 432%CPU (0avgtext+0avgdata 1456576maxresident)k
> > 4397944inputs+5093232outputs (9688major+35689066minor)pagefaults 0swaps
> > scanned_pages_by_limit 3867645
> > scanned_anon_pages_by_limit 1518266
> > scanned_file_pages_by_limit 2349379
> > rotated_pages_by_limit 1502640
> > rotated_anon_pages_by_limit 1416627
> > rotated_file_pages_by_limit 86013
> > freed_pages_by_limit 1005141
> > freed_anon_pages_by_limit 24577
> > freed_file_pages_by_limit 980564
> > elapsed_ns_by_limit 82833866094
> > 
> > [Patched]
> > 773.73user 305.09system 3:51.28elapsed 466%CPU (0avgtext+0avgdata 1458464maxresident)k
> > 4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps
> 
> Hmm, 57% reduction of major page faults which doesn't fit with other
> numbers. At least I do not see any corelation with them. Your workload
> has freed more or less the same number of file pages (>1% less). Do you
> have a theory for that?
> 
[Before] freed_anon_pages_by_limit 24577 
[After]  freed_anon_pages_by_limit 20599

This reduces 3987 swap out. Changes in major fault is 4110.
I think this is major reason to reduce the major faults.

> Is it possible that this is caused by "memcg: stop vmscan when enough
> done."?
> 

The patch is one of a help.

Assume nodes are in following state under limit=2000
     
       Node0   Node1   Node2   Node3
File   250     250       0     250
Anon   250     250      500    250

If select_victim_node() selects Node0, vmscan will visit
Node0->Node1->Node2->Node3 in zonelist order and cause swap-out in Node2.
"memcg: stop vmscan when enough done." will help to avoid scaning Node2
when Node0,Node1,Node3 are selected.

And other patches will help not to select Node2.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 0/6]  memg: better numa scanning
  2011-08-10  0:15     ` KAMEZAWA Hiroyuki
@ 2011-08-10  6:03       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10  6:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 10 Aug 2011 09:15:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 9 Aug 2011 16:33:14 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Hmm, 57% reduction of major page faults which doesn't fit with other
> > numbers. At least I do not see any corelation with them. Your workload
> > has freed more or less the same number of file pages (>1% less). Do you
> > have a theory for that?
> > 
> > Is it possible that this is caused by "memcg: stop vmscan when enough
> > done."?
> > 
> 

I did more runs. In this time, I did 3 sequence of runs per test. Then, 2nd, 3rd
runs will see some garbage(file cache) of previous runs. cpuset is not used.

[Nopatch]
[1] 772.07user 308.73system 4:05.41elapsed 440%CPU (0avgtext+0avgdata 1458400maxresident)k
    4519512inputs+7485704outputs (8078major+35671016minor)pagefaults 0swaps
[2] 774.19user 306.19system 4:03.05elapsed 444%CPU (0avgtext+0avgdata 1455472maxresident)k
    4502272inputs+5168832outputs (7815major+35691489minor)pagefaults 0swaps
[3] 773.99user 310.71system 4:00.31elapsed 451%CPU (0avgtext+0avgdata 1458144maxresident)k
    4518448inputs+8695352outputs (7768major+35683064minor)pagefaults 0swaps

[Patch 1-3 applied]
[1] 771.75user 312.82system 4:09.55elapsed 434%CPU (0avgtext+0avgdata 1458320maxresident)k
    4413032inputs+7895152outputs (8793major+35691822minor)pagefaults 0swaps
[2] 772.66user 308.93system 4:15.22elapsed 423%CPU (0avgtext+0avgdata 1457504maxresident)k
    4469120inputs+12484960outputs (10952major+35702053minor)pagefaults 0swaps
[3] 771.83user 305.53system 3:57.63elapsed 453%CPU (0avgtext+0avgdata 1457856maxresident)k
    4355392inputs+5169560outputs (6985major+35680863minor)pagefaults 0swaps

[Full Patched]
[1] 771.19user 303.37system 3:49.47elapsed 468%CPU (0avgtext+0avgdata 1458400maxresident)k
    4260032inputs+4919824outputs (5496major+35672873minor)pagefaults 0swaps
[2] 772.51user 305.90system 3:56.89elapsed 455%CPU (0avgtext+0avgdata 1458416maxresident)k
    4463728inputs+4621496outputs (6301major+35671610minor)pagefaults 0swaps
[3] 773.14user 305.02system 3:55.09elapsed 458%CPU (0avgtext+0avgdata 1458240maxresident)k
    4447088inputs+5190792outputs (5106major+35699087minor)pagefaults 0swaps

The patch 3 is require for patch 5 to work correctly (It makes node-selection unused.)
But it seems to be not more than that.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 0/6]  memg: better numa scanning
@ 2011-08-10  6:03       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10  6:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 10 Aug 2011 09:15:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 9 Aug 2011 16:33:14 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Hmm, 57% reduction of major page faults which doesn't fit with other
> > numbers. At least I do not see any corelation with them. Your workload
> > has freed more or less the same number of file pages (>1% less). Do you
> > have a theory for that?
> > 
> > Is it possible that this is caused by "memcg: stop vmscan when enough
> > done."?
> > 
> 

I did more runs. In this time, I did 3 sequence of runs per test. Then, 2nd, 3rd
runs will see some garbage(file cache) of previous runs. cpuset is not used.

[Nopatch]
[1] 772.07user 308.73system 4:05.41elapsed 440%CPU (0avgtext+0avgdata 1458400maxresident)k
    4519512inputs+7485704outputs (8078major+35671016minor)pagefaults 0swaps
[2] 774.19user 306.19system 4:03.05elapsed 444%CPU (0avgtext+0avgdata 1455472maxresident)k
    4502272inputs+5168832outputs (7815major+35691489minor)pagefaults 0swaps
[3] 773.99user 310.71system 4:00.31elapsed 451%CPU (0avgtext+0avgdata 1458144maxresident)k
    4518448inputs+8695352outputs (7768major+35683064minor)pagefaults 0swaps

[Patch 1-3 applied]
[1] 771.75user 312.82system 4:09.55elapsed 434%CPU (0avgtext+0avgdata 1458320maxresident)k
    4413032inputs+7895152outputs (8793major+35691822minor)pagefaults 0swaps
[2] 772.66user 308.93system 4:15.22elapsed 423%CPU (0avgtext+0avgdata 1457504maxresident)k
    4469120inputs+12484960outputs (10952major+35702053minor)pagefaults 0swaps
[3] 771.83user 305.53system 3:57.63elapsed 453%CPU (0avgtext+0avgdata 1457856maxresident)k
    4355392inputs+5169560outputs (6985major+35680863minor)pagefaults 0swaps

[Full Patched]
[1] 771.19user 303.37system 3:49.47elapsed 468%CPU (0avgtext+0avgdata 1458400maxresident)k
    4260032inputs+4919824outputs (5496major+35672873minor)pagefaults 0swaps
[2] 772.51user 305.90system 3:56.89elapsed 455%CPU (0avgtext+0avgdata 1458416maxresident)k
    4463728inputs+4621496outputs (6301major+35671610minor)pagefaults 0swaps
[3] 773.14user 305.02system 3:55.09elapsed 458%CPU (0avgtext+0avgdata 1458240maxresident)k
    4447088inputs+5190792outputs (5106major+35699087minor)pagefaults 0swaps

The patch 3 is require for patch 5 to work correctly (It makes node-selection unused.)
But it seems to be not more than that.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 1/6]  memg: better numa scanning
  2011-08-09 10:08   ` KAMEZAWA Hiroyuki
@ 2011-08-10 10:00     ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-10 10:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:08:24, KAMEZAWA Hiroyuki wrote:
> 
> Making memcg numa's scanning information update by schedule_work().
> 
> Now, memcg's numa information is updated under a thread doing
> memory reclaim. It's not very heavy weight now. But upcoming updates
> around numa scanning will add more works. This patch makes
> the update be done by schedule_work() and reduce latency caused
> by this updates.

I am not sure whether this pays off. Anyway, I think it would be better
to place this patch somewhere at the end of the series so that we can
measure its impact separately.

> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Otherwise looks good to me.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

Just a minor nit bellow.

> ---
>  mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++------------
>  1 file changed, 30 insertions(+), 12 deletions(-)
> 
> Index: mmotm-Aug3/mm/memcontrol.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/memcontrol.c
> +++ mmotm-Aug3/mm/memcontrol.c
> @@ -285,6 +285,7 @@ struct mem_cgroup {
>  	nodemask_t	scan_nodes;
>  	atomic_t	numainfo_events;
>  	atomic_t	numainfo_updating;
> +	struct work_struct	numainfo_update_work;
>  #endif
>  	/*
>  	 * Should the accounting and control be hierarchical, per subtree?
> @@ -1567,6 +1568,23 @@ static bool test_mem_cgroup_node_reclaim
>  }
>  #if MAX_NUMNODES > 1
>  
> +static void mem_cgroup_numainfo_update_work(struct work_struct *work)
> +{
> +	struct mem_cgroup *memcg;
> +	int nid;
> +
> +	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
> +
> +	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
> +	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
> +		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
> +			node_clear(nid, memcg->scan_nodes);
> +	}
> +	atomic_set(&memcg->numainfo_updating, 0);
> +	css_put(&memcg->css);
> +}
> +
> +
>  /*
>   * Always updating the nodemask is not very good - even if we have an empty
>   * list or the wrong list here, we can start from some node and traverse all
> @@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
>   */

Would be good to update the function comment as well (we still have 10s
period there).

>  static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
>  {
> -	int nid;
>  	/*
>  	 * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
>  	 * pagein/pageout changes since the last update.
[...]

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 1/6]  memg: better numa scanning
@ 2011-08-10 10:00     ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-10 10:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:08:24, KAMEZAWA Hiroyuki wrote:
> 
> Making memcg numa's scanning information update by schedule_work().
> 
> Now, memcg's numa information is updated under a thread doing
> memory reclaim. It's not very heavy weight now. But upcoming updates
> around numa scanning will add more works. This patch makes
> the update be done by schedule_work() and reduce latency caused
> by this updates.

I am not sure whether this pays off. Anyway, I think it would be better
to place this patch somewhere at the end of the series so that we can
measure its impact separately.

> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Otherwise looks good to me.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

Just a minor nit bellow.

> ---
>  mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++------------
>  1 file changed, 30 insertions(+), 12 deletions(-)
> 
> Index: mmotm-Aug3/mm/memcontrol.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/memcontrol.c
> +++ mmotm-Aug3/mm/memcontrol.c
> @@ -285,6 +285,7 @@ struct mem_cgroup {
>  	nodemask_t	scan_nodes;
>  	atomic_t	numainfo_events;
>  	atomic_t	numainfo_updating;
> +	struct work_struct	numainfo_update_work;
>  #endif
>  	/*
>  	 * Should the accounting and control be hierarchical, per subtree?
> @@ -1567,6 +1568,23 @@ static bool test_mem_cgroup_node_reclaim
>  }
>  #if MAX_NUMNODES > 1
>  
> +static void mem_cgroup_numainfo_update_work(struct work_struct *work)
> +{
> +	struct mem_cgroup *memcg;
> +	int nid;
> +
> +	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
> +
> +	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
> +	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
> +		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
> +			node_clear(nid, memcg->scan_nodes);
> +	}
> +	atomic_set(&memcg->numainfo_updating, 0);
> +	css_put(&memcg->css);
> +}
> +
> +
>  /*
>   * Always updating the nodemask is not very good - even if we have an empty
>   * list or the wrong list here, we can start from some node and traverse all
> @@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
>   */

Would be good to update the function comment as well (we still have 10s
period there).

>  static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
>  {
> -	int nid;
>  	/*
>  	 * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
>  	 * pagein/pageout changes since the last update.
[...]

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 3/6]  memg: vmscan pass nodemask
  2011-08-09 10:10   ` KAMEZAWA Hiroyuki
@ 2011-08-10 11:19     ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-10 11:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:10:18, KAMEZAWA Hiroyuki wrote:
> 
> pass memcg's nodemask to try_to_free_pages().
> 
> try_to_free_pages can take nodemask as its argument but memcg
> doesn't pass it. Considering memcg can be used with cpuset on
> big NUMA, memcg should pass nodemask if available.
> 
> Now, memcg maintain nodemask with periodic updates. pass it.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Changelog:
>  - fixed bugs to pass nodemask.

Yes, looks good now.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> Index: mmotm-Aug3/mm/vmscan.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/vmscan.c
> +++ mmotm-Aug3/mm/vmscan.c
> @@ -2354,7 +2354,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  		.order = 0,
>  		.mem_cgroup = mem_cont,
>  		.memcg_record = rec,
> -		.nodemask = NULL, /* we don't care the placement */
> +		.nodemask = NULL,
>  		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>  	};

We can remove the whole nodemask initialization.

> @@ -2368,7 +2368,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  	 * take care of from where we get pages. So the node where we start the
>  	 * scan does not need to be the current node.
>  	 */
> -	nid = mem_cgroup_select_victim_node(mem_cont);
> +	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
>  
>  	zonelist = NODE_DATA(nid)->node_zonelists;

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 3/6]  memg: vmscan pass nodemask
@ 2011-08-10 11:19     ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-10 11:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:10:18, KAMEZAWA Hiroyuki wrote:
> 
> pass memcg's nodemask to try_to_free_pages().
> 
> try_to_free_pages can take nodemask as its argument but memcg
> doesn't pass it. Considering memcg can be used with cpuset on
> big NUMA, memcg should pass nodemask if available.
> 
> Now, memcg maintain nodemask with periodic updates. pass it.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Changelog:
>  - fixed bugs to pass nodemask.

Yes, looks good now.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> Index: mmotm-Aug3/mm/vmscan.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/vmscan.c
> +++ mmotm-Aug3/mm/vmscan.c
> @@ -2354,7 +2354,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  		.order = 0,
>  		.mem_cgroup = mem_cont,
>  		.memcg_record = rec,
> -		.nodemask = NULL, /* we don't care the placement */
> +		.nodemask = NULL,
>  		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>  	};

We can remove the whole nodemask initialization.

> @@ -2368,7 +2368,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  	 * take care of from where we get pages. So the node where we start the
>  	 * scan does not need to be the current node.
>  	 */
> -	nid = mem_cgroup_select_victim_node(mem_cont);
> +	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
>  
>  	zonelist = NODE_DATA(nid)->node_zonelists;

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-09 10:09   ` KAMEZAWA Hiroyuki
@ 2011-08-10 14:14     ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-10 14:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:09:33, KAMEZAWA Hiroyuki wrote:
> memcg :avoid node fallback scan if possible.
> 
> Now, try_to_free_pages() scans all zonelist because the page allocator
> should visit all zonelists...but that behavior is harmful for memcg.
> Memcg just scans memory because it hits limit...no memory shortage
> in pased zonelist.
> 
> For example, with following unbalanced nodes
> 
>      Node 0    Node 1
> File 1G        0
> Anon 200M      200M
> 
> memcg will cause swap-out from Node1 at every vmscan.
> 
> Another example, assume 1024 nodes system.
> With 1024 node system, memcg will visit 1024 nodes
> pages per vmscan... This is overkilling. 
> 
> This is why memcg's victim node selection logic doesn't work
> as expected.
> 
> This patch is a help for stopping vmscan when we scanned enough.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

OK, I see the point. At first I was afraid that we would make a bigger
pressure on the node which triggered the reclaim but as we are selecting
t dynamically (mem_cgroup_select_victim_node) - round robin at the
moment - it should be fair in the end. More targeted node selection
should be even more efficient.

I still have a concern about resize_limit code path, though. It uses
memcg direct reclaim to get under the new limit (assuming it is lower
than the current one). 
Currently we might reclaim nr_nodes * SWAP_CLUSTER_MAX while
after your change we have it at SWAP_CLUSTER_MAX. This means that
mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
(currently it is doing 5 rounds of reclaim before it gives up). I do not
consider this to be blocker but maybe we should enhance
mem_cgroup_hierarchical_reclaim with a nr_pages argument to tell it how
much we want to reclaim (min(SWAP_CLUSTER_MAX, nr_pages)).
What do you think?

> ---
>  mm/vmscan.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: mmotm-Aug3/mm/vmscan.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/vmscan.c
> +++ mmotm-Aug3/mm/vmscan.c
> @@ -2124,6 +2124,16 @@ static void shrink_zones(int priority, s
>  		}
>  
>  		shrink_zone(priority, zone, sc);
> +		if (!scanning_global_lru(sc)) {
> +			/*
> +			 * When we do scan for memcg's limit, it's bad to do
> +			 * fallback into more node/zones because there is no
> +			 * memory shortage. We quit as much as possible when
> +			 * we reache target.
> +			 */
> +			if (sc->nr_to_reclaim <= sc->nr_reclaimed)
> +				break;
> +		}
>  	}
>  }

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-10 14:14     ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-10 14:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:09:33, KAMEZAWA Hiroyuki wrote:
> memcg :avoid node fallback scan if possible.
> 
> Now, try_to_free_pages() scans all zonelist because the page allocator
> should visit all zonelists...but that behavior is harmful for memcg.
> Memcg just scans memory because it hits limit...no memory shortage
> in pased zonelist.
> 
> For example, with following unbalanced nodes
> 
>      Node 0    Node 1
> File 1G        0
> Anon 200M      200M
> 
> memcg will cause swap-out from Node1 at every vmscan.
> 
> Another example, assume 1024 nodes system.
> With 1024 node system, memcg will visit 1024 nodes
> pages per vmscan... This is overkilling. 
> 
> This is why memcg's victim node selection logic doesn't work
> as expected.
> 
> This patch is a help for stopping vmscan when we scanned enough.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

OK, I see the point. At first I was afraid that we would make a bigger
pressure on the node which triggered the reclaim but as we are selecting
t dynamically (mem_cgroup_select_victim_node) - round robin at the
moment - it should be fair in the end. More targeted node selection
should be even more efficient.

I still have a concern about resize_limit code path, though. It uses
memcg direct reclaim to get under the new limit (assuming it is lower
than the current one). 
Currently we might reclaim nr_nodes * SWAP_CLUSTER_MAX while
after your change we have it at SWAP_CLUSTER_MAX. This means that
mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
(currently it is doing 5 rounds of reclaim before it gives up). I do not
consider this to be blocker but maybe we should enhance
mem_cgroup_hierarchical_reclaim with a nr_pages argument to tell it how
much we want to reclaim (min(SWAP_CLUSTER_MAX, nr_pages)).
What do you think?

> ---
>  mm/vmscan.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: mmotm-Aug3/mm/vmscan.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/vmscan.c
> +++ mmotm-Aug3/mm/vmscan.c
> @@ -2124,6 +2124,16 @@ static void shrink_zones(int priority, s
>  		}
>  
>  		shrink_zone(priority, zone, sc);
> +		if (!scanning_global_lru(sc)) {
> +			/*
> +			 * When we do scan for memcg's limit, it's bad to do
> +			 * fallback into more node/zones because there is no
> +			 * memory shortage. We quit as much as possible when
> +			 * we reache target.
> +			 */
> +			if (sc->nr_to_reclaim <= sc->nr_reclaimed)
> +				break;
> +		}
>  	}
>  }

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 0/6]  memg: better numa scanning
  2011-08-10  0:15     ` KAMEZAWA Hiroyuki
@ 2011-08-10 14:20       ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-10 14:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed 10-08-11 09:15:44, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 16:33:14 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 19:04:50, KAMEZAWA Hiroyuki wrote:
[...]
> > > #!/bin/bash -x
> > > 
> > > cgset -r memory.limit_in_bytes=500M A
> > > 
> > > make -j 4 clean
> > > sync
> > > sync
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > > sleep 1
> > > echo 0 > /cgroup/memory/A/memory.vmscan_stat
> > > cgexec -g memory:A -g cpuset:A time make -j 8
> > > ==
> > > 
> > > On 8cpu, 4-node fake-numa box.
> > 
> > How big are those nodes? I assume that you haven't used any numa
> > policies, right?
> > 
> 
> This box has 24GB memory and fake numa creates 6GBnode x 4.
> 
> [kamezawa@bluextal ~]$ grep MemTotal /sys/devices/system/node/node?/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemTotal:        6290360 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemTotal:        6291456 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemTotal:        6291456 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemTotal:        6291456 kB
> 
> 2 cpus per each node. (IIRC, Hyperthread)
> 
> [kamezawa@bluextal ~]$ ls -d /sys/devices/system/node/node?/cpu?
> /sys/devices/system/node/node0/cpu0  /sys/devices/system/node/node2/cpu2
> /sys/devices/system/node/node0/cpu4  /sys/devices/system/node/node2/cpu6
> /sys/devices/system/node/node1/cpu1  /sys/devices/system/node/node3/cpu3
> /sys/devices/system/node/node1/cpu5  /sys/devices/system/node/node3/cpu7
> 
> And yes, I don't use any numa policy other than spread-page.

OK, so the load should fit into a single node without spread-page.

> > > (each node has 2cpus.)
> > > 
> > > Under the limit of 500M, 'make' need to scan memory to reclaim.
> > > This tests see how vmscan works.
> > > 
> > > When cpuset.memory_spread_page==0.
> > 
> > > 
> > > [Before patch]
> > > 773.07user 305.45system 4:09.64elapsed 432%CPU (0avgtext+0avgdata 1456576maxresident)k
> > > 4397944inputs+5093232outputs (9688major+35689066minor)pagefaults 0swaps
> > > scanned_pages_by_limit 3867645
> > > scanned_anon_pages_by_limit 1518266
> > > scanned_file_pages_by_limit 2349379
> > > rotated_pages_by_limit 1502640
> > > rotated_anon_pages_by_limit 1416627
> > > rotated_file_pages_by_limit 86013
> > > freed_pages_by_limit 1005141
> > > freed_anon_pages_by_limit 24577
> > > freed_file_pages_by_limit 980564
> > > elapsed_ns_by_limit 82833866094
> > > 
> > > [Patched]
> > > 773.73user 305.09system 3:51.28elapsed 466%CPU (0avgtext+0avgdata 1458464maxresident)k
> > > 4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps
> > 
> > Hmm, 57% reduction of major page faults which doesn't fit with other
> > numbers. At least I do not see any corelation with them. Your workload
> > has freed more or less the same number of file pages (>1% less). Do you
> > have a theory for that?
> > 
> [Before] freed_anon_pages_by_limit 24577 
> [After]  freed_anon_pages_by_limit 20599
> 
> This reduces 3987 swap out. Changes in major fault is 4110.
> I think this is major reason to reduce the major faults.

Ahh, right you are.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 0/6]  memg: better numa scanning
@ 2011-08-10 14:20       ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-10 14:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed 10-08-11 09:15:44, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 16:33:14 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 19:04:50, KAMEZAWA Hiroyuki wrote:
[...]
> > > #!/bin/bash -x
> > > 
> > > cgset -r memory.limit_in_bytes=500M A
> > > 
> > > make -j 4 clean
> > > sync
> > > sync
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > > sleep 1
> > > echo 0 > /cgroup/memory/A/memory.vmscan_stat
> > > cgexec -g memory:A -g cpuset:A time make -j 8
> > > ==
> > > 
> > > On 8cpu, 4-node fake-numa box.
> > 
> > How big are those nodes? I assume that you haven't used any numa
> > policies, right?
> > 
> 
> This box has 24GB memory and fake numa creates 6GBnode x 4.
> 
> [kamezawa@bluextal ~]$ grep MemTotal /sys/devices/system/node/node?/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemTotal:        6290360 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemTotal:        6291456 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemTotal:        6291456 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemTotal:        6291456 kB
> 
> 2 cpus per each node. (IIRC, Hyperthread)
> 
> [kamezawa@bluextal ~]$ ls -d /sys/devices/system/node/node?/cpu?
> /sys/devices/system/node/node0/cpu0  /sys/devices/system/node/node2/cpu2
> /sys/devices/system/node/node0/cpu4  /sys/devices/system/node/node2/cpu6
> /sys/devices/system/node/node1/cpu1  /sys/devices/system/node/node3/cpu3
> /sys/devices/system/node/node1/cpu5  /sys/devices/system/node/node3/cpu7
> 
> And yes, I don't use any numa policy other than spread-page.

OK, so the load should fit into a single node without spread-page.

> > > (each node has 2cpus.)
> > > 
> > > Under the limit of 500M, 'make' need to scan memory to reclaim.
> > > This tests see how vmscan works.
> > > 
> > > When cpuset.memory_spread_page==0.
> > 
> > > 
> > > [Before patch]
> > > 773.07user 305.45system 4:09.64elapsed 432%CPU (0avgtext+0avgdata 1456576maxresident)k
> > > 4397944inputs+5093232outputs (9688major+35689066minor)pagefaults 0swaps
> > > scanned_pages_by_limit 3867645
> > > scanned_anon_pages_by_limit 1518266
> > > scanned_file_pages_by_limit 2349379
> > > rotated_pages_by_limit 1502640
> > > rotated_anon_pages_by_limit 1416627
> > > rotated_file_pages_by_limit 86013
> > > freed_pages_by_limit 1005141
> > > freed_anon_pages_by_limit 24577
> > > freed_file_pages_by_limit 980564
> > > elapsed_ns_by_limit 82833866094
> > > 
> > > [Patched]
> > > 773.73user 305.09system 3:51.28elapsed 466%CPU (0avgtext+0avgdata 1458464maxresident)k
> > > 4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps
> > 
> > Hmm, 57% reduction of major page faults which doesn't fit with other
> > numbers. At least I do not see any corelation with them. Your workload
> > has freed more or less the same number of file pages (>1% less). Do you
> > have a theory for that?
> > 
> [Before] freed_anon_pages_by_limit 24577 
> [After]  freed_anon_pages_by_limit 20599
> 
> This reduces 3987 swap out. Changes in major fault is 4110.
> I think this is major reason to reduce the major faults.

Ahh, right you are.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 1/6]  memg: better numa scanning
  2011-08-10 10:00     ` Michal Hocko
@ 2011-08-10 23:30       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10 23:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 10 Aug 2011 12:00:42 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:08:24, KAMEZAWA Hiroyuki wrote:
> > 
> > Making memcg numa's scanning information update by schedule_work().
> > 
> > Now, memcg's numa information is updated under a thread doing
> > memory reclaim. It's not very heavy weight now. But upcoming updates
> > around numa scanning will add more works. This patch makes
> > the update be done by schedule_work() and reduce latency caused
> > by this updates.
> 
> I am not sure whether this pays off. Anyway, I think it would be better
> to place this patch somewhere at the end of the series so that we can
> measure its impact separately.
> 

I'll consider reordering when I come back from vacation.

> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Otherwise looks good to me.
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 

Thanks.

> Just a minor nit bellow.
> 
> > ---
> >  mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++------------
> >  1 file changed, 30 insertions(+), 12 deletions(-)
> > 
> > Index: mmotm-Aug3/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-Aug3.orig/mm/memcontrol.c
> > +++ mmotm-Aug3/mm/memcontrol.c
> > @@ -285,6 +285,7 @@ struct mem_cgroup {
> >  	nodemask_t	scan_nodes;
> >  	atomic_t	numainfo_events;
> >  	atomic_t	numainfo_updating;
> > +	struct work_struct	numainfo_update_work;
> >  #endif
> >  	/*
> >  	 * Should the accounting and control be hierarchical, per subtree?
> > @@ -1567,6 +1568,23 @@ static bool test_mem_cgroup_node_reclaim
> >  }
> >  #if MAX_NUMNODES > 1
> >  
> > +static void mem_cgroup_numainfo_update_work(struct work_struct *work)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	int nid;
> > +
> > +	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
> > +
> > +	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
> > +	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
> > +		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
> > +			node_clear(nid, memcg->scan_nodes);
> > +	}
> > +	atomic_set(&memcg->numainfo_updating, 0);
> > +	css_put(&memcg->css);
> > +}
> > +
> > +
> >  /*
> >   * Always updating the nodemask is not very good - even if we have an empty
> >   * list or the wrong list here, we can start from some node and traverse all
> > @@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
> >   */
> 
> Would be good to update the function comment as well (we still have 10s
> period there).
> 

ok.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 1/6]  memg: better numa scanning
@ 2011-08-10 23:30       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10 23:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 10 Aug 2011 12:00:42 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:08:24, KAMEZAWA Hiroyuki wrote:
> > 
> > Making memcg numa's scanning information update by schedule_work().
> > 
> > Now, memcg's numa information is updated under a thread doing
> > memory reclaim. It's not very heavy weight now. But upcoming updates
> > around numa scanning will add more works. This patch makes
> > the update be done by schedule_work() and reduce latency caused
> > by this updates.
> 
> I am not sure whether this pays off. Anyway, I think it would be better
> to place this patch somewhere at the end of the series so that we can
> measure its impact separately.
> 

I'll consider reordering when I come back from vacation.

> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Otherwise looks good to me.
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 

Thanks.

> Just a minor nit bellow.
> 
> > ---
> >  mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++------------
> >  1 file changed, 30 insertions(+), 12 deletions(-)
> > 
> > Index: mmotm-Aug3/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-Aug3.orig/mm/memcontrol.c
> > +++ mmotm-Aug3/mm/memcontrol.c
> > @@ -285,6 +285,7 @@ struct mem_cgroup {
> >  	nodemask_t	scan_nodes;
> >  	atomic_t	numainfo_events;
> >  	atomic_t	numainfo_updating;
> > +	struct work_struct	numainfo_update_work;
> >  #endif
> >  	/*
> >  	 * Should the accounting and control be hierarchical, per subtree?
> > @@ -1567,6 +1568,23 @@ static bool test_mem_cgroup_node_reclaim
> >  }
> >  #if MAX_NUMNODES > 1
> >  
> > +static void mem_cgroup_numainfo_update_work(struct work_struct *work)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	int nid;
> > +
> > +	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
> > +
> > +	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
> > +	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
> > +		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
> > +			node_clear(nid, memcg->scan_nodes);
> > +	}
> > +	atomic_set(&memcg->numainfo_updating, 0);
> > +	css_put(&memcg->css);
> > +}
> > +
> > +
> >  /*
> >   * Always updating the nodemask is not very good - even if we have an empty
> >   * list or the wrong list here, we can start from some node and traverse all
> > @@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
> >   */
> 
> Would be good to update the function comment as well (we still have 10s
> period there).
> 

ok.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 3/6]  memg: vmscan pass nodemask
  2011-08-10 11:19     ` Michal Hocko
@ 2011-08-10 23:43       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10 23:43 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 10 Aug 2011 13:19:58 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:10:18, KAMEZAWA Hiroyuki wrote:
> > 
> > pass memcg's nodemask to try_to_free_pages().
> > 
> > try_to_free_pages can take nodemask as its argument but memcg
> > doesn't pass it. Considering memcg can be used with cpuset on
> > big NUMA, memcg should pass nodemask if available.
> > 
> > Now, memcg maintain nodemask with periodic updates. pass it.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Changelog:
> >  - fixed bugs to pass nodemask.
> 
> Yes, looks good now.
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
Thanks.

> > Index: mmotm-Aug3/mm/vmscan.c
> > ===================================================================
> > --- mmotm-Aug3.orig/mm/vmscan.c
> > +++ mmotm-Aug3/mm/vmscan.c
> > @@ -2354,7 +2354,7 @@ unsigned long try_to_free_mem_cgroup_pag
> >  		.order = 0,
> >  		.mem_cgroup = mem_cont,
> >  		.memcg_record = rec,
> > -		.nodemask = NULL, /* we don't care the placement */
> > +		.nodemask = NULL,
> >  		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> >  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> >  	};
> 
> We can remove the whole nodemask initialization.
> 

Ok, here
==

pass memcg's nodemask to try_to_free_pages().

try_to_free_pages can take nodemask as its argument but memcg
doesn't pass it. Considering memcg can be used with cpuset on
big NUMA, memcg should pass nodemask if available.

Now, memcg maintain nodemask with periodic updates. pass it.

Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Changelog:
 - removed unnecessary initialization of sc.nodemask.
Changelog:
 - fixed bugs to pass nodemask.
---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |    8 ++++++--
 mm/vmscan.c                |    3 +--
 3 files changed, 8 insertions(+), 5 deletions(-)

Index: mmotm-Aug3/include/linux/memcontrol.h
===================================================================
--- mmotm-Aug3.orig/include/linux/memcontrol.h
+++ mmotm-Aug3/include/linux/memcontrol.h
@@ -118,7 +118,7 @@ extern void mem_cgroup_end_migration(str
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -1615,10 +1615,11 @@ static void mem_cgroup_may_update_nodema
  *
  * Now, we use round-robin. Better algorithm is welcomed.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
 	int node;
 
+	*mask = NULL;
 	mem_cgroup_may_update_nodemask(mem);
 	node = mem->last_scanned_node;
 
@@ -1633,6 +1634,8 @@ int mem_cgroup_select_victim_node(struct
 	 */
 	if (unlikely(node == MAX_NUMNODES))
 		node = numa_node_id();
+	else
+		*mask = &mem->scan_nodes;
 
 	mem->last_scanned_node = node;
 	return node;
@@ -1680,8 +1683,9 @@ static void mem_cgroup_numascan_init(str
 }
 
 #else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
+	*mask = NULL;
 	return 0;
 }
 
Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2354,7 +2354,6 @@ unsigned long try_to_free_mem_cgroup_pag
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,
-		.nodemask = NULL, /* we don't care the placement */
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 	};
@@ -2368,7 +2367,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont);
+	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 3/6]  memg: vmscan pass nodemask
@ 2011-08-10 23:43       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10 23:43 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 10 Aug 2011 13:19:58 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:10:18, KAMEZAWA Hiroyuki wrote:
> > 
> > pass memcg's nodemask to try_to_free_pages().
> > 
> > try_to_free_pages can take nodemask as its argument but memcg
> > doesn't pass it. Considering memcg can be used with cpuset on
> > big NUMA, memcg should pass nodemask if available.
> > 
> > Now, memcg maintain nodemask with periodic updates. pass it.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Changelog:
> >  - fixed bugs to pass nodemask.
> 
> Yes, looks good now.
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
Thanks.

> > Index: mmotm-Aug3/mm/vmscan.c
> > ===================================================================
> > --- mmotm-Aug3.orig/mm/vmscan.c
> > +++ mmotm-Aug3/mm/vmscan.c
> > @@ -2354,7 +2354,7 @@ unsigned long try_to_free_mem_cgroup_pag
> >  		.order = 0,
> >  		.mem_cgroup = mem_cont,
> >  		.memcg_record = rec,
> > -		.nodemask = NULL, /* we don't care the placement */
> > +		.nodemask = NULL,
> >  		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> >  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> >  	};
> 
> We can remove the whole nodemask initialization.
> 

Ok, here
==

pass memcg's nodemask to try_to_free_pages().

try_to_free_pages can take nodemask as its argument but memcg
doesn't pass it. Considering memcg can be used with cpuset on
big NUMA, memcg should pass nodemask if available.

Now, memcg maintain nodemask with periodic updates. pass it.

Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Changelog:
 - removed unnecessary initialization of sc.nodemask.
Changelog:
 - fixed bugs to pass nodemask.
---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |    8 ++++++--
 mm/vmscan.c                |    3 +--
 3 files changed, 8 insertions(+), 5 deletions(-)

Index: mmotm-Aug3/include/linux/memcontrol.h
===================================================================
--- mmotm-Aug3.orig/include/linux/memcontrol.h
+++ mmotm-Aug3/include/linux/memcontrol.h
@@ -118,7 +118,7 @@ extern void mem_cgroup_end_migration(str
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -1615,10 +1615,11 @@ static void mem_cgroup_may_update_nodema
  *
  * Now, we use round-robin. Better algorithm is welcomed.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
 	int node;
 
+	*mask = NULL;
 	mem_cgroup_may_update_nodemask(mem);
 	node = mem->last_scanned_node;
 
@@ -1633,6 +1634,8 @@ int mem_cgroup_select_victim_node(struct
 	 */
 	if (unlikely(node == MAX_NUMNODES))
 		node = numa_node_id();
+	else
+		*mask = &mem->scan_nodes;
 
 	mem->last_scanned_node = node;
 	return node;
@@ -1680,8 +1683,9 @@ static void mem_cgroup_numascan_init(str
 }
 
 #else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
+	*mask = NULL;
 	return 0;
 }
 
Index: mmotm-Aug3/mm/vmscan.c
===================================================================
--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c
@@ -2354,7 +2354,6 @@ unsigned long try_to_free_mem_cgroup_pag
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,
-		.nodemask = NULL, /* we don't care the placement */
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 	};
@@ -2368,7 +2367,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont);
+	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH] memcg: fix comment on update nodemask
  2011-08-10 23:30       ` KAMEZAWA Hiroyuki
@ 2011-08-10 23:44         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10 23:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, nishimura


> > >  /*
> > >   * Always updating the nodemask is not very good - even if we have an empty
> > >   * list or the wrong list here, we can start from some node and traverse all
> > > @@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
> > >   */
> > 
> > Would be good to update the function comment as well (we still have 10s
> > period there).
> > 
> 
how about this ?
==

Update function's comment. The behavior is changed by commit 453a9bf3

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -1568,10 +1568,7 @@ static bool test_mem_cgroup_node_reclaim
 #if MAX_NUMNODES > 1
 
 /*
- * Always updating the nodemask is not very good - even if we have an empty
- * list or the wrong list here, we can start from some node and traverse all
- * nodes based on the zonelist. So update the list loosely once per 10 secs.
- *
+ * Update scan nodemask with memcg's event_counter(NUMAINFO_EVENTS_TARGET)
  */
 static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
 {


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH] memcg: fix comment on update nodemask
@ 2011-08-10 23:44         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10 23:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, nishimura


> > >  /*
> > >   * Always updating the nodemask is not very good - even if we have an empty
> > >   * list or the wrong list here, we can start from some node and traverse all
> > > @@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
> > >   */
> > 
> > Would be good to update the function comment as well (we still have 10s
> > period there).
> > 
> 
how about this ?
==

Update function's comment. The behavior is changed by commit 453a9bf3

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================
--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c
@@ -1568,10 +1568,7 @@ static bool test_mem_cgroup_node_reclaim
 #if MAX_NUMNODES > 1
 
 /*
- * Always updating the nodemask is not very good - even if we have an empty
- * list or the wrong list here, we can start from some node and traverse all
- * nodes based on the zonelist. So update the list loosely once per 10 secs.
- *
+ * Update scan nodemask with memcg's event_counter(NUMAINFO_EVENTS_TARGET)
  */
 static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-10 14:14     ` Michal Hocko
@ 2011-08-10 23:52       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10 23:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 10 Aug 2011 16:14:25 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:09:33, KAMEZAWA Hiroyuki wrote:
> > memcg :avoid node fallback scan if possible.
> > 
> > Now, try_to_free_pages() scans all zonelist because the page allocator
> > should visit all zonelists...but that behavior is harmful for memcg.
> > Memcg just scans memory because it hits limit...no memory shortage
> > in pased zonelist.
> > 
> > For example, with following unbalanced nodes
> > 
> >      Node 0    Node 1
> > File 1G        0
> > Anon 200M      200M
> > 
> > memcg will cause swap-out from Node1 at every vmscan.
> > 
> > Another example, assume 1024 nodes system.
> > With 1024 node system, memcg will visit 1024 nodes
> > pages per vmscan... This is overkilling. 
> > 
> > This is why memcg's victim node selection logic doesn't work
> > as expected.
> > 
> > This patch is a help for stopping vmscan when we scanned enough.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> OK, I see the point. At first I was afraid that we would make a bigger
> pressure on the node which triggered the reclaim but as we are selecting
> t dynamically (mem_cgroup_select_victim_node) - round robin at the
> moment - it should be fair in the end. More targeted node selection
> should be even more efficient.
> 
> I still have a concern about resize_limit code path, though. It uses
> memcg direct reclaim to get under the new limit (assuming it is lower
> than the current one). 
> Currently we might reclaim nr_nodes * SWAP_CLUSTER_MAX while
> after your change we have it at SWAP_CLUSTER_MAX. This means that
> mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> (currently it is doing 5 rounds of reclaim before it gives up). I do not
> consider this to be blocker but maybe we should enhance
> mem_cgroup_hierarchical_reclaim with a nr_pages argument to tell it how
> much we want to reclaim (min(SWAP_CLUSTER_MAX, nr_pages)).
> What do you think?
> 

Hmm,

> mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines

mem_cgroup_resize_limit() just checks (curusage < prevusage), then, 
I agree reducing the number of scan/reclaim will cause that.

I agree to pass nr_pages to try_to_free_mem_cgroup_pages().


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-10 23:52       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10 23:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 10 Aug 2011 16:14:25 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:09:33, KAMEZAWA Hiroyuki wrote:
> > memcg :avoid node fallback scan if possible.
> > 
> > Now, try_to_free_pages() scans all zonelist because the page allocator
> > should visit all zonelists...but that behavior is harmful for memcg.
> > Memcg just scans memory because it hits limit...no memory shortage
> > in pased zonelist.
> > 
> > For example, with following unbalanced nodes
> > 
> >      Node 0    Node 1
> > File 1G        0
> > Anon 200M      200M
> > 
> > memcg will cause swap-out from Node1 at every vmscan.
> > 
> > Another example, assume 1024 nodes system.
> > With 1024 node system, memcg will visit 1024 nodes
> > pages per vmscan... This is overkilling. 
> > 
> > This is why memcg's victim node selection logic doesn't work
> > as expected.
> > 
> > This patch is a help for stopping vmscan when we scanned enough.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> OK, I see the point. At first I was afraid that we would make a bigger
> pressure on the node which triggered the reclaim but as we are selecting
> t dynamically (mem_cgroup_select_victim_node) - round robin at the
> moment - it should be fair in the end. More targeted node selection
> should be even more efficient.
> 
> I still have a concern about resize_limit code path, though. It uses
> memcg direct reclaim to get under the new limit (assuming it is lower
> than the current one). 
> Currently we might reclaim nr_nodes * SWAP_CLUSTER_MAX while
> after your change we have it at SWAP_CLUSTER_MAX. This means that
> mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> (currently it is doing 5 rounds of reclaim before it gives up). I do not
> consider this to be blocker but maybe we should enhance
> mem_cgroup_hierarchical_reclaim with a nr_pages argument to tell it how
> much we want to reclaim (min(SWAP_CLUSTER_MAX, nr_pages)).
> What do you think?
> 

Hmm,

> mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines

mem_cgroup_resize_limit() just checks (curusage < prevusage), then, 
I agree reducing the number of scan/reclaim will cause that.

I agree to pass nr_pages to try_to_free_mem_cgroup_pages().


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] memcg: fix comment on update nodemask
  2011-08-10 23:44         ` KAMEZAWA Hiroyuki
@ 2011-08-11 13:25           ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-11 13:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 11-08-11 08:44:56, KAMEZAWA Hiroyuki wrote:
> 
> > > >  /*
> > > >   * Always updating the nodemask is not very good - even if we have an empty
> > > >   * list or the wrong list here, we can start from some node and traverse all
> > > > @@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
> > > >   */
> > > 
> > > Would be good to update the function comment as well (we still have 10s
> > > period there).
> > > 
> > 
> how about this ?
> ==
> 
> Update function's comment. The behavior is changed by commit 453a9bf3
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  mm/memcontrol.c |    5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> Index: mmotm-Aug3/mm/memcontrol.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/memcontrol.c
> +++ mmotm-Aug3/mm/memcontrol.c
> @@ -1568,10 +1568,7 @@ static bool test_mem_cgroup_node_reclaim
>  #if MAX_NUMNODES > 1
>  
>  /*
> - * Always updating the nodemask is not very good - even if we have an empty
> - * list or the wrong list here, we can start from some node and traverse all
> - * nodes based on the zonelist. So update the list loosely once per 10 secs.
> - *
> + * Update scan nodemask with memcg's event_counter(NUMAINFO_EVENTS_TARGET)
>   */
>  static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
>  {

I would keep the first part about reasoning and just replace the one
about 10 secs update.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] memcg: fix comment on update nodemask
@ 2011-08-11 13:25           ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-11 13:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 11-08-11 08:44:56, KAMEZAWA Hiroyuki wrote:
> 
> > > >  /*
> > > >   * Always updating the nodemask is not very good - even if we have an empty
> > > >   * list or the wrong list here, we can start from some node and traverse all
> > > > @@ -1575,7 +1593,6 @@ static bool test_mem_cgroup_node_reclaim
> > > >   */
> > > 
> > > Would be good to update the function comment as well (we still have 10s
> > > period there).
> > > 
> > 
> how about this ?
> ==
> 
> Update function's comment. The behavior is changed by commit 453a9bf3
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  mm/memcontrol.c |    5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> Index: mmotm-Aug3/mm/memcontrol.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/memcontrol.c
> +++ mmotm-Aug3/mm/memcontrol.c
> @@ -1568,10 +1568,7 @@ static bool test_mem_cgroup_node_reclaim
>  #if MAX_NUMNODES > 1
>  
>  /*
> - * Always updating the nodemask is not very good - even if we have an empty
> - * list or the wrong list here, we can start from some node and traverse all
> - * nodes based on the zonelist. So update the list loosely once per 10 secs.
> - *
> + * Update scan nodemask with memcg's event_counter(NUMAINFO_EVENTS_TARGET)
>   */
>  static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
>  {

I would keep the first part about reasoning and just replace the one
about 10 secs update.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-10 23:52       ` KAMEZAWA Hiroyuki
@ 2011-08-11 14:50         ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-11 14:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 11-08-11 08:52:52, KAMEZAWA Hiroyuki wrote:
> On Wed, 10 Aug 2011 16:14:25 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 19:09:33, KAMEZAWA Hiroyuki wrote:
> > > memcg :avoid node fallback scan if possible.
> > > 
> > > Now, try_to_free_pages() scans all zonelist because the page allocator
> > > should visit all zonelists...but that behavior is harmful for memcg.
> > > Memcg just scans memory because it hits limit...no memory shortage
> > > in pased zonelist.
> > > 
> > > For example, with following unbalanced nodes
> > > 
> > >      Node 0    Node 1
> > > File 1G        0
> > > Anon 200M      200M
> > > 
> > > memcg will cause swap-out from Node1 at every vmscan.
> > > 
> > > Another example, assume 1024 nodes system.
> > > With 1024 node system, memcg will visit 1024 nodes
> > > pages per vmscan... This is overkilling. 
> > > 
> > > This is why memcg's victim node selection logic doesn't work
> > > as expected.
> > > 
> > > This patch is a help for stopping vmscan when we scanned enough.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > OK, I see the point. At first I was afraid that we would make a bigger
> > pressure on the node which triggered the reclaim but as we are selecting
> > t dynamically (mem_cgroup_select_victim_node) - round robin at the
> > moment - it should be fair in the end. More targeted node selection
> > should be even more efficient.
> > 
> > I still have a concern about resize_limit code path, though. It uses
> > memcg direct reclaim to get under the new limit (assuming it is lower
> > than the current one). 
> > Currently we might reclaim nr_nodes * SWAP_CLUSTER_MAX while
> > after your change we have it at SWAP_CLUSTER_MAX. This means that
> > mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> > (currently it is doing 5 rounds of reclaim before it gives up). I do not
> > consider this to be blocker but maybe we should enhance
> > mem_cgroup_hierarchical_reclaim with a nr_pages argument to tell it how
> > much we want to reclaim (min(SWAP_CLUSTER_MAX, nr_pages)).
> > What do you think?
> > 
> 
> Hmm,
> 
> > mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> 
> mem_cgroup_resize_limit() just checks (curusage < prevusage), then, 
> I agree reducing the number of scan/reclaim will cause that.
> 
> I agree to pass nr_pages to try_to_free_mem_cgroup_pages().

What about this (just compile tested)?
--- 
From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: add nr_pages argument for hierarchical reclaim

Now that we are doing memcg direct reclaim limited to nr_to_reclaim
pages (introduced by "memcg: stop vmscan when enough done.") we have to
be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
most callers but it might cause failures for limit resize or force_empty
code paths on big NUMA machines.

Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
on reclaiming a certain amount of pages and retrying if their condition is
still not met.

Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
push it further to try_to_free_mem_cgroup_pages. We still fall back to
SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
affected by this.

Open questions:
- Should we care about soft limit as well? Currently I am using excess
  number of pages for the parameter so it can replace direct query for
  the value in mem_cgroup_hierarchical_reclaim but should we push it to
  mem_cgroup_shrink_node_zone?
  I do not think so because we should try to reclaim from more groups in the
  hierarchy and also it doesn't get to shrink_zones which has been modified
  by the previous patch.
- mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
  OK but will have to think about it some more.
- Aren't we going to reclaim too much when we hit the limit due to THP?

Signed-off-by: Michal Hocko <mhocko@suse.cz>

Index: linus_tree/include/linux/memcontrol.h
===================================================================
--- linus_tree.orig/include/linux/memcontrol.h	2011-08-11 15:44:43.000000000 +0200
+++ linus_tree/include/linux/memcontrol.h	2011-08-11 15:46:27.000000000 +0200
@@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
 
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
+						  struct memcg_scanrecord *rec,
+						  unsigned long nr_pages);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c	2011-08-11 15:36:15.000000000 +0200
+++ linus_tree/mm/memcontrol.c	2011-08-11 16:00:46.000000000 +0200
@@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
  * (other groups can be removed while we're walking....)
  *
  * If shrink==true, for avoiding to free too much, this returns immedieately.
+ * Given nr_pages tells how many pages are we over the soft limit or how many
+ * pages do we want to reclaim in the direct reclaim mode.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
 						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+						unsigned long *total_scanned,
+						unsigned long nr_pages)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	struct memcg_scanrecord rec;
-	unsigned long excess;
 	unsigned long scanned;
 
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
 		noswap = true;
@@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 				/*
 				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
+				 * nr_pages >> 2 is not to excessive so as to
 				 * reclaim too much, nor too less that we keep
 				 * coming back to reclaim from this cgroup
 				 */
-				if (total >= (excess >> 2) ||
+				if (total >= (nr_pages >> 2) ||
 					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
 					css_put(&victim->css);
 					break;
@@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
 			*total_scanned += scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
+						noswap, &rec, nr_pages);
 		mem_cgroup_record_scanstat(&rec);
 		css_put(&victim->css);
 		/*
@@ -2332,7 +2332,8 @@ static int mem_cgroup_do_charge(struct m
 		return CHARGE_WOULDBLOCK;
 
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+					      gfp_mask, flags, NULL,
+					      nr_pages);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3567,7 +3568,8 @@ static int mem_cgroup_resize_limit(struc
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3628,7 +3630,8 @@ static int mem_cgroup_resize_memsw_limit
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memswlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3671,10 +3674,12 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		nr_scanned = 0;
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
 						gfp_mask,
 						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+						&nr_scanned,
+						excess >> PAGE_SHIFT);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
@@ -3871,7 +3876,8 @@ try_to_free:
 		rec.mem = mem;
 		rec.root = mem;
 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
-						false, &rec);
+						false, &rec,
+						mem->res.usage >> PAGE_SHIFT);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-08-11 15:44:43.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-08-11 16:41:22.000000000 +0200
@@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   struct memcg_scanrecord *rec,
+					   unsigned long nr_pages)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-11 14:50         ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-11 14:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 11-08-11 08:52:52, KAMEZAWA Hiroyuki wrote:
> On Wed, 10 Aug 2011 16:14:25 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 19:09:33, KAMEZAWA Hiroyuki wrote:
> > > memcg :avoid node fallback scan if possible.
> > > 
> > > Now, try_to_free_pages() scans all zonelist because the page allocator
> > > should visit all zonelists...but that behavior is harmful for memcg.
> > > Memcg just scans memory because it hits limit...no memory shortage
> > > in pased zonelist.
> > > 
> > > For example, with following unbalanced nodes
> > > 
> > >      Node 0    Node 1
> > > File 1G        0
> > > Anon 200M      200M
> > > 
> > > memcg will cause swap-out from Node1 at every vmscan.
> > > 
> > > Another example, assume 1024 nodes system.
> > > With 1024 node system, memcg will visit 1024 nodes
> > > pages per vmscan... This is overkilling. 
> > > 
> > > This is why memcg's victim node selection logic doesn't work
> > > as expected.
> > > 
> > > This patch is a help for stopping vmscan when we scanned enough.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > OK, I see the point. At first I was afraid that we would make a bigger
> > pressure on the node which triggered the reclaim but as we are selecting
> > t dynamically (mem_cgroup_select_victim_node) - round robin at the
> > moment - it should be fair in the end. More targeted node selection
> > should be even more efficient.
> > 
> > I still have a concern about resize_limit code path, though. It uses
> > memcg direct reclaim to get under the new limit (assuming it is lower
> > than the current one). 
> > Currently we might reclaim nr_nodes * SWAP_CLUSTER_MAX while
> > after your change we have it at SWAP_CLUSTER_MAX. This means that
> > mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> > (currently it is doing 5 rounds of reclaim before it gives up). I do not
> > consider this to be blocker but maybe we should enhance
> > mem_cgroup_hierarchical_reclaim with a nr_pages argument to tell it how
> > much we want to reclaim (min(SWAP_CLUSTER_MAX, nr_pages)).
> > What do you think?
> > 
> 
> Hmm,
> 
> > mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> 
> mem_cgroup_resize_limit() just checks (curusage < prevusage), then, 
> I agree reducing the number of scan/reclaim will cause that.
> 
> I agree to pass nr_pages to try_to_free_mem_cgroup_pages().

What about this (just compile tested)?
--- 
From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: add nr_pages argument for hierarchical reclaim

Now that we are doing memcg direct reclaim limited to nr_to_reclaim
pages (introduced by "memcg: stop vmscan when enough done.") we have to
be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
most callers but it might cause failures for limit resize or force_empty
code paths on big NUMA machines.

Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
on reclaiming a certain amount of pages and retrying if their condition is
still not met.

Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
push it further to try_to_free_mem_cgroup_pages. We still fall back to
SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
affected by this.

Open questions:
- Should we care about soft limit as well? Currently I am using excess
  number of pages for the parameter so it can replace direct query for
  the value in mem_cgroup_hierarchical_reclaim but should we push it to
  mem_cgroup_shrink_node_zone?
  I do not think so because we should try to reclaim from more groups in the
  hierarchy and also it doesn't get to shrink_zones which has been modified
  by the previous patch.
- mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
  OK but will have to think about it some more.
- Aren't we going to reclaim too much when we hit the limit due to THP?

Signed-off-by: Michal Hocko <mhocko@suse.cz>

Index: linus_tree/include/linux/memcontrol.h
===================================================================
--- linus_tree.orig/include/linux/memcontrol.h	2011-08-11 15:44:43.000000000 +0200
+++ linus_tree/include/linux/memcontrol.h	2011-08-11 15:46:27.000000000 +0200
@@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
 
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
+						  struct memcg_scanrecord *rec,
+						  unsigned long nr_pages);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c	2011-08-11 15:36:15.000000000 +0200
+++ linus_tree/mm/memcontrol.c	2011-08-11 16:00:46.000000000 +0200
@@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
  * (other groups can be removed while we're walking....)
  *
  * If shrink==true, for avoiding to free too much, this returns immedieately.
+ * Given nr_pages tells how many pages are we over the soft limit or how many
+ * pages do we want to reclaim in the direct reclaim mode.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
 						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+						unsigned long *total_scanned,
+						unsigned long nr_pages)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	struct memcg_scanrecord rec;
-	unsigned long excess;
 	unsigned long scanned;
 
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
 		noswap = true;
@@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 				/*
 				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
+				 * nr_pages >> 2 is not to excessive so as to
 				 * reclaim too much, nor too less that we keep
 				 * coming back to reclaim from this cgroup
 				 */
-				if (total >= (excess >> 2) ||
+				if (total >= (nr_pages >> 2) ||
 					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
 					css_put(&victim->css);
 					break;
@@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
 			*total_scanned += scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
+						noswap, &rec, nr_pages);
 		mem_cgroup_record_scanstat(&rec);
 		css_put(&victim->css);
 		/*
@@ -2332,7 +2332,8 @@ static int mem_cgroup_do_charge(struct m
 		return CHARGE_WOULDBLOCK;
 
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+					      gfp_mask, flags, NULL,
+					      nr_pages);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3567,7 +3568,8 @@ static int mem_cgroup_resize_limit(struc
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3628,7 +3630,8 @@ static int mem_cgroup_resize_memsw_limit
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memswlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3671,10 +3674,12 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		nr_scanned = 0;
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
 						gfp_mask,
 						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+						&nr_scanned,
+						excess >> PAGE_SHIFT);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
@@ -3871,7 +3876,8 @@ try_to_free:
 		rec.mem = mem;
 		rec.root = mem;
 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
-						false, &rec);
+						false, &rec,
+						mem->res.usage >> PAGE_SHIFT);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-08-11 15:44:43.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-08-11 16:41:22.000000000 +0200
@@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   struct memcg_scanrecord *rec,
+					   unsigned long nr_pages)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH] memcg: add nr_pages argument for hierarchical reclaim
  2011-08-11 14:50         ` Michal Hocko
@ 2011-08-12 12:44           ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-12 12:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 11-08-11 16:50:55, Michal Hocko wrote:
> On Thu 11-08-11 08:52:52, KAMEZAWA Hiroyuki wrote:
> > On Wed, 10 Aug 2011 16:14:25 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Tue 09-08-11 19:09:33, KAMEZAWA Hiroyuki wrote:
> > > > memcg :avoid node fallback scan if possible.
> > > > 
> > > > Now, try_to_free_pages() scans all zonelist because the page allocator
> > > > should visit all zonelists...but that behavior is harmful for memcg.
> > > > Memcg just scans memory because it hits limit...no memory shortage
> > > > in pased zonelist.
> > > > 
> > > > For example, with following unbalanced nodes
> > > > 
> > > >      Node 0    Node 1
> > > > File 1G        0
> > > > Anon 200M      200M
> > > > 
> > > > memcg will cause swap-out from Node1 at every vmscan.
> > > > 
> > > > Another example, assume 1024 nodes system.
> > > > With 1024 node system, memcg will visit 1024 nodes
> > > > pages per vmscan... This is overkilling. 
> > > > 
> > > > This is why memcg's victim node selection logic doesn't work
> > > > as expected.
> > > > 
> > > > This patch is a help for stopping vmscan when we scanned enough.
> > > > 
> > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > OK, I see the point. At first I was afraid that we would make a bigger
> > > pressure on the node which triggered the reclaim but as we are selecting
> > > t dynamically (mem_cgroup_select_victim_node) - round robin at the
> > > moment - it should be fair in the end. More targeted node selection
> > > should be even more efficient.
> > > 
> > > I still have a concern about resize_limit code path, though. It uses
> > > memcg direct reclaim to get under the new limit (assuming it is lower
> > > than the current one). 
> > > Currently we might reclaim nr_nodes * SWAP_CLUSTER_MAX while
> > > after your change we have it at SWAP_CLUSTER_MAX. This means that
> > > mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> > > (currently it is doing 5 rounds of reclaim before it gives up). I do not
> > > consider this to be blocker but maybe we should enhance
> > > mem_cgroup_hierarchical_reclaim with a nr_pages argument to tell it how
> > > much we want to reclaim (min(SWAP_CLUSTER_MAX, nr_pages)).
> > > What do you think?
> > > 
> > 
> > Hmm,
> > 
> > > mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> > 
> > mem_cgroup_resize_limit() just checks (curusage < prevusage), then, 
> > I agree reducing the number of scan/reclaim will cause that.
> > 
> > I agree to pass nr_pages to try_to_free_mem_cgroup_pages().

This is another version which prevents from excessive reclaim due to
THP.
---
From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: add nr_pages argument for hierarchical reclaim

Now that we are doing memcg direct reclaim limited to nr_to_reclaim
pages (introduced by "memcg: stop vmscan when enough done.") we have to
be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
most callers but it might cause failures for limit resize or force_empty
code paths on big NUMA machines.

Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
on reclaiming a certain amount of pages and retrying if their condition is
still not met.

Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
push it further to try_to_free_mem_cgroup_pages. We still fall back to
SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
affected by this.

We have to be careful in mem_cgroup_do_charge and do not provide the
given nr_pages because we would reclaim too much for THP which can
safely fall back to single page allocations.

Open questions:
- Should we care about soft limit as well? Currently I am using excess
  number of pages for the parameter so it can replace direct query for
  the value in mem_cgroup_hierarchical_reclaim but should we push it to
  mem_cgroup_shrink_node_zone?
  I do not think so because we should try to reclaim from more groups in the
  hierarchy and also it doesn't get to shrink_zones which has been modified
  by the previous patch.
- mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
  OK but will have to think about it some more.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Index: linus_tree/include/linux/memcontrol.h
===================================================================
--- linus_tree.orig/include/linux/memcontrol.h	2011-08-11 15:44:43.000000000 +0200
+++ linus_tree/include/linux/memcontrol.h	2011-08-11 15:46:27.000000000 +0200
@@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
 
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
+						  struct memcg_scanrecord *rec,
+						  unsigned long nr_pages);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c	2011-08-11 15:36:15.000000000 +0200
+++ linus_tree/mm/memcontrol.c	2011-08-11 18:10:52.000000000 +0200
@@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
  * (other groups can be removed while we're walking....)
  *
  * If shrink==true, for avoiding to free too much, this returns immedieately.
+ * Given nr_pages tells how many pages are we over the soft limit or how many
+ * pages do we want to reclaim in the direct reclaim mode.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
 						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+						unsigned long *total_scanned,
+						unsigned long nr_pages)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	struct memcg_scanrecord rec;
-	unsigned long excess;
 	unsigned long scanned;
 
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
 		noswap = true;
@@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 				/*
 				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
+				 * nr_pages >> 2 is not to excessive so as to
 				 * reclaim too much, nor too less that we keep
 				 * coming back to reclaim from this cgroup
 				 */
-				if (total >= (excess >> 2) ||
+				if (total >= (nr_pages >> 2) ||
 					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
 					css_put(&victim->css);
 					break;
@@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
 			*total_scanned += scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
+						noswap, &rec, nr_pages);
 		mem_cgroup_record_scanstat(&rec);
 		css_put(&victim->css);
 		/*
@@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
+	/*
+	 * We are lying about nr_pages because we do not want to
+	 * reclaim too much for THP pages which should rather fallback
+	 * to small pages.
+	 */
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+					      gfp_mask, flags, NULL,
+					      1);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3567,7 +3573,8 @@ static int mem_cgroup_resize_limit(struc
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3628,7 +3635,8 @@ static int mem_cgroup_resize_memsw_limit
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memswlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3671,10 +3679,12 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		nr_scanned = 0;
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
 						gfp_mask,
 						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+						&nr_scanned,
+						excess >> PAGE_SHIFT);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
@@ -3871,7 +3881,8 @@ try_to_free:
 		rec.mem = mem;
 		rec.root = mem;
 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
-						false, &rec);
+						false, &rec,
+						mem->res.usage >> PAGE_SHIFT);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-08-11 15:44:43.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-08-11 16:41:22.000000000 +0200
@@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   struct memcg_scanrecord *rec,
+					   unsigned long nr_pages)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH] memcg: add nr_pages argument for hierarchical reclaim
@ 2011-08-12 12:44           ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-12 12:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 11-08-11 16:50:55, Michal Hocko wrote:
> On Thu 11-08-11 08:52:52, KAMEZAWA Hiroyuki wrote:
> > On Wed, 10 Aug 2011 16:14:25 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Tue 09-08-11 19:09:33, KAMEZAWA Hiroyuki wrote:
> > > > memcg :avoid node fallback scan if possible.
> > > > 
> > > > Now, try_to_free_pages() scans all zonelist because the page allocator
> > > > should visit all zonelists...but that behavior is harmful for memcg.
> > > > Memcg just scans memory because it hits limit...no memory shortage
> > > > in pased zonelist.
> > > > 
> > > > For example, with following unbalanced nodes
> > > > 
> > > >      Node 0    Node 1
> > > > File 1G        0
> > > > Anon 200M      200M
> > > > 
> > > > memcg will cause swap-out from Node1 at every vmscan.
> > > > 
> > > > Another example, assume 1024 nodes system.
> > > > With 1024 node system, memcg will visit 1024 nodes
> > > > pages per vmscan... This is overkilling. 
> > > > 
> > > > This is why memcg's victim node selection logic doesn't work
> > > > as expected.
> > > > 
> > > > This patch is a help for stopping vmscan when we scanned enough.
> > > > 
> > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > OK, I see the point. At first I was afraid that we would make a bigger
> > > pressure on the node which triggered the reclaim but as we are selecting
> > > t dynamically (mem_cgroup_select_victim_node) - round robin at the
> > > moment - it should be fair in the end. More targeted node selection
> > > should be even more efficient.
> > > 
> > > I still have a concern about resize_limit code path, though. It uses
> > > memcg direct reclaim to get under the new limit (assuming it is lower
> > > than the current one). 
> > > Currently we might reclaim nr_nodes * SWAP_CLUSTER_MAX while
> > > after your change we have it at SWAP_CLUSTER_MAX. This means that
> > > mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> > > (currently it is doing 5 rounds of reclaim before it gives up). I do not
> > > consider this to be blocker but maybe we should enhance
> > > mem_cgroup_hierarchical_reclaim with a nr_pages argument to tell it how
> > > much we want to reclaim (min(SWAP_CLUSTER_MAX, nr_pages)).
> > > What do you think?
> > > 
> > 
> > Hmm,
> > 
> > > mem_cgroup_resize_mem_limit might fail sooner on large NUMA machines
> > 
> > mem_cgroup_resize_limit() just checks (curusage < prevusage), then, 
> > I agree reducing the number of scan/reclaim will cause that.
> > 
> > I agree to pass nr_pages to try_to_free_mem_cgroup_pages().

This is another version which prevents from excessive reclaim due to
THP.
---
From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: add nr_pages argument for hierarchical reclaim

Now that we are doing memcg direct reclaim limited to nr_to_reclaim
pages (introduced by "memcg: stop vmscan when enough done.") we have to
be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
most callers but it might cause failures for limit resize or force_empty
code paths on big NUMA machines.

Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
on reclaiming a certain amount of pages and retrying if their condition is
still not met.

Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
push it further to try_to_free_mem_cgroup_pages. We still fall back to
SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
affected by this.

We have to be careful in mem_cgroup_do_charge and do not provide the
given nr_pages because we would reclaim too much for THP which can
safely fall back to single page allocations.

Open questions:
- Should we care about soft limit as well? Currently I am using excess
  number of pages for the parameter so it can replace direct query for
  the value in mem_cgroup_hierarchical_reclaim but should we push it to
  mem_cgroup_shrink_node_zone?
  I do not think so because we should try to reclaim from more groups in the
  hierarchy and also it doesn't get to shrink_zones which has been modified
  by the previous patch.
- mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
  OK but will have to think about it some more.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Index: linus_tree/include/linux/memcontrol.h
===================================================================
--- linus_tree.orig/include/linux/memcontrol.h	2011-08-11 15:44:43.000000000 +0200
+++ linus_tree/include/linux/memcontrol.h	2011-08-11 15:46:27.000000000 +0200
@@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
 
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
+						  struct memcg_scanrecord *rec,
+						  unsigned long nr_pages);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c	2011-08-11 15:36:15.000000000 +0200
+++ linus_tree/mm/memcontrol.c	2011-08-11 18:10:52.000000000 +0200
@@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
  * (other groups can be removed while we're walking....)
  *
  * If shrink==true, for avoiding to free too much, this returns immedieately.
+ * Given nr_pages tells how many pages are we over the soft limit or how many
+ * pages do we want to reclaim in the direct reclaim mode.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
 						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+						unsigned long *total_scanned,
+						unsigned long nr_pages)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	struct memcg_scanrecord rec;
-	unsigned long excess;
 	unsigned long scanned;
 
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
 		noswap = true;
@@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 				/*
 				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
+				 * nr_pages >> 2 is not to excessive so as to
 				 * reclaim too much, nor too less that we keep
 				 * coming back to reclaim from this cgroup
 				 */
-				if (total >= (excess >> 2) ||
+				if (total >= (nr_pages >> 2) ||
 					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
 					css_put(&victim->css);
 					break;
@@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
 			*total_scanned += scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
+						noswap, &rec, nr_pages);
 		mem_cgroup_record_scanstat(&rec);
 		css_put(&victim->css);
 		/*
@@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
+	/*
+	 * We are lying about nr_pages because we do not want to
+	 * reclaim too much for THP pages which should rather fallback
+	 * to small pages.
+	 */
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+					      gfp_mask, flags, NULL,
+					      1);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3567,7 +3573,8 @@ static int mem_cgroup_resize_limit(struc
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3628,7 +3635,8 @@ static int mem_cgroup_resize_memsw_limit
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memswlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3671,10 +3679,12 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		nr_scanned = 0;
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
 						gfp_mask,
 						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+						&nr_scanned,
+						excess >> PAGE_SHIFT);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
@@ -3871,7 +3881,8 @@ try_to_free:
 		rec.mem = mem;
 		rec.root = mem;
 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
-						false, &rec);
+						false, &rec,
+						mem->res.usage >> PAGE_SHIFT);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-08-11 15:44:43.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-08-11 16:41:22.000000000 +0200
@@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   struct memcg_scanrecord *rec,
+					   unsigned long nr_pages)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-11 14:50         ` Michal Hocko
@ 2011-08-17  0:54           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-17  0:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu, 11 Aug 2011 16:50:55 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> What about this (just compile tested)?
> --- 
> From: Michal Hocko <mhocko@suse.cz>
> Subject: memcg: add nr_pages argument for hierarchical reclaim
> 
> Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> pages (introduced by "memcg: stop vmscan when enough done.") we have to
> be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> most callers but it might cause failures for limit resize or force_empty
> code paths on big NUMA machines.
> 
> Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
> while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
> on reclaiming a certain amount of pages and retrying if their condition is
> still not met.
> 
> Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
> push it further to try_to_free_mem_cgroup_pages. We still fall back to
> SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
> affected by this.
> 
> Open questions:
> - Should we care about soft limit as well? Currently I am using excess
>   number of pages for the parameter so it can replace direct query for
>   the value in mem_cgroup_hierarchical_reclaim but should we push it to
>   mem_cgroup_shrink_node_zone?
>   I do not think so because we should try to reclaim from more groups in the
>   hierarchy and also it doesn't get to shrink_zones which has been modified
>   by the previous patch.



> - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
>   OK but will have to think about it some more.

force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
is overkilling.


> - Aren't we going to reclaim too much when we hit the limit due to THP?

When we use THP without memcg, failure in memory allocation
just fails and khugepaged will make small pages into hugepage later.

Memcg doesn't need to take special care, I think.
If we change it, it will be performance matter and should be measured.

> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> Index: linus_tree/include/linux/memcontrol.h
> ===================================================================
> --- linus_tree.orig/include/linux/memcontrol.h	2011-08-11 15:44:43.000000000 +0200
> +++ linus_tree/include/linux/memcontrol.h	2011-08-11 15:46:27.000000000 +0200
> @@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
>  
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>  						  gfp_t gfp_mask, bool noswap,
> -						  struct memcg_scanrecord *rec);
> +						  struct memcg_scanrecord *rec,
> +						  unsigned long nr_pages);
>  extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
>  						struct zone *zone,
> Index: linus_tree/mm/memcontrol.c
> ===================================================================
> --- linus_tree.orig/mm/memcontrol.c	2011-08-11 15:36:15.000000000 +0200
> +++ linus_tree/mm/memcontrol.c	2011-08-11 16:00:46.000000000 +0200
> @@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
>   * (other groups can be removed while we're walking....)
>   *
>   * If shrink==true, for avoiding to free too much, this returns immedieately.
> + * Given nr_pages tells how many pages are we over the soft limit or how many
> + * pages do we want to reclaim in the direct reclaim mode.
>   */
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  						struct zone *zone,
>  						gfp_t gfp_mask,
>  						unsigned long reclaim_options,
> -						unsigned long *total_scanned)
> +						unsigned long *total_scanned,
> +						unsigned long nr_pages)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
> @@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
>  	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
>  	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>  	struct memcg_scanrecord rec;
> -	unsigned long excess;
>  	unsigned long scanned;
>  
> -	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
> -
>  	/* If memsw_is_minimum==1, swap-out is of-no-use. */
>  	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
>  		noswap = true;
> @@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
>  				}
>  				/*
>  				 * We want to do more targeted reclaim.
> -				 * excess >> 2 is not to excessive so as to
> +				 * nr_pages >> 2 is not to excessive so as to
>  				 * reclaim too much, nor too less that we keep
>  				 * coming back to reclaim from this cgroup
>  				 */
> -				if (total >= (excess >> 2) ||
> +				if (total >= (nr_pages >> 2) ||
>  					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
>  					css_put(&victim->css);
>  					break;
> @@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
>  			*total_scanned += scanned;
>  		} else
>  			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> -						noswap, &rec);
> +						noswap, &rec, nr_pages);
>  		mem_cgroup_record_scanstat(&rec);
>  		css_put(&victim->css);
>  		/*
> @@ -2332,7 +2332,8 @@ static int mem_cgroup_do_charge(struct m
>  		return CHARGE_WOULDBLOCK;
>  
>  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					      gfp_mask, flags, NULL);
> +					      gfp_mask, flags, NULL,
> +					      nr_pages);

Hmm, in usual, nr_pages = batch = CHARGE_BATCH = 32 ? At allocating Hugepage,
this nr_pages will be 512 ? I think it's too big...

Thanks,
-Kame



>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>  		return CHARGE_RETRY;
>  	/*
> @@ -3567,7 +3568,8 @@ static int mem_cgroup_resize_limit(struc
>  
>  		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
>  						MEM_CGROUP_RECLAIM_SHRINK,
> -						NULL);
> +						NULL,
> +						(val-memlimit) >> PAGE_SHIFT);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>    		if (curusage >= oldusage)
> @@ -3628,7 +3630,8 @@ static int mem_cgroup_resize_memsw_limit
>  		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
>  						MEM_CGROUP_RECLAIM_NOSWAP |
>  						MEM_CGROUP_RECLAIM_SHRINK,
> -						NULL);
> +						NULL,
> +						(val-memswlimit) >> PAGE_SHIFT);
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -3671,10 +3674,12 @@ unsigned long mem_cgroup_soft_limit_recl
>  			break;
>  
>  		nr_scanned = 0;
> +		excess = res_counter_soft_limit_excess(&mz->mem->res);
>  		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
>  						gfp_mask,
>  						MEM_CGROUP_RECLAIM_SOFT,
> -						&nr_scanned);
> +						&nr_scanned,
> +						excess >> PAGE_SHIFT);
>  		nr_reclaimed += reclaimed;
>  		*total_scanned += nr_scanned;
>  		spin_lock(&mctz->lock);
> @@ -3871,7 +3876,8 @@ try_to_free:
>  		rec.mem = mem;
>  		rec.root = mem;
>  		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> -						false, &rec);
> +						false, &rec,
> +						mem->res.usage >> PAGE_SHIFT);
>  		if (!progress) {
>  			nr_retries--;
>  			/* maybe some writeback is necessary */
> Index: linus_tree/mm/vmscan.c
> ===================================================================
> --- linus_tree.orig/mm/vmscan.c	2011-08-11 15:44:43.000000000 +0200
> +++ linus_tree/mm/vmscan.c	2011-08-11 16:41:22.000000000 +0200
> @@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  					   gfp_t gfp_mask,
>  					   bool noswap,
> -					   struct memcg_scanrecord *rec)
> +					   struct memcg_scanrecord *rec,
> +					   unsigned long nr_pages)
>  {
>  	struct zonelist *zonelist;
>  	unsigned long nr_reclaimed;
> @@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  		.may_writepage = !laptop_mode,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
> -		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
>  		.order = 0,
>  		.mem_cgroup = mem_cont,
>  		.memcg_record = rec,
> -- 
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9    
> Czech Republic
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-17  0:54           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-17  0:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu, 11 Aug 2011 16:50:55 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> What about this (just compile tested)?
> --- 
> From: Michal Hocko <mhocko@suse.cz>
> Subject: memcg: add nr_pages argument for hierarchical reclaim
> 
> Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> pages (introduced by "memcg: stop vmscan when enough done.") we have to
> be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> most callers but it might cause failures for limit resize or force_empty
> code paths on big NUMA machines.
> 
> Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
> while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
> on reclaiming a certain amount of pages and retrying if their condition is
> still not met.
> 
> Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
> push it further to try_to_free_mem_cgroup_pages. We still fall back to
> SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
> affected by this.
> 
> Open questions:
> - Should we care about soft limit as well? Currently I am using excess
>   number of pages for the parameter so it can replace direct query for
>   the value in mem_cgroup_hierarchical_reclaim but should we push it to
>   mem_cgroup_shrink_node_zone?
>   I do not think so because we should try to reclaim from more groups in the
>   hierarchy and also it doesn't get to shrink_zones which has been modified
>   by the previous patch.



> - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
>   OK but will have to think about it some more.

force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
is overkilling.


> - Aren't we going to reclaim too much when we hit the limit due to THP?

When we use THP without memcg, failure in memory allocation
just fails and khugepaged will make small pages into hugepage later.

Memcg doesn't need to take special care, I think.
If we change it, it will be performance matter and should be measured.

> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> Index: linus_tree/include/linux/memcontrol.h
> ===================================================================
> --- linus_tree.orig/include/linux/memcontrol.h	2011-08-11 15:44:43.000000000 +0200
> +++ linus_tree/include/linux/memcontrol.h	2011-08-11 15:46:27.000000000 +0200
> @@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
>  
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>  						  gfp_t gfp_mask, bool noswap,
> -						  struct memcg_scanrecord *rec);
> +						  struct memcg_scanrecord *rec,
> +						  unsigned long nr_pages);
>  extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
>  						struct zone *zone,
> Index: linus_tree/mm/memcontrol.c
> ===================================================================
> --- linus_tree.orig/mm/memcontrol.c	2011-08-11 15:36:15.000000000 +0200
> +++ linus_tree/mm/memcontrol.c	2011-08-11 16:00:46.000000000 +0200
> @@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
>   * (other groups can be removed while we're walking....)
>   *
>   * If shrink==true, for avoiding to free too much, this returns immedieately.
> + * Given nr_pages tells how many pages are we over the soft limit or how many
> + * pages do we want to reclaim in the direct reclaim mode.
>   */
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  						struct zone *zone,
>  						gfp_t gfp_mask,
>  						unsigned long reclaim_options,
> -						unsigned long *total_scanned)
> +						unsigned long *total_scanned,
> +						unsigned long nr_pages)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
> @@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
>  	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
>  	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>  	struct memcg_scanrecord rec;
> -	unsigned long excess;
>  	unsigned long scanned;
>  
> -	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
> -
>  	/* If memsw_is_minimum==1, swap-out is of-no-use. */
>  	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
>  		noswap = true;
> @@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
>  				}
>  				/*
>  				 * We want to do more targeted reclaim.
> -				 * excess >> 2 is not to excessive so as to
> +				 * nr_pages >> 2 is not to excessive so as to
>  				 * reclaim too much, nor too less that we keep
>  				 * coming back to reclaim from this cgroup
>  				 */
> -				if (total >= (excess >> 2) ||
> +				if (total >= (nr_pages >> 2) ||
>  					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
>  					css_put(&victim->css);
>  					break;
> @@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
>  			*total_scanned += scanned;
>  		} else
>  			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> -						noswap, &rec);
> +						noswap, &rec, nr_pages);
>  		mem_cgroup_record_scanstat(&rec);
>  		css_put(&victim->css);
>  		/*
> @@ -2332,7 +2332,8 @@ static int mem_cgroup_do_charge(struct m
>  		return CHARGE_WOULDBLOCK;
>  
>  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					      gfp_mask, flags, NULL);
> +					      gfp_mask, flags, NULL,
> +					      nr_pages);

Hmm, in usual, nr_pages = batch = CHARGE_BATCH = 32 ? At allocating Hugepage,
this nr_pages will be 512 ? I think it's too big...

Thanks,
-Kame



>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>  		return CHARGE_RETRY;
>  	/*
> @@ -3567,7 +3568,8 @@ static int mem_cgroup_resize_limit(struc
>  
>  		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
>  						MEM_CGROUP_RECLAIM_SHRINK,
> -						NULL);
> +						NULL,
> +						(val-memlimit) >> PAGE_SHIFT);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>    		if (curusage >= oldusage)
> @@ -3628,7 +3630,8 @@ static int mem_cgroup_resize_memsw_limit
>  		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
>  						MEM_CGROUP_RECLAIM_NOSWAP |
>  						MEM_CGROUP_RECLAIM_SHRINK,
> -						NULL);
> +						NULL,
> +						(val-memswlimit) >> PAGE_SHIFT);
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -3671,10 +3674,12 @@ unsigned long mem_cgroup_soft_limit_recl
>  			break;
>  
>  		nr_scanned = 0;
> +		excess = res_counter_soft_limit_excess(&mz->mem->res);
>  		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
>  						gfp_mask,
>  						MEM_CGROUP_RECLAIM_SOFT,
> -						&nr_scanned);
> +						&nr_scanned,
> +						excess >> PAGE_SHIFT);
>  		nr_reclaimed += reclaimed;
>  		*total_scanned += nr_scanned;
>  		spin_lock(&mctz->lock);
> @@ -3871,7 +3876,8 @@ try_to_free:
>  		rec.mem = mem;
>  		rec.root = mem;
>  		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> -						false, &rec);
> +						false, &rec,
> +						mem->res.usage >> PAGE_SHIFT);
>  		if (!progress) {
>  			nr_retries--;
>  			/* maybe some writeback is necessary */
> Index: linus_tree/mm/vmscan.c
> ===================================================================
> --- linus_tree.orig/mm/vmscan.c	2011-08-11 15:44:43.000000000 +0200
> +++ linus_tree/mm/vmscan.c	2011-08-11 16:41:22.000000000 +0200
> @@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  					   gfp_t gfp_mask,
>  					   bool noswap,
> -					   struct memcg_scanrecord *rec)
> +					   struct memcg_scanrecord *rec,
> +					   unsigned long nr_pages)
>  {
>  	struct zonelist *zonelist;
>  	unsigned long nr_reclaimed;
> @@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  		.may_writepage = !laptop_mode,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
> -		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
>  		.order = 0,
>  		.mem_cgroup = mem_cont,
>  		.memcg_record = rec,
> -- 
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9    
> Czech Republic
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-17  0:54           ` KAMEZAWA Hiroyuki
@ 2011-08-17 11:35             ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-17 11:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> On Thu, 11 Aug 2011 16:50:55 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > What about this (just compile tested)?
> > --- 
> > From: Michal Hocko <mhocko@suse.cz>
> > Subject: memcg: add nr_pages argument for hierarchical reclaim
> > 
> > Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> > pages (introduced by "memcg: stop vmscan when enough done.") we have to
> > be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> > most callers but it might cause failures for limit resize or force_empty
> > code paths on big NUMA machines.
> > 
> > Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
> > while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
> > on reclaiming a certain amount of pages and retrying if their condition is
> > still not met.
> > 
> > Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
> > push it further to try_to_free_mem_cgroup_pages. We still fall back to
> > SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
> > affected by this.
> > 
> > Open questions:
> > - Should we care about soft limit as well? Currently I am using excess
> >   number of pages for the parameter so it can replace direct query for
> >   the value in mem_cgroup_hierarchical_reclaim but should we push it to
> >   mem_cgroup_shrink_node_zone?
> >   I do not think so because we should try to reclaim from more groups in the
> >   hierarchy and also it doesn't get to shrink_zones which has been modified
> >   by the previous patch.
> 
> 
> 
> > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> >   OK but will have to think about it some more.
> 
> force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> is overkilling.

So, how many pages should be reclaimed then?

> > @@ -2332,7 +2332,8 @@ static int mem_cgroup_do_charge(struct m
> >  		return CHARGE_WOULDBLOCK;
> >  
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					      gfp_mask, flags, NULL);
> > +					      gfp_mask, flags, NULL,
> > +					      nr_pages);
> 
> Hmm, in usual, nr_pages = batch = CHARGE_BATCH = 32 ? At allocating Hugepage,
> this nr_pages will be 512 ? I think it's too big...

Yes it is. I have posted updated version already:
http://www.spinics.net/lists/linux-mm/msg23113.html

> 
> Thanks,
> -Kame

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-17 11:35             ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-17 11:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> On Thu, 11 Aug 2011 16:50:55 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > What about this (just compile tested)?
> > --- 
> > From: Michal Hocko <mhocko@suse.cz>
> > Subject: memcg: add nr_pages argument for hierarchical reclaim
> > 
> > Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> > pages (introduced by "memcg: stop vmscan when enough done.") we have to
> > be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> > most callers but it might cause failures for limit resize or force_empty
> > code paths on big NUMA machines.
> > 
> > Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
> > while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
> > on reclaiming a certain amount of pages and retrying if their condition is
> > still not met.
> > 
> > Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
> > push it further to try_to_free_mem_cgroup_pages. We still fall back to
> > SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
> > affected by this.
> > 
> > Open questions:
> > - Should we care about soft limit as well? Currently I am using excess
> >   number of pages for the parameter so it can replace direct query for
> >   the value in mem_cgroup_hierarchical_reclaim but should we push it to
> >   mem_cgroup_shrink_node_zone?
> >   I do not think so because we should try to reclaim from more groups in the
> >   hierarchy and also it doesn't get to shrink_zones which has been modified
> >   by the previous patch.
> 
> 
> 
> > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> >   OK but will have to think about it some more.
> 
> force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> is overkilling.

So, how many pages should be reclaimed then?

> > @@ -2332,7 +2332,8 @@ static int mem_cgroup_do_charge(struct m
> >  		return CHARGE_WOULDBLOCK;
> >  
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					      gfp_mask, flags, NULL);
> > +					      gfp_mask, flags, NULL,
> > +					      nr_pages);
> 
> Hmm, in usual, nr_pages = batch = CHARGE_BATCH = 32 ? At allocating Hugepage,
> this nr_pages will be 512 ? I think it's too big...

Yes it is. I have posted updated version already:
http://www.spinics.net/lists/linux-mm/msg23113.html

> 
> Thanks,
> -Kame

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 4/6]  memg: calculate numa weight for vmscan
  2011-08-09 10:11   ` KAMEZAWA Hiroyuki
@ 2011-08-17 14:34     ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-17 14:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

Sorry it took so long but I was quite busy recently.

On Tue 09-08-11 19:11:00, KAMEZAWA Hiroyuki wrote:
> caclculate node scan weight.
> 
> Now, memory cgroup selects a scan target node in round-robin.
> It's not very good...there is not scheduling based on page usages.
> 
> This patch is for calculating each node's weight for scanning.
> If weight of a node is high, the node is worth to be scanned.
> 
> The weight is now calucauted on following concept.
> 
>    - make use of swappiness.
>    - If inactive-file is enough, ignore active-file
>    - If file is enough (w.r.t swappiness), ignore anon
>    - make use of recent_scan/rotated reclaim stats.

The concept looks good (see the specific comments bellow). I would
appreciate if the description was more descriptive (especially in the
reclaim statistics part with the reasoning why it is better).
 
> Then, a node contains many inactive file pages will be a 1st victim.
> Node selection logic based on this weight will be in the next patch.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 105 insertions(+), 5 deletions(-)
> 
> Index: mmotm-Aug3/mm/memcontrol.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/memcontrol.c
> +++ mmotm-Aug3/mm/memcontrol.c
[...]
> @@ -1568,18 +1570,108 @@ static bool test_mem_cgroup_node_reclaim
>  }
>  #if MAX_NUMNODES > 1
>  
> +static unsigned long
> +__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
> +				int nid,
> +				unsigned long anon_prio,
> +				unsigned long file_prio,
> +				int lru_mask)
> +{
> +	u64 file, anon;
> +	unsigned long weight, mask;

mask is not used anywhere.

> +	unsigned long rotated[2], scanned[2];
> +	int zid;
> +
> +	scanned[0] = 0;
> +	scanned[1] = 0;
> +	rotated[0] = 0;
> +	rotated[1] = 0;
> +
> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +		struct mem_cgroup_per_zone *mz;
> +
> +		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +		scanned[0] += mz->reclaim_stat.recent_scanned[0];
> +		scanned[1] += mz->reclaim_stat.recent_scanned[1];
> +		rotated[0] += mz->reclaim_stat.recent_rotated[0];
> +		rotated[1] += mz->reclaim_stat.recent_rotated[1];
> +	}
> +	file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
> +
> +	if (total_swap_pages)

What about ((lru_mask & LRU_ALL_ANON) && total_swap_pages)? Why should
we go down the mem_cgroup_node_nr_lru_pages if are not getting anything?

> +		anon = mem_cgroup_node_nr_lru_pages(memcg,
> +					nid, mask & LRU_ALL_ANON);

btw. s/mask/lru_mask/

> +	else
> +		anon = 0;

Can be initialized during declaration (makes patch smaller).

> +	if (!(file + anon))
> +		node_clear(nid, memcg->scan_nodes);

In that case we can return with 0 right away.

> +
> +	/* 'scanned - rotated/scanned' means ratio of finding not active. */
> +	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
> +	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);

OK, makes sense. We should not reclaim from nodes that are known to be
hard to reclaim from. We, however, have to be careful to not exclude the
node from reclaiming completely.

> +
> +	weight = (anon * anon_prio + file * file_prio) / 200;

Shouldn't we rather normalize the weight to the node size? This way we
are punishing bigger nodes, aren't we.

> +	return weight;
> +}
> +
> +/*
> + * Calculate each NUMA node's scan weight. scan weight is determined by
> + * amount of pages and recent scan ratio, swappiness.
> + */
> +static unsigned long
> +mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
> +{
> +	unsigned long weight, total_weight;
> +	u64 anon_prio, file_prio, nr_anon, nr_file;
> +	int lru_mask;
> +	int nid;
> +
> +	anon_prio = mem_cgroup_swappiness(memcg) + 1;
> +	file_prio = 200 - anon_prio + 1;

What is +1 good for. I do not see that anon_prio would be used as a
denominator.

> +
> +	lru_mask = BIT(LRU_INACTIVE_FILE);
> +	if (mem_cgroup_inactive_file_is_low(memcg))
> +		lru_mask |= BIT(LRU_ACTIVE_FILE);
> +	/*
> +	 * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
> +	 * amounts of file/anon pages and swappiness and reclaim_stat. Here,
> +	 * we try to find good node to be scanned. If the memcg contains enough
> +	 * file caches, we'll ignore anon's weight.
> +	 * (Note) scanning anon-only node tends to be waste of time.
> +	 */
> +
> +	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> +	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> +
> +	/* If file cache is small w.r.t swappiness, check anon page's weight */
> +	if (nr_file * file_prio >= nr_anon * anon_prio)
> +		lru_mask |= BIT(LRU_INACTIVE_ANON);

Why we do not care about active anon (e.g. if inactive anon is low)?

> +
> +	total_weight = 0;

Can be initialized during declaration.

[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 4/6]  memg: calculate numa weight for vmscan
@ 2011-08-17 14:34     ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-17 14:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

Sorry it took so long but I was quite busy recently.

On Tue 09-08-11 19:11:00, KAMEZAWA Hiroyuki wrote:
> caclculate node scan weight.
> 
> Now, memory cgroup selects a scan target node in round-robin.
> It's not very good...there is not scheduling based on page usages.
> 
> This patch is for calculating each node's weight for scanning.
> If weight of a node is high, the node is worth to be scanned.
> 
> The weight is now calucauted on following concept.
> 
>    - make use of swappiness.
>    - If inactive-file is enough, ignore active-file
>    - If file is enough (w.r.t swappiness), ignore anon
>    - make use of recent_scan/rotated reclaim stats.

The concept looks good (see the specific comments bellow). I would
appreciate if the description was more descriptive (especially in the
reclaim statistics part with the reasoning why it is better).
 
> Then, a node contains many inactive file pages will be a 1st victim.
> Node selection logic based on this weight will be in the next patch.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 105 insertions(+), 5 deletions(-)
> 
> Index: mmotm-Aug3/mm/memcontrol.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/memcontrol.c
> +++ mmotm-Aug3/mm/memcontrol.c
[...]
> @@ -1568,18 +1570,108 @@ static bool test_mem_cgroup_node_reclaim
>  }
>  #if MAX_NUMNODES > 1
>  
> +static unsigned long
> +__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
> +				int nid,
> +				unsigned long anon_prio,
> +				unsigned long file_prio,
> +				int lru_mask)
> +{
> +	u64 file, anon;
> +	unsigned long weight, mask;

mask is not used anywhere.

> +	unsigned long rotated[2], scanned[2];
> +	int zid;
> +
> +	scanned[0] = 0;
> +	scanned[1] = 0;
> +	rotated[0] = 0;
> +	rotated[1] = 0;
> +
> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +		struct mem_cgroup_per_zone *mz;
> +
> +		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +		scanned[0] += mz->reclaim_stat.recent_scanned[0];
> +		scanned[1] += mz->reclaim_stat.recent_scanned[1];
> +		rotated[0] += mz->reclaim_stat.recent_rotated[0];
> +		rotated[1] += mz->reclaim_stat.recent_rotated[1];
> +	}
> +	file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
> +
> +	if (total_swap_pages)

What about ((lru_mask & LRU_ALL_ANON) && total_swap_pages)? Why should
we go down the mem_cgroup_node_nr_lru_pages if are not getting anything?

> +		anon = mem_cgroup_node_nr_lru_pages(memcg,
> +					nid, mask & LRU_ALL_ANON);

btw. s/mask/lru_mask/

> +	else
> +		anon = 0;

Can be initialized during declaration (makes patch smaller).

> +	if (!(file + anon))
> +		node_clear(nid, memcg->scan_nodes);

In that case we can return with 0 right away.

> +
> +	/* 'scanned - rotated/scanned' means ratio of finding not active. */
> +	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
> +	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);

OK, makes sense. We should not reclaim from nodes that are known to be
hard to reclaim from. We, however, have to be careful to not exclude the
node from reclaiming completely.

> +
> +	weight = (anon * anon_prio + file * file_prio) / 200;

Shouldn't we rather normalize the weight to the node size? This way we
are punishing bigger nodes, aren't we.

> +	return weight;
> +}
> +
> +/*
> + * Calculate each NUMA node's scan weight. scan weight is determined by
> + * amount of pages and recent scan ratio, swappiness.
> + */
> +static unsigned long
> +mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
> +{
> +	unsigned long weight, total_weight;
> +	u64 anon_prio, file_prio, nr_anon, nr_file;
> +	int lru_mask;
> +	int nid;
> +
> +	anon_prio = mem_cgroup_swappiness(memcg) + 1;
> +	file_prio = 200 - anon_prio + 1;

What is +1 good for. I do not see that anon_prio would be used as a
denominator.

> +
> +	lru_mask = BIT(LRU_INACTIVE_FILE);
> +	if (mem_cgroup_inactive_file_is_low(memcg))
> +		lru_mask |= BIT(LRU_ACTIVE_FILE);
> +	/*
> +	 * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
> +	 * amounts of file/anon pages and swappiness and reclaim_stat. Here,
> +	 * we try to find good node to be scanned. If the memcg contains enough
> +	 * file caches, we'll ignore anon's weight.
> +	 * (Note) scanning anon-only node tends to be waste of time.
> +	 */
> +
> +	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> +	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> +
> +	/* If file cache is small w.r.t swappiness, check anon page's weight */
> +	if (nr_file * file_prio >= nr_anon * anon_prio)
> +		lru_mask |= BIT(LRU_INACTIVE_ANON);

Why we do not care about active anon (e.g. if inactive anon is low)?

> +
> +	total_weight = 0;

Can be initialized during declaration.

[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-17 11:35             ` Michal Hocko
@ 2011-08-17 23:52               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-17 23:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 17 Aug 2011 13:35:50 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> > On Thu, 11 Aug 2011 16:50:55 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > What about this (just compile tested)?
> > > --- 
> > > From: Michal Hocko <mhocko@suse.cz>
> > > Subject: memcg: add nr_pages argument for hierarchical reclaim
> > > 
> > > Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> > > pages (introduced by "memcg: stop vmscan when enough done.") we have to
> > > be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> > > most callers but it might cause failures for limit resize or force_empty
> > > code paths on big NUMA machines.
> > > 
> > > Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
> > > while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
> > > on reclaiming a certain amount of pages and retrying if their condition is
> > > still not met.
> > > 
> > > Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
> > > push it further to try_to_free_mem_cgroup_pages. We still fall back to
> > > SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
> > > affected by this.
> > > 
> > > Open questions:
> > > - Should we care about soft limit as well? Currently I am using excess
> > >   number of pages for the parameter so it can replace direct query for
> > >   the value in mem_cgroup_hierarchical_reclaim but should we push it to
> > >   mem_cgroup_shrink_node_zone?
> > >   I do not think so because we should try to reclaim from more groups in the
> > >   hierarchy and also it doesn't get to shrink_zones which has been modified
> > >   by the previous patch.
> > 
> > 
> > 
> > > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> > >   OK but will have to think about it some more.
> > 
> > force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> > is overkilling.
> 
> So, how many pages should be reclaimed then?
> 

How about (1 << (MAX_ORDER-1))/loop ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-17 23:52               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-17 23:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 17 Aug 2011 13:35:50 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> > On Thu, 11 Aug 2011 16:50:55 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > What about this (just compile tested)?
> > > --- 
> > > From: Michal Hocko <mhocko@suse.cz>
> > > Subject: memcg: add nr_pages argument for hierarchical reclaim
> > > 
> > > Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> > > pages (introduced by "memcg: stop vmscan when enough done.") we have to
> > > be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> > > most callers but it might cause failures for limit resize or force_empty
> > > code paths on big NUMA machines.
> > > 
> > > Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
> > > while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
> > > on reclaiming a certain amount of pages and retrying if their condition is
> > > still not met.
> > > 
> > > Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
> > > push it further to try_to_free_mem_cgroup_pages. We still fall back to
> > > SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
> > > affected by this.
> > > 
> > > Open questions:
> > > - Should we care about soft limit as well? Currently I am using excess
> > >   number of pages for the parameter so it can replace direct query for
> > >   the value in mem_cgroup_hierarchical_reclaim but should we push it to
> > >   mem_cgroup_shrink_node_zone?
> > >   I do not think so because we should try to reclaim from more groups in the
> > >   hierarchy and also it doesn't get to shrink_zones which has been modified
> > >   by the previous patch.
> > 
> > 
> > 
> > > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> > >   OK but will have to think about it some more.
> > 
> > force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> > is overkilling.
> 
> So, how many pages should be reclaimed then?
> 

How about (1 << (MAX_ORDER-1))/loop ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 4/6]  memg: calculate numa weight for vmscan
  2011-08-17 14:34     ` Michal Hocko
@ 2011-08-18  0:17       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:17 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 17 Aug 2011 16:34:18 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> Sorry it took so long but I was quite busy recently.
> 
> On Tue 09-08-11 19:11:00, KAMEZAWA Hiroyuki wrote:
> > caclculate node scan weight.
> > 
> > Now, memory cgroup selects a scan target node in round-robin.
> > It's not very good...there is not scheduling based on page usages.
> > 
> > This patch is for calculating each node's weight for scanning.
> > If weight of a node is high, the node is worth to be scanned.
> > 
> > The weight is now calucauted on following concept.
> > 
> >    - make use of swappiness.
> >    - If inactive-file is enough, ignore active-file
> >    - If file is enough (w.r.t swappiness), ignore anon
> >    - make use of recent_scan/rotated reclaim stats.
> 
> The concept looks good (see the specific comments bellow). I would
> appreciate if the description was more descriptive (especially in the
> reclaim statistics part with the reasoning why it is better).
>  
> > Then, a node contains many inactive file pages will be a 1st victim.
> > Node selection logic based on this weight will be in the next patch.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 105 insertions(+), 5 deletions(-)
> > 
> > Index: mmotm-Aug3/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-Aug3.orig/mm/memcontrol.c
> > +++ mmotm-Aug3/mm/memcontrol.c
> [...]
> > @@ -1568,18 +1570,108 @@ static bool test_mem_cgroup_node_reclaim
> >  }
> >  #if MAX_NUMNODES > 1
> >  
> > +static unsigned long
> > +__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
> > +				int nid,
> > +				unsigned long anon_prio,
> > +				unsigned long file_prio,
> > +				int lru_mask)
> > +{
> > +	u64 file, anon;
> > +	unsigned long weight, mask;
> 
> mask is not used anywhere.
> 
I'll remove this.


> > +	unsigned long rotated[2], scanned[2];
> > +	int zid;
> > +
> > +	scanned[0] = 0;
> > +	scanned[1] = 0;
> > +	rotated[0] = 0;
> > +	rotated[1] = 0;
> > +
> > +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> > +		struct mem_cgroup_per_zone *mz;
> > +
> > +		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> > +		scanned[0] += mz->reclaim_stat.recent_scanned[0];
> > +		scanned[1] += mz->reclaim_stat.recent_scanned[1];
> > +		rotated[0] += mz->reclaim_stat.recent_rotated[0];
> > +		rotated[1] += mz->reclaim_stat.recent_rotated[1];
> > +	}
> > +	file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
> > +
> > +	if (total_swap_pages)
> 
> What about ((lru_mask & LRU_ALL_ANON) && total_swap_pages)?

Ok. will add that.

> Why should we go down the mem_cgroup_node_nr_lru_pages if are not getting anything?
> 



> > +		anon = mem_cgroup_node_nr_lru_pages(memcg,
> > +					nid, mask & LRU_ALL_ANON);
> 
> btw. s/mask/lru_mask/
> 
yes...

> > +	else
> > +		anon = 0;
> 
> Can be initialized during declaration (makes patch smaller).
> 
Sure.

> > +	if (!(file + anon))
> > +		node_clear(nid, memcg->scan_nodes);
> 
> In that case we can return with 0 right away.
> 
yes.



> > +
> > +	/* 'scanned - rotated/scanned' means ratio of finding not active. */
> > +	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
> > +	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
> 
> OK, makes sense. We should not reclaim from nodes that are known to be
> hard to reclaim from. We, however, have to be careful to not exclude the
> node from reclaiming completely.
> 
> > +
> > +	weight = (anon * anon_prio + file * file_prio) / 200;
> 
> Shouldn't we rather normalize the weight to the node size? This way we
> are punishing bigger nodes, aren't we.
> 

Here, the routine is for reclaiming memory in a memcg in smooth way.
And not for balancing zone. It will be kswapd+memcg(softlimit) work.
The size of node in this memcg is represented by file + anon.


> > +	return weight;
> > +}
> > +
> > +/*
> > + * Calculate each NUMA node's scan weight. scan weight is determined by
> > + * amount of pages and recent scan ratio, swappiness.
> > + */
> > +static unsigned long
> > +mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
> > +{
> > +	unsigned long weight, total_weight;
> > +	u64 anon_prio, file_prio, nr_anon, nr_file;
> > +	int lru_mask;
> > +	int nid;
> > +
> > +	anon_prio = mem_cgroup_swappiness(memcg) + 1;
> > +	file_prio = 200 - anon_prio + 1;
> 
> What is +1 good for. I do not see that anon_prio would be used as a
> denominator.
> 

weight = (anon * anon_prio + file * file_prio) / 200;

Just for avoiding the influence of anon never be 0 (by wrong value
set to swappiness by user.)


> > +
> > +	lru_mask = BIT(LRU_INACTIVE_FILE);
> > +	if (mem_cgroup_inactive_file_is_low(memcg))
> > +		lru_mask |= BIT(LRU_ACTIVE_FILE);
> > +	/*
> > +	 * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
> > +	 * amounts of file/anon pages and swappiness and reclaim_stat. Here,
> > +	 * we try to find good node to be scanned. If the memcg contains enough
> > +	 * file caches, we'll ignore anon's weight.
> > +	 * (Note) scanning anon-only node tends to be waste of time.
> > +	 */
> > +
> > +	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> > +	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> > +
> > +	/* If file cache is small w.r.t swappiness, check anon page's weight */
> > +	if (nr_file * file_prio >= nr_anon * anon_prio)
> > +		lru_mask |= BIT(LRU_INACTIVE_ANON);
> 
> Why we do not care about active anon (e.g. if inactive anon is low)?
> 
This condition is wrong...

	if (nr_file * file_prio <= nr_anon * anon_prio)
		lru_mask |= BIT(LRU_INACTIVE_ANON);

I was worried about LRU_ACTIVE_ANON. I considered
  - We can't handle ACTIVE_ANON and INACTIVE_ANON in the same weight.
    But I don't want to add more magic numbers.
  - vmscan.c:shrink_zone() scans ACTIVE_ANON whenever/only when
    inactive_anon_is_low()==true. SWAP_CLUSTER_MAX per priority.
    It's specially handled.

So, I thought involing the number of ACTIVE_ANON to the weight is difficult
and ignored ACTIVE_ANON, here. Do you have idea ?



> > +
> > +	total_weight = 0;
> 
> Can be initialized during declaration.
> 

will fix.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 4/6]  memg: calculate numa weight for vmscan
@ 2011-08-18  0:17       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:17 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Wed, 17 Aug 2011 16:34:18 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> Sorry it took so long but I was quite busy recently.
> 
> On Tue 09-08-11 19:11:00, KAMEZAWA Hiroyuki wrote:
> > caclculate node scan weight.
> > 
> > Now, memory cgroup selects a scan target node in round-robin.
> > It's not very good...there is not scheduling based on page usages.
> > 
> > This patch is for calculating each node's weight for scanning.
> > If weight of a node is high, the node is worth to be scanned.
> > 
> > The weight is now calucauted on following concept.
> > 
> >    - make use of swappiness.
> >    - If inactive-file is enough, ignore active-file
> >    - If file is enough (w.r.t swappiness), ignore anon
> >    - make use of recent_scan/rotated reclaim stats.
> 
> The concept looks good (see the specific comments bellow). I would
> appreciate if the description was more descriptive (especially in the
> reclaim statistics part with the reasoning why it is better).
>  
> > Then, a node contains many inactive file pages will be a 1st victim.
> > Node selection logic based on this weight will be in the next patch.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 105 insertions(+), 5 deletions(-)
> > 
> > Index: mmotm-Aug3/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-Aug3.orig/mm/memcontrol.c
> > +++ mmotm-Aug3/mm/memcontrol.c
> [...]
> > @@ -1568,18 +1570,108 @@ static bool test_mem_cgroup_node_reclaim
> >  }
> >  #if MAX_NUMNODES > 1
> >  
> > +static unsigned long
> > +__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
> > +				int nid,
> > +				unsigned long anon_prio,
> > +				unsigned long file_prio,
> > +				int lru_mask)
> > +{
> > +	u64 file, anon;
> > +	unsigned long weight, mask;
> 
> mask is not used anywhere.
> 
I'll remove this.


> > +	unsigned long rotated[2], scanned[2];
> > +	int zid;
> > +
> > +	scanned[0] = 0;
> > +	scanned[1] = 0;
> > +	rotated[0] = 0;
> > +	rotated[1] = 0;
> > +
> > +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> > +		struct mem_cgroup_per_zone *mz;
> > +
> > +		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> > +		scanned[0] += mz->reclaim_stat.recent_scanned[0];
> > +		scanned[1] += mz->reclaim_stat.recent_scanned[1];
> > +		rotated[0] += mz->reclaim_stat.recent_rotated[0];
> > +		rotated[1] += mz->reclaim_stat.recent_rotated[1];
> > +	}
> > +	file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
> > +
> > +	if (total_swap_pages)
> 
> What about ((lru_mask & LRU_ALL_ANON) && total_swap_pages)?

Ok. will add that.

> Why should we go down the mem_cgroup_node_nr_lru_pages if are not getting anything?
> 



> > +		anon = mem_cgroup_node_nr_lru_pages(memcg,
> > +					nid, mask & LRU_ALL_ANON);
> 
> btw. s/mask/lru_mask/
> 
yes...

> > +	else
> > +		anon = 0;
> 
> Can be initialized during declaration (makes patch smaller).
> 
Sure.

> > +	if (!(file + anon))
> > +		node_clear(nid, memcg->scan_nodes);
> 
> In that case we can return with 0 right away.
> 
yes.



> > +
> > +	/* 'scanned - rotated/scanned' means ratio of finding not active. */
> > +	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
> > +	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
> 
> OK, makes sense. We should not reclaim from nodes that are known to be
> hard to reclaim from. We, however, have to be careful to not exclude the
> node from reclaiming completely.
> 
> > +
> > +	weight = (anon * anon_prio + file * file_prio) / 200;
> 
> Shouldn't we rather normalize the weight to the node size? This way we
> are punishing bigger nodes, aren't we.
> 

Here, the routine is for reclaiming memory in a memcg in smooth way.
And not for balancing zone. It will be kswapd+memcg(softlimit) work.
The size of node in this memcg is represented by file + anon.


> > +	return weight;
> > +}
> > +
> > +/*
> > + * Calculate each NUMA node's scan weight. scan weight is determined by
> > + * amount of pages and recent scan ratio, swappiness.
> > + */
> > +static unsigned long
> > +mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
> > +{
> > +	unsigned long weight, total_weight;
> > +	u64 anon_prio, file_prio, nr_anon, nr_file;
> > +	int lru_mask;
> > +	int nid;
> > +
> > +	anon_prio = mem_cgroup_swappiness(memcg) + 1;
> > +	file_prio = 200 - anon_prio + 1;
> 
> What is +1 good for. I do not see that anon_prio would be used as a
> denominator.
> 

weight = (anon * anon_prio + file * file_prio) / 200;

Just for avoiding the influence of anon never be 0 (by wrong value
set to swappiness by user.)


> > +
> > +	lru_mask = BIT(LRU_INACTIVE_FILE);
> > +	if (mem_cgroup_inactive_file_is_low(memcg))
> > +		lru_mask |= BIT(LRU_ACTIVE_FILE);
> > +	/*
> > +	 * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
> > +	 * amounts of file/anon pages and swappiness and reclaim_stat. Here,
> > +	 * we try to find good node to be scanned. If the memcg contains enough
> > +	 * file caches, we'll ignore anon's weight.
> > +	 * (Note) scanning anon-only node tends to be waste of time.
> > +	 */
> > +
> > +	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> > +	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> > +
> > +	/* If file cache is small w.r.t swappiness, check anon page's weight */
> > +	if (nr_file * file_prio >= nr_anon * anon_prio)
> > +		lru_mask |= BIT(LRU_INACTIVE_ANON);
> 
> Why we do not care about active anon (e.g. if inactive anon is low)?
> 
This condition is wrong...

	if (nr_file * file_prio <= nr_anon * anon_prio)
		lru_mask |= BIT(LRU_INACTIVE_ANON);

I was worried about LRU_ACTIVE_ANON. I considered
  - We can't handle ACTIVE_ANON and INACTIVE_ANON in the same weight.
    But I don't want to add more magic numbers.
  - vmscan.c:shrink_zone() scans ACTIVE_ANON whenever/only when
    inactive_anon_is_low()==true. SWAP_CLUSTER_MAX per priority.
    It's specially handled.

So, I thought involing the number of ACTIVE_ANON to the weight is difficult
and ignored ACTIVE_ANON, here. Do you have idea ?



> > +
> > +	total_weight = 0;
> 
> Can be initialized during declaration.
> 

will fix.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-17 23:52               ` KAMEZAWA Hiroyuki
@ 2011-08-18  6:27                 ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18  6:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 18-08-11 08:52:33, KAMEZAWA Hiroyuki wrote:
> On Wed, 17 Aug 2011 13:35:50 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 11 Aug 2011 16:50:55 +0200
> > > > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> > > >   OK but will have to think about it some more.
> > > 
> > > force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> > > is overkilling.
> > 
> > So, how many pages should be reclaimed then?
> > 
> 
> How about (1 << (MAX_ORDER-1))/loop ?

Hmm, I am not sure I see any benefit. We want to reclaim all those
pages why shouldn't we do it in one batch? If we use a value based on
MAX_ORDER then we make a bigger chance that force_empty fails for big
cgroups (e.g. with a lot of page cache).
Anyway, if we want to mimic the previous behavior then we should use
something like nr_nodes * SWAP_CLUSTER_MAX (the above value would be
sufficient for up to 32 nodes).

> 
> Thanks,
> -Kame

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-18  6:27                 ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18  6:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 18-08-11 08:52:33, KAMEZAWA Hiroyuki wrote:
> On Wed, 17 Aug 2011 13:35:50 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 11 Aug 2011 16:50:55 +0200
> > > > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> > > >   OK but will have to think about it some more.
> > > 
> > > force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> > > is overkilling.
> > 
> > So, how many pages should be reclaimed then?
> > 
> 
> How about (1 << (MAX_ORDER-1))/loop ?

Hmm, I am not sure I see any benefit. We want to reclaim all those
pages why shouldn't we do it in one batch? If we use a value based on
MAX_ORDER then we make a bigger chance that force_empty fails for big
cgroups (e.g. with a lot of page cache).
Anyway, if we want to mimic the previous behavior then we should use
something like nr_nodes * SWAP_CLUSTER_MAX (the above value would be
sufficient for up to 32 nodes).

> 
> Thanks,
> -Kame

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-18  6:27                 ` Michal Hocko
@ 2011-08-18  6:42                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  6:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu, 18 Aug 2011 08:27:22 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 18-08-11 08:52:33, KAMEZAWA Hiroyuki wrote:
> > On Wed, 17 Aug 2011 13:35:50 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 11 Aug 2011 16:50:55 +0200
> > > > > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> > > > >   OK but will have to think about it some more.
> > > > 
> > > > force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> > > > is overkilling.
> > > 
> > > So, how many pages should be reclaimed then?
> > > 
> > 
> > How about (1 << (MAX_ORDER-1))/loop ?
> 
> Hmm, I am not sure I see any benefit. We want to reclaim all those
> pages why shouldn't we do it in one batch? If we use a value based on
> MAX_ORDER then we make a bigger chance that force_empty fails for big
> cgroups (e.g. with a lot of page cache).

Why bigger chance to fail ? retry counter is decreased only when we cannot
make any reclaim. The number passed here is not problem against the faiulre.

I don't like very long vmscan which cannot be stopped by Ctrl-C.


> Anyway, if we want to mimic the previous behavior then we should use
> something like nr_nodes * SWAP_CLUSTER_MAX (the above value would be
> sufficient for up to 32 nodes).
> 

agreed.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-18  6:42                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  6:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu, 18 Aug 2011 08:27:22 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 18-08-11 08:52:33, KAMEZAWA Hiroyuki wrote:
> > On Wed, 17 Aug 2011 13:35:50 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 11 Aug 2011 16:50:55 +0200
> > > > > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> > > > >   OK but will have to think about it some more.
> > > > 
> > > > force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> > > > is overkilling.
> > > 
> > > So, how many pages should be reclaimed then?
> > > 
> > 
> > How about (1 << (MAX_ORDER-1))/loop ?
> 
> Hmm, I am not sure I see any benefit. We want to reclaim all those
> pages why shouldn't we do it in one batch? If we use a value based on
> MAX_ORDER then we make a bigger chance that force_empty fails for big
> cgroups (e.g. with a lot of page cache).

Why bigger chance to fail ? retry counter is decreased only when we cannot
make any reclaim. The number passed here is not problem against the faiulre.

I don't like very long vmscan which cannot be stopped by Ctrl-C.


> Anyway, if we want to mimic the previous behavior then we should use
> something like nr_nodes * SWAP_CLUSTER_MAX (the above value would be
> sufficient for up to 32 nodes).
> 

agreed.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
  2011-08-18  6:42                   ` KAMEZAWA Hiroyuki
@ 2011-08-18  7:46                     ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18  7:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 18-08-11 15:42:59, KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Aug 2011 08:27:22 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Thu 18-08-11 08:52:33, KAMEZAWA Hiroyuki wrote:
> > > On Wed, 17 Aug 2011 13:35:50 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> > > > > On Thu, 11 Aug 2011 16:50:55 +0200
> > > > > > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> > > > > >   OK but will have to think about it some more.
> > > > > 
> > > > > force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> > > > > is overkilling.
> > > > 
> > > > So, how many pages should be reclaimed then?
> > > > 
> > > 
> > > How about (1 << (MAX_ORDER-1))/loop ?
> > 
> > Hmm, I am not sure I see any benefit. We want to reclaim all those
> > pages why shouldn't we do it in one batch? If we use a value based on
> > MAX_ORDER then we make a bigger chance that force_empty fails for big
> > cgroups (e.g. with a lot of page cache).
> 
> Why bigger chance to fail ? retry counter is decreased only when we cannot
> make any reclaim. The number passed here is not problem against the faiulre.

Yes, you are right. I have overlooked that.

 
> I don't like very long vmscan which cannot be stopped by Ctrl-C.

Sure, now I see your point. Thanks for clarification.

> > Anyway, if we want to mimic the previous behavior then we should use
> > something like nr_nodes * SWAP_CLUSTER_MAX (the above value would be
> > sufficient for up to 32 nodes).
> > 
> 
> agreed.

Updated patch:
Changes since v1:
- reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
--- 
From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: add nr_pages argument for hierarchical reclaim

Now that we are doing memcg direct reclaim limited to nr_to_reclaim
pages (introduced by "memcg: stop vmscan when enough done.") we have to
be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
most callers but it might cause failures for limit resize or force_empty
code paths on big NUMA machines.

Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
on reclaiming a certain amount of pages and retrying if their condition is
still not met.

Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
push it further to try_to_free_mem_cgroup_pages. We still fall back to
SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
affected by this.

We have to be careful in mem_cgroup_do_charge and do not provide the
given nr_pages because we would reclaim too much for THP which can
safely fall back to single page allocations.

mem_cgroup_force_empty could try to reclaim all pages at once but it is much
better to limit the nr_pages to something reasonable so that we are able to
terminate it by a signal. Let's mimic previous behavior by asking for
MAX_NUMNODES * SWAP_CLUSTER_MAX.

Signed-off-by: Michal Hocko <mhocko@suse.cz>

Index: linus_tree/include/linux/memcontrol.h
===================================================================
--- linus_tree.orig/include/linux/memcontrol.h	2011-08-18 09:30:24.000000000 +0200
+++ linus_tree/include/linux/memcontrol.h	2011-08-18 09:30:36.000000000 +0200
@@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
 
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
+						  struct memcg_scanrecord *rec,
+						  unsigned long nr_pages);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c	2011-08-18 09:30:34.000000000 +0200
+++ linus_tree/mm/memcontrol.c	2011-08-18 09:36:41.000000000 +0200
@@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
  * (other groups can be removed while we're walking....)
  *
  * If shrink==true, for avoiding to free too much, this returns immedieately.
+ * Given nr_pages tells how many pages are we over the soft limit or how many
+ * pages do we want to reclaim in the direct reclaim mode.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
 						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+						unsigned long *total_scanned,
+						unsigned long nr_pages)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	struct memcg_scanrecord rec;
-	unsigned long excess;
 	unsigned long scanned;
 
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
 		noswap = true;
@@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 				/*
 				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
+				 * nr_pages >> 2 is not to excessive so as to
 				 * reclaim too much, nor too less that we keep
 				 * coming back to reclaim from this cgroup
 				 */
-				if (total >= (excess >> 2) ||
+				if (total >= (nr_pages >> 2) ||
 					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
 					css_put(&victim->css);
 					break;
@@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
 			*total_scanned += scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
+						noswap, &rec, nr_pages);
 		mem_cgroup_record_scanstat(&rec);
 		css_put(&victim->css);
 		/*
@@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
+	/*
+	 * We are lying about nr_pages because we do not want to
+	 * reclaim too much for THP pages which should rather fallback
+	 * to small pages.
+	 */
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+					      gfp_mask, flags, NULL,
+					      1);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3567,7 +3573,8 @@ static int mem_cgroup_resize_limit(struc
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3628,7 +3635,8 @@ static int mem_cgroup_resize_memsw_limit
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memswlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3671,10 +3679,12 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		nr_scanned = 0;
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
 						gfp_mask,
 						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+						&nr_scanned,
+						excess >> PAGE_SHIFT);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
@@ -3870,8 +3880,10 @@ try_to_free:
 		rec.context = SCAN_BY_SHRINK;
 		rec.mem = mem;
 		rec.root = mem;
+		/* reclaim from every node at least something */
 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
-						false, &rec);
+						false, &rec,
+						MAX_NUMNODES * SWAP_CLUSTER_MAX);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-08-18 09:30:24.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-08-18 09:30:36.000000000 +0200
@@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   struct memcg_scanrecord *rec,
+					   unsigned long nr_pages)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 2/6]  memcg: stop vmscan when enough done.
@ 2011-08-18  7:46                     ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18  7:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 18-08-11 15:42:59, KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Aug 2011 08:27:22 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Thu 18-08-11 08:52:33, KAMEZAWA Hiroyuki wrote:
> > > On Wed, 17 Aug 2011 13:35:50 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Wed 17-08-11 09:54:05, KAMEZAWA Hiroyuki wrote:
> > > > > On Thu, 11 Aug 2011 16:50:55 +0200
> > > > > > - mem_cgroup_force_empty asks for reclaiming all pages. I guess it should be
> > > > > >   OK but will have to think about it some more.
> > > > > 
> > > > > force_empty/rmdir() is allowed to be stopped by Ctrl-C. I think passing res->usage
> > > > > is overkilling.
> > > > 
> > > > So, how many pages should be reclaimed then?
> > > > 
> > > 
> > > How about (1 << (MAX_ORDER-1))/loop ?
> > 
> > Hmm, I am not sure I see any benefit. We want to reclaim all those
> > pages why shouldn't we do it in one batch? If we use a value based on
> > MAX_ORDER then we make a bigger chance that force_empty fails for big
> > cgroups (e.g. with a lot of page cache).
> 
> Why bigger chance to fail ? retry counter is decreased only when we cannot
> make any reclaim. The number passed here is not problem against the faiulre.

Yes, you are right. I have overlooked that.

 
> I don't like very long vmscan which cannot be stopped by Ctrl-C.

Sure, now I see your point. Thanks for clarification.

> > Anyway, if we want to mimic the previous behavior then we should use
> > something like nr_nodes * SWAP_CLUSTER_MAX (the above value would be
> > sufficient for up to 32 nodes).
> > 
> 
> agreed.

Updated patch:
Changes since v1:
- reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
--- 
From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: add nr_pages argument for hierarchical reclaim

Now that we are doing memcg direct reclaim limited to nr_to_reclaim
pages (introduced by "memcg: stop vmscan when enough done.") we have to
be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
most callers but it might cause failures for limit resize or force_empty
code paths on big NUMA machines.

Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
on reclaiming a certain amount of pages and retrying if their condition is
still not met.

Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
push it further to try_to_free_mem_cgroup_pages. We still fall back to
SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
affected by this.

We have to be careful in mem_cgroup_do_charge and do not provide the
given nr_pages because we would reclaim too much for THP which can
safely fall back to single page allocations.

mem_cgroup_force_empty could try to reclaim all pages at once but it is much
better to limit the nr_pages to something reasonable so that we are able to
terminate it by a signal. Let's mimic previous behavior by asking for
MAX_NUMNODES * SWAP_CLUSTER_MAX.

Signed-off-by: Michal Hocko <mhocko@suse.cz>

Index: linus_tree/include/linux/memcontrol.h
===================================================================
--- linus_tree.orig/include/linux/memcontrol.h	2011-08-18 09:30:24.000000000 +0200
+++ linus_tree/include/linux/memcontrol.h	2011-08-18 09:30:36.000000000 +0200
@@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
 
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
+						  struct memcg_scanrecord *rec,
+						  unsigned long nr_pages);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c	2011-08-18 09:30:34.000000000 +0200
+++ linus_tree/mm/memcontrol.c	2011-08-18 09:36:41.000000000 +0200
@@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
  * (other groups can be removed while we're walking....)
  *
  * If shrink==true, for avoiding to free too much, this returns immedieately.
+ * Given nr_pages tells how many pages are we over the soft limit or how many
+ * pages do we want to reclaim in the direct reclaim mode.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
 						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+						unsigned long *total_scanned,
+						unsigned long nr_pages)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	struct memcg_scanrecord rec;
-	unsigned long excess;
 	unsigned long scanned;
 
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
 		noswap = true;
@@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 				/*
 				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
+				 * nr_pages >> 2 is not to excessive so as to
 				 * reclaim too much, nor too less that we keep
 				 * coming back to reclaim from this cgroup
 				 */
-				if (total >= (excess >> 2) ||
+				if (total >= (nr_pages >> 2) ||
 					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
 					css_put(&victim->css);
 					break;
@@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
 			*total_scanned += scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
+						noswap, &rec, nr_pages);
 		mem_cgroup_record_scanstat(&rec);
 		css_put(&victim->css);
 		/*
@@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
+	/*
+	 * We are lying about nr_pages because we do not want to
+	 * reclaim too much for THP pages which should rather fallback
+	 * to small pages.
+	 */
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+					      gfp_mask, flags, NULL,
+					      1);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3567,7 +3573,8 @@ static int mem_cgroup_resize_limit(struc
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3628,7 +3635,8 @@ static int mem_cgroup_resize_memsw_limit
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memswlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3671,10 +3679,12 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		nr_scanned = 0;
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
 						gfp_mask,
 						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+						&nr_scanned,
+						excess >> PAGE_SHIFT);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
@@ -3870,8 +3880,10 @@ try_to_free:
 		rec.context = SCAN_BY_SHRINK;
 		rec.mem = mem;
 		rec.root = mem;
+		/* reclaim from every node at least something */
 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
-						false, &rec);
+						false, &rec,
+						MAX_NUMNODES * SWAP_CLUSTER_MAX);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-08-18 09:30:24.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-08-18 09:30:36.000000000 +0200
@@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   struct memcg_scanrecord *rec,
+					   unsigned long nr_pages)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 4/6]  memg: calculate numa weight for vmscan
  2011-08-18  0:17       ` KAMEZAWA Hiroyuki
@ 2011-08-18  8:41         ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18  8:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 18-08-11 09:17:50, KAMEZAWA Hiroyuki wrote:
> On Wed, 17 Aug 2011 16:34:18 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Sorry it took so long but I was quite busy recently.
> > 
> > On Tue 09-08-11 19:11:00, KAMEZAWA Hiroyuki wrote:
[...]
> > > Index: mmotm-Aug3/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-Aug3.orig/mm/memcontrol.c
> > > +++ mmotm-Aug3/mm/memcontrol.c
[...]
> > > +
> > > +	/* 'scanned - rotated/scanned' means ratio of finding not active. */
> > > +	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
> > > +	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
> > 
> > OK, makes sense. We should not reclaim from nodes that are known to be
> > hard to reclaim from. We, however, have to be careful to not exclude the
> > node from reclaiming completely.
> > 
> > > +
> > > +	weight = (anon * anon_prio + file * file_prio) / 200;
> > 
> > Shouldn't we rather normalize the weight to the node size? This way we
> > are punishing bigger nodes, aren't we.
> > 
> 
> Here, the routine is for reclaiming memory in a memcg in smooth way.
> And not for balancing zone. It will be kswapd+memcg(softlimit) work.
> The size of node in this memcg is represented by file + anon.

I am not sure I understand what you mean by that but consider two nodes.
swappiness = 0
anon_prio = 1
file_prio = 200
A 1000 pages, 100 anon, 300 file: weight 300, node is 40% full
B 15000 pages 2500 anon, 3500 file: weight ~3500, node is 40% full

I think that both nodes should be equal.

> > > +	return weight;
> > > +}
> > > +
> > > +/*
> > > + * Calculate each NUMA node's scan weight. scan weight is determined by
> > > + * amount of pages and recent scan ratio, swappiness.
> > > + */
> > > +static unsigned long
> > > +mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
> > > +{
> > > +	unsigned long weight, total_weight;
> > > +	u64 anon_prio, file_prio, nr_anon, nr_file;
> > > +	int lru_mask;
> > > +	int nid;
> > > +
> > > +	anon_prio = mem_cgroup_swappiness(memcg) + 1;
> > > +	file_prio = 200 - anon_prio + 1;
> > 
> > What is +1 good for. I do not see that anon_prio would be used as a
> > denominator.
> > 
> 
> weight = (anon * anon_prio + file * file_prio) / 200;
> 
> Just for avoiding the influence of anon never be 0 (by wrong value
> set to swappiness by user.)

OK, so you want to prevent from situation where we have swappiness 0
and there are no file pages so the node would have 0 weight?
Why do you consider 0 swappiness a wrong value?

[...]
> > > +	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> > > +	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> > > +
> > > +	/* If file cache is small w.r.t swappiness, check anon page's weight */
> > > +	if (nr_file * file_prio >= nr_anon * anon_prio)
> > > +		lru_mask |= BIT(LRU_INACTIVE_ANON);
> > 
> > Why we do not care about active anon (e.g. if inactive anon is low)?
> > 
> This condition is wrong...
> 
> 	if (nr_file * file_prio <= nr_anon * anon_prio)
> 		lru_mask |= BIT(LRU_INACTIVE_ANON);

True. Haven't noticed it before...

> 
> I was worried about LRU_ACTIVE_ANON. I considered
>   - We can't handle ACTIVE_ANON and INACTIVE_ANON in the same weight.
>     But I don't want to add more magic numbers.

Yes I agree, weight shouldn't involve active pages because we do not
want to reclaim nodes according to their active working set.

>   - vmscan.c:shrink_zone() scans ACTIVE_ANON whenever/only when
>     inactive_anon_is_low()==true. SWAP_CLUSTER_MAX per priority.
>     It's specially handled.
> 
> So, I thought involing the number of ACTIVE_ANON to the weight is difficult
> and ignored ACTIVE_ANON, here. Do you have idea ?

I am not sure whether nr_anon should include also active pages, though.
We are comparing all file to all anon pages which looks consistent, on
the other hand we are not including active pages into weight. This way
we make bigger pressure on nodes with a big anon working set.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 4/6]  memg: calculate numa weight for vmscan
@ 2011-08-18  8:41         ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18  8:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu 18-08-11 09:17:50, KAMEZAWA Hiroyuki wrote:
> On Wed, 17 Aug 2011 16:34:18 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Sorry it took so long but I was quite busy recently.
> > 
> > On Tue 09-08-11 19:11:00, KAMEZAWA Hiroyuki wrote:
[...]
> > > Index: mmotm-Aug3/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-Aug3.orig/mm/memcontrol.c
> > > +++ mmotm-Aug3/mm/memcontrol.c
[...]
> > > +
> > > +	/* 'scanned - rotated/scanned' means ratio of finding not active. */
> > > +	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
> > > +	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
> > 
> > OK, makes sense. We should not reclaim from nodes that are known to be
> > hard to reclaim from. We, however, have to be careful to not exclude the
> > node from reclaiming completely.
> > 
> > > +
> > > +	weight = (anon * anon_prio + file * file_prio) / 200;
> > 
> > Shouldn't we rather normalize the weight to the node size? This way we
> > are punishing bigger nodes, aren't we.
> > 
> 
> Here, the routine is for reclaiming memory in a memcg in smooth way.
> And not for balancing zone. It will be kswapd+memcg(softlimit) work.
> The size of node in this memcg is represented by file + anon.

I am not sure I understand what you mean by that but consider two nodes.
swappiness = 0
anon_prio = 1
file_prio = 200
A 1000 pages, 100 anon, 300 file: weight 300, node is 40% full
B 15000 pages 2500 anon, 3500 file: weight ~3500, node is 40% full

I think that both nodes should be equal.

> > > +	return weight;
> > > +}
> > > +
> > > +/*
> > > + * Calculate each NUMA node's scan weight. scan weight is determined by
> > > + * amount of pages and recent scan ratio, swappiness.
> > > + */
> > > +static unsigned long
> > > +mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
> > > +{
> > > +	unsigned long weight, total_weight;
> > > +	u64 anon_prio, file_prio, nr_anon, nr_file;
> > > +	int lru_mask;
> > > +	int nid;
> > > +
> > > +	anon_prio = mem_cgroup_swappiness(memcg) + 1;
> > > +	file_prio = 200 - anon_prio + 1;
> > 
> > What is +1 good for. I do not see that anon_prio would be used as a
> > denominator.
> > 
> 
> weight = (anon * anon_prio + file * file_prio) / 200;
> 
> Just for avoiding the influence of anon never be 0 (by wrong value
> set to swappiness by user.)

OK, so you want to prevent from situation where we have swappiness 0
and there are no file pages so the node would have 0 weight?
Why do you consider 0 swappiness a wrong value?

[...]
> > > +	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> > > +	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> > > +
> > > +	/* If file cache is small w.r.t swappiness, check anon page's weight */
> > > +	if (nr_file * file_prio >= nr_anon * anon_prio)
> > > +		lru_mask |= BIT(LRU_INACTIVE_ANON);
> > 
> > Why we do not care about active anon (e.g. if inactive anon is low)?
> > 
> This condition is wrong...
> 
> 	if (nr_file * file_prio <= nr_anon * anon_prio)
> 		lru_mask |= BIT(LRU_INACTIVE_ANON);

True. Haven't noticed it before...

> 
> I was worried about LRU_ACTIVE_ANON. I considered
>   - We can't handle ACTIVE_ANON and INACTIVE_ANON in the same weight.
>     But I don't want to add more magic numbers.

Yes I agree, weight shouldn't involve active pages because we do not
want to reclaim nodes according to their active working set.

>   - vmscan.c:shrink_zone() scans ACTIVE_ANON whenever/only when
>     inactive_anon_is_low()==true. SWAP_CLUSTER_MAX per priority.
>     It's specially handled.
> 
> So, I thought involing the number of ACTIVE_ANON to the weight is difficult
> and ignored ACTIVE_ANON, here. Do you have idea ?

I am not sure whether nr_anon should include also active pages, though.
We are comparing all file to all anon pages which looks consistent, on
the other hand we are not including active pages into weight. This way
we make bigger pressure on nodes with a big anon working set.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim
  2011-08-18  7:46                     ` Michal Hocko
@ 2011-08-18 12:57                       ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18 12:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

I have just realized that num_online_nodes should be much better than
MAX_NUMNODES. 
Just for reference, the patch is based on top of
https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also
doesn't make much sense without it)

Changes since v2:
- use num_online_nodes rather than MAX_NUMNODES
Changes since v1:
- reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
---
From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: add nr_pages argument for hierarchical reclaim

Now that we are doing memcg direct reclaim limited to nr_to_reclaim
pages (introduced by "memcg: stop vmscan when enough done.") we have to
be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
most callers but it might cause failures for limit resize or force_empty
code paths on big NUMA machines.

Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
on reclaiming a certain amount of pages and retrying if their condition is
still not met.

Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
push it further to try_to_free_mem_cgroup_pages. We still fall back to
SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
affected by this.

We have to be careful in mem_cgroup_do_charge and do not provide the
given nr_pages because we would reclaim too much for THP which can
safely fall back to single page allocations.

mem_cgroup_force_empty could try to reclaim all pages at once but it is much
better to limit the nr_pages to something reasonable so that we are able to
terminate it by a signal. Let's mimic previous behavior by asking for
num_online_nodes * SWAP_CLUSTER_MAX.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Index: linus_tree/include/linux/memcontrol.h
===================================================================
--- linus_tree.orig/include/linux/memcontrol.h	2011-08-18 09:30:24.000000000 +0200
+++ linus_tree/include/linux/memcontrol.h	2011-08-18 09:30:36.000000000 +0200
@@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
 
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
+						  struct memcg_scanrecord *rec,
+						  unsigned long nr_pages);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c	2011-08-18 09:30:34.000000000 +0200
+++ linus_tree/mm/memcontrol.c	2011-08-18 14:50:26.000000000 +0200
@@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
  * (other groups can be removed while we're walking....)
  *
  * If shrink==true, for avoiding to free too much, this returns immedieately.
+ * Given nr_pages tells how many pages are we over the soft limit or how many
+ * pages do we want to reclaim in the direct reclaim mode.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
 						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+						unsigned long *total_scanned,
+						unsigned long nr_pages)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	struct memcg_scanrecord rec;
-	unsigned long excess;
 	unsigned long scanned;
 
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
 		noswap = true;
@@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 				/*
 				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
+				 * nr_pages >> 2 is not to excessive so as to
 				 * reclaim too much, nor too less that we keep
 				 * coming back to reclaim from this cgroup
 				 */
-				if (total >= (excess >> 2) ||
+				if (total >= (nr_pages >> 2) ||
 					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
 					css_put(&victim->css);
 					break;
@@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
 			*total_scanned += scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
+						noswap, &rec, nr_pages);
 		mem_cgroup_record_scanstat(&rec);
 		css_put(&victim->css);
 		/*
@@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
+	/*
+	 * We are lying about nr_pages because we do not want to
+	 * reclaim too much for THP pages which should rather fallback
+	 * to small pages.
+	 */
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+					      gfp_mask, flags, NULL,
+					      1);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3567,7 +3573,8 @@ static int mem_cgroup_resize_limit(struc
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3628,7 +3635,8 @@ static int mem_cgroup_resize_memsw_limit
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memswlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3671,10 +3679,12 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		nr_scanned = 0;
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
 						gfp_mask,
 						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+						&nr_scanned,
+						excess >> PAGE_SHIFT);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
@@ -3861,6 +3871,7 @@ try_to_free:
 	shrink = 1;
 	while (nr_retries && mem->res.usage > 0) {
 		struct memcg_scanrecord rec;
+		unsigned long nr_to_reclaim;
 		int progress;
 
 		if (signal_pending(current)) {
@@ -3870,8 +3881,10 @@ try_to_free:
 		rec.context = SCAN_BY_SHRINK;
 		rec.mem = mem;
 		rec.root = mem;
+		/* reclaim from every node at least something */
+		nr_to_reclaim = num_online_nodes() * SWAP_CLUSTER_MAX;
 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
-						false, &rec);
+						false, &rec, nr_to_reclaim);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-08-18 09:30:24.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-08-18 09:30:36.000000000 +0200
@@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   struct memcg_scanrecord *rec,
+					   unsigned long nr_pages)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim
@ 2011-08-18 12:57                       ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18 12:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

I have just realized that num_online_nodes should be much better than
MAX_NUMNODES. 
Just for reference, the patch is based on top of
https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also
doesn't make much sense without it)

Changes since v2:
- use num_online_nodes rather than MAX_NUMNODES
Changes since v1:
- reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
---
From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: add nr_pages argument for hierarchical reclaim

Now that we are doing memcg direct reclaim limited to nr_to_reclaim
pages (introduced by "memcg: stop vmscan when enough done.") we have to
be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
most callers but it might cause failures for limit resize or force_empty
code paths on big NUMA machines.

Previously we might have reclaimed up to nr_nodes * SWAP_CLUSTER_MAX
while now we have it at SWAP_CLUSTER_MAX. Both resize and force_empty rely
on reclaiming a certain amount of pages and retrying if their condition is
still not met.

Let's add nr_pages argument to mem_cgroup_hierarchical_reclaim which will
push it further to try_to_free_mem_cgroup_pages. We still fall back to
SWAP_CLUSTER_MAX for small requests so the standard code (hot) paths are not
affected by this.

We have to be careful in mem_cgroup_do_charge and do not provide the
given nr_pages because we would reclaim too much for THP which can
safely fall back to single page allocations.

mem_cgroup_force_empty could try to reclaim all pages at once but it is much
better to limit the nr_pages to something reasonable so that we are able to
terminate it by a signal. Let's mimic previous behavior by asking for
num_online_nodes * SWAP_CLUSTER_MAX.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Index: linus_tree/include/linux/memcontrol.h
===================================================================
--- linus_tree.orig/include/linux/memcontrol.h	2011-08-18 09:30:24.000000000 +0200
+++ linus_tree/include/linux/memcontrol.h	2011-08-18 09:30:36.000000000 +0200
@@ -130,7 +130,8 @@ extern void mem_cgroup_print_oom_info(st
 
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
+						  struct memcg_scanrecord *rec,
+						  unsigned long nr_pages);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c	2011-08-18 09:30:34.000000000 +0200
+++ linus_tree/mm/memcontrol.c	2011-08-18 14:50:26.000000000 +0200
@@ -1729,12 +1729,15 @@ static void mem_cgroup_record_scanstat(s
  * (other groups can be removed while we're walking....)
  *
  * If shrink==true, for avoiding to free too much, this returns immedieately.
+ * Given nr_pages tells how many pages are we over the soft limit or how many
+ * pages do we want to reclaim in the direct reclaim mode.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
 						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+						unsigned long *total_scanned,
+						unsigned long nr_pages)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1743,11 +1746,8 @@ static int mem_cgroup_hierarchical_recla
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	struct memcg_scanrecord rec;
-	unsigned long excess;
 	unsigned long scanned;
 
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (!check_soft && !shrink && root_mem->memsw_is_minimum)
 		noswap = true;
@@ -1785,11 +1785,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 				/*
 				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
+				 * nr_pages >> 2 is not to excessive so as to
 				 * reclaim too much, nor too less that we keep
 				 * coming back to reclaim from this cgroup
 				 */
-				if (total >= (excess >> 2) ||
+				if (total >= (nr_pages >> 2) ||
 					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
 					css_put(&victim->css);
 					break;
@@ -1816,7 +1816,7 @@ static int mem_cgroup_hierarchical_recla
 			*total_scanned += scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
+						noswap, &rec, nr_pages);
 		mem_cgroup_record_scanstat(&rec);
 		css_put(&victim->css);
 		/*
@@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
+	/*
+	 * We are lying about nr_pages because we do not want to
+	 * reclaim too much for THP pages which should rather fallback
+	 * to small pages.
+	 */
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+					      gfp_mask, flags, NULL,
+					      1);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3567,7 +3573,8 @@ static int mem_cgroup_resize_limit(struc
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3628,7 +3635,8 @@ static int mem_cgroup_resize_memsw_limit
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
 						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+						NULL,
+						(val-memswlimit) >> PAGE_SHIFT);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3671,10 +3679,12 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		nr_scanned = 0;
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
 						gfp_mask,
 						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+						&nr_scanned,
+						excess >> PAGE_SHIFT);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
@@ -3861,6 +3871,7 @@ try_to_free:
 	shrink = 1;
 	while (nr_retries && mem->res.usage > 0) {
 		struct memcg_scanrecord rec;
+		unsigned long nr_to_reclaim;
 		int progress;
 
 		if (signal_pending(current)) {
@@ -3870,8 +3881,10 @@ try_to_free:
 		rec.context = SCAN_BY_SHRINK;
 		rec.mem = mem;
 		rec.root = mem;
+		/* reclaim from every node at least something */
+		nr_to_reclaim = num_online_nodes() * SWAP_CLUSTER_MAX;
 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
-						false, &rec);
+						false, &rec, nr_to_reclaim);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-08-18 09:30:24.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-08-18 09:30:36.000000000 +0200
@@ -2340,7 +2340,8 @@ unsigned long mem_cgroup_shrink_node_zon
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   struct memcg_scanrecord *rec,
+					   unsigned long nr_pages)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.memcg_record = rec,

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 5/6]  memg: vmscan select victim node by weight
  2011-08-09 10:12   ` KAMEZAWA Hiroyuki
@ 2011-08-18 13:34     ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18 13:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:12:02, KAMEZAWA Hiroyuki wrote:
> 
> This patch implements a node selection logic based on each node's weight.
> 
> This patch adds a new array of nodescan_tickets[]. This array holds
> each node's scan weight in a tuple of 2 values. as
> 
>     for (i = 0, total_weight = 0; i < nodes; i++) {
>         weight = node->weight;
>         nodescan_tickets[i].start = total_weight;
>         nodescan_tickets[i].length = weight;
>     }
> 
> After this, a lottery logic as 'ticket = random32()/total_weight'
> will make a ticket and bserach(ticket, nodescan_tickets[])
> will find a node which holds [start, length] contains ticket.
> (This is a lottery scheduling.)
> 
> By this, node will be selected in fair manner proportinal to
> its weight.

The algorithm sounds interesting, I am just wondering how much gain it
gives over a simple node select with maximum weight (+ add some aging so
that we do not hammer a single node all the time). Have you tried that?

> 
> This patch improve the scan time. Following is a test result
> ot kernel-make on 4-node fake-numa under 500M limit, with 8cpus.
> 2cpus per node.
> 
> [Before patch]
>  772.52user 305.67system 4:11.48elapsed 428%CPU
>  (0avgtext+0avgdata 1457264maxresident)k
>  4797592inputs+5483240outputs (12550major+35707629minor)pagefaults 0swaps

Just to make sure I understand. Before means before this patch not the
entire patch set, right?

> 
> [After patch]
>  773.73user 305.09system 3:51.28elapsed 466%CPU
>  (0avgtext+0avgdata 1458464maxresident)k
>  4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps
> 
> elapsed time and major faults are reduced.
> 
[...]
> Index: mmotm-Aug3/mm/memcontrol.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/memcontrol.c
> +++ mmotm-Aug3/mm/memcontrol.c
[...]
> @@ -1660,6 +1671,46 @@ mem_cgroup_calc_numascan_weight(struct m
>  }
>  
>  /*
> + * For lottery scheduling, this routine disributes "ticket" for
> + * scanning to each node. ticket will be recored into numascan_ticket
> + * array and this array will be used for scheduling, lator.
> + * For make lottery wair, we limit the sum of tickets almost 0xffff.
> + * Later, random() & 0xffff will do proportional fair lottery.
> + */
> +#define NUMA_TICKET_SHIFT	(16)
> +#define NUMA_TICKET_FACTOR	((1 << NUMA_TICKET_SHIFT) - 1)
> +static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
> +{
> +	struct numascan_ticket *nt;
> +	unsigned int node_ticket, assigned_tickets;
> +	u64 weight;
> +	int nid, assigned_num, generation;
> +
> +	/* update ticket information by double buffering */
> +	generation = memcg->numascan_generation ^ 0x1;

Double buffering is used due to synchronization with consumers (they
are using the other one than is updated here), right?  Would be good to
mention in the coment...

> +
> +	nt = memcg->numascan_tickets[generation];
> +	assigned_tickets = 0;
> +	assigned_num = 0;
> +	for_each_node_mask(nid, memcg->scan_nodes) {
> +		weight = memcg->info.nodeinfo[nid]->weight;
> +		node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
> +					memcg->total_weight + 1);
> +		if (!node_ticket)
> +			node_ticket = 1;
> +		nt->nid = nid;
> +		nt->start = assigned_tickets;
> +		nt->tickets = node_ticket;
> +		assigned_tickets += node_ticket;
> +		nt++;
> +		assigned_num++;
> +	}
> +	memcg->numascan_tickets_num[generation] = assigned_num;
> +	smp_wmb();
> +	memcg->numascan_generation = generation;
> +}
> +
> +/*
>   * Update all node's scan weight in background.
>   */
>  static void mem_cgroup_numainfo_update_work(struct work_struct *work)
> @@ -1672,6 +1723,8 @@ static void mem_cgroup_numainfo_update_w
>  
>  	memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
>  
> +	synchronize_rcu();
> +	mem_cgroup_update_numascan_tickets(memcg);

OK, so we have only one updater (this one) which will update generation.
Why do we need rcu here and for search? ACCESS_ONCE should be
sufficient (in mem_cgroup_select_victim_node) or are you afraid that we
could get to update twice during the search?

>  	atomic_set(&memcg->numainfo_updating, 0);
>  	css_put(&memcg->css);
>  }
[...]
> @@ -1707,32 +1772,38 @@ static void mem_cgroup_may_update_nodema
>   * we'll use or we've used. So, it may make LRU bad. And if several threads
>   * hit limits, it will see a contention on a node. But freeing from remote
>   * node means more costs for memory reclaim because of memory latency.
> - *
> - * Now, we use round-robin. Better algorithm is welcomed.
>   */
> -int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
> +int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
> +				struct memcg_scanrecord *rec)
>  {
> -	int node;
> +	int node = MAX_NUMNODES;
> +	struct numascan_ticket *nt;
> +	unsigned long lottery;
> +	int generation;
>  
> +	if (rec->context == SCAN_BY_SHRINK)
> +		goto out;

Why do we care about shrinking here? Due to overhead in node selection?

> +
> +	mem_cgroup_may_update_nodemask(memcg);
>  	*mask = NULL;
> -	mem_cgroup_may_update_nodemask(mem);
> -	node = mem->last_scanned_node;
> +	lottery = random32() & NUMA_TICKET_FACTOR;
>  
> -	node = next_node(node, mem->scan_nodes);
> -	if (node == MAX_NUMNODES)
> -		node = first_node(mem->scan_nodes);
> -	/*
> -	 * We call this when we hit limit, not when pages are added to LRU.
> -	 * No LRU may hold pages because all pages are UNEVICTABLE or
> -	 * memcg is too small and all pages are not on LRU. In that case,
> -	 * we use curret node.
> -	 */
> -	if (unlikely(node == MAX_NUMNODES))
> +	rcu_read_lock();
> +	generation = memcg->numascan_generation;
> +	nt = bsearch((void *)lottery,
> +		memcg->numascan_tickets[generation],
> +		memcg->numascan_tickets_num[generation],
> +		sizeof(struct numascan_ticket), node_weight_compare);
> +	rcu_read_unlock();
> +	if (nt)
> +		node = nt->nid;
> +out:
> +	if (unlikely(node == MAX_NUMNODES)) {
>  		node = numa_node_id();
> -	else
> -		*mask = &mem->scan_nodes;
> +		*mask = NULL;
> +	} else
> +		*mask = &memcg->scan_nodes;
>  
> -	mem->last_scanned_node = node;
>  	return node;
>  }
>  
[...]
> Index: mmotm-Aug3/mm/vmscan.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/vmscan.c
> +++ mmotm-Aug3/mm/vmscan.c
> @@ -2378,9 +2378,9 @@ unsigned long try_to_free_mem_cgroup_pag
>  	 * take care of from where we get pages. So the node where we start the
>  	 * scan does not need to be the current node.
>  	 */
> -	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
> +	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask, rec);
>  
> -	zonelist = NODE_DATA(nid)->node_zonelists;
> +	zonelist = &NODE_DATA(nid)->node_zonelists[0];

Unnecessary change.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 5/6]  memg: vmscan select victim node by weight
@ 2011-08-18 13:34     ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18 13:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Tue 09-08-11 19:12:02, KAMEZAWA Hiroyuki wrote:
> 
> This patch implements a node selection logic based on each node's weight.
> 
> This patch adds a new array of nodescan_tickets[]. This array holds
> each node's scan weight in a tuple of 2 values. as
> 
>     for (i = 0, total_weight = 0; i < nodes; i++) {
>         weight = node->weight;
>         nodescan_tickets[i].start = total_weight;
>         nodescan_tickets[i].length = weight;
>     }
> 
> After this, a lottery logic as 'ticket = random32()/total_weight'
> will make a ticket and bserach(ticket, nodescan_tickets[])
> will find a node which holds [start, length] contains ticket.
> (This is a lottery scheduling.)
> 
> By this, node will be selected in fair manner proportinal to
> its weight.

The algorithm sounds interesting, I am just wondering how much gain it
gives over a simple node select with maximum weight (+ add some aging so
that we do not hammer a single node all the time). Have you tried that?

> 
> This patch improve the scan time. Following is a test result
> ot kernel-make on 4-node fake-numa under 500M limit, with 8cpus.
> 2cpus per node.
> 
> [Before patch]
>  772.52user 305.67system 4:11.48elapsed 428%CPU
>  (0avgtext+0avgdata 1457264maxresident)k
>  4797592inputs+5483240outputs (12550major+35707629minor)pagefaults 0swaps

Just to make sure I understand. Before means before this patch not the
entire patch set, right?

> 
> [After patch]
>  773.73user 305.09system 3:51.28elapsed 466%CPU
>  (0avgtext+0avgdata 1458464maxresident)k
>  4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps
> 
> elapsed time and major faults are reduced.
> 
[...]
> Index: mmotm-Aug3/mm/memcontrol.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/memcontrol.c
> +++ mmotm-Aug3/mm/memcontrol.c
[...]
> @@ -1660,6 +1671,46 @@ mem_cgroup_calc_numascan_weight(struct m
>  }
>  
>  /*
> + * For lottery scheduling, this routine disributes "ticket" for
> + * scanning to each node. ticket will be recored into numascan_ticket
> + * array and this array will be used for scheduling, lator.
> + * For make lottery wair, we limit the sum of tickets almost 0xffff.
> + * Later, random() & 0xffff will do proportional fair lottery.
> + */
> +#define NUMA_TICKET_SHIFT	(16)
> +#define NUMA_TICKET_FACTOR	((1 << NUMA_TICKET_SHIFT) - 1)
> +static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
> +{
> +	struct numascan_ticket *nt;
> +	unsigned int node_ticket, assigned_tickets;
> +	u64 weight;
> +	int nid, assigned_num, generation;
> +
> +	/* update ticket information by double buffering */
> +	generation = memcg->numascan_generation ^ 0x1;

Double buffering is used due to synchronization with consumers (they
are using the other one than is updated here), right?  Would be good to
mention in the coment...

> +
> +	nt = memcg->numascan_tickets[generation];
> +	assigned_tickets = 0;
> +	assigned_num = 0;
> +	for_each_node_mask(nid, memcg->scan_nodes) {
> +		weight = memcg->info.nodeinfo[nid]->weight;
> +		node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
> +					memcg->total_weight + 1);
> +		if (!node_ticket)
> +			node_ticket = 1;
> +		nt->nid = nid;
> +		nt->start = assigned_tickets;
> +		nt->tickets = node_ticket;
> +		assigned_tickets += node_ticket;
> +		nt++;
> +		assigned_num++;
> +	}
> +	memcg->numascan_tickets_num[generation] = assigned_num;
> +	smp_wmb();
> +	memcg->numascan_generation = generation;
> +}
> +
> +/*
>   * Update all node's scan weight in background.
>   */
>  static void mem_cgroup_numainfo_update_work(struct work_struct *work)
> @@ -1672,6 +1723,8 @@ static void mem_cgroup_numainfo_update_w
>  
>  	memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
>  
> +	synchronize_rcu();
> +	mem_cgroup_update_numascan_tickets(memcg);

OK, so we have only one updater (this one) which will update generation.
Why do we need rcu here and for search? ACCESS_ONCE should be
sufficient (in mem_cgroup_select_victim_node) or are you afraid that we
could get to update twice during the search?

>  	atomic_set(&memcg->numainfo_updating, 0);
>  	css_put(&memcg->css);
>  }
[...]
> @@ -1707,32 +1772,38 @@ static void mem_cgroup_may_update_nodema
>   * we'll use or we've used. So, it may make LRU bad. And if several threads
>   * hit limits, it will see a contention on a node. But freeing from remote
>   * node means more costs for memory reclaim because of memory latency.
> - *
> - * Now, we use round-robin. Better algorithm is welcomed.
>   */
> -int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
> +int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
> +				struct memcg_scanrecord *rec)
>  {
> -	int node;
> +	int node = MAX_NUMNODES;
> +	struct numascan_ticket *nt;
> +	unsigned long lottery;
> +	int generation;
>  
> +	if (rec->context == SCAN_BY_SHRINK)
> +		goto out;

Why do we care about shrinking here? Due to overhead in node selection?

> +
> +	mem_cgroup_may_update_nodemask(memcg);
>  	*mask = NULL;
> -	mem_cgroup_may_update_nodemask(mem);
> -	node = mem->last_scanned_node;
> +	lottery = random32() & NUMA_TICKET_FACTOR;
>  
> -	node = next_node(node, mem->scan_nodes);
> -	if (node == MAX_NUMNODES)
> -		node = first_node(mem->scan_nodes);
> -	/*
> -	 * We call this when we hit limit, not when pages are added to LRU.
> -	 * No LRU may hold pages because all pages are UNEVICTABLE or
> -	 * memcg is too small and all pages are not on LRU. In that case,
> -	 * we use curret node.
> -	 */
> -	if (unlikely(node == MAX_NUMNODES))
> +	rcu_read_lock();
> +	generation = memcg->numascan_generation;
> +	nt = bsearch((void *)lottery,
> +		memcg->numascan_tickets[generation],
> +		memcg->numascan_tickets_num[generation],
> +		sizeof(struct numascan_ticket), node_weight_compare);
> +	rcu_read_unlock();
> +	if (nt)
> +		node = nt->nid;
> +out:
> +	if (unlikely(node == MAX_NUMNODES)) {
>  		node = numa_node_id();
> -	else
> -		*mask = &mem->scan_nodes;
> +		*mask = NULL;
> +	} else
> +		*mask = &memcg->scan_nodes;
>  
> -	mem->last_scanned_node = node;
>  	return node;
>  }
>  
[...]
> Index: mmotm-Aug3/mm/vmscan.c
> ===================================================================
> --- mmotm-Aug3.orig/mm/vmscan.c
> +++ mmotm-Aug3/mm/vmscan.c
> @@ -2378,9 +2378,9 @@ unsigned long try_to_free_mem_cgroup_pag
>  	 * take care of from where we get pages. So the node where we start the
>  	 * scan does not need to be the current node.
>  	 */
> -	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
> +	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask, rec);
>  
> -	zonelist = NODE_DATA(nid)->node_zonelists;
> +	zonelist = &NODE_DATA(nid)->node_zonelists[0];

Unnecessary change.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim
  2011-08-18 12:57                       ` Michal Hocko
@ 2011-08-18 13:58                         ` Johannes Weiner
  -1 siblings, 0 replies; 70+ messages in thread
From: Johannes Weiner @ 2011-08-18 13:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, akpm, nishimura

On Thu, Aug 18, 2011 at 02:57:54PM +0200, Michal Hocko wrote:
> I have just realized that num_online_nodes should be much better than
> MAX_NUMNODES. 
> Just for reference, the patch is based on top of
> https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also
> doesn't make much sense without it)
> 
> Changes since v2:
> - use num_online_nodes rather than MAX_NUMNODES
> Changes since v1:
> - reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
> ---
> From: Michal Hocko <mhocko@suse.cz>
> Subject: memcg: add nr_pages argument for hierarchical reclaim
> 
> Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> pages (introduced by "memcg: stop vmscan when enough done.") we have to
> be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> most callers but it might cause failures for limit resize or force_empty
> code paths on big NUMA machines.

The limit resizing path retries as long as reclaim makes progress, so
this is just handwaving.

After Kame's patch, the force-empty path has an increased risk of
failing to move huge pages to the parent, because it tries reclaim
only once.  This could need further evaluation, and possibly a fix.
But instead:

> @@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
>  
> +	/*
> +	 * We are lying about nr_pages because we do not want to
> +	 * reclaim too much for THP pages which should rather fallback
> +	 * to small pages.
> +	 */
>  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					      gfp_mask, flags, NULL);
> +					      gfp_mask, flags, NULL,
> +					      1);
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>  		return CHARGE_RETRY;
>  	/*

You tell it to reclaim _less_ than before, further increasing the risk
of failure...

> @@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  		.may_writepage = !laptop_mode,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
> -		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),

...but wait, this transparently fixes it up and ignores the caller's
request.

Sorry, but this is just horrible!

For the past weeks I have been chasing memcg bugs that came in with
sloppy and untested code, that was merged for handwavy reasons.

Changes to algorithms need to be tested and optimizations need to be
quantified in other parts of the VM and the kernel, too.  I have no
idea why this doesn't seem to apply to the memory cgroup subsystem.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim
@ 2011-08-18 13:58                         ` Johannes Weiner
  0 siblings, 0 replies; 70+ messages in thread
From: Johannes Weiner @ 2011-08-18 13:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, akpm, nishimura

On Thu, Aug 18, 2011 at 02:57:54PM +0200, Michal Hocko wrote:
> I have just realized that num_online_nodes should be much better than
> MAX_NUMNODES. 
> Just for reference, the patch is based on top of
> https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also
> doesn't make much sense without it)
> 
> Changes since v2:
> - use num_online_nodes rather than MAX_NUMNODES
> Changes since v1:
> - reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
> ---
> From: Michal Hocko <mhocko@suse.cz>
> Subject: memcg: add nr_pages argument for hierarchical reclaim
> 
> Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> pages (introduced by "memcg: stop vmscan when enough done.") we have to
> be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> most callers but it might cause failures for limit resize or force_empty
> code paths on big NUMA machines.

The limit resizing path retries as long as reclaim makes progress, so
this is just handwaving.

After Kame's patch, the force-empty path has an increased risk of
failing to move huge pages to the parent, because it tries reclaim
only once.  This could need further evaluation, and possibly a fix.
But instead:

> @@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
>  
> +	/*
> +	 * We are lying about nr_pages because we do not want to
> +	 * reclaim too much for THP pages which should rather fallback
> +	 * to small pages.
> +	 */
>  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					      gfp_mask, flags, NULL);
> +					      gfp_mask, flags, NULL,
> +					      1);
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>  		return CHARGE_RETRY;
>  	/*

You tell it to reclaim _less_ than before, further increasing the risk
of failure...

> @@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  		.may_writepage = !laptop_mode,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
> -		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),

...but wait, this transparently fixes it up and ignores the caller's
request.

Sorry, but this is just horrible!

For the past weeks I have been chasing memcg bugs that came in with
sloppy and untested code, that was merged for handwavy reasons.

Changes to algorithms need to be tested and optimizations need to be
quantified in other parts of the VM and the kernel, too.  I have no
idea why this doesn't seem to apply to the memory cgroup subsystem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim
  2011-08-18 13:58                         ` Johannes Weiner
@ 2011-08-18 14:40                           ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18 14:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, akpm, nishimura

On Thu 18-08-11 15:58:21, Johannes Weiner wrote:
> On Thu, Aug 18, 2011 at 02:57:54PM +0200, Michal Hocko wrote:
> > I have just realized that num_online_nodes should be much better than
> > MAX_NUMNODES. 
> > Just for reference, the patch is based on top of
> > https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also
> > doesn't make much sense without it)
> > 
> > Changes since v2:
> > - use num_online_nodes rather than MAX_NUMNODES
> > Changes since v1:
> > - reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
> > ---
> > From: Michal Hocko <mhocko@suse.cz>
> > Subject: memcg: add nr_pages argument for hierarchical reclaim
> > 
> > Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> > pages (introduced by "memcg: stop vmscan when enough done.") we have to
> > be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> > most callers but it might cause failures for limit resize or force_empty
> > code paths on big NUMA machines.
> 
> The limit resizing path retries as long as reclaim makes progress, so
> this is just handwaving.

limit resizing paths do not check the return value of
mem_cgroup_hierarchical_reclaim so the number of retries is not
affected. It is true that fixing that would be much easier.

> 
> After Kame's patch, the force-empty path has an increased risk of
> failing to move huge pages to the parent, because it tries reclaim
> only once.  This could need further evaluation, and possibly a fix.

Agreed

> But instead:
> 
> > @@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
> >  	if (!(gfp_mask & __GFP_WAIT))
> >  		return CHARGE_WOULDBLOCK;
> >  
> > +	/*
> > +	 * We are lying about nr_pages because we do not want to
> > +	 * reclaim too much for THP pages which should rather fallback
> > +	 * to small pages.
> > +	 */
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					      gfp_mask, flags, NULL);
> > +					      gfp_mask, flags, NULL,
> > +					      1);
> >  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> >  		return CHARGE_RETRY;
> >  	/*
> 
> You tell it to reclaim _less_ than before, further increasing the risk
> of failure...
> 
> > @@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
> >  		.may_writepage = !laptop_mode,
> >  		.may_unmap = 1,
> >  		.may_swap = !noswap,
> > -		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> > +		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
> 
> ...but wait, this transparently fixes it up and ignores the caller's
> request.
> 
> Sorry, but this is just horrible!

Yes, I do not like it as well and tried to point it out in the comment.
Anyway I do agree that this doesn't solve the problem you are describing
above and the limit resizing paths can be fixed much easier so the patch
is pointless.

> 
> For the past weeks I have been chasing memcg bugs that came in with
> sloppy and untested code, that was merged for handwavy reasons.

Yes, I feel big responsibility about that.

> 
> Changes to algorithms need to be tested and optimizations need to be
> quantified in other parts of the VM and the kernel, too.  I have no
> idea why this doesn't seem to apply to the memory cgroup subsystem.

Yes, we should definitely do better during review process.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim
@ 2011-08-18 14:40                           ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2011-08-18 14:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, akpm, nishimura

On Thu 18-08-11 15:58:21, Johannes Weiner wrote:
> On Thu, Aug 18, 2011 at 02:57:54PM +0200, Michal Hocko wrote:
> > I have just realized that num_online_nodes should be much better than
> > MAX_NUMNODES. 
> > Just for reference, the patch is based on top of
> > https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also
> > doesn't make much sense without it)
> > 
> > Changes since v2:
> > - use num_online_nodes rather than MAX_NUMNODES
> > Changes since v1:
> > - reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
> > ---
> > From: Michal Hocko <mhocko@suse.cz>
> > Subject: memcg: add nr_pages argument for hierarchical reclaim
> > 
> > Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> > pages (introduced by "memcg: stop vmscan when enough done.") we have to
> > be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> > most callers but it might cause failures for limit resize or force_empty
> > code paths on big NUMA machines.
> 
> The limit resizing path retries as long as reclaim makes progress, so
> this is just handwaving.

limit resizing paths do not check the return value of
mem_cgroup_hierarchical_reclaim so the number of retries is not
affected. It is true that fixing that would be much easier.

> 
> After Kame's patch, the force-empty path has an increased risk of
> failing to move huge pages to the parent, because it tries reclaim
> only once.  This could need further evaluation, and possibly a fix.

Agreed

> But instead:
> 
> > @@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
> >  	if (!(gfp_mask & __GFP_WAIT))
> >  		return CHARGE_WOULDBLOCK;
> >  
> > +	/*
> > +	 * We are lying about nr_pages because we do not want to
> > +	 * reclaim too much for THP pages which should rather fallback
> > +	 * to small pages.
> > +	 */
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					      gfp_mask, flags, NULL);
> > +					      gfp_mask, flags, NULL,
> > +					      1);
> >  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> >  		return CHARGE_RETRY;
> >  	/*
> 
> You tell it to reclaim _less_ than before, further increasing the risk
> of failure...
> 
> > @@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
> >  		.may_writepage = !laptop_mode,
> >  		.may_unmap = 1,
> >  		.may_swap = !noswap,
> > -		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> > +		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
> 
> ...but wait, this transparently fixes it up and ignores the caller's
> request.
> 
> Sorry, but this is just horrible!

Yes, I do not like it as well and tried to point it out in the comment.
Anyway I do agree that this doesn't solve the problem you are describing
above and the limit resizing paths can be fixed much easier so the patch
is pointless.

> 
> For the past weeks I have been chasing memcg bugs that came in with
> sloppy and untested code, that was merged for handwavy reasons.

Yes, I feel big responsibility about that.

> 
> Changes to algorithms need to be tested and optimizations need to be
> quantified in other parts of the VM and the kernel, too.  I have no
> idea why this doesn't seem to apply to the memory cgroup subsystem.

Yes, we should definitely do better during review process.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 4/6]  memg: calculate numa weight for vmscan
  2011-08-18  8:41         ` Michal Hocko
@ 2011-08-19  0:06           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-19  0:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu, 18 Aug 2011 10:41:03 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 18-08-11 09:17:50, KAMEZAWA Hiroyuki wrote:
> > On Wed, 17 Aug 2011 16:34:18 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > Sorry it took so long but I was quite busy recently.
> > > 
> > > On Tue 09-08-11 19:11:00, KAMEZAWA Hiroyuki wrote:
> [...]
> > > > Index: mmotm-Aug3/mm/memcontrol.c
> > > > ===================================================================
> > > > --- mmotm-Aug3.orig/mm/memcontrol.c
> > > > +++ mmotm-Aug3/mm/memcontrol.c
> [...]
> > > > +
> > > > +	/* 'scanned - rotated/scanned' means ratio of finding not active. */
> > > > +	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
> > > > +	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
> > > 
> > > OK, makes sense. We should not reclaim from nodes that are known to be
> > > hard to reclaim from. We, however, have to be careful to not exclude the
> > > node from reclaiming completely.
> > > 
> > > > +
> > > > +	weight = (anon * anon_prio + file * file_prio) / 200;
> > > 
> > > Shouldn't we rather normalize the weight to the node size? This way we
> > > are punishing bigger nodes, aren't we.
> > > 
> > 
> > Here, the routine is for reclaiming memory in a memcg in smooth way.
> > And not for balancing zone. It will be kswapd+memcg(softlimit) work.
> > The size of node in this memcg is represented by file + anon.
> 
> I am not sure I understand what you mean by that but consider two nodes.
> swappiness = 0
> anon_prio = 1
> file_prio = 200
> A 1000 pages, 100 anon, 300 file: weight 300, node is 40% full
> B 15000 pages 2500 anon, 3500 file: weight ~3500, node is 40% full
> 
> I think that both nodes should be equal.
> 

Ok, try to explain again.

I'd like to keep that memcg's limit just cares amount of memory and never
care system's zone balancing. Zone balancing is taken care of by kswapd,
soft limit. 

(Off topic)
I think talking in % is not good.
What memory reclaim tries is to get available memory, not reducing % of usage.
What we care here is just amount of memory, not ratio of usage.




> > weight = (anon * anon_prio + file * file_prio) / 200;
> > 
> > Just for avoiding the influence of anon never be 0 (by wrong value
> > set to swappiness by user.)
> 
> OK, so you want to prevent from situation where we have swappiness 0
> and there are no file pages so the node would have 0 weight?
> Why do you consider 0 swappiness a wrong value?
> 

By setting anon_prio > 1, when a memcg contains only ANON, weight will be
caluculated and numa_scan bitmask will be set correctly.



> [...]
> > > > +	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> > > > +	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> > > > +
> > > > +	/* If file cache is small w.r.t swappiness, check anon page's weight */
> > > > +	if (nr_file * file_prio >= nr_anon * anon_prio)
> > > > +		lru_mask |= BIT(LRU_INACTIVE_ANON);
> > > 
> > > Why we do not care about active anon (e.g. if inactive anon is low)?
> > > 
> > This condition is wrong...
> > 
> > 	if (nr_file * file_prio <= nr_anon * anon_prio)
> > 		lru_mask |= BIT(LRU_INACTIVE_ANON);
> 
> True. Haven't noticed it before...
> 
> > 
> > I was worried about LRU_ACTIVE_ANON. I considered
> >   - We can't handle ACTIVE_ANON and INACTIVE_ANON in the same weight.
> >     But I don't want to add more magic numbers.
> 
> Yes I agree, weight shouldn't involve active pages because we do not
> want to reclaim nodes according to their active working set.
> 
> >   - vmscan.c:shrink_zone() scans ACTIVE_ANON whenever/only when
> >     inactive_anon_is_low()==true. SWAP_CLUSTER_MAX per priority.
> >     It's specially handled.
> > 
> > So, I thought involing the number of ACTIVE_ANON to the weight is difficult
> > and ignored ACTIVE_ANON, here. Do you have idea ?
> 
> I am not sure whether nr_anon should include also active pages, though.
> We are comparing all file to all anon pages which looks consistent, on
> the other hand we are not including active pages into weight. This way
> we make bigger pressure on nodes with a big anon working set.

Early version includes ACTIVE_ANON to weight and I saw BAD scores ;(
Anyway, good idea is welcomed.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 4/6]  memg: calculate numa weight for vmscan
@ 2011-08-19  0:06           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 70+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-19  0:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, hannes, nishimura

On Thu, 18 Aug 2011 10:41:03 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 18-08-11 09:17:50, KAMEZAWA Hiroyuki wrote:
> > On Wed, 17 Aug 2011 16:34:18 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > Sorry it took so long but I was quite busy recently.
> > > 
> > > On Tue 09-08-11 19:11:00, KAMEZAWA Hiroyuki wrote:
> [...]
> > > > Index: mmotm-Aug3/mm/memcontrol.c
> > > > ===================================================================
> > > > --- mmotm-Aug3.orig/mm/memcontrol.c
> > > > +++ mmotm-Aug3/mm/memcontrol.c
> [...]
> > > > +
> > > > +	/* 'scanned - rotated/scanned' means ratio of finding not active. */
> > > > +	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
> > > > +	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
> > > 
> > > OK, makes sense. We should not reclaim from nodes that are known to be
> > > hard to reclaim from. We, however, have to be careful to not exclude the
> > > node from reclaiming completely.
> > > 
> > > > +
> > > > +	weight = (anon * anon_prio + file * file_prio) / 200;
> > > 
> > > Shouldn't we rather normalize the weight to the node size? This way we
> > > are punishing bigger nodes, aren't we.
> > > 
> > 
> > Here, the routine is for reclaiming memory in a memcg in smooth way.
> > And not for balancing zone. It will be kswapd+memcg(softlimit) work.
> > The size of node in this memcg is represented by file + anon.
> 
> I am not sure I understand what you mean by that but consider two nodes.
> swappiness = 0
> anon_prio = 1
> file_prio = 200
> A 1000 pages, 100 anon, 300 file: weight 300, node is 40% full
> B 15000 pages 2500 anon, 3500 file: weight ~3500, node is 40% full
> 
> I think that both nodes should be equal.
> 

Ok, try to explain again.

I'd like to keep that memcg's limit just cares amount of memory and never
care system's zone balancing. Zone balancing is taken care of by kswapd,
soft limit. 

(Off topic)
I think talking in % is not good.
What memory reclaim tries is to get available memory, not reducing % of usage.
What we care here is just amount of memory, not ratio of usage.




> > weight = (anon * anon_prio + file * file_prio) / 200;
> > 
> > Just for avoiding the influence of anon never be 0 (by wrong value
> > set to swappiness by user.)
> 
> OK, so you want to prevent from situation where we have swappiness 0
> and there are no file pages so the node would have 0 weight?
> Why do you consider 0 swappiness a wrong value?
> 

By setting anon_prio > 1, when a memcg contains only ANON, weight will be
caluculated and numa_scan bitmask will be set correctly.



> [...]
> > > > +	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> > > > +	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> > > > +
> > > > +	/* If file cache is small w.r.t swappiness, check anon page's weight */
> > > > +	if (nr_file * file_prio >= nr_anon * anon_prio)
> > > > +		lru_mask |= BIT(LRU_INACTIVE_ANON);
> > > 
> > > Why we do not care about active anon (e.g. if inactive anon is low)?
> > > 
> > This condition is wrong...
> > 
> > 	if (nr_file * file_prio <= nr_anon * anon_prio)
> > 		lru_mask |= BIT(LRU_INACTIVE_ANON);
> 
> True. Haven't noticed it before...
> 
> > 
> > I was worried about LRU_ACTIVE_ANON. I considered
> >   - We can't handle ACTIVE_ANON and INACTIVE_ANON in the same weight.
> >     But I don't want to add more magic numbers.
> 
> Yes I agree, weight shouldn't involve active pages because we do not
> want to reclaim nodes according to their active working set.
> 
> >   - vmscan.c:shrink_zone() scans ACTIVE_ANON whenever/only when
> >     inactive_anon_is_low()==true. SWAP_CLUSTER_MAX per priority.
> >     It's specially handled.
> > 
> > So, I thought involing the number of ACTIVE_ANON to the weight is difficult
> > and ignored ACTIVE_ANON, here. Do you have idea ?
> 
> I am not sure whether nr_anon should include also active pages, though.
> We are comparing all file to all anon pages which looks consistent, on
> the other hand we are not including active pages into weight. This way
> we make bigger pressure on nodes with a big anon working set.

Early version includes ACTIVE_ANON to weight and I saw BAD scores ;(
Anyway, good idea is welcomed.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2011-08-19  0:14 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-09 10:04 [PATCH v5 0/6] memg: better numa scanning KAMEZAWA Hiroyuki
2011-08-09 10:04 ` KAMEZAWA Hiroyuki
2011-08-09 10:08 ` [PATCH v5 1/6] " KAMEZAWA Hiroyuki
2011-08-09 10:08   ` KAMEZAWA Hiroyuki
2011-08-10 10:00   ` Michal Hocko
2011-08-10 10:00     ` Michal Hocko
2011-08-10 23:30     ` KAMEZAWA Hiroyuki
2011-08-10 23:30       ` KAMEZAWA Hiroyuki
2011-08-10 23:44       ` [PATCH] memcg: fix comment on update nodemask KAMEZAWA Hiroyuki
2011-08-10 23:44         ` KAMEZAWA Hiroyuki
2011-08-11 13:25         ` Michal Hocko
2011-08-11 13:25           ` Michal Hocko
2011-08-09 10:09 ` [PATCH v5 2/6] memcg: stop vmscan when enough done KAMEZAWA Hiroyuki
2011-08-09 10:09   ` KAMEZAWA Hiroyuki
2011-08-10 14:14   ` Michal Hocko
2011-08-10 14:14     ` Michal Hocko
2011-08-10 23:52     ` KAMEZAWA Hiroyuki
2011-08-10 23:52       ` KAMEZAWA Hiroyuki
2011-08-11 14:50       ` Michal Hocko
2011-08-11 14:50         ` Michal Hocko
2011-08-12 12:44         ` [PATCH] memcg: add nr_pages argument for hierarchical reclaim Michal Hocko
2011-08-12 12:44           ` Michal Hocko
2011-08-17  0:54         ` [PATCH v5 2/6] memcg: stop vmscan when enough done KAMEZAWA Hiroyuki
2011-08-17  0:54           ` KAMEZAWA Hiroyuki
2011-08-17 11:35           ` Michal Hocko
2011-08-17 11:35             ` Michal Hocko
2011-08-17 23:52             ` KAMEZAWA Hiroyuki
2011-08-17 23:52               ` KAMEZAWA Hiroyuki
2011-08-18  6:27               ` Michal Hocko
2011-08-18  6:27                 ` Michal Hocko
2011-08-18  6:42                 ` KAMEZAWA Hiroyuki
2011-08-18  6:42                   ` KAMEZAWA Hiroyuki
2011-08-18  7:46                   ` Michal Hocko
2011-08-18  7:46                     ` Michal Hocko
2011-08-18 12:57                     ` [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim Michal Hocko
2011-08-18 12:57                       ` Michal Hocko
2011-08-18 13:58                       ` Johannes Weiner
2011-08-18 13:58                         ` Johannes Weiner
2011-08-18 14:40                         ` Michal Hocko
2011-08-18 14:40                           ` Michal Hocko
2011-08-09 10:10 ` [PATCH v5 3/6] memg: vmscan pass nodemask KAMEZAWA Hiroyuki
2011-08-09 10:10   ` KAMEZAWA Hiroyuki
2011-08-10 11:19   ` Michal Hocko
2011-08-10 11:19     ` Michal Hocko
2011-08-10 23:43     ` KAMEZAWA Hiroyuki
2011-08-10 23:43       ` KAMEZAWA Hiroyuki
2011-08-09 10:11 ` [PATCH v5 4/6] memg: calculate numa weight for vmscan KAMEZAWA Hiroyuki
2011-08-09 10:11   ` KAMEZAWA Hiroyuki
2011-08-17 14:34   ` Michal Hocko
2011-08-17 14:34     ` Michal Hocko
2011-08-18  0:17     ` KAMEZAWA Hiroyuki
2011-08-18  0:17       ` KAMEZAWA Hiroyuki
2011-08-18  8:41       ` Michal Hocko
2011-08-18  8:41         ` Michal Hocko
2011-08-19  0:06         ` KAMEZAWA Hiroyuki
2011-08-19  0:06           ` KAMEZAWA Hiroyuki
2011-08-09 10:12 ` [PATCH v5 5/6] memg: vmscan select victim node by weight KAMEZAWA Hiroyuki
2011-08-09 10:12   ` KAMEZAWA Hiroyuki
2011-08-18 13:34   ` Michal Hocko
2011-08-18 13:34     ` Michal Hocko
2011-08-09 10:13 ` [PATCH v5 6/6] memg: do target scan if unbalanced KAMEZAWA Hiroyuki
2011-08-09 10:13   ` KAMEZAWA Hiroyuki
2011-08-09 14:33 ` [PATCH v5 0/6] memg: better numa scanning Michal Hocko
2011-08-09 14:33   ` Michal Hocko
2011-08-10  0:15   ` KAMEZAWA Hiroyuki
2011-08-10  0:15     ` KAMEZAWA Hiroyuki
2011-08-10  6:03     ` KAMEZAWA Hiroyuki
2011-08-10  6:03       ` KAMEZAWA Hiroyuki
2011-08-10 14:20     ` Michal Hocko
2011-08-10 14:20       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.