* [PATCH v4 0/5] memcg : make numa scanning better
@ 2011-07-27 5:44 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:44 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, akpm, Michal Hocko, nishimura
When 889976db(memcg: reclaim memory from nodes in round-robin order) is
pushed, I mentioned "But yes, a better algorithm is needed."
Here is one.
I already cut out some of pieces, which was in this set, and pushed to upstream.
This series contains more fixes and a new core logic.
The concept is to select a node with regard to page usages.
This patch calculates weight of nodes and does scheduling proportionally
fair to each node's weight. The weight is calculated in adaptive way
considering the status of the whole memcg. In short, if a node contains
much (inactive) file caches, the node will be a victim.
As I did before, I did apache-bench test as following.
Host
Host : Xeon 8cpu
Memory: 24GB
What test ?
access a CGI script which reads a file in random. And access it by
apatch-bench. The randomnes of file access is normalized.
Full working set is 600MB.
And run httpd under memcg. This will cause memory reclaim and read I/O.
[Set limit as 300M]
<mmotm-0709 + some merged bugfixes>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 41 48 15.0 46 1161
Waiting: 40 46 10.5 44 623
Total: 41 48 15.0 46 1161
scanned_pages_by_limit 410693
elapsed_ns_by_limit 2393975561
<mmotm-0709 + cpuset's page cache spread nodes>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 42 48 16.9 46 1616
Waiting: 40 46 14.7 44 1614
Total: 42 48 16.9 46 1616
scanned_pages_by_limit 271733
elapsed_ns_by_limit 1415085661
<patch>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 41 46 7.5 45 706
Waiting: 39 45 6.4 44 630
Total: 41 46 7.5 45 706
scanned_pages_by_limit 302282
elapsed_ns_by_limit 1312758481
<patch + cpuset's page cache spread nodes>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 4
Processing: 42 47 11.4 46 962
Waiting: 40 45 8.7 44 493
Total: 42 47 11.4 46 962
scanned_pages_by_limit 349020
elapsed_ns_by_limit 1594144061
[Set Limit as 400M]
<mmotm-0709>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 3
Processing: 40 45 4.7 45 467
Waiting: 39 44 4.4 43 465
Total: 40 45 4.7 45 467
scanned_pages_by_limit 156279
elapsed_ns_by_limit 1274982214
<mmotm-0709 + cpuset's node spread>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 41 46 6.9 45 458
Waiting: 40 44 4.5 44 388
Total: 41 46 6.9 45 459
scanned_pages_by_limit 346534
elapsed_ns_by_limit 2612352442
<Patch>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 42 45 5.1 45 467
Waiting: 38 44 4.5 43 465
Total: 42 45 5.1 45 467
scanned_pages_by_limit 116307
elapsed_ns_by_limit 624529569
<patch+spread>
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 41 46 5.3 45 392
Waiting: 39 44 4.1 43 388
Total: 41 46 5.3 45 392
scanned_pages_by_limit 154865
elapsed_ns_by_limit 830638510
In general, this patch set reduce memory reclaim scans and time and
helps reclaiming memory in efficient way.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 0/5] memcg : make numa scanning better
@ 2011-07-27 5:44 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:44 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, akpm, Michal Hocko, nishimura
When 889976db(memcg: reclaim memory from nodes in round-robin order) is
pushed, I mentioned "But yes, a better algorithm is needed."
Here is one.
I already cut out some of pieces, which was in this set, and pushed to upstream.
This series contains more fixes and a new core logic.
The concept is to select a node with regard to page usages.
This patch calculates weight of nodes and does scheduling proportionally
fair to each node's weight. The weight is calculated in adaptive way
considering the status of the whole memcg. In short, if a node contains
much (inactive) file caches, the node will be a victim.
As I did before, I did apache-bench test as following.
Host
Host : Xeon 8cpu
Memory: 24GB
What test ?
access a CGI script which reads a file in random. And access it by
apatch-bench. The randomnes of file access is normalized.
Full working set is 600MB.
And run httpd under memcg. This will cause memory reclaim and read I/O.
[Set limit as 300M]
<mmotm-0709 + some merged bugfixes>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 41 48 15.0 46 1161
Waiting: 40 46 10.5 44 623
Total: 41 48 15.0 46 1161
scanned_pages_by_limit 410693
elapsed_ns_by_limit 2393975561
<mmotm-0709 + cpuset's page cache spread nodes>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 42 48 16.9 46 1616
Waiting: 40 46 14.7 44 1614
Total: 42 48 16.9 46 1616
scanned_pages_by_limit 271733
elapsed_ns_by_limit 1415085661
<patch>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 41 46 7.5 45 706
Waiting: 39 45 6.4 44 630
Total: 41 46 7.5 45 706
scanned_pages_by_limit 302282
elapsed_ns_by_limit 1312758481
<patch + cpuset's page cache spread nodes>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 4
Processing: 42 47 11.4 46 962
Waiting: 40 45 8.7 44 493
Total: 42 47 11.4 46 962
scanned_pages_by_limit 349020
elapsed_ns_by_limit 1594144061
[Set Limit as 400M]
<mmotm-0709>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 3
Processing: 40 45 4.7 45 467
Waiting: 39 44 4.4 43 465
Total: 40 45 4.7 45 467
scanned_pages_by_limit 156279
elapsed_ns_by_limit 1274982214
<mmotm-0709 + cpuset's node spread>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 41 46 6.9 45 458
Waiting: 40 44 4.5 44 388
Total: 41 46 6.9 45 459
scanned_pages_by_limit 346534
elapsed_ns_by_limit 2612352442
<Patch>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 42 45 5.1 45 467
Waiting: 38 44 4.5 43 465
Total: 42 45 5.1 45 467
scanned_pages_by_limit 116307
elapsed_ns_by_limit 624529569
<patch+spread>
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 41 46 5.3 45 392
Waiting: 39 44 4.1 43 388
Total: 41 46 5.3 45 392
scanned_pages_by_limit 154865
elapsed_ns_by_limit 830638510
In general, this patch set reduce memory reclaim scans and time and
helps reclaiming memory in efficient way.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 1/5] memcg : update numascan info by schedule_work
2011-07-27 5:44 ` KAMEZAWA Hiroyuki
@ 2011-07-27 5:46 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:46 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
Making memcg numa's scanning information update by schedule_work().
Now, memcg's numa information is updated under a thread doing
memory reclaim. It's not very heavy weight now. But upcoming updates
around numa scanning will add more works. This patch makes
the update be done by schedule_work() and reduce latency caused
by this updates.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 42 ++++++++++++++++++++++++++++++------------
1 file changed, 30 insertions(+), 12 deletions(-)
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -284,6 +284,7 @@ struct mem_cgroup {
nodemask_t scan_nodes;
atomic_t numainfo_events;
atomic_t numainfo_updating;
+ struct work_struct numainfo_update_work;
#endif
/*
* Should the accounting and control be hierarchical, per subtree?
@@ -1551,6 +1552,23 @@ static bool test_mem_cgroup_node_reclaim
}
#if MAX_NUMNODES > 1
+static void mem_cgroup_numainfo_update_work(struct work_struct *work)
+{
+ struct mem_cgroup *memcg;
+ int nid;
+
+ memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
+
+ memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+ for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+ if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
+ node_clear(nid, memcg->scan_nodes);
+ }
+ atomic_set(&memcg->numainfo_updating, 0);
+ css_put(&memcg->css);
+}
+
+
/*
* Always updating the nodemask is not very good - even if we have an empty
* list or the wrong list here, we can start from some node and traverse all
@@ -1559,7 +1577,6 @@ static bool test_mem_cgroup_node_reclaim
*/
static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
{
- int nid;
/*
* numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
* pagein/pageout changes since the last update.
@@ -1568,18 +1585,9 @@ static void mem_cgroup_may_update_nodema
return;
if (atomic_inc_return(&mem->numainfo_updating) > 1)
return;
-
- /* make a nodemask where this memcg uses memory from */
- mem->scan_nodes = node_states[N_HIGH_MEMORY];
-
- for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
-
- if (!test_mem_cgroup_node_reclaimable(mem, nid, false))
- node_clear(nid, mem->scan_nodes);
- }
-
atomic_set(&mem->numainfo_events, 0);
- atomic_set(&mem->numainfo_updating, 0);
+ css_get(&mem->css);
+ schedule_work(&mem->numainfo_update_work);
}
/*
@@ -1652,6 +1660,12 @@ bool mem_cgroup_reclaimable(struct mem_c
return false;
}
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+ INIT_WORK(&memcg->numainfo_update_work,
+ mem_cgroup_numainfo_update_work);
+}
+
#else
int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
{
@@ -1662,6 +1676,9 @@ bool mem_cgroup_reclaimable(struct mem_c
{
return test_mem_cgroup_node_reclaimable(mem, 0, noswap);
}
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+}
#endif
static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5032,6 +5049,7 @@ mem_cgroup_create(struct cgroup_subsys *
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
spin_lock_init(&mem->scanstat.lock);
+ mem_cgroup_numascan_init(mem);
return &mem->css;
free_out:
__mem_cgroup_free(mem);
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 1/5] memcg : update numascan info by schedule_work
@ 2011-07-27 5:46 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:46 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
Making memcg numa's scanning information update by schedule_work().
Now, memcg's numa information is updated under a thread doing
memory reclaim. It's not very heavy weight now. But upcoming updates
around numa scanning will add more works. This patch makes
the update be done by schedule_work() and reduce latency caused
by this updates.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 42 ++++++++++++++++++++++++++++++------------
1 file changed, 30 insertions(+), 12 deletions(-)
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -284,6 +284,7 @@ struct mem_cgroup {
nodemask_t scan_nodes;
atomic_t numainfo_events;
atomic_t numainfo_updating;
+ struct work_struct numainfo_update_work;
#endif
/*
* Should the accounting and control be hierarchical, per subtree?
@@ -1551,6 +1552,23 @@ static bool test_mem_cgroup_node_reclaim
}
#if MAX_NUMNODES > 1
+static void mem_cgroup_numainfo_update_work(struct work_struct *work)
+{
+ struct mem_cgroup *memcg;
+ int nid;
+
+ memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
+
+ memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+ for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+ if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
+ node_clear(nid, memcg->scan_nodes);
+ }
+ atomic_set(&memcg->numainfo_updating, 0);
+ css_put(&memcg->css);
+}
+
+
/*
* Always updating the nodemask is not very good - even if we have an empty
* list or the wrong list here, we can start from some node and traverse all
@@ -1559,7 +1577,6 @@ static bool test_mem_cgroup_node_reclaim
*/
static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
{
- int nid;
/*
* numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
* pagein/pageout changes since the last update.
@@ -1568,18 +1585,9 @@ static void mem_cgroup_may_update_nodema
return;
if (atomic_inc_return(&mem->numainfo_updating) > 1)
return;
-
- /* make a nodemask where this memcg uses memory from */
- mem->scan_nodes = node_states[N_HIGH_MEMORY];
-
- for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
-
- if (!test_mem_cgroup_node_reclaimable(mem, nid, false))
- node_clear(nid, mem->scan_nodes);
- }
-
atomic_set(&mem->numainfo_events, 0);
- atomic_set(&mem->numainfo_updating, 0);
+ css_get(&mem->css);
+ schedule_work(&mem->numainfo_update_work);
}
/*
@@ -1652,6 +1660,12 @@ bool mem_cgroup_reclaimable(struct mem_c
return false;
}
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+ INIT_WORK(&memcg->numainfo_update_work,
+ mem_cgroup_numainfo_update_work);
+}
+
#else
int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
{
@@ -1662,6 +1676,9 @@ bool mem_cgroup_reclaimable(struct mem_c
{
return test_mem_cgroup_node_reclaimable(mem, 0, noswap);
}
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+}
#endif
static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5032,6 +5049,7 @@ mem_cgroup_create(struct cgroup_subsys *
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
spin_lock_init(&mem->scanstat.lock);
+ mem_cgroup_numascan_init(mem);
return &mem->css;
free_out:
__mem_cgroup_free(mem);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 2/5] memcg : pass scan nodemask
2011-07-27 5:44 ` KAMEZAWA Hiroyuki
@ 2011-07-27 5:47 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:47 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
pass memcg's nodemask to try_to_free_pages().
try_to_free_pages can take nodemask as its argument but memcg
doesn't pass it. Considering memcg can be used with cpuset on
big NUMA, memcg should pass nodemask if available.
Now, memcg maintain nodemask with periodic updates. pass it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 2 +-
mm/memcontrol.c | 8 ++++++--
mm/vmscan.c | 3 ++-
3 files changed, 9 insertions(+), 4 deletions(-)
Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -117,7 +117,7 @@ extern void mem_cgroup_end_migration(str
*/
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
int nid, int zid, unsigned int lrumask);
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -1602,10 +1602,11 @@ static void mem_cgroup_may_update_nodema
*
* Now, we use round-robin. Better algorithm is welcomed.
*/
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
{
int node;
+ *mask = NULL;
mem_cgroup_may_update_nodemask(mem);
node = mem->last_scanned_node;
@@ -1620,6 +1621,8 @@ int mem_cgroup_select_victim_node(struct
*/
if (unlikely(node == MAX_NUMNODES))
node = numa_node_id();
+ else
+ *mask = &mem->scan_nodes;
mem->last_scanned_node = node;
return node;
@@ -1667,8 +1670,9 @@ static void mem_cgroup_numascan_init(str
}
#else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
{
+ *mask = NULL;
return 0;
}
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2280,6 +2280,7 @@ unsigned long try_to_free_mem_cgroup_pag
unsigned long nr_reclaimed;
unsigned long start, end;
int nid;
+ nodemask_t *mask;
struct scan_control sc = {
.may_writepage = !laptop_mode,
.may_unmap = 1,
@@ -2302,7 +2303,7 @@ unsigned long try_to_free_mem_cgroup_pag
* take care of from where we get pages. So the node where we start the
* scan does not need to be the current node.
*/
- nid = mem_cgroup_select_victim_node(mem_cont);
+ nid = mem_cgroup_select_victim_node(mem_cont, &mask);
zonelist = NODE_DATA(nid)->node_zonelists;
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 2/5] memcg : pass scan nodemask
@ 2011-07-27 5:47 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:47 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
pass memcg's nodemask to try_to_free_pages().
try_to_free_pages can take nodemask as its argument but memcg
doesn't pass it. Considering memcg can be used with cpuset on
big NUMA, memcg should pass nodemask if available.
Now, memcg maintain nodemask with periodic updates. pass it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 2 +-
mm/memcontrol.c | 8 ++++++--
mm/vmscan.c | 3 ++-
3 files changed, 9 insertions(+), 4 deletions(-)
Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -117,7 +117,7 @@ extern void mem_cgroup_end_migration(str
*/
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
int nid, int zid, unsigned int lrumask);
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -1602,10 +1602,11 @@ static void mem_cgroup_may_update_nodema
*
* Now, we use round-robin. Better algorithm is welcomed.
*/
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
{
int node;
+ *mask = NULL;
mem_cgroup_may_update_nodemask(mem);
node = mem->last_scanned_node;
@@ -1620,6 +1621,8 @@ int mem_cgroup_select_victim_node(struct
*/
if (unlikely(node == MAX_NUMNODES))
node = numa_node_id();
+ else
+ *mask = &mem->scan_nodes;
mem->last_scanned_node = node;
return node;
@@ -1667,8 +1670,9 @@ static void mem_cgroup_numascan_init(str
}
#else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
{
+ *mask = NULL;
return 0;
}
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2280,6 +2280,7 @@ unsigned long try_to_free_mem_cgroup_pag
unsigned long nr_reclaimed;
unsigned long start, end;
int nid;
+ nodemask_t *mask;
struct scan_control sc = {
.may_writepage = !laptop_mode,
.may_unmap = 1,
@@ -2302,7 +2303,7 @@ unsigned long try_to_free_mem_cgroup_pag
* take care of from where we get pages. So the node where we start the
* scan does not need to be the current node.
*/
- nid = mem_cgroup_select_victim_node(mem_cont);
+ nid = mem_cgroup_select_victim_node(mem_cont, &mask);
zonelist = NODE_DATA(nid)->node_zonelists;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 3/5] memcg : stop scanning if enough
2011-07-27 5:44 ` KAMEZAWA Hiroyuki
@ 2011-07-27 5:49 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
memcg :avoid node fallback scan if possible.
Now, try_to_free_pages() scans all zonelist because the page allocator
should visit all zonelists...but that behavior is harmful for memcg.
Memcg just scans memory because it hits limit...no memory shortage
in pased zonelist.
For example, with following unbalanced nodes
Node 0 Node 1
File 1G 0
Anon 200M 200M
memcg will cause swap-out from Node1 at every vmscan.
Another example, assume 1024 nodes system.
With 1024 node system, memcg will visit 1024 nodes
pages per vmscan... This is overkilling.
This is why memcg's victim node selection logic doesn't work
as expected.
This patch is a help for stopping vmscan when we scanned enough.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/vmscan.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2058,6 +2058,16 @@ static void shrink_zones(int priority, s
}
shrink_zone(priority, zone, sc);
+ if (!scanning_global_lru(sc)) {
+ /*
+ * When we do scan for memcg's limit, it's bad to do
+ * fallback into more node/zones because there is no
+ * memory shortage. We quit as much as possible when
+ * we reache target.
+ */
+ if (sc->nr_to_reclaim <= sc->nr_reclaimed)
+ break;
+ }
}
}
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 3/5] memcg : stop scanning if enough
@ 2011-07-27 5:49 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
memcg :avoid node fallback scan if possible.
Now, try_to_free_pages() scans all zonelist because the page allocator
should visit all zonelists...but that behavior is harmful for memcg.
Memcg just scans memory because it hits limit...no memory shortage
in pased zonelist.
For example, with following unbalanced nodes
Node 0 Node 1
File 1G 0
Anon 200M 200M
memcg will cause swap-out from Node1 at every vmscan.
Another example, assume 1024 nodes system.
With 1024 node system, memcg will visit 1024 nodes
pages per vmscan... This is overkilling.
This is why memcg's victim node selection logic doesn't work
as expected.
This patch is a help for stopping vmscan when we scanned enough.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/vmscan.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2058,6 +2058,16 @@ static void shrink_zones(int priority, s
}
shrink_zone(priority, zone, sc);
+ if (!scanning_global_lru(sc)) {
+ /*
+ * When we do scan for memcg's limit, it's bad to do
+ * fallback into more node/zones because there is no
+ * memory shortage. We quit as much as possible when
+ * we reache target.
+ */
+ if (sc->nr_to_reclaim <= sc->nr_reclaimed)
+ break;
+ }
}
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 4/5] memcg : calculate node scan weight
2011-07-27 5:44 ` KAMEZAWA Hiroyuki
@ 2011-07-27 5:49 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
caclculate node scan weight.
Now, memory cgroup selects a scan target node in round-robin.
It's not very good...there is not scheduling based on page usages.
This patch is for calculating each node's weight for scanning.
If weight of a node is high, the node is worth to be scanned.
The weight is now calucauted on following concept.
- make use of swappiness.
- If inactive-file is enough, ignore active-file
- If file is enough (w.r.t swappiness), ignore anon
- make use of recent_scan/rotated reclaim stats.
Then, a node contains many inactive file pages will be a 1st victim.
Node selection logic based on this weight will be in the next patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 105 insertions(+), 5 deletions(-)
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -143,6 +143,7 @@ struct mem_cgroup_per_zone {
struct mem_cgroup_per_node {
struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+ unsigned long weight;
};
struct mem_cgroup_lru_info {
@@ -285,6 +286,7 @@ struct mem_cgroup {
atomic_t numainfo_events;
atomic_t numainfo_updating;
struct work_struct numainfo_update_work;
+ unsigned long total_weight;
#endif
/*
* Should the accounting and control be hierarchical, per subtree?
@@ -1552,18 +1554,108 @@ static bool test_mem_cgroup_node_reclaim
}
#if MAX_NUMNODES > 1
+static unsigned long
+__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
+ int nid,
+ unsigned long anon_prio,
+ unsigned long file_prio,
+ int lru_mask)
+{
+ u64 file, anon;
+ unsigned long weight, mask;
+ unsigned long rotated[2], scanned[2];
+ int zid;
+
+ scanned[0] = 0;
+ scanned[1] = 0;
+ rotated[0] = 0;
+ rotated[1] = 0;
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ struct mem_cgroup_per_zone *mz;
+
+ mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+ scanned[0] += mz->reclaim_stat.recent_scanned[0];
+ scanned[1] += mz->reclaim_stat.recent_scanned[1];
+ rotated[0] += mz->reclaim_stat.recent_rotated[0];
+ rotated[1] += mz->reclaim_stat.recent_rotated[1];
+ }
+ file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
+
+ if (total_swap_pages)
+ anon = mem_cgroup_node_nr_lru_pages(memcg,
+ nid, mask & LRU_ALL_ANON);
+ else
+ anon = 0;
+ if (!(file + anon))
+ node_clear(nid, memcg->scan_nodes);
+
+ /* 'scanned - rotated/scanned' means ratio of finding not active. */
+ anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
+ file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
+
+ weight = (anon * anon_prio + file * file_prio) / 200;
+ return weight;
+}
+
+/*
+ * Calculate each NUMA node's scan weight. scan weight is determined by
+ * amount of pages and recent scan ratio, swappiness.
+ */
+static unsigned long
+mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
+{
+ unsigned long weight, total_weight;
+ u64 anon_prio, file_prio, nr_anon, nr_file;
+ int lru_mask;
+ int nid;
+
+ anon_prio = mem_cgroup_swappiness(memcg) + 1;
+ file_prio = 200 - anon_prio + 1;
+
+ lru_mask = BIT(LRU_INACTIVE_FILE);
+ if (mem_cgroup_inactive_file_is_low(memcg))
+ lru_mask |= BIT(LRU_ACTIVE_FILE);
+ /*
+ * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
+ * amounts of file/anon pages and swappiness and reclaim_stat. Here,
+ * we try to find good node to be scanned. If the memcg contains enough
+ * file caches, we'll ignore anon's weight.
+ * (Note) scanning anon-only node tends to be waste of time.
+ */
+
+ nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
+ nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
+
+ /* If file cache is small w.r.t swappiness, check anon page's weight */
+ if (nr_file * file_prio >= nr_anon * anon_prio)
+ lru_mask |= BIT(LRU_INACTIVE_ANON);
+
+ total_weight = 0;
+
+ for_each_node_state(nid, N_HIGH_MEMORY) {
+ weight = __mem_cgroup_calc_numascan_weight(memcg,
+ nid, anon_prio, file_prio, lru_mask);
+ memcg->info.nodeinfo[nid]->weight = weight;
+ total_weight += weight;
+ }
+
+ return total_weight;
+}
+
+/*
+ * Update all node's scan weight in background.
+ */
static void mem_cgroup_numainfo_update_work(struct work_struct *work)
{
struct mem_cgroup *memcg;
- int nid;
memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
memcg->scan_nodes = node_states[N_HIGH_MEMORY];
- for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
- if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
- node_clear(nid, memcg->scan_nodes);
- }
+
+ memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
+
atomic_set(&memcg->numainfo_updating, 0);
css_put(&memcg->css);
}
@@ -4212,6 +4304,14 @@ static int mem_control_numa_stat_show(st
seq_printf(m, " N%d=%lu", nid, node_nr);
}
seq_putc(m, '\n');
+
+ seq_printf(m, "scan_weight=%lu", mem_cont->total_weight);
+ for_each_node_state(nid, N_HIGH_MEMORY) {
+ unsigned long weight;
+ weight = mem_cont->info.nodeinfo[nid]->weight;
+ seq_printf(m, " N%d=%lu", nid, weight);
+ }
+ seq_putc(m, '\n');
return 0;
}
#endif /* CONFIG_NUMA */
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 4/5] memcg : calculate node scan weight
@ 2011-07-27 5:49 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
caclculate node scan weight.
Now, memory cgroup selects a scan target node in round-robin.
It's not very good...there is not scheduling based on page usages.
This patch is for calculating each node's weight for scanning.
If weight of a node is high, the node is worth to be scanned.
The weight is now calucauted on following concept.
- make use of swappiness.
- If inactive-file is enough, ignore active-file
- If file is enough (w.r.t swappiness), ignore anon
- make use of recent_scan/rotated reclaim stats.
Then, a node contains many inactive file pages will be a 1st victim.
Node selection logic based on this weight will be in the next patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 105 insertions(+), 5 deletions(-)
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -143,6 +143,7 @@ struct mem_cgroup_per_zone {
struct mem_cgroup_per_node {
struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+ unsigned long weight;
};
struct mem_cgroup_lru_info {
@@ -285,6 +286,7 @@ struct mem_cgroup {
atomic_t numainfo_events;
atomic_t numainfo_updating;
struct work_struct numainfo_update_work;
+ unsigned long total_weight;
#endif
/*
* Should the accounting and control be hierarchical, per subtree?
@@ -1552,18 +1554,108 @@ static bool test_mem_cgroup_node_reclaim
}
#if MAX_NUMNODES > 1
+static unsigned long
+__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
+ int nid,
+ unsigned long anon_prio,
+ unsigned long file_prio,
+ int lru_mask)
+{
+ u64 file, anon;
+ unsigned long weight, mask;
+ unsigned long rotated[2], scanned[2];
+ int zid;
+
+ scanned[0] = 0;
+ scanned[1] = 0;
+ rotated[0] = 0;
+ rotated[1] = 0;
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ struct mem_cgroup_per_zone *mz;
+
+ mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+ scanned[0] += mz->reclaim_stat.recent_scanned[0];
+ scanned[1] += mz->reclaim_stat.recent_scanned[1];
+ rotated[0] += mz->reclaim_stat.recent_rotated[0];
+ rotated[1] += mz->reclaim_stat.recent_rotated[1];
+ }
+ file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
+
+ if (total_swap_pages)
+ anon = mem_cgroup_node_nr_lru_pages(memcg,
+ nid, mask & LRU_ALL_ANON);
+ else
+ anon = 0;
+ if (!(file + anon))
+ node_clear(nid, memcg->scan_nodes);
+
+ /* 'scanned - rotated/scanned' means ratio of finding not active. */
+ anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
+ file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
+
+ weight = (anon * anon_prio + file * file_prio) / 200;
+ return weight;
+}
+
+/*
+ * Calculate each NUMA node's scan weight. scan weight is determined by
+ * amount of pages and recent scan ratio, swappiness.
+ */
+static unsigned long
+mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
+{
+ unsigned long weight, total_weight;
+ u64 anon_prio, file_prio, nr_anon, nr_file;
+ int lru_mask;
+ int nid;
+
+ anon_prio = mem_cgroup_swappiness(memcg) + 1;
+ file_prio = 200 - anon_prio + 1;
+
+ lru_mask = BIT(LRU_INACTIVE_FILE);
+ if (mem_cgroup_inactive_file_is_low(memcg))
+ lru_mask |= BIT(LRU_ACTIVE_FILE);
+ /*
+ * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
+ * amounts of file/anon pages and swappiness and reclaim_stat. Here,
+ * we try to find good node to be scanned. If the memcg contains enough
+ * file caches, we'll ignore anon's weight.
+ * (Note) scanning anon-only node tends to be waste of time.
+ */
+
+ nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
+ nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
+
+ /* If file cache is small w.r.t swappiness, check anon page's weight */
+ if (nr_file * file_prio >= nr_anon * anon_prio)
+ lru_mask |= BIT(LRU_INACTIVE_ANON);
+
+ total_weight = 0;
+
+ for_each_node_state(nid, N_HIGH_MEMORY) {
+ weight = __mem_cgroup_calc_numascan_weight(memcg,
+ nid, anon_prio, file_prio, lru_mask);
+ memcg->info.nodeinfo[nid]->weight = weight;
+ total_weight += weight;
+ }
+
+ return total_weight;
+}
+
+/*
+ * Update all node's scan weight in background.
+ */
static void mem_cgroup_numainfo_update_work(struct work_struct *work)
{
struct mem_cgroup *memcg;
- int nid;
memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
memcg->scan_nodes = node_states[N_HIGH_MEMORY];
- for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
- if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
- node_clear(nid, memcg->scan_nodes);
- }
+
+ memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
+
atomic_set(&memcg->numainfo_updating, 0);
css_put(&memcg->css);
}
@@ -4212,6 +4304,14 @@ static int mem_control_numa_stat_show(st
seq_printf(m, " N%d=%lu", nid, node_nr);
}
seq_putc(m, '\n');
+
+ seq_printf(m, "scan_weight=%lu", mem_cont->total_weight);
+ for_each_node_state(nid, N_HIGH_MEMORY) {
+ unsigned long weight;
+ weight = mem_cont->info.nodeinfo[nid]->weight;
+ seq_printf(m, " N%d=%lu", nid, weight);
+ }
+ seq_putc(m, '\n');
return 0;
}
#endif /* CONFIG_NUMA */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 5/5] memcg : select a victim node by weights
2011-07-27 5:44 ` KAMEZAWA Hiroyuki
@ 2011-07-27 5:51 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:51 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
This patch implements a node selection logic based on each node's weight.
This patch adds a new array of nodescan_tickets[]. This array holds
each node's scan weight in a tuple of 2 values. as
for (i = 0, total_weight = 0; i < nodes; i++) {
weight = node->weight;
nodescan_tickets[i].start = total_weight;
nodescan_tickets[i].length = weight;
}
After this, a lottery logic as 'ticket = random32()/total_weight'
will make a ticket and bserach(ticket, nodescan_tickets[])
will find a node which holds [start, length] contains ticket.
(This is a lottery scheduling.)
By this, node will be selected in fair manner proportinal to
its weight.
This patch improve the scan time. Following is a test result
ot apatch bench on 2-node fake-numa. In this test, almost all
pages are file cache and too much scan on anon and swap-out
is harmful. (The result itself is measured with following patches
to this.)
Working set: 600Mbytes random access in normalized distribution
Memory Limit: 300MBytes
<before patch>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 41 48 15.0 46 1161
Waiting: 40 46 10.5 44 623
Total: 41 48 15.0 46 1161
memory.vmscan_stat
scanned_pages_by_limit 410693
elapsed_ns_by_limit 2393975561
<after patch>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 41 46 7.5 45 706
Waiting: 39 45 6.4 44 630
Total: 41 46 7.5 45 706
scanned_pages_by_limit 302282
elapsed_ns_by_limit 1312758481
vmscan time is much reduced.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 3
mm/memcontrol.c | 149 ++++++++++++++++++++++++++++++++++++++-------
mm/vmscan.c | 4 -
3 files changed, 130 insertions(+), 26 deletions(-)
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -48,6 +48,9 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/random.h>
+#include <linux/bsearch.h>
+#include <linux/cpuset.h>
#include "internal.h"
#include <asm/uaccess.h>
@@ -150,6 +153,11 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};
+struct numascan_ticket {
+ int nid;
+ unsigned int start, tickets;
+};
+
/*
* Cgroups above their limits are maintained in a RB-Tree, independent of
* their hierarchy representation
@@ -286,7 +294,10 @@ struct mem_cgroup {
atomic_t numainfo_events;
atomic_t numainfo_updating;
struct work_struct numainfo_update_work;
- unsigned long total_weight;
+ unsigned long total_weight;
+ int numascan_generation;
+ int numascan_tickets_num[2];
+ struct numascan_ticket *numascan_tickets[2];
#endif
/*
* Should the accounting and control be hierarchical, per subtree?
@@ -1644,6 +1655,46 @@ mem_cgroup_calc_numascan_weight(struct m
}
/*
+ * For lottery scheduling, this routine disributes "ticket" for
+ * scanning to each node. ticket will be recored into numascan_ticket
+ * array and this array will be used for scheduling, lator.
+ * For make lottery wair, we limit the sum of tickets almost 0xffff.
+ * Later, random() & 0xffff will do proportional fair lottery.
+ */
+#define NUMA_TICKET_SHIFT (16)
+#define NUMA_TICKET_FACTOR ((1 << NUMA_TICKET_SHIFT) - 1)
+static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
+{
+ struct numascan_ticket *nt;
+ unsigned int node_ticket, assigned_tickets;
+ u64 weight;
+ int nid, assigned_num, generation;
+
+ /* update ticket information by double buffering */
+ generation = memcg->numascan_generation ^ 0x1;
+
+ nt = memcg->numascan_tickets[generation];
+ assigned_tickets = 0;
+ assigned_num = 0;
+ for_each_node_mask(nid, memcg->scan_nodes) {
+ weight = memcg->info.nodeinfo[nid]->weight;
+ node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
+ memcg->total_weight + 1);
+ if (!node_ticket)
+ node_ticket = 1;
+ nt->nid = nid;
+ nt->start = assigned_tickets;
+ nt->tickets = node_ticket;
+ assigned_tickets += node_ticket;
+ nt++;
+ assigned_num++;
+ }
+ memcg->numascan_tickets_num[generation] = assigned_num;
+ smp_wmb();
+ memcg->numascan_generation = generation;
+}
+
+/*
* Update all node's scan weight in background.
*/
static void mem_cgroup_numainfo_update_work(struct work_struct *work)
@@ -1656,6 +1707,8 @@ static void mem_cgroup_numainfo_update_w
memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
+ synchronize_rcu();
+ mem_cgroup_update_numascan_tickets(memcg);
atomic_set(&memcg->numainfo_updating, 0);
css_put(&memcg->css);
}
@@ -1682,6 +1735,18 @@ static void mem_cgroup_may_update_nodema
schedule_work(&mem->numainfo_update_work);
}
+static int node_weight_compare(const void *key, const void *elt)
+{
+ unsigned long lottery = (unsigned long)key;
+ struct numascan_ticket *nt = (struct numascan_ticket *)elt;
+
+ if (lottery < nt->start)
+ return -1;
+ if (lottery > (nt->start + nt->tickets))
+ return 1;
+ return 0;
+}
+
/*
* Selecting a node where we start reclaim from. Because what we need is just
* reducing usage counter, start from anywhere is O,K. Considering
@@ -1691,32 +1756,38 @@ static void mem_cgroup_may_update_nodema
* we'll use or we've used. So, it may make LRU bad. And if several threads
* hit limits, it will see a contention on a node. But freeing from remote
* node means more costs for memory reclaim because of memory latency.
- *
- * Now, we use round-robin. Better algorithm is welcomed.
*/
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+ struct memcg_scanrecord *rec)
{
- int node;
+ int node = MAX_NUMNODES;
+ struct numascan_ticket *nt;
+ unsigned long lottery;
+ int generation;
+ if (rec->context == SCAN_BY_SHRINK)
+ goto out;
+
+ mem_cgroup_may_update_nodemask(memcg);
*mask = NULL;
- mem_cgroup_may_update_nodemask(mem);
- node = mem->last_scanned_node;
+ lottery = random32() & NUMA_TICKET_FACTOR;
- node = next_node(node, mem->scan_nodes);
- if (node == MAX_NUMNODES)
- node = first_node(mem->scan_nodes);
- /*
- * We call this when we hit limit, not when pages are added to LRU.
- * No LRU may hold pages because all pages are UNEVICTABLE or
- * memcg is too small and all pages are not on LRU. In that case,
- * we use curret node.
- */
- if (unlikely(node == MAX_NUMNODES))
+ rcu_read_lock();
+ generation = memcg->numascan_generation;
+ nt = bsearch((void *)lottery,
+ memcg->numascan_tickets[generation],
+ memcg->numascan_tickets_num[generation],
+ sizeof(struct numascan_ticket), node_weight_compare);
+ rcu_read_unlock();
+ if (nt)
+ node = nt->nid;
+out:
+ if (unlikely(node == MAX_NUMNODES)) {
node = numa_node_id();
- else
- *mask = &mem->scan_nodes;
+ *mask = NULL;
+ } else
+ *mask = &memcg->scan_nodes;
- mem->last_scanned_node = node;
return node;
}
@@ -1755,14 +1826,42 @@ bool mem_cgroup_reclaimable(struct mem_c
return false;
}
-static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+static bool mem_cgroup_numascan_init(struct mem_cgroup *memcg)
{
+ struct numascan_ticket *nt;
+ int nr_nodes;
+
INIT_WORK(&memcg->numainfo_update_work,
mem_cgroup_numainfo_update_work);
+
+ nr_nodes = num_possible_nodes();
+ nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+ GFP_KERNEL);
+ if (!nt)
+ return false;
+ memcg->numascan_tickets[0] = nt;
+ nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+ GFP_KERNEL);
+ if (!nt) {
+ kfree(memcg->numascan_tickets[0]);
+ memcg->numascan_tickets[0] = NULL;
+ return false;
+ }
+ memcg->numascan_tickets[1] = nt;
+ memcg->numascan_tickets_num[0] = 0;
+ memcg->numascan_tickets_num[1] = 0;
+ return true;
+}
+
+static void mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+ kfree(memcg->numascan_tickets[0]);
+ kfree(memcg->numascan_tickets[1]);
}
#else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask,
+ struct memcg_scanrecord *rec)
{
*mask = NULL;
return 0;
@@ -1775,6 +1874,9 @@ bool mem_cgroup_reclaimable(struct mem_c
static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
{
}
+static bool mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+}
#endif
static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5015,6 +5117,7 @@ static void __mem_cgroup_free(struct mem
int node;
mem_cgroup_remove_from_trees(mem);
+ mem_cgroup_numascan_free(mem);
free_css_id(&mem_cgroup_subsys, &mem->css);
for_each_node_state(node, N_POSSIBLE)
@@ -5153,7 +5256,8 @@ mem_cgroup_create(struct cgroup_subsys *
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
spin_lock_init(&mem->scanstat.lock);
- mem_cgroup_numascan_init(mem);
+ if (!mem_cgroup_numascan_init(mem))
+ goto free_out;
return &mem->css;
free_out:
__mem_cgroup_free(mem);
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2313,9 +2313,9 @@ unsigned long try_to_free_mem_cgroup_pag
* take care of from where we get pages. So the node where we start the
* scan does not need to be the current node.
*/
- nid = mem_cgroup_select_victim_node(mem_cont, &mask);
+ nid = mem_cgroup_select_victim_node(mem_cont, &mask, rec);
- zonelist = NODE_DATA(nid)->node_zonelists;
+ zonelist = &NODE_DATA(nid)->node_zonelists[0];
trace_mm_vmscan_memcg_reclaim_begin(0,
sc.may_writepage,
Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -117,7 +117,8 @@ extern void mem_cgroup_end_migration(str
*/
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+ struct memcg_scanrecord *rec);
unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
int nid, int zid, unsigned int lrumask);
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 5/5] memcg : select a victim node by weights
@ 2011-07-27 5:51 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:51 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
This patch implements a node selection logic based on each node's weight.
This patch adds a new array of nodescan_tickets[]. This array holds
each node's scan weight in a tuple of 2 values. as
for (i = 0, total_weight = 0; i < nodes; i++) {
weight = node->weight;
nodescan_tickets[i].start = total_weight;
nodescan_tickets[i].length = weight;
}
After this, a lottery logic as 'ticket = random32()/total_weight'
will make a ticket and bserach(ticket, nodescan_tickets[])
will find a node which holds [start, length] contains ticket.
(This is a lottery scheduling.)
By this, node will be selected in fair manner proportinal to
its weight.
This patch improve the scan time. Following is a test result
ot apatch bench on 2-node fake-numa. In this test, almost all
pages are file cache and too much scan on anon and swap-out
is harmful. (The result itself is measured with following patches
to this.)
Working set: 600Mbytes random access in normalized distribution
Memory Limit: 300MBytes
<before patch>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 41 48 15.0 46 1161
Waiting: 40 46 10.5 44 623
Total: 41 48 15.0 46 1161
memory.vmscan_stat
scanned_pages_by_limit 410693
elapsed_ns_by_limit 2393975561
<after patch>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 41 46 7.5 45 706
Waiting: 39 45 6.4 44 630
Total: 41 46 7.5 45 706
scanned_pages_by_limit 302282
elapsed_ns_by_limit 1312758481
vmscan time is much reduced.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 3
mm/memcontrol.c | 149 ++++++++++++++++++++++++++++++++++++++-------
mm/vmscan.c | 4 -
3 files changed, 130 insertions(+), 26 deletions(-)
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -48,6 +48,9 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/random.h>
+#include <linux/bsearch.h>
+#include <linux/cpuset.h>
#include "internal.h"
#include <asm/uaccess.h>
@@ -150,6 +153,11 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};
+struct numascan_ticket {
+ int nid;
+ unsigned int start, tickets;
+};
+
/*
* Cgroups above their limits are maintained in a RB-Tree, independent of
* their hierarchy representation
@@ -286,7 +294,10 @@ struct mem_cgroup {
atomic_t numainfo_events;
atomic_t numainfo_updating;
struct work_struct numainfo_update_work;
- unsigned long total_weight;
+ unsigned long total_weight;
+ int numascan_generation;
+ int numascan_tickets_num[2];
+ struct numascan_ticket *numascan_tickets[2];
#endif
/*
* Should the accounting and control be hierarchical, per subtree?
@@ -1644,6 +1655,46 @@ mem_cgroup_calc_numascan_weight(struct m
}
/*
+ * For lottery scheduling, this routine disributes "ticket" for
+ * scanning to each node. ticket will be recored into numascan_ticket
+ * array and this array will be used for scheduling, lator.
+ * For make lottery wair, we limit the sum of tickets almost 0xffff.
+ * Later, random() & 0xffff will do proportional fair lottery.
+ */
+#define NUMA_TICKET_SHIFT (16)
+#define NUMA_TICKET_FACTOR ((1 << NUMA_TICKET_SHIFT) - 1)
+static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
+{
+ struct numascan_ticket *nt;
+ unsigned int node_ticket, assigned_tickets;
+ u64 weight;
+ int nid, assigned_num, generation;
+
+ /* update ticket information by double buffering */
+ generation = memcg->numascan_generation ^ 0x1;
+
+ nt = memcg->numascan_tickets[generation];
+ assigned_tickets = 0;
+ assigned_num = 0;
+ for_each_node_mask(nid, memcg->scan_nodes) {
+ weight = memcg->info.nodeinfo[nid]->weight;
+ node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
+ memcg->total_weight + 1);
+ if (!node_ticket)
+ node_ticket = 1;
+ nt->nid = nid;
+ nt->start = assigned_tickets;
+ nt->tickets = node_ticket;
+ assigned_tickets += node_ticket;
+ nt++;
+ assigned_num++;
+ }
+ memcg->numascan_tickets_num[generation] = assigned_num;
+ smp_wmb();
+ memcg->numascan_generation = generation;
+}
+
+/*
* Update all node's scan weight in background.
*/
static void mem_cgroup_numainfo_update_work(struct work_struct *work)
@@ -1656,6 +1707,8 @@ static void mem_cgroup_numainfo_update_w
memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
+ synchronize_rcu();
+ mem_cgroup_update_numascan_tickets(memcg);
atomic_set(&memcg->numainfo_updating, 0);
css_put(&memcg->css);
}
@@ -1682,6 +1735,18 @@ static void mem_cgroup_may_update_nodema
schedule_work(&mem->numainfo_update_work);
}
+static int node_weight_compare(const void *key, const void *elt)
+{
+ unsigned long lottery = (unsigned long)key;
+ struct numascan_ticket *nt = (struct numascan_ticket *)elt;
+
+ if (lottery < nt->start)
+ return -1;
+ if (lottery > (nt->start + nt->tickets))
+ return 1;
+ return 0;
+}
+
/*
* Selecting a node where we start reclaim from. Because what we need is just
* reducing usage counter, start from anywhere is O,K. Considering
@@ -1691,32 +1756,38 @@ static void mem_cgroup_may_update_nodema
* we'll use or we've used. So, it may make LRU bad. And if several threads
* hit limits, it will see a contention on a node. But freeing from remote
* node means more costs for memory reclaim because of memory latency.
- *
- * Now, we use round-robin. Better algorithm is welcomed.
*/
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+ struct memcg_scanrecord *rec)
{
- int node;
+ int node = MAX_NUMNODES;
+ struct numascan_ticket *nt;
+ unsigned long lottery;
+ int generation;
+ if (rec->context == SCAN_BY_SHRINK)
+ goto out;
+
+ mem_cgroup_may_update_nodemask(memcg);
*mask = NULL;
- mem_cgroup_may_update_nodemask(mem);
- node = mem->last_scanned_node;
+ lottery = random32() & NUMA_TICKET_FACTOR;
- node = next_node(node, mem->scan_nodes);
- if (node == MAX_NUMNODES)
- node = first_node(mem->scan_nodes);
- /*
- * We call this when we hit limit, not when pages are added to LRU.
- * No LRU may hold pages because all pages are UNEVICTABLE or
- * memcg is too small and all pages are not on LRU. In that case,
- * we use curret node.
- */
- if (unlikely(node == MAX_NUMNODES))
+ rcu_read_lock();
+ generation = memcg->numascan_generation;
+ nt = bsearch((void *)lottery,
+ memcg->numascan_tickets[generation],
+ memcg->numascan_tickets_num[generation],
+ sizeof(struct numascan_ticket), node_weight_compare);
+ rcu_read_unlock();
+ if (nt)
+ node = nt->nid;
+out:
+ if (unlikely(node == MAX_NUMNODES)) {
node = numa_node_id();
- else
- *mask = &mem->scan_nodes;
+ *mask = NULL;
+ } else
+ *mask = &memcg->scan_nodes;
- mem->last_scanned_node = node;
return node;
}
@@ -1755,14 +1826,42 @@ bool mem_cgroup_reclaimable(struct mem_c
return false;
}
-static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+static bool mem_cgroup_numascan_init(struct mem_cgroup *memcg)
{
+ struct numascan_ticket *nt;
+ int nr_nodes;
+
INIT_WORK(&memcg->numainfo_update_work,
mem_cgroup_numainfo_update_work);
+
+ nr_nodes = num_possible_nodes();
+ nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+ GFP_KERNEL);
+ if (!nt)
+ return false;
+ memcg->numascan_tickets[0] = nt;
+ nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+ GFP_KERNEL);
+ if (!nt) {
+ kfree(memcg->numascan_tickets[0]);
+ memcg->numascan_tickets[0] = NULL;
+ return false;
+ }
+ memcg->numascan_tickets[1] = nt;
+ memcg->numascan_tickets_num[0] = 0;
+ memcg->numascan_tickets_num[1] = 0;
+ return true;
+}
+
+static void mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+ kfree(memcg->numascan_tickets[0]);
+ kfree(memcg->numascan_tickets[1]);
}
#else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask,
+ struct memcg_scanrecord *rec)
{
*mask = NULL;
return 0;
@@ -1775,6 +1874,9 @@ bool mem_cgroup_reclaimable(struct mem_c
static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
{
}
+static bool mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+}
#endif
static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5015,6 +5117,7 @@ static void __mem_cgroup_free(struct mem
int node;
mem_cgroup_remove_from_trees(mem);
+ mem_cgroup_numascan_free(mem);
free_css_id(&mem_cgroup_subsys, &mem->css);
for_each_node_state(node, N_POSSIBLE)
@@ -5153,7 +5256,8 @@ mem_cgroup_create(struct cgroup_subsys *
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
spin_lock_init(&mem->scanstat.lock);
- mem_cgroup_numascan_init(mem);
+ if (!mem_cgroup_numascan_init(mem))
+ goto free_out;
return &mem->css;
free_out:
__mem_cgroup_free(mem);
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2313,9 +2313,9 @@ unsigned long try_to_free_mem_cgroup_pag
* take care of from where we get pages. So the node where we start the
* scan does not need to be the current node.
*/
- nid = mem_cgroup_select_victim_node(mem_cont, &mask);
+ nid = mem_cgroup_select_victim_node(mem_cont, &mask, rec);
- zonelist = NODE_DATA(nid)->node_zonelists;
+ zonelist = &NODE_DATA(nid)->node_zonelists[0];
trace_mm_vmscan_memcg_reclaim_begin(0,
sc.may_writepage,
Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -117,7 +117,8 @@ extern void mem_cgroup_end_migration(str
*/
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+ struct memcg_scanrecord *rec);
unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
int nid, int zid, unsigned int lrumask);
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 6/5] memcg : check numa balance
2011-07-27 5:44 ` KAMEZAWA Hiroyuki
@ 2011-07-27 5:52 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:52 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
This patch was required for handling numa-unbalanced memcg.
==
Because do_try_to_free_pages() scans node based on zonelist,
even if we select a victim node, we may scan other nodes.
When the nodes are balanced, it's good because we'll quit scan loop
before updating 'priority'. But when the nodes are unbalanced,
it will force scanning a very small nodes and will cause
swap-out when the node doesn't contains enough file caches.
This patch selects zonelist[] for vmscan scan list for memcg.
If memcg is well balanced among nodes, usual fall back (and mask) is used.
If not, it selects node local zonelist and do target reclaim.
This will reduce unnecessary (anon page) scans when memcg is not balanced.
Now, memcg/NUMA is balanced when each node's weight is between
80% and 120% of average node weight.
(*) This value is just a magic number but works well in several tests.
Further study to detemine this value is appreciated.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 2 +-
mm/memcontrol.c | 20 ++++++++++++++++++--
mm/vmscan.c | 8 ++++++--
3 files changed, 25 insertions(+), 5 deletions(-)
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -295,6 +295,7 @@ struct mem_cgroup {
atomic_t numainfo_updating;
struct work_struct numainfo_update_work;
unsigned long total_weight;
+ bool numascan_balance;
int numascan_generation;
int numascan_tickets_num[2];
struct numascan_ticket *numascan_tickets[2];
@@ -1663,12 +1664,15 @@ mem_cgroup_calc_numascan_weight(struct m
*/
#define NUMA_TICKET_SHIFT (16)
#define NUMA_TICKET_FACTOR ((1 << NUMA_TICKET_SHIFT) - 1)
+#define NUMA_BALANCE_RANGE_LOW (80)
+#define NUMA_BALANCE_RANGE_HIGH (120)
static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
{
struct numascan_ticket *nt;
unsigned int node_ticket, assigned_tickets;
u64 weight;
int nid, assigned_num, generation;
+ unsigned long average, balance_low, balance_high;
/* update ticket information by double buffering */
generation = memcg->numascan_generation ^ 0x1;
@@ -1676,6 +1680,11 @@ static void mem_cgroup_update_numascan_t
nt = memcg->numascan_tickets[generation];
assigned_tickets = 0;
assigned_num = 0;
+ average = memcg->total_weight / (nodes_weight(memcg->scan_nodes) + 1);
+ balance_low = NUMA_BALANCE_RANGE_LOW * average / 100;
+ balance_high = NUMA_BALANCE_RANGE_HIGH * average / 100;
+ memcg->numascan_balance = true;
+
for_each_node_mask(nid, memcg->scan_nodes) {
weight = memcg->info.nodeinfo[nid]->weight;
node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
@@ -1688,6 +1697,9 @@ static void mem_cgroup_update_numascan_t
assigned_tickets += node_ticket;
nt++;
assigned_num++;
+ if ((weight < balance_low) ||
+ (weight > balance_high))
+ memcg->numascan_balance = false;
}
memcg->numascan_tickets_num[generation] = assigned_num;
smp_wmb();
@@ -1758,7 +1770,7 @@ static int node_weight_compare(const voi
* node means more costs for memory reclaim because of memory latency.
*/
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
- struct memcg_scanrecord *rec)
+ struct memcg_scanrecord *rec, bool *fallback)
{
int node = MAX_NUMNODES;
struct numascan_ticket *nt;
@@ -1785,8 +1797,11 @@ out:
if (unlikely(node == MAX_NUMNODES)) {
node = numa_node_id();
*mask = NULL;
- } else
+ *fallback = true;
+ } else {
*mask = &memcg->scan_nodes;
+ *fallback = memcg->numascan_balance;
+ }
return node;
}
@@ -1864,6 +1879,7 @@ int mem_cgroup_select_victim_node(struct
struct memcg_scanrecord *rec)
{
*mask = NULL;
+ *fallback = true;
return 0;
}
Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -118,7 +118,7 @@ extern void mem_cgroup_end_migration(str
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
- struct memcg_scanrecord *rec);
+ struct memcg_scanrecord *rec, bool *fallback);
unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
int nid, int zid, unsigned int lrumask);
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2290,6 +2290,7 @@ unsigned long try_to_free_mem_cgroup_pag
unsigned long nr_reclaimed;
unsigned long start, end;
int nid;
+ bool fallback;
nodemask_t *mask;
struct scan_control sc = {
.may_writepage = !laptop_mode,
@@ -2313,9 +2314,12 @@ unsigned long try_to_free_mem_cgroup_pag
* take care of from where we get pages. So the node where we start the
* scan does not need to be the current node.
*/
- nid = mem_cgroup_select_victim_node(mem_cont, &mask, rec);
+ nid = mem_cgroup_select_victim_node(mem_cont, &mask, rec, &fallback);
- zonelist = &NODE_DATA(nid)->node_zonelists[0];
+ if (fallback) /* memcg/NUMA is balanced and fallback works well */
+ zonelist = &NODE_DATA(nid)->node_zonelists[0];
+ else /* memcg/NUMA is not balanced, do target reclaim */
+ zonelist = &NODE_DATA(nid)->node_zonelists[1];
trace_mm_vmscan_memcg_reclaim_begin(0,
sc.may_writepage,
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 6/5] memcg : check numa balance
@ 2011-07-27 5:52 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27 5:52 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, Michal Hocko, nishimura
This patch was required for handling numa-unbalanced memcg.
==
Because do_try_to_free_pages() scans node based on zonelist,
even if we select a victim node, we may scan other nodes.
When the nodes are balanced, it's good because we'll quit scan loop
before updating 'priority'. But when the nodes are unbalanced,
it will force scanning a very small nodes and will cause
swap-out when the node doesn't contains enough file caches.
This patch selects zonelist[] for vmscan scan list for memcg.
If memcg is well balanced among nodes, usual fall back (and mask) is used.
If not, it selects node local zonelist and do target reclaim.
This will reduce unnecessary (anon page) scans when memcg is not balanced.
Now, memcg/NUMA is balanced when each node's weight is between
80% and 120% of average node weight.
(*) This value is just a magic number but works well in several tests.
Further study to detemine this value is appreciated.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 2 +-
mm/memcontrol.c | 20 ++++++++++++++++++--
mm/vmscan.c | 8 ++++++--
3 files changed, 25 insertions(+), 5 deletions(-)
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -295,6 +295,7 @@ struct mem_cgroup {
atomic_t numainfo_updating;
struct work_struct numainfo_update_work;
unsigned long total_weight;
+ bool numascan_balance;
int numascan_generation;
int numascan_tickets_num[2];
struct numascan_ticket *numascan_tickets[2];
@@ -1663,12 +1664,15 @@ mem_cgroup_calc_numascan_weight(struct m
*/
#define NUMA_TICKET_SHIFT (16)
#define NUMA_TICKET_FACTOR ((1 << NUMA_TICKET_SHIFT) - 1)
+#define NUMA_BALANCE_RANGE_LOW (80)
+#define NUMA_BALANCE_RANGE_HIGH (120)
static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
{
struct numascan_ticket *nt;
unsigned int node_ticket, assigned_tickets;
u64 weight;
int nid, assigned_num, generation;
+ unsigned long average, balance_low, balance_high;
/* update ticket information by double buffering */
generation = memcg->numascan_generation ^ 0x1;
@@ -1676,6 +1680,11 @@ static void mem_cgroup_update_numascan_t
nt = memcg->numascan_tickets[generation];
assigned_tickets = 0;
assigned_num = 0;
+ average = memcg->total_weight / (nodes_weight(memcg->scan_nodes) + 1);
+ balance_low = NUMA_BALANCE_RANGE_LOW * average / 100;
+ balance_high = NUMA_BALANCE_RANGE_HIGH * average / 100;
+ memcg->numascan_balance = true;
+
for_each_node_mask(nid, memcg->scan_nodes) {
weight = memcg->info.nodeinfo[nid]->weight;
node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
@@ -1688,6 +1697,9 @@ static void mem_cgroup_update_numascan_t
assigned_tickets += node_ticket;
nt++;
assigned_num++;
+ if ((weight < balance_low) ||
+ (weight > balance_high))
+ memcg->numascan_balance = false;
}
memcg->numascan_tickets_num[generation] = assigned_num;
smp_wmb();
@@ -1758,7 +1770,7 @@ static int node_weight_compare(const voi
* node means more costs for memory reclaim because of memory latency.
*/
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
- struct memcg_scanrecord *rec)
+ struct memcg_scanrecord *rec, bool *fallback)
{
int node = MAX_NUMNODES;
struct numascan_ticket *nt;
@@ -1785,8 +1797,11 @@ out:
if (unlikely(node == MAX_NUMNODES)) {
node = numa_node_id();
*mask = NULL;
- } else
+ *fallback = true;
+ } else {
*mask = &memcg->scan_nodes;
+ *fallback = memcg->numascan_balance;
+ }
return node;
}
@@ -1864,6 +1879,7 @@ int mem_cgroup_select_victim_node(struct
struct memcg_scanrecord *rec)
{
*mask = NULL;
+ *fallback = true;
return 0;
}
Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -118,7 +118,7 @@ extern void mem_cgroup_end_migration(str
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
- struct memcg_scanrecord *rec);
+ struct memcg_scanrecord *rec, bool *fallback);
unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
int nid, int zid, unsigned int lrumask);
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2290,6 +2290,7 @@ unsigned long try_to_free_mem_cgroup_pag
unsigned long nr_reclaimed;
unsigned long start, end;
int nid;
+ bool fallback;
nodemask_t *mask;
struct scan_control sc = {
.may_writepage = !laptop_mode,
@@ -2313,9 +2314,12 @@ unsigned long try_to_free_mem_cgroup_pag
* take care of from where we get pages. So the node where we start the
* scan does not need to be the current node.
*/
- nid = mem_cgroup_select_victim_node(mem_cont, &mask, rec);
+ nid = mem_cgroup_select_victim_node(mem_cont, &mask, rec, &fallback);
- zonelist = &NODE_DATA(nid)->node_zonelists[0];
+ if (fallback) /* memcg/NUMA is balanced and fallback works well */
+ zonelist = &NODE_DATA(nid)->node_zonelists[0];
+ else /* memcg/NUMA is not balanced, do target reclaim */
+ zonelist = &NODE_DATA(nid)->node_zonelists[1];
trace_mm_vmscan_memcg_reclaim_begin(0,
sc.may_writepage,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/5] memcg : pass scan nodemask
2011-07-27 5:47 ` KAMEZAWA Hiroyuki
@ 2011-08-01 13:59 ` Michal Hocko
-1 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2011-08-01 13:59 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, nishimura
On Wed 27-07-11 14:47:42, KAMEZAWA Hiroyuki wrote:
>
> pass memcg's nodemask to try_to_free_pages().
>
> try_to_free_pages can take nodemask as its argument but memcg
> doesn't pass it. Considering memcg can be used with cpuset on
> big NUMA, memcg should pass nodemask if available.
>
> Now, memcg maintain nodemask with periodic updates. pass it.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> include/linux/memcontrol.h | 2 +-
> mm/memcontrol.c | 8 ++++++--
> mm/vmscan.c | 3 ++-
> 3 files changed, 9 insertions(+), 4 deletions(-)
>
[...]
> Index: mmotm-0710/mm/vmscan.c
> ===================================================================
> --- mmotm-0710.orig/mm/vmscan.c
> +++ mmotm-0710/mm/vmscan.c
> @@ -2280,6 +2280,7 @@ unsigned long try_to_free_mem_cgroup_pag
> unsigned long nr_reclaimed;
> unsigned long start, end;
> int nid;
> + nodemask_t *mask;
> struct scan_control sc = {
> .may_writepage = !laptop_mode,
> .may_unmap = 1,
> @@ -2302,7 +2303,7 @@ unsigned long try_to_free_mem_cgroup_pag
> * take care of from where we get pages. So the node where we start the
> * scan does not need to be the current node.
> */
> - nid = mem_cgroup_select_victim_node(mem_cont);
> + nid = mem_cgroup_select_victim_node(mem_cont, &mask);
The mask is not used anywhere AFAICS and using it is a point of the
patch AFAIU. I guess you wanted to use &sc.nodemask, right?
Other than that, looks good to me.
Reviewed-by: Michal Hocko <mhocko@suse.cz>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/5] memcg : pass scan nodemask
@ 2011-08-01 13:59 ` Michal Hocko
0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2011-08-01 13:59 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, nishimura
On Wed 27-07-11 14:47:42, KAMEZAWA Hiroyuki wrote:
>
> pass memcg's nodemask to try_to_free_pages().
>
> try_to_free_pages can take nodemask as its argument but memcg
> doesn't pass it. Considering memcg can be used with cpuset on
> big NUMA, memcg should pass nodemask if available.
>
> Now, memcg maintain nodemask with periodic updates. pass it.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> include/linux/memcontrol.h | 2 +-
> mm/memcontrol.c | 8 ++++++--
> mm/vmscan.c | 3 ++-
> 3 files changed, 9 insertions(+), 4 deletions(-)
>
[...]
> Index: mmotm-0710/mm/vmscan.c
> ===================================================================
> --- mmotm-0710.orig/mm/vmscan.c
> +++ mmotm-0710/mm/vmscan.c
> @@ -2280,6 +2280,7 @@ unsigned long try_to_free_mem_cgroup_pag
> unsigned long nr_reclaimed;
> unsigned long start, end;
> int nid;
> + nodemask_t *mask;
> struct scan_control sc = {
> .may_writepage = !laptop_mode,
> .may_unmap = 1,
> @@ -2302,7 +2303,7 @@ unsigned long try_to_free_mem_cgroup_pag
> * take care of from where we get pages. So the node where we start the
> * scan does not need to be the current node.
> */
> - nid = mem_cgroup_select_victim_node(mem_cont);
> + nid = mem_cgroup_select_victim_node(mem_cont, &mask);
The mask is not used anywhere AFAICS and using it is a point of the
patch AFAIU. I guess you wanted to use &sc.nodemask, right?
Other than that, looks good to me.
Reviewed-by: Michal Hocko <mhocko@suse.cz>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 3/5] memcg : stop scanning if enough
2011-07-27 5:49 ` KAMEZAWA Hiroyuki
@ 2011-08-01 14:37 ` Michal Hocko
-1 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2011-08-01 14:37 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, nishimura
On Wed 27-07-11 14:49:00, KAMEZAWA Hiroyuki wrote:
> memcg :avoid node fallback scan if possible.
>
> Now, try_to_free_pages() scans all zonelist because the page allocator
> should visit all zonelists...but that behavior is harmful for memcg.
> Memcg just scans memory because it hits limit...no memory shortage
> in pased zonelist.
>
> For example, with following unbalanced nodes
>
> Node 0 Node 1
> File 1G 0
> Anon 200M 200M
>
> memcg will cause swap-out from Node1 at every vmscan.
>
> Another example, assume 1024 nodes system.
> With 1024 node system, memcg will visit 1024 nodes
> pages per vmscan... This is overkilling.
>
> This is why memcg's victim node selection logic doesn't work
> as expected.
Previous patch adds nodemask filled by
mem_cgroup_select_victim_node. Shouldn't we rather limit that nodemask
to a victim node?
Or am I missing something?
The patch as is doesn't look nice it makes shrink_zones even more
memcg-hacky:
for_each_zone_zonelist_nodemask
if (scanning_global_lru(sc))
/.../
shrink_zone(priority, zone, sc);
if (!scanning_global_lru(sc))
/.../
>
> This patch is a help for stopping vmscan when we scanned enough.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/vmscan.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> Index: mmotm-0710/mm/vmscan.c
> ===================================================================
> --- mmotm-0710.orig/mm/vmscan.c
> +++ mmotm-0710/mm/vmscan.c
> @@ -2058,6 +2058,16 @@ static void shrink_zones(int priority, s
> }
>
> shrink_zone(priority, zone, sc);
> + if (!scanning_global_lru(sc)) {
> + /*
> + * When we do scan for memcg's limit, it's bad to do
> + * fallback into more node/zones because there is no
> + * memory shortage. We quit as much as possible when
> + * we reache target.
> + */
> + if (sc->nr_to_reclaim <= sc->nr_reclaimed)
> + break;
> + }
> }
> }
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 3/5] memcg : stop scanning if enough
@ 2011-08-01 14:37 ` Michal Hocko
0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2011-08-01 14:37 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, nishimura
On Wed 27-07-11 14:49:00, KAMEZAWA Hiroyuki wrote:
> memcg :avoid node fallback scan if possible.
>
> Now, try_to_free_pages() scans all zonelist because the page allocator
> should visit all zonelists...but that behavior is harmful for memcg.
> Memcg just scans memory because it hits limit...no memory shortage
> in pased zonelist.
>
> For example, with following unbalanced nodes
>
> Node 0 Node 1
> File 1G 0
> Anon 200M 200M
>
> memcg will cause swap-out from Node1 at every vmscan.
>
> Another example, assume 1024 nodes system.
> With 1024 node system, memcg will visit 1024 nodes
> pages per vmscan... This is overkilling.
>
> This is why memcg's victim node selection logic doesn't work
> as expected.
Previous patch adds nodemask filled by
mem_cgroup_select_victim_node. Shouldn't we rather limit that nodemask
to a victim node?
Or am I missing something?
The patch as is doesn't look nice it makes shrink_zones even more
memcg-hacky:
for_each_zone_zonelist_nodemask
if (scanning_global_lru(sc))
/.../
shrink_zone(priority, zone, sc);
if (!scanning_global_lru(sc))
/.../
>
> This patch is a help for stopping vmscan when we scanned enough.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/vmscan.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> Index: mmotm-0710/mm/vmscan.c
> ===================================================================
> --- mmotm-0710.orig/mm/vmscan.c
> +++ mmotm-0710/mm/vmscan.c
> @@ -2058,6 +2058,16 @@ static void shrink_zones(int priority, s
> }
>
> shrink_zone(priority, zone, sc);
> + if (!scanning_global_lru(sc)) {
> + /*
> + * When we do scan for memcg's limit, it's bad to do
> + * fallback into more node/zones because there is no
> + * memory shortage. We quit as much as possible when
> + * we reache target.
> + */
> + if (sc->nr_to_reclaim <= sc->nr_reclaimed)
> + break;
> + }
> }
> }
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 3/5] memcg : stop scanning if enough
2011-08-01 14:37 ` Michal Hocko
@ 2011-08-01 19:49 ` Michal Hocko
-1 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2011-08-01 19:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, nishimura
On Mon 01-08-11 16:37:45, Michal Hocko wrote:
> On Wed 27-07-11 14:49:00, KAMEZAWA Hiroyuki wrote:
> > memcg :avoid node fallback scan if possible.
> >
> > Now, try_to_free_pages() scans all zonelist because the page allocator
> > should visit all zonelists...but that behavior is harmful for memcg.
> > Memcg just scans memory because it hits limit...no memory shortage
> > in pased zonelist.
> >
> > For example, with following unbalanced nodes
> >
> > Node 0 Node 1
> > File 1G 0
> > Anon 200M 200M
> >
> > memcg will cause swap-out from Node1 at every vmscan.
> >
> > Another example, assume 1024 nodes system.
> > With 1024 node system, memcg will visit 1024 nodes
> > pages per vmscan... This is overkilling.
> >
> > This is why memcg's victim node selection logic doesn't work
> > as expected.
>
> Previous patch adds nodemask filled by
> mem_cgroup_select_victim_node. Shouldn't we rather limit that nodemask
> to a victim node?
Bahh, scratch that. I was jumping from one thing to another and got
totally confused. Victim memcg is not bound to any particular node in
general...
Sorry for noise. I will try to get back to this tomorrow.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 3/5] memcg : stop scanning if enough
@ 2011-08-01 19:49 ` Michal Hocko
0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2011-08-01 19:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, nishimura
On Mon 01-08-11 16:37:45, Michal Hocko wrote:
> On Wed 27-07-11 14:49:00, KAMEZAWA Hiroyuki wrote:
> > memcg :avoid node fallback scan if possible.
> >
> > Now, try_to_free_pages() scans all zonelist because the page allocator
> > should visit all zonelists...but that behavior is harmful for memcg.
> > Memcg just scans memory because it hits limit...no memory shortage
> > in pased zonelist.
> >
> > For example, with following unbalanced nodes
> >
> > Node 0 Node 1
> > File 1G 0
> > Anon 200M 200M
> >
> > memcg will cause swap-out from Node1 at every vmscan.
> >
> > Another example, assume 1024 nodes system.
> > With 1024 node system, memcg will visit 1024 nodes
> > pages per vmscan... This is overkilling.
> >
> > This is why memcg's victim node selection logic doesn't work
> > as expected.
>
> Previous patch adds nodemask filled by
> mem_cgroup_select_victim_node. Shouldn't we rather limit that nodemask
> to a victim node?
Bahh, scratch that. I was jumping from one thing to another and got
totally confused. Victim memcg is not bound to any particular node in
general...
Sorry for noise. I will try to get back to this tomorrow.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/5] memcg : pass scan nodemask
2011-08-01 13:59 ` Michal Hocko
@ 2011-08-02 2:21 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-02 2:21 UTC (permalink / raw)
To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, nishimura
On Mon, 1 Aug 2011 15:59:53 +0200
Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 27-07-11 14:47:42, KAMEZAWA Hiroyuki wrote:
> >
> > pass memcg's nodemask to try_to_free_pages().
> >
> > try_to_free_pages can take nodemask as its argument but memcg
> > doesn't pass it. Considering memcg can be used with cpuset on
> > big NUMA, memcg should pass nodemask if available.
> >
> > Now, memcg maintain nodemask with periodic updates. pass it.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> > include/linux/memcontrol.h | 2 +-
> > mm/memcontrol.c | 8 ++++++--
> > mm/vmscan.c | 3 ++-
> > 3 files changed, 9 insertions(+), 4 deletions(-)
> >
> [...]
> > Index: mmotm-0710/mm/vmscan.c
> > ===================================================================
> > --- mmotm-0710.orig/mm/vmscan.c
> > +++ mmotm-0710/mm/vmscan.c
> > @@ -2280,6 +2280,7 @@ unsigned long try_to_free_mem_cgroup_pag
> > unsigned long nr_reclaimed;
> > unsigned long start, end;
> > int nid;
> > + nodemask_t *mask;
> > struct scan_control sc = {
> > .may_writepage = !laptop_mode,
> > .may_unmap = 1,
> > @@ -2302,7 +2303,7 @@ unsigned long try_to_free_mem_cgroup_pag
> > * take care of from where we get pages. So the node where we start the
> > * scan does not need to be the current node.
> > */
> > - nid = mem_cgroup_select_victim_node(mem_cont);
> > + nid = mem_cgroup_select_victim_node(mem_cont, &mask);
>
> The mask is not used anywhere AFAICS and using it is a point of the
> patch AFAIU. I guess you wanted to use &sc.nodemask, right?
>
> Other than that, looks good to me.
>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
Ah, sorry. I'll fix.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/5] memcg : pass scan nodemask
@ 2011-08-02 2:21 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-02 2:21 UTC (permalink / raw)
To: Michal Hocko; +Cc: linux-mm, linux-kernel, akpm, nishimura
On Mon, 1 Aug 2011 15:59:53 +0200
Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 27-07-11 14:47:42, KAMEZAWA Hiroyuki wrote:
> >
> > pass memcg's nodemask to try_to_free_pages().
> >
> > try_to_free_pages can take nodemask as its argument but memcg
> > doesn't pass it. Considering memcg can be used with cpuset on
> > big NUMA, memcg should pass nodemask if available.
> >
> > Now, memcg maintain nodemask with periodic updates. pass it.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> > include/linux/memcontrol.h | 2 +-
> > mm/memcontrol.c | 8 ++++++--
> > mm/vmscan.c | 3 ++-
> > 3 files changed, 9 insertions(+), 4 deletions(-)
> >
> [...]
> > Index: mmotm-0710/mm/vmscan.c
> > ===================================================================
> > --- mmotm-0710.orig/mm/vmscan.c
> > +++ mmotm-0710/mm/vmscan.c
> > @@ -2280,6 +2280,7 @@ unsigned long try_to_free_mem_cgroup_pag
> > unsigned long nr_reclaimed;
> > unsigned long start, end;
> > int nid;
> > + nodemask_t *mask;
> > struct scan_control sc = {
> > .may_writepage = !laptop_mode,
> > .may_unmap = 1,
> > @@ -2302,7 +2303,7 @@ unsigned long try_to_free_mem_cgroup_pag
> > * take care of from where we get pages. So the node where we start the
> > * scan does not need to be the current node.
> > */
> > - nid = mem_cgroup_select_victim_node(mem_cont);
> > + nid = mem_cgroup_select_victim_node(mem_cont, &mask);
>
> The mask is not used anywhere AFAICS and using it is a point of the
> patch AFAIU. I guess you wanted to use &sc.nodemask, right?
>
> Other than that, looks good to me.
>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
Ah, sorry. I'll fix.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2011-08-02 2:29 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-27 5:44 [PATCH v4 0/5] memcg : make numa scanning better KAMEZAWA Hiroyuki
2011-07-27 5:44 ` KAMEZAWA Hiroyuki
2011-07-27 5:46 ` [PATCH v4 1/5] memcg : update numascan info by schedule_work KAMEZAWA Hiroyuki
2011-07-27 5:46 ` KAMEZAWA Hiroyuki
2011-07-27 5:47 ` [PATCH v4 2/5] memcg : pass scan nodemask KAMEZAWA Hiroyuki
2011-07-27 5:47 ` KAMEZAWA Hiroyuki
2011-08-01 13:59 ` Michal Hocko
2011-08-01 13:59 ` Michal Hocko
2011-08-02 2:21 ` KAMEZAWA Hiroyuki
2011-08-02 2:21 ` KAMEZAWA Hiroyuki
2011-07-27 5:49 ` [PATCH v4 3/5] memcg : stop scanning if enough KAMEZAWA Hiroyuki
2011-07-27 5:49 ` KAMEZAWA Hiroyuki
2011-08-01 14:37 ` Michal Hocko
2011-08-01 14:37 ` Michal Hocko
2011-08-01 19:49 ` Michal Hocko
2011-08-01 19:49 ` Michal Hocko
2011-07-27 5:49 ` [PATCH v4 4/5] memcg : calculate node scan weight KAMEZAWA Hiroyuki
2011-07-27 5:49 ` KAMEZAWA Hiroyuki
2011-07-27 5:51 ` [PATCH v4 5/5] memcg : select a victim node by weights KAMEZAWA Hiroyuki
2011-07-27 5:51 ` KAMEZAWA Hiroyuki
2011-07-27 5:52 ` [PATCH v4 6/5] memcg : check numa balance KAMEZAWA Hiroyuki
2011-07-27 5:52 ` KAMEZAWA Hiroyuki
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.