All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH V7 0/9] memcg: per cgroup background reclaim
@ 2011-04-22  4:24 Ying Han
  2011-04-22  4:24 ` [PATCH V7 1/9] Add kswapd descriptor Ying Han
                   ` (8 more replies)
  0 siblings, 9 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

The current implementation of memcg supports targeting reclaim when the
cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
Per cgroup background reclaim is needed which helps to spread out memory
pressure over longer period of time and smoothes out the cgroup performance.

Two watermarks ("high_wmark", "low_wmark") are added to trigger the background
reclaim and stop it. The watermarks are calculated based on the cgroup's
limit_in_bytes. By default, the per-memcg kswapd threads are running under root
cgroup. There is a per-memcg API which exports the pid of each kswapd thread,
and userspace can configure cpu cgroup seperately.

Prior to this version, there are one kswapd thread per cgroup. the thread is
created when the cgroup changes its limit_in_bytes and is deleted when the
cgroup is being removed. In some enviroment when thousand of cgroups are being
configured on a single host, we will have thousand of kswapd threads. The memory
consumption would be 8k*100 = 8M. We don't see a big issue for now if the host
can host that many of cgroups.

In this patchset, i applied the thread pool patch from KAMAZAWA. The patch is
built on top of V6 but changing the threading model. All memcg which needs
background recalim are linked to a list and memcg-kswapd picks up a memcg
from the list and run reclaim.

The per-memcg-per-kswapd model
Pros:
1. memory overhead per thread, and The memory consumption would be 8k*1000 = 8M
with 1k cgroup.
2. we see lots of threads at 'ps -elf'

Cons:
1. the implementation is simply and straigh-forward.
2. we can easily isolate the background reclaim overhead between cgroups.
3. better latency from memory pressure to actual start reclaiming

The thread-pool model
Pros:
1. there is no isolation between memcg background reclaim, since the memcg threads
are shared.
2. it is hard for visibility and debugability. I have been experienced a lot when
some kswapds running creazy and we need a stright-forward way to identify which
cgroup causing the reclaim.
3. potential starvation for some memcgs, if one workitem stucks and the rest of work
won't proceed.

Cons:
1. save some memory resource.

In general, the per-memcg-per-kswapd implmentation looks sane to me at this point,
esepcially the sharing memcg thread model will make debugging issue very hard later.

changlog v7..v6:
1. applied the [PATCH 1/3] memcg kswapd thread pool from KAMAZAWA, and fix the
merge conflicts.
2. applied the [PATCH 3/3] fix mem_cgroup_watemark_ok
3. fix the compile error from the patch with building w/o memcg.
4. removed two patches from V6. (export kswapd_id and the memory.stat)
5. I haven't applied [PATCH 2/3]. Will include that in next post after we deciding
the threading model.

I run through dd test on large file and then cat the file. Then I compared
the reclaim related stats in memory.stat.

Step1: Create a cgroup with 500M memory_limit.
$ mkdir /dev/cgroup/memory/A
$ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/memory/A/tasks

Step2: Test and set the wmarks.
$ cat /dev/cgroup/memory/A/memory.low_wmark_distance
0
$ cat /dev/cgroup/memory/A/memory.high_wmark_distance
0

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 524288000
high_wmark 524288000

$ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
$ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 482344960
high_wmark 471859200

$ ps -ef | grep memcg
root       607     2  0 18:25 ?        00:00:01 [memcg_1]
root       608     2  0 18:25 ?        00:00:03 [memcg_2]
root       609     2  0 18:25 ?        00:00:03 [memcg_3]
root     32711 32696  0 21:07 ttyS0    00:00:00 grep memcg

Step3: Dirty the pages by creating a 20g file on hard drive.
$ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1

Here are the memory.stat with vs without the per-memcg reclaim. It used to be
all the pages are reclaimed from direct reclaim, and now some of the pages are
also being reclaimed at background.

Only direct reclaim                      With background reclaim:

pgpgin 5243093                           pgpgin 5242926
pgpgout 5115140                          pgpgout 5127779
kswapd_steal 0                           kswapd_steal 1427513
pg_pgsteal 5115117                       pg_pgsteal 3700243
kswapd_pgscan 0                          kswapd_pgscan 2636508
pg_scan 5941889                          pg_scan 21933629
pgrefill 264792                          pgrefill 283584
pgoutrun 0                               pgoutrun 43875
allocstall 158383                        allocstall 114128

real   5m1.462s                          real    5m0.988s
user   0m1.235s                          user    0m1.209s
sys    1m8.929s                          sys     1m11.348s

throughput is 67.81 MB/sec               throughput is 68.04 MB/sec

Step 4: Cleanup
$ echo $$ >/dev/cgroup/memory/tasks
$ echo 1 > /dev/cgroup/memory/A/memory.force_empty
$ rmdir /dev/cgroup/memory/A
$ echo 3 >/proc/sys/vm/drop_caches

Step 5: Create the same cgroup and read the 20g file into pagecache.
$ cat /export/hdc3/dd/tf0 > /dev/zero

All the pages are reclaimed from background instead of direct reclaim with
the per cgroup reclaim.

Only direct reclaim                       With background reclaim:

pgpgin 5243093                            pgpgin 5243032
pgpgout 5115140                           pgpgout 5127889
kswapd_steal 0                            kswapd_steal 5127830
pg_pgsteal 5115117                        pg_pgsteal 0
kswapd_pgscan 0                           kswapd_pgscan 5127840
pg_scan 5941889                           pg_scan 0
pgrefill 264792                           pgrefill 0
pgoutrun 0                                pgoutrun 160242
allocstall 158383                         allocstall 0

real    4m24.373s                         real    4m20.842s
user    0m0.265s                          user    0m0.289s
sys     0m23.205s                         sys     0m24.393s

Note:
This is the first effort of enhancing the target reclaim into memcg. Here are
the existing known issues and our plan:

The 1 and 2 here are from previous versions and keep here for record.
1. there are one kswapd thread per cgroup. the thread is created when the
cgroup changes its limit_in_bytes and is deleted when the cgroup is being
removed. In some enviroment when thousand of cgroups are being configured on
a single host, we will have thousand of kswapd threads. The memory consumption
would be 8k*100 = 8M. We don't see a big issue for now if the host can host
that many of cgroups.

2. regarding to the alternative workqueue, which is more complicated and we
need to be very careful of work items in the workqueue. We've experienced in
one workitem stucks and the rest of the work item won't proceed. For example
in dirty page writeback, one heavily writer cgroup could starve the other
cgroups from flushing dirty pages to the same disk. In the kswapd case, I can
imagine we might have similar senario. How to prioritize the workitems is
another problem. The order of adding the workitems in the queue reflects the
order of cgroups being reclaimed. We don't have that restriction currently but
relying on the cpu scheduler to put kswapd on the right cpu-core to run. We
"might" introduce priority later for reclaim and how are we gonna deal with
that.

3. there is a potential lock contention between per cgroup kswapds, and the
worst case depends on the number of cpu cores on the system. Basically we
now are sharing the zone->lru_lock between per-memcg LRU and global LRU. I have
a plan to get rid of the global LRU eventually, which requires to enhance the
existing targeting reclaim (this patch is included). I would like to get to that
where the locking contention problem will be solved naturely.

4. no hierarchical reclaim support in this patchset. I would like to get to
after the basic stuff are being accepted.

5. By default, it is running under root. If there is a need to put the kswapd
thread into a cpu cgroup, userspace can make that change by reading the pid from
the new API and echo-ing. In non preemption kernel, we need to be careful of
priority inversion when restricting kswapd cpu time while it is holding a mutex.


Ying Han (9):
  Add kswapd descriptor
  Add per memcg reclaim watermarks
  New APIs to adjust per-memcg wmarks
  Add memcg kswapd thread pool
  Infrastructure to support per-memcg reclaim.
  Implement the select_victim_node within memcg.
  Per-memcg background reclaim.
  Add per-memcg zone "unreclaimable"
  Enable per-memcg background reclaim.

 include/linux/memcontrol.h  |  122 +++++++++++
 include/linux/mmzone.h      |    2 +-
 include/linux/res_counter.h |   78 +++++++
 include/linux/sched.h       |    1 +
 include/linux/swap.h        |   11 +-
 kernel/res_counter.c        |    6 +
 mm/memcontrol.c             |  467 ++++++++++++++++++++++++++++++++++++++++++-
 mm/memory_hotplug.c         |    2 +-
 mm/vmscan.c                 |  337 +++++++++++++++++++++++++------
 9 files changed, 959 insertions(+), 67 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH V7 1/9] Add kswapd descriptor
  2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
@ 2011-04-22  4:24 ` Ying Han
  2011-04-22  4:31   ` KAMEZAWA Hiroyuki
  2011-04-22  4:47   ` KOSAKI Motohiro
  2011-04-22  4:24 ` [PATCH V7 2/9] Add per memcg reclaim watermarks Ying Han
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

There is a kswapd kernel thread for each numa node. We will add a different
kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
information of node and memcgs, and it allows the global and per-memcg
background reclaim to share common reclaim algorithms.

This patch adds the kswapd descriptor and moves the per-node kswapd to use the
new structure.

changelog v7..v6:
1. revert wait_queue_head change in pgdat. Keep the wait_queue_head in pgdat

changelog v6..v5:
1. rename kswapd_thr to kswapd_tsk
2. revert the api change on sleeping_prematurely since memcg doesn't support it.

changelog v5..v4:
1. add comment on kswapds_spinlock
2. remove the kswapds_spinlock. we don't need it here since the kswapd and pgdat
have 1:1 mapping.

changelog v3..v2:
1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
2. rename thr in kswapd_run to something else.

changelog v2..v1:
1. dynamic allocate kswapd descriptor and initialize the wait_queue_head of pgdat
at kswapd_run.
2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup kswapd
descriptor.

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/mmzone.h |    2 +-
 include/linux/swap.h   |    7 +++++
 mm/vmscan.c            |   64 ++++++++++++++++++++++++++++++++++++------------
 3 files changed, 56 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 628f07b..53c3c61 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -641,7 +641,7 @@ typedef struct pglist_data {
 					     range, including holes */
 	int node_id;
 	wait_queue_head_t kswapd_wait;
-	struct task_struct *kswapd;
+	struct kswapd *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 } pg_data_t;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed6ebe6..9b91ca4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
 	return current->flags & PF_KSWAPD;
 }
 
+struct kswapd {
+	struct task_struct *kswapd_task;
+	wait_queue_head_t *kswapd_wait;
+	pg_data_t *kswapd_pgdat;
+};
+
+int kswapd(void *p);
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..7aba681 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2570,21 +2570,24 @@ out:
 	return order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
+				int classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = kswapd_p->kswapd_wait;
 
 	if (freezing(current) || kthread_should_stop())
 		return;
 
-	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
 	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
-		finish_wait(&pgdat->kswapd_wait, &wait);
-		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+		finish_wait(wait_h, &wait);
+		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 	}
 
 	/*
@@ -2611,7 +2614,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		else
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
-	finish_wait(&pgdat->kswapd_wait, &wait);
+	finish_wait(wait_h, &wait);
 }
 
 /*
@@ -2627,20 +2630,22 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
  * If there are applications that are active memory-allocators
  * (most normal use), this basically shouldn't matter.
  */
-static int kswapd(void *p)
+int kswapd(void *p)
 {
 	unsigned long order;
 	int classzone_idx;
-	pg_data_t *pgdat = (pg_data_t*)p;
+	struct kswapd *kswapd_p = (struct kswapd *)p;
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
 	struct task_struct *tsk = current;
 
 	struct reclaim_state reclaim_state = {
 		.reclaimed_slab = 0,
 	};
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	const struct cpumask *cpumask;
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
+	cpumask = cpumask_of_node(pgdat->node_id);
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(tsk, cpumask);
 	current->reclaim_state = &reclaim_state;
@@ -2679,7 +2684,7 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx);
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
@@ -2817,12 +2822,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 		for_each_node_state(nid, N_HIGH_MEMORY) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 			const struct cpumask *mask;
+			struct kswapd *kswapd_p;
+			struct task_struct *kswapd_tsk;
+			wait_queue_head_t *wait;
 
 			mask = cpumask_of_node(pgdat->node_id);
 
+			wait = &pgdat->kswapd_wait;
+			kswapd_p = pgdat->kswapd;
+			kswapd_tsk = kswapd_p->kswapd_task;
+
 			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
 				/* One of our CPUs online: restore mask */
-				set_cpus_allowed_ptr(pgdat->kswapd, mask);
+				if (kswapd_tsk)
+					set_cpus_allowed_ptr(kswapd_tsk, mask);
 		}
 	}
 	return NOTIFY_OK;
@@ -2835,18 +2848,31 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 int kswapd_run(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
+	struct task_struct *kswapd_tsk;
+	struct kswapd *kswapd_p;
 	int ret = 0;
 
 	if (pgdat->kswapd)
 		return 0;
 
-	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
-	if (IS_ERR(pgdat->kswapd)) {
+	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
+	if (!kswapd_p)
+		return -ENOMEM;
+
+	pgdat->kswapd = kswapd_p;
+	kswapd_p->kswapd_wait = &pgdat->kswapd_wait;
+	kswapd_p->kswapd_pgdat = pgdat;
+
+	kswapd_tsk = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (IS_ERR(kswapd_tsk)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
 		printk("Failed to start kswapd on node %d\n",nid);
+		pgdat->kswapd = NULL;
+		kfree(kswapd_p);
 		ret = -1;
-	}
+	} else
+		kswapd_p->kswapd_task = kswapd_tsk;
 	return ret;
 }
 
@@ -2855,10 +2881,16 @@ int kswapd_run(int nid)
  */
 void kswapd_stop(int nid)
 {
-	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
+	struct task_struct *kswapd_tsk = NULL;
+	struct kswapd *kswapd_p = NULL;
+
+	kswapd_p = NODE_DATA(nid)->kswapd;
+	kswapd_tsk = kswapd_p->kswapd_task;
+	kswapd_p->kswapd_task = NULL;
+	if (kswapd_tsk)
+		kthread_stop(kswapd_tsk);
 
-	if (kswapd)
-		kthread_stop(kswapd);
+	kfree(kswapd_p);
 }
 
 static int __init kswapd_init(void)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V7 2/9] Add per memcg reclaim watermarks
  2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
  2011-04-22  4:24 ` [PATCH V7 1/9] Add kswapd descriptor Ying Han
@ 2011-04-22  4:24 ` Ying Han
  2011-04-22  4:24 ` [PATCH V7 3/9] New APIs to adjust per-memcg wmarks Ying Han
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
until the usage is lower than the high_wmark.

Each watermark is calculated based on the hard_limit(limit_in_bytes) for each
memcg. Each time the hard_limit is changed, the corresponding wmarks are
re-calculated. Since memory controller charges only user pages, there is
no need for a "min_wmark". The current calculation of wmarks is based on
individual tunable low/high_wmark_distance, which are set to 0 by default.

changelog v7..v6:
1. apply the fix on mem_cgroup_watermark_ok from Kamazawa.

changelog v5..v4:
1. rename res_counter_low_wmark_limit_locked().
2. rename res_counter_high_wmark_limit_locked().

changelog v4..v3:
1. remove legacy comments
2. rename the res_counter_check_under_high_wmark_limit
3. replace the wmark_ratio per-memcg by individual tunable for both wmarks.
4. add comments on low/high_wmark
5. add individual tunables for low/high_wmarks and remove wmark_ratio
6. replace the mem_cgroup_get_limit() call by res_count_read_u64(). The first
one returns large value w/ swapon.

changelog v3..v2:
1. Add VM_BUG_ON() on couple of places.
2. Remove the spinlock on the min_free_kbytes since the consequence of reading
stale data.
3. Remove the "min_free_kbytes" API and replace it with wmark_ratio based on
hard_limit.

changelog v2..v1:
1. Remove the res_counter_charge on wmark due to performance concern.
2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
4. make the wmark to be consistant with core VM which checks the free pages
instead of usage.
5. changed wmark to be boolean

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h  |    1 +
 include/linux/res_counter.h |   78 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    6 +++
 mm/memcontrol.c             |   51 ++++++++++++++++++++++++++++
 4 files changed, 136 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..3ece36d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -82,6 +82,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..669f199 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -39,6 +39,14 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit that reclaim triggers.
+	 */
+	unsigned long long low_wmark_limit;
+	/*
+	 * the limit that reclaim stops.
+	 */
+	unsigned long long high_wmark_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -55,6 +63,9 @@ struct res_counter {
 
 #define RESOURCE_MAX (unsigned long long)LLONG_MAX
 
+#define CHARGE_WMARK_LOW	0x01
+#define CHARGE_WMARK_HIGH	0x02
+
 /**
  * Helpers to interact with userspace
  * res_counter_read_u64() - returns the value of the specified member.
@@ -92,6 +103,8 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_WMARK_LIMIT,
+	RES_HIGH_WMARK_LIMIT
 };
 
 /*
@@ -147,6 +160,24 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
 	return margin;
 }
 
+static inline bool
+res_counter_under_high_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->high_wmark_limit)
+		return true;
+
+	return false;
+}
+
+static inline bool
+res_counter_under_low_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->low_wmark_limit)
+		return true;
+
+	return false;
+}
+
 /**
  * Get the difference between the usage and the soft limit
  * @cnt: The counter
@@ -169,6 +200,30 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+static inline bool
+res_counter_under_low_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_under_low_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool
+res_counter_under_high_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_under_high_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -214,4 +269,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_high_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
+static inline int
+res_counter_set_low_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..206a724 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 	spin_lock_init(&counter->lock);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
+	counter->low_wmark_limit = RESOURCE_MAX;
+	counter->high_wmark_limit = RESOURCE_MAX;
 	counter->parent = parent;
 }
 
@@ -103,6 +105,10 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_WMARK_LIMIT:
+		return &counter->low_wmark_limit;
+	case RES_HIGH_WMARK_LIMIT:
+		return &counter->high_wmark_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4407dd0..c351e62 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -272,6 +272,12 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	/*
+	 * used to calculate the low/high_wmarks based on the limit_in_bytes.
+	 */
+	u64 high_wmark_distance;
+	u64 low_wmark_distance;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -813,6 +819,25 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
+{
+	u64 limit;
+
+	limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+	if (mem->high_wmark_distance == 0) {
+		res_counter_set_low_wmark_limit(&mem->res, limit);
+		res_counter_set_high_wmark_limit(&mem->res, limit);
+	} else {
+		u64 low_wmark, high_wmark;
+
+		low_wmark = limit - mem->low_wmark_distance;
+		high_wmark = limit - mem->high_wmark_distance;
+
+		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
+		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+	}
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -3205,6 +3230,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3264,6 +3290,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -4521,6 +4548,30 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+/*
+ * We use low_wmark and high_wmark for triggering per-memcg kswapd.
+ * The reclaim is triggered by low_wmark (usage > low_wmark) and stopped
+ * by high_wmark (usage < high_wmark).
+ */
+int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
+				int charge_flags)
+{
+	long ret = 0;
+	int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
+
+	if (!mem->low_wmark_distance)
+		return 1;
+
+	VM_BUG_ON((charge_flags & flags) == flags);
+
+	if (charge_flags & CHARGE_WMARK_LOW)
+		ret = res_counter_under_low_wmark_limit(&mem->res);
+	if (charge_flags & CHARGE_WMARK_HIGH)
+		ret = res_counter_under_high_wmark_limit(&mem->res);
+
+	return ret;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V7 3/9] New APIs to adjust per-memcg wmarks
  2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
  2011-04-22  4:24 ` [PATCH V7 1/9] Add kswapd descriptor Ying Han
  2011-04-22  4:24 ` [PATCH V7 2/9] Add per memcg reclaim watermarks Ying Han
@ 2011-04-22  4:24 ` Ying Han
  2011-04-22  4:32   ` KAMEZAWA Hiroyuki
  2011-04-22  4:24 ` [PATCH V7 4/9] Add memcg kswapd thread pool Ying Han
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

Add memory.low_wmark_distance, memory.high_wmark_distance and reclaim_wmarks
APIs per-memcg. The first two adjust the internal low/high wmark calculation
and the reclaim_wmarks exports the current value of watermarks.

By default, the low/high_wmark is calculated by subtracting the distance from
the hard_limit(limit_in_bytes). When configuring the low/high_wmark distance,
user must setup the high_wmark_distance before low_wmark_distance. Also user
must zero low_wmark_distance before high_wmark_distance.

$ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
$ cat /dev/cgroup/A/memory.limit_in_bytes
524288000

$ echo 50m >/dev/cgroup/A/memory.high_wmark_distance
$ echo 40m >/dev/cgroup/A/memory.low_wmark_distance

$ cat /dev/cgroup/A/memory.reclaim_wmarks
low_wmark 482344960
high_wmark 471859200

change v5..v4
1. add sanity check for setting high/low_wmark_distance for root cgroup.

changelog v4..v3:
1. replace the "wmark_ratio" API with individual tunable for low/high_wmarks.

changelog v3..v2:
1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
feedbacks

Signed-off-by: Ying Han <yinghan@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c351e62..6029f1b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3974,6 +3974,78 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 mem_cgroup_high_wmark_distance_read(struct cgroup *cgrp,
+					       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->high_wmark_distance;
+}
+
+static u64 mem_cgroup_low_wmark_distance_read(struct cgroup *cgrp,
+					      struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->low_wmark_distance;
+}
+
+static int mem_cgroup_high_wmark_distance_write(struct cgroup *cont,
+						struct cftype *cft,
+						const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	u64 low_wmark_distance = memcg->low_wmark_distance;
+	unsigned long long val;
+	u64 limit;
+	int ret;
+
+	if (!cont->parent)
+		return -EINVAL;
+
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return -EINVAL;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+	if ((val >= limit) || (val < low_wmark_distance) ||
+	   (low_wmark_distance && val == low_wmark_distance))
+		return -EINVAL;
+
+	memcg->high_wmark_distance = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+}
+
+static int mem_cgroup_low_wmark_distance_write(struct cgroup *cont,
+					       struct cftype *cft,
+					       const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	u64 high_wmark_distance = memcg->high_wmark_distance;
+	unsigned long long val;
+	u64 limit;
+	int ret;
+
+	if (!cont->parent)
+		return -EINVAL;
+
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return -EINVAL;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+	if ((val >= limit) || (val > high_wmark_distance) ||
+	    (high_wmark_distance && val == high_wmark_distance))
+		return -EINVAL;
+
+	memcg->low_wmark_distance = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4265,6 +4337,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 	mutex_unlock(&memcg_oom_mutex);
 }
 
+static int mem_cgroup_wmark_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	u64 low_wmark, high_wmark;
+
+	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
+	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+
+	cb->fill(cb, "low_wmark", low_wmark);
+	cb->fill(cb, "high_wmark", high_wmark);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4368,6 +4455,20 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "high_wmark_distance",
+		.write_string = mem_cgroup_high_wmark_distance_write,
+		.read_u64 = mem_cgroup_high_wmark_distance_read,
+	},
+	{
+		.name = "low_wmark_distance",
+		.write_string = mem_cgroup_low_wmark_distance_write,
+		.read_u64 = mem_cgroup_low_wmark_distance_read,
+	},
+	{
+		.name = "reclaim_wmarks",
+		.read_map = mem_cgroup_wmark_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
                   ` (2 preceding siblings ...)
  2011-04-22  4:24 ` [PATCH V7 3/9] New APIs to adjust per-memcg wmarks Ying Han
@ 2011-04-22  4:24 ` Ying Han
  2011-04-22  4:36   ` KAMEZAWA Hiroyuki
  2011-04-22  5:39   ` KOSAKI Motohiro
  2011-04-22  4:24 ` [PATCH V7 5/9] Infrastructure to support per-memcg reclaim Ying Han
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This patch creates a thread pool for memcg-kswapd. All memcg which needs
background recalim are linked to a list and memcg-kswapd picks up a memcg
from the list and run reclaim.

The concern of using per-memcg-kswapd thread is the system overhead including
memory and cputime.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   69 +++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c            |   82 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 151 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3ece36d..9157c4d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -84,6 +84,11 @@ extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
 
+bool mem_cgroup_kswapd_can_sleep(void);
+struct mem_cgroup *mem_cgroup_get_shrink_target(void);
+void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
+wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
+
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
 {
@@ -355,6 +360,70 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
 {
 }
 
+/* background reclaim stats */
+static inline void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg,
+					   int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_steal(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg,
+					    int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pgrefill(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_outrun(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_alloc_stall(struct mem_cgroup *memcg,
+					  int val)
+{
+	return 0;
+}
+
+static inline bool mem_cgroup_kswapd_can_sleep(void)
+{
+	return false;
+}
+
+static inline
+struct mem_cgroup *mem_cgroup_get_shrink_target(void)
+{
+	return NULL;
+}
+
+static inline void mem_cgroup_put_shrink_target(struct mem_cgroup *mem)
+{
+}
+
+static inline
+wait_queue_head_t *mem_cgroup_kswapd_waitq(void)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6029f1b..527ad9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,8 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include "internal.h"
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 
 #include <asm/uaccess.h>
 
@@ -262,6 +264,14 @@ struct mem_cgroup {
 	 * mem_cgroup ? And what type of charges should we move ?
 	 */
 	unsigned long 	move_charge_at_immigrate;
+
+	/* !=0 if a kswapd runs */
+	atomic_t kswapd_running;
+	/* for waiting the end*/
+	wait_queue_head_t memcg_kswapd_end;
+	/* for shceduling */
+	struct list_head memcg_kswapd_wait_list;
+
 	/*
 	 * percpu counter.
 	 */
@@ -4392,6 +4402,76 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+/*
+ * Controls for background memory reclam stuff.
+ */
+struct memcg_kswapd_work {
+	spinlock_t lock;
+	struct list_head list;
+	wait_queue_head_t waitq;
+};
+
+struct memcg_kswapd_work memcg_kswapd_control;
+
+static void memcg_kswapd_wait_end(struct mem_cgroup *mem)
+{
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&mem->memcg_kswapd_end, &wait, TASK_INTERRUPTIBLE);
+	if (atomic_read(&mem->kswapd_running))
+		schedule();
+	finish_wait(&mem->memcg_kswapd_end, &wait);
+}
+
+struct mem_cgroup *mem_cgroup_get_shrink_target(void)
+{
+	struct mem_cgroup *mem;
+
+	spin_lock(&memcg_kswapd_control.lock);
+	rcu_read_lock();
+	do {
+		mem = NULL;
+		if (!list_empty(&memcg_kswapd_control.list)) {
+			mem = list_entry(memcg_kswapd_control.list.next,
+					struct mem_cgroup,
+					memcg_kswapd_wait_list);
+			list_del_init(&mem->memcg_kswapd_wait_list);
+		}
+	} while (mem && !css_tryget(&mem->css));
+	if (mem)
+		atomic_inc(&mem->kswapd_running);
+	rcu_read_unlock();
+	spin_unlock(&memcg_kswapd_control.lock);
+	return mem;
+}
+
+void mem_cgroup_put_shrink_target(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return;
+	atomic_dec(&mem->kswapd_running);
+	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH)) {
+		spin_lock(&memcg_kswapd_control.lock);
+		if (list_empty(&mem->memcg_kswapd_wait_list)) {
+			list_add_tail(&mem->memcg_kswapd_wait_list,
+					&memcg_kswapd_control.list);
+		}
+		spin_unlock(&memcg_kswapd_control.lock);
+	}
+	wake_up_all(&mem->memcg_kswapd_end);
+	cgroup_release_and_wakeup_rmdir(&mem->css);
+}
+
+bool mem_cgroup_kswapd_can_sleep(void)
+{
+	return list_empty(&memcg_kswapd_control.list);
+}
+
+wait_queue_head_t *mem_cgroup_kswapd_waitq(void)
+{
+	return &memcg_kswapd_control.waitq;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4755,6 +4835,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
+	init_waitqueue_head(&mem->memcg_kswapd_end);
+	INIT_LIST_HEAD(&mem->memcg_kswapd_wait_list);
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V7 5/9] Infrastructure to support per-memcg reclaim.
  2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
                   ` (3 preceding siblings ...)
  2011-04-22  4:24 ` [PATCH V7 4/9] Add memcg kswapd thread pool Ying Han
@ 2011-04-22  4:24 ` Ying Han
  2011-04-22  4:38   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2011-04-22  4:24 ` [PATCH V7 6/9] Implement the select_victim_node within memcg Ying Han
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

Add the kswapd_mem field in kswapd descriptor which links the kswapd
kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
queue headed at kswapd_wait field of the kswapd descriptor.

The kswapd() function is now shared between global and per-memcg kswapd. It
is passed in with the kswapd descriptor which contains the information of
either node or memcg. Then the new function balance_mem_cgroup_pgdat is
invoked if it is per-mem kswapd thread, and the implementation of the function
is on the following patch.

change v7..v6:
1. change the threading model of memcg from per-memcg-per-thread to thread-pool.
this is based on the patch from KAMAZAWA.

change v6..v5:
1. rename is_node_kswapd to is_global_kswapd to match the scanning_global_lru.
2. revert the sleeping_prematurely change, but keep the kswapd_try_to_sleep()
for memcg.

changelog v4..v3:
1. fix up the kswapd_run and kswapd_stop for online_pages() and offline_pages.
2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's request.

changelog v3..v2:
1. split off from the initial patch which includes all changes of the following
three patches.

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    2 +-
 mm/memory_hotplug.c  |    2 +-
 mm/vmscan.c          |  156 +++++++++++++++++++++++++++++++-------------------
 3 files changed, 100 insertions(+), 60 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9b91ca4..a062f0b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -303,7 +303,7 @@ static inline void scan_unevictable_unregister_node(struct node *node)
 }
 #endif
 
-extern int kswapd_run(int nid);
+extern int kswapd_run(int nid, int id);
 extern void kswapd_stop(int nid);
 
 #ifdef CONFIG_MMU
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 321fc74..36b4eed 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -462,7 +462,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages)
 	setup_per_zone_wmarks();
 	calculate_zone_inactive_ratio(zone);
 	if (onlined_pages) {
-		kswapd_run(zone_to_nid(zone));
+		kswapd_run(zone_to_nid(zone), 0);
 		node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7aba681..63c557e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2241,6 +2241,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 	return balanced_pages > (present_pages >> 2);
 }
 
+#define is_global_kswapd(kswapd_p) ((kswapd_p)->kswapd_pgdat)
+
 /* is kswapd sleeping prematurely? */
 static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 					int classzone_idx)
@@ -2583,40 +2585,46 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 
 	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 
-	/* Try to sleep for a short interval */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
-		remaining = schedule_timeout(HZ/10);
-		finish_wait(wait_h, &wait);
-		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
-	}
-
-	/*
-	 * After a short sleep, check if it was a premature sleep. If not, then
-	 * go fully to sleep until explicitly woken up.
-	 */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
-		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+	if (is_global_kswapd(kswapd_p)) {
+		/* Try to sleep for a short interval */
+		if (!sleeping_prematurely(pgdat, order,
+				remaining, classzone_idx)) {
+			remaining = schedule_timeout(HZ/10);
+			finish_wait(wait_h, &wait);
+			prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
+		}
 
 		/*
-		 * vmstat counters are not perfectly accurate and the estimated
-		 * value for counters such as NR_FREE_PAGES can deviate from the
-		 * true value by nr_online_cpus * threshold. To avoid the zone
-		 * watermarks being breached while under pressure, we reduce the
-		 * per-cpu vmstat threshold while kswapd is awake and restore
-		 * them before going back to sleep.
+		 * After a short sleep, check if it was a premature sleep.
+		 * If not, then go fully to sleep until explicitly woken up.
 		 */
-		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
-		schedule();
-		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+		if (!sleeping_prematurely(pgdat, order,
+					remaining, classzone_idx)) {
+			trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+			set_pgdat_percpu_threshold(pgdat,
+					calculate_normal_threshold);
+			schedule();
+			set_pgdat_percpu_threshold(pgdat,
+					calculate_pressure_threshold);
+		} else {
+			if (remaining)
+				count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
+			else
+				count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
+		}
 	} else {
-		if (remaining)
-			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
-		else
-			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
+		/* For now, we just check the remaining works.*/
+		if (mem_cgroup_kswapd_can_sleep())
+			schedule();
 	}
 	finish_wait(wait_h, &wait);
 }
 
+static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
+{
+	return 0;
+}
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -2636,6 +2644,7 @@ int kswapd(void *p)
 	int classzone_idx;
 	struct kswapd *kswapd_p = (struct kswapd *)p;
 	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	struct mem_cgroup *mem;
 	struct task_struct *tsk = current;
 
 	struct reclaim_state reclaim_state = {
@@ -2645,9 +2654,11 @@ int kswapd(void *p)
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
-	cpumask = cpumask_of_node(pgdat->node_id);
-	if (!cpumask_empty(cpumask))
-		set_cpus_allowed_ptr(tsk, cpumask);
+	if (is_global_kswapd(kswapd_p)) {
+		cpumask = cpumask_of_node(pgdat->node_id);
+		if (!cpumask_empty(cpumask))
+			set_cpus_allowed_ptr(tsk, cpumask);
+	}
 	current->reclaim_state = &reclaim_state;
 
 	/*
@@ -2662,7 +2673,10 @@ int kswapd(void *p)
 	 * us from recursively trying to free more memory as we're
 	 * trying to free the first piece of memory in the first place).
 	 */
-	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+	if (is_global_kswapd(kswapd_p))
+		tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+	else
+		tsk->flags |= PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
 	order = 0;
@@ -2672,36 +2686,48 @@ int kswapd(void *p)
 		int new_classzone_idx;
 		int ret;
 
-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order = 0;
-		pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		if (order < new_order || classzone_idx > new_classzone_idx) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation or has tigher zone constraints
-			 */
-			order = new_order;
-			classzone_idx = new_classzone_idx;
-		} else {
-			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
-			order = pgdat->kswapd_max_order;
-			classzone_idx = pgdat->classzone_idx;
+		if (is_global_kswapd(kswapd_p)) {
+			new_order = pgdat->kswapd_max_order;
+			new_classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		}
+			if (order < new_order ||
+					classzone_idx > new_classzone_idx) {
+				/*
+				 * Don't sleep if someone wants a larger 'order'
+				 * allocation or has tigher zone constraints
+				 */
+				order = new_order;
+				classzone_idx = new_classzone_idx;
+			} else {
+				kswapd_try_to_sleep(kswapd_p, order,
+						    classzone_idx);
+				order = pgdat->kswapd_max_order;
+				classzone_idx = pgdat->classzone_idx;
+				pgdat->kswapd_max_order = 0;
+				pgdat->classzone_idx = MAX_NR_ZONES - 1;
+			}
+		} else
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
 			break;
 
+		if (ret)
+			continue;
 		/*
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret) {
+		if (is_global_kswapd(kswapd_p)) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
 			order = balance_pgdat(pgdat, order, &classzone_idx);
+		} else {
+			mem = mem_cgroup_get_shrink_target();
+			if (mem)
+				shrink_mem_cgroup(mem, order);
+			mem_cgroup_put_shrink_target(mem);
 		}
 	}
 	return 0;
@@ -2845,30 +2871,44 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
  * This kswapd start function will be called by init and node-hot-add.
  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
  */
-int kswapd_run(int nid)
+int kswapd_run(int nid, int memcgid)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
 	struct task_struct *kswapd_tsk;
+	pg_data_t *pgdat = NULL;
 	struct kswapd *kswapd_p;
+	static char name[TASK_COMM_LEN];
 	int ret = 0;
 
-	if (pgdat->kswapd)
-		return 0;
+	if (!memcgid) {
+		pgdat = NODE_DATA(nid);
+		if (pgdat->kswapd)
+			return ret;
+	}
 
 	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
 	if (!kswapd_p)
 		return -ENOMEM;
 
-	pgdat->kswapd = kswapd_p;
-	kswapd_p->kswapd_wait = &pgdat->kswapd_wait;
-	kswapd_p->kswapd_pgdat = pgdat;
+	if (!memcgid) {
+		pgdat->kswapd = kswapd_p;
+		kswapd_p->kswapd_wait = &pgdat->kswapd_wait;
+		kswapd_p->kswapd_pgdat = pgdat;
+		snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
+	} else {
+		kswapd_p->kswapd_wait = mem_cgroup_kswapd_waitq();
+		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcgid);
+	}
 
-	kswapd_tsk = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	kswapd_tsk = kthread_run(kswapd, kswapd_p, name);
 	if (IS_ERR(kswapd_tsk)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
-		printk("Failed to start kswapd on node %d\n",nid);
-		pgdat->kswapd = NULL;
+		if (!memcgid) {
+			printk(KERN_ERR "Failed to start kswapd on node %d\n",
+								nid);
+			pgdat->kswapd = NULL;
+		} else
+			printk(KERN_ERR "Failed to start kswapd on memcg\n");
 		kfree(kswapd_p);
 		ret = -1;
 	} else
@@ -2899,7 +2939,7 @@ static int __init kswapd_init(void)
 
 	swap_setup();
 	for_each_node_state(nid, N_HIGH_MEMORY)
- 		kswapd_run(nid);
+		kswapd_run(nid, 0);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V7 6/9] Implement the select_victim_node within memcg.
  2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
                   ` (4 preceding siblings ...)
  2011-04-22  4:24 ` [PATCH V7 5/9] Infrastructure to support per-memcg reclaim Ying Han
@ 2011-04-22  4:24 ` Ying Han
  2011-04-22  4:39   ` KAMEZAWA Hiroyuki
  2011-04-22  4:24 ` [PATCH V7 7/9] Per-memcg background reclaim Ying Han
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This add the mechanism for background reclaim which we remember the
last scanned node and always starting from the next one each time.
The simple round-robin fasion provide the fairness between nodes for
each memcg.

changelog v6..v5:
1. fix the correct comment style.

changelog v5..v4:
1. initialize the last_scanned_node to MAX_NUMNODES.

changelog v4..v3:
1. split off from the per-memcg background reclaim patch.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9157c4d..7444738 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -83,6 +83,9 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
+extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
+extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
+					const nodemask_t *nodes);
 
 bool mem_cgroup_kswapd_can_sleep(void);
 struct mem_cgroup *mem_cgroup_get_shrink_target(void);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 527ad9a..4696fd8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -288,6 +288,12 @@ struct mem_cgroup {
 	 */
 	u64 high_wmark_distance;
 	u64 low_wmark_distance;
+
+	/*
+	 * While doing per cgroup background reclaim, we cache the
+	 * last node we reclaimed from
+	 */
+	int last_scanned_node;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -1544,6 +1550,27 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 }
 
 /*
+ * Visit the first node after the last_scanned_node of @mem and use that to
+ * reclaim free pages from.
+ */
+int
+mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
+{
+	int next_nid;
+	int last_scanned;
+
+	last_scanned = mem->last_scanned_node;
+	next_nid = next_node(last_scanned, *nodes);
+
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(*nodes);
+
+	mem->last_scanned_node = next_nid;
+
+	return next_nid;
+}
+
+/*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
@@ -4753,6 +4780,14 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
 	return ret;
 }
 
+int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return -1;
+
+	return mem->last_scanned_node;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4828,6 +4863,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
 	if (parent)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V7 7/9] Per-memcg background reclaim.
  2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
                   ` (5 preceding siblings ...)
  2011-04-22  4:24 ` [PATCH V7 6/9] Implement the select_victim_node within memcg Ying Han
@ 2011-04-22  4:24 ` Ying Han
  2011-04-22  4:40   ` KAMEZAWA Hiroyuki
  2011-04-22  6:00   ` KOSAKI Motohiro
  2011-04-22  4:24 ` [PATCH V7 8/9] Add per-memcg zone "unreclaimable" Ying Han
  2011-04-22  4:24 ` [PATCH V7 9/9] Enable per-memcg background reclaim Ying Han
  8 siblings, 2 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This is the main loop of per-memcg background reclaim which is implemented in
function balance_mem_cgroup_pgdat().

The function performs a priority loop similar to global reclaim. During each
iteration it invokes balance_pgdat_node() for all nodes on the system, which
is another new function performs background reclaim per node. After reclaiming
each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
it returns true.

changelog v7..v6:
1. change based on KAMAZAWA's patchset. Each memcg reclaims now reclaims
SWAP_CLUSTER_MAX of pages and putback the memcg to the tail of list.
memcg-kswapd will visit memcgs in round-robin manner and reduce usages.

changelog v6..v5:
1. add mem_cgroup_zone_reclaimable_pages()
2. fix some comment style.

changelog v5..v4:
1. remove duplicate check on nodes_empty()
2. add logic to check if the per-memcg lru is empty on the zone.

changelog v4..v3:
1. split the select_victim_node and zone_unreclaimable to a seperate patches
2. remove the logic tries to do zone balancing.

changelog v3..v2:
1. change mz->all_unreclaimable to be boolean.
2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
3. some more clean-up.

changelog v2..v1:
1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
2. shared the kswapd_run/kswapd_stop for per-memcg and global background
reclaim.
3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
keeps the same name.
4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
after freeing.
5. add the fairness in zonelist where memcg remember the last zone reclaimed
from.

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    9 +++
 mm/memcontrol.c            |   18 +++++++
 mm/vmscan.c                |  118 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 145 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7444738..39eade6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -115,6 +115,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+						  struct zone *zone);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru);
@@ -311,6 +313,13 @@ mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 }
 
 static inline unsigned long
+mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+				    struct zone *zone)
+{
+	return 0;
+}
+
+static inline unsigned long
 mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
 			 enum lru_list lru)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4696fd8..41eaa62 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1105,6 +1105,24 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 	return (active > inactive);
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+						struct zone *zone)
+{
+	int nr;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
+	     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
+		      MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
+
+	return nr;
+}
+
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 63c557e..ba03a10 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -47,6 +47,8 @@
 
 #include <linux/swapops.h>
 
+#include <linux/res_counter.h>
+
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -111,6 +113,8 @@ struct scan_control {
 	 * are scanned.
 	 */
 	nodemask_t	*nodemask;
+
+	int priority;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -2620,10 +2624,124 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	finish_wait(wait_h, &wait);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * The function is used for per-memcg LRU. It scanns all the zones of the
+ * node and returns the nr_scanned and nr_reclaimed.
+ */
+static void shrink_memcg_node(pg_data_t *pgdat, int order,
+				struct scan_control *sc)
+{
+	int i;
+	unsigned long total_scanned = 0;
+	struct mem_cgroup *mem_cont = sc->mem_cgroup;
+	int priority = sc->priority;
+
+	/*
+	 * This dma->highmem order is consistant with global reclaim.
+	 * We do this because the page allocator works in the opposite
+	 * direction although memcg user pages are mostly allocated at
+	 * highmem.
+	 */
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+		unsigned long scan = 0;
+
+		scan = mem_cgroup_zone_reclaimable_pages(mem_cont, zone);
+		if (!scan)
+			continue;
+
+		sc->nr_scanned = 0;
+		shrink_zone(priority, zone, sc);
+		total_scanned += sc->nr_scanned;
+
+		/*
+		 * If we've done a decent amount of scanning and
+		 * the reclaim ratio is low, start doing writepage
+		 * even in laptop mode
+		 */
+		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
+			sc->may_writepage = 1;
+		}
+	}
+
+	sc->nr_scanned = total_scanned;
+}
+
+/*
+ * Per cgroup background reclaim.
+ * TODO: Take off the order since memcg always do order 0
+ */
+static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
+{
+	int i, nid, priority, loop;
+	pg_data_t *pgdat;
+	nodemask_t do_nodes;
+	unsigned long total_scanned;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.swappiness = vm_swappiness,
+		.order = order,
+		.mem_cgroup = mem_cont,
+	};
+
+	do_nodes = NODE_MASK_NONE;
+	sc.may_writepage = !laptop_mode;
+	sc.nr_reclaimed = 0;
+	total_scanned = 0;
+
+	do_nodes = node_states[N_ONLINE];
+
+	for (priority = DEF_PRIORITY;
+		(priority >= 0) && (sc.nr_to_reclaim > sc.nr_reclaimed);
+		priority--) {
+
+		sc.priority = priority;
+		/* The swap token gets in the way of swapout... */
+		if (!priority)
+			disable_swap_token();
+
+		for (loop = num_online_nodes();
+			(loop > 0) && !nodes_empty(do_nodes);
+			loop--) {
+
+			nid = mem_cgroup_select_victim_node(mem_cont,
+							&do_nodes);
+
+			pgdat = NODE_DATA(nid);
+			shrink_memcg_node(pgdat, order, &sc);
+			total_scanned += sc.nr_scanned;
+
+			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+				struct zone *zone = pgdat->node_zones + i;
+
+				if (populated_zone(zone))
+					break;
+			}
+			if (i < 0)
+				node_clear(nid, do_nodes);
+
+			if (mem_cgroup_watermark_ok(mem_cont,
+						CHARGE_WMARK_HIGH))
+				goto out;
+		}
+
+		if (total_scanned && priority < DEF_PRIORITY - 2)
+			congestion_wait(WRITE, HZ/10);
+	}
+out:
+	return sc.nr_reclaimed;
+}
+#else
 static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
 {
 	return 0;
 }
+#endif
 
 /*
  * The background pageout daemon, started as a kernel thread
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V7 8/9] Add per-memcg zone "unreclaimable"
  2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
                   ` (6 preceding siblings ...)
  2011-04-22  4:24 ` [PATCH V7 7/9] Per-memcg background reclaim Ying Han
@ 2011-04-22  4:24 ` Ying Han
  2011-04-22  4:43   ` KAMEZAWA Hiroyuki
  2011-04-22  6:13   ` KOSAKI Motohiro
  2011-04-22  4:24 ` [PATCH V7 9/9] Enable per-memcg background reclaim Ying Han
  8 siblings, 2 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
and breaks the priority loop if it returns true. The per-memcg zone will
be marked as "unreclaimable" if the scanning rate is much greater than the
reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
page charged to the memcg being freed. Kswapd breaks the priority loop if
all the zones are marked as "unreclaimable".

changelog v7..v6:
1. fix merge conflicts w/ the thread-pool patch.

changelog v6..v5:
1. make global zone_unreclaimable use the ZONE_MEMCG_RECLAIMABLE_RATE.
2. add comment on the zone_unreclaimable

changelog v5..v4:
1. reduce the frequency of updating mz->unreclaimable bit by using the existing
memcg batch in task struct.
2. add new function mem_cgroup_mz_clear_unreclaimable() for recoganizing zone.

changelog v4..v3:
1. split off from the per-memcg background reclaim patch in V3.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   40 +++++++++++++++
 include/linux/sched.h      |    1 +
 include/linux/swap.h       |    2 +
 mm/memcontrol.c            |  118 +++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |   25 +++++++++-
 5 files changed, 183 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 39eade6..29a945a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -157,6 +157,14 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, struct zone *zone);
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
+void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+					struct zone *zone);
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+					unsigned long nr_scanned);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
@@ -354,6 +362,38 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem,
+					       struct zone *zone)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
+						struct zone *zone)
+{
+	return false;
+}
+
+static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
+							struct zone *zone)
+{
+}
+
+static inline void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem,
+							struct page *page)
+{
+}
+
+static inline void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+							struct zone *zone)
+{
+}
+static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
+						struct zone *zone,
+						unsigned long nr_scanned)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 98fc7ed..3370c5a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1526,6 +1526,7 @@ struct task_struct {
 		struct mem_cgroup *memcg; /* target memcg of uncharge */
 		unsigned long nr_pages;	/* uncharged usage */
 		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
+		struct zone *zone; /* a zone page is last uncharged */
 	} memcg_batch;
 #endif
 };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a062f0b..b868e597 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -159,6 +159,8 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
+#define ZONE_RECLAIMABLE_RATE 6
+
 #define SWAP_CLUSTER_MAX 32
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 41eaa62..9e535b2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -135,7 +135,10 @@ struct mem_cgroup_per_zone {
 	bool			on_tree;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
+	unsigned long		pages_scanned;	/* since last reclaim */
+	bool			all_unreclaimable;	/* All pages pinned */
 };
+
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
 
@@ -1162,6 +1165,103 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone *zone,
+						unsigned long nr_scanned)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->pages_scanned += nr_scanned;
+}
+
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return 0;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->pages_scanned <
+				mem_cgroup_zone_reclaimable_pages(mem, zone) *
+				ZONE_RECLAIMABLE_RATE;
+	return 0;
+}
+
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return false;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->all_unreclaimable;
+
+	return false;
+}
+
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->all_unreclaimable = true;
+}
+
+void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+				       struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return;
+
+	mz = page_cgroup_zoneinfo(mem, page);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -2709,6 +2809,7 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 
 static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 				   unsigned int nr_pages,
+				   struct page *page,
 				   const enum charge_type ctype)
 {
 	struct memcg_batch_info *batch = NULL;
@@ -2726,6 +2827,10 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 	 */
 	if (!batch->memcg)
 		batch->memcg = mem;
+
+	if (!batch->zone)
+		batch->zone = page_zone(page);
+
 	/*
 	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
 	 * In those cases, all pages freed continously can be expected to be in
@@ -2747,12 +2852,17 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 	 */
 	if (batch->memcg != mem)
 		goto direct_uncharge;
+
+	if (batch->zone != page_zone(page))
+		mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
+
 	/* remember freed charge and uncharge it later */
 	batch->nr_pages++;
 	if (uncharge_memsw)
 		batch->memsw_nr_pages++;
 	return;
 direct_uncharge:
+	mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
 	res_counter_uncharge(&mem->res, nr_pages * PAGE_SIZE);
 	if (uncharge_memsw)
 		res_counter_uncharge(&mem->memsw, nr_pages * PAGE_SIZE);
@@ -2834,7 +2944,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		mem_cgroup_get(mem);
 	}
 	if (!mem_cgroup_is_root(mem))
-		mem_cgroup_do_uncharge(mem, nr_pages, ctype);
+		mem_cgroup_do_uncharge(mem, nr_pages, page, ctype);
 
 	return mem;
 
@@ -2902,6 +3012,10 @@ void mem_cgroup_uncharge_end(void)
 	if (batch->memsw_nr_pages)
 		res_counter_uncharge(&batch->memcg->memsw,
 				     batch->memsw_nr_pages * PAGE_SIZE);
+	if (batch->zone)
+		mem_cgroup_mz_clear_unreclaimable(batch->memcg, batch->zone);
+	batch->zone = NULL;
+
 	memcg_oom_recover(batch->memcg);
 	/* forget this pointer (for sanity check) */
 	batch->memcg = NULL;
@@ -4667,6 +4781,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = mem;
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
 	}
 	return 0;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba03a10..87653d6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
 			zone, sc->mem_cgroup,
 			0, file);
+
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
+
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
@@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
@@ -1989,7 +1993,8 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 
 static bool zone_reclaimable(struct zone *zone)
 {
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+	return zone->pages_scanned < zone_reclaimable_pages(zone) *
+					ZONE_RECLAIMABLE_RATE;
 }
 
 /*
@@ -2651,10 +2656,20 @@ static void shrink_memcg_node(pg_data_t *pgdat, int order,
 		if (!scan)
 			continue;
 
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+			priority != DEF_PRIORITY)
+			continue;
+
 		sc->nr_scanned = 0;
 		shrink_zone(priority, zone, sc);
 		total_scanned += sc->nr_scanned;
 
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
+			continue;
+
+		if (!mem_cgroup_zone_reclaimable(mem_cont, zone))
+			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
+
 		/*
 		 * If we've done a decent amount of scanning and
 		 * the reclaim ratio is low, start doing writepage
@@ -2716,10 +2731,16 @@ static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
 			shrink_memcg_node(pgdat, order, &sc);
 			total_scanned += sc.nr_scanned;
 
+			/*
+			 * Set the node which has at least one reclaimable
+			 * zone
+			 */
 			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
 				struct zone *zone = pgdat->node_zones + i;
 
-				if (populated_zone(zone))
+				if (populated_zone(zone) &&
+				    !mem_cgroup_mz_unreclaimable(mem_cont,
+								zone))
 					break;
 			}
 			if (i < 0)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V7 9/9] Enable per-memcg background reclaim.
  2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
                   ` (7 preceding siblings ...)
  2011-04-22  4:24 ` [PATCH V7 8/9] Add per-memcg zone "unreclaimable" Ying Han
@ 2011-04-22  4:24 ` Ying Han
  2011-04-22  4:44   ` KAMEZAWA Hiroyuki
  8 siblings, 1 reply; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

By default the per-memcg background reclaim is disabled when the limit_in_bytes
is set the maximum. The kswapd_run() is called when the memcg is being resized,
and kswapd_stop() is called when the memcg is being deleted.

The per-memcg kswapd is waked up based on the usage and low_wmark, which is
checked once per 1024 increments per cpu. The memcg's kswapd is waked up if the
usage is larger than the low_wmark.

changelog v7..v6:
1. merge the thread-pool and add memcg_kswapd_stop(), memcg_kswapd_init() based
on thread-pool.

changelog v4..v3:
1. move kswapd_stop to mem_cgroup_destroy based on comments from KAMAZAWA
2. move kswapd_run to setup_mem_cgroup_wmark, since the actual watermarks
determines whether or not enabling per-memcg background reclaim.

changelog v3..v2:
1. some clean-ups

changelog v2..v1:
1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
2. remove checking the wmark from per-page charging. now it checks the wmark
periodically based on the event counter.

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   61 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 61 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9e535b2..a98471b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -107,10 +107,12 @@ enum mem_cgroup_events_index {
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
 	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_WMARK_EVENTS_THRESH,
 	MEM_CGROUP_NTARGETS,
 };
 #define THRESHOLDS_EVENTS_TARGET (128)
 #define SOFTLIMIT_EVENTS_TARGET (1024)
+#define WMARK_EVENTS_TARGET (1024)
 
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
@@ -379,6 +381,9 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 
+static void wake_memcg_kswapd(struct mem_cgroup *mem);
+static void memcg_kswapd_stop(struct mem_cgroup *mem);
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
@@ -557,6 +562,12 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
+static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
+		wake_memcg_kswapd(mem);
+}
+
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -687,6 +698,9 @@ static void __mem_cgroup_target_update(struct mem_cgroup *mem, int target)
 	case MEM_CGROUP_TARGET_SOFTLIMIT:
 		next = val + SOFTLIMIT_EVENTS_TARGET;
 		break;
+	case MEM_CGROUP_WMARK_EVENTS_THRESH:
+		next = val + WMARK_EVENTS_TARGET;
+		break;
 	default:
 		return;
 	}
@@ -710,6 +724,10 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
+		if (unlikely(__memcg_event_check(mem,
+			MEM_CGROUP_WMARK_EVENTS_THRESH))){
+			mem_cgroup_check_wmark(mem);
+		}
 	}
 }
 
@@ -3651,6 +3669,7 @@ move_account:
 		ret = -EBUSY;
 		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
 			goto out;
+		memcg_kswapd_stop(mem);
 		ret = -EINTR;
 		if (signal_pending(current))
 			goto out;
@@ -4572,6 +4591,21 @@ struct memcg_kswapd_work {
 
 struct memcg_kswapd_work memcg_kswapd_control;
 
+static void wake_memcg_kswapd(struct mem_cgroup *mem)
+{
+	/* already running */
+	if (atomic_read(&mem->kswapd_running))
+		return;
+
+	spin_lock(&memcg_kswapd_control.lock);
+	if (list_empty(&mem->memcg_kswapd_wait_list))
+		list_add_tail(&mem->memcg_kswapd_wait_list,
+				&memcg_kswapd_control.list);
+	spin_unlock(&memcg_kswapd_control.lock);
+	wake_up(&memcg_kswapd_control.waitq);
+	return;
+}
+
 static void memcg_kswapd_wait_end(struct mem_cgroup *mem)
 {
 	DEFINE_WAIT(wait);
@@ -4582,6 +4616,17 @@ static void memcg_kswapd_wait_end(struct mem_cgroup *mem)
 	finish_wait(&mem->memcg_kswapd_end, &wait);
 }
 
+/* called at pre_destroy */
+static void memcg_kswapd_stop(struct mem_cgroup *mem)
+{
+	spin_lock(&memcg_kswapd_control.lock);
+	if (!list_empty(&mem->memcg_kswapd_wait_list))
+		list_del(&mem->memcg_kswapd_wait_list);
+	spin_unlock(&memcg_kswapd_control.lock);
+
+	memcg_kswapd_wait_end(mem);
+}
+
 struct mem_cgroup *mem_cgroup_get_shrink_target(void)
 {
 	struct mem_cgroup *mem;
@@ -4631,6 +4676,22 @@ wait_queue_head_t *mem_cgroup_kswapd_waitq(void)
 	return &memcg_kswapd_control.waitq;
 }
 
+static int __init memcg_kswapd_init(void)
+{
+	int i, nr_threads;
+
+	spin_lock_init(&memcg_kswapd_control.lock);
+	INIT_LIST_HEAD(&memcg_kswapd_control.list);
+	init_waitqueue_head(&memcg_kswapd_control.waitq);
+
+	nr_threads = int_sqrt(num_possible_cpus()) + 1;
+	for (i = 0; i < nr_threads; i++)
+		if (kswapd_run(0, i + 1) == -1)
+			break;
+	return 0;
+}
+module_init(memcg_kswapd_init);
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 1/9] Add kswapd descriptor
  2011-04-22  4:24 ` [PATCH V7 1/9] Add kswapd descriptor Ying Han
@ 2011-04-22  4:31   ` KAMEZAWA Hiroyuki
  2011-04-22  4:47   ` KOSAKI Motohiro
  1 sibling, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  4:31 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 21:24:12 -0700
Ying Han <yinghan@google.com> wrote:

> There is a kswapd kernel thread for each numa node. We will add a different
> kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
> kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
> information of node and memcgs, and it allows the global and per-memcg
> background reclaim to share common reclaim algorithms.
> 
> This patch adds the kswapd descriptor and moves the per-node kswapd to use the
> new structure.
> 
> changelog v7..v6:
> 1. revert wait_queue_head change in pgdat. Keep the wait_queue_head in pgdat
> 
> changelog v6..v5:
> 1. rename kswapd_thr to kswapd_tsk
> 2. revert the api change on sleeping_prematurely since memcg doesn't support it.
> 
> changelog v5..v4:
> 1. add comment on kswapds_spinlock
> 2. remove the kswapds_spinlock. we don't need it here since the kswapd and pgdat
> have 1:1 mapping.
> 
> changelog v3..v2:
> 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
> 2. rename thr in kswapd_run to something else.
> 
> changelog v2..v1:
> 1. dynamic allocate kswapd descriptor and initialize the wait_queue_head of pgdat
> at kswapd_run.
> 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup kswapd
> descriptor.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Seems ok to me. Thank you for merging my dirty patch.

If I add comments to this patch, this patch is just for sharing codes in kswapd
and memory cgroup's background reclaim. By this, it's easy to compare memcg
bacground reclaim and kswapd and will be good for maintainance.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 3/9] New APIs to adjust per-memcg wmarks
  2011-04-22  4:24 ` [PATCH V7 3/9] New APIs to adjust per-memcg wmarks Ying Han
@ 2011-04-22  4:32   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  4:32 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 21:24:14 -0700
Ying Han <yinghan@google.com> wrote:

> Add memory.low_wmark_distance, memory.high_wmark_distance and reclaim_wmarks
> APIs per-memcg. The first two adjust the internal low/high wmark calculation
> and the reclaim_wmarks exports the current value of watermarks.
> 
> By default, the low/high_wmark is calculated by subtracting the distance from
> the hard_limit(limit_in_bytes). When configuring the low/high_wmark distance,
> user must setup the high_wmark_distance before low_wmark_distance. Also user
> must zero low_wmark_distance before high_wmark_distance.
> 
> $ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
> $ cat /dev/cgroup/A/memory.limit_in_bytes
> 524288000
> 
> $ echo 50m >/dev/cgroup/A/memory.high_wmark_distance
> $ echo 40m >/dev/cgroup/A/memory.low_wmark_distance
> 
> $ cat /dev/cgroup/A/memory.reclaim_wmarks
> low_wmark 482344960
> high_wmark 471859200
> 
> change v5..v4
> 1. add sanity check for setting high/low_wmark_distance for root cgroup.
> 
> changelog v4..v3:
> 1. replace the "wmark_ratio" API with individual tunable for low/high_wmarks.
> 
> changelog v3..v2:
> 1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
> feedbacks
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  4:24 ` [PATCH V7 4/9] Add memcg kswapd thread pool Ying Han
@ 2011-04-22  4:36   ` KAMEZAWA Hiroyuki
  2011-04-22  4:49     ` Ying Han
  2011-04-22  5:39   ` KOSAKI Motohiro
  1 sibling, 1 reply; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  4:36 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 21:24:15 -0700
Ying Han <yinghan@google.com> wrote:

> This patch creates a thread pool for memcg-kswapd. All memcg which needs
> background recalim are linked to a list and memcg-kswapd picks up a memcg
> from the list and run reclaim.
> 
> The concern of using per-memcg-kswapd thread is the system overhead including
> memory and cputime.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Ying Han <yinghan@google.com>

Thank you for merging. This seems ok to me.

Further development may make this better or change thread pools (to some other),
but I think this is enough good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 5/9] Infrastructure to support per-memcg reclaim.
  2011-04-22  4:24 ` [PATCH V7 5/9] Infrastructure to support per-memcg reclaim Ying Han
@ 2011-04-22  4:38   ` KAMEZAWA Hiroyuki
  2011-04-22  5:11   ` KOSAKI Motohiro
  2011-04-22  5:27   ` KOSAKI Motohiro
  2 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  4:38 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 21:24:16 -0700
Ying Han <yinghan@google.com> wrote:

> Add the kswapd_mem field in kswapd descriptor which links the kswapd
> kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
> queue headed at kswapd_wait field of the kswapd descriptor.
> 
> The kswapd() function is now shared between global and per-memcg kswapd. It
> is passed in with the kswapd descriptor which contains the information of
> either node or memcg. Then the new function balance_mem_cgroup_pgdat is
> invoked if it is per-mem kswapd thread, and the implementation of the function
> is on the following patch.
> 
> change v7..v6:
> 1. change the threading model of memcg from per-memcg-per-thread to thread-pool.
> this is based on the patch from KAMAZAWA.
> 
> change v6..v5:
> 1. rename is_node_kswapd to is_global_kswapd to match the scanning_global_lru.
> 2. revert the sleeping_prematurely change, but keep the kswapd_try_to_sleep()
> for memcg.
> 
> changelog v4..v3:
> 1. fix up the kswapd_run and kswapd_stop for online_pages() and offline_pages.
> 2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's request.
> 
> changelog v3..v2:
> 1. split off from the initial patch which includes all changes of the following
> three patches.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Ack. Seems ok to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 6/9] Implement the select_victim_node within memcg.
  2011-04-22  4:24 ` [PATCH V7 6/9] Implement the select_victim_node within memcg Ying Han
@ 2011-04-22  4:39   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  4:39 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 21:24:17 -0700
Ying Han <yinghan@google.com> wrote:

> This add the mechanism for background reclaim which we remember the
> last scanned node and always starting from the next one each time.
> The simple round-robin fasion provide the fairness between nodes for
> each memcg.
> 
> changelog v6..v5:
> 1. fix the correct comment style.
> 
> changelog v5..v4:
> 1. initialize the last_scanned_node to MAX_NUMNODES.
> 
> changelog v4..v3:
> 1. split off from the per-memcg background reclaim patch.
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 7/9] Per-memcg background reclaim.
  2011-04-22  4:24 ` [PATCH V7 7/9] Per-memcg background reclaim Ying Han
@ 2011-04-22  4:40   ` KAMEZAWA Hiroyuki
  2011-04-22  6:00   ` KOSAKI Motohiro
  1 sibling, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  4:40 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 21:24:18 -0700
Ying Han <yinghan@google.com> wrote:

> This is the main loop of per-memcg background reclaim which is implemented in
> function balance_mem_cgroup_pgdat().
> 
> The function performs a priority loop similar to global reclaim. During each
> iteration it invokes balance_pgdat_node() for all nodes on the system, which
> is another new function performs background reclaim per node. After reclaiming
> each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
> it returns true.
> 
> changelog v7..v6:
> 1. change based on KAMAZAWA's patchset. Each memcg reclaims now reclaims
> SWAP_CLUSTER_MAX of pages and putback the memcg to the tail of list.
> memcg-kswapd will visit memcgs in round-robin manner and reduce usages.
> 
> changelog v6..v5:
> 1. add mem_cgroup_zone_reclaimable_pages()
> 2. fix some comment style.
> 
> changelog v5..v4:
> 1. remove duplicate check on nodes_empty()
> 2. add logic to check if the per-memcg lru is empty on the zone.
> 
> changelog v4..v3:
> 1. split the select_victim_node and zone_unreclaimable to a seperate patches
> 2. remove the logic tries to do zone balancing.
> 
> changelog v3..v2:
> 1. change mz->all_unreclaimable to be boolean.
> 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
> 3. some more clean-up.
> 
> changelog v2..v1:
> 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> reclaim.
> 3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
> keeps the same name.
> 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
> after freeing.
> 5. add the fairness in zonelist where memcg remember the last zone reclaimed
> from.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

seems good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 8/9] Add per-memcg zone "unreclaimable"
  2011-04-22  4:24 ` [PATCH V7 8/9] Add per-memcg zone "unreclaimable" Ying Han
@ 2011-04-22  4:43   ` KAMEZAWA Hiroyuki
  2011-04-22  6:13   ` KOSAKI Motohiro
  1 sibling, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  4:43 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 21:24:19 -0700
Ying Han <yinghan@google.com> wrote:

> After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
> and breaks the priority loop if it returns true. The per-memcg zone will
> be marked as "unreclaimable" if the scanning rate is much greater than the
> reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
> page charged to the memcg being freed. Kswapd breaks the priority loop if
> all the zones are marked as "unreclaimable".
> 
> changelog v7..v6:
> 1. fix merge conflicts w/ the thread-pool patch.
> 
> changelog v6..v5:
> 1. make global zone_unreclaimable use the ZONE_MEMCG_RECLAIMABLE_RATE.
> 2. add comment on the zone_unreclaimable
> 
> changelog v5..v4:
> 1. reduce the frequency of updating mz->unreclaimable bit by using the existing
> memcg batch in task struct.
> 2. add new function mem_cgroup_mz_clear_unreclaimable() for recoganizing zone.
> 
> changelog v4..v3:
> 1. split off from the per-memcg background reclaim patch in V3.
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 9/9] Enable per-memcg background reclaim.
  2011-04-22  4:24 ` [PATCH V7 9/9] Enable per-memcg background reclaim Ying Han
@ 2011-04-22  4:44   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  4:44 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 21:24:20 -0700
Ying Han <yinghan@google.com> wrote:

> By default the per-memcg background reclaim is disabled when the limit_in_bytes
> is set the maximum. The kswapd_run() is called when the memcg is being resized,
> and kswapd_stop() is called when the memcg is being deleted.
> 
> The per-memcg kswapd is waked up based on the usage and low_wmark, which is
> checked once per 1024 increments per cpu. The memcg's kswapd is waked up if the
> usage is larger than the low_wmark.
> 
> changelog v7..v6:
> 1. merge the thread-pool and add memcg_kswapd_stop(), memcg_kswapd_init() based
> on thread-pool.
> 
> changelog v4..v3:
> 1. move kswapd_stop to mem_cgroup_destroy based on comments from KAMAZAWA
> 2. move kswapd_run to setup_mem_cgroup_wmark, since the actual watermarks
> determines whether or not enabling per-memcg background reclaim.
> 
> changelog v3..v2:
> 1. some clean-ups
> 
> changelog v2..v1:
> 1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
> 2. remove checking the wmark from per-page charging. now it checks the wmark
> periodically based on the event counter.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

seems ok to me. Maybe we need to revisit the suitable number of threads after
seeing real world workload.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 1/9] Add kswapd descriptor
  2011-04-22  4:24 ` [PATCH V7 1/9] Add kswapd descriptor Ying Han
  2011-04-22  4:31   ` KAMEZAWA Hiroyuki
@ 2011-04-22  4:47   ` KOSAKI Motohiro
  2011-04-22  5:55     ` Ying Han
  1 sibling, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2011-04-22  4:47 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

Hi,

This seems to have no ugly parts.


nitpick:

> -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +	const struct cpumask *cpumask;
>  
>  	lockdep_set_current_reclaim_state(GFP_KERNEL);
>  
> +	cpumask = cpumask_of_node(pgdat->node_id);

no effect change?


>  	if (!cpumask_empty(cpumask))
>  		set_cpus_allowed_ptr(tsk, cpumask);
>  	current->reclaim_state = &reclaim_state;
> @@ -2679,7 +2684,7 @@ static int kswapd(void *p)
>  			order = new_order;
>  			classzone_idx = new_classzone_idx;
>  		} else {
> -			kswapd_try_to_sleep(pgdat, order, classzone_idx);
> +			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
>  			order = pgdat->kswapd_max_order;
>  			classzone_idx = pgdat->classzone_idx;
>  			pgdat->kswapd_max_order = 0;
> @@ -2817,12 +2822,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>  		for_each_node_state(nid, N_HIGH_MEMORY) {
>  			pg_data_t *pgdat = NODE_DATA(nid);
>  			const struct cpumask *mask;
> +			struct kswapd *kswapd_p;
> +			struct task_struct *kswapd_tsk;
> +			wait_queue_head_t *wait;
>  
>  			mask = cpumask_of_node(pgdat->node_id);
>  
> +			wait = &pgdat->kswapd_wait;

In kswapd_try_to_sleep(), this waitqueue is called wait_h. Can you
please keep naming consistency?


> +			kswapd_p = pgdat->kswapd;
> +			kswapd_tsk = kswapd_p->kswapd_task;
> +
>  			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>  				/* One of our CPUs online: restore mask */
> -				set_cpus_allowed_ptr(pgdat->kswapd, mask);
> +				if (kswapd_tsk)
> +					set_cpus_allowed_ptr(kswapd_tsk, mask);

Need adding commnets. What mean kswapd_tsk==NULL and When it occur.
I'm apologize if it done at later patch.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  4:36   ` KAMEZAWA Hiroyuki
@ 2011-04-22  4:49     ` Ying Han
  2011-04-22  5:00       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 48+ messages in thread
From: Ying Han @ 2011-04-22  4:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1921 bytes --]

On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 21 Apr 2011 21:24:15 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This patch creates a thread pool for memcg-kswapd. All memcg which needs
> > background recalim are linked to a list and memcg-kswapd picks up a memcg
> > from the list and run reclaim.
> >
> > The concern of using per-memcg-kswapd thread is the system overhead
> including
> > memory and cputime.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Ying Han <yinghan@google.com>
>
> Thank you for merging. This seems ok to me.
>
> Further development may make this better or change thread pools (to some
> other),
> but I think this is enough good.
>

Thank you for reviewing and Acking. At the same time, I do have wondering on
the thread-pool modeling which I posted on the cover-letter :)

The per-memcg-per-kswapd model
Pros:
1. memory overhead per thread, and The memory consumption would be 8k*1000 =
8M
with 1k cgroup.
2. we see lots of threads at 'ps -elf'

Cons:
1. the implementation is simply and straigh-forward.
2. we can easily isolate the background reclaim overhead between cgroups.
3. better latency from memory pressure to actual start reclaiming

The thread-pool model
Pros:
1. there is no isolation between memcg background reclaim, since the memcg
threads
are shared.
2. it is hard for visibility and debugability. I have been experienced a lot
when
some kswapds running creazy and we need a stright-forward way to identify
which
cgroup causing the reclaim.
3. potential starvation for some memcgs, if one workitem stucks and the rest
of work
won't proceed.

Cons:
1. save some memory resource.

In general, the per-memcg-per-kswapd implmentation looks sane to me at this
point, esepcially the sharing memcg thread model will make debugging issue
very hard later.

Comments?

--Ying

[-- Attachment #2: Type: text/html, Size: 2986 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  4:49     ` Ying Han
@ 2011-04-22  5:00       ` KAMEZAWA Hiroyuki
  2011-04-22  5:53         ` Ying Han
  2011-04-22  6:02         ` Zhu Yanhai
  0 siblings, 2 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  5:00 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 21:49:04 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 21 Apr 2011 21:24:15 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > This patch creates a thread pool for memcg-kswapd. All memcg which needs
> > > background recalim are linked to a list and memcg-kswapd picks up a memcg
> > > from the list and run reclaim.
> > >
> > > The concern of using per-memcg-kswapd thread is the system overhead
> > including
> > > memory and cputime.
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Signed-off-by: Ying Han <yinghan@google.com>
> >
> > Thank you for merging. This seems ok to me.
> >
> > Further development may make this better or change thread pools (to some
> > other),
> > but I think this is enough good.
> >
> 
> Thank you for reviewing and Acking. At the same time, I do have wondering on
> the thread-pool modeling which I posted on the cover-letter :)
> 
> The per-memcg-per-kswapd model
> Pros:
> 1. memory overhead per thread, and The memory consumption would be 8k*1000 =
> 8M
> with 1k cgroup.
> 2. we see lots of threads at 'ps -elf'
> 
> Cons:
> 1. the implementation is simply and straigh-forward.
> 2. we can easily isolate the background reclaim overhead between cgroups.
> 3. better latency from memory pressure to actual start reclaiming
> 
> The thread-pool model
> Pros:
> 1. there is no isolation between memcg background reclaim, since the memcg
> threads
> are shared.
> 2. it is hard for visibility and debugability. I have been experienced a lot
> when
> some kswapds running creazy and we need a stright-forward way to identify
> which
> cgroup causing the reclaim.
> 3. potential starvation for some memcgs, if one workitem stucks and the rest
> of work
> won't proceed.
> 
> Cons:
> 1. save some memory resource.
> 
> In general, the per-memcg-per-kswapd implmentation looks sane to me at this
> point, esepcially the sharing memcg thread model will make debugging issue
> very hard later.
> 
> Comments?
> 
Pros <-> Cons ?

My idea is adding trace point for memcg-kswapd and seeing what it's now doing.
(We don't have too small trace point in memcg...)

I don't think its sane to create kthread per memcg because we know there is a user
who makes hundreds/thousands of memcg.

And, I think that creating threads, which does the same job, more than the number
of cpus will cause much more difficult starvation, priority inversion issue.
Keeping scheduling knob/chances of jobs in memcg is important. I don't want to
give a hint to scheduler because of memcg internal issue.

And, even if memcg-kswapd doesn't exist, memcg works (well?). 
memcg-kswapd just helps making things better but not do any critical jobs.
So, it's okay to have this as best-effort service.
Of course, better scheduling idea for picking up memcg is welcomed. It's now
round-robin.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 5/9] Infrastructure to support per-memcg reclaim.
  2011-04-22  4:24 ` [PATCH V7 5/9] Infrastructure to support per-memcg reclaim Ying Han
  2011-04-22  4:38   ` KAMEZAWA Hiroyuki
@ 2011-04-22  5:11   ` KOSAKI Motohiro
  2011-04-22  5:59     ` Ying Han
  2011-04-22  5:27   ` KOSAKI Motohiro
  2 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2011-04-22  5:11 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

> Add the kswapd_mem field in kswapd descriptor which links the kswapd
> kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
> queue headed at kswapd_wait field of the kswapd descriptor.
> 
> The kswapd() function is now shared between global and per-memcg kswapd. It
> is passed in with the kswapd descriptor which contains the information of
> either node or memcg. Then the new function balance_mem_cgroup_pgdat is
> invoked if it is per-mem kswapd thread, and the implementation of the function
> is on the following patch.
> 
> change v7..v6:
> 1. change the threading model of memcg from per-memcg-per-thread to thread-pool.
> this is based on the patch from KAMAZAWA.
> 
> change v6..v5:
> 1. rename is_node_kswapd to is_global_kswapd to match the scanning_global_lru.
> 2. revert the sleeping_prematurely change, but keep the kswapd_try_to_sleep()
> for memcg.
> 
> changelog v4..v3:
> 1. fix up the kswapd_run and kswapd_stop for online_pages() and offline_pages.
> 2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's request.
> 
> changelog v3..v2:
> 1. split off from the initial patch which includes all changes of the following
> three patches.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Looks ok. but this one have some ugly coding style.

functioon()
{
	if (is_global_kswapd()) {
		looooooooong lines
		...
		..
	} else {
		another looooooong lines
		...
		..
	}
}

please pay attention more to keep simpler code.
However, I don't think this patch has major issue. I expect I can ack next version.



> ---
>  include/linux/swap.h |    2 +-
>  mm/memory_hotplug.c  |    2 +-
>  mm/vmscan.c          |  156 +++++++++++++++++++++++++++++++-------------------
>  3 files changed, 100 insertions(+), 60 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 9b91ca4..a062f0b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -303,7 +303,7 @@ static inline void scan_unevictable_unregister_node(struct node *node)
>  }
>  #endif
>  
> -extern int kswapd_run(int nid);
> +extern int kswapd_run(int nid, int id);

"id" is bad name. there is no information. please use memcg-id or so on.


>  extern void kswapd_stop(int nid);
>  
>  #ifdef CONFIG_MMU
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 321fc74..36b4eed 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -462,7 +462,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages)
>  	setup_per_zone_wmarks();
>  	calculate_zone_inactive_ratio(zone);
>  	if (onlined_pages) {
> -		kswapd_run(zone_to_nid(zone));
> +		kswapd_run(zone_to_nid(zone), 0);
>  		node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
>  	}
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7aba681..63c557e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2241,6 +2241,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
>  	return balanced_pages > (present_pages >> 2);
>  }
>  
> +#define is_global_kswapd(kswapd_p) ((kswapd_p)->kswapd_pgdat)

please use inline function.



> +
>  /* is kswapd sleeping prematurely? */
>  static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  					int classzone_idx)
> @@ -2583,40 +2585,46 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>  
>  	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>  
> -	/* Try to sleep for a short interval */
> -	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
> -		remaining = schedule_timeout(HZ/10);
> -		finish_wait(wait_h, &wait);
> -		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> -	}
> -
> -	/*
> -	 * After a short sleep, check if it was a premature sleep. If not, then
> -	 * go fully to sleep until explicitly woken up.
> -	 */
> -	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
> -		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +	if (is_global_kswapd(kswapd_p)) {

bad indentation. :-/
please don't increase coding mess.

	if (!is_global_kswapd(kswapd_p)) {
		kswapd_try_to_sleep_memcg();
		return;
	}

is simpler.


> +		/* Try to sleep for a short interval */
> +		if (!sleeping_prematurely(pgdat, order,
> +				remaining, classzone_idx)) {
> +			remaining = schedule_timeout(HZ/10);
> +			finish_wait(wait_h, &wait);
> +			prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> +		}
>  
>  		/*
> -		 * vmstat counters are not perfectly accurate and the estimated
> -		 * value for counters such as NR_FREE_PAGES can deviate from the
> -		 * true value by nr_online_cpus * threshold. To avoid the zone
> -		 * watermarks being breached while under pressure, we reduce the
> -		 * per-cpu vmstat threshold while kswapd is awake and restore
> -		 * them before going back to sleep.
> +		 * After a short sleep, check if it was a premature sleep.
> +		 * If not, then go fully to sleep until explicitly woken up.
>  		 */
> -		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
> -		schedule();
> -		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
> +		if (!sleeping_prematurely(pgdat, order,
> +					remaining, classzone_idx)) {
> +			trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +			set_pgdat_percpu_threshold(pgdat,
> +					calculate_normal_threshold);
> +			schedule();
> +			set_pgdat_percpu_threshold(pgdat,
> +					calculate_pressure_threshold);
> +		} else {
> +			if (remaining)
> +				count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> +			else
> +				count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> +		}
>  	} else {
> -		if (remaining)
> -			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> -		else
> -			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> +		/* For now, we just check the remaining works.*/
> +		if (mem_cgroup_kswapd_can_sleep())
> +			schedule();
>  	}
>  	finish_wait(wait_h, &wait);


>  }
>  
> +static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
> +{
> +	return 0;
> +}
> +
>  /*
>   * The background pageout daemon, started as a kernel thread
>   * from the init process.
> @@ -2636,6 +2644,7 @@ int kswapd(void *p)
>  	int classzone_idx;
>  	struct kswapd *kswapd_p = (struct kswapd *)p;
>  	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +	struct mem_cgroup *mem;
>  	struct task_struct *tsk = current;
>  
>  	struct reclaim_state reclaim_state = {
> @@ -2645,9 +2654,11 @@ int kswapd(void *p)
>  
>  	lockdep_set_current_reclaim_state(GFP_KERNEL);
>  
> -	cpumask = cpumask_of_node(pgdat->node_id);
> -	if (!cpumask_empty(cpumask))
> -		set_cpus_allowed_ptr(tsk, cpumask);
> +	if (is_global_kswapd(kswapd_p)) {
> +		cpumask = cpumask_of_node(pgdat->node_id);
> +		if (!cpumask_empty(cpumask))
> +			set_cpus_allowed_ptr(tsk, cpumask);
> +	}
>  	current->reclaim_state = &reclaim_state;
>  
>  	/*
> @@ -2662,7 +2673,10 @@ int kswapd(void *p)
>  	 * us from recursively trying to free more memory as we're
>  	 * trying to free the first piece of memory in the first place).
>  	 */
> -	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> +	if (is_global_kswapd(kswapd_p))
> +		tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> +	else
> +		tsk->flags |= PF_SWAPWRITE | PF_KSWAPD;
>  	set_freezable();
>  
>  	order = 0;
> @@ -2672,36 +2686,48 @@ int kswapd(void *p)
>  		int new_classzone_idx;
>  		int ret;
>  
> -		new_order = pgdat->kswapd_max_order;
> -		new_classzone_idx = pgdat->classzone_idx;
> -		pgdat->kswapd_max_order = 0;
> -		pgdat->classzone_idx = MAX_NR_ZONES - 1;
> -		if (order < new_order || classzone_idx > new_classzone_idx) {
> -			/*
> -			 * Don't sleep if someone wants a larger 'order'
> -			 * allocation or has tigher zone constraints
> -			 */
> -			order = new_order;
> -			classzone_idx = new_classzone_idx;
> -		} else {
> -			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
> -			order = pgdat->kswapd_max_order;
> -			classzone_idx = pgdat->classzone_idx;
> +		if (is_global_kswapd(kswapd_p)) {
> +			new_order = pgdat->kswapd_max_order;
> +			new_classzone_idx = pgdat->classzone_idx;
>  			pgdat->kswapd_max_order = 0;
>  			pgdat->classzone_idx = MAX_NR_ZONES - 1;
> -		}
> +			if (order < new_order ||
> +					classzone_idx > new_classzone_idx) {
> +				/*
> +				 * Don't sleep if someone wants a larger 'order'
> +				 * allocation or has tigher zone constraints
> +				 */
> +				order = new_order;
> +				classzone_idx = new_classzone_idx;
> +			} else {
> +				kswapd_try_to_sleep(kswapd_p, order,
> +						    classzone_idx);
> +				order = pgdat->kswapd_max_order;
> +				classzone_idx = pgdat->classzone_idx;
> +				pgdat->kswapd_max_order = 0;
> +				pgdat->classzone_idx = MAX_NR_ZONES - 1;

-ETOODEEPNEST.


> +			}
> +		} else
> +			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
>  
>  		ret = try_to_freeze();
>  		if (kthread_should_stop())
>  			break;
>  
> +		if (ret)
> +			continue;
>  		/*
>  		 * We can speed up thawing tasks if we don't call balance_pgdat
>  		 * after returning from the refrigerator
>  		 */
> -		if (!ret) {
> +		if (is_global_kswapd(kswapd_p)) {
>  			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
>  			order = balance_pgdat(pgdat, order, &classzone_idx);
> +		} else {
> +			mem = mem_cgroup_get_shrink_target();
> +			if (mem)
> +				shrink_mem_cgroup(mem, order);
> +			mem_cgroup_put_shrink_target(mem);
>  		}



>  	}
>  	return 0;
> @@ -2845,30 +2871,44 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>   * This kswapd start function will be called by init and node-hot-add.
>   * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
>   */
> -int kswapd_run(int nid)
> +int kswapd_run(int nid, int memcgid)
>  {
> -	pg_data_t *pgdat = NODE_DATA(nid);
>  	struct task_struct *kswapd_tsk;
> +	pg_data_t *pgdat = NULL;
>  	struct kswapd *kswapd_p;
> +	static char name[TASK_COMM_LEN];
>  	int ret = 0;
>  
> -	if (pgdat->kswapd)
> -		return 0;
> +	if (!memcgid) {
> +		pgdat = NODE_DATA(nid);
> +		if (pgdat->kswapd)
> +			return ret;
> +	}
>  
>  	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
>  	if (!kswapd_p)
>  		return -ENOMEM;
>  
> -	pgdat->kswapd = kswapd_p;
> -	kswapd_p->kswapd_wait = &pgdat->kswapd_wait;
> -	kswapd_p->kswapd_pgdat = pgdat;
> +	if (!memcgid) {
> +		pgdat->kswapd = kswapd_p;
> +		kswapd_p->kswapd_wait = &pgdat->kswapd_wait;
> +		kswapd_p->kswapd_pgdat = pgdat;
> +		snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
> +	} else {
> +		kswapd_p->kswapd_wait = mem_cgroup_kswapd_waitq();
> +		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcgid);
> +	}
>  
> -	kswapd_tsk = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);

You seems to change kswapd name slightly.



> +	kswapd_tsk = kthread_run(kswapd, kswapd_p, name);
>  	if (IS_ERR(kswapd_tsk)) {
>  		/* failure at boot is fatal */
>  		BUG_ON(system_state == SYSTEM_BOOTING);
> -		printk("Failed to start kswapd on node %d\n",nid);
> -		pgdat->kswapd = NULL;
> +		if (!memcgid) {
> +			printk(KERN_ERR "Failed to start kswapd on node %d\n",
> +								nid);
> +			pgdat->kswapd = NULL;
> +		} else
> +			printk(KERN_ERR "Failed to start kswapd on memcg\n");

Why don't you show memcg-id here?


>  		kfree(kswapd_p);
>  		ret = -1;
>  	} else
> @@ -2899,7 +2939,7 @@ static int __init kswapd_init(void)
>  
>  	swap_setup();
>  	for_each_node_state(nid, N_HIGH_MEMORY)
> - 		kswapd_run(nid);
> +		kswapd_run(nid, 0);
>  	hotcpu_notifier(cpu_callback, 0);
>  	return 0;
>  }
> -- 
> 1.7.3.1
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 5/9] Infrastructure to support per-memcg reclaim.
  2011-04-22  4:24 ` [PATCH V7 5/9] Infrastructure to support per-memcg reclaim Ying Han
  2011-04-22  4:38   ` KAMEZAWA Hiroyuki
  2011-04-22  5:11   ` KOSAKI Motohiro
@ 2011-04-22  5:27   ` KOSAKI Motohiro
  2011-04-22  6:00     ` Ying Han
  2 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2011-04-22  5:27 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

> +static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
> +{
> +	return 0;
> +}

this one and

> @@ -2672,36 +2686,48 @@ int kswapd(void *p)
(snip)
>  		/*
>  		 * We can speed up thawing tasks if we don't call balance_pgdat
>  		 * after returning from the refrigerator
>  		 */
> -		if (!ret) {
> +		if (is_global_kswapd(kswapd_p)) {
>  			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
>  			order = balance_pgdat(pgdat, order, &classzone_idx);
> +		} else {
> +			mem = mem_cgroup_get_shrink_target();
> +			if (mem)
> +				shrink_mem_cgroup(mem, order);
> +			mem_cgroup_put_shrink_target(mem);
>  		}
>  	}

this one shold be placed in "[7/9] Per-memcg background reclaim". isn't it?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  4:24 ` [PATCH V7 4/9] Add memcg kswapd thread pool Ying Han
  2011-04-22  4:36   ` KAMEZAWA Hiroyuki
@ 2011-04-22  5:39   ` KOSAKI Motohiro
  2011-04-22  5:56     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2011-04-22  5:39 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

> +bool mem_cgroup_kswapd_can_sleep(void)
> +{
> +	return list_empty(&memcg_kswapd_control.list);
> +}

and, 

> @@ -2583,40 +2585,46 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>  	} else {
> +		/* For now, we just check the remaining works.*/
> +		if (mem_cgroup_kswapd_can_sleep())
> +			schedule();

has bad assumption. If freeable memory is very little and kswapds are
contended, memcg-kswap also have to give up and go into sleep as global
kswapd.

Otherwise, We are going to see kswapd cpu 100% consumption issue again.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  5:00       ` KAMEZAWA Hiroyuki
@ 2011-04-22  5:53         ` Ying Han
  2011-04-22  5:59           ` KAMEZAWA Hiroyuki
  2011-04-22  6:02         ` Zhu Yanhai
  1 sibling, 1 reply; 48+ messages in thread
From: Ying Han @ 2011-04-22  5:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3535 bytes --]

On Thu, Apr 21, 2011 at 10:00 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 21 Apr 2011 21:49:04 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Thu, 21 Apr 2011 21:24:15 -0700
> > > Ying Han <yinghan@google.com> wrote:
> > >
> > > > This patch creates a thread pool for memcg-kswapd. All memcg which
> needs
> > > > background recalim are linked to a list and memcg-kswapd picks up a
> memcg
> > > > from the list and run reclaim.
> > > >
> > > > The concern of using per-memcg-kswapd thread is the system overhead
> > > including
> > > > memory and cputime.
> > > >
> > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Signed-off-by: Ying Han <yinghan@google.com>
> > >
> > > Thank you for merging. This seems ok to me.
> > >
> > > Further development may make this better or change thread pools (to
> some
> > > other),
> > > but I think this is enough good.
> > >
> >
> > Thank you for reviewing and Acking. At the same time, I do have wondering
> on
> > the thread-pool modeling which I posted on the cover-letter :)
> >
> > The per-memcg-per-kswapd model
> > Pros:
> > 1. memory overhead per thread, and The memory consumption would be
> 8k*1000 =
> > 8M
> > with 1k cgroup.
> > 2. we see lots of threads at 'ps -elf'
> >
> > Cons:
> > 1. the implementation is simply and straigh-forward.
> > 2. we can easily isolate the background reclaim overhead between cgroups.
> > 3. better latency from memory pressure to actual start reclaiming
> >
> > The thread-pool model
> > Pros:
> > 1. there is no isolation between memcg background reclaim, since the
> memcg
> > threads
> > are shared.
> > 2. it is hard for visibility and debugability. I have been experienced a
> lot
> > when
> > some kswapds running creazy and we need a stright-forward way to identify
> > which
> > cgroup causing the reclaim.
> > 3. potential starvation for some memcgs, if one workitem stucks and the
> rest
> > of work
> > won't proceed.
> >
> > Cons:
> > 1. save some memory resource.
> >
> > In general, the per-memcg-per-kswapd implmentation looks sane to me at
> this
> > point, esepcially the sharing memcg thread model will make debugging
> issue
> > very hard later.
> >
> > Comments?
> >
> Pros <-> Cons ?
>
> My idea is adding trace point for memcg-kswapd and seeing what it's now
> doing.
> (We don't have too small trace point in memcg...)
>
> I don't think its sane to create kthread per memcg because we know there is
> a user
> who makes hundreds/thousands of memcg.
>
> And, I think that creating threads, which does the same job, more than the
> number
> of cpus will cause much more difficult starvation, priority inversion
> issue.
> Keeping scheduling knob/chances of jobs in memcg is important. I don't want
> to
> give a hint to scheduler because of memcg internal issue.
>
> And, even if memcg-kswapd doesn't exist, memcg works (well?).
> memcg-kswapd just helps making things better but not do any critical jobs.
> So, it's okay to have this as best-effort service.
> Of course, better scheduling idea for picking up memcg is welcomed. It's
> now
> round-robin.
>
> Hmm. The concern I have is the debug-ability. Let's say I am running a
system and found memcg-3 running crazy. Is there a way to find out which
memcg it is trying to reclaim pages from? Also, how to count cputime for the
shared memcg to the memcgs if we wanted to.

--Ying


> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 4799 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 1/9] Add kswapd descriptor
  2011-04-22  4:47   ` KOSAKI Motohiro
@ 2011-04-22  5:55     ` Ying Han
  0 siblings, 0 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  5:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2475 bytes --]

On Thu, Apr 21, 2011 at 9:47 PM, KOSAKI Motohiro <
kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi,
>
> This seems to have no ugly parts.
>

Thank you for reviewing.

>
>
> nitpick:
>
> > -     const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> > +     const struct cpumask *cpumask;
> >
> >       lockdep_set_current_reclaim_state(GFP_KERNEL);
> >
> > +     cpumask = cpumask_of_node(pgdat->node_id);
>
> no effect change?
>

yes, will change .

>
>
> >       if (!cpumask_empty(cpumask))
> >               set_cpus_allowed_ptr(tsk, cpumask);
> >       current->reclaim_state = &reclaim_state;
> > @@ -2679,7 +2684,7 @@ static int kswapd(void *p)
> >                       order = new_order;
> >                       classzone_idx = new_classzone_idx;
> >               } else {
> > -                     kswapd_try_to_sleep(pgdat, order, classzone_idx);
> > +                     kswapd_try_to_sleep(kswapd_p, order,
> classzone_idx);
> >                       order = pgdat->kswapd_max_order;
> >                       classzone_idx = pgdat->classzone_idx;
> >                       pgdat->kswapd_max_order = 0;
> > @@ -2817,12 +2822,20 @@ static int __devinit cpu_callback(struct
> notifier_block *nfb,
> >               for_each_node_state(nid, N_HIGH_MEMORY) {
> >                       pg_data_t *pgdat = NODE_DATA(nid);
> >                       const struct cpumask *mask;
> > +                     struct kswapd *kswapd_p;
> > +                     struct task_struct *kswapd_tsk;
> > +                     wait_queue_head_t *wait;
> >
> >                       mask = cpumask_of_node(pgdat->node_id);
> >
> > +                     wait = &pgdat->kswapd_wait;
>
> In kswapd_try_to_sleep(), this waitqueue is called wait_h. Can you
> please keep naming consistency?
>
> Ok.

>
> > +                     kswapd_p = pgdat->kswapd;
> > +                     kswapd_tsk = kswapd_p->kswapd_task;
> > +
> >                       if (cpumask_any_and(cpu_online_mask, mask) <
> nr_cpu_ids)
> >                               /* One of our CPUs online: restore mask */
> > -                             set_cpus_allowed_ptr(pgdat->kswapd, mask);
> > +                             if (kswapd_tsk)
> > +                                     set_cpus_allowed_ptr(kswapd_tsk,
> mask);
>
> Need adding commnets. What mean kswapd_tsk==NULL and When it occur.
> I'm apologize if it done at later patch.
>

I don't think i have comments on later patch. will add.

--Ying

[-- Attachment #2: Type: text/html, Size: 3586 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  5:39   ` KOSAKI Motohiro
@ 2011-04-22  5:56     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  5:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ying Han, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, 22 Apr 2011 14:39:24 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > +bool mem_cgroup_kswapd_can_sleep(void)
> > +{
> > +	return list_empty(&memcg_kswapd_control.list);
> > +}
> 
> and, 
> 
> > @@ -2583,40 +2585,46 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
> >  	} else {
> > +		/* For now, we just check the remaining works.*/
> > +		if (mem_cgroup_kswapd_can_sleep())
> > +			schedule();
> 
> has bad assumption. If freeable memory is very little and kswapds are
> contended, memcg-kswap also have to give up and go into sleep as global
> kswapd.
> 
> Otherwise, We are going to see kswapd cpu 100% consumption issue again.
> 

Hmm, ok. need to add more logics. Is it ok to have add-on patch like this ?
I'll consider some more smart and fair....
==

Because memcg-kswapd push back memcg to the list when there is remaining work,
it may consume too much cpu when it finds hard-to-reclaim-memcg.

This patch adds a penalty to hard-to-reclaim memcg and reduces chance to
be scheduled again.

---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |   14 +++++++++++---
 mm/vmscan.c                |    4 ++--
 3 files changed, 14 insertions(+), 6 deletions(-)

Index: mmotm-Apr14/include/linux/memcontrol.h
===================================================================
--- mmotm-Apr14.orig/include/linux/memcontrol.h
+++ mmotm-Apr14/include/linux/memcontrol.h
@@ -96,7 +96,7 @@ extern int mem_cgroup_select_victim_node
 
 extern bool mem_cgroup_kswapd_can_sleep(void);
 extern struct mem_cgroup *mem_cgroup_get_shrink_target(void);
-extern void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
+extern void mem_cgroup_put_shrink_target(struct mem_cgroup *mem, int pages);
 extern wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
 extern int mem_cgroup_kswapd_bonus(struct mem_cgroup *mem);
 
Index: mmotm-Apr14/mm/memcontrol.c
===================================================================
--- mmotm-Apr14.orig/mm/memcontrol.c
+++ mmotm-Apr14/mm/memcontrol.c
@@ -4739,6 +4739,10 @@ struct mem_cgroup *mem_cgroup_get_shrink
 				 	memcg_kswapd_wait_list);
 			list_del_init(&mem->memcg_kswapd_wait_list);
 		}
+		if (mem && mem->stalled) {
+			mem->stalled--; /* This memcg was cpu hog */
+			continue;
+		}
 	} while (mem && !css_tryget(&mem->css));
 	if (mem)
 		atomic_inc(&mem->kswapd_running);
@@ -4747,7 +4751,7 @@ struct mem_cgroup *mem_cgroup_get_shrink
 	return mem;
 }
 
-void mem_cgroup_put_shrink_target(struct mem_cgroup *mem)
+void mem_cgroup_put_shrink_target(struct mem_cgroup *mem, int nr_pages)
 {
 	if (!mem)
 		return;
@@ -4755,8 +4759,12 @@ void mem_cgroup_put_shrink_target(struct
 	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH)) {
 		spin_lock(&memcg_kswapd_control.lock);
 		if (list_empty(&mem->memcg_kswapd_wait_list)) {
-			list_add_tail(&mem->memcg_kswapd_wait_list,
-					&memcg_kswapd_control.list);
+			/* If memory reclaim was smooth, resched it */
+			if (nr_pages >= SWAP_CLUSTER_MAX/2)
+				list_add_tail(&mem->memcg_kswapd_wait_list,
+						&memcg_kswapd_control.list);
+			else
+				mem->stalled += 1; /* ignore this memcg for a while */
 		}
 		spin_unlock(&memcg_kswapd_control.lock);
 	}
Index: mmotm-Apr14/mm/vmscan.c
===================================================================
--- mmotm-Apr14.orig/mm/vmscan.c
+++ mmotm-Apr14/mm/vmscan.c
@@ -2892,8 +2892,8 @@ int kswapd(void *p)
 		} else {
 			mem = mem_cgroup_get_shrink_target();
 			if (mem)
-				shrink_mem_cgroup(mem, order);
-			mem_cgroup_put_shrink_target(mem);
+				ret = shrink_mem_cgroup(mem, order);
+			mem_cgroup_put_shrink_target(mem, ret);
 		}
 	}
 	return 0;
==

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 5/9] Infrastructure to support per-memcg reclaim.
  2011-04-22  5:11   ` KOSAKI Motohiro
@ 2011-04-22  5:59     ` Ying Han
  0 siblings, 0 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  5:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 15025 bytes --]

On Thu, Apr 21, 2011 at 10:11 PM, KOSAKI Motohiro <
kosaki.motohiro@jp.fujitsu.com> wrote:

> > Add the kswapd_mem field in kswapd descriptor which links the kswapd
> > kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
> > queue headed at kswapd_wait field of the kswapd descriptor.
> >
> > The kswapd() function is now shared between global and per-memcg kswapd.
> It
> > is passed in with the kswapd descriptor which contains the information of
> > either node or memcg. Then the new function balance_mem_cgroup_pgdat is
> > invoked if it is per-mem kswapd thread, and the implementation of the
> function
> > is on the following patch.
> >
> > change v7..v6:
> > 1. change the threading model of memcg from per-memcg-per-thread to
> thread-pool.
> > this is based on the patch from KAMAZAWA.
> >
> > change v6..v5:
> > 1. rename is_node_kswapd to is_global_kswapd to match the
> scanning_global_lru.
> > 2. revert the sleeping_prematurely change, but keep the
> kswapd_try_to_sleep()
> > for memcg.
> >
> > changelog v4..v3:
> > 1. fix up the kswapd_run and kswapd_stop for online_pages() and
> offline_pages.
> > 2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's
> request.
> >
> > changelog v3..v2:
> > 1. split off from the initial patch which includes all changes of the
> following
> > three patches.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Looks ok. but this one have some ugly coding style.
>
> functioon()
> {
>        if (is_global_kswapd()) {
>                looooooooong lines
>                ...
>                ..
>        } else {
>                another looooooong lines
>                ...
>                ..
>        }
> }
>
> please pay attention more to keep simpler code.
> However, I don't think this patch has major issue. I expect I can ack next
> version.
>
>
> Thank you for reviewing.

>
> > ---
> >  include/linux/swap.h |    2 +-
> >  mm/memory_hotplug.c  |    2 +-
> >  mm/vmscan.c          |  156
> +++++++++++++++++++++++++++++++-------------------
> >  3 files changed, 100 insertions(+), 60 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 9b91ca4..a062f0b 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -303,7 +303,7 @@ static inline void
> scan_unevictable_unregister_node(struct node *node)
> >  }
> >  #endif
> >
> > -extern int kswapd_run(int nid);
> > +extern int kswapd_run(int nid, int id);
>
> "id" is bad name. there is no information. please use memcg-id or so on.
>

will change .

>
>
> >  extern void kswapd_stop(int nid);
> >
> >  #ifdef CONFIG_MMU
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 321fc74..36b4eed 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -462,7 +462,7 @@ int online_pages(unsigned long pfn, unsigned long
> nr_pages)
> >       setup_per_zone_wmarks();
> >       calculate_zone_inactive_ratio(zone);
> >       if (onlined_pages) {
> > -             kswapd_run(zone_to_nid(zone));
> > +             kswapd_run(zone_to_nid(zone), 0);
> >               node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
> >       }
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 7aba681..63c557e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2241,6 +2241,8 @@ static bool pgdat_balanced(pg_data_t *pgdat,
> unsigned long balanced_pages,
> >       return balanced_pages > (present_pages >> 2);
> >  }
> >
> > +#define is_global_kswapd(kswapd_p) ((kswapd_p)->kswapd_pgdat)
>
> please use inline function.
>

Hmm.  see will change next/

>
>

>
> > +
> >  /* is kswapd sleeping prematurely? */
> >  static bool sleeping_prematurely(pg_data_t *pgdat, int order, long
> remaining,
> >                                       int classzone_idx)
> > @@ -2583,40 +2585,46 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >
> >       prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> >
> > -     /* Try to sleep for a short interval */
> > -     if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx))
> {
> > -             remaining = schedule_timeout(HZ/10);
> > -             finish_wait(wait_h, &wait);
> > -             prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> > -     }
> > -
> > -     /*
> > -      * After a short sleep, check if it was a premature sleep. If not,
> then
> > -      * go fully to sleep until explicitly woken up.
> > -      */
> > -     if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx))
> {
> > -             trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > +     if (is_global_kswapd(kswapd_p)) {
>
> bad indentation. :-/
> please don't increase coding mess.
>
>        if (!is_global_kswapd(kswapd_p)) {
>                 kswapd_try_to_sleep_memcg();
>                return;
>        }
>
> is simpler.
>
> Ok. I will check on next post.

>
> > +             /* Try to sleep for a short interval */
> > +             if (!sleeping_prematurely(pgdat, order,
> > +                             remaining, classzone_idx)) {
> > +                     remaining = schedule_timeout(HZ/10);
> > +                     finish_wait(wait_h, &wait);
> > +                     prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> > +             }
> >
> >               /*
> > -              * vmstat counters are not perfectly accurate and the
> estimated
> > -              * value for counters such as NR_FREE_PAGES can deviate
> from the
> > -              * true value by nr_online_cpus * threshold. To avoid the
> zone
> > -              * watermarks being breached while under pressure, we
> reduce the
> > -              * per-cpu vmstat threshold while kswapd is awake and
> restore
> > -              * them before going back to sleep.
> > +              * After a short sleep, check if it was a premature sleep.
> > +              * If not, then go fully to sleep until explicitly woken
> up.
> >                */
> > -             set_pgdat_percpu_threshold(pgdat,
> calculate_normal_threshold);
> > -             schedule();
> > -             set_pgdat_percpu_threshold(pgdat,
> calculate_pressure_threshold);
> > +             if (!sleeping_prematurely(pgdat, order,
> > +                                     remaining, classzone_idx)) {
> > +                     trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > +                     set_pgdat_percpu_threshold(pgdat,
> > +                                     calculate_normal_threshold);
> > +                     schedule();
> > +                     set_pgdat_percpu_threshold(pgdat,
> > +                                     calculate_pressure_threshold);
> > +             } else {
> > +                     if (remaining)
> > +
> count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> > +                     else
> > +
> count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> > +             }
> >       } else {
> > -             if (remaining)
> > -                     count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> > -             else
> > -                     count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> > +             /* For now, we just check the remaining works.*/
> > +             if (mem_cgroup_kswapd_can_sleep())
> > +                     schedule();
> >       }
> >       finish_wait(wait_h, &wait);
>
>
> >  }
> >
> > +static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int
> order)
> > +{
> > +     return 0;
> > +}
> > +
> >  /*
> >   * The background pageout daemon, started as a kernel thread
> >   * from the init process.
> > @@ -2636,6 +2644,7 @@ int kswapd(void *p)
> >       int classzone_idx;
> >       struct kswapd *kswapd_p = (struct kswapd *)p;
> >       pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> > +     struct mem_cgroup *mem;
> >       struct task_struct *tsk = current;
> >
> >       struct reclaim_state reclaim_state = {
> > @@ -2645,9 +2654,11 @@ int kswapd(void *p)
> >
> >       lockdep_set_current_reclaim_state(GFP_KERNEL);
> >
> > -     cpumask = cpumask_of_node(pgdat->node_id);
> > -     if (!cpumask_empty(cpumask))
> > -             set_cpus_allowed_ptr(tsk, cpumask);
> > +     if (is_global_kswapd(kswapd_p)) {
> > +             cpumask = cpumask_of_node(pgdat->node_id);
> > +             if (!cpumask_empty(cpumask))
> > +                     set_cpus_allowed_ptr(tsk, cpumask);
> > +     }
> >       current->reclaim_state = &reclaim_state;
> >
> >       /*
> > @@ -2662,7 +2673,10 @@ int kswapd(void *p)
> >        * us from recursively trying to free more memory as we're
> >        * trying to free the first piece of memory in the first place).
> >        */
> > -     tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > +     if (is_global_kswapd(kswapd_p))
> > +             tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > +     else
> > +             tsk->flags |= PF_SWAPWRITE | PF_KSWAPD;
> >       set_freezable();
> >
> >       order = 0;
> > @@ -2672,36 +2686,48 @@ int kswapd(void *p)
> >               int new_classzone_idx;
> >               int ret;
> >
> > -             new_order = pgdat->kswapd_max_order;
> > -             new_classzone_idx = pgdat->classzone_idx;
> > -             pgdat->kswapd_max_order = 0;
> > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > -             if (order < new_order || classzone_idx > new_classzone_idx)
> {
> > -                     /*
> > -                      * Don't sleep if someone wants a larger 'order'
> > -                      * allocation or has tigher zone constraints
> > -                      */
> > -                     order = new_order;
> > -                     classzone_idx = new_classzone_idx;
> > -             } else {
> > -                     kswapd_try_to_sleep(kswapd_p, order,
> classzone_idx);
> > -                     order = pgdat->kswapd_max_order;
> > -                     classzone_idx = pgdat->classzone_idx;
> > +             if (is_global_kswapd(kswapd_p)) {
> > +                     new_order = pgdat->kswapd_max_order;
> > +                     new_classzone_idx = pgdat->classzone_idx;
> >                       pgdat->kswapd_max_order = 0;
> >                       pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > -             }
> > +                     if (order < new_order ||
> > +                                     classzone_idx > new_classzone_idx)
> {
> > +                             /*
> > +                              * Don't sleep if someone wants a larger
> 'order'
> > +                              * allocation or has tigher zone
> constraints
> > +                              */
> > +                             order = new_order;
> > +                             classzone_idx = new_classzone_idx;
> > +                     } else {
> > +                             kswapd_try_to_sleep(kswapd_p, order,
> > +                                                 classzone_idx);
> > +                             order = pgdat->kswapd_max_order;
> > +                             classzone_idx = pgdat->classzone_idx;
> > +                             pgdat->kswapd_max_order = 0;
> > +                             pgdat->classzone_idx = MAX_NR_ZONES - 1;
>
> -ETOODEEPNEST.
>
>
> > +                     }
> > +             } else
> > +                     kswapd_try_to_sleep(kswapd_p, order,
> classzone_idx);
> >
> >               ret = try_to_freeze();
> >               if (kthread_should_stop())
> >                       break;
> >
> > +             if (ret)
> > +                     continue;
> >               /*
> >                * We can speed up thawing tasks if we don't call
> balance_pgdat
> >                * after returning from the refrigerator
> >                */
> > -             if (!ret) {
> > +             if (is_global_kswapd(kswapd_p)) {
> >                       trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >                       order = balance_pgdat(pgdat, order,
> &classzone_idx);
> > +             } else {
> > +                     mem = mem_cgroup_get_shrink_target();
> > +                     if (mem)
> > +                             shrink_mem_cgroup(mem, order);
> > +                     mem_cgroup_put_shrink_target(mem);
> >               }
>
>
>
> >       }
> >       return 0;
> > @@ -2845,30 +2871,44 @@ static int __devinit cpu_callback(struct
> notifier_block *nfb,
> >   * This kswapd start function will be called by init and node-hot-add.
> >   * On node-hot-add, kswapd will moved to proper cpus if cpus are
> hot-added.
> >   */
> > -int kswapd_run(int nid)
> > +int kswapd_run(int nid, int memcgid)
> >  {
> > -     pg_data_t *pgdat = NODE_DATA(nid);
> >       struct task_struct *kswapd_tsk;
> > +     pg_data_t *pgdat = NULL;
> >       struct kswapd *kswapd_p;
> > +     static char name[TASK_COMM_LEN];
> >       int ret = 0;
> >
> > -     if (pgdat->kswapd)
> > -             return 0;
> > +     if (!memcgid) {
> > +             pgdat = NODE_DATA(nid);
> > +             if (pgdat->kswapd)
> > +                     return ret;
> > +     }
> >
> >       kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
> >       if (!kswapd_p)
> >               return -ENOMEM;
> >
> > -     pgdat->kswapd = kswapd_p;
> > -     kswapd_p->kswapd_wait = &pgdat->kswapd_wait;
> > -     kswapd_p->kswapd_pgdat = pgdat;
> > +     if (!memcgid) {
> > +             pgdat->kswapd = kswapd_p;
> > +             kswapd_p->kswapd_wait = &pgdat->kswapd_wait;
> > +             kswapd_p->kswapd_pgdat = pgdat;
> > +             snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
> > +     } else {
> > +             kswapd_p->kswapd_wait = mem_cgroup_kswapd_waitq();
> > +             snprintf(name, TASK_COMM_LEN, "memcg_%d", memcgid);
> > +     }
> >
> > -     kswapd_tsk = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
>
> You seems to change kswapd name slightly.
>
>
>
> > +     kswapd_tsk = kthread_run(kswapd, kswapd_p, name);
> >       if (IS_ERR(kswapd_tsk)) {
> >               /* failure at boot is fatal */
> >               BUG_ON(system_state == SYSTEM_BOOTING);
> > -             printk("Failed to start kswapd on node %d\n",nid);
> > -             pgdat->kswapd = NULL;
> > +             if (!memcgid) {
> > +                     printk(KERN_ERR "Failed to start kswapd on node
> %d\n",
> > +                                                             nid);
> > +                     pgdat->kswapd = NULL;
> > +             } else
> > +                     printk(KERN_ERR "Failed to start kswapd on
> memcg\n");
>
> Why don't you show memcg-id here?
>

will change.

>
>
> >               kfree(kswapd_p);
> >               ret = -1;
> >       } else
> > @@ -2899,7 +2939,7 @@ static int __init kswapd_init(void)
> >
> >       swap_setup();
> >       for_each_node_state(nid, N_HIGH_MEMORY)
> > -             kswapd_run(nid);
> > +             kswapd_run(nid, 0);
> >       hotcpu_notifier(cpu_callback, 0);
> >       return 0;
> >  }
> > --
> > 1.7.3.1
> >
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 18881 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  5:53         ` Ying Han
@ 2011-04-22  5:59           ` KAMEZAWA Hiroyuki
  2011-04-22  6:10             ` Ying Han
  0 siblings, 1 reply; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  5:59 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 22:53:19 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, Apr 21, 2011 at 10:00 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 21 Apr 2011 21:49:04 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
> > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > > On Thu, 21 Apr 2011 21:24:15 -0700
> > > > Ying Han <yinghan@google.com> wrote:
> > > >
> > > > > This patch creates a thread pool for memcg-kswapd. All memcg which
> > needs
> > > > > background recalim are linked to a list and memcg-kswapd picks up a
> > memcg
> > > > > from the list and run reclaim.
> > > > >
> > > > > The concern of using per-memcg-kswapd thread is the system overhead
> > > > including
> > > > > memory and cputime.
> > > > >
> > > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > Signed-off-by: Ying Han <yinghan@google.com>
> > > >
> > > > Thank you for merging. This seems ok to me.
> > > >
> > > > Further development may make this better or change thread pools (to
> > some
> > > > other),
> > > > but I think this is enough good.
> > > >
> > >
> > > Thank you for reviewing and Acking. At the same time, I do have wondering
> > on
> > > the thread-pool modeling which I posted on the cover-letter :)
> > >
> > > The per-memcg-per-kswapd model
> > > Pros:
> > > 1. memory overhead per thread, and The memory consumption would be
> > 8k*1000 =
> > > 8M
> > > with 1k cgroup.
> > > 2. we see lots of threads at 'ps -elf'
> > >
> > > Cons:
> > > 1. the implementation is simply and straigh-forward.
> > > 2. we can easily isolate the background reclaim overhead between cgroups.
> > > 3. better latency from memory pressure to actual start reclaiming
> > >
> > > The thread-pool model
> > > Pros:
> > > 1. there is no isolation between memcg background reclaim, since the
> > memcg
> > > threads
> > > are shared.
> > > 2. it is hard for visibility and debugability. I have been experienced a
> > lot
> > > when
> > > some kswapds running creazy and we need a stright-forward way to identify
> > > which
> > > cgroup causing the reclaim.
> > > 3. potential starvation for some memcgs, if one workitem stucks and the
> > rest
> > > of work
> > > won't proceed.
> > >
> > > Cons:
> > > 1. save some memory resource.
> > >
> > > In general, the per-memcg-per-kswapd implmentation looks sane to me at
> > this
> > > point, esepcially the sharing memcg thread model will make debugging
> > issue
> > > very hard later.
> > >
> > > Comments?
> > >
> > Pros <-> Cons ?
> >
> > My idea is adding trace point for memcg-kswapd and seeing what it's now
> > doing.
> > (We don't have too small trace point in memcg...)
> >
> > I don't think its sane to create kthread per memcg because we know there is
> > a user
> > who makes hundreds/thousands of memcg.
> >
> > And, I think that creating threads, which does the same job, more than the
> > number
> > of cpus will cause much more difficult starvation, priority inversion
> > issue.
> > Keeping scheduling knob/chances of jobs in memcg is important. I don't want
> > to
> > give a hint to scheduler because of memcg internal issue.
> >
> > And, even if memcg-kswapd doesn't exist, memcg works (well?).
> > memcg-kswapd just helps making things better but not do any critical jobs.
> > So, it's okay to have this as best-effort service.
> > Of course, better scheduling idea for picking up memcg is welcomed. It's
> > now
> > round-robin.
> >
> > Hmm. The concern I have is the debug-ability. Let's say I am running a
> system and found memcg-3 running crazy. Is there a way to find out which
> memcg it is trying to reclaim pages from? Also, how to count cputime for the
> shared memcg to the memcgs if we wanted to.
> 

add a counter for kswapd-scan and kswapd-reclaim, kswapd-pickup will show
you information, if necessary it's good to show some latecy stat. I think
we can add enough information by adding stats (or debug by perf tools.)
I'll consider this a a bit more.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 7/9] Per-memcg background reclaim.
  2011-04-22  4:24 ` [PATCH V7 7/9] Per-memcg background reclaim Ying Han
  2011-04-22  4:40   ` KAMEZAWA Hiroyuki
@ 2011-04-22  6:00   ` KOSAKI Motohiro
  2011-04-22  7:54     ` Ying Han
  1 sibling, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2011-04-22  6:00 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

> This is the main loop of per-memcg background reclaim which is implemented in
> function balance_mem_cgroup_pgdat().
> 
> The function performs a priority loop similar to global reclaim. During each
> iteration it invokes balance_pgdat_node() for all nodes on the system, which
> is another new function performs background reclaim per node. After reclaiming
> each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
> it returns true.
> 
> changelog v7..v6:
> 1. change based on KAMAZAWA's patchset. Each memcg reclaims now reclaims
> SWAP_CLUSTER_MAX of pages and putback the memcg to the tail of list.
> memcg-kswapd will visit memcgs in round-robin manner and reduce usages.
> 
> changelog v6..v5:
> 1. add mem_cgroup_zone_reclaimable_pages()
> 2. fix some comment style.
> 
> changelog v5..v4:
> 1. remove duplicate check on nodes_empty()
> 2. add logic to check if the per-memcg lru is empty on the zone.
> 
> changelog v4..v3:
> 1. split the select_victim_node and zone_unreclaimable to a seperate patches
> 2. remove the logic tries to do zone balancing.
> 
> changelog v3..v2:
> 1. change mz->all_unreclaimable to be boolean.
> 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
> 3. some more clean-up.
> 
> changelog v2..v1:
> 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> reclaim.
> 3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
> keeps the same name.
> 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
> after freeing.
> 5. add the fairness in zonelist where memcg remember the last zone reclaimed
> from.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    9 +++
>  mm/memcontrol.c            |   18 +++++++
>  mm/vmscan.c                |  118 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 145 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7444738..39eade6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -115,6 +115,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>   */
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +						  struct zone *zone);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>  				       struct zone *zone,
>  				       enum lru_list lru);
> @@ -311,6 +313,13 @@ mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>  }
>  
>  static inline unsigned long
> +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +				    struct zone *zone)
> +{
> +	return 0;
> +}
> +
> +static inline unsigned long
>  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
>  			 enum lru_list lru)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4696fd8..41eaa62 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1105,6 +1105,24 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>  	return (active > inactive);
>  }
>  
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +						struct zone *zone)
> +{
> +	int nr;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> +	     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> +
> +	if (nr_swap_pages > 0)
> +		nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> +		      MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>  				       struct zone *zone,
>  				       enum lru_list lru)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 63c557e..ba03a10 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -47,6 +47,8 @@
>  
>  #include <linux/swapops.h>
>  
> +#include <linux/res_counter.h>
> +
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -111,6 +113,8 @@ struct scan_control {
>  	 * are scanned.
>  	 */
>  	nodemask_t	*nodemask;
> +
> +	int priority;
>  };

Bah!
If you need sc.priority, you have to make cleanup patch at first. and
all current reclaim path have to use sc.priority. Please don't increase
unnecessary mess.


>  
>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> @@ -2620,10 +2624,124 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>  	finish_wait(wait_h, &wait);
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +static void shrink_memcg_node(pg_data_t *pgdat, int order,
> +				struct scan_control *sc)
> +{
> +	int i;
> +	unsigned long total_scanned = 0;
> +	struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +	int priority = sc->priority;

unnecessary local variables. we can keep smaller stack.


> +
> +	/*
> +	 * This dma->highmem order is consistant with global reclaim.
> +	 * We do this because the page allocator works in the opposite
> +	 * direction although memcg user pages are mostly allocated at
> +	 * highmem.
> +	 */
> +	for (i = 0; i < pgdat->nr_zones; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +		unsigned long scan = 0;
> +
> +		scan = mem_cgroup_zone_reclaimable_pages(mem_cont, zone);
> +		if (!scan)
> +			continue;
> +
> +		sc->nr_scanned = 0;
> +		shrink_zone(priority, zone, sc);
> +		total_scanned += sc->nr_scanned;
> +
> +		/*
> +		 * If we've done a decent amount of scanning and
> +		 * the reclaim ratio is low, start doing writepage
> +		 * even in laptop mode
> +		 */
> +		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> +		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
> +			sc->may_writepage = 1;

please make helper function for may_writepage. iow, don't cut-n-paste.


> +/*
> + * Per cgroup background reclaim.
> + * TODO: Take off the order since memcg always do order 0
> + */
> +static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
> +{
> +	int i, nid, priority, loop;
> +	pg_data_t *pgdat;
> +	nodemask_t do_nodes;
> +	unsigned long total_scanned;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +		.swappiness = vm_swappiness,

No. memcg has per-memcg swappiness. Please don't use global swappiness value.


> +		.order = order,
> +		.mem_cgroup = mem_cont,
> +	};
> +
> +	do_nodes = NODE_MASK_NONE;
> +	sc.may_writepage = !laptop_mode;
> +	sc.nr_reclaimed = 0;

this initialization move into sc static initializer. balance pgdat has
loop_again label and this doesn't.

> +	total_scanned = 0;
> +
> +	do_nodes = node_states[N_ONLINE];

Why do we need care memoryless node? N_HIGH_MEMORY is wrong?

> +
> +	for (priority = DEF_PRIORITY;
> +		(priority >= 0) && (sc.nr_to_reclaim > sc.nr_reclaimed);
> +		priority--) {

bah. bad coding style...

> +
> +		sc.priority = priority;
> +		/* The swap token gets in the way of swapout... */
> +		if (!priority)
> +			disable_swap_token();

Why?

disable swap token mean "Please devest swap preventation privilege from 
owner task. Instead we endure swap storm and performance hit".
However I doublt memcg memory shortage is good situation to make swap storm.


> +
> +		for (loop = num_online_nodes();
> +			(loop > 0) && !nodes_empty(do_nodes);
> +			loop--) {

Why don't you use for_each_online_node()?
Maybe for_each_node_state(n, N_HIGH_MEMORY) is best option?

At least, find_next_bit() is efficient than bare loop?

> +
> +			nid = mem_cgroup_select_victim_node(mem_cont,
> +							&do_nodes);
> +
> +			pgdat = NODE_DATA(nid);
> +			shrink_memcg_node(pgdat, order, &sc);
> +			total_scanned += sc.nr_scanned;
> +
> +			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +				struct zone *zone = pgdat->node_zones + i;
> +
> +				if (populated_zone(zone))
> +					break;
> +			}

memory less node check is here. but we can check it before.

> +			if (i < 0)
> +				node_clear(nid, do_nodes);
> +
> +			if (mem_cgroup_watermark_ok(mem_cont,
> +						CHARGE_WMARK_HIGH))
> +				goto out;
> +		}
> +
> +		if (total_scanned && priority < DEF_PRIORITY - 2)
> +			congestion_wait(WRITE, HZ/10);
> +	}
> +out:
> +	return sc.nr_reclaimed;
> +}
> +#else
>  static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
>  {
>  	return 0;
>  }
> +#endif
>  
>  /*
>   * The background pageout daemon, started as a kernel thread
> -- 
> 1.7.3.1
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 5/9] Infrastructure to support per-memcg reclaim.
  2011-04-22  5:27   ` KOSAKI Motohiro
@ 2011-04-22  6:00     ` Ying Han
  0 siblings, 0 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  6:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1197 bytes --]

On Thu, Apr 21, 2011 at 10:27 PM, KOSAKI Motohiro <
kosaki.motohiro@jp.fujitsu.com> wrote:

> > +static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int
> order)
> > +{
> > +     return 0;
> > +}
>
> this one and
>
> > @@ -2672,36 +2686,48 @@ int kswapd(void *p)
> (snip)
> >               /*
> >                * We can speed up thawing tasks if we don't call
> balance_pgdat
> >                * after returning from the refrigerator
> >                */
> > -             if (!ret) {
> > +             if (is_global_kswapd(kswapd_p)) {
> >                       trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >                       order = balance_pgdat(pgdat, order,
> &classzone_idx);
> > +             } else {
> > +                     mem = mem_cgroup_get_shrink_target();
> > +                     if (mem)
> > +                             shrink_mem_cgroup(mem, order);
> > +                     mem_cgroup_put_shrink_target(mem);
> >               }
> >       }
>
> this one shold be placed in "[7/9] Per-memcg background reclaim". isn't it?
>

This is the infrastructure, and the shrink_mem_cgroup() is a noop. The [7/9]
is the actual implementation.

--Ying

[-- Attachment #2: Type: text/html, Size: 1715 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  5:00       ` KAMEZAWA Hiroyuki
  2011-04-22  5:53         ` Ying Han
@ 2011-04-22  6:02         ` Zhu Yanhai
  2011-04-22  6:14           ` Ying Han
  1 sibling, 1 reply; 48+ messages in thread
From: Zhu Yanhai @ 2011-04-22  6:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4335 bytes --]

Hi Kame,

2011/4/22 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> On Thu, 21 Apr 2011 21:49:04 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Thu, 21 Apr 2011 21:24:15 -0700
> > > Ying Han <yinghan@google.com> wrote:
> > >
> > > > This patch creates a thread pool for memcg-kswapd. All memcg which
> needs
> > > > background recalim are linked to a list and memcg-kswapd picks up a
> memcg
> > > > from the list and run reclaim.
> > > >
> > > > The concern of using per-memcg-kswapd thread is the system overhead
> > > including
> > > > memory and cputime.
> > > >
> > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Signed-off-by: Ying Han <yinghan@google.com>
> > >
> > > Thank you for merging. This seems ok to me.
> > >
> > > Further development may make this better or change thread pools (to
> some
> > > other),
> > > but I think this is enough good.
> > >
> >
> > Thank you for reviewing and Acking. At the same time, I do have wondering
> on
> > the thread-pool modeling which I posted on the cover-letter :)
> >
> > The per-memcg-per-kswapd model
> > Pros:
> > 1. memory overhead per thread, and The memory consumption would be
> 8k*1000 =
> > 8M
> > with 1k cgroup.
> > 2. we see lots of threads at 'ps -elf'
> >
> > Cons:
> > 1. the implementation is simply and straigh-forward.
> > 2. we can easily isolate the background reclaim overhead between cgroups.
> > 3. better latency from memory pressure to actual start reclaiming
> >
> > The thread-pool model
> > Pros:
> > 1. there is no isolation between memcg background reclaim, since the
> memcg
> > threads
> > are shared.
> > 2. it is hard for visibility and debugability. I have been experienced a
> lot
> > when
> > some kswapds running creazy and we need a stright-forward way to identify
> > which
> > cgroup causing the reclaim.
> > 3. potential starvation for some memcgs, if one workitem stucks and the
> rest
> > of work
> > won't proceed.
> >
> > Cons:
> > 1. save some memory resource.
> >
> > In general, the per-memcg-per-kswapd implmentation looks sane to me at
> this
> > point, esepcially the sharing memcg thread model will make debugging
> issue
> > very hard later.
> >
> > Comments?
> >
> Pros <-> Cons ?
>
> My idea is adding trace point for memcg-kswapd and seeing what it's now
> doing.
> (We don't have too small trace point in memcg...)
>
> I don't think its sane to create kthread per memcg because we know there is
> a user
> who makes hundreds/thousands of memcg.
>

I think we need to think about the exact usage of  'thousands of cgroups' in
this case. Although not quite in detail, in Ying's previous email she did
say that they created thousands of cgroups on each box in Google's cluster
and most of them _slept_ in most of the time. So I guess actually what they
did is creating a larger number of cgroups, each of them has different
limits on various resources. Then on the time of job dispatching, they can
choose a suitable group from each box and submit the job into it - without
touching the other thousands of sleeping groups. That's to say, though
Google has a huge number of groups on each box, they have only few jobs on
it, so it's impossible to see too many busy groups at the same time.
If above is correct, then I think Ying can call kthread_stop at the moment
we find there's no tasks in the group anymore, to kill the memcg thread (as
this group is expected to sleep for a long time after all the job leave). In
this way we can keep the number of memcg threads small and don't lose the
debug-ability.
What do you think?

Regards,
Zhu Yanhai

>
> And, I think that creating threads, which does the same job, more than the
> number
> of cpus will cause much more difficult starvation, priority inversion
> issue.
> Keeping scheduling knob/chances of jobs in memcg is important. I don't want
> to
> give a hint to scheduler because of memcg internal issue.
>
> And, even if memcg-kswapd doesn't exist, memcg works (well?).
> memcg-kswapd just helps making things better but not do any critical jobs.
> So, it's okay to have this as best-effort service.
> Of course, better scheduling idea for picking up memcg is welcomed. It's
> now
> round-robin.
>
> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 6341 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  5:59           ` KAMEZAWA Hiroyuki
@ 2011-04-22  6:10             ` Ying Han
  2011-04-22  7:46               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 48+ messages in thread
From: Ying Han @ 2011-04-22  6:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4689 bytes --]

On Thu, Apr 21, 2011 at 10:59 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 21 Apr 2011 22:53:19 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Thu, Apr 21, 2011 at 10:00 PM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Thu, 21 Apr 2011 21:49:04 -0700
> > > Ying Han <yinghan@google.com> wrote:
> > >
> > > > On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
> > > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > >
> > > > > On Thu, 21 Apr 2011 21:24:15 -0700
> > > > > Ying Han <yinghan@google.com> wrote:
> > > > >
> > > > > > This patch creates a thread pool for memcg-kswapd. All memcg
> which
> > > needs
> > > > > > background recalim are linked to a list and memcg-kswapd picks up
> a
> > > memcg
> > > > > > from the list and run reclaim.
> > > > > >
> > > > > > The concern of using per-memcg-kswapd thread is the system
> overhead
> > > > > including
> > > > > > memory and cputime.
> > > > > >
> > > > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
> >
> > > > > > Signed-off-by: Ying Han <yinghan@google.com>
> > > > >
> > > > > Thank you for merging. This seems ok to me.
> > > > >
> > > > > Further development may make this better or change thread pools (to
> > > some
> > > > > other),
> > > > > but I think this is enough good.
> > > > >
> > > >
> > > > Thank you for reviewing and Acking. At the same time, I do have
> wondering
> > > on
> > > > the thread-pool modeling which I posted on the cover-letter :)
> > > >
> > > > The per-memcg-per-kswapd model
> > > > Pros:
> > > > 1. memory overhead per thread, and The memory consumption would be
> > > 8k*1000 =
> > > > 8M
> > > > with 1k cgroup.
> > > > 2. we see lots of threads at 'ps -elf'
> > > >
> > > > Cons:
> > > > 1. the implementation is simply and straigh-forward.
> > > > 2. we can easily isolate the background reclaim overhead between
> cgroups.
> > > > 3. better latency from memory pressure to actual start reclaiming
> > > >
> > > > The thread-pool model
> > > > Pros:
> > > > 1. there is no isolation between memcg background reclaim, since the
> > > memcg
> > > > threads
> > > > are shared.
> > > > 2. it is hard for visibility and debugability. I have been
> experienced a
> > > lot
> > > > when
> > > > some kswapds running creazy and we need a stright-forward way to
> identify
> > > > which
> > > > cgroup causing the reclaim.
> > > > 3. potential starvation for some memcgs, if one workitem stucks and
> the
> > > rest
> > > > of work
> > > > won't proceed.
> > > >
> > > > Cons:
> > > > 1. save some memory resource.
> > > >
> > > > In general, the per-memcg-per-kswapd implmentation looks sane to me
> at
> > > this
> > > > point, esepcially the sharing memcg thread model will make debugging
> > > issue
> > > > very hard later.
> > > >
> > > > Comments?
> > > >
> > > Pros <-> Cons ?
> > >
> > > My idea is adding trace point for memcg-kswapd and seeing what it's now
> > > doing.
> > > (We don't have too small trace point in memcg...)
> > >
> > > I don't think its sane to create kthread per memcg because we know
> there is
> > > a user
> > > who makes hundreds/thousands of memcg.
> > >
> > > And, I think that creating threads, which does the same job, more than
> the
> > > number
> > > of cpus will cause much more difficult starvation, priority inversion
> > > issue.
> > > Keeping scheduling knob/chances of jobs in memcg is important. I don't
> want
> > > to
> > > give a hint to scheduler because of memcg internal issue.
> > >
> > > And, even if memcg-kswapd doesn't exist, memcg works (well?).
> > > memcg-kswapd just helps making things better but not do any critical
> jobs.
> > > So, it's okay to have this as best-effort service.
> > > Of course, better scheduling idea for picking up memcg is welcomed.
> It's
> > > now
> > > round-robin.
> > >
> > > Hmm. The concern I have is the debug-ability. Let's say I am running a
> > system and found memcg-3 running crazy. Is there a way to find out which
> > memcg it is trying to reclaim pages from? Also, how to count cputime for
> the
> > shared memcg to the memcgs if we wanted to.
> >
>
> add a counter for kswapd-scan and kswapd-reclaim, kswapd-pickup will show
> you information, if necessary it's good to show some latecy stat. I think
> we can add enough information by adding stats (or debug by perf tools.)
> I'll consider this a a bit more.
>

Something like "kswapd_pgscan" and "kswapd_steal" per memcg? If we are going
to the thread-pool, we definitely need to add more stats to give us enough
visibility of per-memcg background reclaim activity. Still, not sure about
the cpu-cycles.

--Ying

>
> Thanks,
> -Kame
>
>
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 6822 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 8/9] Add per-memcg zone "unreclaimable"
  2011-04-22  4:24 ` [PATCH V7 8/9] Add per-memcg zone "unreclaimable" Ying Han
  2011-04-22  4:43   ` KAMEZAWA Hiroyuki
@ 2011-04-22  6:13   ` KOSAKI Motohiro
  2011-04-22  6:17     ` Ying Han
  1 sibling, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2011-04-22  6:13 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 98fc7ed..3370c5a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1526,6 +1526,7 @@ struct task_struct {
>  		struct mem_cgroup *memcg; /* target memcg of uncharge */
>  		unsigned long nr_pages;	/* uncharged usage */
>  		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
> +		struct zone *zone; /* a zone page is last uncharged */

"zone" is bad name for task_struct. :-/


>  	} memcg_batch;
>  #endif
>  };
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a062f0b..b868e597 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -159,6 +159,8 @@ enum {
>  	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
>  };
>  
> +#define ZONE_RECLAIMABLE_RATE 6
> +

Need comment?


>  #define SWAP_CLUSTER_MAX 32
>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 41eaa62..9e535b2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -135,7 +135,10 @@ struct mem_cgroup_per_zone {
>  	bool			on_tree;
>  	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
>  						/* use container_of	   */
> +	unsigned long		pages_scanned;	/* since last reclaim */
> +	bool			all_unreclaimable;	/* All pages pinned */
>  };
> +
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
>  
> @@ -1162,6 +1165,103 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>  	return &mz->reclaim_stat;
>  }
>  
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone *zone,
> +						unsigned long nr_scanned)

this names sound like pages_scanned value getting helper function.


> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->pages_scanned += nr_scanned;
> +}
> +
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return 0;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->pages_scanned <
> +				mem_cgroup_zone_reclaimable_pages(mem, zone) *
> +				ZONE_RECLAIMABLE_RATE;
> +	return 0;
> +}
> +
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return false;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->all_unreclaimable;
> +
> +	return false;
> +}
> +
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->all_unreclaimable = true;
> +}
> +
> +void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
> +				       struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz) {
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = false;
> +	}
> +
> +	return;
> +}
> +
> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +
> +	if (!mem)
> +		return;
> +
> +	mz = page_cgroup_zoneinfo(mem, page);
> +	if (mz) {
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = false;
> +	}
> +
> +	return;
> +}
> +
>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -2709,6 +2809,7 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
>  
>  static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
>  				   unsigned int nr_pages,
> +				   struct page *page,
>  				   const enum charge_type ctype)
>  {
>  	struct memcg_batch_info *batch = NULL;
> @@ -2726,6 +2827,10 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
>  	 */
>  	if (!batch->memcg)
>  		batch->memcg = mem;
> +
> +	if (!batch->zone)
> +		batch->zone = page_zone(page);
> +
>  	/*
>  	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
>  	 * In those cases, all pages freed continously can be expected to be in
> @@ -2747,12 +2852,17 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
>  	 */
>  	if (batch->memcg != mem)
>  		goto direct_uncharge;
> +
> +	if (batch->zone != page_zone(page))
> +		mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
> +
>  	/* remember freed charge and uncharge it later */
>  	batch->nr_pages++;
>  	if (uncharge_memsw)
>  		batch->memsw_nr_pages++;
>  	return;
>  direct_uncharge:
> +	mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
>  	res_counter_uncharge(&mem->res, nr_pages * PAGE_SIZE);
>  	if (uncharge_memsw)
>  		res_counter_uncharge(&mem->memsw, nr_pages * PAGE_SIZE);
> @@ -2834,7 +2944,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  		mem_cgroup_get(mem);
>  	}
>  	if (!mem_cgroup_is_root(mem))
> -		mem_cgroup_do_uncharge(mem, nr_pages, ctype);
> +		mem_cgroup_do_uncharge(mem, nr_pages, page, ctype);
>  
>  	return mem;
>  
> @@ -2902,6 +3012,10 @@ void mem_cgroup_uncharge_end(void)
>  	if (batch->memsw_nr_pages)
>  		res_counter_uncharge(&batch->memcg->memsw,
>  				     batch->memsw_nr_pages * PAGE_SIZE);
> +	if (batch->zone)
> +		mem_cgroup_mz_clear_unreclaimable(batch->memcg, batch->zone);
> +	batch->zone = NULL;
> +
>  	memcg_oom_recover(batch->memcg);
>  	/* forget this pointer (for sanity check) */
>  	batch->memcg = NULL;
> @@ -4667,6 +4781,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>  		mz->usage_in_excess = 0;
>  		mz->on_tree = false;
>  		mz->mem = mem;
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = false;
>  	}
>  	return 0;
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ba03a10..87653d6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  					ISOLATE_BOTH : ISOLATE_INACTIVE,
>  			zone, sc->mem_cgroup,
>  			0, file);
> +
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
> +
>  		/*
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
> @@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
>  		 */
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>  	}
>  
>  	reclaim_stat->recent_scanned[file] += nr_taken;
> @@ -1989,7 +1993,8 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
>  
>  static bool zone_reclaimable(struct zone *zone)
>  {
> -	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +	return zone->pages_scanned < zone_reclaimable_pages(zone) *
> +					ZONE_RECLAIMABLE_RATE;
>  }
>  
>  /*
> @@ -2651,10 +2656,20 @@ static void shrink_memcg_node(pg_data_t *pgdat, int order,
>  		if (!scan)
>  			continue;
>  
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> +			priority != DEF_PRIORITY)
> +			continue;
> +
>  		sc->nr_scanned = 0;
>  		shrink_zone(priority, zone, sc);
>  		total_scanned += sc->nr_scanned;
>  
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
> +			continue;
> +
> +		if (!mem_cgroup_zone_reclaimable(mem_cont, zone))
> +			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
> +
>  		/*
>  		 * If we've done a decent amount of scanning and
>  		 * the reclaim ratio is low, start doing writepage
> @@ -2716,10 +2731,16 @@ static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
>  			shrink_memcg_node(pgdat, order, &sc);
>  			total_scanned += sc.nr_scanned;
>  
> +			/*
> +			 * Set the node which has at least one reclaimable
> +			 * zone
> +			 */
>  			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
>  				struct zone *zone = pgdat->node_zones + i;
>  
> -				if (populated_zone(zone))
> +				if (populated_zone(zone) &&
> +				    !mem_cgroup_mz_unreclaimable(mem_cont,
> +								zone))
>  					break;

global reclaim call shrink_zone() when priority==DEF_PRIORITY even if 
all_unreclaimable is set. Is this intentional change?
If so, please add some comments.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  6:02         ` Zhu Yanhai
@ 2011-04-22  6:14           ` Ying Han
  0 siblings, 0 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  6:14 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, Michal Hocko,
	Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4765 bytes --]

On Thu, Apr 21, 2011 at 11:02 PM, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:

> Hi Kame,
>
> 2011/4/22 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
>> On Thu, 21 Apr 2011 21:49:04 -0700
>> Ying Han <yinghan@google.com> wrote:
>>
>> > On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
>> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >
>> > > On Thu, 21 Apr 2011 21:24:15 -0700
>> > > Ying Han <yinghan@google.com> wrote:
>> > >
>> > > > This patch creates a thread pool for memcg-kswapd. All memcg which
>> needs
>> > > > background recalim are linked to a list and memcg-kswapd picks up a
>> memcg
>> > > > from the list and run reclaim.
>> > > >
>> > > > The concern of using per-memcg-kswapd thread is the system overhead
>> > > including
>> > > > memory and cputime.
>> > > >
>> > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > > > Signed-off-by: Ying Han <yinghan@google.com>
>> > >
>> > > Thank you for merging. This seems ok to me.
>> > >
>> > > Further development may make this better or change thread pools (to
>> some
>> > > other),
>> > > but I think this is enough good.
>> > >
>> >
>> > Thank you for reviewing and Acking. At the same time, I do have
>> wondering on
>> > the thread-pool modeling which I posted on the cover-letter :)
>> >
>> > The per-memcg-per-kswapd model
>> > Pros:
>> > 1. memory overhead per thread, and The memory consumption would be
>> 8k*1000 =
>> > 8M
>> > with 1k cgroup.
>> > 2. we see lots of threads at 'ps -elf'
>> >
>> > Cons:
>> > 1. the implementation is simply and straigh-forward.
>> > 2. we can easily isolate the background reclaim overhead between
>> cgroups.
>> > 3. better latency from memory pressure to actual start reclaiming
>> >
>> > The thread-pool model
>> > Pros:
>> > 1. there is no isolation between memcg background reclaim, since the
>> memcg
>> > threads
>> > are shared.
>> > 2. it is hard for visibility and debugability. I have been experienced a
>> lot
>> > when
>> > some kswapds running creazy and we need a stright-forward way to
>> identify
>> > which
>> > cgroup causing the reclaim.
>> > 3. potential starvation for some memcgs, if one workitem stucks and the
>> rest
>> > of work
>> > won't proceed.
>> >
>> > Cons:
>> > 1. save some memory resource.
>> >
>> > In general, the per-memcg-per-kswapd implmentation looks sane to me at
>> this
>> > point, esepcially the sharing memcg thread model will make debugging
>> issue
>> > very hard later.
>> >
>> > Comments?
>> >
>> Pros <-> Cons ?
>>
>> My idea is adding trace point for memcg-kswapd and seeing what it's now
>> doing.
>> (We don't have too small trace point in memcg...)
>>
>> I don't think its sane to create kthread per memcg because we know there
>> is a user
>> who makes hundreds/thousands of memcg.
>>
>
> I think we need to think about the exact usage of  'thousands of cgroups'
> in this case. Although not quite in detail, in Ying's previous email she did
> say that they created thousands of cgroups on each box in Google's cluster
> and most of them _slept_ in most of the time. So I guess actually what they
> did is creating a larger number of cgroups, each of them has different
> limits on various resources. Then on the time of job dispatching, they can
> choose a suitable group from each box and submit the job into it - without
> touching the other thousands of sleeping groups. That's to say, though
> Google has a huge number of groups on each box, they have only few jobs on
> it, so it's impossible to see too many busy groups at the same time.
>

The number of memcg thread running at the same time is capped w/ the number
of cpu-cores. The rest of them just idle.


> If above is correct, then I think Ying can call kthread_stop at the moment
> we find there's no tasks in the group anymore, to kill the memcg thread (as
> this group is expected to sleep for a long time after all the job leave). In
> this way we can keep the number of memcg threads small and don't lose the
> debug-ability.
> What do you think?
>

In the V6, I have the kswapd_stop() in mem_cgroup_destroy().

--Ying

>
> Regards,
> Zhu Yanhai
>
>>
>> And, I think that creating threads, which does the same job, more than the
>> number
>> of cpus will cause much more difficult starvation, priority inversion
>> issue.
>> Keeping scheduling knob/chances of jobs in memcg is important. I don't
>> want to
>> give a hint to scheduler because of memcg internal issue.
>>
>> And, even if memcg-kswapd doesn't exist, memcg works (well?).
>> memcg-kswapd just helps making things better but not do any critical jobs.
>> So, it's okay to have this as best-effort service.
>> Of course, better scheduling idea for picking up memcg is welcomed. It's
>> now
>> round-robin.
>>
>> Thanks,
>> -Kame
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 7137 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 8/9] Add per-memcg zone "unreclaimable"
  2011-04-22  6:13   ` KOSAKI Motohiro
@ 2011-04-22  6:17     ` Ying Han
  0 siblings, 0 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  6:17 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 11170 bytes --]

On Thu, Apr 21, 2011 at 11:13 PM, KOSAKI Motohiro <
kosaki.motohiro@jp.fujitsu.com> wrote:

> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 98fc7ed..3370c5a 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1526,6 +1526,7 @@ struct task_struct {
> >               struct mem_cgroup *memcg; /* target memcg of uncharge */
> >               unsigned long nr_pages; /* uncharged usage */
> >               unsigned long memsw_nr_pages; /* uncharged mem+swap usage
> */
> > +             struct zone *zone; /* a zone page is last uncharged */
>
> "zone" is bad name for task_struct. :-/
>

Hmm. then "zone_uncharged" ?

>
>
> >       } memcg_batch;
> >  #endif
> >  };
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index a062f0b..b868e597 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -159,6 +159,8 @@ enum {
> >       SWP_SCANNING    = (1 << 8),     /* refcount in scan_swap_map */
> >  };
> >
> > +#define ZONE_RECLAIMABLE_RATE 6
> > +
>
> Need comment?
>

ok.

>
>
> >  #define SWAP_CLUSTER_MAX 32
> >  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 41eaa62..9e535b2 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -135,7 +135,10 @@ struct mem_cgroup_per_zone {
> >       bool                    on_tree;
> >       struct mem_cgroup       *mem;           /* Back pointer, we cannot
> */
> >                                               /* use container_of
>  */
> > +     unsigned long           pages_scanned;  /* since last reclaim */
> > +     bool                    all_unreclaimable;      /* All pages pinned
> */
> >  };
> > +
> >  /* Macro for accessing counter */
> >  #define MEM_CGROUP_ZSTAT(mz, idx)    ((mz)->count[(idx)])
> >
> > @@ -1162,6 +1165,103 @@ mem_cgroup_get_reclaim_stat_from_page(struct page
> *page)
> >       return &mz->reclaim_stat;
> >  }
> >
> > +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone
> *zone,
> > +                                             unsigned long nr_scanned)
>
> this names sound like pages_scanned value getting helper function.
>



> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +
> > +     if (!mem)
> > +             return;
> > +
> > +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +     if (mz)
> > +             mz->pages_scanned += nr_scanned;
> > +}
> > +
> > +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, struct zone
> *zone)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +
> > +     if (!mem)
> > +             return 0;
> > +
> > +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +     if (mz)
> > +             return mz->pages_scanned <
> > +                             mem_cgroup_zone_reclaimable_pages(mem,
> zone) *
> > +                             ZONE_RECLAIMABLE_RATE;
> > +     return 0;
> > +}
> > +
> > +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone
> *zone)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +
> > +     if (!mem)
> > +             return false;
> > +
> > +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +     if (mz)
> > +             return mz->all_unreclaimable;
> > +
> > +     return false;
> > +}
> > +
> > +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone
> *zone)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +
> > +     if (!mem)
> > +             return;
> > +
> > +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +     if (mz)
> > +             mz->all_unreclaimable = true;
> > +}
> > +
> > +void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
> > +                                    struct zone *zone)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +
> > +     if (!mem)
> > +             return;
> > +
> > +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +     if (mz) {
> > +             mz->pages_scanned = 0;
> > +             mz->all_unreclaimable = false;
> > +     }
> > +
> > +     return;
> > +}
> > +
> > +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page
> *page)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +
> > +     if (!mem)
> > +             return;
> > +
> > +     mz = page_cgroup_zoneinfo(mem, page);
> > +     if (mz) {
> > +             mz->pages_scanned = 0;
> > +             mz->all_unreclaimable = false;
> > +     }
> > +
> > +     return;
> > +}
> > +
> >  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> >                                       struct list_head *dst,
> >                                       unsigned long *scanned, int order,
> > @@ -2709,6 +2809,7 @@ void mem_cgroup_cancel_charge_swapin(struct
> mem_cgroup *mem)
> >
> >  static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
> >                                  unsigned int nr_pages,
> > +                                struct page *page,
> >                                  const enum charge_type ctype)
> >  {
> >       struct memcg_batch_info *batch = NULL;
> > @@ -2726,6 +2827,10 @@ static void mem_cgroup_do_uncharge(struct
> mem_cgroup *mem,
> >        */
> >       if (!batch->memcg)
> >               batch->memcg = mem;
> > +
> > +     if (!batch->zone)
> > +             batch->zone = page_zone(page);
> > +
> >       /*
> >        * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> >        * In those cases, all pages freed continously can be expected to
> be in
> > @@ -2747,12 +2852,17 @@ static void mem_cgroup_do_uncharge(struct
> mem_cgroup *mem,
> >        */
> >       if (batch->memcg != mem)
> >               goto direct_uncharge;
> > +
> > +     if (batch->zone != page_zone(page))
> > +             mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
> > +
> >       /* remember freed charge and uncharge it later */
> >       batch->nr_pages++;
> >       if (uncharge_memsw)
> >               batch->memsw_nr_pages++;
> >       return;
> >  direct_uncharge:
> > +     mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
> >       res_counter_uncharge(&mem->res, nr_pages * PAGE_SIZE);
> >       if (uncharge_memsw)
> >               res_counter_uncharge(&mem->memsw, nr_pages * PAGE_SIZE);
> > @@ -2834,7 +2944,7 @@ __mem_cgroup_uncharge_common(struct page *page,
> enum charge_type ctype)
> >               mem_cgroup_get(mem);
> >       }
> >       if (!mem_cgroup_is_root(mem))
> > -             mem_cgroup_do_uncharge(mem, nr_pages, ctype);
> > +             mem_cgroup_do_uncharge(mem, nr_pages, page, ctype);
> >
> >       return mem;
> >
> > @@ -2902,6 +3012,10 @@ void mem_cgroup_uncharge_end(void)
> >       if (batch->memsw_nr_pages)
> >               res_counter_uncharge(&batch->memcg->memsw,
> >                                    batch->memsw_nr_pages * PAGE_SIZE);
> > +     if (batch->zone)
> > +             mem_cgroup_mz_clear_unreclaimable(batch->memcg,
> batch->zone);
> > +     batch->zone = NULL;
> > +
> >       memcg_oom_recover(batch->memcg);
> >       /* forget this pointer (for sanity check) */
> >       batch->memcg = NULL;
> > @@ -4667,6 +4781,8 @@ static int alloc_mem_cgroup_per_zone_info(struct
> mem_cgroup *mem, int node)
> >               mz->usage_in_excess = 0;
> >               mz->on_tree = false;
> >               mz->mem = mem;
> > +             mz->pages_scanned = 0;
> > +             mz->all_unreclaimable = false;
> >       }
> >       return 0;
> >  }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ba03a10..87653d6 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
> >                                       ISOLATE_BOTH : ISOLATE_INACTIVE,
> >                       zone, sc->mem_cgroup,
> >                       0, file);
> > +
> > +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone,
> nr_scanned);
> > +
> >               /*
> >                * mem_cgroup_isolate_pages() keeps track of
> >                * scanned pages on its own.
> > @@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long
> nr_pages, struct zone *zone,
> >                * mem_cgroup_isolate_pages() keeps track of
> >                * scanned pages on its own.
> >                */
> > +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone,
> pgscanned);
> >       }
> >
> >       reclaim_stat->recent_scanned[file] += nr_taken;
> > @@ -1989,7 +1993,8 @@ static void shrink_zones(int priority, struct
> zonelist *zonelist,
> >
> >  static bool zone_reclaimable(struct zone *zone)
> >  {
> > -     return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> > +     return zone->pages_scanned < zone_reclaimable_pages(zone) *
> > +                                     ZONE_RECLAIMABLE_RATE;
> >  }
> >
> >  /*
> > @@ -2651,10 +2656,20 @@ static void shrink_memcg_node(pg_data_t *pgdat,
> int order,
> >               if (!scan)
> >                       continue;
> >
> > +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> > +                     priority != DEF_PRIORITY)
> > +                     continue;
> > +
> >               sc->nr_scanned = 0;
> >               shrink_zone(priority, zone, sc);
> >               total_scanned += sc->nr_scanned;
> >
> > +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
> > +                     continue;
> > +
> > +             if (!mem_cgroup_zone_reclaimable(mem_cont, zone))
> > +                     mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
> > +
> >               /*
> >                * If we've done a decent amount of scanning and
> >                * the reclaim ratio is low, start doing writepage
> > @@ -2716,10 +2731,16 @@ static unsigned long shrink_mem_cgroup(struct
> mem_cgroup *mem_cont, int order)
> >                       shrink_memcg_node(pgdat, order, &sc);
> >                       total_scanned += sc.nr_scanned;
> >
> > +                     /*
> > +                      * Set the node which has at least one reclaimable
> > +                      * zone
> > +                      */
> >                       for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> >                               struct zone *zone = pgdat->node_zones + i;
> >
> > -                             if (populated_zone(zone))
> > +                             if (populated_zone(zone) &&
> > +                                 !mem_cgroup_mz_unreclaimable(mem_cont,
> > +                                                             zone))
> >                                       break;
>
> global reclaim call shrink_zone() when priority==DEF_PRIORITY even if
> all_unreclaimable is set. Is this intentional change?
> If so, please add some comments.
>
> Ok.

--Ying

[-- Attachment #2: Type: text/html, Size: 13863 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  6:10             ` Ying Han
@ 2011-04-22  7:46               ` KAMEZAWA Hiroyuki
  2011-04-22  7:59                 ` Ying Han
  0 siblings, 1 reply; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  7:46 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 23:10:58 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, Apr 21, 2011 at 10:59 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 21 Apr 2011 22:53:19 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > On Thu, Apr 21, 2011 at 10:00 PM, KAMEZAWA Hiroyuki <
> > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > > On Thu, 21 Apr 2011 21:49:04 -0700
> > > > Ying Han <yinghan@google.com> wrote:
> > > >
> > > > > On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
> > > > > kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > add a counter for kswapd-scan and kswapd-reclaim, kswapd-pickup will show
> > you information, if necessary it's good to show some latecy stat. I think
> > we can add enough information by adding stats (or debug by perf tools.)
> > I'll consider this a a bit more.
> >
> 
> Something like "kswapd_pgscan" and "kswapd_steal" per memcg? If we are going
> to the thread-pool, we definitely need to add more stats to give us enough
> visibility of per-memcg background reclaim activity. Still, not sure about
> the cpu-cycles.
> 

BTW, Kosaki requeted me not to have private thread pool implementation and
use workqueue. I think he is right. So, I'd like to write a patch to enhance
workqueue for using it for memcg (Of couse, I'll make a private workqueue.)


==
2. regarding to the alternative workqueue, which is more complicated and we
need to be very careful of work items in the workqueue. We've experienced in
one workitem stucks and the rest of the work item won't proceed. For example
in dirty page writeback, one heavily writer cgroup could starve the other
cgroups from flushing dirty pages to the same disk. In the kswapd case, I can
imagine we might have similar senario. How to prioritize the workitems is
another problem. The order of adding the workitems in the queue reflects the
order of cgroups being reclaimed. We don't have that restriction currently but
relying on the cpu scheduler to put kswapd on the right cpu-core to run. We
"might" introduce priority later for reclaim and how are we gonna deal with
that.
==

>From this, I feel I need to use unbound workqueue. BTW, with patches for
current thread pool model, I think starvation problem by dirty pages
cannot be seen.
Anyway, I'll give a try.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 7/9] Per-memcg background reclaim.
  2011-04-22  6:00   ` KOSAKI Motohiro
@ 2011-04-22  7:54     ` Ying Han
  2011-04-22  8:44       ` KOSAKI Motohiro
  0 siblings, 1 reply; 48+ messages in thread
From: Ying Han @ 2011-04-22  7:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 11239 bytes --]

On Thu, Apr 21, 2011 at 11:00 PM, KOSAKI Motohiro <
kosaki.motohiro@jp.fujitsu.com> wrote:

> > This is the main loop of per-memcg background reclaim which is
> implemented in
> > function balance_mem_cgroup_pgdat().
> >
> > The function performs a priority loop similar to global reclaim. During
> each
> > iteration it invokes balance_pgdat_node() for all nodes on the system,
> which
> > is another new function performs background reclaim per node. After
> reclaiming
> > each node, it checks mem_cgroup_watermark_ok() and breaks the priority
> loop if
> > it returns true.
> >
> > changelog v7..v6:
> > 1. change based on KAMAZAWA's patchset. Each memcg reclaims now reclaims
> > SWAP_CLUSTER_MAX of pages and putback the memcg to the tail of list.
> > memcg-kswapd will visit memcgs in round-robin manner and reduce usages.
> >
> > changelog v6..v5:
> > 1. add mem_cgroup_zone_reclaimable_pages()
> > 2. fix some comment style.
> >
> > changelog v5..v4:
> > 1. remove duplicate check on nodes_empty()
> > 2. add logic to check if the per-memcg lru is empty on the zone.
> >
> > changelog v4..v3:
> > 1. split the select_victim_node and zone_unreclaimable to a seperate
> patches
> > 2. remove the logic tries to do zone balancing.
> >
> > changelog v3..v2:
> > 1. change mz->all_unreclaimable to be boolean.
> > 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg
> reclaim.
> > 3. some more clean-up.
> >
> > changelog v2..v1:
> > 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> > 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> > reclaim.
> > 3. name the per-memcg memcg as "memcg-id" (css->id). And the global
> kswapd
> > keeps the same name.
> > 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be
> accessed
> > after freeing.
> > 5. add the fairness in zonelist where memcg remember the last zone
> reclaimed
> > from.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/memcontrol.h |    9 +++
> >  mm/memcontrol.c            |   18 +++++++
> >  mm/vmscan.c                |  118
> ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 145 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 7444738..39eade6 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -115,6 +115,8 @@ extern void mem_cgroup_end_migration(struct
> mem_cgroup *mem,
> >   */
> >  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> >  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup
> *memcg,
> > +                                               struct zone *zone);
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                      struct zone *zone,
> >                                      enum lru_list lru);
> > @@ -311,6 +313,13 @@ mem_cgroup_inactive_file_is_low(struct mem_cgroup
> *memcg)
> >  }
> >
> >  static inline unsigned long
> > +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> > +                                 struct zone *zone)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline unsigned long
> >  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> >                        enum lru_list lru)
> >  {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 4696fd8..41eaa62 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1105,6 +1105,24 @@ int mem_cgroup_inactive_file_is_low(struct
> mem_cgroup *memcg)
> >       return (active > inactive);
> >  }
> >
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup
> *memcg,
> > +                                             struct zone *zone)
> > +{
> > +     int nr;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +     struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid,
> zid);
> > +
> > +     nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> > +          MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> > +
> > +     if (nr_swap_pages > 0)
> > +             nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> > +                   MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> > +
> > +     return nr;
> > +}
> > +
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                      struct zone *zone,
> >                                      enum lru_list lru)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 63c557e..ba03a10 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -47,6 +47,8 @@
> >
> >  #include <linux/swapops.h>
> >
> > +#include <linux/res_counter.h>
> > +
> >  #include "internal.h"
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -111,6 +113,8 @@ struct scan_control {
> >        * are scanned.
> >        */
> >       nodemask_t      *nodemask;
> > +
> > +     int priority;
> >  };
>
> Bah!
> If you need sc.priority, you have to make cleanup patch at first. and
> all current reclaim path have to use sc.priority. Please don't increase
> unnecessary mess.
>
> hmm. so then I would change it by passing the priority
> as separate parameter.
>


>
> >
> >  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> > @@ -2620,10 +2624,124 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >       finish_wait(wait_h, &wait);
> >  }
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * The function is used for per-memcg LRU. It scanns all the zones of
> the
> > + * node and returns the nr_scanned and nr_reclaimed.
> > + */
> > +static void shrink_memcg_node(pg_data_t *pgdat, int order,
> > +                             struct scan_control *sc)
> > +{
> > +     int i;
> > +     unsigned long total_scanned = 0;
> > +     struct mem_cgroup *mem_cont = sc->mem_cgroup;
> > +     int priority = sc->priority;
>
> unnecessary local variables. we can keep smaller stack.
>
> ok.

>
> > +
> > +     /*
> > +      * This dma->highmem order is consistant with global reclaim.
> > +      * We do this because the page allocator works in the opposite
> > +      * direction although memcg user pages are mostly allocated at
> > +      * highmem.
> > +      */
> > +     for (i = 0; i < pgdat->nr_zones; i++) {
> > +             struct zone *zone = pgdat->node_zones + i;
> > +             unsigned long scan = 0;
> > +
> > +             scan = mem_cgroup_zone_reclaimable_pages(mem_cont, zone);
> > +             if (!scan)
> > +                     continue;
> > +
> > +             sc->nr_scanned = 0;
> > +             shrink_zone(priority, zone, sc);
> > +             total_scanned += sc->nr_scanned;
> > +
> > +             /*
> > +              * If we've done a decent amount of scanning and
> > +              * the reclaim ratio is low, start doing writepage
> > +              * even in laptop mode
> > +              */
> > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> 2) {
> > +                     sc->may_writepage = 1;
>
> please make helper function for may_writepage. iow, don't cut-n-paste.
>
> hmm, can you help to clarify that?
>


> > +/*
> > + * Per cgroup background reclaim.
> > + * TODO: Take off the order since memcg always do order 0
> > + */
> > +static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int
> order)
> > +{
> > +     int i, nid, priority, loop;
> > +     pg_data_t *pgdat;
> > +     nodemask_t do_nodes;
> > +     unsigned long total_scanned;
> > +     struct scan_control sc = {
> > +             .gfp_mask = GFP_KERNEL,
> > +             .may_unmap = 1,
> > +             .may_swap = 1,
> > +             .nr_to_reclaim = SWAP_CLUSTER_MAX,
> > +             .swappiness = vm_swappiness,
>
> No. memcg has per-memcg swappiness. Please don't use global swappiness
> value.
>
> sounds reasonable, i will take a look at it.

>
> > +             .order = order,
> > +             .mem_cgroup = mem_cont,
> > +     };
> > +
> > +     do_nodes = NODE_MASK_NONE;
> > +     sc.may_writepage = !laptop_mode;
> > +     sc.nr_reclaimed = 0;
>
> this initialization move into sc static initializer. balance pgdat has
> loop_again label and this doesn't.
>

will change.


>
> > +     total_scanned = 0;
> > +
> > +     do_nodes = node_states[N_ONLINE];
>
> Why do we need care memoryless node? N_HIGH_MEMORY is wrong?
>
hmm, let me look into that.

>
> > +
> > +     for (priority = DEF_PRIORITY;
> > +             (priority >= 0) && (sc.nr_to_reclaim > sc.nr_reclaimed);
> > +             priority--) {
>
> bah. bad coding style...
>

ok. will change.

>
> > +
> > +             sc.priority = priority;
> > +             /* The swap token gets in the way of swapout... */
> > +             if (!priority)
> > +                     disable_swap_token();
>
> Why?
>
> disable swap token mean "Please devest swap preventation privilege from
> owner task. Instead we endure swap storm and performance hit".
> However I doublt memcg memory shortage is good situation to make swap
> storm.
>

I am not sure about that either way. we probably can leave as it is and make
corresponding change if real problem is observed?

>
>
> > +
> > +             for (loop = num_online_nodes();
> > +                     (loop > 0) && !nodes_empty(do_nodes);
> > +                     loop--) {
>
> Why don't you use for_each_online_node()?
> Maybe for_each_node_state(n, N_HIGH_MEMORY) is best option?
>
> At least, find_next_bit() is efficient than bare loop?
>



> > +
> > +                     nid = mem_cgroup_select_victim_node(mem_cont,
> > +                                                     &do_nodes);
> > +
> > +                     pgdat = NODE_DATA(nid);
> > +                     shrink_memcg_node(pgdat, order, &sc);
> > +                     total_scanned += sc.nr_scanned;
> > +
> > +                     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > +                             struct zone *zone = pgdat->node_zones + i;
> > +
> > +                             if (populated_zone(zone))
> > +                                     break;
> > +                     }
>
> memory less node check is here. but we can check it before.
>

Not sure I understand this, can you help to clarify?

thank you for reviewing

--Ying

>
> > +                     if (i < 0)
> > +                             node_clear(nid, do_nodes);
> > +
> > +                     if (mem_cgroup_watermark_ok(mem_cont,
> > +                                             CHARGE_WMARK_HIGH))
> > +                             goto out;
> > +             }
> > +
> > +             if (total_scanned && priority < DEF_PRIORITY - 2)
> > +                     congestion_wait(WRITE, HZ/10);
> > +     }
> > +out:
> > +     return sc.nr_reclaimed;
> > +}
> > +#else
> >  static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont, int
> order)
> >  {
> >       return 0;
> >  }
> > +#endif
> >
> >  /*
> >   * The background pageout daemon, started as a kernel thread
> > --
> > 1.7.3.1
> >
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 15156 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  7:46               ` KAMEZAWA Hiroyuki
@ 2011-04-22  7:59                 ` Ying Han
  2011-04-22  8:02                   ` KAMEZAWA Hiroyuki
  2011-04-24 23:26                   ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 48+ messages in thread
From: Ying Han @ 2011-04-22  7:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2780 bytes --]

On Fri, Apr 22, 2011 at 12:46 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 21 Apr 2011 23:10:58 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Thu, Apr 21, 2011 at 10:59 PM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Thu, 21 Apr 2011 22:53:19 -0700
> > > Ying Han <yinghan@google.com> wrote:
> > >
> > > > On Thu, Apr 21, 2011 at 10:00 PM, KAMEZAWA Hiroyuki <
> > > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > >
> > > > > On Thu, 21 Apr 2011 21:49:04 -0700
> > > > > Ying Han <yinghan@google.com> wrote:
> > > > >
> > > > > > On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
> > > > > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > > add a counter for kswapd-scan and kswapd-reclaim, kswapd-pickup will
> show
> > > you information, if necessary it's good to show some latecy stat. I
> think
> > > we can add enough information by adding stats (or debug by perf tools.)
> > > I'll consider this a a bit more.
> > >
> >
> > Something like "kswapd_pgscan" and "kswapd_steal" per memcg? If we are
> going
> > to the thread-pool, we definitely need to add more stats to give us
> enough
> > visibility of per-memcg background reclaim activity. Still, not sure
> about
> > the cpu-cycles.
> >
>
> BTW, Kosaki requeted me not to have private thread pool implementation and
> use workqueue. I think he is right. So, I'd like to write a patch to
> enhance
> workqueue for using it for memcg (Of couse, I'll make a private workqueue.)
>
> Hmm. Can you give a bit more details of the logic behind? and what's about
the private workqueue? Also, how
we plan to solve the better debug-ability issue.


>
> ==
> 2. regarding to the alternative workqueue, which is more complicated and we
> need to be very careful of work items in the workqueue. We've experienced
> in
> one workitem stucks and the rest of the work item won't proceed. For
> example
> in dirty page writeback, one heavily writer cgroup could starve the other
> cgroups from flushing dirty pages to the same disk. In the kswapd case, I
> can
> imagine we might have similar senario. How to prioritize the workitems is
> another problem. The order of adding the workitems in the queue reflects
> the
> order of cgroups being reclaimed. We don't have that restriction currently
> but
> relying on the cpu scheduler to put kswapd on the right cpu-core to run. We
> "might" introduce priority later for reclaim and how are we gonna deal with
> that.
> ==
>
> From this, I feel I need to use unbound workqueue. BTW, with patches for
> current thread pool model, I think starvation problem by dirty pages
> cannot be seen.
> Anyway, I'll give a try.
>

Then do you suggest me to wait for your patch for my next post?

--Ying

>
> Thanks,
> -Kame
>
>
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 4099 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  7:59                 ` Ying Han
@ 2011-04-22  8:02                   ` KAMEZAWA Hiroyuki
  2011-04-24 23:26                   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-22  8:02 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, 22 Apr 2011 00:59:26 -0700
Ying Han <yinghan@google.com> wrote:

> On Fri, Apr 22, 2011 at 12:46 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 21 Apr 2011 23:10:58 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > On Thu, Apr 21, 2011 at 10:59 PM, KAMEZAWA Hiroyuki <
> > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > > On Thu, 21 Apr 2011 22:53:19 -0700
> > > > Ying Han <yinghan@google.com> wrote:
> > > >
> > > > > On Thu, Apr 21, 2011 at 10:00 PM, KAMEZAWA Hiroyuki <
> > > > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > >
> > > > > > On Thu, 21 Apr 2011 21:49:04 -0700
> > > > > > Ying Han <yinghan@google.com> wrote:
> > > > > >
> > > > > > > On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
> > > > > > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > > add a counter for kswapd-scan and kswapd-reclaim, kswapd-pickup will
> > show
> > > > you information, if necessary it's good to show some latecy stat. I
> > think
> > > > we can add enough information by adding stats (or debug by perf tools.)
> > > > I'll consider this a a bit more.
> > > >
> > >
> > > Something like "kswapd_pgscan" and "kswapd_steal" per memcg? If we are
> > going
> > > to the thread-pool, we definitely need to add more stats to give us
> > enough
> > > visibility of per-memcg background reclaim activity. Still, not sure
> > about
> > > the cpu-cycles.
> > >
> >
> > BTW, Kosaki requeted me not to have private thread pool implementation and
> > use workqueue. I think he is right. So, I'd like to write a patch to
> > enhance
> > workqueue for using it for memcg (Of couse, I'll make a private workqueue.)
> >
> > Hmm. Can you give a bit more details of the logic behind? 

Kosaki just says please avoid reinventing the wheel. It seems it's used in many
places and has rich functions. I think we should do try. With my patch for
threadl pools, I think I can avoid starvation problem.

> and what's about the private workqueue? 

Dont use schedule_work() but use  queue_work(). 

> Also, how we plan to solve the better debug-ability issue.
> 

I'll add patch for debug ability. I belive I can record cputime usage
for memory reclaim per memcg, both for direct and background. I think it's
a good feature for memcg regardless of background reclaim.
 
> >
> > ==
> > 2. regarding to the alternative workqueue, which is more complicated and we
> > need to be very careful of work items in the workqueue. We've experienced
> > in
> > one workitem stucks and the rest of the work item won't proceed. For
> > example
> > in dirty page writeback, one heavily writer cgroup could starve the other
> > cgroups from flushing dirty pages to the same disk. In the kswapd case, I
> > can
> > imagine we might have similar senario. How to prioritize the workitems is
> > another problem. The order of adding the workitems in the queue reflects
> > the
> > order of cgroups being reclaimed. We don't have that restriction currently
> > but
> > relying on the cpu scheduler to put kswapd on the right cpu-core to run. We
> > "might" introduce priority later for reclaim and how are we gonna deal with
> > that.
> > ==
> >
> > From this, I feel I need to use unbound workqueue. BTW, with patches for
> > current thread pool model, I think starvation problem by dirty pages
> > cannot be seen.
> > Anyway, I'll give a try.
> >
> 
> Then do you suggest me to wait for your patch for my next post?
> 

Feel free to do. But I think it's near to weekend and posting patch right
now or posting patch at Monday will not make big changes.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 7/9] Per-memcg background reclaim.
  2011-04-22  7:54     ` Ying Han
@ 2011-04-22  8:44       ` KOSAKI Motohiro
  2011-04-22 18:37         ` Ying Han
  0 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2011-04-22  8:44 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

> > > @@ -111,6 +113,8 @@ struct scan_control {
> > >        * are scanned.
> > >        */
> > >       nodemask_t      *nodemask;
> > > +
> > > +     int priority;
> > >  };
> >
> > Bah!
> > If you need sc.priority, you have to make cleanup patch at first. and
> > all current reclaim path have to use sc.priority. Please don't increase
> > unnecessary mess.
> >
> > hmm. so then I would change it by passing the priority
> > as separate parameter.

ok.

> > > +             /*
> > > +              * If we've done a decent amount of scanning and
> > > +              * the reclaim ratio is low, start doing writepage
> > > +              * even in laptop mode
> > > +              */
> > > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> > 2) {
> > > +                     sc->may_writepage = 1;
> >
> > please make helper function for may_writepage. iow, don't cut-n-paste.
> >
> > hmm, can you help to clarify that?

I meant completely cut-n-paste code and comments is here.


> > > +     total_scanned = 0;
> > > +
> > > +     do_nodes = node_states[N_ONLINE];
> >
> > Why do we need care memoryless node? N_HIGH_MEMORY is wrong?
> >
> hmm, let me look into that.


> > > +             sc.priority = priority;
> > > +             /* The swap token gets in the way of swapout... */
> > > +             if (!priority)
> > > +                     disable_swap_token();
> >
> > Why?
> >
> > disable swap token mean "Please devest swap preventation privilege from
> > owner task. Instead we endure swap storm and performance hit".
> > However I doublt memcg memory shortage is good situation to make swap
> > storm.
> >
> 
> I am not sure about that either way. we probably can leave as it is and make
> corresponding change if real problem is observed?

Why?
This is not only memcg issue, but also can lead to global swap ping-pong.

But I give up. I have no time to persuade you.


> > > +                     nid = mem_cgroup_select_victim_node(mem_cont,
> > > +                                                     &do_nodes);
> > > +
> > > +                     pgdat = NODE_DATA(nid);
> > > +                     shrink_memcg_node(pgdat, order, &sc);
> > > +                     total_scanned += sc.nr_scanned;
> > > +
> > > +                     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > > +                             struct zone *zone = pgdat->node_zones + i;
> > > +
> > > +                             if (populated_zone(zone))
> > > +                                     break;
> > > +                     }
> >
> > memory less node check is here. but we can check it before.
> 
> Not sure I understand this, can you help to clarify?

Same with above N_HIGH_MEMORY comments.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 7/9] Per-memcg background reclaim.
  2011-04-22  8:44       ` KOSAKI Motohiro
@ 2011-04-22 18:37         ` Ying Han
  2011-04-25  2:21           ` [PATCH] vmscan,memcg: memcg aware swap token KOSAKI Motohiro
  0 siblings, 1 reply; 48+ messages in thread
From: Ying Han @ 2011-04-22 18:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3428 bytes --]

On Fri, Apr 22, 2011 at 1:44 AM, KOSAKI Motohiro <
kosaki.motohiro@jp.fujitsu.com> wrote:

> > > > @@ -111,6 +113,8 @@ struct scan_control {
> > > >        * are scanned.
> > > >        */
> > > >       nodemask_t      *nodemask;
> > > > +
> > > > +     int priority;
> > > >  };
> > >
> > > Bah!
> > > If you need sc.priority, you have to make cleanup patch at first. and
> > > all current reclaim path have to use sc.priority. Please don't increase
> > > unnecessary mess.
> > >
> > > hmm. so then I would change it by passing the priority
> > > as separate parameter.
>
> ok.
>
> > > > +             /*
> > > > +              * If we've done a decent amount of scanning and
> > > > +              * the reclaim ratio is low, start doing writepage
> > > > +              * even in laptop mode
> > > > +              */
> > > > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > > > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed
> /
> > > 2) {
> > > > +                     sc->may_writepage = 1;
> > >
> > > please make helper function for may_writepage. iow, don't cut-n-paste.
> > >
> > > hmm, can you help to clarify that?
>
> I meant completely cut-n-paste code and comments is here.
>
>
> > > > +     total_scanned = 0;
> > > > +
> > > > +     do_nodes = node_states[N_ONLINE];
> > >
> > > Why do we need care memoryless node? N_HIGH_MEMORY is wrong?
> > >
> > hmm, let me look into that.
>
>
> > > > +             sc.priority = priority;
> > > > +             /* The swap token gets in the way of swapout... */
> > > > +             if (!priority)
> > > > +                     disable_swap_token();
> > >
> > > Why?
> > >
> > > disable swap token mean "Please devest swap preventation privilege from
> > > owner task. Instead we endure swap storm and performance hit".
> > > However I doublt memcg memory shortage is good situation to make swap
> > > storm.
> > >
> >
> > I am not sure about that either way. we probably can leave as it is and
> make
> > corresponding change if real problem is observed?
>
> Why?
> This is not only memcg issue, but also can lead to global swap ping-pong.
>
> But I give up. I have no time to persuade you.
>
> Thank you for pointing that out. I didn't pay much attention on the
swap_token but just simply inherited
it from the global logic. Now after reading a bit more, i think you were
right about it.  It would be a bad
idea to have memcg kswapds affecting much the global swap token being set.

I will remove it from the next post.

>
> > > > +                     nid = mem_cgroup_select_victim_node(mem_cont,
> > > > +                                                     &do_nodes);
> > > > +
> > > > +                     pgdat = NODE_DATA(nid);
> > > > +                     shrink_memcg_node(pgdat, order, &sc);
> > > > +                     total_scanned += sc.nr_scanned;
> > > > +
> > > > +                     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > > > +                             struct zone *zone = pgdat->node_zones +
> i;
> > > > +
> > > > +                             if (populated_zone(zone))
> > > > +                                     break;
> > > > +                     }
> > >
> > > memory less node check is here. but we can check it before.
> >
> > Not sure I understand this, can you help to clarify?
>
> Same with above N_HIGH_MEMORY comments.
>

Ok, agree on the HIGH_MEMORY and will change that.

--Ying

[-- Attachment #2: Type: text/html, Size: 4841 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-22  7:59                 ` Ying Han
  2011-04-22  8:02                   ` KAMEZAWA Hiroyuki
@ 2011-04-24 23:26                   ` KAMEZAWA Hiroyuki
  2011-04-25  2:08                     ` Ying Han
  1 sibling, 1 reply; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-24 23:26 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, 22 Apr 2011 00:59:26 -0700
Ying Han <yinghan@google.com> wrote:

> On Fri, Apr 22, 2011 at 12:46 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > From this, I feel I need to use unbound workqueue. BTW, with patches for
> > current thread pool model, I think starvation problem by dirty pages
> > cannot be seen.
> > Anyway, I'll give a try.
> >
> 
> Then do you suggest me to wait for your patch for my next post?
> 

I used most of weekend for background reclaim on workqueue and I changed many
things based on your patch (but dropped most of kswapd descriptor...patches.)

I'll post it today after some tests on machines in my office. It worked well
on my laptop.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V7 4/9] Add memcg kswapd thread pool
  2011-04-24 23:26                   ` KAMEZAWA Hiroyuki
@ 2011-04-25  2:08                     ` Ying Han
  0 siblings, 0 replies; 48+ messages in thread
From: Ying Han @ 2011-04-25  2:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1190 bytes --]

On Sun, Apr 24, 2011 at 4:26 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri, 22 Apr 2011 00:59:26 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Fri, Apr 22, 2011 at 12:46 AM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > > From this, I feel I need to use unbound workqueue. BTW, with patches
> for
> > > current thread pool model, I think starvation problem by dirty pages
> > > cannot be seen.
> > > Anyway, I'll give a try.
> > >
> >
> > Then do you suggest me to wait for your patch for my next post?
> >
>
> I used most of weekend for background reclaim on workqueue and I changed
> many
> things based on your patch (but dropped most of kswapd
> descriptor...patches.)
>
> Thank you for the heads up. Although I am still having concerns on the
workqueue approach, but
thank you for your time to give a try.

One of my concerns is still the debug-ability and I am not being convinced
the resource consumption is a killing issue for the per-memcg
kswapd thread. Anyway, looking to see your change.

--Ying



> I'll post it today after some tests on machines in my office. It worked
> well
> on my laptop.
>
> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 1950 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH] vmscan,memcg: memcg aware swap token
  2011-04-22 18:37         ` Ying Han
@ 2011-04-25  2:21           ` KOSAKI Motohiro
  2011-04-25  9:47             ` KAMEZAWA Hiroyuki
  2011-04-25 17:13             ` Ying Han
  0 siblings, 2 replies; 48+ messages in thread
From: KOSAKI Motohiro @ 2011-04-25  2:21 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

> > > > > +             sc.priority = priority;
> > > > > +             /* The swap token gets in the way of swapout... */
> > > > > +             if (!priority)
> > > > > +                     disable_swap_token();
> > > >
> > > > Why?
> > > >
> > > > disable swap token mean "Please devest swap preventation privilege from
> > > > owner task. Instead we endure swap storm and performance hit".
> > > > However I doublt memcg memory shortage is good situation to make swap
> > > > storm.
> > > >
> > >
> > > I am not sure about that either way. we probably can leave as it is and
> > make
> > > corresponding change if real problem is observed?
> >
> > Why?
> > This is not only memcg issue, but also can lead to global swap ping-pong.
> >
> > But I give up. I have no time to persuade you.
> >
> > Thank you for pointing that out. I didn't pay much attention on the
> swap_token but just simply inherited
> it from the global logic. Now after reading a bit more, i think you were
> right about it.  It would be a bad
> idea to have memcg kswapds affecting much the global swap token being set.
> 
> I will remove it from the next post.

The better approach is swap-token recognize memcg and behave clever? :)



From 106c21d7f9cf8641592cbfe1416af66470af4f9a Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Mon, 25 Apr 2011 10:57:54 +0900
Subject: [PATCH] vmscan,memcg: memcg aware swap token

Currently, memcg reclaim can disable swap token even if the swap token
mm doesn't belong in its memory cgroup. It's slightly riskly. If an
admin makes very small mem-cgroup and silly guy runs contenious heavy
memory pressure workloa, whole tasks in the system are going to lose
swap-token and then system may become unresponsive. That's bad.

This patch adds 'memcg' parameter into disable_swap_token(). and if
the parameter doesn't match swap-token, VM doesn't put swap-token.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    6 ++++++
 include/linux/swap.h       |   24 +++++++++++++++++-------
 mm/memcontrol.c            |    2 +-
 mm/thrash.c                |   17 +++++++++++++++++
 mm/vmscan.c                |    4 ++--
 5 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6a0cffd..df572af 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -84,6 +84,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
@@ -244,6 +245,11 @@ static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	return NULL;
 }
 
+static inline struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
+{
+	return NULL;
+}
+
 static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
 {
 	return 1;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 384eb5f..ccea15d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -358,21 +358,31 @@ struct backing_dev_info;
 extern struct mm_struct *swap_token_mm;
 extern void grab_swap_token(struct mm_struct *);
 extern void __put_swap_token(struct mm_struct *);
+extern int has_swap_token_memcg(struct mm_struct *mm, struct mem_cgroup *memcg);
 
-static inline int has_swap_token(struct mm_struct *mm)
+static inline
+int has_swap_token(struct mm_struct *mm)
 {
-	return (mm == swap_token_mm);
+	return has_swap_token_memcg(mm, NULL);
 }
 
-static inline void put_swap_token(struct mm_struct *mm)
+static inline
+void put_swap_token_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
 {
-	if (has_swap_token(mm))
+	if (has_swap_token_memcg(mm, memcg))
 		__put_swap_token(mm);
 }
 
-static inline void disable_swap_token(void)
+static inline
+void put_swap_token(struct mm_struct *mm)
+{
+	return put_swap_token_memcg(mm, NULL);
+}
+
+static inline
+void disable_swap_token(struct mem_cgroup *memcg)
 {
-	put_swap_token(swap_token_mm);
+	put_swap_token_memcg(swap_token_mm, memcg);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -500,7 +510,7 @@ static inline int has_swap_token(struct mm_struct *mm)
 	return 0;
 }
 
-static inline void disable_swap_token(void)
+static inline void disable_swap_token(struct mem_cgroup *memcg)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c2776f1..5683c7a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -735,7 +735,7 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 				struct mem_cgroup, css);
 }
 
-static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
+struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
 	struct mem_cgroup *mem = NULL;
 
diff --git a/mm/thrash.c b/mm/thrash.c
index 2372d4e..f892a6e 100644
--- a/mm/thrash.c
+++ b/mm/thrash.c
@@ -21,6 +21,7 @@
 #include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/swap.h>
+#include <linux/memcontrol.h>
 
 static DEFINE_SPINLOCK(swap_token_lock);
 struct mm_struct *swap_token_mm;
@@ -75,3 +76,19 @@ void __put_swap_token(struct mm_struct *mm)
 		swap_token_mm = NULL;
 	spin_unlock(&swap_token_lock);
 }
+
+int has_swap_token_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
+{
+	if (memcg) {
+		struct mem_cgroup *swap_token_memcg;
+
+		/*
+		 * memcgroup reclaim can disable swap token only if token task
+		 * is in the same cgroup.
+		 */
+		swap_token_memcg = try_get_mem_cgroup_from_mm(swap_token_mm);
+		return ((mm == swap_token_mm) && (memcg == swap_token_memcg));
+	} else
+		return (mm == swap_token_mm);
+}
+
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b3a569f..19e179b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2044,7 +2044,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
 		if (!priority)
-			disable_swap_token();
+			disable_swap_token(sc->mem_cgroup);
 		shrink_zones(priority, zonelist, sc);
 		/*
 		 * Don't shrink slabs when reclaiming memory from
@@ -2353,7 +2353,7 @@ loop_again:
 
 		/* The swap token gets in the way of swapout... */
 		if (!priority)
-			disable_swap_token();
+			disable_swap_token(NULL);
 
 		all_zones_ok = 1;
 		balanced = 0;
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] vmscan,memcg: memcg aware swap token
  2011-04-25  2:21           ` [PATCH] vmscan,memcg: memcg aware swap token KOSAKI Motohiro
@ 2011-04-25  9:47             ` KAMEZAWA Hiroyuki
  2011-04-25 17:13             ` Ying Han
  1 sibling, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ying Han, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Mon, 25 Apr 2011 11:21:57 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> The better approach is swap-token recognize memcg and behave clever? :)
> 
> 
> 
> From 106c21d7f9cf8641592cbfe1416af66470af4f9a Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Mon, 25 Apr 2011 10:57:54 +0900
> Subject: [PATCH] vmscan,memcg: memcg aware swap token
> 
> Currently, memcg reclaim can disable swap token even if the swap token
> mm doesn't belong in its memory cgroup. It's slightly riskly. If an
> admin makes very small mem-cgroup and silly guy runs contenious heavy
> memory pressure workloa, whole tasks in the system are going to lose
> swap-token and then system may become unresponsive. That's bad.
> 
> This patch adds 'memcg' parameter into disable_swap_token(). and if
> the parameter doesn't match swap-token, VM doesn't put swap-token.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Oh, thank you.  This seems necessary. But..we should maintain swap token
under memcg with considering hierarchy, I think. 

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

(It's ok to update this by myself if hierarchy seems complicated.)


> ---
>  include/linux/memcontrol.h |    6 ++++++
>  include/linux/swap.h       |   24 +++++++++++++++++-------
>  mm/memcontrol.c            |    2 +-
>  mm/thrash.c                |   17 +++++++++++++++++
>  mm/vmscan.c                |    4 ++--
>  5 files changed, 43 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6a0cffd..df572af 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -84,6 +84,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +extern struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm);
>  
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> @@ -244,6 +245,11 @@ static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  	return NULL;
>  }
>  
> +static inline struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> +{
> +	return NULL;
> +}
> +
>  static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
>  {
>  	return 1;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 384eb5f..ccea15d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -358,21 +358,31 @@ struct backing_dev_info;
>  extern struct mm_struct *swap_token_mm;
>  extern void grab_swap_token(struct mm_struct *);
>  extern void __put_swap_token(struct mm_struct *);
> +extern int has_swap_token_memcg(struct mm_struct *mm, struct mem_cgroup *memcg);
>  
> -static inline int has_swap_token(struct mm_struct *mm)
> +static inline
> +int has_swap_token(struct mm_struct *mm)
>  {
> -	return (mm == swap_token_mm);
> +	return has_swap_token_memcg(mm, NULL);
>  }
>  
> -static inline void put_swap_token(struct mm_struct *mm)
> +static inline
> +void put_swap_token_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
>  {
> -	if (has_swap_token(mm))
> +	if (has_swap_token_memcg(mm, memcg))
>  		__put_swap_token(mm);
>  }
>  
> -static inline void disable_swap_token(void)
> +static inline
> +void put_swap_token(struct mm_struct *mm)
> +{
> +	return put_swap_token_memcg(mm, NULL);
> +}
> +
> +static inline
> +void disable_swap_token(struct mem_cgroup *memcg)
>  {
> -	put_swap_token(swap_token_mm);
> +	put_swap_token_memcg(swap_token_mm, memcg);
>  }
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> @@ -500,7 +510,7 @@ static inline int has_swap_token(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -static inline void disable_swap_token(void)
> +static inline void disable_swap_token(struct mem_cgroup *memcg)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c2776f1..5683c7a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -735,7 +735,7 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>  				struct mem_cgroup, css);
>  }
>  
> -static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> +struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  {
>  	struct mem_cgroup *mem = NULL;
>  
> diff --git a/mm/thrash.c b/mm/thrash.c
> index 2372d4e..f892a6e 100644
> --- a/mm/thrash.c
> +++ b/mm/thrash.c
> @@ -21,6 +21,7 @@
>  #include <linux/mm.h>
>  #include <linux/sched.h>
>  #include <linux/swap.h>
> +#include <linux/memcontrol.h>
>  
>  static DEFINE_SPINLOCK(swap_token_lock);
>  struct mm_struct *swap_token_mm;
> @@ -75,3 +76,19 @@ void __put_swap_token(struct mm_struct *mm)
>  		swap_token_mm = NULL;
>  	spin_unlock(&swap_token_lock);
>  }
> +
> +int has_swap_token_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> +{
> +	if (memcg) {
> +		struct mem_cgroup *swap_token_memcg;
> +
> +		/*
> +		 * memcgroup reclaim can disable swap token only if token task
> +		 * is in the same cgroup.
> +		 */
> +		swap_token_memcg = try_get_mem_cgroup_from_mm(swap_token_mm);
> +		return ((mm == swap_token_mm) && (memcg == swap_token_memcg));
> +	} else
> +		return (mm == swap_token_mm);
> +}
> +
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b3a569f..19e179b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2044,7 +2044,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>  		sc->nr_scanned = 0;
>  		if (!priority)
> -			disable_swap_token();
> +			disable_swap_token(sc->mem_cgroup);
>  		shrink_zones(priority, zonelist, sc);
>  		/*
>  		 * Don't shrink slabs when reclaiming memory from
> @@ -2353,7 +2353,7 @@ loop_again:
>  
>  		/* The swap token gets in the way of swapout... */
>  		if (!priority)
> -			disable_swap_token();
> +			disable_swap_token(NULL);
>  
>  		all_zones_ok = 1;
>  		balanced = 0;
> -- 
> 1.7.3.1
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] vmscan,memcg: memcg aware swap token
  2011-04-25  2:21           ` [PATCH] vmscan,memcg: memcg aware swap token KOSAKI Motohiro
  2011-04-25  9:47             ` KAMEZAWA Hiroyuki
@ 2011-04-25 17:13             ` Ying Han
  2011-04-26  2:08               ` KOSAKI Motohiro
  1 sibling, 1 reply; 48+ messages in thread
From: Ying Han @ 2011-04-25 17:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Sun, Apr 24, 2011 at 7:21 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> > > > > +             sc.priority = priority;
>> > > > > +             /* The swap token gets in the way of swapout... */
>> > > > > +             if (!priority)
>> > > > > +                     disable_swap_token();
>> > > >
>> > > > Why?
>> > > >
>> > > > disable swap token mean "Please devest swap preventation privilege from
>> > > > owner task. Instead we endure swap storm and performance hit".
>> > > > However I doublt memcg memory shortage is good situation to make swap
>> > > > storm.
>> > > >
>> > >
>> > > I am not sure about that either way. we probably can leave as it is and
>> > make
>> > > corresponding change if real problem is observed?
>> >
>> > Why?
>> > This is not only memcg issue, but also can lead to global swap ping-pong.
>> >
>> > But I give up. I have no time to persuade you.
>> >
>> > Thank you for pointing that out. I didn't pay much attention on the
>> swap_token but just simply inherited
>> it from the global logic. Now after reading a bit more, i think you were
>> right about it.  It would be a bad
>> idea to have memcg kswapds affecting much the global swap token being set.
>>
>> I will remove it from the next post.
>
> The better approach is swap-token recognize memcg and behave clever? :)

Ok, this makes sense for memcg case. Maybe I missed something on the
per-node balance_pgdat, where it seems it will blindly disable the
swap_token_mm if there is a one.

static inline int has_swap_token(struct mm_struct *mm)
{
>-------return (mm == swap_token_mm);
}

static inline void put_swap_token(struct mm_struct *mm)
{
>-------if (has_swap_token(mm))
>------->-------__put_swap_token(mm);
}

static inline void disable_swap_token(void)
{
>-------put_swap_token(swap_token_mm);
}


Should I include this patch into the per-memcg kswapd patset?

--Ying

>
>
>
> From 106c21d7f9cf8641592cbfe1416af66470af4f9a Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Mon, 25 Apr 2011 10:57:54 +0900
> Subject: [PATCH] vmscan,memcg: memcg aware swap token
>
> Currently, memcg reclaim can disable swap token even if the swap token
> mm doesn't belong in its memory cgroup. It's slightly riskly. If an
> admin makes very small mem-cgroup and silly guy runs contenious heavy
> memory pressure workloa, whole tasks in the system are going to lose
> swap-token and then system may become unresponsive. That's bad.
>
> This patch adds 'memcg' parameter into disable_swap_token(). and if
> the parameter doesn't match swap-token, VM doesn't put swap-token.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    6 ++++++
>  include/linux/swap.h       |   24 +++++++++++++++++-------
>  mm/memcontrol.c            |    2 +-
>  mm/thrash.c                |   17 +++++++++++++++++
>  mm/vmscan.c                |    4 ++--
>  5 files changed, 43 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6a0cffd..df572af 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -84,6 +84,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +extern struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm);
>
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> @@ -244,6 +245,11 @@ static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>        return NULL;
>  }
>
> +static inline struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> +{
> +       return NULL;
> +}
> +
>  static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
>  {
>        return 1;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 384eb5f..ccea15d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -358,21 +358,31 @@ struct backing_dev_info;
>  extern struct mm_struct *swap_token_mm;
>  extern void grab_swap_token(struct mm_struct *);
>  extern void __put_swap_token(struct mm_struct *);
> +extern int has_swap_token_memcg(struct mm_struct *mm, struct mem_cgroup *memcg);
>
> -static inline int has_swap_token(struct mm_struct *mm)
> +static inline
> +int has_swap_token(struct mm_struct *mm)
>  {
> -       return (mm == swap_token_mm);
> +       return has_swap_token_memcg(mm, NULL);
>  }
>
> -static inline void put_swap_token(struct mm_struct *mm)
> +static inline
> +void put_swap_token_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
>  {
> -       if (has_swap_token(mm))
> +       if (has_swap_token_memcg(mm, memcg))
>                __put_swap_token(mm);
>  }
>
> -static inline void disable_swap_token(void)
> +static inline
> +void put_swap_token(struct mm_struct *mm)
> +{
> +       return put_swap_token_memcg(mm, NULL);
> +}
> +
> +static inline
> +void disable_swap_token(struct mem_cgroup *memcg)
>  {
> -       put_swap_token(swap_token_mm);
> +       put_swap_token_memcg(swap_token_mm, memcg);
>  }
>
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> @@ -500,7 +510,7 @@ static inline int has_swap_token(struct mm_struct *mm)
>        return 0;
>  }
>
> -static inline void disable_swap_token(void)
> +static inline void disable_swap_token(struct mem_cgroup *memcg)
>  {
>  }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c2776f1..5683c7a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -735,7 +735,7 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>                                struct mem_cgroup, css);
>  }
>
> -static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> +struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  {
>        struct mem_cgroup *mem = NULL;
>
> diff --git a/mm/thrash.c b/mm/thrash.c
> index 2372d4e..f892a6e 100644
> --- a/mm/thrash.c
> +++ b/mm/thrash.c
> @@ -21,6 +21,7 @@
>  #include <linux/mm.h>
>  #include <linux/sched.h>
>  #include <linux/swap.h>
> +#include <linux/memcontrol.h>
>
>  static DEFINE_SPINLOCK(swap_token_lock);
>  struct mm_struct *swap_token_mm;
> @@ -75,3 +76,19 @@ void __put_swap_token(struct mm_struct *mm)
>                swap_token_mm = NULL;
>        spin_unlock(&swap_token_lock);
>  }
> +
> +int has_swap_token_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> +{
> +       if (memcg) {
> +               struct mem_cgroup *swap_token_memcg;
> +
> +               /*
> +                * memcgroup reclaim can disable swap token only if token task
> +                * is in the same cgroup.
> +                */
> +               swap_token_memcg = try_get_mem_cgroup_from_mm(swap_token_mm);
> +               return ((mm == swap_token_mm) && (memcg == swap_token_memcg));
> +       } else
> +               return (mm == swap_token_mm);
> +}
> +
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b3a569f..19e179b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2044,7 +2044,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>        for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>                sc->nr_scanned = 0;
>                if (!priority)
> -                       disable_swap_token();
> +                       disable_swap_token(sc->mem_cgroup);
>                shrink_zones(priority, zonelist, sc);
>                /*
>                 * Don't shrink slabs when reclaiming memory from
> @@ -2353,7 +2353,7 @@ loop_again:
>
>                /* The swap token gets in the way of swapout... */
>                if (!priority)
> -                       disable_swap_token();
> +                       disable_swap_token(NULL);
>
>                all_zones_ok = 1;
>                balanced = 0;
> --
> 1.7.3.1
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] vmscan,memcg: memcg aware swap token
  2011-04-25 17:13             ` Ying Han
@ 2011-04-26  2:08               ` KOSAKI Motohiro
  0 siblings, 0 replies; 48+ messages in thread
From: KOSAKI Motohiro @ 2011-04-26  2:08 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

> > The better approach is swap-token recognize memcg and behave clever? :)
> 
> Ok, this makes sense for memcg case. Maybe I missed something on the
> per-node balance_pgdat, where it seems it will blindly disable the
> swap_token_mm if there is a one.

That's design. 'disable' of disable_swap_token() mean blindly disable.
The intention is,
  priority != 0:   try to avoid swap storm
  priority == 0:  allow thrashing. it's better than false positive oom.


> Should I include this patch into the per-memcg kswapd patset?

Nope.
This is standalone patch. current memcg direct reclaim path has the same
problem.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2011-04-26  2:08 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-22  4:24 [RFC PATCH V7 0/9] memcg: per cgroup background reclaim Ying Han
2011-04-22  4:24 ` [PATCH V7 1/9] Add kswapd descriptor Ying Han
2011-04-22  4:31   ` KAMEZAWA Hiroyuki
2011-04-22  4:47   ` KOSAKI Motohiro
2011-04-22  5:55     ` Ying Han
2011-04-22  4:24 ` [PATCH V7 2/9] Add per memcg reclaim watermarks Ying Han
2011-04-22  4:24 ` [PATCH V7 3/9] New APIs to adjust per-memcg wmarks Ying Han
2011-04-22  4:32   ` KAMEZAWA Hiroyuki
2011-04-22  4:24 ` [PATCH V7 4/9] Add memcg kswapd thread pool Ying Han
2011-04-22  4:36   ` KAMEZAWA Hiroyuki
2011-04-22  4:49     ` Ying Han
2011-04-22  5:00       ` KAMEZAWA Hiroyuki
2011-04-22  5:53         ` Ying Han
2011-04-22  5:59           ` KAMEZAWA Hiroyuki
2011-04-22  6:10             ` Ying Han
2011-04-22  7:46               ` KAMEZAWA Hiroyuki
2011-04-22  7:59                 ` Ying Han
2011-04-22  8:02                   ` KAMEZAWA Hiroyuki
2011-04-24 23:26                   ` KAMEZAWA Hiroyuki
2011-04-25  2:08                     ` Ying Han
2011-04-22  6:02         ` Zhu Yanhai
2011-04-22  6:14           ` Ying Han
2011-04-22  5:39   ` KOSAKI Motohiro
2011-04-22  5:56     ` KAMEZAWA Hiroyuki
2011-04-22  4:24 ` [PATCH V7 5/9] Infrastructure to support per-memcg reclaim Ying Han
2011-04-22  4:38   ` KAMEZAWA Hiroyuki
2011-04-22  5:11   ` KOSAKI Motohiro
2011-04-22  5:59     ` Ying Han
2011-04-22  5:27   ` KOSAKI Motohiro
2011-04-22  6:00     ` Ying Han
2011-04-22  4:24 ` [PATCH V7 6/9] Implement the select_victim_node within memcg Ying Han
2011-04-22  4:39   ` KAMEZAWA Hiroyuki
2011-04-22  4:24 ` [PATCH V7 7/9] Per-memcg background reclaim Ying Han
2011-04-22  4:40   ` KAMEZAWA Hiroyuki
2011-04-22  6:00   ` KOSAKI Motohiro
2011-04-22  7:54     ` Ying Han
2011-04-22  8:44       ` KOSAKI Motohiro
2011-04-22 18:37         ` Ying Han
2011-04-25  2:21           ` [PATCH] vmscan,memcg: memcg aware swap token KOSAKI Motohiro
2011-04-25  9:47             ` KAMEZAWA Hiroyuki
2011-04-25 17:13             ` Ying Han
2011-04-26  2:08               ` KOSAKI Motohiro
2011-04-22  4:24 ` [PATCH V7 8/9] Add per-memcg zone "unreclaimable" Ying Han
2011-04-22  4:43   ` KAMEZAWA Hiroyuki
2011-04-22  6:13   ` KOSAKI Motohiro
2011-04-22  6:17     ` Ying Han
2011-04-22  4:24 ` [PATCH V7 9/9] Enable per-memcg background reclaim Ying Han
2011-04-22  4:44   ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.