All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 00/10] memcg: per cgroup background reclaim
@ 2011-04-19  3:57 Ying Han
  2011-04-19  3:57 ` [PATCH V6 01/10] Add kswapd descriptor Ying Han
                   ` (12 more replies)
  0 siblings, 13 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

The current implementation of memcg supports targeting reclaim when the
cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
Per cgroup background reclaim is needed which helps to spread out memory
pressure over longer period of time and smoothes out the cgroup performance.

If the cgroup is configured to use per cgroup background reclaim, a kswapd
thread is created which only scans the per-memcg LRU list. Two watermarks
("high_wmark", "low_wmark") are added to trigger the background reclaim and
stop it. The watermarks are calculated based on the cgroup's limit_in_bytes.
By default, the per-memcg kswapd threads are running under root cgroup. There
is a per-memcg API which exports the pid of each kswapd thread, and userspace
can configure cpu cgroup seperately.

I run through dd test on large file and then cat the file. Then I compared
the reclaim related stats in memory.stat.

Step1: Create a cgroup with 500M memory_limit.
$ mkdir /dev/cgroup/memory/A
$ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/memory/A/tasks

Step2: Test and set the wmarks.
$ cat /dev/cgroup/memory/A/memory.low_wmark_distance
0
$ cat /dev/cgroup/memory/A/memory.high_wmark_distance
0

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 524288000
high_wmark 524288000

$ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
$ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 482344960
high_wmark 471859200

$ ps -ef | grep memcg
root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg

$ cat /dev/cgroup/memory/A/memory.kswapd_pid
memcg_3 18126

Step3: Dirty the pages by creating a 20g file on hard drive.
$ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1

Here are the memory.stat with vs without the per-memcg reclaim. It used to be
all the pages are reclaimed from direct reclaim, and now some of the pages are
also being reclaimed at background.

Only direct reclaim                      With background reclaim:

pgpgin 5243222                           pgpgin 5243267
pgpgout 5115252                          pgpgout 5127978
kswapd_steal 0                           kswapd_steal 2699807
pg_pgsteal 5115229                       pg_pgsteal 2428102
kswapd_pgscan 0                          kswapd_pgscan 10527319
pg_scan 5918875                          pg_scan 15533740
pgrefill 264761                          pgrefill 294801
pgoutrun 0                               pgoutrun 81097
allocstall 158406                        allocstall 73799

real   4m55.684s                         real    5m1.123s
user   0m1.227s                          user    0m1.205s
sys    1m7.793s                          sys     1m6.647s

throughput is 67.37 MB/sec               throughput is 68.04 MB/sec

Step 4: Cleanup
$ echo $$ >/dev/cgroup/memory/tasks
$ echo 1 > /dev/cgroup/memory/A/memory.force_empty
$ rmdir /dev/cgroup/memory/A
$ echo 3 >/proc/sys/vm/drop_caches

Step 5: Create the same cgroup and read the 20g file into pagecache.
$ cat /export/hdc3/dd/tf0 > /dev/zero

All the pages are reclaimed from background instead of direct reclaim with
the per cgroup reclaim.

Only direct reclaim                       With background reclaim:

pgpgin 5242929                            pgpgin 5242935
pgpgout 5114974                           pgpgout 5125504
kswapd_steal 0                            kswapd_steal 5125470
pg_pgsteal 5114944                        pg_pgsteal 0
kswapd_pgscan 0                           kswapd_pgscan 5125472
pg_scan 5114944                           pg_scan 0
pgrefill 0                                pgrefill 0
pgoutrun 0                                pgoutrun 160184
allocstall 159842                         allocstall 0

real    4m20.678s                         real    4m20.632s
user    0m0.198s                          user    0m0.280s
sys     0m32.569s                         sys     0m24.580s

Note:
This is the first effort of enhancing the target reclaim into memcg. Here are
the existing known issues and our plan:

1. there are one kswapd thread per cgroup. the thread is created when the
cgroup changes its limit_in_bytes and is deleted when the cgroup is being
removed. In some enviroment when thousand of cgroups are being configured on
a single host, we will have thousand of kswapd threads. The memory consumption
would be 8k*100 = 8M. We don't see a big issue for now if the host can host
that many of cgroups.

2. regarding to the alternative workqueue, which is more complicated and we
need to be very careful of work items in the workqueue. We've experienced in
one workitem stucks and the rest of the work item won't proceed. For example
in dirty page writeback, one heavily writer cgroup could starve the other
cgroups from flushing dirty pages to the same disk. In the kswapd case, I can
imagine we might have similar senario. How to prioritize the workitems is
another problem. The order of adding the workitems in the queue reflects the
order of cgroups being reclaimed. We don't have that restriction currently but
relying on the cpu scheduler to put kswapd on the right cpu-core to run. We
"might" introduce priority later for reclaim and how are we gonna deal with
that.

3. there is a potential lock contention between per cgroup kswapds, and the
worst case depends on the number of cpu cores on the system. Basically we
now are sharing the zone->lru_lock between per-memcg LRU and global LRU. I have
a plan to get rid of the global LRU eventually, which requires to enhance the
existing targeting reclaim (this patch is included). I would like to get to that
where the locking contention problem will be solved naturely.

4. no hierarchical reclaim support in this patchset. I would like to get to
after the basic stuff are being accepted.

5. By default, it is running under root. If there is a need to put the kswapd
thread into a cpu cgroup, userspace can make that change by reading the pid from
the new API and echo-ing. In non preemption kernel, we need to be careful of
priority inversion when restricting kswapd cpu time while it is holding a mutex.

Ying Han (10):
  Add kswapd descriptor
  Add per memcg reclaim watermarks
  New APIs to adjust per-memcg wmarks
  Infrastructure to support per-memcg reclaim.
  Implement the select_victim_node within memcg.
  Per-memcg background reclaim.
  Add per-memcg zone "unreclaimable"
  Enable per-memcg background reclaim.
  Add API to export per-memcg kswapd pid.
  Add some per-memcg stats

 Documentation/cgroups/memory.txt |   14 +
 include/linux/memcontrol.h       |  109 +++++++++
 include/linux/mmzone.h           |    3 +-
 include/linux/res_counter.h      |   78 ++++++
 include/linux/sched.h            |    1 +
 include/linux/swap.h             |   14 +-
 kernel/res_counter.c             |    6 +
 mm/memcontrol.c                  |  490 +++++++++++++++++++++++++++++++++++++-
 mm/memory_hotplug.c              |    4 +-
 mm/page_alloc.c                  |    1 -
 mm/vmscan.c                      |  391 ++++++++++++++++++++++++++----
 11 files changed, 1053 insertions(+), 58 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH V6 01/10] Add kswapd descriptor
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-19  3:57 ` [PATCH V6 02/10] Add per memcg reclaim watermarks Ying Han
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

There is a kswapd kernel thread for each numa node. We will add a different
kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
information of node or memcg and it allows the global and per-memcg background
reclaim to share common reclaim algorithms.

This patch adds the kswapd descriptor and moves the per-node kswapd to use the
new structure.

changelog v6..v5:
1. rename kswapd_thr to kswapd_tsk
2. revert the api change on sleeping_prematurely since memcg doesn't support it.

changelog v5..v4:
1. add comment on kswapds_spinlock
2. remove the kswapds_spinlock. we don't need it here since the kswapd and pgdat
have 1:1 mapping.

changelog v3..v2:
1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
2. rename thr in kswapd_run to something else.

changelog v2..v1:
1. dynamic allocate kswapd descriptor and initialize the wait_queue_head of pgdat
at kswapd_run.
2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup kswapd
descriptor.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/mmzone.h |    3 +-
 include/linux/swap.h   |    7 ++++
 mm/page_alloc.c        |    1 -
 mm/vmscan.c            |   80 ++++++++++++++++++++++++++++++++++++-----------
 4 files changed, 69 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 628f07b..6cba7d2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -640,8 +640,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	wait_queue_head_t kswapd_wait;
-	struct task_struct *kswapd;
+	wait_queue_head_t *kswapd_wait;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 } pg_data_t;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed6ebe6..f43d406 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
 	return current->flags & PF_KSWAPD;
 }
 
+struct kswapd {
+	struct task_struct *kswapd_task;
+	wait_queue_head_t kswapd_wait;
+	pg_data_t *kswapd_pgdat;
+};
+
+int kswapd(void *p);
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e1b52a..6340865 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
-	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..ba5e591 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2570,21 +2570,24 @@ out:
 	return order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
+				int classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 
 	if (freezing(current) || kthread_should_stop())
 		return;
 
-	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
 	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
-		finish_wait(&pgdat->kswapd_wait, &wait);
-		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+		finish_wait(wait_h, &wait);
+		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 	}
 
 	/*
@@ -2611,7 +2614,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		else
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
-	finish_wait(&pgdat->kswapd_wait, &wait);
+	finish_wait(wait_h, &wait);
 }
 
 /*
@@ -2627,20 +2630,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
  * If there are applications that are active memory-allocators
  * (most normal use), this basically shouldn't matter.
  */
-static int kswapd(void *p)
+int kswapd(void *p)
 {
 	unsigned long order;
 	int classzone_idx;
-	pg_data_t *pgdat = (pg_data_t*)p;
+	struct kswapd *kswapd_p = (struct kswapd *)p;
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 
 	struct reclaim_state reclaim_state = {
 		.reclaimed_slab = 0,
 	};
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	const struct cpumask *cpumask;
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
+	BUG_ON(pgdat->kswapd_wait != wait_h);
+	cpumask = cpumask_of_node(pgdat->node_id);
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(tsk, cpumask);
 	current->reclaim_state = &reclaim_state;
@@ -2679,7 +2686,7 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx);
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
@@ -2719,13 +2726,13 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 		pgdat->kswapd_max_order = order;
 		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
 	}
-	if (!waitqueue_active(&pgdat->kswapd_wait))
+	if (!waitqueue_active(pgdat->kswapd_wait))
 		return;
 	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-	wake_up_interruptible(&pgdat->kswapd_wait);
+	wake_up_interruptible(pgdat->kswapd_wait);
 }
 
 /*
@@ -2817,12 +2824,21 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 		for_each_node_state(nid, N_HIGH_MEMORY) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 			const struct cpumask *mask;
+			struct kswapd *kswapd_p;
+			struct task_struct *kswapd_tsk;
+			wait_queue_head_t *wait;
 
 			mask = cpumask_of_node(pgdat->node_id);
 
+			wait = pgdat->kswapd_wait;
+			kswapd_p = container_of(wait, struct kswapd,
+						kswapd_wait);
+			kswapd_tsk = kswapd_p->kswapd_task;
+
 			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
 				/* One of our CPUs online: restore mask */
-				set_cpus_allowed_ptr(pgdat->kswapd, mask);
+				if (kswapd_tsk)
+					set_cpus_allowed_ptr(kswapd_tsk, mask);
 		}
 	}
 	return NOTIFY_OK;
@@ -2835,18 +2851,31 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 int kswapd_run(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
+	struct task_struct *kswapd_tsk;
+	struct kswapd *kswapd_p;
 	int ret = 0;
 
-	if (pgdat->kswapd)
+	if (pgdat->kswapd_wait)
 		return 0;
 
-	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
-	if (IS_ERR(pgdat->kswapd)) {
+	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
+	if (!kswapd_p)
+		return -ENOMEM;
+
+	init_waitqueue_head(&kswapd_p->kswapd_wait);
+	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_pgdat = pgdat;
+
+	kswapd_tsk = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (IS_ERR(kswapd_tsk)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
 		printk("Failed to start kswapd on node %d\n",nid);
+		pgdat->kswapd_wait = NULL;
+		kfree(kswapd_p);
 		ret = -1;
-	}
+	} else
+		kswapd_p->kswapd_task = kswapd_tsk;
 	return ret;
 }
 
@@ -2855,10 +2884,23 @@ int kswapd_run(int nid)
  */
 void kswapd_stop(int nid)
 {
-	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
+	struct task_struct *kswapd_tsk = NULL;
+	struct kswapd *kswapd_p = NULL;
+	wait_queue_head_t *wait;
+
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	wait = pgdat->kswapd_wait;
+	if (wait) {
+		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+		kswapd_tsk = kswapd_p->kswapd_task;
+		kswapd_p->kswapd_task = NULL;
+	}
+
+	if (kswapd_tsk)
+		kthread_stop(kswapd_tsk);
 
-	if (kswapd)
-		kthread_stop(kswapd);
+	kfree(kswapd_p);
 }
 
 static int __init kswapd_init(void)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH V6 02/10] Add per memcg reclaim watermarks
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
  2011-04-19  3:57 ` [PATCH V6 01/10] Add kswapd descriptor Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-19  3:57 ` [PATCH V6 03/10] New APIs to adjust per-memcg wmarks Ying Han
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
until the usage is lower than the high_wmark.

Each watermark is calculated based on the hard_limit(limit_in_bytes) for each
memcg. Each time the hard_limit is changed, the corresponding wmarks are
re-calculated. Since memory controller charges only user pages, there is
no need for a "min_wmark". The current calculation of wmarks is based on
individual tunable low/high_wmark_distance, which are set to 0 by default.

changelog v5..v4:
1. rename res_counter_low_wmark_limit_locked().
2. rename res_counter_high_wmark_limit_locked().

changelog v4..v3:
1. remove legacy comments
2. rename the res_counter_check_under_high_wmark_limit
3. replace the wmark_ratio per-memcg by individual tunable for both wmarks.
4. add comments on low/high_wmark
5. add individual tunables for low/high_wmarks and remove wmark_ratio
6. replace the mem_cgroup_get_limit() call by res_count_read_u64(). The first
one returns large value w/ swapon.

changelog v3..v2:
1. Add VM_BUG_ON() on couple of places.
2. Remove the spinlock on the min_free_kbytes since the consequence of reading
stale data.
3. Remove the "min_free_kbytes" API and replace it with wmark_ratio based on
hard_limit.

changelog v2..v1:
1. Remove the res_counter_charge on wmark due to performance concern.
2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
4. make the wmark to be consistant with core VM which checks the free pages
instead of usage.
5. changed wmark to be boolean

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h  |    1 +
 include/linux/res_counter.h |   78 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    6 +++
 mm/memcontrol.c             |   48 ++++++++++++++++++++++++++
 4 files changed, 133 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..3ece36d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -82,6 +82,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..669f199 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -39,6 +39,14 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit that reclaim triggers.
+	 */
+	unsigned long long low_wmark_limit;
+	/*
+	 * the limit that reclaim stops.
+	 */
+	unsigned long long high_wmark_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -55,6 +63,9 @@ struct res_counter {
 
 #define RESOURCE_MAX (unsigned long long)LLONG_MAX
 
+#define CHARGE_WMARK_LOW	0x01
+#define CHARGE_WMARK_HIGH	0x02
+
 /**
  * Helpers to interact with userspace
  * res_counter_read_u64() - returns the value of the specified member.
@@ -92,6 +103,8 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_WMARK_LIMIT,
+	RES_HIGH_WMARK_LIMIT
 };
 
 /*
@@ -147,6 +160,24 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
 	return margin;
 }
 
+static inline bool
+res_counter_under_high_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->high_wmark_limit)
+		return true;
+
+	return false;
+}
+
+static inline bool
+res_counter_under_low_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->low_wmark_limit)
+		return true;
+
+	return false;
+}
+
 /**
  * Get the difference between the usage and the soft limit
  * @cnt: The counter
@@ -169,6 +200,30 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+static inline bool
+res_counter_under_low_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_under_low_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool
+res_counter_under_high_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_under_high_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -214,4 +269,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_high_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
+static inline int
+res_counter_set_low_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..206a724 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 	spin_lock_init(&counter->lock);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
+	counter->low_wmark_limit = RESOURCE_MAX;
+	counter->high_wmark_limit = RESOURCE_MAX;
 	counter->parent = parent;
 }
 
@@ -103,6 +105,10 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_WMARK_LIMIT:
+		return &counter->low_wmark_limit;
+	case RES_HIGH_WMARK_LIMIT:
+		return &counter->high_wmark_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4407dd0..1ec4014 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -272,6 +272,12 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	/*
+	 * used to calculate the low/high_wmarks based on the limit_in_bytes.
+	 */
+	u64 high_wmark_distance;
+	u64 low_wmark_distance;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -813,6 +819,25 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
+{
+	u64 limit;
+
+	limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+	if (mem->high_wmark_distance == 0) {
+		res_counter_set_low_wmark_limit(&mem->res, limit);
+		res_counter_set_high_wmark_limit(&mem->res, limit);
+	} else {
+		u64 low_wmark, high_wmark;
+
+		low_wmark = limit - mem->low_wmark_distance;
+		high_wmark = limit - mem->high_wmark_distance;
+
+		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
+		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+	}
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -3205,6 +3230,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3264,6 +3290,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -4521,6 +4548,27 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+/*
+ * We use low_wmark and high_wmark for triggering per-memcg kswapd.
+ * The reclaim is triggered by low_wmark (usage > low_wmark) and stopped
+ * by high_wmark (usage < high_wmark).
+ */
+int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
+				int charge_flags)
+{
+	long ret = 0;
+	int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
+
+	VM_BUG_ON((charge_flags & flags) == flags);
+
+	if (charge_flags & CHARGE_WMARK_LOW)
+		ret = res_counter_under_low_wmark_limit(&mem->res);
+	if (charge_flags & CHARGE_WMARK_HIGH)
+		ret = res_counter_under_high_wmark_limit(&mem->res);
+
+	return ret;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH V6 03/10] New APIs to adjust per-memcg wmarks
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
  2011-04-19  3:57 ` [PATCH V6 01/10] Add kswapd descriptor Ying Han
  2011-04-19  3:57 ` [PATCH V6 02/10] Add per memcg reclaim watermarks Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-19  3:57 ` [PATCH V6 04/10] Infrastructure to support per-memcg reclaim Ying Han
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

Add memory.low_wmark_distance, memory.high_wmark_distance and reclaim_wmarks
APIs per-memcg. The first two adjust the internal low/high wmark calculation
and the reclaim_wmarks exports the current value of watermarks.

By default, the low/high_wmark is calculated by subtracting the distance from
the hard_limit(limit_in_bytes). When configuring the low/high_wmark distance,
user must setup the high_wmark_distance before low_wmark_distance. Also user
must zero low_wmark_distance before high_wmark_distance.

$ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
$ cat /dev/cgroup/A/memory.limit_in_bytes
524288000

$ echo 50m >/dev/cgroup/A/memory.high_wmark_distance
$ echo 40m >/dev/cgroup/A/memory.low_wmark_distance

$ cat /dev/cgroup/A/memory.reclaim_wmarks
low_wmark 482344960
high_wmark 471859200

change v5..v4
1. add sanity check for setting high/low_wmark_distance for root cgroup.

changelog v4..v3:
1. replace the "wmark_ratio" API with individual tunable for low/high_wmarks.

changelog v3..v2:
1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
feedbacks

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |  101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ec4014..76ad009 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3974,6 +3974,78 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 mem_cgroup_high_wmark_distance_read(struct cgroup *cgrp,
+					       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->high_wmark_distance;
+}
+
+static u64 mem_cgroup_low_wmark_distance_read(struct cgroup *cgrp,
+					      struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->low_wmark_distance;
+}
+
+static int mem_cgroup_high_wmark_distance_write(struct cgroup *cont,
+						struct cftype *cft,
+						const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	u64 low_wmark_distance = memcg->low_wmark_distance;
+	unsigned long long val;
+	u64 limit;
+	int ret;
+
+	if (!cont->parent)
+		return -EINVAL;
+
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return -EINVAL;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+	if ((val >= limit) || (val < low_wmark_distance) ||
+	   (low_wmark_distance && val == low_wmark_distance))
+		return -EINVAL;
+
+	memcg->high_wmark_distance = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+}
+
+static int mem_cgroup_low_wmark_distance_write(struct cgroup *cont,
+					       struct cftype *cft,
+					       const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	u64 high_wmark_distance = memcg->high_wmark_distance;
+	unsigned long long val;
+	u64 limit;
+	int ret;
+
+	if (!cont->parent)
+		return -EINVAL;
+
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return -EINVAL;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+	if ((val >= limit) || (val > high_wmark_distance) ||
+	    (high_wmark_distance && val == high_wmark_distance))
+		return -EINVAL;
+
+	memcg->low_wmark_distance = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4265,6 +4337,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 	mutex_unlock(&memcg_oom_mutex);
 }
 
+static int mem_cgroup_wmark_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	u64 low_wmark, high_wmark;
+
+	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
+	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+
+	cb->fill(cb, "low_wmark", low_wmark);
+	cb->fill(cb, "high_wmark", high_wmark);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4368,6 +4455,20 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "high_wmark_distance",
+		.write_string = mem_cgroup_high_wmark_distance_write,
+		.read_u64 = mem_cgroup_high_wmark_distance_read,
+	},
+	{
+		.name = "low_wmark_distance",
+		.write_string = mem_cgroup_low_wmark_distance_write,
+		.read_u64 = mem_cgroup_low_wmark_distance_read,
+	},
+	{
+		.name = "reclaim_wmarks",
+		.read_map = mem_cgroup_wmark_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH V6 04/10] Infrastructure to support per-memcg reclaim.
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (2 preceding siblings ...)
  2011-04-19  3:57 ` [PATCH V6 03/10] New APIs to adjust per-memcg wmarks Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-19  3:57 ` [PATCH V6 05/10] Implement the select_victim_node within memcg Ying Han
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

Add the kswapd_mem field in kswapd descriptor which links the kswapd
kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
queue headed at kswapd_wait field of the kswapd descriptor.

The kswapd() function is now shared between global and per-memcg kswapd. It
is passed in with the kswapd descriptor which contains the information of
either node or memcg. Then the new function balance_mem_cgroup_pgdat is
invoked if it is per-mem kswapd thread, and the implementation of the function
is on the following patch.

change v6..v5:
1. rename is_node_kswapd to is_global_kswapd to match the scanning_global_lru.
2. revert the sleeping_prematurely change, but keep the kswapd_try_to_sleep()
for memcg.

changelog v4..v3:
1. fix up the kswapd_run and kswapd_stop for online_pages() and offline_pages.
2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's request.

changelog v3..v2:
1. split off from the initial patch which includes all changes of the following
three patches.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    5 ++
 include/linux/swap.h       |    5 +-
 mm/memcontrol.c            |   29 ++++++++++
 mm/memory_hotplug.c        |    4 +-
 mm/vmscan.c                |  127 +++++++++++++++++++++++++++++++------------
 5 files changed, 130 insertions(+), 40 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3ece36d..f7ffd1f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@ struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
+struct kswapd;
 
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
@@ -83,6 +84,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
+extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
+				  struct kswapd *kswapd_p);
+extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
+extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f43d406..17e0511 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -30,6 +30,7 @@ struct kswapd {
 	struct task_struct *kswapd_task;
 	wait_queue_head_t kswapd_wait;
 	pg_data_t *kswapd_pgdat;
+	struct mem_cgroup *kswapd_mem;
 };
 
 int kswapd(void *p);
@@ -303,8 +304,8 @@ static inline void scan_unevictable_unregister_node(struct node *node)
 }
 #endif
 
-extern int kswapd_run(int nid);
-extern void kswapd_stop(int nid);
+extern int kswapd_run(int nid, struct mem_cgroup *mem);
+extern void kswapd_stop(int nid, struct mem_cgroup *mem);
 
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 76ad009..8761a6f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -278,6 +278,8 @@ struct mem_cgroup {
 	 */
 	u64 high_wmark_distance;
 	u64 low_wmark_distance;
+
+	wait_queue_head_t *kswapd_wait;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -4670,6 +4672,33 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
 	return ret;
 }
 
+int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
+{
+	if (!mem || !kswapd_p)
+		return 0;
+
+	mem->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_mem = mem;
+
+	return css_id(&mem->css);
+}
+
+void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
+{
+	if (mem)
+		mem->kswapd_wait = NULL;
+
+	return;
+}
+
+wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return NULL;
+
+	return mem->kswapd_wait;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 321fc74..2f78ff6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -462,7 +462,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages)
 	setup_per_zone_wmarks();
 	calculate_zone_inactive_ratio(zone);
 	if (onlined_pages) {
-		kswapd_run(zone_to_nid(zone));
+		kswapd_run(zone_to_nid(zone), NULL);
 		node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
 	}
 
@@ -897,7 +897,7 @@ repeat:
 	calculate_zone_inactive_ratio(zone);
 	if (!node_present_pages(node)) {
 		node_clear_state(node, N_HIGH_MEMORY);
-		kswapd_stop(node);
+		kswapd_stop(node, NULL);
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba5e591..0060d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2241,6 +2241,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 	return balanced_pages > (present_pages >> 2);
 }
 
+#define is_global_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
+
 /* is kswapd sleeping prematurely? */
 static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 					int classzone_idx)
@@ -2583,6 +2585,11 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 
 	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 
+	if (!is_global_kswapd(kswapd_p)) {
+		schedule();
+		goto out;
+	}
+
 	/* Try to sleep for a short interval */
 	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
@@ -2614,9 +2621,16 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 		else
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
+out:
 	finish_wait(wait_h, &wait);
 }
 
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+							int order)
+{
+	return 0;
+}
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -2636,6 +2650,7 @@ int kswapd(void *p)
 	int classzone_idx;
 	struct kswapd *kswapd_p = (struct kswapd *)p;
 	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
 	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 
@@ -2646,10 +2661,12 @@ int kswapd(void *p)
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
-	BUG_ON(pgdat->kswapd_wait != wait_h);
-	cpumask = cpumask_of_node(pgdat->node_id);
-	if (!cpumask_empty(cpumask))
-		set_cpus_allowed_ptr(tsk, cpumask);
+	if (is_global_kswapd(kswapd_p)) {
+		BUG_ON(pgdat->kswapd_wait != wait_h);
+		cpumask = cpumask_of_node(pgdat->node_id);
+		if (!cpumask_empty(cpumask))
+			set_cpus_allowed_ptr(tsk, cpumask);
+	}
 	current->reclaim_state = &reclaim_state;
 
 	/*
@@ -2664,7 +2681,10 @@ int kswapd(void *p)
 	 * us from recursively trying to free more memory as we're
 	 * trying to free the first piece of memory in the first place).
 	 */
-	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+	if (is_global_kswapd(kswapd_p))
+		tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+	else
+		tsk->flags |= PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
 	order = 0;
@@ -2674,24 +2694,29 @@ int kswapd(void *p)
 		int new_classzone_idx;
 		int ret;
 
-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order = 0;
-		pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		if (order < new_order || classzone_idx > new_classzone_idx) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation or has tigher zone constraints
-			 */
-			order = new_order;
-			classzone_idx = new_classzone_idx;
-		} else {
-			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
-			order = pgdat->kswapd_max_order;
-			classzone_idx = pgdat->classzone_idx;
+		if (is_global_kswapd(kswapd_p)) {
+			new_order = pgdat->kswapd_max_order;
+			new_classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		}
+			if (order < new_order ||
+					classzone_idx > new_classzone_idx) {
+				/*
+				 * Don't sleep if someone wants a larger 'order'
+				 * allocation or has tigher zone constraints
+				 */
+				order = new_order;
+				classzone_idx = new_classzone_idx;
+			} else {
+				kswapd_try_to_sleep(kswapd_p, order,
+						    classzone_idx);
+				order = pgdat->kswapd_max_order;
+				classzone_idx = pgdat->classzone_idx;
+				pgdat->kswapd_max_order = 0;
+				pgdat->classzone_idx = MAX_NR_ZONES - 1;
+			}
+		} else
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -2702,8 +2727,13 @@ int kswapd(void *p)
 		 * after returning from the refrigerator
 		 */
 		if (!ret) {
-			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			order = balance_pgdat(pgdat, order, &classzone_idx);
+			if (is_global_kswapd(kswapd_p)) {
+				trace_mm_vmscan_kswapd_wake(pgdat->node_id,
+								order);
+				order = balance_pgdat(pgdat, order,
+							&classzone_idx);
+			} else
+				balance_mem_cgroup_pgdat(mem, order);
 		}
 	}
 	return 0;
@@ -2848,30 +2878,53 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
  * This kswapd start function will be called by init and node-hot-add.
  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
  */
-int kswapd_run(int nid)
+int kswapd_run(int nid, struct mem_cgroup *mem)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
 	struct task_struct *kswapd_tsk;
+	pg_data_t *pgdat = NULL;
 	struct kswapd *kswapd_p;
+	static char name[TASK_COMM_LEN];
+	int memcg_id = -1;
 	int ret = 0;
 
-	if (pgdat->kswapd_wait)
-		return 0;
+	if (!mem) {
+		pgdat = NODE_DATA(nid);
+		if (pgdat->kswapd_wait)
+			return ret;
+	}
 
 	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
 	if (!kswapd_p)
 		return -ENOMEM;
 
 	init_waitqueue_head(&kswapd_p->kswapd_wait);
-	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
-	kswapd_p->kswapd_pgdat = pgdat;
 
-	kswapd_tsk = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (!mem) {
+		pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+		kswapd_p->kswapd_pgdat = pgdat;
+		snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
+	} else {
+		memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
+		if (!memcg_id) {
+			kfree(kswapd_p);
+			return ret;
+		}
+		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
+	}
+
+	kswapd_tsk = kthread_run(kswapd, kswapd_p, name);
 	if (IS_ERR(kswapd_tsk)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
-		printk("Failed to start kswapd on node %d\n",nid);
-		pgdat->kswapd_wait = NULL;
+		if (!mem) {
+			printk(KERN_ERR "Failed to start kswapd on node %d\n",
+								nid);
+			pgdat->kswapd_wait = NULL;
+		} else {
+			printk(KERN_ERR "Failed to start kswapd on memcg %d\n",
+								memcg_id);
+			mem_cgroup_clear_kswapd(mem);
+		}
 		kfree(kswapd_p);
 		ret = -1;
 	} else
@@ -2882,15 +2935,17 @@ int kswapd_run(int nid)
 /*
  * Called by memory hotplug when all memory in a node is offlined.
  */
-void kswapd_stop(int nid)
+void kswapd_stop(int nid, struct mem_cgroup *mem)
 {
 	struct task_struct *kswapd_tsk = NULL;
 	struct kswapd *kswapd_p = NULL;
 	wait_queue_head_t *wait;
 
-	pg_data_t *pgdat = NODE_DATA(nid);
+	if (!mem)
+		wait = NODE_DATA(nid)->kswapd_wait;
+	else
+		wait = mem_cgroup_kswapd_wait(mem);
 
-	wait = pgdat->kswapd_wait;
 	if (wait) {
 		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
 		kswapd_tsk = kswapd_p->kswapd_task;
@@ -2909,7 +2964,7 @@ static int __init kswapd_init(void)
 
 	swap_setup();
 	for_each_node_state(nid, N_HIGH_MEMORY)
- 		kswapd_run(nid);
+		kswapd_run(nid, NULL);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH V6 05/10] Implement the select_victim_node within memcg.
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (3 preceding siblings ...)
  2011-04-19  3:57 ` [PATCH V6 04/10] Infrastructure to support per-memcg reclaim Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-19  3:57 ` [PATCH V6 06/10] Per-memcg background reclaim Ying Han
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This add the mechanism for background reclaim which we remember the
last scanned node and always starting from the next one each time.
The simple round-robin fasion provide the fairness between nodes for
each memcg.

changelog v6..v5:
1. fix the correct comment style.

changelog v5..v4:
1. initialize the last_scanned_node to MAX_NUMNODES.

changelog v4..v3:
1. split off from the per-memcg background reclaim patch.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f7ffd1f..d4ff7f2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
 				  struct kswapd *kswapd_p);
 extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
 extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
+extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
+extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
+					const nodemask_t *nodes);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8761a6f..06fddd2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -279,6 +279,12 @@ struct mem_cgroup {
 	u64 high_wmark_distance;
 	u64 low_wmark_distance;
 
+	/*
+	 * While doing per cgroup background reclaim, we cache the
+	 * last node we reclaimed from
+	 */
+	int last_scanned_node;
+
 	wait_queue_head_t *kswapd_wait;
 };
 
@@ -1536,6 +1542,27 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 }
 
 /*
+ * Visit the first node after the last_scanned_node of @mem and use that to
+ * reclaim free pages from.
+ */
+int
+mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
+{
+	int next_nid;
+	int last_scanned;
+
+	last_scanned = mem->last_scanned_node;
+	next_nid = next_node(last_scanned, *nodes);
+
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(*nodes);
+
+	mem->last_scanned_node = next_nid;
+
+	return next_nid;
+}
+
+/*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
@@ -4699,6 +4726,14 @@ wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
 	return mem->kswapd_wait;
 }
 
+int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return -1;
+
+	return mem->last_scanned_node;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4774,6 +4809,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
 	if (parent)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH V6 06/10] Per-memcg background reclaim.
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (4 preceding siblings ...)
  2011-04-19  3:57 ` [PATCH V6 05/10] Implement the select_victim_node within memcg Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-20  1:03   ` KAMEZAWA Hiroyuki
  2012-03-19  8:14   ` Zhu Yanhai
  2011-04-19  3:57 ` [PATCH V6 07/10] Add per-memcg zone "unreclaimable" Ying Han
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This is the main loop of per-memcg background reclaim which is implemented in
function balance_mem_cgroup_pgdat().

The function performs a priority loop similar to global reclaim. During each
iteration it invokes balance_pgdat_node() for all nodes on the system, which
is another new function performs background reclaim per node. After reclaiming
each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
it returns true.

changelog v6..v5:
1. add mem_cgroup_zone_reclaimable_pages()
2. fix some comment style.

changelog v5..v4:
1. remove duplicate check on nodes_empty()
2. add logic to check if the per-memcg lru is empty on the zone.

changelog v4..v3:
1. split the select_victim_node and zone_unreclaimable to a seperate patches
2. remove the logic tries to do zone balancing.

changelog v3..v2:
1. change mz->all_unreclaimable to be boolean.
2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
3. some more clean-up.

changelog v2..v1:
1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
2. shared the kswapd_run/kswapd_stop for per-memcg and global background
reclaim.
3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
keeps the same name.
4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
after freeing.
5. add the fairness in zonelist where memcg remember the last zone reclaimed
from.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    9 +++
 mm/memcontrol.c            |   18 +++++
 mm/vmscan.c                |  151 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 178 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d4ff7f2..a4747b0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -115,6 +115,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+						  struct zone *zone);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru);
@@ -311,6 +313,13 @@ mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 }
 
 static inline unsigned long
+mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+				    struct zone *zone)
+{
+	return 0;
+}
+
+static inline unsigned long
 mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
 			 enum lru_list lru)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 06fddd2..7490147 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1097,6 +1097,24 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 	return (active > inactive);
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+						struct zone *zone)
+{
+	int nr;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
+	     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
+		      MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
+
+	return nr;
+}
+
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0060d1e..2a5c734 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -47,6 +47,8 @@
 
 #include <linux/swapops.h>
 
+#include <linux/res_counter.h>
+
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -111,6 +113,8 @@ struct scan_control {
 	 * are scanned.
 	 */
 	nodemask_t	*nodemask;
+
+	int priority;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -2625,11 +2629,158 @@ out:
 	finish_wait(wait_h, &wait);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * The function is used for per-memcg LRU. It scanns all the zones of the
+ * node and returns the nr_scanned and nr_reclaimed.
+ */
+static void balance_pgdat_node(pg_data_t *pgdat, int order,
+					struct scan_control *sc)
+{
+	int i;
+	unsigned long total_scanned = 0;
+	struct mem_cgroup *mem_cont = sc->mem_cgroup;
+	int priority = sc->priority;
+
+	/*
+	 * This dma->highmem order is consistant with global reclaim.
+	 * We do this because the page allocator works in the opposite
+	 * direction although memcg user pages are mostly allocated at
+	 * highmem.
+	 */
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+		unsigned long scan = 0;
+
+		scan = mem_cgroup_zone_reclaimable_pages(mem_cont, zone);
+		if (!scan)
+			continue;
+
+		sc->nr_scanned = 0;
+		shrink_zone(priority, zone, sc);
+		total_scanned += sc->nr_scanned;
+
+		/*
+		 * If we've done a decent amount of scanning and
+		 * the reclaim ratio is low, start doing writepage
+		 * even in laptop mode
+		 */
+		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
+			sc->may_writepage = 1;
+		}
+	}
+
+	sc->nr_scanned = total_scanned;
+}
+
+/*
+ * Per cgroup background reclaim.
+ * TODO: Take off the order since memcg always do order 0
+ */
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+					      int order)
+{
+	int i, nid;
+	int start_node;
+	int priority;
+	bool wmark_ok;
+	int loop;
+	pg_data_t *pgdat;
+	nodemask_t do_nodes;
+	unsigned long total_scanned;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.swappiness = vm_swappiness,
+		.order = order,
+		.mem_cgroup = mem_cont,
+	};
+
+loop_again:
+	do_nodes = NODE_MASK_NONE;
+	sc.may_writepage = !laptop_mode;
+	sc.nr_reclaimed = 0;
+	total_scanned = 0;
+
+	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		sc.priority = priority;
+		wmark_ok = false;
+		loop = 0;
+
+		/* The swap token gets in the way of swapout... */
+		if (!priority)
+			disable_swap_token();
+
+		if (priority == DEF_PRIORITY)
+			do_nodes = node_states[N_ONLINE];
+
+		while (1) {
+			nid = mem_cgroup_select_victim_node(mem_cont,
+							&do_nodes);
+
+			/*
+			 * Indicate we have cycled the nodelist once
+			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
+			 * kswapd burning cpu cycles.
+			 */
+			if (loop == 0) {
+				start_node = nid;
+				loop++;
+			} else if (nid == start_node)
+				break;
+
+			pgdat = NODE_DATA(nid);
+			balance_pgdat_node(pgdat, order, &sc);
+			total_scanned += sc.nr_scanned;
+
+			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+				struct zone *zone = pgdat->node_zones + i;
+
+				if (!populated_zone(zone))
+					continue;
+			}
+			if (i < 0)
+				node_clear(nid, do_nodes);
+
+			if (mem_cgroup_watermark_ok(mem_cont,
+							CHARGE_WMARK_HIGH)) {
+				wmark_ok = true;
+				goto out;
+			}
+
+			if (nodes_empty(do_nodes)) {
+				wmark_ok = true;
+				goto out;
+			}
+		}
+
+		if (total_scanned && priority < DEF_PRIORITY - 2)
+			congestion_wait(WRITE, HZ/10);
+
+		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
+			break;
+	}
+out:
+	if (!wmark_ok) {
+		cond_resched();
+
+		try_to_freeze();
+
+		goto loop_again;
+	}
+
+	return sc.nr_reclaimed;
+}
+#else
 static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
 							int order)
 {
 	return 0;
 }
+#endif
 
 /*
  * The background pageout daemon, started as a kernel thread
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH V6 07/10] Add per-memcg zone "unreclaimable"
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (5 preceding siblings ...)
  2011-04-19  3:57 ` [PATCH V6 06/10] Per-memcg background reclaim Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-19  3:57 ` [PATCH V6 08/10] Enable per-memcg background reclaim Ying Han
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
and breaks the priority loop if it returns true. The per-memcg zone will
be marked as "unreclaimable" if the scanning rate is much greater than the
reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
page charged to the memcg being freed. Kswapd breaks the priority loop if
all the zones are marked as "unreclaimable".

changelog v6..v5:
1. make global zone_unreclaimable use the ZONE_MEMCG_RECLAIMABLE_RATE.
2. add comment on the zone_unreclaimable

changelog v5..v4:
1. reduce the frequency of updating mz->unreclaimable bit by using the existing
memcg batch in task struct.
2. add new function mem_cgroup_mz_clear_unreclaimable() for recoganizing zone.

changelog v4..v3:
1. split off from the per-memcg background reclaim patch in V3.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   40 +++++++++++++++
 include/linux/sched.h      |    1 +
 include/linux/swap.h       |    2 +
 mm/memcontrol.c            |  118 +++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |   25 +++++++++-
 5 files changed, 184 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a4747b0..29bbca2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -157,6 +157,14 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, struct zone *zone);
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
+void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+					struct zone *zone);
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+					unsigned long nr_scanned);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
@@ -354,6 +362,38 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem,
+					       struct zone *zone)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
+						struct zone *zone)
+{
+	return false;
+}
+
+static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
+							struct zone *zone)
+{
+}
+
+static inline void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem,
+							struct page *page)
+{
+}
+
+static inline void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+							struct zone *zone);
+{
+}
+static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
+						struct zone *zone,
+						unsigned long nr_scanned)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 98fc7ed..3370c5a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1526,6 +1526,7 @@ struct task_struct {
 		struct mem_cgroup *memcg; /* target memcg of uncharge */
 		unsigned long nr_pages;	/* uncharged usage */
 		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
+		struct zone *zone; /* a zone page is last uncharged */
 	} memcg_batch;
 #endif
 };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 17e0511..319b800 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -160,6 +160,8 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
+#define ZONE_RECLAIMABLE_RATE 6
+
 #define SWAP_CLUSTER_MAX 32
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7490147..0dfdf27 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
 	bool			on_tree;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
+	unsigned long		pages_scanned;	/* since last reclaim */
+	bool			all_unreclaimable;	/* All pages pinned */
 };
+
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
 
@@ -1154,6 +1157,103 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone *zone,
+						unsigned long nr_scanned)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->pages_scanned += nr_scanned;
+}
+
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return 0;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->pages_scanned <
+				mem_cgroup_zone_reclaimable_pages(mem, zone) *
+				ZONE_RECLAIMABLE_RATE;
+	return 0;
+}
+
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return false;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->all_unreclaimable;
+
+	return false;
+}
+
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->all_unreclaimable = true;
+}
+
+void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+				       struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return;
+
+	mz = page_cgroup_zoneinfo(mem, page);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -2701,6 +2801,7 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 
 static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 				   unsigned int nr_pages,
+				   struct page *page,
 				   const enum charge_type ctype)
 {
 	struct memcg_batch_info *batch = NULL;
@@ -2718,6 +2819,10 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 	 */
 	if (!batch->memcg)
 		batch->memcg = mem;
+
+	if (!batch->zone)
+		batch->zone = page_zone(page);
+
 	/*
 	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
 	 * In those cases, all pages freed continously can be expected to be in
@@ -2739,12 +2844,17 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 	 */
 	if (batch->memcg != mem)
 		goto direct_uncharge;
+
+	if (batch->zone != page_zone(page))
+		mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
+
 	/* remember freed charge and uncharge it later */
 	batch->nr_pages++;
 	if (uncharge_memsw)
 		batch->memsw_nr_pages++;
 	return;
 direct_uncharge:
+	mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
 	res_counter_uncharge(&mem->res, nr_pages * PAGE_SIZE);
 	if (uncharge_memsw)
 		res_counter_uncharge(&mem->memsw, nr_pages * PAGE_SIZE);
@@ -2826,7 +2936,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		mem_cgroup_get(mem);
 	}
 	if (!mem_cgroup_is_root(mem))
-		mem_cgroup_do_uncharge(mem, nr_pages, ctype);
+		mem_cgroup_do_uncharge(mem, nr_pages, page, ctype);
 
 	return mem;
 
@@ -2894,6 +3004,10 @@ void mem_cgroup_uncharge_end(void)
 	if (batch->memsw_nr_pages)
 		res_counter_uncharge(&batch->memcg->memsw,
 				     batch->memsw_nr_pages * PAGE_SIZE);
+	if (batch->zone)
+		mem_cgroup_mz_clear_unreclaimable(batch->memcg, batch->zone);
+	batch->zone = NULL;
+
 	memcg_oom_recover(batch->memcg);
 	/* forget this pointer (for sanity check) */
 	batch->memcg = NULL;
@@ -4589,6 +4703,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = mem;
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
 	}
 	return 0;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2a5c734..ed4622b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
 			zone, sc->mem_cgroup,
 			0, file);
+
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
+
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
@@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
@@ -1989,7 +1993,8 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 
 static bool zone_reclaimable(struct zone *zone)
 {
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+	return zone->pages_scanned < zone_reclaimable_pages(zone) *
+					ZONE_RECLAIMABLE_RATE;
 }
 
 /*
@@ -2656,10 +2661,20 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
 		if (!scan)
 			continue;
 
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+			priority != DEF_PRIORITY)
+			continue;
+
 		sc->nr_scanned = 0;
 		shrink_zone(priority, zone, sc);
 		total_scanned += sc->nr_scanned;
 
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
+			continue;
+
+		if (!mem_cgroup_zone_reclaimable(mem_cont, zone))
+			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
+
 		/*
 		 * If we've done a decent amount of scanning and
 		 * the reclaim ratio is low, start doing writepage
@@ -2736,11 +2751,19 @@ loop_again:
 			balance_pgdat_node(pgdat, order, &sc);
 			total_scanned += sc.nr_scanned;
 
+			/*
+			 * Set the node which has at least one reclaimable
+			 * zone
+			 */
 			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
 				struct zone *zone = pgdat->node_zones + i;
 
 				if (!populated_zone(zone))
 					continue;
+
+				if (!mem_cgroup_mz_unreclaimable(mem_cont,
+								zone))
+					break;
 			}
 			if (i < 0)
 				node_clear(nid, do_nodes);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH V6 08/10] Enable per-memcg background reclaim.
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (6 preceding siblings ...)
  2011-04-19  3:57 ` [PATCH V6 07/10] Add per-memcg zone "unreclaimable" Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-19  3:57 ` [PATCH V6 09/10] Add API to export per-memcg kswapd pid Ying Han
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

By default the per-memcg background reclaim is disabled when the limit_in_bytes
is set the maximum. The kswapd_run() is called when the memcg is being resized,
and kswapd_stop() is called when the memcg is being deleted.

The per-memcg kswapd is waked up based on the usage and low_wmark, which is
checked once per 1024 increments per cpu. The memcg's kswapd is waked up if the
usage is larger than the low_wmark.

changelog v4..v3:
1. move kswapd_stop to mem_cgroup_destroy based on comments from KAMAZAWA
2. move kswapd_run to setup_mem_cgroup_wmark, since the actual watermarks
determines whether or not enabling per-memcg background reclaim.

changelog v3..v2:
1. some clean-ups

changelog v2..v1:
1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
2. remove checking the wmark from per-page charging. now it checks the wmark
periodically based on the event counter.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0dfdf27..d5b284c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -105,10 +105,12 @@ enum mem_cgroup_events_index {
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
 	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_WMARK_EVENTS_THRESH,
 	MEM_CGROUP_NTARGETS,
 };
 #define THRESHOLDS_EVENTS_TARGET (128)
 #define SOFTLIMIT_EVENTS_TARGET (1024)
+#define WMARK_EVENTS_TARGET (1024)
 
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
@@ -371,6 +373,8 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 
+static void wake_memcg_kswapd(struct mem_cgroup *mem);
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
@@ -549,6 +553,12 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
+static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
+		wake_memcg_kswapd(mem);
+}
+
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -679,6 +689,9 @@ static void __mem_cgroup_target_update(struct mem_cgroup *mem, int target)
 	case MEM_CGROUP_TARGET_SOFTLIMIT:
 		next = val + SOFTLIMIT_EVENTS_TARGET;
 		break;
+	case MEM_CGROUP_WMARK_EVENTS_THRESH:
+		next = val + WMARK_EVENTS_TARGET;
+		break;
 	default:
 		return;
 	}
@@ -702,6 +715,10 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
+		if (unlikely(__memcg_event_check(mem,
+			MEM_CGROUP_WMARK_EVENTS_THRESH))){
+			mem_cgroup_check_wmark(mem);
+		}
 	}
 }
 
@@ -846,6 +863,9 @@ static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
 
 		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
 		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+
+		if (!mem_cgroup_is_root(mem) && !mem->kswapd_wait)
+			kswapd_run(0, mem);
 	}
 }
 
@@ -4868,6 +4888,22 @@ int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
 	return mem->last_scanned_node;
 }
 
+static inline
+void wake_memcg_kswapd(struct mem_cgroup *mem)
+{
+	wait_queue_head_t *wait;
+
+	if (!mem || !mem->high_wmark_distance)
+		return;
+
+	wait = mem->kswapd_wait;
+
+	if (!wait || !waitqueue_active(wait))
+		return;
+
+	wake_up_interruptible(wait);
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4971,6 +5007,7 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	kswapd_stop(0, mem);
 	mem_cgroup_put(mem);
 }
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH V6 09/10] Add API to export per-memcg kswapd pid.
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (7 preceding siblings ...)
  2011-04-19  3:57 ` [PATCH V6 08/10] Enable per-memcg background reclaim Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-20  1:15   ` KAMEZAWA Hiroyuki
  2011-04-19  3:57 ` [PATCH V6 10/10] Add some per-memcg stats Ying Han
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This add the API which exports per-memcg kswapd thread pid. The kswapd
thread is named as "memcg_" + css_id, and the pid can be used to put
kswapd thread into cpu cgroup later.

$ mkdir /dev/cgroup/memory/A
$ cat /dev/cgroup/memory/A/memory.kswapd_pid
memcg_null 0

$ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
$ ps -ef | grep memcg
root      6727     2  0 14:32 ?        00:00:00 [memcg_3]
root      6729  6044  0 14:32 ttyS0    00:00:00 grep memcg

$ cat memory.kswapd_pid
memcg_3 6727

changelog v6..v5
1. Remove the legacy spinlock which has been removed from previous post.

changelog v5..v4
1. Initialize the memcg-kswapd pid to -1 instead of 0.
2. Remove the kswapds_spinlock.

changelog v4..v3
1. Add the API based on KAMAZAWA's request on patch v3.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |   31 +++++++++++++++++++++++++++++++
 1 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5b284c..0b108b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4533,6 +4533,33 @@ static int mem_cgroup_wmark_read(struct cgroup *cgrp,
 	return 0;
 }
 
+static int mem_cgroup_kswapd_pid_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	struct task_struct *kswapd_thr = NULL;
+	struct kswapd *kswapd_p = NULL;
+	wait_queue_head_t *wait;
+	char name[TASK_COMM_LEN];
+	pid_t pid = -1;
+
+	sprintf(name, "memcg_null");
+
+	wait = mem_cgroup_kswapd_wait(mem);
+	if (wait) {
+		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+		kswapd_thr = kswapd_p->kswapd_task;
+		if (kswapd_thr) {
+			get_task_comm(name, kswapd_thr);
+			pid = kswapd_thr->pid;
+		}
+	}
+
+	cb->fill(cb, name, pid);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4650,6 +4677,10 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "reclaim_wmarks",
 		.read_map = mem_cgroup_wmark_read,
 	},
+	{
+		.name = "kswapd_pid",
+		.read_map = mem_cgroup_kswapd_pid_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH V6 10/10] Add some per-memcg stats
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (8 preceding siblings ...)
  2011-04-19  3:57 ` [PATCH V6 09/10] Add API to export per-memcg kswapd pid Ying Han
@ 2011-04-19  3:57 ` Ying Han
  2011-04-21  2:51 ` [PATCH V6 00/10] memcg: per cgroup background reclaim Johannes Weiner
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-19  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

A bunch of statistics are added in memory.stat to monitor per cgroup
kswapd performance.

$cat /dev/cgroup/yinghan/memory.stat
kswapd_steal 12588994
pg_pgsteal 0
kswapd_pgscan 18629519
pg_scan 0
pgrefill 2893517
pgoutrun 5342267948
allocstall 0

changelog v2..v1:
1. change the stats using events instead of stats.
2. add the stats in the Documentation

Signed-off-by: Ying Han <yinghan@google.com>
---
 Documentation/cgroups/memory.txt |   14 +++++++
 include/linux/memcontrol.h       |   51 +++++++++++++++++++++++++++
 mm/memcontrol.c                  |   72 ++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                      |   28 ++++++++++++--
 4 files changed, 161 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b6ed61c..29dee73 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,13 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+kswapd_steal	- # of pages reclaimed from kswapd
+pg_pgsteal	- # of pages reclaimed from direct reclaim
+kswapd_pgscan	- # of pages scanned from kswapd
+pg_scan		- # of pages scanned frm direct reclaim
+pgrefill	- # of pages scanned on active list
+pgoutrun	- # of times triggering kswapd
+allocstall	- # of times triggering direct reclaim
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -406,6 +413,13 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_kswapd_steal	- sum of all children's "kswapd_steal"
+total_pg_pgsteal	- sum of all children's "pg_pgsteal"
+total_kswapd_pgscan	- sum of all children's "kswapd_pgscan"
+total_pg_scan		- sum of all children's "pg_scan"
+total_pgrefill		- sum of all children's "pgrefill"
+total_pgoutrun		- sum of all children's "pgoutrun"
+total_allocstall	- sum of all children's "allocstall"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 29bbca2..3207dbf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -166,6 +166,15 @@ void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
 void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
 					unsigned long nr_scanned);
 
+/* background reclaim stats */
+void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pgrefill(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_outrun(struct mem_cgroup *memcg, int val);
+void mem_cgroup_alloc_stall(struct mem_cgroup *memcg, int val);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
 #endif
@@ -412,6 +421,48 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
 {
 }
 
+/* background reclaim stats */
+static inline void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg,
+					   int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_steal(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg,
+					    int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pgrefill(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_outrun(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_alloc_stall(struct mem_cgroup *memcg,
+					  int val)
+{
+	return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b108b9..84208bb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,13 @@ enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	MEM_CGROUP_EVENTS_KSWAPD_STEAL, /* # of pages reclaimed from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSTEAL, /* # of pages reclaimed from ttfp */
+	MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, /* # of pages scanned from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSCAN, /* # of pages scanned from ttfp */
+	MEM_CGROUP_EVENTS_PGREFILL, /* # of pages scanned on active list */
+	MEM_CGROUP_EVENTS_PGOUTRUN, /* # of triggers of background reclaim */
+	MEM_CGROUP_EVENTS_ALLOCSTALL, /* # of triggers of direct reclaim */
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -612,6 +619,41 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_kswapd_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_STEAL], val);
+}
+
+void mem_cgroup_pg_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSTEAL], val);
+}
+
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_PGSCAN], val);
+}
+
+void mem_cgroup_pg_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSCAN], val);
+}
+
+void mem_cgroup_pgrefill(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGREFILL], val);
+}
+
+void mem_cgroup_pg_outrun(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGOUTRUN], val);
+}
+
+void mem_cgroup_alloc_stall(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_ALLOCSTALL], val);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -3980,6 +4022,13 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_KSWAPD_STEAL,
+	MCS_PG_PGSTEAL,
+	MCS_KSWAPD_PGSCAN,
+	MCS_PG_PGSCAN,
+	MCS_PGREFILL,
+	MCS_PGOUTRUN,
+	MCS_ALLOCSTALL,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -4002,6 +4051,13 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"kswapd_steal", "total_kswapd_steal"},
+	{"pg_pgsteal", "total_pg_pgsteal"},
+	{"kswapd_pgscan", "total_kswapd_pgscan"},
+	{"pg_scan", "total_pg_scan"},
+	{"pgrefill", "total_pgrefill"},
+	{"pgoutrun", "total_pgoutrun"},
+	{"allocstall", "total_allocstall"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -4031,6 +4087,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
+	/* kswapd stat */
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_STEAL);
+	s->stat[MCS_KSWAPD_STEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSTEAL);
+	s->stat[MCS_PG_PGSTEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN);
+	s->stat[MCS_KSWAPD_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSCAN);
+	s->stat[MCS_PG_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGREFILL);
+	s->stat[MCS_PGREFILL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGOUTRUN);
+	s->stat[MCS_PGOUTRUN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_ALLOCSTALL);
+	s->stat[MCS_ALLOCSTALL] += val;
+
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ed4622b..c8f4ce5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1421,6 +1421,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_pgscan(sc->mem_cgroup, nr_scanned);
+		else
+			mem_cgroup_pg_pgscan(sc->mem_cgroup, nr_scanned);
 	}
 
 	if (nr_taken == 0) {
@@ -1441,9 +1445,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	local_irq_disable();
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	if (scanning_global_lru(sc)) {
+		if (current_is_kswapd())
+			__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+		__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	} else {
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_steal(sc->mem_cgroup, nr_reclaimed);
+		else
+			mem_cgroup_pg_steal(sc->mem_cgroup, nr_reclaimed);
+	}
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
@@ -1541,7 +1552,12 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
-	__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	if (scanning_global_lru(sc))
+		__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	else
+		mem_cgroup_pgrefill(sc->mem_cgroup, pgscanned);
+
+
 	if (file)
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -nr_taken);
 	else
@@ -2055,6 +2071,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 	if (scanning_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
+	else
+		mem_cgroup_alloc_stall(sc->mem_cgroup, 1);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
@@ -2720,6 +2738,8 @@ loop_again:
 	sc.nr_reclaimed = 0;
 	total_scanned = 0;
 
+	mem_cgroup_pg_outrun(mem_cont, 1);
+
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc.priority = priority;
 		wmark_ok = false;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 06/10] Per-memcg background reclaim.
  2011-04-19  3:57 ` [PATCH V6 06/10] Per-memcg background reclaim Ying Han
@ 2011-04-20  1:03   ` KAMEZAWA Hiroyuki
  2011-04-20  3:25     ` Ying Han
  2011-04-20  4:20     ` Ying Han
  2012-03-19  8:14   ` Zhu Yanhai
  1 sibling, 2 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-20  1:03 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Mon, 18 Apr 2011 20:57:42 -0700
Ying Han <yinghan@google.com> wrote:

> This is the main loop of per-memcg background reclaim which is implemented in
> function balance_mem_cgroup_pgdat().
> 
> The function performs a priority loop similar to global reclaim. During each
> iteration it invokes balance_pgdat_node() for all nodes on the system, which
> is another new function performs background reclaim per node. After reclaiming
> each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
> it returns true.
> 

Seems getting better. But some comments, below.


> changelog v6..v5:
> 1. add mem_cgroup_zone_reclaimable_pages()
> 2. fix some comment style.
> 
> changelog v5..v4:
> 1. remove duplicate check on nodes_empty()
> 2. add logic to check if the per-memcg lru is empty on the zone.
> 
> changelog v4..v3:
> 1. split the select_victim_node and zone_unreclaimable to a seperate patches
> 2. remove the logic tries to do zone balancing.
> 
> changelog v3..v2:
> 1. change mz->all_unreclaimable to be boolean.
> 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
> 3. some more clean-up.
> 
> changelog v2..v1:
> 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> reclaim.
> 3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
> keeps the same name.
> 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
> after freeing.
> 5. add the fairness in zonelist where memcg remember the last zone reclaimed
> from.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h |    9 +++
>  mm/memcontrol.c            |   18 +++++
>  mm/vmscan.c                |  151 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 178 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d4ff7f2..a4747b0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -115,6 +115,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>   */
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +						  struct zone *zone);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>  				       struct zone *zone,
>  				       enum lru_list lru);
> @@ -311,6 +313,13 @@ mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>  }
>  
>  static inline unsigned long
> +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +				    struct zone *zone)
> +{
> +	return 0;
> +}
> +
> +static inline unsigned long
>  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
>  			 enum lru_list lru)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 06fddd2..7490147 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1097,6 +1097,24 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>  	return (active > inactive);
>  }
>  
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +						struct zone *zone)
> +{
> +	int nr;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +	nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> +	     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> +
> +	if (nr_swap_pages > 0)
> +		nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> +		      MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>  				       struct zone *zone,
>  				       enum lru_list lru)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0060d1e..2a5c734 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -47,6 +47,8 @@
>  
>  #include <linux/swapops.h>
>  
> +#include <linux/res_counter.h>
> +
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -111,6 +113,8 @@ struct scan_control {
>  	 * are scanned.
>  	 */
>  	nodemask_t	*nodemask;
> +
> +	int priority;
>  };
>  
>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> @@ -2625,11 +2629,158 @@ out:
>  	finish_wait(wait_h, &wait);
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> +					struct scan_control *sc)
> +{


shrink_memcg_node() instead of balance_pgdat_node() ?

I guess the name is misleading.

> +	int i;
> +	unsigned long total_scanned = 0;
> +	struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +	int priority = sc->priority;
> +
> +	/*
> +	 * This dma->highmem order is consistant with global reclaim.
> +	 * We do this because the page allocator works in the opposite
> +	 * direction although memcg user pages are mostly allocated at
> +	 * highmem.
> +	 */
> +	for (i = 0; i < pgdat->nr_zones; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +		unsigned long scan = 0;
> +
> +		scan = mem_cgroup_zone_reclaimable_pages(mem_cont, zone);
> +		if (!scan)
> +			continue;
> +
> +		sc->nr_scanned = 0;
> +		shrink_zone(priority, zone, sc);
> +		total_scanned += sc->nr_scanned;
> +
> +		/*
> +		 * If we've done a decent amount of scanning and
> +		 * the reclaim ratio is low, start doing writepage
> +		 * even in laptop mode
> +		 */
> +		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> +		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
> +			sc->may_writepage = 1;
> +		}
> +	}
> +
> +	sc->nr_scanned = total_scanned;
> +}
> +
> +/*
> + * Per cgroup background reclaim.
> + * TODO: Take off the order since memcg always do order 0
> + */
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +					      int order)

Here, too. shrink_mem_cgroup() may be straightforward.


> +{
> +	int i, nid;
> +	int start_node;
> +	int priority;
> +	bool wmark_ok;
> +	int loop;
> +	pg_data_t *pgdat;
> +	nodemask_t do_nodes;
> +	unsigned long total_scanned;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +		.swappiness = vm_swappiness,
> +		.order = order,
> +		.mem_cgroup = mem_cont,
> +	};
> +
> +loop_again:
> +	do_nodes = NODE_MASK_NONE;
> +	sc.may_writepage = !laptop_mode;

Even with !laptop_mode, "write_page since the 1st scan" should be avoided.
How about sc.may_writepage = 1 when we do "goto loop_again;" ?


> +	sc.nr_reclaimed = 0;
> +	total_scanned = 0;
> +
> +	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> +		sc.priority = priority;
> +		wmark_ok = false;
> +		loop = 0;
> +
> +		/* The swap token gets in the way of swapout... */
> +		if (!priority)
> +			disable_swap_token();
> +
> +		if (priority == DEF_PRIORITY)
> +			do_nodes = node_states[N_ONLINE];

This can be moved out from the loop.

> +
> +		while (1) {
> +			nid = mem_cgroup_select_victim_node(mem_cont,
> +							&do_nodes);
> +
> +			/*
> +			 * Indicate we have cycled the nodelist once
> +			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
> +			 * kswapd burning cpu cycles.
> +			 */
> +			if (loop == 0) {
> +				start_node = nid;
> +				loop++;
> +			} else if (nid == start_node)
> +				break;
> +

Hmm...let me try a different style.
==
	start_node = mem_cgroup_select_victim_node(mem_cont, &do_nodes);
	for (nid = start_node;
             nid != start_node && !node_empty(do_nodes);
             nid = mem_cgroup_select_victim_node(mem_cont, &do_nodes)) {

		shrink_memcg_node(NODE_DATA(nid), order, &sc);
		total_scanned += sc.nr_scanned;
		for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
			if (populated_zone(NODE_DATA(nid)->node_zones + i))
				break;
		}
		if (i == NODE_DATA(nid)->nr_zones)
			node_clear(nid, do_nodes);
		if (mem_cgroup_watermark_ok(mem_cont, CHARGE_WMARK_HIGH))
			break;
	}
==

In short, I like for() loop rather than while(1) because next calculation and
end condition are clear.



> +			pgdat = NODE_DATA(nid);
> +			balance_pgdat_node(pgdat, order, &sc);
> +			total_scanned += sc.nr_scanned;
> +
> +			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +				struct zone *zone = pgdat->node_zones + i;
> +
> +				if (!populated_zone(zone))
> +					continue;
> +			}
> +			if (i < 0)
> +				node_clear(nid, do_nodes);
Isn't this wrong ? I guess
		if (populated_zone(zone))
			break;
is what you want to do.

Thanks,
-Kame
> +
> +			if (mem_cgroup_watermark_ok(mem_cont,
> +							CHARGE_WMARK_HIGH)) {
> +				wmark_ok = true;
> +				goto out;
> +			}
> +
> +			if (nodes_empty(do_nodes)) {
> +				wmark_ok = true;
> +				goto out;
> +			}
> +		}
> +
> +		if (total_scanned && priority < DEF_PRIORITY - 2)
> +			congestion_wait(WRITE, HZ/10);
> +
> +		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> +			break;
> +	}
> +out:
> +	if (!wmark_ok) {
> +		cond_resched();
> +
> +		try_to_freeze();
> +
> +		goto loop_again;
> +	}
> +
> +	return sc.nr_reclaimed;
> +}
> +#else
>  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
>  							int order)
>  {
>  	return 0;
>  }
> +#endif
>  
>  /*
>   * The background pageout daemon, started as a kernel thread
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 09/10] Add API to export per-memcg kswapd pid.
  2011-04-19  3:57 ` [PATCH V6 09/10] Add API to export per-memcg kswapd pid Ying Han
@ 2011-04-20  1:15   ` KAMEZAWA Hiroyuki
  2011-04-20  3:39     ` Ying Han
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-20  1:15 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Mon, 18 Apr 2011 20:57:45 -0700
Ying Han <yinghan@google.com> wrote:

> This add the API which exports per-memcg kswapd thread pid. The kswapd
> thread is named as "memcg_" + css_id, and the pid can be used to put
> kswapd thread into cpu cgroup later.
> 
> $ mkdir /dev/cgroup/memory/A
> $ cat /dev/cgroup/memory/A/memory.kswapd_pid
> memcg_null 0
> 
> $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> $ ps -ef | grep memcg
> root      6727     2  0 14:32 ?        00:00:00 [memcg_3]
> root      6729  6044  0 14:32 ttyS0    00:00:00 grep memcg
> 
> $ cat memory.kswapd_pid
> memcg_3 6727
> 
> changelog v6..v5
> 1. Remove the legacy spinlock which has been removed from previous post.
> 
> changelog v5..v4
> 1. Initialize the memcg-kswapd pid to -1 instead of 0.
> 2. Remove the kswapds_spinlock.
> 
> changelog v4..v3
> 1. Add the API based on KAMAZAWA's request on patch v3.
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Ying Han <yinghan@google.com>

I'm very sorry but please drop this. There is a discussion that
we should use thread pool rather than one-thread-per-one-memcg.
If so, we need to remove this interface and we'll see regression.

I think we need some control knobs as priority/share in thread pools finally...
(So, I want to use cpu cgroup.) If not, there will be unfair utilization of
cpu/thread. But for now, it seems adding this is too early.


> ---
>  mm/memcontrol.c |   31 +++++++++++++++++++++++++++++++
>  1 files changed, 31 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d5b284c..0b108b9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4533,6 +4533,33 @@ static int mem_cgroup_wmark_read(struct cgroup *cgrp,
>  	return 0;
>  }
>  
> +static int mem_cgroup_kswapd_pid_read(struct cgroup *cgrp,
> +	struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	struct task_struct *kswapd_thr = NULL;
> +	struct kswapd *kswapd_p = NULL;
> +	wait_queue_head_t *wait;
> +	char name[TASK_COMM_LEN];
> +	pid_t pid = -1;
> +
> +	sprintf(name, "memcg_null");
> +
> +	wait = mem_cgroup_kswapd_wait(mem);
> +	if (wait) {
> +		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> +		kswapd_thr = kswapd_p->kswapd_task;
> +		if (kswapd_thr) {
> +			get_task_comm(name, kswapd_thr);
> +			pid = kswapd_thr->pid;
> +		}
> +	}
> +
> +	cb->fill(cb, name, pid);
> +
> +	return 0;
> +}
> +
>  static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
>  	struct cftype *cft,  struct cgroup_map_cb *cb)
>  {
> @@ -4650,6 +4677,10 @@ static struct cftype mem_cgroup_files[] = {
>  		.name = "reclaim_wmarks",
>  		.read_map = mem_cgroup_wmark_read,
>  	},
> +	{
> +		.name = "kswapd_pid",
> +		.read_map = mem_cgroup_kswapd_pid_read,
> +	},
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> -- 
> 1.7.3.1
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 06/10] Per-memcg background reclaim.
  2011-04-20  1:03   ` KAMEZAWA Hiroyuki
@ 2011-04-20  3:25     ` Ying Han
  2011-04-20  4:20     ` Ying Han
  1 sibling, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-20  3:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 12719 bytes --]

On Tue, Apr 19, 2011 at 6:03 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 18 Apr 2011 20:57:42 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This is the main loop of per-memcg background reclaim which is
> implemented in
> > function balance_mem_cgroup_pgdat().
> >
> > The function performs a priority loop similar to global reclaim. During
> each
> > iteration it invokes balance_pgdat_node() for all nodes on the system,
> which
> > is another new function performs background reclaim per node. After
> reclaiming
> > each node, it checks mem_cgroup_watermark_ok() and breaks the priority
> loop if
> > it returns true.
> >
>
> Seems getting better. But some comments, below.
>

thank you for reviewing.

>
>
> > changelog v6..v5:
> > 1. add mem_cgroup_zone_reclaimable_pages()
> > 2. fix some comment style.
> >
> > changelog v5..v4:
> > 1. remove duplicate check on nodes_empty()
> > 2. add logic to check if the per-memcg lru is empty on the zone.
> >
> > changelog v4..v3:
> > 1. split the select_victim_node and zone_unreclaimable to a seperate
> patches
> > 2. remove the logic tries to do zone balancing.
> >
> > changelog v3..v2:
> > 1. change mz->all_unreclaimable to be boolean.
> > 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg
> reclaim.
> > 3. some more clean-up.
> >
> > changelog v2..v1:
> > 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> > 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> > reclaim.
> > 3. name the per-memcg memcg as "memcg-id" (css->id). And the global
> kswapd
> > keeps the same name.
> > 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be
> accessed
> > after freeing.
> > 5. add the fairness in zonelist where memcg remember the last zone
> reclaimed
> > from.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/memcontrol.h |    9 +++
> >  mm/memcontrol.c            |   18 +++++
> >  mm/vmscan.c                |  151
> ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 178 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index d4ff7f2..a4747b0 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -115,6 +115,8 @@ extern void mem_cgroup_end_migration(struct
> mem_cgroup *mem,
> >   */
> >  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> >  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup
> *memcg,
> > +                                               struct zone *zone);
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                      struct zone *zone,
> >                                      enum lru_list lru);
> > @@ -311,6 +313,13 @@ mem_cgroup_inactive_file_is_low(struct mem_cgroup
> *memcg)
> >  }
> >
> >  static inline unsigned long
> > +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> > +                                 struct zone *zone)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline unsigned long
> >  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> >                        enum lru_list lru)
> >  {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 06fddd2..7490147 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1097,6 +1097,24 @@ int mem_cgroup_inactive_file_is_low(struct
> mem_cgroup *memcg)
> >       return (active > inactive);
> >  }
> >
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup
> *memcg,
> > +                                             struct zone *zone)
> > +{
> > +     int nr;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +     struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid,
> zid);
> > +
> > +     nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> > +          MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> > +
> > +     if (nr_swap_pages > 0)
> > +             nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> > +                   MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> > +
> > +     return nr;
> > +}
> > +
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                      struct zone *zone,
> >                                      enum lru_list lru)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0060d1e..2a5c734 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -47,6 +47,8 @@
> >
> >  #include <linux/swapops.h>
> >
> > +#include <linux/res_counter.h>
> > +
> >  #include "internal.h"
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -111,6 +113,8 @@ struct scan_control {
> >        * are scanned.
> >        */
> >       nodemask_t      *nodemask;
> > +
> > +     int priority;
> >  };
> >
> >  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> > @@ -2625,11 +2629,158 @@ out:
> >       finish_wait(wait_h, &wait);
> >  }
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * The function is used for per-memcg LRU. It scanns all the zones of
> the
> > + * node and returns the nr_scanned and nr_reclaimed.
> > + */
> > +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> > +                                     struct scan_control *sc)
> > +{
>
>
> shrink_memcg_node() instead of balance_pgdat_node() ?
>
> I guess the name is misleading.
>

ok. will make the change

>
> > +     int i;
> > +     unsigned long total_scanned = 0;
> > +     struct mem_cgroup *mem_cont = sc->mem_cgroup;
> > +     int priority = sc->priority;
> > +
> > +     /*
> > +      * This dma->highmem order is consistant with global reclaim.
> > +      * We do this because the page allocator works in the opposite
> > +      * direction although memcg user pages are mostly allocated at
> > +      * highmem.
> > +      */
> > +     for (i = 0; i < pgdat->nr_zones; i++) {
> > +             struct zone *zone = pgdat->node_zones + i;
> > +             unsigned long scan = 0;
> > +
> > +             scan = mem_cgroup_zone_reclaimable_pages(mem_cont, zone);
> > +             if (!scan)
> > +                     continue;
> > +
> > +             sc->nr_scanned = 0;
> > +             shrink_zone(priority, zone, sc);
> > +             total_scanned += sc->nr_scanned;
> > +
> > +             /*
> > +              * If we've done a decent amount of scanning and
> > +              * the reclaim ratio is low, start doing writepage
> > +              * even in laptop mode
> > +              */
> > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> 2) {
> > +                     sc->may_writepage = 1;
> > +             }
> > +     }
> > +
> > +     sc->nr_scanned = total_scanned;
> > +}
> > +
> > +/*
> > + * Per cgroup background reclaim.
> > + * TODO: Take off the order since memcg always do order 0
> > + */
> > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> > +                                           int order)
>
> Here, too. shrink_mem_cgroup() may be straightforward.
>

will make the change.

>
>
> > +{
> > +     int i, nid;
> > +     int start_node;
> > +     int priority;
> > +     bool wmark_ok;
> > +     int loop;
> > +     pg_data_t *pgdat;
> > +     nodemask_t do_nodes;
> > +     unsigned long total_scanned;
> > +     struct scan_control sc = {
> > +             .gfp_mask = GFP_KERNEL,
> > +             .may_unmap = 1,
> > +             .may_swap = 1,
> > +             .nr_to_reclaim = SWAP_CLUSTER_MAX,
> > +             .swappiness = vm_swappiness,
> > +             .order = order,
> > +             .mem_cgroup = mem_cont,
> > +     };
> > +
> > +loop_again:
> > +     do_nodes = NODE_MASK_NONE;
> > +     sc.may_writepage = !laptop_mode;
>
> Even with !laptop_mode, "write_page since the 1st scan" should be avoided.
> How about sc.may_writepage = 1 when we do "goto loop_again;" ?
>
> sounds a safe change to make. will add it.
>


> > +     sc.nr_reclaimed = 0;
> > +     total_scanned = 0;
> > +
> > +     for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > +             sc.priority = priority;
> > +             wmark_ok = false;
> > +             loop = 0;
> > +
> > +             /* The swap token gets in the way of swapout... */
> > +             if (!priority)
> > +                     disable_swap_token();
> > +
> > +             if (priority == DEF_PRIORITY)
> > +                     do_nodes = node_states[N_ONLINE];
>
> This can be moved out from the loop.
>
> ok. changed.

> > +
> > +             while (1) {
> > +                     nid = mem_cgroup_select_victim_node(mem_cont,
> > +                                                     &do_nodes);
> > +
> > +                     /*
> > +                      * Indicate we have cycled the nodelist once
> > +                      * TODO: we might add MAX_RECLAIM_LOOP for
> preventing
> > +                      * kswapd burning cpu cycles.
> > +                      */
> > +                     if (loop == 0) {
> > +                             start_node = nid;
> > +                             loop++;
> > +                     } else if (nid == start_node)
> > +                             break;
> > +
>
> Hmm...let me try a different style.
> ==
>        start_node = mem_cgroup_select_victim_node(mem_cont, &do_nodes);
>        for (nid = start_node;
>             nid != start_node && !node_empty(do_nodes);
>             nid = mem_cgroup_select_victim_node(mem_cont, &do_nodes)) {
>
>                shrink_memcg_node(NODE_DATA(nid), order, &sc);
>                 total_scanned += sc.nr_scanned;
>                 for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
>                        if (populated_zone(NODE_DATA(nid)->node_zones + i))
>                                break;
>                }
>                if (i == NODE_DATA(nid)->nr_zones)
>                         node_clear(nid, do_nodes);
>                if (mem_cgroup_watermark_ok(mem_cont, CHARGE_WMARK_HIGH))
>                         break;
>        }
> ==
>
> In short, I like for() loop rather than while(1) because next calculation
> and
> end condition are clear.
>

Ok. I should be able to make that change.

>
>
>
> > +                     pgdat = NODE_DATA(nid);
> > +                     balance_pgdat_node(pgdat, order, &sc);
> > +                     total_scanned += sc.nr_scanned;
> > +
> > +                     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > +                             struct zone *zone = pgdat->node_zones + i;
> > +
> > +                             if (!populated_zone(zone))
> > +                                     continue;
> > +                     }
> > +                     if (i < 0)
> > +                             node_clear(nid, do_nodes);
> Isn't this wrong ? I guess
>                if (populated_zone(zone))
>                        break;
> is what you want to do.
>
> hmm. you are right.

--Ying

--Ying

> Thanks,
> -Kame
> > +
> > +                     if (mem_cgroup_watermark_ok(mem_cont,
> > +                                                     CHARGE_WMARK_HIGH))
> {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +
> > +                     if (nodes_empty(do_nodes)) {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +             }
> > +
> > +             if (total_scanned && priority < DEF_PRIORITY - 2)
> > +                     congestion_wait(WRITE, HZ/10);
> > +
> > +             if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> > +                     break;
> > +     }
> > +out:
> > +     if (!wmark_ok) {
> > +             cond_resched();
> > +
> > +             try_to_freeze();
> > +
> > +             goto loop_again;
> > +     }
> > +
> > +     return sc.nr_reclaimed;
> > +}
> > +#else
> >  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> >                                                       int order)
> >  {
> >       return 0;
> >  }
> > +#endif
> >
> >  /*
> >   * The background pageout daemon, started as a kernel thread
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 16703 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 09/10] Add API to export per-memcg kswapd pid.
  2011-04-20  1:15   ` KAMEZAWA Hiroyuki
@ 2011-04-20  3:39     ` Ying Han
  0 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-20  3:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3572 bytes --]

On Tue, Apr 19, 2011 at 6:15 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 18 Apr 2011 20:57:45 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This add the API which exports per-memcg kswapd thread pid. The kswapd
> > thread is named as "memcg_" + css_id, and the pid can be used to put
> > kswapd thread into cpu cgroup later.
> >
> > $ mkdir /dev/cgroup/memory/A
> > $ cat /dev/cgroup/memory/A/memory.kswapd_pid
> > memcg_null 0
> >
> > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> > $ ps -ef | grep memcg
> > root      6727     2  0 14:32 ?        00:00:00 [memcg_3]
> > root      6729  6044  0 14:32 ttyS0    00:00:00 grep memcg
> >
> > $ cat memory.kswapd_pid
> > memcg_3 6727
> >
> > changelog v6..v5
> > 1. Remove the legacy spinlock which has been removed from previous post.
> >
> > changelog v5..v4
> > 1. Initialize the memcg-kswapd pid to -1 instead of 0.
> > 2. Remove the kswapds_spinlock.
> >
> > changelog v4..v3
> > 1. Add the API based on KAMAZAWA's request on patch v3.
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Ying Han <yinghan@google.com>
>
> I'm very sorry but please drop this. There is a discussion that
> we should use thread pool rather than one-thread-per-one-memcg.
> If so, we need to remove this interface and we'll see regression.
>
> I think we need some control knobs as priority/share in thread pools
> finally...
> (So, I want to use cpu cgroup.) If not, there will be unfair utilization of
> cpu/thread. But for now, it seems adding this is too early.
>

This patch is is very good self-contained and i have no problem to drop it
for now. And I won't include this in my next post.

--Ying

>
>
> > ---
> >  mm/memcontrol.c |   31 +++++++++++++++++++++++++++++++
> >  1 files changed, 31 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d5b284c..0b108b9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4533,6 +4533,33 @@ static int mem_cgroup_wmark_read(struct cgroup
> *cgrp,
> >       return 0;
> >  }
> >
> > +static int mem_cgroup_kswapd_pid_read(struct cgroup *cgrp,
> > +     struct cftype *cft,  struct cgroup_map_cb *cb)
> > +{
> > +     struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> > +     struct task_struct *kswapd_thr = NULL;
> > +     struct kswapd *kswapd_p = NULL;
> > +     wait_queue_head_t *wait;
> > +     char name[TASK_COMM_LEN];
> > +     pid_t pid = -1;
> > +
> > +     sprintf(name, "memcg_null");
> > +
> > +     wait = mem_cgroup_kswapd_wait(mem);
> > +     if (wait) {
> > +             kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> > +             kswapd_thr = kswapd_p->kswapd_task;
> > +             if (kswapd_thr) {
> > +                     get_task_comm(name, kswapd_thr);
> > +                     pid = kswapd_thr->pid;
> > +             }
> > +     }
> > +
> > +     cb->fill(cb, name, pid);
> > +
> > +     return 0;
> > +}
> > +
> >  static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
> >       struct cftype *cft,  struct cgroup_map_cb *cb)
> >  {
> > @@ -4650,6 +4677,10 @@ static struct cftype mem_cgroup_files[] = {
> >               .name = "reclaim_wmarks",
> >               .read_map = mem_cgroup_wmark_read,
> >       },
> > +     {
> > +             .name = "kswapd_pid",
> > +             .read_map = mem_cgroup_kswapd_pid_read,
> > +     },
> >  };
> >
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > --
> > 1.7.3.1
> >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 4826 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 06/10] Per-memcg background reclaim.
  2011-04-20  1:03   ` KAMEZAWA Hiroyuki
  2011-04-20  3:25     ` Ying Han
@ 2011-04-20  4:20     ` Ying Han
  1 sibling, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-20  4:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 12907 bytes --]

On Tue, Apr 19, 2011 at 6:03 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 18 Apr 2011 20:57:42 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This is the main loop of per-memcg background reclaim which is
> implemented in
> > function balance_mem_cgroup_pgdat().
> >
> > The function performs a priority loop similar to global reclaim. During
> each
> > iteration it invokes balance_pgdat_node() for all nodes on the system,
> which
> > is another new function performs background reclaim per node. After
> reclaiming
> > each node, it checks mem_cgroup_watermark_ok() and breaks the priority
> loop if
> > it returns true.
> >
>
> Seems getting better. But some comments, below.
>
>
> > changelog v6..v5:
> > 1. add mem_cgroup_zone_reclaimable_pages()
> > 2. fix some comment style.
> >
> > changelog v5..v4:
> > 1. remove duplicate check on nodes_empty()
> > 2. add logic to check if the per-memcg lru is empty on the zone.
> >
> > changelog v4..v3:
> > 1. split the select_victim_node and zone_unreclaimable to a seperate
> patches
> > 2. remove the logic tries to do zone balancing.
> >
> > changelog v3..v2:
> > 1. change mz->all_unreclaimable to be boolean.
> > 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg
> reclaim.
> > 3. some more clean-up.
> >
> > changelog v2..v1:
> > 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> > 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> > reclaim.
> > 3. name the per-memcg memcg as "memcg-id" (css->id). And the global
> kswapd
> > keeps the same name.
> > 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be
> accessed
> > after freeing.
> > 5. add the fairness in zonelist where memcg remember the last zone
> reclaimed
> > from.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/memcontrol.h |    9 +++
> >  mm/memcontrol.c            |   18 +++++
> >  mm/vmscan.c                |  151
> ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 178 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index d4ff7f2..a4747b0 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -115,6 +115,8 @@ extern void mem_cgroup_end_migration(struct
> mem_cgroup *mem,
> >   */
> >  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> >  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup
> *memcg,
> > +                                               struct zone *zone);
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                      struct zone *zone,
> >                                      enum lru_list lru);
> > @@ -311,6 +313,13 @@ mem_cgroup_inactive_file_is_low(struct mem_cgroup
> *memcg)
> >  }
> >
> >  static inline unsigned long
> > +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> > +                                 struct zone *zone)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline unsigned long
> >  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> >                        enum lru_list lru)
> >  {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 06fddd2..7490147 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1097,6 +1097,24 @@ int mem_cgroup_inactive_file_is_low(struct
> mem_cgroup *memcg)
> >       return (active > inactive);
> >  }
> >
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup
> *memcg,
> > +                                             struct zone *zone)
> > +{
> > +     int nr;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +     struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid,
> zid);
> > +
> > +     nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> > +          MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> > +
> > +     if (nr_swap_pages > 0)
> > +             nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> > +                   MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> > +
> > +     return nr;
> > +}
> > +
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                      struct zone *zone,
> >                                      enum lru_list lru)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0060d1e..2a5c734 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -47,6 +47,8 @@
> >
> >  #include <linux/swapops.h>
> >
> > +#include <linux/res_counter.h>
> > +
> >  #include "internal.h"
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -111,6 +113,8 @@ struct scan_control {
> >        * are scanned.
> >        */
> >       nodemask_t      *nodemask;
> > +
> > +     int priority;
> >  };
> >
> >  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> > @@ -2625,11 +2629,158 @@ out:
> >       finish_wait(wait_h, &wait);
> >  }
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * The function is used for per-memcg LRU. It scanns all the zones of
> the
> > + * node and returns the nr_scanned and nr_reclaimed.
> > + */
> > +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> > +                                     struct scan_control *sc)
> > +{
>
>
> shrink_memcg_node() instead of balance_pgdat_node() ?
>
> I guess the name is misleading.
>
> > +     int i;
> > +     unsigned long total_scanned = 0;
> > +     struct mem_cgroup *mem_cont = sc->mem_cgroup;
> > +     int priority = sc->priority;
> > +
> > +     /*
> > +      * This dma->highmem order is consistant with global reclaim.
> > +      * We do this because the page allocator works in the opposite
> > +      * direction although memcg user pages are mostly allocated at
> > +      * highmem.
> > +      */
> > +     for (i = 0; i < pgdat->nr_zones; i++) {
> > +             struct zone *zone = pgdat->node_zones + i;
> > +             unsigned long scan = 0;
> > +
> > +             scan = mem_cgroup_zone_reclaimable_pages(mem_cont, zone);
> > +             if (!scan)
> > +                     continue;
> > +
> > +             sc->nr_scanned = 0;
> > +             shrink_zone(priority, zone, sc);
> > +             total_scanned += sc->nr_scanned;
> > +
> > +             /*
> > +              * If we've done a decent amount of scanning and
> > +              * the reclaim ratio is low, start doing writepage
> > +              * even in laptop mode
> > +              */
> > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> 2) {
> > +                     sc->may_writepage = 1;
> > +             }
> > +     }
> > +
> > +     sc->nr_scanned = total_scanned;
> > +}
> > +
> > +/*
> > + * Per cgroup background reclaim.
> > + * TODO: Take off the order since memcg always do order 0
> > + */
> > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> > +                                           int order)
>
> Here, too. shrink_mem_cgroup() may be straightforward.
>
>
> > +{
> > +     int i, nid;
> > +     int start_node;
> > +     int priority;
> > +     bool wmark_ok;
> > +     int loop;
> > +     pg_data_t *pgdat;
> > +     nodemask_t do_nodes;
> > +     unsigned long total_scanned;
> > +     struct scan_control sc = {
> > +             .gfp_mask = GFP_KERNEL,
> > +             .may_unmap = 1,
> > +             .may_swap = 1,
> > +             .nr_to_reclaim = SWAP_CLUSTER_MAX,
> > +             .swappiness = vm_swappiness,
> > +             .order = order,
> > +             .mem_cgroup = mem_cont,
> > +     };
> > +
> > +loop_again:
> > +     do_nodes = NODE_MASK_NONE;
> > +     sc.may_writepage = !laptop_mode;
>
> Even with !laptop_mode, "write_page since the 1st scan" should be avoided.
> How about sc.may_writepage = 1 when we do "goto loop_again;" ?
>
>
> > +     sc.nr_reclaimed = 0;
> > +     total_scanned = 0;
> > +
> > +     for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > +             sc.priority = priority;
> > +             wmark_ok = false;
> > +             loop = 0;
> > +
> > +             /* The swap token gets in the way of swapout... */
> > +             if (!priority)
> > +                     disable_swap_token();
> > +
> > +             if (priority == DEF_PRIORITY)
> > +                     do_nodes = node_states[N_ONLINE];
>
> This can be moved out from the loop.
>
> > +
> > +             while (1) {
> > +                     nid = mem_cgroup_select_victim_node(mem_cont,
> > +                                                     &do_nodes);
> > +
> > +                     /*
> > +                      * Indicate we have cycled the nodelist once
> > +                      * TODO: we might add MAX_RECLAIM_LOOP for
> preventing
> > +                      * kswapd burning cpu cycles.
> > +                      */
> > +                     if (loop == 0) {
> > +                             start_node = nid;
> > +                             loop++;
> > +                     } else if (nid == start_node)
> > +                             break;
> > +
>
> Hmm...let me try a different style.
> ==
>        start_node = mem_cgroup_select_victim_node(mem_cont, &do_nodes);
>        for (nid = start_node;
>             nid != start_node && !node_empty(do_nodes);
>             nid = mem_cgroup_select_victim_node(mem_cont, &do_nodes)) {
>
>                shrink_memcg_node(NODE_DATA(nid), order, &sc);
>                 total_scanned += sc.nr_scanned;
>                 for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
>                        if (populated_zone(NODE_DATA(nid)->node_zones + i))
>                                break;
>                }
>                if (i == NODE_DATA(nid)->nr_zones)
>                         node_clear(nid, do_nodes);
>                if (mem_cgroup_watermark_ok(mem_cont, CHARGE_WMARK_HIGH))
>                         break;
>        }
> ==
>
> In short, I like for() loop rather than while(1) because next calculation
> and
> end condition are clear.
>
> I tried and the logic itself above doesn't work. Basically we need to start
with the start_node and then end with the start_node where the for() loop
doesn't fullfill that better.

If possible, i would like to keep the existing implementation by giving that
it is not impacting the logic as well as the performance. I would like to
move on to the next step after the basic stuffs are in. Hopefull that works.

--Ying

>
>
> > +                     pgdat = NODE_DATA(nid);
> > +                     balance_pgdat_node(pgdat, order, &sc);
> > +                     total_scanned += sc.nr_scanned;
> > +
> > +                     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > +                             struct zone *zone = pgdat->node_zones + i;
> > +
> > +                             if (!populated_zone(zone))
> > +                                     continue;
> > +                     }
> > +                     if (i < 0)
> > +                             node_clear(nid, do_nodes);
> Isn't this wrong ? I guess
>                if (populated_zone(zone))
>                        break;
> is what you want to do.
>
> Thanks,
> -Kame
> > +
> > +                     if (mem_cgroup_watermark_ok(mem_cont,
> > +                                                     CHARGE_WMARK_HIGH))
> {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +
> > +                     if (nodes_empty(do_nodes)) {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +             }
> > +
> > +             if (total_scanned && priority < DEF_PRIORITY - 2)
> > +                     congestion_wait(WRITE, HZ/10);
> > +
> > +             if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> > +                     break;
> > +     }
> > +out:
> > +     if (!wmark_ok) {
> > +             cond_resched();
> > +
> > +             try_to_freeze();
> > +
> > +             goto loop_again;
> > +     }
> > +
> > +     return sc.nr_reclaimed;
> > +}
> > +#else
> >  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> >                                                       int order)
> >  {
> >       return 0;
> >  }
> > +#endif
> >
> >  /*
> >   * The background pageout daemon, started as a kernel thread
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 16078 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (9 preceding siblings ...)
  2011-04-19  3:57 ` [PATCH V6 10/10] Add some per-memcg stats Ying Han
@ 2011-04-21  2:51 ` Johannes Weiner
  2011-04-21  3:05   ` Ying Han
  2011-04-21  4:00   ` KAMEZAWA Hiroyuki
  2011-04-21  3:40 ` KAMEZAWA Hiroyuki
  2011-04-21  3:43 ` [PATCH 1/3] memcg kswapd thread pool (Was " KAMEZAWA Hiroyuki
  12 siblings, 2 replies; 58+ messages in thread
From: Johannes Weiner @ 2011-04-21  2:51 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

Hello Ying,

I'm sorry that I chime in so late, I was still traveling until Monday.

On Mon, Apr 18, 2011 at 08:57:36PM -0700, Ying Han wrote:
> The current implementation of memcg supports targeting reclaim when the
> cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
> Per cgroup background reclaim is needed which helps to spread out memory
> pressure over longer period of time and smoothes out the cgroup performance.

Latency reduction makes perfect sense, the reasons kswapd exists apply
to memory control groups as well.  But I disagree with the design
choices you made.

> If the cgroup is configured to use per cgroup background reclaim, a kswapd
> thread is created which only scans the per-memcg LRU list.

We already have direct reclaim, direct reclaim on behalf of a memcg,
and global kswapd-reclaim.  Please don't add yet another reclaim path
that does its own thing and interacts unpredictably with the rest of
them.

As discussed on LSF, we want to get rid of the global LRU.  So the
goal is to have each reclaim entry end up at the same core part of
reclaim that round-robin scans a subset of zones from a subset of
memory control groups.

> Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> background reclaim and stop it. The watermarks are calculated based
> on the cgroup's limit_in_bytes.

Which brings me to the next issue: making the watermarks configurable.

You argued that having them adjustable from userspace is required for
overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
in in case of global memory pressure.  But that is only a problem
because global kswapd reclaim is (apart from soft limit reclaim)
unaware of memory control groups.

I think the much better solution is to make global kswapd memcg aware
(with the above mentioned round-robin reclaim scheduler), compared to
adding new (and final!) kernel ABI to avoid an internal shortcoming.

The whole excercise of asynchroneous background reclaim is to reduce
reclaim latency.  We already have a mechanism for global memory
pressure in place.  Per-memcg watermarks should only exist to avoid
direct reclaim due to hitting the hardlimit, nothing else.

So in summary, I think converting the reclaim core to this round-robin
scheduler solves all these problems at once: a single code path for
reclaim, breaking up of the global lru lock, fair soft limit reclaim,
and a mechanism for latency reduction that just DTRT without any
user-space configuration necessary.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  2:51 ` [PATCH V6 00/10] memcg: per cgroup background reclaim Johannes Weiner
@ 2011-04-21  3:05   ` Ying Han
  2011-04-21  3:53     ` Johannes Weiner
  2011-04-21  4:00   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-21  3:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3980 bytes --]

On Wed, Apr 20, 2011 at 7:51 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> Hello Ying,
>
> I'm sorry that I chime in so late, I was still traveling until Monday.
>

Hey, hope you had a great trip :)

>
> On Mon, Apr 18, 2011 at 08:57:36PM -0700, Ying Han wrote:
> > The current implementation of memcg supports targeting reclaim when the
> > cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
> > Per cgroup background reclaim is needed which helps to spread out memory
> > pressure over longer period of time and smoothes out the cgroup
> performance.
>
> Latency reduction makes perfect sense, the reasons kswapd exists apply
> to memory control groups as well.  But I disagree with the design
> choices you made.
>
> > If the cgroup is configured to use per cgroup background reclaim, a
> kswapd
> > thread is created which only scans the per-memcg LRU list.
>
> We already have direct reclaim, direct reclaim on behalf of a memcg,
> and global kswapd-reclaim.  Please don't add yet another reclaim path
> that does its own thing and interacts unpredictably with the rest of
> them.
>

Yes, we do have per-memcg direct reclaim and kswapd-reclaim. but the later
one is global and we don't want to start reclaiming from each memcg until we
reach the global memory pressure.

>
> As discussed on LSF, we want to get rid of the global LRU.  So the
> goal is to have each reclaim entry end up at the same core part of
> reclaim that round-robin scans a subset of zones from a subset of
> memory control groups.
>

True, but that is for system under global memory pressure and we would like
to do targeting reclaim instead of reclaiming from the global LRU. That is
not the same in this patch, which is doing targeting reclaim proactively
per-memcg based on their hard_limit.

>
> > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > background reclaim and stop it. The watermarks are calculated based
> > on the cgroup's limit_in_bytes.
>
> Which brings me to the next issue: making the watermarks configurable.
>
> You argued that having them adjustable from userspace is required for
> overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
> in in case of global memory pressure.  But that is only a problem
> because global kswapd reclaim is (apart from soft limit reclaim)
> unaware of memory control groups.
>
> I think the much better solution is to make global kswapd memcg aware
> (with the above mentioned round-robin reclaim scheduler), compared to
> adding new (and final!) kernel ABI to avoid an internal shortcoming.
>

We need to make the global kswapd memcg aware and that is the
soft_limit hierarchical reclaim.
It is different from doing per-memcg background reclaim which we want to
reclaim memory per-memcg
before they goes to per-memcg direct reclaim.

>
> The whole excercise of asynchroneous background reclaim is to reduce
> reclaim latency.  We already have a mechanism for global memory
> pressure in place.  Per-memcg watermarks should only exist to avoid
> direct reclaim due to hitting the hardlimit, nothing else.
>

Yes, but we have per-memcg direct reclaim which is based on the hard_limit.
The latency we need to reduce is the direct reclaim which is different from
global memory pressure.

>
> So in summary, I think converting the reclaim core to this round-robin
> scheduler solves all these problems at once: a single code path for
> reclaim, breaking up of the global lru lock, fair soft limit reclaim,
> and a mechanism for latency reduction that just DTRT without any
> user-space configuration necessary.
>

Not exactly. We will have cases where only few cgroups configured and the
total hard_limit always less than the machine capacity. So we will never
trigger the global memory pressure. However, we still need to smooth out the
performance per-memcg by doing background page reclaim proactively before
they hit their hard_limit (direct reclaim)

--Ying


>
>        Hannes
>

[-- Attachment #2: Type: text/html, Size: 5409 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (10 preceding siblings ...)
  2011-04-21  2:51 ` [PATCH V6 00/10] memcg: per cgroup background reclaim Johannes Weiner
@ 2011-04-21  3:40 ` KAMEZAWA Hiroyuki
  2011-04-21  3:48   ` [PATCH 2/3] weight for memcg background reclaim (Was " KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2011-04-21  3:43 ` [PATCH 1/3] memcg kswapd thread pool (Was " KAMEZAWA Hiroyuki
  12 siblings, 3 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  3:40 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Mon, 18 Apr 2011 20:57:36 -0700
Ying Han <yinghan@google.com> wrote:

> 1. there are one kswapd thread per cgroup. the thread is created when the
> cgroup changes its limit_in_bytes and is deleted when the cgroup is being
> removed. In some enviroment when thousand of cgroups are being configured on
> a single host, we will have thousand of kswapd threads. The memory consumption
> would be 8k*100 = 8M. We don't see a big issue for now if the host can host
> that many of cgroups.
> 

I don't think no-fix to this is ok.

Here is a thread pool patch on your set. (and includes some more).
3 patches in following e-mails.
Any comments are welocme, but my response may be delayed.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/3] memcg kswapd thread pool (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (11 preceding siblings ...)
  2011-04-21  3:40 ` KAMEZAWA Hiroyuki
@ 2011-04-21  3:43 ` KAMEZAWA Hiroyuki
  2011-04-21  7:09   ` Ying Han
  2011-04-21  8:10   ` Minchan Kim
  12 siblings, 2 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  3:43 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

Ying, please take this just a hint, you don't need to implement this as is.
==
Now, memcg-kswapd is created per a cgroup. Considering there are users
who creates hundreds on cgroup on a system, it consumes too much
resources, memory, cputime.

This patch creates a thread pool for memcg-kswapd. All memcg which 
needs background recalim are linked to a list and memcg-kswapd
picks up a memcg from the list and run reclaim. This reclaimes
SWAP_CLUSTER_MAX of pages and putback the memcg to the lail of
list. memcg-kswapd will visit memcgs in round-robin manner and
reduce usages.

This patch does

 - adds memcg-kswapd thread pool, the number of threads is now
   sqrt(num_of_cpus) + 1.
 - use unified kswapd_waitq for all memcgs.
 - refine memcg shrink codes in vmscan.c

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    5 
 include/linux/swap.h       |    7 -
 mm/memcontrol.c            |  174 +++++++++++++++++++++++----------
 mm/memory_hotplug.c        |    4 
 mm/page_alloc.c            |    1 
 mm/vmscan.c                |  237 ++++++++++++++++++---------------------------
 6 files changed, 232 insertions(+), 196 deletions(-)

Index: mmotm-Apr14/mm/memcontrol.c
===================================================================
--- mmotm-Apr14.orig/mm/memcontrol.c
+++ mmotm-Apr14/mm/memcontrol.c
@@ -49,6 +49,8 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include "internal.h"
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 
 #include <asm/uaccess.h>
 
@@ -274,6 +276,12 @@ struct mem_cgroup {
 	 */
 	unsigned long 	move_charge_at_immigrate;
 	/*
+ 	 * memcg kswapd control stuff.
+ 	 */
+	atomic_t		kswapd_running; /* !=0 if a kswapd runs */
+	wait_queue_head_t	memcg_kswapd_end; /* for waiting the end*/
+	struct list_head	memcg_kswapd_wait_list;/* for shceduling */
+	/*
 	 * percpu counter.
 	 */
 	struct mem_cgroup_stat_cpu *stat;
@@ -296,7 +304,6 @@ struct mem_cgroup {
 	 */
 	int last_scanned_node;
 
-	wait_queue_head_t *kswapd_wait;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -380,6 +387,7 @@ static struct mem_cgroup *parent_mem_cgr
 static void drain_all_stock_async(void);
 
 static void wake_memcg_kswapd(struct mem_cgroup *mem);
+static void memcg_kswapd_stop(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -916,9 +924,6 @@ static void setup_per_memcg_wmarks(struc
 
 		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
 		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
-
-		if (!mem_cgroup_is_root(mem) && !mem->kswapd_wait)
-			kswapd_run(0, mem);
 	}
 }
 
@@ -3729,6 +3734,7 @@ move_account:
 		ret = -EBUSY;
 		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
 			goto out;
+		memcg_kswapd_stop(mem);
 		ret = -EINTR;
 		if (signal_pending(current))
 			goto out;
@@ -4655,6 +4661,120 @@ static int mem_cgroup_oom_control_write(
 	return 0;
 }
 
+/*
+ * Controls for background memory reclam stuff.
+ */
+struct memcg_kswapd_work
+{
+	spinlock_t		lock;  /* lock for list */
+	struct list_head	list;  /* list of works. */
+	wait_queue_head_t	waitq;
+};
+
+struct memcg_kswapd_work	memcg_kswapd_control;
+
+static void wake_memcg_kswapd(struct mem_cgroup *mem)
+{
+	if (atomic_read(&mem->kswapd_running)) /* already running */
+		return;
+
+	spin_lock(&memcg_kswapd_control.lock);
+	if (list_empty(&mem->memcg_kswapd_wait_list))
+		list_add_tail(&mem->memcg_kswapd_wait_list,
+				&memcg_kswapd_control.list);
+	spin_unlock(&memcg_kswapd_control.lock);
+	wake_up(&memcg_kswapd_control.waitq);
+	return;
+}
+
+static void memcg_kswapd_wait_end(struct mem_cgroup *mem)
+{
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&mem->memcg_kswapd_end, &wait, TASK_INTERRUPTIBLE);
+	if (atomic_read(&mem->kswapd_running))
+		schedule();
+	finish_wait(&mem->memcg_kswapd_end, &wait);
+}
+
+/* called at pre_destroy */
+static void memcg_kswapd_stop(struct mem_cgroup *mem)
+{
+	spin_lock(&memcg_kswapd_control.lock);
+	if (!list_empty(&mem->memcg_kswapd_wait_list))
+		list_del(&mem->memcg_kswapd_wait_list);
+	spin_unlock(&memcg_kswapd_control.lock);
+
+	memcg_kswapd_wait_end(mem);
+}
+
+struct mem_cgroup *mem_cgroup_get_shrink_target(void)
+{
+	struct mem_cgroup *mem;
+
+	spin_lock(&memcg_kswapd_control.lock);
+	rcu_read_lock();
+	do {
+		mem = NULL;
+		if (!list_empty(&memcg_kswapd_control.list)) {
+			mem = list_entry(memcg_kswapd_control.list.next,
+				 	struct mem_cgroup,
+				 	memcg_kswapd_wait_list);
+			list_del_init(&mem->memcg_kswapd_wait_list);
+		}
+	} while (mem && !css_tryget(&mem->css));
+	if (mem)
+		atomic_inc(&mem->kswapd_running);
+	rcu_read_unlock();
+	spin_unlock(&memcg_kswapd_control.lock);
+	return mem;
+}
+
+void mem_cgroup_put_shrink_target(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return;
+	atomic_dec(&mem->kswapd_running);
+	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH)) {
+		spin_lock(&memcg_kswapd_control.lock);
+		if (list_empty(&mem->memcg_kswapd_wait_list)) {
+			list_add_tail(&mem->memcg_kswapd_wait_list,
+					&memcg_kswapd_control.list);
+		}
+		spin_unlock(&memcg_kswapd_control.lock);
+	}
+	wake_up_all(&mem->memcg_kswapd_end);
+	cgroup_release_and_wakeup_rmdir(&mem->css);
+}
+
+bool mem_cgroup_kswapd_can_sleep(void)
+{
+	return list_empty(&memcg_kswapd_control.list);
+}
+
+wait_queue_head_t *mem_cgroup_kswapd_waitq(void)
+{
+	return &memcg_kswapd_control.waitq;
+}
+
+static int __init memcg_kswapd_init(void)
+{
+
+	int i, nr_threads;
+
+	spin_lock_init(&memcg_kswapd_control.lock);
+	INIT_LIST_HEAD(&memcg_kswapd_control.list);
+	init_waitqueue_head(&memcg_kswapd_control.waitq);
+
+	nr_threads = int_sqrt(num_possible_cpus()) + 1;
+	for (i = 0; i < nr_threads; i++)
+		if (kswapd_run(0, i + 1) == -1)
+			break;
+	return 0;
+}
+module_init(memcg_kswapd_init);
+
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4935,33 +5055,6 @@ int mem_cgroup_watermark_ok(struct mem_c
 	return ret;
 }
 
-int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
-{
-	if (!mem || !kswapd_p)
-		return 0;
-
-	mem->kswapd_wait = &kswapd_p->kswapd_wait;
-	kswapd_p->kswapd_mem = mem;
-
-	return css_id(&mem->css);
-}
-
-void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
-{
-	if (mem)
-		mem->kswapd_wait = NULL;
-
-	return;
-}
-
-wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
-{
-	if (!mem)
-		return NULL;
-
-	return mem->kswapd_wait;
-}
-
 int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
 {
 	if (!mem)
@@ -4970,22 +5063,6 @@ int mem_cgroup_last_scanned_node(struct 
 	return mem->last_scanned_node;
 }
 
-static inline
-void wake_memcg_kswapd(struct mem_cgroup *mem)
-{
-	wait_queue_head_t *wait;
-
-	if (!mem || !mem->high_wmark_distance)
-		return;
-
-	wait = mem->kswapd_wait;
-
-	if (!wait || !waitqueue_active(wait))
-		return;
-
-	wake_up_interruptible(wait);
-}
-
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -5069,6 +5146,8 @@ mem_cgroup_create(struct cgroup_subsys *
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
+	init_waitqueue_head(&mem->memcg_kswapd_end);
+	INIT_LIST_HEAD(&mem->memcg_kswapd_wait_list);
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
@@ -5089,7 +5168,6 @@ static void mem_cgroup_destroy(struct cg
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
-	kswapd_stop(0, mem);
 	mem_cgroup_put(mem);
 }
 
Index: mmotm-Apr14/include/linux/swap.h
===================================================================
--- mmotm-Apr14.orig/include/linux/swap.h
+++ mmotm-Apr14/include/linux/swap.h
@@ -28,9 +28,8 @@ static inline int current_is_kswapd(void
 
 struct kswapd {
 	struct task_struct *kswapd_task;
-	wait_queue_head_t kswapd_wait;
+	wait_queue_head_t *kswapd_wait;
 	pg_data_t *kswapd_pgdat;
-	struct mem_cgroup *kswapd_mem;
 };
 
 int kswapd(void *p);
@@ -307,8 +306,8 @@ static inline void scan_unevictable_unre
 }
 #endif
 
-extern int kswapd_run(int nid, struct mem_cgroup *mem);
-extern void kswapd_stop(int nid, struct mem_cgroup *mem);
+extern int kswapd_run(int nid, int id);
+extern void kswapd_stop(int nid);
 
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
Index: mmotm-Apr14/mm/page_alloc.c
===================================================================
--- mmotm-Apr14.orig/mm/page_alloc.c
+++ mmotm-Apr14/mm/page_alloc.c
@@ -4199,6 +4199,7 @@ static void __paginginit free_area_init_
 
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
+	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
Index: mmotm-Apr14/mm/vmscan.c
===================================================================
--- mmotm-Apr14.orig/mm/vmscan.c
+++ mmotm-Apr14/mm/vmscan.c
@@ -2256,7 +2256,7 @@ static bool pgdat_balanced(pg_data_t *pg
 	return balanced_pages > (present_pages >> 2);
 }
 
-#define is_global_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
+#define is_global_kswapd(kswapd_p) ((kswapd_p)->kswapd_pgdat)
 
 /* is kswapd sleeping prematurely? */
 static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
@@ -2599,50 +2599,56 @@ static void kswapd_try_to_sleep(struct k
 	long remaining = 0;
 	DEFINE_WAIT(wait);
 	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
-	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
+	wait_queue_head_t *wait_h = kswapd_p->kswapd_wait;
 
 	if (freezing(current) || kthread_should_stop())
 		return;
 
 	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 
-	if (!is_global_kswapd(kswapd_p)) {
-		schedule();
-		goto out;
-	}
-
-	/* Try to sleep for a short interval */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
-		remaining = schedule_timeout(HZ/10);
-		finish_wait(wait_h, &wait);
-		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
-	}
-
-	/*
-	 * After a short sleep, check if it was a premature sleep. If not, then
-	 * go fully to sleep until explicitly woken up.
-	 */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
-		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+	if (is_global_kswapd(kswapd_p)) {
+		/* Try to sleep for a short interval */
+		if (!sleeping_prematurely(pgdat, order,
+				remaining, classzone_idx)) {
+			remaining = schedule_timeout(HZ/10);
+			finish_wait(wait_h, &wait);
+			prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
+		}
 
 		/*
-		 * vmstat counters are not perfectly accurate and the estimated
-		 * value for counters such as NR_FREE_PAGES can deviate from the
-		 * true value by nr_online_cpus * threshold. To avoid the zone
-		 * watermarks being breached while under pressure, we reduce the
-		 * per-cpu vmstat threshold while kswapd is awake and restore
-		 * them before going back to sleep.
-		 */
-		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
-		schedule();
-		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+	 	 * After a short sleep, check if it was a premature sleep.
+	 	 * If not, then go fully to sleep until explicitly woken up.
+	 	 */
+		if (!sleeping_prematurely(pgdat, order,
+					remaining, classzone_idx)) {
+			trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+			/*
+		 	 * vmstat counters are not perfectly accurate and
+		 	 * the estimated value for counters such as
+		 	 * NR_FREE_PAGES  can deviate from the true value for
+		 	 * counters such as NR_FREE_PAGES can deviate from the
+		 	 *  true value by nr_online_cpus * threshold. To avoid
+		 	 *  the zonewatermarks being breached while under
+		 	 *  pressure, we reduce the per-cpu vmstat threshold
+		 	 *  while kswapd is awake and restore them before
+		 	 *  going back to sleep.
+		 	 */
+			set_pgdat_percpu_threshold(pgdat,
+					calculate_normal_threshold);
+			schedule();
+			set_pgdat_percpu_threshold(pgdat,
+					calculate_pressure_threshold);
+		} else {
+			if (remaining)
+				count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
+			else
+				count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
+		}
 	} else {
-		if (remaining)
-			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
-		else
-			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
+		/* For now, we just check the remaining works.*/
+		if (mem_cgroup_kswapd_can_sleep())
+			schedule();
 	}
-out:
 	finish_wait(wait_h, &wait);
 }
 
@@ -2651,8 +2657,8 @@ out:
  * The function is used for per-memcg LRU. It scanns all the zones of the
  * node and returns the nr_scanned and nr_reclaimed.
  */
-static void balance_pgdat_node(pg_data_t *pgdat, int order,
-					struct scan_control *sc)
+static void shrink_memcg_node(pg_data_t *pgdat, int order,
+				struct scan_control *sc)
 {
 	int i;
 	unsigned long total_scanned = 0;
@@ -2705,14 +2711,9 @@ static void balance_pgdat_node(pg_data_t
  * Per cgroup background reclaim.
  * TODO: Take off the order since memcg always do order 0
  */
-static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
-					      int order)
+static int shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
 {
-	int i, nid;
-	int start_node;
-	int priority;
-	bool wmark_ok;
-	int loop;
+	int i, nid, priority, loop;
 	pg_data_t *pgdat;
 	nodemask_t do_nodes;
 	unsigned long total_scanned;
@@ -2726,43 +2727,34 @@ static unsigned long balance_mem_cgroup_
 		.mem_cgroup = mem_cont,
 	};
 
-loop_again:
 	do_nodes = NODE_MASK_NONE;
 	sc.may_writepage = !laptop_mode;
 	sc.nr_reclaimed = 0;
 	total_scanned = 0;
 
-	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		sc.priority = priority;
-		wmark_ok = false;
-		loop = 0;
+	do_nodes = node_states[N_ONLINE];
 
+	for (priority = DEF_PRIORITY;
+		(priority >= 0) && (sc.nr_to_reclaim > sc.nr_reclaimed);
+		priority--) {
+
+		sc.priority = priority;
 		/* The swap token gets in the way of swapout... */
 		if (!priority)
 			disable_swap_token();
+		/*
+		 * We'll scan a node given by memcg's logic. For avoiding
+		 * burning cpu, we have a limit of this loop.
+		 */
+		for (loop = num_online_nodes();
+			(loop > 0) && !nodes_empty(do_nodes);
+			loop--) {
 
-		if (priority == DEF_PRIORITY)
-			do_nodes = node_states[N_ONLINE];
-
-		while (1) {
 			nid = mem_cgroup_select_victim_node(mem_cont,
 							&do_nodes);
-
-			/*
-			 * Indicate we have cycled the nodelist once
-			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
-			 * kswapd burning cpu cycles.
-			 */
-			if (loop == 0) {
-				start_node = nid;
-				loop++;
-			} else if (nid == start_node)
-				break;
-
 			pgdat = NODE_DATA(nid);
-			balance_pgdat_node(pgdat, order, &sc);
+			shrink_memcg_node(pgdat, order, &sc);
 			total_scanned += sc.nr_scanned;
-
 			/*
 			 * Set the node which has at least one reclaimable
 			 * zone
@@ -2770,10 +2762,8 @@ loop_again:
 			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
 				struct zone *zone = pgdat->node_zones + i;
 
-				if (!populated_zone(zone))
-					continue;
-
-				if (!mem_cgroup_mz_unreclaimable(mem_cont,
+				if (populated_zone(zone) &&
+				    !mem_cgroup_mz_unreclaimable(mem_cont,
 								zone))
 					break;
 			}
@@ -2781,36 +2771,18 @@ loop_again:
 				node_clear(nid, do_nodes);
 
 			if (mem_cgroup_watermark_ok(mem_cont,
-							CHARGE_WMARK_HIGH)) {
-				wmark_ok = true;
-				goto out;
-			}
-
-			if (nodes_empty(do_nodes)) {
-				wmark_ok = true;
+						CHARGE_WMARK_HIGH))
 				goto out;
-			}
 		}
 
 		if (total_scanned && priority < DEF_PRIORITY - 2)
 			congestion_wait(WRITE, HZ/10);
-
-		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
-			break;
 	}
 out:
-	if (!wmark_ok) {
-		cond_resched();
-
-		try_to_freeze();
-
-		goto loop_again;
-	}
-
 	return sc.nr_reclaimed;
 }
 #else
-static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont,
 							int order)
 {
 	return 0;
@@ -2836,8 +2808,7 @@ int kswapd(void *p)
 	int classzone_idx;
 	struct kswapd *kswapd_p = (struct kswapd *)p;
 	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
-	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
-	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
+	struct mem_cgroup *mem;
 	struct task_struct *tsk = current;
 
 	struct reclaim_state reclaim_state = {
@@ -2848,7 +2819,6 @@ int kswapd(void *p)
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
 	if (is_global_kswapd(kswapd_p)) {
-		BUG_ON(pgdat->kswapd_wait != wait_h);
 		cpumask = cpumask_of_node(pgdat->node_id);
 		if (!cpumask_empty(cpumask))
 			set_cpus_allowed_ptr(tsk, cpumask);
@@ -2908,18 +2878,20 @@ int kswapd(void *p)
 		if (kthread_should_stop())
 			break;
 
+		if (ret)
+			continue;
 		/*
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret) {
-			if (is_global_kswapd(kswapd_p)) {
-				trace_mm_vmscan_kswapd_wake(pgdat->node_id,
-								order);
-				order = balance_pgdat(pgdat, order,
-							&classzone_idx);
-			} else
-				balance_mem_cgroup_pgdat(mem, order);
+		if (is_global_kswapd(kswapd_p)) {
+			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
+			order = balance_pgdat(pgdat, order, &classzone_idx);
+		} else {
+			mem = mem_cgroup_get_shrink_target();
+			if (mem)
+				shrink_mem_cgroup(mem, order);
+			mem_cgroup_put_shrink_target(mem);
 		}
 	}
 	return 0;
@@ -2942,13 +2914,13 @@ void wakeup_kswapd(struct zone *zone, in
 		pgdat->kswapd_max_order = order;
 		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
 	}
-	if (!waitqueue_active(pgdat->kswapd_wait))
+	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-	wake_up_interruptible(pgdat->kswapd_wait);
+	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
 /*
@@ -3046,9 +3018,8 @@ static int __devinit cpu_callback(struct
 
 			mask = cpumask_of_node(pgdat->node_id);
 
-			wait = pgdat->kswapd_wait;
-			kswapd_p = container_of(wait, struct kswapd,
-						kswapd_wait);
+			wait = &pgdat->kswapd_wait;
+			kswapd_p = pgdat->kswapd;
 			kswapd_tsk = kswapd_p->kswapd_task;
 
 			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
@@ -3064,18 +3035,17 @@ static int __devinit cpu_callback(struct
  * This kswapd start function will be called by init and node-hot-add.
  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
  */
-int kswapd_run(int nid, struct mem_cgroup *mem)
+int kswapd_run(int nid, int memcgid)
 {
 	struct task_struct *kswapd_tsk;
 	pg_data_t *pgdat = NULL;
 	struct kswapd *kswapd_p;
 	static char name[TASK_COMM_LEN];
-	int memcg_id = -1;
 	int ret = 0;
 
-	if (!mem) {
+	if (!memcgid) {
 		pgdat = NODE_DATA(nid);
-		if (pgdat->kswapd_wait)
+		if (pgdat->kswapd)
 			return ret;
 	}
 
@@ -3083,34 +3053,26 @@ int kswapd_run(int nid, struct mem_cgrou
 	if (!kswapd_p)
 		return -ENOMEM;
 
-	init_waitqueue_head(&kswapd_p->kswapd_wait);
-
-	if (!mem) {
-		pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+	if (!memcgid) {
+		pgdat->kswapd = kswapd_p;
+		kswapd_p->kswapd_wait = &pgdat->kswapd_wait;
 		kswapd_p->kswapd_pgdat = pgdat;
 		snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
 	} else {
-		memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
-		if (!memcg_id) {
-			kfree(kswapd_p);
-			return ret;
-		}
-		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
+		kswapd_p->kswapd_wait = mem_cgroup_kswapd_waitq();
+		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcgid);
 	}
 
 	kswapd_tsk = kthread_run(kswapd, kswapd_p, name);
 	if (IS_ERR(kswapd_tsk)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
-		if (!mem) {
+		if (!memcgid) {
 			printk(KERN_ERR "Failed to start kswapd on node %d\n",
 								nid);
-			pgdat->kswapd_wait = NULL;
-		} else {
-			printk(KERN_ERR "Failed to start kswapd on memcg %d\n",
-								memcg_id);
-			mem_cgroup_clear_kswapd(mem);
-		}
+			pgdat->kswapd = NULL;
+		} else
+			printk(KERN_ERR "Failed to start kswapd on memcg\n");
 		kfree(kswapd_p);
 		ret = -1;
 	} else
@@ -3121,23 +3083,14 @@ int kswapd_run(int nid, struct mem_cgrou
 /*
  * Called by memory hotplug when all memory in a node is offlined.
  */
-void kswapd_stop(int nid, struct mem_cgroup *mem)
+void kswapd_stop(int nid)
 {
 	struct task_struct *kswapd_tsk = NULL;
 	struct kswapd *kswapd_p = NULL;
-	wait_queue_head_t *wait;
-
-	if (!mem)
-		wait = NODE_DATA(nid)->kswapd_wait;
-	else
-		wait = mem_cgroup_kswapd_wait(mem);
-
-	if (wait) {
-		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
-		kswapd_tsk = kswapd_p->kswapd_task;
-		kswapd_p->kswapd_task = NULL;
-	}
 
+	kswapd_p = NODE_DATA(nid)->kswapd;
+	kswapd_tsk = kswapd_p->kswapd_task;
+	kswapd_p->kswapd_task = NULL;
 	if (kswapd_tsk)
 		kthread_stop(kswapd_tsk);
 
@@ -3150,7 +3103,7 @@ static int __init kswapd_init(void)
 
 	swap_setup();
 	for_each_node_state(nid, N_HIGH_MEMORY)
-		kswapd_run(nid, NULL);
+		kswapd_run(nid, 0);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }
Index: mmotm-Apr14/include/linux/memcontrol.h
===================================================================
--- mmotm-Apr14.orig/include/linux/memcontrol.h
+++ mmotm-Apr14/include/linux/memcontrol.h
@@ -94,6 +94,11 @@ extern int mem_cgroup_last_scanned_node(
 extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
 					const nodemask_t *nodes);
 
+extern bool mem_cgroup_kswapd_can_sleep(void);
+extern struct mem_cgroup *mem_cgroup_get_shrink_target(void);
+extern void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
+extern wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
+
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
 {
Index: mmotm-Apr14/mm/memory_hotplug.c
===================================================================
--- mmotm-Apr14.orig/mm/memory_hotplug.c
+++ mmotm-Apr14/mm/memory_hotplug.c
@@ -463,7 +463,7 @@ int __ref online_pages(unsigned long pfn
 	init_per_zone_wmark_min();
 
 	if (onlined_pages) {
-		kswapd_run(zone_to_nid(zone), NULL);
+		kswapd_run(zone_to_nid(zone), 0);
 		node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
 	}
 
@@ -898,7 +898,7 @@ repeat:
 
 	if (!node_present_pages(node)) {
 		node_clear_state(node, N_HIGH_MEMORY);
-		kswapd_stop(node, NULL);
+		kswapd_stop(node);
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 2/3] weight for memcg background reclaim (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  3:40 ` KAMEZAWA Hiroyuki
@ 2011-04-21  3:48   ` KAMEZAWA Hiroyuki
  2011-04-21  6:11     ` Ying Han
  2011-04-21  3:50   ` [PATCH 3/3/] fix mem_cgroup_watemark_ok " KAMEZAWA Hiroyuki
  2011-04-21  4:22   ` Ying Han
  2 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  3:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm


memcg-kswapd visits each memcg in round-robin. But required
amounts of works depends on memcg' usage and hi/low watermark
and taking it into account will be good.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |   17 +++++++++++++++++
 mm/vmscan.c                |    2 ++
 3 files changed, 20 insertions(+)

Index: mmotm-Apr14/include/linux/memcontrol.h
===================================================================
--- mmotm-Apr14.orig/include/linux/memcontrol.h
+++ mmotm-Apr14/include/linux/memcontrol.h
@@ -98,6 +98,7 @@ extern bool mem_cgroup_kswapd_can_sleep(
 extern struct mem_cgroup *mem_cgroup_get_shrink_target(void);
 extern void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
 extern wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
+extern int mem_cgroup_kswapd_bonus(struct mem_cgroup *mem);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
Index: mmotm-Apr14/mm/memcontrol.c
===================================================================
--- mmotm-Apr14.orig/mm/memcontrol.c
+++ mmotm-Apr14/mm/memcontrol.c
@@ -4673,6 +4673,23 @@ struct memcg_kswapd_work
 
 struct memcg_kswapd_work	memcg_kswapd_control;
 
+int mem_cgroup_kswapd_bonus(struct mem_cgroup *mem)
+{
+	unsigned long long usage, lowat, hiwat;
+	int rate;
+
+	usage = res_counter_read_u64(&mem->res, RES_USAGE);
+	lowat = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
+	hiwat = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+	if (lowat == hiwat)
+		return 0;
+
+	rate = (usage - hiwat) * 10 / (lowat - hiwat);
+	/* If usage is big, we reclaim more */
+	return rate * SWAP_CLUSTER_MAX;
+}
+
+
 static void wake_memcg_kswapd(struct mem_cgroup *mem)
 {
 	if (atomic_read(&mem->kswapd_running)) /* already running */
Index: mmotm-Apr14/mm/vmscan.c
===================================================================
--- mmotm-Apr14.orig/mm/vmscan.c
+++ mmotm-Apr14/mm/vmscan.c
@@ -2732,6 +2732,8 @@ static int shrink_mem_cgroup(struct mem_
 	sc.nr_reclaimed = 0;
 	total_scanned = 0;
 
+	sc.nr_to_reclaim += mem_cgroup_kswapd_bonus(mem_cont);
+
 	do_nodes = node_states[N_ONLINE];
 
 	for (priority = DEF_PRIORITY;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 3/3/] fix mem_cgroup_watemark_ok (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  3:40 ` KAMEZAWA Hiroyuki
  2011-04-21  3:48   ` [PATCH 2/3] weight for memcg background reclaim (Was " KAMEZAWA Hiroyuki
@ 2011-04-21  3:50   ` KAMEZAWA Hiroyuki
  2011-04-21  5:29     ` Ying Han
  2011-04-21  4:22   ` Ying Han
  2 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  3:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm


Ying, I noticed this at test. please fix the code in your set.
==
if low_wmark_distance = 0, mem_cgroup_watermark_ok() returns
false when usage hits limit.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    3 +++
 1 file changed, 3 insertions(+)

Index: mmotm-Apr14/mm/memcontrol.c
===================================================================
--- mmotm-Apr14.orig/mm/memcontrol.c
+++ mmotm-Apr14/mm/memcontrol.c
@@ -5062,6 +5062,9 @@ int mem_cgroup_watermark_ok(struct mem_c
 	long ret = 0;
 	int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
 
+	if (!mem->low_wmark_distance)
+		return 1;
+
 	VM_BUG_ON((charge_flags & flags) == flags);
 
 	if (charge_flags & CHARGE_WMARK_LOW)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  3:05   ` Ying Han
@ 2011-04-21  3:53     ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2011-04-21  3:53 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Wed, Apr 20, 2011 at 08:05:05PM -0700, Ying Han wrote:
> On Wed, Apr 20, 2011 at 7:51 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > I'm sorry that I chime in so late, I was still traveling until Monday.
> 
> Hey, hope you had a great trip :)

It was fantastic, thanks ;)

> > > If the cgroup is configured to use per cgroup background
> > > reclaim, a kswapd thread is created which only scans the
> > > per-memcg > > LRU list.
> >
> > We already have direct reclaim, direct reclaim on behalf of a memcg,
> > and global kswapd-reclaim.  Please don't add yet another reclaim path
> > that does its own thing and interacts unpredictably with the rest of
> > them.
> 
> Yes, we do have per-memcg direct reclaim and kswapd-reclaim. but the later
> one is global and we don't want to start reclaiming from each memcg until we
> reach the global memory pressure.

Not each, but a selected subset.  See below.

> > As discussed on LSF, we want to get rid of the global LRU.  So the
> > goal is to have each reclaim entry end up at the same core part of
> > reclaim that round-robin scans a subset of zones from a subset of
> > memory control groups.
> 
> True, but that is for system under global memory pressure and we would like
> to do targeting reclaim instead of reclaiming from the global LRU. That is
> not the same in this patch, which is doing targeting reclaim proactively
> per-memcg based on their hard_limit.

When triggered by global memory pressure we want to scan the subset of
memcgs that are above their soft limit, or all memcgs if none of them
exceeds their soft limit, which is a singleton in the no-memcg case.

When triggered by the hard limit, we want to scan the subset of memcgs
that have reached their hard limit, which is a singleton.

When triggered by the hard limit watermarks, we want scan the subset
of memcgs that are in violation of their watermarks.

I argue that the 'reclaim round-robin from a subset of cgroups' is the
same for all cases and that it makes sense to not encode differences
where there really are none.

> > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > > background reclaim and stop it. The watermarks are calculated based
> > > on the cgroup's limit_in_bytes.
> >
> > Which brings me to the next issue: making the watermarks configurable.
> >
> > You argued that having them adjustable from userspace is required for
> > overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
> > in in case of global memory pressure.  But that is only a problem
> > because global kswapd reclaim is (apart from soft limit reclaim)
> > unaware of memory control groups.
> >
> > I think the much better solution is to make global kswapd memcg aware
> > (with the above mentioned round-robin reclaim scheduler), compared to
> > adding new (and final!) kernel ABI to avoid an internal shortcoming.
> 
> We need to make the global kswapd memcg aware and that is the
> soft_limit hierarchical reclaim.

Yes, but not only.

> It is different from doing per-memcg background reclaim which we
> want to reclaim memory per-memcg before they goes to per-memcg
> direct reclaim.

Both the condition for waking up kswapd and the subset of control
groups to reclaim from are different.  But not the basic code that
goes through that subset and reclaims until the condition is resolved.

> > The whole excercise of asynchroneous background reclaim is to reduce
> > reclaim latency.  We already have a mechanism for global memory
> > pressure in place.  Per-memcg watermarks should only exist to avoid
> > direct reclaim due to hitting the hardlimit, nothing else.
> 
> Yes, but we have per-memcg direct reclaim which is based on the hard_limit.
> The latency we need to reduce is the direct reclaim which is different from
> global memory pressure.

Where is the difference?  Direct reclaim happens due to physical
memory pressure or due to cgroup-limit memory pressure.  We want both
cases to be mitigated by watermark-triggered asynchroneous reclaim.

I only say that per-memcg watermarks should not be abused to deal with
global memory pressure.

> > So in summary, I think converting the reclaim core to this round-robin
> > scheduler solves all these problems at once: a single code path for
> > reclaim, breaking up of the global lru lock, fair soft limit reclaim,
> > and a mechanism for latency reduction that just DTRT without any
> > user-space configuration necessary.
> 
> Not exactly. We will have cases where only few cgroups configured and the
> total hard_limit always less than the machine capacity. So we will never
> trigger the global memory pressure. However, we still need to smooth out the
> performance per-memcg by doing background page reclaim proactively before
> they hit their hard_limit (direct reclaim)

I did not want to argue against the hard-limit watermarks.  Sorry, I
now realize that my summary was ambiguous.

What I meant was that this group reclaimer should optimally be
implemented first, and then the hard-limit watermarks can be added as
just another trigger + subset filter for asynchroneous reclaim.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  2:51 ` [PATCH V6 00/10] memcg: per cgroup background reclaim Johannes Weiner
  2011-04-21  3:05   ` Ying Han
@ 2011-04-21  4:00   ` KAMEZAWA Hiroyuki
  2011-04-21  4:24     ` Ying Han
  2011-04-21  5:08     ` Johannes Weiner
  1 sibling, 2 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  4:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 04:51:07 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> > If the cgroup is configured to use per cgroup background reclaim, a kswapd
> > thread is created which only scans the per-memcg LRU list.
> 
> We already have direct reclaim, direct reclaim on behalf of a memcg,
> and global kswapd-reclaim.  Please don't add yet another reclaim path
> that does its own thing and interacts unpredictably with the rest of
> them.
> 
> As discussed on LSF, we want to get rid of the global LRU.  So the
> goal is to have each reclaim entry end up at the same core part of
> reclaim that round-robin scans a subset of zones from a subset of
> memory control groups.
> 

It's not related to this set. And I think even if we remove global LRU,
global-kswapd and memcg-kswapd need to do independent work.

global-kswapd : works for zone/node balancing and making free pages,
                and compaction. select a memcg vicitm and ask it
                to reduce memory with regard to gfp_mask. Starts its work
                when zone/node is unbalanced.

memcg-kswapd  : works for reducing usage of memory, no interests on
                zone/nodes. Starts when high/low watermaks hits.

We can share 'recalim_memcg_this_zone()' code finally, but it can be
changed when we remove global LRU. 


> > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > background reclaim and stop it. The watermarks are calculated based
> > on the cgroup's limit_in_bytes.
> 
> Which brings me to the next issue: making the watermarks configurable.
> 
> You argued that having them adjustable from userspace is required for
> overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
> in in case of global memory pressure.  But that is only a problem
> because global kswapd reclaim is (apart from soft limit reclaim)
> unaware of memory control groups.
> 
> I think the much better solution is to make global kswapd memcg aware
> (with the above mentioned round-robin reclaim scheduler), compared to
> adding new (and final!) kernel ABI to avoid an internal shortcoming.
> 

I don't think its a good idea to kick kswapd even when free memory is enough.

If memcg-kswapd implemted, I'd like to add auto-cgroup for memcg-kswapd and
limit its cpu usage because it works even when memory is not in-short.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  3:40 ` KAMEZAWA Hiroyuki
  2011-04-21  3:48   ` [PATCH 2/3] weight for memcg background reclaim (Was " KAMEZAWA Hiroyuki
  2011-04-21  3:50   ` [PATCH 3/3/] fix mem_cgroup_watemark_ok " KAMEZAWA Hiroyuki
@ 2011-04-21  4:22   ` Ying Han
  2011-04-21  4:27     ` KAMEZAWA Hiroyuki
  2011-04-21  4:31     ` Ying Han
  2 siblings, 2 replies; 58+ messages in thread
From: Ying Han @ 2011-04-21  4:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 979 bytes --]

On Wed, Apr 20, 2011 at 8:40 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 18 Apr 2011 20:57:36 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > 1. there are one kswapd thread per cgroup. the thread is created when the
> > cgroup changes its limit_in_bytes and is deleted when the cgroup is being
> > removed. In some enviroment when thousand of cgroups are being configured
> on
> > a single host, we will have thousand of kswapd threads. The memory
> consumption
> > would be 8k*100 = 8M. We don't see a big issue for now if the host can
> host
> > that many of cgroups.
> >
>
> I don't think no-fix to this is ok.
>
> Here is a thread pool patch on your set. (and includes some more).
> 3 patches in following e-mails.
> Any comments are welocme, but my response may be delayed.
>
> Thank you for making up the patch, and I will take a look. Do I apply the 3
patches on top of my patchset or they comes separately?

--Ying

Thanks,
> -Kame
>
>
>

[-- Attachment #2: Type: text/html, Size: 1545 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  4:00   ` KAMEZAWA Hiroyuki
@ 2011-04-21  4:24     ` Ying Han
  2011-04-21  4:46       ` KAMEZAWA Hiroyuki
  2011-04-21  5:08     ` Johannes Weiner
  1 sibling, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-21  4:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2638 bytes --]

On Wed, Apr 20, 2011 at 9:00 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 21 Apr 2011 04:51:07 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > > If the cgroup is configured to use per cgroup background reclaim, a
> kswapd
> > > thread is created which only scans the per-memcg LRU list.
> >
> > We already have direct reclaim, direct reclaim on behalf of a memcg,
> > and global kswapd-reclaim.  Please don't add yet another reclaim path
> > that does its own thing and interacts unpredictably with the rest of
> > them.
> >
> > As discussed on LSF, we want to get rid of the global LRU.  So the
> > goal is to have each reclaim entry end up at the same core part of
> > reclaim that round-robin scans a subset of zones from a subset of
> > memory control groups.
> >
>
> It's not related to this set. And I think even if we remove global LRU,
> global-kswapd and memcg-kswapd need to do independent work.
>
> global-kswapd : works for zone/node balancing and making free pages,
>                and compaction. select a memcg vicitm and ask it
>                to reduce memory with regard to gfp_mask. Starts its work
>                when zone/node is unbalanced.
>
> memcg-kswapd  : works for reducing usage of memory, no interests on
>                zone/nodes. Starts when high/low watermaks hits.
>
> We can share 'recalim_memcg_this_zone()' code finally, but it can be
> changed when we remove global LRU.
>
>
> > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > > background reclaim and stop it. The watermarks are calculated based
> > > on the cgroup's limit_in_bytes.
> >
> > Which brings me to the next issue: making the watermarks configurable.
> >
> > You argued that having them adjustable from userspace is required for
> > overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
> > in in case of global memory pressure.  But that is only a problem
> > because global kswapd reclaim is (apart from soft limit reclaim)
> > unaware of memory control groups.
> >
> > I think the much better solution is to make global kswapd memcg aware
> > (with the above mentioned round-robin reclaim scheduler), compared to
> > adding new (and final!) kernel ABI to avoid an internal shortcoming.
> >
>
> I don't think its a good idea to kick kswapd even when free memory is
> enough.
>
> If memcg-kswapd implemted, I'd like to add auto-cgroup for memcg-kswapd and
> limit its cpu usage because it works even when memory is not in-short.
>

How are we gonna isolate the memcg-kswapd cpu usage under the workqueue
model?

--Ying

>
>
> Thanks,
> -Kame
>
>
>

[-- Attachment #2: Type: text/html, Size: 3498 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  4:22   ` Ying Han
@ 2011-04-21  4:27     ` KAMEZAWA Hiroyuki
  2011-04-21  4:31     ` Ying Han
  1 sibling, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  4:27 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Wed, 20 Apr 2011 21:22:43 -0700
Ying Han <yinghan@google.com> wrote:

> On Wed, Apr 20, 2011 at 8:40 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 18 Apr 2011 20:57:36 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > 1. there are one kswapd thread per cgroup. the thread is created when the
> > > cgroup changes its limit_in_bytes and is deleted when the cgroup is being
> > > removed. In some enviroment when thousand of cgroups are being configured
> > on
> > > a single host, we will have thousand of kswapd threads. The memory
> > consumption
> > > would be 8k*100 = 8M. We don't see a big issue for now if the host can
> > host
> > > that many of cgroups.
> > >
> >
> > I don't think no-fix to this is ok.
> >
> > Here is a thread pool patch on your set. (and includes some more).
> > 3 patches in following e-mails.
> > Any comments are welocme, but my response may be delayed.
> >
> > Thank you for making up the patch, and I will take a look. Do I apply the 3
> patches on top of my patchset or they comes separately?

Ah, sorry, I made patches on
mmotm-Apr15 + Your Patch 1-8. (not including 9,10)

I dropped 10 just because of HUNK (caused by dropping 9) but as David pointed out,
we should make a consolidation with count_vm_event() (in different patch set)...
And I think you already have v7.
For this time, you can pick usable parts up to your set. I'll make an add-on again.

What imporatant here will be discussion for better implemenation.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  4:22   ` Ying Han
  2011-04-21  4:27     ` KAMEZAWA Hiroyuki
@ 2011-04-21  4:31     ` Ying Han
  1 sibling, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-21  4:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1248 bytes --]

On Wed, Apr 20, 2011 at 9:22 PM, Ying Han <yinghan@google.com> wrote:

>
>
> On Wed, Apr 20, 2011 at 8:40 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Mon, 18 Apr 2011 20:57:36 -0700
>> Ying Han <yinghan@google.com> wrote:
>>
>> > 1. there are one kswapd thread per cgroup. the thread is created when
>> the
>> > cgroup changes its limit_in_bytes and is deleted when the cgroup is
>> being
>> > removed. In some enviroment when thousand of cgroups are being
>> configured on
>> > a single host, we will have thousand of kswapd threads. The memory
>> consumption
>> > would be 8k*100 = 8M. We don't see a big issue for now if the host can
>> host
>> > that many of cgroups.
>> >
>>
>> I don't think no-fix to this is ok.
>>
>> Here is a thread pool patch on your set. (and includes some more).
>> 3 patches in following e-mails.
>> Any comments are welocme, but my response may be delayed.
>>
>> Thank you for making up the patch, and I will take a look. Do I apply the
> 3 patches on top of my patchset or they comes separately?
>

Sorry, please ignore my last question. Looks like the patch are based on my
existing per-memcg kswapd patchset. I will try to apply it.

--Ying

>
> --Ying
>
> Thanks,
>> -Kame
>>
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 2260 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  4:24     ` Ying Han
@ 2011-04-21  4:46       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  4:46 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Wed, 20 Apr 2011 21:24:07 -0700
Ying Han <yinghan@google.com> wrote:

> On Wed, Apr 20, 2011 at 9:00 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > > > background reclaim and stop it. The watermarks are calculated based
> > > > on the cgroup's limit_in_bytes.
> > >
> > > Which brings me to the next issue: making the watermarks configurable.
> > >
> > > You argued that having them adjustable from userspace is required for
> > > overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
> > > in in case of global memory pressure.  But that is only a problem
> > > because global kswapd reclaim is (apart from soft limit reclaim)
> > > unaware of memory control groups.
> > >
> > > I think the much better solution is to make global kswapd memcg aware
> > > (with the above mentioned round-robin reclaim scheduler), compared to
> > > adding new (and final!) kernel ABI to avoid an internal shortcoming.
> > >
> >
> > I don't think its a good idea to kick kswapd even when free memory is
> > enough.
> >
> > If memcg-kswapd implemted, I'd like to add auto-cgroup for memcg-kswapd and
> > limit its cpu usage because it works even when memory is not in-short.
> >
> 
> How are we gonna isolate the memcg-kswapd cpu usage under the workqueue
> model?
> 

Admin can limit the total cpu usage of memcg-kswapd. So, using private
workqueue model seems to make sense.
If background-reclaim uses up its cpu share, heavy worker memcg will hit
direct reclaim and need to consume its own cpu time. I think it's fair. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  4:00   ` KAMEZAWA Hiroyuki
  2011-04-21  4:24     ` Ying Han
@ 2011-04-21  5:08     ` Johannes Weiner
  2011-04-21  5:28       ` Ying Han
  2011-04-21  5:41       ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 58+ messages in thread
From: Johannes Weiner @ 2011-04-21  5:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 21 Apr 2011 04:51:07 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > > If the cgroup is configured to use per cgroup background reclaim, a kswapd
> > > thread is created which only scans the per-memcg LRU list.
> > 
> > We already have direct reclaim, direct reclaim on behalf of a memcg,
> > and global kswapd-reclaim.  Please don't add yet another reclaim path
> > that does its own thing and interacts unpredictably with the rest of
> > them.
> > 
> > As discussed on LSF, we want to get rid of the global LRU.  So the
> > goal is to have each reclaim entry end up at the same core part of
> > reclaim that round-robin scans a subset of zones from a subset of
> > memory control groups.
> 
> It's not related to this set. And I think even if we remove global LRU,
> global-kswapd and memcg-kswapd need to do independent work.
> 
> global-kswapd : works for zone/node balancing and making free pages,
>                 and compaction. select a memcg vicitm and ask it
>                 to reduce memory with regard to gfp_mask. Starts its work
>                 when zone/node is unbalanced.

For soft limit reclaim (which is triggered by global memory pressure),
we want to scan a group of memory cgroups equally in round robin
fashion.  I think at LSF we established that it is not fair to find
the one that exceeds its limit the most and hammer it until memory
pressure is resolved or there is another group with more excess.

So even for global kswapd, sooner or later we need a mechanism to
apply equal pressure to a set of memcgs.

With the removal of the global LRU, we ALWAYS operate on a set of
memcgs in a round-robin fashion, not just for soft limit reclaim.

So yes, these are two different things, but they have the same
requirements.

> memcg-kswapd  : works for reducing usage of memory, no interests on
>                 zone/nodes. Starts when high/low watermaks hits.

When the watermark is hit in the charge path, we want to wake up the
daemon to reclaim from a specific memcg.

When multiple memcgs exceed their watermarks in parallel (after all,
we DO allow concurrency), we again have a group of memcgs we want to
reclaim from in a fair fashion until their watermarks are met again.

And memcg reclaim is not oblivious to nodes and zones, right now, we
also do mind the current node and respect the zone balancing when we
do direct reclaim on behalf of a memcg.

So, to be honest, I really don't see how both cases should be
independent from each other.  On the contrary, I see very little
difference between them.  The entry path differs slightly as well as
the predicate for the set of memcgs to scan.  But most of the worker
code is exactly the same, no?

> > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > > background reclaim and stop it. The watermarks are calculated based
> > > on the cgroup's limit_in_bytes.
> > 
> > Which brings me to the next issue: making the watermarks configurable.
> > 
> > You argued that having them adjustable from userspace is required for
> > overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
> > in in case of global memory pressure.  But that is only a problem
> > because global kswapd reclaim is (apart from soft limit reclaim)
> > unaware of memory control groups.
> > 
> > I think the much better solution is to make global kswapd memcg aware
> > (with the above mentioned round-robin reclaim scheduler), compared to
> > adding new (and final!) kernel ABI to avoid an internal shortcoming.
> 
> I don't think its a good idea to kick kswapd even when free memory is enough.

This depends on what kswapd is supposed to be doing.  I don't say we
should reclaim from all memcgs (i.e. globally) just because one memcg
hits its watermark, of course.

But the argument was that we need the watermarks configurable to force
per-memcg reclaim even when the hard limits are overcommitted, because
global reclaim does not do a fair job to balance memcgs.  My counter
proposal is to fix global reclaim instead and apply equal pressure on
memcgs, such that we never have to tweak per-memcg watermarks to
achieve the same thing.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  5:08     ` Johannes Weiner
@ 2011-04-21  5:28       ` Ying Han
  2011-04-23  1:35         ` Johannes Weiner
  2011-04-21  5:41       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-21  5:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5630 bytes --]

On Wed, Apr 20, 2011 at 10:08 PM, Johannes Weiner <hannes@cmpxchg.org>wrote:

> On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 21 Apr 2011 04:51:07 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > > > If the cgroup is configured to use per cgroup background reclaim, a
> kswapd
> > > > thread is created which only scans the per-memcg LRU list.
> > >
> > > We already have direct reclaim, direct reclaim on behalf of a memcg,
> > > and global kswapd-reclaim.  Please don't add yet another reclaim path
> > > that does its own thing and interacts unpredictably with the rest of
> > > them.
> > >
> > > As discussed on LSF, we want to get rid of the global LRU.  So the
> > > goal is to have each reclaim entry end up at the same core part of
> > > reclaim that round-robin scans a subset of zones from a subset of
> > > memory control groups.
> >
> > It's not related to this set. And I think even if we remove global LRU,
> > global-kswapd and memcg-kswapd need to do independent work.
> >
> > global-kswapd : works for zone/node balancing and making free pages,
> >                 and compaction. select a memcg vicitm and ask it
> >                 to reduce memory with regard to gfp_mask. Starts its work
> >                 when zone/node is unbalanced.
>
> For soft limit reclaim (which is triggered by global memory pressure),
> we want to scan a group of memory cgroups equally in round robin
> fashion.  I think at LSF we established that it is not fair to find
> the one that exceeds its limit the most and hammer it until memory
> pressure is resolved or there is another group with more excess.
>
> So even for global kswapd, sooner or later we need a mechanism to
> apply equal pressure to a set of memcgs.
>
> With the removal of the global LRU, we ALWAYS operate on a set of
> memcgs in a round-robin fashion, not just for soft limit reclaim.
>
> So yes, these are two different things, but they have the same
> requirements.
>

Hmm. I don't see we have disagreement on the global-kswapd. The plan now is
to do the round-robin based
on their soft_limit. (note, this is not how it is implemented now, and I am
working on the patch now)

>
> > memcg-kswapd  : works for reducing usage of memory, no interests on
> >                 zone/nodes. Starts when high/low watermaks hits.
>
> When the watermark is hit in the charge path, we want to wake up the
> daemon to reclaim from a specific memcg.
>
> When multiple memcgs exceed their watermarks in parallel (after all,
> we DO allow concurrency), we again have a group of memcgs we want to
> reclaim from in a fair fashion until their watermarks are met again.
>
> And memcg reclaim is not oblivious to nodes and zones, right now, we
> also do mind the current node and respect the zone balancing when we
> do direct reclaim on behalf of a memcg.
>
> So, to be honest, I really don't see how both cases should be
> independent from each other.  On the contrary, I see very little
> difference between them.  The entry path differs slightly as well as
> the predicate for the set of memcgs to scan.  But most of the worker
> code is exactly the same, no?
>

They are triggered at different point and the target are different. One is
triggered under global pressure,
and the calculation of which memcg and how much to reclaim are based on
soft_limit. Also, the target is to bring the zone under the wmark, as well
as the zone balancing. The other one is triggered per-memcg on wmarks, and
the target is to bring the memcg usage below the wmark.

>
> > > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > > > background reclaim and stop it. The watermarks are calculated based
> > > > on the cgroup's limit_in_bytes.
> > >
> > > Which brings me to the next issue: making the watermarks configurable.
> > >
> > > You argued that having them adjustable from userspace is required for
> > > overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
> > > in in case of global memory pressure.  But that is only a problem
> > > because global kswapd reclaim is (apart from soft limit reclaim)
> > > unaware of memory control groups.
> > >
> > > I think the much better solution is to make global kswapd memcg aware
> > > (with the above mentioned round-robin reclaim scheduler), compared to
> > > adding new (and final!) kernel ABI to avoid an internal shortcoming.
> >
> > I don't think its a good idea to kick kswapd even when free memory is
> enough.
>
> This depends on what kswapd is supposed to be doing.  I don't say we
> should reclaim from all memcgs (i.e. globally) just because one memcg
> hits its watermark, of course.
>
> But the argument was that we need the watermarks configurable to force
> per-memcg reclaim even when the hard limits are overcommitted, because
> global reclaim does not do a fair job to balance memcgs.


There seems to be some confusion here. The watermark we defined is
per-memcg, and that is calculated
based on the hard_limit. We need the per-memcg wmark the same reason of
per-zone wmart which triggers
the background reclaim before direct reclaim.

There is a patch in my patchset which adds the tunable for both
high/low_mark, which gives more flexibility to admin to config the host. In
over-commit environment, we might never hit the wmark if all the wmarks are
set internally.

My counter proposal is to fix global reclaim instead and apply equal
pressure on memcgs, such that we never have to tweak per-memcg watermarks
to achieve the same thing.

We still need this and that is the soft_limit reclaim under global
background reclaim.

--Ying

[-- Attachment #2: Type: text/html, Size: 7629 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/3/] fix mem_cgroup_watemark_ok (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  3:50   ` [PATCH 3/3/] fix mem_cgroup_watemark_ok " KAMEZAWA Hiroyuki
@ 2011-04-21  5:29     ` Ying Han
  0 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-21  5:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 956 bytes --]

On Wed, Apr 20, 2011 at 8:50 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

>
> Ying, I noticed this at test. please fix the code in your set.
> ==
> if low_wmark_distance = 0, mem_cgroup_watermark_ok() returns
> false when usage hits limit.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> Index: mmotm-Apr14/mm/memcontrol.c
> ===================================================================
> --- mmotm-Apr14.orig/mm/memcontrol.c
> +++ mmotm-Apr14/mm/memcontrol.c
> @@ -5062,6 +5062,9 @@ int mem_cgroup_watermark_ok(struct mem_c
>        long ret = 0;
>        int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
>
> +       if (!mem->low_wmark_distance)
> +               return 1;
> +
>        VM_BUG_ON((charge_flags & flags) == flags);
>
>        if (charge_flags & CHARGE_WMARK_LOW)
>
> Thanks. Will add this in the next post.

--Ying

[-- Attachment #2: Type: text/html, Size: 1389 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  5:08     ` Johannes Weiner
  2011-04-21  5:28       ` Ying Han
@ 2011-04-21  5:41       ` KAMEZAWA Hiroyuki
  2011-04-21  6:23         ` Ying Han
  2011-04-23  2:02         ` Johannes Weiner
  1 sibling, 2 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  5:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 07:08:51 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 21 Apr 2011 04:51:07 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > > If the cgroup is configured to use per cgroup background reclaim, a kswapd
> > > > thread is created which only scans the per-memcg LRU list.
> > > 
> > > We already have direct reclaim, direct reclaim on behalf of a memcg,
> > > and global kswapd-reclaim.  Please don't add yet another reclaim path
> > > that does its own thing and interacts unpredictably with the rest of
> > > them.
> > > 
> > > As discussed on LSF, we want to get rid of the global LRU.  So the
> > > goal is to have each reclaim entry end up at the same core part of
> > > reclaim that round-robin scans a subset of zones from a subset of
> > > memory control groups.
> > 
> > It's not related to this set. And I think even if we remove global LRU,
> > global-kswapd and memcg-kswapd need to do independent work.
> > 
> > global-kswapd : works for zone/node balancing and making free pages,
> >                 and compaction. select a memcg vicitm and ask it
> >                 to reduce memory with regard to gfp_mask. Starts its work
> >                 when zone/node is unbalanced.
> 
> For soft limit reclaim (which is triggered by global memory pressure),
> we want to scan a group of memory cgroups equally in round robin
> fashion.  I think at LSF we established that it is not fair to find
> the one that exceeds its limit the most and hammer it until memory
> pressure is resolved or there is another group with more excess.
> 

Why do you guys like to make a mixture discussion of softlimit and
high/low watermarks ?


> So even for global kswapd, sooner or later we need a mechanism to
> apply equal pressure to a set of memcgs.
> 

yes, please do rework.


> With the removal of the global LRU, we ALWAYS operate on a set of
> memcgs in a round-robin fashion, not just for soft limit reclaim.
> 
> So yes, these are two different things, but they have the same
> requirements.
> 

Please do make changes all again.


> > memcg-kswapd  : works for reducing usage of memory, no interests on
> >                 zone/nodes. Starts when high/low watermaks hits.
> 
> When the watermark is hit in the charge path, we want to wake up the
> daemon to reclaim from a specific memcg.
> 
> When multiple memcgs exceed their watermarks in parallel (after all,
> we DO allow concurrency), we again have a group of memcgs we want to
> reclaim from in a fair fashion until their watermarks are met again.
> 

It's never be reason to make kswapd wake up.


> And memcg reclaim is not oblivious to nodes and zones, right now, we
> also do mind the current node and respect the zone balancing when we
> do direct reclaim on behalf of a memcg.
> 
If you find problem, please fix.


> So, to be honest, I really don't see how both cases should be
> independent from each other.  On the contrary, I see very little
> difference between them.  The entry path differs slightly as well as
> the predicate for the set of memcgs to scan.  But most of the worker
> code is exactly the same, no?
> 

No. memcg-background-reclaim will need to have more better algorithm finally
as using file/anon ratio, swapiness, dirty-ratio on memecg. And it works
as a service for helping performance by kernel.

global-background-reclaim will need to depends on global file/anon ratio
and swapiness, dirty-ratio. This works as a service for maintaining free
memory, by kernel.

I don't want to make mixture here until we convice we can do that.

memcg-kswapd does.
 1. pick up memcg
 2. do scan and reclaim

global-kswapd does
 1. pick up zone.
 2. pick up suitable memcg for reclaiming this zone's page
 3. check zone balancing.

We _may_ be able to finally merge them, but I'm unsure. Total rework after
implementing nicely-work-memcg-kswapd is welcomed.

I want to fix problems one by one. Reworking around this at removing LRU is
not heavy burden, but will be a interesting job. At rework, 
global kswapd/global direct-reclaim need to consider
  - get free memory
  - compaction of multi-order pages.
  - balancing zones
  - balancing nodes
  - OOM.
  + balancing memcgs (with softlimit) and LRU ordering
  + dirty-ratio (it may be better to avoid picking busy memcg by kswapd.)
  + hi/low watermak (if you want).

"+" is new things added by memcg.
We need to establish each ones and needs performance/statistics check for each.

I don't think we can implement them all perfectly with a rush. I think I'll
see unexpected problems on my way to realistic solution

> > > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > > > background reclaim and stop it. The watermarks are calculated based
> > > > on the cgroup's limit_in_bytes.
> > > 
> > > Which brings me to the next issue: making the watermarks configurable.
> > > 
> > > You argued that having them adjustable from userspace is required for
> > > overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
> > > in in case of global memory pressure.  But that is only a problem
> > > because global kswapd reclaim is (apart from soft limit reclaim)
> > > unaware of memory control groups.
> > > 
> > > I think the much better solution is to make global kswapd memcg aware
> > > (with the above mentioned round-robin reclaim scheduler), compared to
> > > adding new (and final!) kernel ABI to avoid an internal shortcoming.
> > 
> > I don't think its a good idea to kick kswapd even when free memory is enough.
> 
> This depends on what kswapd is supposed to be doing.  I don't say we
> should reclaim from all memcgs (i.e. globally) just because one memcg
> hits its watermark, of course.
> 
> But the argument was that we need the watermarks configurable to force
> per-memcg reclaim even when the hard limits are overcommitted, because
> global reclaim does not do a fair job to balance memcgs.  

I cannot understand here. Why global reclaim need to do works other than
balancing zones ? And what is balancing memcg ? Mentioning softlimit ?

> My counter
> proposal is to fix global reclaim instead and apply equal pressure on
> memcgs, such that we never have to tweak per-memcg watermarks to
> achieve the same thing.
> 

I cannot undestand this, either. Don't you make a mixture of discussion
with softlimit ? Making global kswapd better is another discussion.

Hi/Low watermak is a feature as it is. It the 3rd way to limit memory
usage. Comaparing hard_limit, soft_limit, it works in moderate way in background
and works regardless of usage of global memory. I think it's valid to have
ineterfaces to tuning this.


Thanks,
-Kame

 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/3] weight for memcg background reclaim (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  3:48   ` [PATCH 2/3] weight for memcg background reclaim (Was " KAMEZAWA Hiroyuki
@ 2011-04-21  6:11     ` Ying Han
  2011-04-21  6:38       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-21  6:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2940 bytes --]

On Wed, Apr 20, 2011 at 8:48 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

>
> memcg-kswapd visits each memcg in round-robin. But required
> amounts of works depends on memcg' usage and hi/low watermark
> and taking it into account will be good.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |   17 +++++++++++++++++
>  mm/vmscan.c                |    2 ++
>  3 files changed, 20 insertions(+)
>
> Index: mmotm-Apr14/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-Apr14.orig/include/linux/memcontrol.h
> +++ mmotm-Apr14/include/linux/memcontrol.h
> @@ -98,6 +98,7 @@ extern bool mem_cgroup_kswapd_can_sleep(
>  extern struct mem_cgroup *mem_cgroup_get_shrink_target(void);
>  extern void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
>  extern wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
> +extern int mem_cgroup_kswapd_bonus(struct mem_cgroup *mem);
>
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> Index: mmotm-Apr14/mm/memcontrol.c
> ===================================================================
> --- mmotm-Apr14.orig/mm/memcontrol.c
> +++ mmotm-Apr14/mm/memcontrol.c
> @@ -4673,6 +4673,23 @@ struct memcg_kswapd_work
>
>  struct memcg_kswapd_work       memcg_kswapd_control;
>
> +int mem_cgroup_kswapd_bonus(struct mem_cgroup *mem)
> +{
> +       unsigned long long usage, lowat, hiwat;
> +       int rate;
> +
> +       usage = res_counter_read_u64(&mem->res, RES_USAGE);
> +       lowat = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
> +       hiwat = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
> +       if (lowat == hiwat)
> +               return 0;
> +
> +       rate = (usage - hiwat) * 10 / (lowat - hiwat);
> +       /* If usage is big, we reclaim more */
> +       return rate * SWAP_CLUSTER_MAX;
> +}
> +
>


> I understand the logic in general, which we would like to reclaim more each
> time if more work needs to be done. But not quite sure the calculation here,
> the (usage - hiwat) determines the amount of work of kswapd. And why divide
> by (lowat - hiwat)? My guess is because the larger the value, the later we
> will trigger kswapd?


--Ying



>


>


>  static void wake_memcg_kswapd(struct mem_cgroup *mem)
>  {
>        if (atomic_read(&mem->kswapd_running)) /* already running */
> Index: mmotm-Apr14/mm/vmscan.c
> ===================================================================
> --- mmotm-Apr14.orig/mm/vmscan.c
> +++ mmotm-Apr14/mm/vmscan.c
> @@ -2732,6 +2732,8 @@ static int shrink_mem_cgroup(struct mem_
>        sc.nr_reclaimed = 0;
>        total_scanned = 0;
>
> +       sc.nr_to_reclaim += mem_cgroup_kswapd_bonus(mem_cont);
> +
>        do_nodes = node_states[N_ONLINE];
>
>        for (priority = DEF_PRIORITY;
>
>

[-- Attachment #2: Type: text/html, Size: 4004 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  5:41       ` KAMEZAWA Hiroyuki
@ 2011-04-21  6:23         ` Ying Han
  2011-04-23  2:02         ` Johannes Weiner
  1 sibling, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-21  6:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 7848 bytes --]

On Wed, Apr 20, 2011 at 10:41 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 21 Apr 2011 07:08:51 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 21 Apr 2011 04:51:07 +0200
> > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > > > If the cgroup is configured to use per cgroup background reclaim, a
> kswapd
> > > > > thread is created which only scans the per-memcg LRU list.
> > > >
> > > > We already have direct reclaim, direct reclaim on behalf of a memcg,
> > > > and global kswapd-reclaim.  Please don't add yet another reclaim path
> > > > that does its own thing and interacts unpredictably with the rest of
> > > > them.
> > > >
> > > > As discussed on LSF, we want to get rid of the global LRU.  So the
> > > > goal is to have each reclaim entry end up at the same core part of
> > > > reclaim that round-robin scans a subset of zones from a subset of
> > > > memory control groups.
> > >
> > > It's not related to this set. And I think even if we remove global LRU,
> > > global-kswapd and memcg-kswapd need to do independent work.
> > >
> > > global-kswapd : works for zone/node balancing and making free pages,
> > >                 and compaction. select a memcg vicitm and ask it
> > >                 to reduce memory with regard to gfp_mask. Starts its
> work
> > >                 when zone/node is unbalanced.
> >
> > For soft limit reclaim (which is triggered by global memory pressure),
> > we want to scan a group of memory cgroups equally in round robin
> > fashion.  I think at LSF we established that it is not fair to find
> > the one that exceeds its limit the most and hammer it until memory
> > pressure is resolved or there is another group with more excess.
> >
>
> Why do you guys like to make a mixture discussion of softlimit and
> high/low watermarks ?
>
> Yes, we've been talking about soft_limit discussion in LSF but I haven't
mentioned this per-memcg-kswapd
effort enough. They are indeed independent effort.

>
> > So even for global kswapd, sooner or later we need a mechanism to
> > apply equal pressure to a set of memcgs.
> >
>
> yes, please do rework.
>
>
> > With the removal of the global LRU, we ALWAYS operate on a set of
> > memcgs in a round-robin fashion, not just for soft limit reclaim.
> >
> > So yes, these are two different things, but they have the same
> > requirements.
> >
>
> Please do make changes all again.
>
>
> > > memcg-kswapd  : works for reducing usage of memory, no interests on
> > >                 zone/nodes. Starts when high/low watermaks hits.
> >
> > When the watermark is hit in the charge path, we want to wake up the
> > daemon to reclaim from a specific memcg.
> >
> > When multiple memcgs exceed their watermarks in parallel (after all,
> > we DO allow concurrency), we again have a group of memcgs we want to
> > reclaim from in a fair fashion until their watermarks are met again.
> >
>
> It's never be reason to make kswapd wake up.
>
>
> > And memcg reclaim is not oblivious to nodes and zones, right now, we
> > also do mind the current node and respect the zone balancing when we
> > do direct reclaim on behalf of a memcg.
> >
> If you find problem, please fix.
>
>
> > So, to be honest, I really don't see how both cases should be
> > independent from each other.  On the contrary, I see very little
> > difference between them.  The entry path differs slightly as well as
> > the predicate for the set of memcgs to scan.  But most of the worker
> > code is exactly the same, no?
> >
>
> No. memcg-background-reclaim will need to have more better algorithm
> finally
> as using file/anon ratio, swapiness, dirty-ratio on memecg. And it works
> as a service for helping performance by kernel.
>
> global-background-reclaim will need to depends on global file/anon ratio
> and swapiness, dirty-ratio. This works as a service for maintaining free
> memory, by kernel.
>
> I don't want to make mixture here until we convice we can do that.
>
> memcg-kswapd does.
>  1. pick up memcg
>  2. do scan and reclaim
>
> global-kswapd does
>  1. pick up zone.
>  2. pick up suitable memcg for reclaiming this zone's page
>  3. check zone balancing.
>
> We _may_ be able to finally merge them, but I'm unsure. Total rework after
> implementing nicely-work-memcg-kswapd is welcomed.
>
> I want to fix problems one by one. Reworking around this at removing LRU is
> not heavy burden, but will be a interesting job. At rework,
> global kswapd/global direct-reclaim need to consider
>  - get free memory
>  - compaction of multi-order pages.
>
this is interesting part. we don't deal w/ high order page reclaim in memcg.
So, there will be no lumpy reclaim in the soft_limit reclaim under global
kswapd. I also mentioned that in:
http://permalink.gmane.org/gmane.linux.kernel.mm/60966


>  - balancing zones
>

this should be covered in current soft_limit reclaim proposal above. don't
want to go to much detail in this thread.

 - balancing nodes
>
not sure about this.


>  - OOM.
>  + balancing memcgs (with softlimit) and LRU ordering
>

agree, and i would like to start with round-robin.


>  + dirty-ratio (it may be better to avoid picking busy memcg by kswapd.)
>  + hi/low watermak (if you want).
>

 I assume this is the zone wmarks.

>
> "+" is new things added by memcg.
> We need to establish each ones and needs performance/statistics check for
> each.
>
> I don't think we can implement them all perfectly with a rush. I think I'll
> see unexpected problems on my way to realistic solution
>

I will review the 3 patch you just posted and test them with my V7.

--Ying

>
> > > > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > > > > background reclaim and stop it. The watermarks are calculated based
> > > > > on the cgroup's limit_in_bytes.
> > > >
> > > > Which brings me to the next issue: making the watermarks
> configurable.
> > > >
> > > > You argued that having them adjustable from userspace is required for
> > > > overcommitting the hardlimits and per-memcg kswapd reclaim not
> kicking
> > > > in in case of global memory pressure.  But that is only a problem
> > > > because global kswapd reclaim is (apart from soft limit reclaim)
> > > > unaware of memory control groups.
> > > >
> > > > I think the much better solution is to make global kswapd memcg aware
> > > > (with the above mentioned round-robin reclaim scheduler), compared to
> > > > adding new (and final!) kernel ABI to avoid an internal shortcoming.
> > >
> > > I don't think its a good idea to kick kswapd even when free memory is
> enough.
> >
> > This depends on what kswapd is supposed to be doing.  I don't say we
> > should reclaim from all memcgs (i.e. globally) just because one memcg
> > hits its watermark, of course.
> >
> > But the argument was that we need the watermarks configurable to force
> > per-memcg reclaim even when the hard limits are overcommitted, because
> > global reclaim does not do a fair job to balance memcgs.
>
> I cannot understand here. Why global reclaim need to do works other than
> balancing zones ? And what is balancing memcg ? Mentioning softlimit ?
>
> > My counter
> > proposal is to fix global reclaim instead and apply equal pressure on
> > memcgs, such that we never have to tweak per-memcg watermarks to
> > achieve the same thing.
> >
>
> I cannot undestand this, either. Don't you make a mixture of discussion
> with softlimit ? Making global kswapd better is another discussion.
>
> Hi/Low watermak is a feature as it is. It the 3rd way to limit memory
> usage. Comaparing hard_limit, soft_limit, it works in moderate way in
> background
> and works regardless of usage of global memory. I think it's valid to have
> ineterfaces to tuning this.
>
>
> Thanks,
> -Kame
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 10701 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/3] weight for memcg background reclaim (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  6:11     ` Ying Han
@ 2011-04-21  6:38       ` KAMEZAWA Hiroyuki
  2011-04-21  6:59         ` Ying Han
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  6:38 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Wed, 20 Apr 2011 23:11:42 -0700
Ying Han <yinghan@google.com> wrote:

> On Wed, Apr 20, 2011 at 8:48 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> >
> > memcg-kswapd visits each memcg in round-robin. But required
> > amounts of works depends on memcg' usage and hi/low watermark
> > and taking it into account will be good.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/memcontrol.h |    1 +
> >  mm/memcontrol.c            |   17 +++++++++++++++++
> >  mm/vmscan.c                |    2 ++
> >  3 files changed, 20 insertions(+)
> >
> > Index: mmotm-Apr14/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-Apr14.orig/include/linux/memcontrol.h
> > +++ mmotm-Apr14/include/linux/memcontrol.h
> > @@ -98,6 +98,7 @@ extern bool mem_cgroup_kswapd_can_sleep(
> >  extern struct mem_cgroup *mem_cgroup_get_shrink_target(void);
> >  extern void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
> >  extern wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
> > +extern int mem_cgroup_kswapd_bonus(struct mem_cgroup *mem);
> >
> >  static inline
> >  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> > *cgroup)
> > Index: mmotm-Apr14/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-Apr14.orig/mm/memcontrol.c
> > +++ mmotm-Apr14/mm/memcontrol.c
> > @@ -4673,6 +4673,23 @@ struct memcg_kswapd_work
> >
> >  struct memcg_kswapd_work       memcg_kswapd_control;
> >
> > +int mem_cgroup_kswapd_bonus(struct mem_cgroup *mem)
> > +{
> > +       unsigned long long usage, lowat, hiwat;
> > +       int rate;
> > +
> > +       usage = res_counter_read_u64(&mem->res, RES_USAGE);
> > +       lowat = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
> > +       hiwat = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
> > +       if (lowat == hiwat)
> > +               return 0;
> > +
> > +       rate = (usage - hiwat) * 10 / (lowat - hiwat);
> > +       /* If usage is big, we reclaim more */
> > +       return rate * SWAP_CLUSTER_MAX;

This may be buggy and we should have upper limit on this 'rate'.


> > +}
> > +
> >
> 
> 
> > I understand the logic in general, which we would like to reclaim more each
> > time if more work needs to be done. But not quite sure the calculation here,
> > the (usage - hiwat) determines the amount of work of kswapd. And why divide
> > by (lowat - hiwat)? My guess is because the larger the value, the later we
> > will trigger kswapd?
> 
Because memcg-kswapd will require more work on this memcg if usage-high is large.

At first, I'm not sure this logic is good but wanted to show there is a chance to
do some schedule.

We have 2 ways to implement this kind of weight

 1. modify to select memcg logic
    I think we'll see starvation easily. So, didn't this for this time.

 2. modify the amount to nr_to_reclaim
    We'll be able to determine the amount by some calculation using some statistics.

I selected "2" for this time. 

With HIGH/LOW watermark, the admin set LOW watermark as a kind of limit. Then,
if usage is more than LOW watermark, its priority will be higher than other memcg
which has lower (relative) usage. In general, memcg-kswapd can reduce memory down
to high watermak only when the system is not busy. So, this logic tries to remove
more memory from busy cgroup to reduce 'hit limit'.

And I wonder, a memcg containes pages which is related to each other. So, reducing
some amount of pages larger than 32pages at once may make sense.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/3] weight for memcg background reclaim (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  6:38       ` KAMEZAWA Hiroyuki
@ 2011-04-21  6:59         ` Ying Han
  2011-04-21  7:01           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-21  6:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4227 bytes --]

On Wed, Apr 20, 2011 at 11:38 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 20 Apr 2011 23:11:42 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Wed, Apr 20, 2011 at 8:48 PM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > >
> > > memcg-kswapd visits each memcg in round-robin. But required
> > > amounts of works depends on memcg' usage and hi/low watermark
> > > and taking it into account will be good.
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  include/linux/memcontrol.h |    1 +
> > >  mm/memcontrol.c            |   17 +++++++++++++++++
> > >  mm/vmscan.c                |    2 ++
> > >  3 files changed, 20 insertions(+)
> > >
> > > Index: mmotm-Apr14/include/linux/memcontrol.h
> > > ===================================================================
> > > --- mmotm-Apr14.orig/include/linux/memcontrol.h
> > > +++ mmotm-Apr14/include/linux/memcontrol.h
> > > @@ -98,6 +98,7 @@ extern bool mem_cgroup_kswapd_can_sleep(
> > >  extern struct mem_cgroup *mem_cgroup_get_shrink_target(void);
> > >  extern void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
> > >  extern wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
> > > +extern int mem_cgroup_kswapd_bonus(struct mem_cgroup *mem);
> > >
> > >  static inline
> > >  int mm_match_cgroup(const struct mm_struct *mm, const struct
> mem_cgroup
> > > *cgroup)
> > > Index: mmotm-Apr14/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-Apr14.orig/mm/memcontrol.c
> > > +++ mmotm-Apr14/mm/memcontrol.c
> > > @@ -4673,6 +4673,23 @@ struct memcg_kswapd_work
> > >
> > >  struct memcg_kswapd_work       memcg_kswapd_control;
> > >
> > > +int mem_cgroup_kswapd_bonus(struct mem_cgroup *mem)
> > > +{
> > > +       unsigned long long usage, lowat, hiwat;
> > > +       int rate;
> > > +
> > > +       usage = res_counter_read_u64(&mem->res, RES_USAGE);
> > > +       lowat = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
> > > +       hiwat = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
> > > +       if (lowat == hiwat)
> > > +               return 0;
> > > +
> > > +       rate = (usage - hiwat) * 10 / (lowat - hiwat);
> > > +       /* If usage is big, we reclaim more */
> > > +       return rate * SWAP_CLUSTER_MAX;
>
> This may be buggy and we should have upper limit on this 'rate'.
>
>
> > > +}
> > > +
> > >
> >
> >
> > > I understand the logic in general, which we would like to reclaim more
> each
> > > time if more work needs to be done. But not quite sure the calculation
> here,
> > > the (usage - hiwat) determines the amount of work of kswapd. And why
> divide
> > > by (lowat - hiwat)? My guess is because the larger the value, the later
> we
> > > will trigger kswapd?
> >
> Because memcg-kswapd will require more work on this memcg if usage-high is
> large.
>

agree on this, and that is the idea of "rate" be proportional to
(usage-high).

>
> At first, I'm not sure this logic is good but wanted to show there is a
> chance to
> do some schedule.
>
> We have 2 ways to implement this kind of weight
>
>  1. modify to select memcg logic
>    I think we'll see starvation easily. So, didn't this for this time.
>
>  2. modify the amount to nr_to_reclaim
>    We'll be able to determine the amount by some calculation using some
> statistics.
>
> I selected "2" for this time.
>
> With HIGH/LOW watermark, the admin set LOW watermark as a kind of limit.
> Then,
> if usage is more than LOW watermark, its priority will be higher than other
> memcg
> which has lower (relative) usage.


Ok, now i know a bit more of the logic behind. Here, we would like to
reclaim more from the memcg which has higher (usage - low).

n general, memcg-kswapd can reduce memory down to high watermak only when
> the system is not busy. So, this logic tries to remove more memory from busy
> cgroup to reduce 'hit limit'.
>

So, the "busy cgroup" here means the memcg has higher (usage - low)?

--Ying

>
> And I wonder, a memcg containes pages which is related to each other. So,
> reducing
> some amount of pages larger than 32pages at once may make sense.
>
>

> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 5982 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/3] weight for memcg background reclaim (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  6:59         ` Ying Han
@ 2011-04-21  7:01           ` KAMEZAWA Hiroyuki
  2011-04-21  7:12             ` Ying Han
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  7:01 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Wed, 20 Apr 2011 23:59:52 -0700
Ying Han <yinghan@google.com> wrote:

> On Wed, Apr 20, 2011 at 11:38 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 20 Apr 2011 23:11:42 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > On Wed, Apr 20, 2011 at 8:48 PM, KAMEZAWA Hiroyuki <
> > > kamezawa.hiroyu@jp.fujitsu.com> wrote:

> n general, memcg-kswapd can reduce memory down to high watermak only when
> > the system is not busy. So, this logic tries to remove more memory from busy
> > cgroup to reduce 'hit limit'.
> >
> 
> So, the "busy cgroup" here means the memcg has higher (usage - low)?
> 

  high < usage < low < limit

Yes, if background reclaim wins, usage - high decreases.
If tasks on cgroup uses more memory than reclaim, usage - high increases even
if background reclaim runs. So, if usage-high is large, cgroup is busy.



Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/3] memcg kswapd thread pool (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  3:43 ` [PATCH 1/3] memcg kswapd thread pool (Was " KAMEZAWA Hiroyuki
@ 2011-04-21  7:09   ` Ying Han
  2011-04-21  7:14     ` KAMEZAWA Hiroyuki
  2011-04-21  8:10   ` Minchan Kim
  1 sibling, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-21  7:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 30655 bytes --]

On Wed, Apr 20, 2011 at 8:43 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Ying, please take this just a hint, you don't need to implement this as is.
>

Thank you for the patch.


> ==
> Now, memcg-kswapd is created per a cgroup. Considering there are users
> who creates hundreds on cgroup on a system, it consumes too much
> resources, memory, cputime.
>
> This patch creates a thread pool for memcg-kswapd. All memcg which
> needs background recalim are linked to a list and memcg-kswapd
> picks up a memcg from the list and run reclaim. This reclaimes
> SWAP_CLUSTER_MAX of pages and putback the memcg to the lail of
> list. memcg-kswapd will visit memcgs in round-robin manner and
> reduce usages.
>
> This patch does
>
>  - adds memcg-kswapd thread pool, the number of threads is now
>   sqrt(num_of_cpus) + 1.
>  - use unified kswapd_waitq for all memcgs.
>

So I looked through the patch, it implements an alternative threading model
using thread-pool. Also it includes some changes on calculating how much
pages to reclaim per memcg. Other than that, all the existing implementation
of per-memcg-kswapd seems not being impacted.

I tried to apply the patch but get some conflicts on vmscan.c/ I will try
some manual work tomorrow. Meantime, after applying the patch, I will try to
test it w/ the same test suite i used on original patch. AFAIK, the only
difference of the two threading model is the amount of resources we consume
on the kswapd kernel thread, which shouldn't have run-time performance
differences.


>  - refine memcg shrink codes in vmscan.c
>

Those seems to be the comments from V6 and I already have them changed in V7
( haven't posted yet)

 --Ying


> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    5
>  include/linux/swap.h       |    7 -
>  mm/memcontrol.c            |  174 +++++++++++++++++++++++----------
>  mm/memory_hotplug.c        |    4
>  mm/page_alloc.c            |    1
>  mm/vmscan.c                |  237
> ++++++++++++++++++---------------------------
>  6 files changed, 232 insertions(+), 196 deletions(-)
>
> Index: mmotm-Apr14/mm/memcontrol.c
> ===================================================================
> --- mmotm-Apr14.orig/mm/memcontrol.c
> +++ mmotm-Apr14/mm/memcontrol.c
> @@ -49,6 +49,8 @@
>  #include <linux/cpu.h>
>  #include <linux/oom.h>
>  #include "internal.h"
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
>
>  #include <asm/uaccess.h>
>
> @@ -274,6 +276,12 @@ struct mem_cgroup {
>         */
>        unsigned long   move_charge_at_immigrate;
>        /*
> +        * memcg kswapd control stuff.
> +        */
> +       atomic_t                kswapd_running; /* !=0 if a kswapd runs */
> +       wait_queue_head_t       memcg_kswapd_end; /* for waiting the end*/
> +       struct list_head        memcg_kswapd_wait_list;/* for shceduling */
> +       /*
>         * percpu counter.
>         */
>        struct mem_cgroup_stat_cpu *stat;
> @@ -296,7 +304,6 @@ struct mem_cgroup {
>         */
>        int last_scanned_node;
>
> -       wait_queue_head_t *kswapd_wait;
>  };
>
>  /* Stuffs for move charges at task migration. */
> @@ -380,6 +387,7 @@ static struct mem_cgroup *parent_mem_cgr
>  static void drain_all_stock_async(void);
>
>  static void wake_memcg_kswapd(struct mem_cgroup *mem);
> +static void memcg_kswapd_stop(struct mem_cgroup *mem);
>
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -916,9 +924,6 @@ static void setup_per_memcg_wmarks(struc
>
>                res_counter_set_low_wmark_limit(&mem->res, low_wmark);
>                res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> -
> -               if (!mem_cgroup_is_root(mem) && !mem->kswapd_wait)
> -                       kswapd_run(0, mem);
>        }
>  }
>
> @@ -3729,6 +3734,7 @@ move_account:
>                ret = -EBUSY;
>                if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
>                        goto out;
> +               memcg_kswapd_stop(mem);
>                ret = -EINTR;
>                if (signal_pending(current))
>                        goto out;
> @@ -4655,6 +4661,120 @@ static int mem_cgroup_oom_control_write(
>        return 0;
>  }
>
> +/*
> + * Controls for background memory reclam stuff.
> + */
> +struct memcg_kswapd_work
> +{
> +       spinlock_t              lock;  /* lock for list */
> +       struct list_head        list;  /* list of works. */
> +       wait_queue_head_t       waitq;
> +};
> +
> +struct memcg_kswapd_work       memcg_kswapd_control;
> +
> +static void wake_memcg_kswapd(struct mem_cgroup *mem)
> +{
> +       if (atomic_read(&mem->kswapd_running)) /* already running */
> +               return;
> +
> +       spin_lock(&memcg_kswapd_control.lock);
> +       if (list_empty(&mem->memcg_kswapd_wait_list))
> +               list_add_tail(&mem->memcg_kswapd_wait_list,
> +                               &memcg_kswapd_control.list);
> +       spin_unlock(&memcg_kswapd_control.lock);
> +       wake_up(&memcg_kswapd_control.waitq);
> +       return;
> +}
> +
> +static void memcg_kswapd_wait_end(struct mem_cgroup *mem)
> +{
> +       DEFINE_WAIT(wait);
> +
> +       prepare_to_wait(&mem->memcg_kswapd_end, &wait, TASK_INTERRUPTIBLE);
> +       if (atomic_read(&mem->kswapd_running))
> +               schedule();
> +       finish_wait(&mem->memcg_kswapd_end, &wait);
> +}
> +
> +/* called at pre_destroy */
> +static void memcg_kswapd_stop(struct mem_cgroup *mem)
> +{
> +       spin_lock(&memcg_kswapd_control.lock);
> +       if (!list_empty(&mem->memcg_kswapd_wait_list))
> +               list_del(&mem->memcg_kswapd_wait_list);
> +       spin_unlock(&memcg_kswapd_control.lock);
> +
> +       memcg_kswapd_wait_end(mem);
> +}
> +
> +struct mem_cgroup *mem_cgroup_get_shrink_target(void)
> +{
> +       struct mem_cgroup *mem;
> +
> +       spin_lock(&memcg_kswapd_control.lock);
> +       rcu_read_lock();
> +       do {
> +               mem = NULL;
> +               if (!list_empty(&memcg_kswapd_control.list)) {
> +                       mem = list_entry(memcg_kswapd_control.list.next,
> +                                       struct mem_cgroup,
> +                                       memcg_kswapd_wait_list);
> +                       list_del_init(&mem->memcg_kswapd_wait_list);
> +               }
> +       } while (mem && !css_tryget(&mem->css));
> +       if (mem)
> +               atomic_inc(&mem->kswapd_running);
> +       rcu_read_unlock();
> +       spin_unlock(&memcg_kswapd_control.lock);
> +       return mem;
> +}
> +
> +void mem_cgroup_put_shrink_target(struct mem_cgroup *mem)
> +{
> +       if (!mem)
> +               return;
> +       atomic_dec(&mem->kswapd_running);
> +       if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH)) {
> +               spin_lock(&memcg_kswapd_control.lock);
> +               if (list_empty(&mem->memcg_kswapd_wait_list)) {
> +                       list_add_tail(&mem->memcg_kswapd_wait_list,
> +                                       &memcg_kswapd_control.list);
> +               }
> +               spin_unlock(&memcg_kswapd_control.lock);
> +       }
> +       wake_up_all(&mem->memcg_kswapd_end);
> +       cgroup_release_and_wakeup_rmdir(&mem->css);
> +}
> +
> +bool mem_cgroup_kswapd_can_sleep(void)
> +{
> +       return list_empty(&memcg_kswapd_control.list);
> +}
> +
> +wait_queue_head_t *mem_cgroup_kswapd_waitq(void)
> +{
> +       return &memcg_kswapd_control.waitq;
> +}
> +
> +static int __init memcg_kswapd_init(void)
> +{
> +
> +       int i, nr_threads;
> +
> +       spin_lock_init(&memcg_kswapd_control.lock);
> +       INIT_LIST_HEAD(&memcg_kswapd_control.list);
> +       init_waitqueue_head(&memcg_kswapd_control.waitq);
> +
> +       nr_threads = int_sqrt(num_possible_cpus()) + 1;
> +       for (i = 0; i < nr_threads; i++)
> +               if (kswapd_run(0, i + 1) == -1)
> +                       break;
> +       return 0;
> +}
> +module_init(memcg_kswapd_init);
> +
> +
>  static struct cftype mem_cgroup_files[] = {
>        {
>                .name = "usage_in_bytes",
> @@ -4935,33 +5055,6 @@ int mem_cgroup_watermark_ok(struct mem_c
>        return ret;
>  }
>
> -int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd
> *kswapd_p)
> -{
> -       if (!mem || !kswapd_p)
> -               return 0;
> -
> -       mem->kswapd_wait = &kswapd_p->kswapd_wait;
> -       kswapd_p->kswapd_mem = mem;
> -
> -       return css_id(&mem->css);
> -}
> -
> -void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
> -{
> -       if (mem)
> -               mem->kswapd_wait = NULL;
> -
> -       return;
> -}
> -
> -wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
> -{
> -       if (!mem)
> -               return NULL;
> -
> -       return mem->kswapd_wait;
> -}
> -
>  int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
>  {
>        if (!mem)
> @@ -4970,22 +5063,6 @@ int mem_cgroup_last_scanned_node(struct
>        return mem->last_scanned_node;
>  }
>
> -static inline
> -void wake_memcg_kswapd(struct mem_cgroup *mem)
> -{
> -       wait_queue_head_t *wait;
> -
> -       if (!mem || !mem->high_wmark_distance)
> -               return;
> -
> -       wait = mem->kswapd_wait;
> -
> -       if (!wait || !waitqueue_active(wait))
> -               return;
> -
> -       wake_up_interruptible(wait);
> -}
> -
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>        struct mem_cgroup_tree_per_node *rtpn;
> @@ -5069,6 +5146,8 @@ mem_cgroup_create(struct cgroup_subsys *
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> +       init_waitqueue_head(&mem->memcg_kswapd_end);
> +       INIT_LIST_HEAD(&mem->memcg_kswapd_wait_list);
>        return &mem->css;
>  free_out:
>        __mem_cgroup_free(mem);
> @@ -5089,7 +5168,6 @@ static void mem_cgroup_destroy(struct cg
>  {
>        struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>
> -       kswapd_stop(0, mem);
>        mem_cgroup_put(mem);
>  }
>
> Index: mmotm-Apr14/include/linux/swap.h
> ===================================================================
> --- mmotm-Apr14.orig/include/linux/swap.h
> +++ mmotm-Apr14/include/linux/swap.h
> @@ -28,9 +28,8 @@ static inline int current_is_kswapd(void
>
>  struct kswapd {
>        struct task_struct *kswapd_task;
> -       wait_queue_head_t kswapd_wait;
> +       wait_queue_head_t *kswapd_wait;
>        pg_data_t *kswapd_pgdat;
> -       struct mem_cgroup *kswapd_mem;
>  };
>
>  int kswapd(void *p);
> @@ -307,8 +306,8 @@ static inline void scan_unevictable_unre
>  }
>  #endif
>
> -extern int kswapd_run(int nid, struct mem_cgroup *mem);
> -extern void kswapd_stop(int nid, struct mem_cgroup *mem);
> +extern int kswapd_run(int nid, int id);
> +extern void kswapd_stop(int nid);
>
>  #ifdef CONFIG_MMU
>  /* linux/mm/shmem.c */
> Index: mmotm-Apr14/mm/page_alloc.c
> ===================================================================
> --- mmotm-Apr14.orig/mm/page_alloc.c
> +++ mmotm-Apr14/mm/page_alloc.c
> @@ -4199,6 +4199,7 @@ static void __paginginit free_area_init_
>
>        pgdat_resize_init(pgdat);
>        pgdat->nr_zones = 0;
> +       init_waitqueue_head(&pgdat->kswapd_wait);
>        pgdat->kswapd_max_order = 0;
>        pgdat_page_cgroup_init(pgdat);
>
> Index: mmotm-Apr14/mm/vmscan.c
> ===================================================================
> --- mmotm-Apr14.orig/mm/vmscan.c
> +++ mmotm-Apr14/mm/vmscan.c
> @@ -2256,7 +2256,7 @@ static bool pgdat_balanced(pg_data_t *pg
>        return balanced_pages > (present_pages >> 2);
>  }
>
> -#define is_global_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
> +#define is_global_kswapd(kswapd_p) ((kswapd_p)->kswapd_pgdat)
>
>  /* is kswapd sleeping prematurely? */
>  static bool sleeping_prematurely(pg_data_t *pgdat, int order, long
> remaining,
> @@ -2599,50 +2599,56 @@ static void kswapd_try_to_sleep(struct k
>        long remaining = 0;
>        DEFINE_WAIT(wait);
>        pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> -       wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
> +       wait_queue_head_t *wait_h = kswapd_p->kswapd_wait;
>
>        if (freezing(current) || kthread_should_stop())
>                return;
>
>        prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>
> -       if (!is_global_kswapd(kswapd_p)) {
> -               schedule();
> -               goto out;
> -       }
> -
> -       /* Try to sleep for a short interval */
> -       if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx))
> {
> -               remaining = schedule_timeout(HZ/10);
> -               finish_wait(wait_h, &wait);
> -               prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> -       }
> -
> -       /*
> -        * After a short sleep, check if it was a premature sleep. If not,
> then
> -        * go fully to sleep until explicitly woken up.
> -        */
> -       if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx))
> {
> -               trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +       if (is_global_kswapd(kswapd_p)) {
> +               /* Try to sleep for a short interval */
> +               if (!sleeping_prematurely(pgdat, order,
> +                               remaining, classzone_idx)) {
> +                       remaining = schedule_timeout(HZ/10);
> +                       finish_wait(wait_h, &wait);
> +                       prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> +               }
>
>                /*
> -                * vmstat counters are not perfectly accurate and the
> estimated
> -                * value for counters such as NR_FREE_PAGES can deviate
> from the
> -                * true value by nr_online_cpus * threshold. To avoid the
> zone
> -                * watermarks being breached while under pressure, we
> reduce the
> -                * per-cpu vmstat threshold while kswapd is awake and
> restore
> -                * them before going back to sleep.
> -                */
> -               set_pgdat_percpu_threshold(pgdat,
> calculate_normal_threshold);
> -               schedule();
> -               set_pgdat_percpu_threshold(pgdat,
> calculate_pressure_threshold);
> +                * After a short sleep, check if it was a premature sleep.
> +                * If not, then go fully to sleep until explicitly woken
> up.
> +                */
> +               if (!sleeping_prematurely(pgdat, order,
> +                                       remaining, classzone_idx)) {
> +                       trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +                       /*
> +                        * vmstat counters are not perfectly accurate and
> +                        * the estimated value for counters such as
> +                        * NR_FREE_PAGES  can deviate from the true value
> for
> +                        * counters such as NR_FREE_PAGES can deviate from
> the
> +                        *  true value by nr_online_cpus * threshold. To
> avoid
> +                        *  the zonewatermarks being breached while under
> +                        *  pressure, we reduce the per-cpu vmstat
> threshold
> +                        *  while kswapd is awake and restore them before
> +                        *  going back to sleep.
> +                        */
> +                       set_pgdat_percpu_threshold(pgdat,
> +                                       calculate_normal_threshold);
> +                       schedule();
> +                       set_pgdat_percpu_threshold(pgdat,
> +                                       calculate_pressure_threshold);
> +               } else {
> +                       if (remaining)
> +
> count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> +                       else
> +
> count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> +               }
>        } else {
> -               if (remaining)
> -                       count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> -               else
> -                       count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> +               /* For now, we just check the remaining works.*/
> +               if (mem_cgroup_kswapd_can_sleep())
> +                       schedule();
>        }
> -out:
>        finish_wait(wait_h, &wait);
>  }
>
> @@ -2651,8 +2657,8 @@ out:
>  * The function is used for per-memcg LRU. It scanns all the zones of the
>  * node and returns the nr_scanned and nr_reclaimed.
>  */
> -static void balance_pgdat_node(pg_data_t *pgdat, int order,
> -                                       struct scan_control *sc)
> +static void shrink_memcg_node(pg_data_t *pgdat, int order,
> +                               struct scan_control *sc)
>  {
>        int i;
>        unsigned long total_scanned = 0;
> @@ -2705,14 +2711,9 @@ static void balance_pgdat_node(pg_data_t
>  * Per cgroup background reclaim.
>  * TODO: Take off the order since memcg always do order 0
>  */
> -static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> -                                             int order)
> +static int shrink_mem_cgroup(struct mem_cgroup *mem_cont, int order)
>  {
> -       int i, nid;
> -       int start_node;
> -       int priority;
> -       bool wmark_ok;
> -       int loop;
> +       int i, nid, priority, loop;
>        pg_data_t *pgdat;
>        nodemask_t do_nodes;
>        unsigned long total_scanned;
> @@ -2726,43 +2727,34 @@ static unsigned long balance_mem_cgroup_
>                .mem_cgroup = mem_cont,
>        };
>
> -loop_again:
>        do_nodes = NODE_MASK_NONE;
>        sc.may_writepage = !laptop_mode;
>        sc.nr_reclaimed = 0;
>        total_scanned = 0;
>
> -       for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> -               sc.priority = priority;
> -               wmark_ok = false;
> -               loop = 0;
> +       do_nodes = node_states[N_ONLINE];
>
> +       for (priority = DEF_PRIORITY;
> +               (priority >= 0) && (sc.nr_to_reclaim > sc.nr_reclaimed);
> +               priority--) {
> +
> +               sc.priority = priority;
>                /* The swap token gets in the way of swapout... */
>                if (!priority)
>                        disable_swap_token();
> +               /*
> +                * We'll scan a node given by memcg's logic. For avoiding
> +                * burning cpu, we have a limit of this loop.
> +                */
> +               for (loop = num_online_nodes();
> +                       (loop > 0) && !nodes_empty(do_nodes);
> +                       loop--) {
>
> -               if (priority == DEF_PRIORITY)
> -                       do_nodes = node_states[N_ONLINE];
> -
> -               while (1) {
>                        nid = mem_cgroup_select_victim_node(mem_cont,
>                                                        &do_nodes);
> -
> -                       /*
> -                        * Indicate we have cycled the nodelist once
> -                        * TODO: we might add MAX_RECLAIM_LOOP for
> preventing
> -                        * kswapd burning cpu cycles.
> -                        */
> -                       if (loop == 0) {
> -                               start_node = nid;
> -                               loop++;
> -                       } else if (nid == start_node)
> -                               break;
> -
>                        pgdat = NODE_DATA(nid);
> -                       balance_pgdat_node(pgdat, order, &sc);
> +                       shrink_memcg_node(pgdat, order, &sc);
>                        total_scanned += sc.nr_scanned;
> -
>                        /*
>                         * Set the node which has at least one reclaimable
>                         * zone
> @@ -2770,10 +2762,8 @@ loop_again:
>                        for (i = pgdat->nr_zones - 1; i >= 0; i--) {
>                                struct zone *zone = pgdat->node_zones + i;
>
> -                               if (!populated_zone(zone))
> -                                       continue;
> -
> -                               if (!mem_cgroup_mz_unreclaimable(mem_cont,
> +                               if (populated_zone(zone) &&
> +                                   !mem_cgroup_mz_unreclaimable(mem_cont,
>                                                                zone))
>                                        break;
>                        }
> @@ -2781,36 +2771,18 @@ loop_again:
>                                node_clear(nid, do_nodes);
>
>                        if (mem_cgroup_watermark_ok(mem_cont,
> -                                                       CHARGE_WMARK_HIGH))
> {
> -                               wmark_ok = true;
> -                               goto out;
> -                       }
> -
> -                       if (nodes_empty(do_nodes)) {
> -                               wmark_ok = true;
> +                                               CHARGE_WMARK_HIGH))
>                                goto out;
> -                       }
>                }
>
>                if (total_scanned && priority < DEF_PRIORITY - 2)
>                        congestion_wait(WRITE, HZ/10);
> -
> -               if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> -                       break;
>        }
>  out:
> -       if (!wmark_ok) {
> -               cond_resched();
> -
> -               try_to_freeze();
> -
> -               goto loop_again;
> -       }
> -
>        return sc.nr_reclaimed;
>  }
>  #else
> -static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +static unsigned long shrink_mem_cgroup(struct mem_cgroup *mem_cont,
>                                                        int order)
>  {
>        return 0;
> @@ -2836,8 +2808,7 @@ int kswapd(void *p)
>        int classzone_idx;
>        struct kswapd *kswapd_p = (struct kswapd *)p;
>        pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> -       struct mem_cgroup *mem = kswapd_p->kswapd_mem;
> -       wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
> +       struct mem_cgroup *mem;
>        struct task_struct *tsk = current;
>
>        struct reclaim_state reclaim_state = {
> @@ -2848,7 +2819,6 @@ int kswapd(void *p)
>        lockdep_set_current_reclaim_state(GFP_KERNEL);
>
>        if (is_global_kswapd(kswapd_p)) {
> -               BUG_ON(pgdat->kswapd_wait != wait_h);
>                cpumask = cpumask_of_node(pgdat->node_id);
>                if (!cpumask_empty(cpumask))
>                        set_cpus_allowed_ptr(tsk, cpumask);
> @@ -2908,18 +2878,20 @@ int kswapd(void *p)
>                if (kthread_should_stop())
>                        break;
>
> +               if (ret)
> +                       continue;
>                /*
>                 * We can speed up thawing tasks if we don't call
> balance_pgdat
>                 * after returning from the refrigerator
>                 */
> -               if (!ret) {
> -                       if (is_global_kswapd(kswapd_p)) {
> -                               trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> -                                                               order);
> -                               order = balance_pgdat(pgdat, order,
> -                                                       &classzone_idx);
> -                       } else
> -                               balance_mem_cgroup_pgdat(mem, order);
> +               if (is_global_kswapd(kswapd_p)) {
> +                       trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> +                       order = balance_pgdat(pgdat, order,
> &classzone_idx);
> +               } else {
> +                       mem = mem_cgroup_get_shrink_target();
> +                       if (mem)
> +                               shrink_mem_cgroup(mem, order);
> +                       mem_cgroup_put_shrink_target(mem);
>                }
>        }
>        return 0;
> @@ -2942,13 +2914,13 @@ void wakeup_kswapd(struct zone *zone, in
>                pgdat->kswapd_max_order = order;
>                pgdat->classzone_idx = min(pgdat->classzone_idx,
> classzone_idx);
>        }
> -       if (!waitqueue_active(pgdat->kswapd_wait))
> +       if (!waitqueue_active(&pgdat->kswapd_wait))
>                return;
>        if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0,
> 0))
>                return;
>
>        trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone),
> order);
> -       wake_up_interruptible(pgdat->kswapd_wait);
> +       wake_up_interruptible(&pgdat->kswapd_wait);
>  }
>
>  /*
> @@ -3046,9 +3018,8 @@ static int __devinit cpu_callback(struct
>
>                        mask = cpumask_of_node(pgdat->node_id);
>
> -                       wait = pgdat->kswapd_wait;
> -                       kswapd_p = container_of(wait, struct kswapd,
> -                                               kswapd_wait);
> +                       wait = &pgdat->kswapd_wait;
> +                       kswapd_p = pgdat->kswapd;
>                        kswapd_tsk = kswapd_p->kswapd_task;
>
>                        if (cpumask_any_and(cpu_online_mask, mask) <
> nr_cpu_ids)
> @@ -3064,18 +3035,17 @@ static int __devinit cpu_callback(struct
>  * This kswapd start function will be called by init and node-hot-add.
>  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
>  */
> -int kswapd_run(int nid, struct mem_cgroup *mem)
> +int kswapd_run(int nid, int memcgid)
>  {
>        struct task_struct *kswapd_tsk;
>        pg_data_t *pgdat = NULL;
>        struct kswapd *kswapd_p;
>        static char name[TASK_COMM_LEN];
> -       int memcg_id = -1;
>        int ret = 0;
>
> -       if (!mem) {
> +       if (!memcgid) {
>                pgdat = NODE_DATA(nid);
> -               if (pgdat->kswapd_wait)
> +               if (pgdat->kswapd)
>                        return ret;
>        }
>
> @@ -3083,34 +3053,26 @@ int kswapd_run(int nid, struct mem_cgrou
>        if (!kswapd_p)
>                return -ENOMEM;
>
> -       init_waitqueue_head(&kswapd_p->kswapd_wait);
> -
> -       if (!mem) {
> -               pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> +       if (!memcgid) {
> +               pgdat->kswapd = kswapd_p;
> +               kswapd_p->kswapd_wait = &pgdat->kswapd_wait;
>                kswapd_p->kswapd_pgdat = pgdat;
>                snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
>        } else {
> -               memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
> -               if (!memcg_id) {
> -                       kfree(kswapd_p);
> -                       return ret;
> -               }
> -               snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
> +               kswapd_p->kswapd_wait = mem_cgroup_kswapd_waitq();
> +               snprintf(name, TASK_COMM_LEN, "memcg_%d", memcgid);
>        }
>
>        kswapd_tsk = kthread_run(kswapd, kswapd_p, name);
>        if (IS_ERR(kswapd_tsk)) {
>                /* failure at boot is fatal */
>                BUG_ON(system_state == SYSTEM_BOOTING);
> -               if (!mem) {
> +               if (!memcgid) {
>                        printk(KERN_ERR "Failed to start kswapd on node
> %d\n",
>                                                                nid);
> -                       pgdat->kswapd_wait = NULL;
> -               } else {
> -                       printk(KERN_ERR "Failed to start kswapd on memcg
> %d\n",
> -                                                               memcg_id);
> -                       mem_cgroup_clear_kswapd(mem);
> -               }
> +                       pgdat->kswapd = NULL;
> +               } else
> +                       printk(KERN_ERR "Failed to start kswapd on
> memcg\n");
>                kfree(kswapd_p);
>                ret = -1;
>        } else
> @@ -3121,23 +3083,14 @@ int kswapd_run(int nid, struct mem_cgrou
>  /*
>  * Called by memory hotplug when all memory in a node is offlined.
>  */
> -void kswapd_stop(int nid, struct mem_cgroup *mem)
> +void kswapd_stop(int nid)
>  {
>        struct task_struct *kswapd_tsk = NULL;
>        struct kswapd *kswapd_p = NULL;
> -       wait_queue_head_t *wait;
> -
> -       if (!mem)
> -               wait = NODE_DATA(nid)->kswapd_wait;
> -       else
> -               wait = mem_cgroup_kswapd_wait(mem);
> -
> -       if (wait) {
> -               kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> -               kswapd_tsk = kswapd_p->kswapd_task;
> -               kswapd_p->kswapd_task = NULL;
> -       }
>
> +       kswapd_p = NODE_DATA(nid)->kswapd;
> +       kswapd_tsk = kswapd_p->kswapd_task;
> +       kswapd_p->kswapd_task = NULL;
>        if (kswapd_tsk)
>                kthread_stop(kswapd_tsk);
>
> @@ -3150,7 +3103,7 @@ static int __init kswapd_init(void)
>
>        swap_setup();
>        for_each_node_state(nid, N_HIGH_MEMORY)
> -               kswapd_run(nid, NULL);
> +               kswapd_run(nid, 0);
>        hotcpu_notifier(cpu_callback, 0);
>        return 0;
>  }
> Index: mmotm-Apr14/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-Apr14.orig/include/linux/memcontrol.h
> +++ mmotm-Apr14/include/linux/memcontrol.h
> @@ -94,6 +94,11 @@ extern int mem_cgroup_last_scanned_node(
>  extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
>                                        const nodemask_t *nodes);
>
> +extern bool mem_cgroup_kswapd_can_sleep(void);
> +extern struct mem_cgroup *mem_cgroup_get_shrink_target(void);
> +extern void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
> +extern wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
> +
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
>  {
> Index: mmotm-Apr14/mm/memory_hotplug.c
> ===================================================================
> --- mmotm-Apr14.orig/mm/memory_hotplug.c
> +++ mmotm-Apr14/mm/memory_hotplug.c
> @@ -463,7 +463,7 @@ int __ref online_pages(unsigned long pfn
>        init_per_zone_wmark_min();
>
>        if (onlined_pages) {
> -               kswapd_run(zone_to_nid(zone), NULL);
> +               kswapd_run(zone_to_nid(zone), 0);
>                node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
>        }
>
> @@ -898,7 +898,7 @@ repeat:
>
>        if (!node_present_pages(node)) {
>                node_clear_state(node, N_HIGH_MEMORY);
> -               kswapd_stop(node, NULL);
> +               kswapd_stop(node);
>        }
>
>        vm_total_pages = nr_free_pagecache_pages();
>
>

[-- Attachment #2: Type: text/html, Size: 34875 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/3] weight for memcg background reclaim (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  7:01           ` KAMEZAWA Hiroyuki
@ 2011-04-21  7:12             ` Ying Han
  0 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-21  7:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1143 bytes --]

On Thu, Apr 21, 2011 at 12:01 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 20 Apr 2011 23:59:52 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Wed, Apr 20, 2011 at 11:38 PM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Wed, 20 Apr 2011 23:11:42 -0700
> > > Ying Han <yinghan@google.com> wrote:
> > >
> > > > On Wed, Apr 20, 2011 at 8:48 PM, KAMEZAWA Hiroyuki <
> > > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > n general, memcg-kswapd can reduce memory down to high watermak only when
> > > the system is not busy. So, this logic tries to remove more memory from
> busy
> > > cgroup to reduce 'hit limit'.
> > >
> >
> > So, the "busy cgroup" here means the memcg has higher (usage - low)?
> >
>
>   high < usage < low < limit
>
> Yes, if background reclaim wins, usage - high decreases.
> If tasks on cgroup uses more memory than reclaim, usage - high increases
> even
> if background reclaim runs. So, if usage-high is large, cgroup is busy.
>
> Yes, I think I understand the (usage - high) in the calculation, but not
the (low - high).

--Ying

>
>
> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 1986 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/3] memcg kswapd thread pool (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  7:09   ` Ying Han
@ 2011-04-21  7:14     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  7:14 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 00:09:13 -0700
Ying Han <yinghan@google.com> wrote:

> On Wed, Apr 20, 2011 at 8:43 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Ying, please take this just a hint, you don't need to implement this as is.
> >
> 
> Thank you for the patch.
> 
> 
> > ==
> > Now, memcg-kswapd is created per a cgroup. Considering there are users
> > who creates hundreds on cgroup on a system, it consumes too much
> > resources, memory, cputime.
> >
> > This patch creates a thread pool for memcg-kswapd. All memcg which
> > needs background recalim are linked to a list and memcg-kswapd
> > picks up a memcg from the list and run reclaim. This reclaimes
> > SWAP_CLUSTER_MAX of pages and putback the memcg to the lail of
> > list. memcg-kswapd will visit memcgs in round-robin manner and
> > reduce usages.
> >
> > This patch does
> >
> >  - adds memcg-kswapd thread pool, the number of threads is now
> >   sqrt(num_of_cpus) + 1.
> >  - use unified kswapd_waitq for all memcgs.
> >
> 
> So I looked through the patch, it implements an alternative threading model
> using thread-pool. Also it includes some changes on calculating how much
> pages to reclaim per memcg. Other than that, all the existing implementation
> of per-memcg-kswapd seems not being impacted.
> 
> I tried to apply the patch but get some conflicts on vmscan.c/ I will try
> some manual work tomorrow. Meantime, after applying the patch, I will try to
> test it w/ the same test suite i used on original patch. AFAIK, the only
> difference of the two threading model is the amount of resources we consume
> on the kswapd kernel thread, which shouldn't have run-time performance
> differences.
> 

I hope so. 
To be honest, I don't like one-thread-per-one-job model because it's wastes
resouce and cache foot print and what we can do is just hoping schedulre
schedules tasks well. I like one-thread-per-mulutiple job and switching in
finer grain with knowledge of memory cgroup.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/3] memcg kswapd thread pool (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  3:43 ` [PATCH 1/3] memcg kswapd thread pool (Was " KAMEZAWA Hiroyuki
  2011-04-21  7:09   ` Ying Han
@ 2011-04-21  8:10   ` Minchan Kim
  2011-04-21  8:46     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-04-21  8:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

Hi Kame,

On Thu, Apr 21, 2011 at 12:43 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Ying, please take this just a hint, you don't need to implement this as is.
> ==
> Now, memcg-kswapd is created per a cgroup. Considering there are users
> who creates hundreds on cgroup on a system, it consumes too much
> resources, memory, cputime.
>
> This patch creates a thread pool for memcg-kswapd. All memcg which
> needs background recalim are linked to a list and memcg-kswapd
> picks up a memcg from the list and run reclaim. This reclaimes
> SWAP_CLUSTER_MAX of pages and putback the memcg to the lail of
> list. memcg-kswapd will visit memcgs in round-robin manner and
> reduce usages.
>

I didn't look at code yet but as I just look over the description, I
have a concern.
We have discussed LRU separation between global and memcg.
The clear goal is that how to keep _fairness_.

For example,

memcg-1 : # pages of LRU : 64
memcg-2 : # pages of LRU : 128
memcg-3 : # pages of LRU : 256

If we have to reclaim 96 pages, memcg-1 would be lost half of pages.
It's much greater than others so memcg 1's page LRU rotation cycle
would be very fast, then working set pages in memcg-1 don't have a
chance to promote.
Is it fair?

I think we should consider memcg-LRU size as doing round-robin.

Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/3] memcg kswapd thread pool (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  8:10   ` Minchan Kim
@ 2011-04-21  8:46     ` KAMEZAWA Hiroyuki
  2011-04-21  9:05       ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-21  8:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Ying Han, KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 21 Apr 2011 17:10:23 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Kame,
> 
> On Thu, Apr 21, 2011 at 12:43 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Ying, please take this just a hint, you don't need to implement this as is.
> > ==
> > Now, memcg-kswapd is created per a cgroup. Considering there are users
> > who creates hundreds on cgroup on a system, it consumes too much
> > resources, memory, cputime.
> >
> > This patch creates a thread pool for memcg-kswapd. All memcg which
> > needs background recalim are linked to a list and memcg-kswapd
> > picks up a memcg from the list and run reclaim. This reclaimes
> > SWAP_CLUSTER_MAX of pages and putback the memcg to the lail of
> > list. memcg-kswapd will visit memcgs in round-robin manner and
> > reduce usages.
> >
> 
> I didn't look at code yet but as I just look over the description, I
> have a concern.
> We have discussed LRU separation between global and memcg.

Please discuss global LRU in other thread. memcg-kswapd is not related
to global LRU _at all_.

And this patch set is independent from the things we discussed at LSF.


> The clear goal is that how to keep _fairness_.
> 
> For example,
> 
> memcg-1 : # pages of LRU : 64
> memcg-2 : # pages of LRU : 128
> memcg-3 : # pages of LRU : 256
> 
> If we have to reclaim 96 pages, memcg-1 would be lost half of pages.
> It's much greater than others so memcg 1's page LRU rotation cycle
> would be very fast, then working set pages in memcg-1 don't have a
> chance to promote.
> Is it fair?
> 
> I think we should consider memcg-LRU size as doing round-robin.
> 

This set doesn't implement a feature to handle your example case, at all.

This patch set handles

memcg-1: # pages of over watermark : 64
memcg-2: # pages of over watermark : 128
memcg-3: # pages of over watermark : 256

And finally reclaim all pages over watermarks which user requested.
Considering fairness, what we consider is in what order we reclaim
memory memcg-1, memcg-2, memcg-3 and how to avoid unnecessary cpu
hogging at reclaiming memory all (64+128+256)

This thread pool reclaim 32 pages per iteration with patch-1 and visit all
in round-robin.
With patch-2, reclaim 32*weight pages per iteration on each memcg.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/3] memcg kswapd thread pool (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  8:46     ` KAMEZAWA Hiroyuki
@ 2011-04-21  9:05       ` Minchan Kim
  2011-04-21 16:56         ` Ying Han
  0 siblings, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2011-04-21  9:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, Apr 21, 2011 at 5:46 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 21 Apr 2011 17:10:23 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi Kame,
>>
>> On Thu, Apr 21, 2011 at 12:43 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > Ying, please take this just a hint, you don't need to implement this as is.
>> > ==
>> > Now, memcg-kswapd is created per a cgroup. Considering there are users
>> > who creates hundreds on cgroup on a system, it consumes too much
>> > resources, memory, cputime.
>> >
>> > This patch creates a thread pool for memcg-kswapd. All memcg which
>> > needs background recalim are linked to a list and memcg-kswapd
>> > picks up a memcg from the list and run reclaim. This reclaimes
>> > SWAP_CLUSTER_MAX of pages and putback the memcg to the lail of
>> > list. memcg-kswapd will visit memcgs in round-robin manner and
>> > reduce usages.
>> >
>>
>> I didn't look at code yet but as I just look over the description, I
>> have a concern.
>> We have discussed LRU separation between global and memcg.
>
> Please discuss global LRU in other thread. memcg-kswapd is not related
> to global LRU _at all_.
>
> And this patch set is independent from the things we discussed at LSF.
>
>
>> The clear goal is that how to keep _fairness_.
>>
>> For example,
>>
>> memcg-1 : # pages of LRU : 64
>> memcg-2 : # pages of LRU : 128
>> memcg-3 : # pages of LRU : 256
>>
>> If we have to reclaim 96 pages, memcg-1 would be lost half of pages.
>> It's much greater than others so memcg 1's page LRU rotation cycle
>> would be very fast, then working set pages in memcg-1 don't have a
>> chance to promote.
>> Is it fair?
>>
>> I think we should consider memcg-LRU size as doing round-robin.
>>
>
> This set doesn't implement a feature to handle your example case, at all.

Sure. Sorry for the confusing.
I don't mean global LRU but it a fairness although this series is
based on per-memcg targeting.

>
> This patch set handles
>
> memcg-1: # pages of over watermark : 64
> memcg-2: # pages of over watermark : 128
> memcg-3: # pages of over watermark : 256
>
> And finally reclaim all pages over watermarks which user requested.
> Considering fairness, what we consider is in what order we reclaim
> memory memcg-1, memcg-2, memcg-3 and how to avoid unnecessary cpu
> hogging at reclaiming memory all (64+128+256)
>
> This thread pool reclaim 32 pages per iteration with patch-1 and visit all
> in round-robin.
> With patch-2, reclaim 32*weight pages per iteration on each memcg.
>

I should have seen the patch [2/3] before posting the comment.
Maybe you seem consider my concern.
Okay. I will look the idea.

>
> Thanks,
> -Kame
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/3] memcg kswapd thread pool (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  9:05       ` Minchan Kim
@ 2011-04-21 16:56         ` Ying Han
  2011-04-22  1:02           ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-21 16:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3327 bytes --]

On Thu, Apr 21, 2011 at 2:05 AM, Minchan Kim <minchan.kim@gmail.com> wrote:

> On Thu, Apr 21, 2011 at 5:46 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 21 Apr 2011 17:10:23 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> Hi Kame,
> >>
> >> On Thu, Apr 21, 2011 at 12:43 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > Ying, please take this just a hint, you don't need to implement this
> as is.
> >> > ==
> >> > Now, memcg-kswapd is created per a cgroup. Considering there are users
> >> > who creates hundreds on cgroup on a system, it consumes too much
> >> > resources, memory, cputime.
> >> >
> >> > This patch creates a thread pool for memcg-kswapd. All memcg which
> >> > needs background recalim are linked to a list and memcg-kswapd
> >> > picks up a memcg from the list and run reclaim. This reclaimes
> >> > SWAP_CLUSTER_MAX of pages and putback the memcg to the lail of
> >> > list. memcg-kswapd will visit memcgs in round-robin manner and
> >> > reduce usages.
> >> >
> >>
> >> I didn't look at code yet but as I just look over the description, I
> >> have a concern.
> >> We have discussed LRU separation between global and memcg.
> >
> > Please discuss global LRU in other thread. memcg-kswapd is not related
> > to global LRU _at all_.
> >
> > And this patch set is independent from the things we discussed at LSF.
> >
> >
> >> The clear goal is that how to keep _fairness_.
> >>
> >> For example,
> >>
> >> memcg-1 : # pages of LRU : 64
> >> memcg-2 : # pages of LRU : 128
> >> memcg-3 : # pages of LRU : 256
> >>
> >> If we have to reclaim 96 pages, memcg-1 would be lost half of pages.
> >> It's much greater than others so memcg 1's page LRU rotation cycle
> >> would be very fast, then working set pages in memcg-1 don't have a
> >> chance to promote.
> >> Is it fair?
> >>
> >> I think we should consider memcg-LRU size as doing round-robin.
> >>
> >
> > This set doesn't implement a feature to handle your example case, at all.
>
> Sure. Sorry for the confusing.
> I don't mean global LRU but it a fairness although this series is
> based on per-memcg targeting.
>
> >
> > This patch set handles
> >
> > memcg-1: # pages of over watermark : 64
> > memcg-2: # pages of over watermark : 128
> > memcg-3: # pages of over watermark : 256
> >
> > And finally reclaim all pages over watermarks which user requested.
> > Considering fairness, what we consider is in what order we reclaim
> > memory memcg-1, memcg-2, memcg-3 and how to avoid unnecessary cpu
> > hogging at reclaiming memory all (64+128+256)
> >
> > This thread pool reclaim 32 pages per iteration with patch-1 and visit
> all
> > in round-robin.
> > With patch-2, reclaim 32*weight pages per iteration on each memcg.
> >
>
> I should have seen the patch [2/3] before posting the comment.
> Maybe you seem consider my concern.
> Okay. I will look the idea.
>

For any ideas on global kswapd and soft_limit reclaim based on round-robin (
discussed in LSF), please move the discussion to :

[RFC no patch yet] memcg: revisit soft_limit reclaim on contention:
http://permalink.gmane.org/gmane.linux.kernel.mm/60966"
I already started with the patch and hopefully to post some result soon.

--Ying


> >
> > Thanks,
> > -Kame
> >
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
>

[-- Attachment #2: Type: text/html, Size: 6007 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/3] memcg kswapd thread pool (Was Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21 16:56         ` Ying Han
@ 2011-04-22  1:02           ` Minchan Kim
  0 siblings, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2011-04-22  1:02 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

On Fri, Apr 22, 2011 at 1:56 AM, Ying Han <yinghan@google.com> wrote:
>
>
> On Thu, Apr 21, 2011 at 2:05 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>
>> On Thu, Apr 21, 2011 at 5:46 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 21 Apr 2011 17:10:23 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >
>> >> Hi Kame,
>> >>
>> >> On Thu, Apr 21, 2011 at 12:43 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> > Ying, please take this just a hint, you don't need to implement this
>> >> > as is.
>> >> > ==
>> >> > Now, memcg-kswapd is created per a cgroup. Considering there are
>> >> > users
>> >> > who creates hundreds on cgroup on a system, it consumes too much
>> >> > resources, memory, cputime.
>> >> >
>> >> > This patch creates a thread pool for memcg-kswapd. All memcg which
>> >> > needs background recalim are linked to a list and memcg-kswapd
>> >> > picks up a memcg from the list and run reclaim. This reclaimes
>> >> > SWAP_CLUSTER_MAX of pages and putback the memcg to the lail of
>> >> > list. memcg-kswapd will visit memcgs in round-robin manner and
>> >> > reduce usages.
>> >> >
>> >>
>> >> I didn't look at code yet but as I just look over the description, I
>> >> have a concern.
>> >> We have discussed LRU separation between global and memcg.
>> >
>> > Please discuss global LRU in other thread. memcg-kswapd is not related
>> > to global LRU _at all_.
>> >
>> > And this patch set is independent from the things we discussed at LSF.
>> >
>> >
>> >> The clear goal is that how to keep _fairness_.
>> >>
>> >> For example,
>> >>
>> >> memcg-1 : # pages of LRU : 64
>> >> memcg-2 : # pages of LRU : 128
>> >> memcg-3 : # pages of LRU : 256
>> >>
>> >> If we have to reclaim 96 pages, memcg-1 would be lost half of pages.
>> >> It's much greater than others so memcg 1's page LRU rotation cycle
>> >> would be very fast, then working set pages in memcg-1 don't have a
>> >> chance to promote.
>> >> Is it fair?
>> >>
>> >> I think we should consider memcg-LRU size as doing round-robin.
>> >>
>> >
>> > This set doesn't implement a feature to handle your example case, at
>> > all.
>>
>> Sure. Sorry for the confusing.
>> I don't mean global LRU but it a fairness although this series is
>> based on per-memcg targeting.
>>
>> >
>> > This patch set handles
>> >
>> > memcg-1: # pages of over watermark : 64
>> > memcg-2: # pages of over watermark : 128
>> > memcg-3: # pages of over watermark : 256
>> >
>> > And finally reclaim all pages over watermarks which user requested.
>> > Considering fairness, what we consider is in what order we reclaim
>> > memory memcg-1, memcg-2, memcg-3 and how to avoid unnecessary cpu
>> > hogging at reclaiming memory all (64+128+256)
>> >
>> > This thread pool reclaim 32 pages per iteration with patch-1 and visit
>> > all
>> > in round-robin.
>> > With patch-2, reclaim 32*weight pages per iteration on each memcg.
>> >
>>
>> I should have seen the patch [2/3] before posting the comment.
>> Maybe you seem consider my concern.
>> Okay. I will look the idea.
>
> For any ideas on global kswapd and soft_limit reclaim based on round-robin (
> discussed in LSF), please move the discussion to :
>
> [RFC no patch yet] memcg: revisit soft_limit reclaim on contention:
>
> http://permalink.gmane.org/gmane.linux.kernel.mm/60966"
>
> I already started with the patch and hopefully to post some result soon.

Okay. I am looking forward to seeing.
Thanks, Ying.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  5:28       ` Ying Han
@ 2011-04-23  1:35         ` Johannes Weiner
  2011-04-23  2:10           ` Ying Han
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2011-04-23  1:35 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

On Wed, Apr 20, 2011 at 10:28:17PM -0700, Ying Han wrote:
> On Wed, Apr 20, 2011 at 10:08 PM, Johannes Weiner <hannes@cmpxchg.org>wrote:
> > On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:
> > > I don't think its a good idea to kick kswapd even when free memory is
> > enough.
> >
> > This depends on what kswapd is supposed to be doing.  I don't say we
> > should reclaim from all memcgs (i.e. globally) just because one memcg
> > hits its watermark, of course.
> >
> > But the argument was that we need the watermarks configurable to force
> > per-memcg reclaim even when the hard limits are overcommitted, because
> > global reclaim does not do a fair job to balance memcgs.
> 
> There seems to be some confusion here. The watermark we defined is
> per-memcg, and that is calculated
> based on the hard_limit. We need the per-memcg wmark the same reason of
> per-zone wmart which triggers
> the background reclaim before direct reclaim.

Of course, I am not arguing against the watermarks.  I am just
(violently) against making them configurable from userspace.

> There is a patch in my patchset which adds the tunable for both
> high/low_mark, which gives more flexibility to admin to config the host. In
> over-commit environment, we might never hit the wmark if all the wmarks are
> set internally.

And my point is that this should not be a problem at all!  If the
watermarks are not physically reachable, there is no reason to reclaim
on behalf of them.

In such an environment, global memory pressure arises before the
memcgs get close to their hard limit, and global memory pressure
reduction should do the right thing and equally push back all memcgs.

Flexibility in itself is not an argument.  On the contrary.  We commit
ourselves to that ABI and have to maintain this flexibility forever.
Instead, please find a convincing argument for the flexibility itself,
other than the need to workaround the current global kswapd reclaim.

(I fixed up the following quotation, please be more careful when
replying, this makes it so hard to follow your emails.  thanks!)

> > My counter proposal is to fix global reclaim instead and apply equal
> > pressure on memcgs, such that we never have to tweak per-memcg watermarks
> > to achieve the same thing.
> 
> We still need this and that is the soft_limit reclaim under global
> background reclaim.

I don't understand what you mean by that.  Could you elaborate?

Thanks,

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-21  5:41       ` KAMEZAWA Hiroyuki
  2011-04-21  6:23         ` Ying Han
@ 2011-04-23  2:02         ` Johannes Weiner
  1 sibling, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2011-04-23  2:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, Apr 21, 2011 at 02:41:56PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 21 Apr 2011 07:08:51 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 21 Apr 2011 04:51:07 +0200
> > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > 
> > > > > If the cgroup is configured to use per cgroup background reclaim, a kswapd
> > > > > thread is created which only scans the per-memcg LRU list.
> > > > 
> > > > We already have direct reclaim, direct reclaim on behalf of a memcg,
> > > > and global kswapd-reclaim.  Please don't add yet another reclaim path
> > > > that does its own thing and interacts unpredictably with the rest of
> > > > them.
> > > > 
> > > > As discussed on LSF, we want to get rid of the global LRU.  So the
> > > > goal is to have each reclaim entry end up at the same core part of
> > > > reclaim that round-robin scans a subset of zones from a subset of
> > > > memory control groups.
> > > 
> > > It's not related to this set. And I think even if we remove global LRU,
> > > global-kswapd and memcg-kswapd need to do independent work.
> > > 
> > > global-kswapd : works for zone/node balancing and making free pages,
> > >                 and compaction. select a memcg vicitm and ask it
> > >                 to reduce memory with regard to gfp_mask. Starts its work
> > >                 when zone/node is unbalanced.
> > 
> > For soft limit reclaim (which is triggered by global memory pressure),
> > we want to scan a group of memory cgroups equally in round robin
> > fashion.  I think at LSF we established that it is not fair to find
> > the one that exceeds its limit the most and hammer it until memory
> > pressure is resolved or there is another group with more excess.
> > 
> 
> Why do you guys like to make a mixture discussion of softlimit and
> high/low watermarks ?

I just tried to make the point that both have the same requirements
and argued that it would make sense to go in a direction that benefits
future work as well.

> > > > Which brings me to the next issue: making the watermarks configurable.
> > > > 
> > > > You argued that having them adjustable from userspace is required for
> > > > overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
> > > > in in case of global memory pressure.  But that is only a problem
> > > > because global kswapd reclaim is (apart from soft limit reclaim)
> > > > unaware of memory control groups.
> > > > 
> > > > I think the much better solution is to make global kswapd memcg aware
> > > > (with the above mentioned round-robin reclaim scheduler), compared to
> > > > adding new (and final!) kernel ABI to avoid an internal shortcoming.
> > > 
> > > I don't think its a good idea to kick kswapd even when free memory is enough.
> > 
> > This depends on what kswapd is supposed to be doing.  I don't say we
> > should reclaim from all memcgs (i.e. globally) just because one memcg
> > hits its watermark, of course.
> > 
> > But the argument was that we need the watermarks configurable to force
> > per-memcg reclaim even when the hard limits are overcommitted, because
> > global reclaim does not do a fair job to balance memcgs.  
> 
> I cannot understand here. Why global reclaim need to do works other than
> balancing zones ? And what is balancing memcg ? Mentioning softlimit ?

By 'balancing memcgs' I mean equally distributing scan pressure
amongst them.  When global reclaim kicks in, it may reclaim much more
from one memcg than from another by accident.

I assume that the only reason for making watermarks configurable is
that global reclaim sucks and that you want to force watermark-based
reclaim even when overcommitting.  Maybe I should stop making this
assumption and ask you for a good explanation of why you want to make
watermarks configurable.

> Hi/Low watermak is a feature as it is. It the 3rd way to limit memory
> usage. Comaparing hard_limit, soft_limit, it works in moderate way in background
> and works regardless of usage of global memory. I think it's valid to have
> ineterfaces to tuning this.

Can you elaborate more on this?  I don't see your argument for it.

Thanks,

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-23  1:35         ` Johannes Weiner
@ 2011-04-23  2:10           ` Ying Han
  2011-04-23  2:34             ` Johannes Weiner
  0 siblings, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-23  2:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3450 bytes --]

On Fri, Apr 22, 2011 at 6:35 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, Apr 20, 2011 at 10:28:17PM -0700, Ying Han wrote:
> > On Wed, Apr 20, 2011 at 10:08 PM, Johannes Weiner <hannes@cmpxchg.org
> >wrote:
> > > On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > I don't think its a good idea to kick kswapd even when free memory is
> > > enough.
> > >
> > > This depends on what kswapd is supposed to be doing.  I don't say we
> > > should reclaim from all memcgs (i.e. globally) just because one memcg
> > > hits its watermark, of course.
> > >
> > > But the argument was that we need the watermarks configurable to force
> > > per-memcg reclaim even when the hard limits are overcommitted, because
> > > global reclaim does not do a fair job to balance memcgs.
> >
> > There seems to be some confusion here. The watermark we defined is
> > per-memcg, and that is calculated
> > based on the hard_limit. We need the per-memcg wmark the same reason of
> > per-zone wmart which triggers
> > the background reclaim before direct reclaim.
>
> Of course, I am not arguing against the watermarks.  I am just
> (violently) against making them configurable from userspace.
>
> > There is a patch in my patchset which adds the tunable for both
> > high/low_mark, which gives more flexibility to admin to config the host.
> In
> > over-commit environment, we might never hit the wmark if all the wmarks
> are
> > set internally.
>
> And my point is that this should not be a problem at all!  If the
> watermarks are not physically reachable, there is no reason to reclaim
> on behalf of them.
>
> In such an environment, global memory pressure arises before the
> memcgs get close to their hard limit, and global memory pressure
> reduction should do the right thing and equally push back all memcgs.
>
> Flexibility in itself is not an argument.  On the contrary.  We commit
> ourselves to that ABI and have to maintain this flexibility forever.
> Instead, please find a convincing argument for the flexibility itself,
> other than the need to workaround the current global kswapd reclaim.
>
> Ok, I tend to agree with you now that the over-commit example i gave early
is a weak argument. We don't need to provide the ability to reclaim from a
memcg before it is reaching its wmarks in over-commit environment.

However, i still think there is a need from the admin to have some controls
of which memcg to do background reclaim proactively (before global memory
pressure) and that was the initial logic behind the API.

I used to have per-memcg wmark_ratio api which controls both high/low_wmark
based on hard_limit, but the two APIs seems give finer granularity.

--Ying


> (I fixed up the following quotation, please be more careful when
> replying, this makes it so hard to follow your emails.  thanks!)
>
> > > My counter proposal is to fix global reclaim instead and apply equal
> > > pressure on memcgs, such that we never have to tweak per-memcg
> watermarks
> > > to achieve the same thing.
> >
> > We still need this and that is the soft_limit reclaim under global
> > background reclaim.
>
> I don't understand what you mean by that.  Could you elaborate?
>

Sorry I think I misunderstood your early comment. What I pointed out here
was that we need both per-memcg
background reclaim and global soft_limit reclaim. I don't think we have
disagreement on that at this point.

>
> Thanks,
>
>        Hannes
>

[-- Attachment #2: Type: text/html, Size: 4568 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-23  2:10           ` Ying Han
@ 2011-04-23  2:34             ` Johannes Weiner
  2011-04-23  3:33               ` Ying Han
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2011-04-23  2:34 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

On Fri, Apr 22, 2011 at 07:10:25PM -0700, Ying Han wrote:
> On Fri, Apr 22, 2011 at 6:35 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Wed, Apr 20, 2011 at 10:28:17PM -0700, Ying Han wrote:
> > > On Wed, Apr 20, 2011 at 10:08 PM, Johannes Weiner <hannes@cmpxchg.org
> > >wrote:
> > > > On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > I don't think its a good idea to kick kswapd even when free memory is
> > > > enough.
> > > >
> > > > This depends on what kswapd is supposed to be doing.  I don't say we
> > > > should reclaim from all memcgs (i.e. globally) just because one memcg
> > > > hits its watermark, of course.
> > > >
> > > > But the argument was that we need the watermarks configurable to force
> > > > per-memcg reclaim even when the hard limits are overcommitted, because
> > > > global reclaim does not do a fair job to balance memcgs.
> > >
> > > There seems to be some confusion here. The watermark we defined is
> > > per-memcg, and that is calculated
> > > based on the hard_limit. We need the per-memcg wmark the same reason of
> > > per-zone wmart which triggers
> > > the background reclaim before direct reclaim.
> >
> > Of course, I am not arguing against the watermarks.  I am just
> > (violently) against making them configurable from userspace.
> >
> > > There is a patch in my patchset which adds the tunable for both
> > > high/low_mark, which gives more flexibility to admin to config the host.
> > In
> > > over-commit environment, we might never hit the wmark if all the wmarks
> > are
> > > set internally.
> >
> > And my point is that this should not be a problem at all!  If the
> > watermarks are not physically reachable, there is no reason to reclaim
> > on behalf of them.
> >
> > In such an environment, global memory pressure arises before the
> > memcgs get close to their hard limit, and global memory pressure
> > reduction should do the right thing and equally push back all memcgs.
> >
> > Flexibility in itself is not an argument.  On the contrary.  We commit
> > ourselves to that ABI and have to maintain this flexibility forever.
> > Instead, please find a convincing argument for the flexibility itself,
> > other than the need to workaround the current global kswapd reclaim.

[fixed following quotation]

> Ok, I tend to agree with you now that the over-commit example i gave
> early is a weak argument. We don't need to provide the ability to
> reclaim from a memcg before it is reaching its wmarks in over-commit
> environment.

Yep.  If it is impossible to reach the hard limit, it can't possibly
be a source of latency.

> However, i still think there is a need from the admin to have some controls
> of which memcg to do background reclaim proactively (before global memory
> pressure) and that was the initial logic behind the API.

That sounds more interesting.  Do you have a specific use case that
requires this?

min_free_kbytes more or less indirectly provides the same on a global
level, but I don't think anybody tunes it just for aggressiveness of
background reclaim.

> > (I fixed up the following quotation, please be more careful when
> > replying, this makes it so hard to follow your emails.  thanks!)

^^^^

> > > > My counter proposal is to fix global reclaim instead and apply
> > > > equal pressure on memcgs, such that we never have to tweak
> > > > per-memcg > > watermarks to achieve the same thing.
> > >
> > > We still need this and that is the soft_limit reclaim under global
> > > background reclaim.
> >
> > I don't understand what you mean by that.  Could you elaborate?
> 
> Sorry I think I misunderstood your early comment. What I pointed out here
> was that we need both per-memcg
> background reclaim and global soft_limit reclaim. I don't think we have
> disagreement on that at this point.

Ah, got you, thanks.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-23  2:34             ` Johannes Weiner
@ 2011-04-23  3:33               ` Ying Han
  2011-04-23  3:41                 ` Rik van Riel
  2011-04-27  7:36                 ` Johannes Weiner
  0 siblings, 2 replies; 58+ messages in thread
From: Ying Han @ 2011-04-23  3:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5888 bytes --]

On Fri, Apr 22, 2011 at 7:34 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Apr 22, 2011 at 07:10:25PM -0700, Ying Han wrote:
> > On Fri, Apr 22, 2011 at 6:35 PM, Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > > On Wed, Apr 20, 2011 at 10:28:17PM -0700, Ying Han wrote:
> > > > On Wed, Apr 20, 2011 at 10:08 PM, Johannes Weiner <
> hannes@cmpxchg.org
> > > >wrote:
> > > > > On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > I don't think its a good idea to kick kswapd even when free
> memory is
> > > > > enough.
> > > > >
> > > > > This depends on what kswapd is supposed to be doing.  I don't say
> we
> > > > > should reclaim from all memcgs (i.e. globally) just because one
> memcg
> > > > > hits its watermark, of course.
> > > > >
> > > > > But the argument was that we need the watermarks configurable to
> force
> > > > > per-memcg reclaim even when the hard limits are overcommitted,
> because
> > > > > global reclaim does not do a fair job to balance memcgs.
> > > >
> > > > There seems to be some confusion here. The watermark we defined is
> > > > per-memcg, and that is calculated
> > > > based on the hard_limit. We need the per-memcg wmark the same reason
> of
> > > > per-zone wmart which triggers
> > > > the background reclaim before direct reclaim.
> > >
> > > Of course, I am not arguing against the watermarks.  I am just
> > > (violently) against making them configurable from userspace.
> > >
> > > > There is a patch in my patchset which adds the tunable for both
> > > > high/low_mark, which gives more flexibility to admin to config the
> host.
> > > In
> > > > over-commit environment, we might never hit the wmark if all the
> wmarks
> > > are
> > > > set internally.
> > >
> > > And my point is that this should not be a problem at all!  If the
> > > watermarks are not physically reachable, there is no reason to reclaim
> > > on behalf of them.
> > >
> > > In such an environment, global memory pressure arises before the
> > > memcgs get close to their hard limit, and global memory pressure
> > > reduction should do the right thing and equally push back all memcgs.
> > >
> > > Flexibility in itself is not an argument.  On the contrary.  We commit
> > > ourselves to that ABI and have to maintain this flexibility forever.
> > > Instead, please find a convincing argument for the flexibility itself,
> > > other than the need to workaround the current global kswapd reclaim.
>
> [fixed following quotation]
>
> > Ok, I tend to agree with you now that the over-commit example i gave
> > early is a weak argument. We don't need to provide the ability to
> > reclaim from a memcg before it is reaching its wmarks in over-commit
> > environment.
>
> Yep.  If it is impossible to reach the hard limit, it can't possibly
> be a source of latency.
>
> > However, i still think there is a need from the admin to have some
> controls
> > of which memcg to do background reclaim proactively (before global memory
> > pressure) and that was the initial logic behind the API.
>
> That sounds more interesting.  Do you have a specific use case that
> requires this?
>
> There might be more interesting use cases there, and here is one I can
think of:

let's say we three jobs A, B and C, and one host with 32G of RAM. We
configure each job's hard_limit as their peak memory usage.
A: 16G
B: 16G
C: 10G

1. we start running A with hard_limit 15G, and start running B with
hard_limit 15G.
2. we set A and B's soft_limit based on their "hot" memory. Let's say
setting A's soft_limit 10G and B's soft_limit 10G.
(The soft_limit will be changing based on their runtime memory usage)

If no more jobs running on the system, A and B will easily fill up the whole
system with pagecache pages. Since we are not over-committing the machine
with their hard_limit, there will be no pressure to push their memory usage
down to soft_limit.

Now we would like to launch another job C, since we know there are A(16G -
10G) + B(16G - 10G)  = 12G "cold" memory can be reclaimed (w/o impacting the
A and B's performance). So what will happen

1. start running C on the host, which triggers global memory pressure right
away. If the reclaim is fast, C start growing with the free pages from A and
B.

However, it might be possible that the reclaim can not catch-up with the
job's page allocation. We end up with either OOM condition or performance
spike on any of the running jobs.

One way to improve it is to set a wmark on either A/B to be proactively
reclaiming pages before launching C. The global memory pressure won't help
much here since we won't trigger that.



> min_free_kbytes more or less indirectly provides the same on a global
> level, but I don't think anybody tunes it just for aggressiveness of
> background reclaim.
>

Hmm, we do scale that in google workload. With large machines under lots of
memory pressure and heavily network traffic workload, we would like to
reduce the likelyhood of page alloc failure. But this is kind of different
from what we are talking about here.

--Ying



>
> > > (I fixed up the following quotation, please be more careful when
> > > replying, this makes it so hard to follow your emails.  thanks!)
>
> ^^^^
>
> > > > > My counter proposal is to fix global reclaim instead and apply
> > > > > equal pressure on memcgs, such that we never have to tweak
> > > > > per-memcg > > watermarks to achieve the same thing.
> > > >
> > > > We still need this and that is the soft_limit reclaim under global
> > > > background reclaim.
> > >
> > > I don't understand what you mean by that.  Could you elaborate?
> >
> > Sorry I think I misunderstood your early comment. What I pointed out here
> > was that we need both per-memcg
> > background reclaim and global soft_limit reclaim. I don't think we have
> > disagreement on that at this point.
>
> Ah, got you, thanks.
>
>        Hannes
>

[-- Attachment #2: Type: text/html, Size: 7813 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-23  3:33               ` Ying Han
@ 2011-04-23  3:41                 ` Rik van Riel
  2011-04-23  3:49                   ` Ying Han
  2011-04-27  7:36                 ` Johannes Weiner
  1 sibling, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2011-04-23  3:41 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On 04/22/2011 11:33 PM, Ying Han wrote:

> Now we would like to launch another job C, since we know there are A(16G
> - 10G) + B(16G - 10G)  = 12G "cold" memory can be reclaimed (w/o
> impacting the A and B's performance). So what will happen
>
> 1. start running C on the host, which triggers global memory pressure
> right away. If the reclaim is fast, C start growing with the free pages
> from A and B.
>
> However, it might be possible that the reclaim can not catch-up with the
> job's page allocation. We end up with either OOM condition or
> performance spike on any of the running jobs.
>
> One way to improve it is to set a wmark on either A/B to be proactively
> reclaiming pages before launching C. The global memory pressure won't
> help much here since we won't trigger that.
>
>     min_free_kbytes more or less indirectly provides the same on a global
>     level, but I don't think anybody tunes it just for aggressiveness of
>     background reclaim.

This sounds like yet another reason to have a tunable that
can increase the gap between min_free_kbytes and low_free_kbytes
(automatically scaled to size in every zone).

The realtime people want this to reduce allocation latencies.

I want it for dynamic virtual machine resizing, without the
memory fragmentation inherent in balloons (which would destroy
the performance benefit of transparent hugepages).

Now Google wants it for job placement.

Is there any good reason we can't have a low watermark
equivalent to min_free_kbytes? :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-23  3:41                 ` Rik van Riel
@ 2011-04-23  3:49                   ` Ying Han
  0 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2011-04-23  3:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2052 bytes --]

On Fri, Apr 22, 2011 at 8:41 PM, Rik van Riel <riel@redhat.com> wrote:

> On 04/22/2011 11:33 PM, Ying Han wrote:
>
>  Now we would like to launch another job C, since we know there are A(16G
>> - 10G) + B(16G - 10G)  = 12G "cold" memory can be reclaimed (w/o
>> impacting the A and B's performance). So what will happen
>>
>> 1. start running C on the host, which triggers global memory pressure
>> right away. If the reclaim is fast, C start growing with the free pages
>> from A and B.
>>
>> However, it might be possible that the reclaim can not catch-up with the
>> job's page allocation. We end up with either OOM condition or
>> performance spike on any of the running jobs.
>>
>> One way to improve it is to set a wmark on either A/B to be proactively
>> reclaiming pages before launching C. The global memory pressure won't
>> help much here since we won't trigger that.
>>
>>    min_free_kbytes more or less indirectly provides the same on a global
>>    level, but I don't think anybody tunes it just for aggressiveness of
>>    background reclaim.
>>
>
> This sounds like yet another reason to have a tunable that
> can increase the gap between min_free_kbytes and low_free_kbytes
> (automatically scaled to size in every zone).
>
> The realtime people want this to reduce allocation latencies.
>
> I want it for dynamic virtual machine resizing, without the
> memory fragmentation inherent in balloons (which would destroy
> the performance benefit of transparent hugepages).
>
> Now Google wants it for job placement.
>

To clarify a bit, we scale the min_free_kbytes to reduce the likelyhood of
page allocation failure. This is still the global per-zone page allocation,
and is different from the memcg discussion we have in this thread. To be
more specific, our case is more or less caused by the 128M fake node size.

Anyway, this is different from what have been discussed so far on this
thread. :)

--Ying

>
> Is there any good reason we can't have a low watermark
> equivalent to min_free_kbytes? :)
>
> --
> All rights reversed
>

[-- Attachment #2: Type: text/html, Size: 2784 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-23  3:33               ` Ying Han
  2011-04-23  3:41                 ` Rik van Riel
@ 2011-04-27  7:36                 ` Johannes Weiner
  2011-04-27 17:41                   ` Ying Han
  1 sibling, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2011-04-27  7:36 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

On Fri, Apr 22, 2011 at 08:33:58PM -0700, Ying Han wrote:
> On Fri, Apr 22, 2011 at 7:34 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Fri, Apr 22, 2011 at 07:10:25PM -0700, Ying Han wrote: >
> > However, i still think there is a need from the admin to have some
> > controls > of which memcg to do background reclaim proactively
> > (before global memory > pressure) and that was the initial logic
> > behind the API.
> >
> > That sounds more interesting.  Do you have a specific use case
> > that requires this?
> 
> There might be more interesting use cases there, and here is one I
> can think of:
> 
> let's say we three jobs A, B and C, and one host with 32G of RAM. We
> configure each job's hard_limit as their peak memory usage.
> A: 16G
> B: 16G
> C: 10G
> 
> 1. we start running A with hard_limit 15G, and start running B with
> hard_limit 15G.
> 2. we set A and B's soft_limit based on their "hot" memory. Let's say
> setting A's soft_limit 10G and B's soft_limit 10G.
> (The soft_limit will be changing based on their runtime memory usage)
> 
> If no more jobs running on the system, A and B will easily fill up the whole
> system with pagecache pages. Since we are not over-committing the machine
> with their hard_limit, there will be no pressure to push their memory usage
> down to soft_limit.
> 
> Now we would like to launch another job C, since we know there are A(16G -
> 10G) + B(16G - 10G)  = 12G "cold" memory can be reclaimed (w/o impacting the
> A and B's performance). So what will happen
> 
> 1. start running C on the host, which triggers global memory pressure right
> away. If the reclaim is fast, C start growing with the free pages from A and
> B.
> 
> However, it might be possible that the reclaim can not catch-up with the
> job's page allocation. We end up with either OOM condition or performance
> spike on any of the running jobs.

If background reclaim can not catch up, C will go into direct reclaim,
which will have exactly the same effect, only that C will have to do
the work itself.

> One way to improve it is to set a wmark on either A/B to be proactively
> reclaiming pages before launching C. The global memory pressure won't help
> much here since we won't trigger that.

Ok, so you want to use the watermarks to push back and limit the usage
of A and B to make room for C.  Isn't this exactly what the hard limit
is for?

I don't understand the second sentence: global memory pressure won't
kick in with only A and B, but it will once C starts up.

> > min_free_kbytes more or less indirectly provides the same on a global
> > level, but I don't think anybody tunes it just for aggressiveness of
> > background reclaim.
> >
> 
> Hmm, we do scale that in google workload. With large machines under lots of
> memory pressure and heavily network traffic workload, we would like to
> reduce the likelyhood of page alloc failure. But this is kind of different
> from what we are talking about here.

My point indeed ;-)

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-27  7:36                 ` Johannes Weiner
@ 2011-04-27 17:41                   ` Ying Han
  2011-04-27 21:37                     ` Johannes Weiner
  0 siblings, 1 reply; 58+ messages in thread
From: Ying Han @ 2011-04-27 17:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3993 bytes --]

On Wed, Apr 27, 2011 at 12:36 AM, Johannes Weiner <hannes@cmpxchg.org>wrote:

> On Fri, Apr 22, 2011 at 08:33:58PM -0700, Ying Han wrote:
> > On Fri, Apr 22, 2011 at 7:34 PM, Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > > On Fri, Apr 22, 2011 at 07:10:25PM -0700, Ying Han wrote: >
> > > However, i still think there is a need from the admin to have some
> > > controls > of which memcg to do background reclaim proactively
> > > (before global memory > pressure) and that was the initial logic
> > > behind the API.
> > >
> > > That sounds more interesting.  Do you have a specific use case
> > > that requires this?
> >
> > There might be more interesting use cases there, and here is one I
> > can think of:
> >
> > let's say we three jobs A, B and C, and one host with 32G of RAM. We
> > configure each job's hard_limit as their peak memory usage.
> > A: 16G
> > B: 16G
> > C: 10G
> >
> > 1. we start running A with hard_limit 15G, and start running B with
> > hard_limit 15G.
> > 2. we set A and B's soft_limit based on their "hot" memory. Let's say
> > setting A's soft_limit 10G and B's soft_limit 10G.
> > (The soft_limit will be changing based on their runtime memory usage)
> >
> > If no more jobs running on the system, A and B will easily fill up the
> whole
> > system with pagecache pages. Since we are not over-committing the machine
> > with their hard_limit, there will be no pressure to push their memory
> usage
> > down to soft_limit.
> >
> > Now we would like to launch another job C, since we know there are A(16G
> -
> > 10G) + B(16G - 10G)  = 12G "cold" memory can be reclaimed (w/o impacting
> the
> > A and B's performance). So what will happen
> >
> > 1. start running C on the host, which triggers global memory pressure
> right
> > away. If the reclaim is fast, C start growing with the free pages from A
> and
> > B.
> >
> > However, it might be possible that the reclaim can not catch-up with the
> > job's page allocation. We end up with either OOM condition or performance
> > spike on any of the running jobs.
>
> If background reclaim can not catch up, C will go into direct reclaim,
> which will have exactly the same effect, only that C will have to do
> the work itself.
>
> > One way to improve it is to set a wmark on either A/B to be proactively
> > reclaiming pages before launching C. The global memory pressure won't
> help
> > much here since we won't trigger that.
>
> Ok, so you want to use the watermarks to push back and limit the usage
> of A and B to make room for C.  Isn't this exactly what the hard limit
> is for?
>
> similar, but not exactly the same. there is no need to hard cap the memory
usage for A and B in that case.
what we need is to have some period of time that A and B slowly reclaim
pages and leaves some room to
launch C smoothly.


> I don't understand the second sentence: global memory pressure won't
> kick in with only A and B, but it will once C starts up.
>
> In my example, the hard_limit of A+B is less than the machine capacity. And
after we have
per-memcg bg reclaim, ideally we won't trigger global reclaim much.

But when we launch C, we end up over-committing the machine. So the global
reclaim will
fire up quickly.

Anyway, i agree that less newly invented kernel API is good. If you check
the latest past V8 from
Kame, we reduced the two API to one which we only allow setting the
high_wmark_distance and the
low_wmark_distance is set internally. I think this is good enough.

--Ying



> > > min_free_kbytes more or less indirectly provides the same on a global
> > > level, but I don't think anybody tunes it just for aggressiveness of
> > > background reclaim.
> > >
> >
> > Hmm, we do scale that in google workload. With large machines under lots
> of
> > memory pressure and heavily network traffic workload, we would like to
> > reduce the likelyhood of page alloc failure. But this is kind of
> different
> > from what we are talking about here.
>
> My point indeed ;-)
>
>        Hannes
>

[-- Attachment #2: Type: text/html, Size: 5319 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 00/10] memcg: per cgroup background reclaim
  2011-04-27 17:41                   ` Ying Han
@ 2011-04-27 21:37                     ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2011-04-27 21:37 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai, linux-mm

On Wed, Apr 27, 2011 at 10:41:47AM -0700, Ying Han wrote:
> On Wed, Apr 27, 2011 at 12:36 AM, Johannes Weiner <hannes@cmpxchg.org>wrote:
> 
> > On Fri, Apr 22, 2011 at 08:33:58PM -0700, Ying Han wrote:
> > > On Fri, Apr 22, 2011 at 7:34 PM, Johannes Weiner <hannes@cmpxchg.org>
> > wrote:
> > >
> > > > On Fri, Apr 22, 2011 at 07:10:25PM -0700, Ying Han wrote: >
> > > > However, i still think there is a need from the admin to have some
> > > > controls > of which memcg to do background reclaim proactively
> > > > (before global memory > pressure) and that was the initial logic
> > > > behind the API.
> > > >
> > > > That sounds more interesting.  Do you have a specific use case
> > > > that requires this?
> > >
> > > There might be more interesting use cases there, and here is one I
> > > can think of:
> > >
> > > let's say we three jobs A, B and C, and one host with 32G of RAM. We
> > > configure each job's hard_limit as their peak memory usage.
> > > A: 16G
> > > B: 16G
> > > C: 10G
> > >
> > > 1. we start running A with hard_limit 15G, and start running B with
> > > hard_limit 15G.
> > > 2. we set A and B's soft_limit based on their "hot" memory. Let's say
> > > setting A's soft_limit 10G and B's soft_limit 10G.
> > > (The soft_limit will be changing based on their runtime memory usage)
> > >
> > > If no more jobs running on the system, A and B will easily fill up the
> > whole
> > > system with pagecache pages. Since we are not over-committing the machine
> > > with their hard_limit, there will be no pressure to push their memory
> > usage
> > > down to soft_limit.
> > >
> > > Now we would like to launch another job C, since we know there are A(16G
> > -
> > > 10G) + B(16G - 10G)  = 12G "cold" memory can be reclaimed (w/o impacting
> > the
> > > A and B's performance). So what will happen
> > >
> > > 1. start running C on the host, which triggers global memory pressure
> > right
> > > away. If the reclaim is fast, C start growing with the free pages from A
> > and
> > > B.
> > >
> > > However, it might be possible that the reclaim can not catch-up with the
> > > job's page allocation. We end up with either OOM condition or performance
> > > spike on any of the running jobs.
> >
> > If background reclaim can not catch up, C will go into direct reclaim,
> > which will have exactly the same effect, only that C will have to do
> > the work itself.
> >
> > > One way to improve it is to set a wmark on either A/B to be proactively
> > > reclaiming pages before launching C. The global memory pressure won't
> > help
> > > much here since we won't trigger that.
> >
> > Ok, so you want to use the watermarks to push back and limit the usage
> > of A and B to make room for C.  Isn't this exactly what the hard limit
> > is for?
> 
> similar, but not exactly the same. there is no need to hard cap the memory
> usage for A and B in that case.
> what we need is to have some period of time that A and B slowly reclaim
> pages and leaves some room to
> launch C smoothly.

I think we are going in circles now.

Since starting with C the machine is overcommitted, the problem is no
longer memcg-internal latency but latency of global memory scarcity.

My suggestion to that was, and still is, to fix global background
reclaim, which should apply pressure equally to all memcgs until the
_global_ watermarks are met again.

This would do the right thing for this case: C starts up, the global
watermark is breached sooner or later and background reclaim will push
back A and B, hopefully before anyone has to go into direct
reclaim. ('Hopefully' because the allocations may still happen faster
than background reclaim can keep up freeing pages.  But this applies
to your scenario as well.)

I think this should work out of the box, without tweaking obscure
knobs from userspace.

Anyway, at this point I can only repeat myself, so I will shut up now.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 06/10] Per-memcg background reclaim.
  2011-04-19  3:57 ` [PATCH V6 06/10] Per-memcg background reclaim Ying Han
  2011-04-20  1:03   ` KAMEZAWA Hiroyuki
@ 2012-03-19  8:14   ` Zhu Yanhai
  2012-03-20  5:37     ` Ying Han
  1 sibling, 1 reply; 58+ messages in thread
From: Zhu Yanhai @ 2012-03-19  8:14 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, linux-mm

2011/4/19 Ying Han <yinghan@google.com>:
> This is the main loop of per-memcg background reclaim which is implemented in
> function balance_mem_cgroup_pgdat().
>
> The function performs a priority loop similar to global reclaim. During each
> iteration it invokes balance_pgdat_node() for all nodes on the system, which
> is another new function performs background reclaim per node. After reclaiming
> each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
> it returns true.
>
> changelog v6..v5:
> 1. add mem_cgroup_zone_reclaimable_pages()
> 2. fix some comment style.
>
> changelog v5..v4:
> 1. remove duplicate check on nodes_empty()
> 2. add logic to check if the per-memcg lru is empty on the zone.
>
> changelog v4..v3:
> 1. split the select_victim_node and zone_unreclaimable to a seperate patches
> 2. remove the logic tries to do zone balancing.
>
> changelog v3..v2:
> 1. change mz->all_unreclaimable to be boolean.
> 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
> 3. some more clean-up.
>
> changelog v2..v1:
> 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> reclaim.
> 3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
> keeps the same name.
> 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
> after freeing.
> 5. add the fairness in zonelist where memcg remember the last zone reclaimed
> from.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h |    9 +++
>  mm/memcontrol.c            |   18 +++++
>  mm/vmscan.c                |  151 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 178 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d4ff7f2..a4747b0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -115,6 +115,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  */
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +                                                 struct zone *zone);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>                                       struct zone *zone,
>                                       enum lru_list lru);
> @@ -311,6 +313,13 @@ mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>  }
>
>  static inline unsigned long
> +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +                                   struct zone *zone)
> +{
> +       return 0;
> +}
> +
> +static inline unsigned long
>  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
>                         enum lru_list lru)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 06fddd2..7490147 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1097,6 +1097,24 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>        return (active > inactive);
>  }
>
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +                                               struct zone *zone)
> +{
> +       int nr;
> +       int nid = zone_to_nid(zone);
> +       int zid = zone_idx(zone);
> +       struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> +       nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> +            MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> +
> +       if (nr_swap_pages > 0)

Do we also need to check memcg->memsw_is_minimum here? That's to say,
       if (nr_swap_pages > 0 && !memcg->memsw_is_minimum)
                        .....
--
Thanks,
Zhu Yanhai

> +               nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> +                     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> +
> +       return nr;
> +}
> +
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>                                       struct zone *zone,
>                                       enum lru_list lru)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0060d1e..2a5c734 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -47,6 +47,8 @@
>
>  #include <linux/swapops.h>
>
> +#include <linux/res_counter.h>
> +
>  #include "internal.h"
>
>  #define CREATE_TRACE_POINTS
> @@ -111,6 +113,8 @@ struct scan_control {
>         * are scanned.
>         */
>        nodemask_t      *nodemask;
> +
> +       int priority;
>  };
>
>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> @@ -2625,11 +2629,158 @@ out:
>        finish_wait(wait_h, &wait);
>  }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> +                                       struct scan_control *sc)
> +{
> +       int i;
> +       unsigned long total_scanned = 0;
> +       struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +       int priority = sc->priority;
> +
> +       /*
> +        * This dma->highmem order is consistant with global reclaim.
> +        * We do this because the page allocator works in the opposite
> +        * direction although memcg user pages are mostly allocated at
> +        * highmem.
> +        */
> +       for (i = 0; i < pgdat->nr_zones; i++) {
> +               struct zone *zone = pgdat->node_zones + i;
> +               unsigned long scan = 0;
> +
> +               scan = mem_cgroup_zone_reclaimable_pages(mem_cont, zone);
> +               if (!scan)
> +                       continue;
> +
> +               sc->nr_scanned = 0;
> +               shrink_zone(priority, zone, sc);
> +               total_scanned += sc->nr_scanned;
> +
> +               /*
> +                * If we've done a decent amount of scanning and
> +                * the reclaim ratio is low, start doing writepage
> +                * even in laptop mode
> +                */
> +               if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> +                   total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
> +                       sc->may_writepage = 1;
> +               }
> +       }
> +
> +       sc->nr_scanned = total_scanned;
> +}
> +
> +/*
> + * Per cgroup background reclaim.
> + * TODO: Take off the order since memcg always do order 0
> + */
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +                                             int order)
> +{
> +       int i, nid;
> +       int start_node;
> +       int priority;
> +       bool wmark_ok;
> +       int loop;
> +       pg_data_t *pgdat;
> +       nodemask_t do_nodes;
> +       unsigned long total_scanned;
> +       struct scan_control sc = {
> +               .gfp_mask = GFP_KERNEL,
> +               .may_unmap = 1,
> +               .may_swap = 1,
> +               .nr_to_reclaim = SWAP_CLUSTER_MAX,
> +               .swappiness = vm_swappiness,
> +               .order = order,
> +               .mem_cgroup = mem_cont,
> +       };
> +
> +loop_again:
> +       do_nodes = NODE_MASK_NONE;
> +       sc.may_writepage = !laptop_mode;
> +       sc.nr_reclaimed = 0;
> +       total_scanned = 0;
> +
> +       for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> +               sc.priority = priority;
> +               wmark_ok = false;
> +               loop = 0;
> +
> +               /* The swap token gets in the way of swapout... */
> +               if (!priority)
> +                       disable_swap_token();
> +
> +               if (priority == DEF_PRIORITY)
> +                       do_nodes = node_states[N_ONLINE];
> +
> +               while (1) {
> +                       nid = mem_cgroup_select_victim_node(mem_cont,
> +                                                       &do_nodes);
> +
> +                       /*
> +                        * Indicate we have cycled the nodelist once
> +                        * TODO: we might add MAX_RECLAIM_LOOP for preventing
> +                        * kswapd burning cpu cycles.
> +                        */
> +                       if (loop == 0) {
> +                               start_node = nid;
> +                               loop++;
> +                       } else if (nid == start_node)
> +                               break;
> +
> +                       pgdat = NODE_DATA(nid);
> +                       balance_pgdat_node(pgdat, order, &sc);
> +                       total_scanned += sc.nr_scanned;
> +
> +                       for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +                               struct zone *zone = pgdat->node_zones + i;
> +
> +                               if (!populated_zone(zone))
> +                                       continue;
> +                       }
> +                       if (i < 0)
> +                               node_clear(nid, do_nodes);
> +
> +                       if (mem_cgroup_watermark_ok(mem_cont,
> +                                                       CHARGE_WMARK_HIGH)) {
> +                               wmark_ok = true;
> +                               goto out;
> +                       }
> +
> +                       if (nodes_empty(do_nodes)) {
> +                               wmark_ok = true;
> +                               goto out;
> +                       }
> +               }
> +
> +               if (total_scanned && priority < DEF_PRIORITY - 2)
> +                       congestion_wait(WRITE, HZ/10);
> +
> +               if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> +                       break;
> +       }
> +out:
> +       if (!wmark_ok) {
> +               cond_resched();
> +
> +               try_to_freeze();
> +
> +               goto loop_again;
> +       }
> +
> +       return sc.nr_reclaimed;
> +}
> +#else
>  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
>                                                        int order)
>  {
>        return 0;
>  }
> +#endif
>
>  /*
>  * The background pageout daemon, started as a kernel thread
> --
> 1.7.3.1
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH V6 06/10] Per-memcg background reclaim.
  2012-03-19  8:14   ` Zhu Yanhai
@ 2012-03-20  5:37     ` Ying Han
  0 siblings, 0 replies; 58+ messages in thread
From: Ying Han @ 2012-03-20  5:37 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, linux-mm

On Mon, Mar 19, 2012 at 1:14 AM, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:
> 2011/4/19 Ying Han <yinghan@google.com>:
>> This is the main loop of per-memcg background reclaim which is implemented in
>> function balance_mem_cgroup_pgdat().
>>
>> The function performs a priority loop similar to global reclaim. During each
>> iteration it invokes balance_pgdat_node() for all nodes on the system, which
>> is another new function performs background reclaim per node. After reclaiming
>> each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
>> it returns true.
>>
>> changelog v6..v5:
>> 1. add mem_cgroup_zone_reclaimable_pages()
>> 2. fix some comment style.
>>
>> changelog v5..v4:
>> 1. remove duplicate check on nodes_empty()
>> 2. add logic to check if the per-memcg lru is empty on the zone.
>>
>> changelog v4..v3:
>> 1. split the select_victim_node and zone_unreclaimable to a seperate patches
>> 2. remove the logic tries to do zone balancing.
>>
>> changelog v3..v2:
>> 1. change mz->all_unreclaimable to be boolean.
>> 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
>> 3. some more clean-up.
>>
>> changelog v2..v1:
>> 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
>> 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
>> reclaim.
>> 3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
>> keeps the same name.
>> 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
>> after freeing.
>> 5. add the fairness in zonelist where memcg remember the last zone reclaimed
>> from.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  include/linux/memcontrol.h |    9 +++
>>  mm/memcontrol.c            |   18 +++++
>>  mm/vmscan.c                |  151 ++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 178 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index d4ff7f2..a4747b0 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -115,6 +115,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>>  */
>>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
>> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
>> +                                                 struct zone *zone);
>>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>>                                       struct zone *zone,
>>                                       enum lru_list lru);
>> @@ -311,6 +313,13 @@ mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>>  }
>>
>>  static inline unsigned long
>> +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
>> +                                   struct zone *zone)
>> +{
>> +       return 0;
>> +}
>> +
>> +static inline unsigned long
>>  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
>>                         enum lru_list lru)
>>  {
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 06fddd2..7490147 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1097,6 +1097,24 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>>        return (active > inactive);
>>  }
>>
>> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
>> +                                               struct zone *zone)
>> +{
>> +       int nr;
>> +       int nid = zone_to_nid(zone);
>> +       int zid = zone_idx(zone);
>> +       struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
>> +
>> +       nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
>> +            MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
>> +
>> +       if (nr_swap_pages > 0)
>
> Do we also need to check memcg->memsw_is_minimum here? That's to say,
>       if (nr_swap_pages > 0 && !memcg->memsw_is_minimum)
>                        .....

That sounds about right. By given that swapon isn't common in our test
environment, I am not surprised to miss that condition by that time.

--Ying

> --
> Thanks,
> Zhu Yanhai
>
>> +               nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
>> +                     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
>> +
>> +       return nr;
>> +}
>> +
>>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>>                                       struct zone *zone,
>>                                       enum lru_list lru)
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 0060d1e..2a5c734 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -47,6 +47,8 @@
>>
>>  #include <linux/swapops.h>
>>
>> +#include <linux/res_counter.h>
>> +
>>  #include "internal.h"
>>
>>  #define CREATE_TRACE_POINTS
>> @@ -111,6 +113,8 @@ struct scan_control {
>>         * are scanned.
>>         */
>>        nodemask_t      *nodemask;
>> +
>> +       int priority;
>>  };
>>
>>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
>> @@ -2625,11 +2629,158 @@ out:
>>        finish_wait(wait_h, &wait);
>>  }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>> +/*
>> + * The function is used for per-memcg LRU. It scanns all the zones of the
>> + * node and returns the nr_scanned and nr_reclaimed.
>> + */
>> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
>> +                                       struct scan_control *sc)
>> +{
>> +       int i;
>> +       unsigned long total_scanned = 0;
>> +       struct mem_cgroup *mem_cont = sc->mem_cgroup;
>> +       int priority = sc->priority;
>> +
>> +       /*
>> +        * This dma->highmem order is consistant with global reclaim.
>> +        * We do this because the page allocator works in the opposite
>> +        * direction although memcg user pages are mostly allocated at
>> +        * highmem.
>> +        */
>> +       for (i = 0; i < pgdat->nr_zones; i++) {
>> +               struct zone *zone = pgdat->node_zones + i;
>> +               unsigned long scan = 0;
>> +
>> +               scan = mem_cgroup_zone_reclaimable_pages(mem_cont, zone);
>> +               if (!scan)
>> +                       continue;
>> +
>> +               sc->nr_scanned = 0;
>> +               shrink_zone(priority, zone, sc);
>> +               total_scanned += sc->nr_scanned;
>> +
>> +               /*
>> +                * If we've done a decent amount of scanning and
>> +                * the reclaim ratio is low, start doing writepage
>> +                * even in laptop mode
>> +                */
>> +               if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
>> +                   total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
>> +                       sc->may_writepage = 1;
>> +               }
>> +       }
>> +
>> +       sc->nr_scanned = total_scanned;
>> +}
>> +
>> +/*
>> + * Per cgroup background reclaim.
>> + * TODO: Take off the order since memcg always do order 0
>> + */
>> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
>> +                                             int order)
>> +{
>> +       int i, nid;
>> +       int start_node;
>> +       int priority;
>> +       bool wmark_ok;
>> +       int loop;
>> +       pg_data_t *pgdat;
>> +       nodemask_t do_nodes;
>> +       unsigned long total_scanned;
>> +       struct scan_control sc = {
>> +               .gfp_mask = GFP_KERNEL,
>> +               .may_unmap = 1,
>> +               .may_swap = 1,
>> +               .nr_to_reclaim = SWAP_CLUSTER_MAX,
>> +               .swappiness = vm_swappiness,
>> +               .order = order,
>> +               .mem_cgroup = mem_cont,
>> +       };
>> +
>> +loop_again:
>> +       do_nodes = NODE_MASK_NONE;
>> +       sc.may_writepage = !laptop_mode;
>> +       sc.nr_reclaimed = 0;
>> +       total_scanned = 0;
>> +
>> +       for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>> +               sc.priority = priority;
>> +               wmark_ok = false;
>> +               loop = 0;
>> +
>> +               /* The swap token gets in the way of swapout... */
>> +               if (!priority)
>> +                       disable_swap_token();
>> +
>> +               if (priority == DEF_PRIORITY)
>> +                       do_nodes = node_states[N_ONLINE];
>> +
>> +               while (1) {
>> +                       nid = mem_cgroup_select_victim_node(mem_cont,
>> +                                                       &do_nodes);
>> +
>> +                       /*
>> +                        * Indicate we have cycled the nodelist once
>> +                        * TODO: we might add MAX_RECLAIM_LOOP for preventing
>> +                        * kswapd burning cpu cycles.
>> +                        */
>> +                       if (loop == 0) {
>> +                               start_node = nid;
>> +                               loop++;
>> +                       } else if (nid == start_node)
>> +                               break;
>> +
>> +                       pgdat = NODE_DATA(nid);
>> +                       balance_pgdat_node(pgdat, order, &sc);
>> +                       total_scanned += sc.nr_scanned;
>> +
>> +                       for (i = pgdat->nr_zones - 1; i >= 0; i--) {
>> +                               struct zone *zone = pgdat->node_zones + i;
>> +
>> +                               if (!populated_zone(zone))
>> +                                       continue;
>> +                       }
>> +                       if (i < 0)
>> +                               node_clear(nid, do_nodes);
>> +
>> +                       if (mem_cgroup_watermark_ok(mem_cont,
>> +                                                       CHARGE_WMARK_HIGH)) {
>> +                               wmark_ok = true;
>> +                               goto out;
>> +                       }
>> +
>> +                       if (nodes_empty(do_nodes)) {
>> +                               wmark_ok = true;
>> +                               goto out;
>> +                       }
>> +               }
>> +
>> +               if (total_scanned && priority < DEF_PRIORITY - 2)
>> +                       congestion_wait(WRITE, HZ/10);
>> +
>> +               if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
>> +                       break;
>> +       }
>> +out:
>> +       if (!wmark_ok) {
>> +               cond_resched();
>> +
>> +               try_to_freeze();
>> +
>> +               goto loop_again;
>> +       }
>> +
>> +       return sc.nr_reclaimed;
>> +}
>> +#else
>>  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
>>                                                        int order)
>>  {
>>        return 0;
>>  }
>> +#endif
>>
>>  /*
>>  * The background pageout daemon, started as a kernel thread
>> --
>> 1.7.3.1
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2012-03-20  5:37 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-19  3:57 [PATCH V6 00/10] memcg: per cgroup background reclaim Ying Han
2011-04-19  3:57 ` [PATCH V6 01/10] Add kswapd descriptor Ying Han
2011-04-19  3:57 ` [PATCH V6 02/10] Add per memcg reclaim watermarks Ying Han
2011-04-19  3:57 ` [PATCH V6 03/10] New APIs to adjust per-memcg wmarks Ying Han
2011-04-19  3:57 ` [PATCH V6 04/10] Infrastructure to support per-memcg reclaim Ying Han
2011-04-19  3:57 ` [PATCH V6 05/10] Implement the select_victim_node within memcg Ying Han
2011-04-19  3:57 ` [PATCH V6 06/10] Per-memcg background reclaim Ying Han
2011-04-20  1:03   ` KAMEZAWA Hiroyuki
2011-04-20  3:25     ` Ying Han
2011-04-20  4:20     ` Ying Han
2012-03-19  8:14   ` Zhu Yanhai
2012-03-20  5:37     ` Ying Han
2011-04-19  3:57 ` [PATCH V6 07/10] Add per-memcg zone "unreclaimable" Ying Han
2011-04-19  3:57 ` [PATCH V6 08/10] Enable per-memcg background reclaim Ying Han
2011-04-19  3:57 ` [PATCH V6 09/10] Add API to export per-memcg kswapd pid Ying Han
2011-04-20  1:15   ` KAMEZAWA Hiroyuki
2011-04-20  3:39     ` Ying Han
2011-04-19  3:57 ` [PATCH V6 10/10] Add some per-memcg stats Ying Han
2011-04-21  2:51 ` [PATCH V6 00/10] memcg: per cgroup background reclaim Johannes Weiner
2011-04-21  3:05   ` Ying Han
2011-04-21  3:53     ` Johannes Weiner
2011-04-21  4:00   ` KAMEZAWA Hiroyuki
2011-04-21  4:24     ` Ying Han
2011-04-21  4:46       ` KAMEZAWA Hiroyuki
2011-04-21  5:08     ` Johannes Weiner
2011-04-21  5:28       ` Ying Han
2011-04-23  1:35         ` Johannes Weiner
2011-04-23  2:10           ` Ying Han
2011-04-23  2:34             ` Johannes Weiner
2011-04-23  3:33               ` Ying Han
2011-04-23  3:41                 ` Rik van Riel
2011-04-23  3:49                   ` Ying Han
2011-04-27  7:36                 ` Johannes Weiner
2011-04-27 17:41                   ` Ying Han
2011-04-27 21:37                     ` Johannes Weiner
2011-04-21  5:41       ` KAMEZAWA Hiroyuki
2011-04-21  6:23         ` Ying Han
2011-04-23  2:02         ` Johannes Weiner
2011-04-21  3:40 ` KAMEZAWA Hiroyuki
2011-04-21  3:48   ` [PATCH 2/3] weight for memcg background reclaim (Was " KAMEZAWA Hiroyuki
2011-04-21  6:11     ` Ying Han
2011-04-21  6:38       ` KAMEZAWA Hiroyuki
2011-04-21  6:59         ` Ying Han
2011-04-21  7:01           ` KAMEZAWA Hiroyuki
2011-04-21  7:12             ` Ying Han
2011-04-21  3:50   ` [PATCH 3/3/] fix mem_cgroup_watemark_ok " KAMEZAWA Hiroyuki
2011-04-21  5:29     ` Ying Han
2011-04-21  4:22   ` Ying Han
2011-04-21  4:27     ` KAMEZAWA Hiroyuki
2011-04-21  4:31     ` Ying Han
2011-04-21  3:43 ` [PATCH 1/3] memcg kswapd thread pool (Was " KAMEZAWA Hiroyuki
2011-04-21  7:09   ` Ying Han
2011-04-21  7:14     ` KAMEZAWA Hiroyuki
2011-04-21  8:10   ` Minchan Kim
2011-04-21  8:46     ` KAMEZAWA Hiroyuki
2011-04-21  9:05       ` Minchan Kim
2011-04-21 16:56         ` Ying Han
2011-04-22  1:02           ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.