All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/7] memcg: per cgroup background reclaim
@ 2011-04-13  7:03 Ying Han
  2011-04-13  7:03 ` [PATCH V3 1/7] Add kswapd descriptor Ying Han
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Ying Han @ 2011-04-13  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen
  Cc: linux-mm

The current implementation of memcg supports targeting reclaim when the
cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
Per cgroup background reclaim is needed which helps to spread out memory
pressure over longer period of time and smoothes out the cgroup performance.

If the cgroup is configured to use per cgroup background reclaim, a kswapd
thread is created which only scans the per-memcg LRU list. Two watermarks
("high_wmark", "low_wmark") are added to trigger the background reclaim and
stop it. The watermarks are calculated based on the cgroup's limit_in_bytes.

I run through dd test on large file and then cat the file. Then I compared
the reclaim related stats in memory.stat.

Step1: Create a cgroup with 500M memory_limit.
$ mkdir /dev/cgroup/memory/A
$ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/memory/A/tasks

Step2: Test and set the wmarks.
$ cat /dev/cgroup/memory/A/memory.wmark_ratio
0

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 524288000
high_wmark 524288000

$ echo 90 >/dev/cgroup/memory/A/memory.wmark_ratio

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 471859200
high_wmark 470016000

$ ps -ef | grep memcg
root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg

Step3: Dirty the pages by creating a 20g file on hard drive.
$ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1

Here are the memory.stat with vs without the per-memcg reclaim. It used to be
all the pages are reclaimed from direct reclaim, and now some of the pages are
also being reclaimed at background.

Only direct reclaim                       With background reclaim:

pgpgin 5248668                            pgpgin 5248347
pgpgout 5120678                           pgpgout 5133505
kswapd_steal 0                            kswapd_steal 1476614
pg_pgsteal 5120578                        pg_pgsteal 3656868
kswapd_pgscan 0                           kswapd_pgscan 3137098
pg_scan 10861956                          pg_scan 6848006
pgrefill 271174                           pgrefill 290441
pgoutrun 0                                pgoutrun 18047
allocstall 131689                         allocstall 100179

real    7m42.702s                         real 7m42.323s
user    0m0.763s                          user 0m0.748s
sys     0m58.785s                         sys  0m52.123s

throughput is 44.33 MB/sec                throughput is 44.23 MB/sec

Step 4: Cleanup
$ echo $$ >/dev/cgroup/memory/tasks
$ echo 1 > /dev/cgroup/memory/A/memory.force_empty
$ rmdir /dev/cgroup/memory/A
$ echo 3 >/proc/sys/vm/drop_caches

Step 5: Create the same cgroup and read the 20g file into pagecache.
$ cat /export/hdc3/dd/tf0 > /dev/zero

All the pages are reclaimed from background instead of direct reclaim with
the per cgroup reclaim.

Only direct reclaim                       With background reclaim:
pgpgin 5248668                            pgpgin 5248114
pgpgout 5120678                           pgpgout 5133480
kswapd_steal 0                            kswapd_steal 5133397
pg_pgsteal 5120578                        pg_pgsteal 0
kswapd_pgscan 0                           kswapd_pgscan 5133400
pg_scan 10861956                          pg_scan 0
pgrefill 271174                           pgrefill 0
pgoutrun 0                                pgoutrun 40535
allocstall 131689                         allocstall 0

real    7m42.702s                         real 6m20.439s
user    0m0.763s                          user 0m0.169s
sys     0m58.785s                         sys  0m26.574s

Note:
This is the first effort of enhancing the target reclaim into memcg. Here are
the existing known issues and our plan:

1. there are one kswapd thread per cgroup. the thread is created when the
cgroup changes its limit_in_bytes and is deleted when the cgroup is being
removed. In some enviroment when thousand of cgroups are being configured on
a single host, we will have thousand of kswapd threads. The memory consumption
would be 8k*100 = 8M. We don't see a big issue for now if the host can host
that many of cgroups.

2. there is a potential lock contention between per cgroup kswapds, and the
worst case depends on the number of cpu cores on the system. Basically we
now are sharing the zone->lru_lock between per-memcg LRU and global LRU. I have
a plan to get rid of the global LRU eventually, which requires to enhance the
existing targeting reclaim (this patch is included). I would like to get to that
where the locking contention problem will be solved naturely.

3. no hierarchical reclaim support in this patchset. I would like to get to
after the basic stuff are being accepted.

Ying Han (7):
  Add kswapd descriptor
  Add per memcg reclaim watermarks
  New APIs to adjust per-memcg wmarks
  Infrastructure to support per-memcg reclaim.
  Per-memcg background reclaim.
  Enable per-memcg background reclaim.
  Add some per-memcg stats

 Documentation/cgroups/memory.txt |   14 ++
 include/linux/memcontrol.h       |   91 ++++++++
 include/linux/mmzone.h           |    3 +-
 include/linux/res_counter.h      |   80 +++++++
 include/linux/swap.h             |   14 +-
 kernel/res_counter.c             |    6 +
 mm/memcontrol.c                  |  375 +++++++++++++++++++++++++++++++
 mm/page_alloc.c                  |    1 -
 mm/vmscan.c                      |  461 ++++++++++++++++++++++++++++++++------
 9 files changed, 976 insertions(+), 69 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH V3 1/7] Add kswapd descriptor
  2011-04-13  7:03 [PATCH V3 0/7] memcg: per cgroup background reclaim Ying Han
@ 2011-04-13  7:03 ` Ying Han
  2011-04-13  7:03 ` [PATCH V3 2/7] Add per memcg reclaim watermarks Ying Han
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-13  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen
  Cc: linux-mm

There is a kswapd kernel thread for each numa node. We will add a different
kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
information of node or memcg and it allows the global and per-memcg background
reclaim to share common reclaim algorithms.

This patch adds the kswapd descriptor and moves the per-node kswapd to use the
new structure.

changelog v2..v1:
1. dynamic allocate kswapd descriptor and initialize the wait_queue_head of pgdat
at kswapd_run.
2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup kswapd
descriptor.

changelog v3..v2:
1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
2. rename thr in kswapd_run to something else.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/mmzone.h |    3 +-
 include/linux/swap.h   |    7 ++++
 mm/page_alloc.c        |    1 -
 mm/vmscan.c            |   95 ++++++++++++++++++++++++++++++++++++------------
 4 files changed, 80 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 628f07b..6cba7d2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -640,8 +640,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	wait_queue_head_t kswapd_wait;
-	struct task_struct *kswapd;
+	wait_queue_head_t *kswapd_wait;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 } pg_data_t;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed6ebe6..f43d406 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
 	return current->flags & PF_KSWAPD;
 }
 
+struct kswapd {
+	struct task_struct *kswapd_task;
+	wait_queue_head_t kswapd_wait;
+	pg_data_t *kswapd_pgdat;
+};
+
+int kswapd(void *p);
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e1b52a..6340865 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
-	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..77ac74f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2241,13 +2241,16 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 	return balanced_pages > (present_pages >> 2);
 }
 
+static DEFINE_SPINLOCK(kswapds_spinlock);
+
 /* is kswapd sleeping prematurely? */
-static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
-					int classzone_idx)
+static int sleeping_prematurely(struct kswapd *kswapd, int order,
+				long remaining, int classzone_idx)
 {
 	int i;
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
+	pg_data_t *pgdat = kswapd->kswapd_pgdat;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
@@ -2570,28 +2573,31 @@ out:
 	return order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
+				int classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 
 	if (freezing(current) || kthread_should_stop())
 		return;
 
-	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
-		finish_wait(&pgdat->kswapd_wait, &wait);
-		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+		finish_wait(wait_h, &wait);
+		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 	}
 
 	/*
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
@@ -2611,7 +2617,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		else
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
-	finish_wait(&pgdat->kswapd_wait, &wait);
+	finish_wait(wait_h, &wait);
 }
 
 /*
@@ -2627,20 +2633,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
  * If there are applications that are active memory-allocators
  * (most normal use), this basically shouldn't matter.
  */
-static int kswapd(void *p)
+int kswapd(void *p)
 {
 	unsigned long order;
 	int classzone_idx;
-	pg_data_t *pgdat = (pg_data_t*)p;
+	struct kswapd *kswapd_p = (struct kswapd *)p;
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 
 	struct reclaim_state reclaim_state = {
 		.reclaimed_slab = 0,
 	};
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	const struct cpumask *cpumask;
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
+	BUG_ON(pgdat->kswapd_wait != wait_h);
+	cpumask = cpumask_of_node(pgdat->node_id);
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(tsk, cpumask);
 	current->reclaim_state = &reclaim_state;
@@ -2679,7 +2689,7 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx);
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
@@ -2719,13 +2729,13 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 		pgdat->kswapd_max_order = order;
 		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
 	}
-	if (!waitqueue_active(&pgdat->kswapd_wait))
+	if (!waitqueue_active(pgdat->kswapd_wait))
 		return;
 	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-	wake_up_interruptible(&pgdat->kswapd_wait);
+	wake_up_interruptible(pgdat->kswapd_wait);
 }
 
 /*
@@ -2817,12 +2827,23 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 		for_each_node_state(nid, N_HIGH_MEMORY) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 			const struct cpumask *mask;
+			struct kswapd *kswapd_p;
+			struct task_struct *kswapd_thr;
+			wait_queue_head_t *wait;
 
 			mask = cpumask_of_node(pgdat->node_id);
 
+			spin_lock(&kswapds_spinlock);
+			wait = pgdat->kswapd_wait;
+			kswapd_p = container_of(wait, struct kswapd,
+						kswapd_wait);
+			kswapd_thr = kswapd_p->kswapd_task;
+			spin_unlock(&kswapds_spinlock);
+
 			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
 				/* One of our CPUs online: restore mask */
-				set_cpus_allowed_ptr(pgdat->kswapd, mask);
+				if (kswapd_thr)
+					set_cpus_allowed_ptr(kswapd_thr, mask);
 		}
 	}
 	return NOTIFY_OK;
@@ -2835,18 +2856,31 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 int kswapd_run(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
+	struct task_struct *kswapd_thr;
+	struct kswapd *kswapd_p;
 	int ret = 0;
 
-	if (pgdat->kswapd)
+	if (pgdat->kswapd_wait)
 		return 0;
 
-	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
-	if (IS_ERR(pgdat->kswapd)) {
+	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
+	if (!kswapd_p)
+		return -ENOMEM;
+
+	init_waitqueue_head(&kswapd_p->kswapd_wait);
+	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_pgdat = pgdat;
+
+	kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (IS_ERR(kswapd_thr)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
 		printk("Failed to start kswapd on node %d\n",nid);
+		pgdat->kswapd_wait = NULL;
+		kfree(kswapd_p);
 		ret = -1;
-	}
+	} else
+		kswapd_p->kswapd_task = kswapd_thr;
 	return ret;
 }
 
@@ -2855,10 +2889,25 @@ int kswapd_run(int nid)
  */
 void kswapd_stop(int nid)
 {
-	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
+	struct task_struct *kswapd_thr = NULL;
+	struct kswapd *kswapd_p = NULL;
+	wait_queue_head_t *wait;
+
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	spin_lock(&kswapds_spinlock);
+	wait = pgdat->kswapd_wait;
+	if (wait) {
+		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+		kswapd_thr = kswapd_p->kswapd_task;
+		kswapd_p->kswapd_task = NULL;
+	}
+	spin_unlock(&kswapds_spinlock);
+
+	if (kswapd_thr)
+		kthread_stop(kswapd_thr);
 
-	if (kswapd)
-		kthread_stop(kswapd);
+	kfree(kswapd_p);
 }
 
 static int __init kswapd_init(void)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 2/7] Add per memcg reclaim watermarks
  2011-04-13  7:03 [PATCH V3 0/7] memcg: per cgroup background reclaim Ying Han
  2011-04-13  7:03 ` [PATCH V3 1/7] Add kswapd descriptor Ying Han
@ 2011-04-13  7:03 ` Ying Han
  2011-04-13  8:25   ` KAMEZAWA Hiroyuki
  2011-04-14  8:24   ` Zhu Yanhai
  2011-04-13  7:03 ` [PATCH V3 3/7] New APIs to adjust per-memcg wmarks Ying Han
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Ying Han @ 2011-04-13  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen
  Cc: linux-mm

There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
until the usage is lower than the high_wmark.

Each watermark is calculated based on the hard_limit(limit_in_bytes) for each
memcg. Each time the hard_limit is changed, the corresponding wmarks are
re-calculated. Since memory controller charges only user pages, there is
no need for a "min_wmark". The current calculation of wmarks is a function of
"wmark_ratio" which is set to 0 by default. When the value is 0, the watermarks
are equal to the hard_limit.

changelog v3..v2:
1. Add VM_BUG_ON() on couple of places.
2. Remove the spinlock on the min_free_kbytes since the consequence of reading
stale data.
3. Remove the "min_free_kbytes" API and replace it with wmark_ratio based on
hard_limit.

changelog v2..v1:
1. Remove the res_counter_charge on wmark due to performance concern.
2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
4. make the wmark to be consistant with core VM which checks the free pages
instead of usage.
5. changed wmark to be boolean

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h  |    1 +
 include/linux/res_counter.h |   80 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    6 +++
 mm/memcontrol.c             |   52 ++++++++++++++++++++++++++++
 4 files changed, 139 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..3ece36d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -82,6 +82,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..fa4181b 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -39,6 +39,16 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit that reclaim triggers. TODO: res_counter in mem
+	 * or wmark_limit.
+	 */
+	unsigned long long low_wmark_limit;
+	/*
+	 * the limit that reclaim stops. TODO: res_counter in mem or
+	 * wmark_limit.
+	 */
+	unsigned long long high_wmark_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -55,6 +65,9 @@ struct res_counter {
 
 #define RESOURCE_MAX (unsigned long long)LLONG_MAX
 
+#define CHARGE_WMARK_LOW	0x01
+#define CHARGE_WMARK_HIGH	0x02
+
 /**
  * Helpers to interact with userspace
  * res_counter_read_u64() - returns the value of the specified member.
@@ -92,6 +105,8 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_WMARK_LIMIT,
+	RES_HIGH_WMARK_LIMIT
 };
 
 /*
@@ -147,6 +162,24 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
 	return margin;
 }
 
+static inline bool
+res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->high_wmark_limit)
+		return true;
+
+	return false;
+}
+
+static inline bool
+res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->low_wmark_limit)
+		return true;
+
+	return false;
+}
+
 /**
  * Get the difference between the usage and the soft limit
  * @cnt: The counter
@@ -169,6 +202,30 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+static inline bool
+res_counter_check_under_low_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_low_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool
+res_counter_check_under_high_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_high_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -214,4 +271,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_high_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
+static inline int
+res_counter_set_low_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..206a724 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 	spin_lock_init(&counter->lock);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
+	counter->low_wmark_limit = RESOURCE_MAX;
+	counter->high_wmark_limit = RESOURCE_MAX;
 	counter->parent = parent;
 }
 
@@ -103,6 +105,10 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_WMARK_LIMIT:
+		return &counter->low_wmark_limit;
+	case RES_HIGH_WMARK_LIMIT:
+		return &counter->high_wmark_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4407dd0..664cdc5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -272,6 +272,8 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	int wmark_ratio;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -353,6 +355,7 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
+static unsigned long get_wmark_ratio(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -813,6 +816,27 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
+{
+	u64 limit;
+	unsigned long wmark_ratio;
+
+	wmark_ratio = get_wmark_ratio(mem);
+	limit = mem_cgroup_get_limit(mem);
+	if (wmark_ratio == 0) {
+		res_counter_set_low_wmark_limit(&mem->res, limit);
+		res_counter_set_high_wmark_limit(&mem->res, limit);
+	} else {
+		unsigned long low_wmark, high_wmark;
+		unsigned long long tmp = (wmark_ratio * limit) / 100;
+
+		low_wmark = tmp;
+		high_wmark = tmp - (tmp >> 8);
+		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
+		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+	}
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -1195,6 +1219,16 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return memcg->swappiness;
 }
 
+static unsigned long get_wmark_ratio(struct mem_cgroup *memcg)
+{
+	struct cgroup *cgrp = memcg->css.cgroup;
+
+	VM_BUG_ON(!cgrp);
+	VM_BUG_ON(!cgrp->parent);
+
+	return memcg->wmark_ratio;
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
@@ -3205,6 +3239,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3264,6 +3299,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -4521,6 +4557,22 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
+				int charge_flags)
+{
+	long ret = 0;
+	int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
+
+	VM_BUG_ON((charge_flags & flags) == flags);
+
+	if (charge_flags & CHARGE_WMARK_LOW)
+		ret = res_counter_check_under_low_wmark_limit(&mem->res);
+	if (charge_flags & CHARGE_WMARK_HIGH)
+		ret = res_counter_check_under_high_wmark_limit(&mem->res);
+
+	return ret;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 3/7] New APIs to adjust per-memcg wmarks
  2011-04-13  7:03 [PATCH V3 0/7] memcg: per cgroup background reclaim Ying Han
  2011-04-13  7:03 ` [PATCH V3 1/7] Add kswapd descriptor Ying Han
  2011-04-13  7:03 ` [PATCH V3 2/7] Add per memcg reclaim watermarks Ying Han
@ 2011-04-13  7:03 ` Ying Han
  2011-04-13  8:30   ` KAMEZAWA Hiroyuki
  2011-04-13  7:03 ` [PATCH V3 4/7] Infrastructure to support per-memcg reclaim Ying Han
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-13  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen
  Cc: linux-mm

Add wmark_ratio and reclaim_wmarks APIs per-memcg. The wmark_ratio
adjusts the internal low/high wmark calculation and the reclaim_wmarks
exports the current value of watermarks. By default, the wmark_ratio is
set to 0 and the watermarks are equal to the hard_limit(limit_in_bytes).

$ cat /dev/cgroup/A/memory.wmark_ratio
0

$ cat /dev/cgroup/A/memory.limit_in_bytes
524288000

$ echo 80 >/dev/cgroup/A/memory.wmark_ratio

$ cat /dev/cgroup/A/memory.reclaim_wmarks
low_wmark 393216000
high_wmark 419430400

changelog v3..v2:
1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
feedbacks

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 664cdc5..36ae377 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3983,6 +3983,31 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 mem_cgroup_wmark_ratio_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return get_wmark_ratio(memcg);
+}
+
+static int mem_cgroup_wmark_ratio_write(struct cgroup *cgrp, struct cftype *cfg,
+				     u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+
+	parent = mem_cgroup_from_cont(cgrp->parent);
+
+	memcg->wmark_ratio = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4274,6 +4299,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 	mutex_unlock(&memcg_oom_mutex);
 }
 
+static int mem_cgroup_wmark_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	u64 low_wmark, high_wmark;
+
+	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
+	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+
+	cb->fill(cb, "low_wmark", low_wmark);
+	cb->fill(cb, "high_wmark", high_wmark);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4377,6 +4417,15 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "wmark_ratio",
+		.write_u64 = mem_cgroup_wmark_ratio_write,
+		.read_u64 = mem_cgroup_wmark_ratio_read,
+	},
+	{
+		.name = "reclaim_wmarks",
+		.read_map = mem_cgroup_wmark_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 4/7] Infrastructure to support per-memcg reclaim.
  2011-04-13  7:03 [PATCH V3 0/7] memcg: per cgroup background reclaim Ying Han
                   ` (2 preceding siblings ...)
  2011-04-13  7:03 ` [PATCH V3 3/7] New APIs to adjust per-memcg wmarks Ying Han
@ 2011-04-13  7:03 ` Ying Han
  2011-04-14  3:57   ` Zhu Yanhai
  2011-04-13  7:03 ` [PATCH V3 5/7] Per-memcg background reclaim Ying Han
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-13  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen
  Cc: linux-mm

Add the kswapd_mem field in kswapd descriptor which links the kswapd
kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
queue headed at kswapd_wait field of the kswapd descriptor.

The kswapd() function is now shared between global and per-memcg kswapd. It
is passed in with the kswapd descriptor which contains the information of
either node or memcg. Then the new function balance_mem_cgroup_pgdat is
invoked if it is per-mem kswapd thread, and the implementation of the function
is on the following patch.

changelog v3..v2:
1. split off from the initial patch which includes all changes of the following
three patches.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    5 ++
 include/linux/swap.h       |    5 +-
 mm/memcontrol.c            |   29 ++++++++
 mm/vmscan.c                |  152 ++++++++++++++++++++++++++++++--------------
 4 files changed, 141 insertions(+), 50 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3ece36d..f7ffd1f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@ struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
+struct kswapd;
 
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
@@ -83,6 +84,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
+extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
+				  struct kswapd *kswapd_p);
+extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
+extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f43d406..17e0511 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -30,6 +30,7 @@ struct kswapd {
 	struct task_struct *kswapd_task;
 	wait_queue_head_t kswapd_wait;
 	pg_data_t *kswapd_pgdat;
+	struct mem_cgroup *kswapd_mem;
 };
 
 int kswapd(void *p);
@@ -303,8 +304,8 @@ static inline void scan_unevictable_unregister_node(struct node *node)
 }
 #endif
 
-extern int kswapd_run(int nid);
-extern void kswapd_stop(int nid);
+extern int kswapd_run(int nid, struct mem_cgroup *mem);
+extern void kswapd_stop(int nid, struct mem_cgroup *mem);
 
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36ae377..acd84a8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -274,6 +274,8 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 	int wmark_ratio;
+
+	wait_queue_head_t *kswapd_wait;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -4622,6 +4624,33 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
 	return ret;
 }
 
+int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
+{
+	if (!mem || !kswapd_p)
+		return 0;
+
+	mem->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_mem = mem;
+
+	return css_id(&mem->css);
+}
+
+void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
+{
+	if (mem)
+		mem->kswapd_wait = NULL;
+
+	return;
+}
+
+wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return NULL;
+
+	return mem->kswapd_wait;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 77ac74f..a1a1211 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2242,6 +2242,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 }
 
 static DEFINE_SPINLOCK(kswapds_spinlock);
+#define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
 
 /* is kswapd sleeping prematurely? */
 static int sleeping_prematurely(struct kswapd *kswapd, int order,
@@ -2251,11 +2252,16 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
 	pg_data_t *pgdat = kswapd->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd->kswapd_mem;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;
 
+	/* Doesn't support for per-memcg reclaim */
+	if (mem)
+		return false;
+
 	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
@@ -2598,19 +2604,25 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	 * go fully to sleep until explicitly woken up.
 	 */
 	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
-		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+		if (is_node_kswapd(kswapd_p)) {
+			trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
-		/*
-		 * vmstat counters are not perfectly accurate and the estimated
-		 * value for counters such as NR_FREE_PAGES can deviate from the
-		 * true value by nr_online_cpus * threshold. To avoid the zone
-		 * watermarks being breached while under pressure, we reduce the
-		 * per-cpu vmstat threshold while kswapd is awake and restore
-		 * them before going back to sleep.
-		 */
-		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
-		schedule();
-		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+			/*
+			 * vmstat counters are not perfectly accurate and the
+			 * estimated value for counters such as NR_FREE_PAGES
+			 * can deviate from the true value by nr_online_cpus *
+			 * threshold. To avoid the zone watermarks being
+			 * breached while under pressure, we reduce the per-cpu
+			 * vmstat threshold while kswapd is awake and restore
+			 * them before going back to sleep.
+			 */
+			set_pgdat_percpu_threshold(pgdat,
+						   calculate_normal_threshold);
+			schedule();
+			set_pgdat_percpu_threshold(pgdat,
+						calculate_pressure_threshold);
+		} else
+			schedule();
 	} else {
 		if (remaining)
 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2620,6 +2632,12 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	finish_wait(wait_h, &wait);
 }
 
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+							int order)
+{
+	return 0;
+}
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -2639,6 +2657,7 @@ int kswapd(void *p)
 	int classzone_idx;
 	struct kswapd *kswapd_p = (struct kswapd *)p;
 	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
 	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 
@@ -2649,10 +2668,12 @@ int kswapd(void *p)
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
-	BUG_ON(pgdat->kswapd_wait != wait_h);
-	cpumask = cpumask_of_node(pgdat->node_id);
-	if (!cpumask_empty(cpumask))
-		set_cpus_allowed_ptr(tsk, cpumask);
+	if (is_node_kswapd(kswapd_p)) {
+		BUG_ON(pgdat->kswapd_wait != wait_h);
+		cpumask = cpumask_of_node(pgdat->node_id);
+		if (!cpumask_empty(cpumask))
+			set_cpus_allowed_ptr(tsk, cpumask);
+	}
 	current->reclaim_state = &reclaim_state;
 
 	/*
@@ -2677,24 +2698,29 @@ int kswapd(void *p)
 		int new_classzone_idx;
 		int ret;
 
-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order = 0;
-		pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		if (order < new_order || classzone_idx > new_classzone_idx) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation or has tigher zone constraints
-			 */
-			order = new_order;
-			classzone_idx = new_classzone_idx;
-		} else {
-			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
-			order = pgdat->kswapd_max_order;
-			classzone_idx = pgdat->classzone_idx;
+		if (is_node_kswapd(kswapd_p)) {
+			new_order = pgdat->kswapd_max_order;
+			new_classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		}
+			if (order < new_order ||
+					classzone_idx > new_classzone_idx) {
+				/*
+				 * Don't sleep if someone wants a larger 'order'
+				 * allocation or has tigher zone constraints
+				 */
+				order = new_order;
+				classzone_idx = new_classzone_idx;
+			} else {
+				kswapd_try_to_sleep(kswapd_p, order,
+						    classzone_idx);
+				order = pgdat->kswapd_max_order;
+				classzone_idx = pgdat->classzone_idx;
+				pgdat->kswapd_max_order = 0;
+				pgdat->classzone_idx = MAX_NR_ZONES - 1;
+			}
+		} else
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -2705,8 +2731,13 @@ int kswapd(void *p)
 		 * after returning from the refrigerator
 		 */
 		if (!ret) {
-			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			order = balance_pgdat(pgdat, order, &classzone_idx);
+			if (is_node_kswapd(kswapd_p)) {
+				trace_mm_vmscan_kswapd_wake(pgdat->node_id,
+								order);
+				order = balance_pgdat(pgdat, order,
+							&classzone_idx);
+			} else
+				balance_mem_cgroup_pgdat(mem, order);
 		}
 	}
 	return 0;
@@ -2853,30 +2884,53 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
  * This kswapd start function will be called by init and node-hot-add.
  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
  */
-int kswapd_run(int nid)
+int kswapd_run(int nid, struct mem_cgroup *mem)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
 	struct task_struct *kswapd_thr;
+	pg_data_t *pgdat = NULL;
 	struct kswapd *kswapd_p;
+	static char name[TASK_COMM_LEN];
+	int memcg_id;
 	int ret = 0;
 
-	if (pgdat->kswapd_wait)
-		return 0;
+	if (!mem) {
+		pgdat = NODE_DATA(nid);
+		if (pgdat->kswapd_wait)
+			return ret;
+	}
 
 	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
 	if (!kswapd_p)
 		return -ENOMEM;
 
 	init_waitqueue_head(&kswapd_p->kswapd_wait);
-	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
-	kswapd_p->kswapd_pgdat = pgdat;
 
-	kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (!mem) {
+		pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+		kswapd_p->kswapd_pgdat = pgdat;
+		snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
+	} else {
+		memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
+		if (!memcg_id) {
+			kfree(kswapd_p);
+			return ret;
+		}
+		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
+	}
+
+	kswapd_thr = kthread_run(kswapd, kswapd_p, name);
 	if (IS_ERR(kswapd_thr)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
-		printk("Failed to start kswapd on node %d\n",nid);
-		pgdat->kswapd_wait = NULL;
+		if (!mem) {
+			printk(KERN_ERR "Failed to start kswapd on node %d\n",
+								nid);
+			pgdat->kswapd_wait = NULL;
+		} else {
+			printk(KERN_ERR "Failed to start kswapd on memcg %d\n",
+								memcg_id);
+			mem_cgroup_clear_kswapd(mem);
+		}
 		kfree(kswapd_p);
 		ret = -1;
 	} else
@@ -2887,16 +2941,18 @@ int kswapd_run(int nid)
 /*
  * Called by memory hotplug when all memory in a node is offlined.
  */
-void kswapd_stop(int nid)
+void kswapd_stop(int nid, struct mem_cgroup *mem)
 {
 	struct task_struct *kswapd_thr = NULL;
 	struct kswapd *kswapd_p = NULL;
 	wait_queue_head_t *wait;
 
-	pg_data_t *pgdat = NODE_DATA(nid);
-
 	spin_lock(&kswapds_spinlock);
-	wait = pgdat->kswapd_wait;
+	if (!mem)
+		wait = NODE_DATA(nid)->kswapd_wait;
+	else
+		wait = mem_cgroup_kswapd_wait(mem);
+
 	if (wait) {
 		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
 		kswapd_thr = kswapd_p->kswapd_task;
@@ -2916,7 +2972,7 @@ static int __init kswapd_init(void)
 
 	swap_setup();
 	for_each_node_state(nid, N_HIGH_MEMORY)
- 		kswapd_run(nid);
+		kswapd_run(nid, NULL);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 5/7] Per-memcg background reclaim.
  2011-04-13  7:03 [PATCH V3 0/7] memcg: per cgroup background reclaim Ying Han
                   ` (3 preceding siblings ...)
  2011-04-13  7:03 ` [PATCH V3 4/7] Infrastructure to support per-memcg reclaim Ying Han
@ 2011-04-13  7:03 ` Ying Han
  2011-04-13  8:58   ` KAMEZAWA Hiroyuki
  2011-04-13  7:03 ` [PATCH V3 6/7] Enable per-memcg " Ying Han
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-13  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen
  Cc: linux-mm

This is the main loop of per-memcg background reclaim which is implemented in
function balance_mem_cgroup_pgdat().

The function performs a priority loop similar to global reclaim. During each
iteration it invokes balance_pgdat_node() for all nodes on the system, which
is another new function performs background reclaim per node. A fairness
mechanism is implemented to remember the last node it was reclaiming from and
always start at the next one. After reclaiming each node, it checks
mem_cgroup_watermark_ok() and breaks the priority loop if it returns true. The
per-memcg zone will be marked as "unreclaimable" if the scanning rate is much
greater than the reclaiming rate on the per-memcg LRU. The bit is cleared when
there is a page charged to the memcg being freed. Kswapd breaks the priority
loop if all the zones are marked as "unreclaimable".

changelog v3..v2:
1. change mz->all_unreclaimable to be boolean.
2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
3. some more clean-up.

changelog v2..v1:
1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
2. shared the kswapd_run/kswapd_stop for per-memcg and global background
reclaim.
3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
keeps the same name.
4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
after freeing.
5. add the fairness in zonelist where memcg remember the last zone reclaimed
from.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   33 +++++++
 include/linux/swap.h       |    2 +
 mm/memcontrol.c            |  136 +++++++++++++++++++++++++++++
 mm/vmscan.c                |  208 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 379 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f7ffd1f..a8159f5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
 				  struct kswapd *kswapd_p);
 extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
 extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
+extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
+extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
+					const nodemask_t *nodes);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
@@ -152,6 +155,12 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+				unsigned long nr_scanned);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
@@ -342,6 +351,25 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
+						struct zone *zone,
+						unsigned long nr_scanned)
+{
+}
+
+static inline void mem_cgroup_clear_unreclaimable(struct page *page,
+							struct zone *zone)
+{
+}
+static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
+		struct zone *zone)
+{
+}
+static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
+						struct zone *zone)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
@@ -360,6 +388,11 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
 {
 }
 
+static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
+								int zid)
+{
+	return false;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 17e0511..319b800 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -160,6 +160,8 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
+#define ZONE_RECLAIMABLE_RATE 6
+
 #define SWAP_CLUSTER_MAX 32
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acd84a8..efeade3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
 	bool			on_tree;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
+	unsigned long		pages_scanned;	/* since last reclaim */
+	bool			all_unreclaimable;	/* All pages pinned */
 };
+
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
 
@@ -275,6 +278,11 @@ struct mem_cgroup {
 
 	int wmark_ratio;
 
+	/* While doing per cgroup background reclaim, we cache the
+	 * last node we reclaimed from
+	 */
+	int last_scanned_node;
+
 	wait_queue_head_t *kswapd_wait;
 };
 
@@ -1129,6 +1137,96 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
+static unsigned long mem_cgroup_zone_reclaimable_pages(
+					struct mem_cgroup_per_zone *mz)
+{
+	int nr;
+	nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
+		MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
+			MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+
+	return nr;
+}
+
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+						unsigned long nr_scanned)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->pages_scanned += nr_scanned;
+}
+
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return 0;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->pages_scanned <
+				mem_cgroup_zone_reclaimable_pages(mz) *
+				ZONE_RECLAIMABLE_RATE;
+	return 0;
+}
+
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return false;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->all_unreclaimable;
+
+	return false;
+}
+
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->all_unreclaimable = true;
+}
+
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return;
+
+	mz = page_cgroup_zoneinfo(mem, page);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -1545,6 +1643,32 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 }
 
 /*
+ * Visit the first node after the last_scanned_node of @mem and use that to
+ * reclaim free pages from.
+ */
+int
+mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
+{
+	int next_nid;
+	int last_scanned;
+
+	last_scanned = mem->last_scanned_node;
+
+	/* Initial stage and start from node0 */
+	if (last_scanned == -1)
+		next_nid = 0;
+	else
+		next_nid = next_node(last_scanned, *nodes);
+
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(*nodes);
+
+	mem->last_scanned_node = next_nid;
+
+	return next_nid;
+}
+
+/*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
@@ -2779,6 +2903,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	 * special functions.
 	 */
 
+	mem_cgroup_clear_unreclaimable(mem, page);
 	unlock_page_cgroup(pc);
 	/*
 	 * even after unlock, we have mem->res.usage here and this memcg
@@ -4501,6 +4626,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = mem;
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
 	}
 	return 0;
 }
@@ -4651,6 +4778,14 @@ wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
 	return mem->kswapd_wait;
 }
 
+int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return -1;
+
+	return mem->last_scanned_node;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4726,6 +4861,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->last_scanned_node = -1;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
 	if (parent)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a1a1211..6571eb8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -47,6 +47,8 @@
 
 #include <linux/swapops.h>
 
+#include <linux/res_counter.h>
+
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -111,6 +113,8 @@ struct scan_control {
 	 * are scanned.
 	 */
 	nodemask_t	*nodemask;
+
+	int priority;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -1410,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
 			zone, sc->mem_cgroup,
 			0, file);
+
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
+
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
@@ -1529,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
@@ -2632,11 +2640,211 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	finish_wait(wait_h, &wait);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * The function is used for per-memcg LRU. It scanns all the zones of the
+ * node and returns the nr_scanned and nr_reclaimed.
+ */
+static void balance_pgdat_node(pg_data_t *pgdat, int order,
+					struct scan_control *sc)
+{
+	int i, end_zone;
+	unsigned long total_scanned;
+	struct mem_cgroup *mem_cont = sc->mem_cgroup;
+	int priority = sc->priority;
+	int nid = pgdat->node_id;
+
+	/*
+	 * Scan in the highmem->dma direction for the highest
+	 * zone which needs scanning
+	 */
+	for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+				priority != DEF_PRIORITY)
+			continue;
+		/*
+		 * Do some background aging of the anon list, to give
+		 * pages a chance to be referenced before reclaiming.
+		 */
+		if (inactive_anon_is_low(zone, sc))
+			shrink_active_list(SWAP_CLUSTER_MAX, zone,
+							sc, priority, 0);
+
+		end_zone = i;
+		goto scan;
+	}
+	return;
+
+scan:
+	total_scanned = 0;
+	/*
+	 * Now scan the zone in the dma->highmem direction, stopping
+	 * at the last zone which needs scanning.
+	 *
+	 * We do this because the page allocator works in the opposite
+	 * direction.  This prevents the page allocator from allocating
+	 * pages behind kswapd's direction of progress, which would
+	 * cause too much scanning of the lower zones.
+	 */
+	for (i = 0; i <= end_zone; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+			priority != DEF_PRIORITY)
+			continue;
+
+		sc->nr_scanned = 0;
+		shrink_zone(priority, zone, sc);
+		total_scanned += sc->nr_scanned;
+
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
+			continue;
+
+		if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
+			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
+
+		/*
+		 * If we've done a decent amount of scanning and
+		 * the reclaim ratio is low, start doing writepage
+		 * even in laptop mode
+		 */
+		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
+			sc->may_writepage = 1;
+		}
+	}
+
+	sc->nr_scanned = total_scanned;
+	return;
+}
+
+/*
+ * Per cgroup background reclaim.
+ * TODO: Take off the order since memcg always do order 0
+ */
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+					      int order)
+{
+	int i, nid;
+	int start_node;
+	int priority;
+	bool wmark_ok;
+	int loop;
+	pg_data_t *pgdat;
+	nodemask_t do_nodes;
+	unsigned long total_scanned;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.nr_to_reclaim = ULONG_MAX,
+		.swappiness = vm_swappiness,
+		.order = order,
+		.mem_cgroup = mem_cont,
+	};
+
+loop_again:
+	do_nodes = NODE_MASK_NONE;
+	sc.may_writepage = !laptop_mode;
+	sc.nr_reclaimed = 0;
+	total_scanned = 0;
+
+	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		sc.priority = priority;
+		wmark_ok = false;
+		loop = 0;
+
+		/* The swap token gets in the way of swapout... */
+		if (!priority)
+			disable_swap_token();
+
+		if (priority == DEF_PRIORITY)
+			do_nodes = node_states[N_ONLINE];
+
+		while (1) {
+			nid = mem_cgroup_select_victim_node(mem_cont,
+							&do_nodes);
+
+			/* Indicate we have cycled the nodelist once
+			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
+			 * kswapd burning cpu cycles.
+			 */
+			if (loop == 0) {
+				start_node = nid;
+				loop++;
+			} else if (nid == start_node)
+				break;
+
+			pgdat = NODE_DATA(nid);
+			balance_pgdat_node(pgdat, order, &sc);
+			total_scanned += sc.nr_scanned;
+
+			/* Set the node which has at least
+			 * one reclaimable zone
+			 */
+			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+				struct zone *zone = pgdat->node_zones + i;
+
+				if (!populated_zone(zone))
+					continue;
+
+				if (!mem_cgroup_mz_unreclaimable(mem_cont,
+								zone))
+					break;
+			}
+			if (i < 0)
+				node_clear(nid, do_nodes);
+
+			if (mem_cgroup_watermark_ok(mem_cont,
+							CHARGE_WMARK_HIGH)) {
+				wmark_ok = true;
+				goto out;
+			}
+
+			if (nodes_empty(do_nodes)) {
+				wmark_ok = true;
+				goto out;
+			}
+		}
+
+		/* All the nodes are unreclaimable, kswapd is done */
+		if (nodes_empty(do_nodes)) {
+			wmark_ok = true;
+			goto out;
+		}
+
+		if (total_scanned && priority < DEF_PRIORITY - 2)
+			congestion_wait(WRITE, HZ/10);
+
+		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
+			break;
+	}
+out:
+	if (!wmark_ok) {
+		cond_resched();
+
+		try_to_freeze();
+
+		goto loop_again;
+	}
+
+	return sc.nr_reclaimed;
+}
+#else
 static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
 							int order)
 {
 	return 0;
 }
+#endif
 
 /*
  * The background pageout daemon, started as a kernel thread
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 6/7] Enable per-memcg background reclaim.
  2011-04-13  7:03 [PATCH V3 0/7] memcg: per cgroup background reclaim Ying Han
                   ` (4 preceding siblings ...)
  2011-04-13  7:03 ` [PATCH V3 5/7] Per-memcg background reclaim Ying Han
@ 2011-04-13  7:03 ` Ying Han
  2011-04-13  9:05   ` KAMEZAWA Hiroyuki
  2011-04-13  7:03 ` [PATCH V3 7/7] Add some per-memcg stats Ying Han
  2011-04-13  7:47 ` [PATCH V3 0/7] memcg: per cgroup background reclaim KAMEZAWA Hiroyuki
  7 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-13  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen
  Cc: linux-mm

By default the per-memcg background reclaim is disabled when the limit_in_bytes
is set the maximum or the wmark_ratio is 0. The kswapd_run() is called when the
memcg is being resized, and kswapd_stop() is called when the memcg is being
deleted.

The per-memcg kswapd is waked up based on the usage and low_wmark, which is
checked once per 1024 increments per cpu. The memcg's kswapd is waked up if the
usage is larger than the low_wmark.

changelog v3..v2:
1. some clean-ups

changelog v2..v1:
1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
2. remove checking the wmark from per-page charging. now it checks the wmark
periodically based on the event counter.

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index efeade3..bfa8646 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -105,10 +105,12 @@ enum mem_cgroup_events_index {
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
 	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_WMARK_EVENTS_THRESH,
 	MEM_CGROUP_NTARGETS,
 };
 #define THRESHOLDS_EVENTS_TARGET (128)
 #define SOFTLIMIT_EVENTS_TARGET (1024)
+#define WMARK_EVENTS_TARGET (1024)
 
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
@@ -366,6 +368,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 static unsigned long get_wmark_ratio(struct mem_cgroup *mem);
+static void wake_memcg_kswapd(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -545,6 +548,12 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
+static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
+		wake_memcg_kswapd(mem);
+}
+
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -675,6 +684,9 @@ static void __mem_cgroup_target_update(struct mem_cgroup *mem, int target)
 	case MEM_CGROUP_TARGET_SOFTLIMIT:
 		next = val + SOFTLIMIT_EVENTS_TARGET;
 		break;
+	case MEM_CGROUP_WMARK_EVENTS_THRESH:
+		next = val + WMARK_EVENTS_TARGET;
+		break;
 	default:
 		return;
 	}
@@ -698,6 +710,10 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
+		if (unlikely(__memcg_event_check(mem,
+			MEM_CGROUP_WMARK_EVENTS_THRESH))){
+			mem_cgroup_check_wmark(mem);
+		}
 	}
 }
 
@@ -3384,6 +3400,10 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 	if (!ret && enlarge)
 		memcg_oom_recover(memcg);
 
+	if (!mem_cgroup_is_root(memcg) && !memcg->kswapd_wait &&
+			memcg->wmark_ratio)
+		kswapd_run(0, memcg);
+
 	return ret;
 }
 
@@ -4680,6 +4700,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	kswapd_stop(0, mem);
 	mem_cgroup_remove_from_trees(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
@@ -4786,6 +4807,22 @@ int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
 	return mem->last_scanned_node;
 }
 
+static inline
+void wake_memcg_kswapd(struct mem_cgroup *mem)
+{
+	wait_queue_head_t *wait;
+
+	if (!mem || !mem->wmark_ratio)
+		return;
+
+	wait = mem->kswapd_wait;
+
+	if (!wait || !waitqueue_active(wait))
+		return;
+
+	wake_up_interruptible(wait);
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 7/7] Add some per-memcg stats
  2011-04-13  7:03 [PATCH V3 0/7] memcg: per cgroup background reclaim Ying Han
                   ` (5 preceding siblings ...)
  2011-04-13  7:03 ` [PATCH V3 6/7] Enable per-memcg " Ying Han
@ 2011-04-13  7:03 ` Ying Han
  2011-04-13  7:47 ` [PATCH V3 0/7] memcg: per cgroup background reclaim KAMEZAWA Hiroyuki
  7 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-13  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen
  Cc: linux-mm

A bunch of statistics are added in memory.stat to monitor per cgroup
kswapd performance.

$cat /dev/cgroup/yinghan/memory.stat
kswapd_steal 12588994
pg_pgsteal 0
kswapd_pgscan 18629519
pg_scan 0
pgrefill 2893517
pgoutrun 5342267948
allocstall 0

changelog v2..v1:
1. change the stats using events instead of stats.
2. add the stats in the Documentation

Signed-off-by: Ying Han <yinghan@google.com>
---
 Documentation/cgroups/memory.txt |   14 +++++++
 include/linux/memcontrol.h       |   52 +++++++++++++++++++++++++++
 mm/memcontrol.c                  |   72 ++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                      |   28 ++++++++++++--
 4 files changed, 162 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b6ed61c..29dee73 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,13 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+kswapd_steal	- # of pages reclaimed from kswapd
+pg_pgsteal	- # of pages reclaimed from direct reclaim
+kswapd_pgscan	- # of pages scanned from kswapd
+pg_scan		- # of pages scanned frm direct reclaim
+pgrefill	- # of pages scanned on active list
+pgoutrun	- # of times triggering kswapd
+allocstall	- # of times triggering direct reclaim
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -406,6 +413,13 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_kswapd_steal	- sum of all children's "kswapd_steal"
+total_pg_pgsteal	- sum of all children's "pg_pgsteal"
+total_kswapd_pgscan	- sum of all children's "kswapd_pgscan"
+total_pg_scan		- sum of all children's "pg_scan"
+total_pgrefill		- sum of all children's "pgrefill"
+total_pgoutrun		- sum of all children's "pgoutrun"
+total_allocstall	- sum of all children's "allocstall"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a8159f5..0b7fb22 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -162,6 +162,15 @@ void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
 void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
 				unsigned long nr_scanned);
 
+/* background reclaim stats */
+void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pgrefill(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_outrun(struct mem_cgroup *memcg, int val);
+void mem_cgroup_alloc_stall(struct mem_cgroup *memcg, int val);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
 #endif
@@ -393,6 +402,49 @@ static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
 {
 	return false;
 }
+
+/* background reclaim stats */
+static inline void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg,
+					   int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_steal(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg,
+					    int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pgrefill(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_outrun(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_alloc_stall(struct mem_cgroup *memcg,
+					  int val)
+{
+	return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bfa8646..7c9070e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,13 @@ enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	MEM_CGROUP_EVENTS_KSWAPD_STEAL, /* # of pages reclaimed from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSTEAL, /* # of pages reclaimed from ttfp */
+	MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, /* # of pages scanned from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSCAN, /* # of pages scanned from ttfp */
+	MEM_CGROUP_EVENTS_PGREFILL, /* # of pages scanned on active list */
+	MEM_CGROUP_EVENTS_PGOUTRUN, /* # of triggers of background reclaim */
+	MEM_CGROUP_EVENTS_ALLOCSTALL, /* # of triggers of direct reclaim */
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -607,6 +614,41 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_kswapd_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_STEAL], val);
+}
+
+void mem_cgroup_pg_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSTEAL], val);
+}
+
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_PGSCAN], val);
+}
+
+void mem_cgroup_pg_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSCAN], val);
+}
+
+void mem_cgroup_pgrefill(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGREFILL], val);
+}
+
+void mem_cgroup_pg_outrun(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGOUTRUN], val);
+}
+
+void mem_cgroup_alloc_stall(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_ALLOCSTALL], val);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -3955,6 +3997,13 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_KSWAPD_STEAL,
+	MCS_PG_PGSTEAL,
+	MCS_KSWAPD_PGSCAN,
+	MCS_PG_PGSCAN,
+	MCS_PGREFILL,
+	MCS_PGOUTRUN,
+	MCS_ALLOCSTALL,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3977,6 +4026,13 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"kswapd_steal", "total_kswapd_steal"},
+	{"pg_pgsteal", "total_pg_pgsteal"},
+	{"kswapd_pgscan", "total_kswapd_pgscan"},
+	{"pg_scan", "total_pg_scan"},
+	{"pgrefill", "total_pgrefill"},
+	{"pgoutrun", "total_pgoutrun"},
+	{"allocstall", "total_allocstall"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -4006,6 +4062,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
+	/* kswapd stat */
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_STEAL);
+	s->stat[MCS_KSWAPD_STEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSTEAL);
+	s->stat[MCS_PG_PGSTEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN);
+	s->stat[MCS_KSWAPD_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSCAN);
+	s->stat[MCS_PG_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGREFILL);
+	s->stat[MCS_PGREFILL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGOUTRUN);
+	s->stat[MCS_PGOUTRUN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_ALLOCSTALL);
+	s->stat[MCS_ALLOCSTALL] += val;
+
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6571eb8..2532459 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1421,6 +1421,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_pgscan(sc->mem_cgroup, nr_scanned);
+		else
+			mem_cgroup_pg_pgscan(sc->mem_cgroup, nr_scanned);
 	}
 
 	if (nr_taken == 0) {
@@ -1441,9 +1445,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	local_irq_disable();
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	if (scanning_global_lru(sc)) {
+		if (current_is_kswapd())
+			__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+		__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	} else {
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_steal(sc->mem_cgroup, nr_reclaimed);
+		else
+			mem_cgroup_pg_steal(sc->mem_cgroup, nr_reclaimed);
+	}
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
@@ -1541,7 +1552,12 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
-	__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	if (scanning_global_lru(sc))
+		__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	else
+		mem_cgroup_pgrefill(sc->mem_cgroup, pgscanned);
+
+
 	if (file)
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -nr_taken);
 	else
@@ -2054,6 +2070,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 	if (scanning_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
+	else
+		mem_cgroup_alloc_stall(sc->mem_cgroup, 1);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
@@ -2757,6 +2775,8 @@ loop_again:
 	sc.nr_reclaimed = 0;
 	total_scanned = 0;
 
+	mem_cgroup_pg_outrun(mem_cont, 1);
+
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc.priority = priority;
 		wmark_ok = false;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/7] memcg: per cgroup background reclaim
  2011-04-13  7:03 [PATCH V3 0/7] memcg: per cgroup background reclaim Ying Han
                   ` (6 preceding siblings ...)
  2011-04-13  7:03 ` [PATCH V3 7/7] Add some per-memcg stats Ying Han
@ 2011-04-13  7:47 ` KAMEZAWA Hiroyuki
  2011-04-13 17:53   ` Ying Han
  7 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-13  7:47 UTC (permalink / raw)
  To: Ying Han
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

On Wed, 13 Apr 2011 00:03:00 -0700
Ying Han <yinghan@google.com> wrote:

> The current implementation of memcg supports targeting reclaim when the
> cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
> Per cgroup background reclaim is needed which helps to spread out memory
> pressure over longer period of time and smoothes out the cgroup performance.
> 
> If the cgroup is configured to use per cgroup background reclaim, a kswapd
> thread is created which only scans the per-memcg LRU list. Two watermarks
> ("high_wmark", "low_wmark") are added to trigger the background reclaim and
> stop it. The watermarks are calculated based on the cgroup's limit_in_bytes.
> 
> I run through dd test on large file and then cat the file. Then I compared
> the reclaim related stats in memory.stat.
> 
> Step1: Create a cgroup with 500M memory_limit.
> $ mkdir /dev/cgroup/memory/A
> $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> $ echo $$ >/dev/cgroup/memory/A/tasks
> 
> Step2: Test and set the wmarks.
> $ cat /dev/cgroup/memory/A/memory.wmark_ratio
> 0
> 
> $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> low_wmark 524288000
> high_wmark 524288000
> 
> $ echo 90 >/dev/cgroup/memory/A/memory.wmark_ratio
> 
> $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> low_wmark 471859200
> high_wmark 470016000
> 
> $ ps -ef | grep memcg
> root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
> root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg
> 
> Step3: Dirty the pages by creating a 20g file on hard drive.
> $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
> 
> Here are the memory.stat with vs without the per-memcg reclaim. It used to be
> all the pages are reclaimed from direct reclaim, and now some of the pages are
> also being reclaimed at background.
> 
> Only direct reclaim                       With background reclaim:
> 
> pgpgin 5248668                            pgpgin 5248347
> pgpgout 5120678                           pgpgout 5133505
> kswapd_steal 0                            kswapd_steal 1476614
> pg_pgsteal 5120578                        pg_pgsteal 3656868
> kswapd_pgscan 0                           kswapd_pgscan 3137098
> pg_scan 10861956                          pg_scan 6848006
> pgrefill 271174                           pgrefill 290441
> pgoutrun 0                                pgoutrun 18047
> allocstall 131689                         allocstall 100179
> 
> real    7m42.702s                         real 7m42.323s
> user    0m0.763s                          user 0m0.748s
> sys     0m58.785s                         sys  0m52.123s
> 
> throughput is 44.33 MB/sec                throughput is 44.23 MB/sec
> 
> Step 4: Cleanup
> $ echo $$ >/dev/cgroup/memory/tasks
> $ echo 1 > /dev/cgroup/memory/A/memory.force_empty
> $ rmdir /dev/cgroup/memory/A
> $ echo 3 >/proc/sys/vm/drop_caches
> 
> Step 5: Create the same cgroup and read the 20g file into pagecache.
> $ cat /export/hdc3/dd/tf0 > /dev/zero
> 
> All the pages are reclaimed from background instead of direct reclaim with
> the per cgroup reclaim.
> 
> Only direct reclaim                       With background reclaim:
> pgpgin 5248668                            pgpgin 5248114
> pgpgout 5120678                           pgpgout 5133480
> kswapd_steal 0                            kswapd_steal 5133397
> pg_pgsteal 5120578                        pg_pgsteal 0
> kswapd_pgscan 0                           kswapd_pgscan 5133400
> pg_scan 10861956                          pg_scan 0
> pgrefill 271174                           pgrefill 0
> pgoutrun 0                                pgoutrun 40535
> allocstall 131689                         allocstall 0
> 
> real    7m42.702s                         real 6m20.439s
> user    0m0.763s                          user 0m0.169s
> sys     0m58.785s                         sys  0m26.574s
> 
> Note:
> This is the first effort of enhancing the target reclaim into memcg. Here are
> the existing known issues and our plan:
> 
> 1. there are one kswapd thread per cgroup. the thread is created when the
> cgroup changes its limit_in_bytes and is deleted when the cgroup is being
> removed. In some enviroment when thousand of cgroups are being configured on
> a single host, we will have thousand of kswapd threads. The memory consumption
> would be 8k*100 = 8M. We don't see a big issue for now if the host can host
> that many of cgroups.
> 

What's bad with using workqueue ? 

Pros.
  - we don't have to keep our own thread pool.
  - we don't have to see 'ps -elf' is filled by kswapd...
Cons.
  - because threads are shared, we can't put kthread to cpu cgroup.

Regardless of workqueue, can't we have moderate numbers of threads ?

I really don't like to have too much threads and thinks one-thread-per-memcg
is big enough to cause lock contension problem.

Anyway, thank you for your patches. I'll review.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/7] Add per memcg reclaim watermarks
  2011-04-13  7:03 ` [PATCH V3 2/7] Add per memcg reclaim watermarks Ying Han
@ 2011-04-13  8:25   ` KAMEZAWA Hiroyuki
  2011-04-13 18:40     ` Ying Han
  2011-04-14  8:24   ` Zhu Yanhai
  1 sibling, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-13  8:25 UTC (permalink / raw)
  To: Ying Han
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

On Wed, 13 Apr 2011 00:03:02 -0700
Ying Han <yinghan@google.com> wrote:

> There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
> The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
> is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
> until the usage is lower than the high_wmark.
> 
> Each watermark is calculated based on the hard_limit(limit_in_bytes) for each
> memcg. Each time the hard_limit is changed, the corresponding wmarks are
> re-calculated. Since memory controller charges only user pages, there is
> no need for a "min_wmark". The current calculation of wmarks is a function of
> "wmark_ratio" which is set to 0 by default. When the value is 0, the watermarks
> are equal to the hard_limit.
> 
> changelog v3..v2:
> 1. Add VM_BUG_ON() on couple of places.
> 2. Remove the spinlock on the min_free_kbytes since the consequence of reading
> stale data.
> 3. Remove the "min_free_kbytes" API and replace it with wmark_ratio based on
> hard_limit.
> 
> changelog v2..v1:
> 1. Remove the res_counter_charge on wmark due to performance concern.
> 2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
> 3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
> 4. make the wmark to be consistant with core VM which checks the free pages
> instead of usage.
> 5. changed wmark to be boolean
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h  |    1 +
>  include/linux/res_counter.h |   80 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/res_counter.c        |    6 +++
>  mm/memcontrol.c             |   52 ++++++++++++++++++++++++++++
>  4 files changed, 139 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5a5ce70..3ece36d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -82,6 +82,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
>  
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index c9d625c..fa4181b 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -39,6 +39,16 @@ struct res_counter {
>  	 */
>  	unsigned long long soft_limit;
>  	/*
> +	 * the limit that reclaim triggers. TODO: res_counter in mem
> +	 * or wmark_limit.
> +	 */
> +	unsigned long long low_wmark_limit;
> +	/*
> +	 * the limit that reclaim stops. TODO: res_counter in mem or
> +	 * wmark_limit.
> +	 */

What does this TODO mean ?


> +	unsigned long long high_wmark_limit;
> +	/*
>  	 * the number of unsuccessful attempts to consume the resource
>  	 */
>  	unsigned long long failcnt;
> @@ -55,6 +65,9 @@ struct res_counter {
>  
>  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
>  
> +#define CHARGE_WMARK_LOW	0x01
> +#define CHARGE_WMARK_HIGH	0x02
> +
>  /**
>   * Helpers to interact with userspace
>   * res_counter_read_u64() - returns the value of the specified member.
> @@ -92,6 +105,8 @@ enum {
>  	RES_LIMIT,
>  	RES_FAILCNT,
>  	RES_SOFT_LIMIT,
> +	RES_LOW_WMARK_LIMIT,
> +	RES_HIGH_WMARK_LIMIT
>  };
>  
>  /*
> @@ -147,6 +162,24 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
>  	return margin;
>  }
>  
> +static inline bool
> +res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->high_wmark_limit)
> +		return true;
> +
> +	return false;
> +}
> +
> +static inline bool
> +res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->low_wmark_limit)
> +		return true;
> +
> +	return false;
> +}
> +


>  /**
>   * Get the difference between the usage and the soft limit
>   * @cnt: The counter
> @@ -169,6 +202,30 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
>  	return excess;
>  }
>  
> +static inline bool
> +res_counter_check_under_low_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_low_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
> +static inline bool
> +res_counter_check_under_high_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_high_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +

Why internal functions are named as _check_ ? I like _under_.


>  static inline void res_counter_reset_max(struct res_counter *cnt)
>  {
>  	unsigned long flags;
> @@ -214,4 +271,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
>  	return 0;
>  }
>  
> +static inline int
> +res_counter_set_high_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->high_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
> +
> +static inline int
> +res_counter_set_low_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->low_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
>  #endif
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 34683ef..206a724 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -19,6 +19,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
>  	spin_lock_init(&counter->lock);
>  	counter->limit = RESOURCE_MAX;
>  	counter->soft_limit = RESOURCE_MAX;
> +	counter->low_wmark_limit = RESOURCE_MAX;
> +	counter->high_wmark_limit = RESOURCE_MAX;
>  	counter->parent = parent;
>  }
>  
> @@ -103,6 +105,10 @@ res_counter_member(struct res_counter *counter, int member)
>  		return &counter->failcnt;
>  	case RES_SOFT_LIMIT:
>  		return &counter->soft_limit;
> +	case RES_LOW_WMARK_LIMIT:
> +		return &counter->low_wmark_limit;
> +	case RES_HIGH_WMARK_LIMIT:
> +		return &counter->high_wmark_limit;
>  	};
>  
>  	BUG();
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4407dd0..664cdc5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -272,6 +272,8 @@ struct mem_cgroup {
>  	 */
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
> +
> +	int wmark_ratio;
>  };
>  
>  /* Stuffs for move charges at task migration. */
> @@ -353,6 +355,7 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  static void drain_all_stock_async(void);
> +static unsigned long get_wmark_ratio(struct mem_cgroup *mem);
>  
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -813,6 +816,27 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>  	return (mem == root_mem_cgroup);
>  }
>  
> +static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> +{
> +	u64 limit;
> +	unsigned long wmark_ratio;
> +
> +	wmark_ratio = get_wmark_ratio(mem);
> +	limit = mem_cgroup_get_limit(mem);
> +	if (wmark_ratio == 0) {
> +		res_counter_set_low_wmark_limit(&mem->res, limit);
> +		res_counter_set_high_wmark_limit(&mem->res, limit);
> +	} else {
> +		unsigned long low_wmark, high_wmark;
> +		unsigned long long tmp = (wmark_ratio * limit) / 100;

could you make this ratio as /1000 ? percent is too big.
And, considering misc. cases, I don't think having per-memcg "ratio" is good.

How about following ?

 - provides an automatic wmark without knob. 0 wmark is okay, for me.
 - provides 2 intrerfaces as
	memory.low_wmark_distance_in_bytes,  # == hard_limit - low_wmark.
	memory.high_wmark_in_bytes,          # == hard_limit - high_wmark.
   (need to add sanity check into set_limit.)


> +
> +		low_wmark = tmp;
> +		high_wmark = tmp - (tmp >> 8);
> +		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> +		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> +	}
> +}

Could you explan what low_wmark/high_wmark means somewhere ?

In this patch, kswapd runs while

	high_wmark < usage < low_wmark 
?

Hmm, I like
	low_wmark < usage < high_wmark.

;) because it's kswapd.


> +
>  /*
>   * Following LRU functions are allowed to be used without PCG_LOCK.
>   * Operations are called by routine of global LRU independently from memcg.
> @@ -1195,6 +1219,16 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return memcg->swappiness;
>  }
>  
> +static unsigned long get_wmark_ratio(struct mem_cgroup *memcg)
> +{
> +	struct cgroup *cgrp = memcg->css.cgroup;
> +
> +	VM_BUG_ON(!cgrp);
> +	VM_BUG_ON(!cgrp->parent);
> +

Does this happen ?

> +	return memcg->wmark_ratio;
> +}
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>  {
>  	int cpu;
> @@ -3205,6 +3239,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  			else
>  				memcg->memsw_is_minimum = false;
>  		}
> +		setup_per_memcg_wmarks(memcg);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -3264,6 +3299,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  			else
>  				memcg->memsw_is_minimum = false;
>  		}
> +		setup_per_memcg_wmarks(memcg);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -4521,6 +4557,22 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>  
> +int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> +				int charge_flags)
> +{
> +	long ret = 0;
> +	int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
> +
> +	VM_BUG_ON((charge_flags & flags) == flags);
> +
> +	if (charge_flags & CHARGE_WMARK_LOW)
> +		ret = res_counter_check_under_low_wmark_limit(&mem->res);
> +	if (charge_flags & CHARGE_WMARK_HIGH)
> +		ret = res_counter_check_under_high_wmark_limit(&mem->res);
> +
> +	return ret;
> +}

Hmm, do we need this unified function ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 3/7] New APIs to adjust per-memcg wmarks
  2011-04-13  7:03 ` [PATCH V3 3/7] New APIs to adjust per-memcg wmarks Ying Han
@ 2011-04-13  8:30   ` KAMEZAWA Hiroyuki
  2011-04-13 18:46     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-13  8:30 UTC (permalink / raw)
  To: Ying Han
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

On Wed, 13 Apr 2011 00:03:03 -0700
Ying Han <yinghan@google.com> wrote:

> Add wmark_ratio and reclaim_wmarks APIs per-memcg. The wmark_ratio
> adjusts the internal low/high wmark calculation and the reclaim_wmarks
> exports the current value of watermarks. By default, the wmark_ratio is
> set to 0 and the watermarks are equal to the hard_limit(limit_in_bytes).
> 
> $ cat /dev/cgroup/A/memory.wmark_ratio
> 0
> 
> $ cat /dev/cgroup/A/memory.limit_in_bytes
> 524288000
> 
> $ echo 80 >/dev/cgroup/A/memory.wmark_ratio
> 
> $ cat /dev/cgroup/A/memory.reclaim_wmarks
> low_wmark 393216000
> high_wmark 419430400
> 

I think havig _ratio_ will finally leads us to a tragedy as dirty_ratio,
a complicated interface.

For memcg, I'd like to have only _bytes.

And, as I wrote in previous mail, how about setting _distance_ ?

   memory.low_wmark_distance_in_bytes .... # hard_limit - low_wmark.
   memory.high_wmark_distance_in_bytes ... # hard_limit - high_wmark.

Anwyay, percent is too big unit.


Thanks,
-Kame


> changelog v3..v2:
> 1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
> feedbacks
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  mm/memcontrol.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 49 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 664cdc5..36ae377 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3983,6 +3983,31 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static u64 mem_cgroup_wmark_ratio_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return get_wmark_ratio(memcg);
> +}
> +
> +static int mem_cgroup_wmark_ratio_write(struct cgroup *cgrp, struct cftype *cfg,
> +				     u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	struct mem_cgroup *parent;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +
> +	parent = mem_cgroup_from_cont(cgrp->parent);
> +
> +	memcg->wmark_ratio = val;
> +
> +	setup_per_memcg_wmarks(memcg);
> +	return 0;
> +
> +}
> +
>  static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
>  {
>  	struct mem_cgroup_threshold_ary *t;
> @@ -4274,6 +4299,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
>  	mutex_unlock(&memcg_oom_mutex);
>  }
>  
> +static int mem_cgroup_wmark_read(struct cgroup *cgrp,
> +	struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	u64 low_wmark, high_wmark;
> +
> +	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
> +	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
> +
> +	cb->fill(cb, "low_wmark", low_wmark);
> +	cb->fill(cb, "high_wmark", high_wmark);
> +
> +	return 0;
> +}
> +
>  static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
>  	struct cftype *cft,  struct cgroup_map_cb *cb)
>  {
> @@ -4377,6 +4417,15 @@ static struct cftype mem_cgroup_files[] = {
>  		.unregister_event = mem_cgroup_oom_unregister_event,
>  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>  	},
> +	{
> +		.name = "wmark_ratio",
> +		.write_u64 = mem_cgroup_wmark_ratio_write,
> +		.read_u64 = mem_cgroup_wmark_ratio_read,
> +	},
> +	{
> +		.name = "reclaim_wmarks",
> +		.read_map = mem_cgroup_wmark_read,
> +	},
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 5/7] Per-memcg background reclaim.
  2011-04-13  7:03 ` [PATCH V3 5/7] Per-memcg background reclaim Ying Han
@ 2011-04-13  8:58   ` KAMEZAWA Hiroyuki
  2011-04-13 22:45     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-13  8:58 UTC (permalink / raw)
  To: Ying Han
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

On Wed, 13 Apr 2011 00:03:05 -0700
Ying Han <yinghan@google.com> wrote:

> This is the main loop of per-memcg background reclaim which is implemented in
> function balance_mem_cgroup_pgdat().
> 
> The function performs a priority loop similar to global reclaim. During each
> iteration it invokes balance_pgdat_node() for all nodes on the system, which
> is another new function performs background reclaim per node. A fairness
> mechanism is implemented to remember the last node it was reclaiming from and
> always start at the next one. After reclaiming each node, it checks
> mem_cgroup_watermark_ok() and breaks the priority loop if it returns true. The
> per-memcg zone will be marked as "unreclaimable" if the scanning rate is much
> greater than the reclaiming rate on the per-memcg LRU. The bit is cleared when
> there is a page charged to the memcg being freed. Kswapd breaks the priority
> loop if all the zones are marked as "unreclaimable".
> 

Hmm, bigger than expected. I'm glad if you can divide this into small pieces.
see below.


> changelog v3..v2:
> 1. change mz->all_unreclaimable to be boolean.
> 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
> 3. some more clean-up.
> 
> changelog v2..v1:
> 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> reclaim.
> 3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
> keeps the same name.
> 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
> after freeing.
> 5. add the fairness in zonelist where memcg remember the last zone reclaimed
> from.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h |   33 +++++++
>  include/linux/swap.h       |    2 +
>  mm/memcontrol.c            |  136 +++++++++++++++++++++++++++++
>  mm/vmscan.c                |  208 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 379 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f7ffd1f..a8159f5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
>  				  struct kswapd *kswapd_p);
>  extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
>  extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
> +extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
> +extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
> +					const nodemask_t *nodes);
>  
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> @@ -152,6 +155,12 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> +				unsigned long nr_scanned);
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
> @@ -342,6 +351,25 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  {
>  }
>  
> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
> +						struct zone *zone,
> +						unsigned long nr_scanned)
> +{
> +}
> +
> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
> +							struct zone *zone)
> +{
> +}
> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
> +		struct zone *zone)
> +{
> +}
> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
> +						struct zone *zone)
> +{
> +}
> +
>  static inline
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  					    gfp_t gfp_mask)
> @@ -360,6 +388,11 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
>  {
>  }
>  
> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
> +								int zid)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 17e0511..319b800 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -160,6 +160,8 @@ enum {
>  	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
>  };
>  
> +#define ZONE_RECLAIMABLE_RATE 6
> +
>  #define SWAP_CLUSTER_MAX 32
>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index acd84a8..efeade3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
>  	bool			on_tree;
>  	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
>  						/* use container_of	   */
> +	unsigned long		pages_scanned;	/* since last reclaim */
> +	bool			all_unreclaimable;	/* All pages pinned */
>  };
> +
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
>  
> @@ -275,6 +278,11 @@ struct mem_cgroup {
>  
>  	int wmark_ratio;
>  
> +	/* While doing per cgroup background reclaim, we cache the
> +	 * last node we reclaimed from
> +	 */
> +	int last_scanned_node;
> +
>  	wait_queue_head_t *kswapd_wait;
>  };
>  
> @@ -1129,6 +1137,96 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>  	return &mz->reclaim_stat;
>  }
>  
> +static unsigned long mem_cgroup_zone_reclaimable_pages(
> +					struct mem_cgroup_per_zone *mz)
> +{
> +	int nr;
> +	nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
> +		MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
> +
> +	if (nr_swap_pages > 0)
> +		nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
> +			MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> +						unsigned long nr_scanned)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->pages_scanned += nr_scanned;
> +}
> +
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +
> +	if (!mem)
> +		return 0;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->pages_scanned <
> +				mem_cgroup_zone_reclaimable_pages(mz) *
> +				ZONE_RECLAIMABLE_RATE;
> +	return 0;
> +}
> +
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return false;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->all_unreclaimable;
> +
> +	return false;
> +}
> +
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->all_unreclaimable = true;
> +}
> +
> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +
> +	if (!mem)
> +		return;
> +
> +	mz = page_cgroup_zoneinfo(mem, page);
> +	if (mz) {
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = false;
> +	}
> +
> +	return;
> +}
> +
>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -1545,6 +1643,32 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  }
>  
>  /*
> + * Visit the first node after the last_scanned_node of @mem and use that to
> + * reclaim free pages from.
> + */
> +int
> +mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
> +{
> +	int next_nid;
> +	int last_scanned;
> +
> +	last_scanned = mem->last_scanned_node;
> +
> +	/* Initial stage and start from node0 */
> +	if (last_scanned == -1)
> +		next_nid = 0;
> +	else
> +		next_nid = next_node(last_scanned, *nodes);
> +
> +	if (next_nid == MAX_NUMNODES)
> +		next_nid = first_node(*nodes);
> +
> +	mem->last_scanned_node = next_nid;
> +
> +	return next_nid;
> +}
> +
> +/*
>   * Check OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
>   */
> @@ -2779,6 +2903,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	 * special functions.
>  	 */
>  
> +	mem_cgroup_clear_unreclaimable(mem, page);

Hmm, do we this always at uncharge ? 

I doubt we really need mz->all_unreclaimable ....

Anyway, I'd like to see this all_unreclaimable logic in an independet patch.
Because direct-relcaim pass should see this, too.

So, could you devide this pieces into

1. record last node .... I wonder this logic should be used in direct-reclaim pass, too.
                        
2. all_unreclaimable .... direct reclaim will be affected, too.

3. scanning core.



>  	unlock_page_cgroup(pc);
>  	/*
>  	 * even after unlock, we have mem->res.usage here and this memcg
> @@ -4501,6 +4626,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>  		mz->usage_in_excess = 0;
>  		mz->on_tree = false;
>  		mz->mem = mem;
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = false;
>  	}
>  	return 0;
>  }
> @@ -4651,6 +4778,14 @@ wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
>  	return mem->kswapd_wait;
>  }
>  
> +int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
> +{
> +	if (!mem)
> +		return -1;
> +
> +	return mem->last_scanned_node;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>  	struct mem_cgroup_tree_per_node *rtpn;
> @@ -4726,6 +4861,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		res_counter_init(&mem->memsw, NULL);
>  	}
>  	mem->last_scanned_child = 0;
> +	mem->last_scanned_node = -1;
>  	INIT_LIST_HEAD(&mem->oom_notify);
>  
>  	if (parent)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a1a1211..6571eb8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -47,6 +47,8 @@
>  
>  #include <linux/swapops.h>
>  
> +#include <linux/res_counter.h>
> +
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -111,6 +113,8 @@ struct scan_control {
>  	 * are scanned.
>  	 */
>  	nodemask_t	*nodemask;
> +
> +	int priority;
>  };
>  
>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> @@ -1410,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  					ISOLATE_BOTH : ISOLATE_INACTIVE,
>  			zone, sc->mem_cgroup,
>  			0, file);
> +
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
> +
>  		/*
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
> @@ -1529,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
>  		 */
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>  	}
>  
>  	reclaim_stat->recent_scanned[file] += nr_taken;
> @@ -2632,11 +2640,211 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>  	finish_wait(wait_h, &wait);
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> +					struct scan_control *sc)
> +{
> +	int i, end_zone;
> +	unsigned long total_scanned;
> +	struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +	int priority = sc->priority;
> +	int nid = pgdat->node_id;
> +
> +	/*
> +	 * Scan in the highmem->dma direction for the highest
> +	 * zone which needs scanning
> +	 */
> +	for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> +				priority != DEF_PRIORITY)
> +			continue;
> +		/*
> +		 * Do some background aging of the anon list, to give
> +		 * pages a chance to be referenced before reclaiming.
> +		 */
> +		if (inactive_anon_is_low(zone, sc))
> +			shrink_active_list(SWAP_CLUSTER_MAX, zone,
> +							sc, priority, 0);
> +
> +		end_zone = i;
> +		goto scan;
> +	}

I don't want to see zone balancing logic in memcg.
It should be a work of global lru.

IOW, even if we remove global LRU finally, we should
implement zone balancing logic in _global_ (per node) kswapd.
(kswapd can pass zone mask to each memcg.)

If you want some clever logic for memcg specail, I think it should be
deteciting 'which node should be victim' logic rather than round-robin.
(But yes, starting from easy round robin makes sense.)

So, could you add more simple one ?

  do {
    select victim node
    do reclaim
  } while (need_stop)

zone balancing should be done other than memcg.

what we really need to improve is 'select victim node'.

Thanks,
-Kame


> +	return;
> +
> +scan:
> +	total_scanned = 0;
> +	/*
> +	 * Now scan the zone in the dma->highmem direction, stopping
> +	 * at the last zone which needs scanning.
> +	 *
> +	 * We do this because the page allocator works in the opposite
> +	 * direction.  This prevents the page allocator from allocating
> +	 * pages behind kswapd's direction of progress, which would
> +	 * cause too much scanning of the lower zones.
> +	 */
> +	for (i = 0; i <= end_zone; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> +			priority != DEF_PRIORITY)
> +			continue;
> +
> +		sc->nr_scanned = 0;
> +		shrink_zone(priority, zone, sc);
> +		total_scanned += sc->nr_scanned;
> +
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
> +			continue;
> +
> +		if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
> +			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
> +
> +		/*
> +		 * If we've done a decent amount of scanning and
> +		 * the reclaim ratio is low, start doing writepage
> +		 * even in laptop mode
> +		 */
> +		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> +		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
> +			sc->may_writepage = 1;
> +		}
> +	}
> +
> +	sc->nr_scanned = total_scanned;
> +	return;
> +}
> +
> +/*
> + * Per cgroup background reclaim.
> + * TODO: Take off the order since memcg always do order 0
> + */
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +					      int order)
> +{
> +	int i, nid;
> +	int start_node;
> +	int priority;
> +	bool wmark_ok;
> +	int loop;
> +	pg_data_t *pgdat;
> +	nodemask_t do_nodes;
> +	unsigned long total_scanned;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +		.nr_to_reclaim = ULONG_MAX,
> +		.swappiness = vm_swappiness,
> +		.order = order,
> +		.mem_cgroup = mem_cont,
> +	};
> +
> +loop_again:
> +	do_nodes = NODE_MASK_NONE;
> +	sc.may_writepage = !laptop_mode;
> +	sc.nr_reclaimed = 0;
> +	total_scanned = 0;
> +
> +	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> +		sc.priority = priority;
> +		wmark_ok = false;
> +		loop = 0;
> +
> +		/* The swap token gets in the way of swapout... */
> +		if (!priority)
> +			disable_swap_token();
> +
> +		if (priority == DEF_PRIORITY)
> +			do_nodes = node_states[N_ONLINE];
> +
> +		while (1) {
> +			nid = mem_cgroup_select_victim_node(mem_cont,
> +							&do_nodes);
> +
> +			/* Indicate we have cycled the nodelist once
> +			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
> +			 * kswapd burning cpu cycles.
> +			 */
> +			if (loop == 0) {
> +				start_node = nid;
> +				loop++;
> +			} else if (nid == start_node)
> +				break;
> +
> +			pgdat = NODE_DATA(nid);
> +			balance_pgdat_node(pgdat, order, &sc);
> +			total_scanned += sc.nr_scanned;
> +
> +			/* Set the node which has at least
> +			 * one reclaimable zone
> +			 */
> +			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +				struct zone *zone = pgdat->node_zones + i;
> +
> +				if (!populated_zone(zone))
> +					continue;
> +
> +				if (!mem_cgroup_mz_unreclaimable(mem_cont,
> +								zone))
> +					break;
> +			}
> +			if (i < 0)
> +				node_clear(nid, do_nodes);
> +
> +			if (mem_cgroup_watermark_ok(mem_cont,
> +							CHARGE_WMARK_HIGH)) {
> +				wmark_ok = true;
> +				goto out;
> +			}
> +
> +			if (nodes_empty(do_nodes)) {
> +				wmark_ok = true;
> +				goto out;
> +			}
> +		}
> +
> +		/* All the nodes are unreclaimable, kswapd is done */
> +		if (nodes_empty(do_nodes)) {
> +			wmark_ok = true;
> +			goto out;
> +		}
> +
> +		if (total_scanned && priority < DEF_PRIORITY - 2)
> +			congestion_wait(WRITE, HZ/10);
> +
> +		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> +			break;
> +	}
> +out:
> +	if (!wmark_ok) {
> +		cond_resched();
> +
> +		try_to_freeze();
> +
> +		goto loop_again;
> +	}
> +
> +	return sc.nr_reclaimed;
> +}
> +#else
>  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
>  							int order)
>  {
>  	return 0;
>  }
> +#endif
>  
>  /*
>   * The background pageout daemon, started as a kernel thread
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 6/7] Enable per-memcg background reclaim.
  2011-04-13  7:03 ` [PATCH V3 6/7] Enable per-memcg " Ying Han
@ 2011-04-13  9:05   ` KAMEZAWA Hiroyuki
  2011-04-13 21:20     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-13  9:05 UTC (permalink / raw)
  To: Ying Han
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

On Wed, 13 Apr 2011 00:03:06 -0700
Ying Han <yinghan@google.com> wrote:

> By default the per-memcg background reclaim is disabled when the limit_in_bytes
> is set the maximum or the wmark_ratio is 0. The kswapd_run() is called when the
> memcg is being resized, and kswapd_stop() is called when the memcg is being
> deleted.
> 
> The per-memcg kswapd is waked up based on the usage and low_wmark, which is
> checked once per 1024 increments per cpu. The memcg's kswapd is waked up if the
> usage is larger than the low_wmark.
> 
> changelog v3..v2:
> 1. some clean-ups
> 
> changelog v2..v1:
> 1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
> 2. remove checking the wmark from per-page charging. now it checks the wmark
> periodically based on the event counter.
> 
> Signed-off-by: Ying Han <yinghan@google.com>

This event logic seems to make sense.

> ---
>  mm/memcontrol.c |   37 +++++++++++++++++++++++++++++++++++++
>  1 files changed, 37 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index efeade3..bfa8646 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -105,10 +105,12 @@ enum mem_cgroup_events_index {
>  enum mem_cgroup_events_target {
>  	MEM_CGROUP_TARGET_THRESH,
>  	MEM_CGROUP_TARGET_SOFTLIMIT,
> +	MEM_CGROUP_WMARK_EVENTS_THRESH,
>  	MEM_CGROUP_NTARGETS,
>  };
>  #define THRESHOLDS_EVENTS_TARGET (128)
>  #define SOFTLIMIT_EVENTS_TARGET (1024)
> +#define WMARK_EVENTS_TARGET (1024)
>  
>  struct mem_cgroup_stat_cpu {
>  	long count[MEM_CGROUP_STAT_NSTATS];
> @@ -366,6 +368,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  static void drain_all_stock_async(void);
>  static unsigned long get_wmark_ratio(struct mem_cgroup *mem);
> +static void wake_memcg_kswapd(struct mem_cgroup *mem);
>  
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -545,6 +548,12 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
>  	return mz;
>  }
>  
> +static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
> +{
> +	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
> +		wake_memcg_kswapd(mem);
> +}
> +
>  /*
>   * Implementation Note: reading percpu statistics for memcg.
>   *
> @@ -675,6 +684,9 @@ static void __mem_cgroup_target_update(struct mem_cgroup *mem, int target)
>  	case MEM_CGROUP_TARGET_SOFTLIMIT:
>  		next = val + SOFTLIMIT_EVENTS_TARGET;
>  		break;
> +	case MEM_CGROUP_WMARK_EVENTS_THRESH:
> +		next = val + WMARK_EVENTS_TARGET;
> +		break;
>  	default:
>  		return;
>  	}
> @@ -698,6 +710,10 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
>  			__mem_cgroup_target_update(mem,
>  				MEM_CGROUP_TARGET_SOFTLIMIT);
>  		}
> +		if (unlikely(__memcg_event_check(mem,
> +			MEM_CGROUP_WMARK_EVENTS_THRESH))){
> +			mem_cgroup_check_wmark(mem);
> +		}
>  	}
>  }
>  
> @@ -3384,6 +3400,10 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  	if (!ret && enlarge)
>  		memcg_oom_recover(memcg);
>  
> +	if (!mem_cgroup_is_root(memcg) && !memcg->kswapd_wait &&
> +			memcg->wmark_ratio)
> +		kswapd_run(0, memcg);
> +

Isn't it enough to have trigger in charge() path ?

rather than here, I think we should check _move_task(). It changes res usage
dramatically without updating events.

Thanks,
-Kame


>  	return ret;
>  }
>  
> @@ -4680,6 +4700,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>  {
>  	int node;
>  
> +	kswapd_stop(0, mem);
>  	mem_cgroup_remove_from_trees(mem);
>  	free_css_id(&mem_cgroup_subsys, &mem->css);
>  

I think kswapd should stop at mem_cgroup_destroy(). No more tasks will use
this memcg after _destroy().

Thanks,
-Kame



> @@ -4786,6 +4807,22 @@ int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
>  	return mem->last_scanned_node;
>  }
>  
> +static inline
> +void wake_memcg_kswapd(struct mem_cgroup *mem)
> +{
> +	wait_queue_head_t *wait;
> +
> +	if (!mem || !mem->wmark_ratio)
> +		return;
> +
> +	wait = mem->kswapd_wait;
> +
> +	if (!wait || !waitqueue_active(wait))
> +		return;
> +
> +	wake_up_interruptible(wait);
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>  	struct mem_cgroup_tree_per_node *rtpn;
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/7] memcg: per cgroup background reclaim
  2011-04-13  7:47 ` [PATCH V3 0/7] memcg: per cgroup background reclaim KAMEZAWA Hiroyuki
@ 2011-04-13 17:53   ` Ying Han
  2011-04-14  0:14     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-13 17:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 7269 bytes --]

On Wed, Apr 13, 2011 at 12:47 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 Apr 2011 00:03:00 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > The current implementation of memcg supports targeting reclaim when the
> > cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
> > Per cgroup background reclaim is needed which helps to spread out memory
> > pressure over longer period of time and smoothes out the cgroup
> performance.
> >
> > If the cgroup is configured to use per cgroup background reclaim, a
> kswapd
> > thread is created which only scans the per-memcg LRU list. Two watermarks
> > ("high_wmark", "low_wmark") are added to trigger the background reclaim
> and
> > stop it. The watermarks are calculated based on the cgroup's
> limit_in_bytes.
> >
> > I run through dd test on large file and then cat the file. Then I
> compared
> > the reclaim related stats in memory.stat.
> >
> > Step1: Create a cgroup with 500M memory_limit.
> > $ mkdir /dev/cgroup/memory/A
> > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > $ echo $$ >/dev/cgroup/memory/A/tasks
> >
> > Step2: Test and set the wmarks.
> > $ cat /dev/cgroup/memory/A/memory.wmark_ratio
> > 0
> >
> > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > low_wmark 524288000
> > high_wmark 524288000
> >
> > $ echo 90 >/dev/cgroup/memory/A/memory.wmark_ratio
> >
> > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > low_wmark 471859200
> > high_wmark 470016000
> >
> > $ ps -ef | grep memcg
> > root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
> > root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg
> >
> > Step3: Dirty the pages by creating a 20g file on hard drive.
> > $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
> >
> > Here are the memory.stat with vs without the per-memcg reclaim. It used
> to be
> > all the pages are reclaimed from direct reclaim, and now some of the
> pages are
> > also being reclaimed at background.
> >
> > Only direct reclaim                       With background reclaim:
> >
> > pgpgin 5248668                            pgpgin 5248347
> > pgpgout 5120678                           pgpgout 5133505
> > kswapd_steal 0                            kswapd_steal 1476614
> > pg_pgsteal 5120578                        pg_pgsteal 3656868
> > kswapd_pgscan 0                           kswapd_pgscan 3137098
> > pg_scan 10861956                          pg_scan 6848006
> > pgrefill 271174                           pgrefill 290441
> > pgoutrun 0                                pgoutrun 18047
> > allocstall 131689                         allocstall 100179
> >
> > real    7m42.702s                         real 7m42.323s
> > user    0m0.763s                          user 0m0.748s
> > sys     0m58.785s                         sys  0m52.123s
> >
> > throughput is 44.33 MB/sec                throughput is 44.23 MB/sec
> >
> > Step 4: Cleanup
> > $ echo $$ >/dev/cgroup/memory/tasks
> > $ echo 1 > /dev/cgroup/memory/A/memory.force_empty
> > $ rmdir /dev/cgroup/memory/A
> > $ echo 3 >/proc/sys/vm/drop_caches
> >
> > Step 5: Create the same cgroup and read the 20g file into pagecache.
> > $ cat /export/hdc3/dd/tf0 > /dev/zero
> >
> > All the pages are reclaimed from background instead of direct reclaim
> with
> > the per cgroup reclaim.
> >
> > Only direct reclaim                       With background reclaim:
> > pgpgin 5248668                            pgpgin 5248114
> > pgpgout 5120678                           pgpgout 5133480
> > kswapd_steal 0                            kswapd_steal 5133397
> > pg_pgsteal 5120578                        pg_pgsteal 0
> > kswapd_pgscan 0                           kswapd_pgscan 5133400
> > pg_scan 10861956                          pg_scan 0
> > pgrefill 271174                           pgrefill 0
> > pgoutrun 0                                pgoutrun 40535
> > allocstall 131689                         allocstall 0
> >
> > real    7m42.702s                         real 6m20.439s
> > user    0m0.763s                          user 0m0.169s
> > sys     0m58.785s                         sys  0m26.574s
> >
> > Note:
> > This is the first effort of enhancing the target reclaim into memcg. Here
> are
> > the existing known issues and our plan:
> >
> > 1. there are one kswapd thread per cgroup. the thread is created when the
> > cgroup changes its limit_in_bytes and is deleted when the cgroup is being
> > removed. In some enviroment when thousand of cgroups are being configured
> on
> > a single host, we will have thousand of kswapd threads. The memory
> consumption
> > would be 8k*100 = 8M. We don't see a big issue for now if the host can
> host
> > that many of cgroups.
> >
>
> What's bad with using workqueue ?
>
> Pros.
>  - we don't have to keep our own thread pool.
>  - we don't have to see 'ps -elf' is filled by kswapd...
> Cons.
>  - because threads are shared, we can't put kthread to cpu cgroup.
>

I did some study on workqueue after posting V2. There was a comment suggesting
workqueue instead of per-memcg kswapd thread, since it will cut the number
of kernel threads being created in host with lots of cgroups. Each kernel
thread allocates about 8K of stack and 8M in total w/ thousand of cgroups.

The current workqueue model merged in 2.6.36 kernel is called "concurrency
managed workqueu(cmwq)", which is intended to provide flexible concurrency
without wasting resources. I studied a bit and here it is:

1. The workqueue is complicated and we need to be very careful of work items
in the workqueue. We've experienced in one workitem stucks and the rest of
the work item won't proceed. For example in dirty page writeback,  one
heavily writer cgroup could starve the other cgroups from flushing dirty
pages to the same disk. In the kswapd case, I can image we might have
similar scenario.

2. How to prioritize the workitems is another problem. The order of adding
the workitems in the queue reflects the order of cgroups being reclaimed. We
don't have that restriction currently but relying on the cpu scheduler to
put kswapd on the right cpu-core to run. We "might" introduce priority later
for reclaim and how are we gonna deal with that.

3. Based on what i observed, not many callers has migrated to the cmwq and I
don't have much data of how good it is.


Regardless of workqueue, can't we have moderate numbers of threads ?
>
> I really don't like to have too much threads and thinks
> one-thread-per-memcg
> is big enough to cause lock contension problem.
>

Back to the current model, on machine with thousands of cgroups which it
will take 8M total for thousand of kswapd threads (8K stack for each
thread).  We are running system with fakenuma which each numa node has a
kswapd. So far we haven't noticed issue caused by "lots of" kswapd threads.
Also, there shouldn't be any performance overhead for kernel thread if it is
not running.

Based on the complexity of workqueue and the benefit it provides, I would
like to stick to the current model first. After we get the basic stuff in
and other targeting reclaim improvement, we can come back to this. What do
you think?

--Ying

>
> Anyway, thank you for your patches. I'll review.
>

Thank you for your review~

>
> Thanks,
> -Kame
>
>
>

[-- Attachment #2: Type: text/html, Size: 11209 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/7] Add per memcg reclaim watermarks
  2011-04-13  8:25   ` KAMEZAWA Hiroyuki
@ 2011-04-13 18:40     ` Ying Han
  2011-04-14  0:27       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-13 18:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 12979 bytes --]

On Wed, Apr 13, 2011 at 1:25 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 Apr 2011 00:03:02 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > There are two watermarks added per-memcg including "high_wmark" and
> "low_wmark".
> > The per-memcg kswapd is invoked when the memcg's memory
> usage(usage_in_bytes)
> > is higher than the low_wmark. Then the kswapd thread starts to reclaim
> pages
> > until the usage is lower than the high_wmark.
> >
> > Each watermark is calculated based on the hard_limit(limit_in_bytes) for
> each
> > memcg. Each time the hard_limit is changed, the corresponding wmarks are
> > re-calculated. Since memory controller charges only user pages, there is
> > no need for a "min_wmark". The current calculation of wmarks is a
> function of
> > "wmark_ratio" which is set to 0 by default. When the value is 0, the
> watermarks
> > are equal to the hard_limit.
> >
> > changelog v3..v2:
> > 1. Add VM_BUG_ON() on couple of places.
> > 2. Remove the spinlock on the min_free_kbytes since the consequence of
> reading
> > stale data.
> > 3. Remove the "min_free_kbytes" API and replace it with wmark_ratio based
> on
> > hard_limit.
> >
> > changelog v2..v1:
> > 1. Remove the res_counter_charge on wmark due to performance concern.
> > 2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate
> commit.
> > 3. Calculate the min_free_kbytes automatically based on the
> limit_in_bytes.
> > 4. make the wmark to be consistant with core VM which checks the free
> pages
> > instead of usage.
> > 5. changed wmark to be boolean
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/memcontrol.h  |    1 +
> >  include/linux/res_counter.h |   80
> +++++++++++++++++++++++++++++++++++++++++++
> >  kernel/res_counter.c        |    6 +++
> >  mm/memcontrol.c             |   52 ++++++++++++++++++++++++++++
> >  4 files changed, 139 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 5a5ce70..3ece36d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -82,6 +82,7 @@ int task_in_mem_cgroup(struct task_struct *task, const
> struct mem_cgroup *mem);
> >
> >  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page
> *page);
> >  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> > +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int
> charge_flags);
> >
> >  static inline
> >  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index c9d625c..fa4181b 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -39,6 +39,16 @@ struct res_counter {
> >        */
> >       unsigned long long soft_limit;
> >       /*
> > +      * the limit that reclaim triggers. TODO: res_counter in mem
> > +      * or wmark_limit.
> > +      */
> > +     unsigned long long low_wmark_limit;
> > +     /*
> > +      * the limit that reclaim stops. TODO: res_counter in mem or
> > +      * wmark_limit.
> > +      */
>
> What does this TODO mean ?
>

Legacy comment. I will remove it.

>
>
> > +     unsigned long long high_wmark_limit;
> > +     /*
> >        * the number of unsuccessful attempts to consume the resource
> >        */
> >       unsigned long long failcnt;
> > @@ -55,6 +65,9 @@ struct res_counter {
> >
> >  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
> >
> > +#define CHARGE_WMARK_LOW     0x01
> > +#define CHARGE_WMARK_HIGH    0x02
> > +
> >  /**
> >   * Helpers to interact with userspace
> >   * res_counter_read_u64() - returns the value of the specified member.
> > @@ -92,6 +105,8 @@ enum {
> >       RES_LIMIT,
> >       RES_FAILCNT,
> >       RES_SOFT_LIMIT,
> > +     RES_LOW_WMARK_LIMIT,
> > +     RES_HIGH_WMARK_LIMIT
> >  };
> >
> >  /*
> > @@ -147,6 +162,24 @@ static inline unsigned long long
> res_counter_margin(struct res_counter *cnt)
> >       return margin;
> >  }
> >
> > +static inline bool
> > +res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
> > +{
> > +     if (cnt->usage < cnt->high_wmark_limit)
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> > +static inline bool
> > +res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
> > +{
> > +     if (cnt->usage < cnt->low_wmark_limit)
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
>
>
> >  /**
> >   * Get the difference between the usage and the soft limit
> >   * @cnt: The counter
> > @@ -169,6 +202,30 @@ res_counter_soft_limit_excess(struct res_counter
> *cnt)
> >       return excess;
> >  }
> >
> > +static inline bool
> > +res_counter_check_under_low_wmark_limit(struct res_counter *cnt)
> > +{
> > +     bool ret;
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&cnt->lock, flags);
> > +     ret = res_counter_low_wmark_limit_check_locked(cnt);
> > +     spin_unlock_irqrestore(&cnt->lock, flags);
> > +     return ret;
> > +}
> > +
> > +static inline bool
> > +res_counter_check_under_high_wmark_limit(struct res_counter *cnt)
> > +{
> > +     bool ret;
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&cnt->lock, flags);
> > +     ret = res_counter_high_wmark_limit_check_locked(cnt);
> > +     spin_unlock_irqrestore(&cnt->lock, flags);
> > +     return ret;
> > +}
> > +
>
> Why internal functions are named as _check_ ? I like _under_.
>

Changed and will be on next post.

>
>
> >  static inline void res_counter_reset_max(struct res_counter *cnt)
> >  {
> >       unsigned long flags;
> > @@ -214,4 +271,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
> >       return 0;
> >  }
> >
> > +static inline int
> > +res_counter_set_high_wmark_limit(struct res_counter *cnt,
> > +                             unsigned long long wmark_limit)
> > +{
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&cnt->lock, flags);
> > +     cnt->high_wmark_limit = wmark_limit;
> > +     spin_unlock_irqrestore(&cnt->lock, flags);
> > +     return 0;
> > +}
> > +
> > +static inline int
> > +res_counter_set_low_wmark_limit(struct res_counter *cnt,
> > +                             unsigned long long wmark_limit)
> > +{
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&cnt->lock, flags);
> > +     cnt->low_wmark_limit = wmark_limit;
> > +     spin_unlock_irqrestore(&cnt->lock, flags);
> > +     return 0;
> > +}
> >  #endif
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 34683ef..206a724 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -19,6 +19,8 @@ void res_counter_init(struct res_counter *counter,
> struct res_counter *parent)
> >       spin_lock_init(&counter->lock);
> >       counter->limit = RESOURCE_MAX;
> >       counter->soft_limit = RESOURCE_MAX;
> > +     counter->low_wmark_limit = RESOURCE_MAX;
> > +     counter->high_wmark_limit = RESOURCE_MAX;
> >       counter->parent = parent;
> >  }
> >
> > @@ -103,6 +105,10 @@ res_counter_member(struct res_counter *counter, int
> member)
> >               return &counter->failcnt;
> >       case RES_SOFT_LIMIT:
> >               return &counter->soft_limit;
> > +     case RES_LOW_WMARK_LIMIT:
> > +             return &counter->low_wmark_limit;
> > +     case RES_HIGH_WMARK_LIMIT:
> > +             return &counter->high_wmark_limit;
> >       };
> >
> >       BUG();
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 4407dd0..664cdc5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -272,6 +272,8 @@ struct mem_cgroup {
> >        */
> >       struct mem_cgroup_stat_cpu nocpu_base;
> >       spinlock_t pcp_counter_lock;
> > +
> > +     int wmark_ratio;
> >  };
> >
> >  /* Stuffs for move charges at task migration. */
> > @@ -353,6 +355,7 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >  static void drain_all_stock_async(void);
> > +static unsigned long get_wmark_ratio(struct mem_cgroup *mem);
> >
> >  static struct mem_cgroup_per_zone *
> >  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> > @@ -813,6 +816,27 @@ static inline bool mem_cgroup_is_root(struct
> mem_cgroup *mem)
> >       return (mem == root_mem_cgroup);
> >  }
> >
> > +static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> > +{
> > +     u64 limit;
> > +     unsigned long wmark_ratio;
> > +
> > +     wmark_ratio = get_wmark_ratio(mem);
> > +     limit = mem_cgroup_get_limit(mem);
> > +     if (wmark_ratio == 0) {
> > +             res_counter_set_low_wmark_limit(&mem->res, limit);
> > +             res_counter_set_high_wmark_limit(&mem->res, limit);
> > +     } else {
> > +             unsigned long low_wmark, high_wmark;
> > +             unsigned long long tmp = (wmark_ratio * limit) / 100;
>
> could you make this ratio as /1000 ? percent is too big.
> And, considering misc. cases, I don't think having per-memcg "ratio" is
> good.
>
> How about following ?
>
>  - provides an automatic wmark without knob. 0 wmark is okay, for me.
>  - provides 2 intrerfaces as
>        memory.low_wmark_distance_in_bytes,  # == hard_limit - low_wmark.
>        memory.high_wmark_in_bytes,          # == hard_limit - high_wmark.
>   (need to add sanity check into set_limit.)
>
> Hmm. Making the wmarks tunable individually make sense to me. One problem I
do notice is that making the hard_limit as the bar might not working well on
over-committing system. Which means the per-cgroup background reclaim might
not be triggered before global memory pressure. Ideally, we would like to do
more per-cgroup reclaim before triggering global memory pressure.

How about adding the two APIs but make the calculation based on:

-- by default, the wmarks are equal to hard_limit. ( no background reclaim)
-- provides 2 intrerfaces as
       memory.low_wmark_distance_in_bytes,  # == min(hard_limit, soft_limit)
- low_wmark.
       memory.high_wmark_in_bytes,          # == min(hard_limit, soft_limit)
- high_wmark.


> > +
> > +             low_wmark = tmp;
> > +             high_wmark = tmp - (tmp >> 8);
> > +             res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> > +             res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> > +     }
> > +}
>
> Could you explan what low_wmark/high_wmark means somewhere ?
>

Will add comments.

>
> In this patch, kswapd runs while
>
>        high_wmark < usage < low_wmark
> ?
>
> Hmm, I like
>        low_wmark < usage < high_wmark.
>
> ;) because it's kswapd.
>
> I adopt the same concept of global kswapd where low_wmark triggers the
kswpd and hight_wmark stop it. And here, we have

(limit - high_wmark) < free < (limit - low_wmark)

--Ying

>
> > +
> >  /*
> >   * Following LRU functions are allowed to be used without PCG_LOCK.
> >   * Operations are called by routine of global LRU independently from
> memcg.
> > @@ -1195,6 +1219,16 @@ static unsigned int get_swappiness(struct
> mem_cgroup *memcg)
> >       return memcg->swappiness;
> >  }
> >
> > +static unsigned long get_wmark_ratio(struct mem_cgroup *memcg)
> > +{
> > +     struct cgroup *cgrp = memcg->css.cgroup;
> > +
> > +     VM_BUG_ON(!cgrp);
> > +     VM_BUG_ON(!cgrp->parent);
> > +
>
> Does this happen ?
>
> > +     return memcg->wmark_ratio;
> > +}
> > +
> >  static void mem_cgroup_start_move(struct mem_cgroup *mem)
> >  {
> >       int cpu;
> > @@ -3205,6 +3239,7 @@ static int mem_cgroup_resize_limit(struct
> mem_cgroup *memcg,
> >                       else
> >                               memcg->memsw_is_minimum = false;
> >               }
> > +             setup_per_memcg_wmarks(memcg);
> >               mutex_unlock(&set_limit_mutex);
> >
> >               if (!ret)
> > @@ -3264,6 +3299,7 @@ static int mem_cgroup_resize_memsw_limit(struct
> mem_cgroup *memcg,
> >                       else
> >                               memcg->memsw_is_minimum = false;
> >               }
> > +             setup_per_memcg_wmarks(memcg);
> >               mutex_unlock(&set_limit_mutex);
> >
> >               if (!ret)
> > @@ -4521,6 +4557,22 @@ static void __init enable_swap_cgroup(void)
> >  }
> >  #endif
> >
> > +int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> > +                             int charge_flags)
> > +{
> > +     long ret = 0;
> > +     int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
> > +
> > +     VM_BUG_ON((charge_flags & flags) == flags);
> > +
> > +     if (charge_flags & CHARGE_WMARK_LOW)
> > +             ret = res_counter_check_under_low_wmark_limit(&mem->res);
> > +     if (charge_flags & CHARGE_WMARK_HIGH)
> > +             ret = res_counter_check_under_high_wmark_limit(&mem->res);
> > +
> > +     return ret;
> > +}
>
> Hmm, do we need this unified function ?
>
> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 16648 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 3/7] New APIs to adjust per-memcg wmarks
  2011-04-13  8:30   ` KAMEZAWA Hiroyuki
@ 2011-04-13 18:46     ` Ying Han
  0 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-13 18:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4812 bytes --]

On Wed, Apr 13, 2011 at 1:30 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 Apr 2011 00:03:03 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > Add wmark_ratio and reclaim_wmarks APIs per-memcg. The wmark_ratio
> > adjusts the internal low/high wmark calculation and the reclaim_wmarks
> > exports the current value of watermarks. By default, the wmark_ratio is
> > set to 0 and the watermarks are equal to the hard_limit(limit_in_bytes).
> >
> > $ cat /dev/cgroup/A/memory.wmark_ratio
> > 0
> >
> > $ cat /dev/cgroup/A/memory.limit_in_bytes
> > 524288000
> >
> > $ echo 80 >/dev/cgroup/A/memory.wmark_ratio
> >
> > $ cat /dev/cgroup/A/memory.reclaim_wmarks
> > low_wmark 393216000
> > high_wmark 419430400
> >
>
> I think havig _ratio_ will finally leads us to a tragedy as dirty_ratio,
> a complicated interface.
>
> For memcg, I'd like to have only _bytes.
>
> And, as I wrote in previous mail, how about setting _distance_ ?
>
>   memory.low_wmark_distance_in_bytes .... # hard_limit - low_wmark.
>   memory.high_wmark_distance_in_bytes ... # hard_limit - high_wmark.
>
> Anwyay, percent is too big unit.
>

Replied to your comment on "Add per memcg reclaim watermarks". I have no
problem to make the
wmark individual tunable. One thing to confirm before making the change is
to have:


memory.low_wmark_distance_in_bytes .... # min(hard_limit, soft_limit) -
> low_wmark
> memory.high_wmark_distance_in_bytes ... # min(hard_limit, soft_limit) -
> high_wmark.
>

And also, some checks on soft_limit are needed. If "soft_limit" == 0, use
hard_limit

--Ying


> Thanks,
> -Kame
>
>
> > changelog v3..v2:
> > 1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
> > feedbacks
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  mm/memcontrol.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 49 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 664cdc5..36ae377 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3983,6 +3983,31 @@ static int mem_cgroup_swappiness_write(struct
> cgroup *cgrp, struct cftype *cft,
> >       return 0;
> >  }
> >
> > +static u64 mem_cgroup_wmark_ratio_read(struct cgroup *cgrp, struct
> cftype *cft)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +
> > +     return get_wmark_ratio(memcg);
> > +}
> > +
> > +static int mem_cgroup_wmark_ratio_write(struct cgroup *cgrp, struct
> cftype *cfg,
> > +                                  u64 val)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +     struct mem_cgroup *parent;
> > +
> > +     if (cgrp->parent == NULL)
> > +             return -EINVAL;
> > +
> > +     parent = mem_cgroup_from_cont(cgrp->parent);
> > +
> > +     memcg->wmark_ratio = val;
> > +
> > +     setup_per_memcg_wmarks(memcg);
> > +     return 0;
> > +
> > +}
> > +
> >  static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
> >  {
> >       struct mem_cgroup_threshold_ary *t;
> > @@ -4274,6 +4299,21 @@ static void mem_cgroup_oom_unregister_event(struct
> cgroup *cgrp,
> >       mutex_unlock(&memcg_oom_mutex);
> >  }
> >
> > +static int mem_cgroup_wmark_read(struct cgroup *cgrp,
> > +     struct cftype *cft,  struct cgroup_map_cb *cb)
> > +{
> > +     struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> > +     u64 low_wmark, high_wmark;
> > +
> > +     low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
> > +     high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
> > +
> > +     cb->fill(cb, "low_wmark", low_wmark);
> > +     cb->fill(cb, "high_wmark", high_wmark);
> > +
> > +     return 0;
> > +}
> > +
> >  static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
> >       struct cftype *cft,  struct cgroup_map_cb *cb)
> >  {
> > @@ -4377,6 +4417,15 @@ static struct cftype mem_cgroup_files[] = {
> >               .unregister_event = mem_cgroup_oom_unregister_event,
> >               .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
> >       },
> > +     {
> > +             .name = "wmark_ratio",
> > +             .write_u64 = mem_cgroup_wmark_ratio_write,
> > +             .read_u64 = mem_cgroup_wmark_ratio_read,
> > +     },
> > +     {
> > +             .name = "reclaim_wmarks",
> > +             .read_map = mem_cgroup_wmark_read,
> > +     },
> >  };
> >
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 6802 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 6/7] Enable per-memcg background reclaim.
  2011-04-13  9:05   ` KAMEZAWA Hiroyuki
@ 2011-04-13 21:20     ` Ying Han
  2011-04-14  0:30       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-13 21:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5957 bytes --]

On Wed, Apr 13, 2011 at 2:05 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 Apr 2011 00:03:06 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > By default the per-memcg background reclaim is disabled when the
> limit_in_bytes
> > is set the maximum or the wmark_ratio is 0. The kswapd_run() is called
> when the
> > memcg is being resized, and kswapd_stop() is called when the memcg is
> being
> > deleted.
> >
> > The per-memcg kswapd is waked up based on the usage and low_wmark, which
> is
> > checked once per 1024 increments per cpu. The memcg's kswapd is waked up
> if the
> > usage is larger than the low_wmark.
> >
> > changelog v3..v2:
> > 1. some clean-ups
> >
> > changelog v2..v1:
> > 1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
> > 2. remove checking the wmark from per-page charging. now it checks the
> wmark
> > periodically based on the event counter.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
>
> This event logic seems to make sense.
>
> > ---
> >  mm/memcontrol.c |   37 +++++++++++++++++++++++++++++++++++++
> >  1 files changed, 37 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index efeade3..bfa8646 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -105,10 +105,12 @@ enum mem_cgroup_events_index {
> >  enum mem_cgroup_events_target {
> >       MEM_CGROUP_TARGET_THRESH,
> >       MEM_CGROUP_TARGET_SOFTLIMIT,
> > +     MEM_CGROUP_WMARK_EVENTS_THRESH,
> >       MEM_CGROUP_NTARGETS,
> >  };
> >  #define THRESHOLDS_EVENTS_TARGET (128)
> >  #define SOFTLIMIT_EVENTS_TARGET (1024)
> > +#define WMARK_EVENTS_TARGET (1024)
> >
> >  struct mem_cgroup_stat_cpu {
> >       long count[MEM_CGROUP_STAT_NSTATS];
> > @@ -366,6 +368,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >  static void drain_all_stock_async(void);
> >  static unsigned long get_wmark_ratio(struct mem_cgroup *mem);
> > +static void wake_memcg_kswapd(struct mem_cgroup *mem);
> >
> >  static struct mem_cgroup_per_zone *
> >  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> > @@ -545,6 +548,12 @@ mem_cgroup_largest_soft_limit_node(struct
> mem_cgroup_tree_per_zone *mctz)
> >       return mz;
> >  }
> >
> > +static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
> > +{
> > +     if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
> > +             wake_memcg_kswapd(mem);
> > +}
> > +
> >  /*
> >   * Implementation Note: reading percpu statistics for memcg.
> >   *
> > @@ -675,6 +684,9 @@ static void __mem_cgroup_target_update(struct
> mem_cgroup *mem, int target)
> >       case MEM_CGROUP_TARGET_SOFTLIMIT:
> >               next = val + SOFTLIMIT_EVENTS_TARGET;
> >               break;
> > +     case MEM_CGROUP_WMARK_EVENTS_THRESH:
> > +             next = val + WMARK_EVENTS_TARGET;
> > +             break;
> >       default:
> >               return;
> >       }
> > @@ -698,6 +710,10 @@ static void memcg_check_events(struct mem_cgroup
> *mem, struct page *page)
> >                       __mem_cgroup_target_update(mem,
> >                               MEM_CGROUP_TARGET_SOFTLIMIT);
> >               }
> > +             if (unlikely(__memcg_event_check(mem,
> > +                     MEM_CGROUP_WMARK_EVENTS_THRESH))){
> > +                     mem_cgroup_check_wmark(mem);
> > +             }
> >       }
> >  }
> >
> > @@ -3384,6 +3400,10 @@ static int mem_cgroup_resize_limit(struct
> mem_cgroup *memcg,
> >       if (!ret && enlarge)
> >               memcg_oom_recover(memcg);
> >
> > +     if (!mem_cgroup_is_root(memcg) && !memcg->kswapd_wait &&
> > +                     memcg->wmark_ratio)
> > +             kswapd_run(0, memcg);
> > +
>
> Isn't it enough to have trigger in charge() path ?
>

why? kswapd_run() is to create the kswapd thread for the memcg. If the
memcg's limit doesn't change from the initial value, we don't want to create
a kswapd thread for it. Only if the limit_in_byte is being changed. Adding
the hook in the charge path sounds too much overhead to the hotpath.

However, I might need to add checks here, where if the limit_in_byte is set
to RESOURCE_MAX.

>
> rather than here, I think we should check _move_task(). It changes res
> usage
> dramatically without updating events.
>

I see both the mem_cgroup_charge_statistics() and memcg_check_events()  are
being called in mem_cgroup_move_account(). Am i missing anything here?


Thanks
--Ying



>
> Thanks,
> -Kame
>
>
> >       return ret;
> >  }
> >
> > @@ -4680,6 +4700,7 @@ static void __mem_cgroup_free(struct mem_cgroup
> *mem)
> >  {
> >       int node;
> >
> > +     kswapd_stop(0, mem);
> >       mem_cgroup_remove_from_trees(mem);
> >       free_css_id(&mem_cgroup_subsys, &mem->css);
> >
>
> I think kswapd should stop at mem_cgroup_destroy(). No more tasks will use
> this memcg after _destroy().
>

I made the change.

>
> Thanks,
> -Kame
>
>
>
> > @@ -4786,6 +4807,22 @@ int mem_cgroup_last_scanned_node(struct mem_cgroup
> *mem)
> >       return mem->last_scanned_node;
> >  }
> >
> > +static inline
> > +void wake_memcg_kswapd(struct mem_cgroup *mem)
> > +{
> > +     wait_queue_head_t *wait;
> > +
> > +     if (!mem || !mem->wmark_ratio)
> > +             return;
> > +
> > +     wait = mem->kswapd_wait;
> > +
> > +     if (!wait || !waitqueue_active(wait))
> > +             return;
> > +
> > +     wake_up_interruptible(wait);
> > +}
> > +
> >  static int mem_cgroup_soft_limit_tree_init(void)
> >  {
> >       struct mem_cgroup_tree_per_node *rtpn;
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 8106 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 5/7] Per-memcg background reclaim.
  2011-04-13  8:58   ` KAMEZAWA Hiroyuki
@ 2011-04-13 22:45     ` Ying Han
  0 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-13 22:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 23131 bytes --]

On Wed, Apr 13, 2011 at 1:58 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 Apr 2011 00:03:05 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This is the main loop of per-memcg background reclaim which is
> implemented in
> > function balance_mem_cgroup_pgdat().
> >
> > The function performs a priority loop similar to global reclaim. During
> each
> > iteration it invokes balance_pgdat_node() for all nodes on the system,
> which
> > is another new function performs background reclaim per node. A fairness
> > mechanism is implemented to remember the last node it was reclaiming from
> and
> > always start at the next one. After reclaiming each node, it checks
> > mem_cgroup_watermark_ok() and breaks the priority loop if it returns
> true. The
> > per-memcg zone will be marked as "unreclaimable" if the scanning rate is
> much
> > greater than the reclaiming rate on the per-memcg LRU. The bit is cleared
> when
> > there is a page charged to the memcg being freed. Kswapd breaks the
> priority
> > loop if all the zones are marked as "unreclaimable".
> >
>
> Hmm, bigger than expected. I'm glad if you can divide this into small
> pieces.
> see below.
>
>
> > changelog v3..v2:
> > 1. change mz->all_unreclaimable to be boolean.
> > 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg
> reclaim.
> > 3. some more clean-up.
> >
> > changelog v2..v1:
> > 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> > 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> > reclaim.
> > 3. name the per-memcg memcg as "memcg-id" (css->id). And the global
> kswapd
> > keeps the same name.
> > 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be
> accessed
> > after freeing.
> > 5. add the fairness in zonelist where memcg remember the last zone
> reclaimed
> > from.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/memcontrol.h |   33 +++++++
> >  include/linux/swap.h       |    2 +
> >  mm/memcontrol.c            |  136 +++++++++++++++++++++++++++++
> >  mm/vmscan.c                |  208
> ++++++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 379 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index f7ffd1f..a8159f5 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup
> *mem,
> >                                 struct kswapd *kswapd_p);
> >  extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
> >  extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup
> *mem);
> > +extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
> > +extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
> > +                                     const nodemask_t *nodes);
> >
> >  static inline
> >  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> > @@ -152,6 +155,12 @@ static inline void mem_cgroup_dec_page_stat(struct
> page *page,
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int
> order,
> >                                               gfp_t gfp_mask);
> >  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> > +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page
> *page);
> > +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int
> zid);
> > +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone
> *zone);
> > +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone
> *zone);
> > +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone*
> zone,
> > +                             unsigned long nr_scanned);
> >
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
> > @@ -342,6 +351,25 @@ static inline void mem_cgroup_dec_page_stat(struct
> page *page,
> >  {
> >  }
> >
> > +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
> > +                                             struct zone *zone,
> > +                                             unsigned long nr_scanned)
> > +{
> > +}
> > +
> > +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
> > +                                                     struct zone *zone)
> > +{
> > +}
> > +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup
> *mem,
> > +             struct zone *zone)
> > +{
> > +}
> > +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
> > +                                             struct zone *zone)
> > +{
> > +}
> > +
> >  static inline
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int
> order,
> >                                           gfp_t gfp_mask)
> > @@ -360,6 +388,11 @@ static inline void
> mem_cgroup_split_huge_fixup(struct page *head,
> >  {
> >  }
> >
> > +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem,
> int nid,
> > +                                                             int zid)
> > +{
> > +     return false;
> > +}
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >
> >  #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 17e0511..319b800 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -160,6 +160,8 @@ enum {
> >       SWP_SCANNING    = (1 << 8),     /* refcount in scan_swap_map */
> >  };
> >
> > +#define ZONE_RECLAIMABLE_RATE 6
> > +
> >  #define SWAP_CLUSTER_MAX 32
> >  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index acd84a8..efeade3 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
> >       bool                    on_tree;
> >       struct mem_cgroup       *mem;           /* Back pointer, we cannot
> */
> >                                               /* use container_of
>  */
> > +     unsigned long           pages_scanned;  /* since last reclaim */
> > +     bool                    all_unreclaimable;      /* All pages pinned
> */
> >  };
> > +
> >  /* Macro for accessing counter */
> >  #define MEM_CGROUP_ZSTAT(mz, idx)    ((mz)->count[(idx)])
> >
> > @@ -275,6 +278,11 @@ struct mem_cgroup {
> >
> >       int wmark_ratio;
> >
> > +     /* While doing per cgroup background reclaim, we cache the
> > +      * last node we reclaimed from
> > +      */
> > +     int last_scanned_node;
> > +
> >       wait_queue_head_t *kswapd_wait;
> >  };
> >
> > @@ -1129,6 +1137,96 @@ mem_cgroup_get_reclaim_stat_from_page(struct page
> *page)
> >       return &mz->reclaim_stat;
> >  }
> >
> > +static unsigned long mem_cgroup_zone_reclaimable_pages(
> > +                                     struct mem_cgroup_per_zone *mz)
> > +{
> > +     int nr;
> > +     nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
> > +             MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
> > +
> > +     if (nr_swap_pages > 0)
> > +             nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
> > +                     MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
> > +
> > +     return nr;
> > +}
> > +
> > +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone*
> zone,
> > +                                             unsigned long nr_scanned)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +
> > +     if (!mem)
> > +             return;
> > +
> > +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +     if (mz)
> > +             mz->pages_scanned += nr_scanned;
> > +}
> > +
> > +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int
> zid)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +
> > +     if (!mem)
> > +             return 0;
> > +
> > +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +     if (mz)
> > +             return mz->pages_scanned <
> > +                             mem_cgroup_zone_reclaimable_pages(mz) *
> > +                             ZONE_RECLAIMABLE_RATE;
> > +     return 0;
> > +}
> > +
> > +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone
> *zone)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +
> > +     if (!mem)
> > +             return false;
> > +
> > +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +     if (mz)
> > +             return mz->all_unreclaimable;
> > +
> > +     return false;
> > +}
> > +
> > +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone
> *zone)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +     int nid = zone_to_nid(zone);
> > +     int zid = zone_idx(zone);
> > +
> > +     if (!mem)
> > +             return;
> > +
> > +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +     if (mz)
> > +             mz->all_unreclaimable = true;
> > +}
> > +
> > +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page
> *page)
> > +{
> > +     struct mem_cgroup_per_zone *mz = NULL;
> > +
> > +     if (!mem)
> > +             return;
> > +
> > +     mz = page_cgroup_zoneinfo(mem, page);
> > +     if (mz) {
> > +             mz->pages_scanned = 0;
> > +             mz->all_unreclaimable = false;
> > +     }
> > +
> > +     return;
> > +}
> > +
> >  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> >                                       struct list_head *dst,
> >                                       unsigned long *scanned, int order,
> > @@ -1545,6 +1643,32 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
> >  }
> >
> >  /*
> > + * Visit the first node after the last_scanned_node of @mem and use that
> to
> > + * reclaim free pages from.
> > + */
> > +int
> > +mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t
> *nodes)
> > +{
> > +     int next_nid;
> > +     int last_scanned;
> > +
> > +     last_scanned = mem->last_scanned_node;
> > +
> > +     /* Initial stage and start from node0 */
> > +     if (last_scanned == -1)
> > +             next_nid = 0;
> > +     else
> > +             next_nid = next_node(last_scanned, *nodes);
> > +
> > +     if (next_nid == MAX_NUMNODES)
> > +             next_nid = first_node(*nodes);
> > +
> > +     mem->last_scanned_node = next_nid;
> > +
> > +     return next_nid;
> > +}
> > +
> > +/*
> >   * Check OOM-Killer is already running under our hierarchy.
> >   * If someone is running, return false.
> >   */
> > @@ -2779,6 +2903,7 @@ __mem_cgroup_uncharge_common(struct page *page,
> enum charge_type ctype)
> >        * special functions.
> >        */
> >
> > +     mem_cgroup_clear_unreclaimable(mem, page);
>
> Hmm, do we this always at uncharge ?
>
> I doubt we really need mz->all_unreclaimable ....
>
> Anyway, I'd like to see this all_unreclaimable logic in an independet
> patch.
> Because direct-relcaim pass should see this, too.
>
> So, could you devide this pieces into
>
> 1. record last node .... I wonder this logic should be used in
> direct-reclaim pass, too.
>
> 2. all_unreclaimable .... direct reclaim will be affected, too.
>
> 3. scanning core.
>

Ok. will make the change for the next post.

>
>
>
> >       unlock_page_cgroup(pc);
> >       /*
> >        * even after unlock, we have mem->res.usage here and this memcg
> > @@ -4501,6 +4626,8 @@ static int alloc_mem_cgroup_per_zone_info(struct
> mem_cgroup *mem, int node)
> >               mz->usage_in_excess = 0;
> >               mz->on_tree = false;
> >               mz->mem = mem;
> > +             mz->pages_scanned = 0;
> > +             mz->all_unreclaimable = false;
> >       }
> >       return 0;
> >  }
> > @@ -4651,6 +4778,14 @@ wait_queue_head_t *mem_cgroup_kswapd_wait(struct
> mem_cgroup *mem)
> >       return mem->kswapd_wait;
> >  }
> >
> > +int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
> > +{
> > +     if (!mem)
> > +             return -1;
> > +
> > +     return mem->last_scanned_node;
> > +}
> > +
> >  static int mem_cgroup_soft_limit_tree_init(void)
> >  {
> >       struct mem_cgroup_tree_per_node *rtpn;
> > @@ -4726,6 +4861,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> cgroup *cont)
> >               res_counter_init(&mem->memsw, NULL);
> >       }
> >       mem->last_scanned_child = 0;
> > +     mem->last_scanned_node = -1;
> >       INIT_LIST_HEAD(&mem->oom_notify);
> >
> >       if (parent)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a1a1211..6571eb8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -47,6 +47,8 @@
> >
> >  #include <linux/swapops.h>
> >
> > +#include <linux/res_counter.h>
> > +
> >  #include "internal.h"
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -111,6 +113,8 @@ struct scan_control {
> >        * are scanned.
> >        */
> >       nodemask_t      *nodemask;
> > +
> > +     int priority;
> >  };
> >
> >  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> > @@ -1410,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
> >                                       ISOLATE_BOTH : ISOLATE_INACTIVE,
> >                       zone, sc->mem_cgroup,
> >                       0, file);
> > +
> > +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone,
> nr_scanned);
> > +
> >               /*
> >                * mem_cgroup_isolate_pages() keeps track of
> >                * scanned pages on its own.
> > @@ -1529,6 +1536,7 @@ static void shrink_active_list(unsigned long
> nr_pages, struct zone *zone,
> >                * mem_cgroup_isolate_pages() keeps track of
> >                * scanned pages on its own.
> >                */
> > +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone,
> pgscanned);
> >       }
> >
> >       reclaim_stat->recent_scanned[file] += nr_taken;
> > @@ -2632,11 +2640,211 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >       finish_wait(wait_h, &wait);
> >  }
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * The function is used for per-memcg LRU. It scanns all the zones of
> the
> > + * node and returns the nr_scanned and nr_reclaimed.
> > + */
> > +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> > +                                     struct scan_control *sc)
> > +{
> > +     int i, end_zone;
> > +     unsigned long total_scanned;
> > +     struct mem_cgroup *mem_cont = sc->mem_cgroup;
> > +     int priority = sc->priority;
> > +     int nid = pgdat->node_id;
> > +
> > +     /*
> > +      * Scan in the highmem->dma direction for the highest
> > +      * zone which needs scanning
> > +      */
> > +     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > +             struct zone *zone = pgdat->node_zones + i;
> > +
> > +             if (!populated_zone(zone))
> > +                     continue;
> > +
> > +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> > +                             priority != DEF_PRIORITY)
> > +                     continue;
> > +             /*
> > +              * Do some background aging of the anon list, to give
> > +              * pages a chance to be referenced before reclaiming.
> > +              */
> > +             if (inactive_anon_is_low(zone, sc))
> > +                     shrink_active_list(SWAP_CLUSTER_MAX, zone,
> > +                                                     sc, priority, 0);
> > +
> > +             end_zone = i;
> > +             goto scan;
> > +     }
>
> I don't want to see zone balancing logic in memcg.
> It should be a work of global lru.
>
> IOW, even if we remove global LRU finally, we should
> implement zone balancing logic in _global_ (per node) kswapd.
> (kswapd can pass zone mask to each memcg.)
>
> If you want some clever logic for memcg specail, I think it should be
> deteciting 'which node should be victim' logic rather than round-robin.
> (But yes, starting from easy round robin makes sense.)
>
> So, could you add more simple one ?
>
>  do {
>    select victim node
>    do reclaim
>  } while (need_stop)
>
> zone balancing should be done other than memcg.
>
> what we really need to improve is 'select victim node'.
>

I will separate out the logic in the next post. So it would be easier to
optimize each individual functionality.

--Ying

>
> Thanks,
> -Kame
>
>
> > +     return;
> > +
> > +scan:
> > +     total_scanned = 0;
> > +     /*
> > +      * Now scan the zone in the dma->highmem direction, stopping
> > +      * at the last zone which needs scanning.
> > +      *
> > +      * We do this because the page allocator works in the opposite
> > +      * direction.  This prevents the page allocator from allocating
> > +      * pages behind kswapd's direction of progress, which would
> > +      * cause too much scanning of the lower zones.
> > +      */
> > +     for (i = 0; i <= end_zone; i++) {
> > +             struct zone *zone = pgdat->node_zones + i;
> > +
> > +             if (!populated_zone(zone))
> > +                     continue;
> > +
> > +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> > +                     priority != DEF_PRIORITY)
> > +                     continue;
> > +
> > +             sc->nr_scanned = 0;
> > +             shrink_zone(priority, zone, sc);
> > +             total_scanned += sc->nr_scanned;
> > +
> > +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
> > +                     continue;
> > +
> > +             if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
> > +                     mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
> > +
> > +             /*
> > +              * If we've done a decent amount of scanning and
> > +              * the reclaim ratio is low, start doing writepage
> > +              * even in laptop mode
> > +              */
> > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> 2) {
> > +                     sc->may_writepage = 1;
> > +             }
> > +     }
> > +
> > +     sc->nr_scanned = total_scanned;
> > +     return;
> > +}
> > +
> > +/*
> > + * Per cgroup background reclaim.
> > + * TODO: Take off the order since memcg always do order 0
> > + */
> > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> > +                                           int order)
> > +{
> > +     int i, nid;
> > +     int start_node;
> > +     int priority;
> > +     bool wmark_ok;
> > +     int loop;
> > +     pg_data_t *pgdat;
> > +     nodemask_t do_nodes;
> > +     unsigned long total_scanned;
> > +     struct scan_control sc = {
> > +             .gfp_mask = GFP_KERNEL,
> > +             .may_unmap = 1,
> > +             .may_swap = 1,
> > +             .nr_to_reclaim = ULONG_MAX,
> > +             .swappiness = vm_swappiness,
> > +             .order = order,
> > +             .mem_cgroup = mem_cont,
> > +     };
> > +
> > +loop_again:
> > +     do_nodes = NODE_MASK_NONE;
> > +     sc.may_writepage = !laptop_mode;
> > +     sc.nr_reclaimed = 0;
> > +     total_scanned = 0;
> > +
> > +     for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > +             sc.priority = priority;
> > +             wmark_ok = false;
> > +             loop = 0;
> > +
> > +             /* The swap token gets in the way of swapout... */
> > +             if (!priority)
> > +                     disable_swap_token();
> > +
> > +             if (priority == DEF_PRIORITY)
> > +                     do_nodes = node_states[N_ONLINE];
> > +
> > +             while (1) {
> > +                     nid = mem_cgroup_select_victim_node(mem_cont,
> > +                                                     &do_nodes);
> > +
> > +                     /* Indicate we have cycled the nodelist once
> > +                      * TODO: we might add MAX_RECLAIM_LOOP for
> preventing
> > +                      * kswapd burning cpu cycles.
> > +                      */
> > +                     if (loop == 0) {
> > +                             start_node = nid;
> > +                             loop++;
> > +                     } else if (nid == start_node)
> > +                             break;
> > +
> > +                     pgdat = NODE_DATA(nid);
> > +                     balance_pgdat_node(pgdat, order, &sc);
> > +                     total_scanned += sc.nr_scanned;
> > +
> > +                     /* Set the node which has at least
> > +                      * one reclaimable zone
> > +                      */
> > +                     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > +                             struct zone *zone = pgdat->node_zones + i;
> > +
> > +                             if (!populated_zone(zone))
> > +                                     continue;
> > +
> > +                             if (!mem_cgroup_mz_unreclaimable(mem_cont,
> > +                                                             zone))
> > +                                     break;
> > +                     }
> > +                     if (i < 0)
> > +                             node_clear(nid, do_nodes);
> > +
> > +                     if (mem_cgroup_watermark_ok(mem_cont,
> > +                                                     CHARGE_WMARK_HIGH))
> {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +
> > +                     if (nodes_empty(do_nodes)) {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +             }
> > +
> > +             /* All the nodes are unreclaimable, kswapd is done */
> > +             if (nodes_empty(do_nodes)) {
> > +                     wmark_ok = true;
> > +                     goto out;
> > +             }
> > +
> > +             if (total_scanned && priority < DEF_PRIORITY - 2)
> > +                     congestion_wait(WRITE, HZ/10);
> > +
> > +             if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> > +                     break;
> > +     }
> > +out:
> > +     if (!wmark_ok) {
> > +             cond_resched();
> > +
> > +             try_to_freeze();
> > +
> > +             goto loop_again;
> > +     }
> > +
> > +     return sc.nr_reclaimed;
> > +}
> > +#else
> >  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> >                                                       int order)
> >  {
> >       return 0;
> >  }
> > +#endif
> >
> >  /*
> >   * The background pageout daemon, started as a kernel thread
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 28028 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/7] memcg: per cgroup background reclaim
  2011-04-13 17:53   ` Ying Han
@ 2011-04-14  0:14     ` KAMEZAWA Hiroyuki
  2011-04-14 17:38       ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-14  0:14 UTC (permalink / raw)
  To: Ying Han
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

On Wed, 13 Apr 2011 10:53:19 -0700
Ying Han <yinghan@google.com> wrote:

> On Wed, Apr 13, 2011 at 12:47 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 13 Apr 2011 00:03:00 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > The current implementation of memcg supports targeting reclaim when the
> > > cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
> > > Per cgroup background reclaim is needed which helps to spread out memory
> > > pressure over longer period of time and smoothes out the cgroup
> > performance.
> > >
> > > If the cgroup is configured to use per cgroup background reclaim, a
> > kswapd
> > > thread is created which only scans the per-memcg LRU list. Two watermarks
> > > ("high_wmark", "low_wmark") are added to trigger the background reclaim
> > and
> > > stop it. The watermarks are calculated based on the cgroup's
> > limit_in_bytes.
> > >
> > > I run through dd test on large file and then cat the file. Then I
> > compared
> > > the reclaim related stats in memory.stat.
> > >
> > > Step1: Create a cgroup with 500M memory_limit.
> > > $ mkdir /dev/cgroup/memory/A
> > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > > $ echo $$ >/dev/cgroup/memory/A/tasks
> > >
> > > Step2: Test and set the wmarks.
> > > $ cat /dev/cgroup/memory/A/memory.wmark_ratio
> > > 0
> > >
> > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > low_wmark 524288000
> > > high_wmark 524288000
> > >
> > > $ echo 90 >/dev/cgroup/memory/A/memory.wmark_ratio
> > >
> > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > low_wmark 471859200
> > > high_wmark 470016000
> > >
> > > $ ps -ef | grep memcg
> > > root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
> > > root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg
> > >
> > > Step3: Dirty the pages by creating a 20g file on hard drive.
> > > $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
> > >
> > > Here are the memory.stat with vs without the per-memcg reclaim. It used
> > to be
> > > all the pages are reclaimed from direct reclaim, and now some of the
> > pages are
> > > also being reclaimed at background.
> > >
> > > Only direct reclaim                       With background reclaim:
> > >
> > > pgpgin 5248668                            pgpgin 5248347
> > > pgpgout 5120678                           pgpgout 5133505
> > > kswapd_steal 0                            kswapd_steal 1476614
> > > pg_pgsteal 5120578                        pg_pgsteal 3656868
> > > kswapd_pgscan 0                           kswapd_pgscan 3137098
> > > pg_scan 10861956                          pg_scan 6848006
> > > pgrefill 271174                           pgrefill 290441
> > > pgoutrun 0                                pgoutrun 18047
> > > allocstall 131689                         allocstall 100179
> > >
> > > real    7m42.702s                         real 7m42.323s
> > > user    0m0.763s                          user 0m0.748s
> > > sys     0m58.785s                         sys  0m52.123s
> > >
> > > throughput is 44.33 MB/sec                throughput is 44.23 MB/sec
> > >
> > > Step 4: Cleanup
> > > $ echo $$ >/dev/cgroup/memory/tasks
> > > $ echo 1 > /dev/cgroup/memory/A/memory.force_empty
> > > $ rmdir /dev/cgroup/memory/A
> > > $ echo 3 >/proc/sys/vm/drop_caches
> > >
> > > Step 5: Create the same cgroup and read the 20g file into pagecache.
> > > $ cat /export/hdc3/dd/tf0 > /dev/zero
> > >
> > > All the pages are reclaimed from background instead of direct reclaim
> > with
> > > the per cgroup reclaim.
> > >
> > > Only direct reclaim                       With background reclaim:
> > > pgpgin 5248668                            pgpgin 5248114
> > > pgpgout 5120678                           pgpgout 5133480
> > > kswapd_steal 0                            kswapd_steal 5133397
> > > pg_pgsteal 5120578                        pg_pgsteal 0
> > > kswapd_pgscan 0                           kswapd_pgscan 5133400
> > > pg_scan 10861956                          pg_scan 0
> > > pgrefill 271174                           pgrefill 0
> > > pgoutrun 0                                pgoutrun 40535
> > > allocstall 131689                         allocstall 0
> > >
> > > real    7m42.702s                         real 6m20.439s
> > > user    0m0.763s                          user 0m0.169s
> > > sys     0m58.785s                         sys  0m26.574s
> > >
> > > Note:
> > > This is the first effort of enhancing the target reclaim into memcg. Here
> > are
> > > the existing known issues and our plan:
> > >
> > > 1. there are one kswapd thread per cgroup. the thread is created when the
> > > cgroup changes its limit_in_bytes and is deleted when the cgroup is being
> > > removed. In some enviroment when thousand of cgroups are being configured
> > on
> > > a single host, we will have thousand of kswapd threads. The memory
> > consumption
> > > would be 8k*100 = 8M. We don't see a big issue for now if the host can
> > host
> > > that many of cgroups.
> > >
> >
> > What's bad with using workqueue ?
> >
> > Pros.
> >  - we don't have to keep our own thread pool.
> >  - we don't have to see 'ps -elf' is filled by kswapd...
> > Cons.
> >  - because threads are shared, we can't put kthread to cpu cgroup.
> >
> 
> I did some study on workqueue after posting V2. There was a comment suggesting
> workqueue instead of per-memcg kswapd thread, since it will cut the number
> of kernel threads being created in host with lots of cgroups. Each kernel
> thread allocates about 8K of stack and 8M in total w/ thousand of cgroups.
> 
> The current workqueue model merged in 2.6.36 kernel is called "concurrency
> managed workqueu(cmwq)", which is intended to provide flexible concurrency
> without wasting resources. I studied a bit and here it is:
> 
> 1. The workqueue is complicated and we need to be very careful of work items
> in the workqueue. We've experienced in one workitem stucks and the rest of
> the work item won't proceed. For example in dirty page writeback,  one
> heavily writer cgroup could starve the other cgroups from flushing dirty
> pages to the same disk. In the kswapd case, I can image we might have
> similar scenario.
> 
> 2. How to prioritize the workitems is another problem. The order of adding
> the workitems in the queue reflects the order of cgroups being reclaimed. We
> don't have that restriction currently but relying on the cpu scheduler to
> put kswapd on the right cpu-core to run. We "might" introduce priority later
> for reclaim and how are we gonna deal with that.
> 
> 3. Based on what i observed, not many callers has migrated to the cmwq and I
> don't have much data of how good it is.
> 
> 
> Regardless of workqueue, can't we have moderate numbers of threads ?
> >
> > I really don't like to have too much threads and thinks
> > one-thread-per-memcg
> > is big enough to cause lock contension problem.
> >
> 
> Back to the current model, on machine with thousands of cgroups which it
> will take 8M total for thousand of kswapd threads (8K stack for each
> thread).  We are running system with fakenuma which each numa node has a
> kswapd. So far we haven't noticed issue caused by "lots of" kswapd threads.
> Also, there shouldn't be any performance overhead for kernel thread if it is
> not running.
> 
> Based on the complexity of workqueue and the benefit it provides, I would
> like to stick to the current model first. After we get the basic stuff in
> and other targeting reclaim improvement, we can come back to this. What do
> you think?
> 

Okay, fair enough. kthread_run() will win.

Then, I have another request. I'd like to kswapd-for-memcg to some cpu
cgroup to limit cpu usage.

- Could you show thread ID somewhere ? and
  confirm we can put it to some cpu cgroup ?
  (creating a auto cpu cgroup for memcg kswapd is a choice, I think.)

  BTW, when kthread_run() creates a kthread, which cgroup it will be under ?
  If it will be under a cgroup who calls kthread_run(), per-memcg kswapd will
  go under cgroup where the user sets hi/low wmark, implicitly. 
  I don't think this is very bad. But it's better to mention the behavior
  somewhere because memcg is tend to be used with cpu cgroup.
  Could you check and add some doc ?

And
- Could you drop PF_MEMALLOC ? (for now.) (in patch 4)
- Could you check PF_KSWAPD doesn't do anything bad ?



Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/7] Add per memcg reclaim watermarks
  2011-04-13 18:40     ` Ying Han
@ 2011-04-14  0:27       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-14  0:27 UTC (permalink / raw)
  To: Ying Han
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

On Wed, 13 Apr 2011 11:40:26 -0700
Ying Han <yinghan@google.com> wrote:

> On Wed, Apr 13, 2011 at 1:25 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 13 Apr 2011 00:03:02 -0700
> > Ying Han <yinghan@google.com> wrote:

> > > +static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> > > +{
> > > +     u64 limit;
> > > +     unsigned long wmark_ratio;
> > > +
> > > +     wmark_ratio = get_wmark_ratio(mem);
> > > +     limit = mem_cgroup_get_limit(mem);
> > > +     if (wmark_ratio == 0) {
> > > +             res_counter_set_low_wmark_limit(&mem->res, limit);
> > > +             res_counter_set_high_wmark_limit(&mem->res, limit);
> > > +     } else {
> > > +             unsigned long low_wmark, high_wmark;
> > > +             unsigned long long tmp = (wmark_ratio * limit) / 100;
> >
> > could you make this ratio as /1000 ? percent is too big.
> > And, considering misc. cases, I don't think having per-memcg "ratio" is
> > good.
> >
> > How about following ?
> >
> >  - provides an automatic wmark without knob. 0 wmark is okay, for me.
> >  - provides 2 intrerfaces as
> >        memory.low_wmark_distance_in_bytes,  # == hard_limit - low_wmark.
> >        memory.high_wmark_in_bytes,          # == hard_limit - high_wmark.
> >   (need to add sanity check into set_limit.)
> >
> > Hmm. Making the wmarks tunable individually make sense to me. One problem I
> do notice is that making the hard_limit as the bar might not working well on
> over-committing system. Which means the per-cgroup background reclaim might
> not be triggered before global memory pressure. Ideally, we would like to do
> more per-cgroup reclaim before triggering global memory pressure.
> 
hmm.

> How about adding the two APIs but make the calculation based on:
> 
> -- by default, the wmarks are equal to hard_limit. ( no background reclaim)

ok.

> -- provides 2 intrerfaces as
>        memory.low_wmark_distance_in_bytes,  # == min(hard_limit, soft_limit)
> - low_wmark.
>        memory.high_wmark_in_bytes,          # == min(hard_limit, soft_limit)
> - high_wmark.
> 

Hmm, with that interface, soflimit=0(or some low value) will disable background
reclaim. (IOW, all memory will be reclaimed.)

IMHO, we don't need take care of softlimit v.s. high/low wmark. It's userland job.
And we cannot know global relcaim's run via memcg' memory uasge....because of
nodes and zones. I think low/high wmark should work against hard_limit.


> >
> > In this patch, kswapd runs while
> >
> >        high_wmark < usage < low_wmark
> > ?
> >
> > Hmm, I like
> >        low_wmark < usage < high_wmark.
> >
> > ;) because it's kswapd.
> >
> > I adopt the same concept of global kswapd where low_wmark triggers the
> kswpd and hight_wmark stop it. And here, we have
> 
> (limit - high_wmark) < free < (limit - low_wmark)
> 

Hm, ok. please add comment somewhere.

Thanks,
-Kame







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 6/7] Enable per-memcg background reclaim.
  2011-04-13 21:20     ` Ying Han
@ 2011-04-14  0:30       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-14  0:30 UTC (permalink / raw)
  To: Ying Han
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

On Wed, 13 Apr 2011 14:20:22 -0700
Ying Han <yinghan@google.com> wrote:

> On Wed, Apr 13, 2011 at 2:05 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 13 Apr 2011 00:03:06 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > By default the per-memcg background reclaim is disabled when the
> > limit_in_bytes
> > > is set the maximum or the wmark_ratio is 0. The kswapd_run() is called
> > when the
> > > memcg is being resized, and kswapd_stop() is called when the memcg is
> > being
> > > deleted.
> > >
> > > The per-memcg kswapd is waked up based on the usage and low_wmark, which
> > is
> > > checked once per 1024 increments per cpu. The memcg's kswapd is waked up
> > if the
> > > usage is larger than the low_wmark.
> > >
> > > changelog v3..v2:
> > > 1. some clean-ups
> > >
> > > changelog v2..v1:
> > > 1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
> > > 2. remove checking the wmark from per-page charging. now it checks the
> > wmark
> > > periodically based on the event counter.
> > >
> > > Signed-off-by: Ying Han <yinghan@google.com>
> >
> > This event logic seems to make sense.
> >
> > > ---
> > >  mm/memcontrol.c |   37 +++++++++++++++++++++++++++++++++++++
> > >  1 files changed, 37 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index efeade3..bfa8646 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -105,10 +105,12 @@ enum mem_cgroup_events_index {
> > >  enum mem_cgroup_events_target {
> > >       MEM_CGROUP_TARGET_THRESH,
> > >       MEM_CGROUP_TARGET_SOFTLIMIT,
> > > +     MEM_CGROUP_WMARK_EVENTS_THRESH,
> > >       MEM_CGROUP_NTARGETS,
> > >  };
> > >  #define THRESHOLDS_EVENTS_TARGET (128)
> > >  #define SOFTLIMIT_EVENTS_TARGET (1024)
> > > +#define WMARK_EVENTS_TARGET (1024)
> > >
> > >  struct mem_cgroup_stat_cpu {
> > >       long count[MEM_CGROUP_STAT_NSTATS];
> > > @@ -366,6 +368,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
> > >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> > >  static void drain_all_stock_async(void);
> > >  static unsigned long get_wmark_ratio(struct mem_cgroup *mem);
> > > +static void wake_memcg_kswapd(struct mem_cgroup *mem);
> > >
> > >  static struct mem_cgroup_per_zone *
> > >  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> > > @@ -545,6 +548,12 @@ mem_cgroup_largest_soft_limit_node(struct
> > mem_cgroup_tree_per_zone *mctz)
> > >       return mz;
> > >  }
> > >
> > > +static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
> > > +{
> > > +     if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
> > > +             wake_memcg_kswapd(mem);
> > > +}
> > > +
> > >  /*
> > >   * Implementation Note: reading percpu statistics for memcg.
> > >   *
> > > @@ -675,6 +684,9 @@ static void __mem_cgroup_target_update(struct
> > mem_cgroup *mem, int target)
> > >       case MEM_CGROUP_TARGET_SOFTLIMIT:
> > >               next = val + SOFTLIMIT_EVENTS_TARGET;
> > >               break;
> > > +     case MEM_CGROUP_WMARK_EVENTS_THRESH:
> > > +             next = val + WMARK_EVENTS_TARGET;
> > > +             break;
> > >       default:
> > >               return;
> > >       }
> > > @@ -698,6 +710,10 @@ static void memcg_check_events(struct mem_cgroup
> > *mem, struct page *page)
> > >                       __mem_cgroup_target_update(mem,
> > >                               MEM_CGROUP_TARGET_SOFTLIMIT);
> > >               }
> > > +             if (unlikely(__memcg_event_check(mem,
> > > +                     MEM_CGROUP_WMARK_EVENTS_THRESH))){
> > > +                     mem_cgroup_check_wmark(mem);
> > > +             }
> > >       }
> > >  }
> > >
> > > @@ -3384,6 +3400,10 @@ static int mem_cgroup_resize_limit(struct
> > mem_cgroup *memcg,
> > >       if (!ret && enlarge)
> > >               memcg_oom_recover(memcg);
> > >
> > > +     if (!mem_cgroup_is_root(memcg) && !memcg->kswapd_wait &&
> > > +                     memcg->wmark_ratio)
> > > +             kswapd_run(0, memcg);
> > > +
> >
> > Isn't it enough to have trigger in charge() path ?
> >
> 
> why? kswapd_run() is to create the kswapd thread for the memcg. If the
> memcg's limit doesn't change from the initial value, we don't want to create
> a kswapd thread for it. Only if the limit_in_byte is being changed. Adding
> the hook in the charge path sounds too much overhead to the hotpath.
> 

Ah, sorry. I misunderstood.


> However, I might need to add checks here, where if the limit_in_byte is set
> to RESOURCE_MAX.
> 
> >
> > rather than here, I think we should check _move_task(). It changes res
> > usage
> > dramatically without updating events.
> >
> 
> I see both the mem_cgroup_charge_statistics() and memcg_check_events()  are
> being called in mem_cgroup_move_account(). Am i missing anything here?
> 

My fault.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/7] Infrastructure to support per-memcg reclaim.
  2011-04-13  7:03 ` [PATCH V3 4/7] Infrastructure to support per-memcg reclaim Ying Han
@ 2011-04-14  3:57   ` Zhu Yanhai
  2011-04-14  6:32     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: Zhu Yanhai @ 2011-04-14  3:57 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen, linux-mm

Hi Ying,

2011/4/13 Ying Han <yinghan@google.com>:
> -extern int kswapd_run(int nid);
> -extern void kswapd_stop(int nid);
> +extern int kswapd_run(int nid, struct mem_cgroup *mem);
> +extern void kswapd_stop(int nid, struct mem_cgroup *mem);

This breaks online_pages() and offline_pages(), which are also
the caller of kswaped_run() and kswaped_stop().

Thanks,
Zhu Yanhai

>
>  #ifdef CONFIG_MMU
>  /* linux/mm/shmem.c */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 36ae377..acd84a8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -274,6 +274,8 @@ struct mem_cgroup {
>        spinlock_t pcp_counter_lock;
>
>        int wmark_ratio;
> +
> +       wait_queue_head_t *kswapd_wait;
>  };
>
>  /* Stuffs for move charges at task migration. */
> @@ -4622,6 +4624,33 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
>        return ret;
>  }
>
> +int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
> +{
> +       if (!mem || !kswapd_p)
> +               return 0;
> +
> +       mem->kswapd_wait = &kswapd_p->kswapd_wait;
> +       kswapd_p->kswapd_mem = mem;
> +
> +       return css_id(&mem->css);
> +}
> +
> +void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
> +{
> +       if (mem)
> +               mem->kswapd_wait = NULL;
> +
> +       return;
> +}
> +
> +wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
> +{
> +       if (!mem)
> +               return NULL;
> +
> +       return mem->kswapd_wait;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>        struct mem_cgroup_tree_per_node *rtpn;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 77ac74f..a1a1211 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2242,6 +2242,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
>  }
>
>  static DEFINE_SPINLOCK(kswapds_spinlock);
> +#define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
>
>  /* is kswapd sleeping prematurely? */
>  static int sleeping_prematurely(struct kswapd *kswapd, int order,
> @@ -2251,11 +2252,16 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
>        unsigned long balanced = 0;
>        bool all_zones_ok = true;
>        pg_data_t *pgdat = kswapd->kswapd_pgdat;
> +       struct mem_cgroup *mem = kswapd->kswapd_mem;
>
>        /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>        if (remaining)
>                return true;
>
> +       /* Doesn't support for per-memcg reclaim */
> +       if (mem)
> +               return false;
> +
>        /* Check the watermark levels */
>        for (i = 0; i < pgdat->nr_zones; i++) {
>                struct zone *zone = pgdat->node_zones + i;
> @@ -2598,19 +2604,25 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>         * go fully to sleep until explicitly woken up.
>         */
>        if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
> -               trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +               if (is_node_kswapd(kswapd_p)) {
> +                       trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>
> -               /*
> -                * vmstat counters are not perfectly accurate and the estimated
> -                * value for counters such as NR_FREE_PAGES can deviate from the
> -                * true value by nr_online_cpus * threshold. To avoid the zone
> -                * watermarks being breached while under pressure, we reduce the
> -                * per-cpu vmstat threshold while kswapd is awake and restore
> -                * them before going back to sleep.
> -                */
> -               set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
> -               schedule();
> -               set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
> +                       /*
> +                        * vmstat counters are not perfectly accurate and the
> +                        * estimated value for counters such as NR_FREE_PAGES
> +                        * can deviate from the true value by nr_online_cpus *
> +                        * threshold. To avoid the zone watermarks being
> +                        * breached while under pressure, we reduce the per-cpu
> +                        * vmstat threshold while kswapd is awake and restore
> +                        * them before going back to sleep.
> +                        */
> +                       set_pgdat_percpu_threshold(pgdat,
> +                                                  calculate_normal_threshold);
> +                       schedule();
> +                       set_pgdat_percpu_threshold(pgdat,
> +                                               calculate_pressure_threshold);
> +               } else
> +                       schedule();
>        } else {
>                if (remaining)
>                        count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> @@ -2620,6 +2632,12 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>        finish_wait(wait_h, &wait);
>  }
>
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +                                                       int order)
> +{
> +       return 0;
> +}
> +
>  /*
>  * The background pageout daemon, started as a kernel thread
>  * from the init process.
> @@ -2639,6 +2657,7 @@ int kswapd(void *p)
>        int classzone_idx;
>        struct kswapd *kswapd_p = (struct kswapd *)p;
>        pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +       struct mem_cgroup *mem = kswapd_p->kswapd_mem;
>        wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>        struct task_struct *tsk = current;
>
> @@ -2649,10 +2668,12 @@ int kswapd(void *p)
>
>        lockdep_set_current_reclaim_state(GFP_KERNEL);
>
> -       BUG_ON(pgdat->kswapd_wait != wait_h);
> -       cpumask = cpumask_of_node(pgdat->node_id);
> -       if (!cpumask_empty(cpumask))
> -               set_cpus_allowed_ptr(tsk, cpumask);
> +       if (is_node_kswapd(kswapd_p)) {
> +               BUG_ON(pgdat->kswapd_wait != wait_h);
> +               cpumask = cpumask_of_node(pgdat->node_id);
> +               if (!cpumask_empty(cpumask))
> +                       set_cpus_allowed_ptr(tsk, cpumask);
> +       }
>        current->reclaim_state = &reclaim_state;
>
>        /*
> @@ -2677,24 +2698,29 @@ int kswapd(void *p)
>                int new_classzone_idx;
>                int ret;
>
> -               new_order = pgdat->kswapd_max_order;
> -               new_classzone_idx = pgdat->classzone_idx;
> -               pgdat->kswapd_max_order = 0;
> -               pgdat->classzone_idx = MAX_NR_ZONES - 1;
> -               if (order < new_order || classzone_idx > new_classzone_idx) {
> -                       /*
> -                        * Don't sleep if someone wants a larger 'order'
> -                        * allocation or has tigher zone constraints
> -                        */
> -                       order = new_order;
> -                       classzone_idx = new_classzone_idx;
> -               } else {
> -                       kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
> -                       order = pgdat->kswapd_max_order;
> -                       classzone_idx = pgdat->classzone_idx;
> +               if (is_node_kswapd(kswapd_p)) {
> +                       new_order = pgdat->kswapd_max_order;
> +                       new_classzone_idx = pgdat->classzone_idx;
>                        pgdat->kswapd_max_order = 0;
>                        pgdat->classzone_idx = MAX_NR_ZONES - 1;
> -               }
> +                       if (order < new_order ||
> +                                       classzone_idx > new_classzone_idx) {
> +                               /*
> +                                * Don't sleep if someone wants a larger 'order'
> +                                * allocation or has tigher zone constraints
> +                                */
> +                               order = new_order;
> +                               classzone_idx = new_classzone_idx;
> +                       } else {
> +                               kswapd_try_to_sleep(kswapd_p, order,
> +                                                   classzone_idx);
> +                               order = pgdat->kswapd_max_order;
> +                               classzone_idx = pgdat->classzone_idx;
> +                               pgdat->kswapd_max_order = 0;
> +                               pgdat->classzone_idx = MAX_NR_ZONES - 1;
> +                       }
> +               } else
> +                       kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
>
>                ret = try_to_freeze();
>                if (kthread_should_stop())
> @@ -2705,8 +2731,13 @@ int kswapd(void *p)
>                 * after returning from the refrigerator
>                 */
>                if (!ret) {
> -                       trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> -                       order = balance_pgdat(pgdat, order, &classzone_idx);
> +                       if (is_node_kswapd(kswapd_p)) {
> +                               trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> +                                                               order);
> +                               order = balance_pgdat(pgdat, order,
> +                                                       &classzone_idx);
> +                       } else
> +                               balance_mem_cgroup_pgdat(mem, order);
>                }
>        }
>        return 0;
> @@ -2853,30 +2884,53 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>  * This kswapd start function will be called by init and node-hot-add.
>  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
>  */
> -int kswapd_run(int nid)
> +int kswapd_run(int nid, struct mem_cgroup *mem)
>  {
> -       pg_data_t *pgdat = NODE_DATA(nid);
>        struct task_struct *kswapd_thr;
> +       pg_data_t *pgdat = NULL;
>        struct kswapd *kswapd_p;
> +       static char name[TASK_COMM_LEN];
> +       int memcg_id;
>        int ret = 0;
>
> -       if (pgdat->kswapd_wait)
> -               return 0;
> +       if (!mem) {
> +               pgdat = NODE_DATA(nid);
> +               if (pgdat->kswapd_wait)
> +                       return ret;
> +       }
>
>        kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
>        if (!kswapd_p)
>                return -ENOMEM;
>
>        init_waitqueue_head(&kswapd_p->kswapd_wait);
> -       pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> -       kswapd_p->kswapd_pgdat = pgdat;
>
> -       kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
> +       if (!mem) {
> +               pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> +               kswapd_p->kswapd_pgdat = pgdat;
> +               snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
> +       } else {
> +               memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
> +               if (!memcg_id) {
> +                       kfree(kswapd_p);
> +                       return ret;
> +               }
> +               snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
> +       }
> +
> +       kswapd_thr = kthread_run(kswapd, kswapd_p, name);
>        if (IS_ERR(kswapd_thr)) {
>                /* failure at boot is fatal */
>                BUG_ON(system_state == SYSTEM_BOOTING);
> -               printk("Failed to start kswapd on node %d\n",nid);
> -               pgdat->kswapd_wait = NULL;
> +               if (!mem) {
> +                       printk(KERN_ERR "Failed to start kswapd on node %d\n",
> +                                                               nid);
> +                       pgdat->kswapd_wait = NULL;
> +               } else {
> +                       printk(KERN_ERR "Failed to start kswapd on memcg %d\n",
> +                                                               memcg_id);
> +                       mem_cgroup_clear_kswapd(mem);
> +               }
>                kfree(kswapd_p);
>                ret = -1;
>        } else
> @@ -2887,16 +2941,18 @@ int kswapd_run(int nid)
>  /*
>  * Called by memory hotplug when all memory in a node is offlined.
>  */
> -void kswapd_stop(int nid)
> +void kswapd_stop(int nid, struct mem_cgroup *mem)
>  {
>        struct task_struct *kswapd_thr = NULL;
>        struct kswapd *kswapd_p = NULL;
>        wait_queue_head_t *wait;
>
> -       pg_data_t *pgdat = NODE_DATA(nid);
> -
>        spin_lock(&kswapds_spinlock);
> -       wait = pgdat->kswapd_wait;
> +       if (!mem)
> +               wait = NODE_DATA(nid)->kswapd_wait;
> +       else
> +               wait = mem_cgroup_kswapd_wait(mem);
> +
>        if (wait) {
>                kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>                kswapd_thr = kswapd_p->kswapd_task;
> @@ -2916,7 +2972,7 @@ static int __init kswapd_init(void)
>
>        swap_setup();
>        for_each_node_state(nid, N_HIGH_MEMORY)
> -               kswapd_run(nid);
> +               kswapd_run(nid, NULL);
>        hotcpu_notifier(cpu_callback, 0);
>        return 0;
>  }
> --
> 1.7.3.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/7] Infrastructure to support per-memcg reclaim.
  2011-04-14  3:57   ` Zhu Yanhai
@ 2011-04-14  6:32     ` Ying Han
  0 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-14  6:32 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 14201 bytes --]

On Wed, Apr 13, 2011 at 8:57 PM, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:

> Hi Ying,
>
> 2011/4/13 Ying Han <yinghan@google.com>:
> > -extern int kswapd_run(int nid);
> > -extern void kswapd_stop(int nid);
> > +extern int kswapd_run(int nid, struct mem_cgroup *mem);
> > +extern void kswapd_stop(int nid, struct mem_cgroup *mem);
>
> This breaks online_pages() and offline_pages(), which are also
> the caller of kswaped_run() and kswaped_stop().
>

Thanks, that will be fixed in the next post.

--Ying

>
> Thanks,
> Zhu Yanhai
>
> >
> >  #ifdef CONFIG_MMU
> >  /* linux/mm/shmem.c */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 36ae377..acd84a8 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -274,6 +274,8 @@ struct mem_cgroup {
> >        spinlock_t pcp_counter_lock;
> >
> >        int wmark_ratio;
> > +
> > +       wait_queue_head_t *kswapd_wait;
> >  };
> >
> >  /* Stuffs for move charges at task migration. */
> > @@ -4622,6 +4624,33 @@ int mem_cgroup_watermark_ok(struct mem_cgroup
> *mem,
> >        return ret;
> >  }
> >
> > +int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd
> *kswapd_p)
> > +{
> > +       if (!mem || !kswapd_p)
> > +               return 0;
> > +
> > +       mem->kswapd_wait = &kswapd_p->kswapd_wait;
> > +       kswapd_p->kswapd_mem = mem;
> > +
> > +       return css_id(&mem->css);
> > +}
> > +
> > +void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
> > +{
> > +       if (mem)
> > +               mem->kswapd_wait = NULL;
> > +
> > +       return;
> > +}
> > +
> > +wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
> > +{
> > +       if (!mem)
> > +               return NULL;
> > +
> > +       return mem->kswapd_wait;
> > +}
> > +
> >  static int mem_cgroup_soft_limit_tree_init(void)
> >  {
> >        struct mem_cgroup_tree_per_node *rtpn;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 77ac74f..a1a1211 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2242,6 +2242,7 @@ static bool pgdat_balanced(pg_data_t *pgdat,
> unsigned long balanced_pages,
> >  }
> >
> >  static DEFINE_SPINLOCK(kswapds_spinlock);
> > +#define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
> >
> >  /* is kswapd sleeping prematurely? */
> >  static int sleeping_prematurely(struct kswapd *kswapd, int order,
> > @@ -2251,11 +2252,16 @@ static int sleeping_prematurely(struct kswapd
> *kswapd, int order,
> >        unsigned long balanced = 0;
> >        bool all_zones_ok = true;
> >        pg_data_t *pgdat = kswapd->kswapd_pgdat;
> > +       struct mem_cgroup *mem = kswapd->kswapd_mem;
> >
> >        /* If a direct reclaimer woke kswapd within HZ/10, it's premature
> */
> >        if (remaining)
> >                return true;
> >
> > +       /* Doesn't support for per-memcg reclaim */
> > +       if (mem)
> > +               return false;
> > +
> >        /* Check the watermark levels */
> >        for (i = 0; i < pgdat->nr_zones; i++) {
> >                struct zone *zone = pgdat->node_zones + i;
> > @@ -2598,19 +2604,25 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >         * go fully to sleep until explicitly woken up.
> >         */
> >        if (!sleeping_prematurely(kswapd_p, order, remaining,
> classzone_idx)) {
> > -               trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > +               if (is_node_kswapd(kswapd_p)) {
> > +                       trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> >
> > -               /*
> > -                * vmstat counters are not perfectly accurate and the
> estimated
> > -                * value for counters such as NR_FREE_PAGES can deviate
> from the
> > -                * true value by nr_online_cpus * threshold. To avoid the
> zone
> > -                * watermarks being breached while under pressure, we
> reduce the
> > -                * per-cpu vmstat threshold while kswapd is awake and
> restore
> > -                * them before going back to sleep.
> > -                */
> > -               set_pgdat_percpu_threshold(pgdat,
> calculate_normal_threshold);
> > -               schedule();
> > -               set_pgdat_percpu_threshold(pgdat,
> calculate_pressure_threshold);
> > +                       /*
> > +                        * vmstat counters are not perfectly accurate and
> the
> > +                        * estimated value for counters such as
> NR_FREE_PAGES
> > +                        * can deviate from the true value by
> nr_online_cpus *
> > +                        * threshold. To avoid the zone watermarks being
> > +                        * breached while under pressure, we reduce the
> per-cpu
> > +                        * vmstat threshold while kswapd is awake and
> restore
> > +                        * them before going back to sleep.
> > +                        */
> > +                       set_pgdat_percpu_threshold(pgdat,
> > +
>  calculate_normal_threshold);
> > +                       schedule();
> > +                       set_pgdat_percpu_threshold(pgdat,
> > +
> calculate_pressure_threshold);
> > +               } else
> > +                       schedule();
> >        } else {
> >                if (remaining)
> >                        count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> > @@ -2620,6 +2632,12 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >        finish_wait(wait_h, &wait);
> >  }
> >
> > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> > +                                                       int order)
> > +{
> > +       return 0;
> > +}
> > +
> >  /*
> >  * The background pageout daemon, started as a kernel thread
> >  * from the init process.
> > @@ -2639,6 +2657,7 @@ int kswapd(void *p)
> >        int classzone_idx;
> >        struct kswapd *kswapd_p = (struct kswapd *)p;
> >        pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> > +       struct mem_cgroup *mem = kswapd_p->kswapd_mem;
> >        wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
> >        struct task_struct *tsk = current;
> >
> > @@ -2649,10 +2668,12 @@ int kswapd(void *p)
> >
> >        lockdep_set_current_reclaim_state(GFP_KERNEL);
> >
> > -       BUG_ON(pgdat->kswapd_wait != wait_h);
> > -       cpumask = cpumask_of_node(pgdat->node_id);
> > -       if (!cpumask_empty(cpumask))
> > -               set_cpus_allowed_ptr(tsk, cpumask);
> > +       if (is_node_kswapd(kswapd_p)) {
> > +               BUG_ON(pgdat->kswapd_wait != wait_h);
> > +               cpumask = cpumask_of_node(pgdat->node_id);
> > +               if (!cpumask_empty(cpumask))
> > +                       set_cpus_allowed_ptr(tsk, cpumask);
> > +       }
> >        current->reclaim_state = &reclaim_state;
> >
> >        /*
> > @@ -2677,24 +2698,29 @@ int kswapd(void *p)
> >                int new_classzone_idx;
> >                int ret;
> >
> > -               new_order = pgdat->kswapd_max_order;
> > -               new_classzone_idx = pgdat->classzone_idx;
> > -               pgdat->kswapd_max_order = 0;
> > -               pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > -               if (order < new_order || classzone_idx >
> new_classzone_idx) {
> > -                       /*
> > -                        * Don't sleep if someone wants a larger 'order'
> > -                        * allocation or has tigher zone constraints
> > -                        */
> > -                       order = new_order;
> > -                       classzone_idx = new_classzone_idx;
> > -               } else {
> > -                       kswapd_try_to_sleep(kswapd_p, order,
> classzone_idx);
> > -                       order = pgdat->kswapd_max_order;
> > -                       classzone_idx = pgdat->classzone_idx;
> > +               if (is_node_kswapd(kswapd_p)) {
> > +                       new_order = pgdat->kswapd_max_order;
> > +                       new_classzone_idx = pgdat->classzone_idx;
> >                        pgdat->kswapd_max_order = 0;
> >                        pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > -               }
> > +                       if (order < new_order ||
> > +                                       classzone_idx >
> new_classzone_idx) {
> > +                               /*
> > +                                * Don't sleep if someone wants a larger
> 'order'
> > +                                * allocation or has tigher zone
> constraints
> > +                                */
> > +                               order = new_order;
> > +                               classzone_idx = new_classzone_idx;
> > +                       } else {
> > +                               kswapd_try_to_sleep(kswapd_p, order,
> > +                                                   classzone_idx);
> > +                               order = pgdat->kswapd_max_order;
> > +                               classzone_idx = pgdat->classzone_idx;
> > +                               pgdat->kswapd_max_order = 0;
> > +                               pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > +                       }
> > +               } else
> > +                       kswapd_try_to_sleep(kswapd_p, order,
> classzone_idx);
> >
> >                ret = try_to_freeze();
> >                if (kthread_should_stop())
> > @@ -2705,8 +2731,13 @@ int kswapd(void *p)
> >                 * after returning from the refrigerator
> >                 */
> >                if (!ret) {
> > -                       trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> order);
> > -                       order = balance_pgdat(pgdat, order,
> &classzone_idx);
> > +                       if (is_node_kswapd(kswapd_p)) {
> > +
> trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> > +                                                               order);
> > +                               order = balance_pgdat(pgdat, order,
> > +                                                       &classzone_idx);
> > +                       } else
> > +                               balance_mem_cgroup_pgdat(mem, order);
> >                }
> >        }
> >        return 0;
> > @@ -2853,30 +2884,53 @@ static int __devinit cpu_callback(struct
> notifier_block *nfb,
> >  * This kswapd start function will be called by init and node-hot-add.
> >  * On node-hot-add, kswapd will moved to proper cpus if cpus are
> hot-added.
> >  */
> > -int kswapd_run(int nid)
> > +int kswapd_run(int nid, struct mem_cgroup *mem)
> >  {
> > -       pg_data_t *pgdat = NODE_DATA(nid);
> >        struct task_struct *kswapd_thr;
> > +       pg_data_t *pgdat = NULL;
> >        struct kswapd *kswapd_p;
> > +       static char name[TASK_COMM_LEN];
> > +       int memcg_id;
> >        int ret = 0;
> >
> > -       if (pgdat->kswapd_wait)
> > -               return 0;
> > +       if (!mem) {
> > +               pgdat = NODE_DATA(nid);
> > +               if (pgdat->kswapd_wait)
> > +                       return ret;
> > +       }
> >
> >        kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
> >        if (!kswapd_p)
> >                return -ENOMEM;
> >
> >        init_waitqueue_head(&kswapd_p->kswapd_wait);
> > -       pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> > -       kswapd_p->kswapd_pgdat = pgdat;
> >
> > -       kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
> > +       if (!mem) {
> > +               pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> > +               kswapd_p->kswapd_pgdat = pgdat;
> > +               snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
> > +       } else {
> > +               memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
> > +               if (!memcg_id) {
> > +                       kfree(kswapd_p);
> > +                       return ret;
> > +               }
> > +               snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
> > +       }
> > +
> > +       kswapd_thr = kthread_run(kswapd, kswapd_p, name);
> >        if (IS_ERR(kswapd_thr)) {
> >                /* failure at boot is fatal */
> >                BUG_ON(system_state == SYSTEM_BOOTING);
> > -               printk("Failed to start kswapd on node %d\n",nid);
> > -               pgdat->kswapd_wait = NULL;
> > +               if (!mem) {
> > +                       printk(KERN_ERR "Failed to start kswapd on node
> %d\n",
> > +                                                               nid);
> > +                       pgdat->kswapd_wait = NULL;
> > +               } else {
> > +                       printk(KERN_ERR "Failed to start kswapd on memcg
> %d\n",
> > +
> memcg_id);
> > +                       mem_cgroup_clear_kswapd(mem);
> > +               }
> >                kfree(kswapd_p);
> >                ret = -1;
> >        } else
> > @@ -2887,16 +2941,18 @@ int kswapd_run(int nid)
> >  /*
> >  * Called by memory hotplug when all memory in a node is offlined.
> >  */
> > -void kswapd_stop(int nid)
> > +void kswapd_stop(int nid, struct mem_cgroup *mem)
> >  {
> >        struct task_struct *kswapd_thr = NULL;
> >        struct kswapd *kswapd_p = NULL;
> >        wait_queue_head_t *wait;
> >
> > -       pg_data_t *pgdat = NODE_DATA(nid);
> > -
> >        spin_lock(&kswapds_spinlock);
> > -       wait = pgdat->kswapd_wait;
> > +       if (!mem)
> > +               wait = NODE_DATA(nid)->kswapd_wait;
> > +       else
> > +               wait = mem_cgroup_kswapd_wait(mem);
> > +
> >        if (wait) {
> >                kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> >                kswapd_thr = kswapd_p->kswapd_task;
> > @@ -2916,7 +2972,7 @@ static int __init kswapd_init(void)
> >
> >        swap_setup();
> >        for_each_node_state(nid, N_HIGH_MEMORY)
> > -               kswapd_run(nid);
> > +               kswapd_run(nid, NULL);
> >        hotcpu_notifier(cpu_callback, 0);
> >        return 0;
> >  }
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>

[-- Attachment #2: Type: text/html, Size: 17372 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/7] Add per memcg reclaim watermarks
  2011-04-13  7:03 ` [PATCH V3 2/7] Add per memcg reclaim watermarks Ying Han
  2011-04-13  8:25   ` KAMEZAWA Hiroyuki
@ 2011-04-14  8:24   ` Zhu Yanhai
  2011-04-14 17:43     ` Ying Han
  1 sibling, 1 reply; 27+ messages in thread
From: Zhu Yanhai @ 2011-04-14  8:24 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen, linux-mm

Hi Ying,

2011/4/13 Ying Han <yinghan@google.com>:
> +static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> +{
> +       u64 limit;
> +       unsigned long wmark_ratio;
> +
> +       wmark_ratio = get_wmark_ratio(mem);
> +       limit = mem_cgroup_get_limit(mem);

mem_cgroup_get_limit doesn't return the correct limit for you,
actually it's only for OOM killer.
You should use
limit = res_counter_read_u64(&mem->res, RES_LIMIT) directly.
Otherwise in the box which has swapon, you will get a huge
number here.
e.g.
 [root@zyh-fedora a]# echo 500m > memory.limit_in_bytes
[root@zyh-fedora a]# cat memory.limit_in_bytes
524288000
[root@zyh-fedora a]# cat memory.reclaim_wmarks
low_wmark 9114218496
high_wmark 9114218496

Regards,
Zhu Yanhai


> +       if (wmark_ratio == 0) {
> +               res_counter_set_low_wmark_limit(&mem->res, limit);
> +               res_counter_set_high_wmark_limit(&mem->res, limit);
> +       } else {
> +               unsigned long low_wmark, high_wmark;
> +               unsigned long long tmp = (wmark_ratio * limit) / 100;
> +
> +               low_wmark = tmp;
> +               high_wmark = tmp - (tmp >> 8);
> +               res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> +               res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> +       }
> +}
> +
>  /*
>  * Following LRU functions are allowed to be used without PCG_LOCK.
>  * Operations are called by routine of global LRU independently from memcg.
> @@ -1195,6 +1219,16 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return memcg->swappiness;
>  }
>
> +static unsigned long get_wmark_ratio(struct mem_cgroup *memcg)
> +{
> +       struct cgroup *cgrp = memcg->css.cgroup;
> +
> +       VM_BUG_ON(!cgrp);
> +       VM_BUG_ON(!cgrp->parent);
> +
> +       return memcg->wmark_ratio;
> +}
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>  {
>        int cpu;
> @@ -3205,6 +3239,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>                        else
>                                memcg->memsw_is_minimum = false;
>                }
> +               setup_per_memcg_wmarks(memcg);
>                mutex_unlock(&set_limit_mutex);
>
>                if (!ret)
> @@ -3264,6 +3299,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>                        else
>                                memcg->memsw_is_minimum = false;
>                }
> +               setup_per_memcg_wmarks(memcg);
>                mutex_unlock(&set_limit_mutex);
>
>                if (!ret)
> @@ -4521,6 +4557,22 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>
> +int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> +                               int charge_flags)
> +{
> +       long ret = 0;
> +       int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
> +
> +       VM_BUG_ON((charge_flags & flags) == flags);
> +
> +       if (charge_flags & CHARGE_WMARK_LOW)
> +               ret = res_counter_check_under_low_wmark_limit(&mem->res);
> +       if (charge_flags & CHARGE_WMARK_HIGH)
> +               ret = res_counter_check_under_high_wmark_limit(&mem->res);
> +
> +       return ret;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>        struct mem_cgroup_tree_per_node *rtpn;
> --
> 1.7.3.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/7] memcg: per cgroup background reclaim
  2011-04-14  0:14     ` KAMEZAWA Hiroyuki
@ 2011-04-14 17:38       ` Ying Han
  2011-04-14 21:59         ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-14 17:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 9283 bytes --]

On Wed, Apr 13, 2011 at 5:14 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 Apr 2011 10:53:19 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Wed, Apr 13, 2011 at 12:47 AM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Wed, 13 Apr 2011 00:03:00 -0700
> > > Ying Han <yinghan@google.com> wrote:
> > >
> > > > The current implementation of memcg supports targeting reclaim when
> the
> > > > cgroup is reaching its hard_limit and we do direct reclaim per
> cgroup.
> > > > Per cgroup background reclaim is needed which helps to spread out
> memory
> > > > pressure over longer period of time and smoothes out the cgroup
> > > performance.
> > > >
> > > > If the cgroup is configured to use per cgroup background reclaim, a
> > > kswapd
> > > > thread is created which only scans the per-memcg LRU list. Two
> watermarks
> > > > ("high_wmark", "low_wmark") are added to trigger the background
> reclaim
> > > and
> > > > stop it. The watermarks are calculated based on the cgroup's
> > > limit_in_bytes.
> > > >
> > > > I run through dd test on large file and then cat the file. Then I
> > > compared
> > > > the reclaim related stats in memory.stat.
> > > >
> > > > Step1: Create a cgroup with 500M memory_limit.
> > > > $ mkdir /dev/cgroup/memory/A
> > > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > > > $ echo $$ >/dev/cgroup/memory/A/tasks
> > > >
> > > > Step2: Test and set the wmarks.
> > > > $ cat /dev/cgroup/memory/A/memory.wmark_ratio
> > > > 0
> > > >
> > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > > low_wmark 524288000
> > > > high_wmark 524288000
> > > >
> > > > $ echo 90 >/dev/cgroup/memory/A/memory.wmark_ratio
> > > >
> > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > > low_wmark 471859200
> > > > high_wmark 470016000
> > > >
> > > > $ ps -ef | grep memcg
> > > > root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
> > > > root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg
> > > >
> > > > Step3: Dirty the pages by creating a 20g file on hard drive.
> > > > $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
> > > >
> > > > Here are the memory.stat with vs without the per-memcg reclaim. It
> used
> > > to be
> > > > all the pages are reclaimed from direct reclaim, and now some of the
> > > pages are
> > > > also being reclaimed at background.
> > > >
> > > > Only direct reclaim                       With background reclaim:
> > > >
> > > > pgpgin 5248668                            pgpgin 5248347
> > > > pgpgout 5120678                           pgpgout 5133505
> > > > kswapd_steal 0                            kswapd_steal 1476614
> > > > pg_pgsteal 5120578                        pg_pgsteal 3656868
> > > > kswapd_pgscan 0                           kswapd_pgscan 3137098
> > > > pg_scan 10861956                          pg_scan 6848006
> > > > pgrefill 271174                           pgrefill 290441
> > > > pgoutrun 0                                pgoutrun 18047
> > > > allocstall 131689                         allocstall 100179
> > > >
> > > > real    7m42.702s                         real 7m42.323s
> > > > user    0m0.763s                          user 0m0.748s
> > > > sys     0m58.785s                         sys  0m52.123s
> > > >
> > > > throughput is 44.33 MB/sec                throughput is 44.23 MB/sec
> > > >
> > > > Step 4: Cleanup
> > > > $ echo $$ >/dev/cgroup/memory/tasks
> > > > $ echo 1 > /dev/cgroup/memory/A/memory.force_empty
> > > > $ rmdir /dev/cgroup/memory/A
> > > > $ echo 3 >/proc/sys/vm/drop_caches
> > > >
> > > > Step 5: Create the same cgroup and read the 20g file into pagecache.
> > > > $ cat /export/hdc3/dd/tf0 > /dev/zero
> > > >
> > > > All the pages are reclaimed from background instead of direct reclaim
> > > with
> > > > the per cgroup reclaim.
> > > >
> > > > Only direct reclaim                       With background reclaim:
> > > > pgpgin 5248668                            pgpgin 5248114
> > > > pgpgout 5120678                           pgpgout 5133480
> > > > kswapd_steal 0                            kswapd_steal 5133397
> > > > pg_pgsteal 5120578                        pg_pgsteal 0
> > > > kswapd_pgscan 0                           kswapd_pgscan 5133400
> > > > pg_scan 10861956                          pg_scan 0
> > > > pgrefill 271174                           pgrefill 0
> > > > pgoutrun 0                                pgoutrun 40535
> > > > allocstall 131689                         allocstall 0
> > > >
> > > > real    7m42.702s                         real 6m20.439s
> > > > user    0m0.763s                          user 0m0.169s
> > > > sys     0m58.785s                         sys  0m26.574s
> > > >
> > > > Note:
> > > > This is the first effort of enhancing the target reclaim into memcg.
> Here
> > > are
> > > > the existing known issues and our plan:
> > > >
> > > > 1. there are one kswapd thread per cgroup. the thread is created when
> the
> > > > cgroup changes its limit_in_bytes and is deleted when the cgroup is
> being
> > > > removed. In some enviroment when thousand of cgroups are being
> configured
> > > on
> > > > a single host, we will have thousand of kswapd threads. The memory
> > > consumption
> > > > would be 8k*100 = 8M. We don't see a big issue for now if the host
> can
> > > host
> > > > that many of cgroups.
> > > >
> > >
> > > What's bad with using workqueue ?
> > >
> > > Pros.
> > >  - we don't have to keep our own thread pool.
> > >  - we don't have to see 'ps -elf' is filled by kswapd...
> > > Cons.
> > >  - because threads are shared, we can't put kthread to cpu cgroup.
> > >
> >
> > I did some study on workqueue after posting V2. There was a comment
> suggesting
> > workqueue instead of per-memcg kswapd thread, since it will cut the
> number
> > of kernel threads being created in host with lots of cgroups. Each kernel
> > thread allocates about 8K of stack and 8M in total w/ thousand of
> cgroups.
> >
> > The current workqueue model merged in 2.6.36 kernel is called
> "concurrency
> > managed workqueu(cmwq)", which is intended to provide flexible
> concurrency
> > without wasting resources. I studied a bit and here it is:
> >
> > 1. The workqueue is complicated and we need to be very careful of work
> items
> > in the workqueue. We've experienced in one workitem stucks and the rest
> of
> > the work item won't proceed. For example in dirty page writeback,  one
> > heavily writer cgroup could starve the other cgroups from flushing dirty
> > pages to the same disk. In the kswapd case, I can image we might have
> > similar scenario.
> >
> > 2. How to prioritize the workitems is another problem. The order of
> adding
> > the workitems in the queue reflects the order of cgroups being reclaimed.
> We
> > don't have that restriction currently but relying on the cpu scheduler to
> > put kswapd on the right cpu-core to run. We "might" introduce priority
> later
> > for reclaim and how are we gonna deal with that.
> >
> > 3. Based on what i observed, not many callers has migrated to the cmwq
> and I
> > don't have much data of how good it is.
> >
> >
> > Regardless of workqueue, can't we have moderate numbers of threads ?
> > >
> > > I really don't like to have too much threads and thinks
> > > one-thread-per-memcg
> > > is big enough to cause lock contension problem.
> > >
> >
> > Back to the current model, on machine with thousands of cgroups which it
> > will take 8M total for thousand of kswapd threads (8K stack for each
> > thread).  We are running system with fakenuma which each numa node has a
> > kswapd. So far we haven't noticed issue caused by "lots of" kswapd
> threads.
> > Also, there shouldn't be any performance overhead for kernel thread if it
> is
> > not running.
> >
> > Based on the complexity of workqueue and the benefit it provides, I would
> > like to stick to the current model first. After we get the basic stuff in
> > and other targeting reclaim improvement, we can come back to this. What
> do
> > you think?
> >
>
> Okay, fair enough. kthread_run() will win.
>
> Then, I have another request. I'd like to kswapd-for-memcg to some cpu
> cgroup to limit cpu usage.
>
> - Could you show thread ID somewhere ? and
>  confirm we can put it to some cpu cgroup ?
>  (creating a auto cpu cgroup for memcg kswapd is a choice, I think.)
>
>  BTW, when kthread_run() creates a kthread, which cgroup it will be under ?
>  If it will be under a cgroup who calls kthread_run(), per-memcg kswapd
> will
>  go under cgroup where the user sets hi/low wmark, implicitly.
>  I don't think this is very bad. But it's better to mention the behavior
>  somewhere because memcg is tend to be used with cpu cgroup.
>  Could you check and add some doc ?
>
> And
> - Could you drop PF_MEMALLOC ? (for now.) (in patch 4)
>
Hmm, do you mean to drop it for per-memcg kswapd?


> - Could you check PF_KSWAPD doesn't do anything bad ?
>

 There are eight places where the current_is_kswapd() is called. Five of
them are called to update counter. And the rest looks good to me.

1. too_many_isolated()
    returns false if kswapd

2. should_reclaim_stall()
    returns false if kswapd

3.  nfs_commit_inode()
   may_wait = NULL if kswapd

--Ying


>
> Thanks,
> -Kame
>
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 12058 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/7] Add per memcg reclaim watermarks
  2011-04-14  8:24   ` Zhu Yanhai
@ 2011-04-14 17:43     ` Ying Han
  0 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-14 17:43 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: KAMEZAWA Hiroyuki, Pavel Emelyanov, Balbir Singh,
	Daisuke Nishimura, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, KOSAKI Motohiro,
	Tejun Heo, Michal Hocko, Andrew Morton, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4011 bytes --]

On Thu, Apr 14, 2011 at 1:24 AM, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:

> Hi Ying,
>
> 2011/4/13 Ying Han <yinghan@google.com>:
> > +static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> > +{
> > +       u64 limit;
> > +       unsigned long wmark_ratio;
> > +
> > +       wmark_ratio = get_wmark_ratio(mem);
> > +       limit = mem_cgroup_get_limit(mem);
>
> mem_cgroup_get_limit doesn't return the correct limit for you,
> actually it's only for OOM killer.
> You should use
> limit = res_counter_read_u64(&mem->res, RES_LIMIT) directly.
> Otherwise in the box which has swapon, you will get a huge
> number here.
> e.g.
>  [root@zyh-fedora a]# echo 500m > memory.limit_in_bytes
> [root@zyh-fedora a]# cat memory.limit_in_bytes
> 524288000
> [root@zyh-fedora a]# cat memory.reclaim_wmarks
> low_wmark 9114218496
> high_wmark 9114218496
>

thank you~ will fix it next post.

--Ying

>
> Regards,
> Zhu Yanhai
>
>
> > +       if (wmark_ratio == 0) {
> > +               res_counter_set_low_wmark_limit(&mem->res, limit);
> > +               res_counter_set_high_wmark_limit(&mem->res, limit);
> > +       } else {
> > +               unsigned long low_wmark, high_wmark;
> > +               unsigned long long tmp = (wmark_ratio * limit) / 100;
> > +
> > +               low_wmark = tmp;
> > +               high_wmark = tmp - (tmp >> 8);
> > +               res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> > +               res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> > +       }
> > +}
> > +
> >  /*
> >  * Following LRU functions are allowed to be used without PCG_LOCK.
> >  * Operations are called by routine of global LRU independently from
> memcg.
> > @@ -1195,6 +1219,16 @@ static unsigned int get_swappiness(struct
> mem_cgroup *memcg)
> >        return memcg->swappiness;
> >  }
> >
> > +static unsigned long get_wmark_ratio(struct mem_cgroup *memcg)
> > +{
> > +       struct cgroup *cgrp = memcg->css.cgroup;
> > +
> > +       VM_BUG_ON(!cgrp);
> > +       VM_BUG_ON(!cgrp->parent);
> > +
> > +       return memcg->wmark_ratio;
> > +}
> > +
> >  static void mem_cgroup_start_move(struct mem_cgroup *mem)
> >  {
> >        int cpu;
> > @@ -3205,6 +3239,7 @@ static int mem_cgroup_resize_limit(struct
> mem_cgroup *memcg,
> >                        else
> >                                memcg->memsw_is_minimum = false;
> >                }
> > +               setup_per_memcg_wmarks(memcg);
> >                mutex_unlock(&set_limit_mutex);
> >
> >                if (!ret)
> > @@ -3264,6 +3299,7 @@ static int mem_cgroup_resize_memsw_limit(struct
> mem_cgroup *memcg,
> >                        else
> >                                memcg->memsw_is_minimum = false;
> >                }
> > +               setup_per_memcg_wmarks(memcg);
> >                mutex_unlock(&set_limit_mutex);
> >
> >                if (!ret)
> > @@ -4521,6 +4557,22 @@ static void __init enable_swap_cgroup(void)
> >  }
> >  #endif
> >
> > +int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> > +                               int charge_flags)
> > +{
> > +       long ret = 0;
> > +       int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
> > +
> > +       VM_BUG_ON((charge_flags & flags) == flags);
> > +
> > +       if (charge_flags & CHARGE_WMARK_LOW)
> > +               ret = res_counter_check_under_low_wmark_limit(&mem->res);
> > +       if (charge_flags & CHARGE_WMARK_HIGH)
> > +               ret =
> res_counter_check_under_high_wmark_limit(&mem->res);
> > +
> > +       return ret;
> > +}
> > +
> >  static int mem_cgroup_soft_limit_tree_init(void)
> >  {
> >        struct mem_cgroup_tree_per_node *rtpn;
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>

[-- Attachment #2: Type: text/html, Size: 5481 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/7] memcg: per cgroup background reclaim
  2011-04-14 17:38       ` Ying Han
@ 2011-04-14 21:59         ` Ying Han
  0 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-14 21:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Pavel Emelyanov, Balbir Singh, Daisuke Nishimura, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, KOSAKI Motohiro, Tejun Heo, Michal Hocko,
	Andrew Morton, Dave Hansen, linux-mm

[-- Attachment #1: Type: text/plain, Size: 10998 bytes --]

On Thu, Apr 14, 2011 at 10:38 AM, Ying Han <yinghan@google.com> wrote:

>
>
> On Wed, Apr 13, 2011 at 5:14 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Wed, 13 Apr 2011 10:53:19 -0700
>> Ying Han <yinghan@google.com> wrote:
>>
>> > On Wed, Apr 13, 2011 at 12:47 AM, KAMEZAWA Hiroyuki <
>> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >
>> > > On Wed, 13 Apr 2011 00:03:00 -0700
>> > > Ying Han <yinghan@google.com> wrote:
>> > >
>> > > > The current implementation of memcg supports targeting reclaim when
>> the
>> > > > cgroup is reaching its hard_limit and we do direct reclaim per
>> cgroup.
>> > > > Per cgroup background reclaim is needed which helps to spread out
>> memory
>> > > > pressure over longer period of time and smoothes out the cgroup
>> > > performance.
>> > > >
>> > > > If the cgroup is configured to use per cgroup background reclaim, a
>> > > kswapd
>> > > > thread is created which only scans the per-memcg LRU list. Two
>> watermarks
>> > > > ("high_wmark", "low_wmark") are added to trigger the background
>> reclaim
>> > > and
>> > > > stop it. The watermarks are calculated based on the cgroup's
>> > > limit_in_bytes.
>> > > >
>> > > > I run through dd test on large file and then cat the file. Then I
>> > > compared
>> > > > the reclaim related stats in memory.stat.
>> > > >
>> > > > Step1: Create a cgroup with 500M memory_limit.
>> > > > $ mkdir /dev/cgroup/memory/A
>> > > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
>> > > > $ echo $$ >/dev/cgroup/memory/A/tasks
>> > > >
>> > > > Step2: Test and set the wmarks.
>> > > > $ cat /dev/cgroup/memory/A/memory.wmark_ratio
>> > > > 0
>> > > >
>> > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
>> > > > low_wmark 524288000
>> > > > high_wmark 524288000
>> > > >
>> > > > $ echo 90 >/dev/cgroup/memory/A/memory.wmark_ratio
>> > > >
>> > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
>> > > > low_wmark 471859200
>> > > > high_wmark 470016000
>> > > >
>> > > > $ ps -ef | grep memcg
>> > > > root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
>> > > > root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg
>> > > >
>> > > > Step3: Dirty the pages by creating a 20g file on hard drive.
>> > > > $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
>> > > >
>> > > > Here are the memory.stat with vs without the per-memcg reclaim. It
>> used
>> > > to be
>> > > > all the pages are reclaimed from direct reclaim, and now some of the
>> > > pages are
>> > > > also being reclaimed at background.
>> > > >
>> > > > Only direct reclaim                       With background reclaim:
>> > > >
>> > > > pgpgin 5248668                            pgpgin 5248347
>> > > > pgpgout 5120678                           pgpgout 5133505
>> > > > kswapd_steal 0                            kswapd_steal 1476614
>> > > > pg_pgsteal 5120578                        pg_pgsteal 3656868
>> > > > kswapd_pgscan 0                           kswapd_pgscan 3137098
>> > > > pg_scan 10861956                          pg_scan 6848006
>> > > > pgrefill 271174                           pgrefill 290441
>> > > > pgoutrun 0                                pgoutrun 18047
>> > > > allocstall 131689                         allocstall 100179
>> > > >
>> > > > real    7m42.702s                         real 7m42.323s
>> > > > user    0m0.763s                          user 0m0.748s
>> > > > sys     0m58.785s                         sys  0m52.123s
>> > > >
>> > > > throughput is 44.33 MB/sec                throughput is 44.23 MB/sec
>> > > >
>> > > > Step 4: Cleanup
>> > > > $ echo $$ >/dev/cgroup/memory/tasks
>> > > > $ echo 1 > /dev/cgroup/memory/A/memory.force_empty
>> > > > $ rmdir /dev/cgroup/memory/A
>> > > > $ echo 3 >/proc/sys/vm/drop_caches
>> > > >
>> > > > Step 5: Create the same cgroup and read the 20g file into pagecache.
>> > > > $ cat /export/hdc3/dd/tf0 > /dev/zero
>> > > >
>> > > > All the pages are reclaimed from background instead of direct
>> reclaim
>> > > with
>> > > > the per cgroup reclaim.
>> > > >
>> > > > Only direct reclaim                       With background reclaim:
>> > > > pgpgin 5248668                            pgpgin 5248114
>> > > > pgpgout 5120678                           pgpgout 5133480
>> > > > kswapd_steal 0                            kswapd_steal 5133397
>> > > > pg_pgsteal 5120578                        pg_pgsteal 0
>> > > > kswapd_pgscan 0                           kswapd_pgscan 5133400
>> > > > pg_scan 10861956                          pg_scan 0
>> > > > pgrefill 271174                           pgrefill 0
>> > > > pgoutrun 0                                pgoutrun 40535
>> > > > allocstall 131689                         allocstall 0
>> > > >
>> > > > real    7m42.702s                         real 6m20.439s
>> > > > user    0m0.763s                          user 0m0.169s
>> > > > sys     0m58.785s                         sys  0m26.574s
>> > > >
>> > > > Note:
>> > > > This is the first effort of enhancing the target reclaim into memcg.
>> Here
>> > > are
>> > > > the existing known issues and our plan:
>> > > >
>> > > > 1. there are one kswapd thread per cgroup. the thread is created
>> when the
>> > > > cgroup changes its limit_in_bytes and is deleted when the cgroup is
>> being
>> > > > removed. In some enviroment when thousand of cgroups are being
>> configured
>> > > on
>> > > > a single host, we will have thousand of kswapd threads. The memory
>> > > consumption
>> > > > would be 8k*100 = 8M. We don't see a big issue for now if the host
>> can
>> > > host
>> > > > that many of cgroups.
>> > > >
>> > >
>> > > What's bad with using workqueue ?
>> > >
>> > > Pros.
>> > >  - we don't have to keep our own thread pool.
>> > >  - we don't have to see 'ps -elf' is filled by kswapd...
>> > > Cons.
>> > >  - because threads are shared, we can't put kthread to cpu cgroup.
>> > >
>> >
>> > I did some study on workqueue after posting V2. There was a comment
>> suggesting
>> > workqueue instead of per-memcg kswapd thread, since it will cut the
>> number
>> > of kernel threads being created in host with lots of cgroups. Each
>> kernel
>> > thread allocates about 8K of stack and 8M in total w/ thousand of
>> cgroups.
>> >
>> > The current workqueue model merged in 2.6.36 kernel is called
>> "concurrency
>> > managed workqueu(cmwq)", which is intended to provide flexible
>> concurrency
>> > without wasting resources. I studied a bit and here it is:
>> >
>> > 1. The workqueue is complicated and we need to be very careful of work
>> items
>> > in the workqueue. We've experienced in one workitem stucks and the rest
>> of
>> > the work item won't proceed. For example in dirty page writeback,  one
>> > heavily writer cgroup could starve the other cgroups from flushing dirty
>> > pages to the same disk. In the kswapd case, I can image we might have
>> > similar scenario.
>> >
>> > 2. How to prioritize the workitems is another problem. The order of
>> adding
>> > the workitems in the queue reflects the order of cgroups being
>> reclaimed. We
>> > don't have that restriction currently but relying on the cpu scheduler
>> to
>> > put kswapd on the right cpu-core to run. We "might" introduce priority
>> later
>> > for reclaim and how are we gonna deal with that.
>> >
>> > 3. Based on what i observed, not many callers has migrated to the cmwq
>> and I
>> > don't have much data of how good it is.
>> >
>> >
>> > Regardless of workqueue, can't we have moderate numbers of threads ?
>> > >
>> > > I really don't like to have too much threads and thinks
>> > > one-thread-per-memcg
>> > > is big enough to cause lock contension problem.
>> > >
>> >
>> > Back to the current model, on machine with thousands of cgroups which it
>> > will take 8M total for thousand of kswapd threads (8K stack for each
>> > thread).  We are running system with fakenuma which each numa node has a
>> > kswapd. So far we haven't noticed issue caused by "lots of" kswapd
>> threads.
>> > Also, there shouldn't be any performance overhead for kernel thread if
>> it is
>> > not running.
>> >
>> > Based on the complexity of workqueue and the benefit it provides, I
>> would
>> > like to stick to the current model first. After we get the basic stuff
>> in
>> > and other targeting reclaim improvement, we can come back to this. What
>> do
>> > you think?
>> >
>>
>> Okay, fair enough. kthread_run() will win.
>>
>> Then, I have another request. I'd like to kswapd-for-memcg to some cpu
>> cgroup to limit cpu usage.
>>
>> - Could you show thread ID somewhere ?
>
> I added a patch which exports per-memcg-kswapd pid. This is necessary to
later link the kswapd thread to the memcg owner from userspace.

$ cat /dev/cgroup/memory/A/memory.kswapd_pid
memcg_3 6727



> and confirm we can put it to some cpu cgroup ?
>>
>
I tested it by echoing the memcg kswapd thread into a cpu group w/ some
cpu-share.


>  (creating a auto cpu cgroup for memcg kswapd is a choice, I think.)
>>
>>  BTW, when kthread_run() creates a kthread, which cgroup it will be under
>> ?
>>
>
By default, it is running under root. If there is a need to put the kswapd
thread into a cpu cgroup, userspace can make that change by reading the pid
from the new API and echo-ing.


>  If it will be under a cgroup who calls kthread_run(), per-memcg kswapd
>> will
>>  go under cgroup where the user sets hi/low wmark, implicitly.
>>  I don't think this is very bad. But it's better to mention the behavior
>>  somewhere because memcg is tend to be used with cpu cgroup.
>>  Could you check and add some doc ?
>>
>
It make senses to constrain the cpu usage of per-memcg kswapd thread as part
of the memcg. However, i see more problems of doing it than the benefits.

pros:
1. it is good for isolation which prevent one cgroup heavy reclaiming
behavior stealing cpu cycles from other cgroups.

cons:
1. constraining background reclaim will add more pressure into direct
reclaim. it is bad for the process performance, especially on machines with
spare cpu cycles most of time.
2. we have danger of priority inversion to preempt kswapd thread. In no
preemption kernel, we should be ok. In preemptive kernel, we might get
priority inversion by preempting kswapd holding mutex.
3. when user configure the cpu cgroup and memcg cgroup, they need to make
the reservation of cpu be proportional to memcg size.

--Ying


>
>> And
>> - Could you drop PF_MEMALLOC ? (for now.) (in patch 4)
>>
> Hmm, do you mean to drop it for per-memcg kswapd?
>

Ok, I dropped the flag for per-memcg kswapd and also made the comment.

>
>
>> - Could you check PF_KSWAPD doesn't do anything bad ?
>>
>
>  There are eight places where the current_is_kswapd() is called. Five of
> them are called to update counter. And the rest looks good to me.
>
> 1. too_many_isolated()
>     returns false if kswapd
>
> 2. should_reclaim_stall()
>     returns false if kswapd
>
> 3.  nfs_commit_inode()
>    may_wait = NULL if kswapd
>
> --Ying
>
>
>>
>> Thanks,
>> -Kame
>>
>>
>>
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 15479 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-04-14 21:59 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-13  7:03 [PATCH V3 0/7] memcg: per cgroup background reclaim Ying Han
2011-04-13  7:03 ` [PATCH V3 1/7] Add kswapd descriptor Ying Han
2011-04-13  7:03 ` [PATCH V3 2/7] Add per memcg reclaim watermarks Ying Han
2011-04-13  8:25   ` KAMEZAWA Hiroyuki
2011-04-13 18:40     ` Ying Han
2011-04-14  0:27       ` KAMEZAWA Hiroyuki
2011-04-14  8:24   ` Zhu Yanhai
2011-04-14 17:43     ` Ying Han
2011-04-13  7:03 ` [PATCH V3 3/7] New APIs to adjust per-memcg wmarks Ying Han
2011-04-13  8:30   ` KAMEZAWA Hiroyuki
2011-04-13 18:46     ` Ying Han
2011-04-13  7:03 ` [PATCH V3 4/7] Infrastructure to support per-memcg reclaim Ying Han
2011-04-14  3:57   ` Zhu Yanhai
2011-04-14  6:32     ` Ying Han
2011-04-13  7:03 ` [PATCH V3 5/7] Per-memcg background reclaim Ying Han
2011-04-13  8:58   ` KAMEZAWA Hiroyuki
2011-04-13 22:45     ` Ying Han
2011-04-13  7:03 ` [PATCH V3 6/7] Enable per-memcg " Ying Han
2011-04-13  9:05   ` KAMEZAWA Hiroyuki
2011-04-13 21:20     ` Ying Han
2011-04-14  0:30       ` KAMEZAWA Hiroyuki
2011-04-13  7:03 ` [PATCH V3 7/7] Add some per-memcg stats Ying Han
2011-04-13  7:47 ` [PATCH V3 0/7] memcg: per cgroup background reclaim KAMEZAWA Hiroyuki
2011-04-13 17:53   ` Ying Han
2011-04-14  0:14     ` KAMEZAWA Hiroyuki
2011-04-14 17:38       ` Ying Han
2011-04-14 21:59         ` Ying Han

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.