All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 00/10] memcg: per cgroup background reclaim
@ 2011-04-14 22:54 Ying Han
  2011-04-14 22:54 ` [PATCH V4 01/10] Add kswapd descriptor Ying Han
                   ` (10 more replies)
  0 siblings, 11 replies; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

The current implementation of memcg supports targeting reclaim when the
cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
Per cgroup background reclaim is needed which helps to spread out memory
pressure over longer period of time and smoothes out the cgroup performance.

If the cgroup is configured to use per cgroup background reclaim, a kswapd
thread is created which only scans the per-memcg LRU list. Two watermarks
("high_wmark", "low_wmark") are added to trigger the background reclaim and
stop it. The watermarks are calculated based on the cgroup's limit_in_bytes.
By default, the per-memcg kswapd threads are running under root cgroup. There
is a per-memcg API which exports the pid of each kswapd thread, and userspace
can configure cpu cgroup seperately.

I run through dd test on large file and then cat the file. Then I compared
the reclaim related stats in memory.stat.

Step1: Create a cgroup with 500M memory_limit.
$ mkdir /dev/cgroup/memory/A
$ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/memory/A/tasks

Step2: Test and set the wmarks.
$ cat /dev/cgroup/memory/A/memory.low_wmark_distance
0
$ cat /dev/cgroup/memory/A/memory.high_wmark_distance
0

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 524288000
high_wmark 524288000

$ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
$ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 482344960
high_wmark 471859200

$ ps -ef | grep memcg
root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg

$ cat /dev/cgroup/memory/A/memory.kswapd_pid
memcg_3 18126

Step3: Dirty the pages by creating a 20g file on hard drive.
$ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1

Here are the memory.stat with vs without the per-memcg reclaim. It used to be
all the pages are reclaimed from direct reclaim, and now some of the pages are
also being reclaimed at background.

Only direct reclaim                       With background reclaim:

pgpgin 5243228                           pgpgin 5243233
pgpgout 5115242                          pgpgout 5131962
kswapd_steal 0                           kswapd_steal 3836249
pg_pgsteal 5115218                       pg_pgsteal 1295690
kswapd_pgscan 0                          kswapd_pgscan 21233387
pg_scan 5929918                          pg_scan 4359408
pgrefill 264763                          pgrefill 307044
pgoutrun 0                               pgoutrun 54431
allocstall 158364                        allocstall 40858

real   5m4.183s                          real    4m57.100s
user   0m1.197s                          user    0m1.261s
sys    1m8.176s                          sys     1m5.866s

throughput is 67.37 MB/sec               throughput is 68.96 MB/sec

Step 4: Cleanup
$ echo $$ >/dev/cgroup/memory/tasks
$ echo 1 > /dev/cgroup/memory/A/memory.force_empty
$ rmdir /dev/cgroup/memory/A
$ echo 3 >/proc/sys/vm/drop_caches

Step 5: Create the same cgroup and read the 20g file into pagecache.
$ cat /export/hdc3/dd/tf0 > /dev/zero

All the pages are reclaimed from background instead of direct reclaim with
the per cgroup reclaim.

Only direct reclaim                       With background reclaim:

pgpgin 5242937                            pgpgin 5242988
pgpgout 5114971                           pgpgout 5126756
kswapd_steal 0                            kswapd_steal 5126676
pg_pgsteal 5114941                        pg_pgsteal 0
kswapd_pgscan 0                           kswapd_pgscan 5126682
pg_scan 5114944                           pg_scan 0
pgrefill 0                                pgrefill 0
pgoutrun 0                                pgoutrun 60765
allocstall 159840                         allocstall 0

real    4m20.649s                         real    4m20.685s
user    0m0.193s                          user    0m0.268s
sys     0m32.266s                         sys     0m24.506s

Note:
This is the first effort of enhancing the target reclaim into memcg. Here are
the existing known issues and our plan:

1. there are one kswapd thread per cgroup. the thread is created when the
cgroup changes its limit_in_bytes and is deleted when the cgroup is being
removed. In some enviroment when thousand of cgroups are being configured on
a single host, we will have thousand of kswapd threads. The memory consumption
would be 8k*100 = 8M. We don't see a big issue for now if the host can host
that many of cgroups.

2. regarding to the alternative workqueue, which is more complicated and we
need to be very careful of work items in the workqueue. We've experienced in
one workitem stucks and the rest of the work item won't proceed. For example
in dirty page writeback, one heavily writer cgroup could starve the other
cgroups from flushing dirty pages to the same disk. In the kswapd case, I can
imagine we might have similar senario. How to prioritize the workitems is
another problem. The order of adding the workitems in the queue reflects the
order of cgroups being reclaimed. We don't have that restriction currently but
relying on the cpu scheduler to put kswapd on the right cpu-core to run. We
"might" introduce priority later for reclaim and how are we gonna deal with
that.

3. there is a potential lock contention between per cgroup kswapds, and the
worst case depends on the number of cpu cores on the system. Basically we
now are sharing the zone->lru_lock between per-memcg LRU and global LRU. I have
a plan to get rid of the global LRU eventually, which requires to enhance the
existing targeting reclaim (this patch is included). I would like to get to that
where the locking contention problem will be solved naturely.

4. no hierarchical reclaim support in this patchset. I would like to get to
after the basic stuff are being accepted.

5. By default, it is running under root. If there is a need to put the kswapd
thread into a cpu cgroup, userspace can make that change by reading the pid from
the new API and echo-ing. In non preemption kernel, we need to be careful of
priority inversion when restricting kswapd cpu time while it is holding a mutex.

Ying Han (10):
  Add kswapd descriptor
  Add per memcg reclaim watermarks
  New APIs to adjust per-memcg wmarks
  Infrastructure to support per-memcg reclaim.
  Implement the select_victim_node within memcg.
  Per-memcg background reclaim.
  Add per-memcg zone "unreclaimable"
  Enable per-memcg background reclaim.
  Add API to export per-memcg kswapd pid.
  Add some per-memcg stats

 Documentation/cgroups/memory.txt |   14 ++
 include/linux/memcontrol.h       |   91 ++++++++
 include/linux/mmzone.h           |    3 +-
 include/linux/res_counter.h      |   78 +++++++
 include/linux/swap.h             |   16 ++-
 kernel/res_counter.c             |    6 +
 mm/memcontrol.c                  |  450 ++++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c              |    4 +-
 mm/page_alloc.c                  |    1 -
 mm/vmscan.c                      |  438 +++++++++++++++++++++++++++++++------
 10 files changed, 1029 insertions(+), 72 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH V4 01/10] Add kswapd descriptor
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  0:04   ` KAMEZAWA Hiroyuki
  2011-04-14 22:54 ` [PATCH V4 02/10] Add per memcg reclaim watermarks Ying Han
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

There is a kswapd kernel thread for each numa node. We will add a different
kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
information of node or memcg and it allows the global and per-memcg background
reclaim to share common reclaim algorithms.

This patch adds the kswapd descriptor and moves the per-node kswapd to use the
new structure.

changelog v2..v1:
1. dynamic allocate kswapd descriptor and initialize the wait_queue_head of pgdat
at kswapd_run.
2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup kswapd
descriptor.

changelog v3..v2:
1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
2. rename thr in kswapd_run to something else.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/mmzone.h |    3 +-
 include/linux/swap.h   |    7 ++++
 mm/page_alloc.c        |    1 -
 mm/vmscan.c            |   95 ++++++++++++++++++++++++++++++++++++------------
 4 files changed, 80 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 628f07b..6cba7d2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -640,8 +640,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	wait_queue_head_t kswapd_wait;
-	struct task_struct *kswapd;
+	wait_queue_head_t *kswapd_wait;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 } pg_data_t;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed6ebe6..f43d406 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
 	return current->flags & PF_KSWAPD;
 }
 
+struct kswapd {
+	struct task_struct *kswapd_task;
+	wait_queue_head_t kswapd_wait;
+	pg_data_t *kswapd_pgdat;
+};
+
+int kswapd(void *p);
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e1b52a..6340865 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
-	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..77ac74f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2241,13 +2241,16 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 	return balanced_pages > (present_pages >> 2);
 }
 
+static DEFINE_SPINLOCK(kswapds_spinlock);
+
 /* is kswapd sleeping prematurely? */
-static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
-					int classzone_idx)
+static int sleeping_prematurely(struct kswapd *kswapd, int order,
+				long remaining, int classzone_idx)
 {
 	int i;
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
+	pg_data_t *pgdat = kswapd->kswapd_pgdat;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
@@ -2570,28 +2573,31 @@ out:
 	return order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
+				int classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 
 	if (freezing(current) || kthread_should_stop())
 		return;
 
-	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
-		finish_wait(&pgdat->kswapd_wait, &wait);
-		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+		finish_wait(wait_h, &wait);
+		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 	}
 
 	/*
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
@@ -2611,7 +2617,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		else
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
-	finish_wait(&pgdat->kswapd_wait, &wait);
+	finish_wait(wait_h, &wait);
 }
 
 /*
@@ -2627,20 +2633,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
  * If there are applications that are active memory-allocators
  * (most normal use), this basically shouldn't matter.
  */
-static int kswapd(void *p)
+int kswapd(void *p)
 {
 	unsigned long order;
 	int classzone_idx;
-	pg_data_t *pgdat = (pg_data_t*)p;
+	struct kswapd *kswapd_p = (struct kswapd *)p;
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 
 	struct reclaim_state reclaim_state = {
 		.reclaimed_slab = 0,
 	};
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	const struct cpumask *cpumask;
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
+	BUG_ON(pgdat->kswapd_wait != wait_h);
+	cpumask = cpumask_of_node(pgdat->node_id);
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(tsk, cpumask);
 	current->reclaim_state = &reclaim_state;
@@ -2679,7 +2689,7 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx);
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
@@ -2719,13 +2729,13 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 		pgdat->kswapd_max_order = order;
 		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
 	}
-	if (!waitqueue_active(&pgdat->kswapd_wait))
+	if (!waitqueue_active(pgdat->kswapd_wait))
 		return;
 	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-	wake_up_interruptible(&pgdat->kswapd_wait);
+	wake_up_interruptible(pgdat->kswapd_wait);
 }
 
 /*
@@ -2817,12 +2827,23 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 		for_each_node_state(nid, N_HIGH_MEMORY) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 			const struct cpumask *mask;
+			struct kswapd *kswapd_p;
+			struct task_struct *kswapd_thr;
+			wait_queue_head_t *wait;
 
 			mask = cpumask_of_node(pgdat->node_id);
 
+			spin_lock(&kswapds_spinlock);
+			wait = pgdat->kswapd_wait;
+			kswapd_p = container_of(wait, struct kswapd,
+						kswapd_wait);
+			kswapd_thr = kswapd_p->kswapd_task;
+			spin_unlock(&kswapds_spinlock);
+
 			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
 				/* One of our CPUs online: restore mask */
-				set_cpus_allowed_ptr(pgdat->kswapd, mask);
+				if (kswapd_thr)
+					set_cpus_allowed_ptr(kswapd_thr, mask);
 		}
 	}
 	return NOTIFY_OK;
@@ -2835,18 +2856,31 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 int kswapd_run(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
+	struct task_struct *kswapd_thr;
+	struct kswapd *kswapd_p;
 	int ret = 0;
 
-	if (pgdat->kswapd)
+	if (pgdat->kswapd_wait)
 		return 0;
 
-	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
-	if (IS_ERR(pgdat->kswapd)) {
+	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
+	if (!kswapd_p)
+		return -ENOMEM;
+
+	init_waitqueue_head(&kswapd_p->kswapd_wait);
+	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_pgdat = pgdat;
+
+	kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (IS_ERR(kswapd_thr)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
 		printk("Failed to start kswapd on node %d\n",nid);
+		pgdat->kswapd_wait = NULL;
+		kfree(kswapd_p);
 		ret = -1;
-	}
+	} else
+		kswapd_p->kswapd_task = kswapd_thr;
 	return ret;
 }
 
@@ -2855,10 +2889,25 @@ int kswapd_run(int nid)
  */
 void kswapd_stop(int nid)
 {
-	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
+	struct task_struct *kswapd_thr = NULL;
+	struct kswapd *kswapd_p = NULL;
+	wait_queue_head_t *wait;
+
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	spin_lock(&kswapds_spinlock);
+	wait = pgdat->kswapd_wait;
+	if (wait) {
+		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+		kswapd_thr = kswapd_p->kswapd_task;
+		kswapd_p->kswapd_task = NULL;
+	}
+	spin_unlock(&kswapds_spinlock);
+
+	if (kswapd_thr)
+		kthread_stop(kswapd_thr);
 
-	if (kswapd)
-		kthread_stop(kswapd);
+	kfree(kswapd_p);
 }
 
 static int __init kswapd_init(void)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V4 02/10] Add per memcg reclaim watermarks
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
  2011-04-14 22:54 ` [PATCH V4 01/10] Add kswapd descriptor Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  0:16   ` KAMEZAWA Hiroyuki
  2011-04-14 22:54 ` [PATCH V4 03/10] New APIs to adjust per-memcg wmarks Ying Han
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
until the usage is lower than the high_wmark.

Each watermark is calculated based on the hard_limit(limit_in_bytes) for each
memcg. Each time the hard_limit is changed, the corresponding wmarks are
re-calculated. Since memory controller charges only user pages, there is
no need for a "min_wmark". The current calculation of wmarks is based on
individual tunable low/high_wmark_distance, which are set to 0 by default.

changelog v4..v3:
1. remove legacy comments
2. rename the res_counter_check_under_high_wmark_limit
3. replace the wmark_ratio per-memcg by individual tunable for both wmarks.
4. add comments on low/high_wmark
5. add individual tunables for low/high_wmarks and remove wmark_ratio
6. replace the mem_cgroup_get_limit() call by res_count_read_u64(). The first
one returns large value w/ swapon.

changelog v3..v2:
1. Add VM_BUG_ON() on couple of places.
2. Remove the spinlock on the min_free_kbytes since the consequence of reading
stale data.
3. Remove the "min_free_kbytes" API and replace it with wmark_ratio based on
hard_limit.

changelog v2..v1:
1. Remove the res_counter_charge on wmark due to performance concern.
2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
4. make the wmark to be consistant with core VM which checks the free pages
instead of usage.
5. changed wmark to be boolean

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h  |    1 +
 include/linux/res_counter.h |   78 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    6 +++
 mm/memcontrol.c             |   48 ++++++++++++++++++++++++++
 4 files changed, 133 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..3ece36d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -82,6 +82,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..77eaaa9 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -39,6 +39,14 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit that reclaim triggers.
+	 */
+	unsigned long long low_wmark_limit;
+	/*
+	 * the limit that reclaim stops.
+	 */
+	unsigned long long high_wmark_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -55,6 +63,9 @@ struct res_counter {
 
 #define RESOURCE_MAX (unsigned long long)LLONG_MAX
 
+#define CHARGE_WMARK_LOW	0x01
+#define CHARGE_WMARK_HIGH	0x02
+
 /**
  * Helpers to interact with userspace
  * res_counter_read_u64() - returns the value of the specified member.
@@ -92,6 +103,8 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_WMARK_LIMIT,
+	RES_HIGH_WMARK_LIMIT
 };
 
 /*
@@ -147,6 +160,24 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
 	return margin;
 }
 
+static inline bool
+res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->high_wmark_limit)
+		return true;
+
+	return false;
+}
+
+static inline bool
+res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->low_wmark_limit)
+		return true;
+
+	return false;
+}
+
 /**
  * Get the difference between the usage and the soft limit
  * @cnt: The counter
@@ -169,6 +200,30 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+static inline bool
+res_counter_under_low_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_low_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool
+res_counter_under_high_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_high_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -214,4 +269,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_high_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
+static inline int
+res_counter_set_low_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..206a724 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 	spin_lock_init(&counter->lock);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
+	counter->low_wmark_limit = RESOURCE_MAX;
+	counter->high_wmark_limit = RESOURCE_MAX;
 	counter->parent = parent;
 }
 
@@ -103,6 +105,10 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_WMARK_LIMIT:
+		return &counter->low_wmark_limit;
+	case RES_HIGH_WMARK_LIMIT:
+		return &counter->high_wmark_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4407dd0..1ec4014 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -272,6 +272,12 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	/*
+	 * used to calculate the low/high_wmarks based on the limit_in_bytes.
+	 */
+	u64 high_wmark_distance;
+	u64 low_wmark_distance;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -813,6 +819,25 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
+{
+	u64 limit;
+
+	limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+	if (mem->high_wmark_distance == 0) {
+		res_counter_set_low_wmark_limit(&mem->res, limit);
+		res_counter_set_high_wmark_limit(&mem->res, limit);
+	} else {
+		u64 low_wmark, high_wmark;
+
+		low_wmark = limit - mem->low_wmark_distance;
+		high_wmark = limit - mem->high_wmark_distance;
+
+		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
+		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+	}
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -3205,6 +3230,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3264,6 +3290,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -4521,6 +4548,27 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+/*
+ * We use low_wmark and high_wmark for triggering per-memcg kswapd.
+ * The reclaim is triggered by low_wmark (usage > low_wmark) and stopped
+ * by high_wmark (usage < high_wmark).
+ */
+int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
+				int charge_flags)
+{
+	long ret = 0;
+	int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
+
+	VM_BUG_ON((charge_flags & flags) == flags);
+
+	if (charge_flags & CHARGE_WMARK_LOW)
+		ret = res_counter_under_low_wmark_limit(&mem->res);
+	if (charge_flags & CHARGE_WMARK_HIGH)
+		ret = res_counter_under_high_wmark_limit(&mem->res);
+
+	return ret;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V4 03/10] New APIs to adjust per-memcg wmarks
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
  2011-04-14 22:54 ` [PATCH V4 01/10] Add kswapd descriptor Ying Han
  2011-04-14 22:54 ` [PATCH V4 02/10] Add per memcg reclaim watermarks Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  0:25   ` KAMEZAWA Hiroyuki
  2011-04-14 22:54 ` [PATCH V4 04/10] Infrastructure to support per-memcg reclaim Ying Han
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

Add memory.low_wmark_distance, memory.high_wmark_distance and reclaim_wmarks
APIs per-memcg. The first two adjust the internal low/high wmark calculation
and the reclaim_wmarks exports the current value of watermarks.

By default, the low/high_wmark is calculated by subtracting the distance from
the hard_limit(limit_in_bytes).

$ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
$ cat /dev/cgroup/A/memory.limit_in_bytes
524288000

$ echo 50m >/dev/cgroup/A/memory.high_wmark_distance
$ echo 40m >/dev/cgroup/A/memory.low_wmark_distance

$ cat /dev/cgroup/A/memory.reclaim_wmarks
low_wmark 482344960
high_wmark 471859200

changelog v4..v3:
1. replace the "wmark_ratio" API with individual tunable for low/high_wmarks.

changelog v3..v2:
1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
feedbacks

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |   95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 95 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ec4014..685645c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3974,6 +3974,72 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 mem_cgroup_high_wmark_distance_read(struct cgroup *cgrp,
+					       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->high_wmark_distance;
+}
+
+static u64 mem_cgroup_low_wmark_distance_read(struct cgroup *cgrp,
+					      struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->low_wmark_distance;
+}
+
+static int mem_cgroup_high_wmark_distance_write(struct cgroup *cont,
+						struct cftype *cft,
+						const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	u64 low_wmark_distance = memcg->low_wmark_distance;
+	unsigned long long val;
+	u64 limit;
+	int ret;
+
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return -EINVAL;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+	if ((val >= limit) || (val < low_wmark_distance) ||
+	   (low_wmark_distance && val == low_wmark_distance))
+		return -EINVAL;
+
+	memcg->high_wmark_distance = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+}
+
+static int mem_cgroup_low_wmark_distance_write(struct cgroup *cont,
+					       struct cftype *cft,
+					       const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	u64 high_wmark_distance = memcg->high_wmark_distance;
+	unsigned long long val;
+	u64 limit;
+	int ret;
+
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return -EINVAL;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+	if ((val >= limit) || (val > high_wmark_distance) ||
+	    (high_wmark_distance && val == high_wmark_distance))
+		return -EINVAL;
+
+	memcg->low_wmark_distance = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4265,6 +4331,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 	mutex_unlock(&memcg_oom_mutex);
 }
 
+static int mem_cgroup_wmark_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	u64 low_wmark, high_wmark;
+
+	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
+	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+
+	cb->fill(cb, "low_wmark", low_wmark);
+	cb->fill(cb, "high_wmark", high_wmark);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4368,6 +4449,20 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "high_wmark_distance",
+		.write_string = mem_cgroup_high_wmark_distance_write,
+		.read_u64 = mem_cgroup_high_wmark_distance_read,
+	},
+	{
+		.name = "low_wmark_distance",
+		.write_string = mem_cgroup_low_wmark_distance_write,
+		.read_u64 = mem_cgroup_low_wmark_distance_read,
+	},
+	{
+		.name = "reclaim_wmarks",
+		.read_map = mem_cgroup_wmark_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V4 04/10] Infrastructure to support per-memcg reclaim.
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (2 preceding siblings ...)
  2011-04-14 22:54 ` [PATCH V4 03/10] New APIs to adjust per-memcg wmarks Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  0:34   ` KAMEZAWA Hiroyuki
  2011-04-14 22:54 ` [PATCH V4 05/10] Implement the select_victim_node within memcg Ying Han
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

Add the kswapd_mem field in kswapd descriptor which links the kswapd
kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
queue headed at kswapd_wait field of the kswapd descriptor.

The kswapd() function is now shared between global and per-memcg kswapd. It
is passed in with the kswapd descriptor which contains the information of
either node or memcg. Then the new function balance_mem_cgroup_pgdat is
invoked if it is per-mem kswapd thread, and the implementation of the function
is on the following patch.

changelog v4..v3:
1. fix up the kswapd_run and kswapd_stop for online_pages() and offline_pages.
2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's request.

changelog v3..v2:
1. split off from the initial patch which includes all changes of the following
three patches.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    5 ++
 include/linux/swap.h       |    5 +-
 mm/memcontrol.c            |   29 ++++++++
 mm/memory_hotplug.c        |    4 +-
 mm/vmscan.c                |  157 ++++++++++++++++++++++++++++++--------------
 5 files changed, 147 insertions(+), 53 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3ece36d..f7ffd1f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@ struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
+struct kswapd;
 
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
@@ -83,6 +84,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
+extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
+				  struct kswapd *kswapd_p);
+extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
+extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f43d406..17e0511 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -30,6 +30,7 @@ struct kswapd {
 	struct task_struct *kswapd_task;
 	wait_queue_head_t kswapd_wait;
 	pg_data_t *kswapd_pgdat;
+	struct mem_cgroup *kswapd_mem;
 };
 
 int kswapd(void *p);
@@ -303,8 +304,8 @@ static inline void scan_unevictable_unregister_node(struct node *node)
 }
 #endif
 
-extern int kswapd_run(int nid);
-extern void kswapd_stop(int nid);
+extern int kswapd_run(int nid, struct mem_cgroup *mem);
+extern void kswapd_stop(int nid, struct mem_cgroup *mem);
 
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 685645c..c4e1904 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -278,6 +278,8 @@ struct mem_cgroup {
 	 */
 	u64 high_wmark_distance;
 	u64 low_wmark_distance;
+
+	wait_queue_head_t *kswapd_wait;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -4664,6 +4666,33 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
 	return ret;
 }
 
+int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
+{
+	if (!mem || !kswapd_p)
+		return 0;
+
+	mem->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_mem = mem;
+
+	return css_id(&mem->css);
+}
+
+void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
+{
+	if (mem)
+		mem->kswapd_wait = NULL;
+
+	return;
+}
+
+wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return NULL;
+
+	return mem->kswapd_wait;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 321fc74..2f78ff6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -462,7 +462,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages)
 	setup_per_zone_wmarks();
 	calculate_zone_inactive_ratio(zone);
 	if (onlined_pages) {
-		kswapd_run(zone_to_nid(zone));
+		kswapd_run(zone_to_nid(zone), NULL);
 		node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
 	}
 
@@ -897,7 +897,7 @@ repeat:
 	calculate_zone_inactive_ratio(zone);
 	if (!node_present_pages(node)) {
 		node_clear_state(node, N_HIGH_MEMORY);
-		kswapd_stop(node);
+		kswapd_stop(node, NULL);
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 77ac74f..4deb9c8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2242,6 +2242,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 }
 
 static DEFINE_SPINLOCK(kswapds_spinlock);
+#define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
 
 /* is kswapd sleeping prematurely? */
 static int sleeping_prematurely(struct kswapd *kswapd, int order,
@@ -2251,11 +2252,16 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
 	pg_data_t *pgdat = kswapd->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd->kswapd_mem;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;
 
+	/* Doesn't support for per-memcg reclaim */
+	if (mem)
+		return false;
+
 	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
@@ -2598,19 +2604,25 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	 * go fully to sleep until explicitly woken up.
 	 */
 	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
-		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+		if (is_node_kswapd(kswapd_p)) {
+			trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
-		/*
-		 * vmstat counters are not perfectly accurate and the estimated
-		 * value for counters such as NR_FREE_PAGES can deviate from the
-		 * true value by nr_online_cpus * threshold. To avoid the zone
-		 * watermarks being breached while under pressure, we reduce the
-		 * per-cpu vmstat threshold while kswapd is awake and restore
-		 * them before going back to sleep.
-		 */
-		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
-		schedule();
-		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+			/*
+			 * vmstat counters are not perfectly accurate and the
+			 * estimated value for counters such as NR_FREE_PAGES
+			 * can deviate from the true value by nr_online_cpus *
+			 * threshold. To avoid the zone watermarks being
+			 * breached while under pressure, we reduce the per-cpu
+			 * vmstat threshold while kswapd is awake and restore
+			 * them before going back to sleep.
+			 */
+			set_pgdat_percpu_threshold(pgdat,
+						   calculate_normal_threshold);
+			schedule();
+			set_pgdat_percpu_threshold(pgdat,
+						calculate_pressure_threshold);
+		} else
+			schedule();
 	} else {
 		if (remaining)
 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2620,6 +2632,12 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	finish_wait(wait_h, &wait);
 }
 
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+							int order)
+{
+	return 0;
+}
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -2639,6 +2657,7 @@ int kswapd(void *p)
 	int classzone_idx;
 	struct kswapd *kswapd_p = (struct kswapd *)p;
 	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
 	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 
@@ -2649,10 +2668,12 @@ int kswapd(void *p)
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
-	BUG_ON(pgdat->kswapd_wait != wait_h);
-	cpumask = cpumask_of_node(pgdat->node_id);
-	if (!cpumask_empty(cpumask))
-		set_cpus_allowed_ptr(tsk, cpumask);
+	if (is_node_kswapd(kswapd_p)) {
+		BUG_ON(pgdat->kswapd_wait != wait_h);
+		cpumask = cpumask_of_node(pgdat->node_id);
+		if (!cpumask_empty(cpumask))
+			set_cpus_allowed_ptr(tsk, cpumask);
+	}
 	current->reclaim_state = &reclaim_state;
 
 	/*
@@ -2667,7 +2688,10 @@ int kswapd(void *p)
 	 * us from recursively trying to free more memory as we're
 	 * trying to free the first piece of memory in the first place).
 	 */
-	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+	if (is_node_kswapd(kswapd_p))
+		tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+	else
+		tsk->flags |= PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
 	order = 0;
@@ -2677,24 +2701,29 @@ int kswapd(void *p)
 		int new_classzone_idx;
 		int ret;
 
-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order = 0;
-		pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		if (order < new_order || classzone_idx > new_classzone_idx) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation or has tigher zone constraints
-			 */
-			order = new_order;
-			classzone_idx = new_classzone_idx;
-		} else {
-			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
-			order = pgdat->kswapd_max_order;
-			classzone_idx = pgdat->classzone_idx;
+		if (is_node_kswapd(kswapd_p)) {
+			new_order = pgdat->kswapd_max_order;
+			new_classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		}
+			if (order < new_order ||
+					classzone_idx > new_classzone_idx) {
+				/*
+				 * Don't sleep if someone wants a larger 'order'
+				 * allocation or has tigher zone constraints
+				 */
+				order = new_order;
+				classzone_idx = new_classzone_idx;
+			} else {
+				kswapd_try_to_sleep(kswapd_p, order,
+						    classzone_idx);
+				order = pgdat->kswapd_max_order;
+				classzone_idx = pgdat->classzone_idx;
+				pgdat->kswapd_max_order = 0;
+				pgdat->classzone_idx = MAX_NR_ZONES - 1;
+			}
+		} else
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -2705,8 +2734,13 @@ int kswapd(void *p)
 		 * after returning from the refrigerator
 		 */
 		if (!ret) {
-			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			order = balance_pgdat(pgdat, order, &classzone_idx);
+			if (is_node_kswapd(kswapd_p)) {
+				trace_mm_vmscan_kswapd_wake(pgdat->node_id,
+								order);
+				order = balance_pgdat(pgdat, order,
+							&classzone_idx);
+			} else
+				balance_mem_cgroup_pgdat(mem, order);
 		}
 	}
 	return 0;
@@ -2853,30 +2887,53 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
  * This kswapd start function will be called by init and node-hot-add.
  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
  */
-int kswapd_run(int nid)
+int kswapd_run(int nid, struct mem_cgroup *mem)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
 	struct task_struct *kswapd_thr;
+	pg_data_t *pgdat = NULL;
 	struct kswapd *kswapd_p;
+	static char name[TASK_COMM_LEN];
+	int memcg_id;
 	int ret = 0;
 
-	if (pgdat->kswapd_wait)
-		return 0;
+	if (!mem) {
+		pgdat = NODE_DATA(nid);
+		if (pgdat->kswapd_wait)
+			return ret;
+	}
 
 	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
 	if (!kswapd_p)
 		return -ENOMEM;
 
 	init_waitqueue_head(&kswapd_p->kswapd_wait);
-	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
-	kswapd_p->kswapd_pgdat = pgdat;
 
-	kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (!mem) {
+		pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+		kswapd_p->kswapd_pgdat = pgdat;
+		snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
+	} else {
+		memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
+		if (!memcg_id) {
+			kfree(kswapd_p);
+			return ret;
+		}
+		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
+	}
+
+	kswapd_thr = kthread_run(kswapd, kswapd_p, name);
 	if (IS_ERR(kswapd_thr)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
-		printk("Failed to start kswapd on node %d\n",nid);
-		pgdat->kswapd_wait = NULL;
+		if (!mem) {
+			printk(KERN_ERR "Failed to start kswapd on node %d\n",
+								nid);
+			pgdat->kswapd_wait = NULL;
+		} else {
+			printk(KERN_ERR "Failed to start kswapd on memcg %d\n",
+								memcg_id);
+			mem_cgroup_clear_kswapd(mem);
+		}
 		kfree(kswapd_p);
 		ret = -1;
 	} else
@@ -2887,16 +2944,18 @@ int kswapd_run(int nid)
 /*
  * Called by memory hotplug when all memory in a node is offlined.
  */
-void kswapd_stop(int nid)
+void kswapd_stop(int nid, struct mem_cgroup *mem)
 {
 	struct task_struct *kswapd_thr = NULL;
 	struct kswapd *kswapd_p = NULL;
 	wait_queue_head_t *wait;
 
-	pg_data_t *pgdat = NODE_DATA(nid);
-
 	spin_lock(&kswapds_spinlock);
-	wait = pgdat->kswapd_wait;
+	if (!mem)
+		wait = NODE_DATA(nid)->kswapd_wait;
+	else
+		wait = mem_cgroup_kswapd_wait(mem);
+
 	if (wait) {
 		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
 		kswapd_thr = kswapd_p->kswapd_task;
@@ -2916,7 +2975,7 @@ static int __init kswapd_init(void)
 
 	swap_setup();
 	for_each_node_state(nid, N_HIGH_MEMORY)
- 		kswapd_run(nid);
+		kswapd_run(nid, NULL);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V4 05/10] Implement the select_victim_node within memcg.
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (3 preceding siblings ...)
  2011-04-14 22:54 ` [PATCH V4 04/10] Infrastructure to support per-memcg reclaim Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  0:40   ` KAMEZAWA Hiroyuki
  2011-04-14 22:54 ` [PATCH V4 06/10] Per-memcg background reclaim Ying Han
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This add the mechanism for background reclaim which we remember the
last scanned node and always starting from the next one each time.
The simple round-robin fasion provide the fairness between nodes for
each memcg.

changelog v4..v3:
1. split off from the per-memcg background reclaim patch.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f7ffd1f..d4ff7f2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
 				  struct kswapd *kswapd_p);
 extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
 extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
+extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
+extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
+					const nodemask_t *nodes);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c4e1904..e22351a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -279,6 +279,11 @@ struct mem_cgroup {
 	u64 high_wmark_distance;
 	u64 low_wmark_distance;
 
+	/* While doing per cgroup background reclaim, we cache the
+	 * last node we reclaimed from
+	 */
+	int last_scanned_node;
+
 	wait_queue_head_t *kswapd_wait;
 };
 
@@ -1536,6 +1541,32 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 }
 
 /*
+ * Visit the first node after the last_scanned_node of @mem and use that to
+ * reclaim free pages from.
+ */
+int
+mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
+{
+	int next_nid;
+	int last_scanned;
+
+	last_scanned = mem->last_scanned_node;
+
+	/* Initial stage and start from node0 */
+	if (last_scanned == -1)
+		next_nid = 0;
+	else
+		next_nid = next_node(last_scanned, *nodes);
+
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(*nodes);
+
+	mem->last_scanned_node = next_nid;
+
+	return next_nid;
+}
+
+/*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
@@ -4693,6 +4724,14 @@ wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
 	return mem->kswapd_wait;
 }
 
+int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return -1;
+
+	return mem->last_scanned_node;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4768,6 +4807,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->last_scanned_node = -1;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
 	if (parent)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V4 06/10] Per-memcg background reclaim.
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (4 preceding siblings ...)
  2011-04-14 22:54 ` [PATCH V4 05/10] Implement the select_victim_node within memcg Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  1:11   ` KAMEZAWA Hiroyuki
  2011-04-14 22:54 ` [PATCH V4 07/10] Add per-memcg zone "unreclaimable" Ying Han
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This is the main loop of per-memcg background reclaim which is implemented in
function balance_mem_cgroup_pgdat().

The function performs a priority loop similar to global reclaim. During each
iteration it invokes balance_pgdat_node() for all nodes on the system, which
is another new function performs background reclaim per node. After reclaiming
each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
it returns true.

changelog v4..v3:
1. split the select_victim_node and zone_unreclaimable to a seperate patches
2. remove the logic tries to do zone balancing.

changelog v3..v2:
1. change mz->all_unreclaimable to be boolean.
2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
3. some more clean-up.

changelog v2..v1:
1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
2. shared the kswapd_run/kswapd_stop for per-memcg and global background
reclaim.
3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
keeps the same name.
4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
after freeing.
5. add the fairness in zonelist where memcg remember the last zone reclaimed
from.

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/vmscan.c |  161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 161 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4deb9c8..b8345d2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -47,6 +47,8 @@
 
 #include <linux/swapops.h>
 
+#include <linux/res_counter.h>
+
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -111,6 +113,8 @@ struct scan_control {
 	 * are scanned.
 	 */
 	nodemask_t	*nodemask;
+
+	int priority;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -2632,11 +2636,168 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	finish_wait(wait_h, &wait);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * The function is used for per-memcg LRU. It scanns all the zones of the
+ * node and returns the nr_scanned and nr_reclaimed.
+ */
+static void balance_pgdat_node(pg_data_t *pgdat, int order,
+					struct scan_control *sc)
+{
+	int i;
+	unsigned long total_scanned = 0;
+	struct mem_cgroup *mem_cont = sc->mem_cgroup;
+	int priority = sc->priority;
+
+	/*
+	 * Now scan the zone in the dma->highmem direction, and we scan
+	 * every zones for each node.
+	 *
+	 * We do this because the page allocator works in the opposite
+	 * direction.  This prevents the page allocator from allocating
+	 * pages behind kswapd's direction of progress, which would
+	 * cause too much scanning of the lower zones.
+	 */
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		sc->nr_scanned = 0;
+		shrink_zone(priority, zone, sc);
+		total_scanned += sc->nr_scanned;
+
+		/*
+		 * If we've done a decent amount of scanning and
+		 * the reclaim ratio is low, start doing writepage
+		 * even in laptop mode
+		 */
+		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
+			sc->may_writepage = 1;
+		}
+	}
+
+	sc->nr_scanned = total_scanned;
+	return;
+}
+
+/*
+ * Per cgroup background reclaim.
+ * TODO: Take off the order since memcg always do order 0
+ */
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+					      int order)
+{
+	int i, nid;
+	int start_node;
+	int priority;
+	bool wmark_ok;
+	int loop;
+	pg_data_t *pgdat;
+	nodemask_t do_nodes;
+	unsigned long total_scanned;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.nr_to_reclaim = ULONG_MAX,
+		.swappiness = vm_swappiness,
+		.order = order,
+		.mem_cgroup = mem_cont,
+	};
+
+loop_again:
+	do_nodes = NODE_MASK_NONE;
+	sc.may_writepage = !laptop_mode;
+	sc.nr_reclaimed = 0;
+	total_scanned = 0;
+
+	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		sc.priority = priority;
+		wmark_ok = false;
+		loop = 0;
+
+		/* The swap token gets in the way of swapout... */
+		if (!priority)
+			disable_swap_token();
+
+		if (priority == DEF_PRIORITY)
+			do_nodes = node_states[N_ONLINE];
+
+		while (1) {
+			nid = mem_cgroup_select_victim_node(mem_cont,
+							&do_nodes);
+
+			/* Indicate we have cycled the nodelist once
+			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
+			 * kswapd burning cpu cycles.
+			 */
+			if (loop == 0) {
+				start_node = nid;
+				loop++;
+			} else if (nid == start_node)
+				break;
+
+			pgdat = NODE_DATA(nid);
+			balance_pgdat_node(pgdat, order, &sc);
+			total_scanned += sc.nr_scanned;
+
+			/* Set the node which has at least
+			 * one reclaimable zone
+			 */
+			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+				struct zone *zone = pgdat->node_zones + i;
+
+				if (!populated_zone(zone))
+					continue;
+			}
+			if (i < 0)
+				node_clear(nid, do_nodes);
+
+			if (mem_cgroup_watermark_ok(mem_cont,
+							CHARGE_WMARK_HIGH)) {
+				wmark_ok = true;
+				goto out;
+			}
+
+			if (nodes_empty(do_nodes)) {
+				wmark_ok = true;
+				goto out;
+			}
+		}
+
+		/* All the nodes are unreclaimable, kswapd is done */
+		if (nodes_empty(do_nodes)) {
+			wmark_ok = true;
+			goto out;
+		}
+
+		if (total_scanned && priority < DEF_PRIORITY - 2)
+			congestion_wait(WRITE, HZ/10);
+
+		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
+			break;
+	}
+out:
+	if (!wmark_ok) {
+		cond_resched();
+
+		try_to_freeze();
+
+		goto loop_again;
+	}
+
+	return sc.nr_reclaimed;
+}
+#else
 static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
 							int order)
 {
 	return 0;
 }
+#endif
 
 /*
  * The background pageout daemon, started as a kernel thread
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V4 07/10] Add per-memcg zone "unreclaimable"
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (5 preceding siblings ...)
  2011-04-14 22:54 ` [PATCH V4 06/10] Per-memcg background reclaim Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  1:32   ` KAMEZAWA Hiroyuki
  2011-04-14 22:54 ` [PATCH V4 08/10] Enable per-memcg background reclaim Ying Han
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
and breaks the priority loop if it returns true. The per-memcg zone will
be marked as "unreclaimable" if the scanning rate is much greater than the
reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
page charged to the memcg being freed. Kswapd breaks the priority loop if
all the zones are marked as "unreclaimable".

changelog v4..v3:
1. split off from the per-memcg background reclaim patch in V3.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   30 ++++++++++++++
 include/linux/swap.h       |    2 +
 mm/memcontrol.c            |   96 ++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |   19 +++++++++
 4 files changed, 147 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d4ff7f2..a8159f5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -155,6 +155,12 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+				unsigned long nr_scanned);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
@@ -345,6 +351,25 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
+						struct zone *zone,
+						unsigned long nr_scanned)
+{
+}
+
+static inline void mem_cgroup_clear_unreclaimable(struct page *page,
+							struct zone *zone)
+{
+}
+static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
+		struct zone *zone)
+{
+}
+static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
+						struct zone *zone)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
@@ -363,6 +388,11 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
 {
 }
 
+static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
+								int zid)
+{
+	return false;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 17e0511..319b800 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -160,6 +160,8 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
+#define ZONE_RECLAIMABLE_RATE 6
+
 #define SWAP_CLUSTER_MAX 32
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e22351a..da6a130 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
 	bool			on_tree;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
+	unsigned long		pages_scanned;	/* since last reclaim */
+	bool			all_unreclaimable;	/* All pages pinned */
 };
+
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
 
@@ -1135,6 +1138,96 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
+static unsigned long mem_cgroup_zone_reclaimable_pages(
+					struct mem_cgroup_per_zone *mz)
+{
+	int nr;
+	nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
+		MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
+			MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+
+	return nr;
+}
+
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+						unsigned long nr_scanned)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->pages_scanned += nr_scanned;
+}
+
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return 0;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->pages_scanned <
+				mem_cgroup_zone_reclaimable_pages(mz) *
+				ZONE_RECLAIMABLE_RATE;
+	return 0;
+}
+
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return false;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->all_unreclaimable;
+
+	return false;
+}
+
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->all_unreclaimable = true;
+}
+
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return;
+
+	mz = page_cgroup_zoneinfo(mem, page);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -2801,6 +2894,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	 * special functions.
 	 */
 
+	mem_cgroup_clear_unreclaimable(mem, page);
 	unlock_page_cgroup(pc);
 	/*
 	 * even after unlock, we have mem->res.usage here and this memcg
@@ -4569,6 +4663,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = mem;
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
 	}
 	return 0;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8345d2..c081112 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
 			zone, sc->mem_cgroup,
 			0, file);
+
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
+
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
@@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
@@ -2648,6 +2652,7 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
 	unsigned long total_scanned = 0;
 	struct mem_cgroup *mem_cont = sc->mem_cgroup;
 	int priority = sc->priority;
+	int nid = pgdat->node_id;
 
 	/*
 	 * Now scan the zone in the dma->highmem direction, and we scan
@@ -2664,10 +2669,20 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
 		if (!populated_zone(zone))
 			continue;
 
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+			priority != DEF_PRIORITY)
+			continue;
+
 		sc->nr_scanned = 0;
 		shrink_zone(priority, zone, sc);
 		total_scanned += sc->nr_scanned;
 
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
+			continue;
+
+		if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
+			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
+
 		/*
 		 * If we've done a decent amount of scanning and
 		 * the reclaim ratio is low, start doing writepage
@@ -2752,6 +2767,10 @@ loop_again:
 
 				if (!populated_zone(zone))
 					continue;
+
+				if (!mem_cgroup_mz_unreclaimable(mem_cont,
+								zone))
+					break;
 			}
 			if (i < 0)
 				node_clear(nid, do_nodes);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V4 08/10] Enable per-memcg background reclaim.
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (6 preceding siblings ...)
  2011-04-14 22:54 ` [PATCH V4 07/10] Add per-memcg zone "unreclaimable" Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  1:34   ` KAMEZAWA Hiroyuki
  2011-04-14 22:54 ` [PATCH V4 09/10] Add API to export per-memcg kswapd pid Ying Han
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

By default the per-memcg background reclaim is disabled when the limit_in_bytes
is set the maximum. The kswapd_run() is called when the memcg is being resized,
and kswapd_stop() is called when the memcg is being deleted.

The per-memcg kswapd is waked up based on the usage and low_wmark, which is
checked once per 1024 increments per cpu. The memcg's kswapd is waked up if the
usage is larger than the low_wmark.

changelog v4..v3:
1. move kswapd_stop to mem_cgroup_destroy based on comments from KAMAZAWA
2. move kswapd_run to setup_mem_cgroup_wmark, since the actual watermarks
determines whether or not enabling per-memcg background reclaim.

changelog v3..v2:
1. some clean-ups

changelog v2..v1:
1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
2. remove checking the wmark from per-page charging. now it checks the wmark
periodically based on the event counter.

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da6a130..1b23ff4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -105,10 +105,12 @@ enum mem_cgroup_events_index {
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
 	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_WMARK_EVENTS_THRESH,
 	MEM_CGROUP_NTARGETS,
 };
 #define THRESHOLDS_EVENTS_TARGET (128)
 #define SOFTLIMIT_EVENTS_TARGET (1024)
+#define WMARK_EVENTS_TARGET (1024)
 
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
@@ -370,6 +372,8 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 
+static void wake_memcg_kswapd(struct mem_cgroup *mem);
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
@@ -548,6 +552,12 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
+static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
+		wake_memcg_kswapd(mem);
+}
+
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -678,6 +688,9 @@ static void __mem_cgroup_target_update(struct mem_cgroup *mem, int target)
 	case MEM_CGROUP_TARGET_SOFTLIMIT:
 		next = val + SOFTLIMIT_EVENTS_TARGET;
 		break;
+	case MEM_CGROUP_WMARK_EVENTS_THRESH:
+		next = val + WMARK_EVENTS_TARGET;
+		break;
 	default:
 		return;
 	}
@@ -701,6 +714,10 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
+		if (unlikely(__memcg_event_check(mem,
+			MEM_CGROUP_WMARK_EVENTS_THRESH))){
+			mem_cgroup_check_wmark(mem);
+		}
 	}
 }
 
@@ -845,6 +862,9 @@ static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
 
 		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
 		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+
+		if (!mem_cgroup_is_root(mem) && !mem->kswapd_wait)
+			kswapd_run(0, mem);
 	}
 }
 
@@ -4828,6 +4848,22 @@ int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
 	return mem->last_scanned_node;
 }
 
+static inline
+void wake_memcg_kswapd(struct mem_cgroup *mem)
+{
+	wait_queue_head_t *wait;
+
+	if (!mem || !mem->high_wmark_distance)
+		return;
+
+	wait = mem->kswapd_wait;
+
+	if (!wait || !waitqueue_active(wait))
+		return;
+
+	wake_up_interruptible(wait);
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4931,6 +4967,7 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	kswapd_stop(0, mem);
 	mem_cgroup_put(mem);
 }
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V4 09/10] Add API to export per-memcg kswapd pid.
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (7 preceding siblings ...)
  2011-04-14 22:54 ` [PATCH V4 08/10] Enable per-memcg background reclaim Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  1:40   ` KAMEZAWA Hiroyuki
  2011-04-14 22:54 ` [PATCH V4 10/10] Add some per-memcg stats Ying Han
  2011-04-15  9:40 ` [PATCH V4 00/10] memcg: per cgroup background reclaim Michal Hocko
  10 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This add the API which exports per-memcg kswapd thread pid. The kswapd
thread is named as "memcg_" + css_id, and the pid can be used to put
kswapd thread into cpu cgroup later.

$ mkdir /dev/cgroup/memory/A
$ cat /dev/cgroup/memory/A/memory.kswapd_pid
memcg_null 0

$ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
$ ps -ef | grep memcg
root      6727     2  0 14:32 ?        00:00:00 [memcg_3]
root      6729  6044  0 14:32 ttyS0    00:00:00 grep memcg

$ cat memory.kswapd_pid
memcg_3 6727

changelog v4..v3
1. Add the API based on KAMAZAWA's request on patch v3.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/swap.h |    2 ++
 mm/memcontrol.c      |   33 +++++++++++++++++++++++++++++++++
 mm/vmscan.c          |    2 +-
 3 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 319b800..2d3e21a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -34,6 +34,8 @@ struct kswapd {
 };
 
 int kswapd(void *p);
+extern spinlock_t kswapds_spinlock;
+
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1b23ff4..606b680 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4493,6 +4493,35 @@ static int mem_cgroup_wmark_read(struct cgroup *cgrp,
 	return 0;
 }
 
+static int mem_cgroup_kswapd_pid_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	struct task_struct *kswapd_thr = NULL;
+	struct kswapd *kswapd_p = NULL;
+	wait_queue_head_t *wait;
+	char name[TASK_COMM_LEN];
+	pid_t pid = 0;
+
+	sprintf(name, "memcg_null");
+
+	spin_lock(&kswapds_spinlock);
+	wait = mem_cgroup_kswapd_wait(mem);
+	if (wait) {
+		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+		kswapd_thr = kswapd_p->kswapd_task;
+		if (kswapd_thr) {
+			get_task_comm(name, kswapd_thr);
+			pid = kswapd_thr->pid;
+		}
+	}
+	spin_unlock(&kswapds_spinlock);
+
+	cb->fill(cb, name, pid);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4610,6 +4639,10 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "reclaim_wmarks",
 		.read_map = mem_cgroup_wmark_read,
 	},
+	{
+		.name = "kswapd_pid",
+		.read_map = mem_cgroup_kswapd_pid_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c081112..df4e5dd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2249,7 +2249,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 	return balanced_pages > (present_pages >> 2);
 }
 
-static DEFINE_SPINLOCK(kswapds_spinlock);
+DEFINE_SPINLOCK(kswapds_spinlock);
 #define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
 
 /* is kswapd sleeping prematurely? */
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V4 10/10] Add some per-memcg stats
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (8 preceding siblings ...)
  2011-04-14 22:54 ` [PATCH V4 09/10] Add API to export per-memcg kswapd pid Ying Han
@ 2011-04-14 22:54 ` Ying Han
  2011-04-15  9:40 ` [PATCH V4 00/10] memcg: per cgroup background reclaim Michal Hocko
  10 siblings, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-14 22:54 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

A bunch of statistics are added in memory.stat to monitor per cgroup
kswapd performance.

$cat /dev/cgroup/yinghan/memory.stat
kswapd_steal 12588994
pg_pgsteal 0
kswapd_pgscan 18629519
pg_scan 0
pgrefill 2893517
pgoutrun 5342267948
allocstall 0

changelog v2..v1:
1. change the stats using events instead of stats.
2. add the stats in the Documentation

Signed-off-by: Ying Han <yinghan@google.com>
---
 Documentation/cgroups/memory.txt |   14 +++++++
 include/linux/memcontrol.h       |   52 +++++++++++++++++++++++++++
 mm/memcontrol.c                  |   72 ++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                      |   28 ++++++++++++--
 4 files changed, 162 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b6ed61c..29dee73 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,13 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+kswapd_steal	- # of pages reclaimed from kswapd
+pg_pgsteal	- # of pages reclaimed from direct reclaim
+kswapd_pgscan	- # of pages scanned from kswapd
+pg_scan		- # of pages scanned frm direct reclaim
+pgrefill	- # of pages scanned on active list
+pgoutrun	- # of times triggering kswapd
+allocstall	- # of times triggering direct reclaim
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -406,6 +413,13 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_kswapd_steal	- sum of all children's "kswapd_steal"
+total_pg_pgsteal	- sum of all children's "pg_pgsteal"
+total_kswapd_pgscan	- sum of all children's "kswapd_pgscan"
+total_pg_scan		- sum of all children's "pg_scan"
+total_pgrefill		- sum of all children's "pgrefill"
+total_pgoutrun		- sum of all children's "pgoutrun"
+total_allocstall	- sum of all children's "allocstall"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a8159f5..0b7fb22 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -162,6 +162,15 @@ void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
 void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
 				unsigned long nr_scanned);
 
+/* background reclaim stats */
+void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pgrefill(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_outrun(struct mem_cgroup *memcg, int val);
+void mem_cgroup_alloc_stall(struct mem_cgroup *memcg, int val);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
 #endif
@@ -393,6 +402,49 @@ static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
 {
 	return false;
 }
+
+/* background reclaim stats */
+static inline void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg,
+					   int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_steal(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg,
+					    int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pgrefill(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_outrun(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_alloc_stall(struct mem_cgroup *memcg,
+					  int val)
+{
+	return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 606b680..7da0ebb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,13 @@ enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	MEM_CGROUP_EVENTS_KSWAPD_STEAL, /* # of pages reclaimed from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSTEAL, /* # of pages reclaimed from ttfp */
+	MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, /* # of pages scanned from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSCAN, /* # of pages scanned from ttfp */
+	MEM_CGROUP_EVENTS_PGREFILL, /* # of pages scanned on active list */
+	MEM_CGROUP_EVENTS_PGOUTRUN, /* # of triggers of background reclaim */
+	MEM_CGROUP_EVENTS_ALLOCSTALL, /* # of triggers of direct reclaim */
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -611,6 +618,41 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_kswapd_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_STEAL], val);
+}
+
+void mem_cgroup_pg_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSTEAL], val);
+}
+
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_PGSCAN], val);
+}
+
+void mem_cgroup_pg_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSCAN], val);
+}
+
+void mem_cgroup_pgrefill(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGREFILL], val);
+}
+
+void mem_cgroup_pg_outrun(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGOUTRUN], val);
+}
+
+void mem_cgroup_alloc_stall(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_ALLOCSTALL], val);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -3946,6 +3988,13 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_KSWAPD_STEAL,
+	MCS_PG_PGSTEAL,
+	MCS_KSWAPD_PGSCAN,
+	MCS_PG_PGSCAN,
+	MCS_PGREFILL,
+	MCS_PGOUTRUN,
+	MCS_ALLOCSTALL,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3968,6 +4017,13 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"kswapd_steal", "total_kswapd_steal"},
+	{"pg_pgsteal", "total_pg_pgsteal"},
+	{"kswapd_pgscan", "total_kswapd_pgscan"},
+	{"pg_scan", "total_pg_scan"},
+	{"pgrefill", "total_pgrefill"},
+	{"pgoutrun", "total_pgoutrun"},
+	{"allocstall", "total_allocstall"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3997,6 +4053,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
+	/* kswapd stat */
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_STEAL);
+	s->stat[MCS_KSWAPD_STEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSTEAL);
+	s->stat[MCS_PG_PGSTEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN);
+	s->stat[MCS_KSWAPD_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSCAN);
+	s->stat[MCS_PG_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGREFILL);
+	s->stat[MCS_PGREFILL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGOUTRUN);
+	s->stat[MCS_PGOUTRUN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_ALLOCSTALL);
+	s->stat[MCS_ALLOCSTALL] += val;
+
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df4e5dd..af15627 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1421,6 +1421,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_pgscan(sc->mem_cgroup, nr_scanned);
+		else
+			mem_cgroup_pg_pgscan(sc->mem_cgroup, nr_scanned);
 	}
 
 	if (nr_taken == 0) {
@@ -1441,9 +1445,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	local_irq_disable();
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	if (scanning_global_lru(sc)) {
+		if (current_is_kswapd())
+			__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+		__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	} else {
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_steal(sc->mem_cgroup, nr_reclaimed);
+		else
+			mem_cgroup_pg_steal(sc->mem_cgroup, nr_reclaimed);
+	}
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
@@ -1541,7 +1552,12 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
-	__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	if (scanning_global_lru(sc))
+		__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	else
+		mem_cgroup_pgrefill(sc->mem_cgroup, pgscanned);
+
+
 	if (file)
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -nr_taken);
 	else
@@ -2054,6 +2070,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 	if (scanning_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
+	else
+		mem_cgroup_alloc_stall(sc->mem_cgroup, 1);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
@@ -2729,6 +2747,8 @@ loop_again:
 	sc.nr_reclaimed = 0;
 	total_scanned = 0;
 
+	mem_cgroup_pg_outrun(mem_cont, 1);
+
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc.priority = priority;
 		wmark_ok = false;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 01/10] Add kswapd descriptor
  2011-04-14 22:54 ` [PATCH V4 01/10] Add kswapd descriptor Ying Han
@ 2011-04-15  0:04   ` KAMEZAWA Hiroyuki
  2011-04-15  3:35     ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  0:04 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 15:54:20 -0700
Ying Han <yinghan@google.com> wrote:

> There is a kswapd kernel thread for each numa node. We will add a different
> kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
> kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
> information of node or memcg and it allows the global and per-memcg background
> reclaim to share common reclaim algorithms.
> 
> This patch adds the kswapd descriptor and moves the per-node kswapd to use the
> new structure.
> 

No objections to your direction but some comments.

> changelog v2..v1:
> 1. dynamic allocate kswapd descriptor and initialize the wait_queue_head of pgdat
> at kswapd_run.
> 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup kswapd
> descriptor.
> 
> changelog v3..v2:
> 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
> 2. rename thr in kswapd_run to something else.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/mmzone.h |    3 +-
>  include/linux/swap.h   |    7 ++++
>  mm/page_alloc.c        |    1 -
>  mm/vmscan.c            |   95 ++++++++++++++++++++++++++++++++++++------------
>  4 files changed, 80 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 628f07b..6cba7d2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -640,8 +640,7 @@ typedef struct pglist_data {
>  	unsigned long node_spanned_pages; /* total size of physical page
>  					     range, including holes */
>  	int node_id;
> -	wait_queue_head_t kswapd_wait;
> -	struct task_struct *kswapd;
> +	wait_queue_head_t *kswapd_wait;
>  	int kswapd_max_order;
>  	enum zone_type classzone_idx;

I think pg_data_t should include struct kswapd in it, as

	struct pglist_data {
	.....
		struct kswapd	kswapd;
	};
and you can add a macro as

#define kswapd_waitqueue(kswapd)	(&(kswapd)->kswapd_wait)
if it looks better.

Why I recommend this is I think it's better to have 'struct kswapd'
on the same page of pg_data_t or struct memcg.
Do you have benefits to kmalloc() struct kswapd on damand ?



>  } pg_data_t;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ed6ebe6..f43d406 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
>  	return current->flags & PF_KSWAPD;
>  }
>  
> +struct kswapd {
> +	struct task_struct *kswapd_task;
> +	wait_queue_head_t kswapd_wait;
> +	pg_data_t *kswapd_pgdat;
> +};
> +
> +int kswapd(void *p);
>  /*
>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>   * be swapped to.  The swap type and the offset into that swap type are
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e1b52a..6340865 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  
>  	pgdat_resize_init(pgdat);
>  	pgdat->nr_zones = 0;
> -	init_waitqueue_head(&pgdat->kswapd_wait);
>  	pgdat->kswapd_max_order = 0;
>  	pgdat_page_cgroup_init(pgdat);
>  	
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..77ac74f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2241,13 +2241,16 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
>  	return balanced_pages > (present_pages >> 2);
>  }
>  
> +static DEFINE_SPINLOCK(kswapds_spinlock);
> +
Maybe better to explain this lock is for what.

It seems we need this because we allocate kswapd descriptor after NODE is online..
right ?

Thanks,
-Kame

>  /* is kswapd sleeping prematurely? */
> -static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> -					int classzone_idx)
> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
> +				long remaining, int classzone_idx)
>  {
>  	int i;
>  	unsigned long balanced = 0;
>  	bool all_zones_ok = true;
> +	pg_data_t *pgdat = kswapd->kswapd_pgdat;
>  
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
> @@ -2570,28 +2573,31 @@ out:
>  	return order;
>  }
>  
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> +static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
> +				int classzone_idx)
>  {
>  	long remaining = 0;
>  	DEFINE_WAIT(wait);
> +	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>  
>  	if (freezing(current) || kthread_should_stop())
>  		return;
>  
> -	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> +	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>  
>  	/* Try to sleep for a short interval */
> -	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
> +	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
>  		remaining = schedule_timeout(HZ/10);
> -		finish_wait(&pgdat->kswapd_wait, &wait);
> -		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> +		finish_wait(wait_h, &wait);
> +		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>  	}
>  
>  	/*
>  	 * After a short sleep, check if it was a premature sleep. If not, then
>  	 * go fully to sleep until explicitly woken up.
>  	 */
> -	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
> +	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
>  		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>  
>  		/*
> @@ -2611,7 +2617,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  		else
>  			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>  	}
> -	finish_wait(&pgdat->kswapd_wait, &wait);
> +	finish_wait(wait_h, &wait);
>  }
>  
>  /*
> @@ -2627,20 +2633,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>   * If there are applications that are active memory-allocators
>   * (most normal use), this basically shouldn't matter.
>   */
> -static int kswapd(void *p)
> +int kswapd(void *p)
>  {
>  	unsigned long order;
>  	int classzone_idx;
> -	pg_data_t *pgdat = (pg_data_t*)p;
> +	struct kswapd *kswapd_p = (struct kswapd *)p;
> +	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>  	struct task_struct *tsk = current;
>  
>  	struct reclaim_state reclaim_state = {
>  		.reclaimed_slab = 0,
>  	};
> -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +	const struct cpumask *cpumask;
>  
>  	lockdep_set_current_reclaim_state(GFP_KERNEL);
>  
> +	BUG_ON(pgdat->kswapd_wait != wait_h);
> +	cpumask = cpumask_of_node(pgdat->node_id);
>  	if (!cpumask_empty(cpumask))
>  		set_cpus_allowed_ptr(tsk, cpumask);
>  	current->reclaim_state = &reclaim_state;
> @@ -2679,7 +2689,7 @@ static int kswapd(void *p)
>  			order = new_order;
>  			classzone_idx = new_classzone_idx;
>  		} else {
> -			kswapd_try_to_sleep(pgdat, order, classzone_idx);
> +			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
>  			order = pgdat->kswapd_max_order;
>  			classzone_idx = pgdat->classzone_idx;
>  			pgdat->kswapd_max_order = 0;
> @@ -2719,13 +2729,13 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  		pgdat->kswapd_max_order = order;
>  		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
>  	}
> -	if (!waitqueue_active(&pgdat->kswapd_wait))
> +	if (!waitqueue_active(pgdat->kswapd_wait))
>  		return;
>  	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
>  		return;
>  
>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
> -	wake_up_interruptible(&pgdat->kswapd_wait);
> +	wake_up_interruptible(pgdat->kswapd_wait);
>  }
>  
>  /*
> @@ -2817,12 +2827,23 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>  		for_each_node_state(nid, N_HIGH_MEMORY) {
>  			pg_data_t *pgdat = NODE_DATA(nid);
>  			const struct cpumask *mask;
> +			struct kswapd *kswapd_p;
> +			struct task_struct *kswapd_thr;
> +			wait_queue_head_t *wait;
>  
>  			mask = cpumask_of_node(pgdat->node_id);
>  
> +			spin_lock(&kswapds_spinlock);
> +			wait = pgdat->kswapd_wait;
> +			kswapd_p = container_of(wait, struct kswapd,
> +						kswapd_wait);
> +			kswapd_thr = kswapd_p->kswapd_task;
> +			spin_unlock(&kswapds_spinlock);
> +
>  			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>  				/* One of our CPUs online: restore mask */
> -				set_cpus_allowed_ptr(pgdat->kswapd, mask);
> +				if (kswapd_thr)
> +					set_cpus_allowed_ptr(kswapd_thr, mask);
>  		}
>  	}
>  	return NOTIFY_OK;
> @@ -2835,18 +2856,31 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>  int kswapd_run(int nid)
>  {
>  	pg_data_t *pgdat = NODE_DATA(nid);
> +	struct task_struct *kswapd_thr;
> +	struct kswapd *kswapd_p;
>  	int ret = 0;
>  
> -	if (pgdat->kswapd)
> +	if (pgdat->kswapd_wait)
>  		return 0;
>  
> -	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
> -	if (IS_ERR(pgdat->kswapd)) {
> +	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
> +	if (!kswapd_p)
> +		return -ENOMEM;
> +
> +	init_waitqueue_head(&kswapd_p->kswapd_wait);
> +	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> +	kswapd_p->kswapd_pgdat = pgdat;
> +
> +	kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
> +	if (IS_ERR(kswapd_thr)) {
>  		/* failure at boot is fatal */
>  		BUG_ON(system_state == SYSTEM_BOOTING);
>  		printk("Failed to start kswapd on node %d\n",nid);
> +		pgdat->kswapd_wait = NULL;
> +		kfree(kswapd_p);
>  		ret = -1;
> -	}
> +	} else
> +		kswapd_p->kswapd_task = kswapd_thr;
>  	return ret;
>  }
>  
> @@ -2855,10 +2889,25 @@ int kswapd_run(int nid)
>   */
>  void kswapd_stop(int nid)
>  {
> -	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
> +	struct task_struct *kswapd_thr = NULL;
> +	struct kswapd *kswapd_p = NULL;
> +	wait_queue_head_t *wait;
> +
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +
> +	spin_lock(&kswapds_spinlock);
> +	wait = pgdat->kswapd_wait;
> +	if (wait) {
> +		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> +		kswapd_thr = kswapd_p->kswapd_task;
> +		kswapd_p->kswapd_task = NULL;
> +	}
> +	spin_unlock(&kswapds_spinlock);
> +
> +	if (kswapd_thr)
> +		kthread_stop(kswapd_thr);
>  
> -	if (kswapd)
> -		kthread_stop(kswapd);
> +	kfree(kswapd_p);
>  }
>  
>  static int __init kswapd_init(void)
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 02/10] Add per memcg reclaim watermarks
  2011-04-14 22:54 ` [PATCH V4 02/10] Add per memcg reclaim watermarks Ying Han
@ 2011-04-15  0:16   ` KAMEZAWA Hiroyuki
  2011-04-15  3:45     ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  0:16 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 15:54:21 -0700
Ying Han <yinghan@google.com> wrote:

> There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
> The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
> is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
> until the usage is lower than the high_wmark.
> 
> Each watermark is calculated based on the hard_limit(limit_in_bytes) for each
> memcg. Each time the hard_limit is changed, the corresponding wmarks are
> re-calculated. Since memory controller charges only user pages, there is
> no need for a "min_wmark". The current calculation of wmarks is based on
> individual tunable low/high_wmark_distance, which are set to 0 by default.
> 
> changelog v4..v3:
> 1. remove legacy comments
> 2. rename the res_counter_check_under_high_wmark_limit
> 3. replace the wmark_ratio per-memcg by individual tunable for both wmarks.
> 4. add comments on low/high_wmark
> 5. add individual tunables for low/high_wmarks and remove wmark_ratio
> 6. replace the mem_cgroup_get_limit() call by res_count_read_u64(). The first
> one returns large value w/ swapon.
> 
> changelog v3..v2:
> 1. Add VM_BUG_ON() on couple of places.
> 2. Remove the spinlock on the min_free_kbytes since the consequence of reading
> stale data.
> 3. Remove the "min_free_kbytes" API and replace it with wmark_ratio based on
> hard_limit.
> 
> changelog v2..v1:
> 1. Remove the res_counter_charge on wmark due to performance concern.
> 2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
> 3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
> 4. make the wmark to be consistant with core VM which checks the free pages
> instead of usage.
> 5. changed wmark to be boolean
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

some nitpick below.



> ---
>  include/linux/memcontrol.h  |    1 +
>  include/linux/res_counter.h |   78 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/res_counter.c        |    6 +++
>  mm/memcontrol.c             |   48 ++++++++++++++++++++++++++
>  4 files changed, 133 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5a5ce70..3ece36d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -82,6 +82,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
>  
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index c9d625c..77eaaa9 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -39,6 +39,14 @@ struct res_counter {
>  	 */
>  	unsigned long long soft_limit;
>  	/*
> +	 * the limit that reclaim triggers.
> +	 */
> +	unsigned long long low_wmark_limit;
> +	/*
> +	 * the limit that reclaim stops.
> +	 */
> +	unsigned long long high_wmark_limit;
> +	/*
>  	 * the number of unsuccessful attempts to consume the resource
>  	 */
>  	unsigned long long failcnt;
> @@ -55,6 +63,9 @@ struct res_counter {
>  
>  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
>  
> +#define CHARGE_WMARK_LOW	0x01
> +#define CHARGE_WMARK_HIGH	0x02
> +
>  /**
>   * Helpers to interact with userspace
>   * res_counter_read_u64() - returns the value of the specified member.
> @@ -92,6 +103,8 @@ enum {
>  	RES_LIMIT,
>  	RES_FAILCNT,
>  	RES_SOFT_LIMIT,
> +	RES_LOW_WMARK_LIMIT,
> +	RES_HIGH_WMARK_LIMIT
>  };
>  
>  /*
> @@ -147,6 +160,24 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
>  	return margin;
>  }
>  
> +static inline bool
> +res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->high_wmark_limit)
> +		return true;
> +
> +	return false;
> +}
> +
> +static inline bool
> +res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->low_wmark_limit)
> +		return true;
> +
> +	return false;
> +}

I like res_counter_under_low_wmark_limit_locked() rather than this name.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 03/10] New APIs to adjust per-memcg wmarks
  2011-04-14 22:54 ` [PATCH V4 03/10] New APIs to adjust per-memcg wmarks Ying Han
@ 2011-04-15  0:25   ` KAMEZAWA Hiroyuki
  2011-04-15  4:00     ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  0:25 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 15:54:22 -0700
Ying Han <yinghan@google.com> wrote:

> Add memory.low_wmark_distance, memory.high_wmark_distance and reclaim_wmarks
> APIs per-memcg. The first two adjust the internal low/high wmark calculation
> and the reclaim_wmarks exports the current value of watermarks.
> 
> By default, the low/high_wmark is calculated by subtracting the distance from
> the hard_limit(limit_in_bytes).
> 
> $ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
> $ cat /dev/cgroup/A/memory.limit_in_bytes
> 524288000
> 
> $ echo 50m >/dev/cgroup/A/memory.high_wmark_distance
> $ echo 40m >/dev/cgroup/A/memory.low_wmark_distance
> 
> $ cat /dev/cgroup/A/memory.reclaim_wmarks
> low_wmark 482344960
> high_wmark 471859200
> 
> changelog v4..v3:
> 1. replace the "wmark_ratio" API with individual tunable for low/high_wmarks.
> 
> changelog v3..v2:
> 1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
> feedbacks
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

But please add a sanity check (see below.)



> ---
>  mm/memcontrol.c |   95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 95 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1ec4014..685645c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3974,6 +3974,72 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static u64 mem_cgroup_high_wmark_distance_read(struct cgroup *cgrp,
> +					       struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return memcg->high_wmark_distance;
> +}
> +
> +static u64 mem_cgroup_low_wmark_distance_read(struct cgroup *cgrp,
> +					      struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return memcg->low_wmark_distance;
> +}
> +
> +static int mem_cgroup_high_wmark_distance_write(struct cgroup *cont,
> +						struct cftype *cft,
> +						const char *buffer)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +	u64 low_wmark_distance = memcg->low_wmark_distance;
> +	unsigned long long val;
> +	u64 limit;
> +	int ret;
> +
> +	ret = res_counter_memparse_write_strategy(buffer, &val);
> +	if (ret)
> +		return -EINVAL;
> +
> +	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> +	if ((val >= limit) || (val < low_wmark_distance) ||
> +	   (low_wmark_distance && val == low_wmark_distance))
> +		return -EINVAL;
> +
> +	memcg->high_wmark_distance = val;
> +
> +	setup_per_memcg_wmarks(memcg);
> +	return 0;
> +}

IIUC, as limit_in_bytes, 'distance' should not be able to set against ROOT memcg
because it doesn't work.



> +
> +static int mem_cgroup_low_wmark_distance_write(struct cgroup *cont,
> +					       struct cftype *cft,
> +					       const char *buffer)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +	u64 high_wmark_distance = memcg->high_wmark_distance;
> +	unsigned long long val;
> +	u64 limit;
> +	int ret;
> +
> +	ret = res_counter_memparse_write_strategy(buffer, &val);
> +	if (ret)
> +		return -EINVAL;
> +
> +	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> +	if ((val >= limit) || (val > high_wmark_distance) ||
> +	    (high_wmark_distance && val == high_wmark_distance))
> +		return -EINVAL;
> +
> +	memcg->low_wmark_distance = val;
> +
> +	setup_per_memcg_wmarks(memcg);
> +	return 0;
> +}
> +

Here, too.

I wonder we should have a method to hide unnecessary interfaces in ROOT cgroup...

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 04/10] Infrastructure to support per-memcg reclaim.
  2011-04-14 22:54 ` [PATCH V4 04/10] Infrastructure to support per-memcg reclaim Ying Han
@ 2011-04-15  0:34   ` KAMEZAWA Hiroyuki
  2011-04-15  4:04     ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  0:34 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 15:54:23 -0700
Ying Han <yinghan@google.com> wrote:

> Add the kswapd_mem field in kswapd descriptor which links the kswapd
> kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
> queue headed at kswapd_wait field of the kswapd descriptor.
> 
> The kswapd() function is now shared between global and per-memcg kswapd. It
> is passed in with the kswapd descriptor which contains the information of
> either node or memcg. Then the new function balance_mem_cgroup_pgdat is
> invoked if it is per-mem kswapd thread, and the implementation of the function
> is on the following patch.
> 
> changelog v4..v3:
> 1. fix up the kswapd_run and kswapd_stop for online_pages() and offline_pages.
> 2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's request.
> 
> changelog v3..v2:
> 1. split off from the initial patch which includes all changes of the following
> three patches.
> 
> Signed-off-by: Ying Han <yinghan@google.com>


> ---
>  include/linux/memcontrol.h |    5 ++
>  include/linux/swap.h       |    5 +-
>  mm/memcontrol.c            |   29 ++++++++
>  mm/memory_hotplug.c        |    4 +-
>  mm/vmscan.c                |  157 ++++++++++++++++++++++++++++++--------------
>  5 files changed, 147 insertions(+), 53 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3ece36d..f7ffd1f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -24,6 +24,7 @@ struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
> +struct kswapd;
>  
>  /* Stats that can be updated by kernel. */
>  enum mem_cgroup_page_stat_item {
> @@ -83,6 +84,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
> +extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
> +				  struct kswapd *kswapd_p);
> +extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
> +extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
>  
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f43d406..17e0511 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -30,6 +30,7 @@ struct kswapd {
>  	struct task_struct *kswapd_task;
>  	wait_queue_head_t kswapd_wait;
>  	pg_data_t *kswapd_pgdat;
> +	struct mem_cgroup *kswapd_mem;
>  };
>  
>  int kswapd(void *p);
> @@ -303,8 +304,8 @@ static inline void scan_unevictable_unregister_node(struct node *node)
>  }
>  #endif
>  
> -extern int kswapd_run(int nid);
> -extern void kswapd_stop(int nid);
> +extern int kswapd_run(int nid, struct mem_cgroup *mem);
> +extern void kswapd_stop(int nid, struct mem_cgroup *mem);
>  
>  #ifdef CONFIG_MMU
>  /* linux/mm/shmem.c */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 685645c..c4e1904 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -278,6 +278,8 @@ struct mem_cgroup {
>  	 */
>  	u64 high_wmark_distance;
>  	u64 low_wmark_distance;
> +
> +	wait_queue_head_t *kswapd_wait;
>  };

I think mem_cgroup can include 'struct kswapd' itself and don't need to
alloc it dynamically.

Other parts seems ok to me.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 05/10] Implement the select_victim_node within memcg.
  2011-04-14 22:54 ` [PATCH V4 05/10] Implement the select_victim_node within memcg Ying Han
@ 2011-04-15  0:40   ` KAMEZAWA Hiroyuki
  2011-04-15  4:36     ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  0:40 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 15:54:24 -0700
Ying Han <yinghan@google.com> wrote:

> This add the mechanism for background reclaim which we remember the
> last scanned node and always starting from the next one each time.
> The simple round-robin fasion provide the fairness between nodes for
> each memcg.
> 
> changelog v4..v3:
> 1. split off from the per-memcg background reclaim patch.
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Yeah, looks nice. Thank you for splitting.


> ---
>  include/linux/memcontrol.h |    3 +++
>  mm/memcontrol.c            |   40 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f7ffd1f..d4ff7f2 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
>  				  struct kswapd *kswapd_p);
>  extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
>  extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
> +extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
> +extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
> +					const nodemask_t *nodes);
>  
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c4e1904..e22351a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -279,6 +279,11 @@ struct mem_cgroup {
>  	u64 high_wmark_distance;
>  	u64 low_wmark_distance;
>  
> +	/* While doing per cgroup background reclaim, we cache the
> +	 * last node we reclaimed from
> +	 */
> +	int last_scanned_node;
> +
>  	wait_queue_head_t *kswapd_wait;
>  };
>  
> @@ -1536,6 +1541,32 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  }
>  
>  /*
> + * Visit the first node after the last_scanned_node of @mem and use that to
> + * reclaim free pages from.
> + */
> +int
> +mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
> +{
> +	int next_nid;
> +	int last_scanned;
> +
> +	last_scanned = mem->last_scanned_node;
> +
> +	/* Initial stage and start from node0 */
> +	if (last_scanned == -1)
> +		next_nid = 0;
> +	else
> +		next_nid = next_node(last_scanned, *nodes);
> +

IIUC, mem->last_scanned_node should be initialized to MAX_NUMNODES.
Then, we can remove above check.

Thanks,
-Kame

> +	if (next_nid == MAX_NUMNODES)
> +		next_nid = first_node(*nodes);
> +
> +	mem->last_scanned_node = next_nid;
> +
> +	return next_nid;
> +}
> +
> +/*
>   * Check OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
>   */
> @@ -4693,6 +4724,14 @@ wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
>  	return mem->kswapd_wait;
>  }
>  
> +int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
> +{
> +	if (!mem)
> +		return -1;
> +
> +	return mem->last_scanned_node;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>  	struct mem_cgroup_tree_per_node *rtpn;
> @@ -4768,6 +4807,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		res_counter_init(&mem->memsw, NULL);
>  	}
>  	mem->last_scanned_child = 0;
> +	mem->last_scanned_node = -1;
>  	INIT_LIST_HEAD(&mem->oom_notify);
>  
>  	if (parent)
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 06/10] Per-memcg background reclaim.
  2011-04-14 22:54 ` [PATCH V4 06/10] Per-memcg background reclaim Ying Han
@ 2011-04-15  1:11   ` KAMEZAWA Hiroyuki
  2011-04-15  6:08     ` Ying Han
  2011-04-15  6:26     ` Ying Han
  0 siblings, 2 replies; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  1:11 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 15:54:25 -0700
Ying Han <yinghan@google.com> wrote:

> This is the main loop of per-memcg background reclaim which is implemented in
> function balance_mem_cgroup_pgdat().
> 
> The function performs a priority loop similar to global reclaim. During each
> iteration it invokes balance_pgdat_node() for all nodes on the system, which
> is another new function performs background reclaim per node. After reclaiming
> each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
> it returns true.
> 
> changelog v4..v3:
> 1. split the select_victim_node and zone_unreclaimable to a seperate patches
> 2. remove the logic tries to do zone balancing.
> 
> changelog v3..v2:
> 1. change mz->all_unreclaimable to be boolean.
> 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
> 3. some more clean-up.
> 
> changelog v2..v1:
> 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> reclaim.
> 3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
> keeps the same name.
> 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
> after freeing.
> 5. add the fairness in zonelist where memcg remember the last zone reclaimed
> from.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  mm/vmscan.c |  161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 161 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4deb9c8..b8345d2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -47,6 +47,8 @@
>  
>  #include <linux/swapops.h>
>  
> +#include <linux/res_counter.h>
> +
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -111,6 +113,8 @@ struct scan_control {
>  	 * are scanned.
>  	 */
>  	nodemask_t	*nodemask;
> +
> +	int priority;
>  };
>  
>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> @@ -2632,11 +2636,168 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>  	finish_wait(wait_h, &wait);
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> +					struct scan_control *sc)
> +{
> +	int i;
> +	unsigned long total_scanned = 0;
> +	struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +	int priority = sc->priority;
> +
> +	/*
> +	 * Now scan the zone in the dma->highmem direction, and we scan
> +	 * every zones for each node.
> +	 *
> +	 * We do this because the page allocator works in the opposite
> +	 * direction.  This prevents the page allocator from allocating
> +	 * pages behind kswapd's direction of progress, which would
> +	 * cause too much scanning of the lower zones.
> +	 */

I guess this comment is a cut-n-paste from global kswapd. It works when
alloc_page() stalls....hmm, I'd like to think whether dma->highmem direction
is good in this case.

As you know, memcg works against user's memory, memory should be in highmem zone.
Memcg-kswapd is not for memory-shortage, but for voluntary page dropping by
_user_.

If this memcg-kswapd drops pages from lower zones first, ah, ok, it's good for
the system because memcg's pages should be on higher zone if we have free memory.

So, I think the reason for dma->highmem is different from global kswapd.




> +	for (i = 0; i < pgdat->nr_zones; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		sc->nr_scanned = 0;
> +		shrink_zone(priority, zone, sc);
> +		total_scanned += sc->nr_scanned;
> +
> +		/*
> +		 * If we've done a decent amount of scanning and
> +		 * the reclaim ratio is low, start doing writepage
> +		 * even in laptop mode
> +		 */
> +		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> +		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
> +			sc->may_writepage = 1;
> +		}
> +	}
> +
> +	sc->nr_scanned = total_scanned;
> +	return;
> +}
> +
> +/*
> + * Per cgroup background reclaim.
> + * TODO: Take off the order since memcg always do order 0
> + */
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +					      int order)
> +{
> +	int i, nid;
> +	int start_node;
> +	int priority;
> +	bool wmark_ok;
> +	int loop;
> +	pg_data_t *pgdat;
> +	nodemask_t do_nodes;
> +	unsigned long total_scanned;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +		.nr_to_reclaim = ULONG_MAX,
> +		.swappiness = vm_swappiness,
> +		.order = order,
> +		.mem_cgroup = mem_cont,
> +	};
> +
> +loop_again:
> +	do_nodes = NODE_MASK_NONE;
> +	sc.may_writepage = !laptop_mode;

I think may_writepage should start from '0' always. We're not sure
the system is in memory shortage...we just want to release memory
volunatary. write_page will add huge costs, I guess.

For exmaple,
	sc.may_writepage = !!loop
may be better for memcg.

BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it later.

I think you should add some logic to fix it to right value. 

For example, before calling shrink_zone(),

sc->nr_to_reclaim = min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() / 100);  # 1% in this zone.

if we love 'fair pressure for each zone'.






> +	sc.nr_reclaimed = 0;
> +	total_scanned = 0;
> +
> +	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> +		sc.priority = priority;
> +		wmark_ok = false;
> +		loop = 0;
> +
> +		/* The swap token gets in the way of swapout... */
> +		if (!priority)
> +			disable_swap_token();
> +
> +		if (priority == DEF_PRIORITY)
> +			do_nodes = node_states[N_ONLINE];
> +
> +		while (1) {
> +			nid = mem_cgroup_select_victim_node(mem_cont,
> +							&do_nodes);
> +
> +			/* Indicate we have cycled the nodelist once
> +			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
> +			 * kswapd burning cpu cycles.
> +			 */
> +			if (loop == 0) {
> +				start_node = nid;
> +				loop++;
> +			} else if (nid == start_node)
> +				break;
> +
> +			pgdat = NODE_DATA(nid);
> +			balance_pgdat_node(pgdat, order, &sc);
> +			total_scanned += sc.nr_scanned;
> +
> +			/* Set the node which has at least
> +			 * one reclaimable zone
> +			 */
> +			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +				struct zone *zone = pgdat->node_zones + i;
> +
> +				if (!populated_zone(zone))
> +					continue;

How about checking whether memcg has pages on this node ?

> +			}
> +			if (i < 0)
> +				node_clear(nid, do_nodes);
> +
> +			if (mem_cgroup_watermark_ok(mem_cont,
> +							CHARGE_WMARK_HIGH)) {
> +				wmark_ok = true;
> +				goto out;
> +			}
> +
> +			if (nodes_empty(do_nodes)) {
> +				wmark_ok = true;
> +				goto out;
> +			}
> +		}
> +
> +		/* All the nodes are unreclaimable, kswapd is done */
> +		if (nodes_empty(do_nodes)) {
> +			wmark_ok = true;
> +			goto out;
> +		}

Can this happen ?


> +
> +		if (total_scanned && priority < DEF_PRIORITY - 2)
> +			congestion_wait(WRITE, HZ/10);
> +
> +		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> +			break;
> +	}
> +out:
> +	if (!wmark_ok) {
> +		cond_resched();
> +
> +		try_to_freeze();
> +
> +		goto loop_again;
> +	}
> +
> +	return sc.nr_reclaimed;
> +}
> +#else
>  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
>  							int order)
>  {
>  	return 0;
>  }
> +#endif
>  


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 07/10] Add per-memcg zone "unreclaimable"
  2011-04-14 22:54 ` [PATCH V4 07/10] Add per-memcg zone "unreclaimable" Ying Han
@ 2011-04-15  1:32   ` KAMEZAWA Hiroyuki
  2012-03-19  8:27     ` Zhu Yanhai
  0 siblings, 1 reply; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  1:32 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 15:54:26 -0700
Ying Han <yinghan@google.com> wrote:

> After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
> and breaks the priority loop if it returns true. The per-memcg zone will
> be marked as "unreclaimable" if the scanning rate is much greater than the
> reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
> page charged to the memcg being freed. Kswapd breaks the priority loop if
> all the zones are marked as "unreclaimable".
> 
> changelog v4..v3:
> 1. split off from the per-memcg background reclaim patch in V3.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h |   30 ++++++++++++++
>  include/linux/swap.h       |    2 +
>  mm/memcontrol.c            |   96 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |   19 +++++++++
>  4 files changed, 147 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d4ff7f2..a8159f5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -155,6 +155,12 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> +				unsigned long nr_scanned);
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
> @@ -345,6 +351,25 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  {
>  }
>  
> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
> +						struct zone *zone,
> +						unsigned long nr_scanned)
> +{
> +}
> +
> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
> +							struct zone *zone)
> +{
> +}
> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
> +		struct zone *zone)
> +{
> +}
> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
> +						struct zone *zone)
> +{
> +}
> +
>  static inline
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  					    gfp_t gfp_mask)
> @@ -363,6 +388,11 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
>  {
>  }
>  
> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
> +								int zid)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 17e0511..319b800 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -160,6 +160,8 @@ enum {
>  	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
>  };
>  
> +#define ZONE_RECLAIMABLE_RATE 6
> +
>  #define SWAP_CLUSTER_MAX 32
>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e22351a..da6a130 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
>  	bool			on_tree;
>  	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
>  						/* use container_of	   */
> +	unsigned long		pages_scanned;	/* since last reclaim */
> +	bool			all_unreclaimable;	/* All pages pinned */
>  };
> +
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
>  
> @@ -1135,6 +1138,96 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>  	return &mz->reclaim_stat;
>  }
>  
> +static unsigned long mem_cgroup_zone_reclaimable_pages(
> +					struct mem_cgroup_per_zone *mz)
> +{
> +	int nr;
> +	nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
> +		MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
> +
> +	if (nr_swap_pages > 0)
> +		nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
> +			MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> +						unsigned long nr_scanned)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->pages_scanned += nr_scanned;
> +}
> +
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +
> +	if (!mem)
> +		return 0;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->pages_scanned <
> +				mem_cgroup_zone_reclaimable_pages(mz) *
> +				ZONE_RECLAIMABLE_RATE;
> +	return 0;
> +}
> +
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return false;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->all_unreclaimable;
> +
> +	return false;
> +}
> +
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->all_unreclaimable = true;
> +}
> +
> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +
> +	if (!mem)
> +		return;
> +
> +	mz = page_cgroup_zoneinfo(mem, page);
> +	if (mz) {
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = false;
> +	}
> +
> +	return;
> +}
> +
>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -2801,6 +2894,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	 * special functions.
>  	 */
>  
> +	mem_cgroup_clear_unreclaimable(mem, page);

Hmm, this will easily cause cache ping-pong. (free_page() clears it after taking
zone->lock....in batched manner.)

Could you consider a way to make this low cost ?

One way is using memcg_check_event() with some low event trigger.
Second way is usign memcg_batch.
In many case, we can expect a chunk of free pages are from the same zone.
Then, add a new member to batch_memcg as

struct memcg_batch_info {
	.....
	struct zone *zone;	# a zone page is last uncharged.
	...
}

Then,
==
static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
                                   unsigned int nr_pages,
+				   struct page *page,
                                   const enum charge_type ctype)
{
        struct memcg_batch_info *batch = NULL;
.....

	if (batch->zone != page_zone(page)) { 
		mem_cgroup_clear_unreclaimable(mem, page);
	}
direct_uncharge:
	mem_cgroup_clear_unreclaimable(mem, page);
....
}
==

This will reduce overhead dramatically.



>  	unlock_page_cgroup(pc);
>  	/*
>  	 * even after unlock, we have mem->res.usage here and this memcg
> @@ -4569,6 +4663,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>  		mz->usage_in_excess = 0;
>  		mz->on_tree = false;
>  		mz->mem = mem;
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = false;
>  	}
>  	return 0;
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b8345d2..c081112 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  					ISOLATE_BOTH : ISOLATE_INACTIVE,
>  			zone, sc->mem_cgroup,
>  			0, file);
> +
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
> +
>  		/*
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
> @@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
>  		 */
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>  	}
>  
>  	reclaim_stat->recent_scanned[file] += nr_taken;
> @@ -2648,6 +2652,7 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
>  	unsigned long total_scanned = 0;
>  	struct mem_cgroup *mem_cont = sc->mem_cgroup;
>  	int priority = sc->priority;
> +	int nid = pgdat->node_id;
>  
>  	/*
>  	 * Now scan the zone in the dma->highmem direction, and we scan
> @@ -2664,10 +2669,20 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
>  		if (!populated_zone(zone))
>  			continue;
>  
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> +			priority != DEF_PRIORITY)
> +			continue;
> +
>  		sc->nr_scanned = 0;
>  		shrink_zone(priority, zone, sc);
>  		total_scanned += sc->nr_scanned;
>  
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
> +			continue;
> +
> +		if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
> +			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
> +
>  		/*
>  		 * If we've done a decent amount of scanning and
>  		 * the reclaim ratio is low, start doing writepage
> @@ -2752,6 +2767,10 @@ loop_again:
>  
>  				if (!populated_zone(zone))
>  					continue;
> +
> +				if (!mem_cgroup_mz_unreclaimable(mem_cont,
> +								zone))
> +	

Ah, okay. this will work.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 08/10] Enable per-memcg background reclaim.
  2011-04-14 22:54 ` [PATCH V4 08/10] Enable per-memcg background reclaim Ying Han
@ 2011-04-15  1:34   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  1:34 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 15:54:27 -0700
Ying Han <yinghan@google.com> wrote:

> By default the per-memcg background reclaim is disabled when the limit_in_bytes
> is set the maximum. The kswapd_run() is called when the memcg is being resized,
> and kswapd_stop() is called when the memcg is being deleted.
> 
> The per-memcg kswapd is waked up based on the usage and low_wmark, which is
> checked once per 1024 increments per cpu. The memcg's kswapd is waked up if the
> usage is larger than the low_wmark.
> 
> changelog v4..v3:
> 1. move kswapd_stop to mem_cgroup_destroy based on comments from KAMAZAWA
> 2. move kswapd_run to setup_mem_cgroup_wmark, since the actual watermarks
> determines whether or not enabling per-memcg background reclaim.
> 
> changelog v3..v2:
> 1. some clean-ups
> 
> changelog v2..v1:
> 1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
> 2. remove checking the wmark from per-page charging. now it checks the wmark
> periodically based on the event counter.
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Ok, seems nice.

For now,
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I'll ack on later version.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 09/10] Add API to export per-memcg kswapd pid.
  2011-04-14 22:54 ` [PATCH V4 09/10] Add API to export per-memcg kswapd pid Ying Han
@ 2011-04-15  1:40   ` KAMEZAWA Hiroyuki
  2011-04-15  4:47     ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  1:40 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 15:54:28 -0700
Ying Han <yinghan@google.com> wrote:

> This add the API which exports per-memcg kswapd thread pid. The kswapd
> thread is named as "memcg_" + css_id, and the pid can be used to put
> kswapd thread into cpu cgroup later.
> 
> $ mkdir /dev/cgroup/memory/A
> $ cat /dev/cgroup/memory/A/memory.kswapd_pid
> memcg_null 0
> 
> $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> $ ps -ef | grep memcg
> root      6727     2  0 14:32 ?        00:00:00 [memcg_3]
> root      6729  6044  0 14:32 ttyS0    00:00:00 grep memcg
> 
> $ cat memory.kswapd_pid
> memcg_3 6727
> 
> changelog v4..v3
> 1. Add the API based on KAMAZAWA's request on patch v3.
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Thank you.

> ---
>  include/linux/swap.h |    2 ++
>  mm/memcontrol.c      |   33 +++++++++++++++++++++++++++++++++
>  mm/vmscan.c          |    2 +-
>  3 files changed, 36 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 319b800..2d3e21a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -34,6 +34,8 @@ struct kswapd {
>  };
>  
>  int kswapd(void *p);
> +extern spinlock_t kswapds_spinlock;
> +
>  /*
>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>   * be swapped to.  The swap type and the offset into that swap type are
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1b23ff4..606b680 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4493,6 +4493,35 @@ static int mem_cgroup_wmark_read(struct cgroup *cgrp,
>  	return 0;
>  }
>  
> +static int mem_cgroup_kswapd_pid_read(struct cgroup *cgrp,
> +	struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	struct task_struct *kswapd_thr = NULL;
> +	struct kswapd *kswapd_p = NULL;
> +	wait_queue_head_t *wait;
> +	char name[TASK_COMM_LEN];
> +	pid_t pid = 0;
> +

I think '0' is ... not very good. This '0' implies there is no kswapd.
But 0 is root pid. I have no idea. Do you have no concern ?

Otherewise, the interface seems good.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>




> +	sprintf(name, "memcg_null");
> +
> +	spin_lock(&kswapds_spinlock);
> +	wait = mem_cgroup_kswapd_wait(mem);
> +	if (wait) {
> +		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> +		kswapd_thr = kswapd_p->kswapd_task;
> +		if (kswapd_thr) {
> +			get_task_comm(name, kswapd_thr);
> +			pid = kswapd_thr->pid;
> +		}
> +	}
> +	spin_unlock(&kswapds_spinlock);
> +
> +	cb->fill(cb, name, pid);
> +
> +	return 0;
> +}
> +
>  static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
>  	struct cftype *cft,  struct cgroup_map_cb *cb)
>  {
> @@ -4610,6 +4639,10 @@ static struct cftype mem_cgroup_files[] = {
>  		.name = "reclaim_wmarks",
>  		.read_map = mem_cgroup_wmark_read,
>  	},
> +	{
> +		.name = "kswapd_pid",
> +		.read_map = mem_cgroup_kswapd_pid_read,
> +	},
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c081112..df4e5dd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2249,7 +2249,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
>  	return balanced_pages > (present_pages >> 2);
>  }
>  
> -static DEFINE_SPINLOCK(kswapds_spinlock);
> +DEFINE_SPINLOCK(kswapds_spinlock);
>  #define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
>  
>  /* is kswapd sleeping prematurely? */
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 01/10] Add kswapd descriptor
  2011-04-15  0:04   ` KAMEZAWA Hiroyuki
@ 2011-04-15  3:35     ` Ying Han
  2011-04-15  4:16       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-15  3:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 13170 bytes --]

On Thu, Apr 14, 2011 at 5:04 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 15:54:20 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > There is a kswapd kernel thread for each numa node. We will add a
> different
> > kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
> > kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
> > information of node or memcg and it allows the global and per-memcg
> background
> > reclaim to share common reclaim algorithms.
> >
> > This patch adds the kswapd descriptor and moves the per-node kswapd to
> use the
> > new structure.
> >
>
> No objections to your direction but some comments.
>
> > changelog v2..v1:
> > 1. dynamic allocate kswapd descriptor and initialize the wait_queue_head
> of pgdat
> > at kswapd_run.
> > 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup
> kswapd
> > descriptor.
> >
> > changelog v3..v2:
> > 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
> > 2. rename thr in kswapd_run to something else.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/mmzone.h |    3 +-
> >  include/linux/swap.h   |    7 ++++
> >  mm/page_alloc.c        |    1 -
> >  mm/vmscan.c            |   95
> ++++++++++++++++++++++++++++++++++++------------
> >  4 files changed, 80 insertions(+), 26 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 628f07b..6cba7d2 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -640,8 +640,7 @@ typedef struct pglist_data {
> >       unsigned long node_spanned_pages; /* total size of physical page
> >                                            range, including holes */
> >       int node_id;
> > -     wait_queue_head_t kswapd_wait;
> > -     struct task_struct *kswapd;
> > +     wait_queue_head_t *kswapd_wait;
> >       int kswapd_max_order;
> >       enum zone_type classzone_idx;
>
> I think pg_data_t should include struct kswapd in it, as
>
>        struct pglist_data {
>        .....
>                struct kswapd   kswapd;
>        };
> and you can add a macro as
>
> #define kswapd_waitqueue(kswapd)        (&(kswapd)->kswapd_wait)
> if it looks better.
>
> Why I recommend this is I think it's better to have 'struct kswapd'
> on the same page of pg_data_t or struct memcg.
> Do you have benefits to kmalloc() struct kswapd on damand ?
>

So we don't end of have kswapd struct on memcgs' which doesn't have
per-memcg kswapd enabled. I don't see one is strongly better than the other
for the two approaches. If ok, I would like to keep as it is for this
verion. Hope this is ok for now.


>
>
>
> >  } pg_data_t;
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index ed6ebe6..f43d406 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
> >       return current->flags & PF_KSWAPD;
> >  }
> >
> > +struct kswapd {
> > +     struct task_struct *kswapd_task;
> > +     wait_queue_head_t kswapd_wait;
> > +     pg_data_t *kswapd_pgdat;
> > +};
> > +
> > +int kswapd(void *p);
> >  /*
> >   * MAX_SWAPFILES defines the maximum number of swaptypes: things which
> can
> >   * be swapped to.  The swap type and the offset into that swap type are
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6e1b52a..6340865 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(struct
> pglist_data *pgdat,
> >
> >       pgdat_resize_init(pgdat);
> >       pgdat->nr_zones = 0;
> > -     init_waitqueue_head(&pgdat->kswapd_wait);
> >       pgdat->kswapd_max_order = 0;
> >       pgdat_page_cgroup_init(pgdat);
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 060e4c1..77ac74f 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2241,13 +2241,16 @@ static bool pgdat_balanced(pg_data_t *pgdat,
> unsigned long balanced_pages,
> >       return balanced_pages > (present_pages >> 2);
> >  }
> >
> > +static DEFINE_SPINLOCK(kswapds_spinlock);
> > +
> Maybe better to explain this lock is for what.
>
> It seems we need this because we allocate kswapd descriptor after NODE is
> online..
> right ?
>
>  true. I will put comment there.

--Ying

Thanks,
> -Kame
>
> >  /* is kswapd sleeping prematurely? */
> > -static bool sleeping_prematurely(pg_data_t *pgdat, int order, long
> remaining,
> > -                                     int classzone_idx)
> > +static int sleeping_prematurely(struct kswapd *kswapd, int order,
> > +                             long remaining, int classzone_idx)
> >  {
> >       int i;
> >       unsigned long balanced = 0;
> >       bool all_zones_ok = true;
> > +     pg_data_t *pgdat = kswapd->kswapd_pgdat;
> >
> >       /* If a direct reclaimer woke kswapd within HZ/10, it's premature
> */
> >       if (remaining)
> > @@ -2570,28 +2573,31 @@ out:
> >       return order;
> >  }
> >
> > -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int
> classzone_idx)
> > +static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
> > +                             int classzone_idx)
> >  {
> >       long remaining = 0;
> >       DEFINE_WAIT(wait);
> > +     pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> > +     wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
> >
> >       if (freezing(current) || kthread_should_stop())
> >               return;
> >
> > -     prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> > +     prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> >
> >       /* Try to sleep for a short interval */
> > -     if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx))
> {
> > +     if (!sleeping_prematurely(kswapd_p, order, remaining,
> classzone_idx)) {
> >               remaining = schedule_timeout(HZ/10);
> > -             finish_wait(&pgdat->kswapd_wait, &wait);
> > -             prepare_to_wait(&pgdat->kswapd_wait, &wait,
> TASK_INTERRUPTIBLE);
> > +             finish_wait(wait_h, &wait);
> > +             prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> >       }
> >
> >       /*
> >        * After a short sleep, check if it was a premature sleep. If not,
> then
> >        * go fully to sleep until explicitly woken up.
> >        */
> > -     if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx))
> {
> > +     if (!sleeping_prematurely(kswapd_p, order, remaining,
> classzone_idx)) {
> >               trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> >
> >               /*
> > @@ -2611,7 +2617,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat,
> int order, int classzone_idx)
> >               else
> >                       count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> >       }
> > -     finish_wait(&pgdat->kswapd_wait, &wait);
> > +     finish_wait(wait_h, &wait);
> >  }
> >
> >  /*
> > @@ -2627,20 +2633,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat,
> int order, int classzone_idx)
> >   * If there are applications that are active memory-allocators
> >   * (most normal use), this basically shouldn't matter.
> >   */
> > -static int kswapd(void *p)
> > +int kswapd(void *p)
> >  {
> >       unsigned long order;
> >       int classzone_idx;
> > -     pg_data_t *pgdat = (pg_data_t*)p;
> > +     struct kswapd *kswapd_p = (struct kswapd *)p;
> > +     pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> > +     wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
> >       struct task_struct *tsk = current;
> >
> >       struct reclaim_state reclaim_state = {
> >               .reclaimed_slab = 0,
> >       };
> > -     const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> > +     const struct cpumask *cpumask;
> >
> >       lockdep_set_current_reclaim_state(GFP_KERNEL);
> >
> > +     BUG_ON(pgdat->kswapd_wait != wait_h);
> > +     cpumask = cpumask_of_node(pgdat->node_id);
> >       if (!cpumask_empty(cpumask))
> >               set_cpus_allowed_ptr(tsk, cpumask);
> >       current->reclaim_state = &reclaim_state;
> > @@ -2679,7 +2689,7 @@ static int kswapd(void *p)
> >                       order = new_order;
> >                       classzone_idx = new_classzone_idx;
> >               } else {
> > -                     kswapd_try_to_sleep(pgdat, order, classzone_idx);
> > +                     kswapd_try_to_sleep(kswapd_p, order,
> classzone_idx);
> >                       order = pgdat->kswapd_max_order;
> >                       classzone_idx = pgdat->classzone_idx;
> >                       pgdat->kswapd_max_order = 0;
> > @@ -2719,13 +2729,13 @@ void wakeup_kswapd(struct zone *zone, int order,
> enum zone_type classzone_idx)
> >               pgdat->kswapd_max_order = order;
> >               pgdat->classzone_idx = min(pgdat->classzone_idx,
> classzone_idx);
> >       }
> > -     if (!waitqueue_active(&pgdat->kswapd_wait))
> > +     if (!waitqueue_active(pgdat->kswapd_wait))
> >               return;
> >       if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0,
> 0))
> >               return;
> >
> >       trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone),
> order);
> > -     wake_up_interruptible(&pgdat->kswapd_wait);
> > +     wake_up_interruptible(pgdat->kswapd_wait);
> >  }
> >
> >  /*
> > @@ -2817,12 +2827,23 @@ static int __devinit cpu_callback(struct
> notifier_block *nfb,
> >               for_each_node_state(nid, N_HIGH_MEMORY) {
> >                       pg_data_t *pgdat = NODE_DATA(nid);
> >                       const struct cpumask *mask;
> > +                     struct kswapd *kswapd_p;
> > +                     struct task_struct *kswapd_thr;
> > +                     wait_queue_head_t *wait;
> >
> >                       mask = cpumask_of_node(pgdat->node_id);
> >
> > +                     spin_lock(&kswapds_spinlock);
> > +                     wait = pgdat->kswapd_wait;
> > +                     kswapd_p = container_of(wait, struct kswapd,
> > +                                             kswapd_wait);
> > +                     kswapd_thr = kswapd_p->kswapd_task;
> > +                     spin_unlock(&kswapds_spinlock);
> > +
> >                       if (cpumask_any_and(cpu_online_mask, mask) <
> nr_cpu_ids)
> >                               /* One of our CPUs online: restore mask */
> > -                             set_cpus_allowed_ptr(pgdat->kswapd, mask);
> > +                             if (kswapd_thr)
> > +                                     set_cpus_allowed_ptr(kswapd_thr,
> mask);
> >               }
> >       }
> >       return NOTIFY_OK;
> > @@ -2835,18 +2856,31 @@ static int __devinit cpu_callback(struct
> notifier_block *nfb,
> >  int kswapd_run(int nid)
> >  {
> >       pg_data_t *pgdat = NODE_DATA(nid);
> > +     struct task_struct *kswapd_thr;
> > +     struct kswapd *kswapd_p;
> >       int ret = 0;
> >
> > -     if (pgdat->kswapd)
> > +     if (pgdat->kswapd_wait)
> >               return 0;
> >
> > -     pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
> > -     if (IS_ERR(pgdat->kswapd)) {
> > +     kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
> > +     if (!kswapd_p)
> > +             return -ENOMEM;
> > +
> > +     init_waitqueue_head(&kswapd_p->kswapd_wait);
> > +     pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> > +     kswapd_p->kswapd_pgdat = pgdat;
> > +
> > +     kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
> > +     if (IS_ERR(kswapd_thr)) {
> >               /* failure at boot is fatal */
> >               BUG_ON(system_state == SYSTEM_BOOTING);
> >               printk("Failed to start kswapd on node %d\n",nid);
> > +             pgdat->kswapd_wait = NULL;
> > +             kfree(kswapd_p);
> >               ret = -1;
> > -     }
> > +     } else
> > +             kswapd_p->kswapd_task = kswapd_thr;
> >       return ret;
> >  }
> >
> > @@ -2855,10 +2889,25 @@ int kswapd_run(int nid)
> >   */
> >  void kswapd_stop(int nid)
> >  {
> > -     struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
> > +     struct task_struct *kswapd_thr = NULL;
> > +     struct kswapd *kswapd_p = NULL;
> > +     wait_queue_head_t *wait;
> > +
> > +     pg_data_t *pgdat = NODE_DATA(nid);
> > +
> > +     spin_lock(&kswapds_spinlock);
> > +     wait = pgdat->kswapd_wait;
> > +     if (wait) {
> > +             kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> > +             kswapd_thr = kswapd_p->kswapd_task;
> > +             kswapd_p->kswapd_task = NULL;
> > +     }
> > +     spin_unlock(&kswapds_spinlock);
> > +
> > +     if (kswapd_thr)
> > +             kthread_stop(kswapd_thr);
> >
> > -     if (kswapd)
> > -             kthread_stop(kswapd);
> > +     kfree(kswapd_p);
> >  }
> >
> >  static int __init kswapd_init(void)
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 16375 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 02/10] Add per memcg reclaim watermarks
  2011-04-15  0:16   ` KAMEZAWA Hiroyuki
@ 2011-04-15  3:45     ` Ying Han
  0 siblings, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-15  3:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4989 bytes --]

On Thu, Apr 14, 2011 at 5:16 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 15:54:21 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > There are two watermarks added per-memcg including "high_wmark" and
> "low_wmark".
> > The per-memcg kswapd is invoked when the memcg's memory
> usage(usage_in_bytes)
> > is higher than the low_wmark. Then the kswapd thread starts to reclaim
> pages
> > until the usage is lower than the high_wmark.
> >
> > Each watermark is calculated based on the hard_limit(limit_in_bytes) for
> each
> > memcg. Each time the hard_limit is changed, the corresponding wmarks are
> > re-calculated. Since memory controller charges only user pages, there is
> > no need for a "min_wmark". The current calculation of wmarks is based on
> > individual tunable low/high_wmark_distance, which are set to 0 by
> default.
> >
> > changelog v4..v3:
> > 1. remove legacy comments
> > 2. rename the res_counter_check_under_high_wmark_limit
> > 3. replace the wmark_ratio per-memcg by individual tunable for both
> wmarks.
> > 4. add comments on low/high_wmark
> > 5. add individual tunables for low/high_wmarks and remove wmark_ratio
> > 6. replace the mem_cgroup_get_limit() call by res_count_read_u64(). The
> first
> > one returns large value w/ swapon.
> >
> > changelog v3..v2:
> > 1. Add VM_BUG_ON() on couple of places.
> > 2. Remove the spinlock on the min_free_kbytes since the consequence of
> reading
> > stale data.
> > 3. Remove the "min_free_kbytes" API and replace it with wmark_ratio based
> on
> > hard_limit.
> >
> > changelog v2..v1:
> > 1. Remove the res_counter_charge on wmark due to performance concern.
> > 2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate
> commit.
> > 3. Calculate the min_free_kbytes automatically based on the
> limit_in_bytes.
> > 4. make the wmark to be consistant with core VM which checks the free
> pages
> > instead of usage.
> > 5. changed wmark to be boolean
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> some nitpick below.
>
>
>
> > ---
> >  include/linux/memcontrol.h  |    1 +
> >  include/linux/res_counter.h |   78
> +++++++++++++++++++++++++++++++++++++++++++
> >  kernel/res_counter.c        |    6 +++
> >  mm/memcontrol.c             |   48 ++++++++++++++++++++++++++
> >  4 files changed, 133 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 5a5ce70..3ece36d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -82,6 +82,7 @@ int task_in_mem_cgroup(struct task_struct *task, const
> struct mem_cgroup *mem);
> >
> >  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page
> *page);
> >  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> > +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int
> charge_flags);
> >
> >  static inline
> >  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index c9d625c..77eaaa9 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -39,6 +39,14 @@ struct res_counter {
> >        */
> >       unsigned long long soft_limit;
> >       /*
> > +      * the limit that reclaim triggers.
> > +      */
> > +     unsigned long long low_wmark_limit;
> > +     /*
> > +      * the limit that reclaim stops.
> > +      */
> > +     unsigned long long high_wmark_limit;
> > +     /*
> >        * the number of unsuccessful attempts to consume the resource
> >        */
> >       unsigned long long failcnt;
> > @@ -55,6 +63,9 @@ struct res_counter {
> >
> >  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
> >
> > +#define CHARGE_WMARK_LOW     0x01
> > +#define CHARGE_WMARK_HIGH    0x02
> > +
> >  /**
> >   * Helpers to interact with userspace
> >   * res_counter_read_u64() - returns the value of the specified member.
> > @@ -92,6 +103,8 @@ enum {
> >       RES_LIMIT,
> >       RES_FAILCNT,
> >       RES_SOFT_LIMIT,
> > +     RES_LOW_WMARK_LIMIT,
> > +     RES_HIGH_WMARK_LIMIT
> >  };
> >
> >  /*
> > @@ -147,6 +160,24 @@ static inline unsigned long long
> res_counter_margin(struct res_counter *cnt)
> >       return margin;
> >  }
> >
> > +static inline bool
> > +res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
> > +{
> > +     if (cnt->usage < cnt->high_wmark_limit)
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> > +static inline bool
> > +res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
> > +{
> > +     if (cnt->usage < cnt->low_wmark_limit)
> > +             return true;
> > +
> > +     return false;
> > +}
>
> I like res_counter_under_low_wmark_limit_locked() rather than this name.
>

Thanks for review. Will change this on the next post.

--Ying

>
> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 6401 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 03/10] New APIs to adjust per-memcg wmarks
  2011-04-15  0:25   ` KAMEZAWA Hiroyuki
@ 2011-04-15  4:00     ` Ying Han
  0 siblings, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-15  4:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4413 bytes --]

On Thu, Apr 14, 2011 at 5:25 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 15:54:22 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > Add memory.low_wmark_distance, memory.high_wmark_distance and
> reclaim_wmarks
> > APIs per-memcg. The first two adjust the internal low/high wmark
> calculation
> > and the reclaim_wmarks exports the current value of watermarks.
> >
> > By default, the low/high_wmark is calculated by subtracting the distance
> from
> > the hard_limit(limit_in_bytes).
> >
> > $ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
> > $ cat /dev/cgroup/A/memory.limit_in_bytes
> > 524288000
> >
> > $ echo 50m >/dev/cgroup/A/memory.high_wmark_distance
> > $ echo 40m >/dev/cgroup/A/memory.low_wmark_distance
> >
> > $ cat /dev/cgroup/A/memory.reclaim_wmarks
> > low_wmark 482344960
> > high_wmark 471859200
> >
> > changelog v4..v3:
> > 1. replace the "wmark_ratio" API with individual tunable for
> low/high_wmarks.
> >
> > changelog v3..v2:
> > 1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
> > feedbacks
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> But please add a sanity check (see below.)
>
>
>
> > ---
> >  mm/memcontrol.c |   95
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 95 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1ec4014..685645c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3974,6 +3974,72 @@ static int mem_cgroup_swappiness_write(struct
> cgroup *cgrp, struct cftype *cft,
> >       return 0;
> >  }
> >
> > +static u64 mem_cgroup_high_wmark_distance_read(struct cgroup *cgrp,
> > +                                            struct cftype *cft)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +
> > +     return memcg->high_wmark_distance;
> > +}
> > +
> > +static u64 mem_cgroup_low_wmark_distance_read(struct cgroup *cgrp,
> > +                                           struct cftype *cft)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +
> > +     return memcg->low_wmark_distance;
> > +}
> > +
> > +static int mem_cgroup_high_wmark_distance_write(struct cgroup *cont,
> > +                                             struct cftype *cft,
> > +                                             const char *buffer)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> > +     u64 low_wmark_distance = memcg->low_wmark_distance;
> > +     unsigned long long val;
> > +     u64 limit;
> > +     int ret;
> > +
> > +     ret = res_counter_memparse_write_strategy(buffer, &val);
> > +     if (ret)
> > +             return -EINVAL;
> > +
> > +     limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> > +     if ((val >= limit) || (val < low_wmark_distance) ||
> > +        (low_wmark_distance && val == low_wmark_distance))
> > +             return -EINVAL;
> > +
> > +     memcg->high_wmark_distance = val;
> > +
> > +     setup_per_memcg_wmarks(memcg);
> > +     return 0;
> > +}
>
> IIUC, as limit_in_bytes, 'distance' should not be able to set against ROOT
> memcg
> because it doesn't work.
>
> thanks for review. will change in next post.
>
> > +
> > +static int mem_cgroup_low_wmark_distance_write(struct cgroup *cont,
> > +                                            struct cftype *cft,
> > +                                            const char *buffer)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> > +     u64 high_wmark_distance = memcg->high_wmark_distance;
> > +     unsigned long long val;
> > +     u64 limit;
> > +     int ret;
> > +
> > +     ret = res_counter_memparse_write_strategy(buffer, &val);
> > +     if (ret)
> > +             return -EINVAL;
> > +
> > +     limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> > +     if ((val >= limit) || (val > high_wmark_distance) ||
> > +         (high_wmark_distance && val == high_wmark_distance))
> > +             return -EINVAL;
> > +
> > +     memcg->low_wmark_distance = val;
> > +
> > +     setup_per_memcg_wmarks(memcg);
> > +     return 0;
> > +}
> > +
>
> Here, too.
>
> Will add


> I wonder we should have a method to hide unnecessary interfaces in ROOT
> cgroup...
>
> hmm. something to think about..

--Ying


> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 5998 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 04/10] Infrastructure to support per-memcg reclaim.
  2011-04-15  0:34   ` KAMEZAWA Hiroyuki
@ 2011-04-15  4:04     ` Ying Han
  0 siblings, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-15  4:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3993 bytes --]

On Thu, Apr 14, 2011 at 5:34 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 15:54:23 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > Add the kswapd_mem field in kswapd descriptor which links the kswapd
> > kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
> > queue headed at kswapd_wait field of the kswapd descriptor.
> >
> > The kswapd() function is now shared between global and per-memcg kswapd.
> It
> > is passed in with the kswapd descriptor which contains the information of
> > either node or memcg. Then the new function balance_mem_cgroup_pgdat is
> > invoked if it is per-mem kswapd thread, and the implementation of the
> function
> > is on the following patch.
> >
> > changelog v4..v3:
> > 1. fix up the kswapd_run and kswapd_stop for online_pages() and
> offline_pages.
> > 2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's
> request.
> >
> > changelog v3..v2:
> > 1. split off from the initial patch which includes all changes of the
> following
> > three patches.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
>
>
> > ---
> >  include/linux/memcontrol.h |    5 ++
> >  include/linux/swap.h       |    5 +-
> >  mm/memcontrol.c            |   29 ++++++++
> >  mm/memory_hotplug.c        |    4 +-
> >  mm/vmscan.c                |  157
> ++++++++++++++++++++++++++++++--------------
> >  5 files changed, 147 insertions(+), 53 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 3ece36d..f7ffd1f 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -24,6 +24,7 @@ struct mem_cgroup;
> >  struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> > +struct kswapd;
> >
> >  /* Stats that can be updated by kernel. */
> >  enum mem_cgroup_page_stat_item {
> > @@ -83,6 +84,10 @@ int task_in_mem_cgroup(struct task_struct *task, const
> struct mem_cgroup *mem);
> >  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page
> *page);
> >  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> >  extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int
> charge_flags);
> > +extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
> > +                               struct kswapd *kswapd_p);
> > +extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
> > +extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup
> *mem);
> >
> >  static inline
> >  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index f43d406..17e0511 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -30,6 +30,7 @@ struct kswapd {
> >       struct task_struct *kswapd_task;
> >       wait_queue_head_t kswapd_wait;
> >       pg_data_t *kswapd_pgdat;
> > +     struct mem_cgroup *kswapd_mem;
> >  };
> >
> >  int kswapd(void *p);
> > @@ -303,8 +304,8 @@ static inline void
> scan_unevictable_unregister_node(struct node *node)
> >  }
> >  #endif
> >
> > -extern int kswapd_run(int nid);
> > -extern void kswapd_stop(int nid);
> > +extern int kswapd_run(int nid, struct mem_cgroup *mem);
> > +extern void kswapd_stop(int nid, struct mem_cgroup *mem);
> >
> >  #ifdef CONFIG_MMU
> >  /* linux/mm/shmem.c */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 685645c..c4e1904 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -278,6 +278,8 @@ struct mem_cgroup {
> >        */
> >       u64 high_wmark_distance;
> >       u64 low_wmark_distance;
> > +
> > +     wait_queue_head_t *kswapd_wait;
> >  };
>
> I think mem_cgroup can include 'struct kswapd' itself and don't need to
> alloc it dynamically.
>
> Other parts seems ok to me.
>

The same for the previous post. I would like to keep the implementation for
the first version if not one is strongly better than the other. Hope that
works.

--Ying

>
> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 5080 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 01/10] Add kswapd descriptor
  2011-04-15  3:35     ` Ying Han
@ 2011-04-15  4:16       ` KAMEZAWA Hiroyuki
  2011-04-15 21:46         ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  4:16 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 20:35:00 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, Apr 14, 2011 at 5:04 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 14 Apr 2011 15:54:20 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > There is a kswapd kernel thread for each numa node. We will add a
> > different
> > > kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
> > > kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
> > > information of node or memcg and it allows the global and per-memcg
> > background
> > > reclaim to share common reclaim algorithms.
> > >
> > > This patch adds the kswapd descriptor and moves the per-node kswapd to
> > use the
> > > new structure.
> > >
> >
> > No objections to your direction but some comments.
> >
> > > changelog v2..v1:
> > > 1. dynamic allocate kswapd descriptor and initialize the wait_queue_head
> > of pgdat
> > > at kswapd_run.
> > > 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup
> > kswapd
> > > descriptor.
> > >
> > > changelog v3..v2:
> > > 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
> > > 2. rename thr in kswapd_run to something else.
> > >
> > > Signed-off-by: Ying Han <yinghan@google.com>
> > > ---
> > >  include/linux/mmzone.h |    3 +-
> > >  include/linux/swap.h   |    7 ++++
> > >  mm/page_alloc.c        |    1 -
> > >  mm/vmscan.c            |   95
> > ++++++++++++++++++++++++++++++++++++------------
> > >  4 files changed, 80 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 628f07b..6cba7d2 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -640,8 +640,7 @@ typedef struct pglist_data {
> > >       unsigned long node_spanned_pages; /* total size of physical page
> > >                                            range, including holes */
> > >       int node_id;
> > > -     wait_queue_head_t kswapd_wait;
> > > -     struct task_struct *kswapd;
> > > +     wait_queue_head_t *kswapd_wait;
> > >       int kswapd_max_order;
> > >       enum zone_type classzone_idx;
> >
> > I think pg_data_t should include struct kswapd in it, as
> >
> >        struct pglist_data {
> >        .....
> >                struct kswapd   kswapd;
> >        };
> > and you can add a macro as
> >
> > #define kswapd_waitqueue(kswapd)        (&(kswapd)->kswapd_wait)
> > if it looks better.
> >
> > Why I recommend this is I think it's better to have 'struct kswapd'
> > on the same page of pg_data_t or struct memcg.
> > Do you have benefits to kmalloc() struct kswapd on damand ?
> >
> 
> So we don't end of have kswapd struct on memcgs' which doesn't have
> per-memcg kswapd enabled. I don't see one is strongly better than the other
> for the two approaches. If ok, I would like to keep as it is for this
> verion. Hope this is ok for now.
> 

My intension is to remove kswapd_spinlock. Can we remove it with
dynamic allocation ? IOW, static allocation still requires spinlock ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 05/10] Implement the select_victim_node within memcg.
  2011-04-15  0:40   ` KAMEZAWA Hiroyuki
@ 2011-04-15  4:36     ` Ying Han
  0 siblings, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-15  4:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4308 bytes --]

On Thu, Apr 14, 2011 at 5:40 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 15:54:24 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This add the mechanism for background reclaim which we remember the
> > last scanned node and always starting from the next one each time.
> > The simple round-robin fasion provide the fairness between nodes for
> > each memcg.
> >
> > changelog v4..v3:
> > 1. split off from the per-memcg background reclaim patch.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
>
> Yeah, looks nice. Thank you for splitting.
>
>
> > ---
> >  include/linux/memcontrol.h |    3 +++
> >  mm/memcontrol.c            |   40
> ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 43 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index f7ffd1f..d4ff7f2 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup
> *mem,
> >                                 struct kswapd *kswapd_p);
> >  extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
> >  extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup
> *mem);
> > +extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
> > +extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
> > +                                     const nodemask_t *nodes);
> >
> >  static inline
> >  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c4e1904..e22351a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -279,6 +279,11 @@ struct mem_cgroup {
> >       u64 high_wmark_distance;
> >       u64 low_wmark_distance;
> >
> > +     /* While doing per cgroup background reclaim, we cache the
> > +      * last node we reclaimed from
> > +      */
> > +     int last_scanned_node;
> > +
> >       wait_queue_head_t *kswapd_wait;
> >  };
> >
> > @@ -1536,6 +1541,32 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
> >  }
> >
> >  /*
> > + * Visit the first node after the last_scanned_node of @mem and use that
> to
> > + * reclaim free pages from.
> > + */
> > +int
> > +mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t
> *nodes)
> > +{
> > +     int next_nid;
> > +     int last_scanned;
> > +
> > +     last_scanned = mem->last_scanned_node;
> > +
> > +     /* Initial stage and start from node0 */
> > +     if (last_scanned == -1)
> > +             next_nid = 0;
> > +     else
> > +             next_nid = next_node(last_scanned, *nodes);
> > +
>
> IIUC, mem->last_scanned_node should be initialized to MAX_NUMNODES.
> Then, we can remove above check.
>

make sense. will make the change on next post.

--Ying

>
> Thanks,
> -Kame
>
> > +     if (next_nid == MAX_NUMNODES)
> > +             next_nid = first_node(*nodes);
> > +
> > +     mem->last_scanned_node = next_nid;
> > +
> > +     return next_nid;
> > +}
> > +
> > +/*
> >   * Check OOM-Killer is already running under our hierarchy.
> >   * If someone is running, return false.
> >   */
> > @@ -4693,6 +4724,14 @@ wait_queue_head_t *mem_cgroup_kswapd_wait(struct
> mem_cgroup *mem)
> >       return mem->kswapd_wait;
> >  }
> >
> > +int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
> > +{
> > +     if (!mem)
> > +             return -1;
> > +
> > +     return mem->last_scanned_node;
> > +}
> > +
> >  static int mem_cgroup_soft_limit_tree_init(void)
> >  {
> >       struct mem_cgroup_tree_per_node *rtpn;
> > @@ -4768,6 +4807,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> cgroup *cont)
> >               res_counter_init(&mem->memsw, NULL);
> >       }
> >       mem->last_scanned_child = 0;
> > +     mem->last_scanned_node = -1;
> >       INIT_LIST_HEAD(&mem->oom_notify);
> >
> >       if (parent)
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 5879 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 09/10] Add API to export per-memcg kswapd pid.
  2011-04-15  1:40   ` KAMEZAWA Hiroyuki
@ 2011-04-15  4:47     ` Ying Han
  0 siblings, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-15  4:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4503 bytes --]

On Thu, Apr 14, 2011 at 6:40 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 15:54:28 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This add the API which exports per-memcg kswapd thread pid. The kswapd
> > thread is named as "memcg_" + css_id, and the pid can be used to put
> > kswapd thread into cpu cgroup later.
> >
> > $ mkdir /dev/cgroup/memory/A
> > $ cat /dev/cgroup/memory/A/memory.kswapd_pid
> > memcg_null 0
> >
> > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> > $ ps -ef | grep memcg
> > root      6727     2  0 14:32 ?        00:00:00 [memcg_3]
> > root      6729  6044  0 14:32 ttyS0    00:00:00 grep memcg
> >
> > $ cat memory.kswapd_pid
> > memcg_3 6727
> >
> > changelog v4..v3
> > 1. Add the API based on KAMAZAWA's request on patch v3.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
>
> Thank you.
>
> > ---
> >  include/linux/swap.h |    2 ++
> >  mm/memcontrol.c      |   33 +++++++++++++++++++++++++++++++++
> >  mm/vmscan.c          |    2 +-
> >  3 files changed, 36 insertions(+), 1 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 319b800..2d3e21a 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -34,6 +34,8 @@ struct kswapd {
> >  };
> >
> >  int kswapd(void *p);
> > +extern spinlock_t kswapds_spinlock;
> > +
> >  /*
> >   * MAX_SWAPFILES defines the maximum number of swaptypes: things which
> can
> >   * be swapped to.  The swap type and the offset into that swap type are
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1b23ff4..606b680 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4493,6 +4493,35 @@ static int mem_cgroup_wmark_read(struct cgroup
> *cgrp,
> >       return 0;
> >  }
> >
> > +static int mem_cgroup_kswapd_pid_read(struct cgroup *cgrp,
> > +     struct cftype *cft,  struct cgroup_map_cb *cb)
> > +{
> > +     struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> > +     struct task_struct *kswapd_thr = NULL;
> > +     struct kswapd *kswapd_p = NULL;
> > +     wait_queue_head_t *wait;
> > +     char name[TASK_COMM_LEN];
> > +     pid_t pid = 0;
> > +
>
> I think '0' is ... not very good. This '0' implies there is no kswapd.
> But 0 is root pid. I have no idea. Do you have no concern ?
>
> Otherewise, the interface seems good.
>

Thank you for review. I will make the change for the pid on initializing to
"-1".

--Ying

>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
>
>
>
> > +     sprintf(name, "memcg_null");
> > +
> > +     spin_lock(&kswapds_spinlock);
> > +     wait = mem_cgroup_kswapd_wait(mem);
> > +     if (wait) {
> > +             kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> > +             kswapd_thr = kswapd_p->kswapd_task;
> > +             if (kswapd_thr) {
> > +                     get_task_comm(name, kswapd_thr);
> > +                     pid = kswapd_thr->pid;
> > +             }
> > +     }
> > +     spin_unlock(&kswapds_spinlock);
> > +
> > +     cb->fill(cb, name, pid);
> > +
> > +     return 0;
> > +}
> > +
> >  static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
> >       struct cftype *cft,  struct cgroup_map_cb *cb)
> >  {
> > @@ -4610,6 +4639,10 @@ static struct cftype mem_cgroup_files[] = {
> >               .name = "reclaim_wmarks",
> >               .read_map = mem_cgroup_wmark_read,
> >       },
> > +     {
> > +             .name = "kswapd_pid",
> > +             .read_map = mem_cgroup_kswapd_pid_read,
> > +     },
> >  };
> >
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c081112..df4e5dd 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2249,7 +2249,7 @@ static bool pgdat_balanced(pg_data_t *pgdat,
> unsigned long balanced_pages,
> >       return balanced_pages > (present_pages >> 2);
> >  }
> >
> > -static DEFINE_SPINLOCK(kswapds_spinlock);
> > +DEFINE_SPINLOCK(kswapds_spinlock);
> >  #define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
> >
> >  /* is kswapd sleeping prematurely? */
> > --
> > 1.7.3.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 6273 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 06/10] Per-memcg background reclaim.
  2011-04-15  1:11   ` KAMEZAWA Hiroyuki
@ 2011-04-15  6:08     ` Ying Han
  2011-04-15  8:14       ` KAMEZAWA Hiroyuki
  2011-04-15  6:26     ` Ying Han
  1 sibling, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-15  6:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 10613 bytes --]

On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 15:54:25 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This is the main loop of per-memcg background reclaim which is
> implemented in
> > function balance_mem_cgroup_pgdat().
> >
> > The function performs a priority loop similar to global reclaim. During
> each
> > iteration it invokes balance_pgdat_node() for all nodes on the system,
> which
> > is another new function performs background reclaim per node. After
> reclaiming
> > each node, it checks mem_cgroup_watermark_ok() and breaks the priority
> loop if
> > it returns true.
> >
> > changelog v4..v3:
> > 1. split the select_victim_node and zone_unreclaimable to a seperate
> patches
> > 2. remove the logic tries to do zone balancing.
> >
> > changelog v3..v2:
> > 1. change mz->all_unreclaimable to be boolean.
> > 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg
> reclaim.
> > 3. some more clean-up.
> >
> > changelog v2..v1:
> > 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> > 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> > reclaim.
> > 3. name the per-memcg memcg as "memcg-id" (css->id). And the global
> kswapd
> > keeps the same name.
> > 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be
> accessed
> > after freeing.
> > 5. add the fairness in zonelist where memcg remember the last zone
> reclaimed
> > from.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  mm/vmscan.c |  161
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 161 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4deb9c8..b8345d2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -47,6 +47,8 @@
> >
> >  #include <linux/swapops.h>
> >
> > +#include <linux/res_counter.h>
> > +
> >  #include "internal.h"
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -111,6 +113,8 @@ struct scan_control {
> >        * are scanned.
> >        */
> >       nodemask_t      *nodemask;
> > +
> > +     int priority;
> >  };
> >
> >  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> > @@ -2632,11 +2636,168 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >       finish_wait(wait_h, &wait);
> >  }
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * The function is used for per-memcg LRU. It scanns all the zones of
> the
> > + * node and returns the nr_scanned and nr_reclaimed.
> > + */
> > +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> > +                                     struct scan_control *sc)
> > +{
> > +     int i;
> > +     unsigned long total_scanned = 0;
> > +     struct mem_cgroup *mem_cont = sc->mem_cgroup;
> > +     int priority = sc->priority;
> > +
> > +     /*
> > +      * Now scan the zone in the dma->highmem direction, and we scan
> > +      * every zones for each node.
> > +      *
> > +      * We do this because the page allocator works in the opposite
> > +      * direction.  This prevents the page allocator from allocating
> > +      * pages behind kswapd's direction of progress, which would
> > +      * cause too much scanning of the lower zones.
> > +      */
>
> I guess this comment is a cut-n-paste from global kswapd. It works when
> alloc_page() stalls....hmm, I'd like to think whether dma->highmem
> direction
> is good in this case.
>

This is a legacy comment and the actual logic of zone balancing has been
removed from this patch.

>
> As you know, memcg works against user's memory, memory should be in highmem
> zone.
> Memcg-kswapd is not for memory-shortage, but for voluntary page dropping by
> _user_.
>

in some sense, yes. but it would also related to memory-shortage on fully
packed machines.

>
> If this memcg-kswapd drops pages from lower zones first, ah, ok, it's good
> for
> the system because memcg's pages should be on higher zone if we have free
> memory.
>
> So, I think the reason for dma->highmem is different from global kswapd.
>

yes. I agree that the logic of dma->highmem ordering is not exactly the same
from per-memcg kswapd and per-node kswapd. But still the page allocation
happens on the other side, and this is still good for the system as you
pointed out.

>
>
>
>
> > +     for (i = 0; i < pgdat->nr_zones; i++) {
> > +             struct zone *zone = pgdat->node_zones + i;
> > +
> > +             if (!populated_zone(zone))
> > +                     continue;
> > +
> > +             sc->nr_scanned = 0;
> > +             shrink_zone(priority, zone, sc);
> > +             total_scanned += sc->nr_scanned;
> > +
> > +             /*
> > +              * If we've done a decent amount of scanning and
> > +              * the reclaim ratio is low, start doing writepage
> > +              * even in laptop mode
> > +              */
> > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> 2) {
> > +                     sc->may_writepage = 1;
> > +             }
> > +     }
> > +
> > +     sc->nr_scanned = total_scanned;
> > +     return;
> > +}
> > +
> > +/*
> > + * Per cgroup background reclaim.
> > + * TODO: Take off the order since memcg always do order 0
> > + */
> > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> > +                                           int order)
> > +{
> > +     int i, nid;
> > +     int start_node;
> > +     int priority;
> > +     bool wmark_ok;
> > +     int loop;
> > +     pg_data_t *pgdat;
> > +     nodemask_t do_nodes;
> > +     unsigned long total_scanned;
> > +     struct scan_control sc = {
> > +             .gfp_mask = GFP_KERNEL,
> > +             .may_unmap = 1,
> > +             .may_swap = 1,
> > +             .nr_to_reclaim = ULONG_MAX,
> > +             .swappiness = vm_swappiness,
> > +             .order = order,
> > +             .mem_cgroup = mem_cont,
> > +     };
> > +
> > +loop_again:
> > +     do_nodes = NODE_MASK_NONE;
> > +     sc.may_writepage = !laptop_mode;
>
> I think may_writepage should start from '0' always. We're not sure
> the system is in memory shortage...we just want to release memory
> volunatary. write_page will add huge costs, I guess.
>
> For exmaple,
>        sc.may_writepage = !!loop
> may be better for memcg.
>
> BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it later.
>
> I think you should add some logic to fix it to right value.
>
> For example, before calling shrink_zone(),
>
> sc->nr_to_reclaim = min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() /
> 100);  # 1% in this zone.
>
> if we love 'fair pressure for each zone'.
>

Hmm. I don't get it. Leaving the nr_to_reclaim to be ULONG_MAX in kswapd
case which is intended to add equal memory pressure for each zone. So in the
shrink_zone, we won't bail out in the following condition:


>-------while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> >------->------->------->------->-------nr[LRU_INACTIVE_FILE]) {
>

 >------->-------if (nr_reclaimed >= nr_to_reclaim && priority <
DEF_PRIORITY)
>------->------->-------break;

}
>
> --Ying

>
>
>
>
> > +     sc.nr_reclaimed = 0;
> > +     total_scanned = 0;
> > +
> > +     for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > +             sc.priority = priority;
> > +             wmark_ok = false;
> > +             loop = 0;
> > +
> > +             /* The swap token gets in the way of swapout... */
> > +             if (!priority)
> > +                     disable_swap_token();
> > +
> > +             if (priority == DEF_PRIORITY)
> > +                     do_nodes = node_states[N_ONLINE];
> > +
> > +             while (1) {
> > +                     nid = mem_cgroup_select_victim_node(mem_cont,
> > +                                                     &do_nodes);
> > +
> > +                     /* Indicate we have cycled the nodelist once
> > +                      * TODO: we might add MAX_RECLAIM_LOOP for
> preventing
> > +                      * kswapd burning cpu cycles.
> > +                      */
> > +                     if (loop == 0) {
> > +                             start_node = nid;
> > +                             loop++;
> > +                     } else if (nid == start_node)
> > +                             break;
> > +
> > +                     pgdat = NODE_DATA(nid);
> > +                     balance_pgdat_node(pgdat, order, &sc);
> > +                     total_scanned += sc.nr_scanned;
> > +
> > +                     /* Set the node which has at least
> > +                      * one reclaimable zone
> > +                      */
> > +                     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > +                             struct zone *zone = pgdat->node_zones + i;
> > +
> > +                             if (!populated_zone(zone))
> > +                                     continue;
>
> How about checking whether memcg has pages on this node ?
>



> > +                     }
> > +                     if (i < 0)
> > +                             node_clear(nid, do_nodes);
> > +
> > +                     if (mem_cgroup_watermark_ok(mem_cont,
> > +                                                     CHARGE_WMARK_HIGH))
> {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +
> > +                     if (nodes_empty(do_nodes)) {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +             }
> > +
> > +             /* All the nodes are unreclaimable, kswapd is done */
> > +             if (nodes_empty(do_nodes)) {
> > +                     wmark_ok = true;
> > +                     goto out;
> > +             }
>
> Can this happen ?
>
>
> > +
> > +             if (total_scanned && priority < DEF_PRIORITY - 2)
> > +                     congestion_wait(WRITE, HZ/10);
> > +
> > +             if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> > +                     break;
> > +     }
> > +out:
> > +     if (!wmark_ok) {
> > +             cond_resched();
> > +
> > +             try_to_freeze();
> > +
> > +             goto loop_again;
> > +     }
> > +
> > +     return sc.nr_reclaimed;
> > +}
> > +#else
> >  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> >                                                       int order)
> >  {
> >       return 0;
> >  }
> > +#endif
> >
>
>
> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 14211 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 06/10] Per-memcg background reclaim.
  2011-04-15  1:11   ` KAMEZAWA Hiroyuki
  2011-04-15  6:08     ` Ying Han
@ 2011-04-15  6:26     ` Ying Han
  1 sibling, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-15  6:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 10051 bytes --]

On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 15:54:25 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This is the main loop of per-memcg background reclaim which is
> implemented in
> > function balance_mem_cgroup_pgdat().
> >
> > The function performs a priority loop similar to global reclaim. During
> each
> > iteration it invokes balance_pgdat_node() for all nodes on the system,
> which
> > is another new function performs background reclaim per node. After
> reclaiming
> > each node, it checks mem_cgroup_watermark_ok() and breaks the priority
> loop if
> > it returns true.
> >
> > changelog v4..v3:
> > 1. split the select_victim_node and zone_unreclaimable to a seperate
> patches
> > 2. remove the logic tries to do zone balancing.
> >
> > changelog v3..v2:
> > 1. change mz->all_unreclaimable to be boolean.
> > 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg
> reclaim.
> > 3. some more clean-up.
> >
> > changelog v2..v1:
> > 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> > 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> > reclaim.
> > 3. name the per-memcg memcg as "memcg-id" (css->id). And the global
> kswapd
> > keeps the same name.
> > 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be
> accessed
> > after freeing.
> > 5. add the fairness in zonelist where memcg remember the last zone
> reclaimed
> > from.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  mm/vmscan.c |  161
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 161 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4deb9c8..b8345d2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -47,6 +47,8 @@
> >
> >  #include <linux/swapops.h>
> >
> > +#include <linux/res_counter.h>
> > +
> >  #include "internal.h"
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -111,6 +113,8 @@ struct scan_control {
> >        * are scanned.
> >        */
> >       nodemask_t      *nodemask;
> > +
> > +     int priority;
> >  };
> >
> >  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> > @@ -2632,11 +2636,168 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >       finish_wait(wait_h, &wait);
> >  }
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * The function is used for per-memcg LRU. It scanns all the zones of
> the
> > + * node and returns the nr_scanned and nr_reclaimed.
> > + */
> > +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> > +                                     struct scan_control *sc)
> > +{
> > +     int i;
> > +     unsigned long total_scanned = 0;
> > +     struct mem_cgroup *mem_cont = sc->mem_cgroup;
> > +     int priority = sc->priority;
> > +
> > +     /*
> > +      * Now scan the zone in the dma->highmem direction, and we scan
> > +      * every zones for each node.
> > +      *
> > +      * We do this because the page allocator works in the opposite
> > +      * direction.  This prevents the page allocator from allocating
> > +      * pages behind kswapd's direction of progress, which would
> > +      * cause too much scanning of the lower zones.
> > +      */
>
> I guess this comment is a cut-n-paste from global kswapd. It works when
> alloc_page() stalls....hmm, I'd like to think whether dma->highmem
> direction
> is good in this case.
>
> As you know, memcg works against user's memory, memory should be in highmem
> zone.
> Memcg-kswapd is not for memory-shortage, but for voluntary page dropping by
> _user_.
>
> If this memcg-kswapd drops pages from lower zones first, ah, ok, it's good
> for
> the system because memcg's pages should be on higher zone if we have free
> memory.
>
> So, I think the reason for dma->highmem is different from global kswapd.
>
>
>
>
> > +     for (i = 0; i < pgdat->nr_zones; i++) {
> > +             struct zone *zone = pgdat->node_zones + i;
> > +
> > +             if (!populated_zone(zone))
> > +                     continue;
> > +
> > +             sc->nr_scanned = 0;
> > +             shrink_zone(priority, zone, sc);
> > +             total_scanned += sc->nr_scanned;
> > +
> > +             /*
> > +              * If we've done a decent amount of scanning and
> > +              * the reclaim ratio is low, start doing writepage
> > +              * even in laptop mode
> > +              */
> > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> 2) {
> > +                     sc->may_writepage = 1;
> > +             }
> > +     }
> > +
> > +     sc->nr_scanned = total_scanned;
> > +     return;
> > +}
> > +
> > +/*
> > + * Per cgroup background reclaim.
> > + * TODO: Take off the order since memcg always do order 0
> > + */
> > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> > +                                           int order)
> > +{
> > +     int i, nid;
> > +     int start_node;
> > +     int priority;
> > +     bool wmark_ok;
> > +     int loop;
> > +     pg_data_t *pgdat;
> > +     nodemask_t do_nodes;
> > +     unsigned long total_scanned;
> > +     struct scan_control sc = {
> > +             .gfp_mask = GFP_KERNEL,
> > +             .may_unmap = 1,
> > +             .may_swap = 1,
> > +             .nr_to_reclaim = ULONG_MAX,
> > +             .swappiness = vm_swappiness,
> > +             .order = order,
> > +             .mem_cgroup = mem_cont,
> > +     };
> > +
> > +loop_again:
> > +     do_nodes = NODE_MASK_NONE;
> > +     sc.may_writepage = !laptop_mode;
>
> I think may_writepage should start from '0' always. We're not sure
> the system is in memory shortage...we just want to release memory
> volunatary. write_page will add huge costs, I guess.
>
> For exmaple,
>        sc.may_writepage = !!loop
> may be better for memcg.
>
> BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it later.
>
> I think you should add some logic to fix it to right value.
>
> For example, before calling shrink_zone(),
>
> sc->nr_to_reclaim = min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() /
> 100);  # 1% in this zone.
>
> if we love 'fair pressure for each zone'.
>
>
>
>
>
>
> > +     sc.nr_reclaimed = 0;
> > +     total_scanned = 0;
> > +
> > +     for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > +             sc.priority = priority;
> > +             wmark_ok = false;
> > +             loop = 0;
> > +
> > +             /* The swap token gets in the way of swapout... */
> > +             if (!priority)
> > +                     disable_swap_token();
> > +
> > +             if (priority == DEF_PRIORITY)
> > +                     do_nodes = node_states[N_ONLINE];
> > +
> > +             while (1) {
> > +                     nid = mem_cgroup_select_victim_node(mem_cont,
> > +                                                     &do_nodes);
> > +
> > +                     /* Indicate we have cycled the nodelist once
> > +                      * TODO: we might add MAX_RECLAIM_LOOP for
> preventing
> > +                      * kswapd burning cpu cycles.
> > +                      */
> > +                     if (loop == 0) {
> > +                             start_node = nid;
> > +                             loop++;
> > +                     } else if (nid == start_node)
> > +                             break;
> > +
> > +                     pgdat = NODE_DATA(nid);
> > +                     balance_pgdat_node(pgdat, order, &sc);
> > +                     total_scanned += sc.nr_scanned;
> > +
> > +                     /* Set the node which has at least
> > +                      * one reclaimable zone
> > +                      */
> > +                     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > +                             struct zone *zone = pgdat->node_zones + i;
> > +
> > +                             if (!populated_zone(zone))
> > +                                     continue;
>
> How about checking whether memcg has pages on this node ?
>

Well, i might be able to add the following logic:

unsigned long scan;
 for_each_evictable_lru(l) {
       scan += zone_nr_lru_pages(zone, sc, l);
}

if (!populated_zone(zone) || !scan)
   continue;



> > +                     }
> > +                     if (i < 0)
> > +                             node_clear(nid, do_nodes);
> > +
> > +                     if (mem_cgroup_watermark_ok(mem_cont,
> > +                                                     CHARGE_WMARK_HIGH))
> {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +
> > +                     if (nodes_empty(do_nodes)) {
> > +                             wmark_ok = true;
> > +                             goto out;
> > +                     }
> > +             }
> > +
> > +             /* All the nodes are unreclaimable, kswapd is done */
> > +             if (nodes_empty(do_nodes)) {
> > +                     wmark_ok = true;
> > +                     goto out;
> > +             }
>
> Can this happen ?
>

Hmm. This looks duplicate. I was thinking the "break" case, but the
nodes_empty in the while loop should have captured that case.

--Ying

>
>
> > +
> > +             if (total_scanned && priority < DEF_PRIORITY - 2)
> > +                     congestion_wait(WRITE, HZ/10);
> > +
> > +             if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> > +                     break;
> > +     }
> > +out:
> > +     if (!wmark_ok) {
> > +             cond_resched();
> > +
> > +             try_to_freeze();
> > +
> > +             goto loop_again;
> > +     }
> > +
> > +     return sc.nr_reclaimed;
> > +}
> > +#else
> >  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> >                                                       int order)
> >  {
> >       return 0;
> >  }
> > +#endif
> >
>
>
> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 12735 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 06/10] Per-memcg background reclaim.
  2011-04-15  6:08     ` Ying Han
@ 2011-04-15  8:14       ` KAMEZAWA Hiroyuki
  2011-04-15 18:00         ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-15  8:14 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 14 Apr 2011 23:08:40 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:

> >
> > As you know, memcg works against user's memory, memory should be in highmem
> > zone.
> > Memcg-kswapd is not for memory-shortage, but for voluntary page dropping by
> > _user_.
> >
> 
> in some sense, yes. but it would also related to memory-shortage on fully
> packed machines.
> 

No. _at this point_, this is just for freeing volutary before hitting limit
to gain performance. Anyway, this understainding is not affecting the patch
itself.

> >
> > If this memcg-kswapd drops pages from lower zones first, ah, ok, it's good
> > for
> > the system because memcg's pages should be on higher zone if we have free
> > memory.
> >
> > So, I think the reason for dma->highmem is different from global kswapd.
> >
> 
> yes. I agree that the logic of dma->highmem ordering is not exactly the same
> from per-memcg kswapd and per-node kswapd. But still the page allocation
> happens on the other side, and this is still good for the system as you
> pointed out.
> 
> >
> >
> >
> >
> > > +     for (i = 0; i < pgdat->nr_zones; i++) {
> > > +             struct zone *zone = pgdat->node_zones + i;
> > > +
> > > +             if (!populated_zone(zone))
> > > +                     continue;
> > > +
> > > +             sc->nr_scanned = 0;
> > > +             shrink_zone(priority, zone, sc);
> > > +             total_scanned += sc->nr_scanned;
> > > +
> > > +             /*
> > > +              * If we've done a decent amount of scanning and
> > > +              * the reclaim ratio is low, start doing writepage
> > > +              * even in laptop mode
> > > +              */
> > > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> > 2) {
> > > +                     sc->may_writepage = 1;
> > > +             }
> > > +     }
> > > +
> > > +     sc->nr_scanned = total_scanned;
> > > +     return;
> > > +}
> > > +
> > > +/*
> > > + * Per cgroup background reclaim.
> > > + * TODO: Take off the order since memcg always do order 0
> > > + */
> > > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> > *mem_cont,
> > > +                                           int order)
> > > +{
> > > +     int i, nid;
> > > +     int start_node;
> > > +     int priority;
> > > +     bool wmark_ok;
> > > +     int loop;
> > > +     pg_data_t *pgdat;
> > > +     nodemask_t do_nodes;
> > > +     unsigned long total_scanned;
> > > +     struct scan_control sc = {
> > > +             .gfp_mask = GFP_KERNEL,
> > > +             .may_unmap = 1,
> > > +             .may_swap = 1,
> > > +             .nr_to_reclaim = ULONG_MAX,
> > > +             .swappiness = vm_swappiness,
> > > +             .order = order,
> > > +             .mem_cgroup = mem_cont,
> > > +     };
> > > +
> > > +loop_again:
> > > +     do_nodes = NODE_MASK_NONE;
> > > +     sc.may_writepage = !laptop_mode;
> >
> > I think may_writepage should start from '0' always. We're not sure
> > the system is in memory shortage...we just want to release memory
> > volunatary. write_page will add huge costs, I guess.
> >
> > For exmaple,
> >        sc.may_writepage = !!loop
> > may be better for memcg.
> >
> > BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it later.
> >
> > I think you should add some logic to fix it to right value.
> >
> > For example, before calling shrink_zone(),
> >
> > sc->nr_to_reclaim = min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() /
> > 100);  # 1% in this zone.
> >
> > if we love 'fair pressure for each zone'.
> >
> 
> Hmm. I don't get it. Leaving the nr_to_reclaim to be ULONG_MAX in kswapd
> case which is intended to add equal memory pressure for each zone. 

And it need to reclaim memory from the zone.
memcg can visit other zone/node because it's not work for zone/pgdat.

> So in the shrink_zone, we won't bail out in the following condition:
> 
> 
> >-------while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> > >------->------->------->------->-------nr[LRU_INACTIVE_FILE]) {
> >
> 
>  >------->-------if (nr_reclaimed >= nr_to_reclaim && priority <
> DEF_PRIORITY)
> >------->------->-------break;
> 
> }

Yes. So, by setting nr_to_reclaim to be proper value for a zone,
we can visit next zone/node sooner. memcg's kswapd is not requrested to
free memory from a node/zone. (But we'll need a hint for bias, later.)

By making nr_reclaimed to be ULONG_MAX, to quit this loop, we need to
loop until all nr[lru] to be 0. When memcg kswapd finds that memcg's usage
is difficult to be reduced under high_wmark, priority goes up dramatically
and we'll see long loop in this zone if zone is busy.

For memcg kswapd, it can visit next zone rather than loop more. Then,
we'll be able to reduce cpu usage and contention by memcg_kswapd.

I think this do-more/skip-and-next logic will be a difficult issue
and need to be maintained with long time research. For now, I bet
ULONG_MAX is not a choice. As usual try_to_free_page() does,
SWAP_CLUSTER_MAX will be enough. As it is, we can visit next node.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 00/10] memcg: per cgroup background reclaim
  2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (9 preceding siblings ...)
  2011-04-14 22:54 ` [PATCH V4 10/10] Add some per-memcg stats Ying Han
@ 2011-04-15  9:40 ` Michal Hocko
  2011-04-15 16:40   ` Ying Han
  10 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2011-04-15  9:40 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Dave Hansen, Zhu Yanhai, linux-mm

Hi Ying,
sorry that I am jumping into game that late but I was quite busy after
returning back from LSF and LFCS.

On Thu 14-04-11 15:54:19, Ying Han wrote:
> The current implementation of memcg supports targeting reclaim when the
> cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
> Per cgroup background reclaim is needed which helps to spread out memory
> pressure over longer period of time and smoothes out the cgroup performance.
> 
> If the cgroup is configured to use per cgroup background reclaim, a kswapd
> thread is created which only scans the per-memcg LRU list. 

Hmm, I am wondering if this fits into the get-rid-of-the-global-LRU
strategy. If we make the background reclaim per-cgroup how do we balance
from the global/zone POV? We can end up with all groups over the high
limit while a memory zone is under this watermark. Or am I missing
something?
I thought that plans for the background reclaim were same as for direct
reclaim so that kswapd would just evict pages from groups in the
round-robin fashion (in first round just those that are under limit and
proportionally when it cannot reach high watermark after it got through
all groups).

> Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> background reclaim and stop it. The watermarks are calculated based on
> the cgroup's limit_in_bytes.

I didn't have time to look at the patch how does the calculation work
yet but we should be careful to match the zone's watermark expectations.

> By default, the per-memcg kswapd threads are running under root cgroup. There
> is a per-memcg API which exports the pid of each kswapd thread, and userspace
> can configure cpu cgroup seperately.
> 
> I run through dd test on large file and then cat the file. Then I compared
> the reclaim related stats in memory.stat.
> 
> Step1: Create a cgroup with 500M memory_limit.
> $ mkdir /dev/cgroup/memory/A
> $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> $ echo $$ >/dev/cgroup/memory/A/tasks
> 
> Step2: Test and set the wmarks.
> $ cat /dev/cgroup/memory/A/memory.low_wmark_distance
> 0
> $ cat /dev/cgroup/memory/A/memory.high_wmark_distance
> 0

I remember that there was a resistance against exporting watermarks as
they are kernel internal thing.

> 
> $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> low_wmark 524288000
> high_wmark 524288000
> 
> $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> $ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance
> 
> $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> low_wmark  482344960
> high_wmark 471859200

low_wmark is higher than high_wmark?

[...]
> Note:
> This is the first effort of enhancing the target reclaim into memcg. Here are
> the existing known issues and our plan:
> 
> 1. there are one kswapd thread per cgroup. the thread is created when the
> cgroup changes its limit_in_bytes and is deleted when the cgroup is being
> removed. In some enviroment when thousand of cgroups are being configured on
> a single host, we will have thousand of kswapd threads. The memory consumption
> would be 8k*100 = 8M. We don't see a big issue for now if the host can host
> that many of cgroups.

I think that zone background reclaim is much bigger issue than 8k per
kernel thread and too many threads... 
I am not sure how much orthogonal per-cgroup-per-thread vs. zone
approaches are, though.  Maybe it makes some sense to do both per-cgroup
and zone background reclaim.  Anyway I think that we should start with
the zone reclaim first.

[...]

> 4. no hierarchical reclaim support in this patchset. I would like to get to
> after the basic stuff are being accepted.

Just an idea.
If we did that from zone's POV then we could call mem_cgroup_hierarchical_reclaim,
right?

[...]

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 00/10] memcg: per cgroup background reclaim
  2011-04-15  9:40 ` [PATCH V4 00/10] memcg: per cgroup background reclaim Michal Hocko
@ 2011-04-15 16:40   ` Ying Han
  2011-04-18  9:13     ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-15 16:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5326 bytes --]

On Fri, Apr 15, 2011 at 2:40 AM, Michal Hocko <mhocko@suse.cz> wrote:

> Hi Ying,
> sorry that I am jumping into game that late but I was quite busy after
> returning back from LSF and LFCS.
>

Sure. Nice meeting you guys there and thank you for looking into this patch
:)

>
> On Thu 14-04-11 15:54:19, Ying Han wrote:
> > The current implementation of memcg supports targeting reclaim when the
> > cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
> > Per cgroup background reclaim is needed which helps to spread out memory
> > pressure over longer period of time and smoothes out the cgroup
> performance.
> >
> > If the cgroup is configured to use per cgroup background reclaim, a
> kswapd
> > thread is created which only scans the per-memcg LRU list.
>
> Hmm, I am wondering if this fits into the get-rid-of-the-global-LRU
> strategy. If we make the background reclaim per-cgroup how do we balance
> from the global/zone POV? We can end up with all groups over the high
> limit while a memory zone is under this watermark. Or am I missing
> something?
> I thought that plans for the background reclaim were same as for direct
> reclaim so that kswapd would just evict pages from groups in the
> round-robin fashion (in first round just those that are under limit and
> proportionally when it cannot reach high watermark after it got through
> all groups).
>

I think you are talking about the soft_limit reclaim which I am gonna look
at next. The soft_limit reclaim
is triggered under global memory pressure and doing round-robin across
memcgs. I will also cover the
zone-balancing by having second list of memgs under their soft_limit.

Here is the summary of our LSF discussion :)
http://permalink.gmane.org/gmane.linux.kernel.mm/60966

>
> > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > background reclaim and stop it. The watermarks are calculated based on
> > the cgroup's limit_in_bytes.
>
> I didn't have time to look at the patch how does the calculation work
> yet but we should be careful to match the zone's watermark expectations.
>

I have API on the following patch which provide high/low_wmark_distance to
tune wmarks individually individually.  By default, they are set to 0 which
turn off the per-memcg kswapd. For now, we are ok since the global kswapd is
still doing per-zone scanning and reclaiming :)

>
> > By default, the per-memcg kswapd threads are running under root cgroup.
> There
> > is a per-memcg API which exports the pid of each kswapd thread, and
> userspace
> > can configure cpu cgroup seperately.
> >
> > I run through dd test on large file and then cat the file. Then I
> compared
> > the reclaim related stats in memory.stat.
> >
> > Step1: Create a cgroup with 500M memory_limit.
> > $ mkdir /dev/cgroup/memory/A
> > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > $ echo $$ >/dev/cgroup/memory/A/tasks
> >
> > Step2: Test and set the wmarks.
> > $ cat /dev/cgroup/memory/A/memory.low_wmark_distance
> > 0
> > $ cat /dev/cgroup/memory/A/memory.high_wmark_distance
> > 0
>
>
They are used to tune the high/low_marks based on the hard_limit. We might
need to export that configuration to user admin especially on machines where
they over-commit by hard_limit.

>
> >
> > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > low_wmark 524288000
> > high_wmark 524288000
> >
> > $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> > $ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance
> >
> > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > low_wmark  482344960
> > high_wmark 471859200
>
> low_wmark is higher than high_wmark?
>

hah, it is confusing. I have them documented. Basically, low_wmark triggers
reclaim and high_wmark stop the reclaim. And we have

high_wmark < usage < low_wmark.

>
> [...]
> > Note:
> > This is the first effort of enhancing the target reclaim into memcg. Here
> are
> > the existing known issues and our plan:
> >
> > 1. there are one kswapd thread per cgroup. the thread is created when the
> > cgroup changes its limit_in_bytes and is deleted when the cgroup is being
> > removed. In some enviroment when thousand of cgroups are being configured
> on
> > a single host, we will have thousand of kswapd threads. The memory
> consumption
> > would be 8k*100 = 8M. We don't see a big issue for now if the host can
> host
> > that many of cgroups.
>
> I think that zone background reclaim is much bigger issue than 8k per
> kernel thread and too many threads...
>

yes.


> I am not sure how much orthogonal per-cgroup-per-thread vs. zone
> approaches are, though.  Maybe it makes some sense to do both per-cgroup
> and zone background reclaim.  Anyway I think that we should start with
> the zone reclaim first.
>

I missed the point here. Can you clarify the zone reclaim here?

>
> [...]
>
> > 4. no hierarchical reclaim support in this patchset. I would like to get
> to
> > after the basic stuff are being accepted.
>
> Just an idea.
> If we did that from zone's POV then we could call
> mem_cgroup_hierarchical_reclaim,
> right?
>
> Maybe. I need to think through that, for this verion I don't plan to
put hierarchical reclaim.

--Ying

> [...]
>
> Thanks
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

[-- Attachment #2: Type: text/html, Size: 7584 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 06/10] Per-memcg background reclaim.
  2011-04-15  8:14       ` KAMEZAWA Hiroyuki
@ 2011-04-15 18:00         ` Ying Han
  0 siblings, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-15 18:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5692 bytes --]

On Fri, Apr 15, 2011 at 1:14 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 23:08:40 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > >
> > > As you know, memcg works against user's memory, memory should be in
> highmem
> > > zone.
> > > Memcg-kswapd is not for memory-shortage, but for voluntary page
> dropping by
> > > _user_.
> > >
> >
> > in some sense, yes. but it would also related to memory-shortage on fully
> > packed machines.
> >
>
> No. _at this point_, this is just for freeing volutary before hitting limit
> to gain performance. Anyway, this understainding is not affecting the patch
> itself.
>
> > >
> > > If this memcg-kswapd drops pages from lower zones first, ah, ok, it's
> good
> > > for
> > > the system because memcg's pages should be on higher zone if we have
> free
> > > memory.
> > >
> > > So, I think the reason for dma->highmem is different from global
> kswapd.
> > >
> >
> > yes. I agree that the logic of dma->highmem ordering is not exactly the
> same
> > from per-memcg kswapd and per-node kswapd. But still the page allocation
> > happens on the other side, and this is still good for the system as you
> > pointed out.
> >
> > >
> > >
> > >
> > >
> > > > +     for (i = 0; i < pgdat->nr_zones; i++) {
> > > > +             struct zone *zone = pgdat->node_zones + i;
> > > > +
> > > > +             if (!populated_zone(zone))
> > > > +                     continue;
> > > > +
> > > > +             sc->nr_scanned = 0;
> > > > +             shrink_zone(priority, zone, sc);
> > > > +             total_scanned += sc->nr_scanned;
> > > > +
> > > > +             /*
> > > > +              * If we've done a decent amount of scanning and
> > > > +              * the reclaim ratio is low, start doing writepage
> > > > +              * even in laptop mode
> > > > +              */
> > > > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > > > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed
> /
> > > 2) {
> > > > +                     sc->may_writepage = 1;
> > > > +             }
> > > > +     }
> > > > +
> > > > +     sc->nr_scanned = total_scanned;
> > > > +     return;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Per cgroup background reclaim.
> > > > + * TODO: Take off the order since memcg always do order 0
> > > > + */
> > > > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> > > *mem_cont,
> > > > +                                           int order)
> > > > +{
> > > > +     int i, nid;
> > > > +     int start_node;
> > > > +     int priority;
> > > > +     bool wmark_ok;
> > > > +     int loop;
> > > > +     pg_data_t *pgdat;
> > > > +     nodemask_t do_nodes;
> > > > +     unsigned long total_scanned;
> > > > +     struct scan_control sc = {
> > > > +             .gfp_mask = GFP_KERNEL,
> > > > +             .may_unmap = 1,
> > > > +             .may_swap = 1,
> > > > +             .nr_to_reclaim = ULONG_MAX,
> > > > +             .swappiness = vm_swappiness,
> > > > +             .order = order,
> > > > +             .mem_cgroup = mem_cont,
> > > > +     };
> > > > +
> > > > +loop_again:
> > > > +     do_nodes = NODE_MASK_NONE;
> > > > +     sc.may_writepage = !laptop_mode;
> > >
> > > I think may_writepage should start from '0' always. We're not sure
> > > the system is in memory shortage...we just want to release memory
> > > volunatary. write_page will add huge costs, I guess.
> > >
> > > For exmaple,
> > >        sc.may_writepage = !!loop
> > > may be better for memcg.
> > >
> > > BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it
> later.
> > >
> > > I think you should add some logic to fix it to right value.
> > >
> > > For example, before calling shrink_zone(),
> > >
> > > sc->nr_to_reclaim = min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() /
> > > 100);  # 1% in this zone.
> > >
> > > if we love 'fair pressure for each zone'.
> > >
> >
> > Hmm. I don't get it. Leaving the nr_to_reclaim to be ULONG_MAX in kswapd
> > case which is intended to add equal memory pressure for each zone.
>
> And it need to reclaim memory from the zone.
> memcg can visit other zone/node because it's not work for zone/pgdat.
>
> > So in the shrink_zone, we won't bail out in the following condition:
> >
> >
> > >-------while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> > > >------->------->------->------->-------nr[LRU_INACTIVE_FILE]) {
> > >
> >
> >  >------->-------if (nr_reclaimed >= nr_to_reclaim && priority <
> > DEF_PRIORITY)
> > >------->------->-------break;
> >
> > }
>
> Yes. So, by setting nr_to_reclaim to be proper value for a zone,
> we can visit next zone/node sooner. memcg's kswapd is not requrested to
> free memory from a node/zone. (But we'll need a hint for bias, later.)
>
> By making nr_reclaimed to be ULONG_MAX, to quit this loop, we need to
> loop until all nr[lru] to be 0. When memcg kswapd finds that memcg's usage
> is difficult to be reduced under high_wmark, priority goes up dramatically
> and we'll see long loop in this zone if zone is busy.
>
> For memcg kswapd, it can visit next zone rather than loop more. Then,
> we'll be able to reduce cpu usage and contention by memcg_kswapd.
>
> I think this do-more/skip-and-next logic will be a difficult issue
> and need to be maintained with long time research. For now, I bet
> ULONG_MAX is not a choice. As usual try_to_free_page() does,
> SWAP_CLUSTER_MAX will be enough. As it is, we can visit next node.
>

fair enough and make sense. I will make the change on the next post.

--Ying

>
> Thanks,
> -Kame
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 7774 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 01/10] Add kswapd descriptor
  2011-04-15  4:16       ` KAMEZAWA Hiroyuki
@ 2011-04-15 21:46         ` Ying Han
  0 siblings, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-15 21:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3818 bytes --]

On Thu, Apr 14, 2011 at 9:16 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 14 Apr 2011 20:35:00 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Thu, Apr 14, 2011 at 5:04 PM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Thu, 14 Apr 2011 15:54:20 -0700
> > > Ying Han <yinghan@google.com> wrote:
> > >
> > > > There is a kswapd kernel thread for each numa node. We will add a
> > > different
> > > > kswapd for each memcg. The kswapd is sleeping in the wait queue
> headed at
> > > > kswapd_wait field of a kswapd descriptor. The kswapd descriptor
> stores
> > > > information of node or memcg and it allows the global and per-memcg
> > > background
> > > > reclaim to share common reclaim algorithms.
> > > >
> > > > This patch adds the kswapd descriptor and moves the per-node kswapd
> to
> > > use the
> > > > new structure.
> > > >
> > >
> > > No objections to your direction but some comments.
> > >
> > > > changelog v2..v1:
> > > > 1. dynamic allocate kswapd descriptor and initialize the
> wait_queue_head
> > > of pgdat
> > > > at kswapd_run.
> > > > 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup
> > > kswapd
> > > > descriptor.
> > > >
> > > > changelog v3..v2:
> > > > 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later
> patch.
> > > > 2. rename thr in kswapd_run to something else.
> > > >
> > > > Signed-off-by: Ying Han <yinghan@google.com>
> > > > ---
> > > >  include/linux/mmzone.h |    3 +-
> > > >  include/linux/swap.h   |    7 ++++
> > > >  mm/page_alloc.c        |    1 -
> > > >  mm/vmscan.c            |   95
> > > ++++++++++++++++++++++++++++++++++++------------
> > > >  4 files changed, 80 insertions(+), 26 deletions(-)
> > > >
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 628f07b..6cba7d2 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -640,8 +640,7 @@ typedef struct pglist_data {
> > > >       unsigned long node_spanned_pages; /* total size of physical
> page
> > > >                                            range, including holes */
> > > >       int node_id;
> > > > -     wait_queue_head_t kswapd_wait;
> > > > -     struct task_struct *kswapd;
> > > > +     wait_queue_head_t *kswapd_wait;
> > > >       int kswapd_max_order;
> > > >       enum zone_type classzone_idx;
> > >
> > > I think pg_data_t should include struct kswapd in it, as
> > >
> > >        struct pglist_data {
> > >        .....
> > >                struct kswapd   kswapd;
> > >        };
> > > and you can add a macro as
> > >
> > > #define kswapd_waitqueue(kswapd)        (&(kswapd)->kswapd_wait)
> > > if it looks better.
> > >
> > > Why I recommend this is I think it's better to have 'struct kswapd'
> > > on the same page of pg_data_t or struct memcg.
> > > Do you have benefits to kmalloc() struct kswapd on damand ?
> > >
> >
> > So we don't end of have kswapd struct on memcgs' which doesn't have
> > per-memcg kswapd enabled. I don't see one is strongly better than the
> other
> > for the two approaches. If ok, I would like to keep as it is for this
> > verion. Hope this is ok for now.
> >
>
> My intension is to remove kswapd_spinlock. Can we remove it with
> dynamic allocation ? IOW, static allocation still requires spinlock ?
>

Thank you for pointing that out which made me thinking a little harder on
this. I don't think we need the spinlock
in this patch.

This is something I inherited from another kswapd patch we did where we
allow one kswapd to reclaim from multiple pgdat. We need the spinlock there
since we need to protect the pgdat list per kswapd. However, we have
one-to-one
mapping here and we can get rid of the lock. I will remove it on the next
post.

--Ying

>
> Thanks,
> -Kame
>
>
>

[-- Attachment #2: Type: text/html, Size: 5361 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 00/10] memcg: per cgroup background reclaim
  2011-04-15 16:40   ` Ying Han
@ 2011-04-18  9:13     ` Michal Hocko
  2011-04-18 17:01       ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2011-04-18  9:13 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Dave Hansen, Zhu Yanhai, linux-mm

On Fri 15-04-11 09:40:54, Ying Han wrote:
> On Fri, Apr 15, 2011 at 2:40 AM, Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Hi Ying,
> > sorry that I am jumping into game that late but I was quite busy after
> > returning back from LSF and LFCS.
> >
> 
> Sure. Nice meeting you guys there and thank you for looking into this patch
> :)

Yes, nice meeting.

> 
> >
> > On Thu 14-04-11 15:54:19, Ying Han wrote:
> > > The current implementation of memcg supports targeting reclaim when the
> > > cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
> > > Per cgroup background reclaim is needed which helps to spread out memory
> > > pressure over longer period of time and smoothes out the cgroup
> > performance.
> > >
> > > If the cgroup is configured to use per cgroup background reclaim, a
> > kswapd
> > > thread is created which only scans the per-memcg LRU list.
> >
> > Hmm, I am wondering if this fits into the get-rid-of-the-global-LRU
> > strategy. If we make the background reclaim per-cgroup how do we balance
> > from the global/zone POV? We can end up with all groups over the high
> > limit while a memory zone is under this watermark. Or am I missing
> > something?
> > I thought that plans for the background reclaim were same as for direct
> > reclaim so that kswapd would just evict pages from groups in the
> > round-robin fashion (in first round just those that are under limit and
> > proportionally when it cannot reach high watermark after it got through
> > all groups).
> >
> 
> I think you are talking about the soft_limit reclaim which I am gonna look
> at next. 

I see. I am just concerned whether 3rd level of reclaim is a good idea.
We would need to do background reclaim anyway (and to preserve the
original semantic it has to be somehow watermark controlled). I am just
wondering why we have to implement it separately from kswapd. Cannot we
just simply trigger global kswapd which would reclaim all cgroups that
are under watermarks? [I am sorry for my ignorance if that is what is
implemented in the series - I haven't got to the patches yes]

> The soft_limit reclaim
> is triggered under global memory pressure and doing round-robin across
> memcgs. I will also cover the
> zone-balancing by having second list of memgs under their soft_limit.
> 
> Here is the summary of our LSF discussion :)
> http://permalink.gmane.org/gmane.linux.kernel.mm/60966

Yes, I have read it and thanks for putting it together.

> > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > > background reclaim and stop it. The watermarks are calculated based on
> > > the cgroup's limit_in_bytes.
> >
> > I didn't have time to look at the patch how does the calculation work
> > yet but we should be careful to match the zone's watermark expectations.
> >
> 
> I have API on the following patch which provide high/low_wmark_distance to
> tune wmarks individually individually.  By default, they are set to 0 which
> turn off the per-memcg kswapd. For now, we are ok since the global kswapd is
> still doing per-zone scanning and reclaiming :)
> 
> >
> > > By default, the per-memcg kswapd threads are running under root cgroup.
> > There
> > > is a per-memcg API which exports the pid of each kswapd thread, and
> > userspace
> > > can configure cpu cgroup seperately.
> > >
> > > I run through dd test on large file and then cat the file. Then I
> > compared
> > > the reclaim related stats in memory.stat.
> > >
> > > Step1: Create a cgroup with 500M memory_limit.
> > > $ mkdir /dev/cgroup/memory/A
> > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > > $ echo $$ >/dev/cgroup/memory/A/tasks
> > >
> > > Step2: Test and set the wmarks.
> > > $ cat /dev/cgroup/memory/A/memory.low_wmark_distance
> > > 0
> > > $ cat /dev/cgroup/memory/A/memory.high_wmark_distance
> > > 0
> >
> >
> They are used to tune the high/low_marks based on the hard_limit. We might
> need to export that configuration to user admin especially on machines where
> they over-commit by hard_limit.

I remember there was some resistance against tuning watermarks
separately.

> > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > low_wmark 524288000
> > > high_wmark 524288000
> > >
> > > $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> > > $ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance
> > >
> > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > low_wmark  482344960
> > > high_wmark 471859200
> >
> > low_wmark is higher than high_wmark?
> >
> 
> hah, it is confusing. I have them documented. Basically, low_wmark triggers
> reclaim and high_wmark stop the reclaim. And we have
> 
> high_wmark < usage < low_wmark.

OK, I will look at it.

[...]

> > I am not sure how much orthogonal per-cgroup-per-thread vs. zone
> > approaches are, though.  Maybe it makes some sense to do both per-cgroup
> > and zone background reclaim.  Anyway I think that we should start with
> > the zone reclaim first.
> >
> 
> I missed the point here. Can you clarify the zone reclaim here?

kswapd does the background zone reclaim and you are trying to do
per-cgroup reclaim, right? I am concerned about those two fighting with
slightly different goal. 

I am still thinking whether backgroup reclaim would be sufficient,
though. We would get rid of per-cgroup thread and wouldn't create a new
reclaim interface.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 00/10] memcg: per cgroup background reclaim
  2011-04-18  9:13     ` Michal Hocko
@ 2011-04-18 17:01       ` Ying Han
  2011-04-18 18:42         ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-18 17:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 6890 bytes --]

On Mon, Apr 18, 2011 at 2:13 AM, Michal Hocko <mhocko@suse.cz> wrote:

> On Fri 15-04-11 09:40:54, Ying Han wrote:
> > On Fri, Apr 15, 2011 at 2:40 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > > Hi Ying,
> > > sorry that I am jumping into game that late but I was quite busy after
> > > returning back from LSF and LFCS.
> > >
> >
> > Sure. Nice meeting you guys there and thank you for looking into this
> patch
> > :)
>
> Yes, nice meeting.
>
> >
> > >
> > > On Thu 14-04-11 15:54:19, Ying Han wrote:
> > > > The current implementation of memcg supports targeting reclaim when
> the
> > > > cgroup is reaching its hard_limit and we do direct reclaim per
> cgroup.
> > > > Per cgroup background reclaim is needed which helps to spread out
> memory
> > > > pressure over longer period of time and smoothes out the cgroup
> > > performance.
> > > >
> > > > If the cgroup is configured to use per cgroup background reclaim, a
> > > kswapd
> > > > thread is created which only scans the per-memcg LRU list.
> > >
> > > Hmm, I am wondering if this fits into the get-rid-of-the-global-LRU
> > > strategy. If we make the background reclaim per-cgroup how do we
> balance
> > > from the global/zone POV? We can end up with all groups over the high
> > > limit while a memory zone is under this watermark. Or am I missing
> > > something?
> > > I thought that plans for the background reclaim were same as for direct
> > > reclaim so that kswapd would just evict pages from groups in the
> > > round-robin fashion (in first round just those that are under limit and
> > > proportionally when it cannot reach high watermark after it got through
> > > all groups).
> > >
> >
> > I think you are talking about the soft_limit reclaim which I am gonna
> look
> > at next.
>
> I see. I am just concerned whether 3rd level of reclaim is a good idea.
> We would need to do background reclaim anyway (and to preserve the
> original semantic it has to be somehow watermark controlled). I am just
> wondering why we have to implement it separately from kswapd. Cannot we
> just simply trigger global kswapd which would reclaim all cgroups that
> are under watermarks? [I am sorry for my ignorance if that is what is
> implemented in the series - I haven't got to the patches yes]
>

They are different on per-zone reclaim vs per-memcg reclaim. The first one
is triggered if the zone is under
memory pressure and we need to free pages to serve further page allocations.
The second one is triggered
if the memcg is under memory pressure and we need to free pages to leave
room (limit - usage) for the memcg
to grow.

Both of them are needed and that is how it is implemented on the direct
reclaim path. The kswapd batches only try to
smooth out the system and memcg performance by reclaiming pages proactively.
It doesn't affecting the functionality.

>
> > The soft_limit reclaim
> > is triggered under global memory pressure and doing round-robin across
> > memcgs. I will also cover the
> > zone-balancing by having second list of memgs under their soft_limit.
> >
> > Here is the summary of our LSF discussion :)
> > http://permalink.gmane.org/gmane.linux.kernel.mm/60966
>
> Yes, I have read it and thanks for putting it together.
>
sure.

>
> > > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the
> > > > background reclaim and stop it. The watermarks are calculated based
> on
> > > > the cgroup's limit_in_bytes.
> > >
> > > I didn't have time to look at the patch how does the calculation work
> > > yet but we should be careful to match the zone's watermark
> expectations.
> > >
> >
> > I have API on the following patch which provide high/low_wmark_distance
> to
> > tune wmarks individually individually.  By default, they are set to 0
> which
> > turn off the per-memcg kswapd. For now, we are ok since the global kswapd
> is
> > still doing per-zone scanning and reclaiming :)
> >
> > >
> > > > By default, the per-memcg kswapd threads are running under root
> cgroup.
> > > There
> > > > is a per-memcg API which exports the pid of each kswapd thread, and
> > > userspace
> > > > can configure cpu cgroup seperately.
> > > >
> > > > I run through dd test on large file and then cat the file. Then I
> > > compared
> > > > the reclaim related stats in memory.stat.
> > > >
> > > > Step1: Create a cgroup with 500M memory_limit.
> > > > $ mkdir /dev/cgroup/memory/A
> > > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > > > $ echo $$ >/dev/cgroup/memory/A/tasks
> > > >
> > > > Step2: Test and set the wmarks.
> > > > $ cat /dev/cgroup/memory/A/memory.low_wmark_distance
> > > > 0
> > > > $ cat /dev/cgroup/memory/A/memory.high_wmark_distance
> > > > 0
> > >
> > >
> > They are used to tune the high/low_marks based on the hard_limit. We
> might
> > need to export that configuration to user admin especially on machines
> where
> > they over-commit by hard_limit.
>
> I remember there was some resistance against tuning watermarks
> separately.
>

This API is based on KAMEZAWA's request. :)

>
> > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > > low_wmark 524288000
> > > > high_wmark 524288000
> > > >
> > > > $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> > > > $ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance
> > > >
> > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > > low_wmark  482344960
> > > > high_wmark 471859200
> > >
> > > low_wmark is higher than high_wmark?
> > >
> >
> > hah, it is confusing. I have them documented. Basically, low_wmark
> triggers
> > reclaim and high_wmark stop the reclaim. And we have
> >
> > high_wmark < usage < low_wmark.
>
> OK, I will look at it.
>
> [...]
>
> > > I am not sure how much orthogonal per-cgroup-per-thread vs. zone
> > > approaches are, though.  Maybe it makes some sense to do both
> per-cgroup
> > > and zone background reclaim.  Anyway I think that we should start with
> > > the zone reclaim first.
> > >
> >
> > I missed the point here. Can you clarify the zone reclaim here?
>
> kswapd does the background zone reclaim and you are trying to do
> per-cgroup reclaim, right? I am concerned about those two fighting with
> slightly different goal.
>
> I am still thinking whether backgroup reclaim would be sufficient,
> though. We would get rid of per-cgroup thread and wouldn't create a new
> reclaim interface.
>

The per-zone reclaim will look at memcg and their soft_limits, and
the criteria is different from per-memcg background reclaim where we look at
the hard_limit. This is how the direct reclaim works on both side, and
kswapd is just doing the
work proactively.

Later when we change the soft_limit reclaim on the per-zone memory pressure,
the same logic will be changed in the per-zone try_to_free_pages().

Thanks

--Ying

> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

[-- Attachment #2: Type: text/html, Size: 9313 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 00/10] memcg: per cgroup background reclaim
  2011-04-18 17:01       ` Ying Han
@ 2011-04-18 18:42         ` Michal Hocko
  2011-04-18 22:27           ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2011-04-18 18:42 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Dave Hansen, Zhu Yanhai, linux-mm

On Mon 18-04-11 10:01:20, Ying Han wrote:
> On Mon, Apr 18, 2011 at 2:13 AM, Michal Hocko <mhocko@suse.cz> wrote:
[...]
> > I see. I am just concerned whether 3rd level of reclaim is a good idea.
> > We would need to do background reclaim anyway (and to preserve the
> > original semantic it has to be somehow watermark controlled). I am just
> > wondering why we have to implement it separately from kswapd. Cannot we
> > just simply trigger global kswapd which would reclaim all cgroups that
> > are under watermarks? [I am sorry for my ignorance if that is what is
> > implemented in the series - I haven't got to the patches yes]
> >
> 
> They are different on per-zone reclaim vs per-memcg reclaim. The first
> one is triggered if the zone is under memory pressure and we need
> to free pages to serve further page allocations.  The second one is
> triggered if the memcg is under memory pressure and we need to free
> pages to leave room (limit - usage) for the memcg to grow.

OK, I see.


> 
> Both of them are needed and that is how it is implemented on the direct
> reclaim path. The kswapd batches only try to
> smooth out the system and memcg performance by reclaiming pages proactively.
> It doesn't affecting the functionality.

I am still wondering, isn't this just a nice to have feature rather than
must to have in order to get rid of the global LRU? Doesn't it make
transition more complicated. I have noticed many if-else in kswapd path to
distinguish per-cgroup from the traditional global background reclaim.

[...]

> > > > > Step1: Create a cgroup with 500M memory_limit.
> > > > > $ mkdir /dev/cgroup/memory/A
> > > > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > > > > $ echo $$ >/dev/cgroup/memory/A/tasks
> > > > >
> > > > > Step2: Test and set the wmarks.
> > > > > $ cat /dev/cgroup/memory/A/memory.low_wmark_distance
> > > > > 0
> > > > > $ cat /dev/cgroup/memory/A/memory.high_wmark_distance
> > > > > 0
> > > >
> > > >
> > > They are used to tune the high/low_marks based on the hard_limit. We
> > might
> > > need to export that configuration to user admin especially on machines
> > where
> > > they over-commit by hard_limit.
> >
> > I remember there was some resistance against tuning watermarks
> > separately.
> >
> 
> This API is based on KAMEZAWA's request. :)

This was just as FYI. Watermarks were considered internal thing. So I
wouldn't be surprised if this got somehow controversial.

> 
> >
> > > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > > > low_wmark 524288000
> > > > > high_wmark 524288000
> > > > >
> > > > > $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> > > > > $ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance
> > > > >
> > > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > > > low_wmark  482344960
> > > > > high_wmark 471859200
> > > >
> > > > low_wmark is higher than high_wmark?
> > > >
> > >
> > > hah, it is confusing. I have them documented. Basically, low_wmark
> > > triggers reclaim and high_wmark stop the reclaim. And we have
> > >
> > > high_wmark < usage < low_wmark.

OK, I see how you calculate those watermarks now but it is really
confusing for those who are used to traditional watermark semantic.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 00/10] memcg: per cgroup background reclaim
  2011-04-18 18:42         ` Michal Hocko
@ 2011-04-18 22:27           ` Ying Han
  2011-04-19  2:48             ` Zhu Yanhai
  0 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2011-04-18 22:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4553 bytes --]

On Mon, Apr 18, 2011 at 11:42 AM, Michal Hocko <mhocko@suse.cz> wrote:

> On Mon 18-04-11 10:01:20, Ying Han wrote:
> > On Mon, Apr 18, 2011 at 2:13 AM, Michal Hocko <mhocko@suse.cz> wrote:
> [...]
> > > I see. I am just concerned whether 3rd level of reclaim is a good idea.
> > > We would need to do background reclaim anyway (and to preserve the
> > > original semantic it has to be somehow watermark controlled). I am just
> > > wondering why we have to implement it separately from kswapd. Cannot we
> > > just simply trigger global kswapd which would reclaim all cgroups that
> > > are under watermarks? [I am sorry for my ignorance if that is what is
> > > implemented in the series - I haven't got to the patches yes]
> > >
> >
> > They are different on per-zone reclaim vs per-memcg reclaim. The first
> > one is triggered if the zone is under memory pressure and we need
> > to free pages to serve further page allocations.  The second one is
> > triggered if the memcg is under memory pressure and we need to free
> > pages to leave room (limit - usage) for the memcg to grow.
>
> OK, I see.
>
>
> >
> > Both of them are needed and that is how it is implemented on the direct
> > reclaim path. The kswapd batches only try to
> > smooth out the system and memcg performance by reclaiming pages
> proactively.
> > It doesn't affecting the functionality.
>
> I am still wondering, isn't this just a nice to have feature rather
> than must to have in order to get rid of the global LRU?

The per-memcg kswapd is a must-have, and it is less related to the effort of
"get rid of global LRU" than the next patch I am looking at "enhance the
soft_limit reclaim". So this is the structure we will end up with

background reclaim:
1. per-memcg : this patch
2. global: targeting reclaim by replacing the per-zone to soft_limit reclaim

direct reclaim:
1. per-memcg: no change from today
2. global: targeting reclaim by replacing the per-zone to soft_limit
reclaim.


> Doesn't it make transition more complicated. I have noticed many if-else in
> kswapd path to
> distinguish per-cgroup from the traditional global background reclaim.
>





>
> [...]
>
> > > > > > Step1: Create a cgroup with 500M memory_limit.
> > > > > > $ mkdir /dev/cgroup/memory/A
> > > > > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > > > > > $ echo $$ >/dev/cgroup/memory/A/tasks
> > > > > >
> > > > > > Step2: Test and set the wmarks.
> > > > > > $ cat /dev/cgroup/memory/A/memory.low_wmark_distance
> > > > > > 0
> > > > > > $ cat /dev/cgroup/memory/A/memory.high_wmark_distance
> > > > > > 0
> > > > >
> > > > >
> > > > They are used to tune the high/low_marks based on the hard_limit. We
> > > might
> > > > need to export that configuration to user admin especially on
> machines
> > > where
> > > > they over-commit by hard_limit.
> > >
> > > I remember there was some resistance against tuning watermarks
> > > separately.
> > >
> >
> > This API is based on KAMEZAWA's request. :)
>
> This was just as FYI. Watermarks were considered internal thing. So I
> wouldn't be surprised if this got somehow controversial.
>

We went back and forth on how to set the high/low wmarks for different
configurations (over-commit or not). So far, by
giving the user ability to set the wmarks seems the most feasible way of
fullfilling the requriment.

>
> >
> > >
> > > > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > > > > low_wmark 524288000
> > > > > > high_wmark 524288000
> > > > > >
> > > > > > $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> > > > > > $ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance
> > > > > >
> > > > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > > > > low_wmark  482344960
> > > > > > high_wmark 471859200
> > > > >
> > > > > low_wmark is higher than high_wmark?
> > > > >
> > > >
> > > > hah, it is confusing. I have them documented. Basically, low_wmark
> > > > triggers reclaim and high_wmark stop the reclaim. And we have
> > > >
> > > > high_wmark < usage < low_wmark.
>
> OK, I see how you calculate those watermarks now but it is really
> confusing for those who are used to traditional watermark semantic.
>

that is true.  I adopt the initial comment from Mel where we keep the same
logic of triggering and stopping kswapd with low/high_wmarks and also
comparing the usage_in_bytes to the wmarks. Either way is confusing and
guess we just need to document it well.

--Ying

--
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

[-- Attachment #2: Type: text/html, Size: 6691 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 00/10] memcg: per cgroup background reclaim
  2011-04-18 22:27           ` Ying Han
@ 2011-04-19  2:48             ` Zhu Yanhai
  2011-04-19  3:46               ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: Zhu Yanhai @ 2011-04-19  2:48 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, Dave Hansen,
	linux-mm

Hi,

2011/4/19 Ying Han <yinghan@google.com>:
>
> that is true.  I adopt the initial comment from Mel where we keep the same
> logic of triggering and stopping kswapd with low/high_wmarks and also
> comparing the usage_in_bytes to the wmarks. Either way is confusing and
> guess we just need to document it well.

IMO another thing need to document well is that a user must setup
high_wmark_distance before setup low_wmark_distance to to make it
start work, and zero  low_wmark_distance before zero
high_wmark_distance to stop it. Otherwise it won't pass the sanity
check, which is not quite obvious.

Thanks,
Zhu Yanhai

> --Ying
>>
>> --
>> Michal Hocko
>> SUSE Labs
>> SUSE LINUX s.r.o.
>> Lihovarska 1060/12
>> 190 00 Praha 9
>> Czech Republic
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 00/10] memcg: per cgroup background reclaim
  2011-04-19  2:48             ` Zhu Yanhai
@ 2011-04-19  3:46               ` Ying Han
  0 siblings, 0 replies; 43+ messages in thread
From: Ying Han @ 2011-04-19  3:46 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: Michal Hocko, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, Dave Hansen,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

On Mon, Apr 18, 2011 at 7:48 PM, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:

> Hi,
>
> 2011/4/19 Ying Han <yinghan@google.com>:
> >
> > that is true.  I adopt the initial comment from Mel where we keep the
> same
> > logic of triggering and stopping kswapd with low/high_wmarks and also
> > comparing the usage_in_bytes to the wmarks. Either way is confusing and
> > guess we just need to document it well.
>
> IMO another thing need to document well is that a user must setup
> high_wmark_distance before setup low_wmark_distance to to make it
> start work, and zero  low_wmark_distance before zero
> high_wmark_distance to stop it. Otherwise it won't pass the sanity
> check, which is not quite obvious.
>

yes. will add into the document.

--Ying

>
> Thanks,
> Zhu Yanhai
>
> > --Ying
> >>
> >> --
> >> Michal Hocko
> >> SUSE Labs
> >> SUSE LINUX s.r.o.
> >> Lihovarska 1060/12
> >> 190 00 Praha 9
> >> Czech Republic
> >
> >
>

[-- Attachment #2: Type: text/html, Size: 1634 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 07/10] Add per-memcg zone "unreclaimable"
  2011-04-15  1:32   ` KAMEZAWA Hiroyuki
@ 2012-03-19  8:27     ` Zhu Yanhai
  2012-03-20  5:45       ` Ying Han
  0 siblings, 1 reply; 43+ messages in thread
From: Zhu Yanhai @ 2012-03-19  8:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, linux-mm

2011/4/15 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
> On Thu, 14 Apr 2011 15:54:26 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
>> and breaks the priority loop if it returns true. The per-memcg zone will
>> be marked as "unreclaimable" if the scanning rate is much greater than the
>> reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
>> page charged to the memcg being freed. Kswapd breaks the priority loop if
>> all the zones are marked as "unreclaimable".
>>
>> changelog v4..v3:
>> 1. split off from the per-memcg background reclaim patch in V3.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  include/linux/memcontrol.h |   30 ++++++++++++++
>>  include/linux/swap.h       |    2 +
>>  mm/memcontrol.c            |   96 ++++++++++++++++++++++++++++++++++++++++++++
>>  mm/vmscan.c                |   19 +++++++++
>>  4 files changed, 147 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index d4ff7f2..a8159f5 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -155,6 +155,12 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>                                               gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>> +                             unsigned long nr_scanned);
>>
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
>> @@ -345,6 +351,25 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>  {
>>  }
>>
>> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
>> +                                             struct zone *zone,
>> +                                             unsigned long nr_scanned)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
>> +                                                     struct zone *zone)
>> +{
>> +}
>> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
>> +             struct zone *zone)
>> +{
>> +}
>> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
>> +                                             struct zone *zone)
>> +{
>> +}
>> +
>>  static inline
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>                                           gfp_t gfp_mask)
>> @@ -363,6 +388,11 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
>>  {
>>  }
>>
>> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
>> +                                                             int zid)
>> +{
>> +     return false;
>> +}
>>  #endif /* CONFIG_CGROUP_MEM_CONT */
>>
>>  #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 17e0511..319b800 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -160,6 +160,8 @@ enum {
>>       SWP_SCANNING    = (1 << 8),     /* refcount in scan_swap_map */
>>  };
>>
>> +#define ZONE_RECLAIMABLE_RATE 6
>> +
>>  #define SWAP_CLUSTER_MAX 32
>>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e22351a..da6a130 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
>>       bool                    on_tree;
>>       struct mem_cgroup       *mem;           /* Back pointer, we cannot */
>>                                               /* use container_of        */
>> +     unsigned long           pages_scanned;  /* since last reclaim */
>> +     bool                    all_unreclaimable;      /* All pages pinned */
>>  };
>> +
>>  /* Macro for accessing counter */
>>  #define MEM_CGROUP_ZSTAT(mz, idx)    ((mz)->count[(idx)])
>>
>> @@ -1135,6 +1138,96 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>>       return &mz->reclaim_stat;
>>  }
>>
>> +static unsigned long mem_cgroup_zone_reclaimable_pages(
>> +                                     struct mem_cgroup_per_zone *mz)
>> +{
>> +     int nr;
>> +     nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
>> +             MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
>> +
>> +     if (nr_swap_pages > 0)
>> +             nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
>> +                     MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
>> +
>> +     return nr;
>> +}
>> +
>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>> +                                             unsigned long nr_scanned)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             mz->pages_scanned += nr_scanned;
>> +}
>> +
>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +
>> +     if (!mem)
>> +             return 0;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             return mz->pages_scanned <
>> +                             mem_cgroup_zone_reclaimable_pages(mz) *
>> +                             ZONE_RECLAIMABLE_RATE;
>> +     return 0;
>> +}
>> +
>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return false;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             return mz->all_unreclaimable;
>> +
>> +     return false;
>> +}
>> +
>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             mz->all_unreclaimable = true;
>> +}
>> +
>> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = page_cgroup_zoneinfo(mem, page);
>> +     if (mz) {
>> +             mz->pages_scanned = 0;
>> +             mz->all_unreclaimable = false;
>> +     }
>> +
>> +     return;
>> +}
>> +
>>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>                                       struct list_head *dst,
>>                                       unsigned long *scanned, int order,
>> @@ -2801,6 +2894,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>        * special functions.
>>        */
>>
>> +     mem_cgroup_clear_unreclaimable(mem, page);
>
> Hmm, this will easily cause cache ping-pong. (free_page() clears it after taking
> zone->lock....in batched manner.)
>
> Could you consider a way to make this low cost ?
>
> One way is using memcg_check_event() with some low event trigger.
> Second way is usign memcg_batch.
> In many case, we can expect a chunk of free pages are from the same zone.
> Then, add a new member to batch_memcg as
>
> struct memcg_batch_info {
>        .....
>        struct zone *zone;      # a zone page is last uncharged.
>        ...
> }
>
> Then,
> ==
> static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
>                                   unsigned int nr_pages,
> +                                  struct page *page,
>                                   const enum charge_type ctype)
> {
>        struct memcg_batch_info *batch = NULL;
> .....
>
>        if (batch->zone != page_zone(page)) {
>                mem_cgroup_clear_unreclaimable(mem, page);
>        }
> direct_uncharge:
>        mem_cgroup_clear_unreclaimable(mem, page);
> ....
> }
> ==
>
> This will reduce overhead dramatically.
>

Excuse me but I don't quite understand this part, IMHO this is to
avoid call mem_cgroup_clear_unreclaimable() against each single page
during a munmap()/free_pages() including many pages to free, which is
unnecessary because the zone will turn into 'reclaimable' at the first
page uncharged.
Then why can't we just say,
   if (mem_cgroup_zoneinfo(mem, page_to_nid(page),
page_zonenum(page))->all_unreclaimable) {
            mem_cgroup_clear_unreclaimable(mem, page);
    }

--
Thanks,
Zhu Yanhai


>
>
>>       unlock_page_cgroup(pc);
>>       /*
>>        * even after unlock, we have mem->res.usage here and this memcg
>> @@ -4569,6 +4663,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>>               mz->usage_in_excess = 0;
>>               mz->on_tree = false;
>>               mz->mem = mem;
>> +             mz->pages_scanned = 0;
>> +             mz->all_unreclaimable = false;
>>       }
>>       return 0;
>>  }
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b8345d2..c081112 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>>                                       ISOLATE_BOTH : ISOLATE_INACTIVE,
>>                       zone, sc->mem_cgroup,
>>                       0, file);
>> +
>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
>> +
>>               /*
>>                * mem_cgroup_isolate_pages() keeps track of
>>                * scanned pages on its own.
>> @@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>>                * mem_cgroup_isolate_pages() keeps track of
>>                * scanned pages on its own.
>>                */
>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>>       }
>>
>>       reclaim_stat->recent_scanned[file] += nr_taken;
>> @@ -2648,6 +2652,7 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
>>       unsigned long total_scanned = 0;
>>       struct mem_cgroup *mem_cont = sc->mem_cgroup;
>>       int priority = sc->priority;
>> +     int nid = pgdat->node_id;
>>
>>       /*
>>        * Now scan the zone in the dma->highmem direction, and we scan
>> @@ -2664,10 +2669,20 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
>>               if (!populated_zone(zone))
>>                       continue;
>>
>> +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
>> +                     priority != DEF_PRIORITY)
>> +                     continue;
>> +
>>               sc->nr_scanned = 0;
>>               shrink_zone(priority, zone, sc);
>>               total_scanned += sc->nr_scanned;
>>
>> +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
>> +                     continue;
>> +
>> +             if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
>> +                     mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
>> +
>>               /*
>>                * If we've done a decent amount of scanning and
>>                * the reclaim ratio is low, start doing writepage
>> @@ -2752,6 +2767,10 @@ loop_again:
>>
>>                               if (!populated_zone(zone))
>>                                       continue;
>> +
>> +                             if (!mem_cgroup_mz_unreclaimable(mem_cont,
>> +                                                             zone))
>> +
>
> Ah, okay. this will work.
>
> Thanks,
> -Kame
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 07/10] Add per-memcg zone "unreclaimable"
  2012-03-19  8:27     ` Zhu Yanhai
@ 2012-03-20  5:45       ` Ying Han
  2012-03-22  1:13         ` Zhu Yanhai
  0 siblings, 1 reply; 43+ messages in thread
From: Ying Han @ 2012-03-20  5:45 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, Michal Hocko,
	Dave Hansen, linux-mm

On Mon, Mar 19, 2012 at 1:27 AM, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:
> 2011/4/15 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
>> On Thu, 14 Apr 2011 15:54:26 -0700
>> Ying Han <yinghan@google.com> wrote:
>>
>>> After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
>>> and breaks the priority loop if it returns true. The per-memcg zone will
>>> be marked as "unreclaimable" if the scanning rate is much greater than the
>>> reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
>>> page charged to the memcg being freed. Kswapd breaks the priority loop if
>>> all the zones are marked as "unreclaimable".
>>>
>>> changelog v4..v3:
>>> 1. split off from the per-memcg background reclaim patch in V3.
>>>
>>> Signed-off-by: Ying Han <yinghan@google.com>
>>> ---
>>>  include/linux/memcontrol.h |   30 ++++++++++++++
>>>  include/linux/swap.h       |    2 +
>>>  mm/memcontrol.c            |   96 ++++++++++++++++++++++++++++++++++++++++++++
>>>  mm/vmscan.c                |   19 +++++++++
>>>  4 files changed, 147 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index d4ff7f2..a8159f5 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -155,6 +155,12 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>>                                               gfp_t gfp_mask);
>>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
>>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
>>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>>> +                             unsigned long nr_scanned);
>>>
>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
>>> @@ -345,6 +351,25 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>>  {
>>>  }
>>>
>>> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
>>> +                                             struct zone *zone,
>>> +                                             unsigned long nr_scanned)
>>> +{
>>> +}
>>> +
>>> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
>>> +                                                     struct zone *zone)
>>> +{
>>> +}
>>> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
>>> +             struct zone *zone)
>>> +{
>>> +}
>>> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
>>> +                                             struct zone *zone)
>>> +{
>>> +}
>>> +
>>>  static inline
>>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>>                                           gfp_t gfp_mask)
>>> @@ -363,6 +388,11 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
>>>  {
>>>  }
>>>
>>> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
>>> +                                                             int zid)
>>> +{
>>> +     return false;
>>> +}
>>>  #endif /* CONFIG_CGROUP_MEM_CONT */
>>>
>>>  #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 17e0511..319b800 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -160,6 +160,8 @@ enum {
>>>       SWP_SCANNING    = (1 << 8),     /* refcount in scan_swap_map */
>>>  };
>>>
>>> +#define ZONE_RECLAIMABLE_RATE 6
>>> +
>>>  #define SWAP_CLUSTER_MAX 32
>>>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index e22351a..da6a130 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
>>>       bool                    on_tree;
>>>       struct mem_cgroup       *mem;           /* Back pointer, we cannot */
>>>                                               /* use container_of        */
>>> +     unsigned long           pages_scanned;  /* since last reclaim */
>>> +     bool                    all_unreclaimable;      /* All pages pinned */
>>>  };
>>> +
>>>  /* Macro for accessing counter */
>>>  #define MEM_CGROUP_ZSTAT(mz, idx)    ((mz)->count[(idx)])
>>>
>>> @@ -1135,6 +1138,96 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>>>       return &mz->reclaim_stat;
>>>  }
>>>
>>> +static unsigned long mem_cgroup_zone_reclaimable_pages(
>>> +                                     struct mem_cgroup_per_zone *mz)
>>> +{
>>> +     int nr;
>>> +     nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
>>> +             MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
>>> +
>>> +     if (nr_swap_pages > 0)
>>> +             nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
>>> +                     MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
>>> +
>>> +     return nr;
>>> +}
>>> +
>>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>>> +                                             unsigned long nr_scanned)
>>> +{
>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>> +     int nid = zone_to_nid(zone);
>>> +     int zid = zone_idx(zone);
>>> +
>>> +     if (!mem)
>>> +             return;
>>> +
>>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>>> +     if (mz)
>>> +             mz->pages_scanned += nr_scanned;
>>> +}
>>> +
>>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
>>> +{
>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>> +
>>> +     if (!mem)
>>> +             return 0;
>>> +
>>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>>> +     if (mz)
>>> +             return mz->pages_scanned <
>>> +                             mem_cgroup_zone_reclaimable_pages(mz) *
>>> +                             ZONE_RECLAIMABLE_RATE;
>>> +     return 0;
>>> +}
>>> +
>>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>>> +{
>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>> +     int nid = zone_to_nid(zone);
>>> +     int zid = zone_idx(zone);
>>> +
>>> +     if (!mem)
>>> +             return false;
>>> +
>>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>>> +     if (mz)
>>> +             return mz->all_unreclaimable;
>>> +
>>> +     return false;
>>> +}
>>> +
>>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>>> +{
>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>> +     int nid = zone_to_nid(zone);
>>> +     int zid = zone_idx(zone);
>>> +
>>> +     if (!mem)
>>> +             return;
>>> +
>>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>>> +     if (mz)
>>> +             mz->all_unreclaimable = true;
>>> +}
>>> +
>>> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
>>> +{
>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>> +
>>> +     if (!mem)
>>> +             return;
>>> +
>>> +     mz = page_cgroup_zoneinfo(mem, page);
>>> +     if (mz) {
>>> +             mz->pages_scanned = 0;
>>> +             mz->all_unreclaimable = false;
>>> +     }
>>> +
>>> +     return;
>>> +}
>>> +
>>>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>>                                       struct list_head *dst,
>>>                                       unsigned long *scanned, int order,
>>> @@ -2801,6 +2894,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>>        * special functions.
>>>        */
>>>
>>> +     mem_cgroup_clear_unreclaimable(mem, page);
>>
>> Hmm, this will easily cause cache ping-pong. (free_page() clears it after taking
>> zone->lock....in batched manner.)
>>
>> Could you consider a way to make this low cost ?
>>
>> One way is using memcg_check_event() with some low event trigger.
>> Second way is usign memcg_batch.
>> In many case, we can expect a chunk of free pages are from the same zone.
>> Then, add a new member to batch_memcg as
>>
>> struct memcg_batch_info {
>>        .....
>>        struct zone *zone;      # a zone page is last uncharged.
>>        ...
>> }
>>
>> Then,
>> ==
>> static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
>>                                   unsigned int nr_pages,
>> +                                  struct page *page,
>>                                   const enum charge_type ctype)
>> {
>>        struct memcg_batch_info *batch = NULL;
>> .....
>>
>>        if (batch->zone != page_zone(page)) {
>>                mem_cgroup_clear_unreclaimable(mem, page);
>>        }
>> direct_uncharge:
>>        mem_cgroup_clear_unreclaimable(mem, page);
>> ....
>> }
>> ==
>>
>> This will reduce overhead dramatically.
>>
>
> Excuse me but I don't quite understand this part, IMHO this is to
> avoid call mem_cgroup_clear_unreclaimable() against each single page
> during a munmap()/free_pages() including many pages to free, which is
> unnecessary because the zone will turn into 'reclaimable' at the first
> page uncharged.
> Then why can't we just say,
>   if (mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page))->all_unreclaimable) {
>            mem_cgroup_clear_unreclaimable(mem, page);
>    }

Are you suggesting to replace the batching w/ the code above?

--Ying
> --
> Thanks,
> Zhu Yanhai
>
>
>>
>>
>>>       unlock_page_cgroup(pc);
>>>       /*
>>>        * even after unlock, we have mem->res.usage here and this memcg
>>> @@ -4569,6 +4663,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>>>               mz->usage_in_excess = 0;
>>>               mz->on_tree = false;
>>>               mz->mem = mem;
>>> +             mz->pages_scanned = 0;
>>> +             mz->all_unreclaimable = false;
>>>       }
>>>       return 0;
>>>  }
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index b8345d2..c081112 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>>>                                       ISOLATE_BOTH : ISOLATE_INACTIVE,
>>>                       zone, sc->mem_cgroup,
>>>                       0, file);
>>> +
>>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
>>> +
>>>               /*
>>>                * mem_cgroup_isolate_pages() keeps track of
>>>                * scanned pages on its own.
>>> @@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>>>                * mem_cgroup_isolate_pages() keeps track of
>>>                * scanned pages on its own.
>>>                */
>>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>>>       }
>>>
>>>       reclaim_stat->recent_scanned[file] += nr_taken;
>>> @@ -2648,6 +2652,7 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
>>>       unsigned long total_scanned = 0;
>>>       struct mem_cgroup *mem_cont = sc->mem_cgroup;
>>>       int priority = sc->priority;
>>> +     int nid = pgdat->node_id;
>>>
>>>       /*
>>>        * Now scan the zone in the dma->highmem direction, and we scan
>>> @@ -2664,10 +2669,20 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
>>>               if (!populated_zone(zone))
>>>                       continue;
>>>
>>> +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
>>> +                     priority != DEF_PRIORITY)
>>> +                     continue;
>>> +
>>>               sc->nr_scanned = 0;
>>>               shrink_zone(priority, zone, sc);
>>>               total_scanned += sc->nr_scanned;
>>>
>>> +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
>>> +                     continue;
>>> +
>>> +             if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
>>> +                     mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
>>> +
>>>               /*
>>>                * If we've done a decent amount of scanning and
>>>                * the reclaim ratio is low, start doing writepage
>>> @@ -2752,6 +2767,10 @@ loop_again:
>>>
>>>                               if (!populated_zone(zone))
>>>                                       continue;
>>> +
>>> +                             if (!mem_cgroup_mz_unreclaimable(mem_cont,
>>> +                                                             zone))
>>> +
>>
>> Ah, okay. this will work.
>>
>> Thanks,
>> -Kame
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 07/10] Add per-memcg zone "unreclaimable"
  2012-03-20  5:45       ` Ying Han
@ 2012-03-22  1:13         ` Zhu Yanhai
  0 siblings, 0 replies; 43+ messages in thread
From: Zhu Yanhai @ 2012-03-22  1:13 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, Michal Hocko,
	Dave Hansen, linux-mm

2012/3/20 Ying Han <yinghan@google.com>:
> On Mon, Mar 19, 2012 at 1:27 AM, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:
>> 2011/4/15 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
>>> On Thu, 14 Apr 2011 15:54:26 -0700
>>> Ying Han <yinghan@google.com> wrote:
>>>
>>>> After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
>>>> and breaks the priority loop if it returns true. The per-memcg zone will
>>>> be marked as "unreclaimable" if the scanning rate is much greater than the
>>>> reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
>>>> page charged to the memcg being freed. Kswapd breaks the priority loop if
>>>> all the zones are marked as "unreclaimable".
>>>>
>>>> changelog v4..v3:
>>>> 1. split off from the per-memcg background reclaim patch in V3.
>>>>
>>>> Signed-off-by: Ying Han <yinghan@google.com>
>>>> ---
>>>>  include/linux/memcontrol.h |   30 ++++++++++++++
>>>>  include/linux/swap.h       |    2 +
>>>>  mm/memcontrol.c            |   96 ++++++++++++++++++++++++++++++++++++++++++++
>>>>  mm/vmscan.c                |   19 +++++++++
>>>>  4 files changed, 147 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>>> index d4ff7f2..a8159f5 100644
>>>> --- a/include/linux/memcontrol.h
>>>> +++ b/include/linux/memcontrol.h
>>>> @@ -155,6 +155,12 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>>>                                               gfp_t gfp_mask);
>>>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>>> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
>>>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
>>>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>>>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>>>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>>>> +                             unsigned long nr_scanned);
>>>>
>>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
>>>> @@ -345,6 +351,25 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>>>  {
>>>>  }
>>>>
>>>> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
>>>> +                                             struct zone *zone,
>>>> +                                             unsigned long nr_scanned)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
>>>> +                                                     struct zone *zone)
>>>> +{
>>>> +}
>>>> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
>>>> +             struct zone *zone)
>>>> +{
>>>> +}
>>>> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
>>>> +                                             struct zone *zone)
>>>> +{
>>>> +}
>>>> +
>>>>  static inline
>>>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>>>                                           gfp_t gfp_mask)
>>>> @@ -363,6 +388,11 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
>>>>  {
>>>>  }
>>>>
>>>> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
>>>> +                                                             int zid)
>>>> +{
>>>> +     return false;
>>>> +}
>>>>  #endif /* CONFIG_CGROUP_MEM_CONT */
>>>>
>>>>  #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 17e0511..319b800 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -160,6 +160,8 @@ enum {
>>>>       SWP_SCANNING    = (1 << 8),     /* refcount in scan_swap_map */
>>>>  };
>>>>
>>>> +#define ZONE_RECLAIMABLE_RATE 6
>>>> +
>>>>  #define SWAP_CLUSTER_MAX 32
>>>>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>>>>
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index e22351a..da6a130 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
>>>>       bool                    on_tree;
>>>>       struct mem_cgroup       *mem;           /* Back pointer, we cannot */
>>>>                                               /* use container_of        */
>>>> +     unsigned long           pages_scanned;  /* since last reclaim */
>>>> +     bool                    all_unreclaimable;      /* All pages pinned */
>>>>  };
>>>> +
>>>>  /* Macro for accessing counter */
>>>>  #define MEM_CGROUP_ZSTAT(mz, idx)    ((mz)->count[(idx)])
>>>>
>>>> @@ -1135,6 +1138,96 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>>>>       return &mz->reclaim_stat;
>>>>  }
>>>>
>>>> +static unsigned long mem_cgroup_zone_reclaimable_pages(
>>>> +                                     struct mem_cgroup_per_zone *mz)
>>>> +{
>>>> +     int nr;
>>>> +     nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
>>>> +             MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
>>>> +
>>>> +     if (nr_swap_pages > 0)
>>>> +             nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
>>>> +                     MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
>>>> +
>>>> +     return nr;
>>>> +}
>>>> +
>>>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>>>> +                                             unsigned long nr_scanned)
>>>> +{
>>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>>> +     int nid = zone_to_nid(zone);
>>>> +     int zid = zone_idx(zone);
>>>> +
>>>> +     if (!mem)
>>>> +             return;
>>>> +
>>>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>>>> +     if (mz)
>>>> +             mz->pages_scanned += nr_scanned;
>>>> +}
>>>> +
>>>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
>>>> +{
>>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>>> +
>>>> +     if (!mem)
>>>> +             return 0;
>>>> +
>>>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>>>> +     if (mz)
>>>> +             return mz->pages_scanned <
>>>> +                             mem_cgroup_zone_reclaimable_pages(mz) *
>>>> +                             ZONE_RECLAIMABLE_RATE;
>>>> +     return 0;
>>>> +}
>>>> +
>>>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>>>> +{
>>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>>> +     int nid = zone_to_nid(zone);
>>>> +     int zid = zone_idx(zone);
>>>> +
>>>> +     if (!mem)
>>>> +             return false;
>>>> +
>>>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>>>> +     if (mz)
>>>> +             return mz->all_unreclaimable;
>>>> +
>>>> +     return false;
>>>> +}
>>>> +
>>>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>>>> +{
>>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>>> +     int nid = zone_to_nid(zone);
>>>> +     int zid = zone_idx(zone);
>>>> +
>>>> +     if (!mem)
>>>> +             return;
>>>> +
>>>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>>>> +     if (mz)
>>>> +             mz->all_unreclaimable = true;
>>>> +}
>>>> +
>>>> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
>>>> +{
>>>> +     struct mem_cgroup_per_zone *mz = NULL;
>>>> +
>>>> +     if (!mem)
>>>> +             return;
>>>> +
>>>> +     mz = page_cgroup_zoneinfo(mem, page);
>>>> +     if (mz) {
>>>> +             mz->pages_scanned = 0;
>>>> +             mz->all_unreclaimable = false;
>>>> +     }
>>>> +
>>>> +     return;
>>>> +}
>>>> +
>>>>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>>>                                       struct list_head *dst,
>>>>                                       unsigned long *scanned, int order,
>>>> @@ -2801,6 +2894,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>>>        * special functions.
>>>>        */
>>>>
>>>> +     mem_cgroup_clear_unreclaimable(mem, page);
>>>
>>> Hmm, this will easily cause cache ping-pong. (free_page() clears it after taking
>>> zone->lock....in batched manner.)
>>>
>>> Could you consider a way to make this low cost ?
>>>
>>> One way is using memcg_check_event() with some low event trigger.
>>> Second way is usign memcg_batch.
>>> In many case, we can expect a chunk of free pages are from the same zone.
>>> Then, add a new member to batch_memcg as
>>>
>>> struct memcg_batch_info {
>>>        .....
>>>        struct zone *zone;      # a zone page is last uncharged.
>>>        ...
>>> }
>>>
>>> Then,
>>> ==
>>> static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
>>>                                   unsigned int nr_pages,
>>> +                                  struct page *page,
>>>                                   const enum charge_type ctype)
>>> {
>>>        struct memcg_batch_info *batch = NULL;
>>> .....
>>>
>>>        if (batch->zone != page_zone(page)) {
>>>                mem_cgroup_clear_unreclaimable(mem, page);
>>>        }
>>> direct_uncharge:
>>>        mem_cgroup_clear_unreclaimable(mem, page);
>>> ....
>>> }
>>> ==
>>>
>>> This will reduce overhead dramatically.
>>>
>>
>> Excuse me but I don't quite understand this part, IMHO this is to
>> avoid call mem_cgroup_clear_unreclaimable() against each single page
>> during a munmap()/free_pages() including many pages to free, which is
>> unnecessary because the zone will turn into 'reclaimable' at the first
>> page uncharged.
>> Then why can't we just say,
>>   if (mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page))->all_unreclaimable) {
>>            mem_cgroup_clear_unreclaimable(mem, page);
>>    }
>
> Are you suggesting to replace the batching w/ the code above?
err...never mind, I got it, it was designed to avoid to touch
mem_cgroup_per_zone and its flag. sorry for the noise :)

--
Thanks
Zhu Yanhai

>
> --Ying
>> --
>> Thanks,
>> Zhu Yanhai
>>
>>
>>>
>>>
>>>>       unlock_page_cgroup(pc);
>>>>       /*
>>>>        * even after unlock, we have mem->res.usage here and this memcg
>>>> @@ -4569,6 +4663,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>>>>               mz->usage_in_excess = 0;
>>>>               mz->on_tree = false;
>>>>               mz->mem = mem;
>>>> +             mz->pages_scanned = 0;
>>>> +             mz->all_unreclaimable = false;
>>>>       }
>>>>       return 0;
>>>>  }
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index b8345d2..c081112 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>>>>                                       ISOLATE_BOTH : ISOLATE_INACTIVE,
>>>>                       zone, sc->mem_cgroup,
>>>>                       0, file);
>>>> +
>>>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
>>>> +
>>>>               /*
>>>>                * mem_cgroup_isolate_pages() keeps track of
>>>>                * scanned pages on its own.
>>>> @@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>>>>                * mem_cgroup_isolate_pages() keeps track of
>>>>                * scanned pages on its own.
>>>>                */
>>>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>>>>       }
>>>>
>>>>       reclaim_stat->recent_scanned[file] += nr_taken;
>>>> @@ -2648,6 +2652,7 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
>>>>       unsigned long total_scanned = 0;
>>>>       struct mem_cgroup *mem_cont = sc->mem_cgroup;
>>>>       int priority = sc->priority;
>>>> +     int nid = pgdat->node_id;
>>>>
>>>>       /*
>>>>        * Now scan the zone in the dma->highmem direction, and we scan
>>>> @@ -2664,10 +2669,20 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
>>>>               if (!populated_zone(zone))
>>>>                       continue;
>>>>
>>>> +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
>>>> +                     priority != DEF_PRIORITY)
>>>> +                     continue;
>>>> +
>>>>               sc->nr_scanned = 0;
>>>>               shrink_zone(priority, zone, sc);
>>>>               total_scanned += sc->nr_scanned;
>>>>
>>>> +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
>>>> +                     continue;
>>>> +
>>>> +             if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
>>>> +                     mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
>>>> +
>>>>               /*
>>>>                * If we've done a decent amount of scanning and
>>>>                * the reclaim ratio is low, start doing writepage
>>>> @@ -2752,6 +2767,10 @@ loop_again:
>>>>
>>>>                               if (!populated_zone(zone))
>>>>                                       continue;
>>>> +
>>>> +                             if (!mem_cgroup_mz_unreclaimable(mem_cont,
>>>> +                                                             zone))
>>>> +
>>>
>>> Ah, okay. this will work.
>>>
>>> Thanks,
>>> -Kame
>>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2012-03-22  1:13 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup background reclaim Ying Han
2011-04-14 22:54 ` [PATCH V4 01/10] Add kswapd descriptor Ying Han
2011-04-15  0:04   ` KAMEZAWA Hiroyuki
2011-04-15  3:35     ` Ying Han
2011-04-15  4:16       ` KAMEZAWA Hiroyuki
2011-04-15 21:46         ` Ying Han
2011-04-14 22:54 ` [PATCH V4 02/10] Add per memcg reclaim watermarks Ying Han
2011-04-15  0:16   ` KAMEZAWA Hiroyuki
2011-04-15  3:45     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 03/10] New APIs to adjust per-memcg wmarks Ying Han
2011-04-15  0:25   ` KAMEZAWA Hiroyuki
2011-04-15  4:00     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 04/10] Infrastructure to support per-memcg reclaim Ying Han
2011-04-15  0:34   ` KAMEZAWA Hiroyuki
2011-04-15  4:04     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 05/10] Implement the select_victim_node within memcg Ying Han
2011-04-15  0:40   ` KAMEZAWA Hiroyuki
2011-04-15  4:36     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 06/10] Per-memcg background reclaim Ying Han
2011-04-15  1:11   ` KAMEZAWA Hiroyuki
2011-04-15  6:08     ` Ying Han
2011-04-15  8:14       ` KAMEZAWA Hiroyuki
2011-04-15 18:00         ` Ying Han
2011-04-15  6:26     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 07/10] Add per-memcg zone "unreclaimable" Ying Han
2011-04-15  1:32   ` KAMEZAWA Hiroyuki
2012-03-19  8:27     ` Zhu Yanhai
2012-03-20  5:45       ` Ying Han
2012-03-22  1:13         ` Zhu Yanhai
2011-04-14 22:54 ` [PATCH V4 08/10] Enable per-memcg background reclaim Ying Han
2011-04-15  1:34   ` KAMEZAWA Hiroyuki
2011-04-14 22:54 ` [PATCH V4 09/10] Add API to export per-memcg kswapd pid Ying Han
2011-04-15  1:40   ` KAMEZAWA Hiroyuki
2011-04-15  4:47     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 10/10] Add some per-memcg stats Ying Han
2011-04-15  9:40 ` [PATCH V4 00/10] memcg: per cgroup background reclaim Michal Hocko
2011-04-15 16:40   ` Ying Han
2011-04-18  9:13     ` Michal Hocko
2011-04-18 17:01       ` Ying Han
2011-04-18 18:42         ` Michal Hocko
2011-04-18 22:27           ` Ying Han
2011-04-19  2:48             ` Zhu Yanhai
2011-04-19  3:46               ` Ying Han

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.