All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5 00/10] memcg: per cgroup background reclaim
@ 2011-04-15 23:23 Ying Han
  2011-04-15 23:23 ` [PATCH V5 01/10] Add kswapd descriptor Ying Han
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

The current implementation of memcg supports targeting reclaim when the
cgroup is reaching its hard_limit and we do direct reclaim per cgroup.
Per cgroup background reclaim is needed which helps to spread out memory
pressure over longer period of time and smoothes out the cgroup performance.

If the cgroup is configured to use per cgroup background reclaim, a kswapd
thread is created which only scans the per-memcg LRU list. Two watermarks
("high_wmark", "low_wmark") are added to trigger the background reclaim and
stop it. The watermarks are calculated based on the cgroup's limit_in_bytes.
By default, the per-memcg kswapd threads are running under root cgroup. There
is a per-memcg API which exports the pid of each kswapd thread, and userspace
can configure cpu cgroup seperately.

I run through dd test on large file and then cat the file. Then I compared
the reclaim related stats in memory.stat.

Step1: Create a cgroup with 500M memory_limit.
$ mkdir /dev/cgroup/memory/A
$ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/memory/A/tasks

Step2: Test and set the wmarks.
$ cat /dev/cgroup/memory/A/memory.low_wmark_distance
0
$ cat /dev/cgroup/memory/A/memory.high_wmark_distance
0

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 524288000
high_wmark 524288000

$ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
$ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance

$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 482344960
high_wmark 471859200

$ ps -ef | grep memcg
root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg

$ cat /dev/cgroup/memory/A/memory.kswapd_pid
memcg_3 18126

Step3: Dirty the pages by creating a 20g file on hard drive.
$ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1

Here are the memory.stat with vs without the per-memcg reclaim. It used to be
all the pages are reclaimed from direct reclaim, and now some of the pages are
also being reclaimed at background.

Only direct reclaim                      With background reclaim:

pgpgin 5243222                           pgpgin 5243267
pgpgout 5115252                          pgpgout 5127978
kswapd_steal 0                           kswapd_steal 2699807
pg_pgsteal 5115229                       pg_pgsteal 2428102
kswapd_pgscan 0                          kswapd_pgscan 10527319
pg_scan 5918875                          pg_scan 15533740
pgrefill 264761                          pgrefill 294801
pgoutrun 0                               pgoutrun 81097
allocstall 158406                        allocstall 73799

real   4m55.684s                         real    5m1.123s
user   0m1.227s                          user    0m1.205s
sys    1m7.793s                          sys     1m6.647s

throughput is 67.37 MB/sec               throughput is 68.04 MB/sec

Step 4: Cleanup
$ echo $$ >/dev/cgroup/memory/tasks
$ echo 1 > /dev/cgroup/memory/A/memory.force_empty
$ rmdir /dev/cgroup/memory/A
$ echo 3 >/proc/sys/vm/drop_caches

Step 5: Create the same cgroup and read the 20g file into pagecache.
$ cat /export/hdc3/dd/tf0 > /dev/zero

All the pages are reclaimed from background instead of direct reclaim with
the per cgroup reclaim.

Only direct reclaim                       With background reclaim:

pgpgin 5242937                            pgpgin 5242935
pgpgout 5114971                           pgpgout 5125504
kswapd_steal 0                            kswapd_steal 5125470
pg_pgsteal 5114941                        pg_pgsteal 0
kswapd_pgscan 0                           kswapd_pgscan 5125472
pg_scan 5114944                           pg_scan 0
pgrefill 0                                pgrefill 0
pgoutrun 0                                pgoutrun 160184
allocstall 159840                         allocstall 0

real    4m20.649s                         real    4m20.632s
user    0m0.193s                          user    0m0.280s
sys     0m32.266s                         sys     0m24.580s

Note:
This is the first effort of enhancing the target reclaim into memcg. Here are
the existing known issues and our plan:

1. there are one kswapd thread per cgroup. the thread is created when the
cgroup changes its limit_in_bytes and is deleted when the cgroup is being
removed. In some enviroment when thousand of cgroups are being configured on
a single host, we will have thousand of kswapd threads. The memory consumption
would be 8k*100 = 8M. We don't see a big issue for now if the host can host
that many of cgroups.

2. regarding to the alternative workqueue, which is more complicated and we
need to be very careful of work items in the workqueue. We've experienced in
one workitem stucks and the rest of the work item won't proceed. For example
in dirty page writeback, one heavily writer cgroup could starve the other
cgroups from flushing dirty pages to the same disk. In the kswapd case, I can
imagine we might have similar senario. How to prioritize the workitems is
another problem. The order of adding the workitems in the queue reflects the
order of cgroups being reclaimed. We don't have that restriction currently but
relying on the cpu scheduler to put kswapd on the right cpu-core to run. We
"might" introduce priority later for reclaim and how are we gonna deal with
that.

3. there is a potential lock contention between per cgroup kswapds, and the
worst case depends on the number of cpu cores on the system. Basically we
now are sharing the zone->lru_lock between per-memcg LRU and global LRU. I have
a plan to get rid of the global LRU eventually, which requires to enhance the
existing targeting reclaim (this patch is included). I would like to get to that
where the locking contention problem will be solved naturely.

4. no hierarchical reclaim support in this patchset. I would like to get to
after the basic stuff are being accepted.

5. By default, it is running under root. If there is a need to put the kswapd
thread into a cpu cgroup, userspace can make that change by reading the pid from
the new API and echo-ing. In non preemption kernel, we need to be careful of
priority inversion when restricting kswapd cpu time while it is holding a mutex.


Ying Han (10):
  Add kswapd descriptor
  Add per memcg reclaim watermarks
  New APIs to adjust per-memcg wmarks
  Infrastructure to support per-memcg reclaim.
  Implement the select_victim_node within memcg.
  Per-memcg background reclaim.
  Add per-memcg zone "unreclaimable"
  Enable per-memcg background reclaim.
  Add API to export per-memcg kswapd pid.
  Add some per-memcg stats

 Documentation/cgroups/memory.txt |   14 +
 include/linux/memcontrol.h       |  100 ++++++++
 include/linux/mmzone.h           |    3 +-
 include/linux/res_counter.h      |   78 ++++++
 include/linux/sched.h            |    1 +
 include/linux/swap.h             |   16 +-
 kernel/res_counter.c             |    6 +
 mm/memcontrol.c                  |  483 +++++++++++++++++++++++++++++++++++++-
 mm/memory_hotplug.c              |    4 +-
 mm/page_alloc.c                  |    1 -
 mm/vmscan.c                      |  429 ++++++++++++++++++++++++++++-----
 11 files changed, 1062 insertions(+), 73 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH V5 01/10] Add kswapd descriptor
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
@ 2011-04-15 23:23 ` Ying Han
  2011-04-18  0:57   ` Minchan Kim
  2011-04-15 23:23 ` [PATCH V5 02/10] Add per memcg reclaim watermarks Ying Han
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

There is a kswapd kernel thread for each numa node. We will add a different
kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
information of node or memcg and it allows the global and per-memcg background
reclaim to share common reclaim algorithms.

This patch adds the kswapd descriptor and moves the per-node kswapd to use the
new structure.

changelog v5..v4:
1. add comment on kswapds_spinlock
2. remove the kswapds_spinlock. we don't need it here since the kswapd and pgdat
have 1:1 mapping.

changelog v3..v2:
1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
2. rename thr in kswapd_run to something else.

changelog v2..v1:
1. dynamic allocate kswapd descriptor and initialize the wait_queue_head of pgdat
at kswapd_run.
2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup kswapd
descriptor.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/mmzone.h |    3 +-
 include/linux/swap.h   |    7 ++++
 mm/page_alloc.c        |    1 -
 mm/vmscan.c            |   89 +++++++++++++++++++++++++++++++++++------------
 4 files changed, 74 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 628f07b..6cba7d2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -640,8 +640,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	wait_queue_head_t kswapd_wait;
-	struct task_struct *kswapd;
+	wait_queue_head_t *kswapd_wait;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 } pg_data_t;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed6ebe6..f43d406 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
 	return current->flags & PF_KSWAPD;
 }
 
+struct kswapd {
+	struct task_struct *kswapd_task;
+	wait_queue_head_t kswapd_wait;
+	pg_data_t *kswapd_pgdat;
+};
+
+int kswapd(void *p);
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e1b52a..6340865 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
-	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..61fb96e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2242,12 +2242,13 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 }
 
 /* is kswapd sleeping prematurely? */
-static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
-					int classzone_idx)
+static int sleeping_prematurely(struct kswapd *kswapd, int order,
+				long remaining, int classzone_idx)
 {
 	int i;
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
+	pg_data_t *pgdat = kswapd->kswapd_pgdat;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
@@ -2570,28 +2571,31 @@ out:
 	return order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
+				int classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 
 	if (freezing(current) || kthread_should_stop())
 		return;
 
-	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
-		finish_wait(&pgdat->kswapd_wait, &wait);
-		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+		finish_wait(wait_h, &wait);
+		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
 	}
 
 	/*
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
@@ -2611,7 +2615,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		else
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
-	finish_wait(&pgdat->kswapd_wait, &wait);
+	finish_wait(wait_h, &wait);
 }
 
 /*
@@ -2627,20 +2631,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
  * If there are applications that are active memory-allocators
  * (most normal use), this basically shouldn't matter.
  */
-static int kswapd(void *p)
+int kswapd(void *p)
 {
 	unsigned long order;
 	int classzone_idx;
-	pg_data_t *pgdat = (pg_data_t*)p;
+	struct kswapd *kswapd_p = (struct kswapd *)p;
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 
 	struct reclaim_state reclaim_state = {
 		.reclaimed_slab = 0,
 	};
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	const struct cpumask *cpumask;
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
+	BUG_ON(pgdat->kswapd_wait != wait_h);
+	cpumask = cpumask_of_node(pgdat->node_id);
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(tsk, cpumask);
 	current->reclaim_state = &reclaim_state;
@@ -2679,7 +2687,7 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx);
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
@@ -2719,13 +2727,13 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 		pgdat->kswapd_max_order = order;
 		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
 	}
-	if (!waitqueue_active(&pgdat->kswapd_wait))
+	if (!waitqueue_active(pgdat->kswapd_wait))
 		return;
 	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-	wake_up_interruptible(&pgdat->kswapd_wait);
+	wake_up_interruptible(pgdat->kswapd_wait);
 }
 
 /*
@@ -2817,12 +2825,21 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 		for_each_node_state(nid, N_HIGH_MEMORY) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 			const struct cpumask *mask;
+			struct kswapd *kswapd_p;
+			struct task_struct *kswapd_thr;
+			wait_queue_head_t *wait;
 
 			mask = cpumask_of_node(pgdat->node_id);
 
+			wait = pgdat->kswapd_wait;
+			kswapd_p = container_of(wait, struct kswapd,
+						kswapd_wait);
+			kswapd_thr = kswapd_p->kswapd_task;
+
 			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
 				/* One of our CPUs online: restore mask */
-				set_cpus_allowed_ptr(pgdat->kswapd, mask);
+				if (kswapd_thr)
+					set_cpus_allowed_ptr(kswapd_thr, mask);
 		}
 	}
 	return NOTIFY_OK;
@@ -2835,18 +2852,31 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 int kswapd_run(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
+	struct task_struct *kswapd_thr;
+	struct kswapd *kswapd_p;
 	int ret = 0;
 
-	if (pgdat->kswapd)
+	if (pgdat->kswapd_wait)
 		return 0;
 
-	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
-	if (IS_ERR(pgdat->kswapd)) {
+	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
+	if (!kswapd_p)
+		return -ENOMEM;
+
+	init_waitqueue_head(&kswapd_p->kswapd_wait);
+	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_pgdat = pgdat;
+
+	kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (IS_ERR(kswapd_thr)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
 		printk("Failed to start kswapd on node %d\n",nid);
+		pgdat->kswapd_wait = NULL;
+		kfree(kswapd_p);
 		ret = -1;
-	}
+	} else
+		kswapd_p->kswapd_task = kswapd_thr;
 	return ret;
 }
 
@@ -2855,10 +2885,23 @@ int kswapd_run(int nid)
  */
 void kswapd_stop(int nid)
 {
-	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
+	struct task_struct *kswapd_thr = NULL;
+	struct kswapd *kswapd_p = NULL;
+	wait_queue_head_t *wait;
+
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	wait = pgdat->kswapd_wait;
+	if (wait) {
+		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+		kswapd_thr = kswapd_p->kswapd_task;
+		kswapd_p->kswapd_task = NULL;
+	}
+
+	if (kswapd_thr)
+		kthread_stop(kswapd_thr);
 
-	if (kswapd)
-		kthread_stop(kswapd);
+	kfree(kswapd_p);
 }
 
 static int __init kswapd_init(void)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 02/10] Add per memcg reclaim watermarks
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
  2011-04-15 23:23 ` [PATCH V5 01/10] Add kswapd descriptor Ying Han
@ 2011-04-15 23:23 ` Ying Han
  2011-04-15 23:23 ` [PATCH V5 03/10] New APIs to adjust per-memcg wmarks Ying Han
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
until the usage is lower than the high_wmark.

Each watermark is calculated based on the hard_limit(limit_in_bytes) for each
memcg. Each time the hard_limit is changed, the corresponding wmarks are
re-calculated. Since memory controller charges only user pages, there is
no need for a "min_wmark". The current calculation of wmarks is based on
individual tunable low/high_wmark_distance, which are set to 0 by default.

changelog v5..v4:
1. rename res_counter_low_wmark_limit_locked().
2. rename res_counter_high_wmark_limit_locked().

changelog v4..v3:
1. remove legacy comments
2. rename the res_counter_check_under_high_wmark_limit
3. replace the wmark_ratio per-memcg by individual tunable for both wmarks.
4. add comments on low/high_wmark
5. add individual tunables for low/high_wmarks and remove wmark_ratio
6. replace the mem_cgroup_get_limit() call by res_count_read_u64(). The first
one returns large value w/ swapon.

changelog v3..v2:
1. Add VM_BUG_ON() on couple of places.
2. Remove the spinlock on the min_free_kbytes since the consequence of reading
stale data.
3. Remove the "min_free_kbytes" API and replace it with wmark_ratio based on
hard_limit.

changelog v2..v1:
1. Remove the res_counter_charge on wmark due to performance concern.
2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
4. make the wmark to be consistant with core VM which checks the free pages
instead of usage.
5. changed wmark to be boolean

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h  |    1 +
 include/linux/res_counter.h |   78 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    6 +++
 mm/memcontrol.c             |   48 ++++++++++++++++++++++++++
 4 files changed, 133 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..3ece36d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -82,6 +82,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..669f199 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -39,6 +39,14 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit that reclaim triggers.
+	 */
+	unsigned long long low_wmark_limit;
+	/*
+	 * the limit that reclaim stops.
+	 */
+	unsigned long long high_wmark_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -55,6 +63,9 @@ struct res_counter {
 
 #define RESOURCE_MAX (unsigned long long)LLONG_MAX
 
+#define CHARGE_WMARK_LOW	0x01
+#define CHARGE_WMARK_HIGH	0x02
+
 /**
  * Helpers to interact with userspace
  * res_counter_read_u64() - returns the value of the specified member.
@@ -92,6 +103,8 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_WMARK_LIMIT,
+	RES_HIGH_WMARK_LIMIT
 };
 
 /*
@@ -147,6 +160,24 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
 	return margin;
 }
 
+static inline bool
+res_counter_under_high_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->high_wmark_limit)
+		return true;
+
+	return false;
+}
+
+static inline bool
+res_counter_under_low_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->low_wmark_limit)
+		return true;
+
+	return false;
+}
+
 /**
  * Get the difference between the usage and the soft limit
  * @cnt: The counter
@@ -169,6 +200,30 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+static inline bool
+res_counter_under_low_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_under_low_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool
+res_counter_under_high_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_under_high_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -214,4 +269,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_high_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
+static inline int
+res_counter_set_low_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..206a724 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 	spin_lock_init(&counter->lock);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
+	counter->low_wmark_limit = RESOURCE_MAX;
+	counter->high_wmark_limit = RESOURCE_MAX;
 	counter->parent = parent;
 }
 
@@ -103,6 +105,10 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_WMARK_LIMIT:
+		return &counter->low_wmark_limit;
+	case RES_HIGH_WMARK_LIMIT:
+		return &counter->high_wmark_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4407dd0..1ec4014 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -272,6 +272,12 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	/*
+	 * used to calculate the low/high_wmarks based on the limit_in_bytes.
+	 */
+	u64 high_wmark_distance;
+	u64 low_wmark_distance;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -813,6 +819,25 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
+{
+	u64 limit;
+
+	limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+	if (mem->high_wmark_distance == 0) {
+		res_counter_set_low_wmark_limit(&mem->res, limit);
+		res_counter_set_high_wmark_limit(&mem->res, limit);
+	} else {
+		u64 low_wmark, high_wmark;
+
+		low_wmark = limit - mem->low_wmark_distance;
+		high_wmark = limit - mem->high_wmark_distance;
+
+		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
+		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+	}
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -3205,6 +3230,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3264,6 +3290,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -4521,6 +4548,27 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+/*
+ * We use low_wmark and high_wmark for triggering per-memcg kswapd.
+ * The reclaim is triggered by low_wmark (usage > low_wmark) and stopped
+ * by high_wmark (usage < high_wmark).
+ */
+int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
+				int charge_flags)
+{
+	long ret = 0;
+	int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
+
+	VM_BUG_ON((charge_flags & flags) == flags);
+
+	if (charge_flags & CHARGE_WMARK_LOW)
+		ret = res_counter_under_low_wmark_limit(&mem->res);
+	if (charge_flags & CHARGE_WMARK_HIGH)
+		ret = res_counter_under_high_wmark_limit(&mem->res);
+
+	return ret;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 03/10] New APIs to adjust per-memcg wmarks
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
  2011-04-15 23:23 ` [PATCH V5 01/10] Add kswapd descriptor Ying Han
  2011-04-15 23:23 ` [PATCH V5 02/10] Add per memcg reclaim watermarks Ying Han
@ 2011-04-15 23:23 ` Ying Han
  2011-04-15 23:23 ` [PATCH V5 04/10] Infrastructure to support per-memcg reclaim Ying Han
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

Add memory.low_wmark_distance, memory.high_wmark_distance and reclaim_wmarks
APIs per-memcg. The first two adjust the internal low/high wmark calculation
and the reclaim_wmarks exports the current value of watermarks.

By default, the low/high_wmark is calculated by subtracting the distance from
the hard_limit(limit_in_bytes).

$ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
$ cat /dev/cgroup/A/memory.limit_in_bytes
524288000

$ echo 50m >/dev/cgroup/A/memory.high_wmark_distance
$ echo 40m >/dev/cgroup/A/memory.low_wmark_distance

$ cat /dev/cgroup/A/memory.reclaim_wmarks
low_wmark 482344960
high_wmark 471859200

change v5..v4
1. add sanity check for setting high/low_wmark_distance for root cgroup.

changelog v4..v3:
1. replace the "wmark_ratio" API with individual tunable for low/high_wmarks.

changelog v3..v2:
1. replace the "min_free_kbytes" api with "wmark_ratio". This is part of
feedbacks

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |  101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ec4014..76ad009 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3974,6 +3974,78 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 mem_cgroup_high_wmark_distance_read(struct cgroup *cgrp,
+					       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->high_wmark_distance;
+}
+
+static u64 mem_cgroup_low_wmark_distance_read(struct cgroup *cgrp,
+					      struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->low_wmark_distance;
+}
+
+static int mem_cgroup_high_wmark_distance_write(struct cgroup *cont,
+						struct cftype *cft,
+						const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	u64 low_wmark_distance = memcg->low_wmark_distance;
+	unsigned long long val;
+	u64 limit;
+	int ret;
+
+	if (!cont->parent)
+		return -EINVAL;
+
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return -EINVAL;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+	if ((val >= limit) || (val < low_wmark_distance) ||
+	   (low_wmark_distance && val == low_wmark_distance))
+		return -EINVAL;
+
+	memcg->high_wmark_distance = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+}
+
+static int mem_cgroup_low_wmark_distance_write(struct cgroup *cont,
+					       struct cftype *cft,
+					       const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	u64 high_wmark_distance = memcg->high_wmark_distance;
+	unsigned long long val;
+	u64 limit;
+	int ret;
+
+	if (!cont->parent)
+		return -EINVAL;
+
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return -EINVAL;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+	if ((val >= limit) || (val > high_wmark_distance) ||
+	    (high_wmark_distance && val == high_wmark_distance))
+		return -EINVAL;
+
+	memcg->low_wmark_distance = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4265,6 +4337,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 	mutex_unlock(&memcg_oom_mutex);
 }
 
+static int mem_cgroup_wmark_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	u64 low_wmark, high_wmark;
+
+	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
+	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+
+	cb->fill(cb, "low_wmark", low_wmark);
+	cb->fill(cb, "high_wmark", high_wmark);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4368,6 +4455,20 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "high_wmark_distance",
+		.write_string = mem_cgroup_high_wmark_distance_write,
+		.read_u64 = mem_cgroup_high_wmark_distance_read,
+	},
+	{
+		.name = "low_wmark_distance",
+		.write_string = mem_cgroup_low_wmark_distance_write,
+		.read_u64 = mem_cgroup_low_wmark_distance_read,
+	},
+	{
+		.name = "reclaim_wmarks",
+		.read_map = mem_cgroup_wmark_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 04/10] Infrastructure to support per-memcg reclaim.
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (2 preceding siblings ...)
  2011-04-15 23:23 ` [PATCH V5 03/10] New APIs to adjust per-memcg wmarks Ying Han
@ 2011-04-15 23:23 ` Ying Han
  2011-04-18  2:11   ` Minchan Kim
  2011-04-15 23:23 ` [PATCH V5 05/10] Implement the select_victim_node within memcg Ying Han
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

Add the kswapd_mem field in kswapd descriptor which links the kswapd
kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
queue headed at kswapd_wait field of the kswapd descriptor.

The kswapd() function is now shared between global and per-memcg kswapd. It
is passed in with the kswapd descriptor which contains the information of
either node or memcg. Then the new function balance_mem_cgroup_pgdat is
invoked if it is per-mem kswapd thread, and the implementation of the function
is on the following patch.

changelog v4..v3:
1. fix up the kswapd_run and kswapd_stop for online_pages() and offline_pages.
2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's request.

changelog v3..v2:
1. split off from the initial patch which includes all changes of the following
three patches.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    5 ++
 include/linux/swap.h       |    5 +-
 mm/memcontrol.c            |   29 ++++++++
 mm/memory_hotplug.c        |    4 +-
 mm/vmscan.c                |  156 ++++++++++++++++++++++++++++++--------------
 5 files changed, 147 insertions(+), 52 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3ece36d..f7ffd1f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@ struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
+struct kswapd;
 
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
@@ -83,6 +84,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
+extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
+				  struct kswapd *kswapd_p);
+extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
+extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f43d406..17e0511 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -30,6 +30,7 @@ struct kswapd {
 	struct task_struct *kswapd_task;
 	wait_queue_head_t kswapd_wait;
 	pg_data_t *kswapd_pgdat;
+	struct mem_cgroup *kswapd_mem;
 };
 
 int kswapd(void *p);
@@ -303,8 +304,8 @@ static inline void scan_unevictable_unregister_node(struct node *node)
 }
 #endif
 
-extern int kswapd_run(int nid);
-extern void kswapd_stop(int nid);
+extern int kswapd_run(int nid, struct mem_cgroup *mem);
+extern void kswapd_stop(int nid, struct mem_cgroup *mem);
 
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 76ad009..8761a6f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -278,6 +278,8 @@ struct mem_cgroup {
 	 */
 	u64 high_wmark_distance;
 	u64 low_wmark_distance;
+
+	wait_queue_head_t *kswapd_wait;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -4670,6 +4672,33 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
 	return ret;
 }
 
+int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
+{
+	if (!mem || !kswapd_p)
+		return 0;
+
+	mem->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_mem = mem;
+
+	return css_id(&mem->css);
+}
+
+void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
+{
+	if (mem)
+		mem->kswapd_wait = NULL;
+
+	return;
+}
+
+wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return NULL;
+
+	return mem->kswapd_wait;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 321fc74..2f78ff6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -462,7 +462,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages)
 	setup_per_zone_wmarks();
 	calculate_zone_inactive_ratio(zone);
 	if (onlined_pages) {
-		kswapd_run(zone_to_nid(zone));
+		kswapd_run(zone_to_nid(zone), NULL);
 		node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
 	}
 
@@ -897,7 +897,7 @@ repeat:
 	calculate_zone_inactive_ratio(zone);
 	if (!node_present_pages(node)) {
 		node_clear_state(node, N_HIGH_MEMORY);
-		kswapd_stop(node);
+		kswapd_stop(node, NULL);
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 61fb96e..06036d2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2241,6 +2241,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 	return balanced_pages > (present_pages >> 2);
 }
 
+#define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
+
 /* is kswapd sleeping prematurely? */
 static int sleeping_prematurely(struct kswapd *kswapd, int order,
 				long remaining, int classzone_idx)
@@ -2249,11 +2251,16 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
 	pg_data_t *pgdat = kswapd->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd->kswapd_mem;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;
 
+	/* Doesn't support for per-memcg reclaim */
+	if (mem)
+		return false;
+
 	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
@@ -2596,19 +2603,25 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	 * go fully to sleep until explicitly woken up.
 	 */
 	if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
-		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+		if (is_node_kswapd(kswapd_p)) {
+			trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
-		/*
-		 * vmstat counters are not perfectly accurate and the estimated
-		 * value for counters such as NR_FREE_PAGES can deviate from the
-		 * true value by nr_online_cpus * threshold. To avoid the zone
-		 * watermarks being breached while under pressure, we reduce the
-		 * per-cpu vmstat threshold while kswapd is awake and restore
-		 * them before going back to sleep.
-		 */
-		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
-		schedule();
-		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+			/*
+			 * vmstat counters are not perfectly accurate and the
+			 * estimated value for counters such as NR_FREE_PAGES
+			 * can deviate from the true value by nr_online_cpus *
+			 * threshold. To avoid the zone watermarks being
+			 * breached while under pressure, we reduce the per-cpu
+			 * vmstat threshold while kswapd is awake and restore
+			 * them before going back to sleep.
+			 */
+			set_pgdat_percpu_threshold(pgdat,
+						   calculate_normal_threshold);
+			schedule();
+			set_pgdat_percpu_threshold(pgdat,
+						calculate_pressure_threshold);
+		} else
+			schedule();
 	} else {
 		if (remaining)
 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2618,6 +2631,12 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	finish_wait(wait_h, &wait);
 }
 
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+							int order)
+{
+	return 0;
+}
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -2637,6 +2656,7 @@ int kswapd(void *p)
 	int classzone_idx;
 	struct kswapd *kswapd_p = (struct kswapd *)p;
 	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
 	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 
@@ -2647,10 +2667,12 @@ int kswapd(void *p)
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
-	BUG_ON(pgdat->kswapd_wait != wait_h);
-	cpumask = cpumask_of_node(pgdat->node_id);
-	if (!cpumask_empty(cpumask))
-		set_cpus_allowed_ptr(tsk, cpumask);
+	if (is_node_kswapd(kswapd_p)) {
+		BUG_ON(pgdat->kswapd_wait != wait_h);
+		cpumask = cpumask_of_node(pgdat->node_id);
+		if (!cpumask_empty(cpumask))
+			set_cpus_allowed_ptr(tsk, cpumask);
+	}
 	current->reclaim_state = &reclaim_state;
 
 	/*
@@ -2665,7 +2687,10 @@ int kswapd(void *p)
 	 * us from recursively trying to free more memory as we're
 	 * trying to free the first piece of memory in the first place).
 	 */
-	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+	if (is_node_kswapd(kswapd_p))
+		tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+	else
+		tsk->flags |= PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
 	order = 0;
@@ -2675,24 +2700,29 @@ int kswapd(void *p)
 		int new_classzone_idx;
 		int ret;
 
-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order = 0;
-		pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		if (order < new_order || classzone_idx > new_classzone_idx) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation or has tigher zone constraints
-			 */
-			order = new_order;
-			classzone_idx = new_classzone_idx;
-		} else {
-			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
-			order = pgdat->kswapd_max_order;
-			classzone_idx = pgdat->classzone_idx;
+		if (is_node_kswapd(kswapd_p)) {
+			new_order = pgdat->kswapd_max_order;
+			new_classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = MAX_NR_ZONES - 1;
-		}
+			if (order < new_order ||
+					classzone_idx > new_classzone_idx) {
+				/*
+				 * Don't sleep if someone wants a larger 'order'
+				 * allocation or has tigher zone constraints
+				 */
+				order = new_order;
+				classzone_idx = new_classzone_idx;
+			} else {
+				kswapd_try_to_sleep(kswapd_p, order,
+						    classzone_idx);
+				order = pgdat->kswapd_max_order;
+				classzone_idx = pgdat->classzone_idx;
+				pgdat->kswapd_max_order = 0;
+				pgdat->classzone_idx = MAX_NR_ZONES - 1;
+			}
+		} else
+			kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -2703,8 +2733,13 @@ int kswapd(void *p)
 		 * after returning from the refrigerator
 		 */
 		if (!ret) {
-			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			order = balance_pgdat(pgdat, order, &classzone_idx);
+			if (is_node_kswapd(kswapd_p)) {
+				trace_mm_vmscan_kswapd_wake(pgdat->node_id,
+								order);
+				order = balance_pgdat(pgdat, order,
+							&classzone_idx);
+			} else
+				balance_mem_cgroup_pgdat(mem, order);
 		}
 	}
 	return 0;
@@ -2849,30 +2884,53 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
  * This kswapd start function will be called by init and node-hot-add.
  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
  */
-int kswapd_run(int nid)
+int kswapd_run(int nid, struct mem_cgroup *mem)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
 	struct task_struct *kswapd_thr;
+	pg_data_t *pgdat = NULL;
 	struct kswapd *kswapd_p;
+	static char name[TASK_COMM_LEN];
+	int memcg_id;
 	int ret = 0;
 
-	if (pgdat->kswapd_wait)
-		return 0;
+	if (!mem) {
+		pgdat = NODE_DATA(nid);
+		if (pgdat->kswapd_wait)
+			return ret;
+	}
 
 	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
 	if (!kswapd_p)
 		return -ENOMEM;
 
 	init_waitqueue_head(&kswapd_p->kswapd_wait);
-	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
-	kswapd_p->kswapd_pgdat = pgdat;
 
-	kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (!mem) {
+		pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+		kswapd_p->kswapd_pgdat = pgdat;
+		snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
+	} else {
+		memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
+		if (!memcg_id) {
+			kfree(kswapd_p);
+			return ret;
+		}
+		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
+	}
+
+	kswapd_thr = kthread_run(kswapd, kswapd_p, name);
 	if (IS_ERR(kswapd_thr)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
-		printk("Failed to start kswapd on node %d\n",nid);
-		pgdat->kswapd_wait = NULL;
+		if (!mem) {
+			printk(KERN_ERR "Failed to start kswapd on node %d\n",
+								nid);
+			pgdat->kswapd_wait = NULL;
+		} else {
+			printk(KERN_ERR "Failed to start kswapd on memcg %d\n",
+								memcg_id);
+			mem_cgroup_clear_kswapd(mem);
+		}
 		kfree(kswapd_p);
 		ret = -1;
 	} else
@@ -2883,15 +2941,17 @@ int kswapd_run(int nid)
 /*
  * Called by memory hotplug when all memory in a node is offlined.
  */
-void kswapd_stop(int nid)
+void kswapd_stop(int nid, struct mem_cgroup *mem)
 {
 	struct task_struct *kswapd_thr = NULL;
 	struct kswapd *kswapd_p = NULL;
 	wait_queue_head_t *wait;
 
-	pg_data_t *pgdat = NODE_DATA(nid);
+	if (!mem)
+		wait = NODE_DATA(nid)->kswapd_wait;
+	else
+		wait = mem_cgroup_kswapd_wait(mem);
 
-	wait = pgdat->kswapd_wait;
 	if (wait) {
 		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
 		kswapd_thr = kswapd_p->kswapd_task;
@@ -2910,7 +2970,7 @@ static int __init kswapd_init(void)
 
 	swap_setup();
 	for_each_node_state(nid, N_HIGH_MEMORY)
- 		kswapd_run(nid);
+		kswapd_run(nid, NULL);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 05/10] Implement the select_victim_node within memcg.
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (3 preceding siblings ...)
  2011-04-15 23:23 ` [PATCH V5 04/10] Infrastructure to support per-memcg reclaim Ying Han
@ 2011-04-15 23:23 ` Ying Han
  2011-04-18  2:22   ` Minchan Kim
  2011-04-15 23:23 ` [PATCH V5 06/10] Per-memcg background reclaim Ying Han
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This add the mechanism for background reclaim which we remember the
last scanned node and always starting from the next one each time.
The simple round-robin fasion provide the fairness between nodes for
each memcg.

changelog v5..v4:
1. initialize the last_scanned_node to MAX_NUMNODES.

changelog v4..v3:
1. split off from the per-memcg background reclaim patch.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f7ffd1f..d4ff7f2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
 				  struct kswapd *kswapd_p);
 extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
 extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
+extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
+extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
+					const nodemask_t *nodes);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8761a6f..b92dc13 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -279,6 +279,11 @@ struct mem_cgroup {
 	u64 high_wmark_distance;
 	u64 low_wmark_distance;
 
+	/* While doing per cgroup background reclaim, we cache the
+	 * last node we reclaimed from
+	 */
+	int last_scanned_node;
+
 	wait_queue_head_t *kswapd_wait;
 };
 
@@ -1536,6 +1541,27 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 }
 
 /*
+ * Visit the first node after the last_scanned_node of @mem and use that to
+ * reclaim free pages from.
+ */
+int
+mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
+{
+	int next_nid;
+	int last_scanned;
+
+	last_scanned = mem->last_scanned_node;
+	next_nid = next_node(last_scanned, *nodes);
+
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(*nodes);
+
+	mem->last_scanned_node = next_nid;
+
+	return next_nid;
+}
+
+/*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
@@ -4699,6 +4725,14 @@ wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
 	return mem->kswapd_wait;
 }
 
+int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return -1;
+
+	return mem->last_scanned_node;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4774,6 +4808,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
 	if (parent)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 06/10] Per-memcg background reclaim.
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (4 preceding siblings ...)
  2011-04-15 23:23 ` [PATCH V5 05/10] Implement the select_victim_node within memcg Ying Han
@ 2011-04-15 23:23 ` Ying Han
  2011-04-18  3:51   ` Minchan Kim
  2011-04-15 23:23 ` [PATCH V5 07/10] Add per-memcg zone "unreclaimable" Ying Han
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This is the main loop of per-memcg background reclaim which is implemented in
function balance_mem_cgroup_pgdat().

The function performs a priority loop similar to global reclaim. During each
iteration it invokes balance_pgdat_node() for all nodes on the system, which
is another new function performs background reclaim per node. After reclaiming
each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
it returns true.

changelog v5..v4:
1. remove duplicate check on nodes_empty()
2. add logic to check if the per-memcg lru is empty on the zone.
3. make per-memcg kswapd to reclaim SWAP_CLUSTER_MAX per zone. It make senses
since it helps to balance the pressure across zones within the memcg.

changelog v4..v3:
1. split the select_victim_node and zone_unreclaimable to a seperate patches
2. remove the logic tries to do zone balancing.

changelog v3..v2:
1. change mz->all_unreclaimable to be boolean.
2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
3. some more clean-up.

changelog v2..v1:
1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
2. shared the kswapd_run/kswapd_stop for per-memcg and global background
reclaim.
3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
keeps the same name.
4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
after freeing.
5. add the fairness in zonelist where memcg remember the last zone reclaimed
from.

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/vmscan.c |  157 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 157 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06036d2..39e6300 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -47,6 +47,8 @@
 
 #include <linux/swapops.h>
 
+#include <linux/res_counter.h>
+
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -111,6 +113,8 @@ struct scan_control {
 	 * are scanned.
 	 */
 	nodemask_t	*nodemask;
+
+	int priority;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -2631,11 +2635,164 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
 	finish_wait(wait_h, &wait);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * The function is used for per-memcg LRU. It scanns all the zones of the
+ * node and returns the nr_scanned and nr_reclaimed.
+ */
+static void balance_pgdat_node(pg_data_t *pgdat, int order,
+					struct scan_control *sc)
+{
+	int i;
+	unsigned long total_scanned = 0;
+	struct mem_cgroup *mem_cont = sc->mem_cgroup;
+	int priority = sc->priority;
+	enum lru_list l;
+
+	/*
+	 * This dma->highmem order is consistant with global reclaim.
+	 * We do this because the page allocator works in the opposite
+	 * direction although memcg user pages are mostly allocated at
+	 * highmem.
+	 */
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+		unsigned long scan = 0;
+
+		for_each_evictable_lru(l)
+			scan += mem_cgroup_zone_nr_pages(mem_cont, zone, l);
+
+		if (!populated_zone(zone) || !scan)
+			continue;
+
+		sc->nr_scanned = 0;
+		shrink_zone(priority, zone, sc);
+		total_scanned += sc->nr_scanned;
+
+		/*
+		 * If we've done a decent amount of scanning and
+		 * the reclaim ratio is low, start doing writepage
+		 * even in laptop mode
+		 */
+		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
+			sc->may_writepage = 1;
+		}
+	}
+
+	sc->nr_scanned = total_scanned;
+	return;
+}
+
+/*
+ * Per cgroup background reclaim.
+ * TODO: Take off the order since memcg always do order 0
+ */
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+					      int order)
+{
+	int i, nid;
+	int start_node;
+	int priority;
+	bool wmark_ok;
+	int loop;
+	pg_data_t *pgdat;
+	nodemask_t do_nodes;
+	unsigned long total_scanned;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.swappiness = vm_swappiness,
+		.order = order,
+		.mem_cgroup = mem_cont,
+	};
+
+loop_again:
+	do_nodes = NODE_MASK_NONE;
+	sc.may_writepage = !laptop_mode;
+	sc.nr_reclaimed = 0;
+	total_scanned = 0;
+
+	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		sc.priority = priority;
+		wmark_ok = false;
+		loop = 0;
+
+		/* The swap token gets in the way of swapout... */
+		if (!priority)
+			disable_swap_token();
+
+		if (priority == DEF_PRIORITY)
+			do_nodes = node_states[N_ONLINE];
+
+		while (1) {
+			nid = mem_cgroup_select_victim_node(mem_cont,
+							&do_nodes);
+
+			/* Indicate we have cycled the nodelist once
+			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
+			 * kswapd burning cpu cycles.
+			 */
+			if (loop == 0) {
+				start_node = nid;
+				loop++;
+			} else if (nid == start_node)
+				break;
+
+			pgdat = NODE_DATA(nid);
+			balance_pgdat_node(pgdat, order, &sc);
+			total_scanned += sc.nr_scanned;
+
+			/* Set the node which has at least
+			 * one reclaimable zone
+			 */
+			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+				struct zone *zone = pgdat->node_zones + i;
+
+				if (!populated_zone(zone))
+					continue;
+			}
+			if (i < 0)
+				node_clear(nid, do_nodes);
+
+			if (mem_cgroup_watermark_ok(mem_cont,
+							CHARGE_WMARK_HIGH)) {
+				wmark_ok = true;
+				goto out;
+			}
+
+			if (nodes_empty(do_nodes)) {
+				wmark_ok = true;
+				goto out;
+			}
+		}
+
+		if (total_scanned && priority < DEF_PRIORITY - 2)
+			congestion_wait(WRITE, HZ/10);
+
+		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
+			break;
+	}
+out:
+	if (!wmark_ok) {
+		cond_resched();
+
+		try_to_freeze();
+
+		goto loop_again;
+	}
+
+	return sc.nr_reclaimed;
+}
+#else
 static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
 							int order)
 {
 	return 0;
 }
+#endif
 
 /*
  * The background pageout daemon, started as a kernel thread
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 07/10] Add per-memcg zone "unreclaimable"
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (5 preceding siblings ...)
  2011-04-15 23:23 ` [PATCH V5 06/10] Per-memcg background reclaim Ying Han
@ 2011-04-15 23:23 ` Ying Han
  2011-04-18  4:27   ` Minchan Kim
  2011-04-15 23:23 ` [PATCH V5 08/10] Enable per-memcg background reclaim Ying Han
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
and breaks the priority loop if it returns true. The per-memcg zone will
be marked as "unreclaimable" if the scanning rate is much greater than the
reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
page charged to the memcg being freed. Kswapd breaks the priority loop if
all the zones are marked as "unreclaimable".

changelog v5..v4:
1. reduce the frequency of updating mz->unreclaimable bit by using the existing
memcg batch in task struct.
2. add new function mem_cgroup_mz_clear_unreclaimable() for recoganizing zone.

changelog v4..v3:
1. split off from the per-memcg background reclaim patch in V3.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   40 ++++++++++++++
 include/linux/sched.h      |    1 +
 include/linux/swap.h       |    2 +
 mm/memcontrol.c            |  130 +++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |   19 +++++++
 5 files changed, 191 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d4ff7f2..b18435d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -155,6 +155,14 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
+void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+					struct zone *zone);
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+					unsigned long nr_scanned);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
@@ -345,6 +353,38 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
+								int zid)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
+						struct zone *zone)
+{
+	return false;
+}
+
+static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
+							struct zone *zone)
+{
+}
+
+static inline void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem,
+							struct page *page)
+{
+}
+
+static inline void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+							struct zone *zone);
+{
+}
+static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
+						struct zone *zone,
+						unsigned long nr_scanned)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 98fc7ed..3370c5a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1526,6 +1526,7 @@ struct task_struct {
 		struct mem_cgroup *memcg; /* target memcg of uncharge */
 		unsigned long nr_pages;	/* uncharged usage */
 		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
+		struct zone *zone; /* a zone page is last uncharged */
 	} memcg_batch;
 #endif
 };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 17e0511..319b800 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -160,6 +160,8 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
+#define ZONE_RECLAIMABLE_RATE 6
+
 #define SWAP_CLUSTER_MAX 32
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b92dc13..0522d59 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -133,7 +133,10 @@ struct mem_cgroup_per_zone {
 	bool			on_tree;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
+	unsigned long		pages_scanned;	/* since last reclaim */
+	bool			all_unreclaimable;	/* All pages pinned */
 };
+
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
 
@@ -1135,6 +1138,115 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
+static unsigned long mem_cgroup_zone_reclaimable_pages(
+					struct mem_cgroup_per_zone *mz)
+{
+	int nr;
+	nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
+		MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
+			MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+
+	return nr;
+}
+
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+						unsigned long nr_scanned)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->pages_scanned += nr_scanned;
+}
+
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return 0;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->pages_scanned <
+				mem_cgroup_zone_reclaimable_pages(mz) *
+				ZONE_RECLAIMABLE_RATE;
+	return 0;
+}
+
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return false;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->all_unreclaimable;
+
+	return false;
+}
+
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->all_unreclaimable = true;
+}
+
+void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+				       struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return;
+
+	mz = page_cgroup_zoneinfo(mem, page);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -2682,6 +2794,7 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 
 static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 				   unsigned int nr_pages,
+				   struct page *page,
 				   const enum charge_type ctype)
 {
 	struct memcg_batch_info *batch = NULL;
@@ -2699,6 +2812,10 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 	 */
 	if (!batch->memcg)
 		batch->memcg = mem;
+
+	if (!batch->zone)
+		batch->zone = page_zone(page);
+
 	/*
 	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
 	 * In those cases, all pages freed continously can be expected to be in
@@ -2720,12 +2837,17 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 	 */
 	if (batch->memcg != mem)
 		goto direct_uncharge;
+
+	if (batch->zone != page_zone(page))
+		mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
+
 	/* remember freed charge and uncharge it later */
 	batch->nr_pages++;
 	if (uncharge_memsw)
 		batch->memsw_nr_pages++;
 	return;
 direct_uncharge:
+	mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
 	res_counter_uncharge(&mem->res, nr_pages * PAGE_SIZE);
 	if (uncharge_memsw)
 		res_counter_uncharge(&mem->memsw, nr_pages * PAGE_SIZE);
@@ -2807,7 +2929,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		mem_cgroup_get(mem);
 	}
 	if (!mem_cgroup_is_root(mem))
-		mem_cgroup_do_uncharge(mem, nr_pages, ctype);
+		mem_cgroup_do_uncharge(mem, nr_pages, page, ctype);
 
 	return mem;
 
@@ -2875,6 +2997,10 @@ void mem_cgroup_uncharge_end(void)
 	if (batch->memsw_nr_pages)
 		res_counter_uncharge(&batch->memcg->memsw,
 				     batch->memsw_nr_pages * PAGE_SIZE);
+	if (batch->zone)
+		mem_cgroup_mz_clear_unreclaimable(batch->memcg, batch->zone);
+	batch->zone = NULL;
+
 	memcg_oom_recover(batch->memcg);
 	/* forget this pointer (for sanity check) */
 	batch->memcg = NULL;
@@ -4570,6 +4696,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = mem;
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
 	}
 	return 0;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 39e6300..0c76bd3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1414,6 +1414,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
 			zone, sc->mem_cgroup,
 			0, file);
+
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
+
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
@@ -1533,6 +1536,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
@@ -2648,6 +2652,7 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
 	struct mem_cgroup *mem_cont = sc->mem_cgroup;
 	int priority = sc->priority;
 	enum lru_list l;
+	int nid = pgdat->node_id;
 
 	/*
 	 * This dma->highmem order is consistant with global reclaim.
@@ -2665,10 +2670,20 @@ static void balance_pgdat_node(pg_data_t *pgdat, int order,
 		if (!populated_zone(zone) || !scan)
 			continue;
 
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+			priority != DEF_PRIORITY)
+			continue;
+
 		sc->nr_scanned = 0;
 		shrink_zone(priority, zone, sc);
 		total_scanned += sc->nr_scanned;
 
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
+			continue;
+
+		if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
+			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
+
 		/*
 		 * If we've done a decent amount of scanning and
 		 * the reclaim ratio is low, start doing writepage
@@ -2753,6 +2768,10 @@ loop_again:
 
 				if (!populated_zone(zone))
 					continue;
+
+				if (!mem_cgroup_mz_unreclaimable(mem_cont,
+								zone))
+					break;
 			}
 			if (i < 0)
 				node_clear(nid, do_nodes);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 08/10] Enable per-memcg background reclaim.
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (6 preceding siblings ...)
  2011-04-15 23:23 ` [PATCH V5 07/10] Add per-memcg zone "unreclaimable" Ying Han
@ 2011-04-15 23:23 ` Ying Han
  2011-04-15 23:23 ` [PATCH V5 09/10] Add API to export per-memcg kswapd pid Ying Han
  2011-04-15 23:23 ` [PATCH V5 10/10] Add some per-memcg stats Ying Han
  9 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

By default the per-memcg background reclaim is disabled when the limit_in_bytes
is set the maximum. The kswapd_run() is called when the memcg is being resized,
and kswapd_stop() is called when the memcg is being deleted.

The per-memcg kswapd is waked up based on the usage and low_wmark, which is
checked once per 1024 increments per cpu. The memcg's kswapd is waked up if the
usage is larger than the low_wmark.

changelog v4..v3:
1. move kswapd_stop to mem_cgroup_destroy based on comments from KAMAZAWA
2. move kswapd_run to setup_mem_cgroup_wmark, since the actual watermarks
determines whether or not enabling per-memcg background reclaim.

changelog v3..v2:
1. some clean-ups

changelog v2..v1:
1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
2. remove checking the wmark from per-page charging. now it checks the wmark
periodically based on the event counter.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0522d59..52e8344 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -105,10 +105,12 @@ enum mem_cgroup_events_index {
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
 	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_WMARK_EVENTS_THRESH,
 	MEM_CGROUP_NTARGETS,
 };
 #define THRESHOLDS_EVENTS_TARGET (128)
 #define SOFTLIMIT_EVENTS_TARGET (1024)
+#define WMARK_EVENTS_TARGET (1024)
 
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
@@ -370,6 +372,8 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 
+static void wake_memcg_kswapd(struct mem_cgroup *mem);
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
@@ -548,6 +552,12 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
+static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
+		wake_memcg_kswapd(mem);
+}
+
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -678,6 +688,9 @@ static void __mem_cgroup_target_update(struct mem_cgroup *mem, int target)
 	case MEM_CGROUP_TARGET_SOFTLIMIT:
 		next = val + SOFTLIMIT_EVENTS_TARGET;
 		break;
+	case MEM_CGROUP_WMARK_EVENTS_THRESH:
+		next = val + WMARK_EVENTS_TARGET;
+		break;
 	default:
 		return;
 	}
@@ -701,6 +714,10 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
+		if (unlikely(__memcg_event_check(mem,
+			MEM_CGROUP_WMARK_EVENTS_THRESH))){
+			mem_cgroup_check_wmark(mem);
+		}
 	}
 }
 
@@ -845,6 +862,9 @@ static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
 
 		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
 		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+
+		if (!mem_cgroup_is_root(mem) && !mem->kswapd_wait)
+			kswapd_run(0, mem);
 	}
 }
 
@@ -4861,6 +4881,22 @@ int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
 	return mem->last_scanned_node;
 }
 
+static inline
+void wake_memcg_kswapd(struct mem_cgroup *mem)
+{
+	wait_queue_head_t *wait;
+
+	if (!mem || !mem->high_wmark_distance)
+		return;
+
+	wait = mem->kswapd_wait;
+
+	if (!wait || !waitqueue_active(wait))
+		return;
+
+	wake_up_interruptible(wait);
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4964,6 +5000,7 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	kswapd_stop(0, mem);
 	mem_cgroup_put(mem);
 }
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 09/10] Add API to export per-memcg kswapd pid.
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (7 preceding siblings ...)
  2011-04-15 23:23 ` [PATCH V5 08/10] Enable per-memcg background reclaim Ying Han
@ 2011-04-15 23:23 ` Ying Han
  2011-04-18  5:01   ` Minchan Kim
  2011-04-15 23:23 ` [PATCH V5 10/10] Add some per-memcg stats Ying Han
  9 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This add the API which exports per-memcg kswapd thread pid. The kswapd
thread is named as "memcg_" + css_id, and the pid can be used to put
kswapd thread into cpu cgroup later.

$ mkdir /dev/cgroup/memory/A
$ cat /dev/cgroup/memory/A/memory.kswapd_pid
memcg_null 0

$ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
$ ps -ef | grep memcg
root      6727     2  0 14:32 ?        00:00:00 [memcg_3]
root      6729  6044  0 14:32 ttyS0    00:00:00 grep memcg

$ cat memory.kswapd_pid
memcg_3 6727

changelog v5..v4
1. Initialize the memcg-kswapd pid to -1 instead of 0.
2. Remove the kswapds_spinlock.

changelog v4..v3
1. Add the API based on KAMAZAWA's request on patch v3.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/swap.h |    2 ++
 mm/memcontrol.c      |   31 +++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 319b800..2d3e21a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -34,6 +34,8 @@ struct kswapd {
 };
 
 int kswapd(void *p);
+extern spinlock_t kswapds_spinlock;
+
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 52e8344..347e861 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4526,6 +4526,33 @@ static int mem_cgroup_wmark_read(struct cgroup *cgrp,
 	return 0;
 }
 
+static int mem_cgroup_kswapd_pid_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	struct task_struct *kswapd_thr = NULL;
+	struct kswapd *kswapd_p = NULL;
+	wait_queue_head_t *wait;
+	char name[TASK_COMM_LEN];
+	pid_t pid = -1;
+
+	sprintf(name, "memcg_null");
+
+	wait = mem_cgroup_kswapd_wait(mem);
+	if (wait) {
+		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+		kswapd_thr = kswapd_p->kswapd_task;
+		if (kswapd_thr) {
+			get_task_comm(name, kswapd_thr);
+			pid = kswapd_thr->pid;
+		}
+	}
+
+	cb->fill(cb, name, pid);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4643,6 +4670,10 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "reclaim_wmarks",
 		.read_map = mem_cgroup_wmark_read,
 	},
+	{
+		.name = "kswapd_pid",
+		.read_map = mem_cgroup_kswapd_pid_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 10/10] Add some per-memcg stats
  2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
                   ` (8 preceding siblings ...)
  2011-04-15 23:23 ` [PATCH V5 09/10] Add API to export per-memcg kswapd pid Ying Han
@ 2011-04-15 23:23 ` Ying Han
  9 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-15 23:23 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

A bunch of statistics are added in memory.stat to monitor per cgroup
kswapd performance.

$cat /dev/cgroup/yinghan/memory.stat
kswapd_steal 12588994
pg_pgsteal 0
kswapd_pgscan 18629519
pg_scan 0
pgrefill 2893517
pgoutrun 5342267948
allocstall 0

changelog v2..v1:
1. change the stats using events instead of stats.
2. add the stats in the Documentation

Signed-off-by: Ying Han <yinghan@google.com>
---
 Documentation/cgroups/memory.txt |   14 +++++++
 include/linux/memcontrol.h       |   51 +++++++++++++++++++++++++++
 mm/memcontrol.c                  |   72 ++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                      |   28 ++++++++++++--
 4 files changed, 161 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b6ed61c..29dee73 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,13 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+kswapd_steal	- # of pages reclaimed from kswapd
+pg_pgsteal	- # of pages reclaimed from direct reclaim
+kswapd_pgscan	- # of pages scanned from kswapd
+pg_scan		- # of pages scanned frm direct reclaim
+pgrefill	- # of pages scanned on active list
+pgoutrun	- # of times triggering kswapd
+allocstall	- # of times triggering direct reclaim
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -406,6 +413,13 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_kswapd_steal	- sum of all children's "kswapd_steal"
+total_pg_pgsteal	- sum of all children's "pg_pgsteal"
+total_kswapd_pgscan	- sum of all children's "kswapd_pgscan"
+total_pg_scan		- sum of all children's "pg_scan"
+total_pgrefill		- sum of all children's "pgrefill"
+total_pgoutrun		- sum of all children's "pgoutrun"
+total_allocstall	- sum of all children's "allocstall"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b18435d..879bd98 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -164,6 +164,15 @@ void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
 void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
 					unsigned long nr_scanned);
 
+/* background reclaim stats */
+void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pgrefill(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_outrun(struct mem_cgroup *memcg, int val);
+void mem_cgroup_alloc_stall(struct mem_cgroup *memcg, int val);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
 #endif
@@ -403,6 +412,48 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
 {
 }
 
+/* background reclaim stats */
+static inline void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg,
+					   int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_steal(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg,
+					    int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pgrefill(struct mem_cgroup *memcg,
+				       int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_outrun(struct mem_cgroup *memcg,
+					int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_alloc_stall(struct mem_cgroup *memcg,
+					  int val)
+{
+	return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 347e861..64977d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,13 @@ enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	MEM_CGROUP_EVENTS_KSWAPD_STEAL, /* # of pages reclaimed from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSTEAL, /* # of pages reclaimed from ttfp */
+	MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, /* # of pages scanned from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSCAN, /* # of pages scanned from ttfp */
+	MEM_CGROUP_EVENTS_PGREFILL, /* # of pages scanned on active list */
+	MEM_CGROUP_EVENTS_PGOUTRUN, /* # of triggers of background reclaim */
+	MEM_CGROUP_EVENTS_ALLOCSTALL, /* # of triggers of direct reclaim */
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -611,6 +618,41 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_kswapd_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_STEAL], val);
+}
+
+void mem_cgroup_pg_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSTEAL], val);
+}
+
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_PGSCAN], val);
+}
+
+void mem_cgroup_pg_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSCAN], val);
+}
+
+void mem_cgroup_pgrefill(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGREFILL], val);
+}
+
+void mem_cgroup_pg_outrun(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGOUTRUN], val);
+}
+
+void mem_cgroup_alloc_stall(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_ALLOCSTALL], val);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -3973,6 +4015,13 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_KSWAPD_STEAL,
+	MCS_PG_PGSTEAL,
+	MCS_KSWAPD_PGSCAN,
+	MCS_PG_PGSCAN,
+	MCS_PGREFILL,
+	MCS_PGOUTRUN,
+	MCS_ALLOCSTALL,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3995,6 +4044,13 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"kswapd_steal", "total_kswapd_steal"},
+	{"pg_pgsteal", "total_pg_pgsteal"},
+	{"kswapd_pgscan", "total_kswapd_pgscan"},
+	{"pg_scan", "total_pg_scan"},
+	{"pgrefill", "total_pgrefill"},
+	{"pgoutrun", "total_pgoutrun"},
+	{"allocstall", "total_allocstall"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -4024,6 +4080,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
+	/* kswapd stat */
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_STEAL);
+	s->stat[MCS_KSWAPD_STEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSTEAL);
+	s->stat[MCS_PG_PGSTEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN);
+	s->stat[MCS_KSWAPD_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSCAN);
+	s->stat[MCS_PG_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGREFILL);
+	s->stat[MCS_PGREFILL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGOUTRUN);
+	s->stat[MCS_PGOUTRUN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_ALLOCSTALL);
+	s->stat[MCS_ALLOCSTALL] += val;
+
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0c76bd3..e77ceb4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1421,6 +1421,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_pgscan(sc->mem_cgroup, nr_scanned);
+		else
+			mem_cgroup_pg_pgscan(sc->mem_cgroup, nr_scanned);
 	}
 
 	if (nr_taken == 0) {
@@ -1441,9 +1445,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	local_irq_disable();
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	if (scanning_global_lru(sc)) {
+		if (current_is_kswapd())
+			__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+		__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	} else {
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_steal(sc->mem_cgroup, nr_reclaimed);
+		else
+			mem_cgroup_pg_steal(sc->mem_cgroup, nr_reclaimed);
+	}
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
@@ -1541,7 +1552,12 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
-	__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	if (scanning_global_lru(sc))
+		__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	else
+		mem_cgroup_pgrefill(sc->mem_cgroup, pgscanned);
+
+
 	if (file)
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -nr_taken);
 	else
@@ -2054,6 +2070,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 	if (scanning_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
+	else
+		mem_cgroup_alloc_stall(sc->mem_cgroup, 1);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
@@ -2730,6 +2748,8 @@ loop_again:
 	sc.nr_reclaimed = 0;
 	total_scanned = 0;
 
+	mem_cgroup_pg_outrun(mem_cont, 1);
+
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc.priority = priority;
 		wmark_ok = false;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 01/10] Add kswapd descriptor
  2011-04-15 23:23 ` [PATCH V5 01/10] Add kswapd descriptor Ying Han
@ 2011-04-18  0:57   ` Minchan Kim
  2011-04-18 18:09     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-04-18  0:57 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

Hi Ying,

I have some comments and nitpick about coding style.

On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> There is a kswapd kernel thread for each numa node. We will add a different
> kswapd for each memcg. The kswapd is sleeping in the wait queue headed at

Why?

Easily, many kernel developers raise an eyebrow to increase kernel thread.
So you should justify why we need new kernel thread, why we can't
handle it with workqueue.

Maybe you explained it and I didn't know it. If it is, sorry.
But at least, the patch description included _why_ is much mergeable
to maintainers and helpful to review the code to reviewers.

> kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
> information of node or memcg and it allows the global and per-memcg background
> reclaim to share common reclaim algorithms.
>
> This patch adds the kswapd descriptor and moves the per-node kswapd to use the
> new structure.
>
> changelog v5..v4:
> 1. add comment on kswapds_spinlock
> 2. remove the kswapds_spinlock. we don't need it here since the kswapd and pgdat
> have 1:1 mapping.
>
> changelog v3..v2:
> 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
> 2. rename thr in kswapd_run to something else.
>
> changelog v2..v1:
> 1. dynamic allocate kswapd descriptor and initialize the wait_queue_head of pgdat
> at kswapd_run.
> 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup kswapd
> descriptor.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/mmzone.h |    3 +-
>  include/linux/swap.h   |    7 ++++
>  mm/page_alloc.c        |    1 -
>  mm/vmscan.c            |   89 +++++++++++++++++++++++++++++++++++------------
>  4 files changed, 74 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 628f07b..6cba7d2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -640,8 +640,7 @@ typedef struct pglist_data {
>        unsigned long node_spanned_pages; /* total size of physical page
>                                             range, including holes */
>        int node_id;
> -       wait_queue_head_t kswapd_wait;
> -       struct task_struct *kswapd;
> +       wait_queue_head_t *kswapd_wait;

Personally, I prefer kswapd not kswapd_wait.
It's more readable and straightforward.

>        int kswapd_max_order;
>        enum zone_type classzone_idx;
>  } pg_data_t;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ed6ebe6..f43d406 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
>        return current->flags & PF_KSWAPD;
>  }
>
> +struct kswapd {
> +       struct task_struct *kswapd_task;
> +       wait_queue_head_t kswapd_wait;
> +       pg_data_t *kswapd_pgdat;
> +};
> +
> +int kswapd(void *p);
>  /*
>  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>  * be swapped to.  The swap type and the offset into that swap type are
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e1b52a..6340865 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>
>        pgdat_resize_init(pgdat);
>        pgdat->nr_zones = 0;
> -       init_waitqueue_head(&pgdat->kswapd_wait);
>        pgdat->kswapd_max_order = 0;
>        pgdat_page_cgroup_init(pgdat);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..61fb96e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2242,12 +2242,13 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
>  }
>
>  /* is kswapd sleeping prematurely? */
> -static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> -                                       int classzone_idx)
> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
> +                               long remaining, int classzone_idx)
>  {
>        int i;
>        unsigned long balanced = 0;
>        bool all_zones_ok = true;
> +       pg_data_t *pgdat = kswapd->kswapd_pgdat;
>
>        /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>        if (remaining)
> @@ -2570,28 +2571,31 @@ out:
>        return order;
>  }
>
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> +static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
> +                               int classzone_idx)
>  {
>        long remaining = 0;
>        DEFINE_WAIT(wait);
> +       pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +       wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;

kswapd_p? p means pointer?
wait_h? h means header?
Hmm.. Of course, it's trivial and we can understand easily in such
context but we don't have been used such words so it's rather awkward
to me.

How about kswapd instead of kswapd_p, kswapd_wait instead of wait_h?

>
>        if (freezing(current) || kthread_should_stop())
>                return;
>
> -       prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> +       prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>
>        /* Try to sleep for a short interval */
> -       if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
> +       if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
>                remaining = schedule_timeout(HZ/10);
> -               finish_wait(&pgdat->kswapd_wait, &wait);
> -               prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> +               finish_wait(wait_h, &wait);
> +               prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>        }
>
>        /*
>         * After a short sleep, check if it was a premature sleep. If not, then
>         * go fully to sleep until explicitly woken up.
>         */
> -       if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
> +       if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
>                trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>
>                /*
> @@ -2611,7 +2615,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>                else
>                        count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>        }
> -       finish_wait(&pgdat->kswapd_wait, &wait);
> +       finish_wait(wait_h, &wait);
>  }
>
>  /*
> @@ -2627,20 +2631,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  * If there are applications that are active memory-allocators
>  * (most normal use), this basically shouldn't matter.
>  */
> -static int kswapd(void *p)
> +int kswapd(void *p)
>  {
>        unsigned long order;
>        int classzone_idx;
> -       pg_data_t *pgdat = (pg_data_t*)p;
> +       struct kswapd *kswapd_p = (struct kswapd *)p;
> +       pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +       wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>        struct task_struct *tsk = current;
>
>        struct reclaim_state reclaim_state = {
>                .reclaimed_slab = 0,
>        };
> -       const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +       const struct cpumask *cpumask;
>
>        lockdep_set_current_reclaim_state(GFP_KERNEL);
>
> +       BUG_ON(pgdat->kswapd_wait != wait_h);

If we include kswapd instead of kswapd_wait in pgdat, maybe we could
remove the check?

> +       cpumask = cpumask_of_node(pgdat->node_id);
>        if (!cpumask_empty(cpumask))
>                set_cpus_allowed_ptr(tsk, cpumask);
>        current->reclaim_state = &reclaim_state;
> @@ -2679,7 +2687,7 @@ static int kswapd(void *p)
>                        order = new_order;
>                        classzone_idx = new_classzone_idx;
>                } else {
> -                       kswapd_try_to_sleep(pgdat, order, classzone_idx);
> +                       kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
>                        order = pgdat->kswapd_max_order;
>                        classzone_idx = pgdat->classzone_idx;
>                        pgdat->kswapd_max_order = 0;
> @@ -2719,13 +2727,13 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>                pgdat->kswapd_max_order = order;
>                pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
>        }
> -       if (!waitqueue_active(&pgdat->kswapd_wait))
> +       if (!waitqueue_active(pgdat->kswapd_wait))
>                return;
>        if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
>                return;
>
>        trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
> -       wake_up_interruptible(&pgdat->kswapd_wait);
> +       wake_up_interruptible(pgdat->kswapd_wait);
>  }
>
>  /*
> @@ -2817,12 +2825,21 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>                for_each_node_state(nid, N_HIGH_MEMORY) {
>                        pg_data_t *pgdat = NODE_DATA(nid);
>                        const struct cpumask *mask;
> +                       struct kswapd *kswapd_p;
> +                       struct task_struct *kswapd_thr;
> +                       wait_queue_head_t *wait;
>
>                        mask = cpumask_of_node(pgdat->node_id);
>
> +                       wait = pgdat->kswapd_wait;
> +                       kswapd_p = container_of(wait, struct kswapd,
> +                                               kswapd_wait);
> +                       kswapd_thr = kswapd_p->kswapd_task;

kswapd_thr? thr means thread?
How about tsk?

> +
If we include kswapd instead of kswapd_wait in pgdat, don't we make this simple?

struct kswapd *kswapd = pgdat->kswapd;
struct task_struct *kswapd_tsk = kswapd->kswapd_task;


>                        if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>                                /* One of our CPUs online: restore mask */
> -                               set_cpus_allowed_ptr(pgdat->kswapd, mask);
> +                               if (kswapd_thr)
> +                                       set_cpus_allowed_ptr(kswapd_thr, mask);
>                }
>        }
>        return NOTIFY_OK;
> @@ -2835,18 +2852,31 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>  int kswapd_run(int nid)
>  {
>        pg_data_t *pgdat = NODE_DATA(nid);
> +       struct task_struct *kswapd_thr;
> +       struct kswapd *kswapd_p;
>        int ret = 0;
>
> -       if (pgdat->kswapd)
> +       if (pgdat->kswapd_wait)
>                return 0;
>
> -       pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
> -       if (IS_ERR(pgdat->kswapd)) {
> +       kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
> +       if (!kswapd_p)
> +               return -ENOMEM;
> +
> +       init_waitqueue_head(&kswapd_p->kswapd_wait);
> +       pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> +       kswapd_p->kswapd_pgdat = pgdat;
> +
> +       kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
> +       if (IS_ERR(kswapd_thr)) {
>                /* failure at boot is fatal */
>                BUG_ON(system_state == SYSTEM_BOOTING);
>                printk("Failed to start kswapd on node %d\n",nid);
> +               pgdat->kswapd_wait = NULL;
> +               kfree(kswapd_p);
>                ret = -1;
> -       }
> +       } else
> +               kswapd_p->kswapd_task = kswapd_thr;
>        return ret;
>  }
>
> @@ -2855,10 +2885,23 @@ int kswapd_run(int nid)
>  */
>  void kswapd_stop(int nid)
>  {
> -       struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
> +       struct task_struct *kswapd_thr = NULL;
> +       struct kswapd *kswapd_p = NULL;
> +       wait_queue_head_t *wait;
> +
> +       pg_data_t *pgdat = NODE_DATA(nid);
> +
> +       wait = pgdat->kswapd_wait;
> +       if (wait) {
> +               kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> +               kswapd_thr = kswapd_p->kswapd_task;
> +               kswapd_p->kswapd_task = NULL;
> +       }
> +
> +       if (kswapd_thr)
> +               kthread_stop(kswapd_thr);
>
> -       if (kswapd)
> -               kthread_stop(kswapd);
> +       kfree(kswapd_p);
>  }
>
>  static int __init kswapd_init(void)
> --
> 1.7.3.1
>
>

Hmm, I don't like kswapd_p, kswapd_thr, wait_h and kswapd_wait of pgdat.
But it's just my personal opinion. :)


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 04/10] Infrastructure to support per-memcg reclaim.
  2011-04-15 23:23 ` [PATCH V5 04/10] Infrastructure to support per-memcg reclaim Ying Han
@ 2011-04-18  2:11   ` Minchan Kim
  2011-04-18 18:44     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-04-18  2:11 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 20206 bytes --]

On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> Add the kswapd_mem field in kswapd descriptor which links the kswapd
> kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
> queue headed at kswapd_wait field of the kswapd descriptor.
>
> The kswapd() function is now shared between global and per-memcg kswapd. It
> is passed in with the kswapd descriptor which contains the information of
> either node or memcg. Then the new function balance_mem_cgroup_pgdat is
> invoked if it is per-mem kswapd thread, and the implementation of the function
> is on the following patch.
>
> changelog v4..v3:
> 1. fix up the kswapd_run and kswapd_stop for online_pages() and offline_pages.
> 2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's request.
>
> changelog v3..v2:
> 1. split off from the initial patch which includes all changes of the following
> three patches.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h |    5 ++
>  include/linux/swap.h       |    5 +-
>  mm/memcontrol.c            |   29 ++++++++
>  mm/memory_hotplug.c        |    4 +-
>  mm/vmscan.c                |  156 ++++++++++++++++++++++++++++++--------------
>  5 files changed, 147 insertions(+), 52 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3ece36d..f7ffd1f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -24,6 +24,7 @@ struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
> +struct kswapd;
>
>  /* Stats that can be updated by kernel. */
>  enum mem_cgroup_page_stat_item {
> @@ -83,6 +84,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
> +extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
> +                                 struct kswapd *kswapd_p);
> +extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
> +extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
>
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f43d406..17e0511 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -30,6 +30,7 @@ struct kswapd {
>        struct task_struct *kswapd_task;
>        wait_queue_head_t kswapd_wait;
>        pg_data_t *kswapd_pgdat;
> +       struct mem_cgroup *kswapd_mem;
>  };
>
>  int kswapd(void *p);
> @@ -303,8 +304,8 @@ static inline void scan_unevictable_unregister_node(struct node *node)
>  }
>  #endif
>
> -extern int kswapd_run(int nid);
> -extern void kswapd_stop(int nid);
> +extern int kswapd_run(int nid, struct mem_cgroup *mem);
> +extern void kswapd_stop(int nid, struct mem_cgroup *mem);
>
>  #ifdef CONFIG_MMU
>  /* linux/mm/shmem.c */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 76ad009..8761a6f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -278,6 +278,8 @@ struct mem_cgroup {
>         */
>        u64 high_wmark_distance;
>        u64 low_wmark_distance;
> +
> +       wait_queue_head_t *kswapd_wait;


Like I mentioned in [1/10], personally, I like including kswapd
instead of kswapd_wait. Just personal opinion. Feel free to ignore.

>  };
>
>  /* Stuffs for move charges at task migration. */
> @@ -4670,6 +4672,33 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
>        return ret;
>  }
>
> +int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
> +{
> +       if (!mem || !kswapd_p)
> +               return 0;
> +
> +       mem->kswapd_wait = &kswapd_p->kswapd_wait;
> +       kswapd_p->kswapd_mem = mem;
> +
> +       return css_id(&mem->css);
> +}
> +
> +void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
> +{
> +       if (mem)
> +               mem->kswapd_wait = NULL;
> +
> +       return;
> +}
> +
> +wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
> +{
> +       if (!mem)
> +               return NULL;
> +
> +       return mem->kswapd_wait;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>        struct mem_cgroup_tree_per_node *rtpn;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 321fc74..2f78ff6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -462,7 +462,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages)
>        setup_per_zone_wmarks();
>        calculate_zone_inactive_ratio(zone);
>        if (onlined_pages) {
> -               kswapd_run(zone_to_nid(zone));
> +               kswapd_run(zone_to_nid(zone), NULL);
>                node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
>        }
>
> @@ -897,7 +897,7 @@ repeat:
>        calculate_zone_inactive_ratio(zone);
>        if (!node_present_pages(node)) {
>                node_clear_state(node, N_HIGH_MEMORY);
> -               kswapd_stop(node);
> +               kswapd_stop(node, NULL);
>        }
>
>        vm_total_pages = nr_free_pagecache_pages();
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 61fb96e..06036d2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2241,6 +2241,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
>        return balanced_pages > (present_pages >> 2);
>  }
>
> +#define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)

In memcg, we already have a similar thing "scanning_global_lru".
How about using "global" term?

> +
>  /* is kswapd sleeping prematurely? */
>  static int sleeping_prematurely(struct kswapd *kswapd, int order,
>                                long remaining, int classzone_idx)
> @@ -2249,11 +2251,16 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
>        unsigned long balanced = 0;
>        bool all_zones_ok = true;
>        pg_data_t *pgdat = kswapd->kswapd_pgdat;
> +       struct mem_cgroup *mem = kswapd->kswapd_mem;
>
>        /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>        if (remaining)
>                return true;
>
> +       /* Doesn't support for per-memcg reclaim */
> +       if (mem)
> +               return false;
> +

How about is_global_kswapd instead of checking wheterh mem field is NULL or not?

>        /* Check the watermark levels */
>        for (i = 0; i < pgdat->nr_zones; i++) {
>                struct zone *zone = pgdat->node_zones + i;
> @@ -2596,19 +2603,25 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>         * go fully to sleep until explicitly woken up.
>         */
>        if (!sleeping_prematurely(kswapd_p, order, remaining, classzone_idx)) {
> -               trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +               if (is_node_kswapd(kswapd_p)) {
> +                       trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>
> -               /*
> -                * vmstat counters are not perfectly accurate and the estimated
> -                * value for counters such as NR_FREE_PAGES can deviate from the
> -                * true value by nr_online_cpus * threshold. To avoid the zone
> -                * watermarks being breached while under pressure, we reduce the
> -                * per-cpu vmstat threshold while kswapd is awake and restore
> -                * them before going back to sleep.
> -                */
> -               set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
> -               schedule();
> -               set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
> +                       /*
> +                        * vmstat counters are not perfectly accurate and the
> +                        * estimated value for counters such as NR_FREE_PAGES
> +                        * can deviate from the true value by nr_online_cpus *
> +                        * threshold. To avoid the zone watermarks being
> +                        * breached while under pressure, we reduce the per-cpu
> +                        * vmstat threshold while kswapd is awake and restore
> +                        * them before going back to sleep.
> +                        */
> +                       set_pgdat_percpu_threshold(pgdat,
> +                                                  calculate_normal_threshold);
> +                       schedule();
> +                       set_pgdat_percpu_threshold(pgdat,
> +                                               calculate_pressure_threshold);
> +               } else
> +                       schedule();
>        } else {
>                if (remaining)
>                        count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> @@ -2618,6 +2631,12 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>        finish_wait(wait_h, &wait);
>  }
>
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +                                                       int order)
> +{
> +       return 0;
> +}
> +
>  /*
>  * The background pageout daemon, started as a kernel thread
>  * from the init process.
> @@ -2637,6 +2656,7 @@ int kswapd(void *p)
>        int classzone_idx;
>        struct kswapd *kswapd_p = (struct kswapd *)p;
>        pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +       struct mem_cgroup *mem = kswapd_p->kswapd_mem;
>        wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>        struct task_struct *tsk = current;
>
> @@ -2647,10 +2667,12 @@ int kswapd(void *p)
>
>        lockdep_set_current_reclaim_state(GFP_KERNEL);
>
> -       BUG_ON(pgdat->kswapd_wait != wait_h);
> -       cpumask = cpumask_of_node(pgdat->node_id);
> -       if (!cpumask_empty(cpumask))
> -               set_cpus_allowed_ptr(tsk, cpumask);
> +       if (is_node_kswapd(kswapd_p)) {
> +               BUG_ON(pgdat->kswapd_wait != wait_h);
> +               cpumask = cpumask_of_node(pgdat->node_id);
> +               if (!cpumask_empty(cpumask))
> +                       set_cpus_allowed_ptr(tsk, cpumask);
> +       }
>        current->reclaim_state = &reclaim_state;
>
>        /*
> @@ -2665,7 +2687,10 @@ int kswapd(void *p)
>         * us from recursively trying to free more memory as we're
>         * trying to free the first piece of memory in the first place).
>         */
> -       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> +       if (is_node_kswapd(kswapd_p))
> +               tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> +       else
> +               tsk->flags |= PF_SWAPWRITE | PF_KSWAPD;
>        set_freezable();
>
>        order = 0;
> @@ -2675,24 +2700,29 @@ int kswapd(void *p)
>                int new_classzone_idx;
>                int ret;
>
> -               new_order = pgdat->kswapd_max_order;
> -               new_classzone_idx = pgdat->classzone_idx;
> -               pgdat->kswapd_max_order = 0;
> -               pgdat->classzone_idx = MAX_NR_ZONES - 1;
> -               if (order < new_order || classzone_idx > new_classzone_idx) {
> -                       /*
> -                        * Don't sleep if someone wants a larger 'order'
> -                        * allocation or has tigher zone constraints
> -                        */
> -                       order = new_order;
> -                       classzone_idx = new_classzone_idx;
> -               } else {
> -                       kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
> -                       order = pgdat->kswapd_max_order;
> -                       classzone_idx = pgdat->classzone_idx;
> +               if (is_node_kswapd(kswapd_p)) {
> +                       new_order = pgdat->kswapd_max_order;
> +                       new_classzone_idx = pgdat->classzone_idx;
>                        pgdat->kswapd_max_order = 0;
>                        pgdat->classzone_idx = MAX_NR_ZONES - 1;
> -               }
> +                       if (order < new_order ||
> +                                       classzone_idx > new_classzone_idx) {
> +                               /*
> +                                * Don't sleep if someone wants a larger 'order'
> +                                * allocation or has tigher zone constraints
> +                                */
> +                               order = new_order;
> +                               classzone_idx = new_classzone_idx;
> +                       } else {
> +                               kswapd_try_to_sleep(kswapd_p, order,
> +                                                   classzone_idx);
> +                               order = pgdat->kswapd_max_order;
> +                               classzone_idx = pgdat->classzone_idx;
> +                               pgdat->kswapd_max_order = 0;
> +                               pgdat->classzone_idx = MAX_NR_ZONES - 1;
> +                       }
> +               } else
> +                       kswapd_try_to_sleep(kswapd_p, order, classzone_idx);
>
>                ret = try_to_freeze();
>                if (kthread_should_stop())
> @@ -2703,8 +2733,13 @@ int kswapd(void *p)
>                 * after returning from the refrigerator
>                 */
>                if (!ret) {
> -                       trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> -                       order = balance_pgdat(pgdat, order, &classzone_idx);
> +                       if (is_node_kswapd(kswapd_p)) {
> +                               trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> +                                                               order);
> +                               order = balance_pgdat(pgdat, order,
> +                                                       &classzone_idx);
> +                       } else
> +                               balance_mem_cgroup_pgdat(mem, order);
>                }
>        }
>        return 0;
> @@ -2849,30 +2884,53 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>  * This kswapd start function will be called by init and node-hot-add.
>  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
>  */
> -int kswapd_run(int nid)
> +int kswapd_run(int nid, struct mem_cgroup *mem)
>  {
> -       pg_data_t *pgdat = NODE_DATA(nid);
>        struct task_struct *kswapd_thr;
> +       pg_data_t *pgdat = NULL;
>        struct kswapd *kswapd_p;
> +       static char name[TASK_COMM_LEN];
> +       int memcg_id;
>        int ret = 0;
>
> -       if (pgdat->kswapd_wait)
> -               return 0;
> +       if (!mem) {
> +               pgdat = NODE_DATA(nid);
> +               if (pgdat->kswapd_wait)
> +                       return ret;
> +       }
>
>        kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
>        if (!kswapd_p)
>                return -ENOMEM;
>
>        init_waitqueue_head(&kswapd_p->kswapd_wait);
> -       pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> -       kswapd_p->kswapd_pgdat = pgdat;
>
> -       kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
> +       if (!mem) {
> +               pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> +               kswapd_p->kswapd_pgdat = pgdat;
> +               snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
> +       } else {
> +               memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
> +               if (!memcg_id) {
> +                       kfree(kswapd_p);
> +                       return ret;
> +               }
> +               snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
> +       }
> +
> +       kswapd_thr = kthread_run(kswapd, kswapd_p, name);
>        if (IS_ERR(kswapd_thr)) {
>                /* failure at boot is fatal */
>                BUG_ON(system_state == SYSTEM_BOOTING);
> -               printk("Failed to start kswapd on node %d\n",nid);
> -               pgdat->kswapd_wait = NULL;
> +               if (!mem) {
> +                       printk(KERN_ERR "Failed to start kswapd on node %d\n",
> +                                                               nid);
> +                       pgdat->kswapd_wait = NULL;
> +               } else {
> +                       printk(KERN_ERR "Failed to start kswapd on memcg %d\n",
> +                                                               memcg_id);
> +                       mem_cgroup_clear_kswapd(mem);
> +               }
>                kfree(kswapd_p);
>                ret = -1;
>        } else
> @@ -2883,15 +2941,17 @@ int kswapd_run(int nid)
>  /*
>  * Called by memory hotplug when all memory in a node is offlined.
>  */
> -void kswapd_stop(int nid)
> +void kswapd_stop(int nid, struct mem_cgroup *mem)
>  {
>        struct task_struct *kswapd_thr = NULL;
>        struct kswapd *kswapd_p = NULL;
>        wait_queue_head_t *wait;
>
> -       pg_data_t *pgdat = NODE_DATA(nid);
> +       if (!mem)
> +               wait = NODE_DATA(nid)->kswapd_wait;
> +       else
> +               wait = mem_cgroup_kswapd_wait(mem);
>
> -       wait = pgdat->kswapd_wait;
>        if (wait) {
>                kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>                kswapd_thr = kswapd_p->kswapd_task;
> @@ -2910,7 +2970,7 @@ static int __init kswapd_init(void)
>
>        swap_setup();
>        for_each_node_state(nid, N_HIGH_MEMORY)
> -               kswapd_run(nid);
> +               kswapd_run(nid, NULL);
>        hotcpu_notifier(cpu_callback, 0);
>        return 0;
>  }
> --
> 1.7.3.1
>
>

Let me ask a question.

What's the effect of kswapd_try_to_sleep in memcg?

As I look code, sleeping_prematurely always return false in case of
memcg. So kswapd_try_to_sleep can sleep short time and then sleep
until next wakeup. It means it doesn't have any related to
kswapd_try_to_sleep's goal. So I hope you remove hack of memcg in
kswapd_try_to_sleep and just calling schedule in kswapd function for
sleeping memcg_kswapd.
But I don't look at further patches in series so I may miss something.

-- 
Kind regards,
Minchan Kim
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒñb‚^[nö¢®×¥yÊ&Š{^®w­r\x16«ë"œ&§iÖ¬Š	á¶Ú\x7fþËh¦Ø^™ë^­Æ¿\x0e‰ízf¢•¨ky

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 05/10] Implement the select_victim_node within memcg.
  2011-04-15 23:23 ` [PATCH V5 05/10] Implement the select_victim_node within memcg Ying Han
@ 2011-04-18  2:22   ` Minchan Kim
  2011-04-18 17:11     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-04-18  2:22 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> This add the mechanism for background reclaim which we remember the
> last scanned node and always starting from the next one each time.
> The simple round-robin fasion provide the fairness between nodes for
> each memcg.
>
> changelog v5..v4:
> 1. initialize the last_scanned_node to MAX_NUMNODES.
>
> changelog v4..v3:
> 1. split off from the per-memcg background reclaim patch.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h |    3 +++
>  mm/memcontrol.c            |   35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 38 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f7ffd1f..d4ff7f2 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
>                                  struct kswapd *kswapd_p);
>  extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
>  extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
> +extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
> +extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
> +                                       const nodemask_t *nodes);
>
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8761a6f..b92dc13 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -279,6 +279,11 @@ struct mem_cgroup {
>        u64 high_wmark_distance;
>        u64 low_wmark_distance;
>
> +       /* While doing per cgroup background reclaim, we cache the

Correct comment style.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 06/10] Per-memcg background reclaim.
  2011-04-15 23:23 ` [PATCH V5 06/10] Per-memcg background reclaim Ying Han
@ 2011-04-18  3:51   ` Minchan Kim
  2011-04-18 21:38     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-04-18  3:51 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> This is the main loop of per-memcg background reclaim which is implemented in
> function balance_mem_cgroup_pgdat().
>
> The function performs a priority loop similar to global reclaim. During each
> iteration it invokes balance_pgdat_node() for all nodes on the system, which
> is another new function performs background reclaim per node. After reclaiming
> each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
> it returns true.
>
> changelog v5..v4:
> 1. remove duplicate check on nodes_empty()
> 2. add logic to check if the per-memcg lru is empty on the zone.
> 3. make per-memcg kswapd to reclaim SWAP_CLUSTER_MAX per zone. It make senses
> since it helps to balance the pressure across zones within the memcg.
>
> changelog v4..v3:
> 1. split the select_victim_node and zone_unreclaimable to a seperate patches
> 2. remove the logic tries to do zone balancing.
>
> changelog v3..v2:
> 1. change mz->all_unreclaimable to be boolean.
> 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg reclaim.
> 3. some more clean-up.
>
> changelog v2..v1:
> 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> reclaim.
> 3. name the per-memcg memcg as "memcg-id" (css->id). And the global kswapd
> keeps the same name.
> 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
> after freeing.
> 5. add the fairness in zonelist where memcg remember the last zone reclaimed
> from.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  mm/vmscan.c |  157 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 157 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 06036d2..39e6300 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -47,6 +47,8 @@
>
>  #include <linux/swapops.h>
>
> +#include <linux/res_counter.h>
> +
>  #include "internal.h"
>
>  #define CREATE_TRACE_POINTS
> @@ -111,6 +113,8 @@ struct scan_control {
>         * are scanned.
>         */
>        nodemask_t      *nodemask;
> +
> +       int priority;
>  };
>
>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> @@ -2631,11 +2635,164 @@ static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>        finish_wait(wait_h, &wait);
>  }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> +                                       struct scan_control *sc)
> +{
> +       int i;
> +       unsigned long total_scanned = 0;
> +       struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +       int priority = sc->priority;
> +       enum lru_list l;
> +
> +       /*
> +        * This dma->highmem order is consistant with global reclaim.
> +        * We do this because the page allocator works in the opposite
> +        * direction although memcg user pages are mostly allocated at
> +        * highmem.
> +        */
> +       for (i = 0; i < pgdat->nr_zones; i++) {
> +               struct zone *zone = pgdat->node_zones + i;
> +               unsigned long scan = 0;
> +
> +               for_each_evictable_lru(l)
> +                       scan += mem_cgroup_zone_nr_pages(mem_cont, zone, l);
> +
> +               if (!populated_zone(zone) || !scan)
> +                       continue;

Do we really need this double check?
Isn't only _scan_ check enough?
And shouldn't we consider non-swap case?

> +
> +               sc->nr_scanned = 0;
> +               shrink_zone(priority, zone, sc);
> +               total_scanned += sc->nr_scanned;
> +
> +               /*
> +                * If we've done a decent amount of scanning and
> +                * the reclaim ratio is low, start doing writepage
> +                * even in laptop mode
> +                */
> +               if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> +                   total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
> +                       sc->may_writepage = 1;

I don't want to add more random write any more although we don't have
a trouble of real memory shortage.
Do you have any reason to reclaim memory urgently as writing dirty pages?
Maybe if we wait a little bit of time, flusher would write out the page.

> +               }
> +       }
> +
> +       sc->nr_scanned = total_scanned;
> +       return;

unnecessary return.

> +}
> +
> +/*
> + * Per cgroup background reclaim.
> + * TODO: Take off the order since memcg always do order 0
> + */
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +                                             int order)
> +{
> +       int i, nid;
> +       int start_node;
> +       int priority;
> +       bool wmark_ok;
> +       int loop;
> +       pg_data_t *pgdat;
> +       nodemask_t do_nodes;
> +       unsigned long total_scanned;
> +       struct scan_control sc = {
> +               .gfp_mask = GFP_KERNEL,
> +               .may_unmap = 1,
> +               .may_swap = 1,
> +               .nr_to_reclaim = SWAP_CLUSTER_MAX,
> +               .swappiness = vm_swappiness,
> +               .order = order,
> +               .mem_cgroup = mem_cont,
> +       };
> +
> +loop_again:
> +       do_nodes = NODE_MASK_NONE;
> +       sc.may_writepage = !laptop_mode;

I think it depends on urgency(ie, priority, reclaim
ratio(#reclaim/#scanning) or something), not laptop_mode in case of
memcg.
As I said earlier,it wold be better to avoid random write.

> +       sc.nr_reclaimed = 0;
> +       total_scanned = 0;
> +
> +       for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> +               sc.priority = priority;
> +               wmark_ok = false;
> +               loop = 0;
> +
> +               /* The swap token gets in the way of swapout... */
> +               if (!priority)
> +                       disable_swap_token();
> +
> +               if (priority == DEF_PRIORITY)
> +                       do_nodes = node_states[N_ONLINE];
> +
> +               while (1) {
> +                       nid = mem_cgroup_select_victim_node(mem_cont,
> +                                                       &do_nodes);
> +
> +                       /* Indicate we have cycled the nodelist once

Fix comment style.

> +                        * TODO: we might add MAX_RECLAIM_LOOP for preventing
> +                        * kswapd burning cpu cycles.
> +                        */
> +                       if (loop == 0) {
> +                               start_node = nid;
> +                               loop++;
> +                       } else if (nid == start_node)
> +                               break;
> +
> +                       pgdat = NODE_DATA(nid);
> +                       balance_pgdat_node(pgdat, order, &sc);
> +                       total_scanned += sc.nr_scanned;
> +
> +                       /* Set the node which has at least

Fix comment style.

> +                        * one reclaimable zone
> +                        */
> +                       for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +                               struct zone *zone = pgdat->node_zones + i;
> +
> +                               if (!populated_zone(zone))
> +                                       continue;
> +                       }

I can't understand your comment and logic.
The comment mentioned reclaimable zone but the logic checks just
populated_zone. What's meaning?

> +                       if (i < 0)
> +                               node_clear(nid, do_nodes);
> +
> +                       if (mem_cgroup_watermark_ok(mem_cont,
> +                                                       CHARGE_WMARK_HIGH)) {
> +                               wmark_ok = true;
> +                               goto out;
> +                       }
> +
> +                       if (nodes_empty(do_nodes)) {
> +                               wmark_ok = true;
> +                               goto out;
> +                       }
> +               }
> +
> +               if (total_scanned && priority < DEF_PRIORITY - 2)
> +                       congestion_wait(WRITE, HZ/10);
> +
> +               if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> +                       break;
> +       }
> +out:
> +       if (!wmark_ok) {
> +               cond_resched();
> +
> +               try_to_freeze();
> +
> +               goto loop_again;
> +       }
> +
> +       return sc.nr_reclaimed;
> +}
> +#else
>  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
>                                                        int order)
>  {
>        return 0;
>  }
> +#endif
>
>  /*
>  * The background pageout daemon, started as a kernel thread
> --
> 1.7.3.1
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 07/10] Add per-memcg zone "unreclaimable"
  2011-04-15 23:23 ` [PATCH V5 07/10] Add per-memcg zone "unreclaimable" Ying Han
@ 2011-04-18  4:27   ` Minchan Kim
  2011-04-18 17:31     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-04-18  4:27 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
> and breaks the priority loop if it returns true. The per-memcg zone will
> be marked as "unreclaimable" if the scanning rate is much greater than the
> reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
> page charged to the memcg being freed. Kswapd breaks the priority loop if
> all the zones are marked as "unreclaimable".
>
> changelog v5..v4:
> 1. reduce the frequency of updating mz->unreclaimable bit by using the existing
> memcg batch in task struct.
> 2. add new function mem_cgroup_mz_clear_unreclaimable() for recoganizing zone.
>
> changelog v4..v3:
> 1. split off from the per-memcg background reclaim patch in V3.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h |   40 ++++++++++++++
>  include/linux/sched.h      |    1 +
>  include/linux/swap.h       |    2 +
>  mm/memcontrol.c            |  130 +++++++++++++++++++++++++++++++++++++++++++-
>  mm/vmscan.c                |   19 +++++++
>  5 files changed, 191 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d4ff7f2..b18435d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -155,6 +155,14 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
> +void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
> +                                       struct zone *zone);
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> +                                       unsigned long nr_scanned);
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
> @@ -345,6 +353,38 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  {
>  }
>
> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
> +                                                               int zid)
> +{
> +       return false;
> +}
> +
> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
> +                                               struct zone *zone)
> +{
> +       return false;
> +}
> +
> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
> +                                                       struct zone *zone)
> +{
> +}
> +
> +static inline void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem,
> +                                                       struct page *page)
> +{
> +}
> +
> +static inline void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
> +                                                       struct zone *zone);
> +{
> +}
> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
> +                                               struct zone *zone,
> +                                               unsigned long nr_scanned)
> +{
> +}
> +
>  static inline
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                            gfp_t gfp_mask)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 98fc7ed..3370c5a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1526,6 +1526,7 @@ struct task_struct {
>                struct mem_cgroup *memcg; /* target memcg of uncharge */
>                unsigned long nr_pages; /* uncharged usage */
>                unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
> +               struct zone *zone; /* a zone page is last uncharged */
>        } memcg_batch;
>  #endif
>  };
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 17e0511..319b800 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -160,6 +160,8 @@ enum {
>        SWP_SCANNING    = (1 << 8),     /* refcount in scan_swap_map */
>  };
>
> +#define ZONE_RECLAIMABLE_RATE 6
> +

You can use ZONE_RECLAIMABLE_RATE in zone_reclaimable, too.
If you want to separate rate of memcg and global, please clear macro
name like ZONE_MEMCG_RECLAIMABLE_RATE.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 09/10] Add API to export per-memcg kswapd pid.
  2011-04-15 23:23 ` [PATCH V5 09/10] Add API to export per-memcg kswapd pid Ying Han
@ 2011-04-18  5:01   ` Minchan Kim
  2011-04-18 17:41     ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-04-18  5:01 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> This add the API which exports per-memcg kswapd thread pid. The kswapd
> thread is named as "memcg_" + css_id, and the pid can be used to put
> kswapd thread into cpu cgroup later.
>
> $ mkdir /dev/cgroup/memory/A
> $ cat /dev/cgroup/memory/A/memory.kswapd_pid
> memcg_null 0
>
> $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> $ ps -ef | grep memcg
> root      6727     2  0 14:32 ?        00:00:00 [memcg_3]
> root      6729  6044  0 14:32 ttyS0    00:00:00 grep memcg
>
> $ cat memory.kswapd_pid
> memcg_3 6727
>
> changelog v5..v4
> 1. Initialize the memcg-kswapd pid to -1 instead of 0.
> 2. Remove the kswapds_spinlock.
>
> changelog v4..v3
> 1. Add the API based on KAMAZAWA's request on patch v3.
>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/swap.h |    2 ++
>  mm/memcontrol.c      |   31 +++++++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 319b800..2d3e21a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -34,6 +34,8 @@ struct kswapd {
>  };
>
>  int kswapd(void *p);
> +extern spinlock_t kswapds_spinlock;

Remove spinlock.



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 05/10] Implement the select_victim_node within memcg.
  2011-04-18  2:22   ` Minchan Kim
@ 2011-04-18 17:11     ` Ying Han
  0 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-18 17:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

On Sun, Apr 17, 2011 at 7:22 PM, Minchan Kim <minchan.kim@gmail.com> wrote:

> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> > This add the mechanism for background reclaim which we remember the
> > last scanned node and always starting from the next one each time.
> > The simple round-robin fasion provide the fairness between nodes for
> > each memcg.
> >
> > changelog v5..v4:
> > 1. initialize the last_scanned_node to MAX_NUMNODES.
> >
> > changelog v4..v3:
> > 1. split off from the per-memcg background reclaim patch.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/memcontrol.h |    3 +++
> >  mm/memcontrol.c            |   35 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 38 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index f7ffd1f..d4ff7f2 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -88,6 +88,9 @@ extern int mem_cgroup_init_kswapd(struct mem_cgroup
> *mem,
> >                                  struct kswapd *kswapd_p);
> >  extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
> >  extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup
> *mem);
> > +extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
> > +extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
> > +                                       const nodemask_t *nodes);
> >
> >  static inline
> >  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 8761a6f..b92dc13 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -279,6 +279,11 @@ struct mem_cgroup {
> >        u64 high_wmark_distance;
> >        u64 low_wmark_distance;
> >
> > +       /* While doing per cgroup background reclaim, we cache the
>
> Correct comment style.
>
> Thanks. Will change in the next post.

--Ying

> --
> Kind regards,
> Minchan Kim
>

[-- Attachment #2: Type: text/html, Size: 2832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 07/10] Add per-memcg zone "unreclaimable"
  2011-04-18  4:27   ` Minchan Kim
@ 2011-04-18 17:31     ` Ying Han
  0 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-18 17:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5179 bytes --]

On Sun, Apr 17, 2011 at 9:27 PM, Minchan Kim <minchan.kim@gmail.com> wrote:

> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> > After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
> > and breaks the priority loop if it returns true. The per-memcg zone will
> > be marked as "unreclaimable" if the scanning rate is much greater than
> the
> > reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
> > page charged to the memcg being freed. Kswapd breaks the priority loop if
> > all the zones are marked as "unreclaimable".
> >
> > changelog v5..v4:
> > 1. reduce the frequency of updating mz->unreclaimable bit by using the
> existing
> > memcg batch in task struct.
> > 2. add new function mem_cgroup_mz_clear_unreclaimable() for recoganizing
> zone.
> >
> > changelog v4..v3:
> > 1. split off from the per-memcg background reclaim patch in V3.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/memcontrol.h |   40 ++++++++++++++
> >  include/linux/sched.h      |    1 +
> >  include/linux/swap.h       |    2 +
> >  mm/memcontrol.c            |  130
> +++++++++++++++++++++++++++++++++++++++++++-
> >  mm/vmscan.c                |   19 +++++++
> >  5 files changed, 191 insertions(+), 1 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index d4ff7f2..b18435d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -155,6 +155,14 @@ static inline void mem_cgroup_dec_page_stat(struct
> page *page,
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int
> order,
> >                                                gfp_t gfp_mask);
> >  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> > +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int
> zid);
> > +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone
> *zone);
> > +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone
> *zone);
> > +void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page
> *page);
> > +void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
> > +                                       struct zone *zone);
> > +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone*
> zone,
> > +                                       unsigned long nr_scanned);
> >
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
> > @@ -345,6 +353,38 @@ static inline void mem_cgroup_dec_page_stat(struct
> page *page,
> >  {
> >  }
> >
> > +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem,
> int nid,
> > +                                                               int zid)
> > +{
> > +       return false;
> > +}
> > +
> > +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
> > +                                               struct zone *zone)
> > +{
> > +       return false;
> > +}
> > +
> > +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup
> *mem,
> > +                                                       struct zone
> *zone)
> > +{
> > +}
> > +
> > +static inline void mem_cgroup_clear_unreclaimable(struct mem_cgroup
> *mem,
> > +                                                       struct page
> *page)
> > +{
> > +}
> > +
> > +static inline void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup
> *mem,
> > +                                                       struct zone
> *zone);
> > +{
> > +}
> > +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
> > +                                               struct zone *zone,
> > +                                               unsigned long nr_scanned)
> > +{
> > +}
> > +
> >  static inline
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int
> order,
> >                                            gfp_t gfp_mask)
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 98fc7ed..3370c5a 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1526,6 +1526,7 @@ struct task_struct {
> >                struct mem_cgroup *memcg; /* target memcg of uncharge */
> >                unsigned long nr_pages; /* uncharged usage */
> >                unsigned long memsw_nr_pages; /* uncharged mem+swap usage
> */
> > +               struct zone *zone; /* a zone page is last uncharged */
> >        } memcg_batch;
> >  #endif
> >  };
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 17e0511..319b800 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -160,6 +160,8 @@ enum {
> >        SWP_SCANNING    = (1 << 8),     /* refcount in scan_swap_map */
> >  };
> >
> > +#define ZONE_RECLAIMABLE_RATE 6
> > +
>
> You can use ZONE_RECLAIMABLE_RATE in zone_reclaimable, too.
> If you want to separate rate of memcg and global, please clear macro
> name like ZONE_MEMCG_RECLAIMABLE_RATE.
>

For now I will leave them as the same value. Will make the change in the
next post.

Thanks

--Ying

>
> --
> Kind regards,
> Minchan Kim
>

[-- Attachment #2: Type: text/html, Size: 6407 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 09/10] Add API to export per-memcg kswapd pid.
  2011-04-18  5:01   ` Minchan Kim
@ 2011-04-18 17:41     ` Ying Han
  0 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-18 17:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1654 bytes --]

On Sun, Apr 17, 2011 at 10:01 PM, Minchan Kim <minchan.kim@gmail.com> wrote:

> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> > This add the API which exports per-memcg kswapd thread pid. The kswapd
> > thread is named as "memcg_" + css_id, and the pid can be used to put
> > kswapd thread into cpu cgroup later.
> >
> > $ mkdir /dev/cgroup/memory/A
> > $ cat /dev/cgroup/memory/A/memory.kswapd_pid
> > memcg_null 0
> >
> > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> > $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> > $ ps -ef | grep memcg
> > root      6727     2  0 14:32 ?        00:00:00 [memcg_3]
> > root      6729  6044  0 14:32 ttyS0    00:00:00 grep memcg
> >
> > $ cat memory.kswapd_pid
> > memcg_3 6727
> >
> > changelog v5..v4
> > 1. Initialize the memcg-kswapd pid to -1 instead of 0.
> > 2. Remove the kswapds_spinlock.
> >
> > changelog v4..v3
> > 1. Add the API based on KAMAZAWA's request on patch v3.
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/swap.h |    2 ++
> >  mm/memcontrol.c      |   31 +++++++++++++++++++++++++++++++
> >  2 files changed, 33 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 319b800..2d3e21a 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -34,6 +34,8 @@ struct kswapd {
> >  };
> >
> >  int kswapd(void *p);
> > +extern spinlock_t kswapds_spinlock;
>
> Remove spinlock.
>

Thanks. Will remove from the next post.

--Ying

>
>
>
> --
> Kind regards,
> Minchan Kim
>

[-- Attachment #2: Type: text/html, Size: 2567 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 01/10] Add kswapd descriptor
  2011-04-18  0:57   ` Minchan Kim
@ 2011-04-18 18:09     ` Ying Han
  2011-04-19  5:35       ` Minchan Kim
  0 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-18 18:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 16082 bytes --]

On Sun, Apr 17, 2011 at 5:57 PM, Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Ying,
>
> I have some comments and nitpick about coding style.
>

Hi Minchan, thank you for your comments and reviews.

>
> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> > There is a kswapd kernel thread for each numa node. We will add a
> different
> > kswapd for each memcg. The kswapd is sleeping in the wait queue headed at
>
> Why?
>
> Easily, many kernel developers raise an eyebrow to increase kernel thread.
> So you should justify why we need new kernel thread, why we can't
> handle it with workqueue.
>
> Maybe you explained it and I didn't know it. If it is, sorry.
> But at least, the patch description included _why_ is much mergeable
> to maintainers and helpful to review the code to reviewers.
>

Here are the replies i posted on earlier version regarding on workqueue.

"
I did some study on workqueue after posting V2. There was a comment
suggesting workqueue instead of per-memcg kswapd thread, since it will cut
the number of kernel threads being created in host with lots of cgroups.
Each kernel thread allocates about 8K of stack and 8M in total w/ thousand
of cgroups.

The current workqueue model merged in 2.6.36 kernel is called "concurrency
managed workqueu(cmwq)", which is intended to provide flexible concurrency
without wasting resources. I studied a bit and here it is:

1. The workqueue is complicated and we need to be very careful of work items
in the workqueue. We've experienced in one workitem stucks and the rest of
the work item won't proceed. For example in dirty page writeback,  one
heavily writer cgroup could starve the other cgroups from flushing dirty
pages to the same disk. In the kswapd case, I can image we might have
similar scenario.

2. How to prioritize the workitems is another problem. The order of adding
the workitems in the queue reflects the order of cgroups being reclaimed. We
don't have that restriction currently but relying on the cpu scheduler to
put kswapd on the right cpu-core to run. We "might" introduce priority later
for reclaim and how are we gonna deal with that.

3. Based on what i observed, not many callers has migrated to the cmwq and I
don't have much data of how good it is.

Back to the current model, on machine with thousands of cgroups which it
will take 8M total for thousand of kswapd threads (8K stack for each
thread).  We are running system with fakenuma which each numa node has a
kswapd. So far we haven't noticed issue caused by "lots of" kswapd threads.
Also, there shouldn't be any performance overhead for kernel thread if it is
not running.

Based on the complexity of workqueue and the benefit it provides, I would
like to stick to the current model first. After we get the basic stuff in
and other targeting reclaim improvement, we can come back to this. What do
you think?
"

KAMEZAWA's reply:
"
Okay, fair enough. kthread_run() will win.

Then, I have another request. I'd like to kswapd-for-memcg to some cpu
cgroup to limit cpu usage.

- Could you show thread ID somewhere ? and
 confirm we can put it to some cpu cgroup ?
 (creating a auto cpu cgroup for memcg kswapd is a choice, I think.)
"


> > kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
> > information of node or memcg and it allows the global and per-memcg
> background
> > reclaim to share common reclaim algorithms.
> >
> > This patch adds the kswapd descriptor and moves the per-node kswapd to
> use the
> > new structure.
> >
> > changelog v5..v4:
> > 1. add comment on kswapds_spinlock
> > 2. remove the kswapds_spinlock. we don't need it here since the kswapd
> and pgdat
> > have 1:1 mapping.
> >
> > changelog v3..v2:
> > 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
> > 2. rename thr in kswapd_run to something else.
> >
> > changelog v2..v1:
> > 1. dynamic allocate kswapd descriptor and initialize the wait_queue_head
> of pgdat
> > at kswapd_run.
> > 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup
> kswapd
> > descriptor.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/mmzone.h |    3 +-
> >  include/linux/swap.h   |    7 ++++
> >  mm/page_alloc.c        |    1 -
> >  mm/vmscan.c            |   89
> +++++++++++++++++++++++++++++++++++------------
> >  4 files changed, 74 insertions(+), 26 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 628f07b..6cba7d2 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -640,8 +640,7 @@ typedef struct pglist_data {
> >        unsigned long node_spanned_pages; /* total size of physical page
> >                                             range, including holes */
> >        int node_id;
> > -       wait_queue_head_t kswapd_wait;
> > -       struct task_struct *kswapd;
> > +       wait_queue_head_t *kswapd_wait;
>
> Personally, I prefer kswapd not kswapd_wait.
> It's more readable and straightforward.
>

hmm. I would like to keep as it is for this version, and improve it after
the basic stuff are in. Hope that works for you?


> >        int kswapd_max_order;
> >        enum zone_type classzone_idx;
> >  } pg_data_t;
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index ed6ebe6..f43d406 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
> >        return current->flags & PF_KSWAPD;
> >  }
> >
> > +struct kswapd {
> > +       struct task_struct *kswapd_task;
> > +       wait_queue_head_t kswapd_wait;
> > +       pg_data_t *kswapd_pgdat;
> > +};
> > +
> > +int kswapd(void *p);
> >  /*
> >  * MAX_SWAPFILES defines the maximum number of swaptypes: things which
> can
> >  * be swapped to.  The swap type and the offset into that swap type are
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6e1b52a..6340865 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(struct
> pglist_data *pgdat,
> >
> >        pgdat_resize_init(pgdat);
> >        pgdat->nr_zones = 0;
> > -       init_waitqueue_head(&pgdat->kswapd_wait);
> >        pgdat->kswapd_max_order = 0;
> >        pgdat_page_cgroup_init(pgdat);
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 060e4c1..61fb96e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2242,12 +2242,13 @@ static bool pgdat_balanced(pg_data_t *pgdat,
> unsigned long balanced_pages,
> >  }
> >
> >  /* is kswapd sleeping prematurely? */
> > -static bool sleeping_prematurely(pg_data_t *pgdat, int order, long
> remaining,
> > -                                       int classzone_idx)
> > +static int sleeping_prematurely(struct kswapd *kswapd, int order,
> > +                               long remaining, int classzone_idx)
> >  {
> >        int i;
> >        unsigned long balanced = 0;
> >        bool all_zones_ok = true;
> > +       pg_data_t *pgdat = kswapd->kswapd_pgdat;
> >
> >        /* If a direct reclaimer woke kswapd within HZ/10, it's premature
> */
> >        if (remaining)
> > @@ -2570,28 +2571,31 @@ out:
> >        return order;
> >  }
> >
> > -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int
> classzone_idx)
> > +static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
> > +                               int classzone_idx)
> >  {
> >        long remaining = 0;
> >        DEFINE_WAIT(wait);
> > +       pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> > +       wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>
> kswapd_p? p means pointer?
>
yes,

> wait_h? h means header?
>
 yes,

Hmm.. Of course, it's trivial and we can understand easily in such
> context but we don't have been used such words so it's rather awkward
> to me.
>
> How about kswapd instead of kswapd_p, kswapd_wait instead of wait_h?
>

that sounds ok for me for the change. however i would like to make the
change as sperate patch after the basic stuff are in. Is that ok?

>
> >
> >        if (freezing(current) || kthread_should_stop())
> >                return;
> >
> > -       prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> > +       prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> >
> >        /* Try to sleep for a short interval */
> > -       if (!sleeping_prematurely(pgdat, order, remaining,
> classzone_idx)) {
> > +       if (!sleeping_prematurely(kswapd_p, order, remaining,
> classzone_idx)) {
> >                remaining = schedule_timeout(HZ/10);
> > -               finish_wait(&pgdat->kswapd_wait, &wait);
> > -               prepare_to_wait(&pgdat->kswapd_wait, &wait,
> TASK_INTERRUPTIBLE);
> > +               finish_wait(wait_h, &wait);
> > +               prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> >        }
> >
> >        /*
> >         * After a short sleep, check if it was a premature sleep. If not,
> then
> >         * go fully to sleep until explicitly woken up.
> >         */
> > -       if (!sleeping_prematurely(pgdat, order, remaining,
> classzone_idx)) {
> > +       if (!sleeping_prematurely(kswapd_p, order, remaining,
> classzone_idx)) {
> >                trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> >
> >                /*
> > @@ -2611,7 +2615,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat,
> int order, int classzone_idx)
> >                else
> >                        count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> >        }
> > -       finish_wait(&pgdat->kswapd_wait, &wait);
> > +       finish_wait(wait_h, &wait);
> >  }
> >
> >  /*
> > @@ -2627,20 +2631,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat,
> int order, int classzone_idx)
> >  * If there are applications that are active memory-allocators
> >  * (most normal use), this basically shouldn't matter.
> >  */
> > -static int kswapd(void *p)
> > +int kswapd(void *p)
> >  {
> >        unsigned long order;
> >        int classzone_idx;
> > -       pg_data_t *pgdat = (pg_data_t*)p;
> > +       struct kswapd *kswapd_p = (struct kswapd *)p;
> > +       pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> > +       wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
> >        struct task_struct *tsk = current;
> >
> >        struct reclaim_state reclaim_state = {
> >                .reclaimed_slab = 0,
> >        };
> > -       const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> > +       const struct cpumask *cpumask;
> >
> >        lockdep_set_current_reclaim_state(GFP_KERNEL);
> >
> > +       BUG_ON(pgdat->kswapd_wait != wait_h);
>
> If we include kswapd instead of kswapd_wait in pgdat, maybe we could
> remove the check?
>
> > +       cpumask = cpumask_of_node(pgdat->node_id);
> >        if (!cpumask_empty(cpumask))
> >                set_cpus_allowed_ptr(tsk, cpumask);
> >        current->reclaim_state = &reclaim_state;
> > @@ -2679,7 +2687,7 @@ static int kswapd(void *p)
> >                        order = new_order;
> >                        classzone_idx = new_classzone_idx;
> >                } else {
> > -                       kswapd_try_to_sleep(pgdat, order, classzone_idx);
> > +                       kswapd_try_to_sleep(kswapd_p, order,
> classzone_idx);
> >                        order = pgdat->kswapd_max_order;
> >                        classzone_idx = pgdat->classzone_idx;
> >                        pgdat->kswapd_max_order = 0;
> > @@ -2719,13 +2727,13 @@ void wakeup_kswapd(struct zone *zone, int order,
> enum zone_type classzone_idx)
> >                pgdat->kswapd_max_order = order;
> >                pgdat->classzone_idx = min(pgdat->classzone_idx,
> classzone_idx);
> >        }
> > -       if (!waitqueue_active(&pgdat->kswapd_wait))
> > +       if (!waitqueue_active(pgdat->kswapd_wait))
> >                return;
> >        if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0,
> 0))
> >                return;
> >
> >        trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone),
> order);
> > -       wake_up_interruptible(&pgdat->kswapd_wait);
> > +       wake_up_interruptible(pgdat->kswapd_wait);
> >  }
> >
> >  /*
> > @@ -2817,12 +2825,21 @@ static int __devinit cpu_callback(struct
> notifier_block *nfb,
> >                for_each_node_state(nid, N_HIGH_MEMORY) {
> >                        pg_data_t *pgdat = NODE_DATA(nid);
> >                        const struct cpumask *mask;
> > +                       struct kswapd *kswapd_p;
> > +                       struct task_struct *kswapd_thr;
> > +                       wait_queue_head_t *wait;
> >
> >                        mask = cpumask_of_node(pgdat->node_id);
> >
> > +                       wait = pgdat->kswapd_wait;
> > +                       kswapd_p = container_of(wait, struct kswapd,
> > +                                               kswapd_wait);
> > +                       kswapd_thr = kswapd_p->kswapd_task;
>
> kswapd_thr? thr means thread?
> How about tsk?
>

ok. I made the change and will be included in the next post.

>
> > +
> If we include kswapd instead of kswapd_wait in pgdat, don't we make this
> simple?
>
> struct kswapd *kswapd = pgdat->kswapd;
> struct task_struct *kswapd_tsk = kswapd->kswapd_task;
>
>
> >                        if (cpumask_any_and(cpu_online_mask, mask) <
> nr_cpu_ids)
> >                                /* One of our CPUs online: restore mask */
> > -                               set_cpus_allowed_ptr(pgdat->kswapd,
> mask);
> > +                               if (kswapd_thr)
> > +                                       set_cpus_allowed_ptr(kswapd_thr,
> mask);
> >                }
> >        }
> >        return NOTIFY_OK;
> > @@ -2835,18 +2852,31 @@ static int __devinit cpu_callback(struct
> notifier_block *nfb,
> >  int kswapd_run(int nid)
> >  {
> >        pg_data_t *pgdat = NODE_DATA(nid);
> > +       struct task_struct *kswapd_thr;
> > +       struct kswapd *kswapd_p;
> >        int ret = 0;
> >
> > -       if (pgdat->kswapd)
> > +       if (pgdat->kswapd_wait)
> >                return 0;
> >
> > -       pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
> > -       if (IS_ERR(pgdat->kswapd)) {
> > +       kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
> > +       if (!kswapd_p)
> > +               return -ENOMEM;
> > +
> > +       init_waitqueue_head(&kswapd_p->kswapd_wait);
> > +       pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> > +       kswapd_p->kswapd_pgdat = pgdat;
> > +
> > +       kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
> > +       if (IS_ERR(kswapd_thr)) {
> >                /* failure at boot is fatal */
> >                BUG_ON(system_state == SYSTEM_BOOTING);
> >                printk("Failed to start kswapd on node %d\n",nid);
> > +               pgdat->kswapd_wait = NULL;
> > +               kfree(kswapd_p);
> >                ret = -1;
> > -       }
> > +       } else
> > +               kswapd_p->kswapd_task = kswapd_thr;
> >        return ret;
> >  }
> >
> > @@ -2855,10 +2885,23 @@ int kswapd_run(int nid)
> >  */
> >  void kswapd_stop(int nid)
> >  {
> > -       struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
> > +       struct task_struct *kswapd_thr = NULL;
> > +       struct kswapd *kswapd_p = NULL;
> > +       wait_queue_head_t *wait;
> > +
> > +       pg_data_t *pgdat = NODE_DATA(nid);
> > +
> > +       wait = pgdat->kswapd_wait;
> > +       if (wait) {
> > +               kswapd_p = container_of(wait, struct kswapd,
> kswapd_wait);
> > +               kswapd_thr = kswapd_p->kswapd_task;
> > +               kswapd_p->kswapd_task = NULL;
> > +       }
> > +
> > +       if (kswapd_thr)
> > +               kthread_stop(kswapd_thr);
> >
> > -       if (kswapd)
> > -               kthread_stop(kswapd);
> > +       kfree(kswapd_p);
> >  }
> >
> >  static int __init kswapd_init(void)
> > --
> > 1.7.3.1
> >
> >
>
> Hmm, I don't like kswapd_p, kswapd_thr, wait_h and kswapd_wait of pgdat.
> But it's just my personal opinion. :)
>

Thank you for your comments :)

>
>
> --
> Kind regards,
> Minchan Kim
>

[-- Attachment #2: Type: text/html, Size: 24424 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 04/10] Infrastructure to support per-memcg reclaim.
  2011-04-18  2:11   ` Minchan Kim
@ 2011-04-18 18:44     ` Ying Han
  0 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2011-04-18 18:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 19767 bytes --]

On Sun, Apr 17, 2011 at 7:11 PM, Minchan Kim <minchan.kim@gmail.com> wrote:

> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> > Add the kswapd_mem field in kswapd descriptor which links the kswapd
> > kernel thread to a memcg. The per-memcg kswapd is sleeping in the wait
> > queue headed at kswapd_wait field of the kswapd descriptor.
> >
> > The kswapd() function is now shared between global and per-memcg kswapd.
> It
> > is passed in with the kswapd descriptor which contains the information of
> > either node or memcg. Then the new function balance_mem_cgroup_pgdat is
> > invoked if it is per-mem kswapd thread, and the implementation of the
> function
> > is on the following patch.
> >
> > changelog v4..v3:
> > 1. fix up the kswapd_run and kswapd_stop for online_pages() and
> offline_pages.
> > 2. drop the PF_MEMALLOC flag for memcg kswapd for now per KAMAZAWA's
> request.
> >
> > changelog v3..v2:
> > 1. split off from the initial patch which includes all changes of the
> following
> > three patches.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  include/linux/memcontrol.h |    5 ++
> >  include/linux/swap.h       |    5 +-
> >  mm/memcontrol.c            |   29 ++++++++
> >  mm/memory_hotplug.c        |    4 +-
> >  mm/vmscan.c                |  156
> ++++++++++++++++++++++++++++++--------------
> >  5 files changed, 147 insertions(+), 52 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 3ece36d..f7ffd1f 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -24,6 +24,7 @@ struct mem_cgroup;
> >  struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> > +struct kswapd;
> >
> >  /* Stats that can be updated by kernel. */
> >  enum mem_cgroup_page_stat_item {
> > @@ -83,6 +84,10 @@ int task_in_mem_cgroup(struct task_struct *task, const
> struct mem_cgroup *mem);
> >  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page
> *page);
> >  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> >  extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int
> charge_flags);
> > +extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
> > +                                 struct kswapd *kswapd_p);
> > +extern void mem_cgroup_clear_kswapd(struct mem_cgroup *mem);
> > +extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup
> *mem);
> >
> >  static inline
> >  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index f43d406..17e0511 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -30,6 +30,7 @@ struct kswapd {
> >        struct task_struct *kswapd_task;
> >        wait_queue_head_t kswapd_wait;
> >        pg_data_t *kswapd_pgdat;
> > +       struct mem_cgroup *kswapd_mem;
> >  };
> >
> >  int kswapd(void *p);
> > @@ -303,8 +304,8 @@ static inline void
> scan_unevictable_unregister_node(struct node *node)
> >  }
> >  #endif
> >
> > -extern int kswapd_run(int nid);
> > -extern void kswapd_stop(int nid);
> > +extern int kswapd_run(int nid, struct mem_cgroup *mem);
> > +extern void kswapd_stop(int nid, struct mem_cgroup *mem);
> >
> >  #ifdef CONFIG_MMU
> >  /* linux/mm/shmem.c */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 76ad009..8761a6f 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -278,6 +278,8 @@ struct mem_cgroup {
> >         */
> >        u64 high_wmark_distance;
> >        u64 low_wmark_distance;
> > +
> > +       wait_queue_head_t *kswapd_wait;
>
>
> Like I mentioned in [1/10], personally, I like including kswapd
> instead of kswapd_wait. Just personal opinion. Feel free to ignore.
>

thank you for your comments.

If that makes more sense, I am ok to make the change. But I would like to
keep this as the current version and do a separate change after the basic
stuff in. Hope that works.

>
> >  };
> >
> >  /* Stuffs for move charges at task migration. */
> > @@ -4670,6 +4672,33 @@ int mem_cgroup_watermark_ok(struct mem_cgroup
> *mem,
> >        return ret;
> >  }
> >
> > +int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd
> *kswapd_p)
> > +{
> > +       if (!mem || !kswapd_p)
> > +               return 0;
> > +
> > +       mem->kswapd_wait = &kswapd_p->kswapd_wait;
> > +       kswapd_p->kswapd_mem = mem;
> > +
> > +       return css_id(&mem->css);
> > +}
> > +
> > +void mem_cgroup_clear_kswapd(struct mem_cgroup *mem)
> > +{
> > +       if (mem)
> > +               mem->kswapd_wait = NULL;
> > +
> > +       return;
> > +}
> > +
> > +wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
> > +{
> > +       if (!mem)
> > +               return NULL;
> > +
> > +       return mem->kswapd_wait;
> > +}
> > +
> >  static int mem_cgroup_soft_limit_tree_init(void)
> >  {
> >        struct mem_cgroup_tree_per_node *rtpn;
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 321fc74..2f78ff6 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -462,7 +462,7 @@ int online_pages(unsigned long pfn, unsigned long
> nr_pages)
> >        setup_per_zone_wmarks();
> >        calculate_zone_inactive_ratio(zone);
> >        if (onlined_pages) {
> > -               kswapd_run(zone_to_nid(zone));
> > +               kswapd_run(zone_to_nid(zone), NULL);
> >                node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
> >        }
> >
> > @@ -897,7 +897,7 @@ repeat:
> >        calculate_zone_inactive_ratio(zone);
> >        if (!node_present_pages(node)) {
> >                node_clear_state(node, N_HIGH_MEMORY);
> > -               kswapd_stop(node);
> > +               kswapd_stop(node, NULL);
> >        }
> >
> >        vm_total_pages = nr_free_pagecache_pages();
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 61fb96e..06036d2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2241,6 +2241,8 @@ static bool pgdat_balanced(pg_data_t *pgdat,
> unsigned long balanced_pages,
> >        return balanced_pages > (present_pages >> 2);
> >  }
> >
> > +#define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
>
> In memcg, we already have a similar thing "scanning_global_lru".
> How about using "global" term?
>

hmm. ok.


>
> > +
> >  /* is kswapd sleeping prematurely? */
> >  static int sleeping_prematurely(struct kswapd *kswapd, int order,
> >                                long remaining, int classzone_idx)
> > @@ -2249,11 +2251,16 @@ static int sleeping_prematurely(struct kswapd
> *kswapd, int order,
> >        unsigned long balanced = 0;
> >        bool all_zones_ok = true;
> >        pg_data_t *pgdat = kswapd->kswapd_pgdat;
> > +       struct mem_cgroup *mem = kswapd->kswapd_mem;
> >
> >        /* If a direct reclaimer woke kswapd within HZ/10, it's premature
> */
> >        if (remaining)
> >                return true;
> >
> > +       /* Doesn't support for per-memcg reclaim */
> > +       if (mem)
> > +               return false;
> > +
>
> How about is_global_kswapd instead of checking wheterh mem field is NULL or
> not?
>

make sense. will change.


>
> >        /* Check the watermark levels */
> >        for (i = 0; i < pgdat->nr_zones; i++) {
> >                struct zone *zone = pgdat->node_zones + i;
> > @@ -2596,19 +2603,25 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >         * go fully to sleep until explicitly woken up.
> >         */
> >        if (!sleeping_prematurely(kswapd_p, order, remaining,
> classzone_idx)) {
> > -               trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > +               if (is_node_kswapd(kswapd_p)) {
> > +                       trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> >
> > -               /*
> > -                * vmstat counters are not perfectly accurate and the
> estimated
> > -                * value for counters such as NR_FREE_PAGES can deviate
> from the
> > -                * true value by nr_online_cpus * threshold. To avoid the
> zone
> > -                * watermarks being breached while under pressure, we
> reduce the
> > -                * per-cpu vmstat threshold while kswapd is awake and
> restore
> > -                * them before going back to sleep.
> > -                */
> > -               set_pgdat_percpu_threshold(pgdat,
> calculate_normal_threshold);
> > -               schedule();
> > -               set_pgdat_percpu_threshold(pgdat,
> calculate_pressure_threshold);
> > +                       /*
> > +                        * vmstat counters are not perfectly accurate and
> the
> > +                        * estimated value for counters such as
> NR_FREE_PAGES
> > +                        * can deviate from the true value by
> nr_online_cpus *
> > +                        * threshold. To avoid the zone watermarks being
> > +                        * breached while under pressure, we reduce the
> per-cpu
> > +                        * vmstat threshold while kswapd is awake and
> restore
> > +                        * them before going back to sleep.
> > +                        */
> > +                       set_pgdat_percpu_threshold(pgdat,
> > +
>  calculate_normal_threshold);
> > +                       schedule();
> > +                       set_pgdat_percpu_threshold(pgdat,
> > +
> calculate_pressure_threshold);
> > +               } else
> > +                       schedule();
> >        } else {
> >                if (remaining)
> >                        count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> > @@ -2618,6 +2631,12 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >        finish_wait(wait_h, &wait);
> >  }
> >
> > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> > +                                                       int order)
> > +{
> > +       return 0;
> > +}
> > +
> >  /*
> >  * The background pageout daemon, started as a kernel thread
> >  * from the init process.
> > @@ -2637,6 +2656,7 @@ int kswapd(void *p)
> >        int classzone_idx;
> >        struct kswapd *kswapd_p = (struct kswapd *)p;
> >        pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> > +       struct mem_cgroup *mem = kswapd_p->kswapd_mem;
> >        wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
> >        struct task_struct *tsk = current;
> >
> > @@ -2647,10 +2667,12 @@ int kswapd(void *p)
> >
> >        lockdep_set_current_reclaim_state(GFP_KERNEL);
> >
> > -       BUG_ON(pgdat->kswapd_wait != wait_h);
> > -       cpumask = cpumask_of_node(pgdat->node_id);
> > -       if (!cpumask_empty(cpumask))
> > -               set_cpus_allowed_ptr(tsk, cpumask);
> > +       if (is_node_kswapd(kswapd_p)) {
> > +               BUG_ON(pgdat->kswapd_wait != wait_h);
> > +               cpumask = cpumask_of_node(pgdat->node_id);
> > +               if (!cpumask_empty(cpumask))
> > +                       set_cpus_allowed_ptr(tsk, cpumask);
> > +       }
> >        current->reclaim_state = &reclaim_state;
> >
> >        /*
> > @@ -2665,7 +2687,10 @@ int kswapd(void *p)
> >         * us from recursively trying to free more memory as we're
> >         * trying to free the first piece of memory in the first place).
> >         */
> > -       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > +       if (is_node_kswapd(kswapd_p))
> > +               tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > +       else
> > +               tsk->flags |= PF_SWAPWRITE | PF_KSWAPD;
> >        set_freezable();
> >
> >        order = 0;
> > @@ -2675,24 +2700,29 @@ int kswapd(void *p)
> >                int new_classzone_idx;
> >                int ret;
> >
> > -               new_order = pgdat->kswapd_max_order;
> > -               new_classzone_idx = pgdat->classzone_idx;
> > -               pgdat->kswapd_max_order = 0;
> > -               pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > -               if (order < new_order || classzone_idx >
> new_classzone_idx) {
> > -                       /*
> > -                        * Don't sleep if someone wants a larger 'order'
> > -                        * allocation or has tigher zone constraints
> > -                        */
> > -                       order = new_order;
> > -                       classzone_idx = new_classzone_idx;
> > -               } else {
> > -                       kswapd_try_to_sleep(kswapd_p, order,
> classzone_idx);
> > -                       order = pgdat->kswapd_max_order;
> > -                       classzone_idx = pgdat->classzone_idx;
> > +               if (is_node_kswapd(kswapd_p)) {
> > +                       new_order = pgdat->kswapd_max_order;
> > +                       new_classzone_idx = pgdat->classzone_idx;
> >                        pgdat->kswapd_max_order = 0;
> >                        pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > -               }
> > +                       if (order < new_order ||
> > +                                       classzone_idx >
> new_classzone_idx) {
> > +                               /*
> > +                                * Don't sleep if someone wants a larger
> 'order'
> > +                                * allocation or has tigher zone
> constraints
> > +                                */
> > +                               order = new_order;
> > +                               classzone_idx = new_classzone_idx;
> > +                       } else {
> > +                               kswapd_try_to_sleep(kswapd_p, order,
> > +                                                   classzone_idx);
> > +                               order = pgdat->kswapd_max_order;
> > +                               classzone_idx = pgdat->classzone_idx;
> > +                               pgdat->kswapd_max_order = 0;
> > +                               pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > +                       }
> > +               } else
> > +                       kswapd_try_to_sleep(kswapd_p, order,
> classzone_idx);
> >
> >                ret = try_to_freeze();
> >                if (kthread_should_stop())
> > @@ -2703,8 +2733,13 @@ int kswapd(void *p)
> >                 * after returning from the refrigerator
> >                 */
> >                if (!ret) {
> > -                       trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> order);
> > -                       order = balance_pgdat(pgdat, order,
> &classzone_idx);
> > +                       if (is_node_kswapd(kswapd_p)) {
> > +
> trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> > +                                                               order);
> > +                               order = balance_pgdat(pgdat, order,
> > +                                                       &classzone_idx);
> > +                       } else
> > +                               balance_mem_cgroup_pgdat(mem, order);
> >                }
> >        }
> >        return 0;
> > @@ -2849,30 +2884,53 @@ static int __devinit cpu_callback(struct
> notifier_block *nfb,
> >  * This kswapd start function will be called by init and node-hot-add.
> >  * On node-hot-add, kswapd will moved to proper cpus if cpus are
> hot-added.
> >  */
> > -int kswapd_run(int nid)
> > +int kswapd_run(int nid, struct mem_cgroup *mem)
> >  {
> > -       pg_data_t *pgdat = NODE_DATA(nid);
> >        struct task_struct *kswapd_thr;
> > +       pg_data_t *pgdat = NULL;
> >        struct kswapd *kswapd_p;
> > +       static char name[TASK_COMM_LEN];
> > +       int memcg_id;
> >        int ret = 0;
> >
> > -       if (pgdat->kswapd_wait)
> > -               return 0;
> > +       if (!mem) {
> > +               pgdat = NODE_DATA(nid);
> > +               if (pgdat->kswapd_wait)
> > +                       return ret;
> > +       }
> >
> >        kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
> >        if (!kswapd_p)
> >                return -ENOMEM;
> >
> >        init_waitqueue_head(&kswapd_p->kswapd_wait);
> > -       pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> > -       kswapd_p->kswapd_pgdat = pgdat;
> >
> > -       kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
> > +       if (!mem) {
> > +               pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> > +               kswapd_p->kswapd_pgdat = pgdat;
> > +               snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
> > +       } else {
> > +               memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
> > +               if (!memcg_id) {
> > +                       kfree(kswapd_p);
> > +                       return ret;
> > +               }
> > +               snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
> > +       }
> > +
> > +       kswapd_thr = kthread_run(kswapd, kswapd_p, name);
> >        if (IS_ERR(kswapd_thr)) {
> >                /* failure at boot is fatal */
> >                BUG_ON(system_state == SYSTEM_BOOTING);
> > -               printk("Failed to start kswapd on node %d\n",nid);
> > -               pgdat->kswapd_wait = NULL;
> > +               if (!mem) {
> > +                       printk(KERN_ERR "Failed to start kswapd on node
> %d\n",
> > +                                                               nid);
> > +                       pgdat->kswapd_wait = NULL;
> > +               } else {
> > +                       printk(KERN_ERR "Failed to start kswapd on memcg
> %d\n",
> > +
> memcg_id);
> > +                       mem_cgroup_clear_kswapd(mem);
> > +               }
> >                kfree(kswapd_p);
> >                ret = -1;
> >        } else
> > @@ -2883,15 +2941,17 @@ int kswapd_run(int nid)
> >  /*
> >  * Called by memory hotplug when all memory in a node is offlined.
> >  */
> > -void kswapd_stop(int nid)
> > +void kswapd_stop(int nid, struct mem_cgroup *mem)
> >  {
> >        struct task_struct *kswapd_thr = NULL;
> >        struct kswapd *kswapd_p = NULL;
> >        wait_queue_head_t *wait;
> >
> > -       pg_data_t *pgdat = NODE_DATA(nid);
> > +       if (!mem)
> > +               wait = NODE_DATA(nid)->kswapd_wait;
> > +       else
> > +               wait = mem_cgroup_kswapd_wait(mem);
> >
> > -       wait = pgdat->kswapd_wait;
> >        if (wait) {
> >                kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> >                kswapd_thr = kswapd_p->kswapd_task;
> > @@ -2910,7 +2970,7 @@ static int __init kswapd_init(void)
> >
> >        swap_setup();
> >        for_each_node_state(nid, N_HIGH_MEMORY)
> > -               kswapd_run(nid);
> > +               kswapd_run(nid, NULL);
> >        hotcpu_notifier(cpu_callback, 0);
> >        return 0;
> >  }
> > --
> > 1.7.3.1
> >
> >
>
> Let me ask a question.
>
> What's the effect of kswapd_try_to_sleep in memcg?
>
> As I look code, sleeping_prematurely always return false in case of
> memcg.

that is right.


> So kswapd_try_to_sleep can sleep short time and then sleep
> until next wakeup. It means it doesn't have any related to
> kswapd_try_to_sleep's goal. So I hope you remove hack of memcg in
> kswapd_try_to_sleep and just calling schedule in kswapd function for
> sleeping memcg_kswapd.
>

 But I don't look at further patches in series so I may miss something.

No, you are not missing something. The memcg kswapd doesn't have the
sleeping_prematurely logic in this patch. I was thinking to add similar
check on zone->all_unreclaimable but instead check all the zones per-memcg.
That sounds like much overhead to me, so I simply put the schedule() for
memcg in this patch but changed the APIs kswapd_try_to_sleep, and
sleeping_prematurely only.

I might be able to revert those two API changes from this patch, and add
them later as part of the real memcg sleeping_prematurely patch.

--Ying


> --
> Kind regards,
> Minchan Kim
>

[-- Attachment #2: Type: text/html, Size: 24166 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 06/10] Per-memcg background reclaim.
  2011-04-18  3:51   ` Minchan Kim
@ 2011-04-18 21:38     ` Ying Han
  2011-04-18 23:32       ` Minchan Kim
  0 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-18 21:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 10432 bytes --]

On Sun, Apr 17, 2011 at 8:51 PM, Minchan Kim <minchan.kim@gmail.com> wrote:

> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> > This is the main loop of per-memcg background reclaim which is
> implemented in
> > function balance_mem_cgroup_pgdat().
> >
> > The function performs a priority loop similar to global reclaim. During
> each
> > iteration it invokes balance_pgdat_node() for all nodes on the system,
> which
> > is another new function performs background reclaim per node. After
> reclaiming
> > each node, it checks mem_cgroup_watermark_ok() and breaks the priority
> loop if
> > it returns true.
> >
> > changelog v5..v4:
> > 1. remove duplicate check on nodes_empty()
> > 2. add logic to check if the per-memcg lru is empty on the zone.
> > 3. make per-memcg kswapd to reclaim SWAP_CLUSTER_MAX per zone. It make
> senses
> > since it helps to balance the pressure across zones within the memcg.
> >
> > changelog v4..v3:
> > 1. split the select_victim_node and zone_unreclaimable to a seperate
> patches
> > 2. remove the logic tries to do zone balancing.
> >
> > changelog v3..v2:
> > 1. change mz->all_unreclaimable to be boolean.
> > 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg
> reclaim.
> > 3. some more clean-up.
> >
> > changelog v2..v1:
> > 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage.
> > 2. shared the kswapd_run/kswapd_stop for per-memcg and global background
> > reclaim.
> > 3. name the per-memcg memcg as "memcg-id" (css->id). And the global
> kswapd
> > keeps the same name.
> > 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be
> accessed
> > after freeing.
> > 5. add the fairness in zonelist where memcg remember the last zone
> reclaimed
> > from.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > ---
> >  mm/vmscan.c |  157
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 157 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 06036d2..39e6300 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -47,6 +47,8 @@
> >
> >  #include <linux/swapops.h>
> >
> > +#include <linux/res_counter.h>
> > +
> >  #include "internal.h"
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -111,6 +113,8 @@ struct scan_control {
> >         * are scanned.
> >         */
> >        nodemask_t      *nodemask;
> > +
> > +       int priority;
> >  };
> >
> >  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> > @@ -2631,11 +2635,164 @@ static void kswapd_try_to_sleep(struct kswapd
> *kswapd_p, int order,
> >        finish_wait(wait_h, &wait);
> >  }
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * The function is used for per-memcg LRU. It scanns all the zones of
> the
> > + * node and returns the nr_scanned and nr_reclaimed.
> > + */
> > +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> > +                                       struct scan_control *sc)
> > +{
> > +       int i;
> > +       unsigned long total_scanned = 0;
> > +       struct mem_cgroup *mem_cont = sc->mem_cgroup;
> > +       int priority = sc->priority;
> > +       enum lru_list l;
> > +
> > +       /*
> > +        * This dma->highmem order is consistant with global reclaim.
> > +        * We do this because the page allocator works in the opposite
> > +        * direction although memcg user pages are mostly allocated at
> > +        * highmem.
> > +        */
> > +       for (i = 0; i < pgdat->nr_zones; i++) {
> > +               struct zone *zone = pgdat->node_zones + i;
> > +               unsigned long scan = 0;
> > +
> > +               for_each_evictable_lru(l)
> > +                       scan += mem_cgroup_zone_nr_pages(mem_cont, zone,
> l);
> > +
> > +               if (!populated_zone(zone) || !scan)
> > +                       continue;
>
> Do we really need this double check?

Isn't only _scan_ check enough?
>

yes. will change on next post.


> And shouldn't we consider non-swap case?
>

good point. we don't need to count the anon lru in non-swap case. A new
function will be added to count the memcg_zone_reclaimable per zone.


>
> > +
> > +               sc->nr_scanned = 0;
> > +               shrink_zone(priority, zone, sc);
> > +               total_scanned += sc->nr_scanned;
> > +
> > +               /*
> > +                * If we've done a decent amount of scanning and
> > +                * the reclaim ratio is low, start doing writepage
> > +                * even in laptop mode
> > +                */
> > +               if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > +                   total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> 2) {
> > +                       sc->may_writepage = 1;
>
> I don't want to add more random write any more although we don't have
> a trouble of real memory shortage.
>


> Do you have any reason to reclaim memory urgently as writing dirty pages?
> Maybe if we wait a little bit of time, flusher would write out the page.
>

We would like to reduce the writing dirty pages from page reclaim,
especially from direct reclaim. AFAIK, the try_to_free_mem_cgroup_pages()
still need to write dirty pages when there is a need. removing this from the
per-memcg kswap will only add more pressure to the per-memcg direct reclaim,
which seems to be worse. (stack overflow as one example which we would like
to get rid of)


>
> > +               }
> > +       }
> > +
> > +       sc->nr_scanned = total_scanned;
> > +       return;
>
> unnecessary return.
>
> removed.


> > +}
> > +
> > +/*
> > + * Per cgroup background reclaim.
> > + * TODO: Take off the order since memcg always do order 0
> > + */
> > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> > +                                             int order)
> > +{
> > +       int i, nid;
> > +       int start_node;
> > +       int priority;
> > +       bool wmark_ok;
> > +       int loop;
> > +       pg_data_t *pgdat;
> > +       nodemask_t do_nodes;
> > +       unsigned long total_scanned;
> > +       struct scan_control sc = {
> > +               .gfp_mask = GFP_KERNEL,
> > +               .may_unmap = 1,
> > +               .may_swap = 1,
> > +               .nr_to_reclaim = SWAP_CLUSTER_MAX,
> > +               .swappiness = vm_swappiness,
> > +               .order = order,
> > +               .mem_cgroup = mem_cont,
> > +       };
> > +
> > +loop_again:
> > +       do_nodes = NODE_MASK_NONE;
> > +       sc.may_writepage = !laptop_mode;
>
> I think it depends on urgency(ie, priority, reclaim
> ratio(#reclaim/#scanning) or something), not laptop_mode in case of
> memcg.
> As I said earlier,it wold be better to avoid random write.
>

I agree that we would like to avoid it. but not sure if we should remove it
here, since it add more pressure to the direct reclaim case.

>
> > +       sc.nr_reclaimed = 0;
> > +       total_scanned = 0;
> > +
> > +       for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > +               sc.priority = priority;
> > +               wmark_ok = false;
> > +               loop = 0;
> > +
> > +               /* The swap token gets in the way of swapout... */
> > +               if (!priority)
> > +                       disable_swap_token();
> > +
> > +               if (priority == DEF_PRIORITY)
> > +                       do_nodes = node_states[N_ONLINE];
> > +
> > +               while (1) {
> > +                       nid = mem_cgroup_select_victim_node(mem_cont,
> > +                                                       &do_nodes);
> > +
> > +                       /* Indicate we have cycled the nodelist once
>
> Fix comment style.
>

Fixed.

>
> > +                        * TODO: we might add MAX_RECLAIM_LOOP for
> preventing
> > +                        * kswapd burning cpu cycles.
> > +                        */
> > +                       if (loop == 0) {
> > +                               start_node = nid;
> > +                               loop++;
> > +                       } else if (nid == start_node)
> > +                               break;
> > +
> > +                       pgdat = NODE_DATA(nid);
> > +                       balance_pgdat_node(pgdat, order, &sc);
> > +                       total_scanned += sc.nr_scanned;
> > +
> > +                       /* Set the node which has at least
>
> Fix comment style.
>
> Fixed.


> > +                        * one reclaimable zone
> > +                        */
> > +                       for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > +                               struct zone *zone = pgdat->node_zones +
> i;
> > +
> > +                               if (!populated_zone(zone))
> > +                                       continue;
> > +                       }
>
> I can't understand your comment and logic.
> The comment mentioned reclaimable zone but the logic checks just
> populated_zone. What's meaning?
>

I will move the comment to another patch which adds the zone unreclaimable.

--Ying

>
> > +                       if (i < 0)
> > +                               node_clear(nid, do_nodes);
> > +
> > +                       if (mem_cgroup_watermark_ok(mem_cont,
> > +
> CHARGE_WMARK_HIGH)) {
> > +                               wmark_ok = true;
> > +                               goto out;
> > +                       }
> > +
> > +                       if (nodes_empty(do_nodes)) {
> > +                               wmark_ok = true;
> > +                               goto out;
> > +                       }
> > +               }
> > +
> > +               if (total_scanned && priority < DEF_PRIORITY - 2)
> > +                       congestion_wait(WRITE, HZ/10);
> > +
> > +               if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> > +                       break;
> > +       }
> > +out:
> > +       if (!wmark_ok) {
> > +               cond_resched();
> > +
> > +               try_to_freeze();
> > +
> > +               goto loop_again;
> > +       }
> > +
> > +       return sc.nr_reclaimed;
> > +}
> > +#else
> >  static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> *mem_cont,
> >                                                        int order)
> >  {
> >        return 0;
> >  }
> > +#endif
> >
> >  /*
> >  * The background pageout daemon, started as a kernel thread
> > --
> > 1.7.3.1
> >
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
>

[-- Attachment #2: Type: text/html, Size: 14099 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 06/10] Per-memcg background reclaim.
  2011-04-18 21:38     ` Ying Han
@ 2011-04-18 23:32       ` Minchan Kim
  2011-04-19  2:42         ` Ying Han
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-04-18 23:32 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Tue, Apr 19, 2011 at 6:38 AM, Ying Han <yinghan@google.com> wrote:
>
>
> On Sun, Apr 17, 2011 at 8:51 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>
>> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
>> > +
>> > +               sc->nr_scanned = 0;
>> > +               shrink_zone(priority, zone, sc);
>> > +               total_scanned += sc->nr_scanned;
>> > +
>> > +               /*
>> > +                * If we've done a decent amount of scanning and
>> > +                * the reclaim ratio is low, start doing writepage
>> > +                * even in laptop mode
>> > +                */
>> > +               if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
>> > +                   total_scanned > sc->nr_reclaimed + sc->nr_reclaimed
>> > / 2) {
>> > +                       sc->may_writepage = 1;
>>
>> I don't want to add more random write any more although we don't have
>> a trouble of real memory shortage.
>
>
>>
>> Do you have any reason to reclaim memory urgently as writing dirty pages?
>> Maybe if we wait a little bit of time, flusher would write out the page.
>
> We would like to reduce the writing dirty pages from page reclaim,
> especially from direct reclaim. AFAIK, the try_to_free_mem_cgroup_pages()
> still need to write dirty pages when there is a need. removing this from the
> per-memcg kswap will only add more pressure to the per-memcg direct reclaim,
> which seems to be worse. (stack overflow as one example which we would like
> to get rid of)
>

Stack overflow would be another topic.

Normal situation :

The softlimit memory pressure of memcg isn't real memory shortage and
if we have gap between hardlimit and softlimit, periodic writeback of
flusher would write it out before reaching the hardlimit. In the end,
direct reclaim don't need to write it out.

Exceptional situation :

Of course, it doesn't work well in congestion of bdi, sudden big
memory consumption in memcg in wrong [hard/soft]limit(small gap)
configuration of administrator.

I think we have to design it by normal situation.
The point is that softlimit isn't real memory shortage so that we are
not urgent.

How about adding new function which checks global memory pressure and
if we have a trouble by global memory pressure, we can change
may_write with 1 dynamically in memcg_kswapd?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 06/10] Per-memcg background reclaim.
  2011-04-18 23:32       ` Minchan Kim
@ 2011-04-19  2:42         ` Ying Han
  2011-04-19  5:50           ` Minchan Kim
  0 siblings, 1 reply; 27+ messages in thread
From: Ying Han @ 2011-04-19  2:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3152 bytes --]

On Mon, Apr 18, 2011 at 4:32 PM, Minchan Kim <minchan.kim@gmail.com> wrote:

> On Tue, Apr 19, 2011 at 6:38 AM, Ying Han <yinghan@google.com> wrote:
> >
> >
> > On Sun, Apr 17, 2011 at 8:51 PM, Minchan Kim <minchan.kim@gmail.com>
> wrote:
> >>
> >> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
> >> > +
> >> > +               sc->nr_scanned = 0;
> >> > +               shrink_zone(priority, zone, sc);
> >> > +               total_scanned += sc->nr_scanned;
> >> > +
> >> > +               /*
> >> > +                * If we've done a decent amount of scanning and
> >> > +                * the reclaim ratio is low, start doing writepage
> >> > +                * even in laptop mode
> >> > +                */
> >> > +               if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> >> > +                   total_scanned > sc->nr_reclaimed +
> sc->nr_reclaimed
> >> > / 2) {
> >> > +                       sc->may_writepage = 1;
> >>
> >> I don't want to add more random write any more although we don't have
> >> a trouble of real memory shortage.
> >
> >
> >>
> >> Do you have any reason to reclaim memory urgently as writing dirty
> pages?
> >> Maybe if we wait a little bit of time, flusher would write out the page.
> >
> > We would like to reduce the writing dirty pages from page reclaim,
> > especially from direct reclaim. AFAIK, the try_to_free_mem_cgroup_pages()
> > still need to write dirty pages when there is a need. removing this from
> the
> > per-memcg kswap will only add more pressure to the per-memcg direct
> reclaim,
> > which seems to be worse. (stack overflow as one example which we would
> like
> > to get rid of)
> >
>
> Stack overflow would be another topic.
>
> Normal situation :
>
> The softlimit memory pressure of memcg isn't real memory shortage and
> if we have gap between hardlimit and softlimit, periodic writeback of
> flusher would write it out before reaching the hardlimit. In the end,
> direct reclaim don't need to write it out.
>
> Exceptional situation :
>
> Of course, it doesn't work well in congestion of bdi, sudden big
> memory consumption in memcg in wrong [hard/soft]limit(small gap)
> configuration of administrator.
>
> I think we have to design it by normal situation.
> The point is that softlimit isn't real memory shortage so that we are
> not urgent.
>

This patch is not dealing with soft_limit, but hard_limit. The soft_limit
reclaim which we talked about during LSF
is something i am currently looking at right now. This patch is doing the
per-memcg background reclaim which
based on the watermarks calculated on the hard_limit. We don't have the
memcg entering the direct reclaim each
time it is reaching the hard_limit, so we add the background reclaim which
reclaiming pages proactively.


> How about adding new function which checks global memory pressure and
> if we have a trouble by global memory pressure, we can change
> may_write with 1 dynamically in memcg_kswapd?
>

Like I mentioned, the may_write is still needed in this case otherwise we
are just put this further to per-memcg
direct reclaim.


Thanks

--Ying

> --
> Kind regards,
> Minchan Kim
>

[-- Attachment #2: Type: text/html, Size: 4527 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 01/10] Add kswapd descriptor
  2011-04-18 18:09     ` Ying Han
@ 2011-04-19  5:35       ` Minchan Kim
  0 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-04-19  5:35 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Tue, Apr 19, 2011 at 3:09 AM, Ying Han <yinghan@google.com> wrote:
>
>
> On Sun, Apr 17, 2011 at 5:57 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>
>> Hi Ying,
>>
>> I have some comments and nitpick about coding style.
>
> Hi Minchan, thank you for your comments and reviews.
>>
>> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
>> > There is a kswapd kernel thread for each numa node. We will add a
>> > different
>> > kswapd for each memcg. The kswapd is sleeping in the wait queue headed
>> > at
>>
>> Why?
>>
>> Easily, many kernel developers raise an eyebrow to increase kernel thread.
>> So you should justify why we need new kernel thread, why we can't
>> handle it with workqueue.
>>
>> Maybe you explained it and I didn't know it. If it is, sorry.
>> But at least, the patch description included _why_ is much mergeable
>> to maintainers and helpful to review the code to reviewers.
>
> Here are the replies i posted on earlier version regarding on workqueue.
> "
> I did some study on workqueue after posting V2. There was a
> comment suggesting workqueue instead of per-memcg kswapd thread, since it
> will cut the number of kernel threads being created in host with lots of
> cgroups. Each kernel thread allocates about 8K of stack and 8M in total w/
> thousand of cgroups.
> The current workqueue model merged in 2.6.36 kernel is called "concurrency
> managed workqueu(cmwq)", which is intended to provide flexible concurrency
> without wasting resources. I studied a bit and here it is:
>
> 1. The workqueue is complicated and we need to be very careful of work items
> in the workqueue. We've experienced in one workitem stucks and the rest of
> the work item won't proceed. For example in dirty page writeback,  one
> heavily writer cgroup could starve the other cgroups from flushing dirty
> pages to the same disk. In the kswapd case, I can image we might have
> similar scenario.
>
> 2. How to prioritize the workitems is another problem. The order of adding
> the workitems in the queue reflects the order of cgroups being reclaimed. We
> don't have that restriction currently but relying on the cpu scheduler to
> put kswapd on the right cpu-core to run. We "might" introduce priority later
> for reclaim and how are we gonna deal with that.
>
> 3. Based on what i observed, not many callers has migrated to the cmwq and I
> don't have much data of how good it is.
> Back to the current model, on machine with thousands of cgroups which it
> will take 8M total for thousand of kswapd threads (8K stack for each
> thread).  We are running system with fakenuma which each numa node has a
> kswapd. So far we haven't noticed issue caused by "lots of" kswapd threads.
> Also, there shouldn't be any performance overhead for kernel thread if it is
> not running.
>
> Based on the complexity of workqueue and the benefit it provides, I would
> like to stick to the current model first. After we get the basic stuff in
> and other targeting reclaim improvement, we can come back to this. What do
> you think?

Thanks for the good summary. I should study cmwq, too but I don't mean
we have to use only workqueue.

The problem of memcg-kswapd flooding is the lock, schedule overhead,
pid consumption as well as memory consumption and it is not good at
keeping the kswapd busy.
I don't think we have to keep the number of kswapd as much as memcg.

We can just keep memcg-kswapd min/max pool like old pdflush.
As memcg wmark pressure is high and memcg-kswapd is busy, new
memcg-kswapd will pop up to handle it.
As memcg wmark pressure is low and memcg-kswapd is idle, old
memcg-kswapd will be exited.

> "
> KAMEZAWA's reply:
> "
> Okay, fair enough. kthread_run() will win.
>
> Then, I have another request. I'd like to kswapd-for-memcg to some cpu
> cgroup to limit cpu usage.
>
> - Could you show thread ID somewhere ? and
>  confirm we can put it to some cpu cgroup ?
>  (creating a auto cpu cgroup for memcg kswapd is a choice, I think.)
> "
>>
>> > kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores
>> > information of node or memcg and it allows the global and per-memcg
>> > background
>> > reclaim to share common reclaim algorithms.
>> >
>> > This patch adds the kswapd descriptor and moves the per-node kswapd to
>> > use the
>> > new structure.
>> >
>> > changelog v5..v4:
>> > 1. add comment on kswapds_spinlock
>> > 2. remove the kswapds_spinlock. we don't need it here since the kswapd
>> > and pgdat
>> > have 1:1 mapping.
>> >
>> > changelog v3..v2:
>> > 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later
>> > patch.
>> > 2. rename thr in kswapd_run to something else.
>> >
>> > changelog v2..v1:
>> > 1. dynamic allocate kswapd descriptor and initialize the wait_queue_head
>> > of pgdat
>> > at kswapd_run.
>> > 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup
>> > kswapd
>> > descriptor.
>> >
>> > Signed-off-by: Ying Han <yinghan@google.com>
>> > ---
>> >  include/linux/mmzone.h |    3 +-
>> >  include/linux/swap.h   |    7 ++++
>> >  mm/page_alloc.c        |    1 -
>> >  mm/vmscan.c            |   89
>> > +++++++++++++++++++++++++++++++++++------------
>> >  4 files changed, 74 insertions(+), 26 deletions(-)
>> >
>> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> > index 628f07b..6cba7d2 100644
>> > --- a/include/linux/mmzone.h
>> > +++ b/include/linux/mmzone.h
>> > @@ -640,8 +640,7 @@ typedef struct pglist_data {
>> >        unsigned long node_spanned_pages; /* total size of physical page
>> >                                             range, including holes */
>> >        int node_id;
>> > -       wait_queue_head_t kswapd_wait;
>> > -       struct task_struct *kswapd;
>> > +       wait_queue_head_t *kswapd_wait;
>>
>> Personally, I prefer kswapd not kswapd_wait.
>> It's more readable and straightforward.
>
> hmm. I would like to keep as it is for this version, and improve it after
> the basic stuff are in. Hope that works for you?

No problem.

>>
>> >        int kswapd_max_order;
>> >        enum zone_type classzone_idx;
>> >  } pg_data_t;
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index ed6ebe6..f43d406 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
>> >        return current->flags & PF_KSWAPD;
>> >  }
>> >
>> > +struct kswapd {
>> > +       struct task_struct *kswapd_task;
>> > +       wait_queue_head_t kswapd_wait;
>> > +       pg_data_t *kswapd_pgdat;
>> > +};
>> > +
>> > +int kswapd(void *p);
>> >  /*
>> >  * MAX_SWAPFILES defines the maximum number of swaptypes: things which
>> > can
>> >  * be swapped to.  The swap type and the offset into that swap type are
>> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > index 6e1b52a..6340865 100644
>> > --- a/mm/page_alloc.c
>> > +++ b/mm/page_alloc.c
>> > @@ -4205,7 +4205,6 @@ static void __paginginit
>> > free_area_init_core(struct pglist_data *pgdat,
>> >
>> >        pgdat_resize_init(pgdat);
>> >        pgdat->nr_zones = 0;
>> > -       init_waitqueue_head(&pgdat->kswapd_wait);
>> >        pgdat->kswapd_max_order = 0;
>> >        pgdat_page_cgroup_init(pgdat);
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 060e4c1..61fb96e 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -2242,12 +2242,13 @@ static bool pgdat_balanced(pg_data_t *pgdat,
>> > unsigned long balanced_pages,
>> >  }
>> >
>> >  /* is kswapd sleeping prematurely? */
>> > -static bool sleeping_prematurely(pg_data_t *pgdat, int order, long
>> > remaining,
>> > -                                       int classzone_idx)
>> > +static int sleeping_prematurely(struct kswapd *kswapd, int order,
>> > +                               long remaining, int classzone_idx)
>> >  {
>> >        int i;
>> >        unsigned long balanced = 0;
>> >        bool all_zones_ok = true;
>> > +       pg_data_t *pgdat = kswapd->kswapd_pgdat;
>> >
>> >        /* If a direct reclaimer woke kswapd within HZ/10, it's premature
>> > */
>> >        if (remaining)
>> > @@ -2570,28 +2571,31 @@ out:
>> >        return order;
>> >  }
>> >
>> > -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int
>> > classzone_idx)
>> > +static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order,
>> > +                               int classzone_idx)
>> >  {
>> >        long remaining = 0;
>> >        DEFINE_WAIT(wait);
>> > +       pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
>> > +       wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>>
>> kswapd_p? p means pointer?
>
> yes,
>>
>> wait_h? h means header?
>
>  yes,
>>
>> Hmm.. Of course, it's trivial and we can understand easily in such
>> context but we don't have been used such words so it's rather awkward
>> to me.
>>
>> How about kswapd instead of kswapd_p, kswapd_wait instead of wait_h?
>
> that sounds ok for me for the change. however i would like to make the
> change as sperate patch after the basic stuff are in. Is that ok?

Sure. It's not critical to merge the series. :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 06/10] Per-memcg background reclaim.
  2011-04-19  2:42         ` Ying Han
@ 2011-04-19  5:50           ` Minchan Kim
  0 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-04-19  5:50 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Tue, Apr 19, 2011 at 11:42 AM, Ying Han <yinghan@google.com> wrote:
>
>
> On Mon, Apr 18, 2011 at 4:32 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>
>> On Tue, Apr 19, 2011 at 6:38 AM, Ying Han <yinghan@google.com> wrote:
>> >
>> >
>> > On Sun, Apr 17, 2011 at 8:51 PM, Minchan Kim <minchan.kim@gmail.com>
>> > wrote:
>> >>
>> >> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@google.com> wrote:
>> >> > +
>> >> > +               sc->nr_scanned = 0;
>> >> > +               shrink_zone(priority, zone, sc);
>> >> > +               total_scanned += sc->nr_scanned;
>> >> > +
>> >> > +               /*
>> >> > +                * If we've done a decent amount of scanning and
>> >> > +                * the reclaim ratio is low, start doing writepage
>> >> > +                * even in laptop mode
>> >> > +                */
>> >> > +               if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
>> >> > +                   total_scanned > sc->nr_reclaimed +
>> >> > sc->nr_reclaimed
>> >> > / 2) {
>> >> > +                       sc->may_writepage = 1;
>> >>
>> >> I don't want to add more random write any more although we don't have
>> >> a trouble of real memory shortage.
>> >
>> >
>> >>
>> >> Do you have any reason to reclaim memory urgently as writing dirty
>> >> pages?
>> >> Maybe if we wait a little bit of time, flusher would write out the
>> >> page.
>> >
>> > We would like to reduce the writing dirty pages from page reclaim,
>> > especially from direct reclaim. AFAIK,
>> > the try_to_free_mem_cgroup_pages()
>> > still need to write dirty pages when there is a need. removing this from
>> > the
>> > per-memcg kswap will only add more pressure to the per-memcg direct
>> > reclaim,
>> > which seems to be worse. (stack overflow as one example which we would
>> > like
>> > to get rid of)
>> >
>>
>> Stack overflow would be another topic.
>>
>> Normal situation :
>>
>> The softlimit memory pressure of memcg isn't real memory shortage and
>> if we have gap between hardlimit and softlimit, periodic writeback of
>> flusher would write it out before reaching the hardlimit. In the end,
>> direct reclaim don't need to write it out.
>>
>> Exceptional situation :
>>
>> Of course, it doesn't work well in congestion of bdi, sudden big
>> memory consumption in memcg in wrong [hard/soft]limit(small gap)
>> configuration of administrator.
>>
>> I think we have to design it by normal situation.
>> The point is that softlimit isn't real memory shortage so that we are
>> not urgent.
>
> This patch is not dealing with soft_limit, but hard_limit. The soft_limit
> reclaim which we talked about during LSF
> is something i am currently looking at right now. This patch is doing the
> per-memcg background reclaim which
> based on the watermarks calculated on the hard_limit. We don't have the
> memcg entering the direct reclaim each
> time it is reaching the hard_limit, so we add the background reclaim which
> reclaiming pages proactively.
>>
>> How about adding new function which checks global memory pressure and
>> if we have a trouble by global memory pressure, we can change
>> may_write with 1 dynamically in memcg_kswapd?
>
>
> Like I mentioned, the may_write is still needed in this case otherwise we
> are just put this further to per-memcg
> direct reclaim.

Totally, you're right. I misunderstood some point.
Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-04-19  5:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-15 23:23 [PATCH V5 00/10] memcg: per cgroup background reclaim Ying Han
2011-04-15 23:23 ` [PATCH V5 01/10] Add kswapd descriptor Ying Han
2011-04-18  0:57   ` Minchan Kim
2011-04-18 18:09     ` Ying Han
2011-04-19  5:35       ` Minchan Kim
2011-04-15 23:23 ` [PATCH V5 02/10] Add per memcg reclaim watermarks Ying Han
2011-04-15 23:23 ` [PATCH V5 03/10] New APIs to adjust per-memcg wmarks Ying Han
2011-04-15 23:23 ` [PATCH V5 04/10] Infrastructure to support per-memcg reclaim Ying Han
2011-04-18  2:11   ` Minchan Kim
2011-04-18 18:44     ` Ying Han
2011-04-15 23:23 ` [PATCH V5 05/10] Implement the select_victim_node within memcg Ying Han
2011-04-18  2:22   ` Minchan Kim
2011-04-18 17:11     ` Ying Han
2011-04-15 23:23 ` [PATCH V5 06/10] Per-memcg background reclaim Ying Han
2011-04-18  3:51   ` Minchan Kim
2011-04-18 21:38     ` Ying Han
2011-04-18 23:32       ` Minchan Kim
2011-04-19  2:42         ` Ying Han
2011-04-19  5:50           ` Minchan Kim
2011-04-15 23:23 ` [PATCH V5 07/10] Add per-memcg zone "unreclaimable" Ying Han
2011-04-18  4:27   ` Minchan Kim
2011-04-18 17:31     ` Ying Han
2011-04-15 23:23 ` [PATCH V5 08/10] Enable per-memcg background reclaim Ying Han
2011-04-15 23:23 ` [PATCH V5 09/10] Add API to export per-memcg kswapd pid Ying Han
2011-04-18  5:01   ` Minchan Kim
2011-04-18 17:41     ` Ying Han
2011-04-15 23:23 ` [PATCH V5 10/10] Add some per-memcg stats Ying Han

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.