All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/4] memcg: per cgroup background reclaim
@ 2010-11-30  6:49 Ying Han
  2010-11-30  6:49 ` [PATCH 1/4] Add kswapd descriptor Ying Han
                   ` (5 more replies)
  0 siblings, 6 replies; 52+ messages in thread
From: Ying Han @ 2010-11-30  6:49 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Tejun Heo
  Cc: linux-mm

The current implementation of memcg only supports direct reclaim and this
patchset adds the support for background reclaim. Per cgroup background
reclaim is needed which spreads out the memory pressure over longer period
of time and smoothes out the system performance.

The current implementation is not a stable version, and it crashes sometimes
on my NUMA machine. Before going further for debugging, I would like to start
the discussion and hear the feedbacks of the initial design.

Current status:
I run through some simple tests which reads/writes a large file and makes sure
it triggers per cgroup kswapd on the low_wmark. Also, I compared at
pg_steal/pg_scan ratio w/o background reclaim.

Step1: Create a cgroup with 500M memory_limit and set the min_free_kbytes to 1024.
$ mount -t cgroup -o cpuset,memory cpuset /dev/cgroup
$ mkdir /dev/cgroup/A
$ echo 0 >/dev/cgroup/A/cpuset.cpus
$ echo 0 >/dev/cgroup/A/cpuset.mems
$ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
$ echo 1024 >/dev/cgroup/A/memory.min_free_kbytes
$ echo $$ >/dev/cgroup/A/tasks

Step2: Check the wmarks.
$ cat /dev/cgroup/A/memory.reclaim_wmarks
memcg_low_wmark 98304000
memcg_high_wmark 81920000

Step3: Dirty the pages by creating a 20g file on hard drive.
$ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1

Checked the memory.stat w/o background reclaim. It used to be all the pages are
reclaimed from direct reclaim, and now about half of them are reclaimed at
background. (note: writing '0' to min_free_kbytes disables per cgroup kswapd)

Only direct reclaim                                                With background reclaim:
kswapd_steal 0                                                     kswapd_steal 2751822
pg_pgsteal 5100401                                               pg_pgsteal 2476676
kswapd_pgscan 0                                                  kswapd_pgscan 6019373
pg_scan 5542464                                                   pg_scan 3851281
pgrefill 304505                                                       pgrefill 348077
pgoutrun 0                                                             pgoutrun 44568
allocstall 159278                                                    allocstall 75669

Step4: Cleanup
$ echo $$ >/dev/cgroup/tasks
$ echo 0 > /dev/cgroup/A/memory.force_empty

Step5: Read the 20g file into the pagecache.
$ cat /export/hdc3/dd/tf0 > /dev/zero;

Checked the memory.stat w/o background reclaim. All the clean pages are reclaimed at
background instead of direct reclaim.

Only direct reclaim                                                With background reclaim
kswapd_steal 0                                                      kswapd_steal 3512424
pg_pgsteal 3461280                                               pg_pgsteal 0
kswapd_pgscan 0                                                  kswapd_pgscan 3512440
pg_scan 3461280                                                   pg_scan 0
pgrefill 0                                                                pgrefill 0
pgoutrun 0                                                             pgoutrun 74973
allocstall 108165                                                    allocstall 0


Ying Han (4):
  Add kswapd descriptor.
  Add per cgroup reclaim watermarks.
  Per cgroup background reclaim.
  Add more per memcg stats.

 include/linux/memcontrol.h  |  112 +++++++++++
 include/linux/mmzone.h      |    3 +-
 include/linux/res_counter.h |   88 +++++++++-
 include/linux/swap.h        |   10 +
 kernel/res_counter.c        |   26 ++-
 mm/memcontrol.c             |  447 ++++++++++++++++++++++++++++++++++++++++++-
 mm/mmzone.c                 |    2 +-
 mm/page_alloc.c             |   11 +-
 mm/vmscan.c                 |  346 ++++++++++++++++++++++++++++++----
 9 files changed, 994 insertions(+), 51 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  6:49 [RFC][PATCH 0/4] memcg: per cgroup background reclaim Ying Han
@ 2010-11-30  6:49 ` Ying Han
  2010-11-30  7:08   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2010-11-30  6:49 ` [PATCH 2/4] Add per cgroup reclaim watermarks Ying Han
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 52+ messages in thread
From: Ying Han @ 2010-11-30  6:49 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Tejun Heo
  Cc: linux-mm

There is a kswapd kernel thread for each memory node. We add a different kswapd
for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
field of a kswapd descriptor. The kswapd descriptor stores information of node
or cgroup and it allows the global and per cgroup background reclaim to share
common reclaim algorithms.

This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
common data structure.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/mmzone.h |    3 +-
 include/linux/swap.h   |   10 +++++
 mm/memcontrol.c        |    2 +
 mm/mmzone.c            |    2 +-
 mm/page_alloc.c        |    9 +++-
 mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
 6 files changed, 90 insertions(+), 34 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..c77dfa2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -642,8 +642,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	wait_queue_head_t kswapd_wait;
-	struct task_struct *kswapd;
+	wait_queue_head_t *kswapd_wait;
 	int kswapd_max_order;
 } pg_data_t;
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index eba53e7..2e6cb58 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
 	return current->flags & PF_KSWAPD;
 }
 
+struct kswapd {
+	struct task_struct *kswapd_task;
+	wait_queue_head_t kswapd_wait;
+	struct mem_cgroup *kswapd_mem;
+	pg_data_t *kswapd_pgdat;
+};
+
+#define MAX_KSWAPDS MAX_NUMNODES
+extern struct kswapd kswapds[MAX_KSWAPDS];
+int kswapd(void *p);
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4034b6..dca3590 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -263,6 +263,8 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	wait_queue_head_t *kswapd_wait;
 };
 
 /* Stuffs for move charges at task migration. */
diff --git a/mm/mmzone.c b/mm/mmzone.c
index e35bfb8..c7cbed5 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -102,7 +102,7 @@ unsigned long zone_nr_free_pages(struct zone *zone)
 	 * free pages are low, get a better estimate for free pages
 	 */
 	if (nr_free_pages < zone->percpu_drift_mark &&
-			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
+			!waitqueue_active(zone->zone_pgdat->kswapd_wait))
 		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 	return nr_free_pages;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b48dea2..a15bc1c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4070,13 +4070,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
+	struct kswapd *kswapd_p;
 
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
-	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
-	
+
+	kswapd_p = &kswapds[nid];
+	init_waitqueue_head(&kswapd_p->kswapd_wait);
+	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_pgdat = pgdat;
+
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, memmap_pages;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8a6fdc..e08005e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2115,12 +2115,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 
 	return nr_reclaimed;
 }
+
 #endif
 
+DEFINE_SPINLOCK(kswapds_spinlock);
+struct kswapd kswapds[MAX_KSWAPDS];
+
 /* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static int sleeping_prematurely(struct kswapd *kswapd, int order,
+				long remaining)
 {
 	int i;
+	pg_data_t *pgdat = kswapd->kswapd_pgdat;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
@@ -2377,21 +2383,28 @@ out:
  * If there are applications that are active memory-allocators
  * (most normal use), this basically shouldn't matter.
  */
-static int kswapd(void *p)
+int kswapd(void *p)
 {
 	unsigned long order;
-	pg_data_t *pgdat = (pg_data_t*)p;
+	struct kswapd *kswapd_p = (struct kswapd *)p;
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 	DEFINE_WAIT(wait);
 	struct reclaim_state reclaim_state = {
 		.reclaimed_slab = 0,
 	};
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	const struct cpumask *cpumask;
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
-	if (!cpumask_empty(cpumask))
-		set_cpus_allowed_ptr(tsk, cpumask);
+	if (pgdat) {
+		BUG_ON(pgdat->kswapd_wait != wait_h);
+		cpumask = cpumask_of_node(pgdat->node_id);
+		if (!cpumask_empty(cpumask))
+			set_cpus_allowed_ptr(tsk, cpumask);
+	}
 	current->reclaim_state = &reclaim_state;
 
 	/*
@@ -2414,9 +2427,13 @@ static int kswapd(void *p)
 		unsigned long new_order;
 		int ret;
 
-		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
-		new_order = pgdat->kswapd_max_order;
-		pgdat->kswapd_max_order = 0;
+		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
+		if (pgdat) {
+			new_order = pgdat->kswapd_max_order;
+			pgdat->kswapd_max_order = 0;
+		} else
+			new_order = 0;
+
 		if (order < new_order) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
@@ -2428,10 +2445,12 @@ static int kswapd(void *p)
 				long remaining = 0;
 
 				/* Try to sleep for a short interval */
-				if (!sleeping_prematurely(pgdat, order, remaining)) {
+				if (!sleeping_prematurely(kswapd_p, order,
+							remaining)) {
 					remaining = schedule_timeout(HZ/10);
-					finish_wait(&pgdat->kswapd_wait, &wait);
-					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+					finish_wait(wait_h, &wait);
+					prepare_to_wait(wait_h, &wait,
+							TASK_INTERRUPTIBLE);
 				}
 
 				/*
@@ -2439,20 +2458,25 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining)) {
-					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+				if (!sleeping_prematurely(kswapd_p, order,
+								remaining)) {
+					if (pgdat)
+						trace_mm_vmscan_kswapd_sleep(
+								pgdat->node_id);
 					schedule();
 				} else {
 					if (remaining)
-						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
+						count_vm_event(
+						KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
-						count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
+						count_vm_event(
+						KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 				}
 			}
-
-			order = pgdat->kswapd_max_order;
+			if (pgdat)
+				order = pgdat->kswapd_max_order;
 		}
-		finish_wait(&pgdat->kswapd_wait, &wait);
+		finish_wait(wait_h, &wait);
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -2476,6 +2500,7 @@ static int kswapd(void *p)
 void wakeup_kswapd(struct zone *zone, int order)
 {
 	pg_data_t *pgdat;
+	wait_queue_head_t *wait;
 
 	if (!populated_zone(zone))
 		return;
@@ -2488,9 +2513,10 @@ void wakeup_kswapd(struct zone *zone, int order)
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
-	if (!waitqueue_active(&pgdat->kswapd_wait))
+	wait = pgdat->kswapd_wait;
+	if (!waitqueue_active(wait))
 		return;
-	wake_up_interruptible(&pgdat->kswapd_wait);
+	wake_up_interruptible(wait);
 }
 
 /*
@@ -2587,7 +2613,10 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 
 			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
 				/* One of our CPUs online: restore mask */
-				set_cpus_allowed_ptr(pgdat->kswapd, mask);
+				if (kswapds[nid].kswapd_task)
+					set_cpus_allowed_ptr(
+						kswapds[nid].kswapd_task,
+						mask);
 		}
 	}
 	return NOTIFY_OK;
@@ -2599,19 +2628,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
  */
 int kswapd_run(int nid)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
+	struct task_struct *thr;
 	int ret = 0;
 
-	if (pgdat->kswapd)
+	if (kswapds[nid].kswapd_task)
 		return 0;
 
-	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
-	if (IS_ERR(pgdat->kswapd)) {
+	thr = kthread_run(kswapd, &kswapds[nid], "kswapd%d", nid);
+	if (IS_ERR(thr)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
 		printk("Failed to start kswapd on node %d\n",nid);
 		ret = -1;
 	}
+	kswapds[nid].kswapd_task = thr;
 	return ret;
 }
 
@@ -2620,10 +2650,20 @@ int kswapd_run(int nid)
  */
 void kswapd_stop(int nid)
 {
-	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
+	struct task_struct *thr;
+	struct kswapd *kswapd_p;
+	wait_queue_head_t *wait;
+
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	spin_lock(&kswapds_spinlock);
+	wait = pgdat->kswapd_wait;
+	kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+	thr = kswapd_p->kswapd_task;
+	spin_unlock(&kswapds_spinlock);
 
-	if (kswapd)
-		kthread_stop(kswapd);
+	if (thr)
+		kthread_stop(thr);
 }
 
 static int __init kswapd_init(void)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/4] Add per cgroup reclaim watermarks.
  2010-11-30  6:49 [RFC][PATCH 0/4] memcg: per cgroup background reclaim Ying Han
  2010-11-30  6:49 ` [PATCH 1/4] Add kswapd descriptor Ying Han
@ 2010-11-30  6:49 ` Ying Han
  2010-11-30  7:21   ` KAMEZAWA Hiroyuki
  2010-12-07 14:56   ` Mel Gorman
  2010-11-30  6:49 ` [PATCH 3/4] Per cgroup background reclaim Ying Han
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 52+ messages in thread
From: Ying Han @ 2010-11-30  6:49 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Tejun Heo
  Cc: linux-mm

The per cgroup kswapd is invoked at mem_cgroup_charge when the cgroup's memory
usage above a threshold--low_wmark. Then the kswapd thread starts to reclaim
pages in a priority loop similar to global algorithm. The kswapd is done if the
memory usage below a threshold--high_wmark.

The per cgroup background reclaim is based on the per cgroup LRU and also adds
per cgroup watermarks. There are two watermarks including "low_wmark" and
"high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
for each cgroup. Each time the hard_limit is change, the corresponding wmarks
are re-calculated. Since memory controller charges only user pages, there is
no need for a "min_wmark". The current calculation of wmarks is a function of
"memory.min_free_kbytes" which could be adjusted by writing different values
into the new api. This is added mainly for debugging purpose.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h  |    1 +
 include/linux/res_counter.h |   88 ++++++++++++++++++++++++++++++-
 kernel/res_counter.c        |   26 ++++++++--
 mm/memcontrol.c             |  123 +++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c                 |   10 ++++
 5 files changed, 238 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..90fe7fe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -76,6 +76,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index fcb9884..eed12c5 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -39,6 +39,16 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit that reclaim triggers. TODO: res_counter in mem
+	 * or wmark_limit.
+	 */
+	unsigned long long low_wmark_limit;
+	/*
+	 * the limit that reclaim stops. TODO: res_counter in mem or
+	 * wmark_limit.
+	 */
+	unsigned long long high_wmark_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -55,6 +65,10 @@ struct res_counter {
 
 #define RESOURCE_MAX (unsigned long long)LLONG_MAX
 
+#define CHARGE_WMARK_MIN	0x01
+#define CHARGE_WMARK_LOW	0x02
+#define CHARGE_WMARK_HIGH	0x04
+
 /**
  * Helpers to interact with userspace
  * res_counter_read_u64() - returns the value of the specified member.
@@ -92,6 +106,8 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_WMARK_LIMIT,
+	RES_HIGH_WMARK_LIMIT
 };
 
 /*
@@ -112,9 +128,10 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
  */
 
 int __must_check res_counter_charge_locked(struct res_counter *counter,
-		unsigned long val);
+		unsigned long val, int charge_flags);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+		unsigned long val, int charge_flags,
+		struct res_counter **limit_fail_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -145,6 +162,24 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline bool
+res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->high_wmark_limit)
+		return true;
+
+	return false;
+}
+
+static inline bool
+res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->low_wmark_limit)
+		return true;
+
+	return false;
+}
+
 /**
  * Get the difference between the usage and the soft limit
  * @cnt: The counter
@@ -193,6 +228,30 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 	return ret;
 }
 
+static inline bool
+res_counter_check_under_low_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_low_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool
+res_counter_check_under_high_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_high_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -220,6 +279,8 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 	spin_lock_irqsave(&cnt->lock, flags);
 	if (cnt->usage <= limit) {
 		cnt->limit = limit;
+		cnt->low_wmark_limit = limit;
+		cnt->high_wmark_limit = limit;
 		ret = 0;
 	}
 	spin_unlock_irqrestore(&cnt->lock, flags);
@@ -238,4 +299,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_high_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
+static inline int
+res_counter_set_low_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index c7eaa37..a524349 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,12 +19,26 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 	spin_lock_init(&counter->lock);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
+	counter->low_wmark_limit = RESOURCE_MAX;
+	counter->high_wmark_limit = RESOURCE_MAX;
 	counter->parent = parent;
 }
 
-int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
+int res_counter_charge_locked(struct res_counter *counter, unsigned long val,
+				int charge_flags)
 {
-	if (counter->usage + val > counter->limit) {
+	unsigned long long limit = 0;
+
+	if (charge_flags & CHARGE_WMARK_LOW)
+		limit = counter->low_wmark_limit;
+
+	if (charge_flags & CHARGE_WMARK_HIGH)
+		limit = counter->high_wmark_limit;
+
+	if (charge_flags & CHARGE_WMARK_MIN)
+		limit = counter->limit;
+
+	if (counter->usage + val > limit) {
 		counter->failcnt++;
 		return -ENOMEM;
 	}
@@ -36,7 +50,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+			int charge_flags, struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
@@ -46,7 +60,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
-		ret = res_counter_charge_locked(c, val);
+		ret = res_counter_charge_locked(c, val, charge_flags);
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -103,6 +117,10 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_WMARK_LIMIT:
+		return &counter->low_wmark_limit;
+	case RES_HIGH_WMARK_LIMIT:
+		return &counter->high_wmark_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dca3590..a0c6ed9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -265,6 +265,7 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 	wait_queue_head_t *kswapd_wait;
+	unsigned long min_free_kbytes;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -370,6 +371,7 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
+static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -796,6 +798,32 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+void setup_per_memcg_wmarks(struct mem_cgroup *mem)
+{
+	u64 limit;
+	unsigned long min_free_kbytes;
+
+	min_free_kbytes = get_min_free_kbytes(mem);
+	limit = mem_cgroup_get_limit(mem);
+	if (min_free_kbytes == 0) {
+		res_counter_set_low_wmark_limit(&mem->res, limit);
+		res_counter_set_high_wmark_limit(&mem->res, limit);
+	} else {
+		unsigned long page_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+		unsigned long lowmem_pages = 2048;
+		unsigned long low_wmark, high_wmark;
+		u64 tmp;
+
+		tmp = (u64)page_min * limit;
+		do_div(tmp, lowmem_pages);
+
+		low_wmark = tmp + (tmp >> 1);
+		high_wmark = tmp + (tmp >> 2);
+		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
+		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+	}
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -1148,6 +1176,22 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+static unsigned long get_min_free_kbytes(struct mem_cgroup *memcg)
+{
+	struct cgroup *cgrp = memcg->css.cgroup;
+	unsigned long min_free_kbytes;
+
+	/* root ? */
+	if (cgrp == NULL || cgrp->parent == NULL)
+		return 0;
+
+	spin_lock(&memcg->reclaim_param_lock);
+	min_free_kbytes = memcg->min_free_kbytes;
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return min_free_kbytes;
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
@@ -1844,12 +1888,13 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	unsigned long flags = 0;
 	int ret;
 
-	ret = res_counter_charge(&mem->res, csize, &fail_res);
+	ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_MIN, &fail_res);
 
 	if (likely(!ret)) {
 		if (!do_swap_account)
 			return CHARGE_OK;
-		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
+		ret = res_counter_charge(&mem->memsw, csize, CHARGE_WMARK_MIN,
+					&fail_res);
 		if (likely(!ret))
 			return CHARGE_OK;
 
@@ -3733,6 +3778,37 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 mem_cgroup_min_free_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return get_min_free_kbytes(memcg);
+}
+
+static int mem_cgroup_min_free_write(struct cgroup *cgrp, struct cftype *cfg,
+				     u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+
+	parent = mem_cgroup_from_cont(cgrp->parent);
+
+	cgroup_lock();
+
+	spin_lock(&memcg->reclaim_param_lock);
+	memcg->min_free_kbytes = val;
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	cgroup_unlock();
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4024,6 +4100,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 	mutex_unlock(&memcg_oom_mutex);
 }
 
+static int mem_cgroup_wmark_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	unsigned long low_wmark, high_wmark;
+
+	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
+	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+
+	cb->fill(cb, "memcg_low_wmark", low_wmark);
+	cb->fill(cb, "memcg_high_wmark", high_wmark);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4127,6 +4218,15 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "min_free_kbytes",
+		.write_u64 = mem_cgroup_min_free_write,
+		.read_u64 = mem_cgroup_min_free_read,
+	},
+	{
+		.name = "reclaim_wmarks",
+		.read_map = mem_cgroup_wmark_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4308,6 +4408,19 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
+				int charge_flags)
+{
+	long ret = 0;
+
+	if (charge_flags & CHARGE_WMARK_LOW)
+		ret = res_counter_check_under_low_wmark_limit(&mem->res);
+	if (charge_flags & CHARGE_WMARK_HIGH)
+		ret = res_counter_check_under_high_wmark_limit(&mem->res);
+
+	return ret;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4450,10 +4563,12 @@ static int mem_cgroup_do_precharge(unsigned long count)
 		 * are still under the same cgroup_mutex. So we can postpone
 		 * css_get().
 		 */
-		if (res_counter_charge(&mem->res, PAGE_SIZE * count, &dummy))
+		if (res_counter_charge(&mem->res, PAGE_SIZE * count,
+					CHARGE_WMARK_MIN, &dummy))
 			goto one_by_one;
 		if (do_swap_account && res_counter_charge(&mem->memsw,
-						PAGE_SIZE * count, &dummy)) {
+						PAGE_SIZE * count,
+						CHARGE_WMARK_MIN, &dummy)) {
 			res_counter_uncharge(&mem->res, PAGE_SIZE * count);
 			goto one_by_one;
 		}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e08005e..6d5702b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -46,6 +46,8 @@
 
 #include <linux/swapops.h>
 
+#include <linux/res_counter.h>
+
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -2127,11 +2129,19 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
 {
 	int i;
 	pg_data_t *pgdat = kswapd->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd->kswapd_mem;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return 1;
 
+	if (mem) {
+		if (!mem_cgroup_watermark_ok(kswapd->kswapd_mem,
+						CHARGE_WMARK_HIGH))
+			return 1;
+		return 0;
+	}
+
 	/* If after HZ/10, a zone is below the high mark, it's premature */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/4] Per cgroup background reclaim.
  2010-11-30  6:49 [RFC][PATCH 0/4] memcg: per cgroup background reclaim Ying Han
  2010-11-30  6:49 ` [PATCH 1/4] Add kswapd descriptor Ying Han
  2010-11-30  6:49 ` [PATCH 2/4] Add per cgroup reclaim watermarks Ying Han
@ 2010-11-30  6:49 ` Ying Han
  2010-11-30  7:51   ` KAMEZAWA Hiroyuki
  2010-12-01  2:18   ` KOSAKI Motohiro
  2010-11-30  6:49 ` [PATCH 4/4] Add more per memcg stats Ying Han
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 52+ messages in thread
From: Ying Han @ 2010-11-30  6:49 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Tejun Heo
  Cc: linux-mm

The current implementation of memcg only supports direct reclaim and this
patch adds the support for background reclaim. Per cgroup background reclaim
is needed which spreads out the memory pressure over longer period of time
and smoothes out the system performance.

There is a kswapd kernel thread for each memory node. We add a different kswapd
for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
field of a kswapd descriptor.

The kswapd() function now is shared between global and per cgroup kswapd thread.
It is passed in with the kswapd descriptor which contains the information of
either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
priority loop similar to global reclaim. In each iteration it invokes
balance_pgdat_node for all nodes on the system, which is a new function performs
background reclaim per node. After reclaiming each node, it checks
mem_cgroup_watermark_ok() and breaks the priority loop if returns true. A per
memcg zone will be marked as "unreclaimable" if the scanning rate is much
greater than the reclaiming rate on the per cgroup LRU. The bit is cleared when
there is a page charged to the cgroup being freed. Kswapd breaks the priority
loop if all the zones are marked as "unreclaimable".

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   30 +++++++
 mm/memcontrol.c            |  182 ++++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c            |    2 +
 mm/vmscan.c                |  205 +++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 416 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 90fe7fe..dbed45d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -127,6 +127,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
+void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone);
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+					unsigned long nr_scanned);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -299,6 +305,25 @@ static inline void mem_cgroup_update_file_mapped(struct page *page,
 {
 }
 
+static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
+						struct zone *zone,
+						unsigned long nr_scanned)
+{
+}
+
+static inline void mem_cgroup_clear_unreclaimable(struct page *page,
+							struct zone *zone)
+{
+}
+static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
+		struct zone *zone)
+{
+}
+static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
+						struct zone *zone)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
@@ -312,6 +337,11 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
 	return 0;
 }
 
+static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
+								int zid)
+{
+	return false;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a0c6ed9..1d39b65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -48,6 +48,8 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/kthread.h>
+
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -118,7 +120,10 @@ struct mem_cgroup_per_zone {
 	bool			on_tree;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
+	unsigned long		pages_scanned;	/* since last reclaim */
+	int			all_unreclaimable;	/* All pages pinned */
 };
+
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
 
@@ -372,6 +377,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
+static inline void wake_memcg_kswapd(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1086,6 +1092,106 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(
+					struct mem_cgroup_per_zone *mz)
+{
+	int nr;
+	nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
+		MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
+			MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+
+	return nr;
+}
+
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+						unsigned long nr_scanned)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->pages_scanned += nr_scanned;
+}
+
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return 0;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->pages_scanned <
+				mem_cgroup_zone_reclaimable_pages(mz) * 6;
+	return 0;
+}
+
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return 0;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->all_unreclaimable;
+
+	return 0;
+}
+
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->all_unreclaimable = 1;
+}
+
+void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	struct mem_cgroup *mem = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	struct page_cgroup *pc = lookup_page_cgroup(page);
+
+	if (unlikely(!pc))
+		return;
+
+	rcu_read_lock();
+	mem = pc->mem_cgroup;
+	rcu_read_unlock();
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = 0;
+	}
+
+	return;
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -1887,6 +1993,20 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	struct res_counter *fail_res;
 	unsigned long flags = 0;
 	int ret;
+	unsigned long min_free_kbytes = 0;
+
+	min_free_kbytes = get_min_free_kbytes(mem);
+	if (min_free_kbytes) {
+		ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_LOW,
+					&fail_res);
+		if (likely(!ret)) {
+			return CHARGE_OK;
+		} else {
+			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
+									res);
+			wake_memcg_kswapd(mem_over_limit);
+		}
+	}
 
 	ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_MIN, &fail_res);
 
@@ -3037,6 +3157,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3046,7 +3167,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
-  		if (curusage >= oldusage)
+		if (curusage >= oldusage)
 			retry_count--;
 		else
 			oldusage = curusage;
@@ -3096,6 +3217,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -4352,6 +4474,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
+	struct kswapd *kswapd_p;
+	wait_queue_head_t *wait;
 
 	mem_cgroup_remove_from_trees(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
@@ -4360,6 +4484,15 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 		free_mem_cgroup_per_zone_info(mem, node);
 
 	free_percpu(mem->stat);
+
+	wait = mem->kswapd_wait;
+	kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+	if (kswapd_p) {
+		if (kswapd_p->kswapd_task)
+			kthread_stop(kswapd_p->kswapd_task);
+		kfree(kswapd_p);
+	}
+
 	if (sizeof(struct mem_cgroup) < PAGE_SIZE)
 		kfree(mem);
 	else
@@ -4421,6 +4554,39 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
 	return ret;
 }
 
+static inline
+void wake_memcg_kswapd(struct mem_cgroup *mem)
+{
+	wait_queue_head_t *wait;
+	struct kswapd *kswapd_p;
+	struct task_struct *thr;
+	static char memcg_name[PATH_MAX];
+
+	if (!mem)
+		return;
+
+	wait = mem->kswapd_wait;
+	kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+	if (!kswapd_p->kswapd_task) {
+		if (mem->css.cgroup)
+			cgroup_path(mem->css.cgroup, memcg_name, PATH_MAX);
+		else
+			sprintf(memcg_name, "no_name");
+
+		thr = kthread_run(kswapd, kswapd_p, "kswapd%s", memcg_name);
+		if (IS_ERR(thr))
+			printk(KERN_INFO "Failed to start kswapd on memcg %d\n",
+				0);
+		else
+			kswapd_p->kswapd_task = thr;
+	}
+
+	if (!waitqueue_active(wait)) {
+		return;
+	}
+	wake_up_interruptible(wait);
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4452,6 +4618,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	struct mem_cgroup *mem, *parent;
 	long error = -ENOMEM;
 	int node;
+	struct kswapd *kswapd_p = NULL;
 
 	mem = mem_cgroup_alloc();
 	if (!mem)
@@ -4499,6 +4666,19 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	spin_lock_init(&mem->reclaim_param_lock);
 	INIT_LIST_HEAD(&mem->oom_notify);
 
+
+	if (!mem_cgroup_is_root(mem)) {
+		kswapd_p = kmalloc(sizeof(struct kswapd), GFP_KERNEL);
+		if (!kswapd_p) {
+			printk(KERN_INFO "Failed to kmalloc kswapd_p %d\n", 0);
+			goto free_out;
+		}
+		memset(kswapd_p, 0, sizeof(struct kswapd));
+		init_waitqueue_head(&kswapd_p->kswapd_wait);
+		mem->kswapd_wait = &kswapd_p->kswapd_wait;
+		kswapd_p->kswapd_mem = mem;
+	}
+
 	if (parent)
 		mem->swappiness = get_swappiness(parent);
 	atomic_set(&mem->refcnt, 1);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a15bc1c..dc61f2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -615,6 +615,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 		do {
 			page = list_entry(list->prev, struct page, lru);
+			mem_cgroup_clear_unreclaimable(page, zone);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
@@ -632,6 +633,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
+	mem_cgroup_clear_unreclaimable(page, zone);
 
 	__free_one_page(page, zone, order, migratetype);
 	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d5702b..f8430c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -100,6 +100,8 @@ struct scan_control {
 	 * are scanned.
 	 */
 	nodemask_t	*nodemask;
+
+	int priority;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -2380,6 +2382,201 @@ out:
 	return sc.nr_reclaimed;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * TODO: the same function is used for global LRU and memcg LRU. For global
+ * LRU, the kswapd is done until all this node's zones are at
+ * high_wmark_pages(zone) or zone->all_unreclaimable.
+ */
+static void balance_pgdat_node(pg_data_t *pgdat, int order,
+					struct scan_control *sc)
+{
+	int i, end_zone;
+	unsigned long total_scanned;
+	struct mem_cgroup *mem_cont = sc->mem_cgroup;
+	int priority = sc->priority;
+	int nid = pgdat->node_id;
+
+	/*
+	 * Scan in the highmem->dma direction for the highest
+	 * zone which needs scanning
+	 */
+	for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+				priority != DEF_PRIORITY)
+			continue;
+		/*
+		 * Do some background aging of the anon list, to give
+		 * pages a chance to be referenced before reclaiming.
+		 */
+		if (inactive_anon_is_low(zone, sc))
+			shrink_active_list(SWAP_CLUSTER_MAX, zone,
+							sc, priority, 0);
+
+		end_zone = i;
+		goto scan;
+	}
+	return;
+
+scan:
+	total_scanned = 0;
+	/*
+	 * Now scan the zone in the dma->highmem direction, stopping
+	 * at the last zone which needs scanning.
+	 *
+	 * We do this because the page allocator works in the opposite
+	 * direction.  This prevents the page allocator from allocating
+	 * pages behind kswapd's direction of progress, which would
+	 * cause too much scanning of the lower zones.
+	 */
+	for (i = 0; i <= end_zone; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+			priority != DEF_PRIORITY)
+			continue;
+
+		sc->nr_scanned = 0;
+		shrink_zone(priority, zone, sc);
+		total_scanned += sc->nr_scanned;
+
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
+			continue;
+
+		if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
+			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
+
+		/*
+		 * If we've done a decent amount of scanning and
+		 * the reclaim ratio is low, start doing writepage
+		 * even in laptop mode
+		 */
+		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
+			sc->may_writepage = 1;
+		}
+	}
+
+	sc->nr_scanned = total_scanned;
+	return;
+}
+
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+					      int order)
+{
+	unsigned long total_scanned = 0;
+	int i;
+	int priority;
+	int wmark_ok, nid;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_unmap = 1,
+		.may_swap = 1,
+		/*
+		 * kswapd doesn't want to be bailed out while reclaim. because
+		 * we want to put equal scanning pressure on each zone.
+		 * TODO: this might not be true for the memcg background
+		 * reclaim.
+		 */
+		.nr_to_reclaim = ULONG_MAX,
+		.swappiness = vm_swappiness,
+		.order = order,
+		.mem_cgroup = mem_cont,
+	};
+	DECLARE_BITMAP(do_nodes, MAX_NUMNODES);
+
+	/*
+	 * bitmap to indicate which node to reclaim pages from. Initially we
+	 * assume all nodes need reclaim.
+	 */
+	bitmap_fill(do_nodes, MAX_NUMNODES);
+
+loop_again:
+	sc.may_writepage = !laptop_mode;
+	sc.nr_reclaimed = 0;
+	total_scanned = 0;
+
+	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		sc.priority = priority;
+
+		/* The swap token gets in the way of swapout... */
+		if (!priority)
+			disable_swap_token();
+
+
+		for_each_online_node(nid) {
+			pg_data_t *pgdat = NODE_DATA(nid);
+
+			wmark_ok = 1;
+
+			if (!test_bit(nid, do_nodes))
+				continue;
+
+			balance_pgdat_node(pgdat, order, &sc);
+			total_scanned += sc.nr_scanned;
+
+			for (i = pgdat->nr_zones - 1; i >= 0; i++) {
+				struct zone *zone = pgdat->node_zones + i;
+
+				if (!populated_zone(zone))
+					continue;
+
+				if (!mem_cgroup_mz_unreclaimable(mem_cont,
+								zone)) {
+					__set_bit(nid, do_nodes);
+					break;
+				}
+			}
+
+			if (i < 0)
+				__clear_bit(nid, do_nodes);
+
+			if (!mem_cgroup_watermark_ok(sc.mem_cgroup,
+							CHARGE_WMARK_HIGH))
+				wmark_ok = 0;
+
+			if (wmark_ok) {
+				goto out;
+			}
+		}
+
+		if (wmark_ok)
+			break;
+
+		if (total_scanned && priority < DEF_PRIORITY - 2)
+			congestion_wait(WRITE, HZ/10);
+
+		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
+			break;
+	}
+
+out:
+	if (!wmark_ok) {
+		cond_resched();
+
+		try_to_freeze();
+
+		goto loop_again;
+	}
+
+	return sc.nr_reclaimed;
+}
+#else
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+							int order)
+{
+	return 0;
+}
+#endif
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -2497,8 +2694,12 @@ int kswapd(void *p)
 		 * after returning from the refrigerator
 		 */
 		if (!ret) {
-			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balance_pgdat(pgdat, order);
+			if (pgdat) {
+				trace_mm_vmscan_kswapd_wake(pgdat->node_id,
+								order);
+				balance_pgdat(pgdat, order);
+			} else
+				balance_mem_cgroup_pgdat(mem, order);
 		}
 	}
 	return 0;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/4] Add more per memcg stats.
  2010-11-30  6:49 [RFC][PATCH 0/4] memcg: per cgroup background reclaim Ying Han
                   ` (2 preceding siblings ...)
  2010-11-30  6:49 ` [PATCH 3/4] Per cgroup background reclaim Ying Han
@ 2010-11-30  6:49 ` Ying Han
  2010-11-30  7:53   ` KAMEZAWA Hiroyuki
  2010-11-30  6:54 ` [RFC][PATCH 0/4] memcg: per cgroup background reclaim KOSAKI Motohiro
  2010-11-30  7:00 ` KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 52+ messages in thread
From: Ying Han @ 2010-11-30  6:49 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Tejun Heo
  Cc: linux-mm

A bunch of statistics are added in memory.stat to monitor per cgroup
kswapd performance.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   81 +++++++++++++++++++++++++
 mm/memcontrol.c            |  140 ++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |   33 +++++++++-
 3 files changed, 250 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dbed45d..893ca62 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -127,6 +127,19 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
+/* background reclaim stats */
+void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pgrefill(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_outrun(struct mem_cgroup *memcg, int val);
+void mem_cgroup_alloc_stall(struct mem_cgroup *memcg, int val);
+void mem_cgroup_balance_wmark_ok(struct mem_cgroup *memcg, int val);
+void mem_cgroup_balance_swap_max(struct mem_cgroup *memcg, int val);
+void mem_cgroup_kswapd_shrink_zone(struct mem_cgroup *memcg, int val);
+void mem_cgroup_kswapd_may_writepage(struct mem_cgroup *memcg, int val);
+
 void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone);
 bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
 bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
@@ -337,6 +350,74 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
 	return 0;
 }
 
+/* background reclaim stats */
+static inline void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_steal(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pgrefill(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_pg_outrun(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_alloc_stall(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_balance_wmark_ok(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_balance_swap_max(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_kswapd_shrink_zone(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
+
+static inline void mem_cgroup_kswapd_may_writepage(struct mem_cgroup *memcg,
+								int val)
+{
+	return 0;
+}
+
 static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
 								int zid)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1d39b65..97df6dd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -91,6 +91,21 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_KSWAPD_INVOKE, /* # of times invokes kswapd */
+	MEM_CGROUP_STAT_KSWAPD_STEAL, /* # of pages reclaimed from kswapd */
+	MEM_CGROUP_STAT_PG_PGSTEAL, /* # of pages reclaimed from ttfp */
+	MEM_CGROUP_STAT_KSWAPD_PGSCAN, /* # of pages scanned from kswapd */
+	MEM_CGROUP_STAT_PG_PGSCAN, /* # of pages scanned from ttfp */
+	MEM_CGROUP_STAT_PGREFILL, /* # of pages scanned on active list */
+	MEM_CGROUP_STAT_WMARK_LOW_OK,
+	MEM_CGROUP_STAT_KSWAP_CREAT,
+	MEM_CGROUP_STAT_PGOUTRUN,
+	MEM_CGROUP_STAT_ALLOCSTALL,
+	MEM_CGROUP_STAT_BALANCE_WMARK_OK,
+	MEM_CGROUP_STAT_BALANCE_SWAP_MAX,
+	MEM_CGROUP_STAT_WAITQUEUE,
+	MEM_CGROUP_STAT_KSWAPD_SHRINK_ZONE,
+	MEM_CGROUP_STAT_KSWAPD_MAY_WRITEPAGE,
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
 	/* incremented at every  pagein/pageout */
 	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
@@ -619,6 +634,62 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_kswapd_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_STEAL], val);
+}
+
+void mem_cgroup_pg_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PG_PGSTEAL], val);
+}
+
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_PGSCAN], val);
+}
+
+void mem_cgroup_pg_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PG_PGSCAN], val);
+}
+
+void mem_cgroup_pgrefill(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PGREFILL], val);
+}
+
+void mem_cgroup_pg_outrun(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PGOUTRUN], val);
+}
+
+void mem_cgroup_alloc_stall(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_ALLOCSTALL], val);
+}
+
+void mem_cgroup_balance_wmark_ok(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_BALANCE_WMARK_OK], val);
+}
+
+void mem_cgroup_balance_swap_max(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_BALANCE_SWAP_MAX], val);
+}
+
+void mem_cgroup_kswapd_shrink_zone(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_SHRINK_ZONE], val);
+}
+
+void mem_cgroup_kswapd_may_writepage(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_MAY_WRITEPAGE],
+			val);
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -2000,8 +2071,14 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 		ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_LOW,
 					&fail_res);
 		if (likely(!ret)) {
+			this_cpu_add(
+				mem->stat->count[MEM_CGROUP_STAT_WMARK_LOW_OK],
+				1);
 			return CHARGE_OK;
 		} else {
+			this_cpu_add(
+				mem->stat->count[MEM_CGROUP_STAT_KSWAPD_INVOKE],
+				1);
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									res);
 			wake_memcg_kswapd(mem_over_limit);
@@ -3723,6 +3800,21 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_KSWAPD_INVOKE,
+	MCS_KSWAPD_STEAL,
+	MCS_PG_PGSTEAL,
+	MCS_KSWAPD_PGSCAN,
+	MCS_PG_PGSCAN,
+	MCS_PGREFILL,
+	MCS_WMARK_LOW_OK,
+	MCS_KSWAP_CREAT,
+	MCS_PGOUTRUN,
+	MCS_ALLOCSTALL,
+	MCS_BALANCE_WMARK_OK,
+	MCS_BALANCE_SWAP_MAX,
+	MCS_WAITQUEUE,
+	MCS_KSWAPD_SHRINK_ZONE,
+	MCS_KSWAPD_MAY_WRITEPAGE,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3745,6 +3837,21 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"kswapd_invoke", "total_kswapd_invoke"},
+	{"kswapd_steal", "total_kswapd_steal"},
+	{"pg_pgsteal", "total_pg_pgsteal"},
+	{"kswapd_pgscan", "total_kswapd_pgscan"},
+	{"pg_scan", "total_pg_scan"},
+	{"pgrefill", "total_pgrefill"},
+	{"wmark_low_ok", "total_wmark_low_ok"},
+	{"kswapd_create", "total_kswapd_create"},
+	{"pgoutrun", "total_pgoutrun"},
+	{"allocstall", "total_allocstall"},
+	{"balance_wmark_ok", "total_balance_wmark_ok"},
+	{"balance_swap_max", "total_balance_swap_max"},
+	{"waitqueue", "total_waitqueue"},
+	{"kswapd_shrink_zone", "total_kswapd_shrink_zone"},
+	{"kswapd_may_writepage", "total_kswapd_may_writepage"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3773,6 +3880,37 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
+	/* kswapd stat */
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_INVOKE);
+	s->stat[MCS_KSWAPD_INVOKE] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_STEAL);
+	s->stat[MCS_KSWAPD_STEAL] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PG_PGSTEAL);
+	s->stat[MCS_PG_PGSTEAL] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_PGSCAN);
+	s->stat[MCS_KSWAPD_PGSCAN] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PG_PGSCAN);
+	s->stat[MCS_PG_PGSCAN] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGREFILL);
+	s->stat[MCS_PGREFILL] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WMARK_LOW_OK);
+	s->stat[MCS_WMARK_LOW_OK] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAP_CREAT);
+	s->stat[MCS_KSWAP_CREAT] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGOUTRUN);
+	s->stat[MCS_PGOUTRUN] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_ALLOCSTALL);
+	s->stat[MCS_ALLOCSTALL] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_BALANCE_WMARK_OK);
+	s->stat[MCS_BALANCE_WMARK_OK] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_BALANCE_SWAP_MAX);
+	s->stat[MCS_BALANCE_SWAP_MAX] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WAITQUEUE);
+	s->stat[MCS_WAITQUEUE] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_SHRINK_ZONE);
+	s->stat[MCS_KSWAPD_SHRINK_ZONE] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_MAY_WRITEPAGE);
+	s->stat[MCS_KSWAPD_MAY_WRITEPAGE] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
@@ -4579,9 +4717,11 @@ void wake_memcg_kswapd(struct mem_cgroup *mem)
 				0);
 		else
 			kswapd_p->kswapd_task = thr;
+		this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAP_CREAT], 1);
 	}
 
 	if (!waitqueue_active(wait)) {
+		this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_WAITQUEUE], 1);
 		return;
 	}
 	wake_up_interruptible(wait);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f8430c4..5b0c349 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1389,10 +1389,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, sc->mem_cgroup,
 			0, file);
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_pgscan(sc->mem_cgroup, nr_scanned);
+		else
+			mem_cgroup_pg_pgscan(sc->mem_cgroup, nr_scanned);
 	}
 
 	if (nr_taken == 0) {
@@ -1413,9 +1418,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	local_irq_disable();
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	if (scanning_global_lru(sc)) {
+		if (current_is_kswapd())
+			__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+		__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	} else {
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_steal(sc->mem_cgroup, nr_reclaimed);
+		else
+			mem_cgroup_pg_steal(sc->mem_cgroup, nr_reclaimed);
+	}
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
@@ -1508,11 +1520,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
-	__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	if (scanning_global_lru(sc))
+		__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	else
+		mem_cgroup_pgrefill(sc->mem_cgroup, pgscanned);
+
 	if (file)
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -nr_taken);
 	else
@@ -1955,6 +1972,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 	if (scanning_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
+	else
+		mem_cgroup_alloc_stall(sc->mem_cgroup, 1);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
@@ -2444,6 +2463,8 @@ scan:
 			priority != DEF_PRIORITY)
 			continue;
 
+		mem_cgroup_kswapd_shrink_zone(mem_cont, 1);
+
 		sc->nr_scanned = 0;
 		shrink_zone(priority, zone, sc);
 		total_scanned += sc->nr_scanned;
@@ -2462,6 +2483,7 @@ scan:
 		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
 		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
 			sc->may_writepage = 1;
+			mem_cgroup_kswapd_may_writepage(mem_cont, 1);
 		}
 	}
 
@@ -2504,6 +2526,8 @@ loop_again:
 	sc.nr_reclaimed = 0;
 	total_scanned = 0;
 
+	mem_cgroup_pg_outrun(mem_cont, 1);
+
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc.priority = priority;
 
@@ -2544,6 +2568,7 @@ loop_again:
 				wmark_ok = 0;
 
 			if (wmark_ok) {
+				mem_cgroup_balance_wmark_ok(sc.mem_cgroup, 1);
 				goto out;
 			}
 		}
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH 0/4] memcg: per cgroup background reclaim
  2010-11-30  6:49 [RFC][PATCH 0/4] memcg: per cgroup background reclaim Ying Han
                   ` (3 preceding siblings ...)
  2010-11-30  6:49 ` [PATCH 4/4] Add more per memcg stats Ying Han
@ 2010-11-30  6:54 ` KOSAKI Motohiro
  2010-11-30  7:03   ` Ying Han
  2010-11-30  7:00 ` KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 52+ messages in thread
From: KOSAKI Motohiro @ 2010-11-30  6:54 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Balbir Singh, Daisuke Nishimura,
	KAMEZAWA Hiroyuki, Andrew Morton, Mel Gorman, Johannes Weiner,
	Christoph Lameter, Wu Fengguang, Andi Kleen, Hugh Dickins,
	Rik van Riel, Tejun Heo, linux-mm

> The current implementation of memcg only supports direct reclaim and this
> patchset adds the support for background reclaim. Per cgroup background
> reclaim is needed which spreads out the memory pressure over longer period
> of time and smoothes out the system performance.
> 
> The current implementation is not a stable version, and it crashes sometimes
> on my NUMA machine. Before going further for debugging, I would like to start
> the discussion and hear the feedbacks of the initial design.

I haven't read your code at all. However I agree your claim that memcg 
also need background reclaim.

So if you post high level design memo, I'm happy.

> 
> Current status:
> I run through some simple tests which reads/writes a large file and makes sure
> it triggers per cgroup kswapd on the low_wmark. Also, I compared at
> pg_steal/pg_scan ratio w/o background reclaim.
> 
> Step1: Create a cgroup with 500M memory_limit and set the min_free_kbytes to 1024.
> $ mount -t cgroup -o cpuset,memory cpuset /dev/cgroup
> $ mkdir /dev/cgroup/A
> $ echo 0 >/dev/cgroup/A/cpuset.cpus
> $ echo 0 >/dev/cgroup/A/cpuset.mems
> $ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
> $ echo 1024 >/dev/cgroup/A/memory.min_free_kbytes
> $ echo $$ >/dev/cgroup/A/tasks
> 
> Step2: Check the wmarks.
> $ cat /dev/cgroup/A/memory.reclaim_wmarks
> memcg_low_wmark 98304000
> memcg_high_wmark 81920000
> 
> Step3: Dirty the pages by creating a 20g file on hard drive.
> $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
> 
> Checked the memory.stat w/o background reclaim. It used to be all the pages are
> reclaimed from direct reclaim, and now about half of them are reclaimed at
> background. (note: writing '0' to min_free_kbytes disables per cgroup kswapd)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH 0/4] memcg: per cgroup background reclaim
  2010-11-30  6:49 [RFC][PATCH 0/4] memcg: per cgroup background reclaim Ying Han
                   ` (4 preceding siblings ...)
  2010-11-30  6:54 ` [RFC][PATCH 0/4] memcg: per cgroup background reclaim KOSAKI Motohiro
@ 2010-11-30  7:00 ` KAMEZAWA Hiroyuki
  2010-11-30  9:05   ` Ying Han
  5 siblings, 1 reply; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  7:00 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, 29 Nov 2010 22:49:41 -0800
Ying Han <yinghan@google.com> wrote:

> The current implementation of memcg only supports direct reclaim and this
> patchset adds the support for background reclaim. Per cgroup background
> reclaim is needed which spreads out the memory pressure over longer period
> of time and smoothes out the system performance.
> 
> The current implementation is not a stable version, and it crashes sometimes
> on my NUMA machine. Before going further for debugging, I would like to start
> the discussion and hear the feedbacks of the initial design.
> 

It's welcome but please wait until merge of dirty-ratio.
And please post after you don't see crash ....

Description of design is appreciated.
Where the cost for "kswapd" is charged agaist if cpu cgroup is used at the same time ?

> Current status:
> I run through some simple tests which reads/writes a large file and makes sure
> it triggers per cgroup kswapd on the low_wmark. Also, I compared at
> pg_steal/pg_scan ratio w/o background reclaim.
> 
>

 Step1: Create a cgroup with 500M memory_limit and set the min_free_kbytes to 1024.
> $ mount -t cgroup -o cpuset,memory cpuset /dev/cgroup
> $ mkdir /dev/cgroup/A
> $ echo 0 >/dev/cgroup/A/cpuset.cpus
> $ echo 0 >/dev/cgroup/A/cpuset.mems
> $ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
> $ echo 1024 >/dev/cgroup/A/memory.min_free_kbytes
> $ echo $$ >/dev/cgroup/A/tasks
> 
> Step2: Check the wmarks.
> $ cat /dev/cgroup/A/memory.reclaim_wmarks
> memcg_low_wmark 98304000
> memcg_high_wmark 81920000
> 
> Step3: Dirty the pages by creating a 20g file on hard drive.
> $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
> 
> Checked the memory.stat w/o background reclaim. It used to be all the pages are
> reclaimed from direct reclaim, and now about half of them are reclaimed at
> background. (note: writing '0' to min_free_kbytes disables per cgroup kswapd)
> 
> Only direct reclaim                                                With background reclaim:
> kswapd_steal 0                                                     kswapd_steal 2751822
> pg_pgsteal 5100401                                               pg_pgsteal 2476676
> kswapd_pgscan 0                                                  kswapd_pgscan 6019373
> pg_scan 5542464                                                   pg_scan 3851281
> pgrefill 304505                                                       pgrefill 348077
> pgoutrun 0                                                             pgoutrun 44568
> allocstall 159278                                                    allocstall 75669
> 
> Step4: Cleanup
> $ echo $$ >/dev/cgroup/tasks
> $ echo 0 > /dev/cgroup/A/memory.force_empty
> 
> Step5: Read the 20g file into the pagecache.
> $ cat /export/hdc3/dd/tf0 > /dev/zero;
> 
> Checked the memory.stat w/o background reclaim. All the clean pages are reclaimed at
> background instead of direct reclaim.
> 
> Only direct reclaim                                                With background reclaim
> kswapd_steal 0                                                      kswapd_steal 3512424
> pg_pgsteal 3461280                                               pg_pgsteal 0
> kswapd_pgscan 0                                                  kswapd_pgscan 3512440
> pg_scan 3461280                                                   pg_scan 0
> pgrefill 0                                                                pgrefill 0
> pgoutrun 0                                                             pgoutrun 74973
> allocstall 108165                                                    allocstall 0
> 

What is the trigger for starting background reclaim ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH 0/4] memcg: per cgroup background reclaim
  2010-11-30  6:54 ` [RFC][PATCH 0/4] memcg: per cgroup background reclaim KOSAKI Motohiro
@ 2010-11-30  7:03   ` Ying Han
  2010-12-02 14:41     ` Balbir Singh
  0 siblings, 1 reply; 52+ messages in thread
From: Ying Han @ 2010-11-30  7:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel, Tejun Heo,
	linux-mm

On Mon, Nov 29, 2010 at 10:54 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> The current implementation of memcg only supports direct reclaim and this
>> patchset adds the support for background reclaim. Per cgroup background
>> reclaim is needed which spreads out the memory pressure over longer period
>> of time and smoothes out the system performance.
>>
>> The current implementation is not a stable version, and it crashes sometimes
>> on my NUMA machine. Before going further for debugging, I would like to start
>> the discussion and hear the feedbacks of the initial design.
>
> I haven't read your code at all. However I agree your claim that memcg
> also need background reclaim.

Thanks for your comment.
>
> So if you post high level design memo, I'm happy.

My high level design is kind of spreading out into each patch, and
here is the consolidated one. This is nothing more but cluing all the
commits' messages for the following patches.

"
The current implementation of memcg only supports direct reclaim and this
patchset adds the support for background reclaim. Per cgroup background
reclaim is needed which spreads out the memory pressure over longer period
of time and smoothes out the system performance.

There is a kswapd kernel thread for each memory node. We add a different kswapd
for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
field of a kswapd descriptor. The kswapd descriptor stores information of node
or cgroup and it allows the global and per cgroup background reclaim to share
common reclaim algorithms. The per cgroup kswapd is invoked at mem_cgroup_charge
when the cgroup's memory usage above a threshold--low_wmark. Then the kswapd
thread starts to reclaim pages in a priority loop similar to global algorithm.
The kswapd is done if the usage below a threshold--high_wmark.

The per cgroup background reclaim is based on the per cgroup LRU and also adds
per cgroup watermarks. There are two watermarks including "low_wmark" and
"high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
for each cgroup. Each time the hard_limit is change, the corresponding wmarks
are re-calculated. Since memory controller charges only user pages, there is
no need for a "min_wmark". The current calculation of wmarks is a function of
"memory.min_free_kbytes" which could be adjusted by writing different values
into the new api. This is added mainly for debugging purpose.

The kswapd() function now is shared between global and per cgroup kswapd thread.
It is passed in with the kswapd descriptor which contains the information of
either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
priority loop similar to global reclaim. In each iteration it invokes
balance_pgdat_node for all nodes on the system, which is a new function performs
background reclaim per node. After reclaiming each node, it checks
mem_cgroup_watermark_ok() and breaks the priority loop if returns true. A per
memcg zone will be marked as "unreclaimable" if the scanning rate is much
greater than the reclaiming rate on the per cgroup LRU. The bit is cleared when
there is a page charged to the cgroup being freed. Kswapd breaks the priority
loop if all the zones are marked as "unreclaimable".
"

Also, I am happy to add more descriptions if anything not clear :)

thanks

--Ying

>
>>
>> Current status:
>> I run through some simple tests which reads/writes a large file and makes sure
>> it triggers per cgroup kswapd on the low_wmark. Also, I compared at
>> pg_steal/pg_scan ratio w/o background reclaim.
>>
>> Step1: Create a cgroup with 500M memory_limit and set the min_free_kbytes to 1024.
>> $ mount -t cgroup -o cpuset,memory cpuset /dev/cgroup
>> $ mkdir /dev/cgroup/A
>> $ echo 0 >/dev/cgroup/A/cpuset.cpus
>> $ echo 0 >/dev/cgroup/A/cpuset.mems
>> $ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
>> $ echo 1024 >/dev/cgroup/A/memory.min_free_kbytes
>> $ echo $$ >/dev/cgroup/A/tasks
>>
>> Step2: Check the wmarks.
>> $ cat /dev/cgroup/A/memory.reclaim_wmarks
>> memcg_low_wmark 98304000
>> memcg_high_wmark 81920000
>>
>> Step3: Dirty the pages by creating a 20g file on hard drive.
>> $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
>>
>> Checked the memory.stat w/o background reclaim. It used to be all the pages are
>> reclaimed from direct reclaim, and now about half of them are reclaimed at
>> background. (note: writing '0' to min_free_kbytes disables per cgroup kswapd)
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  6:49 ` [PATCH 1/4] Add kswapd descriptor Ying Han
@ 2010-11-30  7:08   ` KAMEZAWA Hiroyuki
  2010-11-30  8:15     ` Minchan Kim
  2010-11-30 20:17     ` Ying Han
  2010-12-07  6:52   ` Balbir Singh
  2010-12-07 12:33   ` Mel Gorman
  2 siblings, 2 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  7:08 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, 29 Nov 2010 22:49:42 -0800
Ying Han <yinghan@google.com> wrote:

> There is a kswapd kernel thread for each memory node. We add a different kswapd
> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
> field of a kswapd descriptor. The kswapd descriptor stores information of node
> or cgroup and it allows the global and per cgroup background reclaim to share
> common reclaim algorithms.
> 
> This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
> common data structure.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/mmzone.h |    3 +-
>  include/linux/swap.h   |   10 +++++
>  mm/memcontrol.c        |    2 +
>  mm/mmzone.c            |    2 +-
>  mm/page_alloc.c        |    9 +++-
>  mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
>  6 files changed, 90 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 39c24eb..c77dfa2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -642,8 +642,7 @@ typedef struct pglist_data {
>  	unsigned long node_spanned_pages; /* total size of physical page
>  					     range, including holes */
>  	int node_id;
> -	wait_queue_head_t kswapd_wait;
> -	struct task_struct *kswapd;
> +	wait_queue_head_t *kswapd_wait;
>  	int kswapd_max_order;
>  } pg_data_t;
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index eba53e7..2e6cb58 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
>  	return current->flags & PF_KSWAPD;
>  }
>  
> +struct kswapd {
> +	struct task_struct *kswapd_task;
> +	wait_queue_head_t kswapd_wait;
> +	struct mem_cgroup *kswapd_mem;
> +	pg_data_t *kswapd_pgdat;
> +};
> +
> +#define MAX_KSWAPDS MAX_NUMNODES
> +extern struct kswapd kswapds[MAX_KSWAPDS];
> +int kswapd(void *p);

Why this is required ? Can't we allocate this at boot (if necessary) ?
Why exsiting kswapd is also controlled under this structure ?
At the 1st look, this just seem to increase the size of changes....

IMHO, implementing background-reclaim-for-memcg is cleaner than reusing kswapd..
kswapd has tons of unnecessary checks.

Regards,
-Kame

>  /*
>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>   * be swapped to.  The swap type and the offset into that swap type are
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a4034b6..dca3590 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -263,6 +263,8 @@ struct mem_cgroup {
>  	 */
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
> +
> +	wait_queue_head_t *kswapd_wait;
>  };
>  
>  /* Stuffs for move charges at task migration. */
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index e35bfb8..c7cbed5 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -102,7 +102,7 @@ unsigned long zone_nr_free_pages(struct zone *zone)
>  	 * free pages are low, get a better estimate for free pages
>  	 */
>  	if (nr_free_pages < zone->percpu_drift_mark &&
> -			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> +			!waitqueue_active(zone->zone_pgdat->kswapd_wait))
>  		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  
>  	return nr_free_pages;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b48dea2..a15bc1c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4070,13 +4070,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  	int nid = pgdat->node_id;
>  	unsigned long zone_start_pfn = pgdat->node_start_pfn;
>  	int ret;
> +	struct kswapd *kswapd_p;
>  
>  	pgdat_resize_init(pgdat);
>  	pgdat->nr_zones = 0;
> -	init_waitqueue_head(&pgdat->kswapd_wait);
>  	pgdat->kswapd_max_order = 0;
>  	pgdat_page_cgroup_init(pgdat);
> -	
> +
> +	kswapd_p = &kswapds[nid];
> +	init_waitqueue_head(&kswapd_p->kswapd_wait);
> +	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> +	kswapd_p->kswapd_pgdat = pgdat;
> +
>  	for (j = 0; j < MAX_NR_ZONES; j++) {
>  		struct zone *zone = pgdat->node_zones + j;
>  		unsigned long size, realsize, memmap_pages;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b8a6fdc..e08005e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2115,12 +2115,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  
>  	return nr_reclaimed;
>  }
> +
>  #endif
>  
> +DEFINE_SPINLOCK(kswapds_spinlock);
> +struct kswapd kswapds[MAX_KSWAPDS];
> +
>  /* is kswapd sleeping prematurely? */
> -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
> +				long remaining)
>  {
>  	int i;
> +	pg_data_t *pgdat = kswapd->kswapd_pgdat;
>  
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
> @@ -2377,21 +2383,28 @@ out:
>   * If there are applications that are active memory-allocators
>   * (most normal use), this basically shouldn't matter.
>   */
> -static int kswapd(void *p)
> +int kswapd(void *p)
>  {
>  	unsigned long order;
> -	pg_data_t *pgdat = (pg_data_t*)p;
> +	struct kswapd *kswapd_p = (struct kswapd *)p;
> +	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
> +	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>  	struct task_struct *tsk = current;
>  	DEFINE_WAIT(wait);
>  	struct reclaim_state reclaim_state = {
>  		.reclaimed_slab = 0,
>  	};
> -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +	const struct cpumask *cpumask;
>  
>  	lockdep_set_current_reclaim_state(GFP_KERNEL);
>  
> -	if (!cpumask_empty(cpumask))
> -		set_cpus_allowed_ptr(tsk, cpumask);
> +	if (pgdat) {
> +		BUG_ON(pgdat->kswapd_wait != wait_h);
> +		cpumask = cpumask_of_node(pgdat->node_id);
> +		if (!cpumask_empty(cpumask))
> +			set_cpus_allowed_ptr(tsk, cpumask);
> +	}
>  	current->reclaim_state = &reclaim_state;
>  
>  	/*
> @@ -2414,9 +2427,13 @@ static int kswapd(void *p)
>  		unsigned long new_order;
>  		int ret;
>  
> -		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> -		new_order = pgdat->kswapd_max_order;
> -		pgdat->kswapd_max_order = 0;
> +		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> +		if (pgdat) {
> +			new_order = pgdat->kswapd_max_order;
> +			pgdat->kswapd_max_order = 0;
> +		} else
> +			new_order = 0;
> +
>  		if (order < new_order) {
>  			/*
>  			 * Don't sleep if someone wants a larger 'order'
> @@ -2428,10 +2445,12 @@ static int kswapd(void *p)
>  				long remaining = 0;
>  
>  				/* Try to sleep for a short interval */
> -				if (!sleeping_prematurely(pgdat, order, remaining)) {
> +				if (!sleeping_prematurely(kswapd_p, order,
> +							remaining)) {
>  					remaining = schedule_timeout(HZ/10);
> -					finish_wait(&pgdat->kswapd_wait, &wait);
> -					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> +					finish_wait(wait_h, &wait);
> +					prepare_to_wait(wait_h, &wait,
> +							TASK_INTERRUPTIBLE);
>  				}
>  
>  				/*
> @@ -2439,20 +2458,25 @@ static int kswapd(void *p)
>  				 * premature sleep. If not, then go fully
>  				 * to sleep until explicitly woken up
>  				 */
> -				if (!sleeping_prematurely(pgdat, order, remaining)) {
> -					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +				if (!sleeping_prematurely(kswapd_p, order,
> +								remaining)) {
> +					if (pgdat)
> +						trace_mm_vmscan_kswapd_sleep(
> +								pgdat->node_id);
>  					schedule();
>  				} else {
>  					if (remaining)
> -						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> +						count_vm_event(
> +						KSWAPD_LOW_WMARK_HIT_QUICKLY);
>  					else
> -						count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> +						count_vm_event(
> +						KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>  				}
>  			}
> -
> -			order = pgdat->kswapd_max_order;
> +			if (pgdat)
> +				order = pgdat->kswapd_max_order;
>  		}
> -		finish_wait(&pgdat->kswapd_wait, &wait);
> +		finish_wait(wait_h, &wait);
>  
>  		ret = try_to_freeze();
>  		if (kthread_should_stop())
> @@ -2476,6 +2500,7 @@ static int kswapd(void *p)
>  void wakeup_kswapd(struct zone *zone, int order)
>  {
>  	pg_data_t *pgdat;
> +	wait_queue_head_t *wait;
>  
>  	if (!populated_zone(zone))
>  		return;
> @@ -2488,9 +2513,10 @@ void wakeup_kswapd(struct zone *zone, int order)
>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>  	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  		return;
> -	if (!waitqueue_active(&pgdat->kswapd_wait))
> +	wait = pgdat->kswapd_wait;
> +	if (!waitqueue_active(wait))
>  		return;
> -	wake_up_interruptible(&pgdat->kswapd_wait);
> +	wake_up_interruptible(wait);
>  }
>  
>  /*
> @@ -2587,7 +2613,10 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>  
>  			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>  				/* One of our CPUs online: restore mask */
> -				set_cpus_allowed_ptr(pgdat->kswapd, mask);
> +				if (kswapds[nid].kswapd_task)
> +					set_cpus_allowed_ptr(
> +						kswapds[nid].kswapd_task,
> +						mask);
>  		}
>  	}
>  	return NOTIFY_OK;
> @@ -2599,19 +2628,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>   */
>  int kswapd_run(int nid)
>  {
> -	pg_data_t *pgdat = NODE_DATA(nid);
> +	struct task_struct *thr;
>  	int ret = 0;
>  
> -	if (pgdat->kswapd)
> +	if (kswapds[nid].kswapd_task)
>  		return 0;
>  
> -	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
> -	if (IS_ERR(pgdat->kswapd)) {
> +	thr = kthread_run(kswapd, &kswapds[nid], "kswapd%d", nid);
> +	if (IS_ERR(thr)) {
>  		/* failure at boot is fatal */
>  		BUG_ON(system_state == SYSTEM_BOOTING);
>  		printk("Failed to start kswapd on node %d\n",nid);
>  		ret = -1;
>  	}
> +	kswapds[nid].kswapd_task = thr;
>  	return ret;
>  }
>  
> @@ -2620,10 +2650,20 @@ int kswapd_run(int nid)
>   */
>  void kswapd_stop(int nid)
>  {
> -	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
> +	struct task_struct *thr;
> +	struct kswapd *kswapd_p;
> +	wait_queue_head_t *wait;
> +
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +
> +	spin_lock(&kswapds_spinlock);
> +	wait = pgdat->kswapd_wait;
> +	kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> +	thr = kswapd_p->kswapd_task;
> +	spin_unlock(&kswapds_spinlock);
>  
> -	if (kswapd)
> -		kthread_stop(kswapd);
> +	if (thr)
> +		kthread_stop(thr);
>  }
>  
>  static int __init kswapd_init(void)
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/4] Add per cgroup reclaim watermarks.
  2010-11-30  6:49 ` [PATCH 2/4] Add per cgroup reclaim watermarks Ying Han
@ 2010-11-30  7:21   ` KAMEZAWA Hiroyuki
  2010-11-30 20:44     ` Ying Han
  2010-12-07 14:56   ` Mel Gorman
  1 sibling, 1 reply; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  7:21 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, 29 Nov 2010 22:49:43 -0800
Ying Han <yinghan@google.com> wrote:

> The per cgroup kswapd is invoked at mem_cgroup_charge when the cgroup's memory
> usage above a threshold--low_wmark. Then the kswapd thread starts to reclaim
> pages in a priority loop similar to global algorithm. The kswapd is done if the
> memory usage below a threshold--high_wmark.
> 
> The per cgroup background reclaim is based on the per cgroup LRU and also adds
> per cgroup watermarks. There are two watermarks including "low_wmark" and
> "high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
> for each cgroup. Each time the hard_limit is change, the corresponding wmarks
> are re-calculated. Since memory controller charges only user pages, there is
> no need for a "min_wmark". The current calculation of wmarks is a function of
> "memory.min_free_kbytes" which could be adjusted by writing different values
> into the new api. This is added mainly for debugging purpose.
> 
> Signed-off-by: Ying Han <yinghan@google.com>

A few points.

1. I can understand the motivation for including low/high watermark to
   res_coutner. But, sadly, compareing all charge will make the counter slow.
   IMHO, as memory controller threshold-check or soft limit, checking usage
   periodically based on event counter is enough. It will be low cost.

2. min_free_kbytes must be automatically calculated.
   For example, max(3% of limit, 20MB) or some.

3. When you allow min_free_kbytes to be set by users, please compare
   it with the limit.
   I think min_free_kbyte interface itself should be in another patch...
   interface code tends to make patch bigger.



> ---
>  include/linux/memcontrol.h  |    1 +
>  include/linux/res_counter.h |   88 ++++++++++++++++++++++++++++++-
>  kernel/res_counter.c        |   26 ++++++++--
>  mm/memcontrol.c             |  123 +++++++++++++++++++++++++++++++++++++++++--
>  mm/vmscan.c                 |   10 ++++
>  5 files changed, 238 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 159a076..90fe7fe 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -76,6 +76,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
>  
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index fcb9884..eed12c5 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -39,6 +39,16 @@ struct res_counter {
>  	 */
>  	unsigned long long soft_limit;
>  	/*
> +	 * the limit that reclaim triggers. TODO: res_counter in mem
> +	 * or wmark_limit.
> +	 */
> +	unsigned long long low_wmark_limit;
> +	/*
> +	 * the limit that reclaim stops. TODO: res_counter in mem or
> +	 * wmark_limit.
> +	 */
> +	unsigned long long high_wmark_limit;
> +	/*
>  	 * the number of unsuccessful attempts to consume the resource
>  	 */
>  	unsigned long long failcnt;
> @@ -55,6 +65,10 @@ struct res_counter {
>  
>  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
>  
> +#define CHARGE_WMARK_MIN	0x01
> +#define CHARGE_WMARK_LOW	0x02
> +#define CHARGE_WMARK_HIGH	0x04
> +
>  /**
>   * Helpers to interact with userspace
>   * res_counter_read_u64() - returns the value of the specified member.
> @@ -92,6 +106,8 @@ enum {
>  	RES_LIMIT,
>  	RES_FAILCNT,
>  	RES_SOFT_LIMIT,
> +	RES_LOW_WMARK_LIMIT,
> +	RES_HIGH_WMARK_LIMIT
>  };
>  
>  /*
> @@ -112,9 +128,10 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>   */
>  
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
> -		unsigned long val);
> +		unsigned long val, int charge_flags);
>  int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> +		unsigned long val, int charge_flags,
> +		struct res_counter **limit_fail_at);
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> @@ -145,6 +162,24 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
>  	return false;
>  }
>  
> +static inline bool
> +res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->high_wmark_limit)
> +		return true;
> +
> +	return false;
> +}
> +
> +static inline bool
> +res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->low_wmark_limit)
> +		return true;
> +
> +	return false;
> +}
> +
>  /**
>   * Get the difference between the usage and the soft limit
>   * @cnt: The counter
> @@ -193,6 +228,30 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  	return ret;
>  }
>  
> +static inline bool
> +res_counter_check_under_low_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_low_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
> +static inline bool
> +res_counter_check_under_high_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_high_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline void res_counter_reset_max(struct res_counter *cnt)
>  {
>  	unsigned long flags;
> @@ -220,6 +279,8 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
>  	spin_lock_irqsave(&cnt->lock, flags);
>  	if (cnt->usage <= limit) {
>  		cnt->limit = limit;
> +		cnt->low_wmark_limit = limit;
> +		cnt->high_wmark_limit = limit;
>  		ret = 0;
>  	}
>  	spin_unlock_irqrestore(&cnt->lock, flags);
> @@ -238,4 +299,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
>  	return 0;
>  }
>  
> +static inline int
> +res_counter_set_high_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->high_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
> +
> +static inline int
> +res_counter_set_low_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->low_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
>  #endif
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index c7eaa37..a524349 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -19,12 +19,26 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
>  	spin_lock_init(&counter->lock);
>  	counter->limit = RESOURCE_MAX;
>  	counter->soft_limit = RESOURCE_MAX;
> +	counter->low_wmark_limit = RESOURCE_MAX;
> +	counter->high_wmark_limit = RESOURCE_MAX;
>  	counter->parent = parent;
>  }
>  
> -int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> +int res_counter_charge_locked(struct res_counter *counter, unsigned long val,
> +				int charge_flags)
>  {
> -	if (counter->usage + val > counter->limit) {
> +	unsigned long long limit = 0;
> +
> +	if (charge_flags & CHARGE_WMARK_LOW)
> +		limit = counter->low_wmark_limit;
> +
> +	if (charge_flags & CHARGE_WMARK_HIGH)
> +		limit = counter->high_wmark_limit;
> +
> +	if (charge_flags & CHARGE_WMARK_MIN)
> +		limit = counter->limit;
> +
> +	if (counter->usage + val > limit) {
>  		counter->failcnt++;
>  		return -ENOMEM;
>  	}
> @@ -36,7 +50,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  }
>  
>  int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> +			int charge_flags, struct res_counter **limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
> @@ -46,7 +60,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
> -		ret = res_counter_charge_locked(c, val);
> +		ret = res_counter_charge_locked(c, val, charge_flags);
>  		spin_unlock(&c->lock);
>  		if (ret < 0) {
>  			*limit_fail_at = c;
> @@ -103,6 +117,10 @@ res_counter_member(struct res_counter *counter, int member)
>  		return &counter->failcnt;
>  	case RES_SOFT_LIMIT:
>  		return &counter->soft_limit;
> +	case RES_LOW_WMARK_LIMIT:
> +		return &counter->low_wmark_limit;
> +	case RES_HIGH_WMARK_LIMIT:
> +		return &counter->high_wmark_limit;
>  	};
>  
>  	BUG();
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index dca3590..a0c6ed9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -265,6 +265,7 @@ struct mem_cgroup {
>  	spinlock_t pcp_counter_lock;
>  
>  	wait_queue_head_t *kswapd_wait;
> +	unsigned long min_free_kbytes;
>  };
>  
>  /* Stuffs for move charges at task migration. */
> @@ -370,6 +371,7 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  static void drain_all_stock_async(void);
> +static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
>  
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -796,6 +798,32 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>  	return (mem == root_mem_cgroup);
>  }
>  
> +void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> +{
> +	u64 limit;
> +	unsigned long min_free_kbytes;
> +
> +	min_free_kbytes = get_min_free_kbytes(mem);
> +	limit = mem_cgroup_get_limit(mem);
> +	if (min_free_kbytes == 0) {
> +		res_counter_set_low_wmark_limit(&mem->res, limit);
> +		res_counter_set_high_wmark_limit(&mem->res, limit);
> +	} else {
> +		unsigned long page_min = min_free_kbytes >> (PAGE_SHIFT - 10);
> +		unsigned long lowmem_pages = 2048;
> +		unsigned long low_wmark, high_wmark;
> +		u64 tmp;
> +
> +		tmp = (u64)page_min * limit;
> +		do_div(tmp, lowmem_pages);
> +
> +		low_wmark = tmp + (tmp >> 1);
> +		high_wmark = tmp + (tmp >> 2);
> +		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> +		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> +	}
> +}
> +
>  /*
>   * Following LRU functions are allowed to be used without PCG_LOCK.
>   * Operations are called by routine of global LRU independently from memcg.
> @@ -1148,6 +1176,22 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +static unsigned long get_min_free_kbytes(struct mem_cgroup *memcg)
> +{
> +	struct cgroup *cgrp = memcg->css.cgroup;
> +	unsigned long min_free_kbytes;
> +
> +	/* root ? */
> +	if (cgrp == NULL || cgrp->parent == NULL)
> +		return 0;
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	min_free_kbytes = memcg->min_free_kbytes;
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return min_free_kbytes;
> +}
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>  {
>  	int cpu;
> @@ -1844,12 +1888,13 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	unsigned long flags = 0;
>  	int ret;
>  
> -	ret = res_counter_charge(&mem->res, csize, &fail_res);
> +	ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_MIN, &fail_res);
>  
>  	if (likely(!ret)) {
>  		if (!do_swap_account)
>  			return CHARGE_OK;
> -		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> +		ret = res_counter_charge(&mem->memsw, csize, CHARGE_WMARK_MIN,
> +					&fail_res);
>  		if (likely(!ret))
>  			return CHARGE_OK;
>  
> @@ -3733,6 +3778,37 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static u64 mem_cgroup_min_free_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return get_min_free_kbytes(memcg);
> +}
> +
> +static int mem_cgroup_min_free_write(struct cgroup *cgrp, struct cftype *cfg,
> +				     u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	struct mem_cgroup *parent;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +
> +	parent = mem_cgroup_from_cont(cgrp->parent);
> +
> +	cgroup_lock();
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	memcg->min_free_kbytes = val;
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	cgroup_unlock();

Why cgroup_lock is required ?

Thanks,
-Kame

> +
> +	setup_per_memcg_wmarks(memcg);
> +	return 0;
> +
> +}
> +
>  static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
>  {
>  	struct mem_cgroup_threshold_ary *t;
> @@ -4024,6 +4100,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
>  	mutex_unlock(&memcg_oom_mutex);
>  }
>  
> +static int mem_cgroup_wmark_read(struct cgroup *cgrp,
> +	struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	unsigned long low_wmark, high_wmark;
> +
> +	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
> +	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
> +
> +	cb->fill(cb, "memcg_low_wmark", low_wmark);
> +	cb->fill(cb, "memcg_high_wmark", high_wmark);
> +
> +	return 0;
> +}
> +
>  static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
>  	struct cftype *cft,  struct cgroup_map_cb *cb)
>  {
> @@ -4127,6 +4218,15 @@ static struct cftype mem_cgroup_files[] = {
>  		.unregister_event = mem_cgroup_oom_unregister_event,
>  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>  	},
> +	{
> +		.name = "min_free_kbytes",
> +		.write_u64 = mem_cgroup_min_free_write,
> +		.read_u64 = mem_cgroup_min_free_read,
> +	},
> +	{
> +		.name = "reclaim_wmarks",
> +		.read_map = mem_cgroup_wmark_read,
> +	},
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -4308,6 +4408,19 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>  
> +int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> +				int charge_flags)
> +{
> +	long ret = 0;
> +
> +	if (charge_flags & CHARGE_WMARK_LOW)
> +		ret = res_counter_check_under_low_wmark_limit(&mem->res);
> +	if (charge_flags & CHARGE_WMARK_HIGH)
> +		ret = res_counter_check_under_high_wmark_limit(&mem->res);
> +
> +	return ret;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>  	struct mem_cgroup_tree_per_node *rtpn;
> @@ -4450,10 +4563,12 @@ static int mem_cgroup_do_precharge(unsigned long count)
>  		 * are still under the same cgroup_mutex. So we can postpone
>  		 * css_get().
>  		 */
> -		if (res_counter_charge(&mem->res, PAGE_SIZE * count, &dummy))
> +		if (res_counter_charge(&mem->res, PAGE_SIZE * count,
> +					CHARGE_WMARK_MIN, &dummy))
>  			goto one_by_one;
>  		if (do_swap_account && res_counter_charge(&mem->memsw,
> -						PAGE_SIZE * count, &dummy)) {
> +						PAGE_SIZE * count,
> +						CHARGE_WMARK_MIN, &dummy)) {
>  			res_counter_uncharge(&mem->res, PAGE_SIZE * count);
>  			goto one_by_one;
>  		}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e08005e..6d5702b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -46,6 +46,8 @@
>  
>  #include <linux/swapops.h>
>  
> +#include <linux/res_counter.h>
> +
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -2127,11 +2129,19 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
>  {
>  	int i;
>  	pg_data_t *pgdat = kswapd->kswapd_pgdat;
> +	struct mem_cgroup *mem = kswapd->kswapd_mem;
>  
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
>  		return 1;
>  
> +	if (mem) {
> +		if (!mem_cgroup_watermark_ok(kswapd->kswapd_mem,
> +						CHARGE_WMARK_HIGH))
> +			return 1;
> +		return 0;
> +	}
> +
>  	/* If after HZ/10, a zone is below the high mark, it's premature */
>  	for (i = 0; i < pgdat->nr_zones; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
> -- 
> 1.7.3.1
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] Per cgroup background reclaim.
  2010-11-30  6:49 ` [PATCH 3/4] Per cgroup background reclaim Ying Han
@ 2010-11-30  7:51   ` KAMEZAWA Hiroyuki
  2010-11-30  8:07     ` KAMEZAWA Hiroyuki
                       ` (2 more replies)
  2010-12-01  2:18   ` KOSAKI Motohiro
  1 sibling, 3 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  7:51 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, 29 Nov 2010 22:49:44 -0800
Ying Han <yinghan@google.com> wrote:

> The current implementation of memcg only supports direct reclaim and this
> patch adds the support for background reclaim. Per cgroup background reclaim
> is needed which spreads out the memory pressure over longer period of time
> and smoothes out the system performance.
> 
> There is a kswapd kernel thread for each memory node. We add a different kswapd
> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
> field of a kswapd descriptor.
> 
> The kswapd() function now is shared between global and per cgroup kswapd thread.
> It is passed in with the kswapd descriptor which contains the information of
> either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
> if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
> priority loop similar to global reclaim. In each iteration it invokes
> balance_pgdat_node for all nodes on the system, which is a new function performs
> background reclaim per node. After reclaiming each node, it checks
> mem_cgroup_watermark_ok() and breaks the priority loop if returns true. A per
> memcg zone will be marked as "unreclaimable" if the scanning rate is much
> greater than the reclaiming rate on the per cgroup LRU. The bit is cleared when
> there is a page charged to the cgroup being freed. Kswapd breaks the priority
> loop if all the zones are marked as "unreclaimable".
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h |   30 +++++++
>  mm/memcontrol.c            |  182 ++++++++++++++++++++++++++++++++++++++-
>  mm/page_alloc.c            |    2 +
>  mm/vmscan.c                |  205 +++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 416 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 90fe7fe..dbed45d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -127,6 +127,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>  
> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone);
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> +					unsigned long nr_scanned);
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
>  
> @@ -299,6 +305,25 @@ static inline void mem_cgroup_update_file_mapped(struct page *page,
>  {
>  }
>  
> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
> +						struct zone *zone,
> +						unsigned long nr_scanned)
> +{
> +}
> +
> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
> +							struct zone *zone)
> +{
> +}
> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
> +		struct zone *zone)
> +{
> +}
> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
> +						struct zone *zone)
> +{
> +}
> +
>  static inline
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  					    gfp_t gfp_mask)
> @@ -312,6 +337,11 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
>  	return 0;
>  }
>  
> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
> +								int zid)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a0c6ed9..1d39b65 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -48,6 +48,8 @@
>  #include <linux/page_cgroup.h>
>  #include <linux/cpu.h>
>  #include <linux/oom.h>
> +#include <linux/kthread.h>
> +
>  #include "internal.h"
>  
>  #include <asm/uaccess.h>
> @@ -118,7 +120,10 @@ struct mem_cgroup_per_zone {
>  	bool			on_tree;
>  	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
>  						/* use container_of	   */
> +	unsigned long		pages_scanned;	/* since last reclaim */
> +	int			all_unreclaimable;	/* All pages pinned */
>  };
> +
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
>  
> @@ -372,6 +377,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  static void drain_all_stock_async(void);
>  static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
> +static inline void wake_memcg_kswapd(struct mem_cgroup *mem);
>  
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -1086,6 +1092,106 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>  	return &mz->reclaim_stat;
>  }
>  
> +unsigned long mem_cgroup_zone_reclaimable_pages(
> +					struct mem_cgroup_per_zone *mz)
> +{
> +	int nr;
> +	nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
> +		MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
> +
> +	if (nr_swap_pages > 0)
> +		nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
> +			MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> +						unsigned long nr_scanned)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->pages_scanned += nr_scanned;
> +}
> +
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +
> +	if (!mem)
> +		return 0;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->pages_scanned <
> +				mem_cgroup_zone_reclaimable_pages(mz) * 6;
> +	return 0;
> +}
> +
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return 0;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->all_unreclaimable;
> +
> +	return 0;
> +}
> +
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->all_unreclaimable = 1;
> +}
> +
> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	struct mem_cgroup *mem = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +	struct page_cgroup *pc = lookup_page_cgroup(page);
> +
> +	if (unlikely(!pc))
> +		return;
> +
> +	rcu_read_lock();
> +	mem = pc->mem_cgroup;

This is incorrect. you have to do css_tryget(&mem->css) before rcu_read_unlock.

> +	rcu_read_unlock();
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz) {
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = 0;
> +	}
> +
> +	return;
> +}
> +
>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -1887,6 +1993,20 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	struct res_counter *fail_res;
>  	unsigned long flags = 0;
>  	int ret;
> +	unsigned long min_free_kbytes = 0;
> +
> +	min_free_kbytes = get_min_free_kbytes(mem);
> +	if (min_free_kbytes) {
> +		ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_LOW,
> +					&fail_res);
> +		if (likely(!ret)) {
> +			return CHARGE_OK;
> +		} else {
> +			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> +									res);
> +			wake_memcg_kswapd(mem_over_limit);
> +		}
> +	}

I think this check can be moved out to periodic-check as threshould notifiers.



>  
>  	ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_MIN, &fail_res);
>  
> @@ -3037,6 +3157,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  			else
>  				memcg->memsw_is_minimum = false;
>  		}
> +		setup_per_memcg_wmarks(memcg);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -3046,7 +3167,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  						MEM_CGROUP_RECLAIM_SHRINK);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
> -  		if (curusage >= oldusage)
> +		if (curusage >= oldusage)
>  			retry_count--;
>  		else
>  			oldusage = curusage;

What's changed here ?

> @@ -3096,6 +3217,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  			else
>  				memcg->memsw_is_minimum = false;
>  		}
> +		setup_per_memcg_wmarks(memcg);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -4352,6 +4474,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>  static void __mem_cgroup_free(struct mem_cgroup *mem)
>  {
>  	int node;
> +	struct kswapd *kswapd_p;
> +	wait_queue_head_t *wait;
>  
>  	mem_cgroup_remove_from_trees(mem);
>  	free_css_id(&mem_cgroup_subsys, &mem->css);
> @@ -4360,6 +4484,15 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>  		free_mem_cgroup_per_zone_info(mem, node);
>  
>  	free_percpu(mem->stat);
> +
> +	wait = mem->kswapd_wait;
> +	kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> +	if (kswapd_p) {
> +		if (kswapd_p->kswapd_task)
> +			kthread_stop(kswapd_p->kswapd_task);
> +		kfree(kswapd_p);
> +	}
> +
>  	if (sizeof(struct mem_cgroup) < PAGE_SIZE)
>  		kfree(mem);
>  	else
> @@ -4421,6 +4554,39 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
>  	return ret;
>  }
>  
> +static inline
> +void wake_memcg_kswapd(struct mem_cgroup *mem)
> +{
> +	wait_queue_head_t *wait;
> +	struct kswapd *kswapd_p;
> +	struct task_struct *thr;
> +	static char memcg_name[PATH_MAX];
> +
> +	if (!mem)
> +		return;
> +
> +	wait = mem->kswapd_wait;
> +	kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> +	if (!kswapd_p->kswapd_task) {
> +		if (mem->css.cgroup)
> +			cgroup_path(mem->css.cgroup, memcg_name, PATH_MAX);
> +		else
> +			sprintf(memcg_name, "no_name");
> +
> +		thr = kthread_run(kswapd, kswapd_p, "kswapd%s", memcg_name);

I don't think reusing the name of "kswapd" isn't good. and this name cannot
be long as PATH_MAX...IIUC, this name is for comm[] field which is 16bytes long.

So, how about naming this as

  "memcg%d", mem->css.id ?

Exporing css.id will be okay if necessary.



> +		if (IS_ERR(thr))
> +			printk(KERN_INFO "Failed to start kswapd on memcg %d\n",
> +				0);
> +		else
> +			kswapd_p->kswapd_task = thr;
> +	}

Hmm, ok, then, kswapd-for-memcg is created when someone go over watermark.
Why this new kswapd will not exit() until memcg destroy ?

I think there are several approaches.

  1. create/destroy a thread at memcg create/destroy
  2. create/destroy a thread at watermarks.
  3. use thread pool for watermarks.
  4. use workqueue for watermaks.

The good point of "1" is that we can control a-thread-for-kswapd by cpu
controller but it will use some resource.
The good point of "2" is that we can avoid unnecessary resource usage.

3 and 4 is not very good, I think. 

I'd like to vote for "1"...I want to avoid "stealing" other container's cpu
by bad application in a container uses up memory.




> +
> +	if (!waitqueue_active(wait)) {
> +		return;
> +	}
> +	wake_up_interruptible(wait);
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>  	struct mem_cgroup_tree_per_node *rtpn;
> @@ -4452,6 +4618,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	struct mem_cgroup *mem, *parent;
>  	long error = -ENOMEM;
>  	int node;
> +	struct kswapd *kswapd_p = NULL;
>  
>  	mem = mem_cgroup_alloc();
>  	if (!mem)
> @@ -4499,6 +4666,19 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	spin_lock_init(&mem->reclaim_param_lock);
>  	INIT_LIST_HEAD(&mem->oom_notify);
>  
> +
> +	if (!mem_cgroup_is_root(mem)) {
> +		kswapd_p = kmalloc(sizeof(struct kswapd), GFP_KERNEL);
> +		if (!kswapd_p) {
> +			printk(KERN_INFO "Failed to kmalloc kswapd_p %d\n", 0);
> +			goto free_out;
> +		}
> +		memset(kswapd_p, 0, sizeof(struct kswapd));
> +		init_waitqueue_head(&kswapd_p->kswapd_wait);
> +		mem->kswapd_wait = &kswapd_p->kswapd_wait;
> +		kswapd_p->kswapd_mem = mem;
> +	}
> +
>  	if (parent)
>  		mem->swappiness = get_swappiness(parent);
>  	atomic_set(&mem->refcnt, 1);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a15bc1c..dc61f2a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -615,6 +615,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  
>  		do {
>  			page = list_entry(list->prev, struct page, lru);
> +			mem_cgroup_clear_unreclaimable(page, zone);
>  			/* must delete as __free_one_page list manipulates */
>  			list_del(&page->lru);
>  			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> @@ -632,6 +633,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
>  	spin_lock(&zone->lock);
>  	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
> +	mem_cgroup_clear_unreclaimable(page, zone);
>  
>  	__free_one_page(page, zone, order, migratetype);
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6d5702b..f8430c4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -100,6 +100,8 @@ struct scan_control {
>  	 * are scanned.
>  	 */
>  	nodemask_t	*nodemask;
> +
> +	int priority;
>  };
>  
>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> @@ -2380,6 +2382,201 @@ out:
>  	return sc.nr_reclaimed;
>  }
>  

Because you write all codes below, I don't think merging with kswapd is
not necessary..


> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * TODO: the same function is used for global LRU and memcg LRU. For global
> + * LRU, the kswapd is done until all this node's zones are at
> + * high_wmark_pages(zone) or zone->all_unreclaimable.
> + */
> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> +					struct scan_control *sc)
> +{
> +	int i, end_zone;
> +	unsigned long total_scanned;
> +	struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +	int priority = sc->priority;
> +	int nid = pgdat->node_id;
> +
> +	/*
> +	 * Scan in the highmem->dma direction for the highest
> +	 * zone which needs scanning
> +	 */
> +	for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> +				priority != DEF_PRIORITY)
> +			continue;
> +		/*
> +		 * Do some background aging of the anon list, to give
> +		 * pages a chance to be referenced before reclaiming.
> +		 */
> +		if (inactive_anon_is_low(zone, sc))
> +			shrink_active_list(SWAP_CLUSTER_MAX, zone,
> +							sc, priority, 0);
> +
> +		end_zone = i;
> +		goto scan;
> +	}
> +	return;
> +
> +scan:
> +	total_scanned = 0;
> +	/*
> +	 * Now scan the zone in the dma->highmem direction, stopping
> +	 * at the last zone which needs scanning.
> +	 *
> +	 * We do this because the page allocator works in the opposite
> +	 * direction.  This prevents the page allocator from allocating
> +	 * pages behind kswapd's direction of progress, which would
> +	 * cause too much scanning of the lower zones.
> +	 */
> +	for (i = 0; i <= end_zone; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> +			priority != DEF_PRIORITY)
> +			continue;
> +
> +		sc->nr_scanned = 0;
> +		shrink_zone(priority, zone, sc);
> +		total_scanned += sc->nr_scanned;
> +
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
> +			continue;
> +
> +		if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
> +			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
> +
> +		/*
> +		 * If we've done a decent amount of scanning and
> +		 * the reclaim ratio is low, start doing writepage
> +		 * even in laptop mode
> +		 */
> +		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> +		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
> +			sc->may_writepage = 1;
> +		}
> +	}
> +
> +	sc->nr_scanned = total_scanned;
> +	return;
> +}
> +
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +					      int order)
> +{
> +	unsigned long total_scanned = 0;
> +	int i;
> +	int priority;
> +	int wmark_ok, nid;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +		/*
> +		 * kswapd doesn't want to be bailed out while reclaim. because
> +		 * we want to put equal scanning pressure on each zone.
> +		 * TODO: this might not be true for the memcg background
> +		 * reclaim.
> +		 */
> +		.nr_to_reclaim = ULONG_MAX,
> +		.swappiness = vm_swappiness,
> +		.order = order,
> +		.mem_cgroup = mem_cont,
> +	};
> +	DECLARE_BITMAP(do_nodes, MAX_NUMNODES);
> +
> +	/*
> +	 * bitmap to indicate which node to reclaim pages from. Initially we
> +	 * assume all nodes need reclaim.
> +	 */
> +	bitmap_fill(do_nodes, MAX_NUMNODES);
> +

Hmm..

> +loop_again:
> +	sc.may_writepage = !laptop_mode;
> +	sc.nr_reclaimed = 0;
> +	total_scanned = 0;
> +
> +	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> +		sc.priority = priority;
> +
> +		/* The swap token gets in the way of swapout... */
> +		if (!priority)
> +			disable_swap_token();
> +
> +
> +		for_each_online_node(nid) {
> +			pg_data_t *pgdat = NODE_DATA(nid);
> +
> +			wmark_ok = 1;
> +
> +			if (!test_bit(nid, do_nodes))
> +				continue;
> +

Then, always start reclaim from node "0"....it's not good.

If using bitmap, could you add fairness among nodes ?

as:
  node = select_next_victim_node(mem);

This function will select the next scan node considering fairness
between nodes.
(Because memcg doesn't take care of NODE placement and just takes care of
 "amount", we don't know the best node to be reclaimed.)


> +			balance_pgdat_node(pgdat, order, &sc);
> +			total_scanned += sc.nr_scanned;
> +
> +			for (i = pgdat->nr_zones - 1; i >= 0; i++) {
> +				struct zone *zone = pgdat->node_zones + i;
> +
> +				if (!populated_zone(zone))
> +					continue;
> +
> +				if (!mem_cgroup_mz_unreclaimable(mem_cont,
> +								zone)) {
> +					__set_bit(nid, do_nodes);
> +					break;
> +				}
> +			}
> +
> +			if (i < 0)
> +				__clear_bit(nid, do_nodes);
> +
> +			if (!mem_cgroup_watermark_ok(sc.mem_cgroup,
> +							CHARGE_WMARK_HIGH))
> +				wmark_ok = 0;
> +
> +			if (wmark_ok) {
> +				goto out;
> +			}
> +		}
> +
> +		if (wmark_ok)
> +			break;
> +
> +		if (total_scanned && priority < DEF_PRIORITY - 2)
> +			congestion_wait(WRITE, HZ/10);
> +
> +		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> +			break;
> +	}
> +
> +out:
> +	if (!wmark_ok) {
> +		cond_resched();
> +
> +		try_to_freeze();
> +
> +		goto loop_again;
> +	}
> +
> +	return sc.nr_reclaimed;
> +}
> +#else
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +							int order)
> +{
> +	return 0;
> +}
> +#endif
> +
>  /*
>   * The background pageout daemon, started as a kernel thread
>   * from the init process.
> @@ -2497,8 +2694,12 @@ int kswapd(void *p)
>  		 * after returning from the refrigerator
>  		 */
>  		if (!ret) {
> -			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> -			balance_pgdat(pgdat, order);
> +			if (pgdat) {
> +				trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> +								order);
> +				balance_pgdat(pgdat, order);
> +			} else
> +				balance_mem_cgroup_pgdat(mem, order);

mem_cgroup's order is always 0.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] Add more per memcg stats.
  2010-11-30  6:49 ` [PATCH 4/4] Add more per memcg stats Ying Han
@ 2010-11-30  7:53   ` KAMEZAWA Hiroyuki
  2010-11-30 18:22     ` Ying Han
  0 siblings, 1 reply; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  7:53 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, 29 Nov 2010 22:49:45 -0800
Ying Han <yinghan@google.com> wrote:

> A bunch of statistics are added in memory.stat to monitor per cgroup
> kswapd performance.
> 
> Signed-off-by: Ying Han <yinghan@google.com>

No objections. But please update the documenation and add more comments.

Thanks,
-Kame

> ---
>  include/linux/memcontrol.h |   81 +++++++++++++++++++++++++
>  mm/memcontrol.c            |  140 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |   33 +++++++++-
>  3 files changed, 250 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index dbed45d..893ca62 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -127,6 +127,19 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>  
> +/* background reclaim stats */
> +void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_pg_steal(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_pgrefill(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_pg_outrun(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_alloc_stall(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_balance_wmark_ok(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_balance_swap_max(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_kswapd_shrink_zone(struct mem_cgroup *memcg, int val);
> +void mem_cgroup_kswapd_may_writepage(struct mem_cgroup *memcg, int val);
> +
>  void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone);
>  bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
>  bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> @@ -337,6 +350,74 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
>  	return 0;
>  }
>  
> +/* background reclaim stats */
> +static inline void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_pg_steal(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_pgrefill(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_pg_outrun(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_alloc_stall(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_balance_wmark_ok(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_balance_swap_max(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_kswapd_shrink_zone(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
> +
> +static inline void mem_cgroup_kswapd_may_writepage(struct mem_cgroup *memcg,
> +								int val)
> +{
> +	return 0;
> +}
> +
>  static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
>  								int zid)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1d39b65..97df6dd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -91,6 +91,21 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_KSWAPD_INVOKE, /* # of times invokes kswapd */
> +	MEM_CGROUP_STAT_KSWAPD_STEAL, /* # of pages reclaimed from kswapd */
> +	MEM_CGROUP_STAT_PG_PGSTEAL, /* # of pages reclaimed from ttfp */
> +	MEM_CGROUP_STAT_KSWAPD_PGSCAN, /* # of pages scanned from kswapd */
> +	MEM_CGROUP_STAT_PG_PGSCAN, /* # of pages scanned from ttfp */
> +	MEM_CGROUP_STAT_PGREFILL, /* # of pages scanned on active list */
> +	MEM_CGROUP_STAT_WMARK_LOW_OK,
> +	MEM_CGROUP_STAT_KSWAP_CREAT,
> +	MEM_CGROUP_STAT_PGOUTRUN,
> +	MEM_CGROUP_STAT_ALLOCSTALL,
> +	MEM_CGROUP_STAT_BALANCE_WMARK_OK,
> +	MEM_CGROUP_STAT_BALANCE_SWAP_MAX,
> +	MEM_CGROUP_STAT_WAITQUEUE,
> +	MEM_CGROUP_STAT_KSWAPD_SHRINK_ZONE,
> +	MEM_CGROUP_STAT_KSWAPD_MAY_WRITEPAGE,
>  	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>  	/* incremented at every  pagein/pageout */
>  	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
> @@ -619,6 +634,62 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
>  	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
>  }
>  
> +void mem_cgroup_kswapd_steal(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_STEAL], val);
> +}
> +
> +void mem_cgroup_pg_steal(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PG_PGSTEAL], val);
> +}
> +
> +void mem_cgroup_kswapd_pgscan(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_PGSCAN], val);
> +}
> +
> +void mem_cgroup_pg_pgscan(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PG_PGSCAN], val);
> +}
> +
> +void mem_cgroup_pgrefill(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PGREFILL], val);
> +}
> +
> +void mem_cgroup_pg_outrun(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PGOUTRUN], val);
> +}
> +
> +void mem_cgroup_alloc_stall(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_ALLOCSTALL], val);
> +}
> +
> +void mem_cgroup_balance_wmark_ok(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_BALANCE_WMARK_OK], val);
> +}
> +
> +void mem_cgroup_balance_swap_max(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_BALANCE_SWAP_MAX], val);
> +}
> +
> +void mem_cgroup_kswapd_shrink_zone(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_SHRINK_ZONE], val);
> +}
> +
> +void mem_cgroup_kswapd_may_writepage(struct mem_cgroup *mem, int val)
> +{
> +	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_MAY_WRITEPAGE],
> +			val);
> +}
> +
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  					 struct page_cgroup *pc,
>  					 bool charge)
> @@ -2000,8 +2071,14 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  		ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_LOW,
>  					&fail_res);
>  		if (likely(!ret)) {
> +			this_cpu_add(
> +				mem->stat->count[MEM_CGROUP_STAT_WMARK_LOW_OK],
> +				1);
>  			return CHARGE_OK;
>  		} else {
> +			this_cpu_add(
> +				mem->stat->count[MEM_CGROUP_STAT_KSWAPD_INVOKE],
> +				1);
>  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>  									res);
>  			wake_memcg_kswapd(mem_over_limit);
> @@ -3723,6 +3800,21 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_KSWAPD_INVOKE,
> +	MCS_KSWAPD_STEAL,
> +	MCS_PG_PGSTEAL,
> +	MCS_KSWAPD_PGSCAN,
> +	MCS_PG_PGSCAN,
> +	MCS_PGREFILL,
> +	MCS_WMARK_LOW_OK,
> +	MCS_KSWAP_CREAT,
> +	MCS_PGOUTRUN,
> +	MCS_ALLOCSTALL,
> +	MCS_BALANCE_WMARK_OK,
> +	MCS_BALANCE_SWAP_MAX,
> +	MCS_WAITQUEUE,
> +	MCS_KSWAPD_SHRINK_ZONE,
> +	MCS_KSWAPD_MAY_WRITEPAGE,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3745,6 +3837,21 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"kswapd_invoke", "total_kswapd_invoke"},
> +	{"kswapd_steal", "total_kswapd_steal"},
> +	{"pg_pgsteal", "total_pg_pgsteal"},
> +	{"kswapd_pgscan", "total_kswapd_pgscan"},
> +	{"pg_scan", "total_pg_scan"},
> +	{"pgrefill", "total_pgrefill"},
> +	{"wmark_low_ok", "total_wmark_low_ok"},
> +	{"kswapd_create", "total_kswapd_create"},
> +	{"pgoutrun", "total_pgoutrun"},
> +	{"allocstall", "total_allocstall"},
> +	{"balance_wmark_ok", "total_balance_wmark_ok"},
> +	{"balance_swap_max", "total_balance_swap_max"},
> +	{"waitqueue", "total_waitqueue"},
> +	{"kswapd_shrink_zone", "total_kswapd_shrink_zone"},
> +	{"kswapd_may_writepage", "total_kswapd_may_writepage"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3773,6 +3880,37 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	/* kswapd stat */
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_INVOKE);
> +	s->stat[MCS_KSWAPD_INVOKE] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_STEAL);
> +	s->stat[MCS_KSWAPD_STEAL] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PG_PGSTEAL);
> +	s->stat[MCS_PG_PGSTEAL] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_PGSCAN);
> +	s->stat[MCS_KSWAPD_PGSCAN] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PG_PGSCAN);
> +	s->stat[MCS_PG_PGSCAN] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGREFILL);
> +	s->stat[MCS_PGREFILL] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WMARK_LOW_OK);
> +	s->stat[MCS_WMARK_LOW_OK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAP_CREAT);
> +	s->stat[MCS_KSWAP_CREAT] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGOUTRUN);
> +	s->stat[MCS_PGOUTRUN] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_ALLOCSTALL);
> +	s->stat[MCS_ALLOCSTALL] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_BALANCE_WMARK_OK);
> +	s->stat[MCS_BALANCE_WMARK_OK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_BALANCE_SWAP_MAX);
> +	s->stat[MCS_BALANCE_SWAP_MAX] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WAITQUEUE);
> +	s->stat[MCS_WAITQUEUE] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_SHRINK_ZONE);
> +	s->stat[MCS_KSWAPD_SHRINK_ZONE] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_MAY_WRITEPAGE);
> +	s->stat[MCS_KSWAPD_MAY_WRITEPAGE] += val;
>  
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -4579,9 +4717,11 @@ void wake_memcg_kswapd(struct mem_cgroup *mem)
>  				0);
>  		else
>  			kswapd_p->kswapd_task = thr;
> +		this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAP_CREAT], 1);
>  	}
>  
>  	if (!waitqueue_active(wait)) {
> +		this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_WAITQUEUE], 1);
>  		return;
>  	}
>  	wake_up_interruptible(wait);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f8430c4..5b0c349 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1389,10 +1389,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  					ISOLATE_INACTIVE : ISOLATE_BOTH,
>  			zone, sc->mem_cgroup,
>  			0, file);
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
>  		/*
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
>  		 */
> +		if (current_is_kswapd())
> +			mem_cgroup_kswapd_pgscan(sc->mem_cgroup, nr_scanned);
> +		else
> +			mem_cgroup_pg_pgscan(sc->mem_cgroup, nr_scanned);
>  	}
>  
>  	if (nr_taken == 0) {
> @@ -1413,9 +1418,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	}
>  
>  	local_irq_disable();
> -	if (current_is_kswapd())
> -		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> -	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
> +	if (scanning_global_lru(sc)) {
> +		if (current_is_kswapd())
> +			__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> +		__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
> +	} else {
> +		if (current_is_kswapd())
> +			mem_cgroup_kswapd_steal(sc->mem_cgroup, nr_reclaimed);
> +		else
> +			mem_cgroup_pg_steal(sc->mem_cgroup, nr_reclaimed);
> +	}
>  
>  	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
>  
> @@ -1508,11 +1520,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
>  		 */
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>  	}
>  
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
> -	__count_zone_vm_events(PGREFILL, zone, pgscanned);
> +	if (scanning_global_lru(sc))
> +		__count_zone_vm_events(PGREFILL, zone, pgscanned);
> +	else
> +		mem_cgroup_pgrefill(sc->mem_cgroup, pgscanned);
> +
>  	if (file)
>  		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -nr_taken);
>  	else
> @@ -1955,6 +1972,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  
>  	if (scanning_global_lru(sc))
>  		count_vm_event(ALLOCSTALL);
> +	else
> +		mem_cgroup_alloc_stall(sc->mem_cgroup, 1);
>  
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>  		sc->nr_scanned = 0;
> @@ -2444,6 +2463,8 @@ scan:
>  			priority != DEF_PRIORITY)
>  			continue;
>  
> +		mem_cgroup_kswapd_shrink_zone(mem_cont, 1);
> +
>  		sc->nr_scanned = 0;
>  		shrink_zone(priority, zone, sc);
>  		total_scanned += sc->nr_scanned;
> @@ -2462,6 +2483,7 @@ scan:
>  		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
>  		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
>  			sc->may_writepage = 1;
> +			mem_cgroup_kswapd_may_writepage(mem_cont, 1);
>  		}
>  	}
>  
> @@ -2504,6 +2526,8 @@ loop_again:
>  	sc.nr_reclaimed = 0;
>  	total_scanned = 0;
>  
> +	mem_cgroup_pg_outrun(mem_cont, 1);
> +
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>  		sc.priority = priority;
>  
> @@ -2544,6 +2568,7 @@ loop_again:
>  				wmark_ok = 0;
>  
>  			if (wmark_ok) {
> +				mem_cgroup_balance_wmark_ok(sc.mem_cgroup, 1);
>  				goto out;
>  			}
>  		}
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] Per cgroup background reclaim.
  2010-11-30  7:51   ` KAMEZAWA Hiroyuki
@ 2010-11-30  8:07     ` KAMEZAWA Hiroyuki
  2010-11-30 22:01       ` Ying Han
  2010-11-30 22:00     ` Ying Han
  2010-12-07  2:25     ` Ying Han
  2 siblings, 1 reply; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  8:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, 30 Nov 2010 16:51:42 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
 
> > +		if (IS_ERR(thr))
> > +			printk(KERN_INFO "Failed to start kswapd on memcg %d\n",
> > +				0);
> > +		else
> > +			kswapd_p->kswapd_task = thr;
> > +	}
> 
> Hmm, ok, then, kswapd-for-memcg is created when someone go over watermark.
> Why this new kswapd will not exit() until memcg destroy ?
> 
> I think there are several approaches.
> 
>   1. create/destroy a thread at memcg create/destroy
>   2. create/destroy a thread at watermarks.
>   3. use thread pool for watermarks.
>   4. use workqueue for watermaks.
> 
> The good point of "1" is that we can control a-thread-for-kswapd by cpu
> controller but it will use some resource.
> The good point of "2" is that we can avoid unnecessary resource usage.
> 
> 3 and 4 is not very good, I think. 
> 
> I'd like to vote for "1"...I want to avoid "stealing" other container's cpu
> by bad application in a container uses up memory.
> 

One more point, one-thread-per-hierarchy is enough. So, please check
memory.use_hierarchy==1 or not at creating a thread.

Thanks,
-kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  7:08   ` KAMEZAWA Hiroyuki
@ 2010-11-30  8:15     ` Minchan Kim
  2010-11-30  8:27       ` KAMEZAWA Hiroyuki
  2010-11-30 20:26       ` Ying Han
  2010-11-30 20:17     ` Ying Han
  1 sibling, 2 replies; 52+ messages in thread
From: Minchan Kim @ 2010-11-30  8:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, Nov 30, 2010 at 4:08 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 29 Nov 2010 22:49:42 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
>> field of a kswapd descriptor. The kswapd descriptor stores information of node
>> or cgroup and it allows the global and per cgroup background reclaim to share
>> common reclaim algorithms.
>>
>> This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
>> common data structure.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  include/linux/mmzone.h |    3 +-
>>  include/linux/swap.h   |   10 +++++
>>  mm/memcontrol.c        |    2 +
>>  mm/mmzone.c            |    2 +-
>>  mm/page_alloc.c        |    9 +++-
>>  mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
>>  6 files changed, 90 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 39c24eb..c77dfa2 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -642,8 +642,7 @@ typedef struct pglist_data {
>>       unsigned long node_spanned_pages; /* total size of physical page
>>                                            range, including holes */
>>       int node_id;
>> -     wait_queue_head_t kswapd_wait;
>> -     struct task_struct *kswapd;
>> +     wait_queue_head_t *kswapd_wait;
>>       int kswapd_max_order;
>>  } pg_data_t;
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index eba53e7..2e6cb58 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
>>       return current->flags & PF_KSWAPD;
>>  }
>>
>> +struct kswapd {
>> +     struct task_struct *kswapd_task;
>> +     wait_queue_head_t kswapd_wait;
>> +     struct mem_cgroup *kswapd_mem;
>> +     pg_data_t *kswapd_pgdat;
>> +};
>> +
>> +#define MAX_KSWAPDS MAX_NUMNODES
>> +extern struct kswapd kswapds[MAX_KSWAPDS];
>> +int kswapd(void *p);
>
> Why this is required ? Can't we allocate this at boot (if necessary) ?
> Why exsiting kswapd is also controlled under this structure ?
> At the 1st look, this just seem to increase the size of changes....
>
> IMHO, implementing background-reclaim-for-memcg is cleaner than reusing kswapd..
> kswapd has tons of unnecessary checks.

Ideally, I hope we unify global and memcg of kswapd for easy
maintainance if it's not a big problem.
When we make patches about lru pages, we always have to consider what
I should do for memcg.
And when we review patches, we also should consider what the patch is
missing for memcg.
It makes maintainance cost big. Of course, if memcg maintainers is
involved with all patches, it's no problem as it is.

If it is impossible due to current kswapd's spaghetti, we can clean up
it first. I am not sure whether my suggestion make sense or not.
Kame can know it much rather than me. But please consider such the voice.

>
> Regards,
> -Kame
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  8:15     ` Minchan Kim
@ 2010-11-30  8:27       ` KAMEZAWA Hiroyuki
  2010-11-30  8:54         ` KAMEZAWA Hiroyuki
  2010-11-30 20:26       ` Ying Han
  1 sibling, 1 reply; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  8:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Ying Han, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, 30 Nov 2010 17:15:37 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Ideally, I hope we unify global and memcg of kswapd for easy
> maintainance if it's not a big problem.
> When we make patches about lru pages, we always have to consider what
> I should do for memcg.
> And when we review patches, we also should consider what the patch is
> missing for memcg.
> It makes maintainance cost big. Of course, if memcg maintainers is
> involved with all patches, it's no problem as it is.
> 
I know it's not. But thread control of kswapd will not have much merging point.
And balance_pgdat() is fully replaced in patch/3. The effort for merging seems
not big.

> If it is impossible due to current kswapd's spaghetti, we can clean up
> it first. I am not sure whether my suggestion make sense or not.

make sense.

> Kame can know it much rather than me. But please consider such the voice.

Unifying is ok in general but this patch seems uglier than imagined.
Implementing a simple memcg one and considering how-to-merge is a way.
But it's a long way.

For now. we have to check the design/function of patch before how beautiful it
is.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  8:27       ` KAMEZAWA Hiroyuki
@ 2010-11-30  8:54         ` KAMEZAWA Hiroyuki
  2010-11-30 20:40           ` Ying Han
  2010-12-07  6:15           ` Balbir Singh
  0 siblings, 2 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  8:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Ying Han, Balbir Singh, Daisuke Nishimura,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, 30 Nov 2010 17:27:10 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 30 Nov 2010 17:15:37 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > Ideally, I hope we unify global and memcg of kswapd for easy
> > maintainance if it's not a big problem.
> > When we make patches about lru pages, we always have to consider what
> > I should do for memcg.
> > And when we review patches, we also should consider what the patch is
> > missing for memcg.
> > It makes maintainance cost big. Of course, if memcg maintainers is
> > involved with all patches, it's no problem as it is.
> > 
> I know it's not. But thread control of kswapd will not have much merging point.
> And balance_pgdat() is fully replaced in patch/3. The effort for merging seems
> not big.
> 

kswapd's balance_pgdat() is for following
  - reclaim pages within a node.
  - balancing zones in a pgdat.

memcg's background reclaim needs followings.
  - reclaim pages within a memcg
  - reclaim pages from arbitrary zones, if it's fair, it's good.
    But it's not important from which zone the pages are reclaimed from. 
    (I'm not sure we can select "the oldest" pages from divided LRU.)

Then, merging will put 2 _very_ different functionalities into 1 function.

So, I thought it's simpler to implement

 1. a victim node selector (This algorithm will never be in kswapd.)
 2. call _existing_ try_to_free_pages_mem_cgroup() with node local zonelist.
 Sharing is enough.

kswapd stop/go routine may be able to be shared. But this patch itself seems not
very good to me.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH 0/4] memcg: per cgroup background reclaim
  2010-11-30  7:00 ` KAMEZAWA Hiroyuki
@ 2010-11-30  9:05   ` Ying Han
  0 siblings, 0 replies; 52+ messages in thread
From: Ying Han @ 2010-11-30  9:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, Nov 29, 2010 at 11:00 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 29 Nov 2010 22:49:41 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> The current implementation of memcg only supports direct reclaim and this
>> patchset adds the support for background reclaim. Per cgroup background
>> reclaim is needed which spreads out the memory pressure over longer period
>> of time and smoothes out the system performance.
>>
>> The current implementation is not a stable version, and it crashes sometimes
>> on my NUMA machine. Before going further for debugging, I would like to start
>> the discussion and hear the feedbacks of the initial design.
>>
>
> It's welcome but please wait until merge of dirty-ratio.
> And please post after you don't see crash ....
Yeah, I will look into the crash and fix it. Besides, it runs fine so
far on my single node
system.

>
> Description of design is appreciated.
> Where the cost for "kswapd" is charged agaist if cpu cgroup is used at the same time ?
There is no special treatment for that in the current implementation.
Ideally it would be nice to charge the kswapd time to the
corresponding cgroup. As a starting point, all the kswapd threads
cputime could be charged to root.

>> Current status:
>> I run through some simple tests which reads/writes a large file and makes sure
>> it triggers per cgroup kswapd on the low_wmark. Also, I compared at
>> pg_steal/pg_scan ratio w/o background reclaim.
>>
>>
>
>  Step1: Create a cgroup with 500M memory_limit and set the min_free_kbytes to 1024.
>> $ mount -t cgroup -o cpuset,memory cpuset /dev/cgroup
>> $ mkdir /dev/cgroup/A
>> $ echo 0 >/dev/cgroup/A/cpuset.cpus
>> $ echo 0 >/dev/cgroup/A/cpuset.mems
>> $ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
>> $ echo 1024 >/dev/cgroup/A/memory.min_free_kbytes
>> $ echo $$ >/dev/cgroup/A/tasks
>>
>> Step2: Check the wmarks.
>> $ cat /dev/cgroup/A/memory.reclaim_wmarks
>> memcg_low_wmark 98304000
>> memcg_high_wmark 81920000
>>
>> Step3: Dirty the pages by creating a 20g file on hard drive.
>> $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
>>
>> Checked the memory.stat w/o background reclaim. It used to be all the pages are
>> reclaimed from direct reclaim, and now about half of them are reclaimed at
>> background. (note: writing '0' to min_free_kbytes disables per cgroup kswapd)
>>
>> Only direct reclaim                                                With background reclaim:
>> kswapd_steal 0                                                     kswapd_steal 2751822
>> pg_pgsteal 5100401                                               pg_pgsteal 2476676
>> kswapd_pgscan 0                                                  kswapd_pgscan 6019373
>> pg_scan 5542464                                                   pg_scan 3851281
>> pgrefill 304505                                                       pgrefill 348077
>> pgoutrun 0                                                             pgoutrun 44568
>> allocstall 159278                                                    allocstall 75669
>>
>> Step4: Cleanup
>> $ echo $$ >/dev/cgroup/tasks
>> $ echo 0 > /dev/cgroup/A/memory.force_empty
>>
>> Step5: Read the 20g file into the pagecache.
>> $ cat /export/hdc3/dd/tf0 > /dev/zero;
>>
>> Checked the memory.stat w/o background reclaim. All the clean pages are reclaimed at
>> background instead of direct reclaim.
>>
>> Only direct reclaim                                                With background reclaim
>> kswapd_steal 0                                                      kswapd_steal 3512424
>> pg_pgsteal 3461280                                               pg_pgsteal 0
>> kswapd_pgscan 0                                                  kswapd_pgscan 3512440
>> pg_scan 3461280                                                   pg_scan 0
>> pgrefill 0                                                                pgrefill 0
>> pgoutrun 0                                                             pgoutrun 74973
>> allocstall 108165                                                    allocstall 0
>>
>
> What is the trigger for starting background reclaim ?

The background reclaim is triggered when the usage_in_bytes above the
watermark in mem_cgroup_do_charge.

--Ying
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] Add more per memcg stats.
  2010-11-30  7:53   ` KAMEZAWA Hiroyuki
@ 2010-11-30 18:22     ` Ying Han
  0 siblings, 0 replies; 52+ messages in thread
From: Ying Han @ 2010-11-30 18:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, Nov 29, 2010 at 11:53 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 29 Nov 2010 22:49:45 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> A bunch of statistics are added in memory.stat to monitor per cgroup
>> kswapd performance.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>
> No objections. But please update the documenation and add more comments.

Sure. will do.

Thanks

--Ying
>
> Thanks,
> -Kame
>
>> ---
>>  include/linux/memcontrol.h |   81 +++++++++++++++++++++++++
>>  mm/memcontrol.c            |  140 ++++++++++++++++++++++++++++++++++++++++++++
>>  mm/vmscan.c                |   33 +++++++++-
>>  3 files changed, 250 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index dbed45d..893ca62 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -127,6 +127,19 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>                                               gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>
>> +/* background reclaim stats */
>> +void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_pg_steal(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_pgrefill(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_pg_outrun(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_alloc_stall(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_balance_wmark_ok(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_balance_swap_max(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_kswapd_shrink_zone(struct mem_cgroup *memcg, int val);
>> +void mem_cgroup_kswapd_may_writepage(struct mem_cgroup *memcg, int val);
>> +
>>  void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone);
>>  bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
>>  bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>> @@ -337,6 +350,74 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
>>       return 0;
>>  }
>>
>> +/* background reclaim stats */
>> +static inline void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline void mem_cgroup_pg_steal(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline void mem_cgroup_pgrefill(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline void mem_cgroup_pg_outrun(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline void mem_cgroup_alloc_stall(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline void mem_cgroup_balance_wmark_ok(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline void mem_cgroup_balance_swap_max(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline void mem_cgroup_kswapd_shrink_zone(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>> +
>> +static inline void mem_cgroup_kswapd_may_writepage(struct mem_cgroup *memcg,
>> +                                                             int val)
>> +{
>> +     return 0;
>> +}
>> +
>>  static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
>>                                                               int zid)
>>  {
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 1d39b65..97df6dd 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -91,6 +91,21 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
>>       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
>>       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>> +     MEM_CGROUP_STAT_KSWAPD_INVOKE, /* # of times invokes kswapd */
>> +     MEM_CGROUP_STAT_KSWAPD_STEAL, /* # of pages reclaimed from kswapd */
>> +     MEM_CGROUP_STAT_PG_PGSTEAL, /* # of pages reclaimed from ttfp */
>> +     MEM_CGROUP_STAT_KSWAPD_PGSCAN, /* # of pages scanned from kswapd */
>> +     MEM_CGROUP_STAT_PG_PGSCAN, /* # of pages scanned from ttfp */
>> +     MEM_CGROUP_STAT_PGREFILL, /* # of pages scanned on active list */
>> +     MEM_CGROUP_STAT_WMARK_LOW_OK,
>> +     MEM_CGROUP_STAT_KSWAP_CREAT,
>> +     MEM_CGROUP_STAT_PGOUTRUN,
>> +     MEM_CGROUP_STAT_ALLOCSTALL,
>> +     MEM_CGROUP_STAT_BALANCE_WMARK_OK,
>> +     MEM_CGROUP_STAT_BALANCE_SWAP_MAX,
>> +     MEM_CGROUP_STAT_WAITQUEUE,
>> +     MEM_CGROUP_STAT_KSWAPD_SHRINK_ZONE,
>> +     MEM_CGROUP_STAT_KSWAPD_MAY_WRITEPAGE,
>>       MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>>       /* incremented at every  pagein/pageout */
>>       MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
>> @@ -619,6 +634,62 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
>>       this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
>>  }
>>
>> +void mem_cgroup_kswapd_steal(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_STEAL], val);
>> +}
>> +
>> +void mem_cgroup_pg_steal(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PG_PGSTEAL], val);
>> +}
>> +
>> +void mem_cgroup_kswapd_pgscan(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_PGSCAN], val);
>> +}
>> +
>> +void mem_cgroup_pg_pgscan(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PG_PGSCAN], val);
>> +}
>> +
>> +void mem_cgroup_pgrefill(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PGREFILL], val);
>> +}
>> +
>> +void mem_cgroup_pg_outrun(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_PGOUTRUN], val);
>> +}
>> +
>> +void mem_cgroup_alloc_stall(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_ALLOCSTALL], val);
>> +}
>> +
>> +void mem_cgroup_balance_wmark_ok(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_BALANCE_WMARK_OK], val);
>> +}
>> +
>> +void mem_cgroup_balance_swap_max(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_BALANCE_SWAP_MAX], val);
>> +}
>> +
>> +void mem_cgroup_kswapd_shrink_zone(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_SHRINK_ZONE], val);
>> +}
>> +
>> +void mem_cgroup_kswapd_may_writepage(struct mem_cgroup *mem, int val)
>> +{
>> +     this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAPD_MAY_WRITEPAGE],
>> +                     val);
>> +}
>> +
>>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>>                                        struct page_cgroup *pc,
>>                                        bool charge)
>> @@ -2000,8 +2071,14 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>>               ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_LOW,
>>                                       &fail_res);
>>               if (likely(!ret)) {
>> +                     this_cpu_add(
>> +                             mem->stat->count[MEM_CGROUP_STAT_WMARK_LOW_OK],
>> +                             1);
>>                       return CHARGE_OK;
>>               } else {
>> +                     this_cpu_add(
>> +                             mem->stat->count[MEM_CGROUP_STAT_KSWAPD_INVOKE],
>> +                             1);
>>                       mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>>                                                                       res);
>>                       wake_memcg_kswapd(mem_over_limit);
>> @@ -3723,6 +3800,21 @@ enum {
>>       MCS_PGPGIN,
>>       MCS_PGPGOUT,
>>       MCS_SWAP,
>> +     MCS_KSWAPD_INVOKE,
>> +     MCS_KSWAPD_STEAL,
>> +     MCS_PG_PGSTEAL,
>> +     MCS_KSWAPD_PGSCAN,
>> +     MCS_PG_PGSCAN,
>> +     MCS_PGREFILL,
>> +     MCS_WMARK_LOW_OK,
>> +     MCS_KSWAP_CREAT,
>> +     MCS_PGOUTRUN,
>> +     MCS_ALLOCSTALL,
>> +     MCS_BALANCE_WMARK_OK,
>> +     MCS_BALANCE_SWAP_MAX,
>> +     MCS_WAITQUEUE,
>> +     MCS_KSWAPD_SHRINK_ZONE,
>> +     MCS_KSWAPD_MAY_WRITEPAGE,
>>       MCS_INACTIVE_ANON,
>>       MCS_ACTIVE_ANON,
>>       MCS_INACTIVE_FILE,
>> @@ -3745,6 +3837,21 @@ struct {
>>       {"pgpgin", "total_pgpgin"},
>>       {"pgpgout", "total_pgpgout"},
>>       {"swap", "total_swap"},
>> +     {"kswapd_invoke", "total_kswapd_invoke"},
>> +     {"kswapd_steal", "total_kswapd_steal"},
>> +     {"pg_pgsteal", "total_pg_pgsteal"},
>> +     {"kswapd_pgscan", "total_kswapd_pgscan"},
>> +     {"pg_scan", "total_pg_scan"},
>> +     {"pgrefill", "total_pgrefill"},
>> +     {"wmark_low_ok", "total_wmark_low_ok"},
>> +     {"kswapd_create", "total_kswapd_create"},
>> +     {"pgoutrun", "total_pgoutrun"},
>> +     {"allocstall", "total_allocstall"},
>> +     {"balance_wmark_ok", "total_balance_wmark_ok"},
>> +     {"balance_swap_max", "total_balance_swap_max"},
>> +     {"waitqueue", "total_waitqueue"},
>> +     {"kswapd_shrink_zone", "total_kswapd_shrink_zone"},
>> +     {"kswapd_may_writepage", "total_kswapd_may_writepage"},
>>       {"inactive_anon", "total_inactive_anon"},
>>       {"active_anon", "total_active_anon"},
>>       {"inactive_file", "total_inactive_file"},
>> @@ -3773,6 +3880,37 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
>>               val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>>               s->stat[MCS_SWAP] += val * PAGE_SIZE;
>>       }
>> +     /* kswapd stat */
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_INVOKE);
>> +     s->stat[MCS_KSWAPD_INVOKE] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_STEAL);
>> +     s->stat[MCS_KSWAPD_STEAL] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PG_PGSTEAL);
>> +     s->stat[MCS_PG_PGSTEAL] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_PGSCAN);
>> +     s->stat[MCS_KSWAPD_PGSCAN] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PG_PGSCAN);
>> +     s->stat[MCS_PG_PGSCAN] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGREFILL);
>> +     s->stat[MCS_PGREFILL] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WMARK_LOW_OK);
>> +     s->stat[MCS_WMARK_LOW_OK] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAP_CREAT);
>> +     s->stat[MCS_KSWAP_CREAT] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGOUTRUN);
>> +     s->stat[MCS_PGOUTRUN] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_ALLOCSTALL);
>> +     s->stat[MCS_ALLOCSTALL] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_BALANCE_WMARK_OK);
>> +     s->stat[MCS_BALANCE_WMARK_OK] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_BALANCE_SWAP_MAX);
>> +     s->stat[MCS_BALANCE_SWAP_MAX] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WAITQUEUE);
>> +     s->stat[MCS_WAITQUEUE] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_SHRINK_ZONE);
>> +     s->stat[MCS_KSWAPD_SHRINK_ZONE] += val;
>> +     val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_KSWAPD_MAY_WRITEPAGE);
>> +     s->stat[MCS_KSWAPD_MAY_WRITEPAGE] += val;
>>
>>       /* per zone stat */
>>       val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
>> @@ -4579,9 +4717,11 @@ void wake_memcg_kswapd(struct mem_cgroup *mem)
>>                               0);
>>               else
>>                       kswapd_p->kswapd_task = thr;
>> +             this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_KSWAP_CREAT], 1);
>>       }
>>
>>       if (!waitqueue_active(wait)) {
>> +             this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_WAITQUEUE], 1);
>>               return;
>>       }
>>       wake_up_interruptible(wait);
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f8430c4..5b0c349 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1389,10 +1389,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>>                                       ISOLATE_INACTIVE : ISOLATE_BOTH,
>>                       zone, sc->mem_cgroup,
>>                       0, file);
>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
>>               /*
>>                * mem_cgroup_isolate_pages() keeps track of
>>                * scanned pages on its own.
>>                */
>> +             if (current_is_kswapd())
>> +                     mem_cgroup_kswapd_pgscan(sc->mem_cgroup, nr_scanned);
>> +             else
>> +                     mem_cgroup_pg_pgscan(sc->mem_cgroup, nr_scanned);
>>       }
>>
>>       if (nr_taken == 0) {
>> @@ -1413,9 +1418,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>>       }
>>
>>       local_irq_disable();
>> -     if (current_is_kswapd())
>> -             __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
>> -     __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
>> +     if (scanning_global_lru(sc)) {
>> +             if (current_is_kswapd())
>> +                     __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
>> +             __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
>> +     } else {
>> +             if (current_is_kswapd())
>> +                     mem_cgroup_kswapd_steal(sc->mem_cgroup, nr_reclaimed);
>> +             else
>> +                     mem_cgroup_pg_steal(sc->mem_cgroup, nr_reclaimed);
>> +     }
>>
>>       putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
>>
>> @@ -1508,11 +1520,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>>                * mem_cgroup_isolate_pages() keeps track of
>>                * scanned pages on its own.
>>                */
>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>>       }
>>
>>       reclaim_stat->recent_scanned[file] += nr_taken;
>>
>> -     __count_zone_vm_events(PGREFILL, zone, pgscanned);
>> +     if (scanning_global_lru(sc))
>> +             __count_zone_vm_events(PGREFILL, zone, pgscanned);
>> +     else
>> +             mem_cgroup_pgrefill(sc->mem_cgroup, pgscanned);
>> +
>>       if (file)
>>               __mod_zone_page_state(zone, NR_ACTIVE_FILE, -nr_taken);
>>       else
>> @@ -1955,6 +1972,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>>
>>       if (scanning_global_lru(sc))
>>               count_vm_event(ALLOCSTALL);
>> +     else
>> +             mem_cgroup_alloc_stall(sc->mem_cgroup, 1);
>>
>>       for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>>               sc->nr_scanned = 0;
>> @@ -2444,6 +2463,8 @@ scan:
>>                       priority != DEF_PRIORITY)
>>                       continue;
>>
>> +             mem_cgroup_kswapd_shrink_zone(mem_cont, 1);
>> +
>>               sc->nr_scanned = 0;
>>               shrink_zone(priority, zone, sc);
>>               total_scanned += sc->nr_scanned;
>> @@ -2462,6 +2483,7 @@ scan:
>>               if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
>>                   total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
>>                       sc->may_writepage = 1;
>> +                     mem_cgroup_kswapd_may_writepage(mem_cont, 1);
>>               }
>>       }
>>
>> @@ -2504,6 +2526,8 @@ loop_again:
>>       sc.nr_reclaimed = 0;
>>       total_scanned = 0;
>>
>> +     mem_cgroup_pg_outrun(mem_cont, 1);
>> +
>>       for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>>               sc.priority = priority;
>>
>> @@ -2544,6 +2568,7 @@ loop_again:
>>                               wmark_ok = 0;
>>
>>                       if (wmark_ok) {
>> +                             mem_cgroup_balance_wmark_ok(sc.mem_cgroup, 1);
>>                               goto out;
>>                       }
>>               }
>> --
>> 1.7.3.1
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  7:08   ` KAMEZAWA Hiroyuki
  2010-11-30  8:15     ` Minchan Kim
@ 2010-11-30 20:17     ` Ying Han
  2010-12-01  0:12       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 52+ messages in thread
From: Ying Han @ 2010-11-30 20:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, Nov 29, 2010 at 11:08 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 29 Nov 2010 22:49:42 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
>> field of a kswapd descriptor. The kswapd descriptor stores information of node
>> or cgroup and it allows the global and per cgroup background reclaim to share
>> common reclaim algorithms.
>>
>> This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
>> common data structure.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  include/linux/mmzone.h |    3 +-
>>  include/linux/swap.h   |   10 +++++
>>  mm/memcontrol.c        |    2 +
>>  mm/mmzone.c            |    2 +-
>>  mm/page_alloc.c        |    9 +++-
>>  mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
>>  6 files changed, 90 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 39c24eb..c77dfa2 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -642,8 +642,7 @@ typedef struct pglist_data {
>>       unsigned long node_spanned_pages; /* total size of physical page
>>                                            range, including holes */
>>       int node_id;
>> -     wait_queue_head_t kswapd_wait;
>> -     struct task_struct *kswapd;
>> +     wait_queue_head_t *kswapd_wait;
>>       int kswapd_max_order;
>>  } pg_data_t;
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index eba53e7..2e6cb58 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
>>       return current->flags & PF_KSWAPD;
>>  }
>>
>> +struct kswapd {
>> +     struct task_struct *kswapd_task;
>> +     wait_queue_head_t kswapd_wait;
>> +     struct mem_cgroup *kswapd_mem;
>> +     pg_data_t *kswapd_pgdat;
>> +};
>> +
>> +#define MAX_KSWAPDS MAX_NUMNODES
>> +extern struct kswapd kswapds[MAX_KSWAPDS];
>> +int kswapd(void *p);
>
> Why this is required ? Can't we allocate this at boot (if necessary) ?

I can double check with that.

> Why exsiting kswapd is also controlled under this structure ?

Some of the reclaim algorithm could be shared after we unifying the
API for global/memcg
background reclaim. One of the examples is the kswapd() daemon function itself.

> At the 1st look, this just seem to increase the size of changes....
>
> IMHO, implementing background-reclaim-for-memcg is cleaner than reusing kswapd..
> kswapd has tons of unnecessary checks.

Sorry I am not aware of "background-reclaim-for-memcg", can you
specify little bit more? Also,
the unnecessary checks here refers to the kswapd() or balance_pgdat()?
If the latter one, the
logic is not being shared at all included in patch/3.

--Ying

>
> Regards,
> -Kame
>
>>  /*
>>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>>   * be swapped to.  The swap type and the offset into that swap type are
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a4034b6..dca3590 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -263,6 +263,8 @@ struct mem_cgroup {
>>        */
>>       struct mem_cgroup_stat_cpu nocpu_base;
>>       spinlock_t pcp_counter_lock;
>> +
>> +     wait_queue_head_t *kswapd_wait;
>>  };
>>
>>  /* Stuffs for move charges at task migration. */
>> diff --git a/mm/mmzone.c b/mm/mmzone.c
>> index e35bfb8..c7cbed5 100644
>> --- a/mm/mmzone.c
>> +++ b/mm/mmzone.c
>> @@ -102,7 +102,7 @@ unsigned long zone_nr_free_pages(struct zone *zone)
>>        * free pages are low, get a better estimate for free pages
>>        */
>>       if (nr_free_pages < zone->percpu_drift_mark &&
>> -                     !waitqueue_active(&zone->zone_pgdat->kswapd_wait))
>> +                     !waitqueue_active(zone->zone_pgdat->kswapd_wait))
>>               return zone_page_state_snapshot(zone, NR_FREE_PAGES);
>>
>>       return nr_free_pages;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b48dea2..a15bc1c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4070,13 +4070,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>>       int nid = pgdat->node_id;
>>       unsigned long zone_start_pfn = pgdat->node_start_pfn;
>>       int ret;
>> +     struct kswapd *kswapd_p;
>>
>>       pgdat_resize_init(pgdat);
>>       pgdat->nr_zones = 0;
>> -     init_waitqueue_head(&pgdat->kswapd_wait);
>>       pgdat->kswapd_max_order = 0;
>>       pgdat_page_cgroup_init(pgdat);
>> -
>> +
>> +     kswapd_p = &kswapds[nid];
>> +     init_waitqueue_head(&kswapd_p->kswapd_wait);
>> +     pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
>> +     kswapd_p->kswapd_pgdat = pgdat;
>> +
>>       for (j = 0; j < MAX_NR_ZONES; j++) {
>>               struct zone *zone = pgdat->node_zones + j;
>>               unsigned long size, realsize, memmap_pages;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b8a6fdc..e08005e 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2115,12 +2115,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>>
>>       return nr_reclaimed;
>>  }
>> +
>>  #endif
>>
>> +DEFINE_SPINLOCK(kswapds_spinlock);
>> +struct kswapd kswapds[MAX_KSWAPDS];
>> +
>>  /* is kswapd sleeping prematurely? */
>> -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
>> +                             long remaining)
>>  {
>>       int i;
>> +     pg_data_t *pgdat = kswapd->kswapd_pgdat;
>>
>>       /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>>       if (remaining)
>> @@ -2377,21 +2383,28 @@ out:
>>   * If there are applications that are active memory-allocators
>>   * (most normal use), this basically shouldn't matter.
>>   */
>> -static int kswapd(void *p)
>> +int kswapd(void *p)
>>  {
>>       unsigned long order;
>> -     pg_data_t *pgdat = (pg_data_t*)p;
>> +     struct kswapd *kswapd_p = (struct kswapd *)p;
>> +     pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
>> +     struct mem_cgroup *mem = kswapd_p->kswapd_mem;
>> +     wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>>       struct task_struct *tsk = current;
>>       DEFINE_WAIT(wait);
>>       struct reclaim_state reclaim_state = {
>>               .reclaimed_slab = 0,
>>       };
>> -     const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>> +     const struct cpumask *cpumask;
>>
>>       lockdep_set_current_reclaim_state(GFP_KERNEL);
>>
>> -     if (!cpumask_empty(cpumask))
>> -             set_cpus_allowed_ptr(tsk, cpumask);
>> +     if (pgdat) {
>> +             BUG_ON(pgdat->kswapd_wait != wait_h);
>> +             cpumask = cpumask_of_node(pgdat->node_id);
>> +             if (!cpumask_empty(cpumask))
>> +                     set_cpus_allowed_ptr(tsk, cpumask);
>> +     }
>>       current->reclaim_state = &reclaim_state;
>>
>>       /*
>> @@ -2414,9 +2427,13 @@ static int kswapd(void *p)
>>               unsigned long new_order;
>>               int ret;
>>
>> -             prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> -             new_order = pgdat->kswapd_max_order;
>> -             pgdat->kswapd_max_order = 0;
>> +             prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>> +             if (pgdat) {
>> +                     new_order = pgdat->kswapd_max_order;
>> +                     pgdat->kswapd_max_order = 0;
>> +             } else
>> +                     new_order = 0;
>> +
>>               if (order < new_order) {
>>                       /*
>>                        * Don't sleep if someone wants a larger 'order'
>> @@ -2428,10 +2445,12 @@ static int kswapd(void *p)
>>                               long remaining = 0;
>>
>>                               /* Try to sleep for a short interval */
>> -                             if (!sleeping_prematurely(pgdat, order, remaining)) {
>> +                             if (!sleeping_prematurely(kswapd_p, order,
>> +                                                     remaining)) {
>>                                       remaining = schedule_timeout(HZ/10);
>> -                                     finish_wait(&pgdat->kswapd_wait, &wait);
>> -                                     prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> +                                     finish_wait(wait_h, &wait);
>> +                                     prepare_to_wait(wait_h, &wait,
>> +                                                     TASK_INTERRUPTIBLE);
>>                               }
>>
>>                               /*
>> @@ -2439,20 +2458,25 @@ static int kswapd(void *p)
>>                                * premature sleep. If not, then go fully
>>                                * to sleep until explicitly woken up
>>                                */
>> -                             if (!sleeping_prematurely(pgdat, order, remaining)) {
>> -                                     trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>> +                             if (!sleeping_prematurely(kswapd_p, order,
>> +                                                             remaining)) {
>> +                                     if (pgdat)
>> +                                             trace_mm_vmscan_kswapd_sleep(
>> +                                                             pgdat->node_id);
>>                                       schedule();
>>                               } else {
>>                                       if (remaining)
>> -                                             count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>> +                                             count_vm_event(
>> +                                             KSWAPD_LOW_WMARK_HIT_QUICKLY);
>>                                       else
>> -                                             count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>> +                                             count_vm_event(
>> +                                             KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>>                               }
>>                       }
>> -
>> -                     order = pgdat->kswapd_max_order;
>> +                     if (pgdat)
>> +                             order = pgdat->kswapd_max_order;
>>               }
>> -             finish_wait(&pgdat->kswapd_wait, &wait);
>> +             finish_wait(wait_h, &wait);
>>
>>               ret = try_to_freeze();
>>               if (kthread_should_stop())
>> @@ -2476,6 +2500,7 @@ static int kswapd(void *p)
>>  void wakeup_kswapd(struct zone *zone, int order)
>>  {
>>       pg_data_t *pgdat;
>> +     wait_queue_head_t *wait;
>>
>>       if (!populated_zone(zone))
>>               return;
>> @@ -2488,9 +2513,10 @@ void wakeup_kswapd(struct zone *zone, int order)
>>       trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>>       if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>               return;
>> -     if (!waitqueue_active(&pgdat->kswapd_wait))
>> +     wait = pgdat->kswapd_wait;
>> +     if (!waitqueue_active(wait))
>>               return;
>> -     wake_up_interruptible(&pgdat->kswapd_wait);
>> +     wake_up_interruptible(wait);
>>  }
>>
>>  /*
>> @@ -2587,7 +2613,10 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>>
>>                       if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>>                               /* One of our CPUs online: restore mask */
>> -                             set_cpus_allowed_ptr(pgdat->kswapd, mask);
>> +                             if (kswapds[nid].kswapd_task)
>> +                                     set_cpus_allowed_ptr(
>> +                                             kswapds[nid].kswapd_task,
>> +                                             mask);
>>               }
>>       }
>>       return NOTIFY_OK;
>> @@ -2599,19 +2628,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>>   */
>>  int kswapd_run(int nid)
>>  {
>> -     pg_data_t *pgdat = NODE_DATA(nid);
>> +     struct task_struct *thr;
>>       int ret = 0;
>>
>> -     if (pgdat->kswapd)
>> +     if (kswapds[nid].kswapd_task)
>>               return 0;
>>
>> -     pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
>> -     if (IS_ERR(pgdat->kswapd)) {
>> +     thr = kthread_run(kswapd, &kswapds[nid], "kswapd%d", nid);
>> +     if (IS_ERR(thr)) {
>>               /* failure at boot is fatal */
>>               BUG_ON(system_state == SYSTEM_BOOTING);
>>               printk("Failed to start kswapd on node %d\n",nid);
>>               ret = -1;
>>       }
>> +     kswapds[nid].kswapd_task = thr;
>>       return ret;
>>  }
>>
>> @@ -2620,10 +2650,20 @@ int kswapd_run(int nid)
>>   */
>>  void kswapd_stop(int nid)
>>  {
>> -     struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
>> +     struct task_struct *thr;
>> +     struct kswapd *kswapd_p;
>> +     wait_queue_head_t *wait;
>> +
>> +     pg_data_t *pgdat = NODE_DATA(nid);
>> +
>> +     spin_lock(&kswapds_spinlock);
>> +     wait = pgdat->kswapd_wait;
>> +     kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>> +     thr = kswapd_p->kswapd_task;
>> +     spin_unlock(&kswapds_spinlock);
>>
>> -     if (kswapd)
>> -             kthread_stop(kswapd);
>> +     if (thr)
>> +             kthread_stop(thr);
>>  }
>>
>>  static int __init kswapd_init(void)
>> --
>> 1.7.3.1
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  8:15     ` Minchan Kim
  2010-11-30  8:27       ` KAMEZAWA Hiroyuki
@ 2010-11-30 20:26       ` Ying Han
  1 sibling, 0 replies; 52+ messages in thread
From: Ying Han @ 2010-11-30 20:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, Nov 30, 2010 at 12:15 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Tue, Nov 30, 2010 at 4:08 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> On Mon, 29 Nov 2010 22:49:42 -0800
>> Ying Han <yinghan@google.com> wrote:
>>
>>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>>> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
>>> field of a kswapd descriptor. The kswapd descriptor stores information of node
>>> or cgroup and it allows the global and per cgroup background reclaim to share
>>> common reclaim algorithms.
>>>
>>> This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
>>> common data structure.
>>>
>>> Signed-off-by: Ying Han <yinghan@google.com>
>>> ---
>>>  include/linux/mmzone.h |    3 +-
>>>  include/linux/swap.h   |   10 +++++
>>>  mm/memcontrol.c        |    2 +
>>>  mm/mmzone.c            |    2 +-
>>>  mm/page_alloc.c        |    9 +++-
>>>  mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
>>>  6 files changed, 90 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 39c24eb..c77dfa2 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -642,8 +642,7 @@ typedef struct pglist_data {
>>>       unsigned long node_spanned_pages; /* total size of physical page
>>>                                            range, including holes */
>>>       int node_id;
>>> -     wait_queue_head_t kswapd_wait;
>>> -     struct task_struct *kswapd;
>>> +     wait_queue_head_t *kswapd_wait;
>>>       int kswapd_max_order;
>>>  } pg_data_t;
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index eba53e7..2e6cb58 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
>>>       return current->flags & PF_KSWAPD;
>>>  }
>>>
>>> +struct kswapd {
>>> +     struct task_struct *kswapd_task;
>>> +     wait_queue_head_t kswapd_wait;
>>> +     struct mem_cgroup *kswapd_mem;
>>> +     pg_data_t *kswapd_pgdat;
>>> +};
>>> +
>>> +#define MAX_KSWAPDS MAX_NUMNODES
>>> +extern struct kswapd kswapds[MAX_KSWAPDS];
>>> +int kswapd(void *p);
>>
>> Why this is required ? Can't we allocate this at boot (if necessary) ?
>> Why exsiting kswapd is also controlled under this structure ?
>> At the 1st look, this just seem to increase the size of changes....
>>
>> IMHO, implementing background-reclaim-for-memcg is cleaner than reusing kswapd..
>> kswapd has tons of unnecessary checks.
>
> Ideally, I hope we unify global and memcg of kswapd for easy
> maintainance if it's not a big problem.

I intended not doing so in this patchset since the algorithm and
reclaiming target are
different for global and per-memcg kswapd. I would prefer not having
the new changes
to affect existing logic.

> When we make patches about lru pages, we always have to consider what
> I should do for memcg.
> And when we review patches, we also should consider what the patch is
> missing for memcg.
The per-memcg LRU is there and that needs to be considered differently
as global one. This
patchset doesn't change that part but is based on that. I don't see by
merging the kswapd will
help the maintainance in that sense. All the following changes to the
per-memcg LRU should take
effect automatically to the per-memcg kswapd later on.

> It makes maintainance cost big. Of course, if memcg maintainers is
> involved with all patches, it's no problem as it is.

>
> If it is impossible due to current kswapd's spaghetti, we can clean up
> it first. I am not sure whether my suggestion make sense or not.
> Kame can know it much rather than me. But please consider such the voice.

The global kswapd is working on node and zones on the node. Its target
is to bring
all the zones above high wmarks unless the zones are "unreclaimable".
The logic is
different for per-memcg kswapd which scans all the nodes and zones on the system
and tries to bring the per-memcg wmark above the threshold. Lots of
heuristics are
not shared at this moment, and I am not sure if this is a good idea to
merge them.

--Ying
>
>>
>> Regards,
>> -Kame
>>
>
>
>
> --
> Kind regards,
> Minchan Kim
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  8:54         ` KAMEZAWA Hiroyuki
@ 2010-11-30 20:40           ` Ying Han
  2010-11-30 23:46             ` KAMEZAWA Hiroyuki
  2010-12-07  6:15           ` Balbir Singh
  1 sibling, 1 reply; 52+ messages in thread
From: Ying Han @ 2010-11-30 20:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, Nov 30, 2010 at 12:54 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 30 Nov 2010 17:27:10 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Tue, 30 Nov 2010 17:15:37 +0900
>> Minchan Kim <minchan.kim@gmail.com> wrote:
>>
>> > Ideally, I hope we unify global and memcg of kswapd for easy
>> > maintainance if it's not a big problem.
>> > When we make patches about lru pages, we always have to consider what
>> > I should do for memcg.
>> > And when we review patches, we also should consider what the patch is
>> > missing for memcg.
>> > It makes maintainance cost big. Of course, if memcg maintainers is
>> > involved with all patches, it's no problem as it is.
>> >
>> I know it's not. But thread control of kswapd will not have much merging point.
>> And balance_pgdat() is fully replaced in patch/3. The effort for merging seems
>> not big.

I intended to separate out the logic of per-memcg kswapd logics and
not having it
interfere with existing code. This should help for merging.

>>
>
> kswapd's balance_pgdat() is for following
>  - reclaim pages within a node.
>  - balancing zones in a pgdat.
>
> memcg's background reclaim needs followings.
>  - reclaim pages within a memcg
>  - reclaim pages from arbitrary zones, if it's fair, it's good.
>    But it's not important from which zone the pages are reclaimed from.
>    (I'm not sure we can select "the oldest" pages from divided LRU.)

The current implementation is simple, which it iterates all the nodes
and reclaims pages from the per-memcg-per-zone LRU. As long as the
wmarks is ok, the kswapd is done. Meanwhile, in order to not wasting
cputime on "unreclaimable: nodes ( a node is unreclaimable if all the
zones are unreclaimable), I used the nodemask to record that from the
last scan, and the bit is reset as long as a page is returned back.
This is a similar logic used in the global kswapd.

A potential improvement is to remember the last node we reclaimed
from, and starting from the next node for the next kswapd wake_up.
This avoids the case all the memcg kswapds are reclaiming from the
small node ids on large numa machines.

>
> Then, merging will put 2 _very_ different functionalities into 1 function

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/4] Add per cgroup reclaim watermarks.
  2010-11-30  7:21   ` KAMEZAWA Hiroyuki
@ 2010-11-30 20:44     ` Ying Han
  2010-12-01  0:27       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 52+ messages in thread
From: Ying Han @ 2010-11-30 20:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, Nov 29, 2010 at 11:21 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 29 Nov 2010 22:49:43 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> The per cgroup kswapd is invoked at mem_cgroup_charge when the cgroup's memory
>> usage above a threshold--low_wmark. Then the kswapd thread starts to reclaim
>> pages in a priority loop similar to global algorithm. The kswapd is done if the
>> memory usage below a threshold--high_wmark.
>>
>> The per cgroup background reclaim is based on the per cgroup LRU and also adds
>> per cgroup watermarks. There are two watermarks including "low_wmark" and
>> "high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
>> for each cgroup. Each time the hard_limit is change, the corresponding wmarks
>> are re-calculated. Since memory controller charges only user pages, there is
>> no need for a "min_wmark". The current calculation of wmarks is a function of
>> "memory.min_free_kbytes" which could be adjusted by writing different values
>> into the new api. This is added mainly for debugging purpose.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>
> A few points.
>
> 1. I can understand the motivation for including low/high watermark to
>   res_coutner. But, sadly, compareing all charge will make the counter slow.
>   IMHO, as memory controller threshold-check or soft limit, checking usage
>   periodically based on event counter is enough. It will be low cost.

If we have other limits using the event counter, this sounds a
feasible try for the
wmarks. I can look into that.

>
> 2. min_free_kbytes must be automatically calculated.
>   For example, max(3% of limit, 20MB) or some.

Now the wmark is automatically calculated based on the limit. Adding
the min_free_kbytes gives
us more flexibility to adjust the portion of the threshold. This could
just be a performance tuning
parameter later. I need it now at least at the beginning before
figuring out a reasonable calculation
formula.

>
> 3. When you allow min_free_kbytes to be set by users, please compare
>   it with the limit.
>   I think min_free_kbyte interface itself should be in another patch...
>   interface code tends to make patch bigger.

Sounds feasible.

--Ying
>
>
>> ---
>>  include/linux/memcontrol.h  |    1 +
>>  include/linux/res_counter.h |   88 ++++++++++++++++++++++++++++++-
>>  kernel/res_counter.c        |   26 ++++++++--
>>  mm/memcontrol.c             |  123 +++++++++++++++++++++++++++++++++++++++++--
>>  mm/vmscan.c                 |   10 ++++
>>  5 files changed, 238 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 159a076..90fe7fe 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -76,6 +76,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>>
>>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>> +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
>>
>>  static inline
>>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
>> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
>> index fcb9884..eed12c5 100644
>> --- a/include/linux/res_counter.h
>> +++ b/include/linux/res_counter.h
>> @@ -39,6 +39,16 @@ struct res_counter {
>>        */
>>       unsigned long long soft_limit;
>>       /*
>> +      * the limit that reclaim triggers. TODO: res_counter in mem
>> +      * or wmark_limit.
>> +      */
>> +     unsigned long long low_wmark_limit;
>> +     /*
>> +      * the limit that reclaim stops. TODO: res_counter in mem or
>> +      * wmark_limit.
>> +      */
>> +     unsigned long long high_wmark_limit;
>> +     /*
>>        * the number of unsuccessful attempts to consume the resource
>>        */
>>       unsigned long long failcnt;
>> @@ -55,6 +65,10 @@ struct res_counter {
>>
>>  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
>>
>> +#define CHARGE_WMARK_MIN     0x01
>> +#define CHARGE_WMARK_LOW     0x02
>> +#define CHARGE_WMARK_HIGH    0x04
>> +
>>  /**
>>   * Helpers to interact with userspace
>>   * res_counter_read_u64() - returns the value of the specified member

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] Per cgroup background reclaim.
  2010-11-30  7:51   ` KAMEZAWA Hiroyuki
  2010-11-30  8:07     ` KAMEZAWA Hiroyuki
@ 2010-11-30 22:00     ` Ying Han
  2010-12-07  2:25     ` Ying Han
  2 siblings, 0 replies; 52+ messages in thread
From: Ying Han @ 2010-11-30 22:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, Nov 29, 2010 at 11:51 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 29 Nov 2010 22:49:44 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> The current implementation of memcg only supports direct reclaim and this
>> patch adds the support for background reclaim. Per cgroup background reclaim
>> is needed which spreads out the memory pressure over longer period of time
>> and smoothes out the system performance.
>>
>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
>> field of a kswapd descriptor.
>>
>> The kswapd() function now is shared between global and per cgroup kswapd thread.
>> It is passed in with the kswapd descriptor which contains the information of
>> either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
>> if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
>> priority loop similar to global reclaim. In each iteration it invokes
>> balance_pgdat_node for all nodes on the system, which is a new function performs
>> background reclaim per node. After reclaiming each node, it checks
>> mem_cgroup_watermark_ok() and breaks the priority loop if returns true. A per
>> memcg zone will be marked as "unreclaimable" if the scanning rate is much
>> greater than the reclaiming rate on the per cgroup LRU. The bit is cleared when
>> there is a page charged to the cgroup being freed. Kswapd breaks the priority
>> loop if all the zones are marked as "unreclaimable".
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  include/linux/memcontrol.h |   30 +++++++
>>  mm/memcontrol.c            |  182 ++++++++++++++++++++++++++++++++++++++-
>>  mm/page_alloc.c            |    2 +
>>  mm/vmscan.c                |  205 +++++++++++++++++++++++++++++++++++++++++++-
>>  4 files changed, 416 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 90fe7fe..dbed45d 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -127,6 +127,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>                                               gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>
>> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone);
>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>> +                                     unsigned long nr_scanned);
>>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>>  struct mem_cgroup;
>>
>> @@ -299,6 +305,25 @@ static inline void mem_cgroup_update_file_mapped(struct page *page,
>>  {
>>  }
>>
>> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
>> +                                             struct zone *zone,
>> +                                             unsigned long nr_scanned)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
>> +                                                     struct zone *zone)
>> +{
>> +}
>> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
>> +             struct zone *zone)
>> +{
>> +}
>> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
>> +                                             struct zone *zone)
>> +{
>> +}
>> +
>>  static inline
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>                                           gfp_t gfp_mask)
>> @@ -312,6 +337,11 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
>>       return 0;
>>  }
>>
>> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
>> +                                                             int zid)
>> +{
>> +     return false;
>> +}
>>  #endif /* CONFIG_CGROUP_MEM_CONT */
>>
>>  #endif /* _LINUX_MEMCONTROL_H */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a0c6ed9..1d39b65 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -48,6 +48,8 @@
>>  #include <linux/page_cgroup.h>
>>  #include <linux/cpu.h>
>>  #include <linux/oom.h>
>> +#include <linux/kthread.h>
>> +
>>  #include "internal.h"
>>
>>  #include <asm/uaccess.h>
>> @@ -118,7 +120,10 @@ struct mem_cgroup_per_zone {
>>       bool                    on_tree;
>>       struct mem_cgroup       *mem;           /* Back pointer, we cannot */
>>                                               /* use container_of        */
>> +     unsigned long           pages_scanned;  /* since last reclaim */
>> +     int                     all_unreclaimable;      /* All pages pinned */
>>  };
>> +
>>  /* Macro for accessing counter */
>>  #define MEM_CGROUP_ZSTAT(mz, idx)    ((mz)->count[(idx)])
>>
>> @@ -372,6 +377,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
>>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>>  static void drain_all_stock_async(void);
>>  static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
>> +static inline void wake_memcg_kswapd(struct mem_cgroup *mem);
>>
>>  static struct mem_cgroup_per_zone *
>>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
>> @@ -1086,6 +1092,106 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>>       return &mz->reclaim_stat;
>>  }
>>
>> +unsigned long mem_cgroup_zone_reclaimable_pages(
>> +                                     struct mem_cgroup_per_zone *mz)
>> +{
>> +     int nr;
>> +     nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
>> +             MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
>> +
>> +     if (nr_swap_pages > 0)
>> +             nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
>> +                     MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
>> +
>> +     return nr;
>> +}
>> +
>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>> +                                             unsigned long nr_scanned)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             mz->pages_scanned += nr_scanned;
>> +}
>> +
>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +
>> +     if (!mem)
>> +             return 0;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             return mz->pages_scanned <
>> +                             mem_cgroup_zone_reclaimable_pages(mz) * 6;
>> +     return 0;
>> +}
>> +
>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return 0;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             return mz->all_unreclaimable;
>> +
>> +     return 0;
>> +}
>> +
>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             mz->all_unreclaimable = 1;
>> +}
>> +
>> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     struct mem_cgroup *mem = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +     struct page_cgroup *pc = lookup_page_cgroup(page);
>> +
>> +     if (unlikely(!pc))
>> +             return;
>> +
>> +     rcu_read_lock();
>> +     mem = pc->mem_cgroup;
>
> This is incorrect. you have to do css_tryget(&mem->css) before rcu_read_unlock.

Thanks. This will be changed in the next post.

>
>> +     rcu_read_unlock();
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz) {
>> +             mz->pages_scanned = 0;
>> +             mz->all_unreclaimable = 0;
>> +     }
>> +
>> +     return;
>> +}
>> +
>>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>                                       struct list_head *dst,
>>                                       unsigned long *scanned, int order,
>> @@ -1887,6 +1993,20 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>>       struct res_counter *fail_res;
>>       unsigned long flags = 0;
>>       int ret;
>> +     unsigned long min_free_kbytes = 0;
>> +
>> +     min_free_kbytes = get_min_free_kbytes(mem);
>> +     if (min_free_kbytes) {
>> +             ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_LOW,
>> +                                     &fail_res);
>> +             if (likely(!ret)) {
>> +                     return CHARGE_OK;
>> +             } else {
>> +                     mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>> +                                                                     res);
>> +                     wake_memcg_kswapd(mem_over_limit);
>> +             }
>> +     }
>
> I think this check can be moved out to periodic-check as threshould notifiers.

I have to check how the threshold notifier works. If the
periodic-check causes delay of triggering
kswapd, we might end up relying on ttfp as we do now.


>
>
>
>>
>>       ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_MIN, &fail_res);
>>
>> @@ -3037,6 +3157,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>>                       else
>>                               memcg->memsw_is_minimum = false;
>>               }
>> +             setup_per_memcg_wmarks(memcg);
>>               mutex_unlock(&set_limit_mutex);
>>
>>               if (!ret)
>> @@ -3046,7 +3167,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>>                                               MEM_CGROUP_RECLAIM_SHRINK);
>>               curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>>               /* Usage is reduced ? */
>> -             if (curusage >= oldusage)
>> +             if (curusage >= oldusage)
>>                       retry_count--;
>>               else
>>                       oldusage = curusage;
>
> What's changed here ?
Hmm. will change in the next patch.
>
>> @@ -3096,6 +3217,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>>                       else
>>                               memcg->memsw_is_minimum = false;
>>               }
>> +             setup_per_memcg_wmarks(memcg);
>>               mutex_unlock(&set_limit_mutex);
>>
>>               if (!ret)
>> @@ -4352,6 +4474,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>>  static void __mem_cgroup_free(struct mem_cgroup *mem)
>>  {
>>       int node;
>> +     struct kswapd *kswapd_p;
>> +     wait_queue_head_t *wait;
>>
>>       mem_cgroup_remove_from_trees(mem);
>>       free_css_id(&mem_cgroup_subsys, &mem->css);
>> @@ -4360,6 +4484,15 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>>               free_mem_cgroup_per_zone_info(mem, node);
>>
>>       free_percpu(mem->stat);
>> +
>> +     wait = mem->kswapd_wait;
>> +     kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>> +     if (kswapd_p) {
>> +             if (kswapd_p->kswapd_task)
>> +                     kthread_stop(kswapd_p->kswapd_task);
>> +             kfree(kswapd_p);
>> +     }
>> +
>>       if (sizeof(struct mem_cgroup) < PAGE_SIZE)
>>               kfree(mem);
>>       else
>> @@ -4421,6 +4554,39 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
>>       return ret;
>>  }
>>
>> +static inline
>> +void wake_memcg_kswapd(struct mem_cgroup *mem)
>> +{
>> +     wait_queue_head_t *wait;
>> +     struct kswapd *kswapd_p;
>> +     struct task_struct *thr;
>> +     static char memcg_name[PATH_MAX];
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     wait = mem->kswapd_wait;
>> +     kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>> +     if (!kswapd_p->kswapd_task) {
>> +             if (mem->css.cgroup)
>> +                     cgroup_path(mem->css.cgroup, memcg_name, PATH_MAX);
>> +             else
>> +                     sprintf(memcg_name, "no_name");
>> +
>> +             thr = kthread_run(kswapd, kswapd_p, "kswapd%s", memcg_name);
>
> I don't think reusing the name of "kswapd" isn't good. and this name cannot
> be long as PATH_MAX...IIUC, this name is for comm[] field which is 16bytes long.
>
> So, how about naming this as
>
>  "memcg%d", mem->css.id ?

No strong objection with the name. :)
>
> Exporing css.id will be okay if necessary.
>
>
>
>> +             if (IS_ERR(thr))
>> +                     printk(KERN_INFO "Failed to start kswapd on memcg %d\n",
>> +                             0);
>> +             else
>> +                     kswapd_p->kswapd_task = thr;
>> +     }
>
> Hmm, ok, then, kswapd-for-memcg is created when someone go over watermark

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] Per cgroup background reclaim.
  2010-11-30  8:07     ` KAMEZAWA Hiroyuki
@ 2010-11-30 22:01       ` Ying Han
  0 siblings, 0 replies; 52+ messages in thread
From: Ying Han @ 2010-11-30 22:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, Nov 30, 2010 at 12:07 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 30 Nov 2010 16:51:42 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> > +           if (IS_ERR(thr))
>> > +                   printk(KERN_INFO "Failed to start kswapd on memcg %d\n",
>> > +                           0);
>> > +           else
>> > +                   kswapd_p->kswapd_task = thr;
>> > +   }
>>
>> Hmm, ok, then, kswapd-for-memcg is created when someone go over watermark.
>> Why this new kswapd will not exit() until memcg destroy ?
>>
>> I think there are several approaches.
>>
>>   1. create/destroy a thread at memcg create/destroy
>>   2. create/destroy a thread at watermarks.
>>   3. use thread pool for watermarks.
>>   4. use workqueue for watermaks.
>>
>> The good point of "1" is that we can control a-thread-for-kswapd by cpu
>> controller but it will use some resource.
>> The good point of "2" is that we can avoid unnecessary resource usage.
>>
>> 3 and 4 is not very good, I think.
>>
>> I'd like to vote for "1"...I want to avoid "stealing" other container's cpu
>> by bad application in a container uses up memory.
>>
>
> One more point, one-thread-per-hierarchy is enough. So, please check
> memory.use_hierarchy==1 or not at creating a thread.

Thanks. Will take a look at it.

--Ying
>
> Thanks,
> -kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30 20:40           ` Ying Han
@ 2010-11-30 23:46             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30 23:46 UTC (permalink / raw)
  To: Ying Han
  Cc: Minchan Kim, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, 30 Nov 2010 12:40:16 -0800
Ying Han <yinghan@google.com> wrote:

> On Tue, Nov 30, 2010 at 12:54 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 30 Nov 2010 17:27:10 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> >> On Tue, 30 Nov 2010 17:15:37 +0900
> >> Minchan Kim <minchan.kim@gmail.com> wrote:
> >>
> >> > Ideally, I hope we unify global and memcg of kswapd for easy
> >> > maintainance if it's not a big problem.
> >> > When we make patches about lru pages, we always have to consider what
> >> > I should do for memcg.
> >> > And when we review patches, we also should consider what the patch is
> >> > missing for memcg.
> >> > It makes maintainance cost big. Of course, if memcg maintainers is
> >> > involved with all patches, it's no problem as it is.
> >> >
> >> I know it's not. But thread control of kswapd will not have much merging point.
> >> And balance_pgdat() is fully replaced in patch/3. The effort for merging seems
> >> not big.
> 
> I intended to separate out the logic of per-memcg kswapd logics and
> not having it
> interfere with existing code. This should help for merging.
> 

yes.


> >>
> >
> > kswapd's balance_pgdat() is for following
> > A - reclaim pages within a node.
> > A - balancing zones in a pgdat.
> >
> > memcg's background reclaim needs followings.
> > A - reclaim pages within a memcg
> > A - reclaim pages from arbitrary zones, if it's fair, it's good.
> > A  A But it's not important from which zone the pages are reclaimed from.
> > A  A (I'm not sure we can select "the oldest" pages from divided LRU.)
> 
> The current implementation is simple, which it iterates all the nodes
> and reclaims pages from the per-memcg-per-zone LRU. As long as the
> wmarks is ok, the kswapd is done. Meanwhile, in order to not wasting
> cputime on "unreclaimable: nodes ( a node is unreclaimable if all the
> zones are unreclaimable), I used the nodemask to record that from the
> last scan, and the bit is reset as long as a page is returned back.
> This is a similar logic used in the global kswapd.
> 
> A potential improvement is to remember the last node we reclaimed
> from, and starting from the next node for the next kswapd wake_up.
> This avoids the case all the memcg kswapds are reclaiming from the
> small node ids on large numa machines.
> 
Yes, that's helpful.

> >
> > Then, merging will put 2 _very_ different functionalities into 1 function.
> 
> Agree.
> 
> >
> > So, I thought it's simpler to implement
> >
> > A 1. a victim node selector (This algorithm will never be in kswapd.)
> 
> Yeah, or round robin as I replied above ?
> 
I think it's good to have.

> > A 2. call _existing_ try_to_free_pages_mem_cgroup() with node local zonelist.
> > A Sharing is enough.
> 
> That will in turn use direct reclaim logic which has no notion of wmarks.
> 

 do {
	node = select_victim_node();
	do_try_to_free_pages_mem_cgroup(node);
	check watermark
 }

or If we need to check priority at el, your new balance_pgdat_mem_cgroup()
will be good.

> > kswapd stop/go routine may be able to be shared. But this patch itself seems not
> > very good to me.
> This looks feasible change, I will double check with it.

Thanks.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30 20:17     ` Ying Han
@ 2010-12-01  0:12       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-01  0:12 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, 30 Nov 2010 12:17:33 -0800
Ying Han <yinghan@google.com> wrote:

> > At the 1st look, this just seem to increase the size of changes....
> >
> > IMHO, implementing background-reclaim-for-memcg is cleaner than reusing kswapd..
> > kswapd has tons of unnecessary checks.
> 
> Sorry I am not aware of "background-reclaim-for-memcg", can you
> specify little bit more? Also,
> the unnecessary checks here refers to the kswapd() or balance_pgdat()?
> If the latter one, the
> logic is not being shared at all included in patch/3.
> 
Yes, now I read patch/3 and I'm sorry to say that.


Some nits.

At 1st. I just coudln't undestand idea of array of kswapd descriptor.
Hmm, dynamic allocation isn't possible ? as

==
struct kswapd_param {
	pg_data_t	*pgdat;
	struct mem_cgroup *memcg;
	struct wait_queue *waitq;
};


int kswapd_run(int nid, struct mem_cgroup *memcg)
{
	struct kswapd_param *param;

	param = kzalloc(); /* freed by kswapd */

	if (!memcg) { /* per-node kswapd */
		param->pgdat = NODE_DATA(nid);
		if (param->pgdat->kswapd)
			return;
		pgdat->kswapd = kthread_run(param);
		..../* fatal error check */
		return;
	}
		
	/* per-memcg kswapd */
	kthread_run(param);
}
==

Secondaly, I think some macro is necessary.

How about
==
#define is_node_kswapd(param)	(!param->memcg)

int kswapd(void *p)
{
	struct kswapd_param *param = p;

	if (is_node_kswapd(param))
		param->waitq = &param->pgdat->kswapd_wait;
	else
		param->waitq = mem_cgroup_get_kswapd_waitq(param->memcg);
		/* Here, we can notify the memcg which thread is for yours. */


or some ?

I think a macro like scanning_global_lru() is necessary.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/4] Add per cgroup reclaim watermarks.
  2010-11-30 20:44     ` Ying Han
@ 2010-12-01  0:27       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-01  0:27 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, 30 Nov 2010 12:44:13 -0800
Ying Han <yinghan@google.com> wrote:

> On Mon, Nov 29, 2010 at 11:21 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 29 Nov 2010 22:49:43 -0800
> > Ying Han <yinghan@google.com> wrote:
> >
> >> The per cgroup kswapd is invoked at mem_cgroup_charge when the cgroup's memory
> >> usage above a threshold--low_wmark. Then the kswapd thread starts to reclaim
> >> pages in a priority loop similar to global algorithm. The kswapd is done if the
> >> memory usage below a threshold--high_wmark.
> >>
> >> The per cgroup background reclaim is based on the per cgroup LRU and also adds
> >> per cgroup watermarks. There are two watermarks including "low_wmark" and
> >> "high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
> >> for each cgroup. Each time the hard_limit is change, the corresponding wmarks
> >> are re-calculated. Since memory controller charges only user pages, there is
> >> no need for a "min_wmark". The current calculation of wmarks is a function of
> >> "memory.min_free_kbytes" which could be adjusted by writing different values
> >> into the new api. This is added mainly for debugging purpose.
> >>
> >> Signed-off-by: Ying Han <yinghan@google.com>
> >
> > A few points.
> >
> > 1. I can understand the motivation for including low/high watermark to
> > A  res_coutner. But, sadly, compareing all charge will make the counter slow.
> > A  IMHO, as memory controller threshold-check or soft limit, checking usage
> > A  periodically based on event counter is enough. It will be low cost.
> 
> If we have other limits using the event counter, this sounds a
> feasible try for the
> wmarks. I can look into that.
> 
> >
> > 2. min_free_kbytes must be automatically calculated.
> > A  For example, max(3% of limit, 20MB) or some.
> 
> Now the wmark is automatically calculated based on the limit. Adding
> the min_free_kbytes gives
> us more flexibility to adjust the portion of the threshold. This could
> just be a performance tuning
> parameter later. I need it now at least at the beginning before
> figuring out a reasonable calculation
> formula.
> 
mm/page_alloc.c::init_per_zone_wmark_min() can be reused.

My question is.

> >> +void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> >> +{
> >> + A  A  u64 limit;
> >> + A  A  unsigned long min_free_kbytes;
> >> +
> >> + A  A  min_free_kbytes = get_min_free_kbytes(mem);
> >> + A  A  limit = mem_cgroup_get_limit(mem);
> >> + A  A  if (min_free_kbytes == 0) {

I think this min_free_kbyte is always 0 until a user set it.
Please set this when the limit is changed, automatically.

I wonder
	struct mem_cgroup {

		unsigned long min_free_kbytes;
		unsigned long min_free_kbytes_user_set; /* use this always if set */
	}
may be necessary if we never adjust min_free_kbytes once a user set it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] Per cgroup background reclaim.
  2010-12-01  2:18   ` KOSAKI Motohiro
@ 2010-12-01  2:16     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-01  2:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ying Han, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, Tejun Heo, linux-mm

On Wed,  1 Dec 2010 11:18:45 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a15bc1c..dc61f2a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -615,6 +615,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  
> >  		do {
> >  			page = list_entry(list->prev, struct page, lru);
> > +			mem_cgroup_clear_unreclaimable(page, zone);
> >  			/* must delete as __free_one_page list manipulates */
> >  			list_del(&page->lru);
> >  			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> > @@ -632,6 +633,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
> >  	spin_lock(&zone->lock);
> >  	zone->all_unreclaimable = 0;
> >  	zone->pages_scanned = 0;
> > +	mem_cgroup_clear_unreclaimable(page, zone);
> >  
> >  	__free_one_page(page, zone, order, migratetype);
> >  	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
> 
> Please don't do that. free page is one of fast path. We don't want to add
> additonal overhead here.
> 
> So I would like to explain why we clear zone->all_unreclaimable in free 
> page path at first. Look, zone free pages are maintained by NR_FREE_PAGES
> and free_one_page modify it.
> 
> But, free_one_page() is unrelated to memory cgroup uncharge thing. If nobody
> does memcg uncharge, reclaim retrying is pointless. no? I think we have
> better place than here.
> 
I agree. Should be done in uncharge or event counter.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] Per cgroup background reclaim.
  2010-11-30  6:49 ` [PATCH 3/4] Per cgroup background reclaim Ying Han
  2010-11-30  7:51   ` KAMEZAWA Hiroyuki
@ 2010-12-01  2:18   ` KOSAKI Motohiro
  2010-12-01  2:16     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 52+ messages in thread
From: KOSAKI Motohiro @ 2010-12-01  2:18 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Balbir Singh, Daisuke Nishimura,
	KAMEZAWA Hiroyuki, Andrew Morton, Mel Gorman, Johannes Weiner,
	Christoph Lameter, Wu Fengguang, Andi Kleen, Hugh Dickins,
	Rik van Riel, Tejun Heo, linux-mm

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a15bc1c..dc61f2a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -615,6 +615,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  
>  		do {
>  			page = list_entry(list->prev, struct page, lru);
> +			mem_cgroup_clear_unreclaimable(page, zone);
>  			/* must delete as __free_one_page list manipulates */
>  			list_del(&page->lru);
>  			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> @@ -632,6 +633,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
>  	spin_lock(&zone->lock);
>  	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
> +	mem_cgroup_clear_unreclaimable(page, zone);
>  
>  	__free_one_page(page, zone, order, migratetype);
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);

Please don't do that. free page is one of fast path. We don't want to add
additonal overhead here.

So I would like to explain why we clear zone->all_unreclaimable in free 
page path at first. Look, zone free pages are maintained by NR_FREE_PAGES
and free_one_page modify it.

But, free_one_page() is unrelated to memory cgroup uncharge thing. If nobody
does memcg uncharge, reclaim retrying is pointless. no? I think we have
better place than here.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH 0/4] memcg: per cgroup background reclaim
  2010-11-30  7:03   ` Ying Han
@ 2010-12-02 14:41     ` Balbir Singh
  2010-12-07  2:29       ` Ying Han
  0 siblings, 1 reply; 52+ messages in thread
From: Balbir Singh @ 2010-12-02 14:41 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel, Tejun Heo,
	linux-mm

* Ying Han <yinghan@google.com> [2010-11-29 23:03:31]:

> On Mon, Nov 29, 2010 at 10:54 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> The current implementation of memcg only supports direct reclaim and this
> >> patchset adds the support for background reclaim. Per cgroup background
> >> reclaim is needed which spreads out the memory pressure over longer period
> >> of time and smoothes out the system performance.
> >>
> >> The current implementation is not a stable version, and it crashes sometimes
> >> on my NUMA machine. Before going further for debugging, I would like to start
> >> the discussion and hear the feedbacks of the initial design.
> >
> > I haven't read your code at all. However I agree your claim that memcg
> > also need background reclaim.
> 
> Thanks for your comment.
> >
> > So if you post high level design memo, I'm happy.
> 
> My high level design is kind of spreading out into each patch, and
> here is the consolidated one. This is nothing more but cluing all the
> commits' messages for the following patches.
> 
> "
> The current implementation of memcg only supports direct reclaim and this
> patchset adds the support for background reclaim. Per cgroup background
> reclaim is needed which spreads out the memory pressure over longer period
> of time and smoothes out the system performance.
> 
> There is a kswapd kernel thread for each memory node. We add a different kswapd
> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
> field of a kswapd descriptor. The kswapd descriptor stores information of node
> or cgroup and it allows the global and per cgroup background reclaim to share
> common reclaim algorithms. The per cgroup kswapd is invoked at mem_cgroup_charge
> when the cgroup's memory usage above a threshold--low_wmark. Then the kswapd
> thread starts to reclaim pages in a priority loop similar to global algorithm.
> The kswapd is done if the usage below a threshold--high_wmark.
>

So the logic is per-node/per-zone/per-cgroup right?
 
> The per cgroup background reclaim is based on the per cgroup LRU and also adds
> per cgroup watermarks. There are two watermarks including "low_wmark" and
> "high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
> for each cgroup. Each time the hard_limit is change, the corresponding wmarks
> are re-calculated. Since memory controller charges only user pages, there is

What about memsw limits, do they impact anything, I presume not.

> no need for a "min_wmark". The current calculation of wmarks is a function of
> "memory.min_free_kbytes" which could be adjusted by writing different values
> into the new api. This is added mainly for debugging purpose.

When you say debugging, can you elaborate?

> 
> The kswapd() function now is shared between global and per cgroup kswapd thread.
> It is passed in with the kswapd descriptor which contains the information of
> either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
> if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
> priority loop similar to global reclaim. In each iteration it invokes
> balance_pgdat_node for all nodes on the system, which is a new function performs
> background reclaim per node. After reclaiming each node, it checks
> mem_cgroup_watermark_ok() and breaks the priority loop if returns true. A per
> memcg zone will be marked as "unreclaimable" if the scanning rate is much
> greater than the reclaiming rate on the per cgroup LRU. The bit is cleared when
> there is a page charged to the cgroup being freed. Kswapd breaks the priority
> loop if all the zones are marked as "unreclaimable".
> "
> 
> Also, I am happy to add more descriptions if anything not clear :)
>

Thanks for explaining this in detail, it makes the review easier. 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] Per cgroup background reclaim.
  2010-11-30  7:51   ` KAMEZAWA Hiroyuki
  2010-11-30  8:07     ` KAMEZAWA Hiroyuki
  2010-11-30 22:00     ` Ying Han
@ 2010-12-07  2:25     ` Ying Han
  2010-12-07  5:21       ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 52+ messages in thread
From: Ying Han @ 2010-12-07  2:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, Nov 29, 2010 at 11:51 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 29 Nov 2010 22:49:44 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> The current implementation of memcg only supports direct reclaim and this
>> patch adds the support for background reclaim. Per cgroup background reclaim
>> is needed which spreads out the memory pressure over longer period of time
>> and smoothes out the system performance.
>>
>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
>> field of a kswapd descriptor.
>>
>> The kswapd() function now is shared between global and per cgroup kswapd thread.
>> It is passed in with the kswapd descriptor which contains the information of
>> either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
>> if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
>> priority loop similar to global reclaim. In each iteration it invokes
>> balance_pgdat_node for all nodes on the system, which is a new function performs
>> background reclaim per node. After reclaiming each node, it checks
>> mem_cgroup_watermark_ok() and breaks the priority loop if returns true. A per
>> memcg zone will be marked as "unreclaimable" if the scanning rate is much
>> greater than the reclaiming rate on the per cgroup LRU. The bit is cleared when
>> there is a page charged to the cgroup being freed. Kswapd breaks the priority
>> loop if all the zones are marked as "unreclaimable".
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  include/linux/memcontrol.h |   30 +++++++
>>  mm/memcontrol.c            |  182 ++++++++++++++++++++++++++++++++++++++-
>>  mm/page_alloc.c            |    2 +
>>  mm/vmscan.c                |  205 +++++++++++++++++++++++++++++++++++++++++++-
>>  4 files changed, 416 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 90fe7fe..dbed45d 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -127,6 +127,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>                                               gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>
>> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone);
>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>> +                                     unsigned long nr_scanned);
>>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>>  struct mem_cgroup;
>>
>> @@ -299,6 +305,25 @@ static inline void mem_cgroup_update_file_mapped(struct page *page,
>>  {
>>  }
>>
>> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
>> +                                             struct zone *zone,
>> +                                             unsigned long nr_scanned)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
>> +                                                     struct zone *zone)
>> +{
>> +}
>> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
>> +             struct zone *zone)
>> +{
>> +}
>> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
>> +                                             struct zone *zone)
>> +{
>> +}
>> +
>>  static inline
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>                                           gfp_t gfp_mask)
>> @@ -312,6 +337,11 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
>>       return 0;
>>  }
>>
>> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
>> +                                                             int zid)
>> +{
>> +     return false;
>> +}
>>  #endif /* CONFIG_CGROUP_MEM_CONT */
>>
>>  #endif /* _LINUX_MEMCONTROL_H */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a0c6ed9..1d39b65 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -48,6 +48,8 @@
>>  #include <linux/page_cgroup.h>
>>  #include <linux/cpu.h>
>>  #include <linux/oom.h>
>> +#include <linux/kthread.h>
>> +
>>  #include "internal.h"
>>
>>  #include <asm/uaccess.h>
>> @@ -118,7 +120,10 @@ struct mem_cgroup_per_zone {
>>       bool                    on_tree;
>>       struct mem_cgroup       *mem;           /* Back pointer, we cannot */
>>                                               /* use container_of        */
>> +     unsigned long           pages_scanned;  /* since last reclaim */
>> +     int                     all_unreclaimable;      /* All pages pinned */
>>  };
>> +
>>  /* Macro for accessing counter */
>>  #define MEM_CGROUP_ZSTAT(mz, idx)    ((mz)->count[(idx)])
>>
>> @@ -372,6 +377,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
>>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>>  static void drain_all_stock_async(void);
>>  static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
>> +static inline void wake_memcg_kswapd(struct mem_cgroup *mem);
>>
>>  static struct mem_cgroup_per_zone *
>>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
>> @@ -1086,6 +1092,106 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>>       return &mz->reclaim_stat;
>>  }
>>
>> +unsigned long mem_cgroup_zone_reclaimable_pages(
>> +                                     struct mem_cgroup_per_zone *mz)
>> +{
>> +     int nr;
>> +     nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
>> +             MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
>> +
>> +     if (nr_swap_pages > 0)
>> +             nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
>> +                     MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
>> +
>> +     return nr;
>> +}
>> +
>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>> +                                             unsigned long nr_scanned)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             mz->pages_scanned += nr_scanned;
>> +}
>> +
>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +
>> +     if (!mem)
>> +             return 0;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             return mz->pages_scanned <
>> +                             mem_cgroup_zone_reclaimable_pages(mz) * 6;
>> +     return 0;
>> +}
>> +
>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return 0;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             return mz->all_unreclaimable;
>> +
>> +     return 0;
>> +}
>> +
>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             mz->all_unreclaimable = 1;
>> +}
>> +
>> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     struct mem_cgroup *mem = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +     struct page_cgroup *pc = lookup_page_cgroup(page);
>> +
>> +     if (unlikely(!pc))
>> +             return;
>> +
>> +     rcu_read_lock();
>> +     mem = pc->mem_cgroup;
>
> This is incorrect. you have to do css_tryget(&mem->css) before rcu_read_unlock.
>
>> +     rcu_read_unlock();
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz) {
>> +             mz->pages_scanned = 0;
>> +             mz->all_unreclaimable = 0;
>> +     }
>> +
>> +     return;
>> +}
>> +
>>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>                                       struct list_head *dst,
>>                                       unsigned long *scanned, int order,
>> @@ -1887,6 +1993,20 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>>       struct res_counter *fail_res;
>>       unsigned long flags = 0;
>>       int ret;
>> +     unsigned long min_free_kbytes = 0;
>> +
>> +     min_free_kbytes = get_min_free_kbytes(mem);
>> +     if (min_free_kbytes) {
>> +             ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_LOW,
>> +                                     &fail_res);
>> +             if (likely(!ret)) {
>> +                     return CHARGE_OK;
>> +             } else {
>> +                     mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>> +                                                                     res);
>> +                     wake_memcg_kswapd(mem_over_limit);
>> +             }
>> +     }
>
> I think this check can be moved out to periodic-check as threshould notifiers.

Yes. This will be changed in V2.

>
>
>
>>
>>       ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_MIN, &fail_res);
>>
>> @@ -3037,6 +3157,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>>                       else
>>                               memcg->memsw_is_minimum = false;
>>               }
>> +             setup_per_memcg_wmarks(memcg);
>>               mutex_unlock(&set_limit_mutex);
>>
>>               if (!ret)
>> @@ -3046,7 +3167,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>>                                               MEM_CGROUP_RECLAIM_SHRINK);
>>               curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>>               /* Usage is reduced ? */
>> -             if (curusage >= oldusage)
>> +             if (curusage >= oldusage)
>>                       retry_count--;
>>               else
>>                       oldusage = curusage;
>
> What's changed here ?
>
>> @@ -3096,6 +3217,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>>                       else
>>                               memcg->memsw_is_minimum = false;
>>               }
>> +             setup_per_memcg_wmarks(memcg);
>>               mutex_unlock(&set_limit_mutex);
>>
>>               if (!ret)
>> @@ -4352,6 +4474,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>>  static void __mem_cgroup_free(struct mem_cgroup *mem)
>>  {
>>       int node;
>> +     struct kswapd *kswapd_p;
>> +     wait_queue_head_t *wait;
>>
>>       mem_cgroup_remove_from_trees(mem);
>>       free_css_id(&mem_cgroup_subsys, &mem->css);
>> @@ -4360,6 +4484,15 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>>               free_mem_cgroup_per_zone_info(mem, node);
>>
>>       free_percpu(mem->stat);
>> +
>> +     wait = mem->kswapd_wait;
>> +     kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>> +     if (kswapd_p) {
>> +             if (kswapd_p->kswapd_task)
>> +                     kthread_stop(kswapd_p->kswapd_task);
>> +             kfree(kswapd_p);
>> +     }
>> +
>>       if (sizeof(struct mem_cgroup) < PAGE_SIZE)
>>               kfree(mem);
>>       else
>> @@ -4421,6 +4554,39 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
>>       return ret;
>>  }
>>
>> +static inline
>> +void wake_memcg_kswapd(struct mem_cgroup *mem)
>> +{
>> +     wait_queue_head_t *wait;
>> +     struct kswapd *kswapd_p;
>> +     struct task_struct *thr;
>> +     static char memcg_name[PATH_MAX];
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     wait = mem->kswapd_wait;
>> +     kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>> +     if (!kswapd_p->kswapd_task) {
>> +             if (mem->css.cgroup)
>> +                     cgroup_path(mem->css.cgroup, memcg_name, PATH_MAX);
>> +             else
>> +                     sprintf(memcg_name, "no_name");
>> +
>> +             thr = kthread_run(kswapd, kswapd_p, "kswapd%s", memcg_name);
>
> I don't think reusing the name of "kswapd" isn't good. and this name cannot
> be long as PATH_MAX...IIUC, this name is for comm[] field which is 16bytes long.
>
> So, how about naming this as
>
>  "memcg%d", mem->css.id ?
>
> Exporing css.id will be okay if necessary.

I am not if that is working since the mem->css hasn't been initialized
during mem_cgroup_create(). So that is one of the reasons that I put
the kswapd creation at triggering wmarks instead of creating cgroup,
since I have all that information ready by the time.

However, I agree that adding into the cgroup creation is better for
performance perspective since we won't add the overhead for the page
allocation. ( Although only the first wmark triggering ). Any
suggestion?

--Ying
>
>
>
>> +             if (IS_ERR(thr))
>> +                     printk(KERN_INFO "Failed to start kswapd on memcg %d\n",
>> +                             0);
>> +             else
>> +                     kswapd_p->kswapd_task = thr;
>> +     }
>
> Hmm, ok, then, kswapd-for-memcg is created when someone go over watermark

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH 0/4] memcg: per cgroup background reclaim
  2010-12-02 14:41     ` Balbir Singh
@ 2010-12-07  2:29       ` Ying Han
  0 siblings, 0 replies; 52+ messages in thread
From: Ying Han @ 2010-12-07  2:29 UTC (permalink / raw)
  To: balbir
  Cc: KOSAKI Motohiro, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman, Johannes Weiner, Christoph Lameter,
	Wu Fengguang, Andi Kleen, Hugh Dickins, Rik van Riel, Tejun Heo,
	linux-mm

On Thu, Dec 2, 2010 at 6:41 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Ying Han <yinghan@google.com> [2010-11-29 23:03:31]:
>
>> On Mon, Nov 29, 2010 at 10:54 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> The current implementation of memcg only supports direct reclaim and this
>> >> patchset adds the support for background reclaim. Per cgroup background
>> >> reclaim is needed which spreads out the memory pressure over longer period
>> >> of time and smoothes out the system performance.
>> >>
>> >> The current implementation is not a stable version, and it crashes sometimes
>> >> on my NUMA machine. Before going further for debugging, I would like to start
>> >> the discussion and hear the feedbacks of the initial design.
>> >
>> > I haven't read your code at all. However I agree your claim that memcg
>> > also need background reclaim.
>>
>> Thanks for your comment.
>> >
>> > So if you post high level design memo, I'm happy.
>>
>> My high level design is kind of spreading out into each patch, and
>> here is the consolidated one. This is nothing more but cluing all the
>> commits' messages for the following patches.
>>
>> "
>> The current implementation of memcg only supports direct reclaim and this
>> patchset adds the support for background reclaim. Per cgroup background
>> reclaim is needed which spreads out the memory pressure over longer period
>> of time and smoothes out the system performance.
>>
>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
>> field of a kswapd descriptor. The kswapd descriptor stores information of node
>> or cgroup and it allows the global and per cgroup background reclaim to share
>> common reclaim algorithms. The per cgroup kswapd is invoked at mem_cgroup_charge
>> when the cgroup's memory usage above a threshold--low_wmark. Then the kswapd
>> thread starts to reclaim pages in a priority loop similar to global algorithm.
>> The kswapd is done if the usage below a threshold--high_wmark.
>>
>
> So the logic is per-node/per-zone/per-cgroup right?

Thanks Balbir for your comments:


The kswapd thread is per-cgroup, and the scanning is on per-node and
per-zone. The watermarks is calculated based on the per-cgroup
limit_in_bytes, and kswapd is done whenever the usage_in_bytes is
under the watermarks.
>
>> The per cgroup background reclaim is based on the per cgroup LRU and also adds
>> per cgroup watermarks. There are two watermarks including "low_wmark" and
>> "high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
>> for each cgroup. Each time the hard_limit is change, the corresponding wmarks
>> are re-calculated. Since memory controller charges only user pages, there is
>
> What about memsw limits, do they impact anything, I presume not.
>
>> no need for a "min_wmark". The current calculation of wmarks is a function of
>> "memory.min_free_kbytes" which could be adjusted by writing different values
>> into the new api. This is added mainly for debugging purpose.
>
> When you say debugging, can you elaborate?

I am not sure if we would like to keep the memory.min_free_kbytes for
the final version, which is used
to adjust the calculation of per-cgroup wmarks. For now, I am adding
it for performance testing purpose.

>
>>
>> The kswapd() function now is shared between global and per cgroup kswapd thread.
>> It is passed in with the kswapd descriptor which contains the information of
>> either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
>> if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
>> priority loop similar to global reclaim. In each iteration it invokes
>> balance_pgdat_node for all nodes on the system, which is a new function performs
>> background reclaim per node. After reclaiming each node, it checks
>> mem_cgroup_watermark_ok() and breaks the priority loop if returns true. A per
>> memcg zone will be marked as "unreclaimable" if the scanning rate is much
>> greater than the reclaiming rate on the per cgroup LRU. The bit is cleared when
>> there is a page charged to the cgroup being freed. Kswapd breaks the priority
>> loop if all the zones are marked as "unreclaimable".
>> "
>>
>> Also, I am happy to add more descriptions if anything not clear :)

Sure. :)

--Ying
>>
>
> Thanks for explaining this in detail, it makes the review easier.
>
> --
>        Three Cheers,
>        Balbir
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] Per cgroup background reclaim.
  2010-12-07  2:25     ` Ying Han
@ 2010-12-07  5:21       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-07  5:21 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, 6 Dec 2010 18:25:55 -0800
Ying Han <yinghan@google.com> wrote:

> On Mon, Nov 29, 2010 at 11:51 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 29 Nov 2010 22:49:44 -0800
> > Ying Han <yinghan@google.com> wrote:
> >
> >> The current implementation of memcg only supports direct reclaim and this
> >> patch adds the support for background reclaim. Per cgroup background reclaim
> >> is needed which spreads out the memory pressure over longer period of time
> >> and smoothes out the system performance.
> >>
> >> There is a kswapd kernel thread for each memory node. We add a different kswapd
> >> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
> >> field of a kswapd descriptor.
> >>
> >> The kswapd() function now is shared between global and per cgroup kswapd thread.
> >> It is passed in with the kswapd descriptor which contains the information of
> >> either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
> >> if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
> >> priority loop similar to global reclaim. In each iteration it invokes
> >> balance_pgdat_node for all nodes on the system, which is a new function performs
> >> background reclaim per node. After reclaiming each node, it checks
> >> mem_cgroup_watermark_ok() and breaks the priority loop if returns true. A per
> >> memcg zone will be marked as "unreclaimable" if the scanning rate is much
> >> greater than the reclaiming rate on the per cgroup LRU. The bit is cleared when
> >> there is a page charged to the cgroup being freed. Kswapd breaks the priority
> >> loop if all the zones are marked as "unreclaimable".
> >>
> >> Signed-off-by: Ying Han <yinghan@google.com>
> >> ---
> >> A include/linux/memcontrol.h | A  30 +++++++
> >> A mm/memcontrol.c A  A  A  A  A  A | A 182 ++++++++++++++++++++++++++++++++++++++-
> >> A mm/page_alloc.c A  A  A  A  A  A | A  A 2 +
> >> A mm/vmscan.c A  A  A  A  A  A  A  A | A 205 +++++++++++++++++++++++++++++++++++++++++++-
> >> A 4 files changed, 416 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index 90fe7fe..dbed45d 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -127,6 +127,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >> A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  gfp_t gfp_mask);
> >> A u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> >>
> >> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone);
> >> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
> >> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> >> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> >> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  unsigned long nr_scanned);
> >> A #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >> A struct mem_cgroup;
> >>
> >> @@ -299,6 +305,25 @@ static inline void mem_cgroup_update_file_mapped(struct page *page,
> >> A {
> >> A }
> >>
> >> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  struct zone *zone,
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  unsigned long nr_scanned)
> >> +{
> >> +}
> >> +
> >> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  struct zone *zone)
> >> +{
> >> +}
> >> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
> >> + A  A  A  A  A  A  struct zone *zone)
> >> +{
> >> +}
> >> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  struct zone *zone)
> >> +{
> >> +}
> >> +
> >> A static inline
> >> A unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >> A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  gfp_t gfp_mask)
> >> @@ -312,6 +337,11 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
> >> A  A  A  return 0;
> >> A }
> >>
> >> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  int zid)
> >> +{
> >> + A  A  return false;
> >> +}
> >> A #endif /* CONFIG_CGROUP_MEM_CONT */
> >>
> >> A #endif /* _LINUX_MEMCONTROL_H */
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index a0c6ed9..1d39b65 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -48,6 +48,8 @@
> >> A #include <linux/page_cgroup.h>
> >> A #include <linux/cpu.h>
> >> A #include <linux/oom.h>
> >> +#include <linux/kthread.h>
> >> +
> >> A #include "internal.h"
> >>
> >> A #include <asm/uaccess.h>
> >> @@ -118,7 +120,10 @@ struct mem_cgroup_per_zone {
> >> A  A  A  bool A  A  A  A  A  A  A  A  A  A on_tree;
> >> A  A  A  struct mem_cgroup A  A  A  *mem; A  A  A  A  A  /* Back pointer, we cannot */
> >> A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  /* use container_of A  A  A  A */
> >> + A  A  unsigned long A  A  A  A  A  pages_scanned; A /* since last reclaim */
> >> + A  A  int A  A  A  A  A  A  A  A  A  A  all_unreclaimable; A  A  A /* All pages pinned */
> >> A };
> >> +
> >> A /* Macro for accessing counter */
> >> A #define MEM_CGROUP_ZSTAT(mz, idx) A  A ((mz)->count[(idx)])
> >>
> >> @@ -372,6 +377,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
> >> A static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >> A static void drain_all_stock_async(void);
> >> A static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
> >> +static inline void wake_memcg_kswapd(struct mem_cgroup *mem);
> >>
> >> A static struct mem_cgroup_per_zone *
> >> A mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> >> @@ -1086,6 +1092,106 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> >> A  A  A  return &mz->reclaim_stat;
> >> A }
> >>
> >> +unsigned long mem_cgroup_zone_reclaimable_pages(
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  struct mem_cgroup_per_zone *mz)
> >> +{
> >> + A  A  int nr;
> >> + A  A  nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
> >> + A  A  A  A  A  A  MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
> >> +
> >> + A  A  if (nr_swap_pages > 0)
> >> + A  A  A  A  A  A  nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
> >> + A  A  A  A  A  A  A  A  A  A  MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
> >> +
> >> + A  A  return nr;
> >> +}
> >> +
> >> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  unsigned long nr_scanned)
> >> +{
> >> + A  A  struct mem_cgroup_per_zone *mz = NULL;
> >> + A  A  int nid = zone_to_nid(zone);
> >> + A  A  int zid = zone_idx(zone);
> >> +
> >> + A  A  if (!mem)
> >> + A  A  A  A  A  A  return;
> >> +
> >> + A  A  mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + A  A  if (mz)
> >> + A  A  A  A  A  A  mz->pages_scanned += nr_scanned;
> >> +}
> >> +
> >> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
> >> +{
> >> + A  A  struct mem_cgroup_per_zone *mz = NULL;
> >> +
> >> + A  A  if (!mem)
> >> + A  A  A  A  A  A  return 0;
> >> +
> >> + A  A  mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + A  A  if (mz)
> >> + A  A  A  A  A  A  return mz->pages_scanned <
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  mem_cgroup_zone_reclaimable_pages(mz) * 6;
> >> + A  A  return 0;
> >> +}
> >> +
> >> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> >> +{
> >> + A  A  struct mem_cgroup_per_zone *mz = NULL;
> >> + A  A  int nid = zone_to_nid(zone);
> >> + A  A  int zid = zone_idx(zone);
> >> +
> >> + A  A  if (!mem)
> >> + A  A  A  A  A  A  return 0;
> >> +
> >> + A  A  mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + A  A  if (mz)
> >> + A  A  A  A  A  A  return mz->all_unreclaimable;
> >> +
> >> + A  A  return 0;
> >> +}
> >> +
> >> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> >> +{
> >> + A  A  struct mem_cgroup_per_zone *mz = NULL;
> >> + A  A  int nid = zone_to_nid(zone);
> >> + A  A  int zid = zone_idx(zone);
> >> +
> >> + A  A  if (!mem)
> >> + A  A  A  A  A  A  return;
> >> +
> >> + A  A  mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + A  A  if (mz)
> >> + A  A  A  A  A  A  mz->all_unreclaimable = 1;
> >> +}
> >> +
> >> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone)
> >> +{
> >> + A  A  struct mem_cgroup_per_zone *mz = NULL;
> >> + A  A  struct mem_cgroup *mem = NULL;
> >> + A  A  int nid = zone_to_nid(zone);
> >> + A  A  int zid = zone_idx(zone);
> >> + A  A  struct page_cgroup *pc = lookup_page_cgroup(page);
> >> +
> >> + A  A  if (unlikely(!pc))
> >> + A  A  A  A  A  A  return;
> >> +
> >> + A  A  rcu_read_lock();
> >> + A  A  mem = pc->mem_cgroup;
> >
> > This is incorrect. you have to do css_tryget(&mem->css) before rcu_read_unlock.
> >
> >> + A  A  rcu_read_unlock();
> >> +
> >> + A  A  if (!mem)
> >> + A  A  A  A  A  A  return;
> >> +
> >> + A  A  mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + A  A  if (mz) {
> >> + A  A  A  A  A  A  mz->pages_scanned = 0;
> >> + A  A  A  A  A  A  mz->all_unreclaimable = 0;
> >> + A  A  }
> >> +
> >> + A  A  return;
> >> +}
> >> +
> >> A unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> >> A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  struct list_head *dst,
> >> A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  unsigned long *scanned, int order,
> >> @@ -1887,6 +1993,20 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> >> A  A  A  struct res_counter *fail_res;
> >> A  A  A  unsigned long flags = 0;
> >> A  A  A  int ret;
> >> + A  A  unsigned long min_free_kbytes = 0;
> >> +
> >> + A  A  min_free_kbytes = get_min_free_kbytes(mem);
> >> + A  A  if (min_free_kbytes) {
> >> + A  A  A  A  A  A  ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_LOW,
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  &fail_res);
> >> + A  A  A  A  A  A  if (likely(!ret)) {
> >> + A  A  A  A  A  A  A  A  A  A  return CHARGE_OK;
> >> + A  A  A  A  A  A  } else {
> >> + A  A  A  A  A  A  A  A  A  A  mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> >> + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  res);
> >> + A  A  A  A  A  A  A  A  A  A  wake_memcg_kswapd(mem_over_limit);
> >> + A  A  A  A  A  A  }
> >> + A  A  }
> >
> > I think this check can be moved out to periodic-check as threshould notifiers.
> 
> Yes. This will be changed in V2.
> 
> >
> >
> >
> >>
> >> A  A  A  ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_MIN, &fail_res);
> >>
> >> @@ -3037,6 +3157,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> >> A  A  A  A  A  A  A  A  A  A  A  else
> >> A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  memcg->memsw_is_minimum = false;
> >> A  A  A  A  A  A  A  }
> >> + A  A  A  A  A  A  setup_per_memcg_wmarks(memcg);
> >> A  A  A  A  A  A  A  mutex_unlock(&set_limit_mutex);
> >>
> >> A  A  A  A  A  A  A  if (!ret)
> >> @@ -3046,7 +3167,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> >> A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  MEM_CGROUP_RECLAIM_SHRINK);
> >> A  A  A  A  A  A  A  curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> >> A  A  A  A  A  A  A  /* Usage is reduced ? */
> >> - A  A  A  A  A  A  if (curusage >= oldusage)
> >> + A  A  A  A  A  A  if (curusage >= oldusage)
> >> A  A  A  A  A  A  A  A  A  A  A  retry_count--;
> >> A  A  A  A  A  A  A  else
> >> A  A  A  A  A  A  A  A  A  A  A  oldusage = curusage;
> >
> > What's changed here ?
> >
> >> @@ -3096,6 +3217,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> >> A  A  A  A  A  A  A  A  A  A  A  else
> >> A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  memcg->memsw_is_minimum = false;
> >> A  A  A  A  A  A  A  }
> >> + A  A  A  A  A  A  setup_per_memcg_wmarks(memcg);
> >> A  A  A  A  A  A  A  mutex_unlock(&set_limit_mutex);
> >>
> >> A  A  A  A  A  A  A  if (!ret)
> >> @@ -4352,6 +4474,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> >> A static void __mem_cgroup_free(struct mem_cgroup *mem)
> >> A {
> >> A  A  A  int node;
> >> + A  A  struct kswapd *kswapd_p;
> >> + A  A  wait_queue_head_t *wait;
> >>
> >> A  A  A  mem_cgroup_remove_from_trees(mem);
> >> A  A  A  free_css_id(&mem_cgroup_subsys, &mem->css);
> >> @@ -4360,6 +4484,15 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
> >> A  A  A  A  A  A  A  free_mem_cgroup_per_zone_info(mem, node);
> >>
> >> A  A  A  free_percpu(mem->stat);
> >> +
> >> + A  A  wait = mem->kswapd_wait;
> >> + A  A  kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> >> + A  A  if (kswapd_p) {
> >> + A  A  A  A  A  A  if (kswapd_p->kswapd_task)
> >> + A  A  A  A  A  A  A  A  A  A  kthread_stop(kswapd_p->kswapd_task);
> >> + A  A  A  A  A  A  kfree(kswapd_p);
> >> + A  A  }
> >> +
> >> A  A  A  if (sizeof(struct mem_cgroup) < PAGE_SIZE)
> >> A  A  A  A  A  A  A  kfree(mem);
> >> A  A  A  else
> >> @@ -4421,6 +4554,39 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> >> A  A  A  return ret;
> >> A }
> >>
> >> +static inline
> >> +void wake_memcg_kswapd(struct mem_cgroup *mem)
> >> +{
> >> + A  A  wait_queue_head_t *wait;
> >> + A  A  struct kswapd *kswapd_p;
> >> + A  A  struct task_struct *thr;
> >> + A  A  static char memcg_name[PATH_MAX];
> >> +
> >> + A  A  if (!mem)
> >> + A  A  A  A  A  A  return;
> >> +
> >> + A  A  wait = mem->kswapd_wait;
> >> + A  A  kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> >> + A  A  if (!kswapd_p->kswapd_task) {
> >> + A  A  A  A  A  A  if (mem->css.cgroup)
> >> + A  A  A  A  A  A  A  A  A  A  cgroup_path(mem->css.cgroup, memcg_name, PATH_MAX);
> >> + A  A  A  A  A  A  else
> >> + A  A  A  A  A  A  A  A  A  A  sprintf(memcg_name, "no_name");
> >> +
> >> + A  A  A  A  A  A  thr = kthread_run(kswapd, kswapd_p, "kswapd%s", memcg_name);
> >
> > I don't think reusing the name of "kswapd" isn't good. and this name cannot
> > be long as PATH_MAX...IIUC, this name is for comm[] field which is 16bytes long.
> >
> > So, how about naming this as
> >
> > A "memcg%d", mem->css.id ?
> >
> > Exporing css.id will be okay if necessary.
> 
> I am not if that is working since the mem->css hasn't been initialized
> during mem_cgroup_create(). So that is one of the reasons that I put
> the kswapd creation at triggering wmarks instead of creating cgroup,
> since I have all that information ready by the time.
> 
> However, I agree that adding into the cgroup creation is better for
> performance perspective since we won't add the overhead for the page
> allocation. ( Although only the first wmark triggering ). Any
> suggestion?
> 

Hmm, my recommendation is to start the thread when the limit is set.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  8:54         ` KAMEZAWA Hiroyuki
  2010-11-30 20:40           ` Ying Han
@ 2010-12-07  6:15           ` Balbir Singh
  2010-12-07  6:24             ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 52+ messages in thread
From: Balbir Singh @ 2010-12-07  6:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Ying Han, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-11-30 17:54:43]:

> On Tue, 30 Nov 2010 17:27:10 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 30 Nov 2010 17:15:37 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> > 
> > > Ideally, I hope we unify global and memcg of kswapd for easy
> > > maintainance if it's not a big problem.
> > > When we make patches about lru pages, we always have to consider what
> > > I should do for memcg.
> > > And when we review patches, we also should consider what the patch is
> > > missing for memcg.
> > > It makes maintainance cost big. Of course, if memcg maintainers is
> > > involved with all patches, it's no problem as it is.
> > > 
> > I know it's not. But thread control of kswapd will not have much merging point.
> > And balance_pgdat() is fully replaced in patch/3. The effort for merging seems
> > not big.
> > 
> 
> kswapd's balance_pgdat() is for following
>   - reclaim pages within a node.
>   - balancing zones in a pgdat.
> 
> memcg's background reclaim needs followings.
>   - reclaim pages within a memcg
>   - reclaim pages from arbitrary zones, if it's fair, it's good.
>     But it's not important from which zone the pages are reclaimed from. 
>     (I'm not sure we can select "the oldest" pages from divided LRU.)
>

Yes, if it is fair, then we don't break what kswapd tries to do, so
fairness is quite important, in that we don't leaves zones unbalanced
(at least by very much) as we try to do background reclaim. But
sometimes it cannot be helped, specially if there are policies that
bias the allocation.
 
> Then, merging will put 2 _very_ different functionalities into 1 function.
> 
> So, I thought it's simpler to implement
> 
>  1. a victim node selector (This algorithm will never be in kswapd.)

A victim node selector per memcg? Could you clarify the context of
node here?

>  2. call _existing_ try_to_free_pages_mem_cgroup() with node local zonelist.
>  Sharing is enough.
> 
> kswapd stop/go routine may be able to be shared. But this patch itself seems not
> very good to me.
> 
> Thanks,
> -Kame
> 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-07  6:15           ` Balbir Singh
@ 2010-12-07  6:24             ` KAMEZAWA Hiroyuki
  2010-12-07  6:59               ` Balbir Singh
  0 siblings, 1 reply; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-07  6:24 UTC (permalink / raw)
  To: balbir
  Cc: Minchan Kim, Ying Han, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, 7 Dec 2010 11:45:03 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-11-30 17:54:43]:
> 
> > On Tue, 30 Nov 2010 17:27:10 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Tue, 30 Nov 2010 17:15:37 +0900
> > > Minchan Kim <minchan.kim@gmail.com> wrote:
> > > 
> > > > Ideally, I hope we unify global and memcg of kswapd for easy
> > > > maintainance if it's not a big problem.
> > > > When we make patches about lru pages, we always have to consider what
> > > > I should do for memcg.
> > > > And when we review patches, we also should consider what the patch is
> > > > missing for memcg.
> > > > It makes maintainance cost big. Of course, if memcg maintainers is
> > > > involved with all patches, it's no problem as it is.
> > > > 
> > > I know it's not. But thread control of kswapd will not have much merging point.
> > > And balance_pgdat() is fully replaced in patch/3. The effort for merging seems
> > > not big.
> > > 
> > 
> > kswapd's balance_pgdat() is for following
> >   - reclaim pages within a node.
> >   - balancing zones in a pgdat.
> > 
> > memcg's background reclaim needs followings.
> >   - reclaim pages within a memcg
> >   - reclaim pages from arbitrary zones, if it's fair, it's good.
> >     But it's not important from which zone the pages are reclaimed from. 
> >     (I'm not sure we can select "the oldest" pages from divided LRU.)
> >
> 
> Yes, if it is fair, then we don't break what kswapd tries to do, so
> fairness is quite important, in that we don't leaves zones unbalanced
> (at least by very much) as we try to do background reclaim. But
> sometimes it cannot be helped, specially if there are policies that
> bias the allocation.
>  
> > Then, merging will put 2 _very_ different functionalities into 1 function.
> > 
> > So, I thought it's simpler to implement
> > 
> >  1. a victim node selector (This algorithm will never be in kswapd.)
> 
> A victim node selector per memcg? Could you clarify the context of
> node here?
> 
An argument to balance_pgdat_for_memcg() or a start point of zonelist[].
i.e.
	zone_list = NODE_DATA(victim)->zonelist[0 or 1]

	for_each_zone_zonelist(z, zone_list)....

But, this is just an example, we just need to determine where we reclaim
page from before start walking.



Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  6:49 ` [PATCH 1/4] Add kswapd descriptor Ying Han
  2010-11-30  7:08   ` KAMEZAWA Hiroyuki
@ 2010-12-07  6:52   ` Balbir Singh
  2010-12-07 19:21     ` Ying Han
  2010-12-07 12:33   ` Mel Gorman
  2 siblings, 1 reply; 52+ messages in thread
From: Balbir Singh @ 2010-12-07  6:52 UTC (permalink / raw)
  To: Ying Han
  Cc: Daisuke Nishimura, KAMEZAWA Hiroyuki, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

* Ying Han <yinghan@google.com> [2010-11-29 22:49:42]:

> There is a kswapd kernel thread for each memory node. We add a different kswapd
> for each cgroup.

Could you please elaborate on this, what is adding? creating a thread?

The kswapd is sleeping in the wait queue headed at kswapd_wait
> field of a kswapd descriptor. The kswapd descriptor stores information of node
> or cgroup and it allows the global and per cgroup background reclaim to share
> common reclaim algorithms.
> 
> This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
> common data structure.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---

The performance data you posted earlier is helpful, do you have any
additional insights on the the CPU overheads if any?

My general overall comment is that this patch needs to be refactored
to bring out the change the patch makes.

>  include/linux/mmzone.h |    3 +-
>  include/linux/swap.h   |   10 +++++
>  mm/memcontrol.c        |    2 +
>  mm/mmzone.c            |    2 +-
>  mm/page_alloc.c        |    9 +++-
>  mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
>  6 files changed, 90 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 39c24eb..c77dfa2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -642,8 +642,7 @@ typedef struct pglist_data {
>  	unsigned long node_spanned_pages; /* total size of physical page
>  					     range, including holes */
>  	int node_id;
> -	wait_queue_head_t kswapd_wait;
> -	struct task_struct *kswapd;
> +	wait_queue_head_t *kswapd_wait;
>  	int kswapd_max_order;
>  } pg_data_t;
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index eba53e7..2e6cb58 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
>  	return current->flags & PF_KSWAPD;
>  }
> 
> +struct kswapd {
> +	struct task_struct *kswapd_task;
> +	wait_queue_head_t kswapd_wait;
> +	struct mem_cgroup *kswapd_mem;

Is this field being used anywhere in this patch?

> +	pg_data_t *kswapd_pgdat;
> +};
> +
> +#define MAX_KSWAPDS MAX_NUMNODES
> +extern struct kswapd kswapds[MAX_KSWAPDS];
> +int kswapd(void *p);
>  /*
>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>   * be swapped to.  The swap type and the offset into that swap type are
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a4034b6..dca3590 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -263,6 +263,8 @@ struct mem_cgroup {
>  	 */
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
> +
> +	wait_queue_head_t *kswapd_wait;
>  };
> 
>  /* Stuffs for move charges at task migration. */
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index e35bfb8..c7cbed5 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -102,7 +102,7 @@ unsigned long zone_nr_free_pages(struct zone *zone)
>  	 * free pages are low, get a better estimate for free pages
>  	 */
>  	if (nr_free_pages < zone->percpu_drift_mark &&
> -			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> +			!waitqueue_active(zone->zone_pgdat->kswapd_wait))
>  		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> 
>  	return nr_free_pages;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b48dea2..a15bc1c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4070,13 +4070,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  	int nid = pgdat->node_id;
>  	unsigned long zone_start_pfn = pgdat->node_start_pfn;
>  	int ret;
> +	struct kswapd *kswapd_p;

_p is sort of ugly, do we really need it?

> 
>  	pgdat_resize_init(pgdat);
>  	pgdat->nr_zones = 0;
> -	init_waitqueue_head(&pgdat->kswapd_wait);
>  	pgdat->kswapd_max_order = 0;
>  	pgdat_page_cgroup_init(pgdat);
> -	

Thanks for the whitspace cleanup, but I don't know if that should be
done here.

> +
> +	kswapd_p = &kswapds[nid];
> +	init_waitqueue_head(&kswapd_p->kswapd_wait);
> +	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> +	kswapd_p->kswapd_pgdat = pgdat;
> +
>  	for (j = 0; j < MAX_NR_ZONES; j++) {
>  		struct zone *zone = pgdat->node_zones + j;
>  		unsigned long size, realsize, memmap_pages;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b8a6fdc..e08005e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2115,12 +2115,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> 
>  	return nr_reclaimed;
>  }
> +
>  #endif
> 
> +DEFINE_SPINLOCK(kswapds_spinlock);
> +struct kswapd kswapds[MAX_KSWAPDS];
> +
>  /* is kswapd sleeping prematurely? */
> -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
> +				long remaining)
>  {
>  	int i;
> +	pg_data_t *pgdat = kswapd->kswapd_pgdat;
> 
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
> @@ -2377,21 +2383,28 @@ out:
>   * If there are applications that are active memory-allocators
>   * (most normal use), this basically shouldn't matter.
>   */
> -static int kswapd(void *p)
> +int kswapd(void *p)
>  {
>  	unsigned long order;
> -	pg_data_t *pgdat = (pg_data_t*)p;
> +	struct kswapd *kswapd_p = (struct kswapd *)p;
> +	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +	struct mem_cgroup *mem = kswapd_p->kswapd_mem;

Do we use mem anywhere?

> +	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;

_p, _h almost look like hungarian notation in reverse :)

>  	struct task_struct *tsk = current;
>  	DEFINE_WAIT(wait);
>  	struct reclaim_state reclaim_state = {
>  		.reclaimed_slab = 0,
>  	};
> -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +	const struct cpumask *cpumask;
> 
>  	lockdep_set_current_reclaim_state(GFP_KERNEL);
> 
> -	if (!cpumask_empty(cpumask))
> -		set_cpus_allowed_ptr(tsk, cpumask);
> +	if (pgdat) {
> +		BUG_ON(pgdat->kswapd_wait != wait_h);
> +		cpumask = cpumask_of_node(pgdat->node_id);
> +		if (!cpumask_empty(cpumask))
> +			set_cpus_allowed_ptr(tsk, cpumask);
> +	}
>  	current->reclaim_state = &reclaim_state;
> 
>  	/*
> @@ -2414,9 +2427,13 @@ static int kswapd(void *p)
>  		unsigned long new_order;
>  		int ret;
> 
> -		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> -		new_order = pgdat->kswapd_max_order;
> -		pgdat->kswapd_max_order = 0;
> +		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> +		if (pgdat) {
> +			new_order = pgdat->kswapd_max_order;
> +			pgdat->kswapd_max_order = 0;
> +		} else
> +			new_order = 0;
> +
>  		if (order < new_order) {
>  			/*
>  			 * Don't sleep if someone wants a larger 'order'
> @@ -2428,10 +2445,12 @@ static int kswapd(void *p)
>  				long remaining = 0;
> 
>  				/* Try to sleep for a short interval */
> -				if (!sleeping_prematurely(pgdat, order, remaining)) {
> +				if (!sleeping_prematurely(kswapd_p, order,
> +							remaining)) {
>  					remaining = schedule_timeout(HZ/10);
> -					finish_wait(&pgdat->kswapd_wait, &wait);
> -					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> +					finish_wait(wait_h, &wait);
> +					prepare_to_wait(wait_h, &wait,
> +							TASK_INTERRUPTIBLE);
>  				}
> 
>  				/*
> @@ -2439,20 +2458,25 @@ static int kswapd(void *p)
>  				 * premature sleep. If not, then go fully
>  				 * to sleep until explicitly woken up
>  				 */
> -				if (!sleeping_prematurely(pgdat, order, remaining)) {
> -					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +				if (!sleeping_prematurely(kswapd_p, order,
> +								remaining)) {
> +					if (pgdat)
> +						trace_mm_vmscan_kswapd_sleep(
> +								pgdat->node_id);
>  					schedule();
>  				} else {
>  					if (remaining)
> -						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> +						count_vm_event(
> +						KSWAPD_LOW_WMARK_HIT_QUICKLY);
>  					else
> -						count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> +						count_vm_event(
> +						KSWAPD_HIGH_WMARK_HIT_QUICKLY);

Sorry, but the coding style hits me here do we really need to change
this?

>  				}
>  			}
> -
> -			order = pgdat->kswapd_max_order;
> +			if (pgdat)
> +				order = pgdat->kswapd_max_order;
>  		}
> -		finish_wait(&pgdat->kswapd_wait, &wait);
> +		finish_wait(wait_h, &wait);
> 
>  		ret = try_to_freeze();
>  		if (kthread_should_stop())
> @@ -2476,6 +2500,7 @@ static int kswapd(void *p)
>  void wakeup_kswapd(struct zone *zone, int order)
>  {
>  	pg_data_t *pgdat;
> +	wait_queue_head_t *wait;
> 
>  	if (!populated_zone(zone))
>  		return;
> @@ -2488,9 +2513,10 @@ void wakeup_kswapd(struct zone *zone, int order)
>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>  	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  		return;
> -	if (!waitqueue_active(&pgdat->kswapd_wait))
> +	wait = pgdat->kswapd_wait;
> +	if (!waitqueue_active(wait))
>  		return;
> -	wake_up_interruptible(&pgdat->kswapd_wait);
> +	wake_up_interruptible(wait);
>  }
> 
>  /*
> @@ -2587,7 +2613,10 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
> 
>  			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>  				/* One of our CPUs online: restore mask */
> -				set_cpus_allowed_ptr(pgdat->kswapd, mask);
> +				if (kswapds[nid].kswapd_task)
> +					set_cpus_allowed_ptr(
> +						kswapds[nid].kswapd_task,
> +						mask);
>  		}
>  	}
>  	return NOTIFY_OK;
> @@ -2599,19 +2628,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>   */
>  int kswapd_run(int nid)
>  {
> -	pg_data_t *pgdat = NODE_DATA(nid);
> +	struct task_struct *thr;

thr is an ugly name for task_struct instance

>  	int ret = 0;
> 
> -	if (pgdat->kswapd)
> +	if (kswapds[nid].kswapd_task)
>  		return 0;
> 
> -	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
> -	if (IS_ERR(pgdat->kswapd)) {
> +	thr = kthread_run(kswapd, &kswapds[nid], "kswapd%d", nid);
> +	if (IS_ERR(thr)) {
>  		/* failure at boot is fatal */
>  		BUG_ON(system_state == SYSTEM_BOOTING);
>  		printk("Failed to start kswapd on node %d\n",nid);
>  		ret = -1;

What happens to the threads started?

>  	}
> +	kswapds[nid].kswapd_task = thr;
>  	return ret;
>  }
> 
> @@ -2620,10 +2650,20 @@ int kswapd_run(int nid)
>   */
>  void kswapd_stop(int nid)
>  {
> -	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
> +	struct task_struct *thr;
> +	struct kswapd *kswapd_p;
> +	wait_queue_head_t *wait;
> +
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +
> +	spin_lock(&kswapds_spinlock);
> +	wait = pgdat->kswapd_wait;
> +	kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> +	thr = kswapd_p->kswapd_task;

Sorry, but thr is just an ugly name to use.

> +	spin_unlock(&kswapds_spinlock);
> 
> -	if (kswapd)
> -		kthread_stop(kswapd);
> +	if (thr)
> +		kthread_stop(thr);
>  }
> 
>  static int __init kswapd_init(void)
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-07  6:24             ` KAMEZAWA Hiroyuki
@ 2010-12-07  6:59               ` Balbir Singh
  2010-12-07  8:00                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 52+ messages in thread
From: Balbir Singh @ 2010-12-07  6:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Ying Han, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-12-07 15:24:23]:

> On Tue, 7 Dec 2010 11:45:03 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-11-30 17:54:43]:
> > 
> > > On Tue, 30 Nov 2010 17:27:10 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Tue, 30 Nov 2010 17:15:37 +0900
> > > > Minchan Kim <minchan.kim@gmail.com> wrote:
> > > > 
> > > > > Ideally, I hope we unify global and memcg of kswapd for easy
> > > > > maintainance if it's not a big problem.
> > > > > When we make patches about lru pages, we always have to consider what
> > > > > I should do for memcg.
> > > > > And when we review patches, we also should consider what the patch is
> > > > > missing for memcg.
> > > > > It makes maintainance cost big. Of course, if memcg maintainers is
> > > > > involved with all patches, it's no problem as it is.
> > > > > 
> > > > I know it's not. But thread control of kswapd will not have much merging point.
> > > > And balance_pgdat() is fully replaced in patch/3. The effort for merging seems
> > > > not big.
> > > > 
> > > 
> > > kswapd's balance_pgdat() is for following
> > >   - reclaim pages within a node.
> > >   - balancing zones in a pgdat.
> > > 
> > > memcg's background reclaim needs followings.
> > >   - reclaim pages within a memcg
> > >   - reclaim pages from arbitrary zones, if it's fair, it's good.
> > >     But it's not important from which zone the pages are reclaimed from. 
> > >     (I'm not sure we can select "the oldest" pages from divided LRU.)
> > >
> > 
> > Yes, if it is fair, then we don't break what kswapd tries to do, so
> > fairness is quite important, in that we don't leaves zones unbalanced
> > (at least by very much) as we try to do background reclaim. But
> > sometimes it cannot be helped, specially if there are policies that
> > bias the allocation.
> >  
> > > Then, merging will put 2 _very_ different functionalities into 1 function.
> > > 
> > > So, I thought it's simpler to implement
> > > 
> > >  1. a victim node selector (This algorithm will never be in kswapd.)
> > 
> > A victim node selector per memcg? Could you clarify the context of
> > node here?
> > 
> An argument to balance_pgdat_for_memcg() or a start point of zonelist[].
> i.e.
> 	zone_list = NODE_DATA(victim)->zonelist[0 or 1]
> 
> 	for_each_zone_zonelist(z, zone_list)....
> 
> But, this is just an example, we just need to determine where we reclaim
> page from before start walking.
>

OK, I understand. BTW, I am not against integration with kswapd for
watermark based reclaim, the advantage I see is that as we balance
zone/node watermarks, we also balance per memcg watermark. The cost
would be proportional to the size of memcg's that have allocated from
that zone/node. kswapd is not fast path and already optimized in terms
of when to wake up, so it makes sense to reuse all of that. 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-07  6:59               ` Balbir Singh
@ 2010-12-07  8:00                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-07  8:00 UTC (permalink / raw)
  To: balbir
  Cc: Minchan Kim, Ying Han, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, 7 Dec 2010 12:29:29 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-12-07 15:24:23]:
> > An argument to balance_pgdat_for_memcg() or a start point of zonelist[].
> > i.e.
> > 	zone_list = NODE_DATA(victim)->zonelist[0 or 1]
> > 
> > 	for_each_zone_zonelist(z, zone_list)....
> > 
> > But, this is just an example, we just need to determine where we reclaim
> > page from before start walking.
> >
> 
> OK, I understand. BTW, I am not against integration with kswapd for
> watermark based reclaim, the advantage I see is that as we balance
> zone/node watermarks, we also balance per memcg watermark. The cost
> would be proportional to the size of memcg's that have allocated from
> that zone/node. kswapd is not fast path and already optimized in terms
> of when to wake up, so it makes sense to reuse all of that. 
> 

But we cannot use balance_pgdat() as it is because we don't need almost
all checks in it and I don't want to add hooks into it because it's
updated frequently. And, I doubt how cleanly we can do merging.

As Ying Han did, adding balance_pgdat_for_memcg() is a clean way for now.
kswapd wakeup, sleep routine may be able to be reused.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-11-30  6:49 ` [PATCH 1/4] Add kswapd descriptor Ying Han
  2010-11-30  7:08   ` KAMEZAWA Hiroyuki
  2010-12-07  6:52   ` Balbir Singh
@ 2010-12-07 12:33   ` Mel Gorman
  2010-12-07 17:28     ` Ying Han
                       ` (2 more replies)
  2 siblings, 3 replies; 52+ messages in thread
From: Mel Gorman @ 2010-12-07 12:33 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Mon, Nov 29, 2010 at 10:49:42PM -0800, Ying Han wrote:
> There is a kswapd kernel thread for each memory node. We add a different kswapd
> for each cgroup.

What is considered a normal number of cgroups in production? 10, 50, 10000? If
it's a really large number and all the cgroups kswapds wake at the same time,
the zone LRU lock will be very heavily contended.  Potentially there will
also be a very large number of new IO sources. I confess I haven't read the
thread yet so maybe this has already been thought of but it might make sense
to have a 1:N relationship between kswapd and memcgroups and cycle between
containers. The difficulty will be a latency between when kswapd wakes up
and when a particular container is scanned. The closer the ratio is to 1:1,
the less the latency will be but the higher the contenion on the LRU lock
and IO will be.

> The kswapd is sleeping in the wait queue headed at kswapd_wait
> field of a kswapd descriptor. The kswapd descriptor stores information of node
> or cgroup and it allows the global and per cgroup background reclaim to share
> common reclaim algorithms.
> 
> This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
> common data structure.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/mmzone.h |    3 +-
>  include/linux/swap.h   |   10 +++++
>  mm/memcontrol.c        |    2 +
>  mm/mmzone.c            |    2 +-
>  mm/page_alloc.c        |    9 +++-
>  mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
>  6 files changed, 90 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 39c24eb..c77dfa2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -642,8 +642,7 @@ typedef struct pglist_data {
>  	unsigned long node_spanned_pages; /* total size of physical page
>  					     range, including holes */
>  	int node_id;
> -	wait_queue_head_t kswapd_wait;
> -	struct task_struct *kswapd;
> +	wait_queue_head_t *kswapd_wait;
>  	int kswapd_max_order;
>  } pg_data_t;
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index eba53e7..2e6cb58 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
>  	return current->flags & PF_KSWAPD;
>  }
>  
> +struct kswapd {
> +	struct task_struct *kswapd_task;
> +	wait_queue_head_t kswapd_wait;
> +	struct mem_cgroup *kswapd_mem;
> +	pg_data_t *kswapd_pgdat;
> +};
> +
> +#define MAX_KSWAPDS MAX_NUMNODES
> +extern struct kswapd kswapds[MAX_KSWAPDS];

This is potentially very large for a static structure. Can they not be
dynamically allocated and kept on a list? Yes, there will be a list walk
involved if yonu need a particular structure but that looks like it's a
rare operation at this point.

> +int kswapd(void *p);
>  /*
>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>   * be swapped to.  The swap type and the offset into that swap type are
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a4034b6..dca3590 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -263,6 +263,8 @@ struct mem_cgroup {
>  	 */
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
> +
> +	wait_queue_head_t *kswapd_wait;
>  };
>  
>  /* Stuffs for move charges at task migration. */
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index e35bfb8..c7cbed5 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -102,7 +102,7 @@ unsigned long zone_nr_free_pages(struct zone *zone)
>  	 * free pages are low, get a better estimate for free pages
>  	 */
>  	if (nr_free_pages < zone->percpu_drift_mark &&
> -			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> +			!waitqueue_active(zone->zone_pgdat->kswapd_wait))
>  		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  
>  	return nr_free_pages;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b48dea2..a15bc1c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4070,13 +4070,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  	int nid = pgdat->node_id;
>  	unsigned long zone_start_pfn = pgdat->node_start_pfn;
>  	int ret;
> +	struct kswapd *kswapd_p;
>  
>  	pgdat_resize_init(pgdat);
>  	pgdat->nr_zones = 0;
> -	init_waitqueue_head(&pgdat->kswapd_wait);
>  	pgdat->kswapd_max_order = 0;
>  	pgdat_page_cgroup_init(pgdat);
> -	
> +
> +	kswapd_p = &kswapds[nid];
> +	init_waitqueue_head(&kswapd_p->kswapd_wait);
> +	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> +	kswapd_p->kswapd_pgdat = pgdat;
> +
>  	for (j = 0; j < MAX_NR_ZONES; j++) {
>  		struct zone *zone = pgdat->node_zones + j;
>  		unsigned long size, realsize, memmap_pages;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b8a6fdc..e08005e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2115,12 +2115,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  
>  	return nr_reclaimed;
>  }
> +
>  #endif
>  

Unnecessary whitespace there.

> +DEFINE_SPINLOCK(kswapds_spinlock);
> +struct kswapd kswapds[MAX_KSWAPDS];
> +
>  /* is kswapd sleeping prematurely? */
> -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
> +				long remaining)
>  {
>  	int i;
> +	pg_data_t *pgdat = kswapd->kswapd_pgdat;
>  

This will behave strangely. You are using information from a *node* to
determine if the kswapd belonging to a cgroup should sleep or not. The
risk is that a cgroup kswapd never goes to sleep because even when all
of its pages are discarded, the node itself is still not balanced.

>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
> @@ -2377,21 +2383,28 @@ out:
>   * If there are applications that are active memory-allocators
>   * (most normal use), this basically shouldn't matter.
>   */
> -static int kswapd(void *p)
> +int kswapd(void *p)
>  {
>  	unsigned long order;
> -	pg_data_t *pgdat = (pg_data_t*)p;
> +	struct kswapd *kswapd_p = (struct kswapd *)p;
> +	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
> +	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>  	struct task_struct *tsk = current;
>  	DEFINE_WAIT(wait);
>  	struct reclaim_state reclaim_state = {
>  		.reclaimed_slab = 0,
>  	};
> -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +	const struct cpumask *cpumask;
>  
>  	lockdep_set_current_reclaim_state(GFP_KERNEL);
>  
> -	if (!cpumask_empty(cpumask))
> -		set_cpus_allowed_ptr(tsk, cpumask);
> +	if (pgdat) {
> +		BUG_ON(pgdat->kswapd_wait != wait_h);
> +		cpumask = cpumask_of_node(pgdat->node_id);
> +		if (!cpumask_empty(cpumask))
> +			set_cpus_allowed_ptr(tsk, cpumask);
> +	}
>  	current->reclaim_state = &reclaim_state;
>  
>  	/*
> @@ -2414,9 +2427,13 @@ static int kswapd(void *p)
>  		unsigned long new_order;
>  		int ret;
>  
> -		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> -		new_order = pgdat->kswapd_max_order;
> -		pgdat->kswapd_max_order = 0;
> +		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
> +		if (pgdat) {
> +			new_order = pgdat->kswapd_max_order;
> +			pgdat->kswapd_max_order = 0;
> +		} else
> +			new_order = 0;
> +
>  		if (order < new_order) {
>  			/*
>  			 * Don't sleep if someone wants a larger 'order'
> @@ -2428,10 +2445,12 @@ static int kswapd(void *p)
>  				long remaining = 0;
>  
>  				/* Try to sleep for a short interval */
> -				if (!sleeping_prematurely(pgdat, order, remaining)) {
> +				if (!sleeping_prematurely(kswapd_p, order,
> +							remaining)) {
>  					remaining = schedule_timeout(HZ/10);
> -					finish_wait(&pgdat->kswapd_wait, &wait);
> -					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> +					finish_wait(wait_h, &wait);
> +					prepare_to_wait(wait_h, &wait,
> +							TASK_INTERRUPTIBLE);

It would be nice if patch 1 did nothing but move the wait queue outside of
the node structure without any other functional change. It'll be then be
far easier to review a patch that introduces background reclaim for containers.

>  				}
>  
>  				/*
> @@ -2439,20 +2458,25 @@ static int kswapd(void *p)
>  				 * premature sleep. If not, then go fully
>  				 * to sleep until explicitly woken up
>  				 */
> -				if (!sleeping_prematurely(pgdat, order, remaining)) {
> -					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +				if (!sleeping_prematurely(kswapd_p, order,
> +								remaining)) {
> +					if (pgdat)
> +						trace_mm_vmscan_kswapd_sleep(
> +								pgdat->node_id);
>  					schedule();
>  				} else {
>  					if (remaining)
> -						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> +						count_vm_event(
> +						KSWAPD_LOW_WMARK_HIT_QUICKLY);
>  					else
> -						count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> +						count_vm_event(
> +						KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>  				}
>  			}
> -
> -			order = pgdat->kswapd_max_order;
> +			if (pgdat)
> +				order = pgdat->kswapd_max_order;
>  		}
> -		finish_wait(&pgdat->kswapd_wait, &wait);
> +		finish_wait(wait_h, &wait);
>  
>  		ret = try_to_freeze();
>  		if (kthread_should_stop())
> @@ -2476,6 +2500,7 @@ static int kswapd(void *p)
>  void wakeup_kswapd(struct zone *zone, int order)
>  {
>  	pg_data_t *pgdat;
> +	wait_queue_head_t *wait;
>  
>  	if (!populated_zone(zone))
>  		return;
> @@ -2488,9 +2513,10 @@ void wakeup_kswapd(struct zone *zone, int order)
>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>  	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  		return;
> -	if (!waitqueue_active(&pgdat->kswapd_wait))
> +	wait = pgdat->kswapd_wait;
> +	if (!waitqueue_active(wait))
>  		return;
> -	wake_up_interruptible(&pgdat->kswapd_wait);
> +	wake_up_interruptible(wait);
>  }
>  
>  /*
> @@ -2587,7 +2613,10 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>  
>  			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>  				/* One of our CPUs online: restore mask */
> -				set_cpus_allowed_ptr(pgdat->kswapd, mask);
> +				if (kswapds[nid].kswapd_task)
> +					set_cpus_allowed_ptr(
> +						kswapds[nid].kswapd_task,
> +						mask);
>  		}
>  	}
>  	return NOTIFY_OK;
> @@ -2599,19 +2628,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>   */
>  int kswapd_run(int nid)
>  {
> -	pg_data_t *pgdat = NODE_DATA(nid);
> +	struct task_struct *thr;
>  	int ret = 0;
>  
> -	if (pgdat->kswapd)
> +	if (kswapds[nid].kswapd_task)
>  		return 0;
>  
> -	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
> -	if (IS_ERR(pgdat->kswapd)) {
> +	thr = kthread_run(kswapd, &kswapds[nid], "kswapd%d", nid);
> +	if (IS_ERR(thr)) {
>  		/* failure at boot is fatal */
>  		BUG_ON(system_state == SYSTEM_BOOTING);
>  		printk("Failed to start kswapd on node %d\n",nid);
>  		ret = -1;
>  	}
> +	kswapds[nid].kswapd_task = thr;
>  	return ret;
>  }
>  
> @@ -2620,10 +2650,20 @@ int kswapd_run(int nid)
>   */
>  void kswapd_stop(int nid)
>  {
> -	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
> +	struct task_struct *thr;
> +	struct kswapd *kswapd_p;
> +	wait_queue_head_t *wait;
> +
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +
> +	spin_lock(&kswapds_spinlock);
> +	wait = pgdat->kswapd_wait;
> +	kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> +	thr = kswapd_p->kswapd_task;
> +	spin_unlock(&kswapds_spinlock);
>  
> -	if (kswapd)
> -		kthread_stop(kswapd);
> +	if (thr)
> +		kthread_stop(thr);
>  }
>  
>  static int __init kswapd_init(void)
> -- 
> 1.7.3.1
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/4] Add per cgroup reclaim watermarks.
  2010-11-30  6:49 ` [PATCH 2/4] Add per cgroup reclaim watermarks Ying Han
  2010-11-30  7:21   ` KAMEZAWA Hiroyuki
@ 2010-12-07 14:56   ` Mel Gorman
  1 sibling, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2010-12-07 14:56 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Mon, Nov 29, 2010 at 10:49:43PM -0800, Ying Han wrote:
> The per cgroup kswapd is invoked at mem_cgroup_charge when the cgroup's memory
> usage above a threshold--low_wmark. Then the kswapd thread starts to reclaim
> pages in a priority loop similar to global algorithm. The kswapd is done if the
> memory usage below a threshold--high_wmark.
> 
> The per cgroup background reclaim is based on the per cgroup LRU and also adds
> per cgroup watermarks. There are two watermarks including "low_wmark" and
> "high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
> for each cgroup. Each time the hard_limit is change, the corresponding wmarks
> are re-calculated. Since memory controller charges only user pages, there is
> no need for a "min_wmark". The current calculation of wmarks is a function of
> "memory.min_free_kbytes" which could be adjusted by writing different values
> into the new api. This is added mainly for debugging purpose.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  include/linux/memcontrol.h  |    1 +
>  include/linux/res_counter.h |   88 ++++++++++++++++++++++++++++++-
>  kernel/res_counter.c        |   26 ++++++++--
>  mm/memcontrol.c             |  123 +++++++++++++++++++++++++++++++++++++++++--
>  mm/vmscan.c                 |   10 ++++
>  5 files changed, 238 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 159a076..90fe7fe 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -76,6 +76,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
>  

bool. I know zone_watermark_ok is int, but it should be bool too.

>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index fcb9884..eed12c5 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -39,6 +39,16 @@ struct res_counter {
>  	 */
>  	unsigned long long soft_limit;
>  	/*
> +	 * the limit that reclaim triggers. TODO: res_counter in mem
> +	 * or wmark_limit.
> +	 */
> +	unsigned long long low_wmark_limit;
> +	/*
> +	 * the limit that reclaim stops. TODO: res_counter in mem or
> +	 * wmark_limit.
> +	 */
> +	unsigned long long high_wmark_limit;
> +	/*
>  	 * the number of unsuccessful attempts to consume the resource
>  	 */
>  	unsigned long long failcnt;
> @@ -55,6 +65,10 @@ struct res_counter {
>  
>  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
>  
> +#define CHARGE_WMARK_MIN	0x01

Comment that CHARGE_WMARK_MIN is the maximum limit of the container.

> +#define CHARGE_WMARK_LOW	0x02
> +#define CHARGE_WMARK_HIGH	0x04
> +
>  /**
>   * Helpers to interact with userspace
>   * res_counter_read_u64() - returns the value of the specified member.
> @@ -92,6 +106,8 @@ enum {
>  	RES_LIMIT,
>  	RES_FAILCNT,
>  	RES_SOFT_LIMIT,
> +	RES_LOW_WMARK_LIMIT,
> +	RES_HIGH_WMARK_LIMIT
>  };
>  

I'm a little concerns that memcg watermarks feel like the opposite of the core
VM watermarks. In the core VM, the number of "free pages" are compared against
a watermark and we are wary of "going below" a watermark.  In memory containers,
the number of "used pages" is checked and should not "be above" a
watermark.

It means one has to think differently about the core VM and containers.
The more examples of this that exist, the harder maintainership will be
in the future.

>  /*
> @@ -112,9 +128,10 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>   */
>  
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
> -		unsigned long val);
> +		unsigned long val, int charge_flags);
>  int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> +		unsigned long val, int charge_flags,
> +		struct res_counter **limit_fail_at);
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> @@ -145,6 +162,24 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
>  	return false;
>  }
>  
> +static inline bool
> +res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
> +{

You only call this from one place and it takes the lock. Just collapse
the functions together. Instead of postfixing this with _locked, you
could have prefixed it with __ . There is a loose convension that
similar named functions with __ imply that the caller is expected to
acquire the necessary locks.

Second, the naming of this function does not give the reader a clue as
to what the function is for. It needs a comment explaining what "true"
means. However, collapsing res_counter_check_under_low_wmark_limit and
this function together would give a better clue.

It'd probably be easier overall if you used the same pattern as
zone_watermark_ok and pass in the watermark as a parameter.

> +	if (cnt->usage < cnt->high_wmark_limit)
> +		return true;
> +
> +	return false;
> +}
> +
> +static inline bool
> +res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->low_wmark_limit)
> +		return true;
> +
> +	return false;
> +}
> +
>  /**
>   * Get the difference between the usage and the soft limit
>   * @cnt: The counter
> @@ -193,6 +228,30 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  	return ret;
>  }
>  
> +static inline bool
> +res_counter_check_under_low_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_low_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
> +static inline bool
> +res_counter_check_under_high_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_high_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline void res_counter_reset_max(struct res_counter *cnt)
>  {
>  	unsigned long flags;
> @@ -220,6 +279,8 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
>  	spin_lock_irqsave(&cnt->lock, flags);
>  	if (cnt->usage <= limit) {
>  		cnt->limit = limit;
> +		cnt->low_wmark_limit = limit;
> +		cnt->high_wmark_limit = limit;
>  		ret = 0;
>  	}
>  	spin_unlock_irqrestore(&cnt->lock, flags);
> @@ -238,4 +299,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
>  	return 0;
>  }
>  
> +static inline int
> +res_counter_set_high_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->high_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);

Is a full IRQ-safe lock here *really* necessary? Do interrupts call this
function? Even then, is a spinlock necessary at all? What are the consequences
if a parallel reader sees a temporarily stale value?

> +	return 0;
> +}

The return value is never anything but 0 so why return anything?

> +
> +static inline int
> +res_counter_set_low_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->low_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
>  #endif
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index c7eaa37..a524349 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -19,12 +19,26 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
>  	spin_lock_init(&counter->lock);
>  	counter->limit = RESOURCE_MAX;
>  	counter->soft_limit = RESOURCE_MAX;
> +	counter->low_wmark_limit = RESOURCE_MAX;
> +	counter->high_wmark_limit = RESOURCE_MAX;
>  	counter->parent = parent;
>  }
>  
> -int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> +int res_counter_charge_locked(struct res_counter *counter, unsigned long val,
> +				int charge_flags)
>  {
> -	if (counter->usage + val > counter->limit) {
> +	unsigned long long limit = 0;
> +
> +	if (charge_flags & CHARGE_WMARK_LOW)
> +		limit = counter->low_wmark_limit;
> +
> +	if (charge_flags & CHARGE_WMARK_HIGH)
> +		limit = counter->high_wmark_limit;
> +
> +	if (charge_flags & CHARGE_WMARK_MIN)
> +		limit = counter->limit;
> +

Similar to zones, you can make the watermarks an array and use the WMARK
flags in charge_flags to index it. It'll reduce the number of branches.

> +	if (counter->usage + val > limit) {
>  		counter->failcnt++;
>  		return -ENOMEM;
>  	}
> @@ -36,7 +50,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  }
>  
>  int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> +			int charge_flags, struct res_counter **limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
> @@ -46,7 +60,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
> -		ret = res_counter_charge_locked(c, val);
> +		ret = res_counter_charge_locked(c, val, charge_flags);
>  		spin_unlock(&c->lock);
>  		if (ret < 0) {
>  			*limit_fail_at = c;
> @@ -103,6 +117,10 @@ res_counter_member(struct res_counter *counter, int member)
>  		return &counter->failcnt;
>  	case RES_SOFT_LIMIT:
>  		return &counter->soft_limit;
> +	case RES_LOW_WMARK_LIMIT:
> +		return &counter->low_wmark_limit;
> +	case RES_HIGH_WMARK_LIMIT:
> +		return &counter->high_wmark_limit;
>  	};
>  
>  	BUG();
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index dca3590..a0c6ed9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -265,6 +265,7 @@ struct mem_cgroup {
>  	spinlock_t pcp_counter_lock;
>  
>  	wait_queue_head_t *kswapd_wait;
> +	unsigned long min_free_kbytes;
>  };
>  
>  /* Stuffs for move charges at task migration. */
> @@ -370,6 +371,7 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  static void drain_all_stock_async(void);
> +static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
>  
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -796,6 +798,32 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>  	return (mem == root_mem_cgroup);
>  }
>  
> +void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> +{
> +	u64 limit;
> +	unsigned long min_free_kbytes;
> +
> +	min_free_kbytes = get_min_free_kbytes(mem);
> +	limit = mem_cgroup_get_limit(mem);
> +	if (min_free_kbytes == 0) {
> +		res_counter_set_low_wmark_limit(&mem->res, limit);
> +		res_counter_set_high_wmark_limit(&mem->res, limit);

This needs a comment stating that a min_free_kbytes of 0 means that
kswapd is never woken up.

> +	} else {
> +		unsigned long page_min = min_free_kbytes >> (PAGE_SHIFT - 10);
> +		unsigned long lowmem_pages = 2048;

What if the container is less than 8M?

> +		unsigned long low_wmark, high_wmark;
> +		u64 tmp;
> +
> +		tmp = (u64)page_min * limit;
> +		do_div(tmp, lowmem_pages);
> +
> +		low_wmark = tmp + (tmp >> 1);
> +		high_wmark = tmp + (tmp >> 2);
> +		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> +		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> +	}

If mem_free_kbytes happens to be larger than the container, it may trigger
OOM. Mind you, the core VM suffers the same problem but in the case of tuning
containers it might be a lot easier for sysadmins to fall into the trap.

> +}
> +
>  /*
>   * Following LRU functions are allowed to be used without PCG_LOCK.
>   * Operations are called by routine of global LRU independently from memcg.
> @@ -1148,6 +1176,22 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +static unsigned long get_min_free_kbytes(struct mem_cgroup *memcg)
> +{
> +	struct cgroup *cgrp = memcg->css.cgroup;
> +	unsigned long min_free_kbytes;
> +
> +	/* root ? */
> +	if (cgrp == NULL || cgrp->parent == NULL)
> +		return 0;
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	min_free_kbytes = memcg->min_free_kbytes;
> +	spin_unlock(&memcg->reclaim_param_lock);
> +

Is the lock really necessary? Again, reading a stale value seems harmless.

> +	return min_free_kbytes;
> +}
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>  {
>  	int cpu;
> @@ -1844,12 +1888,13 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	unsigned long flags = 0;
>  	int ret;
>  
> -	ret = res_counter_charge(&mem->res, csize, &fail_res);
> +	ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_MIN, &fail_res);
>  
>  	if (likely(!ret)) {
>  		if (!do_swap_account)
>  			return CHARGE_OK;
> -		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> +		ret = res_counter_charge(&mem->memsw, csize, CHARGE_WMARK_MIN,
> +					&fail_res);
>  		if (likely(!ret))
>  			return CHARGE_OK;
>  
> @@ -3733,6 +3778,37 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static u64 mem_cgroup_min_free_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return get_min_free_kbytes(memcg);
> +}
> +
> +static int mem_cgroup_min_free_write(struct cgroup *cgrp, struct cftype *cfg,
> +				     u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	struct mem_cgroup *parent;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +
> +	parent = mem_cgroup_from_cont(cgrp->parent);
> +
> +	cgroup_lock();
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	memcg->min_free_kbytes = val;
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	cgroup_unlock();
> +
> +	setup_per_memcg_wmarks(memcg);
> +	return 0;
> +
> +}
> +
>  static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
>  {
>  	struct mem_cgroup_threshold_ary *t;
> @@ -4024,6 +4100,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
>  	mutex_unlock(&memcg_oom_mutex);
>  }
>  
> +static int mem_cgroup_wmark_read(struct cgroup *cgrp,
> +	struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	unsigned long low_wmark, high_wmark;
> +
> +	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
> +	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
> +
> +	cb->fill(cb, "memcg_low_wmark", low_wmark);
> +	cb->fill(cb, "memcg_high_wmark", high_wmark);
> +
> +	return 0;
> +}
> +
>  static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
>  	struct cftype *cft,  struct cgroup_map_cb *cb)
>  {
> @@ -4127,6 +4218,15 @@ static struct cftype mem_cgroup_files[] = {
>  		.unregister_event = mem_cgroup_oom_unregister_event,
>  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>  	},
> +	{
> +		.name = "min_free_kbytes",
> +		.write_u64 = mem_cgroup_min_free_write,
> +		.read_u64 = mem_cgroup_min_free_read,
> +	},
> +	{
> +		.name = "reclaim_wmarks",
> +		.read_map = mem_cgroup_wmark_read,
> +	},
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -4308,6 +4408,19 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>  
> +int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> +				int charge_flags)
> +{
> +	long ret = 0;
> +
> +	if (charge_flags & CHARGE_WMARK_LOW)
> +		ret = res_counter_check_under_low_wmark_limit(&mem->res);
> +	if (charge_flags & CHARGE_WMARK_HIGH)
> +		ret = res_counter_check_under_high_wmark_limit(&mem->res);
> +

The naming here is odd as well. We are using flags called CHARGE in
situations where nothing is being charged at all. If the enum
zone_watermark was reused and the watermarks were checked in a similar
way to the core VM (preferably identically), it will be easier to
understand both core VM and container behaviour.

> +	return ret;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>  	struct mem_cgroup_tree_per_node *rtpn;
> @@ -4450,10 +4563,12 @@ static int mem_cgroup_do_precharge(unsigned long count)
>  		 * are still under the same cgroup_mutex. So we can postpone
>  		 * css_get().
>  		 */
> -		if (res_counter_charge(&mem->res, PAGE_SIZE * count, &dummy))
> +		if (res_counter_charge(&mem->res, PAGE_SIZE * count,
> +					CHARGE_WMARK_MIN, &dummy))
>  			goto one_by_one;
>  		if (do_swap_account && res_counter_charge(&mem->memsw,
> -						PAGE_SIZE * count, &dummy)) {
> +						PAGE_SIZE * count,
> +						CHARGE_WMARK_MIN, &dummy)) {
>  			res_counter_uncharge(&mem->res, PAGE_SIZE * count);
>  			goto one_by_one;
>  		}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e08005e..6d5702b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -46,6 +46,8 @@
>  
>  #include <linux/swapops.h>
>  
> +#include <linux/res_counter.h>
> +
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -2127,11 +2129,19 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
>  {
>  	int i;
>  	pg_data_t *pgdat = kswapd->kswapd_pgdat;
> +	struct mem_cgroup *mem = kswapd->kswapd_mem;
>  
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
>  		return 1;
>  
> +	if (mem) {
> +		if (!mem_cgroup_watermark_ok(kswapd->kswapd_mem,
> +						CHARGE_WMARK_HIGH))
> +			return 1;
> +		return 0;
> +	}
> +
>  	/* If after HZ/10, a zone is below the high mark, it's premature */
>  	for (i = 0; i < pgdat->nr_zones; i++) {
>  		struct zone *zone = pgdat->node_zones + i;

Very broadly speaking, I'd be happier if watermarks for containers
behaved similarly to watermarks in the core VM.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-07 12:33   ` Mel Gorman
@ 2010-12-07 17:28     ` Ying Han
  2010-12-08  0:39       ` KAMEZAWA Hiroyuki
  2010-12-08  7:21       ` KOSAKI Motohiro
  2010-12-07 18:50     ` Ying Han
  2010-12-08  7:22     ` KOSAKI Motohiro
  2 siblings, 2 replies; 52+ messages in thread
From: Ying Han @ 2010-12-07 17:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Nov 29, 2010 at 10:49:42PM -0800, Ying Han wrote:
>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>> for each cgroup.
>
> What is considered a normal number of cgroups in production? 10, 50, 10000?
Normally it is less than 100. I assume there is a cap of number of
cgroups can be created
per system.

If it's a really large number and all the cgroups kswapds wake at the same time,
> the zone LRU lock will be very heavily contended.

Thanks for reviewing the patch~

Agree. The zone->lru_lock is another thing we are looking at.
Eventually, we need to break the lock to
per-zone per-memcg lru.

Potentially there will
> also be a very large number of new IO sources. I confess I haven't read the
> thread yet so maybe this has already been thought of but it might make sense
> to have a 1:N relationship between kswapd and memcgroups and cycle between
> containers. The difficulty will be a latency between when kswapd wakes up
> and when a particular container is scanned. The closer the ratio is to 1:1,
> the less the latency will be but the higher the contenion on the LRU lock
> and IO will be.

No, we weren't talked about the mapping anywhere in the thread. Having
many kswapd threads
at the same time isn't a problem as long as no locking contention (
ext, 1k kswapd threads on
1k fake numa node system). So breaking the zone->lru_lock should work.

>
>> The kswapd is sleeping in the wait queue headed at kswapd_wait
>> field of a kswapd descriptor. The kswapd descriptor stores information of node
>> or cgroup and it allows the global and per cgroup background reclaim to share
>> common reclaim algorithms.
>>
>> This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
>> common data structure.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  include/linux/mmzone.h |    3 +-
>>  include/linux/swap.h   |   10 +++++
>>  mm/memcontrol.c        |    2 +
>>  mm/mmzone.c            |    2 +-
>>  mm/page_alloc.c        |    9 +++-
>>  mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
>>  6 files changed, 90 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 39c24eb..c77dfa2 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -642,8 +642,7 @@ typedef struct pglist_data {
>>       unsigned long node_spanned_pages; /* total size of physical page
>>                                            range, including holes */
>>       int node_id;
>> -     wait_queue_head_t kswapd_wait;
>> -     struct task_struct *kswapd;
>> +     wait_queue_head_t *kswapd_wait;
>>       int kswapd_max_order;
>>  } pg_data_t;
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index eba53e7..2e6cb58 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
>>       return current->flags & PF_KSWAPD;
>>  }
>>
>> +struct kswapd {
>> +     struct task_struct *kswapd_task;
>> +     wait_queue_head_t kswapd_wait;
>> +     struct mem_cgroup *kswapd_mem;
>> +     pg_data_t *kswapd_pgdat;
>> +};
>> +
>> +#define MAX_KSWAPDS MAX_NUMNODES
>> +extern struct kswapd kswapds[MAX_KSWAPDS];
>
> This is potentially very large for a static structure. Can they not be
> dynamically allocated and kept on a list? Yes, there will be a list walk
> involved if yonu need a particular structure but that looks like it's a
> rare operation at this point.
>
>> +int kswapd(void *p);
>>  /*
>>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>>   * be swapped to.  The swap type and the offset into that swap type are
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a4034b6..dca3590 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -263,6 +263,8 @@ struct mem_cgroup {
>>        */
>>       struct mem_cgroup_stat_cpu nocpu_base;
>>       spinlock_t pcp_counter_lock;
>> +
>> +     wait_queue_head_t *kswapd_wait;
>>  };
>>
>>  /* Stuffs for move charges at task migration. */
>> diff --git a/mm/mmzone.c b/mm/mmzone.c
>> index e35bfb8..c7cbed5 100644
>> --- a/mm/mmzone.c
>> +++ b/mm/mmzone.c
>> @@ -102,7 +102,7 @@ unsigned long zone_nr_free_pages(struct zone *zone)
>>        * free pages are low, get a better estimate for free pages
>>        */
>>       if (nr_free_pages < zone->percpu_drift_mark &&
>> -                     !waitqueue_active(&zone->zone_pgdat->kswapd_wait))
>> +                     !waitqueue_active(zone->zone_pgdat->kswapd_wait))
>>               return zone_page_state_snapshot(zone, NR_FREE_PAGES);
>>
>>       return nr_free_pages;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b48dea2..a15bc1c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4070,13 +4070,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>>       int nid = pgdat->node_id;
>>       unsigned long zone_start_pfn = pgdat->node_start_pfn;
>>       int ret;
>> +     struct kswapd *kswapd_p;
>>
>>       pgdat_resize_init(pgdat);
>>       pgdat->nr_zones = 0;
>> -     init_waitqueue_head(&pgdat->kswapd_wait);
>>       pgdat->kswapd_max_order = 0;
>>       pgdat_page_cgroup_init(pgdat);
>> -
>> +
>> +     kswapd_p = &kswapds[nid];
>> +     init_waitqueue_head(&kswapd_p->kswapd_wait);
>> +     pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
>> +     kswapd_p->kswapd_pgdat = pgdat;
>> +
>>       for (j = 0; j < MAX_NR_ZONES; j++) {
>>               struct zone *zone = pgdat->node_zones + j;
>>               unsigned long size, realsize, memmap_pages;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b8a6fdc..e08005e 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2115,12 +2115,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>>
>>       return nr_reclaimed;
>>  }
>> +
>>  #endif
>>
>
> Unnecessary whitespace there.
>
>> +DEFINE_SPINLOCK(kswapds_spinlock);
>> +struct kswapd kswapds[MAX_KSWAPDS];
>> +
>>  /* is kswapd sleeping prematurely? */
>> -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
>> +                             long remaining)
>>  {
>>       int i;
>> +     pg_data_t *pgdat = kswapd->kswapd_pgdat;
>>
>
> This will behave strangely. You are using information from a *node* to
> determine if the kswapd belonging to a cgroup should sleep or not.


The
> risk is that a cgroup kswapd never goes to sleep because even when all
> of its pages are discarded, the node itself is still not balanced.

The kswapd descriptor is one per-node and one per-cgroup. I believe I have the
logic on the later patch to separate them out. And per-cgroup kswapd is using
the wmark calculated based on the limits. Like this:

static int sleeping_prematurely(struct kswapd *kswapd, int order,
>------->------->------->-------long remaining)

>-------if (mem) {
>------->-------if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH))
>------->------->-------return 1;
>------->-------return 0;
>-------}

>
>>       /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>>       if (remaining)
>> @@ -2377,21 +2383,28 @@ out:
>>   * If there are applications that are active memory-allocators
>>   * (most normal use), this basically shouldn't matter.
>>   */
>> -static int kswapd(void *p)
>> +int kswapd(void *p)
>>  {
>>       unsigned long order;
>> -     pg_data_t *pgdat = (pg_data_t*)p;
>> +     struct kswapd *kswapd_p = (struct kswapd *)p;
>> +     pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
>> +     struct mem_cgroup *mem = kswapd_p->kswapd_mem;
>> +     wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>>       struct task_struct *tsk = current;
>>       DEFINE_WAIT(wait);
>>       struct reclaim_state reclaim_state = {
>>               .reclaimed_slab = 0,
>>       };
>> -     const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>> +     const struct cpumask *cpumask;
>>
>>       lockdep_set_current_reclaim_state(GFP_KERNEL);
>>
>> -     if (!cpumask_empty(cpumask))
>> -             set_cpus_allowed_ptr(tsk, cpumask);
>> +     if (pgdat) {
>> +             BUG_ON(pgdat->kswapd_wait != wait_h);
>> +             cpumask = cpumask_of_node(pgdat->node_id);
>> +             if (!cpumask_empty(cpumask))
>> +                     set_cpus_allowed_ptr(tsk, cpumask);
>> +     }
>>       current->reclaim_state = &reclaim_state;
>>
>>       /*
>> @@ -2414,9 +2427,13 @@ static int kswapd(void *p)
>>               unsigned long new_order;
>>               int ret;
>>
>> -             prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> -             new_order = pgdat->kswapd_max_order;
>> -             pgdat->kswapd_max_order = 0;
>> +             prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>> +             if (pgdat) {
>> +                     new_order = pgdat->kswapd_max_order;
>> +                     pgdat->kswapd_max_order = 0;
>> +             } else
>> +                     new_order = 0;
>> +
>>               if (order < new_order) {
>>                       /*
>>                        * Don't sleep if someone wants a larger 'order'
>> @@ -2428,10 +2445,12 @@ static int kswapd(void *p)
>>                               long remaining = 0;
>>
>>                               /* Try to sleep for a short interval */
>> -                             if (!sleeping_prematurely(pgdat, order, remaining)) {
>> +                             if (!sleeping_prematurely(kswapd_p, order,
>> +                                                     remaining)) {
>>                                       remaining = schedule_timeout(HZ/10);
>> -                                     finish_wait(&pgdat->kswapd_wait, &wait);
>> -                                     prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> +                                     finish_wait(wait_h, &wait);
>> +                                     prepare_to_wait(wait_h, &wait,
>> +                                                     TASK_INTERRUPTIBLE);
>
> It would be nice if patch 1 did nothing but move the wait queue outside of
> the node structure without any other functional change. It'll be then be
> far easier to review a patch that introduces background reclaim for containers.

I will see if to splitting this patch into two.

--Ying

>
>>                               }
>>
>>                               /*
>> @@ -2439,20 +2458,25 @@ static int kswapd(void *p)
>>                                * premature sleep. If not, then go fully
>>                                * to sleep until explicitly woken up
>>                                */
>> -                             if (!sleeping_prematurely(pgdat, order, remaining)) {
>> -                                     trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>> +                             if (!sleeping_prematurely(kswapd_p, order,
>> +                                                             remaining)) {
>> +                                     if (pgdat)
>> +                                             trace_mm_vmscan_kswapd_sleep(
>> +                                                             pgdat->node_id);
>>                                       schedule();
>>                               } else {
>>                                       if (remaining)
>> -                                             count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>> +                                             count_vm_event(
>> +                                             KSWAPD_LOW_WMARK_HIT_QUICKLY);
>>                                       else
>> -                                             count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>> +                                             count_vm_event(
>> +                                             KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>>                               }
>>                       }
>> -
>> -                     order = pgdat->kswapd_max_order;
>> +                     if (pgdat)
>> +                             order = pgdat->kswapd_max_order;
>>               }
>> -             finish_wait(&pgdat->kswapd_wait, &wait);
>> +             finish_wait(wait_h, &wait);
>>
>>               ret = try_to_freeze();
>>               if (kthread_should_stop())
>> @@ -2476,6 +2500,7 @@ static int kswapd(void *p)
>>  void wakeup_kswapd(struct zone *zone, int order)
>>  {
>>       pg_data_t *pgdat;
>> +     wait_queue_head_t *wait;
>>
>>       if (!populated_zone(zone))
>>               return;
>> @@ -2488,9 +2513,10 @@ void wakeup_kswapd(struct zone *zone, int order)
>>       trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>>       if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>               return;
>> -     if (!waitqueue_active(&pgdat->kswapd_wait))
>> +     wait = pgdat->kswapd_wait;
>> +     if (!waitqueue_active(wait))
>>               return;
>> -     wake_up_interruptible(&pgdat->kswapd_wait);
>> +     wake_up_interruptible(wait);
>>  }
>>
>>  /*
>> @@ -2587,7 +2613,10 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>>
>>                       if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>>                               /* One of our CPUs online: restore mask */
>> -                             set_cpus_allowed_ptr(pgdat->kswapd, mask);
>> +                             if (kswapds[nid].kswapd_task)
>> +                                     set_cpus_allowed_ptr(
>> +                                             kswapds[nid].kswapd_task,
>> +                                             mask);
>>               }
>>       }
>>       return NOTIFY_OK;
>> @@ -2599,19 +2628,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>>   */
>>  int kswapd_run(int nid)
>>  {
>> -     pg_data_t *pgdat = NODE_DATA(nid);
>> +     struct task_struct *thr;
>>       int ret = 0;
>>
>> -     if (pgdat->kswapd)
>> +     if (kswapds[nid].kswapd_task)
>>               return 0;
>>
>> -     pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
>> -     if (IS_ERR(pgdat->kswapd)) {
>> +     thr = kthread_run(kswapd, &kswapds[nid], "kswapd%d", nid);
>> +     if (IS_ERR(thr)) {
>>               /* failure at boot is fatal */
>>               BUG_ON(system_state == SYSTEM_BOOTING);
>>               printk("Failed to start kswapd on node %d\n",nid);
>>               ret = -1;
>>       }
>> +     kswapds[nid].kswapd_task = thr;
>>       return ret;
>>  }
>>
>> @@ -2620,10 +2650,20 @@ int kswapd_run(int nid)
>>   */
>>  void kswapd_stop(int nid)
>>  {
>> -     struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
>> +     struct task_struct *thr;
>> +     struct kswapd *kswapd_p;
>> +     wait_queue_head_t *wait;
>> +
>> +     pg_data_t *pgdat = NODE_DATA(nid);
>> +
>> +     spin_lock(&kswapds_spinlock);
>> +     wait = pgdat->kswapd_wait;
>> +     kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>> +     thr = kswapd_p->kswapd_task;
>> +     spin_unlock(&kswapds_spinlock);
>>
>> -     if (kswapd)
>> -             kthread_stop(kswapd);
>> +     if (thr)
>> +             kthread_stop(thr);
>>  }
>>
>>  static int __init kswapd_init(void)
>> --
>> 1.7.3.1
>>
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-07 12:33   ` Mel Gorman
  2010-12-07 17:28     ` Ying Han
@ 2010-12-07 18:50     ` Ying Han
  2010-12-08  7:22     ` KOSAKI Motohiro
  2 siblings, 0 replies; 52+ messages in thread
From: Ying Han @ 2010-12-07 18:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Andrew Morton, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Nov 29, 2010 at 10:49:42PM -0800, Ying Han wrote:
>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>> for each cgroup.
>
> What is considered a normal number of cgroups in production? 10, 50, 10000? If
> it's a really large number and all the cgroups kswapds wake at the same time,
> the zone LRU lock will be very heavily contended.  Potentially there will
> also be a very large number of new IO sources. I confess I haven't read the
> thread yet so maybe this has already been thought of but it might make sense
> to have a 1:N relationship between kswapd and memcgroups and cycle between
> containers. The difficulty will be a latency between when kswapd wakes up
> and when a particular container is scanned. The closer the ratio is to 1:1,
> the less the latency will be but the higher the contenion on the LRU lock
> and IO will be.
>
>> The kswapd is sleeping in the wait queue headed at kswapd_wait
>> field of a kswapd descriptor. The kswapd descriptor stores information of node
>> or cgroup and it allows the global and per cgroup background reclaim to share
>> common reclaim algorithms.
>>
>> This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
>> common data structure.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  include/linux/mmzone.h |    3 +-
>>  include/linux/swap.h   |   10 +++++
>>  mm/memcontrol.c        |    2 +
>>  mm/mmzone.c            |    2 +-
>>  mm/page_alloc.c        |    9 +++-
>>  mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
>>  6 files changed, 90 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 39c24eb..c77dfa2 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -642,8 +642,7 @@ typedef struct pglist_data {
>>       unsigned long node_spanned_pages; /* total size of physical page
>>                                            range, including holes */
>>       int node_id;
>> -     wait_queue_head_t kswapd_wait;
>> -     struct task_struct *kswapd;
>> +     wait_queue_head_t *kswapd_wait;
>>       int kswapd_max_order;
>>  } pg_data_t;
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index eba53e7..2e6cb58 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
>>       return current->flags & PF_KSWAPD;
>>  }
>>
>> +struct kswapd {
>> +     struct task_struct *kswapd_task;
>> +     wait_queue_head_t kswapd_wait;
>> +     struct mem_cgroup *kswapd_mem;
>> +     pg_data_t *kswapd_pgdat;
>> +};
>> +
>> +#define MAX_KSWAPDS MAX_NUMNODES
>> +extern struct kswapd kswapds[MAX_KSWAPDS];
>
> This is potentially very large for a static structure. Can they not be
> dynamically allocated and kept on a list? Yes, there will be a list walk
> involved if yonu need a particular structure but that looks like it's a
> rare operation at this point.

This has been changed to dynamic allocation on the V2 I am working on. The
kswapd descriptor is dynamic allocated at kswapd_run for both per-node and
per-cgroup kswapd, and there is no list walking since it could be identified by
container_of(wait, struct kswapd, kswapd_wait);

--Ying
>
>> +int kswapd(void *p);
>>  /*
>>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>>   * be swapped to.  The swap type and the offset into that swap type are
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a4034b6..dca3590 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -263,6 +263,8 @@ struct mem_cgroup {
>>        */
>>       struct mem_cgroup_stat_cpu nocpu_base;
>>       spinlock_t pcp_counter_lock;
>> +
>> +     wait_queue_head_t *kswapd_wait;
>>  };
>>
>>  /* Stuffs for move charges at task migration. */
>> diff --git a/mm/mmzone.c b/mm/mmzone.c
>> index e35bfb8..c7cbed5 100644
>> --- a/mm/mmzone.c
>> +++ b/mm/mmzone.c
>> @@ -102,7 +102,7 @@ unsigned long zone_nr_free_pages(struct zone *zone)
>>        * free pages are low, get a better estimate for free pages
>>        */
>>       if (nr_free_pages < zone->percpu_drift_mark &&
>> -                     !waitqueue_active(&zone->zone_pgdat->kswapd_wait))
>> +                     !waitqueue_active(zone->zone_pgdat->kswapd_wait))
>>               return zone_page_state_snapshot(zone, NR_FREE_PAGES);
>>
>>       return nr_free_pages;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b48dea2..a15bc1c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4070,13 +4070,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>>       int nid = pgdat->node_id;
>>       unsigned long zone_start_pfn = pgdat->node_start_pfn;
>>       int ret;
>> +     struct kswapd *kswapd_p;
>>
>>       pgdat_resize_init(pgdat);
>>       pgdat->nr_zones = 0;
>> -     init_waitqueue_head(&pgdat->kswapd_wait);
>>       pgdat->kswapd_max_order = 0;
>>       pgdat_page_cgroup_init(pgdat);
>> -
>> +
>> +     kswapd_p = &kswapds[nid];
>> +     init_waitqueue_head(&kswapd_p->kswapd_wait);
>> +     pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
>> +     kswapd_p->kswapd_pgdat = pgdat;
>> +
>>       for (j = 0; j < MAX_NR_ZONES; j++) {
>>               struct zone *zone = pgdat->node_zones + j;
>>               unsigned long size, realsize, memmap_pages;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b8a6fdc..e08005e 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2115,12 +2115,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>>
>>       return nr_reclaimed;
>>  }
>> +
>>  #endif
>>
>
> Unnecessary whitespace there.
>
>> +DEFINE_SPINLOCK(kswapds_spinlock);
>> +struct kswapd kswapds[MAX_KSWAPDS];
>> +
>>  /* is kswapd sleeping prematurely? */
>> -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
>> +                             long remaining)
>>  {
>>       int i;
>> +     pg_data_t *pgdat = kswapd->kswapd_pgdat;
>>
>
> This will behave strangely. You are using information from a *node* to
> determine if the kswapd belonging to a cgroup should sleep or not. The
> risk is that a cgroup kswapd never goes to sleep because even when all
> of its pages are discarded, the node itself is still not balanced.
>
>>       /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>>       if (remaining)
>> @@ -2377,21 +2383,28 @@ out:
>>   * If there are applications that are active memory-allocators
>>   * (most normal use), this basically shouldn't matter.
>>   */
>> -static int kswapd(void *p)
>> +int kswapd(void *p)
>>  {
>>       unsigned long order;
>> -     pg_data_t *pgdat = (pg_data_t*)p;
>> +     struct kswapd *kswapd_p = (struct kswapd *)p;
>> +     pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
>> +     struct mem_cgroup *mem = kswapd_p->kswapd_mem;
>> +     wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>>       struct task_struct *tsk = current;
>>       DEFINE_WAIT(wait);
>>       struct reclaim_state reclaim_state = {
>>               .reclaimed_slab = 0,
>>       };
>> -     const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>> +     const struct cpumask *cpumask;
>>
>>       lockdep_set_current_reclaim_state(GFP_KERNEL);
>>
>> -     if (!cpumask_empty(cpumask))
>> -             set_cpus_allowed_ptr(tsk, cpumask);
>> +     if (pgdat) {
>> +             BUG_ON(pgdat->kswapd_wait != wait_h);
>> +             cpumask = cpumask_of_node(pgdat->node_id);
>> +             if (!cpumask_empty(cpumask))
>> +                     set_cpus_allowed_ptr(tsk, cpumask);
>> +     }
>>       current->reclaim_state = &reclaim_state;
>>
>>       /*
>> @@ -2414,9 +2427,13 @@ static int kswapd(void *p)
>>               unsigned long new_order;
>>               int ret;
>>
>> -             prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> -             new_order = pgdat->kswapd_max_order;
>> -             pgdat->kswapd_max_order = 0;
>> +             prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>> +             if (pgdat) {
>> +                     new_order = pgdat->kswapd_max_order;
>> +                     pgdat->kswapd_max_order = 0;
>> +             } else
>> +                     new_order = 0;
>> +
>>               if (order < new_order) {
>>                       /*
>>                        * Don't sleep if someone wants a larger 'order'
>> @@ -2428,10 +2445,12 @@ static int kswapd(void *p)
>>                               long remaining = 0;
>>
>>                               /* Try to sleep for a short interval */
>> -                             if (!sleeping_prematurely(pgdat, order, remaining)) {
>> +                             if (!sleeping_prematurely(kswapd_p, order,
>> +                                                     remaining)) {
>>                                       remaining = schedule_timeout(HZ/10);
>> -                                     finish_wait(&pgdat->kswapd_wait, &wait);
>> -                                     prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> +                                     finish_wait(wait_h, &wait);
>> +                                     prepare_to_wait(wait_h, &wait,
>> +                                                     TASK_INTERRUPTIBLE);
>
> It would be nice if patch 1 did nothing but move the wait queue outside of
> the node structure without any other functional change. It'll be then be
> far easier to review a patch that introduces background reclaim for containers.
>
>>                               }
>>
>>                               /*
>> @@ -2439,20 +2458,25 @@ static int kswapd(void *p)
>>                                * premature sleep. If not, then go fully
>>                                * to sleep until explicitly woken up
>>                                */
>> -                             if (!sleeping_prematurely(pgdat, order, remaining)) {
>> -                                     trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>> +                             if (!sleeping_prematurely(kswapd_p, order,
>> +                                                             remaining)) {
>> +                                     if (pgdat)
>> +                                             trace_mm_vmscan_kswapd_sleep(
>> +                                                             pgdat->node_id);
>>                                       schedule();
>>                               } else {
>>                                       if (remaining)
>> -                                             count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>> +                                             count_vm_event(
>> +                                             KSWAPD_LOW_WMARK_HIT_QUICKLY);
>>                                       else
>> -                                             count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>> +                                             count_vm_event(
>> +                                             KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>>                               }
>>                       }
>> -
>> -                     order = pgdat->kswapd_max_order;
>> +                     if (pgdat)
>> +                             order = pgdat->kswapd_max_order;
>>               }
>> -             finish_wait(&pgdat->kswapd_wait, &wait);
>> +             finish_wait(wait_h, &wait);
>>
>>               ret = try_to_freeze();
>>               if (kthread_should_stop())
>> @@ -2476,6 +2500,7 @@ static int kswapd(void *p)
>>  void wakeup_kswapd(struct zone *zone, int order)
>>  {
>>       pg_data_t *pgdat;
>> +     wait_queue_head_t *wait;
>>
>>       if (!populated_zone(zone))
>>               return;
>> @@ -2488,9 +2513,10 @@ void wakeup_kswapd(struct zone *zone, int order)
>>       trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>>       if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>               return;
>> -     if (!waitqueue_active(&pgdat->kswapd_wait))
>> +     wait = pgdat->kswapd_wait;
>> +     if (!waitqueue_active(wait))
>>               return;
>> -     wake_up_interruptible(&pgdat->kswapd_wait);
>> +     wake_up_interruptible(wait);
>>  }
>>
>>  /*
>> @@ -2587,7 +2613,10 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>>
>>                       if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>>                               /* One of our CPUs online: restore mask */
>> -                             set_cpus_allowed_ptr(pgdat->kswapd, mask);
>> +                             if (kswapds[nid].kswapd_task)
>> +                                     set_cpus_allowed_ptr(
>> +                                             kswapds[nid].kswapd_task,
>> +                                             mask);
>>               }
>>       }
>>       return NOTIFY_OK;
>> @@ -2599,19 +2628,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>>   */
>>  int kswapd_run(int nid)
>>  {
>> -     pg_data_t *pgdat = NODE_DATA(nid);
>> +     struct task_struct *thr;
>>       int ret = 0;
>>
>> -     if (pgdat->kswapd)
>> +     if (kswapds[nid].kswapd_task)
>>               return 0;
>>
>> -     pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
>> -     if (IS_ERR(pgdat->kswapd)) {
>> +     thr = kthread_run(kswapd, &kswapds[nid], "kswapd%d", nid);
>> +     if (IS_ERR(thr)) {
>>               /* failure at boot is fatal */
>>               BUG_ON(system_state == SYSTEM_BOOTING);
>>               printk("Failed to start kswapd on node %d\n",nid);
>>               ret = -1;
>>       }
>> +     kswapds[nid].kswapd_task = thr;
>>       return ret;
>>  }
>>
>> @@ -2620,10 +2650,20 @@ int kswapd_run(int nid)
>>   */
>>  void kswapd_stop(int nid)
>>  {
>> -     struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
>> +     struct task_struct *thr;
>> +     struct kswapd *kswapd_p;
>> +     wait_queue_head_t *wait;
>> +
>> +     pg_data_t *pgdat = NODE_DATA(nid);
>> +
>> +     spin_lock(&kswapds_spinlock);
>> +     wait = pgdat->kswapd_wait;
>> +     kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>> +     thr = kswapd_p->kswapd_task;
>> +     spin_unlock(&kswapds_spinlock);
>>
>> -     if (kswapd)
>> -             kthread_stop(kswapd);
>> +     if (thr)
>> +             kthread_stop(thr);
>>  }
>>
>>  static int __init kswapd_init(void)
>> --
>> 1.7.3.1
>>
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-07  6:52   ` Balbir Singh
@ 2010-12-07 19:21     ` Ying Han
  0 siblings, 0 replies; 52+ messages in thread
From: Ying Han @ 2010-12-07 19:21 UTC (permalink / raw)
  To: balbir
  Cc: Daisuke Nishimura, KAMEZAWA Hiroyuki, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Mon, Dec 6, 2010 at 10:52 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Ying Han <yinghan@google.com> [2010-11-29 22:49:42]:
>
>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>> for each cgroup.
>
> Could you please elaborate on this, what is adding? creating a thread?

Ok, better descriptor on the V2.

>
> The kswapd is sleeping in the wait queue headed at kswapd_wait
>> field of a kswapd descriptor. The kswapd descriptor stores information of node
>> or cgroup and it allows the global and per cgroup background reclaim to share
>> common reclaim algorithms.
>>
>> This patch addes the kswapd descriptor and changes per zone kswapd_wait to the
>> common data structure.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>
> The performance data you posted earlier is helpful, do you have any
> additional insights on the the CPU overheads if any?

I haven't measured the kswapd cputime overhead, numbers will be posted
on the next patch.

>
> My general overall comment is that this patch needs to be refactored
> to bring out the change the patch makes.
>
>>  include/linux/mmzone.h |    3 +-
>>  include/linux/swap.h   |   10 +++++
>>  mm/memcontrol.c        |    2 +
>>  mm/mmzone.c            |    2 +-
>>  mm/page_alloc.c        |    9 +++-
>>  mm/vmscan.c            |   98 +++++++++++++++++++++++++++++++++--------------
>>  6 files changed, 90 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 39c24eb..c77dfa2 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -642,8 +642,7 @@ typedef struct pglist_data {
>>       unsigned long node_spanned_pages; /* total size of physical page
>>                                            range, including holes */
>>       int node_id;
>> -     wait_queue_head_t kswapd_wait;
>> -     struct task_struct *kswapd;
>> +     wait_queue_head_t *kswapd_wait;
>>       int kswapd_max_order;
>>  } pg_data_t;
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index eba53e7..2e6cb58 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -26,6 +26,16 @@ static inline int current_is_kswapd(void)
>>       return current->flags & PF_KSWAPD;
>>  }
>>
>> +struct kswapd {
>> +     struct task_struct *kswapd_task;
>> +     wait_queue_head_t kswapd_wait;
>> +     struct mem_cgroup *kswapd_mem;
>
> Is this field being used anywhere in this patch?

will move this to the patch3.

>
>> +     pg_data_t *kswapd_pgdat;
>> +};
>> +
>> +#define MAX_KSWAPDS MAX_NUMNODES
>> +extern struct kswapd kswapds[MAX_KSWAPDS];
>> +int kswapd(void *p);
>>  /*
>>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>>   * be swapped to.  The swap type and the offset into that swap type are
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a4034b6..dca3590 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -263,6 +263,8 @@ struct mem_cgroup {
>>        */
>>       struct mem_cgroup_stat_cpu nocpu_base;
>>       spinlock_t pcp_counter_lock;
>> +
>> +     wait_queue_head_t *kswapd_wait;
>>  };
>>
>>  /* Stuffs for move charges at task migration. */
>> diff --git a/mm/mmzone.c b/mm/mmzone.c
>> index e35bfb8..c7cbed5 100644
>> --- a/mm/mmzone.c
>> +++ b/mm/mmzone.c
>> @@ -102,7 +102,7 @@ unsigned long zone_nr_free_pages(struct zone *zone)
>>        * free pages are low, get a better estimate for free pages
>>        */
>>       if (nr_free_pages < zone->percpu_drift_mark &&
>> -                     !waitqueue_active(&zone->zone_pgdat->kswapd_wait))
>> +                     !waitqueue_active(zone->zone_pgdat->kswapd_wait))
>>               return zone_page_state_snapshot(zone, NR_FREE_PAGES);
>>
>>       return nr_free_pages;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b48dea2..a15bc1c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4070,13 +4070,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>>       int nid = pgdat->node_id;
>>       unsigned long zone_start_pfn = pgdat->node_start_pfn;
>>       int ret;
>> +     struct kswapd *kswapd_p;
>
> _p is sort of ugly, do we really need it?

will change.

>
>>
>>       pgdat_resize_init(pgdat);
>>       pgdat->nr_zones = 0;
>> -     init_waitqueue_head(&pgdat->kswapd_wait);
>>       pgdat->kswapd_max_order = 0;
>>       pgdat_page_cgroup_init(pgdat);
>> -
>
> Thanks for the whitspace cleanup, but I don't know if that should be
> done here.

done.

>
>> +
>> +     kswapd_p = &kswapds[nid];
>> +     init_waitqueue_head(&kswapd_p->kswapd_wait);
>> +     pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
>> +     kswapd_p->kswapd_pgdat = pgdat;
>> +
>>       for (j = 0; j < MAX_NR_ZONES; j++) {
>>               struct zone *zone = pgdat->node_zones + j;
>>               unsigned long size, realsize, memmap_pages;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b8a6fdc..e08005e 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2115,12 +2115,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>>
>>       return nr_reclaimed;
>>  }
>> +
>>  #endif
>>
>> +DEFINE_SPINLOCK(kswapds_spinlock);
>> +struct kswapd kswapds[MAX_KSWAPDS];
>> +
>>  /* is kswapd sleeping prematurely? */
>> -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
>> +                             long remaining)
>>  {
>>       int i;
>> +     pg_data_t *pgdat = kswapd->kswapd_pgdat;
>>
>>       /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>>       if (remaining)
>> @@ -2377,21 +2383,28 @@ out:
>>   * If there are applications that are active memory-allocators
>>   * (most normal use), this basically shouldn't matter.
>>   */
>> -static int kswapd(void *p)
>> +int kswapd(void *p)
>>  {
>>       unsigned long order;
>> -     pg_data_t *pgdat = (pg_data_t*)p;
>> +     struct kswapd *kswapd_p = (struct kswapd *)p;
>> +     pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
>> +     struct mem_cgroup *mem = kswapd_p->kswapd_mem;
>
> Do we use mem anywhere?

move to patch3.
>
>> +     wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>
> _p, _h almost look like hungarian notation in reverse :)
>
>>       struct task_struct *tsk = current;
>>       DEFINE_WAIT(wait);
>>       struct reclaim_state reclaim_state = {
>>               .reclaimed_slab = 0,
>>       };
>> -     const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>> +     const struct cpumask *cpumask;
>>
>>       lockdep_set_current_reclaim_state(GFP_KERNEL);
>>
>> -     if (!cpumask_empty(cpumask))
>> -             set_cpus_allowed_ptr(tsk, cpumask);
>> +     if (pgdat) {
>> +             BUG_ON(pgdat->kswapd_wait != wait_h);
>> +             cpumask = cpumask_of_node(pgdat->node_id);
>> +             if (!cpumask_empty(cpumask))
>> +                     set_cpus_allowed_ptr(tsk, cpumask);
>> +     }
>>       current->reclaim_state = &reclaim_state;
>>
>>       /*
>> @@ -2414,9 +2427,13 @@ static int kswapd(void *p)
>>               unsigned long new_order;
>>               int ret;
>>
>> -             prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> -             new_order = pgdat->kswapd_max_order;
>> -             pgdat->kswapd_max_order = 0;
>> +             prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>> +             if (pgdat) {
>> +                     new_order = pgdat->kswapd_max_order;
>> +                     pgdat->kswapd_max_order = 0;
>> +             } else
>> +                     new_order = 0;
>> +
>>               if (order < new_order) {
>>                       /*
>>                        * Don't sleep if someone wants a larger 'order'
>> @@ -2428,10 +2445,12 @@ static int kswapd(void *p)
>>                               long remaining = 0;
>>
>>                               /* Try to sleep for a short interval */
>> -                             if (!sleeping_prematurely(pgdat, order, remaining)) {
>> +                             if (!sleeping_prematurely(kswapd_p, order,
>> +                                                     remaining)) {
>>                                       remaining = schedule_timeout(HZ/10);
>> -                                     finish_wait(&pgdat->kswapd_wait, &wait);
>> -                                     prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> +                                     finish_wait(wait_h, &wait);
>> +                                     prepare_to_wait(wait_h, &wait,
>> +                                                     TASK_INTERRUPTIBLE);
>>                               }
>>
>>                               /*
>> @@ -2439,20 +2458,25 @@ static int kswapd(void *p)
>>                                * premature sleep. If not, then go fully
>>                                * to sleep until explicitly woken up
>>                                */
>> -                             if (!sleeping_prematurely(pgdat, order, remaining)) {
>> -                                     trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>> +                             if (!sleeping_prematurely(kswapd_p, order,
>> +                                                             remaining)) {
>> +                                     if (pgdat)
>> +                                             trace_mm_vmscan_kswapd_sleep(
>> +                                                             pgdat->node_id);
>>                                       schedule();
>>                               } else {
>>                                       if (remaining)
>> -                                             count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>> +                                             count_vm_event(
>> +                                             KSWAPD_LOW_WMARK_HIT_QUICKLY);
>>                                       else
>> -                                             count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>> +                                             count_vm_event(
>> +                                             KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>
> Sorry, but the coding style hits me here do we really need to change
> this?
done.
>
>>                               }
>>                       }
>> -
>> -                     order = pgdat->kswapd_max_order;
>> +                     if (pgdat)
>> +                             order = pgdat->kswapd_max_order;
>>               }
>> -             finish_wait(&pgdat->kswapd_wait, &wait);
>> +             finish_wait(wait_h, &wait);
>>
>>               ret = try_to_freeze();
>>               if (kthread_should_stop())
>> @@ -2476,6 +2500,7 @@ static int kswapd(void *p)
>>  void wakeup_kswapd(struct zone *zone, int order)
>>  {
>>       pg_data_t *pgdat;
>> +     wait_queue_head_t *wait;
>>
>>       if (!populated_zone(zone))
>>               return;
>> @@ -2488,9 +2513,10 @@ void wakeup_kswapd(struct zone *zone, int order)
>>       trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>>       if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>               return;
>> -     if (!waitqueue_active(&pgdat->kswapd_wait))
>> +     wait = pgdat->kswapd_wait;
>> +     if (!waitqueue_active(wait))
>>               return;
>> -     wake_up_interruptible(&pgdat->kswapd_wait);
>> +     wake_up_interruptible(wait);
>>  }
>>
>>  /*
>> @@ -2587,7 +2613,10 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>>
>>                       if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>>                               /* One of our CPUs online: restore mask */
>> -                             set_cpus_allowed_ptr(pgdat->kswapd, mask);
>> +                             if (kswapds[nid].kswapd_task)
>> +                                     set_cpus_allowed_ptr(
>> +                                             kswapds[nid].kswapd_task,
>> +                                             mask);
>>               }
>>       }
>>       return NOTIFY_OK;
>> @@ -2599,19 +2628,20 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>>   */
>>  int kswapd_run(int nid)
>>  {
>> -     pg_data_t *pgdat = NODE_DATA(nid);
>> +     struct task_struct *thr;
>
> thr is an ugly name for task_struct instance

>
>>       int ret = 0;
>>
>> -     if (pgdat->kswapd)
>> +     if (kswapds[nid].kswapd_task)
>>               return 0;
>>
>> -     pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
>> -     if (IS_ERR(pgdat->kswapd)) {
>> +     thr = kthread_run(kswapd, &kswapds[nid], "kswapd%d", nid);
>> +     if (IS_ERR(thr)) {
>>               /* failure at boot is fatal */
>>               BUG_ON(system_state == SYSTEM_BOOTING);
>>               printk("Failed to start kswapd on node %d\n",nid);
>>               ret = -1;
>
> What happens to the threads started?

Can you elaborate this little bit more?
>
>>       }
>> +     kswapds[nid].kswapd_task = thr;
>>       return ret;
>>  }
>>
>> @@ -2620,10 +2650,20 @@ int kswapd_run(int nid)
>>   */
>>  void kswapd_stop(int nid)
>>  {
>> -     struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
>> +     struct task_struct *thr;
>> +     struct kswapd *kswapd_p;
>> +     wait_queue_head_t *wait;
>> +
>> +     pg_data_t *pgdat = NODE_DATA(nid);
>> +
>> +     spin_lock(&kswapds_spinlock);
>> +     wait = pgdat->kswapd_wait;
>> +     kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>> +     thr = kswapd_p->kswapd_task;
>
> Sorry, but thr is just an ugly name to use.

>
>> +     spin_unlock(&kswapds_spinlock);
>>
>> -     if (kswapd)
>> -             kthread_stop(kswapd);
>> +     if (thr)
>> +             kthread_stop(thr);
>>  }
>>
>>  static int __init kswapd_init(void)
>> --
>> 1.7.3.1
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
>        Three Cheers,
>        Balbir
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-07 17:28     ` Ying Han
@ 2010-12-08  0:39       ` KAMEZAWA Hiroyuki
  2010-12-08  1:24         ` Ying Han
  2010-12-08  7:21       ` KOSAKI Motohiro
  1 sibling, 1 reply; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-08  0:39 UTC (permalink / raw)
  To: Ying Han
  Cc: Mel Gorman, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, 7 Dec 2010 09:28:01 -0800
Ying Han <yinghan@google.com> wrote:

> On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@csn.ul.ie> wrote:

> Potentially there will
> > also be a very large number of new IO sources. I confess I haven't read the
> > thread yet so maybe this has already been thought of but it might make sense
> > to have a 1:N relationship between kswapd and memcgroups and cycle between
> > containers. The difficulty will be a latency between when kswapd wakes up
> > and when a particular container is scanned. The closer the ratio is to 1:1,
> > the less the latency will be but the higher the contenion on the LRU lock
> > and IO will be.
> 
> No, we weren't talked about the mapping anywhere in the thread. Having
> many kswapd threads
> at the same time isn't a problem as long as no locking contention (
> ext, 1k kswapd threads on
> 1k fake numa node system). So breaking the zone->lru_lock should work.
> 

That's me who make zone->lru_lock be shared. And per-memcg lock will makes
the maintainance of memcg very bad. That will add many races.
Or we need to make memcg's LRU not synchronized with zone's LRU, IOW, we need
to have completely independent LRU.

I'd like to limit the number of kswapd-for-memcg if zone->lru lock contention
is problematic. memcg _can_ work without background reclaim.

How about adding per-node kswapd-for-memcg it will reclaim pages by a memcg's
request ? as

	memcg_wake_kswapd(struct mem_cgroup *mem) 
	{
		do {
			nid = select_victim_node(mem);
			/* ask kswapd to reclaim memcg's memory */
			ret = memcg_kswapd_queue_work(nid, mem); /* may return -EBUSY if very busy*/
		} while()
	}

This will make lock contention minimum. Anyway, using too much cpu for this
unnecessary_but_good_for_performance_function is bad. Throttoling is required.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-08  0:39       ` KAMEZAWA Hiroyuki
@ 2010-12-08  1:24         ` Ying Han
  2010-12-08  1:28           ` KAMEZAWA Hiroyuki
  2010-12-08 12:19           ` Mel Gorman
  0 siblings, 2 replies; 52+ messages in thread
From: Ying Han @ 2010-12-08  1:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, Dec 7, 2010 at 4:39 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 7 Dec 2010 09:28:01 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>
>> Potentially there will
>> > also be a very large number of new IO sources. I confess I haven't read the
>> > thread yet so maybe this has already been thought of but it might make sense
>> > to have a 1:N relationship between kswapd and memcgroups and cycle between
>> > containers. The difficulty will be a latency between when kswapd wakes up
>> > and when a particular container is scanned. The closer the ratio is to 1:1,
>> > the less the latency will be but the higher the contenion on the LRU lock
>> > and IO will be.
>>
>> No, we weren't talked about the mapping anywhere in the thread. Having
>> many kswapd threads
>> at the same time isn't a problem as long as no locking contention (
>> ext, 1k kswapd threads on
>> 1k fake numa node system). So breaking the zone->lru_lock should work.
>>
>
> That's me who make zone->lru_lock be shared. And per-memcg lock will makes
> the maintainance of memcg very bad. That will add many races.
> Or we need to make memcg's LRU not synchronized with zone's LRU, IOW, we need
> to have completely independent LRU.
>
> I'd like to limit the number of kswapd-for-memcg if zone->lru lock contention
> is problematic. memcg _can_ work without background reclaim.

>
> How about adding per-node kswapd-for-memcg it will reclaim pages by a memcg's
> request ? as
>
>        memcg_wake_kswapd(struct mem_cgroup *mem)
>        {
>                do {
>                        nid = select_victim_node(mem);
>                        /* ask kswapd to reclaim memcg's memory */
>                        ret = memcg_kswapd_queue_work(nid, mem); /* may return -EBUSY if very busy*/
>                } while()
>        }
>
> This will make lock contention minimum. Anyway, using too much cpu for this
> unnecessary_but_good_for_performance_function is bad. Throttoling is required.

I don't see the problem of one-kswapd-per-cgroup here since there will
be no performance cost if they are not running.

I haven't measured the lock contention and cputime for each kswapd
running. Theoretically it would be a problem
if thousands of cgroups are configured on the the host and all of them
are under memory pressure.

We can either optimize the locking or make each kswapd smarter (hold
the lock less time). My current plan is to have the
one-kswapd-per-cgroup on the V2 patch w/ select_victim_node, and the
optimization for this comes as following patchset.

--Ying




>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-08  1:24         ` Ying Han
@ 2010-12-08  1:28           ` KAMEZAWA Hiroyuki
  2010-12-08  2:10             ` Ying Han
  2010-12-08 12:19           ` Mel Gorman
  1 sibling, 1 reply; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-08  1:28 UTC (permalink / raw)
  To: Ying Han
  Cc: Mel Gorman, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, 7 Dec 2010 17:24:12 -0800
Ying Han <yinghan@google.com> wrote:

> On Tue, Dec 7, 2010 at 4:39 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 7 Dec 2010 09:28:01 -0800
> > Ying Han <yinghan@google.com> wrote:
> >
> >> On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> Potentially there will
> >> > also be a very large number of new IO sources. I confess I haven't read the
> >> > thread yet so maybe this has already been thought of but it might make sense
> >> > to have a 1:N relationship between kswapd and memcgroups and cycle between
> >> > containers. The difficulty will be a latency between when kswapd wakes up
> >> > and when a particular container is scanned. The closer the ratio is to 1:1,
> >> > the less the latency will be but the higher the contenion on the LRU lock
> >> > and IO will be.
> >>
> >> No, we weren't talked about the mapping anywhere in the thread. Having
> >> many kswapd threads
> >> at the same time isn't a problem as long as no locking contention (
> >> ext, 1k kswapd threads on
> >> 1k fake numa node system). So breaking the zone->lru_lock should work.
> >>
> >
> > That's me who make zone->lru_lock be shared. And per-memcg lock will makes
> > the maintainance of memcg very bad. That will add many races.
> > Or we need to make memcg's LRU not synchronized with zone's LRU, IOW, we need
> > to have completely independent LRU.
> >
> > I'd like to limit the number of kswapd-for-memcg if zone->lru lock contention
> > is problematic. memcg _can_ work without background reclaim.
> 
> >
> > How about adding per-node kswapd-for-memcg it will reclaim pages by a memcg's
> > request ? as
> >
> > A  A  A  A memcg_wake_kswapd(struct mem_cgroup *mem)
> > A  A  A  A {
> > A  A  A  A  A  A  A  A do {
> > A  A  A  A  A  A  A  A  A  A  A  A nid = select_victim_node(mem);
> > A  A  A  A  A  A  A  A  A  A  A  A /* ask kswapd to reclaim memcg's memory */
> > A  A  A  A  A  A  A  A  A  A  A  A ret = memcg_kswapd_queue_work(nid, mem); /* may return -EBUSY if very busy*/
> > A  A  A  A  A  A  A  A } while()
> > A  A  A  A }
> >
> > This will make lock contention minimum. Anyway, using too much cpu for this
> > unnecessary_but_good_for_performance_function is bad. Throttoling is required.
> 
> I don't see the problem of one-kswapd-per-cgroup here since there will
> be no performance cost if they are not running.
> 
Yes. But we've got a report from user who uses 2000+ cgroups on his host, one year ago.
(in libcgroup mailing list.)

So, running 2000+ deadly thread will be bad. It's cost.
In theory, the number of memcg can be 65534.

> I haven't measured the lock contention and cputime for each kswapd
> running. Theoretically it would be a problem
> if thousands of cgroups are configured on the the host and all of them
> are under memory pressure.
> 
I think that's a configuration mistake. 

> We can either optimize the locking or make each kswapd smarter (hold
> the lock less time). My current plan is to have the
> one-kswapd-per-cgroup on the V2 patch w/ select_victim_node, and the
> optimization for this comes as following patchset.
> 

My point above is holding remove node's lock, touching remote node's page
increases memory reclaim cost very much. Then, I like per-node approach.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-08  1:28           ` KAMEZAWA Hiroyuki
@ 2010-12-08  2:10             ` Ying Han
  2010-12-08  2:13               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 52+ messages in thread
From: Ying Han @ 2010-12-08  2:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, Dec 7, 2010 at 5:28 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 7 Dec 2010 17:24:12 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> On Tue, Dec 7, 2010 at 4:39 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Tue, 7 Dec 2010 09:28:01 -0800
>> > Ying Han <yinghan@google.com> wrote:
>> >
>> >> On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>> >
>> >> Potentially there will
>> >> > also be a very large number of new IO sources. I confess I haven't read the
>> >> > thread yet so maybe this has already been thought of but it might make sense
>> >> > to have a 1:N relationship between kswapd and memcgroups and cycle between
>> >> > containers. The difficulty will be a latency between when kswapd wakes up
>> >> > and when a particular container is scanned. The closer the ratio is to 1:1,
>> >> > the less the latency will be but the higher the contenion on the LRU lock
>> >> > and IO will be.
>> >>
>> >> No, we weren't talked about the mapping anywhere in the thread. Having
>> >> many kswapd threads
>> >> at the same time isn't a problem as long as no locking contention (
>> >> ext, 1k kswapd threads on
>> >> 1k fake numa node system). So breaking the zone->lru_lock should work

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-08  2:10             ` Ying Han
@ 2010-12-08  2:13               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-08  2:13 UTC (permalink / raw)
  To: Ying Han
  Cc: Mel Gorman, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, 7 Dec 2010 18:10:11 -0800
Ying Han <yinghan@google.com> wrote:

> >
> >> I haven't measured the lock contention and cputime for each kswapd
> >> running. Theoretically it would be a problem
> >> if thousands of cgroups are configured on the the host and all of them
> >> are under memory pressure.
> >>
> > I think that's a configuration mistake.
> >
> >> We can either optimize the locking or make each kswapd smarter (hold
> >> the lock less time). My current plan is to have the
> >> one-kswapd-per-cgroup on the V2 patch w/ select_victim_node, and the
> >> optimization for this comes as following patchset.
> >>
> >
> > My point above is holding remove node's lock, touching remote node's page
> > increases memory reclaim cost very much. Then, I like per-node approach.
> 
> So in a case of one physical node and thousands of cgroups, we are
> queuing all the works into single kswapd
> which is doing the global background reclaim as well. This could be a
> problem on a multi-core system where
> all the cgroups queuing behind the current work being throttle which
> might not be necessary.

percpu thread is enough. And there is direct reclaim, absense of kswapd
will not be critical (because memcg doesn't need 'zone balancing').
And as you said, 'usual' users will not use 100+ cgroups. Queueing will
not be fatal, I think.

> 
> I am not sure which way is better at this point. I would like to keep
> the current implementation for the next post V2
> since smaller changes between versions sounds better to me.
> 
yes, please go ahread. I'm not against the functionality itself.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-07 17:28     ` Ying Han
  2010-12-08  0:39       ` KAMEZAWA Hiroyuki
@ 2010-12-08  7:21       ` KOSAKI Motohiro
  1 sibling, 0 replies; 52+ messages in thread
From: KOSAKI Motohiro @ 2010-12-08  7:21 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Mel Gorman, Balbir Singh, Daisuke Nishimura,
	KAMEZAWA Hiroyuki, Andrew Morton, Johannes Weiner,
	Christoph Lameter, Wu Fengguang, Andi Kleen, Hugh Dickins,
	Rik van Riel, Tejun Heo, linux-mm

> On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Mon, Nov 29, 2010 at 10:49:42PM -0800, Ying Han wrote:
> >> There is a kswapd kernel thread for each memory node. We add a different kswapd
> >> for each cgroup.
> >
> > What is considered a normal number of cgroups in production? 10, 50, 10000?
> Normally it is less than 100. I assume there is a cap of number of
> cgroups can be created
> per system.
> 
> If it's a really large number and all the cgroups kswapds wake at the same time,
> > the zone LRU lock will be very heavily contended.
> 
> Thanks for reviewing the patch~
> 
> Agree. The zone->lru_lock is another thing we are looking at.
> Eventually, we need to break the lock to
> per-zone per-memcg lru.

This may make following bad scenario. That's the reason why now we are using zone->lru_lock.

1) start memcg reclaim
2) found the lru tail page has pte access bit
3) memcg reclaim decided that the page move to active list of memcg-lru.
    Also, pte access bit was cleaned. But, the page still remain inactive list of global-lru.
4) Sadly, global reclaim discard the page quickly because it has been lost accessed bit by memcg.


But, if we have to modify both memcg and global LRU, we can't avoid zone->lru_lock anyway.
Then, we don't use memcg special lock.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-07 12:33   ` Mel Gorman
  2010-12-07 17:28     ` Ying Han
  2010-12-07 18:50     ` Ying Han
@ 2010-12-08  7:22     ` KOSAKI Motohiro
  2010-12-08  7:37       ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 52+ messages in thread
From: KOSAKI Motohiro @ 2010-12-08  7:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Ying Han, Balbir Singh, Daisuke Nishimura,
	KAMEZAWA Hiroyuki, Andrew Morton, Johannes Weiner,
	Christoph Lameter, Wu Fengguang, Andi Kleen, Hugh Dickins,
	Rik van Riel, Tejun Heo, linux-mm

> > +struct kswapd {
> > +	struct task_struct *kswapd_task;
> > +	wait_queue_head_t kswapd_wait;
> > +	struct mem_cgroup *kswapd_mem;
> > +	pg_data_t *kswapd_pgdat;
> > +};
> > +
> > +#define MAX_KSWAPDS MAX_NUMNODES
> > +extern struct kswapd kswapds[MAX_KSWAPDS];
> 
> This is potentially very large for a static structure. Can they not be
> dynamically allocated and kept on a list? Yes, there will be a list walk
> involved if yonu need a particular structure but that looks like it's a
> rare operation at this point.

Why can't we use normal workqueue mechanism?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-08  7:22     ` KOSAKI Motohiro
@ 2010-12-08  7:37       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-08  7:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Ying Han, Balbir Singh, Daisuke Nishimura,
	Andrew Morton, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, Tejun Heo, linux-mm

On Wed,  8 Dec 2010 16:22:30 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > +struct kswapd {
> > > +	struct task_struct *kswapd_task;
> > > +	wait_queue_head_t kswapd_wait;
> > > +	struct mem_cgroup *kswapd_mem;
> > > +	pg_data_t *kswapd_pgdat;
> > > +};
> > > +
> > > +#define MAX_KSWAPDS MAX_NUMNODES
> > > +extern struct kswapd kswapds[MAX_KSWAPDS];
> > 
> > This is potentially very large for a static structure. Can they not be
> > dynamically allocated and kept on a list? Yes, there will be a list walk
> > involved if yonu need a particular structure but that looks like it's a
> > rare operation at this point.
> 
> Why can't we use normal workqueue mechanism?
> 
Sounds much simpler than self management of threads.

My only concern is cpu-time-stealing by background works but ....

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] Add kswapd descriptor.
  2010-12-08  1:24         ` Ying Han
  2010-12-08  1:28           ` KAMEZAWA Hiroyuki
@ 2010-12-08 12:19           ` Mel Gorman
  1 sibling, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2010-12-08 12:19 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura,
	Andrew Morton, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, Dec 07, 2010 at 05:24:12PM -0800, Ying Han wrote:
> On Tue, Dec 7, 2010 at 4:39 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 7 Dec 2010 09:28:01 -0800
> > Ying Han <yinghan@google.com> wrote:
> >
> >> On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> Potentially there will
> >> > also be a very large number of new IO sources. I confess I haven't read the
> >> > thread yet so maybe this has already been thought of but it might make sense
> >> > to have a 1:N relationship between kswapd and memcgroups and cycle between
> >> > containers. The difficulty will be a latency between when kswapd wakes up
> >> > and when a particular container is scanned. The closer the ratio is to 1:1,
> >> > the less the latency will be but the higher the contenion on the LRU lock
> >> > and IO will be.
> >>
> >> No, we weren't talked about the mapping anywhere in the thread. Having
> >> many kswapd threads
> >> at the same time isn't a problem as long as no locking contention (
> >> ext, 1k kswapd threads on
> >> 1k fake numa node system). So breaking the zone->lru_lock should work.
> >>
> >
> > That's me who make zone->lru_lock be shared. And per-memcg lock will makes
> > the maintainance of memcg very bad. That will add many races.
> > Or we need to make memcg's LRU not synchronized with zone's LRU, IOW, we need
> > to have completely independent LRU.
> >
> > I'd like to limit the number of kswapd-for-memcg if zone->lru lock contention
> > is problematic. memcg _can_ work without background reclaim.
> 
> >
> > How about adding per-node kswapd-for-memcg it will reclaim pages by a memcg's
> > request ? as
> >
> >        memcg_wake_kswapd(struct mem_cgroup *mem)
> >        {
> >                do {
> >                        nid = select_victim_node(mem);
> >                        /* ask kswapd to reclaim memcg's memory */
> >                        ret = memcg_kswapd_queue_work(nid, mem); /* may return -EBUSY if very busy*/
> >                } while()
> >        }
> >
> > This will make lock contention minimum. Anyway, using too much cpu for this
> > unnecessary_but_good_for_performance_function is bad. Throttoling is required.
> 
> I don't see the problem of one-kswapd-per-cgroup here since there will
> be no performance cost if they are not running.
> 

*If* they are not running. There is potentially a massive cost here.

> I haven't measured the lock contention and cputime for each kswapd
> running. Theoretically it would be a problem
> if thousands of cgroups are configured on the the host and all of them
> are under memory pressure.
> 

It's not just the locking. If all of these kswapds are running and each
container has a small number of dirty pages, we potentially have tens or
hundreds of kswapd each queueing a small number of pages for IO.  Granted,
if we reach the point where these IO sources are delegated to flusher threads
it would be less of a problem but it's not how things currently behave.

> We can either optimize the locking or make each kswapd smarter (hold
> the lock less time).

Holding the lock less time might allow other kswapd instances to make small
amounts of progress but they'll still be wasting a lot of CPU spinning on
the lock. It's not a simple issue which is why I think we need either a)
a means of telling kswapd which containers it should be reclaiming from
or b) a 1:N mapping of kswapd instances to containers from the outset.
Otherwise users with large numbers of containers will see severe slowdowns
under memory pressure where as previously they would have experienced stalls
in individual containers.

> My current plan is to have the
> one-kswapd-per-cgroup on the V2 patch w/ select_victim_node, and the
> optimization for this comes as following patchset.
> 

Will read when they come out :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2010-12-08 12:20 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-30  6:49 [RFC][PATCH 0/4] memcg: per cgroup background reclaim Ying Han
2010-11-30  6:49 ` [PATCH 1/4] Add kswapd descriptor Ying Han
2010-11-30  7:08   ` KAMEZAWA Hiroyuki
2010-11-30  8:15     ` Minchan Kim
2010-11-30  8:27       ` KAMEZAWA Hiroyuki
2010-11-30  8:54         ` KAMEZAWA Hiroyuki
2010-11-30 20:40           ` Ying Han
2010-11-30 23:46             ` KAMEZAWA Hiroyuki
2010-12-07  6:15           ` Balbir Singh
2010-12-07  6:24             ` KAMEZAWA Hiroyuki
2010-12-07  6:59               ` Balbir Singh
2010-12-07  8:00                 ` KAMEZAWA Hiroyuki
2010-11-30 20:26       ` Ying Han
2010-11-30 20:17     ` Ying Han
2010-12-01  0:12       ` KAMEZAWA Hiroyuki
2010-12-07  6:52   ` Balbir Singh
2010-12-07 19:21     ` Ying Han
2010-12-07 12:33   ` Mel Gorman
2010-12-07 17:28     ` Ying Han
2010-12-08  0:39       ` KAMEZAWA Hiroyuki
2010-12-08  1:24         ` Ying Han
2010-12-08  1:28           ` KAMEZAWA Hiroyuki
2010-12-08  2:10             ` Ying Han
2010-12-08  2:13               ` KAMEZAWA Hiroyuki
2010-12-08 12:19           ` Mel Gorman
2010-12-08  7:21       ` KOSAKI Motohiro
2010-12-07 18:50     ` Ying Han
2010-12-08  7:22     ` KOSAKI Motohiro
2010-12-08  7:37       ` KAMEZAWA Hiroyuki
2010-11-30  6:49 ` [PATCH 2/4] Add per cgroup reclaim watermarks Ying Han
2010-11-30  7:21   ` KAMEZAWA Hiroyuki
2010-11-30 20:44     ` Ying Han
2010-12-01  0:27       ` KAMEZAWA Hiroyuki
2010-12-07 14:56   ` Mel Gorman
2010-11-30  6:49 ` [PATCH 3/4] Per cgroup background reclaim Ying Han
2010-11-30  7:51   ` KAMEZAWA Hiroyuki
2010-11-30  8:07     ` KAMEZAWA Hiroyuki
2010-11-30 22:01       ` Ying Han
2010-11-30 22:00     ` Ying Han
2010-12-07  2:25     ` Ying Han
2010-12-07  5:21       ` KAMEZAWA Hiroyuki
2010-12-01  2:18   ` KOSAKI Motohiro
2010-12-01  2:16     ` KAMEZAWA Hiroyuki
2010-11-30  6:49 ` [PATCH 4/4] Add more per memcg stats Ying Han
2010-11-30  7:53   ` KAMEZAWA Hiroyuki
2010-11-30 18:22     ` Ying Han
2010-11-30  6:54 ` [RFC][PATCH 0/4] memcg: per cgroup background reclaim KOSAKI Motohiro
2010-11-30  7:03   ` Ying Han
2010-12-02 14:41     ` Balbir Singh
2010-12-07  2:29       ` Ying Han
2010-11-30  7:00 ` KAMEZAWA Hiroyuki
2010-11-30  9:05   ` Ying Han

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.