All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] memcg background reclaim , yet another one.
@ 2011-04-25  9:25 KAMEZAWA Hiroyuki
  2011-04-25  9:28 ` [PATCH 1/7] memcg: add high/low watermark to res_counter KAMEZAWA Hiroyuki
                   ` (10 more replies)
  0 siblings, 11 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:25 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko


This patch is based on Ying Han's one....at its origin, but I changed too much ;)
Then, start this as new thread.

(*) This work is not related to the topic "rewriting global LRU using memcg"
    discussion, at all. This kind of hi/low watermark has been planned since
    memcg was born. 

At first, per-memcg background reclaim is used for
  - helping memory reclaim and avoid direct reclaim.
  - set a not-hard limit of memory usage.

For example, assume a memcg has its hard-limit as 500M bytes.
Then, set high-watermark as 400M. Here, memory usage can exceed 400M up to 500M
but memory usage will be reduced automatically to 400M as time goes by.

This is useful when a user want to limit memory usage to 400M but don't want to
see big performance regression by hitting limit when memory usage spike happens.

1) == hard limit = 400M ==
[root@rhel6-test hilow]# time cp ./tmpfile xxx                
real    0m7.353s
user    0m0.009s
sys     0m3.280s

2) == hard limit 500M/ hi_watermark = 400M ==
[root@rhel6-test hilow]# time cp ./tmpfile xxx

real    0m6.421s
user    0m0.059s
sys     0m2.707s

Above is a brief result on VM and needs more study. But my impression is positive.
I'd like to use bigger real machine in the next time.

Here is a short list of updates from Ying Han's one.

 1. use workqueue and visit memcg in round robin.
 2. only allow setting hi watermark. low-watermark is automatically determined.
    This is good for avoiding bad cpu usage by background reclaim.
 3. totally rewrite algorithm of shrink_mem_cgroup for round-robin.
 4. fixed get_scan_count() , this was problematic.
 5. added some statistics, which I think necessary.
 6. added documenation

Then, the algorithm is not a cut-n-paste from kswapd. I thought kswapd should be
updated...and 'priority' in vmscan.c seems to be an enemy of memcg ;)


Thanks
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
@ 2011-04-25  9:28 ` KAMEZAWA Hiroyuki
  2011-04-26 17:54   ` Ying Han
                     ` (2 more replies)
  2011-04-25  9:29 ` [PATCH 2/7] memcg high watermark interface KAMEZAWA Hiroyuki
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
until the usage is lower than the high_wmark.

Each watermark is calculated based on the hard_limit(limit_in_bytes) for each
memcg. Each time the hard_limit is changed, the corresponding wmarks are
re-calculated. Since memory controller charges only user pages, there is
no need for a "min_wmark". The current calculation of wmarks is based on
individual tunable high_wmark_distance, which are set to 0 by default.
low_wmark is calculated in automatic way.

Changelog:v8b...v7
1. set low_wmark_distance in automatic using fixed HILOW_DISTANCE.

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h  |    1 
 include/linux/res_counter.h |   78 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    6 +++
 mm/memcontrol.c             |   69 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 154 insertions(+)

Index: memcg/include/linux/memcontrol.h
===================================================================
--- memcg.orig/include/linux/memcontrol.h
+++ memcg/include/linux/memcontrol.h
@@ -84,6 +84,7 @@ int task_in_mem_cgroup(struct task_struc
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
Index: memcg/include/linux/res_counter.h
===================================================================
--- memcg.orig/include/linux/res_counter.h
+++ memcg/include/linux/res_counter.h
@@ -39,6 +39,14 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit that reclaim triggers.
+	 */
+	unsigned long long low_wmark_limit;
+	/*
+	 * the limit that reclaim stops.
+	 */
+	unsigned long long high_wmark_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -55,6 +63,9 @@ struct res_counter {
 
 #define RESOURCE_MAX (unsigned long long)LLONG_MAX
 
+#define CHARGE_WMARK_LOW	0x01
+#define CHARGE_WMARK_HIGH	0x02
+
 /**
  * Helpers to interact with userspace
  * res_counter_read_u64() - returns the value of the specified member.
@@ -92,6 +103,8 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_WMARK_LIMIT,
+	RES_HIGH_WMARK_LIMIT
 };
 
 /*
@@ -147,6 +160,24 @@ static inline unsigned long long res_cou
 	return margin;
 }
 
+static inline bool
+res_counter_under_high_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->high_wmark_limit)
+		return true;
+
+	return false;
+}
+
+static inline bool
+res_counter_under_low_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->low_wmark_limit)
+		return true;
+
+	return false;
+}
+
 /**
  * Get the difference between the usage and the soft limit
  * @cnt: The counter
@@ -169,6 +200,30 @@ res_counter_soft_limit_excess(struct res
 	return excess;
 }
 
+static inline bool
+res_counter_under_low_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_under_low_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool
+res_counter_under_high_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_under_high_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -214,4 +269,27 @@ res_counter_set_soft_limit(struct res_co
 	return 0;
 }
 
+static inline int
+res_counter_set_high_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
+static inline int
+res_counter_set_low_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
 #endif
Index: memcg/kernel/res_counter.c
===================================================================
--- memcg.orig/kernel/res_counter.c
+++ memcg/kernel/res_counter.c
@@ -19,6 +19,8 @@ void res_counter_init(struct res_counter
 	spin_lock_init(&counter->lock);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
+	counter->low_wmark_limit = RESOURCE_MAX;
+	counter->high_wmark_limit = RESOURCE_MAX;
 	counter->parent = parent;
 }
 
@@ -103,6 +105,10 @@ res_counter_member(struct res_counter *c
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_WMARK_LIMIT:
+		return &counter->low_wmark_limit;
+	case RES_HIGH_WMARK_LIMIT:
+		return &counter->high_wmark_limit;
 	};
 
 	BUG();
Index: memcg/mm/memcontrol.c
===================================================================
--- memcg.orig/mm/memcontrol.c
+++ memcg/mm/memcontrol.c
@@ -278,6 +278,11 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	/*
+	 * used to calculate the low/high_wmarks based on the limit_in_bytes.
+	 */
+	u64 high_wmark_distance;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -867,6 +872,44 @@ out:
 EXPORT_SYMBOL(mem_cgroup_count_vm_event);
 
 /*
+ * If Hi-Low distance is too big, background reclaim tend to be cpu hogging.
+ * If Hi-Low distance is too small, small memory usage spike (by temporal
+ * shell scripts) causes background reclaim and make thing worse. But memory
+ * spike can be avoided by setting high-wmark a bit higier. We use fixed size
+ * size of HiLow Distance, this will be easy to use.
+ */
+#ifdef CONFIG_64BIT /* object size tend do be twice */
+#define HILOW_DISTANCE	(4 * 1024 * 1024)
+#else
+#define HILOW_DISTANCE	(2 * 1024 * 1024)
+#endif
+
+static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
+{
+	u64 limit;
+
+	limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+	if (mem->high_wmark_distance == 0) {
+		res_counter_set_low_wmark_limit(&mem->res, limit);
+		res_counter_set_high_wmark_limit(&mem->res, limit);
+	} else {
+		u64 low_wmark, high_wmark, low_distance;
+		if (mem->high_wmark_distance <= HILOW_DISTANCE)
+			low_distance = mem->high_wmark_distance / 2;
+		else
+			low_distance = HILOW_DISTANCE;
+		if (low_distance < PAGE_SIZE * 2)
+			low_distance = PAGE_SIZE * 2;
+
+		low_wmark = limit - low_distance;
+		high_wmark = limit - mem->high_wmark_distance;
+
+		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
+		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+	}
+}
+
+/*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
  * What we have to take care of here is validness of pc->mem_cgroup.
@@ -3264,6 +3307,7 @@ static int mem_cgroup_resize_limit(struc
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3324,6 +3368,7 @@ static int mem_cgroup_resize_memsw_limit
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		setup_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -4603,6 +4648,30 @@ static void __init enable_swap_cgroup(vo
 }
 #endif
 
+/*
+ * We use low_wmark and high_wmark for triggering per-memcg kswapd.
+ * The reclaim is triggered by low_wmark (usage > low_wmark) and stopped
+ * by high_wmark (usage < high_wmark).
+ */
+int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
+				int charge_flags)
+{
+	long ret = 0;
+	int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
+
+	if (!mem->high_wmark_distance)
+		return 1;
+
+	VM_BUG_ON((charge_flags & flags) == flags);
+
+	if (charge_flags & CHARGE_WMARK_LOW)
+		ret = res_counter_under_low_wmark_limit(&mem->res);
+	if (charge_flags & CHARGE_WMARK_HIGH)
+		ret = res_counter_under_high_wmark_limit(&mem->res);
+
+	return ret;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 2/7] memcg high watermark interface
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
  2011-04-25  9:28 ` [PATCH 1/7] memcg: add high/low watermark to res_counter KAMEZAWA Hiroyuki
@ 2011-04-25  9:29 ` KAMEZAWA Hiroyuki
  2011-04-25 22:36   ` Ying Han
  2011-04-25  9:31 ` [PATCH 3/7] memcg: select victim node in round robin KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

Add memory.high_wmark_distance and reclaim_wmarks API per memcg.
The first adjust the internal low/high wmark calculation and 
the reclaim_wmarks exports the current value of watermarks.
low_wmark is caclurated in automatic.

$ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
$ cat /dev/cgroup/A/memory.limit_in_bytes
524288000

$ echo 50m >/dev/cgroup/A/memory.high_wmark_distance

$ cat /dev/cgroup/A/memory.reclaim_wmarks
low_wmark 476053504
high_wmark 471859200

Change v8a..v7
   1. removed low_wmark_distance it's now automatic.
   2. added Documenation.

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/memory.txt |   43 ++++++++++++++++++++++++++++
 mm/memcontrol.c                  |   58 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+), 1 deletion(-)

Index: memcg/mm/memcontrol.c
===================================================================
--- memcg.orig/mm/memcontrol.c
+++ memcg/mm/memcontrol.c
@@ -4074,6 +4074,40 @@ static int mem_cgroup_swappiness_write(s
 	return 0;
 }
 
+static u64 mem_cgroup_high_wmark_distance_read(struct cgroup *cgrp,
+					       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->high_wmark_distance;
+}
+
+static int mem_cgroup_high_wmark_distance_write(struct cgroup *cont,
+						struct cftype *cft,
+						const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	unsigned long long val;
+	u64 limit;
+	int ret;
+
+	if (!cont->parent)
+		return -EINVAL;
+
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return -EINVAL;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+	if (val >= limit)
+		return -EINVAL;
+
+	memcg->high_wmark_distance = val;
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4365,6 +4399,21 @@ static void mem_cgroup_oom_unregister_ev
 	mutex_unlock(&memcg_oom_mutex);
 }
 
+static int mem_cgroup_wmark_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	u64 low_wmark, high_wmark;
+
+	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
+	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+
+	cb->fill(cb, "low_wmark", low_wmark);
+	cb->fill(cb, "high_wmark", high_wmark);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4468,6 +4517,15 @@ static struct cftype mem_cgroup_files[] 
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "high_wmark_distance",
+		.write_string = mem_cgroup_high_wmark_distance_write,
+		.read_u64 = mem_cgroup_high_wmark_distance_read,
+	},
+	{
+		.name = "reclaim_wmarks",
+		.read_map = mem_cgroup_wmark_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
Index: memcg/Documentation/cgroups/memory.txt
===================================================================
--- memcg.orig/Documentation/cgroups/memory.txt
+++ memcg/Documentation/cgroups/memory.txt
@@ -68,6 +68,8 @@ Brief summary of control files.
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
  memory.oom_control		 # set/show oom controls.
+ memory.hiwmark_distance	 # set/show watermark control
+ memory.reclaim_wmarks		 # show watermark details.
 
 1. History
 
@@ -501,6 +503,7 @@ NOTE2: When panic_on_oom is set to "2", 
        case of an OOM event in any cgroup.
 
 7. Soft limits
+(See Watermarks, too.)
 
 Soft limits allow for greater sharing of memory. The idea behind soft limits
 is to allow control groups to use as much of the memory as needed, provided
@@ -649,7 +652,45 @@ At reading, current status of OOM is sho
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
-11. TODO
+11. Watermarks
+
+Tasks gets big overhead when it hits memory limit because it needs to scan
+memory and free them. To avoid that, some background memory freeing by
+kernel will be helpful. Memory cgroup supports background memory freeing
+by threshold called Watermarks. It can be used for fuzzy limiting of memory.
+
+For example, if you have 1G limit and set
+  - high_watermark ....980M
+  - low_watermark  ....984M
+Memory freeing work by kernel starts when usage goes over 984M until memory
+usage goes down to 980M. Of course, this cousumes CPU. So, the kernel controls
+this work to avoid too much cpu hogging.
+
+11.1 memory.high_wmark_distance
+
+This is an interface for high_wmark. You can specify the distance between
+the limit of memory and high_watemark here. For example, under 1G limit memroy
+cgroup,
+  # echo 20M > memory.high_wmark_distance
+will set high_watermark as 980M. low_watermark is _automatically_ determined
+because big distance between high-low watermark tend to use too much CPU and
+it's difficult to determine low_watermark by users.
+
+With this, memory usage will be reduced to 980M as time goes by.
+After setting memory.high_wmark_distance to be 20M, assume you update
+memory.limit_in_bytes to be 2G bytes. In this case, hiwh_watermak is 1980M.
+
+Another thinking, assume you have memory.limit_in_bytes to be 1G.
+Then, set memory.high_wmark_distance as 300M. Then, you can limit memory
+usage under 700M in moderate way and you can limit it under 1G with hard
+limit.
+
+11.2 memory.reclaim_wmarks
+
+This interface shows high_watermark and low_watermark in bytes. Maybe
+useful at compareing usage/watermarks.
+
+12. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 3/7] memcg: select victim node in round robin.
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
  2011-04-25  9:28 ` [PATCH 1/7] memcg: add high/low watermark to res_counter KAMEZAWA Hiroyuki
  2011-04-25  9:29 ` [PATCH 2/7] memcg high watermark interface KAMEZAWA Hiroyuki
@ 2011-04-25  9:31 ` KAMEZAWA Hiroyuki
  2011-04-25  9:34 ` [PATCH 4/7] memcg fix scan ratio with small memcg KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

Not changed from Ying's.
==
This add the mechanism for background reclaim which we remember the
last scanned node and always starting from the next one each time.
The simple round-robin fasion provide the fairness between nodes for
each memcg.

>From :  Ying Han <yinghan@google.com>
Signed-off-by: Ying Han <yinghan@google.com>
Sigend-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

Index: memcg/include/linux/memcontrol.h
===================================================================
--- memcg.orig/include/linux/memcontrol.h
+++ memcg/include/linux/memcontrol.h
@@ -85,6 +85,9 @@ int task_in_mem_cgroup(struct task_struc
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
+extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
+extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
+					const nodemask_t *nodes);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
Index: memcg/mm/memcontrol.c
===================================================================
--- memcg.orig/mm/memcontrol.c
+++ memcg/mm/memcontrol.c
@@ -283,6 +283,12 @@ struct mem_cgroup {
 	 * used to calculate the low/high_wmarks based on the limit_in_bytes.
 	 */
 	u64 high_wmark_distance;
+
+	/*
+	 * While doing per cgroup background reclaim, we cache the
+	 * last node we reclaimed from
+	 */
+	int last_scanned_node;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -1611,6 +1617,27 @@ static int mem_cgroup_hierarchical_recla
 }
 
 /*
+ * Visit the first node after the last_scanned_node of @mem and use that to
+ * reclaim free pages from.
+ */
+int
+mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
+{
+	int next_nid;
+	int last_scanned;
+
+	last_scanned = mem->last_scanned_node;
+	next_nid = next_node(last_scanned, *nodes);
+
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(*nodes);
+
+	mem->last_scanned_node = next_nid;
+
+	return next_nid;
+}
+
+/*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
@@ -4730,6 +4757,14 @@ int mem_cgroup_watermark_ok(struct mem_c
 	return ret;
 }
 
+int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return -1;
+
+	return mem->last_scanned_node;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4805,6 +4840,7 @@ mem_cgroup_create(struct cgroup_subsys *
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
 	if (parent)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 4/7] memcg fix scan ratio with small memcg.
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2011-04-25  9:31 ` [PATCH 3/7] memcg: select victim node in round robin KAMEZAWA Hiroyuki
@ 2011-04-25  9:34 ` KAMEZAWA Hiroyuki
  2011-04-25 17:35   ` Ying Han
  2011-04-25  9:36 ` [PATCH 5/7] memcg bgreclaim core KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko


At memcg memory reclaim, get_scan_count() may returns [0, 0, 0, 0]
and no scan was not issued at the reclaim priority.

The reason is because memory cgroup may not be enough big to have
the number of pages, which is greater than 1 << priority.

Because priority affects many routines in vmscan.c, it's better
to scan memory even if usage >> priority < 0. 
>From another point of view, if memcg's zone doesn't have enough memory which
meets priority, it should be skipped. So, this patch creates a temporal priority
in get_scan_count() and scan some amount of pages even when
usage is small. By this, memcg's reclaim goes smoother without
having too high priority, which will cause unnecessary congestion_wait(), etc.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    6 ++++++
 mm/memcontrol.c            |    5 +++++
 mm/vmscan.c                |   11 +++++++++++
 3 files changed, 22 insertions(+)

Index: memcg/include/linux/memcontrol.h
===================================================================
--- memcg.orig/include/linux/memcontrol.h
+++ memcg/include/linux/memcontrol.h
@@ -152,6 +152,7 @@ unsigned long mem_cgroup_soft_limit_recl
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+u64 mem_cgroup_get_usage(struct mem_cgroup *mem);
 
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -357,6 +358,11 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 	return 0;
 }
 
+static inline u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
+{
+	return 0;
+}
+
 static inline void mem_cgroup_split_huge_fixup(struct page *head,
 						struct page *tail)
 {
Index: memcg/mm/memcontrol.c
===================================================================
--- memcg.orig/mm/memcontrol.c
+++ memcg/mm/memcontrol.c
@@ -1483,6 +1483,11 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 	return min(limit, memsw);
 }
 
+u64 mem_cgroup_get_usage(struct mem_cgroup *memcg)
+{
+	return res_counter_read_u64(&memcg->res, RES_USAGE);
+}
+
 /*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
Index: memcg/mm/vmscan.c
===================================================================
--- memcg.orig/mm/vmscan.c
+++ memcg/mm/vmscan.c
@@ -1762,6 +1762,17 @@ static void get_scan_count(struct zone *
 			denominator = 1;
 			goto out;
 		}
+	} else {
+		u64 usage;
+		/*
+		 * When memcg is enough small, anon+file >> priority
+		 * can be 0 and we'll do no scan. Adjust it to proper
+		 * value against its usage. If this zone's usage is enough
+		 * small, scan will ignore this zone until priority goes down.
+		 */
+		for (usage = mem_cgroup_get_usage(sc->mem_cgroup) >> PAGE_SHIFT;
+		     priority && ((usage >> priority) < SWAP_CLUSTER_MAX);
+		     priority--);
 	}
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 5/7] memcg bgreclaim core.
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2011-04-25  9:34 ` [PATCH 4/7] memcg fix scan ratio with small memcg KAMEZAWA Hiroyuki
@ 2011-04-25  9:36 ` KAMEZAWA Hiroyuki
  2011-04-26  4:59   ` Ying Han
  2011-04-26 18:37   ` Ying Han
  2011-04-25  9:40 ` [PATCH 6/7] memcg add zone_all_unreclaimable KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

Following patch will chagnge the logic. This is a core.
==
This is the main loop of per-memcg background reclaim which is implemented in
function balance_mem_cgroup_pgdat().

The function performs a priority loop similar to global reclaim. During each
iteration it frees memory from a selected victim node.
After reclaiming enough pages or scanning enough pages, it returns and find
next work with round-robin.

changelog v8b..v7
1. reworked for using work_queue rather than threads.
2. changed shrink_mem_cgroup algorithm to fit workqueue. In short, avoid
   long running and allow quick round-robin and unnecessary write page.
   When a thread make pages dirty continuously, write back them by flusher
   is far faster than writeback by background reclaim. This detail will
   be fixed when dirty_ratio implemented. The logic around this will be
   revisited in following patche.

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   11 ++++
 mm/memcontrol.c            |   44 ++++++++++++++---
 mm/vmscan.c                |  115 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 162 insertions(+), 8 deletions(-)

Index: memcg/include/linux/memcontrol.h
===================================================================
--- memcg.orig/include/linux/memcontrol.h
+++ memcg/include/linux/memcontrol.h
@@ -89,6 +89,8 @@ extern int mem_cgroup_last_scanned_node(
 extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
 					const nodemask_t *nodes);
 
+unsigned long shrink_mem_cgroup(struct mem_cgroup *mem);
+
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
 {
@@ -112,6 +114,9 @@ extern void mem_cgroup_end_migration(str
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+				int nid, int zone_idx);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru);
@@ -310,6 +315,12 @@ mem_cgroup_inactive_file_is_low(struct m
 }
 
 static inline unsigned long
+mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zone_idx)
+{
+	return 0;
+}
+
+static inline unsigned long
 mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
 			 enum lru_list lru)
 {
Index: memcg/mm/memcontrol.c
===================================================================
--- memcg.orig/mm/memcontrol.c
+++ memcg/mm/memcontrol.c
@@ -1166,6 +1166,23 @@ int mem_cgroup_inactive_file_is_low(stru
 	return (active > inactive);
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+						int nid, int zone_idx)
+{
+	int nr;
+	struct mem_cgroup_per_zone *mz =
+		mem_cgroup_zoneinfo(memcg, nid, zone_idx);
+
+	nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
+	     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
+		      MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
+
+	return nr;
+}
+
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru)
@@ -1286,7 +1303,7 @@ static unsigned long mem_cgroup_margin(s
 	return margin >> PAGE_SHIFT;
 }
 
-static unsigned int get_swappiness(struct mem_cgroup *memcg)
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
 
@@ -1595,14 +1612,15 @@ static int mem_cgroup_hierarchical_recla
 		/* we use swappiness of local cgroup */
 		if (check_soft) {
 			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
-				noswap, get_swappiness(victim), zone,
+				noswap, mem_cgroup_swappiness(victim), zone,
 				&nr_scanned);
 			*total_scanned += nr_scanned;
 			mem_cgroup_soft_steal(victim, ret);
 			mem_cgroup_soft_scan(victim, nr_scanned);
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, get_swappiness(victim));
+						noswap,
+						mem_cgroup_swappiness(victim));
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -1628,15 +1646,25 @@ static int mem_cgroup_hierarchical_recla
 int
 mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
 {
-	int next_nid;
+	int next_nid, i;
 	int last_scanned;
 
 	last_scanned = mem->last_scanned_node;
-	next_nid = next_node(last_scanned, *nodes);
+	next_nid = last_scanned;
+rescan:
+	next_nid = next_node(next_nid, *nodes);
 
 	if (next_nid == MAX_NUMNODES)
 		next_nid = first_node(*nodes);
 
+	/* If no page on this node, skip */
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		if (mem_cgroup_zone_reclaimable_pages(mem, next_nid, i))
+			break;
+
+	if (next_nid != last_scanned && (i == MAX_NR_ZONES))
+		goto rescan;
+
 	mem->last_scanned_node = next_nid;
 
 	return next_nid;
@@ -3649,7 +3677,7 @@ try_to_free:
 			goto out;
 		}
 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
-						false, get_swappiness(mem));
+					false, mem_cgroup_swappiness(mem));
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
@@ -4073,7 +4101,7 @@ static u64 mem_cgroup_swappiness_read(st
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 
-	return get_swappiness(memcg);
+	return mem_cgroup_swappiness(memcg);
 }
 
 static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
@@ -4849,7 +4877,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	INIT_LIST_HEAD(&mem->oom_notify);
 
 	if (parent)
-		mem->swappiness = get_swappiness(parent);
+		mem->swappiness = mem_cgroup_swappiness(parent);
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
Index: memcg/mm/vmscan.c
===================================================================
--- memcg.orig/mm/vmscan.c
+++ memcg/mm/vmscan.c
@@ -42,6 +42,7 @@
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
 #include <linux/oom.h>
+#include <linux/res_counter.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2308,6 +2309,120 @@ static bool sleeping_prematurely(pg_data
 		return !all_zones_ok;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * The function is used for per-memcg LRU. It scanns all the zones of the
+ * node and returns the nr_scanned and nr_reclaimed.
+ */
+/*
+ * Limit of scanning per iteration. For round-robin.
+ */
+#define MEMCG_BGSCAN_LIMIT	(2048)
+
+static void
+shrink_memcg_node(int nid, int priority, struct scan_control *sc)
+{
+	unsigned long total_scanned = 0;
+	struct mem_cgroup *mem_cont = sc->mem_cgroup;
+	int i;
+
+	/*
+	 * This dma->highmem order is consistant with global reclaim.
+	 * We do this because the page allocator works in the opposite
+	 * direction although memcg user pages are mostly allocated at
+	 * highmem.
+	 */
+	for (i = 0;
+	     (i < NODE_DATA(nid)->nr_zones) &&
+	     (total_scanned < MEMCG_BGSCAN_LIMIT);
+	     i++) {
+		struct zone *zone = NODE_DATA(nid)->node_zones + i;
+		struct zone_reclaim_stat *zrs;
+		unsigned long scan, rotate;
+
+		if (!populated_zone(zone))
+			continue;
+		scan = mem_cgroup_zone_reclaimable_pages(mem_cont, nid, i);
+		if (!scan)
+			continue;
+		/* If recent memory reclaim on this zone doesn't get good */
+		zrs = get_reclaim_stat(zone, sc);
+		scan = zrs->recent_scanned[0] + zrs->recent_scanned[1];
+		rotate = zrs->recent_rotated[0] + zrs->recent_rotated[1];
+
+		if (rotate > scan/2)
+        		sc->may_writepage = 1;
+
+		sc->nr_scanned = 0;
+		shrink_zone(priority, zone, sc);
+		total_scanned += sc->nr_scanned;
+		sc->may_writepage = 0;
+	}
+	sc->nr_scanned = total_scanned;
+}
+
+/*
+ * Per cgroup background reclaim.
+ */
+unsigned long shrink_mem_cgroup(struct mem_cgroup *mem)
+{
+	int nid, priority, next_prio;
+	nodemask_t nodes;
+	unsigned long total_scanned;
+	struct scan_control sc = {
+		.gfp_mask = GFP_HIGHUSER_MOVABLE,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.order = 0,
+		.mem_cgroup = mem,
+	};
+
+	sc.may_writepage = 0;
+	sc.nr_reclaimed = 0;
+	total_scanned = 0;
+	nodes = node_states[N_HIGH_MEMORY];
+	sc.swappiness = mem_cgroup_swappiness(mem);
+
+	current->flags |= PF_SWAPWRITE;
+	/*
+	 * Unlike kswapd, we need to traverse cgroups one by one. So, we don't
+	 * use full priority. Just scan small number of pages and visit next.
+	 * Now, we scan MEMCG_BGRECLAIM_SCAN_LIMIT pages per scan.
+	 * We use static priority 0.
+	 */
+	next_prio = min(SWAP_CLUSTER_MAX * num_node_state(N_HIGH_MEMORY),
+			MEMCG_BGSCAN_LIMIT/8);
+	priority = DEF_PRIORITY;
+	while ((total_scanned < MEMCG_BGSCAN_LIMIT) &&
+	       !nodes_empty(nodes) &&
+	       (sc.nr_to_reclaim > sc.nr_reclaimed)) {
+
+		nid = mem_cgroup_select_victim_node(mem, &nodes);
+		shrink_memcg_node(nid, priority, &sc);
+		/*
+		 * the node seems to have no pages.
+ 		 * skip this for a while
+ 		 */
+		if (!sc.nr_scanned)
+			node_clear(nid, nodes);
+		total_scanned += sc.nr_scanned;
+		if (mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH))
+			break;
+		/* emulate priority */
+		if (total_scanned > next_prio) {
+			priority--;
+			next_prio <<= 1;
+		}
+		if (sc.nr_scanned &&
+		    total_scanned > sc.nr_reclaimed * 2)
+			congestion_wait(WRITE, HZ/10);
+	}
+	current->flags &= ~PF_SWAPWRITE;
+	return sc.nr_reclaimed;
+}
+#endif
+
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at high_wmark_pages(zone).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 6/7] memcg add zone_all_unreclaimable.
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2011-04-25  9:36 ` [PATCH 5/7] memcg bgreclaim core KAMEZAWA Hiroyuki
@ 2011-04-25  9:40 ` KAMEZAWA Hiroyuki
  2011-04-25  9:42 ` [PATCH 7/7] memcg watermark reclaim workqueue KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko


After reclaiming each node per memcg, it checks mem_cgroup_watermark_ok()
and breaks the priority loop if it returns true. The per-memcg zone will
be marked as "unreclaimable" if the scanning rate is much greater than the
reclaiming rate on the per-memcg LRU. The bit is cleared when there is a
page charged to the memcg being freed. Kswapd breaks the priority loop if
all the zones are marked as "unreclaimable".

changelog v8a..v7
  remove using priority.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   40 ++++++++++++++
 include/linux/sched.h      |    1 
 include/linux/swap.h       |    2 
 mm/memcontrol.c            |  126 +++++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c                |   13 ++++
 5 files changed, 177 insertions(+), 5 deletions(-)

Index: memcg/include/linux/memcontrol.h
===================================================================
--- memcg.orig/include/linux/memcontrol.h
+++ memcg/include/linux/memcontrol.h
@@ -158,6 +158,14 @@ unsigned long mem_cgroup_soft_limit_recl
 						unsigned long *total_scanned);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 u64 mem_cgroup_get_usage(struct mem_cgroup *mem);
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, struct zone *zone);
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page);
+void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+					struct zone *zone);
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+					unsigned long nr_scanned);
 
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -355,6 +363,38 @@ static inline void mem_cgroup_dec_page_s
 {
 }
 
+static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem,
+					       struct zone *zone)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
+						struct zone *zone)
+{
+	return false;
+}
+
+static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
+							struct zone *zone)
+{
+}
+
+static inline void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem,
+							struct page *page)
+{
+}
+
+static inline void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+							struct zone *zone)
+{
+}
+static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
+						struct zone *zone,
+						unsigned long nr_scanned)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask,
Index: memcg/include/linux/sched.h
===================================================================
--- memcg.orig/include/linux/sched.h
+++ memcg/include/linux/sched.h
@@ -1540,6 +1540,7 @@ struct task_struct {
 		struct mem_cgroup *memcg; /* target memcg of uncharge */
 		unsigned long nr_pages;	/* uncharged usage */
 		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
+		struct zone *zone; /* a zone page is last uncharged */
 	} memcg_batch;
 #endif
 };
Index: memcg/include/linux/swap.h
===================================================================
--- memcg.orig/include/linux/swap.h
+++ memcg/include/linux/swap.h
@@ -152,6 +152,8 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
+#define ZONE_RECLAIMABLE_RATE 6
+
 #define SWAP_CLUSTER_MAX 32
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
Index: memcg/mm/memcontrol.c
===================================================================
--- memcg.orig/mm/memcontrol.c
+++ memcg/mm/memcontrol.c
@@ -139,7 +139,10 @@ struct mem_cgroup_per_zone {
 	bool			on_tree;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
+	unsigned long		pages_scanned;	/* since last reclaim */
+	bool			all_unreclaimable;	/* All pages pinned */
 };
+
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
 
@@ -1166,12 +1169,15 @@ int mem_cgroup_inactive_file_is_low(stru
 	return (active > inactive);
 }
 
-unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *mem,
 						int nid, int zone_idx)
 {
 	int nr;
-	struct mem_cgroup_per_zone *mz =
-		mem_cgroup_zoneinfo(memcg, nid, zone_idx);
+	struct mem_cgroup_per_zone *mz;
+
+	if (!mem)
+		return 0;
+	mz = mem_cgroup_zoneinfo(mem, nid, zone_idx);
 
 	nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
 	     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
@@ -1222,6 +1228,102 @@ mem_cgroup_get_reclaim_stat_from_page(st
 	return &mz->reclaim_stat;
 }
 
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone *zone,
+						unsigned long nr_scanned)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->pages_scanned += nr_scanned;
+}
+
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return 0;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+
+	return mz->pages_scanned <
+			mem_cgroup_zone_reclaimable_pages(mem, nid, zid) *
+			ZONE_RECLAIMABLE_RATE;
+}
+
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return false;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->all_unreclaimable;
+
+	return false;
+}
+
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->all_unreclaimable = true;
+}
+
+void mem_cgroup_mz_clear_unreclaimable(struct mem_cgroup *mem,
+				       struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
+void mem_cgroup_clear_unreclaimable(struct mem_cgroup *mem, struct page *page)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return;
+
+	mz = page_cgroup_zoneinfo(mem, page);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
+	}
+
+	return;
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -2791,6 +2893,7 @@ void mem_cgroup_cancel_charge_swapin(str
 
 static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
 				   unsigned int nr_pages,
+				   struct page *page,
 				   const enum charge_type ctype)
 {
 	struct memcg_batch_info *batch = NULL;
@@ -2808,6 +2911,10 @@ static void mem_cgroup_do_uncharge(struc
 	 */
 	if (!batch->memcg)
 		batch->memcg = mem;
+
+	if (!batch->zone)
+		batch->zone = page_zone(page);
+
 	/*
 	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
 	 * In those cases, all pages freed continuously can be expected to be in
@@ -2829,12 +2936,17 @@ static void mem_cgroup_do_uncharge(struc
 	 */
 	if (batch->memcg != mem)
 		goto direct_uncharge;
+
+	if (batch->zone != page_zone(page))
+		mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
+
 	/* remember freed charge and uncharge it later */
 	batch->nr_pages++;
 	if (uncharge_memsw)
 		batch->memsw_nr_pages++;
 	return;
 direct_uncharge:
+	mem_cgroup_mz_clear_unreclaimable(mem, page_zone(page));
 	res_counter_uncharge(&mem->res, nr_pages * PAGE_SIZE);
 	if (uncharge_memsw)
 		res_counter_uncharge(&mem->memsw, nr_pages * PAGE_SIZE);
@@ -2916,7 +3028,7 @@ __mem_cgroup_uncharge_common(struct page
 		mem_cgroup_get(mem);
 	}
 	if (!mem_cgroup_is_root(mem))
-		mem_cgroup_do_uncharge(mem, nr_pages, ctype);
+		mem_cgroup_do_uncharge(mem, nr_pages, page, ctype);
 
 	return mem;
 
@@ -2984,6 +3096,10 @@ void mem_cgroup_uncharge_end(void)
 	if (batch->memsw_nr_pages)
 		res_counter_uncharge(&batch->memcg->memsw,
 				     batch->memsw_nr_pages * PAGE_SIZE);
+	if (batch->zone)
+		mem_cgroup_mz_clear_unreclaimable(batch->memcg, batch->zone);
+	batch->zone = NULL;
+
 	memcg_oom_recover(batch->memcg);
 	/* forget this pointer (for sanity check) */
 	batch->memcg = NULL;
@@ -4659,6 +4775,8 @@ static int alloc_mem_cgroup_per_zone_inf
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = mem;
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = false;
 	}
 	return 0;
 }
Index: memcg/mm/vmscan.c
===================================================================
--- memcg.orig/mm/vmscan.c
+++ memcg/mm/vmscan.c
@@ -1412,6 +1412,9 @@ shrink_inactive_list(unsigned long nr_to
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
 			zone, sc->mem_cgroup,
 			0, file);
+
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
+
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
@@ -1531,6 +1534,7 @@ static void shrink_active_list(unsigned 
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
@@ -1998,7 +2002,8 @@ static void shrink_zones(int priority, s
 
 static bool zone_reclaimable(struct zone *zone)
 {
-	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+	return zone->pages_scanned < zone_reclaimable_pages(zone) *
+					ZONE_RECLAIMABLE_RATE;
 }
 
 /* All zones in zonelist are unreclaimable? */
@@ -2343,6 +2348,10 @@ shrink_memcg_node(int nid, int priority,
 		scan = mem_cgroup_zone_reclaimable_pages(mem_cont, nid, i);
 		if (!scan)
 			continue;
+		/* we would like to remove memory from where we can do easy */
+		if ((sc->nr_reclaimed >= total_scanned/4) &&
+		     mem_cgroup_mz_unreclaimable(mem_cont, zone))
+			continue;
 		/* If recent memory reclaim on this zone doesn't get good */
 		zrs = get_reclaim_stat(zone, sc);
 		scan = zrs->recent_scanned[0] + zrs->recent_scanned[1];
@@ -2355,6 +2364,8 @@ shrink_memcg_node(int nid, int priority,
 		shrink_zone(priority, zone, sc);
 		total_scanned += sc->nr_scanned;
 		sc->may_writepage = 0;
+		if (!mem_cgroup_zone_reclaimable(mem_cont, zone))
+			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
 	}
 	sc->nr_scanned = total_scanned;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 7/7] memcg watermark reclaim workqueue.
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2011-04-25  9:40 ` [PATCH 6/7] memcg add zone_all_unreclaimable KAMEZAWA Hiroyuki
@ 2011-04-25  9:42 ` KAMEZAWA Hiroyuki
  2011-04-26 23:19   ` Ying Han
  2011-04-25  9:43 ` [PATCH 8/7] memcg : reclaim statistics KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

By default the per-memcg background reclaim is disabled when the limit_in_bytes
is set the maximum. The kswapd_run() is called when the memcg is being resized,
and kswapd_stop() is called when the memcg is being deleted.

The per-memcg kswapd is waked up based on the usage and low_wmark, which is
checked once per 1024 increments per cpu. The memcg's kswapd is waked up if the
usage is larger than the low_wmark.

At each iteration of work, the work frees memory at most 2048 pages and switch
to next work for round robin. And if the memcg seems congested, it adds
delay for the next work.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    2 -
 mm/memcontrol.c            |   86 +++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |   23 +++++++-----
 3 files changed, 102 insertions(+), 9 deletions(-)

Index: memcg/mm/memcontrol.c
===================================================================
--- memcg.orig/mm/memcontrol.c
+++ memcg/mm/memcontrol.c
@@ -111,10 +111,12 @@ enum mem_cgroup_events_index {
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
 	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_WMARK_EVENTS_THRESH,
 	MEM_CGROUP_NTARGETS,
 };
 #define THRESHOLDS_EVENTS_TARGET (128)
 #define SOFTLIMIT_EVENTS_TARGET (1024)
+#define WMARK_EVENTS_TARGET (1024)
 
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
@@ -267,6 +269,11 @@ struct mem_cgroup {
 	struct list_head oom_notify;
 
 	/*
+ 	 * For high/low watermark.
+ 	 */
+	bool			bgreclaim_resched;
+	struct delayed_work	bgreclaim_work;
+	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
 	 */
@@ -374,6 +381,8 @@ static void mem_cgroup_put(struct mem_cg
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 
+static void wake_memcg_kswapd(struct mem_cgroup *mem);
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
@@ -552,6 +561,12 @@ mem_cgroup_largest_soft_limit_node(struc
 	return mz;
 }
 
+static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
+		wake_memcg_kswapd(mem);
+}
+
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -702,6 +717,9 @@ static void __mem_cgroup_target_update(s
 	case MEM_CGROUP_TARGET_SOFTLIMIT:
 		next = val + SOFTLIMIT_EVENTS_TARGET;
 		break;
+	case MEM_CGROUP_WMARK_EVENTS_THRESH:
+		next = val + WMARK_EVENTS_TARGET;
+		break;
 	default:
 		return;
 	}
@@ -725,6 +743,10 @@ static void memcg_check_events(struct me
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
+		if (unlikely(__memcg_event_check(mem,
+			MEM_CGROUP_WMARK_EVENTS_THRESH))){
+			mem_cgroup_check_wmark(mem);
+		}
 	}
 }
 
@@ -3661,6 +3683,67 @@ unsigned long mem_cgroup_soft_limit_recl
 	return nr_reclaimed;
 }
 
+struct workqueue_struct *memcg_bgreclaimq;
+
+static int memcg_bgreclaim_init(void)
+{
+	/*
+	 * use UNBOUND workqueue because we traverse nodes (no locality) and
+	 * the work is cpu-intensive.
+	 */
+	memcg_bgreclaimq = alloc_workqueue("memcg",
+			WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_FREEZABLE, 0);
+	return 0;
+}
+module_init(memcg_bgreclaim_init);
+
+static void memcg_bgreclaim(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct mem_cgroup *mem =
+		container_of(dw, struct mem_cgroup, bgreclaim_work);
+	int delay = 0;
+	unsigned long long required, usage, hiwat;
+
+	hiwat = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+	usage = res_counter_read_u64(&mem->res, RES_USAGE);
+	required = usage - hiwat;
+	if (required >= 0)  {
+		required = ((usage - hiwat) >> PAGE_SHIFT) + 1;
+		delay = shrink_mem_cgroup(mem, (long)required);
+	}
+	if (!mem->bgreclaim_resched  ||
+		mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH)) {
+		cgroup_release_and_wakeup_rmdir(&mem->css);
+		return;
+	}
+	/* need reschedule */
+	if (!queue_delayed_work(memcg_bgreclaimq, &mem->bgreclaim_work, delay))
+		cgroup_release_and_wakeup_rmdir(&mem->css);
+}
+
+static void wake_memcg_kswapd(struct mem_cgroup *mem)
+{
+	if (delayed_work_pending(&mem->bgreclaim_work))
+		return;
+	cgroup_exclude_rmdir(&mem->css);
+	if (!queue_delayed_work(memcg_bgreclaimq, &mem->bgreclaim_work, 0))
+		cgroup_release_and_wakeup_rmdir(&mem->css);
+	return;
+}
+
+static void stop_memcg_kswapd(struct mem_cgroup *mem)
+{
+	/*
+	 * at destroy(), there is no task and we don't need to take care of
+	 * new bgreclaim work queued. But we need to prevent it from reschedule
+	 * use bgreclaim_resched to tell no more reschedule.
+	 */
+	mem->bgreclaim_resched = false;
+	flush_delayed_work(&mem->bgreclaim_work);
+	mem->bgreclaim_resched = true;
+}
+
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -3742,6 +3825,7 @@ move_account:
 		ret = -EBUSY;
 		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
 			goto out;
+		stop_memcg_kswapd(mem);
 		ret = -EINTR;
 		if (signal_pending(current))
 			goto out;
@@ -4804,6 +4888,8 @@ static struct mem_cgroup *mem_cgroup_all
 	if (!mem->stat)
 		goto out_free;
 	spin_lock_init(&mem->pcp_counter_lock);
+	INIT_DELAYED_WORK(&mem->bgreclaim_work, memcg_bgreclaim);
+	mem->bgreclaim_resched = true;
 	return mem;
 
 out_free:
Index: memcg/include/linux/memcontrol.h
===================================================================
--- memcg.orig/include/linux/memcontrol.h
+++ memcg/include/linux/memcontrol.h
@@ -89,7 +89,7 @@ extern int mem_cgroup_last_scanned_node(
 extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
 					const nodemask_t *nodes);
 
-unsigned long shrink_mem_cgroup(struct mem_cgroup *mem);
+int shrink_mem_cgroup(struct mem_cgroup *mem, long required);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
Index: memcg/mm/vmscan.c
===================================================================
--- memcg.orig/mm/vmscan.c
+++ memcg/mm/vmscan.c
@@ -2373,20 +2373,19 @@ shrink_memcg_node(int nid, int priority,
 /*
  * Per cgroup background reclaim.
  */
-unsigned long shrink_mem_cgroup(struct mem_cgroup *mem)
+int shrink_mem_cgroup(struct mem_cgroup *mem, long required)
 {
-	int nid, priority, next_prio;
+	int nid, priority, next_prio, delay;
 	nodemask_t nodes;
 	unsigned long total_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_HIGHUSER_MOVABLE,
 		.may_unmap = 1,
 		.may_swap = 1,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.order = 0,
 		.mem_cgroup = mem,
 	};
-
+	/* writepage will be set later per zone */
 	sc.may_writepage = 0;
 	sc.nr_reclaimed = 0;
 	total_scanned = 0;
@@ -2400,9 +2399,12 @@ unsigned long shrink_mem_cgroup(struct m
 	 * Now, we scan MEMCG_BGRECLAIM_SCAN_LIMIT pages per scan.
 	 * We use static priority 0.
 	 */
+	sc.nr_to_reclaim = min(required, (long)MEMCG_BGSCAN_LIMIT/2);
 	next_prio = min(SWAP_CLUSTER_MAX * num_node_state(N_HIGH_MEMORY),
 			MEMCG_BGSCAN_LIMIT/8);
 	priority = DEF_PRIORITY;
+	/* delay for next work at congestion */
+	delay = HZ/10;
 	while ((total_scanned < MEMCG_BGSCAN_LIMIT) &&
 	       !nodes_empty(nodes) &&
 	       (sc.nr_to_reclaim > sc.nr_reclaimed)) {
@@ -2423,12 +2425,17 @@ unsigned long shrink_mem_cgroup(struct m
 			priority--;
 			next_prio <<= 1;
 		}
-		if (sc.nr_scanned &&
-		    total_scanned > sc.nr_reclaimed * 2)
-			congestion_wait(WRITE, HZ/10);
+		/* give up early ? */
+		if (total_scanned > MEMCG_BGSCAN_LIMIT/8 &&
+		    total_scanned > sc.nr_reclaimed * 4)
+			goto out;
 	}
+	/* We scanned enough...If we reclaimed half of requested, no delay */
+	if (sc.nr_reclaimed > sc.nr_to_reclaim/2)
+		delay = 0;
+out:
 	current->flags &= ~PF_SWAPWRITE;
-	return sc.nr_reclaimed;
+	return delay;
 }
 #endif
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 8/7] memcg : reclaim statistics
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2011-04-25  9:42 ` [PATCH 7/7] memcg watermark reclaim workqueue KAMEZAWA Hiroyuki
@ 2011-04-25  9:43 ` KAMEZAWA Hiroyuki
  2011-04-26  5:35   ` Ying Han
  2011-04-25  9:49 ` [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

At tuning memcg background reclaim, cpu usage per memcg's work is an
interesting information because some amount of shared resource is used.
(i.e. background reclaim uses workqueue.) And other information as
pgscan and pgreclaim is important.

This patch shows them via memory.stat with cpu usage for direct reclaim
and softlimit reclaim and page scan statistics.


 # cat /cgroup/memory/A/memory.stat
 ....
 direct_elapsed_ns 0
 soft_elapsed_ns 0
 wmark_elapsed_ns 103566424
 direct_scanned 0
 soft_scanned 0
 wmark_scanned 29303
 direct_freed 0
 soft_freed 0
 wmark_freed 29290


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/memory.txt |   18 +++++++++
 include/linux/memcontrol.h       |    6 +++
 include/linux/swap.h             |    7 +++
 mm/memcontrol.c                  |   77 +++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c                      |   15 +++++++
 5 files changed, 120 insertions(+), 3 deletions(-)

Index: memcg/mm/memcontrol.c
===================================================================
--- memcg.orig/mm/memcontrol.c
+++ memcg/mm/memcontrol.c
@@ -274,6 +274,17 @@ struct mem_cgroup {
 	bool			bgreclaim_resched;
 	struct delayed_work	bgreclaim_work;
 	/*
+	 * reclaim statistics (not per zone, node)
+	 */
+	spinlock_t		elapsed_lock;
+	u64			bgreclaim_elapsed;
+	u64			direct_elapsed;
+	u64			soft_elapsed;
+
+	u64			reclaim_scan[NR_RECLAIM_CONTEXTS];
+	u64			reclaim_freed[NR_RECLAIM_CONTEXTS];
+
+	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
 	 */
@@ -1346,6 +1357,18 @@ void mem_cgroup_clear_unreclaimable(stru
 	return;
 }
 
+void mem_cgroup_reclaim_statistics(struct mem_cgroup *mem,
+		int context, unsigned long scanned,
+		unsigned long freed)
+{
+	if (!mem)
+		return;
+	spin_lock(&mem->elapsed_lock);
+	mem->reclaim_scan[context] += scanned;
+	mem->reclaim_freed[context] += freed;
+	spin_unlock(&mem->elapsed_lock);
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -1692,6 +1715,7 @@ static int mem_cgroup_hierarchical_recla
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	unsigned long excess;
 	unsigned long nr_scanned;
+	s64 start, end;
 
 	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
 
@@ -1735,16 +1759,27 @@ static int mem_cgroup_hierarchical_recla
 		}
 		/* we use swappiness of local cgroup */
 		if (check_soft) {
+			start = sched_clock();
 			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
 				noswap, mem_cgroup_swappiness(victim), zone,
 				&nr_scanned);
 			*total_scanned += nr_scanned;
+			end = sched_clock();
+			spin_lock(&victim->elapsed_lock);
+			victim->soft_elapsed += end - start;
+			spin_unlock(&victim->elapsed_lock);
 			mem_cgroup_soft_steal(victim, ret);
 			mem_cgroup_soft_scan(victim, nr_scanned);
-		} else
+		} else {
+			start = sched_clock();
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
 						noswap,
 						mem_cgroup_swappiness(victim));
+			end = sched_clock();
+			spin_lock(&victim->elapsed_lock);
+			victim->direct_elapsed += end - start;
+			spin_unlock(&victim->elapsed_lock);
+		}
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -3702,15 +3737,22 @@ static void memcg_bgreclaim(struct work_
 	struct delayed_work *dw = to_delayed_work(work);
 	struct mem_cgroup *mem =
 		container_of(dw, struct mem_cgroup, bgreclaim_work);
-	int delay = 0;
+	int delay;
 	unsigned long long required, usage, hiwat;
 
+	delay = 0;
 	hiwat = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
 	usage = res_counter_read_u64(&mem->res, RES_USAGE);
 	required = usage - hiwat;
 	if (required >= 0)  {
+		u64 start, end;
 		required = ((usage - hiwat) >> PAGE_SHIFT) + 1;
+		start = sched_clock();
 		delay = shrink_mem_cgroup(mem, (long)required);
+		end = sched_clock();
+		spin_lock(&mem->elapsed_lock);
+		mem->bgreclaim_elapsed += end - start;
+		spin_unlock(&mem->elapsed_lock);
 	}
 	if (!mem->bgreclaim_resched  ||
 		mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH)) {
@@ -4152,6 +4194,15 @@ enum {
 	MCS_INACTIVE_FILE,
 	MCS_ACTIVE_FILE,
 	MCS_UNEVICTABLE,
+	MCS_DIRECT_ELAPSED,
+	MCS_SOFT_ELAPSED,
+	MCS_WMARK_ELAPSED,
+	MCS_DIRECT_SCANNED,
+	MCS_SOFT_SCANNED,
+	MCS_WMARK_SCANNED,
+	MCS_DIRECT_FREED,
+	MCS_SOFT_FREED,
+	MCS_WMARK_FREED,
 	NR_MCS_STAT,
 };
 
@@ -4177,7 +4228,16 @@ struct {
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
 	{"active_file", "total_active_file"},
-	{"unevictable", "total_unevictable"}
+	{"unevictable", "total_unevictable"},
+	{"direct_elapsed_ns", "total_direct_elapsed_ns"},
+	{"soft_elapsed_ns", "total_soft_elapsed_ns"},
+	{"wmark_elapsed_ns", "total_wmark_elapsed_ns"},
+	{"direct_scanned", "total_direct_scanned"},
+	{"soft_scanned", "total_soft_scanned"},
+	{"wmark_scanned", "total_wmark_scanned"},
+	{"direct_freed", "total_direct_freed"},
+	{"soft_freed", "total_soft_freed"},
+	{"wmark_freed", "total_wamrk_freed"}
 };
 
 
@@ -4185,6 +4245,7 @@ static void
 mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 {
 	s64 val;
+	int i;
 
 	/* per cpu stat */
 	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
@@ -4221,6 +4282,15 @@ mem_cgroup_get_local_stat(struct mem_cgr
 	s->stat[MCS_ACTIVE_FILE] += val * PAGE_SIZE;
 	val = mem_cgroup_get_local_zonestat(mem, LRU_UNEVICTABLE);
 	s->stat[MCS_UNEVICTABLE] += val * PAGE_SIZE;
+
+	/* reclaim stats */
+	s->stat[MCS_DIRECT_ELAPSED] += mem->direct_elapsed;
+	s->stat[MCS_SOFT_ELAPSED] += mem->soft_elapsed;
+	s->stat[MCS_WMARK_ELAPSED] += mem->bgreclaim_elapsed;
+	for (i = 0; i < NR_RECLAIM_CONTEXTS; i++) {
+		s->stat[i + MCS_DIRECT_SCANNED] += mem->reclaim_scan[i];
+		s->stat[i + MCS_DIRECT_FREED] += mem->reclaim_freed[i];
+	}
 }
 
 static void
@@ -4889,6 +4959,7 @@ static struct mem_cgroup *mem_cgroup_all
 		goto out_free;
 	spin_lock_init(&mem->pcp_counter_lock);
 	INIT_DELAYED_WORK(&mem->bgreclaim_work, memcg_bgreclaim);
+	spin_lock_init(&mem->elapsed_lock);
 	mem->bgreclaim_resched = true;
 	return mem;
 
Index: memcg/include/linux/memcontrol.h
===================================================================
--- memcg.orig/include/linux/memcontrol.h
+++ memcg/include/linux/memcontrol.h
@@ -90,6 +90,8 @@ extern int mem_cgroup_select_victim_node
 					const nodemask_t *nodes);
 
 int shrink_mem_cgroup(struct mem_cgroup *mem, long required);
+void mem_cgroup_reclaim_statistics(struct mem_cgroup *mem, int context,
+			unsigned long scanned, unsigned long freed);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
@@ -423,6 +425,10 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+void mem_cgroup_reclaim_statistics(struct mem_cgroup *mem, int context,
+				unsigned long scanned, unsigned long freed)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
Index: memcg/include/linux/swap.h
===================================================================
--- memcg.orig/include/linux/swap.h
+++ memcg/include/linux/swap.h
@@ -250,6 +250,13 @@ static inline void lru_cache_add_file(st
 #define ISOLATE_ACTIVE 1	/* Isolate active pages. */
 #define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
 
+/* context for memory reclaim.( comes from memory cgroup.) */
+enum {
+	RECLAIM_DIRECT,		/* under direct reclaim */
+	RECLAIM_KSWAPD,		/* under global kswapd's soft limit */
+	RECLAIM_WMARK,		/* under background reclaim by watermark */
+	NR_RECLAIM_CONTEXTS
+};
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
Index: memcg/mm/vmscan.c
===================================================================
--- memcg.orig/mm/vmscan.c
+++ memcg/mm/vmscan.c
@@ -72,6 +72,9 @@ typedef unsigned __bitwise__ reclaim_mod
 #define RECLAIM_MODE_LUMPYRECLAIM	((__force reclaim_mode_t)0x08u)
 #define RECLAIM_MODE_COMPACTION		((__force reclaim_mode_t)0x10u)
 
+/* 3 reclaim contexts fro memcg statistics. */
+enum {DIRECT_RECLAIM, KSWAPD_RECLAIM, WMARK_RECLAIM};
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -107,6 +110,7 @@ struct scan_control {
 
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
+	int	reclaim_context;
 
 	/*
 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -2116,6 +2120,10 @@ out:
 	delayacct_freepages_end();
 	put_mems_allowed();
 
+	if (!scanning_global_lru(sc))
+		mem_cgroup_reclaim_statistics(sc->mem_cgroup,
+			sc->reclaim_context, total_scanned, sc->nr_reclaimed);
+
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;
 
@@ -2178,6 +2186,7 @@ unsigned long mem_cgroup_shrink_node_zon
 		.swappiness = swappiness,
 		.order = 0,
 		.mem_cgroup = mem,
+		.reclaim_context = RECLAIM_KSWAPD,
 	};
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2198,6 +2207,8 @@ unsigned long mem_cgroup_shrink_node_zon
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
+	mem_cgroup_reclaim_statistics(sc.mem_cgroup,
+			sc.reclaim_context, sc.nr_scanned, sc.nr_reclaimed);
 	*nr_scanned = sc.nr_scanned;
 	return sc.nr_reclaimed;
 }
@@ -2217,6 +2228,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.swappiness = swappiness,
 		.order = 0,
 		.mem_cgroup = mem_cont,
+		.reclaim_context = RECLAIM_DIRECT,
 		.nodemask = NULL, /* we don't care the placement */
 	};
 
@@ -2384,6 +2396,7 @@ int shrink_mem_cgroup(struct mem_cgroup 
 		.may_swap = 1,
 		.order = 0,
 		.mem_cgroup = mem,
+		.reclaim_context = RECLAIM_WMARK,
 	};
 	/* writepage will be set later per zone */
 	sc.may_writepage = 0;
@@ -2434,6 +2447,8 @@ int shrink_mem_cgroup(struct mem_cgroup 
 	if (sc.nr_reclaimed > sc.nr_to_reclaim/2)
 		delay = 0;
 out:
+	mem_cgroup_reclaim_statistics(sc.mem_cgroup, sc.reclaim_context,
+			total_scanned, sc.nr_reclaimed);
 	current->flags &= ~PF_SWAPWRITE;
 	return delay;
 }
Index: memcg/Documentation/cgroups/memory.txt
===================================================================
--- memcg.orig/Documentation/cgroups/memory.txt
+++ memcg/Documentation/cgroups/memory.txt
@@ -398,6 +398,15 @@ active_anon	- # of bytes of anonymous an
 inactive_file	- # of bytes of file-backed memory on inactive LRU list.
 active_file	- # of bytes of file-backed memory on active LRU list.
 unevictable	- # of bytes of memory that cannot be reclaimed (mlocked etc).
+direct_elapsed_ns  - # of elapsed cpu time at hard limit reclaim (ns)
+soft_elapsed_ns  - # of elapsed cpu time at soft limit reclaim (ns)
+wmark_elapsed_ns  - # of elapsed cpu time at hi/low watermark reclaim (ns)
+direct_scanned	- # of page scans at hard limit reclaim
+soft_scanned	- # of page scans at soft limit reclaim
+wmark_scanned	- # of page scans at hi/low watermark reclaim
+direct_freed	- # of page freeing at hard limit reclaim
+soft_freed	- # of page freeing at soft limit reclaim
+wmark_freed	- # of page freeing at hi/low watermark reclaim
 
 # status considering hierarchy (see memory.use_hierarchy settings)
 
@@ -421,6 +430,15 @@ total_active_anon	- sum of all children'
 total_inactive_file	- sum of all children's "inactive_file"
 total_active_file	- sum of all children's "active_file"
 total_unevictable	- sum of all children's "unevictable"
+total_direct_elapsed_ns - sum of all children's "direct_elapsed_ns"
+total_soft_elapsed_ns	- sum of all children's "soft_elapsed_ns"
+total_wmark_elapsed_ns	- sum of all children's "wmark_elapsed_ns"
+total_direct_scanned	- sum of all children's "direct_scanned"
+total_soft_scanned	- sum of all children's "soft_scanned"
+total_wmark_scanned	- sum of all children's "wmark_scanned"
+total_direct_freed	- sum of all children's "direct_freed"
+total_soft_freed	- sum of all children's "soft_freed"
+total_wamrk_freed	- sum of all children's "wmark_freed"
 
 # The following additional stats are dependent on CONFIG_DEBUG_VM.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2011-04-25  9:43 ` [PATCH 8/7] memcg : reclaim statistics KAMEZAWA Hiroyuki
@ 2011-04-25  9:49 ` KAMEZAWA Hiroyuki
  2011-04-25 10:14 ` KAMEZAWA Hiroyuki
  2011-05-02  6:09 ` Balbir Singh
  10 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25  9:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Mon, 25 Apr 2011 18:25:29 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 1) == hard limit = 400M ==
> [root@rhel6-test hilow]# time cp ./tmpfile xxx                
> real    0m7.353s
> user    0m0.009s
> sys     0m3.280s
> 

Sorry, size of tmpfile is 400M here.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
                   ` (8 preceding siblings ...)
  2011-04-25  9:49 ` [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
@ 2011-04-25 10:14 ` KAMEZAWA Hiroyuki
  2011-04-25 22:21   ` Ying Han
  2011-05-02  6:09 ` Balbir Singh
  10 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-25 10:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Mon, 25 Apr 2011 18:25:29 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:


> 2) == hard limit 500M/ hi_watermark = 400M ==
> [root@rhel6-test hilow]# time cp ./tmpfile xxx
> 
> real    0m6.421s
> user    0m0.059s
> sys     0m2.707s
> 

When doing this, we see usage changes as
(sec) (bytes)
   0: 401408        <== cp start
   1: 98603008
   2: 262705152
   3: 433491968     <== wmark reclaim triggerd.
   4: 486502400
   5: 507748352
   6: 524189696     <== cp ends (and hit limits)
   7: 501231616
   8: 499511296
   9: 477118464
  10: 417980416     <== usage goes below watermark.
  11: 417980416
 .....

If we have dirty_ratio, this result will be some different.
(and flusher thread will work sooner...)


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/7] memcg fix scan ratio with small memcg.
  2011-04-25  9:34 ` [PATCH 4/7] memcg fix scan ratio with small memcg KAMEZAWA Hiroyuki
@ 2011-04-25 17:35   ` Ying Han
  2011-04-26  1:43     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-04-25 17:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 3597 bytes --]

On Mon, Apr 25, 2011 at 2:34 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

>
> At memcg memory reclaim, get_scan_count() may returns [0, 0, 0, 0]
> and no scan was not issued at the reclaim priority.
>
> The reason is because memory cgroup may not be enough big to have
> the number of pages, which is greater than 1 << priority.
>
> Because priority affects many routines in vmscan.c, it's better
> to scan memory even if usage >> priority < 0.
> From another point of view, if memcg's zone doesn't have enough memory
> which
> meets priority, it should be skipped. So, this patch creates a temporal
> priority
> in get_scan_count() and scan some amount of pages even when
> usage is small. By this, memcg's reclaim goes smoother without
> having too high priority, which will cause unnecessary congestion_wait(),
> etc.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    6 ++++++
>  mm/memcontrol.c            |    5 +++++
>  mm/vmscan.c                |   11 +++++++++++
>  3 files changed, 22 insertions(+)
>
> Index: memcg/include/linux/memcontrol.h
> ===================================================================
> --- memcg.orig/include/linux/memcontrol.h
> +++ memcg/include/linux/memcontrol.h
> @@ -152,6 +152,7 @@ unsigned long mem_cgroup_soft_limit_recl
>                                                gfp_t gfp_mask,
>                                                unsigned long
> *total_scanned);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> +u64 mem_cgroup_get_usage(struct mem_cgroup *mem);
>
>  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item
> idx);
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -357,6 +358,11 @@ u64 mem_cgroup_get_limit(struct mem_cgro
>        return 0;
>  }
>
> +static inline u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
> +{
> +       return 0;
> +}
> +
>

should be  mem_cgroup_get_usage()


 static inline void mem_cgroup_split_huge_fixup(struct page *head,
>                                                struct page *tail)
>  {
> Index: memcg/mm/memcontrol.c
> ===================================================================
> --- memcg.orig/mm/memcontrol.c
> +++ memcg/mm/memcontrol.c
> @@ -1483,6 +1483,11 @@ u64 mem_cgroup_get_limit(struct mem_cgro
>        return min(limit, memsw);
>  }
>
> +u64 mem_cgroup_get_usage(struct mem_cgroup *memcg)
> +{
> +       return res_counter_read_u64(&memcg->res, RES_USAGE);
> +}
> +
>  /*
>  * Visit the first child (need not be the first child as per the ordering
>  * of the cgroup list, since we track last_scanned_child) of @mem and use
> Index: memcg/mm/vmscan.c
> ===================================================================
> --- memcg.orig/mm/vmscan.c
> +++ memcg/mm/vmscan.c
> @@ -1762,6 +1762,17 @@ static void get_scan_count(struct zone *
>                        denominator = 1;
>                        goto out;
>                }
> +       } else {
> +               u64 usage;
> +               /*
> +                * When memcg is enough small, anon+file >> priority
> +                * can be 0 and we'll do no scan. Adjust it to proper
> +                * value against its usage. If this zone's usage is enough
> +                * small, scan will ignore this zone until priority goes
> down.
> +                */
> +               for (usage = mem_cgroup_get_usage(sc->mem_cgroup) >>
> PAGE_SHIFT;
> +                    priority && ((usage >> priority) < SWAP_CLUSTER_MAX);
> +                    priority--);
>        }
>

--Ying

>
>        /*
>
>

[-- Attachment #2: Type: text/html, Size: 4616 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-25 10:14 ` KAMEZAWA Hiroyuki
@ 2011-04-25 22:21   ` Ying Han
  2011-04-26  1:38     ` KAMEZAWA Hiroyuki
  2011-05-02  7:02     ` Balbir Singh
  0 siblings, 2 replies; 68+ messages in thread
From: Ying Han @ 2011-04-25 22:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

Kame:

Thank you for putting time on implementing the patch. I think it is
definitely a good idea to have the two alternatives on the table since
people has asked the questions. Before going down to the track, i have
thought about the two approaches and also discussed with Greg and Hugh
(cc-ed),  i would like to clarify some of the pros and cons on both
approaches.  In general, I think the workqueue is not the right answer
for this purpose.

The thread-pool model
Pros:
1. there is no isolation between memcg background reclaim, since the
memcg threads are shared. That isolation including all the resources
that the per-memcg background reclaim will need to access, like cpu
time. One thing we are missing for the shared worker model is the
individual cpu scheduling ability. We need the ability to isolate and
count the resource assumption per memcg, and including how much
cputime and where to run the per-memcg kswapd thread.

2. it is hard for visibility and debugability. We have been
experiencing a lot when some kswapds running creazy and we need a
stright-forward way to identify which cgroup causing the reclaim. yes,
we can add more stats per-memcg to sort of giving that visibility, but
I can tell they are involved w/ more overhead of the change. Why
introduce the over-head if the per-memcg kswapd thread can offer that
maturely.

3. potential priority inversion for some memcgs. Let's say we have two
memcgs A and B on a single core machine, and A has big chuck of work
and B has small chuck of work. Now B's work is queued up after A. In
the workqueue model, we won't process B unless we finish A's work
since we only have one worker on the single core host. However, in the
per-memcg kswapd model, B got chance to run when A calls
cond_resched(). Well, we might not having the exact problem if we
don't constrain the workers number, and the worst case we'll have the
same number of workers as the number of memcgs. If so, it would be the
same model as per-memcg kswapd.

4. the kswapd threads are created and destroyed dynamically. are we
talking about allocating 8k of stack for kswapd when we are under
memory pressure? In the other case, all the memory are preallocated.

5. the workqueue is scary and might introduce issues sooner or later.
Also, why we think the background reclaim fits into the workqueue
model, and be more specific, how that share the same logic of other
parts of the system using workqueue.

Cons:
1. save SOME memory resource.

The per-memcg-per-kswapd model
Pros:
1. memory overhead per thread, and The memory consumption would be
8k*1000 = 8M with 1k cgroup. This is NOT a problem as least we haven't
seen it in our production. We have cases that 2k of kernel threads
being created, and we haven't noticed it is causing resource
consumption problem as well as performance issue. On those systems, we
might have ~100 cgroup running at a time.

2. we see lots of threads at 'ps -elf'. well, is that really a problem
that we need to change the threading model?

Overall, the per-memcg-per-kswapd thread model is simple enough to
provide better isolation (predictability & debug ability). The number
of threads we might potentially have on the system is not a real
problem. We already have systems running that much of threads (even
more) and we haven't seen problem of that. Also, i can imagine it will
make our life easier for some other extensions on memcg works.

For now, I would like to stick on the simple model. At the same time I
am willing to looking into changes and fixes whence we have seen
problems later.

Comments?

Thanks

--Ying

On Mon, Apr 25, 2011 at 3:14 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 25 Apr 2011 18:25:29 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>
>> 2) == hard limit 500M/ hi_watermark = 400M ==
>> [root@rhel6-test hilow]# time cp ./tmpfile xxx
>>
>> real    0m6.421s
>> user    0m0.059s
>> sys     0m2.707s
>>
>
> When doing this, we see usage changes as
> (sec) (bytes)
>   0: 401408        <== cp start
>   1: 98603008
>   2: 262705152
>   3: 433491968     <== wmark reclaim triggerd.
>   4: 486502400
>   5: 507748352
>   6: 524189696     <== cp ends (and hit limits)
>   7: 501231616
>   8: 499511296
>   9: 477118464
>  10: 417980416     <== usage goes below watermark.
>  11: 417980416
>  .....
>
> If we have dirty_ratio, this result will be some different.
> (and flusher thread will work sooner...)
>
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/7] memcg high watermark interface
  2011-04-25  9:29 ` [PATCH 2/7] memcg high watermark interface KAMEZAWA Hiroyuki
@ 2011-04-25 22:36   ` Ying Han
  0 siblings, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-04-25 22:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Mon, Apr 25, 2011 at 2:29 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Add memory.high_wmark_distance and reclaim_wmarks API per memcg.
> The first adjust the internal low/high wmark calculation and
> the reclaim_wmarks exports the current value of watermarks.
> low_wmark is caclurated in automatic.
>
> $ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
> $ cat /dev/cgroup/A/memory.limit_in_bytes
> 524288000
>
> $ echo 50m >/dev/cgroup/A/memory.high_wmark_distance
>
> $ cat /dev/cgroup/A/memory.reclaim_wmarks
> low_wmark 476053504
> high_wmark 471859200
>
> Change v8a..v7
>   1. removed low_wmark_distance it's now automatic.
>   2. added Documenation.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  Documentation/cgroups/memory.txt |   43 ++++++++++++++++++++++++++++
>  mm/memcontrol.c                  |   58 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 100 insertions(+), 1 deletion(-)
>
> Index: memcg/mm/memcontrol.c
> ===================================================================
> --- memcg.orig/mm/memcontrol.c
> +++ memcg/mm/memcontrol.c
> @@ -4074,6 +4074,40 @@ static int mem_cgroup_swappiness_write(s
>        return 0;
>  }
>
> +static u64 mem_cgroup_high_wmark_distance_read(struct cgroup *cgrp,
> +                                              struct cftype *cft)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +       return memcg->high_wmark_distance;
> +}
> +
> +static int mem_cgroup_high_wmark_distance_write(struct cgroup *cont,
> +                                               struct cftype *cft,
> +                                               const char *buffer)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +       unsigned long long val;
> +       u64 limit;
> +       int ret;
> +
> +       if (!cont->parent)
> +               return -EINVAL;
> +
> +       ret = res_counter_memparse_write_strategy(buffer, &val);
> +       if (ret)
> +               return -EINVAL;
> +
> +       limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> +       if (val >= limit)
> +               return -EINVAL;
> +
> +       memcg->high_wmark_distance = val;
> +
> +       setup_per_memcg_wmarks(memcg);
> +       return 0;
> +}
> +
>  static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
>  {
>        struct mem_cgroup_threshold_ary *t;
> @@ -4365,6 +4399,21 @@ static void mem_cgroup_oom_unregister_ev
>        mutex_unlock(&memcg_oom_mutex);
>  }
>
> +static int mem_cgroup_wmark_read(struct cgroup *cgrp,
> +       struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +       struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +       u64 low_wmark, high_wmark;
> +
> +       low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
> +       high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
> +
> +       cb->fill(cb, "low_wmark", low_wmark);
> +       cb->fill(cb, "high_wmark", high_wmark);
> +
> +       return 0;
> +}
> +
>  static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
>        struct cftype *cft,  struct cgroup_map_cb *cb)
>  {
> @@ -4468,6 +4517,15 @@ static struct cftype mem_cgroup_files[]
>                .unregister_event = mem_cgroup_oom_unregister_event,
>                .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>        },
> +       {
> +               .name = "high_wmark_distance",
> +               .write_string = mem_cgroup_high_wmark_distance_write,
> +               .read_u64 = mem_cgroup_high_wmark_distance_read,
> +       },
> +       {
> +               .name = "reclaim_wmarks",
> +               .read_map = mem_cgroup_wmark_read,
> +       },
>  };
>
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> Index: memcg/Documentation/cgroups/memory.txt
> ===================================================================
> --- memcg.orig/Documentation/cgroups/memory.txt
> +++ memcg/Documentation/cgroups/memory.txt
> @@ -68,6 +68,8 @@ Brief summary of control files.
>                                 (See sysctl's vm.swappiness)
>  memory.move_charge_at_immigrate # set/show controls of moving charges
>  memory.oom_control             # set/show oom controls.
> + memory.hiwmark_distance        # set/show watermark control
> + memory.reclaim_wmarks          # show watermark details.
>
>  1. History
>
> @@ -501,6 +503,7 @@ NOTE2: When panic_on_oom is set to "2",
>        case of an OOM event in any cgroup.
>
>  7. Soft limits
> +(See Watermarks, too.)
>
>  Soft limits allow for greater sharing of memory. The idea behind soft limits
>  is to allow control groups to use as much of the memory as needed, provided
> @@ -649,7 +652,45 @@ At reading, current status of OOM is sho
>        under_oom        0 or 1 (if 1, the memory cgroup is under OOM, tasks may
>                                 be stopped.)
>
> -11. TODO
> +11. Watermarks
> +
> +Tasks gets big overhead when it hits memory limit because it needs to scan
> +memory and free them. To avoid that, some background memory freeing by
> +kernel will be helpful. Memory cgroup supports background memory freeing
> +by threshold called Watermarks. It can be used for fuzzy limiting of memory.
> +
> +For example, if you have 1G limit and set
> +  - high_watermark ....980M
> +  - low_watermark  ....984M
> +Memory freeing work by kernel starts when usage goes over 984M until memory
> +usage goes down to 980M. Of course, this cousumes CPU. So, the kernel controls
> +this work to avoid too much cpu hogging.
> +
> +11.1 memory.high_wmark_distance
> +
> +This is an interface for high_wmark. You can specify the distance between
> +the limit of memory and high_watemark here. For example, under 1G limit memroy
> +cgroup,
> +  # echo 20M > memory.high_wmark_distance
> +will set high_watermark as 980M. low_watermark is _automatically_ determined
> +because big distance between high-low watermark tend to use too much CPU and
> +it's difficult to determine low_watermark by users.
> +
> +With this, memory usage will be reduced to 980M as time goes by.
> +After setting memory.high_wmark_distance to be 20M, assume you update
> +memory.limit_in_bytes to be 2G bytes. In this case, hiwh_watermak is 1980M.
> +
> +Another thinking, assume you have memory.limit_in_bytes to be 1G.
> +Then, set memory.high_wmark_distance as 300M. Then, you can limit memory
> +usage under 700M in moderate way and you can limit it under 1G with hard
> +limit.
> +
> +11.2 memory.reclaim_wmarks
> +
> +This interface shows high_watermark and low_watermark in bytes. Maybe
> +useful at compareing usage/watermarks.
> +
> +12. TODO
>
>  1. Add support for accounting huge pages (as a separate controller)
>  2. Make per-cgroup scanner reclaim not-shared pages first
>
Thank you and this looks good me, and I can certainly apply that on
the next post.

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-25 22:21   ` Ying Han
@ 2011-04-26  1:38     ` KAMEZAWA Hiroyuki
  2011-04-26  7:19       ` Ying Han
  2011-05-02  7:02     ` Balbir Singh
  1 sibling, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-26  1:38 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

On Mon, 25 Apr 2011 15:21:21 -0700
Ying Han <yinghan@google.com> wrote:

> Kame:
> 
> Thank you for putting time on implementing the patch. I think it is
> definitely a good idea to have the two alternatives on the table since
> people has asked the questions. Before going down to the track, i have
> thought about the two approaches and also discussed with Greg and Hugh
> (cc-ed),  i would like to clarify some of the pros and cons on both
> approaches.  In general, I think the workqueue is not the right answer
> for this purpose.
> 
> The thread-pool model
> Pros:
> 1. there is no isolation between memcg background reclaim, since the
> memcg threads are shared. That isolation including all the resources
> that the per-memcg background reclaim will need to access, like cpu
> time. One thing we are missing for the shared worker model is the
> individual cpu scheduling ability. We need the ability to isolate and
> count the resource assumption per memcg, and including how much
> cputime and where to run the per-memcg kswapd thread.
> 

IIUC, new threads for workqueue will be created if necessary in automatic.



> 2. it is hard for visibility and debugability. We have been
> experiencing a lot when some kswapds running creazy and we need a
> stright-forward way to identify which cgroup causing the reclaim. yes,
> we can add more stats per-memcg to sort of giving that visibility, but
> I can tell they are involved w/ more overhead of the change. Why
> introduce the over-head if the per-memcg kswapd thread can offer that
> maturely.
> 

I added counters and time comsumption statistics with low overhead.


> 3. potential priority inversion for some memcgs. Let's say we have two
> memcgs A and B on a single core machine, and A has big chuck of work
> and B has small chuck of work. Now B's work is queued up after A. In
> the workqueue model, we won't process B unless we finish A's work
> since we only have one worker on the single core host. However, in the
> per-memcg kswapd model, B got chance to run when A calls
> cond_resched(). Well, we might not having the exact problem if we
> don't constrain the workers number, and the worst case we'll have the
> same number of workers as the number of memcgs. If so, it would be the
> same model as per-memcg kswapd.
> 

I implemented static scan rate round-robin. I think you didn't read patches.
And, The fact per-memcg thread model switches when it calls cond_resched(),
means it will not reched until it cousumes enough vruntme. I guess static
scan rate round-robin wins at duscussing fairness.
 
And IIUC, workqueue invokes enough amount of threads to do service.

> 4. the kswapd threads are created and destroyed dynamically. are we
> talking about allocating 8k of stack for kswapd when we are under
> memory pressure? In the other case, all the memory are preallocated.
> 

I think workqueue is there for avoiding 'making kthread dynamically'.
We can save much codes.

> 5. the workqueue is scary and might introduce issues sooner or later.
> Also, why we think the background reclaim fits into the workqueue
> model, and be more specific, how that share the same logic of other
> parts of the system using workqueue.
> 

Ok, with using workqueue.

  1. The number of threads can be changed dynamically with regard to system
     workload without adding any codes. workqueue is for this kind of
     background jobs. gcwq has a hooks to scheduler and it works well.
     With per-memcg thread model, we'll never be able to do such.

  2. We can avoid having unncessary threads.
     If it sleeps most of time, why we need to keep it ? No, it's unnecessary.
     It should be on-demand. freezer() etc need to stop all threads and
     thousands of sleeping threads will be harmful.
     You can see how 'ps -elf' gets slow when the number of threads increases.


=== When we have small threads ==
[root@rhel6-test hilow]# time ps -elf | wc -l
128

real    0m0.058s
user    0m0.010s
sys     0m0.051s
  
== When we have 2000 'sleeping' tasks. ==
[root@rhel6-test hilow]# time ps -elf | wc -l
2128

real    0m0.881s
user    0m0.055s
sys     0m0.972s

Awesome, it costs nearly 1sec.
We should keep the number of threads as small as possible. Having threads is cost.


  3. We need to refine reclaim codes for memcg to make it consuming less time.
     With per-memcg-thread model, we'll use cut-n-paste codes and pass all job
     to scheduler and consuming more time, reclaim slowly.

     BTW, Static scan rate round robin implemented in this patch is a fair routine.


On 4cpu KVM, creating 100M-limit 90M-hiwat cgroups 1,2,3,4,5, and run 'cat 400M >/dev/null'
a size of file(400M) on each cgroup for 60secs in loop.
==
[kamezawa@rhel6-test ~]$ cat /cgroup/memory/[1-5]/memory.stat | grep elapse | grep -v total
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 792377873
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 811053756
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 799196613
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 806502820
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 790071307
==

No one dives into direct reclaim. and time consumption for background reclaim is fair
for the same jobs.

==
[kamezawa@rhel6-test ~]$ cat /cgroup/memory/[1-5]/memory.stat | grep wmark_scanned | grep -v total
wmark_scanned 225881
wmark_scanned 225563
wmark_scanned 226848
wmark_scanned 225458
wmark_scanned 226137
==
Ah, yes. scan rate is fair. Even when we had 5 active cat + 5 works.

BTW, without bgreclaim,
==
[kamezawa@rhel6-test ~]$ cat /cgroup/memory/[1-5]/memory.stat | grep direct_elapsed | grep -v total
direct_elapsed_ns 786049957
direct_elapsed_ns 782150545
direct_elapsed_ns 805222327
direct_elapsed_ns 782563391
direct_elapsed_ns 782431424
==

direct reclaim uses the same amount of time.

==
[kamezawa@rhel6-test ~]$ cat /cgroup/memory/[1-5]/memory.stat | grep direct_scan | grep -v total
direct_scanned 224501
direct_scanned 224448
direct_scanned 224448
direct_scanned 224448
direct_scanned 224448
==

CFS seems to work fair ;) (Note: there is 10M difference between bgreclaim/direct).

with 10 groups. 10threads + 10works.
==
[kamezawa@rhel6-test hilow]$ cat /cgroup/memory/[0-9]/memory.stat | grep elapsed_ns | grep -v total | grep -v soft
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 81856013
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 350538700
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 340384072
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 344776087
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 322237832
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 337741658
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 261018174
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 316675784
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 257009865
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 154339039

==
No one dives into direct reclaim. (But 'cat' iself slow...because of read() ?)
>From bgreclaim point's of view, this is fair because no direct reclaim happens.
Maybe I need to use blkio cgroup for more tests of this kind.

I attaches the test script below.

  4. We can see how round-robin works and see what we need to modify.
     Maybe good for future work and we'll have good chance to reuse codes.

  5. With per-memcg-thread, at bad case, we'll see thousands of threads trying
     to reclai memory at once. It's never good.
     In this patch, I left max_active of workqueue as default. If we need fix/tune,
     we just fix max_active. 

  6. If it seems that it's better to have thread pool for memcg,
     we can switch to thread-pool model seamlessly. But delayd_work implemenation
     will be difficult ;) And management of the number of active works will be
     difficult. I bet we'll never use thread-pool.

  7. We'll never see cpu cache miss caused by frequent thread-stack-switch. 


> Cons:
> 1. save SOME memory resource.
> 
and CPU resouce. per-memcg-thread tends to use much cpu time rather than workqueue
which is required to be designed as short-term round robin.


> The per-memcg-per-kswapd model
> Pros:
> 1. memory overhead per thread, and The memory consumption would be
> 8k*1000 = 8M with 1k cgroup. This is NOT a problem as least we haven't
> seen it in our production. We have cases that 2k of kernel threads
> being created, and we haven't noticed it is causing resource
> consumption problem as well as performance issue. On those systems, we
> might have ~100 cgroup running at a time.
> 
> 2. we see lots of threads at 'ps -elf'. well, is that really a problem
> that we need to change the threading model?
> 
> Overall, the per-memcg-per-kswapd thread model is simple enough to
> provide better isolation (predictability & debug ability). The number
> of threads we might potentially have on the system is not a real
> problem. We already have systems running that much of threads (even
> more) and we haven't seen problem of that. Also, i can imagine it will
> make our life easier for some other extensions on memcg works.
> 
> For now, I would like to stick on the simple model. At the same time I
> am willing to looking into changes and fixes whence we have seen
> problems later.
> 
> Comments?
> 


In 2-3 years ago, I implemetned per-memcg-thread model and got NACK and
said "you should use workqueue ;) Now, workqueue is renewed and seems easier to
use for cpu-intensive workloads. If I need more tweaks for workqueue, I'll add
patches for workqueue. But, it's unseen now.

And, using per-memcg thread model tend to lead us to brain-dead, as using cut-n-paste
codes from kswapd which never fits memcg. Later, at removing LRU, we need some kind
of round-robin again and checking how round-robin works and what is good code for
round robin is an interesting study. For example, I noticed I need patch 4, soon.


I'd like to use workqueue and refine the whole routine to fit short-term round-robin.
Having sleeping threads is cost. round robin can work in fair way.


Thanks,
-Kame

== test.sh ==
#!/bin/bash -x

for i in `seq 0 9`; do
        mkdir /cgroup/memory/$i
        echo 100M > /cgroup/memory/$i/memory.limit_in_bytes
        echo 10M > /cgroup/memory/$i/memory.high_wmark_distance
done

for i in `seq 0 9`; do
        cgexec -g memory:$i ./loop.sh ./tmpfile$i &
done

sleep 60;

pkill loop.sh

== loop.sh ==
#!/bin/sh

while true; do
        cat $1 > /dev/null
done
==



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/7] memcg fix scan ratio with small memcg.
  2011-04-25 17:35   ` Ying Han
@ 2011-04-26  1:43     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-26  1:43 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Mon, 25 Apr 2011 10:35:39 -0700
Ying Han <yinghan@google.com> wrote:

> On Mon, Apr 25, 2011 at 2:34 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> >
> > At memcg memory reclaim, get_scan_count() may returns [0, 0, 0, 0]
> > and no scan was not issued at the reclaim priority.
> >
> > The reason is because memory cgroup may not be enough big to have
> > the number of pages, which is greater than 1 << priority.
> >
> > Because priority affects many routines in vmscan.c, it's better
> > to scan memory even if usage >> priority < 0.
> > From another point of view, if memcg's zone doesn't have enough memory
> > which
> > meets priority, it should be skipped. So, this patch creates a temporal
> > priority
> > in get_scan_count() and scan some amount of pages even when
> > usage is small. By this, memcg's reclaim goes smoother without
> > having too high priority, which will cause unnecessary congestion_wait(),
> > etc.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/memcontrol.h |    6 ++++++
> >  mm/memcontrol.c            |    5 +++++
> >  mm/vmscan.c                |   11 +++++++++++
> >  3 files changed, 22 insertions(+)
> >
> > Index: memcg/include/linux/memcontrol.h
> > ===================================================================
> > --- memcg.orig/include/linux/memcontrol.h
> > +++ memcg/include/linux/memcontrol.h
> > @@ -152,6 +152,7 @@ unsigned long mem_cgroup_soft_limit_recl
> >                                                gfp_t gfp_mask,
> >                                                unsigned long
> > *total_scanned);
> >  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> > +u64 mem_cgroup_get_usage(struct mem_cgroup *mem);
> >
> >  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item
> > idx);
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > @@ -357,6 +358,11 @@ u64 mem_cgroup_get_limit(struct mem_cgro
> >        return 0;
> >  }
> >
> > +static inline u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
> > +{
> > +       return 0;
> > +}
> > +
> >
> 
> should be  mem_cgroup_get_usage()
> 

Ah, yes. thanks.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/7] memcg bgreclaim core.
  2011-04-25  9:36 ` [PATCH 5/7] memcg bgreclaim core KAMEZAWA Hiroyuki
@ 2011-04-26  4:59   ` Ying Han
  2011-04-26  5:08     ` KAMEZAWA Hiroyuki
  2011-04-26 18:37   ` Ying Han
  1 sibling, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-04-26  4:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Mon, Apr 25, 2011 at 2:36 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Following patch will chagnge the logic. This is a core.
> ==
> This is the main loop of per-memcg background reclaim which is implemented in
> function balance_mem_cgroup_pgdat().
>
> The function performs a priority loop similar to global reclaim. During each
> iteration it frees memory from a selected victim node.
> After reclaiming enough pages or scanning enough pages, it returns and find
> next work with round-robin.
>
> changelog v8b..v7
> 1. reworked for using work_queue rather than threads.
> 2. changed shrink_mem_cgroup algorithm to fit workqueue. In short, avoid
>   long running and allow quick round-robin and unnecessary write page.
>   When a thread make pages dirty continuously, write back them by flusher
>   is far faster than writeback by background reclaim. This detail will
>   be fixed when dirty_ratio implemented. The logic around this will be
>   revisited in following patche.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |   11 ++++
>  mm/memcontrol.c            |   44 ++++++++++++++---
>  mm/vmscan.c                |  115 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 162 insertions(+), 8 deletions(-)
>
> Index: memcg/include/linux/memcontrol.h
> ===================================================================
> --- memcg.orig/include/linux/memcontrol.h
> +++ memcg/include/linux/memcontrol.h
> @@ -89,6 +89,8 @@ extern int mem_cgroup_last_scanned_node(
>  extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
>                                        const nodemask_t *nodes);
>
> +unsigned long shrink_mem_cgroup(struct mem_cgroup *mem);
> +
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
>  {
> @@ -112,6 +114,9 @@ extern void mem_cgroup_end_migration(str
>  */
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +                               int nid, int zone_idx);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>                                       struct zone *zone,
>                                       enum lru_list lru);
> @@ -310,6 +315,12 @@ mem_cgroup_inactive_file_is_low(struct m
>  }
>
>  static inline unsigned long
> +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zone_idx)
> +{
> +       return 0;
> +}
> +
> +static inline unsigned long
>  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
>                         enum lru_list lru)
>  {
> Index: memcg/mm/memcontrol.c
> ===================================================================
> --- memcg.orig/mm/memcontrol.c
> +++ memcg/mm/memcontrol.c
> @@ -1166,6 +1166,23 @@ int mem_cgroup_inactive_file_is_low(stru
>        return (active > inactive);
>  }
>
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +                                               int nid, int zone_idx)
> +{
> +       int nr;
> +       struct mem_cgroup_per_zone *mz =
> +               mem_cgroup_zoneinfo(memcg, nid, zone_idx);
> +
> +       nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> +            MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> +
> +       if (nr_swap_pages > 0)
> +               nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> +                     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> +
> +       return nr;
> +}
> +
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>                                       struct zone *zone,
>                                       enum lru_list lru)
> @@ -1286,7 +1303,7 @@ static unsigned long mem_cgroup_margin(s
>        return margin >> PAGE_SHIFT;
>  }
>
> -static unsigned int get_swappiness(struct mem_cgroup *memcg)
> +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  {
>        struct cgroup *cgrp = memcg->css.cgroup;
>
> @@ -1595,14 +1612,15 @@ static int mem_cgroup_hierarchical_recla
>                /* we use swappiness of local cgroup */
>                if (check_soft) {
>                        ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> -                               noswap, get_swappiness(victim), zone,
> +                               noswap, mem_cgroup_swappiness(victim), zone,
>                                &nr_scanned);
>                        *total_scanned += nr_scanned;
>                        mem_cgroup_soft_steal(victim, ret);
>                        mem_cgroup_soft_scan(victim, nr_scanned);
>                } else
>                        ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> -                                               noswap, get_swappiness(victim));
> +                                               noswap,
> +                                               mem_cgroup_swappiness(victim));
>                css_put(&victim->css);
>                /*
>                 * At shrinking usage, we can't check we should stop here or
> @@ -1628,15 +1646,25 @@ static int mem_cgroup_hierarchical_recla
>  int
>  mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
>  {
> -       int next_nid;
> +       int next_nid, i;
>        int last_scanned;
>
>        last_scanned = mem->last_scanned_node;
> -       next_nid = next_node(last_scanned, *nodes);
> +       next_nid = last_scanned;
> +rescan:
> +       next_nid = next_node(next_nid, *nodes);
>
>        if (next_nid == MAX_NUMNODES)
>                next_nid = first_node(*nodes);
>
> +       /* If no page on this node, skip */
> +       for (i = 0; i < MAX_NR_ZONES; i++)
> +               if (mem_cgroup_zone_reclaimable_pages(mem, next_nid, i))
> +                       break;
> +
> +       if (next_nid != last_scanned && (i == MAX_NR_ZONES))
> +               goto rescan;
> +
>        mem->last_scanned_node = next_nid;
>
>        return next_nid;
> @@ -3649,7 +3677,7 @@ try_to_free:
>                        goto out;
>                }
>                progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> -                                               false, get_swappiness(mem));
> +                                       false, mem_cgroup_swappiness(mem));
>                if (!progress) {
>                        nr_retries--;
>                        /* maybe some writeback is necessary */
> @@ -4073,7 +4101,7 @@ static u64 mem_cgroup_swappiness_read(st
>  {
>        struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>
> -       return get_swappiness(memcg);
> +       return mem_cgroup_swappiness(memcg);
>  }
>
>  static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> @@ -4849,7 +4877,7 @@ mem_cgroup_create(struct cgroup_subsys *
>        INIT_LIST_HEAD(&mem->oom_notify);
>
>        if (parent)
> -               mem->swappiness = get_swappiness(parent);
> +               mem->swappiness = mem_cgroup_swappiness(parent);
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> Index: memcg/mm/vmscan.c
> ===================================================================
> --- memcg.orig/mm/vmscan.c
> +++ memcg/mm/vmscan.c
> @@ -42,6 +42,7 @@
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
>  #include <linux/oom.h>
> +#include <linux/res_counter.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -2308,6 +2309,120 @@ static bool sleeping_prematurely(pg_data
>                return !all_zones_ok;
>  }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +/*
> + * Limit of scanning per iteration. For round-robin.
> + */
> +#define MEMCG_BGSCAN_LIMIT     (2048)
> +
> +static void
> +shrink_memcg_node(int nid, int priority, struct scan_control *sc)
> +{
> +       unsigned long total_scanned = 0;
> +       struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +       int i;
> +
> +       /*
> +        * This dma->highmem order is consistant with global reclaim.
> +        * We do this because the page allocator works in the opposite
> +        * direction although memcg user pages are mostly allocated at
> +        * highmem.
> +        */
> +       for (i = 0;
> +            (i < NODE_DATA(nid)->nr_zones) &&
> +            (total_scanned < MEMCG_BGSCAN_LIMIT);
> +            i++) {
> +               struct zone *zone = NODE_DATA(nid)->node_zones + i;
> +               struct zone_reclaim_stat *zrs;
> +               unsigned long scan, rotate;
> +
> +               if (!populated_zone(zone))
> +                       continue;
> +               scan = mem_cgroup_zone_reclaimable_pages(mem_cont, nid, i);
> +               if (!scan)
> +                       continue;
> +               /* If recent memory reclaim on this zone doesn't get good */
> +               zrs = get_reclaim_stat(zone, sc);
> +               scan = zrs->recent_scanned[0] + zrs->recent_scanned[1];
> +               rotate = zrs->recent_rotated[0] + zrs->recent_rotated[1];
> +
> +               if (rotate > scan/2)
> +                       sc->may_writepage = 1;
> +
> +               sc->nr_scanned = 0;
> +               shrink_zone(priority, zone, sc);
> +               total_scanned += sc->nr_scanned;
> +               sc->may_writepage = 0;
> +       }
> +       sc->nr_scanned = total_scanned;
> +}

I see the MEMCG_BGSCAN_LIMIT is a newly defined macro from previous
post. So, now the number of pages to scan is capped on 2k for each
memcg, and does it make difference on big vs small cgroup?

--Ying

> +/*
> + * Per cgroup background reclaim.
> + */
> +unsigned long shrink_mem_cgroup(struct mem_cgroup *mem)
> +{
> +       int nid, priority, next_prio;
> +       nodemask_t nodes;
> +       unsigned long total_scanned;
> +       struct scan_control sc = {
> +               .gfp_mask = GFP_HIGHUSER_MOVABLE,
> +               .may_unmap = 1,
> +               .may_swap = 1,
> +               .nr_to_reclaim = SWAP_CLUSTER_MAX,
> +               .order = 0,
> +               .mem_cgroup = mem,
> +       };
> +
> +       sc.may_writepage = 0;
> +       sc.nr_reclaimed = 0;
> +       total_scanned = 0;
> +       nodes = node_states[N_HIGH_MEMORY];
> +       sc.swappiness = mem_cgroup_swappiness(mem);
> +
> +       current->flags |= PF_SWAPWRITE;
> +       /*
> +        * Unlike kswapd, we need to traverse cgroups one by one. So, we don't
> +        * use full priority. Just scan small number of pages and visit next.
> +        * Now, we scan MEMCG_BGRECLAIM_SCAN_LIMIT pages per scan.
> +        * We use static priority 0.
> +        */
> +       next_prio = min(SWAP_CLUSTER_MAX * num_node_state(N_HIGH_MEMORY),
> +                       MEMCG_BGSCAN_LIMIT/8);
> +       priority = DEF_PRIORITY;
> +       while ((total_scanned < MEMCG_BGSCAN_LIMIT) &&
> +              !nodes_empty(nodes) &&
> +              (sc.nr_to_reclaim > sc.nr_reclaimed)) {
> +
> +               nid = mem_cgroup_select_victim_node(mem, &nodes);
> +               shrink_memcg_node(nid, priority, &sc);
> +               /*
> +                * the node seems to have no pages.
> +                * skip this for a while
> +                */
> +               if (!sc.nr_scanned)
> +                       node_clear(nid, nodes);
> +               total_scanned += sc.nr_scanned;
> +               if (mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH))
> +                       break;
> +               /* emulate priority */
> +               if (total_scanned > next_prio) {
> +                       priority--;
> +                       next_prio <<= 1;
> +               }
> +               if (sc.nr_scanned &&
> +                   total_scanned > sc.nr_reclaimed * 2)
> +                       congestion_wait(WRITE, HZ/10);
> +       }
> +       current->flags &= ~PF_SWAPWRITE;
> +       return sc.nr_reclaimed;
> +}
> +#endif
> +
>  /*
>  * For kswapd, balance_pgdat() will work across all this node's zones until
>  * they are all at high_wmark_pages(zone).
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/7] memcg bgreclaim core.
  2011-04-26  4:59   ` Ying Han
@ 2011-04-26  5:08     ` KAMEZAWA Hiroyuki
  2011-04-26 23:15       ` Ying Han
  0 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-26  5:08 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Mon, 25 Apr 2011 21:59:06 -0700
Ying Han <yinghan@google.com> wrote:

> On Mon, Apr 25, 2011 at 2:36 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Following patch will chagnge the logic. This is a core.
> > ==
> > This is the main loop of per-memcg background reclaim which is implemented in
> > function balance_mem_cgroup_pgdat().
> >
> > The function performs a priority loop similar to global reclaim. During each
> > iteration it frees memory from a selected victim node.
> > After reclaiming enough pages or scanning enough pages, it returns and find
> > next work with round-robin.
> >
> > changelog v8b..v7
> > 1. reworked for using work_queue rather than threads.
> > 2. changed shrink_mem_cgroup algorithm to fit workqueue. In short, avoid
> > A  long running and allow quick round-robin and unnecessary write page.
> > A  When a thread make pages dirty continuously, write back them by flusher
> > A  is far faster than writeback by background reclaim. This detail will
> > A  be fixed when dirty_ratio implemented. The logic around this will be
> > A  revisited in following patche.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> > A include/linux/memcontrol.h | A  11 ++++
> > A mm/memcontrol.c A  A  A  A  A  A | A  44 ++++++++++++++---
> > A mm/vmscan.c A  A  A  A  A  A  A  A | A 115 +++++++++++++++++++++++++++++++++++++++++++++
> > A 3 files changed, 162 insertions(+), 8 deletions(-)
> >
> > Index: memcg/include/linux/memcontrol.h
> > ===================================================================
> > --- memcg.orig/include/linux/memcontrol.h
> > +++ memcg/include/linux/memcontrol.h
> > @@ -89,6 +89,8 @@ extern int mem_cgroup_last_scanned_node(
> > A extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A const nodemask_t *nodes);
> >
> > +unsigned long shrink_mem_cgroup(struct mem_cgroup *mem);
> > +
> > A static inline
> > A int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> > A {
> > @@ -112,6 +114,9 @@ extern void mem_cgroup_end_migration(str
> > A */
> > A int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> > A int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> > +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  int nid, int zone_idx);
> > A unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  struct zone *zone,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  enum lru_list lru);
> > @@ -310,6 +315,12 @@ mem_cgroup_inactive_file_is_low(struct m
> > A }
> >
> > A static inline unsigned long
> > +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zone_idx)
> > +{
> > + A  A  A  return 0;
> > +}
> > +
> > +static inline unsigned long
> > A mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> > A  A  A  A  A  A  A  A  A  A  A  A  enum lru_list lru)
> > A {
> > Index: memcg/mm/memcontrol.c
> > ===================================================================
> > --- memcg.orig/mm/memcontrol.c
> > +++ memcg/mm/memcontrol.c
> > @@ -1166,6 +1166,23 @@ int mem_cgroup_inactive_file_is_low(stru
> > A  A  A  A return (active > inactive);
> > A }
> >
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  int nid, int zone_idx)
> > +{
> > + A  A  A  int nr;
> > + A  A  A  struct mem_cgroup_per_zone *mz =
> > + A  A  A  A  A  A  A  mem_cgroup_zoneinfo(memcg, nid, zone_idx);
> > +
> > + A  A  A  nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> > + A  A  A  A  A  A MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> > +
> > + A  A  A  if (nr_swap_pages > 0)
> > + A  A  A  A  A  A  A  nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> > + A  A  A  A  A  A  A  A  A  A  MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> > +
> > + A  A  A  return nr;
> > +}
> > +
> > A unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  struct zone *zone,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  enum lru_list lru)
> > @@ -1286,7 +1303,7 @@ static unsigned long mem_cgroup_margin(s
> > A  A  A  A return margin >> PAGE_SHIFT;
> > A }
> >
> > -static unsigned int get_swappiness(struct mem_cgroup *memcg)
> > +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> > A {
> > A  A  A  A struct cgroup *cgrp = memcg->css.cgroup;
> >
> > @@ -1595,14 +1612,15 @@ static int mem_cgroup_hierarchical_recla
> > A  A  A  A  A  A  A  A /* we use swappiness of local cgroup */
> > A  A  A  A  A  A  A  A if (check_soft) {
> > A  A  A  A  A  A  A  A  A  A  A  A ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  noswap, get_swappiness(victim), zone,
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  noswap, mem_cgroup_swappiness(victim), zone,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A &nr_scanned);
> > A  A  A  A  A  A  A  A  A  A  A  A *total_scanned += nr_scanned;
> > A  A  A  A  A  A  A  A  A  A  A  A mem_cgroup_soft_steal(victim, ret);
> > A  A  A  A  A  A  A  A  A  A  A  A mem_cgroup_soft_scan(victim, nr_scanned);
> > A  A  A  A  A  A  A  A } else
> > A  A  A  A  A  A  A  A  A  A  A  A ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  noswap, get_swappiness(victim));
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  noswap,
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  mem_cgroup_swappiness(victim));
> > A  A  A  A  A  A  A  A css_put(&victim->css);
> > A  A  A  A  A  A  A  A /*
> > A  A  A  A  A  A  A  A  * At shrinking usage, we can't check we should stop here or
> > @@ -1628,15 +1646,25 @@ static int mem_cgroup_hierarchical_recla
> > A int
> > A mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t *nodes)
> > A {
> > - A  A  A  int next_nid;
> > + A  A  A  int next_nid, i;
> > A  A  A  A int last_scanned;
> >
> > A  A  A  A last_scanned = mem->last_scanned_node;
> > - A  A  A  next_nid = next_node(last_scanned, *nodes);
> > + A  A  A  next_nid = last_scanned;
> > +rescan:
> > + A  A  A  next_nid = next_node(next_nid, *nodes);
> >
> > A  A  A  A if (next_nid == MAX_NUMNODES)
> > A  A  A  A  A  A  A  A next_nid = first_node(*nodes);
> >
> > + A  A  A  /* If no page on this node, skip */
> > + A  A  A  for (i = 0; i < MAX_NR_ZONES; i++)
> > + A  A  A  A  A  A  A  if (mem_cgroup_zone_reclaimable_pages(mem, next_nid, i))
> > + A  A  A  A  A  A  A  A  A  A  A  break;
> > +
> > + A  A  A  if (next_nid != last_scanned && (i == MAX_NR_ZONES))
> > + A  A  A  A  A  A  A  goto rescan;
> > +
> > A  A  A  A mem->last_scanned_node = next_nid;
> >
> > A  A  A  A return next_nid;
> > @@ -3649,7 +3677,7 @@ try_to_free:
> > A  A  A  A  A  A  A  A  A  A  A  A goto out;
> > A  A  A  A  A  A  A  A }
> > A  A  A  A  A  A  A  A progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  false, get_swappiness(mem));
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  false, mem_cgroup_swappiness(mem));
> > A  A  A  A  A  A  A  A if (!progress) {
> > A  A  A  A  A  A  A  A  A  A  A  A nr_retries--;
> > A  A  A  A  A  A  A  A  A  A  A  A /* maybe some writeback is necessary */
> > @@ -4073,7 +4101,7 @@ static u64 mem_cgroup_swappiness_read(st
> > A {
> > A  A  A  A struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> >
> > - A  A  A  return get_swappiness(memcg);
> > + A  A  A  return mem_cgroup_swappiness(memcg);
> > A }
> >
> > A static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> > @@ -4849,7 +4877,7 @@ mem_cgroup_create(struct cgroup_subsys *
> > A  A  A  A INIT_LIST_HEAD(&mem->oom_notify);
> >
> > A  A  A  A if (parent)
> > - A  A  A  A  A  A  A  mem->swappiness = get_swappiness(parent);
> > + A  A  A  A  A  A  A  mem->swappiness = mem_cgroup_swappiness(parent);
> > A  A  A  A atomic_set(&mem->refcnt, 1);
> > A  A  A  A mem->move_charge_at_immigrate = 0;
> > A  A  A  A mutex_init(&mem->thresholds_lock);
> > Index: memcg/mm/vmscan.c
> > ===================================================================
> > --- memcg.orig/mm/vmscan.c
> > +++ memcg/mm/vmscan.c
> > @@ -42,6 +42,7 @@
> > A #include <linux/delayacct.h>
> > A #include <linux/sysctl.h>
> > A #include <linux/oom.h>
> > +#include <linux/res_counter.h>
> >
> > A #include <asm/tlbflush.h>
> > A #include <asm/div64.h>
> > @@ -2308,6 +2309,120 @@ static bool sleeping_prematurely(pg_data
> > A  A  A  A  A  A  A  A return !all_zones_ok;
> > A }
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * The function is used for per-memcg LRU. It scanns all the zones of the
> > + * node and returns the nr_scanned and nr_reclaimed.
> > + */
> > +/*
> > + * Limit of scanning per iteration. For round-robin.
> > + */
> > +#define MEMCG_BGSCAN_LIMIT A  A  (2048)
> > +
> > +static void
> > +shrink_memcg_node(int nid, int priority, struct scan_control *sc)
> > +{
> > + A  A  A  unsigned long total_scanned = 0;
> > + A  A  A  struct mem_cgroup *mem_cont = sc->mem_cgroup;
> > + A  A  A  int i;
> > +
> > + A  A  A  /*
> > + A  A  A  A * This dma->highmem order is consistant with global reclaim.
> > + A  A  A  A * We do this because the page allocator works in the opposite
> > + A  A  A  A * direction although memcg user pages are mostly allocated at
> > + A  A  A  A * highmem.
> > + A  A  A  A */
> > + A  A  A  for (i = 0;
> > + A  A  A  A  A  A (i < NODE_DATA(nid)->nr_zones) &&
> > + A  A  A  A  A  A (total_scanned < MEMCG_BGSCAN_LIMIT);
> > + A  A  A  A  A  A i++) {
> > + A  A  A  A  A  A  A  struct zone *zone = NODE_DATA(nid)->node_zones + i;
> > + A  A  A  A  A  A  A  struct zone_reclaim_stat *zrs;
> > + A  A  A  A  A  A  A  unsigned long scan, rotate;
> > +
> > + A  A  A  A  A  A  A  if (!populated_zone(zone))
> > + A  A  A  A  A  A  A  A  A  A  A  continue;
> > + A  A  A  A  A  A  A  scan = mem_cgroup_zone_reclaimable_pages(mem_cont, nid, i);
> > + A  A  A  A  A  A  A  if (!scan)
> > + A  A  A  A  A  A  A  A  A  A  A  continue;
> > + A  A  A  A  A  A  A  /* If recent memory reclaim on this zone doesn't get good */
> > + A  A  A  A  A  A  A  zrs = get_reclaim_stat(zone, sc);
> > + A  A  A  A  A  A  A  scan = zrs->recent_scanned[0] + zrs->recent_scanned[1];
> > + A  A  A  A  A  A  A  rotate = zrs->recent_rotated[0] + zrs->recent_rotated[1];
> > +
> > + A  A  A  A  A  A  A  if (rotate > scan/2)
> > + A  A  A  A  A  A  A  A  A  A  A  sc->may_writepage = 1;
> > +
> > + A  A  A  A  A  A  A  sc->nr_scanned = 0;
> > + A  A  A  A  A  A  A  shrink_zone(priority, zone, sc);
> > + A  A  A  A  A  A  A  total_scanned += sc->nr_scanned;
> > + A  A  A  A  A  A  A  sc->may_writepage = 0;
> > + A  A  A  }
> > + A  A  A  sc->nr_scanned = total_scanned;
> > +}
> 
> I see the MEMCG_BGSCAN_LIMIT is a newly defined macro from previous
> post. So, now the number of pages to scan is capped on 2k for each
> memcg, and does it make difference on big vs small cgroup?
> 

Now, no difference. One reason is because low_watermark - high_watermark is
limited to 4MB, at most. It should be static 4MB in many cases and 2048 pages
is for scanning 8MB, twice of low_wmark - high_wmark. Another reason is
that I didn't have enough time for considering to tune this. 
By MEMCG_BGSCAN_LIMIT, round-robin can be simply fair and I think it's a
good start point.

If memory eater enough slow (because the threads needs to do some
work on allocated memory), this shrink_mem_cgroup() works fine and
helps to avoid hitting limit. Here, the amount of dirty pages is troublesome.

The penaly for cpu eating (hard-to-reclaim) cgroup is given by 'delay'.
(see patch 7.) This patch's congestion_wait is too bad and will be replaced
in patch 7 as 'delay'. In short, if memcg scanning seems to be not successful,
it gets HZ/10 delay until the next work.

If we have dirty_ratio + I/O less dirty throttling, I think we'll see much
better fairness on this watermark reclaim round robin.


Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 8/7] memcg : reclaim statistics
  2011-04-25  9:43 ` [PATCH 8/7] memcg : reclaim statistics KAMEZAWA Hiroyuki
@ 2011-04-26  5:35   ` Ying Han
  0 siblings, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-04-26  5:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 14994 bytes --]

On Mon, Apr 25, 2011 at 2:43 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> At tuning memcg background reclaim, cpu usage per memcg's work is an
> interesting information because some amount of shared resource is used.
> (i.e. background reclaim uses workqueue.) And other information as
> pgscan and pgreclaim is important.
>
> This patch shows them via memory.stat with cpu usage for direct reclaim
> and softlimit reclaim and page scan statistics.
>
>
>  # cat /cgroup/memory/A/memory.stat
>  ....
>  direct_elapsed_ns 0
>  soft_elapsed_ns 0
>  wmark_elapsed_ns 103566424
>  direct_scanned 0
>  soft_scanned 0
>  wmark_scanned 29303
>  direct_freed 0
>  soft_freed 0
>  wmark_freed 29290
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  Documentation/cgroups/memory.txt |   18 +++++++++
>  include/linux/memcontrol.h       |    6 +++
>  include/linux/swap.h             |    7 +++
>  mm/memcontrol.c                  |   77
> +++++++++++++++++++++++++++++++++++++--
>  mm/vmscan.c                      |   15 +++++++
>  5 files changed, 120 insertions(+), 3 deletions(-)
>
> Index: memcg/mm/memcontrol.c
> ===================================================================
> --- memcg.orig/mm/memcontrol.c
> +++ memcg/mm/memcontrol.c
> @@ -274,6 +274,17 @@ struct mem_cgroup {
>        bool                    bgreclaim_resched;
>        struct delayed_work     bgreclaim_work;
>        /*
> +        * reclaim statistics (not per zone, node)
> +        */
> +       spinlock_t              elapsed_lock;
> +       u64                     bgreclaim_elapsed;
> +       u64                     direct_elapsed;
> +       u64                     soft_elapsed;
> +
> +       u64                     reclaim_scan[NR_RECLAIM_CONTEXTS];
> +       u64                     reclaim_freed[NR_RECLAIM_CONTEXTS];
> +
> +       /*
>         * Should we move charges of a task when a task is moved into this
>         * mem_cgroup ? And what type of charges should we move ?
>         */
> @@ -1346,6 +1357,18 @@ void mem_cgroup_clear_unreclaimable(stru
>        return;
>  }
>
> +void mem_cgroup_reclaim_statistics(struct mem_cgroup *mem,
> +               int context, unsigned long scanned,
> +               unsigned long freed)
> +{
> +       if (!mem)
> +               return;
> +       spin_lock(&mem->elapsed_lock);
> +       mem->reclaim_scan[context] += scanned;
> +       mem->reclaim_freed[context] += freed;
> +       spin_unlock(&mem->elapsed_lock);
> +}
> +
>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>                                        struct list_head *dst,
>                                        unsigned long *scanned, int order,
> @@ -1692,6 +1715,7 @@ static int mem_cgroup_hierarchical_recla
>        bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>        unsigned long excess;
>        unsigned long nr_scanned;
> +       s64 start, end;
>
>        excess = res_counter_soft_limit_excess(&root_mem->res) >>
> PAGE_SHIFT;
>
> @@ -1735,16 +1759,27 @@ static int mem_cgroup_hierarchical_recla
>                }
>                /* we use swappiness of local cgroup */
>                if (check_soft) {
> +                       start = sched_clock();
>                        ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
>                                noswap, mem_cgroup_swappiness(victim), zone,
>                                &nr_scanned);
>                        *total_scanned += nr_scanned;
> +                       end = sched_clock();
> +                       spin_lock(&victim->elapsed_lock);
> +                       victim->soft_elapsed += end - start;
> +                       spin_unlock(&victim->elapsed_lock);
>                        mem_cgroup_soft_steal(victim, ret);
>                        mem_cgroup_soft_scan(victim, nr_scanned);
> -               } else
> +               } else {
> +                       start = sched_clock();
>                        ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
>                                                noswap,
>
>  mem_cgroup_swappiness(victim));
> +                       end = sched_clock();
> +                       spin_lock(&victim->elapsed_lock);
> +                       victim->direct_elapsed += end - start;
> +                       spin_unlock(&victim->elapsed_lock);
> +               }
>                css_put(&victim->css);
>                /*
>                 * At shrinking usage, we can't check we should stop here or
> @@ -3702,15 +3737,22 @@ static void memcg_bgreclaim(struct work_
>        struct delayed_work *dw = to_delayed_work(work);
>        struct mem_cgroup *mem =
>                container_of(dw, struct mem_cgroup, bgreclaim_work);
> -       int delay = 0;
> +       int delay;
>        unsigned long long required, usage, hiwat;
>
> +       delay = 0;
>        hiwat = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
>        usage = res_counter_read_u64(&mem->res, RES_USAGE);
>        required = usage - hiwat;
>        if (required >= 0)  {
> +               u64 start, end;
>                required = ((usage - hiwat) >> PAGE_SHIFT) + 1;
> +               start = sched_clock();
>                delay = shrink_mem_cgroup(mem, (long)required);
> +               end = sched_clock();
> +               spin_lock(&mem->elapsed_lock);
> +               mem->bgreclaim_elapsed += end - start;
> +               spin_unlock(&mem->elapsed_lock);
>        }
>        if (!mem->bgreclaim_resched  ||
>                mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH)) {
> @@ -4152,6 +4194,15 @@ enum {
>        MCS_INACTIVE_FILE,
>        MCS_ACTIVE_FILE,
>        MCS_UNEVICTABLE,
> +       MCS_DIRECT_ELAPSED,
> +       MCS_SOFT_ELAPSED,
> +       MCS_WMARK_ELAPSED,
> +       MCS_DIRECT_SCANNED,
> +       MCS_SOFT_SCANNED,
> +       MCS_WMARK_SCANNED,
> +       MCS_DIRECT_FREED,
> +       MCS_SOFT_FREED,
> +       MCS_WMARK_FREED,
>        NR_MCS_STAT,
>  };
>
> @@ -4177,7 +4228,16 @@ struct {
>        {"active_anon", "total_active_anon"},
>        {"inactive_file", "total_inactive_file"},
>        {"active_file", "total_active_file"},
> -       {"unevictable", "total_unevictable"}
> +       {"unevictable", "total_unevictable"},
> +       {"direct_elapsed_ns", "total_direct_elapsed_ns"},
> +       {"soft_elapsed_ns", "total_soft_elapsed_ns"},
> +       {"wmark_elapsed_ns", "total_wmark_elapsed_ns"},
> +       {"direct_scanned", "total_direct_scanned"},
> +       {"soft_scanned", "total_soft_scanned"},
> +       {"wmark_scanned", "total_wmark_scanned"},
> +       {"direct_freed", "total_direct_freed"},
> +       {"soft_freed", "total_soft_freed"},
> +       {"wmark_freed", "total_wamrk_freed"}
>  };
>
>
> @@ -4185,6 +4245,7 @@ static void
>  mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat
> *s)
>  {
>        s64 val;
> +       int i;
>
>        /* per cpu stat */
>        val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> @@ -4221,6 +4282,15 @@ mem_cgroup_get_local_stat(struct mem_cgr
>        s->stat[MCS_ACTIVE_FILE] += val * PAGE_SIZE;
>        val = mem_cgroup_get_local_zonestat(mem, LRU_UNEVICTABLE);
>        s->stat[MCS_UNEVICTABLE] += val * PAGE_SIZE;
> +
> +       /* reclaim stats */
> +       s->stat[MCS_DIRECT_ELAPSED] += mem->direct_elapsed;
> +       s->stat[MCS_SOFT_ELAPSED] += mem->soft_elapsed;
> +       s->stat[MCS_WMARK_ELAPSED] += mem->bgreclaim_elapsed;
> +       for (i = 0; i < NR_RECLAIM_CONTEXTS; i++) {
> +               s->stat[i + MCS_DIRECT_SCANNED] += mem->reclaim_scan[i];
> +               s->stat[i + MCS_DIRECT_FREED] += mem->reclaim_freed[i];
> +       }
>  }
>
>  static void
> @@ -4889,6 +4959,7 @@ static struct mem_cgroup *mem_cgroup_all
>                goto out_free;
>        spin_lock_init(&mem->pcp_counter_lock);
>        INIT_DELAYED_WORK(&mem->bgreclaim_work, memcg_bgreclaim);
> +       spin_lock_init(&mem->elapsed_lock);
>        mem->bgreclaim_resched = true;
>        return mem;
>
> Index: memcg/include/linux/memcontrol.h
> ===================================================================
> --- memcg.orig/include/linux/memcontrol.h
> +++ memcg/include/linux/memcontrol.h
> @@ -90,6 +90,8 @@ extern int mem_cgroup_select_victim_node
>                                        const nodemask_t *nodes);
>
>  int shrink_mem_cgroup(struct mem_cgroup *mem, long required);
> +void mem_cgroup_reclaim_statistics(struct mem_cgroup *mem, int context,
> +                       unsigned long scanned, unsigned long freed);
>
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> @@ -423,6 +425,10 @@ static inline
>  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item
> idx)
>  {
>  }
> +void mem_cgroup_reclaim_statistics(struct mem_cgroup *mem, int context,
> +                               unsigned long scanned, unsigned long freed)
> +{
> +}
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
> Index: memcg/include/linux/swap.h
> ===================================================================
> --- memcg.orig/include/linux/swap.h
> +++ memcg/include/linux/swap.h
> @@ -250,6 +250,13 @@ static inline void lru_cache_add_file(st
>  #define ISOLATE_ACTIVE 1       /* Isolate active pages. */
>  #define ISOLATE_BOTH 2         /* Isolate both active and inactive pages.
> */
>
> +/* context for memory reclaim.( comes from memory cgroup.) */
> +enum {
> +       RECLAIM_DIRECT,         /* under direct reclaim */
> +       RECLAIM_KSWAPD,         /* under global kswapd's soft limit */
> +       RECLAIM_WMARK,          /* under background reclaim by watermark */
> +       NR_RECLAIM_CONTEXTS
> +};
>  /* linux/mm/vmscan.c */
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
> order,
>                                        gfp_t gfp_mask, nodemask_t *mask);
> Index: memcg/mm/vmscan.c
> ===================================================================
> --- memcg.orig/mm/vmscan.c
> +++ memcg/mm/vmscan.c
> @@ -72,6 +72,9 @@ typedef unsigned __bitwise__ reclaim_mod
>  #define RECLAIM_MODE_LUMPYRECLAIM      ((__force reclaim_mode_t)0x08u)
>  #define RECLAIM_MODE_COMPACTION                ((__force
> reclaim_mode_t)0x10u)
>
> +/* 3 reclaim contexts fro memcg statistics. */
> +enum {DIRECT_RECLAIM, KSWAPD_RECLAIM, WMARK_RECLAIM};
> +
>  struct scan_control {
>        /* Incremented by the number of inactive pages that were scanned */
>        unsigned long nr_scanned;
> @@ -107,6 +110,7 @@ struct scan_control {
>
>        /* Which cgroup do we reclaim from */
>        struct mem_cgroup *mem_cgroup;
> +       int     reclaim_context;
>
>        /*
>         * Nodemask of nodes allowed by the caller. If NULL, all nodes
> @@ -2116,6 +2120,10 @@ out:
>        delayacct_freepages_end();
>        put_mems_allowed();
>
> +       if (!scanning_global_lru(sc))
> +               mem_cgroup_reclaim_statistics(sc->mem_cgroup,
> +                       sc->reclaim_context, total_scanned,
> sc->nr_reclaimed);
> +
>        if (sc->nr_reclaimed)
>                return sc->nr_reclaimed;
>
> @@ -2178,6 +2186,7 @@ unsigned long mem_cgroup_shrink_node_zon
>                .swappiness = swappiness,
>                .order = 0,
>                .mem_cgroup = mem,
> +               .reclaim_context = RECLAIM_KSWAPD,
>        };
>
>        sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> @@ -2198,6 +2207,8 @@ unsigned long mem_cgroup_shrink_node_zon
>
>        trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> +       mem_cgroup_reclaim_statistics(sc.mem_cgroup,
> +                       sc.reclaim_context, sc.nr_scanned,
> sc.nr_reclaimed);
>        *nr_scanned = sc.nr_scanned;
>        return sc.nr_reclaimed;
>  }
> @@ -2217,6 +2228,7 @@ unsigned long try_to_free_mem_cgroup_pag
>                .swappiness = swappiness,
>                .order = 0,
>                .mem_cgroup = mem_cont,
> +               .reclaim_context = RECLAIM_DIRECT,
>                .nodemask = NULL, /* we don't care the placement */
>        };
>
> @@ -2384,6 +2396,7 @@ int shrink_mem_cgroup(struct mem_cgroup
>                .may_swap = 1,
>                .order = 0,
>                .mem_cgroup = mem,
> +               .reclaim_context = RECLAIM_WMARK,
>        };
>        /* writepage will be set later per zone */
>        sc.may_writepage = 0;
> @@ -2434,6 +2447,8 @@ int shrink_mem_cgroup(struct mem_cgroup
>        if (sc.nr_reclaimed > sc.nr_to_reclaim/2)
>                delay = 0;
>  out:
> +       mem_cgroup_reclaim_statistics(sc.mem_cgroup, sc.reclaim_context,
> +                       total_scanned, sc.nr_reclaimed);
>        current->flags &= ~PF_SWAPWRITE;
>        return delay;
>  }
> Index: memcg/Documentation/cgroups/memory.txt
> ===================================================================
> --- memcg.orig/Documentation/cgroups/memory.txt
> +++ memcg/Documentation/cgroups/memory.txt
> @@ -398,6 +398,15 @@ active_anon        - # of bytes of anonymous an
>  inactive_file  - # of bytes of file-backed memory on inactive LRU list.
>  active_file    - # of bytes of file-backed memory on active LRU list.
>  unevictable    - # of bytes of memory that cannot be reclaimed (mlocked
> etc).
> +direct_elapsed_ns  - # of elapsed cpu time at hard limit reclaim (ns)
> +soft_elapsed_ns  - # of elapsed cpu time at soft limit reclaim (ns)
> +wmark_elapsed_ns  - # of elapsed cpu time at hi/low watermark reclaim (ns)
> +direct_scanned - # of page scans at hard limit reclaim
> +soft_scanned   - # of page scans at soft limit reclaim
> +wmark_scanned  - # of page scans at hi/low watermark reclaim
> +direct_freed   - # of page freeing at hard limit reclaim
> +soft_freed     - # of page freeing at soft limit reclaim
> +wmark_freed    - # of page freeing at hi/low watermark reclaim
>
>  # status considering hierarchy (see memory.use_hierarchy settings)
>
> @@ -421,6 +430,15 @@ total_active_anon  - sum of all children'
>  total_inactive_file    - sum of all children's "inactive_file"
>  total_active_file      - sum of all children's "active_file"
>  total_unevictable      - sum of all children's "unevictable"
> +total_direct_elapsed_ns - sum of all children's "direct_elapsed_ns"
> +total_soft_elapsed_ns  - sum of all children's "soft_elapsed_ns"
> +total_wmark_elapsed_ns - sum of all children's "wmark_elapsed_ns"
> +total_direct_scanned   - sum of all children's "direct_scanned"
> +total_soft_scanned     - sum of all children's "soft_scanned"
> +total_wmark_scanned    - sum of all children's "wmark_scanned"
> +total_direct_freed     - sum of all children's "direct_freed"
> +total_soft_freed       - sum of all children's "soft_freed"
> +total_wamrk_freed      - sum of all children's "wmark_freed"
>
>  # The following additional stats are dependent on CONFIG_DEBUG_VM.
>
> Those stats looks good to me. Thanks

--Ying

[-- Attachment #2: Type: text/html, Size: 17218 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-26  1:38     ` KAMEZAWA Hiroyuki
@ 2011-04-26  7:19       ` Ying Han
  2011-04-26  7:43         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-04-26  7:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

On Mon, Apr 25, 2011 at 6:38 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 25 Apr 2011 15:21:21 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> Kame:
>>
>> Thank you for putting time on implementing the patch. I think it is
>> definitely a good idea to have the two alternatives on the table since
>> people has asked the questions. Before going down to the track, i have
>> thought about the two approaches and also discussed with Greg and Hugh
>> (cc-ed),  i would like to clarify some of the pros and cons on both
>> approaches.  In general, I think the workqueue is not the right answer
>> for this purpose.
>>
>> The thread-pool model
>> Pros:
>> 1. there is no isolation between memcg background reclaim, since the
>> memcg threads are shared. That isolation including all the resources
>> that the per-memcg background reclaim will need to access, like cpu
>> time. One thing we are missing for the shared worker model is the
>> individual cpu scheduling ability. We need the ability to isolate and
>> count the resource assumption per memcg, and including how much
>> cputime and where to run the per-memcg kswapd thread.
>>
>
> IIUC, new threads for workqueue will be created if necessary in automatic.
>
I read your patches today, but i might missed some details while I was
reading it. I will read them through
tomorrow.

The question I was wondering here is
1. how to do cpu cgroup limit per-memcg including the kswapd time.
2. how to do numa awareness cpu scheduling if i want to do cpumask on
the memcg-kswapd close to the numa node where all the pages of the
memcg allocated.

I guess the second one should have been covered. If not, it shouldn't
be a big effort to fix that. And any suggestions on the first one.

>
>> 2. it is hard for visibility and debugability. We have been
>> experiencing a lot when some kswapds running creazy and we need a
>> stright-forward way to identify which cgroup causing the reclaim. yes,
>> we can add more stats per-memcg to sort of giving that visibility, but
>> I can tell they are involved w/ more overhead of the change. Why
>> introduce the over-head if the per-memcg kswapd thread can offer that
>> maturely.
>>
>
> I added counters and time comsumption statistics with low overhead.

I looked at the patch and the stats looks good to me. Thanks.

>
>
>> 3. potential priority inversion for some memcgs. Let's say we have two
>> memcgs A and B on a single core machine, and A has big chuck of work
>> and B has small chuck of work. Now B's work is queued up after A. In
>> the workqueue model, we won't process B unless we finish A's work
>> since we only have one worker on the single core host. However, in the
>> per-memcg kswapd model, B got chance to run when A calls
>> cond_resched(). Well, we might not having the exact problem if we
>> don't constrain the workers number, and the worst case we'll have the
>> same number of workers as the number of memcgs. If so, it would be the
>> same model as per-memcg kswapd.
>>
>
> I implemented static scan rate round-robin. I think you didn't read patches.
> And, The fact per-memcg thread model switches when it calls cond_resched(),
> means it will not reched until it cousumes enough vruntme. I guess static
> scan rate round-robin wins at duscussing fairness.
>
> And IIUC, workqueue invokes enough amount of threads to do service.

So, instead of having dedicated thread hitting the wmark based on
priority, we just do a little bit amount
of work per-memcg and round robin across them. This sounds might help
on the counter example I gave
above and also shares similar logic as calling cond_resched().


>
>> 4. the kswapd threads are created and destroyed dynamically. are we
>> talking about allocating 8k of stack for kswapd when we are under
>> memory pressure? In the other case, all the memory are preallocated.
>>
>
> I think workqueue is there for avoiding 'making kthread dynamically'.
> We can save much codes.

So right now, the workqueue is configured as unbounded. which means
the worse case we might create
the same number of workers as the number of memcgs. ( if each memcg
takes long time to do the reclaim). So this might not be a problem,
but I would like to confirm.

>
>> 5. the workqueue is scary and might introduce issues sooner or later.
>> Also, why we think the background reclaim fits into the workqueue
>> model, and be more specific, how that share the same logic of other
>> parts of the system using workqueue.
>>
>
> Ok, with using workqueue.
>
>  1. The number of threads can be changed dynamically with regard to system
>     workload without adding any codes. workqueue is for this kind of
>     background jobs. gcwq has a hooks to scheduler and it works well.
>     With per-memcg thread model, we'll never be able to do such.
>
>  2. We can avoid having unncessary threads.
>     If it sleeps most of time, why we need to keep it ? No, it's unnecessary.
>     It should be on-demand. freezer() etc need to stop all threads and
>     thousands of sleeping threads will be harmful.
>     You can see how 'ps -elf' gets slow when the number of threads increases.

In general, i am not strongly against the workqueue but trying to
understand the procs and cons between the two approaches. The first
one is definitely simpler and more straight-forward, and I was
suggesting to start with something simple and improve it later if we
see problems. But I will read your path through tomorrow and also
willing to see comments from others.

Thank you for the efforts!

--Ying

>
>
> === When we have small threads ==
> [root@rhel6-test hilow]# time ps -elf | wc -l
> 128
>
> real    0m0.058s
> user    0m0.010s
> sys     0m0.051s
>
> == When we have 2000 'sleeping' tasks. ==
> [root@rhel6-test hilow]# time ps -elf | wc -l
> 2128
>
> real    0m0.881s
> user    0m0.055s
> sys     0m0.972s
>
> Awesome, it costs nearly 1sec.
> We should keep the number of threads as small as possible. Having threads is cost.
>
>
>  3. We need to refine reclaim codes for memcg to make it consuming less time.
>     With per-memcg-thread model, we'll use cut-n-paste codes and pass all job
>     to scheduler and consuming more time, reclaim slowly.
>
>     BTW, Static scan rate round robin implemented in this patch is a fair routine.
>
>
> On 4cpu KVM, creating 100M-limit 90M-hiwat cgroups 1,2,3,4,5, and run 'cat 400M >/dev/null'
> a size of file(400M) on each cgroup for 60secs in loop.
> ==
> [kamezawa@rhel6-test ~]$ cat /cgroup/memory/[1-5]/memory.stat | grep elapse | grep -v total
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 792377873
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 811053756
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 799196613
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 806502820
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 790071307
> ==
>
> No one dives into direct reclaim. and time consumption for background reclaim is fair
> for the same jobs.
>
> ==
> [kamezawa@rhel6-test ~]$ cat /cgroup/memory/[1-5]/memory.stat | grep wmark_scanned | grep -v total
> wmark_scanned 225881
> wmark_scanned 225563
> wmark_scanned 226848
> wmark_scanned 225458
> wmark_scanned 226137
> ==
> Ah, yes. scan rate is fair. Even when we had 5 active cat + 5 works.
>
> BTW, without bgreclaim,
> ==
> [kamezawa@rhel6-test ~]$ cat /cgroup/memory/[1-5]/memory.stat | grep direct_elapsed | grep -v total
> direct_elapsed_ns 786049957
> direct_elapsed_ns 782150545
> direct_elapsed_ns 805222327
> direct_elapsed_ns 782563391
> direct_elapsed_ns 782431424
> ==
>
> direct reclaim uses the same amount of time.
>
> ==
> [kamezawa@rhel6-test ~]$ cat /cgroup/memory/[1-5]/memory.stat | grep direct_scan | grep -v total
> direct_scanned 224501
> direct_scanned 224448
> direct_scanned 224448
> direct_scanned 224448
> direct_scanned 224448
> ==
>
> CFS seems to work fair ;) (Note: there is 10M difference between bgreclaim/direct).
>
> with 10 groups. 10threads + 10works.
> ==
> [kamezawa@rhel6-test hilow]$ cat /cgroup/memory/[0-9]/memory.stat | grep elapsed_ns | grep -v total | grep -v soft
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 81856013
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 350538700
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 340384072
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 344776087
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 322237832
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 337741658
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 261018174
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 316675784
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 257009865
> direct_elapsed_ns 0
> soft_elapsed_ns 0
> wmark_elapsed_ns 154339039
>
> ==
> No one dives into direct reclaim. (But 'cat' iself slow...because of read() ?)
> From bgreclaim point's of view, this is fair because no direct reclaim happens.
> Maybe I need to use blkio cgroup for more tests of this kind.
>
> I attaches the test script below.
>
>  4. We can see how round-robin works and see what we need to modify.
>     Maybe good for future work and we'll have good chance to reuse codes.
>
>  5. With per-memcg-thread, at bad case, we'll see thousands of threads trying
>     to reclai memory at once. It's never good.
>     In this patch, I left max_active of workqueue as default. If we need fix/tune,
>     we just fix max_active.
>
>  6. If it seems that it's better to have thread pool for memcg,
>     we can switch to thread-pool model seamlessly. But delayd_work implemenation
>     will be difficult ;) And management of the number of active works will be
>     difficult. I bet we'll never use thread-pool.
>
>  7. We'll never see cpu cache miss caused by frequent thread-stack-switch.
>
>
>> Cons:
>> 1. save SOME memory resource.
>>
> and CPU resouce. per-memcg-thread tends to use much cpu time rather than workqueue
> which is required to be designed as short-term round robin.
>
>
>> The per-memcg-per-kswapd model
>> Pros:
>> 1. memory overhead per thread, and The memory consumption would be
>> 8k*1000 = 8M with 1k cgroup. This is NOT a problem as least we haven't
>> seen it in our production. We have cases that 2k of kernel threads
>> being created, and we haven't noticed it is causing resource
>> consumption problem as well as performance issue. On those systems, we
>> might have ~100 cgroup running at a time.
>>
>> 2. we see lots of threads at 'ps -elf'. well, is that really a problem
>> that we need to change the threading model?
>>
>> Overall, the per-memcg-per-kswapd thread model is simple enough to
>> provide better isolation (predictability & debug ability). The number
>> of threads we might potentially have on the system is not a real
>> problem. We already have systems running that much of threads (even
>> more) and we haven't seen problem of that. Also, i can imagine it will
>> make our life easier for some other extensions on memcg works.
>>
>> For now, I would like to stick on the simple model. At the same time I
>> am willing to looking into changes and fixes whence we have seen
>> problems later.
>>
>> Comments?
>>
>
>
> In 2-3 years ago, I implemetned per-memcg-thread model and got NACK and
> said "you should use workqueue ;) Now, workqueue is renewed and seems easier to
> use for cpu-intensive workloads. If I need more tweaks for workqueue, I'll add
> patches for workqueue. But, it's unseen now.
>
> And, using per-memcg thread model tend to lead us to brain-dead, as using cut-n-paste
> codes from kswapd which never fits memcg. Later, at removing LRU, we need some kind
> of round-robin again and checking how round-robin works and what is good code for
> round robin is an interesting study. For example, I noticed I need patch 4, soon.
>
>
> I'd like to use workqueue and refine the whole routine to fit short-term round-robin.
> Having sleeping threads is cost. round robin can work in fair way.
>
>
> Thanks,
> -Kame
>
> == test.sh ==
> #!/bin/bash -x
>
> for i in `seq 0 9`; do
>        mkdir /cgroup/memory/$i
>        echo 100M > /cgroup/memory/$i/memory.limit_in_bytes
>        echo 10M > /cgroup/memory/$i/memory.high_wmark_distance
> done
>
> for i in `seq 0 9`; do
>        cgexec -g memory:$i ./loop.sh ./tmpfile$i &
> done
>
> sleep 60;
>
> pkill loop.sh
>
> == loop.sh ==
> #!/bin/sh
>
> while true; do
>        cat $1 > /dev/null
> done
> ==
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-26  7:19       ` Ying Han
@ 2011-04-26  7:43         ` KAMEZAWA Hiroyuki
  2011-04-26  8:43           ` Ying Han
  0 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-26  7:43 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

On Tue, 26 Apr 2011 00:19:46 -0700
Ying Han <yinghan@google.com> wrote:

> On Mon, Apr 25, 2011 at 6:38 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 25 Apr 2011 15:21:21 -0700
> > Ying Han <yinghan@google.com> wrote:

> >> Thank you for putting time on implementing the patch. I think it is
> >> definitely a good idea to have the two alternatives on the table since
> >> people has asked the questions. Before going down to the track, i have
> >> thought about the two approaches and also discussed with Greg and Hugh
> >> (cc-ed), A i would like to clarify some of the pros and cons on both
> >> approaches. A In general, I think the workqueue is not the right answer
> >> for this purpose.
> >>
> >> The thread-pool model
> >> Pros:
> >> 1. there is no isolation between memcg background reclaim, since the
> >> memcg threads are shared. That isolation including all the resources
> >> that the per-memcg background reclaim will need to access, like cpu
> >> time. One thing we are missing for the shared worker model is the
> >> individual cpu scheduling ability. We need the ability to isolate and
> >> count the resource assumption per memcg, and including how much
> >> cputime and where to run the per-memcg kswapd thread.
> >>
> >
> > IIUC, new threads for workqueue will be created if necessary in automatic.
> >
> I read your patches today, but i might missed some details while I was
> reading it. I will read them through tomorrow.
> 

Thank you.

> The question I was wondering here is
> 1. how to do cpu cgroup limit per-memcg including the kswapd time.

I'd like to add some limitation based on elapsed time. For example,
only allow to run 10ms within 1sec. It's a background job should be
limited. Or, simply adds static delay per memcg at queue_delayed_work().
Then, the user can limit scan/sec. But what I wonder now is what is the
good interface....msec/sec ? scan/sec, free/sec ? etc...


> 2. how to do numa awareness cpu scheduling if i want to do cpumask on
> the memcg-kswapd close to the numa node where all the pages of the
> memcg allocated.
> 
> I guess the second one should have been covered. If not, it shouldn't
> be a big effort to fix that. And any suggestions on the first one.
> 

Interesting. If we use WQ_CPU_INTENSIVE + queue_work_on() instead
of WQ_UNBOUND, we can control which cpu to do jobs.

"The default cpu" to run wmark-reclaim can by calculated by
css_id(&mem->css) % num_online_cpus() or some round robin at
memcg creation. Anyway, we'll need to use WQ_CPU_INTENSIVE.
It may give us good result than WQ_UNBOUND...

Adding an interface for limiting cpu is...hmm. per memcg ? or
as the generic memcg param ? It will a memcg parameter not
a threads's.


> >
> >> 4. the kswapd threads are created and destroyed dynamically. are we
> >> talking about allocating 8k of stack for kswapd when we are under
> >> memory pressure? In the other case, all the memory are preallocated.
> >>
> >
> > I think workqueue is there for avoiding 'making kthread dynamically'.
> > We can save much codes.
> 
> So right now, the workqueue is configured as unbounded. which means
> the worse case we might create
> the same number of workers as the number of memcgs. ( if each memcg
> takes long time to do the reclaim). So this might not be a problem,
> but I would like to confirm.
> 
>From documenation, max_active unbound workqueue (default) is
==
Currently, for a bound wq, the maximum limit for @max_active is 512
and the default value used when 0 is specified is 256.  For an unbound
wq, the limit is higher of 512 and 4 * num_possible_cpus().  These
values are chosen sufficiently high such that they are not the
limiting factor while providing protection in runaway cases.
==
512 ?  If wmark-reclaim burns cpu (and get rechedule), new kthread will
be created.


> >
> >> 5. the workqueue is scary and might introduce issues sooner or later.
> >> Also, why we think the background reclaim fits into the workqueue
> >> model, and be more specific, how that share the same logic of other
> >> parts of the system using workqueue.
> >>
> >
> > Ok, with using workqueue.
> >
> > A 1. The number of threads can be changed dynamically with regard to system
> > A  A  workload without adding any codes. workqueue is for this kind of
> > A  A  background jobs. gcwq has a hooks to scheduler and it works well.
> > A  A  With per-memcg thread model, we'll never be able to do such.
> >
> > A 2. We can avoid having unncessary threads.
> > A  A  If it sleeps most of time, why we need to keep it ? No, it's unnecessary.
> > A  A  It should be on-demand. freezer() etc need to stop all threads and
> > A  A  thousands of sleeping threads will be harmful.
> > A  A  You can see how 'ps -elf' gets slow when the number of threads increases.
> 
> In general, i am not strongly against the workqueue but trying to
> understand the procs and cons between the two approaches. The first
> one is definitely simpler and more straight-forward, and I was
> suggesting to start with something simple and improve it later if we
> see problems. But I will read your path through tomorrow and also
> willing to see comments from others.
> 
> Thank you for the efforts!
> 

you, too. 

Anyway, get_scan_count() seems to be a big problem and I'll cut out it
as independent patch.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-26  7:43         ` KAMEZAWA Hiroyuki
@ 2011-04-26  8:43           ` Ying Han
  2011-04-26  8:47             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-04-26  8:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 6371 bytes --]

On Tue, Apr 26, 2011 at 12:43 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 26 Apr 2011 00:19:46 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Mon, Apr 25, 2011 at 6:38 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Mon, 25 Apr 2011 15:21:21 -0700
> > > Ying Han <yinghan@google.com> wrote:
>
> > >> Thank you for putting time on implementing the patch. I think it is
> > >> definitely a good idea to have the two alternatives on the table since
> > >> people has asked the questions. Before going down to the track, i have
> > >> thought about the two approaches and also discussed with Greg and Hugh
> > >> (cc-ed),  i would like to clarify some of the pros and cons on both
> > >> approaches.  In general, I think the workqueue is not the right answer
> > >> for this purpose.
> > >>
> > >> The thread-pool model
> > >> Pros:
> > >> 1. there is no isolation between memcg background reclaim, since the
> > >> memcg threads are shared. That isolation including all the resources
> > >> that the per-memcg background reclaim will need to access, like cpu
> > >> time. One thing we are missing for the shared worker model is the
> > >> individual cpu scheduling ability. We need the ability to isolate and
> > >> count the resource assumption per memcg, and including how much
> > >> cputime and where to run the per-memcg kswapd thread.
> > >>
> > >
> > > IIUC, new threads for workqueue will be created if necessary in
> automatic.
> > >
> > I read your patches today, but i might missed some details while I was
> > reading it. I will read them through tomorrow.
> >
>
> Thank you.
>
> > The question I was wondering here is
> > 1. how to do cpu cgroup limit per-memcg including the kswapd time.
>
> I'd like to add some limitation based on elapsed time. For example,
> only allow to run 10ms within 1sec. It's a background job should be
> limited. Or, simply adds static delay per memcg at queue_delayed_work().
> Then, the user can limit scan/sec. But what I wonder now is what is the
> good interface....msec/sec ? scan/sec, free/sec ? etc...
>
>
> > 2. how to do numa awareness cpu scheduling if i want to do cpumask on
> > the memcg-kswapd close to the numa node where all the pages of the
> > memcg allocated.
> >
> > I guess the second one should have been covered. If not, it shouldn't
> > be a big effort to fix that. And any suggestions on the first one.
> >
>
> Interesting. If we use WQ_CPU_INTENSIVE + queue_work_on() instead
> of WQ_UNBOUND, we can control which cpu to do jobs.
>
> "The default cpu" to run wmark-reclaim can by calculated by
> css_id(&mem->css) % num_online_cpus() or some round robin at
> memcg creation. Anyway, we'll need to use WQ_CPU_INTENSIVE.
> It may give us good result than WQ_UNBOUND...
>
> Adding an interface for limiting cpu is...hmm. per memcg ? or
> as the generic memcg param ? It will a memcg parameter not
> a threads's.
>


> To clarify a bit, my question was meant to account it but not necessary to
> limit it. We can use existing cpu cgroup to do the cpu limiting, and I am
>
just wondering how to configure it for the memcg kswapd thread.

   Let's say in the per-memcg-kswapd model, i can echo the kswapd thread pid
into the cpu cgroup ( the same set of process of memcg, but in a cpu
limiting cgroup instead).  If the kswapd is shared, we might need extra work
to account the cpu cycles correspondingly.

> >
> > >> 4. the kswapd threads are created and destroyed dynamically. are we
> > >> talking about allocating 8k of stack for kswapd when we are under
> > >> memory pressure? In the other case, all the memory are preallocated.
> > >>
> > >
> > > I think workqueue is there for avoiding 'making kthread dynamically'.
> > > We can save much codes.
> >
> > So right now, the workqueue is configured as unbounded. which means
> > the worse case we might create
> > the same number of workers as the number of memcgs. ( if each memcg
> > takes long time to do the reclaim). So this might not be a problem,
> > but I would like to confirm.
> >
> From documenation, max_active unbound workqueue (default) is
> ==
> Currently, for a bound wq, the maximum limit for @max_active is 512
> and the default value used when 0 is specified is 256.  For an unbound
> wq, the limit is higher of 512 and 4 * num_possible_cpus().  These
> values are chosen sufficiently high such that they are not the
> limiting factor while providing protection in runaway cases.
> ==
> 512 ?  If wmark-reclaim burns cpu (and get rechedule), new kthread will
> be created.
>
> Ok, so we have here max(512, 4*num_possible_cpus) execution context per
cpu, and that should be
less or equal to the number of memcgs on the system. (since we have one work
item per memcg).

>
> > >
> > >> 5. the workqueue is scary and might introduce issues sooner or later.
> > >> Also, why we think the background reclaim fits into the workqueue
> > >> model, and be more specific, how that share the same logic of other
> > >> parts of the system using workqueue.
> > >>
> > >
> > > Ok, with using workqueue.
> > >
> > >  1. The number of threads can be changed dynamically with regard to
> system
> > >     workload without adding any codes. workqueue is for this kind of
> > >     background jobs. gcwq has a hooks to scheduler and it works well.
> > >     With per-memcg thread model, we'll never be able to do such.
> > >
> > >  2. We can avoid having unncessary threads.
> > >     If it sleeps most of time, why we need to keep it ? No, it's
> unnecessary.
> > >     It should be on-demand. freezer() etc need to stop all threads and
> > >     thousands of sleeping threads will be harmful.
> > >     You can see how 'ps -elf' gets slow when the number of threads
> increases.
> >
> > In general, i am not strongly against the workqueue but trying to
> > understand the procs and cons between the two approaches. The first
> > one is definitely simpler and more straight-forward, and I was
> > suggesting to start with something simple and improve it later if we
> > see problems. But I will read your path through tomorrow and also
> > willing to see comments from others.
> >
> > Thank you for the efforts!
> >
>
> you, too.
>
> Anyway, get_scan_count() seems to be a big problem and I'll cut out it
> as independent patch.
>

sounds good to me.

--Ying


> Thanks,
> -Kame
>
>
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 8517 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-26  8:43           ` Ying Han
@ 2011-04-26  8:47             ` KAMEZAWA Hiroyuki
  2011-04-26 23:08               ` Ying Han
  2011-04-28  3:55               ` Ying Han
  0 siblings, 2 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-26  8:47 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

On Tue, 26 Apr 2011 01:43:17 -0700
Ying Han <yinghan@google.com> wrote:

> On Tue, Apr 26, 2011 at 12:43 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 26 Apr 2011 00:19:46 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > On Mon, Apr 25, 2011 at 6:38 PM, KAMEZAWA Hiroyuki
> > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > On Mon, 25 Apr 2011 15:21:21 -0700
> > > > Ying Han <yinghan@google.com> wrote:

> 
> > To clarify a bit, my question was meant to account it but not necessary to
> > limit it. We can use existing cpu cgroup to do the cpu limiting, and I am
> >
> just wondering how to configure it for the memcg kswapd thread.
> 
>    Let's say in the per-memcg-kswapd model, i can echo the kswapd thread pid
> into the cpu cgroup ( the same set of process of memcg, but in a cpu
> limiting cgroup instead).  If the kswapd is shared, we might need extra work
> to account the cpu cycles correspondingly.
> 

Hm ? statistics of elapsed_time isn't enough ?

Now, I think limiting scan/sec interface is more promissing rather than time
or thread controls. It's easier to understand.

BTW, I think it's better to avoid the watermark reclaim work as kswapd.
It's confusing because we've talked about global reclaim at LSF.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-04-25  9:28 ` [PATCH 1/7] memcg: add high/low watermark to res_counter KAMEZAWA Hiroyuki
@ 2011-04-26 17:54   ` Ying Han
  2011-04-29 13:33   ` Michal Hocko
  2011-05-02  9:07   ` Balbir Singh
  2 siblings, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-04-26 17:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 10567 bytes --]

On Mon, Apr 25, 2011 at 2:28 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> There are two watermarks added per-memcg including "high_wmark" and
> "low_wmark".
> The per-memcg kswapd is invoked when the memcg's memory
> usage(usage_in_bytes)
> is higher than the low_wmark. Then the kswapd thread starts to reclaim
> pages
> until the usage is lower than the high_wmark.
>
> Each watermark is calculated based on the hard_limit(limit_in_bytes) for
> each
> memcg. Each time the hard_limit is changed, the corresponding wmarks are
> re-calculated. Since memory controller charges only user pages, there is
> no need for a "min_wmark". The current calculation of wmarks is based on
> individual tunable high_wmark_distance, which are set to 0 by default.
> low_wmark is calculated in automatic way.
>
> Changelog:v8b...v7
> 1. set low_wmark_distance in automatic using fixed HILOW_DISTANCE.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h  |    1
>  include/linux/res_counter.h |   78
> ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/res_counter.c        |    6 +++
>  mm/memcontrol.c             |   69 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 154 insertions(+)
>
> Index: memcg/include/linux/memcontrol.h
> ===================================================================
> --- memcg.orig/include/linux/memcontrol.h
> +++ memcg/include/linux/memcontrol.h
> @@ -84,6 +84,7 @@ int task_in_mem_cgroup(struct task_struc
>
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int
> charge_flags);
>
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> Index: memcg/include/linux/res_counter.h
> ===================================================================
> --- memcg.orig/include/linux/res_counter.h
> +++ memcg/include/linux/res_counter.h
> @@ -39,6 +39,14 @@ struct res_counter {
>         */
>        unsigned long long soft_limit;
>        /*
> +        * the limit that reclaim triggers.
> +        */
> +       unsigned long long low_wmark_limit;
> +       /*
> +        * the limit that reclaim stops.
> +        */
> +       unsigned long long high_wmark_limit;
> +       /*
>         * the number of unsuccessful attempts to consume the resource
>         */
>        unsigned long long failcnt;
> @@ -55,6 +63,9 @@ struct res_counter {
>
>  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
>
> +#define CHARGE_WMARK_LOW       0x01
> +#define CHARGE_WMARK_HIGH      0x02
> +
>  /**
>  * Helpers to interact with userspace
>  * res_counter_read_u64() - returns the value of the specified member.
> @@ -92,6 +103,8 @@ enum {
>        RES_LIMIT,
>        RES_FAILCNT,
>        RES_SOFT_LIMIT,
> +       RES_LOW_WMARK_LIMIT,
> +       RES_HIGH_WMARK_LIMIT
>  };
>
>  /*
> @@ -147,6 +160,24 @@ static inline unsigned long long res_cou
>        return margin;
>  }
>
> +static inline bool
> +res_counter_under_high_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +       if (cnt->usage < cnt->high_wmark_limit)
> +               return true;
> +
> +       return false;
> +}
> +
> +static inline bool
> +res_counter_under_low_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +       if (cnt->usage < cnt->low_wmark_limit)
> +               return true;
> +
> +       return false;
> +}
> +
>  /**
>  * Get the difference between the usage and the soft limit
>  * @cnt: The counter
> @@ -169,6 +200,30 @@ res_counter_soft_limit_excess(struct res
>        return excess;
>  }
>
> +static inline bool
> +res_counter_under_low_wmark_limit(struct res_counter *cnt)
> +{
> +       bool ret;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&cnt->lock, flags);
> +       ret = res_counter_under_low_wmark_limit_check_locked(cnt);
> +       spin_unlock_irqrestore(&cnt->lock, flags);
> +       return ret;
> +}
> +
> +static inline bool
> +res_counter_under_high_wmark_limit(struct res_counter *cnt)
> +{
> +       bool ret;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&cnt->lock, flags);
> +       ret = res_counter_under_high_wmark_limit_check_locked(cnt);
> +       spin_unlock_irqrestore(&cnt->lock, flags);
> +       return ret;
> +}
> +
>  static inline void res_counter_reset_max(struct res_counter *cnt)
>  {
>        unsigned long flags;
> @@ -214,4 +269,27 @@ res_counter_set_soft_limit(struct res_co
>        return 0;
>  }
>
> +static inline int
> +res_counter_set_high_wmark_limit(struct res_counter *cnt,
> +                               unsigned long long wmark_limit)
> +{
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&cnt->lock, flags);
> +       cnt->high_wmark_limit = wmark_limit;
> +       spin_unlock_irqrestore(&cnt->lock, flags);
> +       return 0;
> +}
> +
> +static inline int
> +res_counter_set_low_wmark_limit(struct res_counter *cnt,
> +                               unsigned long long wmark_limit)
> +{
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&cnt->lock, flags);
> +       cnt->low_wmark_limit = wmark_limit;
> +       spin_unlock_irqrestore(&cnt->lock, flags);
> +       return 0;
> +}
>  #endif
> Index: memcg/kernel/res_counter.c
> ===================================================================
> --- memcg.orig/kernel/res_counter.c
> +++ memcg/kernel/res_counter.c
> @@ -19,6 +19,8 @@ void res_counter_init(struct res_counter
>        spin_lock_init(&counter->lock);
>        counter->limit = RESOURCE_MAX;
>        counter->soft_limit = RESOURCE_MAX;
> +       counter->low_wmark_limit = RESOURCE_MAX;
> +       counter->high_wmark_limit = RESOURCE_MAX;
>        counter->parent = parent;
>  }
>
> @@ -103,6 +105,10 @@ res_counter_member(struct res_counter *c
>                return &counter->failcnt;
>        case RES_SOFT_LIMIT:
>                return &counter->soft_limit;
> +       case RES_LOW_WMARK_LIMIT:
> +               return &counter->low_wmark_limit;
> +       case RES_HIGH_WMARK_LIMIT:
> +               return &counter->high_wmark_limit;
>        };
>
>        BUG();
> Index: memcg/mm/memcontrol.c
> ===================================================================
> --- memcg.orig/mm/memcontrol.c
> +++ memcg/mm/memcontrol.c
> @@ -278,6 +278,11 @@ struct mem_cgroup {
>         */
>        struct mem_cgroup_stat_cpu nocpu_base;
>        spinlock_t pcp_counter_lock;
> +
> +       /*
> +        * used to calculate the low/high_wmarks based on the
> limit_in_bytes.
> +        */
> +       u64 high_wmark_distance;
>  };
>
>  /* Stuffs for move charges at task migration. */
> @@ -867,6 +872,44 @@ out:
>  EXPORT_SYMBOL(mem_cgroup_count_vm_event);
>
>  /*
> + * If Hi-Low distance is too big, background reclaim tend to be cpu
> hogging.
> + * If Hi-Low distance is too small, small memory usage spike (by temporal
> + * shell scripts) causes background reclaim and make thing worse. But
> memory
> + * spike can be avoided by setting high-wmark a bit higier. We use fixed
> size
> + * size of HiLow Distance, this will be easy to use.
> + */
> +#ifdef CONFIG_64BIT /* object size tend do be twice */
> +#define HILOW_DISTANCE (4 * 1024 * 1024)
> +#else
> +#define HILOW_DISTANCE (2 * 1024 * 1024)
> +#endif
> +
> +static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> +{
> +       u64 limit;
> +
> +       limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> +       if (mem->high_wmark_distance == 0) {
> +               res_counter_set_low_wmark_limit(&mem->res, limit);
> +               res_counter_set_high_wmark_limit(&mem->res, limit);
> +       } else {
> +               u64 low_wmark, high_wmark, low_distance;
> +               if (mem->high_wmark_distance <= HILOW_DISTANCE)
> +                       low_distance = mem->high_wmark_distance / 2;
> +               else
> +                       low_distance = HILOW_DISTANCE;

+               if (low_distance < PAGE_SIZE * 2)
> +                       low_distance = PAGE_SIZE * 2;
> +
> +               low_wmark = limit - low_distance;
>

So the low_distance here is the distance between limit and the low_wmark.
Then, i missed the point where we control the distance between Hi-Low wmark
as in the comments. So here we might have
mem->high_wmark_distance = 4M + 1page
low_distance = 4M

--Ying


> +               high_wmark = limit - mem->high_wmark_distance;
> +
> +               res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> +               res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> +       }
> +}
> +
> +/*
>  * Following LRU functions are allowed to be used without PCG_LOCK.
>  * Operations are called by routine of global LRU independently from memcg.
>  * What we have to take care of here is validness of pc->mem_cgroup.
> @@ -3264,6 +3307,7 @@ static int mem_cgroup_resize_limit(struc
>                        else
>                                memcg->memsw_is_minimum = false;
>                }
> +               setup_per_memcg_wmarks(memcg);
>                mutex_unlock(&set_limit_mutex);
>
>                if (!ret)
> @@ -3324,6 +3368,7 @@ static int mem_cgroup_resize_memsw_limit
>                        else
>                                memcg->memsw_is_minimum = false;
>                }
> +               setup_per_memcg_wmarks(memcg);
>                mutex_unlock(&set_limit_mutex);
>
>                if (!ret)
> @@ -4603,6 +4648,30 @@ static void __init enable_swap_cgroup(vo
>  }
>  #endif
>
> +/*
> + * We use low_wmark and high_wmark for triggering per-memcg kswapd.
> + * The reclaim is triggered by low_wmark (usage > low_wmark) and stopped
> + * by high_wmark (usage < high_wmark).
> + */
> +int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> +                               int charge_flags)
> +{
> +       long ret = 0;
> +       int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
> +
> +       if (!mem->high_wmark_distance)
> +               return 1;
> +
> +       VM_BUG_ON((charge_flags & flags) == flags);
> +
> +       if (charge_flags & CHARGE_WMARK_LOW)
> +               ret = res_counter_under_low_wmark_limit(&mem->res);
> +       if (charge_flags & CHARGE_WMARK_HIGH)
> +               ret = res_counter_under_high_wmark_limit(&mem->res);
> +
> +       return ret;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>        struct mem_cgroup_tree_per_node *rtpn;
>
>

[-- Attachment #2: Type: text/html, Size: 12478 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/7] memcg bgreclaim core.
  2011-04-25  9:36 ` [PATCH 5/7] memcg bgreclaim core KAMEZAWA Hiroyuki
  2011-04-26  4:59   ` Ying Han
@ 2011-04-26 18:37   ` Ying Han
  1 sibling, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-04-26 18:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 12640 bytes --]

On Mon, Apr 25, 2011 at 2:36 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Following patch will chagnge the logic. This is a core.
> ==
> This is the main loop of per-memcg background reclaim which is implemented
> in
> function balance_mem_cgroup_pgdat().
>
> The function performs a priority loop similar to global reclaim. During
> each
> iteration it frees memory from a selected victim node.
> After reclaiming enough pages or scanning enough pages, it returns and find
> next work with round-robin.
>
> changelog v8b..v7
> 1. reworked for using work_queue rather than threads.
> 2. changed shrink_mem_cgroup algorithm to fit workqueue. In short, avoid
>   long running and allow quick round-robin and unnecessary write page.
>   When a thread make pages dirty continuously, write back them by flusher
>   is far faster than writeback by background reclaim. This detail will
>   be fixed when dirty_ratio implemented. The logic around this will be
>   revisited in following patche.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |   11 ++++
>  mm/memcontrol.c            |   44 ++++++++++++++---
>  mm/vmscan.c                |  115
> +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 162 insertions(+), 8 deletions(-)
>
> Index: memcg/include/linux/memcontrol.h
> ===================================================================
> --- memcg.orig/include/linux/memcontrol.h
> +++ memcg/include/linux/memcontrol.h
> @@ -89,6 +89,8 @@ extern int mem_cgroup_last_scanned_node(
>  extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
>                                        const nodemask_t *nodes);
>
> +unsigned long shrink_mem_cgroup(struct mem_cgroup *mem);
> +
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
>  {
> @@ -112,6 +114,9 @@ extern void mem_cgroup_end_migration(str
>  */
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +                               int nid, int zone_idx);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>                                       struct zone *zone,
>                                       enum lru_list lru);
> @@ -310,6 +315,12 @@ mem_cgroup_inactive_file_is_low(struct m
>  }
>
>  static inline unsigned long
> +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int
> zone_idx)
> +{
> +       return 0;
> +}
> +
> +static inline unsigned long
>  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
>                         enum lru_list lru)
>  {
> Index: memcg/mm/memcontrol.c
> ===================================================================
> --- memcg.orig/mm/memcontrol.c
> +++ memcg/mm/memcontrol.c
> @@ -1166,6 +1166,23 @@ int mem_cgroup_inactive_file_is_low(stru
>        return (active > inactive);
>  }
>
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> +                                               int nid, int zone_idx)
> +{
> +       int nr;
> +       struct mem_cgroup_per_zone *mz =
> +               mem_cgroup_zoneinfo(memcg, nid, zone_idx);
> +
> +       nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> +            MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> +
> +       if (nr_swap_pages > 0)
> +               nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> +                     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> +
> +       return nr;
> +}
> +
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>                                       struct zone *zone,
>                                       enum lru_list lru)
> @@ -1286,7 +1303,7 @@ static unsigned long mem_cgroup_margin(s
>        return margin >> PAGE_SHIFT;
>  }
>
> -static unsigned int get_swappiness(struct mem_cgroup *memcg)
> +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  {
>        struct cgroup *cgrp = memcg->css.cgroup;
>
> @@ -1595,14 +1612,15 @@ static int mem_cgroup_hierarchical_recla
>                /* we use swappiness of local cgroup */
>                if (check_soft) {
>                        ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> -                               noswap, get_swappiness(victim), zone,
> +                               noswap, mem_cgroup_swappiness(victim),
> zone,
>                                &nr_scanned);
>                        *total_scanned += nr_scanned;
>                        mem_cgroup_soft_steal(victim, ret);
>                        mem_cgroup_soft_scan(victim, nr_scanned);
>                } else
>                        ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> -                                               noswap,
> get_swappiness(victim));
> +                                               noswap,
> +
> mem_cgroup_swappiness(victim));
>                css_put(&victim->css);
>                /*
>                 * At shrinking usage, we can't check we should stop here or
> @@ -1628,15 +1646,25 @@ static int mem_cgroup_hierarchical_recla
>  int
>  mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t
> *nodes)
>  {
> -       int next_nid;
> +       int next_nid, i;
>        int last_scanned;
>
>        last_scanned = mem->last_scanned_node;
> -       next_nid = next_node(last_scanned, *nodes);
> +       next_nid = last_scanned;
> +rescan:
> +       next_nid = next_node(next_nid, *nodes);
>
>        if (next_nid == MAX_NUMNODES)
>                next_nid = first_node(*nodes);
>
> +       /* If no page on this node, skip */
> +       for (i = 0; i < MAX_NR_ZONES; i++)
> +               if (mem_cgroup_zone_reclaimable_pages(mem, next_nid, i))
> +                       break;
> +
> +       if (next_nid != last_scanned && (i == MAX_NR_ZONES))
> +               goto rescan;
> +
>        mem->last_scanned_node = next_nid;
>
>        return next_nid;
> @@ -3649,7 +3677,7 @@ try_to_free:
>                        goto out;
>                }
>                progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> -                                               false,
> get_swappiness(mem));
> +                                       false, mem_cgroup_swappiness(mem));
>                if (!progress) {
>                        nr_retries--;
>                        /* maybe some writeback is necessary */
> @@ -4073,7 +4101,7 @@ static u64 mem_cgroup_swappiness_read(st
>  {
>        struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>
> -       return get_swappiness(memcg);
> +       return mem_cgroup_swappiness(memcg);
>  }
>
>  static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype
> *cft,
> @@ -4849,7 +4877,7 @@ mem_cgroup_create(struct cgroup_subsys *
>        INIT_LIST_HEAD(&mem->oom_notify);
>
>        if (parent)
> -               mem->swappiness = get_swappiness(parent);
> +               mem->swappiness = mem_cgroup_swappiness(parent);
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> Index: memcg/mm/vmscan.c
> ===================================================================
> --- memcg.orig/mm/vmscan.c
> +++ memcg/mm/vmscan.c
> @@ -42,6 +42,7 @@
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
>  #include <linux/oom.h>
> +#include <linux/res_counter.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -2308,6 +2309,120 @@ static bool sleeping_prematurely(pg_data
>                return !all_zones_ok;
>  }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +/*
> + * Limit of scanning per iteration. For round-robin.
> + */
> +#define MEMCG_BGSCAN_LIMIT     (2048)
> +
> +static void
> +shrink_memcg_node(int nid, int priority, struct scan_control *sc)
> +{
> +       unsigned long total_scanned = 0;
> +       struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +       int i;
> +
> +       /*
> +        * This dma->highmem order is consistant with global reclaim.
> +        * We do this because the page allocator works in the opposite
> +        * direction although memcg user pages are mostly allocated at
> +        * highmem.
> +        */
> +       for (i = 0;
> +            (i < NODE_DATA(nid)->nr_zones) &&
> +            (total_scanned < MEMCG_BGSCAN_LIMIT);
> +            i++) {
> +               struct zone *zone = NODE_DATA(nid)->node_zones + i;
> +               struct zone_reclaim_stat *zrs;
> +               unsigned long scan, rotate;
> +
> +               if (!populated_zone(zone))
> +                       continue;
> +               scan = mem_cgroup_zone_reclaimable_pages(mem_cont, nid, i);
> +               if (!scan)
> +                       continue;
> +               /* If recent memory reclaim on this zone doesn't get good
> */
> +               zrs = get_reclaim_stat(zone, sc);
> +               scan = zrs->recent_scanned[0] + zrs->recent_scanned[1];
> +               rotate = zrs->recent_rotated[0] + zrs->recent_rotated[1];
> +
> +               if (rotate > scan/2)
> +                       sc->may_writepage = 1;
> +
> +               sc->nr_scanned = 0;
> +               shrink_zone(priority, zone, sc);
> +               total_scanned += sc->nr_scanned;
> +               sc->may_writepage = 0;
> +       }
> +       sc->nr_scanned = total_scanned;
> +}
> +
> +/*
> + * Per cgroup background reclaim.
> + */
> +unsigned long shrink_mem_cgroup(struct mem_cgroup *mem)
> +{
> +       int nid, priority, next_prio;
> +       nodemask_t nodes;
> +       unsigned long total_scanned;
> +       struct scan_control sc = {
> +               .gfp_mask = GFP_HIGHUSER_MOVABLE,
>

I noticed this is changed from GFP_KERNEL from previous patch, and also
seems memcg reclaim uses this flag as well on other reclaim path. So it
should be a ok change.

+               .may_unmap = 1,
> +               .may_swap = 1,
> +               .nr_to_reclaim = SWAP_CLUSTER_MAX,
> +               .order = 0,
> +               .mem_cgroup = mem,
> +       };
> +
> +       sc.may_writepage = 0;
> +       sc.nr_reclaimed = 0;
> +       total_scanned = 0;
> +       nodes = node_states[N_HIGH_MEMORY];
> +       sc.swappiness = mem_cgroup_swappiness(mem);
> +
> +       current->flags |= PF_SWAPWRITE;
>
why we set the flags here instead of in the main kswapd function
memcg_bgreclaim()
?

+       /*
> +        * Unlike kswapd, we need to traverse cgroups one by one. So, we
> don't
> +        * use full priority. Just scan small number of pages and visit
> next.
> +        * Now, we scan MEMCG_BGRECLAIM_SCAN_LIMIT pages per scan.
> +        * We use static priority 0.
> +        */
>
this comment here is a bit confusing since we are doing reclaim for one
memcg in this funcion.

+       next_prio = min(SWAP_CLUSTER_MAX * num_node_state(N_HIGH_MEMORY),
> +                       MEMCG_BGSCAN_LIMIT/8);
> +       priority = DEF_PRIORITY;
> +       while ((total_scanned < MEMCG_BGSCAN_LIMIT) &&
> +              !nodes_empty(nodes) &&
> +              (sc.nr_to_reclaim > sc.nr_reclaimed)) {
> +
> +               nid = mem_cgroup_select_victim_node(mem, &nodes);
> +               shrink_memcg_node(nid, priority, &sc);
> +               /*
> +                * the node seems to have no pages.
> +                * skip this for a while
> +                */
> +               if (!sc.nr_scanned)
> +                       node_clear(nid, nodes);
> +               total_scanned += sc.nr_scanned;
> +               if (mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH))
> +                       break;
> +               /* emulate priority */
> +               if (total_scanned > next_prio) {
> +                       priority--;
> +                       next_prio <<= 1;
> +               }
> +               if (sc.nr_scanned &&
> +                   total_scanned > sc.nr_reclaimed * 2)
> +                       congestion_wait(WRITE, HZ/10);
> +       }
> +       current->flags &= ~PF_SWAPWRITE;
>

hmm, the same question above. why we need to set this flag each time?

--Ying

+       return sc.nr_reclaimed;
> +}
> +#endif
> +
>  /*
>  * For kswapd, balance_pgdat() will work across all this node's zones until
>  * they are all at high_wmark_pages(zone).
>
>

[-- Attachment #2: Type: text/html, Size: 15043 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-26  8:47             ` KAMEZAWA Hiroyuki
@ 2011-04-26 23:08               ` Ying Han
  2011-04-27  0:34                 ` KAMEZAWA Hiroyuki
  2011-04-28  3:55               ` Ying Han
  1 sibling, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-04-26 23:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 1766 bytes --]

On Tue, Apr 26, 2011 at 1:47 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 26 Apr 2011 01:43:17 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Tue, Apr 26, 2011 at 12:43 AM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Tue, 26 Apr 2011 00:19:46 -0700
> > > Ying Han <yinghan@google.com> wrote:
> > >
> > > > On Mon, Apr 25, 2011 at 6:38 PM, KAMEZAWA Hiroyuki
> > > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > On Mon, 25 Apr 2011 15:21:21 -0700
> > > > > Ying Han <yinghan@google.com> wrote:
>
> >
> > > To clarify a bit, my question was meant to account it but not necessary
> to
> > > limit it. We can use existing cpu cgroup to do the cpu limiting, and I
> am
> > >
> > just wondering how to configure it for the memcg kswapd thread.
> >
> >    Let's say in the per-memcg-kswapd model, i can echo the kswapd thread
> pid
> > into the cpu cgroup ( the same set of process of memcg, but in a cpu
> > limiting cgroup instead).  If the kswapd is shared, we might need extra
> work
> > to account the cpu cycles correspondingly.
> >
>
> Hm ? statistics of elapsed_time isn't enough ?
>

I think the stats works for cpu-charging, although we might need to do extra
work to account them for each
work item and also charge them to the cpu cgroup. But it should work for
now.

>
> Now, I think limiting scan/sec interface is more promissing rather than
> time
> or thread controls. It's easier to understand.

Adding monitoring stats is good to start with, like what you have on the
last patch.



> BTW, I think it's better to avoid the watermark reclaim work as kswapd.
> It's confusing because we've talked about global reclaim at LSF.
>

Can you clarify that?

--Ying

>
>
> Thanks,
> -Kame
>
>

[-- Attachment #2: Type: text/html, Size: 3010 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/7] memcg bgreclaim core.
  2011-04-26  5:08     ` KAMEZAWA Hiroyuki
@ 2011-04-26 23:15       ` Ying Han
  2011-04-27  0:10         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-04-26 23:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 12847 bytes --]

On Mon, Apr 25, 2011 at 10:08 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 25 Apr 2011 21:59:06 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > On Mon, Apr 25, 2011 at 2:36 AM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > Following patch will chagnge the logic. This is a core.
> > > ==
> > > This is the main loop of per-memcg background reclaim which is
> implemented in
> > > function balance_mem_cgroup_pgdat().
> > >
> > > The function performs a priority loop similar to global reclaim. During
> each
> > > iteration it frees memory from a selected victim node.
> > > After reclaiming enough pages or scanning enough pages, it returns and
> find
> > > next work with round-robin.
> > >
> > > changelog v8b..v7
> > > 1. reworked for using work_queue rather than threads.
> > > 2. changed shrink_mem_cgroup algorithm to fit workqueue. In short,
> avoid
> > >   long running and allow quick round-robin and unnecessary write page.
> > >   When a thread make pages dirty continuously, write back them by
> flusher
> > >   is far faster than writeback by background reclaim. This detail will
> > >   be fixed when dirty_ratio implemented. The logic around this will be
> > >   revisited in following patche.
> > >
> > > Signed-off-by: Ying Han <yinghan@google.com>
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  include/linux/memcontrol.h |   11 ++++
> > >  mm/memcontrol.c            |   44 ++++++++++++++---
> > >  mm/vmscan.c                |  115
> +++++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 162 insertions(+), 8 deletions(-)
> > >
> > > Index: memcg/include/linux/memcontrol.h
> > > ===================================================================
> > > --- memcg.orig/include/linux/memcontrol.h
> > > +++ memcg/include/linux/memcontrol.h
> > > @@ -89,6 +89,8 @@ extern int mem_cgroup_last_scanned_node(
> > >  extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
> > >                                        const nodemask_t *nodes);
> > >
> > > +unsigned long shrink_mem_cgroup(struct mem_cgroup *mem);
> > > +
> > >  static inline
> > >  int mm_match_cgroup(const struct mm_struct *mm, const struct
> mem_cgroup *cgroup)
> > >  {
> > > @@ -112,6 +114,9 @@ extern void mem_cgroup_end_migration(str
> > >  */
> > >  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> > >  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> > > +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
> > > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup
> *memcg,
> > > +                               int nid, int zone_idx);
> > >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> > >                                       struct zone *zone,
> > >                                       enum lru_list lru);
> > > @@ -310,6 +315,12 @@ mem_cgroup_inactive_file_is_low(struct m
> > >  }
> > >
> > >  static inline unsigned long
> > > +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid,
> int zone_idx)
> > > +{
> > > +       return 0;
> > > +}
> > > +
> > > +static inline unsigned long
> > >  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> > >                         enum lru_list lru)
> > >  {
> > > Index: memcg/mm/memcontrol.c
> > > ===================================================================
> > > --- memcg.orig/mm/memcontrol.c
> > > +++ memcg/mm/memcontrol.c
> > > @@ -1166,6 +1166,23 @@ int mem_cgroup_inactive_file_is_low(stru
> > >        return (active > inactive);
> > >  }
> > >
> > > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup
> *memcg,
> > > +                                               int nid, int zone_idx)
> > > +{
> > > +       int nr;
> > > +       struct mem_cgroup_per_zone *mz =
> > > +               mem_cgroup_zoneinfo(memcg, nid, zone_idx);
> > > +
> > > +       nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> > > +            MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> > > +
> > > +       if (nr_swap_pages > 0)
> > > +               nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> > > +                     MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> > > +
> > > +       return nr;
> > > +}
> > > +
> > >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> > >                                       struct zone *zone,
> > >                                       enum lru_list lru)
> > > @@ -1286,7 +1303,7 @@ static unsigned long mem_cgroup_margin(s
> > >        return margin >> PAGE_SHIFT;
> > >  }
> > >
> > > -static unsigned int get_swappiness(struct mem_cgroup *memcg)
> > > +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> > >  {
> > >        struct cgroup *cgrp = memcg->css.cgroup;
> > >
> > > @@ -1595,14 +1612,15 @@ static int mem_cgroup_hierarchical_recla
> > >                /* we use swappiness of local cgroup */
> > >                if (check_soft) {
> > >                        ret = mem_cgroup_shrink_node_zone(victim,
> gfp_mask,
> > > -                               noswap, get_swappiness(victim), zone,
> > > +                               noswap, mem_cgroup_swappiness(victim),
> zone,
> > >                                &nr_scanned);
> > >                        *total_scanned += nr_scanned;
> > >                        mem_cgroup_soft_steal(victim, ret);
> > >                        mem_cgroup_soft_scan(victim, nr_scanned);
> > >                } else
> > >                        ret = try_to_free_mem_cgroup_pages(victim,
> gfp_mask,
> > > -                                               noswap,
> get_swappiness(victim));
> > > +                                               noswap,
> > > +
> mem_cgroup_swappiness(victim));
> > >                css_put(&victim->css);
> > >                /*
> > >                 * At shrinking usage, we can't check we should stop
> here or
> > > @@ -1628,15 +1646,25 @@ static int mem_cgroup_hierarchical_recla
> > >  int
> > >  mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t
> *nodes)
> > >  {
> > > -       int next_nid;
> > > +       int next_nid, i;
> > >        int last_scanned;
> > >
> > >        last_scanned = mem->last_scanned_node;
> > > -       next_nid = next_node(last_scanned, *nodes);
> > > +       next_nid = last_scanned;
> > > +rescan:
> > > +       next_nid = next_node(next_nid, *nodes);
> > >
> > >        if (next_nid == MAX_NUMNODES)
> > >                next_nid = first_node(*nodes);
> > >
> > > +       /* If no page on this node, skip */
> > > +       for (i = 0; i < MAX_NR_ZONES; i++)
> > > +               if (mem_cgroup_zone_reclaimable_pages(mem, next_nid,
> i))
> > > +                       break;
> > > +
> > > +       if (next_nid != last_scanned && (i == MAX_NR_ZONES))
> > > +               goto rescan;
> > > +
> > >        mem->last_scanned_node = next_nid;
> > >
> > >        return next_nid;
> > > @@ -3649,7 +3677,7 @@ try_to_free:
> > >                        goto out;
> > >                }
> > >                progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> > > -                                               false,
> get_swappiness(mem));
> > > +                                       false,
> mem_cgroup_swappiness(mem));
> > >                if (!progress) {
> > >                        nr_retries--;
> > >                        /* maybe some writeback is necessary */
> > > @@ -4073,7 +4101,7 @@ static u64 mem_cgroup_swappiness_read(st
> > >  {
> > >        struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > >
> > > -       return get_swappiness(memcg);
> > > +       return mem_cgroup_swappiness(memcg);
> > >  }
> > >
> > >  static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct
> cftype *cft,
> > > @@ -4849,7 +4877,7 @@ mem_cgroup_create(struct cgroup_subsys *
> > >        INIT_LIST_HEAD(&mem->oom_notify);
> > >
> > >        if (parent)
> > > -               mem->swappiness = get_swappiness(parent);
> > > +               mem->swappiness = mem_cgroup_swappiness(parent);
> > >        atomic_set(&mem->refcnt, 1);
> > >        mem->move_charge_at_immigrate = 0;
> > >        mutex_init(&mem->thresholds_lock);
> > > Index: memcg/mm/vmscan.c
> > > ===================================================================
> > > --- memcg.orig/mm/vmscan.c
> > > +++ memcg/mm/vmscan.c
> > > @@ -42,6 +42,7 @@
> > >  #include <linux/delayacct.h>
> > >  #include <linux/sysctl.h>
> > >  #include <linux/oom.h>
> > > +#include <linux/res_counter.h>
> > >
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/div64.h>
> > > @@ -2308,6 +2309,120 @@ static bool sleeping_prematurely(pg_data
> > >                return !all_zones_ok;
> > >  }
> > >
> > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > +/*
> > > + * The function is used for per-memcg LRU. It scanns all the zones of
> the
> > > + * node and returns the nr_scanned and nr_reclaimed.
> > > + */
> > > +/*
> > > + * Limit of scanning per iteration. For round-robin.
> > > + */
> > > +#define MEMCG_BGSCAN_LIMIT     (2048)
> > > +
> > > +static void
> > > +shrink_memcg_node(int nid, int priority, struct scan_control *sc)
> > > +{
> > > +       unsigned long total_scanned = 0;
> > > +       struct mem_cgroup *mem_cont = sc->mem_cgroup;
> > > +       int i;
> > > +
> > > +       /*
> > > +        * This dma->highmem order is consistant with global reclaim.
> > > +        * We do this because the page allocator works in the opposite
> > > +        * direction although memcg user pages are mostly allocated at
> > > +        * highmem.
> > > +        */
> > > +       for (i = 0;
> > > +            (i < NODE_DATA(nid)->nr_zones) &&
> > > +            (total_scanned < MEMCG_BGSCAN_LIMIT);
> > > +            i++) {
> > > +               struct zone *zone = NODE_DATA(nid)->node_zones + i;
> > > +               struct zone_reclaim_stat *zrs;
> > > +               unsigned long scan, rotate;
> > > +
> > > +               if (!populated_zone(zone))
> > > +                       continue;
> > > +               scan = mem_cgroup_zone_reclaimable_pages(mem_cont, nid,
> i);
> > > +               if (!scan)
> > > +                       continue;
> > > +               /* If recent memory reclaim on this zone doesn't get
> good */
> > > +               zrs = get_reclaim_stat(zone, sc);
> > > +               scan = zrs->recent_scanned[0] + zrs->recent_scanned[1];
> > > +               rotate = zrs->recent_rotated[0] +
> zrs->recent_rotated[1];
> > > +
> > > +               if (rotate > scan/2)
> > > +                       sc->may_writepage = 1;
> > > +
> > > +               sc->nr_scanned = 0;
> > > +               shrink_zone(priority, zone, sc);
> > > +               total_scanned += sc->nr_scanned;
> > > +               sc->may_writepage = 0;
> > > +       }
> > > +       sc->nr_scanned = total_scanned;
> > > +}
> >
> > I see the MEMCG_BGSCAN_LIMIT is a newly defined macro from previous
> > post. So, now the number of pages to scan is capped on 2k for each
> > memcg, and does it make difference on big vs small cgroup?
> >
>
> Now, no difference. One reason is because low_watermark - high_watermark is
> limited to 4MB, at most. It should be static 4MB in many cases and 2048
> pages
> is for scanning 8MB, twice of low_wmark - high_wmark. Another reason is
> that I didn't have enough time for considering to tune this.
> By MEMCG_BGSCAN_LIMIT, round-robin can be simply fair and I think it's a
> good start point.
>

I can see a problem here to be "fair" to each memcg. Each container has
different sizes and running with
different workloads. Some of them are more sensitive with latency than the
other, so they are willing to pay
more cpu cycles to do background reclaim.

So, here we fix the amount of work per-memcg, and the performance for those
jobs will be hurt. If i understand
correctly, we only have one workitem on the workqueue per memcg. So which
means we can only reclaim those amount of pages for each iteration. And if
the queue is big, those jobs(heavy memory allocating, and willing to pay cpu
to do bg reclaim) will hit direct reclaim more than necessary.

--Ying

>
> If memory eater enough slow (because the threads needs to do some
> work on allocated memory), this shrink_mem_cgroup() works fine and
> helps to avoid hitting limit. Here, the amount of dirty pages is
> troublesome.
>
> The penaly for cpu eating (hard-to-reclaim) cgroup is given by 'delay'.
> (see patch 7.) This patch's congestion_wait is too bad and will be replaced
> in patch 7 as 'delay'. In short, if memcg scanning seems to be not
> successful,
> it gets HZ/10 delay until the next work.
>
> If we have dirty_ratio + I/O less dirty throttling, I think we'll see much
> better fairness on this watermark reclaim round robin.
>
>
> Thanks,
> -Kame
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 16066 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/7] memcg watermark reclaim workqueue.
  2011-04-25  9:42 ` [PATCH 7/7] memcg watermark reclaim workqueue KAMEZAWA Hiroyuki
@ 2011-04-26 23:19   ` Ying Han
  2011-04-27  0:31     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-04-26 23:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 10524 bytes --]

On Mon, Apr 25, 2011 at 2:42 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> By default the per-memcg background reclaim is disabled when the
> limit_in_bytes
> is set the maximum. The kswapd_run() is called when the memcg is being
> resized,
> and kswapd_stop() is called when the memcg is being deleted.
>
> The per-memcg kswapd is waked up based on the usage and low_wmark, which is
> checked once per 1024 increments per cpu. The memcg's kswapd is waked up if
> the
> usage is larger than the low_wmark.
>
> At each iteration of work, the work frees memory at most 2048 pages and
> switch
> to next work for round robin. And if the memcg seems congested, it adds
> delay for the next work.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    2 -
>  mm/memcontrol.c            |   86
> +++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |   23 +++++++-----
>  3 files changed, 102 insertions(+), 9 deletions(-)
>
> Index: memcg/mm/memcontrol.c
> ===================================================================
> --- memcg.orig/mm/memcontrol.c
> +++ memcg/mm/memcontrol.c
> @@ -111,10 +111,12 @@ enum mem_cgroup_events_index {
>  enum mem_cgroup_events_target {
>        MEM_CGROUP_TARGET_THRESH,
>        MEM_CGROUP_TARGET_SOFTLIMIT,
> +       MEM_CGROUP_WMARK_EVENTS_THRESH,
>        MEM_CGROUP_NTARGETS,
>  };
>  #define THRESHOLDS_EVENTS_TARGET (128)
>  #define SOFTLIMIT_EVENTS_TARGET (1024)
> +#define WMARK_EVENTS_TARGET (1024)
>
>  struct mem_cgroup_stat_cpu {
>        long count[MEM_CGROUP_STAT_NSTATS];
> @@ -267,6 +269,11 @@ struct mem_cgroup {
>        struct list_head oom_notify;
>
>        /*
> +        * For high/low watermark.
> +        */
> +       bool                    bgreclaim_resched;
> +       struct delayed_work     bgreclaim_work;
> +       /*
>         * Should we move charges of a task when a task is moved into this
>         * mem_cgroup ? And what type of charges should we move ?
>         */
> @@ -374,6 +381,8 @@ static void mem_cgroup_put(struct mem_cg
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  static void drain_all_stock_async(void);
>
> +static void wake_memcg_kswapd(struct mem_cgroup *mem);
> +
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
>  {
> @@ -552,6 +561,12 @@ mem_cgroup_largest_soft_limit_node(struc
>        return mz;
>  }
>
> +static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
> +{
> +       if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
> +               wake_memcg_kswapd(mem);
> +}
> +
>  /*
>  * Implementation Note: reading percpu statistics for memcg.
>  *
> @@ -702,6 +717,9 @@ static void __mem_cgroup_target_update(s
>        case MEM_CGROUP_TARGET_SOFTLIMIT:
>                next = val + SOFTLIMIT_EVENTS_TARGET;
>                break;
> +       case MEM_CGROUP_WMARK_EVENTS_THRESH:
> +               next = val + WMARK_EVENTS_TARGET;
> +               break;
>        default:
>                return;
>        }
> @@ -725,6 +743,10 @@ static void memcg_check_events(struct me
>                        __mem_cgroup_target_update(mem,
>                                MEM_CGROUP_TARGET_SOFTLIMIT);
>                }
> +               if (unlikely(__memcg_event_check(mem,
> +                       MEM_CGROUP_WMARK_EVENTS_THRESH))){
> +                       mem_cgroup_check_wmark(mem);
> +               }
>        }
>  }
>
> @@ -3661,6 +3683,67 @@ unsigned long mem_cgroup_soft_limit_recl
>        return nr_reclaimed;
>  }
>
> +struct workqueue_struct *memcg_bgreclaimq;
> +
> +static int memcg_bgreclaim_init(void)
> +{
> +       /*
> +        * use UNBOUND workqueue because we traverse nodes (no locality)
> and
> +        * the work is cpu-intensive.
> +        */
> +       memcg_bgreclaimq = alloc_workqueue("memcg",
> +                       WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_FREEZABLE, 0);
> +       return 0;
> +}
>

I read about the documentation of workqueue. So the WQ_UNBOUND support the
max 512 execution contexts per CPU. Does the execution context means thread?

I think I understand the motivation of that flag, so we can have more
concurrency of bg reclaim workitems. But one question is on the workqueue
scheduling mechanism. If we can queue the item anywhere as long as they are
inserted in the queue, do we have mechanism to support the load balancing
like the system scheduler? The scenario I am thinking is that one CPU has
512 work items and the other one has 1.

I don't think this is directly related issue for this patch, and I just hope
the workqueue mechanism already support something like that for load
balancing.

--Ying



> +module_init(memcg_bgreclaim_init);
> +
> +static void memcg_bgreclaim(struct work_struct *work)
> +{
> +       struct delayed_work *dw = to_delayed_work(work);
> +       struct mem_cgroup *mem =
> +               container_of(dw, struct mem_cgroup, bgreclaim_work);
> +       int delay = 0;
> +       unsigned long long required, usage, hiwat;
> +
> +       hiwat = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
> +       usage = res_counter_read_u64(&mem->res, RES_USAGE);
> +       required = usage - hiwat;
> +       if (required >= 0)  {
> +               required = ((usage - hiwat) >> PAGE_SHIFT) + 1;
> +               delay = shrink_mem_cgroup(mem, (long)required);
> +       }
> +       if (!mem->bgreclaim_resched  ||
> +               mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH)) {
> +               cgroup_release_and_wakeup_rmdir(&mem->css);
> +               return;
> +       }
> +       /* need reschedule */
> +       if (!queue_delayed_work(memcg_bgreclaimq, &mem->bgreclaim_work,
> delay))
> +               cgroup_release_and_wakeup_rmdir(&mem->css);
> +}
> +
> +static void wake_memcg_kswapd(struct mem_cgroup *mem)
> +{
> +       if (delayed_work_pending(&mem->bgreclaim_work))
> +               return;
> +       cgroup_exclude_rmdir(&mem->css);
> +       if (!queue_delayed_work(memcg_bgreclaimq, &mem->bgreclaim_work, 0))
> +               cgroup_release_and_wakeup_rmdir(&mem->css);
> +       return;
> +}
> +
> +static void stop_memcg_kswapd(struct mem_cgroup *mem)
> +{
> +       /*
> +        * at destroy(), there is no task and we don't need to take care of
> +        * new bgreclaim work queued. But we need to prevent it from
> reschedule
> +        * use bgreclaim_resched to tell no more reschedule.
> +        */
> +       mem->bgreclaim_resched = false;
> +       flush_delayed_work(&mem->bgreclaim_work);
> +       mem->bgreclaim_resched = true;
> +}
> +
>  /*
>  * This routine traverse page_cgroup in given list and drop them all.
>  * *And* this routine doesn't reclaim page itself, just removes
> page_cgroup.
> @@ -3742,6 +3825,7 @@ move_account:
>                ret = -EBUSY;
>                if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
>                        goto out;
> +               stop_memcg_kswapd(mem);
>                ret = -EINTR;
>                if (signal_pending(current))
>                        goto out;
> @@ -4804,6 +4888,8 @@ static struct mem_cgroup *mem_cgroup_all
>        if (!mem->stat)
>                goto out_free;
>        spin_lock_init(&mem->pcp_counter_lock);
> +       INIT_DELAYED_WORK(&mem->bgreclaim_work, memcg_bgreclaim);
> +       mem->bgreclaim_resched = true;
>        return mem;
>
>  out_free:
> Index: memcg/include/linux/memcontrol.h
> ===================================================================
> --- memcg.orig/include/linux/memcontrol.h
> +++ memcg/include/linux/memcontrol.h
> @@ -89,7 +89,7 @@ extern int mem_cgroup_last_scanned_node(
>  extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
>                                        const nodemask_t *nodes);
>
> -unsigned long shrink_mem_cgroup(struct mem_cgroup *mem);
> +int shrink_mem_cgroup(struct mem_cgroup *mem, long required);
>
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup
> *cgroup)
> Index: memcg/mm/vmscan.c
> ===================================================================
> --- memcg.orig/mm/vmscan.c
> +++ memcg/mm/vmscan.c
> @@ -2373,20 +2373,19 @@ shrink_memcg_node(int nid, int priority,
>  /*
>  * Per cgroup background reclaim.
>  */
> -unsigned long shrink_mem_cgroup(struct mem_cgroup *mem)
> +int shrink_mem_cgroup(struct mem_cgroup *mem, long required)
>  {
> -       int nid, priority, next_prio;
> +       int nid, priority, next_prio, delay;
>        nodemask_t nodes;
>        unsigned long total_scanned;
>        struct scan_control sc = {
>                .gfp_mask = GFP_HIGHUSER_MOVABLE,
>                .may_unmap = 1,
>                .may_swap = 1,
> -               .nr_to_reclaim = SWAP_CLUSTER_MAX,
>                .order = 0,
>                .mem_cgroup = mem,
>        };
> -
> +       /* writepage will be set later per zone */
>        sc.may_writepage = 0;
>        sc.nr_reclaimed = 0;
>        total_scanned = 0;
> @@ -2400,9 +2399,12 @@ unsigned long shrink_mem_cgroup(struct m
>         * Now, we scan MEMCG_BGRECLAIM_SCAN_LIMIT pages per scan.
>         * We use static priority 0.
>         */
> +       sc.nr_to_reclaim = min(required, (long)MEMCG_BGSCAN_LIMIT/2);
>        next_prio = min(SWAP_CLUSTER_MAX * num_node_state(N_HIGH_MEMORY),
>                        MEMCG_BGSCAN_LIMIT/8);
>        priority = DEF_PRIORITY;
> +       /* delay for next work at congestion */
> +       delay = HZ/10;
>        while ((total_scanned < MEMCG_BGSCAN_LIMIT) &&
>               !nodes_empty(nodes) &&
>               (sc.nr_to_reclaim > sc.nr_reclaimed)) {
> @@ -2423,12 +2425,17 @@ unsigned long shrink_mem_cgroup(struct m
>                        priority--;
>                        next_prio <<= 1;
>                }
> -               if (sc.nr_scanned &&
> -                   total_scanned > sc.nr_reclaimed * 2)
> -                       congestion_wait(WRITE, HZ/10);
> +               /* give up early ? */
> +               if (total_scanned > MEMCG_BGSCAN_LIMIT/8 &&
> +                   total_scanned > sc.nr_reclaimed * 4)
> +                       goto out;
>        }
> +       /* We scanned enough...If we reclaimed half of requested, no delay
> */
> +       if (sc.nr_reclaimed > sc.nr_to_reclaim/2)
> +               delay = 0;
> +out:
>        current->flags &= ~PF_SWAPWRITE;
> -       return sc.nr_reclaimed;
> +       return delay;
>  }
>  #endif
>
>
>

[-- Attachment #2: Type: text/html, Size: 12081 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/7] memcg bgreclaim core.
  2011-04-26 23:15       ` Ying Han
@ 2011-04-27  0:10         ` KAMEZAWA Hiroyuki
  2011-04-27  1:01           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-27  0:10 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Tue, 26 Apr 2011 16:15:04 -0700
Ying Han <yinghan@google.com> wrote:

> On Mon, Apr 25, 2011 at 10:08 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > > I see the MEMCG_BGSCAN_LIMIT is a newly defined macro from previous
> > > post. So, now the number of pages to scan is capped on 2k for each
> > > memcg, and does it make difference on big vs small cgroup?
> > >
> >
> > Now, no difference. One reason is because low_watermark - high_watermark is
> > limited to 4MB, at most. It should be static 4MB in many cases and 2048
> > pages
> > is for scanning 8MB, twice of low_wmark - high_wmark. Another reason is
> > that I didn't have enough time for considering to tune this.
> > By MEMCG_BGSCAN_LIMIT, round-robin can be simply fair and I think it's a
> > good start point.
> >
> 
> I can see a problem here to be "fair" to each memcg. Each container has
> different sizes and running with
> different workloads. Some of them are more sensitive with latency than the
> other, so they are willing to pay
> more cpu cycles to do background reclaim.
> 

Hmm, I think care for it can be added easily. But...

> So, here we fix the amount of work per-memcg, and the performance for those
> jobs will be hurt. If i understand
> correctly, we only have one workitem on the workqueue per memcg. So which
> means we can only reclaim those amount of pages for each iteration. And if
> the queue is big, those jobs(heavy memory allocating, and willing to pay cpu
> to do bg reclaim) will hit direct reclaim more than necessary.
> 

But, from measurements, we cannot reclaim enough memory on time if the work
is busy. Can you think of 'make -j 8' doesn't hit the limit by bgreclaim ?

'Working hard' just adds more CPU consumption and results more latency.
>From my point of view, if direct reclaim has problematic costs, bgreclaim is
not easy and slow, too. Then, 'work harder' cannot be help. And spike of
memory consumption can be very rapid. If an application exec an application
which does malloc(2G), under 1G limit memcg, we cannot avoid direct reclaim.

I think the user can set limit higher and distance between limit <-> wmark large.
Then, he can gain more time and avoid hitting direct relcaim. How about enlarging
limit <-> wmark range for performance intensive jobs ?
Amount of work per memcg is limit <-> wmark range, I guess.

Thanks,
-Kame











--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/7] memcg watermark reclaim workqueue.
  2011-04-26 23:19   ` Ying Han
@ 2011-04-27  0:31     ` KAMEZAWA Hiroyuki
  2011-04-27  3:40       ` Ying Han
  0 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-27  0:31 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Tue, 26 Apr 2011 16:19:41 -0700
Ying Han <yinghan@google.com> wrote:

> On Mon, Apr 25, 2011 at 2:42 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > @@ -3661,6 +3683,67 @@ unsigned long mem_cgroup_soft_limit_recl
> >        return nr_reclaimed;
> >  }
> >
> > +struct workqueue_struct *memcg_bgreclaimq;
> > +
> > +static int memcg_bgreclaim_init(void)
> > +{
> > +       /*
> > +        * use UNBOUND workqueue because we traverse nodes (no locality)
> > and
> > +        * the work is cpu-intensive.
> > +        */
> > +       memcg_bgreclaimq = alloc_workqueue("memcg",
> > +                       WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_FREEZABLE, 0);
> > +       return 0;
> > +}
> >
> 
> I read about the documentation of workqueue. So the WQ_UNBOUND support the
> max 512 execution contexts per CPU. Does the execution context means thread?
> 
> I think I understand the motivation of that flag, so we can have more
> concurrency of bg reclaim workitems. But one question is on the workqueue
> scheduling mechanism. If we can queue the item anywhere as long as they are
> inserted in the queue, do we have mechanism to support the load balancing
> like the system scheduler? The scenario I am thinking is that one CPU has
> 512 work items and the other one has 1.
> 
IIUC, UNBOUND workqueue doesn't have cpumask and it can be scheduled anywhere.
So, scheduler's load balancing works well.

Because unbound_gcwq_nr_running == 0 always (If I believe comment on source),
 __need_more_worker() always returns true and 
need_to_create_worker() returns true if no idle thread.

Then, I think new kthread is created always if there is a work.

I wonder I shoud use WQ_CPU_INTENSIVE and spread jobs to each cpu per memcg. But
I don't see problem with UNBOUND wq, yet.


> I don't think this is directly related issue for this patch, and I just hope
> the workqueue mechanism already support something like that for load
> balancing.
> 
If not, we can add it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-26 23:08               ` Ying Han
@ 2011-04-27  0:34                 ` KAMEZAWA Hiroyuki
  2011-04-27  1:19                   ` Ying Han
  0 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-27  0:34 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

On Tue, 26 Apr 2011 16:08:38 -0700
Ying Han <yinghan@google.com> wrote:

> On Tue, Apr 26, 2011 at 1:47 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > BTW, I think it's better to avoid the watermark reclaim work as kswapd.
> > It's confusing because we've talked about global reclaim at LSF.
> >
> 
> Can you clarify that?
> 

Maybe I should write "it's better to avoid calling watermark work as kswapd"

Many guys talk about soft-limit and removing LRU at talking about kswapd or
bacground reclaim ;)


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/7] memcg bgreclaim core.
  2011-04-27  0:10         ` KAMEZAWA Hiroyuki
@ 2011-04-27  1:01           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-27  1:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Wed, 27 Apr 2011 09:10:30 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 26 Apr 2011 16:15:04 -0700
> Ying Han <yinghan@google.com> wrote:
> > So, here we fix the amount of work per-memcg, and the performance for those
> > jobs will be hurt. If i understand
> > correctly, we only have one workitem on the workqueue per memcg. So which
> > means we can only reclaim those amount of pages for each iteration. And if
> > the queue is big, those jobs(heavy memory allocating, and willing to pay cpu
> > to do bg reclaim) will hit direct reclaim more than necessary.
> > 
> 
> But, from measurements, we cannot reclaim enough memory on time if the work
> is busy. Can you think of 'make -j 8' doesn't hit the limit by bgreclaim ?
> 
> 'Working hard' just adds more CPU consumption and results more latency.
> From my point of view, if direct reclaim has problematic costs, bgreclaim is
> not easy and slow, too. Then, 'work harder' cannot be help. And spike of
> memory consumption can be very rapid. If an application exec an application
> which does malloc(2G), under 1G limit memcg, we cannot avoid direct reclaim.
> 
> I think the user can set limit higher and distance between limit <-> wmark large.
> Then, he can gain more time and avoid hitting direct relcaim. How about enlarging
> limit <-> wmark range for performance intensive jobs ?
> Amount of work per memcg is limit <-> wmark range, I guess.
> 

BTW, in another idea, I wonder I should limit work iterms by reducing max_active
because it may burn cpu. If we need, we can have 2 workqueues of high/low priority.
high workqueue has big max_active(0?) and low workqueue has small max_active.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-27  0:34                 ` KAMEZAWA Hiroyuki
@ 2011-04-27  1:19                   ` Ying Han
  0 siblings, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-04-27  1:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

On Tue, Apr 26, 2011 at 5:34 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 26 Apr 2011 16:08:38 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Tue, Apr 26, 2011 at 1:47 AM, KAMEZAWA Hiroyuki <
>> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> > BTW, I think it's better to avoid the watermark reclaim work as kswapd.
>> > It's confusing because we've talked about global reclaim at LSF.
>> >
>>
>> Can you clarify that?
>>
>
> Maybe I should write "it's better to avoid calling watermark work as kswapd"
>
> Many guys talk about soft-limit and removing LRU at talking about kswapd or
> bacground reclaim ;)

Ok, thanks :)

--Ying
>
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/7] memcg watermark reclaim workqueue.
  2011-04-27  0:31     ` KAMEZAWA Hiroyuki
@ 2011-04-27  3:40       ` Ying Han
  0 siblings, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-04-27  3:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Tue, Apr 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 26 Apr 2011 16:19:41 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Mon, Apr 25, 2011 at 2:42 AM, KAMEZAWA Hiroyuki <
>> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> > @@ -3661,6 +3683,67 @@ unsigned long mem_cgroup_soft_limit_recl
>> >        return nr_reclaimed;
>> >  }
>> >
>> > +struct workqueue_struct *memcg_bgreclaimq;
>> > +
>> > +static int memcg_bgreclaim_init(void)
>> > +{
>> > +       /*
>> > +        * use UNBOUND workqueue because we traverse nodes (no locality)
>> > and
>> > +        * the work is cpu-intensive.
>> > +        */
>> > +       memcg_bgreclaimq = alloc_workqueue("memcg",
>> > +                       WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_FREEZABLE, 0);
>> > +       return 0;
>> > +}
>> >
>>
>> I read about the documentation of workqueue. So the WQ_UNBOUND support the
>> max 512 execution contexts per CPU. Does the execution context means thread?
>>
>> I think I understand the motivation of that flag, so we can have more
>> concurrency of bg reclaim workitems. But one question is on the workqueue
>> scheduling mechanism. If we can queue the item anywhere as long as they are
>> inserted in the queue, do we have mechanism to support the load balancing
>> like the system scheduler? The scenario I am thinking is that one CPU has
>> 512 work items and the other one has 1.
>>
> IIUC, UNBOUND workqueue doesn't have cpumask and it can be scheduled anywhere.
> So, scheduler's load balancing works well.
>
> Because unbound_gcwq_nr_running == 0 always (If I believe comment on source),
>  __need_more_worker() always returns true and
> need_to_create_worker() returns true if no idle thread.
>
> Then, I think new kthread is created always if there is a work.

Ah, ok. Then this works better than I thought, so we can use the
scheduler to put threads onto the
CPUs.

>
> I wonder I shoud use WQ_CPU_INTENSIVE and spread jobs to each cpu per memcg. But
> I don't see problem with UNBOUND wq, yet.

I think the UNBOUND is good to start with.

>
>
>> I don't think this is directly related issue for this patch, and I just hope
>> the workqueue mechanism already support something like that for load
>> balancing.
>>
> If not, we can add it.

So, you might already answered my question. The load balancing is done
by the system
scheduler since we fork new thread for the queued items.

--Ying
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-26  8:47             ` KAMEZAWA Hiroyuki
  2011-04-26 23:08               ` Ying Han
@ 2011-04-28  3:55               ` Ying Han
  2011-04-28  4:05                 ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-04-28  3:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

On Tue, Apr 26, 2011 at 1:47 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 26 Apr 2011 01:43:17 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Tue, Apr 26, 2011 at 12:43 AM, KAMEZAWA Hiroyuki <
>> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> > On Tue, 26 Apr 2011 00:19:46 -0700
>> > Ying Han <yinghan@google.com> wrote:
>> >
>> > > On Mon, Apr 25, 2011 at 6:38 PM, KAMEZAWA Hiroyuki
>> > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > > > On Mon, 25 Apr 2011 15:21:21 -0700
>> > > > Ying Han <yinghan@google.com> wrote:
>
>>
>> > To clarify a bit, my question was meant to account it but not necessary to
>> > limit it. We can use existing cpu cgroup to do the cpu limiting, and I am
>> >
>> just wondering how to configure it for the memcg kswapd thread.
>>
>>    Let's say in the per-memcg-kswapd model, i can echo the kswapd thread pid
>> into the cpu cgroup ( the same set of process of memcg, but in a cpu
>> limiting cgroup instead).  If the kswapd is shared, we might need extra work
>> to account the cpu cycles correspondingly.
>>
>
> Hm ? statistics of elapsed_time isn't enough ?
>
> Now, I think limiting scan/sec interface is more promissing rather than time
> or thread controls. It's easier to understand.

I think it will work on the cpu accounting by recording the
elapsed_time per memcg workitem.

But, we might still need the cpu throttling as well. To give one use
cases from google, we'd rather kill a low priority job for running
tight on memory rather than having its reclaim thread affecting the
latency of high priority job. It is quite easy to understand how to
accomplish that in per-memcg-per-kswapd model, but harder in the
shared workqueue model. It is straight-forward to read  the cpu usage
by the cpuacct.usage* and limit the cpu usage by setting cpu.shares.
One concern we have here is the scan/sec implementation will make
things quite complex.

--Ying

>
> BTW, I think it's better to avoid the watermark reclaim work as kswapd.
> It's confusing because we've talked about global reclaim at LSF.
>
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-28  3:55               ` Ying Han
@ 2011-04-28  4:05                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-28  4:05 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

On Wed, 27 Apr 2011 20:55:49 -0700
Ying Han <yinghan@google.com> wrote:

> On Tue, Apr 26, 2011 at 1:47 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 26 Apr 2011 01:43:17 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> >> On Tue, Apr 26, 2011 at 12:43 AM, KAMEZAWA Hiroyuki <
> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >>
> >> > On Tue, 26 Apr 2011 00:19:46 -0700
> >> > Ying Han <yinghan@google.com> wrote:
> >> >
> >> > > On Mon, Apr 25, 2011 at 6:38 PM, KAMEZAWA Hiroyuki
> >> > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > > > On Mon, 25 Apr 2011 15:21:21 -0700
> >> > > > Ying Han <yinghan@google.com> wrote:
> >
> >>
> >> > To clarify a bit, my question was meant to account it but not necessary to
> >> > limit it. We can use existing cpu cgroup to do the cpu limiting, and I am
> >> >
> >> just wondering how to configure it for the memcg kswapd thread.
> >>
> >> A  A Let's say in the per-memcg-kswapd model, i can echo the kswapd thread pid
> >> into the cpu cgroup ( the same set of process of memcg, but in a cpu
> >> limiting cgroup instead). A If the kswapd is shared, we might need extra work
> >> to account the cpu cycles correspondingly.
> >>
> >
> > Hm ? statistics of elapsed_time isn't enough ?
> >
> > Now, I think limiting scan/sec interface is more promissing rather than time
> > or thread controls. It's easier to understand.
> 
> I think it will work on the cpu accounting by recording the
> elapsed_time per memcg workitem.
> 
> But, we might still need the cpu throttling as well. To give one use
> cases from google, we'd rather kill a low priority job for running
> tight on memory rather than having its reclaim thread affecting the
> latency of high priority job. It is quite easy to understand how to
> accomplish that in per-memcg-per-kswapd model, but harder in the
> shared workqueue model. It is straight-forward to read  the cpu usage
> by the cpuacct.usage* and limit the cpu usage by setting cpu.shares.
> One concern we have here is the scan/sec implementation will make
> things quite complex.
> 

I think you should check how distance between limit<->hiwater works
before jumping onto cpu scheduler. If you can see a memcg's bgreclaim is
cpu hogging, you can stop it easily by setting limit==hiwat. per-memcg
statistics seems enough for me. I don't like splitting up features
between cgroups, more. "To reduce cpu usage by memcg, please check
cpu cgroup and...." how complex it is! Do you remember what Hugh Dickins
pointed out at LSF ? It's a big concern.

Setting up of combination of cgroup subsys is too complex.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-04-25  9:28 ` [PATCH 1/7] memcg: add high/low watermark to res_counter KAMEZAWA Hiroyuki
  2011-04-26 17:54   ` Ying Han
@ 2011-04-29 13:33   ` Michal Hocko
  2011-05-01  6:06     ` KOSAKI Motohiro
  2011-05-02  9:07   ` Balbir Singh
  2 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2011-04-29 13:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim

On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
> There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
> The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
> is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
> until the usage is lower than the high_wmark.

I have mentioned this during Ying's patchsets already, but do we really
want to have this confusing naming? High and low watermarks have
opposite semantic for zones.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-04-29 13:33   ` Michal Hocko
@ 2011-05-01  6:06     ` KOSAKI Motohiro
  2011-05-03  6:49       ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: KOSAKI Motohiro @ 2011-05-01  6:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Ying Han, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

> On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
> > There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
> > The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
> > is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
> > until the usage is lower than the high_wmark.
> 
> I have mentioned this during Ying's patchsets already, but do we really
> want to have this confusing naming? High and low watermarks have
> opposite semantic for zones.

Can you please clarify this? I feel it is not opposite semantics.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
                   ` (9 preceding siblings ...)
  2011-04-25 10:14 ` KAMEZAWA Hiroyuki
@ 2011-05-02  6:09 ` Balbir Singh
  10 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2011-05-02  6:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-04-25 18:25:29]:

> 
> This patch is based on Ying Han's one....at its origin, but I changed too much ;)
> Then, start this as new thread.
> 
> (*) This work is not related to the topic "rewriting global LRU using memcg"
>     discussion, at all. This kind of hi/low watermark has been planned since
>     memcg was born. 
> 
> At first, per-memcg background reclaim is used for
>   - helping memory reclaim and avoid direct reclaim.
>   - set a not-hard limit of memory usage.
> 
> For example, assume a memcg has its hard-limit as 500M bytes.
> Then, set high-watermark as 400M. Here, memory usage can exceed 400M up to 500M
> but memory usage will be reduced automatically to 400M as time goes by.
> 
> This is useful when a user want to limit memory usage to 400M but don't want to
> see big performance regression by hitting limit when memory usage spike happens.
> 
> 1) == hard limit = 400M ==
> [root@rhel6-test hilow]# time cp ./tmpfile xxx                
> real    0m7.353s
> user    0m0.009s
> sys     0m3.280s
>

What do the stats look like (graphed during this period?)
 
> 2) == hard limit 500M/ hi_watermark = 400M ==
> [root@rhel6-test hilow]# time cp ./tmpfile xxx
> 
> real    0m6.421s
> user    0m0.059s
> sys     0m2.707s
> 
What do the stats look like (graphed during this period?) for
comparison. Does the usage extend beyond 400 very often?

> Above is a brief result on VM and needs more study. But my impression is positive.
> I'd like to use bigger real machine in the next time.
> 
> Here is a short list of updates from Ying Han's one.
> 
>  1. use workqueue and visit memcg in round robin.
>  2. only allow setting hi watermark. low-watermark is automatically determined.
>     This is good for avoiding bad cpu usage by background reclaim.
>  3. totally rewrite algorithm of shrink_mem_cgroup for round-robin.
>  4. fixed get_scan_count() , this was problematic.
>  5. added some statistics, which I think necessary.
>  6. added documenation
> 
> Then, the algorithm is not a cut-n-paste from kswapd. I thought kswapd should be
> updated...and 'priority' in vmscan.c seems to be an enemy of memcg ;)
>

Thanks for looking into this. 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/7] memcg background reclaim , yet another one.
  2011-04-25 22:21   ` Ying Han
  2011-04-26  1:38     ` KAMEZAWA Hiroyuki
@ 2011-05-02  7:02     ` Balbir Singh
  1 sibling, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2011-05-02  7:02 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, linux-mm, kosaki.motohiro, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko, Greg Thelen,
	Hugh Dickins

* Ying Han <yinghan@google.com> [2011-04-25 15:21:21]:

> Kame:
> 
> Thank you for putting time on implementing the patch. I think it is
> definitely a good idea to have the two alternatives on the table since
> people has asked the questions. Before going down to the track, i have
> thought about the two approaches and also discussed with Greg and Hugh
> (cc-ed),  i would like to clarify some of the pros and cons on both
> approaches.  In general, I think the workqueue is not the right answer
> for this purpose.
> 
> The thread-pool model
> Pros:
> 1. there is no isolation between memcg background reclaim, since the
> memcg threads are shared. That isolation including all the resources
> that the per-memcg background reclaim will need to access, like cpu
> time. One thing we are missing for the shared worker model is the
> individual cpu scheduling ability. We need the ability to isolate and
> count the resource assumption per memcg, and including how much
> cputime and where to run the per-memcg kswapd thread.
> 

Fair enough, but I think your suggestion is very container specific. I
am not sure how binding CPU and memory resources together is a good
idea, unless proven. My concern is growth in number of kernel threads.

> 2. it is hard for visibility and debugability. We have been
> experiencing a lot when some kswapds running creazy and we need a
> stright-forward way to identify which cgroup causing the reclaim. yes,
> we can add more stats per-memcg to sort of giving that visibility, but
> I can tell they are involved w/ more overhead of the change. Why
> introduce the over-head if the per-memcg kswapd thread can offer that
> maturely.
> 
> 3. potential priority inversion for some memcgs. Let's say we have two
> memcgs A and B on a single core machine, and A has big chuck of work
> and B has small chuck of work. Now B's work is queued up after A. In
> the workqueue model, we won't process B unless we finish A's work
> since we only have one worker on the single core host. However, in the
> per-memcg kswapd model, B got chance to run when A calls
> cond_resched(). Well, we might not having the exact problem if we
> don't constrain the workers number, and the worst case we'll have the
> same number of workers as the number of memcgs. If so, it would be the
> same model as per-memcg kswapd.
> 
> 4. the kswapd threads are created and destroyed dynamically. are we
> talking about allocating 8k of stack for kswapd when we are under
> memory pressure? In the other case, all the memory are preallocated.
> 
> 5. the workqueue is scary and might introduce issues sooner or later.
> Also, why we think the background reclaim fits into the workqueue
> model, and be more specific, how that share the same logic of other
> parts of the system using workqueue.
> 
> Cons:
> 1. save SOME memory resource.
> 
> The per-memcg-per-kswapd model
> Pros:
> 1. memory overhead per thread, and The memory consumption would be
> 8k*1000 = 8M with 1k cgroup. This is NOT a problem as least we haven't
> seen it in our production. We have cases that 2k of kernel threads
> being created, and we haven't noticed it is causing resource
> consumption problem as well as performance issue. On those systems, we
> might have ~100 cgroup running at a time.
> 
> 2. we see lots of threads at 'ps -elf'. well, is that really a problem
> that we need to change the threading model?
> 
> Overall, the per-memcg-per-kswapd thread model is simple enough to
> provide better isolation (predictability & debug ability). The number
> of threads we might potentially have on the system is not a real
> problem. We already have systems running that much of threads (even
> more) and we haven't seen problem of that. Also, i can imagine it will
> make our life easier for some other extensions on memcg works.
> 
> For now, I would like to stick on the simple model. At the same time I
> am willing to looking into changes and fixes whence we have seen
> problems later.
>

On second thoughts, ksm and THP have gone their own thread way, but
the number of threads is limited. With workqueues, won't @max_active
help cover some of the issues you mentioned? I know it does not help
with per cgroup association of workqueue threads, but if they execute
in process context, we should still have some control..no?

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-04-25  9:28 ` [PATCH 1/7] memcg: add high/low watermark to res_counter KAMEZAWA Hiroyuki
  2011-04-26 17:54   ` Ying Han
  2011-04-29 13:33   ` Michal Hocko
@ 2011-05-02  9:07   ` Balbir Singh
  2011-05-06  5:30     ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 68+ messages in thread
From: Balbir Singh @ 2011-05-02  9:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm, kosaki.motohiro, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-04-25 18:28:49]:

> There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
> The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
> is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
> until the usage is lower than the high_wmark.
> 
> Each watermark is calculated based on the hard_limit(limit_in_bytes) for each
> memcg. Each time the hard_limit is changed, the corresponding wmarks are
> re-calculated. Since memory controller charges only user pages, there is
> no need for a "min_wmark". The current calculation of wmarks is based on
> individual tunable high_wmark_distance, which are set to 0 by default.
> low_wmark is calculated in automatic way.
> 
> Changelog:v8b...v7
> 1. set low_wmark_distance in automatic using fixed HILOW_DISTANCE.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h  |    1 
>  include/linux/res_counter.h |   78 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/res_counter.c        |    6 +++
>  mm/memcontrol.c             |   69 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 154 insertions(+)
> 
> Index: memcg/include/linux/memcontrol.h
> ===================================================================
> --- memcg.orig/include/linux/memcontrol.h
> +++ memcg/include/linux/memcontrol.h
> @@ -84,6 +84,7 @@ int task_in_mem_cgroup(struct task_struc
> 
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
> 
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> Index: memcg/include/linux/res_counter.h
> ===================================================================
> --- memcg.orig/include/linux/res_counter.h
> +++ memcg/include/linux/res_counter.h
> @@ -39,6 +39,14 @@ struct res_counter {
>  	 */
>  	unsigned long long soft_limit;
>  	/*
> +	 * the limit that reclaim triggers.
> +	 */
> +	unsigned long long low_wmark_limit;
> +	/*
> +	 * the limit that reclaim stops.
> +	 */
> +	unsigned long long high_wmark_limit;
> +	/*
>  	 * the number of unsuccessful attempts to consume the resource
>  	 */
>  	unsigned long long failcnt;
> @@ -55,6 +63,9 @@ struct res_counter {
> 
>  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
> 
> +#define CHARGE_WMARK_LOW	0x01
> +#define CHARGE_WMARK_HIGH	0x02
> +
>  /**
>   * Helpers to interact with userspace
>   * res_counter_read_u64() - returns the value of the specified member.
> @@ -92,6 +103,8 @@ enum {
>  	RES_LIMIT,
>  	RES_FAILCNT,
>  	RES_SOFT_LIMIT,
> +	RES_LOW_WMARK_LIMIT,
> +	RES_HIGH_WMARK_LIMIT
>  };
> 
>  /*
> @@ -147,6 +160,24 @@ static inline unsigned long long res_cou
>  	return margin;
>  }
> 
> +static inline bool
> +res_counter_under_high_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->high_wmark_limit)
> +		return true;
> +
> +	return false;
> +}
> +
> +static inline bool
> +res_counter_under_low_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->low_wmark_limit)
> +		return true;
> +
> +	return false;
> +}
> +
>  /**
>   * Get the difference between the usage and the soft limit
>   * @cnt: The counter
> @@ -169,6 +200,30 @@ res_counter_soft_limit_excess(struct res
>  	return excess;
>  }
> 
> +static inline bool
> +res_counter_under_low_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_under_low_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
> +static inline bool
> +res_counter_under_high_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_under_high_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline void res_counter_reset_max(struct res_counter *cnt)
>  {
>  	unsigned long flags;
> @@ -214,4 +269,27 @@ res_counter_set_soft_limit(struct res_co
>  	return 0;
>  }
> 
> +static inline int
> +res_counter_set_high_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->high_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
> +
> +static inline int
> +res_counter_set_low_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->low_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
>  #endif
> Index: memcg/kernel/res_counter.c
> ===================================================================
> --- memcg.orig/kernel/res_counter.c
> +++ memcg/kernel/res_counter.c
> @@ -19,6 +19,8 @@ void res_counter_init(struct res_counter
>  	spin_lock_init(&counter->lock);
>  	counter->limit = RESOURCE_MAX;
>  	counter->soft_limit = RESOURCE_MAX;
> +	counter->low_wmark_limit = RESOURCE_MAX;
> +	counter->high_wmark_limit = RESOURCE_MAX;
>  	counter->parent = parent;
>  }
> 
> @@ -103,6 +105,10 @@ res_counter_member(struct res_counter *c
>  		return &counter->failcnt;
>  	case RES_SOFT_LIMIT:
>  		return &counter->soft_limit;
> +	case RES_LOW_WMARK_LIMIT:
> +		return &counter->low_wmark_limit;
> +	case RES_HIGH_WMARK_LIMIT:
> +		return &counter->high_wmark_limit;
>  	};
> 
>  	BUG();
> Index: memcg/mm/memcontrol.c
> ===================================================================
> --- memcg.orig/mm/memcontrol.c
> +++ memcg/mm/memcontrol.c
> @@ -278,6 +278,11 @@ struct mem_cgroup {
>  	 */
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
> +
> +	/*
> +	 * used to calculate the low/high_wmarks based on the limit_in_bytes.
> +	 */
> +	u64 high_wmark_distance;
>  };
> 
>  /* Stuffs for move charges at task migration. */
> @@ -867,6 +872,44 @@ out:
>  EXPORT_SYMBOL(mem_cgroup_count_vm_event);
>

Hmm... I wonder if we can start looking at the read side of
usage_in_bytes using RCU and reduce lock contention on cnt->lock. May
be an optimization for later. I still have my old per cpu counter
patches for usage_in_bytes that add some fuzz factor, but help improve
speed. I should rebase them and try.

 
>  /*
> + * If Hi-Low distance is too big, background reclaim tend to be cpu hogging.
> + * If Hi-Low distance is too small, small memory usage spike (by temporal
> + * shell scripts) causes background reclaim and make thing worse. But memory
> + * spike can be avoided by setting high-wmark a bit higier. We use fixed size
> + * size of HiLow Distance, this will be easy to use.
> + */
> +#ifdef CONFIG_64BIT /* object size tend do be twice */
> +#define HILOW_DISTANCE	(4 * 1024 * 1024)
> +#else
> +#define HILOW_DISTANCE	(2 * 1024 * 1024)
> +#endif
> +
> +static void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> +{
> +	u64 limit;
> +
> +	limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> +	if (mem->high_wmark_distance == 0) {
> +		res_counter_set_low_wmark_limit(&mem->res, limit);
> +		res_counter_set_high_wmark_limit(&mem->res, limit);
> +	} else {
> +		u64 low_wmark, high_wmark, low_distance;
> +		if (mem->high_wmark_distance <= HILOW_DISTANCE)
> +			low_distance = mem->high_wmark_distance / 2;
> +		else
> +			low_distance = HILOW_DISTANCE;
> +		if (low_distance < PAGE_SIZE * 2)
> +			low_distance = PAGE_SIZE * 2;
> +
> +		low_wmark = limit - low_distance;
> +		high_wmark = limit - mem->high_wmark_distance;
> +
> +		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> +		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> +	}
> +}
> +

I've not seen the documentation patch, but it might be good to have
some comments with what to expect the watermarks to be and who sets up
up high_wmark_distance. 

> +/*
>   * Following LRU functions are allowed to be used without PCG_LOCK.
>   * Operations are called by routine of global LRU independently from memcg.
>   * What we have to take care of here is validness of pc->mem_cgroup.
> @@ -3264,6 +3307,7 @@ static int mem_cgroup_resize_limit(struc
>  			else
>  				memcg->memsw_is_minimum = false;
>  		}
> +		setup_per_memcg_wmarks(memcg);
>  		mutex_unlock(&set_limit_mutex);
> 
>  		if (!ret)
> @@ -3324,6 +3368,7 @@ static int mem_cgroup_resize_memsw_limit
>  			else
>  				memcg->memsw_is_minimum = false;
>  		}
> +		setup_per_memcg_wmarks(memcg);
>  		mutex_unlock(&set_limit_mutex);
> 
>  		if (!ret)
> @@ -4603,6 +4648,30 @@ static void __init enable_swap_cgroup(vo
>  }
>  #endif
> 
> +/*
> + * We use low_wmark and high_wmark for triggering per-memcg kswapd.
> + * The reclaim is triggered by low_wmark (usage > low_wmark) and stopped
> + * by high_wmark (usage < high_wmark).
> + */
> +int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> +				int charge_flags)
> +{
> +	long ret = 0;
> +	int flags = CHARGE_WMARK_LOW | CHARGE_WMARK_HIGH;
> +
> +	if (!mem->high_wmark_distance)
> +		return 1;
> +
> +	VM_BUG_ON((charge_flags & flags) == flags);
> +
> +	if (charge_flags & CHARGE_WMARK_LOW)
> +		ret = res_counter_under_low_wmark_limit(&mem->res);
> +	if (charge_flags & CHARGE_WMARK_HIGH)
> +		ret = res_counter_under_high_wmark_limit(&mem->res);
> +
> +	return ret;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>  	struct mem_cgroup_tree_per_node *rtpn;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-01  6:06     ` KOSAKI Motohiro
@ 2011-05-03  6:49       ` Michal Hocko
  2011-05-03  7:45         ` KOSAKI Motohiro
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2011-05-03  6:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Ying Han, linux-mm, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim

On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
> > > There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
> > > The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
> > > is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
> > > until the usage is lower than the high_wmark.
> > 
> > I have mentioned this during Ying's patchsets already, but do we really
> > want to have this confusing naming? High and low watermarks have
> > opposite semantic for zones.
> 
> Can you please clarify this? I feel it is not opposite semantics.

In the global reclaim low watermark represents the point when we _start_
background reclaim while high watermark is the _stopper_. Watermarks are
based on the free memory while this proposal makes it based on the used
memory.
I understand that the result is same in the end but it is really
confusing because you have to switch your mindset from free to used and
from under the limit to above the limit.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-03  6:49       ` Michal Hocko
@ 2011-05-03  7:45         ` KOSAKI Motohiro
  2011-05-03  8:25           ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: KOSAKI Motohiro @ 2011-05-03  7:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, Ying Han, linux-mm, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim

2011/5/3 Michal Hocko <mhocko@suse.cz>:
> On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
>> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
>> > > There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
>> > > The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
>> > > is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
>> > > until the usage is lower than the high_wmark.
>> >
>> > I have mentioned this during Ying's patchsets already, but do we really
>> > want to have this confusing naming? High and low watermarks have
>> > opposite semantic for zones.
>>
>> Can you please clarify this? I feel it is not opposite semantics.
>
> In the global reclaim low watermark represents the point when we _start_
> background reclaim while high watermark is the _stopper_. Watermarks are
> based on the free memory while this proposal makes it based on the used
> memory.
> I understand that the result is same in the end but it is really
> confusing because you have to switch your mindset from free to used and
> from under the limit to above the limit.

Ah, right. So, do you have an alternative idea?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-03  7:45         ` KOSAKI Motohiro
@ 2011-05-03  8:25           ` Michal Hocko
  2011-05-03 17:01             ` Ying Han
  2011-05-04  3:55             ` KOSAKI Motohiro
  0 siblings, 2 replies; 68+ messages in thread
From: Michal Hocko @ 2011-05-03  8:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Ying Han, linux-mm, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim

On Tue 03-05-11 16:45:23, KOSAKI Motohiro wrote:
> 2011/5/3 Michal Hocko <mhocko@suse.cz>:
> > On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
> >> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
> >> > > There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
> >> > > The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
> >> > > is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
> >> > > until the usage is lower than the high_wmark.
> >> >
> >> > I have mentioned this during Ying's patchsets already, but do we really
> >> > want to have this confusing naming? High and low watermarks have
> >> > opposite semantic for zones.
> >>
> >> Can you please clarify this? I feel it is not opposite semantics.
> >
> > In the global reclaim low watermark represents the point when we _start_
> > background reclaim while high watermark is the _stopper_. Watermarks are
> > based on the free memory while this proposal makes it based on the used
> > memory.
> > I understand that the result is same in the end but it is really
> > confusing because you have to switch your mindset from free to used and
> > from under the limit to above the limit.
> 
> Ah, right. So, do you have an alternative idea?

Why cannot we just keep the global reclaim semantic and make it free
memory (hard_limit - usage_in_bytes) based with low limit as the trigger
for reclaiming? 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-03  8:25           ` Michal Hocko
@ 2011-05-03 17:01             ` Ying Han
  2011-05-04  8:58               ` Michal Hocko
  2011-05-04  3:55             ` KOSAKI Motohiro
  1 sibling, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-05-03 17:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura,
	akpm, Johannes Weiner, minchan.kim

On Tue, May 3, 2011 at 1:25 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Tue 03-05-11 16:45:23, KOSAKI Motohiro wrote:
>> 2011/5/3 Michal Hocko <mhocko@suse.cz>:
>> > On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
>> >> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
>> >> > > There are two watermarks added per-memcg including "high_wmark" and "low_wmark".
>> >> > > The per-memcg kswapd is invoked when the memcg's memory usage(usage_in_bytes)
>> >> > > is higher than the low_wmark. Then the kswapd thread starts to reclaim pages
>> >> > > until the usage is lower than the high_wmark.
>> >> >
>> >> > I have mentioned this during Ying's patchsets already, but do we really
>> >> > want to have this confusing naming? High and low watermarks have
>> >> > opposite semantic for zones.
>> >>
>> >> Can you please clarify this? I feel it is not opposite semantics.
>> >
>> > In the global reclaim low watermark represents the point when we _start_
>> > background reclaim while high watermark is the _stopper_. Watermarks are
>> > based on the free memory while this proposal makes it based on the used
>> > memory.
>> > I understand that the result is same in the end but it is really
>> > confusing because you have to switch your mindset from free to used and
>> > from under the limit to above the limit.
>>
>> Ah, right. So, do you have an alternative idea?
>
> Why cannot we just keep the global reclaim semantic and make it free
> memory (hard_limit - usage_in_bytes) based with low limit as the trigger
> for reclaiming?

Hmm, that was my initial implementation. But then I got comment to switch to
the current scheme which is based on the usage. The initial comment was that
using the "free" is confusing... :)

The current scheme is closer to the global bg reclaim which the low is
triggering reclaim
and high is stopping reclaim. And we can only use the "usage" to keep
the same API.

--Ying

>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-03  8:25           ` Michal Hocko
  2011-05-03 17:01             ` Ying Han
@ 2011-05-04  3:55             ` KOSAKI Motohiro
  2011-05-04  8:55               ` Michal Hocko
  1 sibling, 1 reply; 68+ messages in thread
From: KOSAKI Motohiro @ 2011-05-04  3:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, Ying Han, linux-mm, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim

>> Ah, right. So, do you have an alternative idea?
>
> Why cannot we just keep the global reclaim semantic and make it free
> memory (hard_limit - usage_in_bytes) based with low limit as the trigger
> for reclaiming?

Because it's not free memory. the cgroup doesn't reach a limit. but....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-04  3:55             ` KOSAKI Motohiro
@ 2011-05-04  8:55               ` Michal Hocko
  2011-05-09  3:24                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2011-05-04  8:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Ying Han, linux-mm, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim

On Wed 04-05-11 12:55:19, KOSAKI Motohiro wrote:
> >> Ah, right. So, do you have an alternative idea?
> >
> > Why cannot we just keep the global reclaim semantic and make it free
> > memory (hard_limit - usage_in_bytes) based with low limit as the trigger
> > for reclaiming?
> 
> Because it's not free memory. 

In some sense it is because it defines the available memory for a group.

> the cgroup doesn't reach a limit. but....

Same way how we do not get down to no free memory (due to reserves
etc.). Or am I missing something.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-03 17:01             ` Ying Han
@ 2011-05-04  8:58               ` Michal Hocko
  2011-05-04 17:16                 ` Ying Han
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2011-05-04  8:58 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura,
	akpm, Johannes Weiner, minchan.kim

On Tue 03-05-11 10:01:27, Ying Han wrote:
> On Tue, May 3, 2011 at 1:25 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Tue 03-05-11 16:45:23, KOSAKI Motohiro wrote:
> >> 2011/5/3 Michal Hocko <mhocko@suse.cz>:
> >> > On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
> >> >> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
[...]
> >> >> Can you please clarify this? I feel it is not opposite semantics.
> >> >
> >> > In the global reclaim low watermark represents the point when we _start_
> >> > background reclaim while high watermark is the _stopper_. Watermarks are
> >> > based on the free memory while this proposal makes it based on the used
> >> > memory.
> >> > I understand that the result is same in the end but it is really
> >> > confusing because you have to switch your mindset from free to used and
> >> > from under the limit to above the limit.
> >>
> >> Ah, right. So, do you have an alternative idea?
> >
> > Why cannot we just keep the global reclaim semantic and make it free
> > memory (hard_limit - usage_in_bytes) based with low limit as the trigger
> > for reclaiming?
> 
[...]
> The current scheme 

What is the current scheme?

> is closer to the global bg reclaim which the low is triggering reclaim
> and high is stopping reclaim. And we can only use the "usage" to keep
> the same API.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-04  8:58               ` Michal Hocko
@ 2011-05-04 17:16                 ` Ying Han
  2011-05-05  6:59                   ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-05-04 17:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura,
	akpm, Johannes Weiner, minchan.kim

On Wed, May 4, 2011 at 1:58 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Tue 03-05-11 10:01:27, Ying Han wrote:
>> On Tue, May 3, 2011 at 1:25 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > On Tue 03-05-11 16:45:23, KOSAKI Motohiro wrote:
>> >> 2011/5/3 Michal Hocko <mhocko@suse.cz>:
>> >> > On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
>> >> >> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
> [...]
>> >> >> Can you please clarify this? I feel it is not opposite semantics.
>> >> >
>> >> > In the global reclaim low watermark represents the point when we _start_
>> >> > background reclaim while high watermark is the _stopper_. Watermarks are
>> >> > based on the free memory while this proposal makes it based on the used
>> >> > memory.
>> >> > I understand that the result is same in the end but it is really
>> >> > confusing because you have to switch your mindset from free to used and
>> >> > from under the limit to above the limit.
>> >>
>> >> Ah, right. So, do you have an alternative idea?
>> >
>> > Why cannot we just keep the global reclaim semantic and make it free
>> > memory (hard_limit - usage_in_bytes) based with low limit as the trigger
>> > for reclaiming?
>>
> [...]
>> The current scheme
>
> What is the current scheme?

using the "usage_in_bytes" instead of "free"

--Ying
>
>> is closer to the global bg reclaim which the low is triggering reclaim
>> and high is stopping reclaim. And we can only use the "usage" to keep
>> the same API.
>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-04 17:16                 ` Ying Han
@ 2011-05-05  6:59                   ` Michal Hocko
  2011-05-06  5:28                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2011-05-05  6:59 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura,
	akpm, Johannes Weiner, minchan.kim

On Wed 04-05-11 10:16:39, Ying Han wrote:
> On Wed, May 4, 2011 at 1:58 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Tue 03-05-11 10:01:27, Ying Han wrote:
> >> On Tue, May 3, 2011 at 1:25 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >> > On Tue 03-05-11 16:45:23, KOSAKI Motohiro wrote:
> >> >> 2011/5/3 Michal Hocko <mhocko@suse.cz>:
> >> >> > On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
> >> >> >> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
> > [...]
> >> >> >> Can you please clarify this? I feel it is not opposite semantics.
> >> >> >
> >> >> > In the global reclaim low watermark represents the point when we _start_
> >> >> > background reclaim while high watermark is the _stopper_. Watermarks are
> >> >> > based on the free memory while this proposal makes it based on the used
> >> >> > memory.
> >> >> > I understand that the result is same in the end but it is really
> >> >> > confusing because you have to switch your mindset from free to used and
> >> >> > from under the limit to above the limit.
> >> >>
> >> >> Ah, right. So, do you have an alternative idea?
> >> >
> >> > Why cannot we just keep the global reclaim semantic and make it free
> >> > memory (hard_limit - usage_in_bytes) based with low limit as the trigger
> >> > for reclaiming?
> >>
> > [...]
> >> The current scheme
> >
> > What is the current scheme?
> 
> using the "usage_in_bytes" instead of "free"
> 
> >> is closer to the global bg reclaim which the low is triggering reclaim
> >> and high is stopping reclaim. And we can only use the "usage" to keep
> >> the same API.

And how is this closer to the global reclaim semantic which is based on
the available memory?
What I am trying to say here is that this new watermark concept doesn't
fit in with the global reclaim. Well, standard user might not be aware
of the zone watermarks at all because they cannot be set. But still if
you are analyzing your memory usage you still check and compare free
memory to min/low/high watermarks to find out what is the current memory
pressure.
If we had another concept with cgroups you would need to switch your 
mindset to analyze things.

I am sorry, but I still do not see any reason why those cgroup watermaks
cannot be based on total-usage.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-05  6:59                   ` Michal Hocko
@ 2011-05-06  5:28                     ` KAMEZAWA Hiroyuki
  2011-05-06 14:22                       ` Johannes Weiner
  2011-05-09  5:40                       ` Ying Han
  0 siblings, 2 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-06  5:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ying Han, KOSAKI Motohiro, linux-mm, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim

On Thu, 5 May 2011 08:59:01 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Wed 04-05-11 10:16:39, Ying Han wrote:
> > On Wed, May 4, 2011 at 1:58 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > > On Tue 03-05-11 10:01:27, Ying Han wrote:
> > >> On Tue, May 3, 2011 at 1:25 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > >> > On Tue 03-05-11 16:45:23, KOSAKI Motohiro wrote:
> > >> >> 2011/5/3 Michal Hocko <mhocko@suse.cz>:
> > >> >> > On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
> > >> >> >> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
> > > [...]
> > >> >> >> Can you please clarify this? I feel it is not opposite semantics.
> > >> >> >
> > >> >> > In the global reclaim low watermark represents the point when we _start_
> > >> >> > background reclaim while high watermark is the _stopper_. Watermarks are
> > >> >> > based on the free memory while this proposal makes it based on the used
> > >> >> > memory.
> > >> >> > I understand that the result is same in the end but it is really
> > >> >> > confusing because you have to switch your mindset from free to used and
> > >> >> > from under the limit to above the limit.
> > >> >>
> > >> >> Ah, right. So, do you have an alternative idea?
> > >> >
> > >> > Why cannot we just keep the global reclaim semantic and make it free
> > >> > memory (hard_limit - usage_in_bytes) based with low limit as the trigger
> > >> > for reclaiming?
> > >>
> > > [...]
> > >> The current scheme
> > >
> > > What is the current scheme?
> > 
> > using the "usage_in_bytes" instead of "free"
> > 
> > >> is closer to the global bg reclaim which the low is triggering reclaim
> > >> and high is stopping reclaim. And we can only use the "usage" to keep
> > >> the same API.
> 

Sorry for long absence.

> And how is this closer to the global reclaim semantic which is based on
> the available memory?

It's never be the same feature and not a similar feature, I think.

> What I am trying to say here is that this new watermark concept doesn't
> fit in with the global reclaim. Well, standard user might not be aware
> of the zone watermarks at all because they cannot be set. But still if
> you are analyzing your memory usage you still check and compare free
> memory to min/low/high watermarks to find out what is the current memory
> pressure.
> If we had another concept with cgroups you would need to switch your 
> mindset to analyze things.
> 
> I am sorry, but I still do not see any reason why those cgroup watermaks
> cannot be based on total-usage.

Hmm, so, the interface should be

  memory.watermark  --- the total usage which kernel's memory shrinker starts.

?

I'm okay with this. And I think this parameter should be fully independent from
the limit.

Memcg can work without watermark reclaim. I think my patch just adds a new
_limit_ which a user can shrink usage of memory on deamand with kernel's help.
Memory reclaim works in background but this is not a kswapd, at all.

I guess performance benefit of using watermark under a cgroup which has limit
is very small and I think this is not for a performance tuning parameter. 
This is just a new limit.

Comparing 2 cases,

 cgroup A)
   - has limit of 300M, no watermaks.
 cgroup B)
   - has limit of UNLIMITED, watermarks=300M

A) has hard limit and memory reclaim cost is paid by user threads, and have
risks of OOM under memcg.
B) has no hard limit and memory reclaim cost is paid by kernel threads, and
will not have risk of OOM under memcg, but can be CPU burning.

I think this should be called as soft-limit ;) But we have another soft-limit now.
Then, I call this as watermark. This will be useful to resize usage of memory
in online because application will not hit limit and get big latency even while
an admin makes watermark smaller. 

Hmm, maybe I should allow watermark > limit setting ;).

Thanks,
-Kame






Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-02  9:07   ` Balbir Singh
@ 2011-05-06  5:30     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-06  5:30 UTC (permalink / raw)
  To: balbir
  Cc: Ying Han, linux-mm, kosaki.motohiro, nishimura, akpm,
	Johannes Weiner, minchan.kim, Michal Hocko

On Mon, 2 May 2011 14:37:41 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-04-25 18:28:49]:
		res_counter_set_high_wmark_limit(&mem->res, limit);
> > +	} else {
> > +		u64 low_wmark, high_wmark, low_distance;
> > +		if (mem->high_wmark_distance <= HILOW_DISTANCE)
> > +			low_distance = mem->high_wmark_distance / 2;
> > +		else
> > +			low_distance = HILOW_DISTANCE;
> > +		if (low_distance < PAGE_SIZE * 2)
> > +			low_distance = PAGE_SIZE * 2;
> > +
> > +		low_wmark = limit - low_distance;
> > +		high_wmark = limit - mem->high_wmark_distance;
> > +
> > +		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> > +		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> > +	}
> > +}
> > +
> 
> I've not seen the documentation patch, but it might be good to have
> some comments with what to expect the watermarks to be and who sets up
> up high_wmark_distance. 
> 

I'll refine these namings.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-06  5:28                     ` KAMEZAWA Hiroyuki
@ 2011-05-06 14:22                       ` Johannes Weiner
  2011-05-09  0:21                         ` KAMEZAWA Hiroyuki
  2011-05-09  5:40                       ` Ying Han
  1 sibling, 1 reply; 68+ messages in thread
From: Johannes Weiner @ 2011-05-06 14:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, Ying Han, KOSAKI Motohiro, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

On Fri, May 06, 2011 at 02:28:34PM +0900, KAMEZAWA Hiroyuki wrote:
> Hmm, so, the interface should be
> 
>   memory.watermark  --- the total usage which kernel's memory shrinker starts.
> 
> ?
> 
> I'm okay with this. And I think this parameter should be fully independent from
> the limit.
> 
> Memcg can work without watermark reclaim. I think my patch just adds a new
> _limit_ which a user can shrink usage of memory on deamand with kernel's help.
> Memory reclaim works in background but this is not a kswapd, at all.
> 
> I guess performance benefit of using watermark under a cgroup which has limit
> is very small and I think this is not for a performance tuning parameter. 
> This is just a new limit.
> 
> Comparing 2 cases,
> 
>  cgroup A)
>    - has limit of 300M, no watermaks.
>  cgroup B)
>    - has limit of UNLIMITED, watermarks=300M
> 
> A) has hard limit and memory reclaim cost is paid by user threads, and have
> risks of OOM under memcg.
> B) has no hard limit and memory reclaim cost is paid by kernel threads, and
> will not have risk of OOM under memcg, but can be CPU burning.
> 
> I think this should be called as soft-limit ;) But we have another soft-limit now.
> Then, I call this as watermark. This will be useful to resize usage of memory
> in online because application will not hit limit and get big latency even while
> an admin makes watermark smaller.

I have two thoughts to this:

1. Even though the memcg will not hit the limit and the application
will not be forced to do memcg target reclaim, the watermark reclaim
will steal pages from the memcg and the application will suffer the
page faults, so it's not an unconditional win.

2. I understand how the feature is supposed to work, but I don't
understand or see a use case for the watermark being configurable.
Don't get me wrong, I completely agree with watermark reclaim, it's a
good latency optimization.  But I don't see why you would want to
manually push back a memcg by changing the watermark.

Ying wrote in another email that she wants to do this to make room for
another job that is about to get launched.  My reply to that was that
you should just launch the job and let global memory pressure push
back that memcg instead.  So instead of lowering the watermark, you
could lower the soft limit and don't do any reclaim at all until real
pressure arises.  You said yourself that the new feature should be
called soft limit.  And I think it is because it is a reimplementation
of the soft limit!

I am sorry that I am such a drag regarding this, please convince me so
I can crawl back to my cave ;)

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-06 14:22                       ` Johannes Weiner
@ 2011-05-09  0:21                         ` KAMEZAWA Hiroyuki
  2011-05-09  5:47                           ` Ying Han
  2011-05-09  9:58                           ` Johannes Weiner
  0 siblings, 2 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-09  0:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Ying Han, KOSAKI Motohiro, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

On Fri, 6 May 2011 16:22:57 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, May 06, 2011 at 02:28:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > Hmm, so, the interface should be
> > 
> >   memory.watermark  --- the total usage which kernel's memory shrinker starts.
> > 
> > ?
> > 
> > I'm okay with this. And I think this parameter should be fully independent from
> > the limit.
> > 
> > Memcg can work without watermark reclaim. I think my patch just adds a new
> > _limit_ which a user can shrink usage of memory on deamand with kernel's help.
> > Memory reclaim works in background but this is not a kswapd, at all.
> > 
> > I guess performance benefit of using watermark under a cgroup which has limit
> > is very small and I think this is not for a performance tuning parameter. 
> > This is just a new limit.
> > 
> > Comparing 2 cases,
> > 
> >  cgroup A)
> >    - has limit of 300M, no watermaks.
> >  cgroup B)
> >    - has limit of UNLIMITED, watermarks=300M
> > 
> > A) has hard limit and memory reclaim cost is paid by user threads, and have
> > risks of OOM under memcg.
> > B) has no hard limit and memory reclaim cost is paid by kernel threads, and
> > will not have risk of OOM under memcg, but can be CPU burning.
> > 
> > I think this should be called as soft-limit ;) But we have another soft-limit now.
> > Then, I call this as watermark. This will be useful to resize usage of memory
> > in online because application will not hit limit and get big latency even while
> > an admin makes watermark smaller.
> 
> I have two thoughts to this:
> 
> 1. Even though the memcg will not hit the limit and the application
> will not be forced to do memcg target reclaim, the watermark reclaim
> will steal pages from the memcg and the application will suffer the
> page faults, so it's not an unconditional win.
> 

Considering the whole system, I never think this watermark can be a performance
help. This consumes the same amount of cpu as a memory freeing thread uses.
In realistic situaion, in busy memcy, several threads hits limit at the same
time and a help by a thread will not be much help.

> 2. I understand how the feature is supposed to work, but I don't
> understand or see a use case for the watermark being configurable.
> Don't get me wrong, I completely agree with watermark reclaim, it's a
> good latency optimization.  But I don't see why you would want to
> manually push back a memcg by changing the watermark.
> 

For keeping free memory, when the system is not busy.

> Ying wrote in another email that she wants to do this to make room fro,
> another job that is about to get launched.  My reply to that was that
> you should just launch the job and let global memory pressure push
> back that memcg instead.  So instead of lowering the watermark, you
> could lower the soft limit and don't do any reclaim at all until real
> pressure arises.  You said yourself that the new feature should be
> called soft limit.  And I think it is because it is a reimplementation
> of the soft limit!
> 

Soft limit works only when the system is in memory shortage. It means the
system need to use cpu for memory reclaim when the system is very busy.
This works always an admin wants. This difference will affects page allocation
latency and execution time of application. In some customer, when he wants to
start up an application in 1 sec, it must be in 1 sec. As you know, kswapd's
memory reclaim itself is too slow against rapid big allocation or burst of
network packet allocation and direct reclaim runs always. Then, it's not
avoidable to reclaim/scan memory when the system is busy.  This feature allows
admins to schedule memory reclaim when the systen is calm. It's like control of
scheduling GC.

IIRC, there was a trial to free memory when idle() runs....but it doesn't meet
current system requirement as idle() should be idle. What I think is a feature
like a that with a help of memcg. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-04  8:55               ` Michal Hocko
@ 2011-05-09  3:24                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  3:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Ying Han, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

> On Wed 04-05-11 12:55:19, KOSAKI Motohiro wrote:
> > >> Ah, right. So, do you have an alternative idea?
> > >
> > > Why cannot we just keep the global reclaim semantic and make it free
> > > memory (hard_limit - usage_in_bytes) based with low limit as the trigger
> > > for reclaiming?
> > 
> > Because it's not free memory. 
> 
> In some sense it is because it defines the available memory for a group.
> 
> > the cgroup doesn't reach a limit. but....
> 
> Same way how we do not get down to no free memory (due to reserves
> etc.). Or am I missing something.

Of cource, it's possible. The only two problem are 1) need much much trivial
rewrite exist code and 2) naming issue (it's not _free_).

So, I'm going away from this discussion. ;-) I don't have strong opinion this.
I only wrote the current decision reason. I don't dislike your idea too.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-06  5:28                     ` KAMEZAWA Hiroyuki
  2011-05-06 14:22                       ` Johannes Weiner
@ 2011-05-09  5:40                       ` Ying Han
  2011-05-09  7:10                         ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-05-09  5:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, KOSAKI Motohiro, linux-mm, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim

On Thu, May 5, 2011 at 10:28 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 5 May 2011 08:59:01 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
>
>> On Wed 04-05-11 10:16:39, Ying Han wrote:
>> > On Wed, May 4, 2011 at 1:58 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > > On Tue 03-05-11 10:01:27, Ying Han wrote:
>> > >> On Tue, May 3, 2011 at 1:25 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > >> > On Tue 03-05-11 16:45:23, KOSAKI Motohiro wrote:
>> > >> >> 2011/5/3 Michal Hocko <mhocko@suse.cz>:
>> > >> >> > On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
>> > >> >> >> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
>> > > [...]
>> > >> >> >> Can you please clarify this? I feel it is not opposite semantics.
>> > >> >> >
>> > >> >> > In the global reclaim low watermark represents the point when we _start_
>> > >> >> > background reclaim while high watermark is the _stopper_. Watermarks are
>> > >> >> > based on the free memory while this proposal makes it based on the used
>> > >> >> > memory.
>> > >> >> > I understand that the result is same in the end but it is really
>> > >> >> > confusing because you have to switch your mindset from free to used and
>> > >> >> > from under the limit to above the limit.
>> > >> >>
>> > >> >> Ah, right. So, do you have an alternative idea?
>> > >> >
>> > >> > Why cannot we just keep the global reclaim semantic and make it free
>> > >> > memory (hard_limit - usage_in_bytes) based with low limit as the trigger
>> > >> > for reclaiming?
>> > >>
>> > > [...]
>> > >> The current scheme
>> > >
>> > > What is the current scheme?
>> >
>> > using the "usage_in_bytes" instead of "free"
>> >
>> > >> is closer to the global bg reclaim which the low is triggering reclaim
>> > >> and high is stopping reclaim. And we can only use the "usage" to keep
>> > >> the same API.
>>
>
> Sorry for long absence.
>
>> And how is this closer to the global reclaim semantic which is based on
>> the available memory?
>
> It's never be the same feature and not a similar feature, I think.
>
>> What I am trying to say here is that this new watermark concept doesn't
>> fit in with the global reclaim. Well, standard user might not be aware
>> of the zone watermarks at all because they cannot be set. But still if
>> you are analyzing your memory usage you still check and compare free
>> memory to min/low/high watermarks to find out what is the current memory
>> pressure.
>> If we had another concept with cgroups you would need to switch your
>> mindset to analyze things.
>>
>> I am sorry, but I still do not see any reason why those cgroup watermaks
>> cannot be based on total-usage.
>
> Hmm, so, the interface should be
>
>  memory.watermark  --- the total usage which kernel's memory shrinker starts.
>
> ?


>
> I'm okay with this. And I think this parameter should be fully independent from
> the limit.

We need two watermarks like high/low where one is used to trigger the
background reclaim and the other one is for stopping it. Using the
limit to calculate the wmarks is straight-forward since doing
background reclaim reduces the latency spikes under direct reclaim.
The direct reclaim is triggered while the usage is hitting the limit.

This is different from the "soft_limit" which is based on the usage
and we don't want to reinvent the soft_limit implementation.

--Ying

>
> Memcg can work without watermark reclaim. I think my patch just adds a new
> _limit_ which a user can shrink usage of memory on deamand with kernel's help.
> Memory reclaim works in background but this is not a kswapd, at all.
>
> I guess performance benefit of using watermark under a cgroup which has limit
> is very small and I think this is not for a performance tuning parameter.
> This is just a new limit.
>
> Comparing 2 cases,
>
>  cgroup A)
>   - has limit of 300M, no watermaks.
>  cgroup B)
>   - has limit of UNLIMITED, watermarks=300M
>
> A) has hard limit and memory reclaim cost is paid by user threads, and have
> risks of OOM under memcg.
> B) has no hard limit and memory reclaim cost is paid by kernel threads, and
> will not have risk of OOM under memcg, but can be CPU burning.
>
> I think this should be called as soft-limit ;) But we have another soft-limit now.
> Then, I call this as watermark. This will be useful to resize usage of memory
> in online because application will not hit limit and get big latency even while
> an admin makes watermark smaller.
>
> Hmm, maybe I should allow watermark > limit setting ;).
>
> Thanks,
> -Kame
>
>
>
>
>
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09  0:21                         ` KAMEZAWA Hiroyuki
@ 2011-05-09  5:47                           ` Ying Han
  2011-05-09  9:58                           ` Johannes Weiner
  1 sibling, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-05-09  5:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, Michal Hocko, KOSAKI Motohiro, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

On Sun, May 8, 2011 at 5:21 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 6 May 2011 16:22:57 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
>> On Fri, May 06, 2011 at 02:28:34PM +0900, KAMEZAWA Hiroyuki wrote:
>> > Hmm, so, the interface should be
>> >
>> >   memory.watermark  --- the total usage which kernel's memory shrinker starts.
>> >
>> > ?
>> >
>> > I'm okay with this. And I think this parameter should be fully independent from
>> > the limit.
>> >
>> > Memcg can work without watermark reclaim. I think my patch just adds a new
>> > _limit_ which a user can shrink usage of memory on deamand with kernel's help.
>> > Memory reclaim works in background but this is not a kswapd, at all.
>> >
>> > I guess performance benefit of using watermark under a cgroup which has limit
>> > is very small and I think this is not for a performance tuning parameter.
>> > This is just a new limit.
>> >
>> > Comparing 2 cases,
>> >
>> >  cgroup A)
>> >    - has limit of 300M, no watermaks.
>> >  cgroup B)
>> >    - has limit of UNLIMITED, watermarks=300M
>> >
>> > A) has hard limit and memory reclaim cost is paid by user threads, and have
>> > risks of OOM under memcg.
>> > B) has no hard limit and memory reclaim cost is paid by kernel threads, and
>> > will not have risk of OOM under memcg, but can be CPU burning.
>> >
>> > I think this should be called as soft-limit ;) But we have another soft-limit now.
>> > Then, I call this as watermark. This will be useful to resize usage of memory
>> > in online because application will not hit limit and get big latency even while
>> > an admin makes watermark smaller.
>>
>> I have two thoughts to this:
>>
>> 1. Even though the memcg will not hit the limit and the application
>> will not be forced to do memcg target reclaim, the watermark reclaim
>> will steal pages from the memcg and the application will suffer the
>> page faults, so it's not an unconditional win.
>>
>
> Considering the whole system, I never think this watermark can be a performance
> help. This consumes the same amount of cpu as a memory freeing thread uses.
> In realistic situaion, in busy memcy, several threads hits limit at the same
> time and a help by a thread will not be much help.
>
>> 2. I understand how the feature is supposed to work, but I don't
>> understand or see a use case for the watermark being configurable.
>> Don't get me wrong, I completely agree with watermark reclaim, it's a
>> good latency optimization.  But I don't see why you would want to
>> manually push back a memcg by changing the watermark.
>>
>
> For keeping free memory, when the system is not busy.
>
>> Ying wrote in another email that she wants to do this to make room fro,
>> another job that is about to get launched.  My reply to that was that
>> you should just launch the job and let global memory pressure push
>> back that memcg instead.  So instead of lowering the watermark, you
>> could lower the soft limit and don't do any reclaim at all until real
>> pressure arises.  You said yourself that the new feature should be
>> called soft limit.  And I think it is because it is a reimplementation
>> of the soft limit!
>>
>
> Soft limit works only when the system is in memory shortage. It means the
> system need to use cpu for memory reclaim when the system is very busy.
> This works always an admin wants. This difference will affects page allocation
> latency and execution time of application. In some customer, when he wants to
> start up an application in 1 sec, it must be in 1 sec. As you know, kswapd's
> memory reclaim itself is too slow against rapid big allocation or burst of
> network packet allocation and direct reclaim runs always. Then, it's not
> avoidable to reclaim/scan memory when the system is busy.  This feature allows
> admins to schedule memory reclaim when the systen is calm. It's like control of
> scheduling GC.

Agree on this. For the configurable per-memcg wmarks, one of the
difference from adjusting
soft_limit since we would like to trigger the per-memcg bg reclaim
before the whole system
under memory pressure. The concept of soft_limit is quite different
from the wmarks, where
the first one can be used to over-committing the system efficiently
which has nothing to do
with per-memcg background reclaim.

--Ying


--Ying

>
> IIRC, there was a trial to free memory when idle() runs....but it doesn't meet
> current system requirement as idle() should be idle. What I think is a feature
> like a that with a help of memcg.
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09  5:40                       ` Ying Han
@ 2011-05-09  7:10                         ` KAMEZAWA Hiroyuki
  2011-05-09 10:18                           ` Johannes Weiner
  0 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-09  7:10 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, KOSAKI Motohiro, linux-mm, balbir, nishimura, akpm,
	Johannes Weiner, minchan.kim

On Sun, 8 May 2011 22:40:47 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, May 5, 2011 at 10:28 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 5 May 2011 08:59:01 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> >
> >> On Wed 04-05-11 10:16:39, Ying Han wrote:
> >> > On Wed, May 4, 2011 at 1:58 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >> > > On Tue 03-05-11 10:01:27, Ying Han wrote:
> >> > >> On Tue, May 3, 2011 at 1:25 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >> > >> > On Tue 03-05-11 16:45:23, KOSAKI Motohiro wrote:
> >> > >> >> 2011/5/3 Michal Hocko <mhocko@suse.cz>:
> >> > >> >> > On Sun 01-05-11 15:06:02, KOSAKI Motohiro wrote:
> >> > >> >> >> > On Mon 25-04-11 18:28:49, KAMEZAWA Hiroyuki wrote:
> >> > > [...]
> >> > >> >> >> Can you please clarify this? I feel it is not opposite semantics.
> >> > >> >> >
> >> > >> >> > In the global reclaim low watermark represents the point when we _start_
> >> > >> >> > background reclaim while high watermark is the _stopper_. Watermarks are
> >> > >> >> > based on the free memory while this proposal makes it based on the used
> >> > >> >> > memory.
> >> > >> >> > I understand that the result is same in the end but it is really
> >> > >> >> > confusing because you have to switch your mindset from free to used and
> >> > >> >> > from under the limit to above the limit.
> >> > >> >>
> >> > >> >> Ah, right. So, do you have an alternative idea?
> >> > >> >
> >> > >> > Why cannot we just keep the global reclaim semantic and make it free
> >> > >> > memory (hard_limit - usage_in_bytes) based with low limit as the trigger
> >> > >> > for reclaiming?
> >> > >>
> >> > > [...]
> >> > >> The current scheme
> >> > >
> >> > > What is the current scheme?
> >> >
> >> > using the "usage_in_bytes" instead of "free"
> >> >
> >> > >> is closer to the global bg reclaim which the low is triggering reclaim
> >> > >> and high is stopping reclaim. And we can only use the "usage" to keep
> >> > >> the same API.
> >>
> >
> > Sorry for long absence.
> >
> >> And how is this closer to the global reclaim semantic which is based on
> >> the available memory?
> >
> > It's never be the same feature and not a similar feature, I think.
> >
> >> What I am trying to say here is that this new watermark concept doesn't
> >> fit in with the global reclaim. Well, standard user might not be aware
> >> of the zone watermarks at all because they cannot be set. But still if
> >> you are analyzing your memory usage you still check and compare free
> >> memory to min/low/high watermarks to find out what is the current memory
> >> pressure.
> >> If we had another concept with cgroups you would need to switch your
> >> mindset to analyze things.
> >>
> >> I am sorry, but I still do not see any reason why those cgroup watermaks
> >> cannot be based on total-usage.
> >
> > Hmm, so, the interface should be
> >
> > A memory.watermark A --- the total usage which kernel's memory shrinker starts.
> >
> > ?
> 
> 
> >
> > I'm okay with this. And I think this parameter should be fully independent from
> > the limit.
> 
> We need two watermarks like high/low where one is used to trigger the
> background reclaim and the other one is for stopping it. 

For avoiding confusion, I use another word as "shrink_to" and "shrink_over".
When the usage over "shrink_over", the kernel reduce the usage to "shrink_to".


IMHO, determining shrink_over-shrink_to distance is difficult and easy. It's
difficult because it depends on workload and if distacnce is too large,
it will consume much cpu time than expected. It's easy because some small amount of
shrink_over-shrink_to distance works well for usual use, as I set 4MB in my series.
(shrink_over - shrink_to distance is meaningless for users, I think.)

I think shrink_over-shrink_to is an implementation detail just for avoiding
frequent switch on/off memory reclaim, IOW, do jobs in a batched manner.

So, my patch hides "shrink_over" and just shows "shrink_to".


> Using the
> limit to calculate the wmarks is straight-forward since doing
> background reclaim reduces the latency spikes under direct reclaim.
> The direct reclaim is triggered while the usage is hitting the limit.
> 
> This is different from the "soft_limit" which is based on the usage
> and we don't want to reinvent the soft_limit implementation.
> 
Yes, this is a different feature.


The discussion here is how to make APIs for "shrink_to" and "shrink_over", ok ?

I think there are 3 candidates.

  1. using distance to limit.
     memory.shrink_to_distance
           - memory will be freed to 'limit - shrink_to_distance'.
     memory.shrink_over_distance
           - memory will be freed when usage > 'limit - shrink_over_distance'

     Pros.
      - Both of shrink_over and shirnk_to can be determined by users.
      - Can keep stable distance to limit even when limit is changed.
     Cons.
      - complicated and seems not natural.
      - hierarchy support will be very difficult.

  2. using bare value
     memory.shrink_to
           - memory will be freed to this 'shirnk_to'
     memory.shrink_from
           - memory will be freed when usage over this value.
     Pros.
      - Both of shrink_over and shrink)to can be determined by users.
      - easy to understand, straightforward.
      - hierarchy support will be easy.
     Cons.
      - The user may need to change this value when he changes the limit.


  3. using only 'shrink_to'
     memory.shrink_to
           - memory will be freed to this value when the usage goes over this vaue
             to some extent (determined by the system.)

     Pros.
      - easy interface.
      - hierarchy support will be easy.
      - bad configuration check is very easy. 
     Cons.
      - The user may beed to change this value when he changes the limit.


Then, I now vote for 3 because hierarchy support is easiest and enough handy for
real use.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09  0:21                         ` KAMEZAWA Hiroyuki
  2011-05-09  5:47                           ` Ying Han
@ 2011-05-09  9:58                           ` Johannes Weiner
  2011-05-09  9:59                             ` KAMEZAWA Hiroyuki
  2011-05-10  4:43                             ` Ying Han
  1 sibling, 2 replies; 68+ messages in thread
From: Johannes Weiner @ 2011-05-09  9:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, Ying Han, KOSAKI Motohiro, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

On Mon, May 09, 2011 at 09:21:12AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 6 May 2011 16:22:57 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Fri, May 06, 2011 at 02:28:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > Hmm, so, the interface should be
> > > 
> > >   memory.watermark  --- the total usage which kernel's memory shrinker starts.
> > > 
> > > ?
> > > 
> > > I'm okay with this. And I think this parameter should be fully independent from
> > > the limit.
> > > 
> > > Memcg can work without watermark reclaim. I think my patch just adds a new
> > > _limit_ which a user can shrink usage of memory on deamand with kernel's help.
> > > Memory reclaim works in background but this is not a kswapd, at all.
> > > 
> > > I guess performance benefit of using watermark under a cgroup which has limit
> > > is very small and I think this is not for a performance tuning parameter. 
> > > This is just a new limit.
> > > 
> > > Comparing 2 cases,
> > > 
> > >  cgroup A)
> > >    - has limit of 300M, no watermaks.
> > >  cgroup B)
> > >    - has limit of UNLIMITED, watermarks=300M
> > > 
> > > A) has hard limit and memory reclaim cost is paid by user threads, and have
> > > risks of OOM under memcg.
> > > B) has no hard limit and memory reclaim cost is paid by kernel threads, and
> > > will not have risk of OOM under memcg, but can be CPU burning.
> > > 
> > > I think this should be called as soft-limit ;) But we have another soft-limit now.
> > > Then, I call this as watermark. This will be useful to resize usage of memory
> > > in online because application will not hit limit and get big latency even while
> > > an admin makes watermark smaller.
> > 
> > I have two thoughts to this:
> > 
> > 1. Even though the memcg will not hit the limit and the application
> > will not be forced to do memcg target reclaim, the watermark reclaim
> > will steal pages from the memcg and the application will suffer the
> > page faults, so it's not an unconditional win.
> > 
> 
> Considering the whole system, I never think this watermark can be a performance
> help. This consumes the same amount of cpu as a memory freeing thread uses.
> In realistic situaion, in busy memcy, several threads hits limit at the same
> time and a help by a thread will not be much help.
> 
> > 2. I understand how the feature is supposed to work, but I don't
> > understand or see a use case for the watermark being configurable.
> > Don't get me wrong, I completely agree with watermark reclaim, it's a
> > good latency optimization.  But I don't see why you would want to
> > manually push back a memcg by changing the watermark.
> > 
> 
> For keeping free memory, when the system is not busy.
> 
> > Ying wrote in another email that she wants to do this to make room fro,
> > another job that is about to get launched.  My reply to that was that
> > you should just launch the job and let global memory pressure push
> > back that memcg instead.  So instead of lowering the watermark, you
> > could lower the soft limit and don't do any reclaim at all until real
> > pressure arises.  You said yourself that the new feature should be
> > called soft limit.  And I think it is because it is a reimplementation
> > of the soft limit!
> > 
> 
> Soft limit works only when the system is in memory shortage. It means the
> system need to use cpu for memory reclaim when the system is very busy.
> This works always an admin wants. This difference will affects page allocation
> latency and execution time of application. In some customer, when he wants to
> start up an application in 1 sec, it must be in 1 sec. As you know, kswapd's
> memory reclaim itself is too slow against rapid big allocation or burst of
> network packet allocation and direct reclaim runs always. Then, it's not
> avoidable to reclaim/scan memory when the system is busy.  This feature allows
> admins to schedule memory reclaim when the systen is calm. It's like control of
> scheduling GC.
> 
> IIRC, there was a trial to free memory when idle() runs....but it doesn't meet
> current system requirement as idle() should be idle. What I think is a feature
> like a that with a help of memcg.

Thanks a lot for the explanation, this certainly makes sense.

How about this: we put in memcg watermark reclaim first, as a pure
best-effort latency optimization, without the watermark configurable
from userspace.  It's not a new concept, we have it with kswapd on a
global level.

And on top of that, as a separate changeset, userspace gets a knob to
kick off async memcg reclaim when the system is idle.  With the
justification you wrote above.  We can still discuss the exact
mechanism, but the async memcg reclaim feature has value in itself and
should not have to wait until this second step is all figured out.

Would this be acceptable?

Thanks again.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09  9:58                           ` Johannes Weiner
@ 2011-05-09  9:59                             ` KAMEZAWA Hiroyuki
  2011-05-10  4:43                             ` Ying Han
  1 sibling, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-09  9:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Ying Han, KOSAKI Motohiro, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

On Mon, 9 May 2011 11:58:04 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Mon, May 09, 2011 at 09:21:12AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Fri, 6 May 2011 16:22:57 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:

> Thanks a lot for the explanation, this certainly makes sense.
> 
> How about this: we put in memcg watermark reclaim first, as a pure
> best-effort latency optimization, without the watermark configurable
> from userspace.  It's not a new concept, we have it with kswapd on a
> global level.
> 
> And on top of that, as a separate changeset, userspace gets a knob to
> kick off async memcg reclaim when the system is idle.  With the
> justification you wrote above.  We can still discuss the exact
> mechanism, but the async memcg reclaim feature has value in itself and
> should not have to wait until this second step is all figured out.
> 
> Would this be acceptable?
> 

It's okay for me. I'll change the order patches and merge patches from
the core parts.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09  7:10                         ` KAMEZAWA Hiroyuki
@ 2011-05-09 10:18                           ` Johannes Weiner
  2011-05-09 12:49                             ` Michal Hocko
  2011-05-10  4:51                             ` Ying Han
  0 siblings, 2 replies; 68+ messages in thread
From: Johannes Weiner @ 2011-05-09 10:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, Michal Hocko, KOSAKI Motohiro, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

On Mon, May 09, 2011 at 04:10:47PM +0900, KAMEZAWA Hiroyuki wrote:
> On Sun, 8 May 2011 22:40:47 -0700
> Ying Han <yinghan@google.com> wrote:
> > Using the
> > limit to calculate the wmarks is straight-forward since doing
> > background reclaim reduces the latency spikes under direct reclaim.
> > The direct reclaim is triggered while the usage is hitting the limit.
> > 
> > This is different from the "soft_limit" which is based on the usage
> > and we don't want to reinvent the soft_limit implementation.
> > 
> Yes, this is a different feature.
> 
> 
> The discussion here is how to make APIs for "shrink_to" and "shrink_over", ok ?
> 
> I think there are 3 candidates.
> 
>   1. using distance to limit.
>      memory.shrink_to_distance
>            - memory will be freed to 'limit - shrink_to_distance'.
>      memory.shrink_over_distance
>            - memory will be freed when usage > 'limit - shrink_over_distance'
> 
>      Pros.
>       - Both of shrink_over and shirnk_to can be determined by users.
>       - Can keep stable distance to limit even when limit is changed.
>      Cons.
>       - complicated and seems not natural.
>       - hierarchy support will be very difficult.
> 
>   2. using bare value
>      memory.shrink_to
>            - memory will be freed to this 'shirnk_to'
>      memory.shrink_from
>            - memory will be freed when usage over this value.
>      Pros.
>       - Both of shrink_over and shrink)to can be determined by users.
>       - easy to understand, straightforward.
>       - hierarchy support will be easy.
>      Cons.
>       - The user may need to change this value when he changes the limit.
> 
> 
>   3. using only 'shrink_to'
>      memory.shrink_to
>            - memory will be freed to this value when the usage goes over this vaue
>              to some extent (determined by the system.)
> 
>      Pros.
>       - easy interface.
>       - hierarchy support will be easy.
>       - bad configuration check is very easy. 
>      Cons.
>       - The user may beed to change this value when he changes the limit.
> 
> 
> Then, I now vote for 3 because hierarchy support is easiest and enough handy for
> real use.

3. looks best to me as well.

What I am wondering, though: we already have a limit to push back
memcgs when we need memory, the soft limit.  The 'need for memory' is
currently defined as global memory pressure, which we know may be too
late.  The problem is not having no limit, the problem is that we want
to control the time of when this limit is enforced.  So instead of
adding another limit, could we instead add a knob like

	memory.force_async_soft_reclaim

that asynchroneously pushes back to the soft limit instead of having
another, separate limit to configure?

Pros:
- easy interface
- limit already existing
- hierarchy support already existing
- bad configuration check already existing
Cons:
- ?

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09 10:18                           ` Johannes Weiner
@ 2011-05-09 12:49                             ` Michal Hocko
  2011-05-09 23:49                               ` KAMEZAWA Hiroyuki
  2011-05-10  4:51                             ` Ying Han
  1 sibling, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2011-05-09 12:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Ying Han, KOSAKI Motohiro, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

On Mon 09-05-11 12:18:17, Johannes Weiner wrote:
> On Mon, May 09, 2011 at 04:10:47PM +0900, KAMEZAWA Hiroyuki wrote:
[...]
> What I am wondering, though: we already have a limit to push back
> memcgs when we need memory, the soft limit.  The 'need for memory' is
> currently defined as global memory pressure, which we know may be too
> late.  The problem is not having no limit, the problem is that we want
> to control the time of when this limit is enforced.  So instead of
> adding another limit, could we instead add a knob like
> 
> 	memory.force_async_soft_reclaim
> 
> that asynchroneously pushes back to the soft limit instead of having
> another, separate limit to configure?

Sound much better than a separate watermark to me. I am just wondering
how we would implement soft unlimited groups with background reclaim.
Btw. is anybody relying on such configuration? To me it sounds like
something should be either limited or unlimited and making it half of
both is hacky.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09 12:49                             ` Michal Hocko
@ 2011-05-09 23:49                               ` KAMEZAWA Hiroyuki
  2011-05-10  4:39                                 ` Ying Han
  0 siblings, 1 reply; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-09 23:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Ying Han, KOSAKI Motohiro, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

On Mon, 9 May 2011 14:49:17 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Mon 09-05-11 12:18:17, Johannes Weiner wrote:
> > On Mon, May 09, 2011 at 04:10:47PM +0900, KAMEZAWA Hiroyuki wrote:
> [...]
> > What I am wondering, though: we already have a limit to push back
> > memcgs when we need memory, the soft limit.  The 'need for memory' is
> > currently defined as global memory pressure, which we know may be too
> > late.  The problem is not having no limit, the problem is that we want
> > to control the time of when this limit is enforced.  So instead of
> > adding another limit, could we instead add a knob like
> > 
> > 	memory.force_async_soft_reclaim
> > 
> > that asynchroneously pushes back to the soft limit instead of having
> > another, separate limit to configure?
> 

Hmm, ok to me. 

> Sound much better than a separate watermark to me. I am just wondering
> how we would implement soft unlimited groups with background reclaim.
> Btw. is anybody relying on such configuration? To me it sounds like
> something should be either limited or unlimited and making it half of
> both is hacky.

I don't think of soft-unlimited configuration. I don't want to handle it
in some automatic way.

Anyway, I'll add
  - _automatic_ background reclaim against the limit of memory, which works
    regarless of softlimit.
  - An interface for force softlimit.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09 23:49                               ` KAMEZAWA Hiroyuki
@ 2011-05-10  4:39                                 ` Ying Han
  0 siblings, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-05-10  4:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, Johannes Weiner, KOSAKI Motohiro, linux-mm, balbir,
	nishimura, akpm, Johannes Weiner, minchan.kim

On Mon, May 9, 2011 at 4:49 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 9 May 2011 14:49:17 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
>
>> On Mon 09-05-11 12:18:17, Johannes Weiner wrote:
>> > On Mon, May 09, 2011 at 04:10:47PM +0900, KAMEZAWA Hiroyuki wrote:
>> [...]
>> > What I am wondering, though: we already have a limit to push back
>> > memcgs when we need memory, the soft limit.  The 'need for memory' is
>> > currently defined as global memory pressure, which we know may be too
>> > late.  The problem is not having no limit, the problem is that we want
>> > to control the time of when this limit is enforced.  So instead of
>> > adding another limit, could we instead add a knob like
>> >
>> >     memory.force_async_soft_reclaim
>> >
>> > that asynchroneously pushes back to the soft limit instead of having
>> > another, separate limit to configure?
>>
>
> Hmm, ok to me.

I don't have problem of the actual tunable for this, but I don't think
setting the soft_limit as the target for per-memcg background reclaim
is feasible in some cases. That will be too aggressive than it is necessary.

>
>> Sound much better than a separate watermark to me. I am just wondering
>> how we would implement soft unlimited groups with background reclaim.
>> Btw. is anybody relying on such configuration? To me it sounds like
>> something should be either limited or unlimited and making it half of
>> both is hacky.
>
> I don't think of soft-unlimited configuration. I don't want to handle it
> in some automatic way.
>
> Anyway, I'll add
>  - _automatic_ background reclaim against the limit of memory, which works
>    regarless of softlimit.

I agree to have the background reclaim first w/ automatic watermark
setting and then adding a configurable knob on top of that. So I
assume we keep the same concept of high/low_wmarks, and what's the
suggested default value for the watermarks? The default value now is
equal to hard_limit which disables he per-memcg background reclaim.
Under the new scheme which we remove the configurable tunable, we need
to set it internally based on the hard_limit.

--Ying

>  - An interface for force softlimit.
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09  9:58                           ` Johannes Weiner
  2011-05-09  9:59                             ` KAMEZAWA Hiroyuki
@ 2011-05-10  4:43                             ` Ying Han
  1 sibling, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-05-10  4:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, KOSAKI Motohiro, linux-mm,
	balbir, nishimura, akpm, Johannes Weiner, minchan.kim

On Mon, May 9, 2011 at 2:58 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Mon, May 09, 2011 at 09:21:12AM +0900, KAMEZAWA Hiroyuki wrote:
>> On Fri, 6 May 2011 16:22:57 +0200
>> Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> > On Fri, May 06, 2011 at 02:28:34PM +0900, KAMEZAWA Hiroyuki wrote:
>> > > Hmm, so, the interface should be
>> > >
>> > >   memory.watermark  --- the total usage which kernel's memory shrinker starts.
>> > >
>> > > ?
>> > >
>> > > I'm okay with this. And I think this parameter should be fully independent from
>> > > the limit.
>> > >
>> > > Memcg can work without watermark reclaim. I think my patch just adds a new
>> > > _limit_ which a user can shrink usage of memory on deamand with kernel's help.
>> > > Memory reclaim works in background but this is not a kswapd, at all.
>> > >
>> > > I guess performance benefit of using watermark under a cgroup which has limit
>> > > is very small and I think this is not for a performance tuning parameter.
>> > > This is just a new limit.
>> > >
>> > > Comparing 2 cases,
>> > >
>> > >  cgroup A)
>> > >    - has limit of 300M, no watermaks.
>> > >  cgroup B)
>> > >    - has limit of UNLIMITED, watermarks=300M
>> > >
>> > > A) has hard limit and memory reclaim cost is paid by user threads, and have
>> > > risks of OOM under memcg.
>> > > B) has no hard limit and memory reclaim cost is paid by kernel threads, and
>> > > will not have risk of OOM under memcg, but can be CPU burning.
>> > >
>> > > I think this should be called as soft-limit ;) But we have another soft-limit now.
>> > > Then, I call this as watermark. This will be useful to resize usage of memory
>> > > in online because application will not hit limit and get big latency even while
>> > > an admin makes watermark smaller.
>> >
>> > I have two thoughts to this:
>> >
>> > 1. Even though the memcg will not hit the limit and the application
>> > will not be forced to do memcg target reclaim, the watermark reclaim
>> > will steal pages from the memcg and the application will suffer the
>> > page faults, so it's not an unconditional win.
>> >
>>
>> Considering the whole system, I never think this watermark can be a performance
>> help. This consumes the same amount of cpu as a memory freeing thread uses.
>> In realistic situaion, in busy memcy, several threads hits limit at the same
>> time and a help by a thread will not be much help.
>>
>> > 2. I understand how the feature is supposed to work, but I don't
>> > understand or see a use case for the watermark being configurable.
>> > Don't get me wrong, I completely agree with watermark reclaim, it's a
>> > good latency optimization.  But I don't see why you would want to
>> > manually push back a memcg by changing the watermark.
>> >
>>
>> For keeping free memory, when the system is not busy.
>>
>> > Ying wrote in another email that she wants to do this to make room fro,
>> > another job that is about to get launched.  My reply to that was that
>> > you should just launch the job and let global memory pressure push
>> > back that memcg instead.  So instead of lowering the watermark, you
>> > could lower the soft limit and don't do any reclaim at all until real
>> > pressure arises.  You said yourself that the new feature should be
>> > called soft limit.  And I think it is because it is a reimplementation
>> > of the soft limit!
>> >
>>
>> Soft limit works only when the system is in memory shortage. It means the
>> system need to use cpu for memory reclaim when the system is very busy.
>> This works always an admin wants. This difference will affects page allocation
>> latency and execution time of application. In some customer, when he wants to
>> start up an application in 1 sec, it must be in 1 sec. As you know, kswapd's
>> memory reclaim itself is too slow against rapid big allocation or burst of
>> network packet allocation and direct reclaim runs always. Then, it's not
>> avoidable to reclaim/scan memory when the system is busy.  This feature allows
>> admins to schedule memory reclaim when the systen is calm. It's like control of
>> scheduling GC.
>>
>> IIRC, there was a trial to free memory when idle() runs....but it doesn't meet
>> current system requirement as idle() should be idle. What I think is a feature
>> like a that with a help of memcg.
>
> Thanks a lot for the explanation, this certainly makes sense.
>
> How about this: we put in memcg watermark reclaim first, as a pure
> best-effort latency optimization, without the watermark configurable
> from userspace.  It's not a new concept, we have it with kswapd on a
> global level.
>
> And on top of that, as a separate changeset, userspace gets a knob to
> kick off async memcg reclaim when the system is idle.  With the
> justification you wrote above.  We can still discuss the exact
> mechanism, but the async memcg reclaim feature has value in itself and
> should not have to wait until this second step is all figured out.
>
> Would this be acceptable?

Agree on this. Although we have users for the configurable tunable for
the watermarks, in most of the cases the default watermarks
calculated by the kernel should be enough.

--Ying
>
> Thanks again.
>
>        Hannes
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-09 10:18                           ` Johannes Weiner
  2011-05-09 12:49                             ` Michal Hocko
@ 2011-05-10  4:51                             ` Ying Han
  2011-05-10  6:27                               ` Johannes Weiner
  1 sibling, 1 reply; 68+ messages in thread
From: Ying Han @ 2011-05-10  4:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, KOSAKI Motohiro, linux-mm,
	balbir, nishimura, akpm, Johannes Weiner, minchan.kim

On Mon, May 9, 2011 at 3:18 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Mon, May 09, 2011 at 04:10:47PM +0900, KAMEZAWA Hiroyuki wrote:
>> On Sun, 8 May 2011 22:40:47 -0700
>> Ying Han <yinghan@google.com> wrote:
>> > Using the
>> > limit to calculate the wmarks is straight-forward since doing
>> > background reclaim reduces the latency spikes under direct reclaim.
>> > The direct reclaim is triggered while the usage is hitting the limit.
>> >
>> > This is different from the "soft_limit" which is based on the usage
>> > and we don't want to reinvent the soft_limit implementation.
>> >
>> Yes, this is a different feature.
>>
>>
>> The discussion here is how to make APIs for "shrink_to" and "shrink_over", ok ?
>>
>> I think there are 3 candidates.
>>
>>   1. using distance to limit.
>>      memory.shrink_to_distance
>>            - memory will be freed to 'limit - shrink_to_distance'.
>>      memory.shrink_over_distance
>>            - memory will be freed when usage > 'limit - shrink_over_distance'
>>
>>      Pros.
>>       - Both of shrink_over and shirnk_to can be determined by users.
>>       - Can keep stable distance to limit even when limit is changed.
>>      Cons.
>>       - complicated and seems not natural.
>>       - hierarchy support will be very difficult.
>>
>>   2. using bare value
>>      memory.shrink_to
>>            - memory will be freed to this 'shirnk_to'
>>      memory.shrink_from
>>            - memory will be freed when usage over this value.
>>      Pros.
>>       - Both of shrink_over and shrink)to can be determined by users.
>>       - easy to understand, straightforward.
>>       - hierarchy support will be easy.
>>      Cons.
>>       - The user may need to change this value when he changes the limit.
>>
>>
>>   3. using only 'shrink_to'
>>      memory.shrink_to
>>            - memory will be freed to this value when the usage goes over this vaue
>>              to some extent (determined by the system.)
>>
>>      Pros.
>>       - easy interface.
>>       - hierarchy support will be easy.
>>       - bad configuration check is very easy.
>>      Cons.
>>       - The user may beed to change this value when he changes the limit.
>>
>>
>> Then, I now vote for 3 because hierarchy support is easiest and enough handy for
>> real use.
>
> 3. looks best to me as well.
>
> What I am wondering, though: we already have a limit to push back
> memcgs when we need memory, the soft limit.  The 'need for memory' is
> currently defined as global memory pressure, which we know may be too
> late.  The problem is not having no limit, the problem is that we want
> to control the time of when this limit is enforced.  So instead of
> adding another limit, could we instead add a knob like
>
>        memory.force_async_soft_reclaim
>
> that asynchroneously pushes back to the soft limit instead of having
> another, separate limit to configure?
>
> Pros:
> - easy interface
> - limit already existing
> - hierarchy support already existing
> - bad configuration check already existing
> Cons:
> - ?

Are we proposing to set the target of per-memcg background reclaim to
be the soft_limit? If so, i would highly doubt for that. The
logic of background reclaim is to start reclaiming memory before
reaching the hard_limit, and stops whence
it makes enough progress. The motivation is to reduce the times for
memcg hitting direct reclaim and that is quite different from
the design of soft_limit. The soft_limit is designed to serve the
over-commit environment where memory can be shared across memcgs
until the global memory pressure. There is no correlation between that
to the watermark based background reclaim.

Making the soft_limit as target for background reclaim will make extra
memory pressure when not necessary. So I don't have issue to have
the tunable later and set the watermark equal to the soft_limit, but
using it as alternative to the watermarks is not straight-forward to
me at
this point.

Thanks

--Ying

>        Hannes
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-10  4:51                             ` Ying Han
@ 2011-05-10  6:27                               ` Johannes Weiner
  2011-05-10  7:09                                 ` Ying Han
  0 siblings, 1 reply; 68+ messages in thread
From: Johannes Weiner @ 2011-05-10  6:27 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, KOSAKI Motohiro, linux-mm,
	balbir, nishimura, akpm, Johannes Weiner, minchan.kim

On Mon, May 09, 2011 at 09:51:43PM -0700, Ying Han wrote:
> On Mon, May 9, 2011 at 3:18 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Mon, May 09, 2011 at 04:10:47PM +0900, KAMEZAWA Hiroyuki wrote:
> >> On Sun, 8 May 2011 22:40:47 -0700
> >> Ying Han <yinghan@google.com> wrote:
> >> > Using the
> >> > limit to calculate the wmarks is straight-forward since doing
> >> > background reclaim reduces the latency spikes under direct reclaim.
> >> > The direct reclaim is triggered while the usage is hitting the limit.
> >> >
> >> > This is different from the "soft_limit" which is based on the usage
> >> > and we don't want to reinvent the soft_limit implementation.
> >> >
> >> Yes, this is a different feature.
> >>
> >>
> >> The discussion here is how to make APIs for "shrink_to" and "shrink_over", ok ?
> >>
> >> I think there are 3 candidates.
> >>
> >>   1. using distance to limit.
> >>      memory.shrink_to_distance
> >>            - memory will be freed to 'limit - shrink_to_distance'.
> >>      memory.shrink_over_distance
> >>            - memory will be freed when usage > 'limit - shrink_over_distance'
> >>
> >>      Pros.
> >>       - Both of shrink_over and shirnk_to can be determined by users.
> >>       - Can keep stable distance to limit even when limit is changed.
> >>      Cons.
> >>       - complicated and seems not natural.
> >>       - hierarchy support will be very difficult.
> >>
> >>   2. using bare value
> >>      memory.shrink_to
> >>            - memory will be freed to this 'shirnk_to'
> >>      memory.shrink_from
> >>            - memory will be freed when usage over this value.
> >>      Pros.
> >>       - Both of shrink_over and shrink)to can be determined by users.
> >>       - easy to understand, straightforward.
> >>       - hierarchy support will be easy.
> >>      Cons.
> >>       - The user may need to change this value when he changes the limit.
> >>
> >>
> >>   3. using only 'shrink_to'
> >>      memory.shrink_to
> >>            - memory will be freed to this value when the usage goes over this vaue
> >>              to some extent (determined by the system.)
> >>
> >>      Pros.
> >>       - easy interface.
> >>       - hierarchy support will be easy.
> >>       - bad configuration check is very easy.
> >>      Cons.
> >>       - The user may beed to change this value when he changes the limit.
> >>
> >>
> >> Then, I now vote for 3 because hierarchy support is easiest and enough handy for
> >> real use.
> >
> > 3. looks best to me as well.
> >
> > What I am wondering, though: we already have a limit to push back
> > memcgs when we need memory, the soft limit.  The 'need for memory' is
> > currently defined as global memory pressure, which we know may be too
> > late.  The problem is not having no limit, the problem is that we want
> > to control the time of when this limit is enforced.  So instead of
> > adding another limit, could we instead add a knob like
> >
> >        memory.force_async_soft_reclaim
> >
> > that asynchroneously pushes back to the soft limit instead of having
> > another, separate limit to configure?
> >
> > Pros:
> > - easy interface
> > - limit already existing
> > - hierarchy support already existing
> > - bad configuration check already existing
> > Cons:
> > - ?
> 
> Are we proposing to set the target of per-memcg background reclaim to
> be the soft_limit?

Yes, if memory.force_async_soft_reclaim is set.

> If so, i would highly doubt for that. The
> logic of background reclaim is to start reclaiming memory before
> reaching the hard_limit, and stops whence
> it makes enough progress. The motivation is to reduce the times for
> memcg hitting direct reclaim and that is quite different from
> the design of soft_limit. The soft_limit is designed to serve the
> over-commit environment where memory can be shared across memcgs
> until the global memory pressure. There is no correlation between that
> to the watermark based background reclaim.

Your whole argument for the knob so far has been that you want to use
it to proactively reduce memory usage, and I argued that it has
nothing to do with watermark reclaim.  This is exactly why I have been
fighting making the watermark configurable and abuse it for that.

> Making the soft_limit as target for background reclaim will make extra
> memory pressure when not necessary. So I don't have issue to have
> the tunable later and set the watermark equal to the soft_limit, but
> using it as alternative to the watermarks is not straight-forward to
> me at
> this point.

Please read my above proposal again, noone suggested to replace the
watermark with the soft limit.

1. The watermark is always in place, computed by the kernel alone, and
triggers background targetted reclaim when breached.

2. The soft limit is enforced as usual upon memory pressure.

3. In addition, the soft limit is enforced by background reclaim if
memory.force_async_soft_reclaim is set.

Thus, you can use 3. if you foresee memory pressure and want to
proactively push back a memcg to the soft limit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/7] memcg: add high/low watermark to res_counter
  2011-05-10  6:27                               ` Johannes Weiner
@ 2011-05-10  7:09                                 ` Ying Han
  0 siblings, 0 replies; 68+ messages in thread
From: Ying Han @ 2011-05-10  7:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, KOSAKI Motohiro, linux-mm,
	balbir, nishimura, akpm, Johannes Weiner, minchan.kim

On Mon, May 9, 2011 at 11:27 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Mon, May 09, 2011 at 09:51:43PM -0700, Ying Han wrote:
>> On Mon, May 9, 2011 at 3:18 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > On Mon, May 09, 2011 at 04:10:47PM +0900, KAMEZAWA Hiroyuki wrote:
>> >> On Sun, 8 May 2011 22:40:47 -0700
>> >> Ying Han <yinghan@google.com> wrote:
>> >> > Using the
>> >> > limit to calculate the wmarks is straight-forward since doing
>> >> > background reclaim reduces the latency spikes under direct reclaim.
>> >> > The direct reclaim is triggered while the usage is hitting the limit.
>> >> >
>> >> > This is different from the "soft_limit" which is based on the usage
>> >> > and we don't want to reinvent the soft_limit implementation.
>> >> >
>> >> Yes, this is a different feature.
>> >>
>> >>
>> >> The discussion here is how to make APIs for "shrink_to" and "shrink_over", ok ?
>> >>
>> >> I think there are 3 candidates.
>> >>
>> >>   1. using distance to limit.
>> >>      memory.shrink_to_distance
>> >>            - memory will be freed to 'limit - shrink_to_distance'.
>> >>      memory.shrink_over_distance
>> >>            - memory will be freed when usage > 'limit - shrink_over_distance'
>> >>
>> >>      Pros.
>> >>       - Both of shrink_over and shirnk_to can be determined by users.
>> >>       - Can keep stable distance to limit even when limit is changed.
>> >>      Cons.
>> >>       - complicated and seems not natural.
>> >>       - hierarchy support will be very difficult.
>> >>
>> >>   2. using bare value
>> >>      memory.shrink_to
>> >>            - memory will be freed to this 'shirnk_to'
>> >>      memory.shrink_from
>> >>            - memory will be freed when usage over this value.
>> >>      Pros.
>> >>       - Both of shrink_over and shrink)to can be determined by users.
>> >>       - easy to understand, straightforward.
>> >>       - hierarchy support will be easy.
>> >>      Cons.
>> >>       - The user may need to change this value when he changes the limit.
>> >>
>> >>
>> >>   3. using only 'shrink_to'
>> >>      memory.shrink_to
>> >>            - memory will be freed to this value when the usage goes over this vaue
>> >>              to some extent (determined by the system.)
>> >>
>> >>      Pros.
>> >>       - easy interface.
>> >>       - hierarchy support will be easy.
>> >>       - bad configuration check is very easy.
>> >>      Cons.
>> >>       - The user may beed to change this value when he changes the limit.
>> >>
>> >>
>> >> Then, I now vote for 3 because hierarchy support is easiest and enough handy for
>> >> real use.
>> >
>> > 3. looks best to me as well.
>> >
>> > What I am wondering, though: we already have a limit to push back
>> > memcgs when we need memory, the soft limit.  The 'need for memory' is
>> > currently defined as global memory pressure, which we know may be too
>> > late.  The problem is not having no limit, the problem is that we want
>> > to control the time of when this limit is enforced.  So instead of
>> > adding another limit, could we instead add a knob like
>> >
>> >        memory.force_async_soft_reclaim
>> >
>> > that asynchroneously pushes back to the soft limit instead of having
>> > another, separate limit to configure?
>> >
>> > Pros:
>> > - easy interface
>> > - limit already existing
>> > - hierarchy support already existing
>> > - bad configuration check already existing
>> > Cons:
>> > - ?
>>
>> Are we proposing to set the target of per-memcg background reclaim to
>> be the soft_limit?
>
> Yes, if memory.force_async_soft_reclaim is set.
>
>> If so, i would highly doubt for that. The
>> logic of background reclaim is to start reclaiming memory before
>> reaching the hard_limit, and stops whence
>> it makes enough progress. The motivation is to reduce the times for
>> memcg hitting direct reclaim and that is quite different from
>> the design of soft_limit. The soft_limit is designed to serve the
>> over-commit environment where memory can be shared across memcgs
>> until the global memory pressure. There is no correlation between that
>> to the watermark based background reclaim.
>
> Your whole argument for the knob so far has been that you want to use
> it to proactively reduce memory usage, and I argued that it has
> nothing to do with watermark reclaim.  This is exactly why I have been
> fighting making the watermark configurable and abuse it for that.



>
>> Making the soft_limit as target for background reclaim will make extra
>> memory pressure when not necessary. So I don't have issue to have
>> the tunable later and set the watermark equal to the soft_limit, but
>> using it as alternative to the watermarks is not straight-forward to
>> me at
>> this point.
>
> Please read my above proposal again, noone suggested to replace the
> watermark with the soft limit.
>
> 1. The watermark is always in place, computed by the kernel alone, and
> triggers background targetted reclaim when breached.

>
> 2. The soft limit is enforced as usual upon memory pressure.

>
> 3. In addition, the soft limit is enforced by background reclaim if
> memory.force_async_soft_reclaim is set.
>
> Thus, you can use 3. if you foresee memory pressure and want to
> proactively push back a memcg to the soft limit.

Ok, thanks a lot for the clarification. I was confused at the
beginning of using the force_async_soft_reclaim as the alternatives to
background reclaim watermarks. So, the proposal looks good to me and
the computed watermarks based on the hard_limit by default
should work most of the time w/o the configurable tunable.

--Ying

>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2011-05-10  7:09 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-25  9:25 [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
2011-04-25  9:28 ` [PATCH 1/7] memcg: add high/low watermark to res_counter KAMEZAWA Hiroyuki
2011-04-26 17:54   ` Ying Han
2011-04-29 13:33   ` Michal Hocko
2011-05-01  6:06     ` KOSAKI Motohiro
2011-05-03  6:49       ` Michal Hocko
2011-05-03  7:45         ` KOSAKI Motohiro
2011-05-03  8:25           ` Michal Hocko
2011-05-03 17:01             ` Ying Han
2011-05-04  8:58               ` Michal Hocko
2011-05-04 17:16                 ` Ying Han
2011-05-05  6:59                   ` Michal Hocko
2011-05-06  5:28                     ` KAMEZAWA Hiroyuki
2011-05-06 14:22                       ` Johannes Weiner
2011-05-09  0:21                         ` KAMEZAWA Hiroyuki
2011-05-09  5:47                           ` Ying Han
2011-05-09  9:58                           ` Johannes Weiner
2011-05-09  9:59                             ` KAMEZAWA Hiroyuki
2011-05-10  4:43                             ` Ying Han
2011-05-09  5:40                       ` Ying Han
2011-05-09  7:10                         ` KAMEZAWA Hiroyuki
2011-05-09 10:18                           ` Johannes Weiner
2011-05-09 12:49                             ` Michal Hocko
2011-05-09 23:49                               ` KAMEZAWA Hiroyuki
2011-05-10  4:39                                 ` Ying Han
2011-05-10  4:51                             ` Ying Han
2011-05-10  6:27                               ` Johannes Weiner
2011-05-10  7:09                                 ` Ying Han
2011-05-04  3:55             ` KOSAKI Motohiro
2011-05-04  8:55               ` Michal Hocko
2011-05-09  3:24                 ` KOSAKI Motohiro
2011-05-02  9:07   ` Balbir Singh
2011-05-06  5:30     ` KAMEZAWA Hiroyuki
2011-04-25  9:29 ` [PATCH 2/7] memcg high watermark interface KAMEZAWA Hiroyuki
2011-04-25 22:36   ` Ying Han
2011-04-25  9:31 ` [PATCH 3/7] memcg: select victim node in round robin KAMEZAWA Hiroyuki
2011-04-25  9:34 ` [PATCH 4/7] memcg fix scan ratio with small memcg KAMEZAWA Hiroyuki
2011-04-25 17:35   ` Ying Han
2011-04-26  1:43     ` KAMEZAWA Hiroyuki
2011-04-25  9:36 ` [PATCH 5/7] memcg bgreclaim core KAMEZAWA Hiroyuki
2011-04-26  4:59   ` Ying Han
2011-04-26  5:08     ` KAMEZAWA Hiroyuki
2011-04-26 23:15       ` Ying Han
2011-04-27  0:10         ` KAMEZAWA Hiroyuki
2011-04-27  1:01           ` KAMEZAWA Hiroyuki
2011-04-26 18:37   ` Ying Han
2011-04-25  9:40 ` [PATCH 6/7] memcg add zone_all_unreclaimable KAMEZAWA Hiroyuki
2011-04-25  9:42 ` [PATCH 7/7] memcg watermark reclaim workqueue KAMEZAWA Hiroyuki
2011-04-26 23:19   ` Ying Han
2011-04-27  0:31     ` KAMEZAWA Hiroyuki
2011-04-27  3:40       ` Ying Han
2011-04-25  9:43 ` [PATCH 8/7] memcg : reclaim statistics KAMEZAWA Hiroyuki
2011-04-26  5:35   ` Ying Han
2011-04-25  9:49 ` [PATCH 0/7] memcg background reclaim , yet another one KAMEZAWA Hiroyuki
2011-04-25 10:14 ` KAMEZAWA Hiroyuki
2011-04-25 22:21   ` Ying Han
2011-04-26  1:38     ` KAMEZAWA Hiroyuki
2011-04-26  7:19       ` Ying Han
2011-04-26  7:43         ` KAMEZAWA Hiroyuki
2011-04-26  8:43           ` Ying Han
2011-04-26  8:47             ` KAMEZAWA Hiroyuki
2011-04-26 23:08               ` Ying Han
2011-04-27  0:34                 ` KAMEZAWA Hiroyuki
2011-04-27  1:19                   ` Ying Han
2011-04-28  3:55               ` Ying Han
2011-04-28  4:05                 ` KAMEZAWA Hiroyuki
2011-05-02  7:02     ` Balbir Singh
2011-05-02  6:09 ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.