linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Memory controller soft limit patches (v3)
@ 2009-03-01  6:29 Balbir Singh
  2009-03-01  6:30 ` [PATCH 1/4] Memory controller soft limit documentation (v3) Balbir Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 40+ messages in thread
From: Balbir Singh @ 2009-03-01  6:29 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v3...v2
1. Implemented several review comments from Kosaki-San and Kamezawa-San
   Please see individual changelogs for changes

Changelog v2...v1
1. Soft limits now support hierarchies
2. Use spinlocks instead of mutexes for synchronization of the RB tree

Here is v3 of the new soft limit implementation. Soft limits is a new feature
for the memory resource controller, something similar has existed in the
group scheduler in the form of shares. The CPU controllers interpretation
of shares is very different though. 

Soft limits are the most useful feature to have for environments where
the administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits implementation
provides a soft_limit_in_bytes interface for the memory controller and not
for memory+swap controller. The implementation maintains an RB-Tree of groups
that exceed their soft limit and starts reclaiming from the group that
exceeds this limit by the maximum amount.

If there are no major objections to the patches, I would like to get them
included in -mm.

TODOs

1. The current implementation maintains the delta from the soft limit
   and pushes back groups to their soft limits, a ratio of delta/soft_limit
   might be more useful
2. It would be nice to have more targetted reclaim (in terms of pages to
   recalim) interface. So that groups are pushed back, close to their soft
   limits.

Tests
-----

I've run two memory intensive workloads with differing soft limits and
seen that they are pushed back to their soft limit on contention. Their usage
was their soft limit plus additional memory that they were able to grab
on the system. Soft limit can take a while before we see the expected
results.

Please review, comment.

Series
------

memcg-soft-limit-documentation.patch
memcg-add-soft-limit-interface.patch
memcg-organize-over-soft-limit-groups.patch
memcg-soft-limit-reclaim-on-contention.patch



-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/4] Memory controller soft limit documentation (v3)
  2009-03-01  6:29 [PATCH 0/4] Memory controller soft limit patches (v3) Balbir Singh
@ 2009-03-01  6:30 ` Balbir Singh
  2009-03-01  6:30 ` [PATCH 2/4] Memory controller soft limit interface (v3) Balbir Singh
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2009-03-01  6:30 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

Add documentation for soft limit feature support.

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt |   27 ++++++++++++++++++++++++++-
 1 files changed, 26 insertions(+), 1 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index a98a7fe..812cb74 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -360,7 +360,32 @@ cgroups created below it.
 
 NOTE2: This feature can be enabled/disabled per subtree.
 
-7. TODO
+7. Soft limits
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention or low memory (kswapd is woken up)
+control groups are pushed back to their soft limits. If the soft limit of each
+control group is very high, they are pushed back as much as possible to make
+sure that one control group does not starve the others of memory.
+
+7.1 Interface
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 megabytes)
+
+# echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use
+
+# echo 1G > memory.soft_limit_in_bytes
+
+
+8. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

-- 
	Balbir

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/4] Memory controller soft limit interface (v3)
  2009-03-01  6:29 [PATCH 0/4] Memory controller soft limit patches (v3) Balbir Singh
  2009-03-01  6:30 ` [PATCH 1/4] Memory controller soft limit documentation (v3) Balbir Singh
@ 2009-03-01  6:30 ` Balbir Singh
  2009-03-02  2:03   ` KAMEZAWA Hiroyuki
  2009-03-01  6:30 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v3) Balbir Singh
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-01  6:30 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Add support for res_counter_check_soft_limit_locked. This is used
   by the hierarchy code.

Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.

Kamezawa-San raised a question as to whether soft limit should belong
to res_counter. Since all resources understand the basic concepts of
hard and soft limits, it is justified to add soft limits here. Soft limits
are a generic resource usage feature, even file system quotas support
soft limits.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |   47 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    3 +++
 mm/memcontrol.c             |   20 ++++++++++++++++++
 3 files changed, 70 insertions(+), 0 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 4c5bcf6..b5f14fa 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
 	 */
 	unsigned long long limit;
 	/*
+	 * the limit that usage can be exceed
+	 */
+	unsigned long long soft_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -85,6 +89,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_SOFT_LIMIT,
 };
 
 /*
@@ -130,6 +135,36 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->soft_limit)
+		return true;
+
+	return false;
+}
+
+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->soft_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -178,4 +213,16 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 	return ret;
 }
 
+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+				unsigned long long soft_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->soft_limit = soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bf8e753..4e6dafe 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->soft_limit = (unsigned long long)LLONG_MAX;
 	counter->parent = parent;
 }
 
@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_SOFT_LIMIT:
+		return &counter->soft_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7bb14fd..75a7b1a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1939,6 +1939,20 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case RES_SOFT_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, soft limits are hard to implement in terms
+		 * of semantics, for now, we support soft limits for
+		 * control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2188,6 +2202,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,

-- 
	Balbir

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/4] Memory controller soft limit organize cgroups (v3)
  2009-03-01  6:29 [PATCH 0/4] Memory controller soft limit patches (v3) Balbir Singh
  2009-03-01  6:30 ` [PATCH 1/4] Memory controller soft limit documentation (v3) Balbir Singh
  2009-03-01  6:30 ` [PATCH 2/4] Memory controller soft limit interface (v3) Balbir Singh
@ 2009-03-01  6:30 ` Balbir Singh
  2009-03-01  6:30 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v3) Balbir Singh
  2009-03-02  0:24 ` [PATCH 0/4] Memory controller soft limit patches (v3) KAMEZAWA Hiroyuki
  4 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2009-03-01  6:30 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v3...v2
1. Add only the ancestor to the RB-Tree
2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put

Changelog v2...v1
1. Add support for hierarchies
2. The res_counter that is highest in the hierarchy is returned on soft
   limit being exceeded. Since we do hierarchical reclaim and add all
   groups exceeding their soft limits, this approach seems to work well
   in practice.

This patch introduces a RB-Tree for storing memory cgroups that are over their
soft limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
   We are careful about updates, updates take place only after a particular
   time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
   limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    3 +
 kernel/res_counter.c        |   12 ++++-
 mm/memcontrol.c             |  111 +++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 120 insertions(+), 6 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index b5f14fa..e526ab6 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+		unsigned long val, struct res_counter **limit_fail_at,
+		struct res_counter **soft_limit_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4e6dafe..08b7614 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+			struct res_counter **limit_fail_at,
+			struct res_counter **soft_limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
+	if (soft_limit_fail_at)
+		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
+		/*
+		 * With soft limits, we return the highest ancestor
+		 * that exceeds its soft limit
+		 */
+		if (soft_limit_fail_at &&
+			!res_counter_soft_limit_check_locked(c))
+			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75a7b1a..0a2f855 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -29,6 +29,7 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
+#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
@@ -129,6 +130,14 @@ struct mem_cgroup_lru_info {
 };
 
 /*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+
+static struct rb_root mem_cgroup_soft_limit_tree;
+static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
+
+/*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
  * statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -176,12 +185,20 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	struct rb_node mem_cgroup_node;		/* RB tree node */
+	unsigned long long usage_in_excess;	/* Set to the value by which */
+						/* the soft limit is exceeded*/
+	unsigned long last_tree_update;		/* Last time the tree was */
+						/* updated in jiffies     */
+
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
 
+#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -214,6 +231,41 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct mem_cgroup *mem_node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	while (*p) {
+		parent = *p;
+		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
+		if (mem->usage_in_excess < mem_node->usage_in_excess)
+			p = &(*p)->rb_left;
+		/*
+		 * We can't avoid mem cgroups that are over their soft
+		 * limit by the same amount
+		 */
+		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&mem->mem_cgroup_node, parent, p);
+	rb_insert_color(&mem->mem_cgroup_node,
+			&mem_cgroup_soft_limit_tree);
+	mem->last_tree_update = jiffies;
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -897,6 +949,40 @@ static void record_last_oom(struct mem_cgroup *mem)
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
+static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
+						bool time_check)
+{
+	unsigned long long prev_usage_in_excess, new_usage_in_excess;
+	bool updated_tree = false;
+	unsigned long next_update = 0;
+	unsigned long flags;
+
+	if (!css_tryget(&mem->css))
+		return;
+	prev_usage_in_excess = mem->usage_in_excess;
+	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+
+	if (time_check)
+		next_update = mem->last_tree_update +
+				MEM_CGROUP_TREE_UPDATE_INTERVAL;
+	if (new_usage_in_excess && time_after(jiffies, next_update)) {
+		if (prev_usage_in_excess)
+			mem_cgroup_remove_exceeded(mem);
+		mem_cgroup_insert_exceeded(mem);
+		updated_tree = true;
+	} else if (prev_usage_in_excess && !new_usage_in_excess) {
+		mem_cgroup_remove_exceeded(mem);
+		updated_tree = true;
+	}
+
+	if (updated_tree) {
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->last_tree_update = jiffies;
+		mem->usage_in_excess = new_usage_in_excess;
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	}
+	css_put(&mem->css);
+}
 
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
@@ -906,9 +992,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom)
 {
-	struct mem_cgroup *mem, *mem_over_limit;
+	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+	struct res_counter *fail_res, *soft_fail_res = NULL;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -938,12 +1024,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		int ret;
 		bool noswap = false;
 
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
+						&soft_fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+							&fail_res, NULL);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
@@ -985,6 +1072,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 		}
 	}
+
+	/*
+	 * Insert just the ancestor, we should trickle down to the correct
+	 * cgroup for reclaim, since the other nodes will be below their
+	 * soft limit
+	 */
+	if (soft_fail_res) {
+		mem_over_soft_limit =
+			mem_cgroup_from_res_counter(soft_fail_res, res);
+		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
+	}
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -1422,6 +1520,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
+	mem_cgroup_check_and_update_tree(mem, true);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
@@ -2346,6 +2445,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	mem_cgroup_check_and_update_tree(mem, false);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -2412,6 +2512,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
 		parent = NULL;
+		mem_cgroup_soft_limit_tree = RB_ROOT;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
@@ -2432,6 +2533,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->usage_in_excess = 0;
+	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)

-- 
	Balbir

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/4] Memory controller soft limit reclaim on contention (v3)
  2009-03-01  6:29 [PATCH 0/4] Memory controller soft limit patches (v3) Balbir Singh
                   ` (2 preceding siblings ...)
  2009-03-01  6:30 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v3) Balbir Singh
@ 2009-03-01  6:30 ` Balbir Singh
  2009-03-02  3:08   ` KOSAKI Motohiro
  2009-03-02  0:24 ` [PATCH 0/4] Memory controller soft limit patches (v3) KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-01  6:30 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v3...v2
1. Convert several arguments to hierarchical reclaim to flags, thereby
   consolidating them
2. The reclaim for soft limits is now triggered from kswapd
3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument

Changelog v2...v1
1. Added support for hierarchical soft limits

This patch allows reclaim from memory cgroups on contention (via the
kswapd() path) only if the order is 0.

memory cgroup soft limit reclaim finds the group that exceeds its soft limit
by the largest amount and reclaims pages from it and then reinserts the
cgroup into its correct place in the rbtree.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |    2 +
 include/linux/swap.h       |    1 
 mm/memcontrol.c            |  134 ++++++++++++++++++++++++++++++++++++++------
 mm/vmscan.c                |    8 ++-
 4 files changed, 125 insertions(+), 20 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..bf12451 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,6 +116,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
+extern unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zonelist,
+							gfp_t gfp_mask);
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 989eb53..d8c1c76 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,7 @@ static inline void lru_cache_add_active_file(struct page *page)
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
+						  struct zonelist *zonelist,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0a2f855..06a4c68 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -191,6 +191,7 @@ struct mem_cgroup {
 	unsigned long last_tree_update;		/* Last time the tree was */
 						/* updated in jiffies     */
 
+	bool on_tree;				/* Is the node on tree? */
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -227,18 +228,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
+/*
+ * Bits used for hierarchical reclaim bits
+ */
+#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
+#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
+#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
+#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
+#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
-static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 {
 	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
 	struct rb_node *parent = NULL;
 	struct mem_cgroup *mem_node;
-	unsigned long flags;
 
-	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	if (mem->on_tree)
+		return;
+
 	while (*p) {
 		parent = *p;
 		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
@@ -255,6 +267,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 	rb_insert_color(&mem->mem_cgroup_node,
 			&mem_cgroup_soft_limit_tree);
 	mem->last_tree_update = jiffies;
+	mem->on_tree = true;
+}
+
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	if (!mem->on_tree)
+		return;
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	mem->on_tree = false;
+}
+
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	__mem_cgroup_insert_exceeded(mem);
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 }
 
@@ -262,10 +291,36 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
-	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	__mem_cgroup_remove_exceeded(mem);
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 }
 
+static struct mem_cgroup *mem_cgroup_get_largest_soft_limit_exceeding_node(void)
+{
+	struct rb_node *rightmost = NULL;
+	struct mem_cgroup *mem = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+retry:
+	rightmost = rb_last(&mem_cgroup_soft_limit_tree);
+	if (!rightmost)
+		goto done;		/* Nothing to reclaim from */
+
+	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
+	/*
+	 * Remove the node now but someone else can add it back,
+	 * we will to add it back at the end of reclaim to its correct
+	 * position in the tree.
+	 */
+	__mem_cgroup_remove_exceeded(mem);
+	if (!css_tryget(&mem->css))
+		goto retry;
+done:
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	return mem;
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -888,11 +943,16 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
-				   gfp_t gfp_mask, bool noswap, bool shrink)
+						struct zonelist *zonelist,
+						gfp_t gfp_mask,
+						unsigned long flags)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
 	int loop = 0;
+	bool noswap = flags & MEM_CGROUP_RECLAIM_NOSWAP;
+	bool shrink = flags & MEM_CGROUP_RECLAIM_SHRINK;
+	bool check_soft = flags & MEM_CGROUP_RECLAIM_SOFT;
 
 	while (loop < 2) {
 		victim = mem_cgroup_select_victim(root_mem);
@@ -904,8 +964,9 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 			continue;
 		}
 		/* we use swappiness of local cgroup */
-		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
-						   get_swappiness(victim));
+		ret = try_to_free_mem_cgroup_pages(victim, zonelist, gfp_mask,
+							noswap,
+							get_swappiness(victim));
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -915,7 +976,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		if (shrink)
 			return ret;
 		total += ret;
-		if (mem_cgroup_check_under_limit(root_mem))
+		if (check_soft) {
+			if (res_counter_soft_limit_excess(&root_mem->res))
+				return total;
+		} else if (mem_cgroup_check_under_limit(root_mem))
 			return 1 + total;
 	}
 	return total;
@@ -1022,7 +1086,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 
 	while (1) {
 		int ret;
-		bool noswap = false;
+		unsigned long flags = 0;
 
 		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
 						&soft_fail_res);
@@ -1035,7 +1099,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 				break;
 			/* mem+swap counter fails */
 			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			noswap = true;
+			flags = MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
 		} else
@@ -1046,8 +1110,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
-		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
-							noswap, false);
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
+							gfp_mask, flags);
 		if (ret)
 			continue;
 
@@ -1692,8 +1756,8 @@ int mem_cgroup_shrink_usage(struct page *page,
 		return 0;
 
 	do {
-		progress = mem_cgroup_hierarchical_reclaim(mem,
-					gfp_mask, true, false);
+		progress = mem_cgroup_hierarchical_reclaim(mem, NULL,
+					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
 		progress += mem_cgroup_check_under_limit(mem);
 	} while (!progress && --retry);
 
@@ -1747,8 +1811,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
-						   false, true);
+		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
+						GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -1796,7 +1861,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
+		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_NOSWAP |
+						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -1807,6 +1874,35 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zonelist,
+						gfp_t gfp_mask)
+{
+	unsigned long nr_reclaimed = 0;
+	struct mem_cgroup *mem;
+	unsigned long flags;
+
+	do {
+		mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
+		if (!mem)
+			break;
+		nr_reclaimed += mem_cgroup_hierarchical_reclaim(mem, zonelist,
+						gfp_mask,
+						MEM_CGROUP_RECLAIM_SOFT);
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+		/*
+		 * We need to remove and reinsert the node in its correct
+		 * position
+		 */
+		__mem_cgroup_remove_exceeded(mem);
+		if (mem->usage_in_excess)
+			__mem_cgroup_insert_exceeded(mem);
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+		css_put(&mem->css);
+	} while (!nr_reclaimed);
+	return nr_reclaimed;
+}
+
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -1930,7 +2026,7 @@ try_to_free:
 			ret = -EINTR;
 			goto out;
 		}
-		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
+		progress = try_to_free_mem_cgroup_pages(mem, NULL, GFP_KERNEL,
 						false, get_swappiness(mem));
 		if (!progress) {
 			nr_retries--;
@@ -2535,6 +2631,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	mem->usage_in_excess = 0;
 	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
+	mem->on_tree = false;
+
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 83f2ca4..0c28596 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1714,6 +1714,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
+					   struct zonelist *zonelist,
 					   gfp_t gfp_mask,
 					   bool noswap,
 					   unsigned int swappiness)
@@ -1727,14 +1728,14 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.mem_cgroup = mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,
 	};
-	struct zonelist *zonelist;
 
 	if (noswap)
 		sc.may_unmap = 0;
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
+	if (!zonelist)
+		zonelist = NODE_DATA(numa_node_id())->node_zonelists;
 	return do_try_to_free_pages(zonelist, &sc);
 }
 #endif
@@ -2015,9 +2016,12 @@ static int kswapd(void *p)
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
 		if (!try_to_freeze()) {
+			struct zonelist *zl = pgdat->node_zonelists;
 			/* We can speed up thawing tasks if we don't call
 			 * balance_pgdat after returning from the refrigerator
 			 */
+			if (!order)
+				mem_cgroup_soft_limit_reclaim(zl, GFP_KERNEL);
 			balance_pgdat(pgdat, order);
 		}
 	}

-- 
	Balbir

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-01  6:29 [PATCH 0/4] Memory controller soft limit patches (v3) Balbir Singh
                   ` (3 preceding siblings ...)
  2009-03-01  6:30 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v3) Balbir Singh
@ 2009-03-02  0:24 ` KAMEZAWA Hiroyuki
  2009-03-02  4:40   ` Balbir Singh
  4 siblings, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02  0:24 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Sun, 01 Mar 2009 11:59:59 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v3...v2
> 1. Implemented several review comments from Kosaki-San and Kamezawa-San
>    Please see individual changelogs for changes
> 
> Changelog v2...v1
> 1. Soft limits now support hierarchies
> 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> 
> Here is v3 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. The CPU controllers interpretation
> of shares is very different though. 
> 
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
> 
> If there are no major objections to the patches, I would like to get them
> included in -mm.
> 
> TODOs
> 
> 1. The current implementation maintains the delta from the soft limit
>    and pushes back groups to their soft limits, a ratio of delta/soft_limit
>    might be more useful
> 2. It would be nice to have more targetted reclaim (in terms of pages to
>    recalim) interface. So that groups are pushed back, close to their soft
>    limits.
> 
> Tests
> -----
> 
> I've run two memory intensive workloads with differing soft limits and
> seen that they are pushed back to their soft limit on contention. Their usage
> was their soft limit plus additional memory that they were able to grab
> on the system. Soft limit can take a while before we see the expected
> results.
> 
> Please review, comment.
> 
Please forgive me to say....that the code itself is getting better but far from
what I want. Maybe I have to show my own implementation to show my idea
and the answer is between yours and mine. If now was the last year, I have enough
time until distro's target kernel and may welcome any innovative patches even if
it seems to give me concerns, but I have to be conservative now.

At first, it's said "When cgroup people adds something, the kernel gets slow".
This is my start point of reviewing. Below is comments to this version of patch.

 1. I think it's bad to add more hooks to res_counter. It's enough slow to give up
    adding more fancy things..

 2. please avoid to add hooks to hot-path. In your patch, especially a hook to
    mem_cgroup_uncharge_common() is annoying me.

 3. please avoid to use global spinlock more. 
    no lock is best. mutex is better, maybe.

 4. RB-tree seems broken. Following is example. (please note you do all ops
    in lazy manner (once in HZ/4.)

   i). while running, the tree is constructed as following
 
             R           R=exceed=300M
            / \ 
           A   B      A=exceed=200M  B=exceed=400M
   ii) A process B exits, but and usage goes down.

   iii)      R          R=exceed=300M
            / \
           A   B      A=exceed=200M  B=exceed=10M

   vi) A new node inserted
             R         R=exceed=300M
            / \       
           A   B       A=exceed=200M B=exceed=10M
              / \
             nil C     C=exceed=310M

   v) Time expires and remove "R" and do rotate.

   Hmm ? Is above status is allowed ? I'm sorry if I misunderstand RBtree.

I'll post my own version in this week (more conservative version, maybe).
please discuss and compare trafe-offs.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] Memory controller soft limit interface (v3)
  2009-03-01  6:30 ` [PATCH 2/4] Memory controller soft limit interface (v3) Balbir Singh
@ 2009-03-02  2:03   ` KAMEZAWA Hiroyuki
  2009-03-02  4:46     ` Balbir Singh
  0 siblings, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02  2:03 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Sun, 01 Mar 2009 12:00:11 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v2...v1
> 1. Add support for res_counter_check_soft_limit_locked. This is used
>    by the hierarchy code.
> 
> Add an interface to allow get/set of soft limits. Soft limits for memory plus
> swap controller (memsw) is currently not supported. Resource counters have
> been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
> added. Unlike hard limits, soft limits can be directly set and do not
> need any reclaim or checks before setting them to a newer value.
> 
> Kamezawa-San raised a question as to whether soft limit should belong
> to res_counter. Since all resources understand the basic concepts of
> hard and soft limits, it is justified to add soft limits here. Soft limits
> are a generic resource usage feature, even file system quotas support
> soft limits.
> 
I don't convice adding more logics to res_counter is a good to do, yet.


Thanks,
-Kame



> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  include/linux/res_counter.h |   47 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/res_counter.c        |    3 +++
>  mm/memcontrol.c             |   20 ++++++++++++++++++
>  3 files changed, 70 insertions(+), 0 deletions(-)
> 
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 4c5bcf6..b5f14fa 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -35,6 +35,10 @@ struct res_counter {
>  	 */
>  	unsigned long long limit;
>  	/*
> +	 * the limit that usage can be exceed
> +	 */
> +	unsigned long long soft_limit;
> +	/*
>  	 * the number of unsuccessful attempts to consume the resource
>  	 */
>  	unsigned long long failcnt;
> @@ -85,6 +89,7 @@ enum {
>  	RES_MAX_USAGE,
>  	RES_LIMIT,
>  	RES_FAILCNT,
> +	RES_SOFT_LIMIT,
>  };
>  
>  /*
> @@ -130,6 +135,36 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>  	return false;
>  }
>  
> +static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
> +{
> +	if (cnt->usage < cnt->soft_limit)
> +		return true;
> +
> +	return false;
> +}
> +
> +/**
> + * Get the difference between the usage and the soft limit
> + * @cnt: The counter
> + *
> + * Returns 0 if usage is less than or equal to soft limit
> + * The difference between usage and soft limit, otherwise.
> + */
> +static inline unsigned long long
> +res_counter_soft_limit_excess(struct res_counter *cnt)
> +{
> +	unsigned long long excess;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	if (cnt->usage <= cnt->soft_limit)
> +		excess = 0;
> +	else
> +		excess = cnt->usage - cnt->soft_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return excess;
> +}
> +
>  /*
>   * Helper function to detect if the cgroup is within it's limit or
>   * not. It's currently called from cgroup_rss_prepare()
> @@ -178,4 +213,16 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
>  	return ret;
>  }
>  
> +static inline int
> +res_counter_set_soft_limit(struct res_counter *cnt,
> +				unsigned long long soft_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->soft_limit = soft_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
> +
>  #endif
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index bf8e753..4e6dafe 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -19,6 +19,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
>  {
>  	spin_lock_init(&counter->lock);
>  	counter->limit = (unsigned long long)LLONG_MAX;
> +	counter->soft_limit = (unsigned long long)LLONG_MAX;
>  	counter->parent = parent;
>  }
>  
> @@ -101,6 +102,8 @@ res_counter_member(struct res_counter *counter, int member)
>  		return &counter->limit;
>  	case RES_FAILCNT:
>  		return &counter->failcnt;
> +	case RES_SOFT_LIMIT:
> +		return &counter->soft_limit;
>  	};
>  
>  	BUG();
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7bb14fd..75a7b1a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1939,6 +1939,20 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
>  		else
>  			ret = mem_cgroup_resize_memsw_limit(memcg, val);
>  		break;
> +	case RES_SOFT_LIMIT:
> +		ret = res_counter_memparse_write_strategy(buffer, &val);
> +		if (ret)
> +			break;
> +		/*
> +		 * For memsw, soft limits are hard to implement in terms
> +		 * of semantics, for now, we support soft limits for
> +		 * control without swap
> +		 */
> +		if (type == _MEM)
> +			ret = res_counter_set_soft_limit(&memcg->res, val);
> +		else
> +			ret = -EINVAL;
> +		break;
>  	default:
>  		ret = -EINVAL; /* should be BUG() ? */
>  		break;
> @@ -2188,6 +2202,12 @@ static struct cftype mem_cgroup_files[] = {
>  		.read_u64 = mem_cgroup_read,
>  	},
>  	{
> +		.name = "soft_limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
> +		.write_string = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
>  		.name = "failcnt",
>  		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
>  		.trigger = mem_cgroup_reset,
> 
> -- 
> 	Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v3)
  2009-03-01  6:30 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v3) Balbir Singh
@ 2009-03-02  3:08   ` KOSAKI Motohiro
  2009-03-02  4:44     ` Balbir Singh
  0 siblings, 1 reply; 40+ messages in thread
From: KOSAKI Motohiro @ 2009-03-02  3:08 UTC (permalink / raw)
  To: Balbir Singh
  Cc: kosaki.motohiro, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

Hi Balbir,

> @@ -2015,9 +2016,12 @@ static int kswapd(void *p)
>  		finish_wait(&pgdat->kswapd_wait, &wait);
>  
>  		if (!try_to_freeze()) {
> +			struct zonelist *zl = pgdat->node_zonelists;
>  			/* We can speed up thawing tasks if we don't call
>  			 * balance_pgdat after returning from the refrigerator
>  			 */
> +			if (!order)
> +				mem_cgroup_soft_limit_reclaim(zl, GFP_KERNEL);
>  			balance_pgdat(pgdat, order);
>  		}
>  	}

kswapd's roll is increasing free pages until zone->pages_high in "own node".
mem_cgroup_soft_limit_reclaim() free one (or more) exceed page in any node.

Oh, well.
I think it is not consistency.

if mem_cgroup_soft_limit_reclaim() is aware to target node and its pages_high,
I'm glad.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  0:24 ` [PATCH 0/4] Memory controller soft limit patches (v3) KAMEZAWA Hiroyuki
@ 2009-03-02  4:40   ` Balbir Singh
  2009-03-02  5:32     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-02  4:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 09:24:04]:

> On Sun, 01 Mar 2009 11:59:59 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v3...v2
> > 1. Implemented several review comments from Kosaki-San and Kamezawa-San
> >    Please see individual changelogs for changes
> > 
> > Changelog v2...v1
> > 1. Soft limits now support hierarchies
> > 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> > 
> > Here is v3 of the new soft limit implementation. Soft limits is a new feature
> > for the memory resource controller, something similar has existed in the
> > group scheduler in the form of shares. The CPU controllers interpretation
> > of shares is very different though. 
> > 
> > Soft limits are the most useful feature to have for environments where
> > the administrator wants to overcommit the system, such that only on memory
> > contention do the limits become active. The current soft limits implementation
> > provides a soft_limit_in_bytes interface for the memory controller and not
> > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > that exceed their soft limit and starts reclaiming from the group that
> > exceeds this limit by the maximum amount.
> > 
> > If there are no major objections to the patches, I would like to get them
> > included in -mm.
> > 
> > TODOs
> > 
> > 1. The current implementation maintains the delta from the soft limit
> >    and pushes back groups to their soft limits, a ratio of delta/soft_limit
> >    might be more useful
> > 2. It would be nice to have more targetted reclaim (in terms of pages to
> >    recalim) interface. So that groups are pushed back, close to their soft
> >    limits.
> > 
> > Tests
> > -----
> > 
> > I've run two memory intensive workloads with differing soft limits and
> > seen that they are pushed back to their soft limit on contention. Their usage
> > was their soft limit plus additional memory that they were able to grab
> > on the system. Soft limit can take a while before we see the expected
> > results.
> > 
> > Please review, comment.
> > 
> Please forgive me to say....that the code itself is getting better but far from
> what I want. Maybe I have to show my own implementation to show my idea
> and the answer is between yours and mine. If now was the last year, I have enough
> time until distro's target kernel and may welcome any innovative patches even if
> it seems to give me concerns, but I have to be conservative now.

I am not asking for an immediate push to mainline, but for integration
into -mm and more test. Let me address your concern below

> 
> At first, it's said "When cgroup people adds something, the kernel gets slow".
> This is my start point of reviewing. Below is comments to this version of patch.
> 
>  1. I think it's bad to add more hooks to res_counter. It's enough slow to give up
>     adding more fancy things..

res_counters was desgined to be extensible, why is adding anything to
it going to make it slow, unless we turn on soft_limits?

> 
>  2. please avoid to add hooks to hot-path. In your patch, especially a hook to
>     mem_cgroup_uncharge_common() is annoying me.

If soft limits are not enabled, the function does a small check and
leaves. 

> 
>  3. please avoid to use global spinlock more. 
>     no lock is best. mutex is better, maybe.
> 

No lock to update a tree which is update concurrently?

>  4. RB-tree seems broken. Following is example. (please note you do all ops
>     in lazy manner (once in HZ/4.)
> 
>    i). while running, the tree is constructed as following
> 
>              R           R=exceed=300M
>             / \ 
>            A   B      A=exceed=200M  B=exceed=400M
>    ii) A process B exits, but and usage goes down.

That is why we have the hook in uncharge. Even if we update and the
usage goes down, the tree is ordered by usage_in_excess which is
updated only when the tree is updated. So what you show below does not
occur. I think I should document the design better.

> 
>    iii)      R          R=exceed=300M
>             / \
>            A   B      A=exceed=200M  B=exceed=10M
> 
>    vi) A new node inserted
>              R         R=exceed=300M
>             / \       
>            A   B       A=exceed=200M B=exceed=10M
>               / \
>              nil C     C=exceed=310M
> 
>    v) Time expires and remove "R" and do rotate.
> 
>    Hmm ? Is above status is allowed ? I'm sorry if I misunderstand RBtree.
> 
> I'll post my own version in this week (more conservative version, maybe).
> please discuss and compare trafe-offs.
> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v3)
  2009-03-02  3:08   ` KOSAKI Motohiro
@ 2009-03-02  4:44     ` Balbir Singh
  2009-03-03  2:43       ` KOSAKI Motohiro
  0 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-02  4:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-02 12:08:01]:

> Hi Balbir,
> 
> > @@ -2015,9 +2016,12 @@ static int kswapd(void *p)
> >  		finish_wait(&pgdat->kswapd_wait, &wait);
> >  
> >  		if (!try_to_freeze()) {
> > +			struct zonelist *zl = pgdat->node_zonelists;
> >  			/* We can speed up thawing tasks if we don't call
> >  			 * balance_pgdat after returning from the refrigerator
> >  			 */
> > +			if (!order)
> > +				mem_cgroup_soft_limit_reclaim(zl, GFP_KERNEL);
> >  			balance_pgdat(pgdat, order);
> >  		}
> >  	}
> 
> kswapd's roll is increasing free pages until zone->pages_high in "own node".
> mem_cgroup_soft_limit_reclaim() free one (or more) exceed page in any node.
> 
> Oh, well.
> I think it is not consistency.
> 
> if mem_cgroup_soft_limit_reclaim() is aware to target node and its pages_high,
> I'm glad.
>

Yes, correct the role of kswapd is to keep increasing free pages until
zone->pages_high and the first set of pages to consider is the memory
controller over their soft limits. We pass the zonelist to ensure that
while doing soft reclaim, we focus on the zonelist associated with the
node. Kamezawa had concernes over calling the soft limit reclaim from
__alloc_pages_internal(), did you prefer that call path? 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] Memory controller soft limit interface (v3)
  2009-03-02  2:03   ` KAMEZAWA Hiroyuki
@ 2009-03-02  4:46     ` Balbir Singh
  2009-03-02  5:35       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-02  4:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 11:03:23]:

> On Sun, 01 Mar 2009 12:00:11 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v2...v1
> > 1. Add support for res_counter_check_soft_limit_locked. This is used
> >    by the hierarchy code.
> > 
> > Add an interface to allow get/set of soft limits. Soft limits for memory plus
> > swap controller (memsw) is currently not supported. Resource counters have
> > been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
> > added. Unlike hard limits, soft limits can be directly set and do not
> > need any reclaim or checks before setting them to a newer value.
> > 
> > Kamezawa-San raised a question as to whether soft limit should belong
> > to res_counter. Since all resources understand the basic concepts of
> > hard and soft limits, it is justified to add soft limits here. Soft limits
> > are a generic resource usage feature, even file system quotas support
> > soft limits.
> > 
> I don't convice adding more logics to res_counter is a good to do, yet.
>

Even though it is extensible and you pay the cost only when soft
limits is turned on? Can you show me why you are not convinced?
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  4:40   ` Balbir Singh
@ 2009-03-02  5:32     ` KAMEZAWA Hiroyuki
  2009-03-02  6:05       ` Balbir Singh
  0 siblings, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02  5:32 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 2 Mar 2009 10:10:43 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 09:24:04]:
> 
> > On Sun, 01 Mar 2009 11:59:59 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > 
> > > From: Balbir Singh <balbir@linux.vnet.ibm.com>

> > 
> > At first, it's said "When cgroup people adds something, the kernel gets slow".
> > This is my start point of reviewing. Below is comments to this version of patch.
> > 
> >  1. I think it's bad to add more hooks to res_counter. It's enough slow to give up
> >     adding more fancy things..
> 
> res_counters was desgined to be extensible, why is adding anything to
> it going to make it slow, unless we turn on soft_limits?
> 
You inserted new "if" logic in the core loop.
(What I want to say here is not that this is definitely bad but that "isn't there
 any alternatives which is less overhead.)


> > 
> >  2. please avoid to add hooks to hot-path. In your patch, especially a hook to
> >     mem_cgroup_uncharge_common() is annoying me.
> 
> If soft limits are not enabled, the function does a small check and
> leaves. 
> 
&soft_fail_res is passed always even if memory.soft_limit==ULONG_MAX
res_counter_soft_limit_excess() adds one more function call and spinlock, and irq-off.

> > 
> >  3. please avoid to use global spinlock more. 
> >     no lock is best. mutex is better, maybe.
> > 
> 
> No lock to update a tree which is update concurrently?
> 
Using tree/sort itself is nonsense, I believe.


> >  4. RB-tree seems broken. Following is example. (please note you do all ops
> >     in lazy manner (once in HZ/4.)
> > 
> >    i). while running, the tree is constructed as following
> > 
> >              R           R=exceed=300M
> >             / \ 
> >            A   B      A=exceed=200M  B=exceed=400M
> >    ii) A process B exits, but and usage goes down.
> 
> That is why we have the hook in uncharge. Even if we update and the
> usage goes down, the tree is ordered by usage_in_excess which is
> updated only when the tree is updated. So what you show below does not
> occur. I think I should document the design better.
> 

time_check==true. So, update-tree at uncharge() only happens once in HZ/4
==
@@ -1422,6 +1520,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
+	mem_cgroup_check_and_update_tree(mem, true);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
==
Then, not-sorted RB-tree can be there.

BTW,
   time_after(jiffies, 0)
is buggy (see definition). If you want make this true always,
   time_after(jiffies, jiffies +1)

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] Memory controller soft limit interface (v3)
  2009-03-02  4:46     ` Balbir Singh
@ 2009-03-02  5:35       ` KAMEZAWA Hiroyuki
  2009-03-02  6:07         ` Balbir Singh
  0 siblings, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02  5:35 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 2 Mar 2009 10:16:31 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 11:03:23]:
> 
> > On Sun, 01 Mar 2009 12:00:11 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > 
> > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > 
> > > Changelog v2...v1
> > > 1. Add support for res_counter_check_soft_limit_locked. This is used
> > >    by the hierarchy code.
> > > 
> > > Add an interface to allow get/set of soft limits. Soft limits for memory plus
> > > swap controller (memsw) is currently not supported. Resource counters have
> > > been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
> > > added. Unlike hard limits, soft limits can be directly set and do not
> > > need any reclaim or checks before setting them to a newer value.
> > > 
> > > Kamezawa-San raised a question as to whether soft limit should belong
> > > to res_counter. Since all resources understand the basic concepts of
> > > hard and soft limits, it is justified to add soft limits here. Soft limits
> > > are a generic resource usage feature, even file system quotas support
> > > soft limits.
> > > 
> > I don't convice adding more logics to res_counter is a good to do, yet.
> >
> 
> Even though it is extensible and you pay the cost only when soft
> limits is turned on? Can you show me why you are not convinced?
>  
Inserting more codes (like "if") to res_counter itself is not welcome..
I think res_counter is too complex as counter already.

I'm now searching a way to reduce res_counter->lock ping-pong but have no
good idea.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  5:32     ` KAMEZAWA Hiroyuki
@ 2009-03-02  6:05       ` Balbir Singh
  2009-03-02  6:18         ` KAMEZAWA Hiroyuki
  2009-03-02  6:21         ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 40+ messages in thread
From: Balbir Singh @ 2009-03-02  6:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 14:32:50]:

> On Mon, 2 Mar 2009 10:10:43 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 09:24:04]:
> > 
> > > On Sun, 01 Mar 2009 11:59:59 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > 
> > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> > > 
> > > At first, it's said "When cgroup people adds something, the kernel gets slow".
> > > This is my start point of reviewing. Below is comments to this version of patch.
> > > 
> > >  1. I think it's bad to add more hooks to res_counter. It's enough slow to give up
> > >     adding more fancy things..
> > 
> > res_counters was desgined to be extensible, why is adding anything to
> > it going to make it slow, unless we turn on soft_limits?
> > 
> You inserted new "if" logic in the core loop.
> (What I want to say here is not that this is definitely bad but that "isn't there
>  any alternatives which is less overhead.)
> 
> 
> > > 
> > >  2. please avoid to add hooks to hot-path. In your patch, especially a hook to
> > >     mem_cgroup_uncharge_common() is annoying me.
> > 
> > If soft limits are not enabled, the function does a small check and
> > leaves. 
> > 
> &soft_fail_res is passed always even if memory.soft_limit==ULONG_MAX
> res_counter_soft_limit_excess() adds one more function call and spinlock, and irq-off.
>

OK, I see that overhead.. I'll figure out a way to work around it.
 
> > > 
> > >  3. please avoid to use global spinlock more. 
> > >     no lock is best. mutex is better, maybe.
> > > 
> > 
> > No lock to update a tree which is update concurrently?
> > 
> Using tree/sort itself is nonsense, I believe.
> 

I tried using prio trees in the past, but they are not easy to update
either. I won't mind asking for suggestions for a data structure that
can scaled well, allow quick insert/delete and search.

> 
> > >  4. RB-tree seems broken. Following is example. (please note you do all ops
> > >     in lazy manner (once in HZ/4.)
> > > 
> > >    i). while running, the tree is constructed as following
> > > 
> > >              R           R=exceed=300M
> > >             / \ 
> > >            A   B      A=exceed=200M  B=exceed=400M
> > >    ii) A process B exits, but and usage goes down.
> > 
> > That is why we have the hook in uncharge. Even if we update and the
> > usage goes down, the tree is ordered by usage_in_excess which is
> > updated only when the tree is updated. So what you show below does not
> > occur. I think I should document the design better.
> > 
> 
> time_check==true. So, update-tree at uncharge() only happens once in HZ/4


No.. you are missing the point

==
        if (updated_tree) {
                spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
                mem->last_tree_update = jiffies;
                mem->usage_in_excess = new_usage_in_excess;
                spin_unlock_irqrestore(&memcg_soft_limit_tree_lock,
flags);
        }
==

mem->usage_in_excess is the key for the RB-Tree and is updated only
when the tree is updated.

> ==
> @@ -1422,6 +1520,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	mz = page_cgroup_zoneinfo(pc);
>  	unlock_page_cgroup(pc);
> 
> +	mem_cgroup_check_and_update_tree(mem, true);
>  	/* at swapout, this memcg will be accessed to record to swap */
>  	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>  		css_put(&mem->css);
> ==
> Then, not-sorted RB-tree can be there.
> 
> BTW,
>    time_after(jiffies, 0)
> is buggy (see definition). If you want make this true always,
>    time_after(jiffies, jiffies +1)
>

HZ/4 is 250/4 jiffies in the worst case (62). We have
time_after(jiffies, next_update_interval) and next_update_interval is
set to last_tree_update + 62. Not sure if I got what you are pointing
to.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] Memory controller soft limit interface (v3)
  2009-03-02  5:35       ` KAMEZAWA Hiroyuki
@ 2009-03-02  6:07         ` Balbir Singh
  2009-03-02  6:19           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-02  6:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 14:35:18]:

> On Mon, 2 Mar 2009 10:16:31 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 11:03:23]:
> > 
> > > On Sun, 01 Mar 2009 12:00:11 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > 
> > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > 
> > > > Changelog v2...v1
> > > > 1. Add support for res_counter_check_soft_limit_locked. This is used
> > > >    by the hierarchy code.
> > > > 
> > > > Add an interface to allow get/set of soft limits. Soft limits for memory plus
> > > > swap controller (memsw) is currently not supported. Resource counters have
> > > > been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
> > > > added. Unlike hard limits, soft limits can be directly set and do not
> > > > need any reclaim or checks before setting them to a newer value.
> > > > 
> > > > Kamezawa-San raised a question as to whether soft limit should belong
> > > > to res_counter. Since all resources understand the basic concepts of
> > > > hard and soft limits, it is justified to add soft limits here. Soft limits
> > > > are a generic resource usage feature, even file system quotas support
> > > > soft limits.
> > > > 
> > > I don't convice adding more logics to res_counter is a good to do, yet.
> > >
> > 
> > Even though it is extensible and you pay the cost only when soft
> > limits is turned on? Can you show me why you are not convinced?
> >  
> Inserting more codes (like "if") to res_counter itself is not welcome..
> I think res_counter is too complex as counter already.
>

Darn.. we better stop all code development!
 
> I'm now searching a way to reduce res_counter->lock ping-pong but have no
> good idea.
> 

I've been doing some detailed analysis on the memory controller
overhead. I have a set of patches that I am testing right now. I'll
share the details once I am done with them.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  6:05       ` Balbir Singh
@ 2009-03-02  6:18         ` KAMEZAWA Hiroyuki
  2009-03-02 17:52           ` Balbir Singh
  2009-03-02  6:21         ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02  6:18 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 2 Mar 2009 11:35:19 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > Then, not-sorted RB-tree can be there.
> > 
> > BTW,
> >    time_after(jiffies, 0)
> > is buggy (see definition). If you want make this true always,
> >    time_after(jiffies, jiffies +1)
> >
> 
> HZ/4 is 250/4 jiffies in the worst case (62). We have
> time_after(jiffies, next_update_interval) and next_update_interval is
> set to last_tree_update + 62. Not sure if I got what you are pointing
> to.
> 
+	unsigned long next_update = 0;
+	unsigned long flags;
+
+	if (!css_tryget(&mem->css))
+		return;
+	prev_usage_in_excess = mem->usage_in_excess;
+	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+
+	if (time_check)
+		next_update = mem->last_tree_update +
+				MEM_CGROUP_TREE_UPDATE_INTERVAL;
+	if (new_usage_in_excess && time_after(jiffies, next_update)) {
+		if (prev_usage_in_excess)
+			mem_cgroup_remove_exceeded(mem);
+		mem_cgroup_insert_exceeded(mem);
+		updated_tree = true;
+	} else if (prev_usage_in_excess && !new_usage_in_excess) {
+		mem_cgroup_remove_exceeded(mem);
+		updated_tree = true;
+	}

My point is what happens if time_check==false.
time_afrter(jiffies, 0) is buggy.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] Memory controller soft limit interface (v3)
  2009-03-02  6:07         ` Balbir Singh
@ 2009-03-02  6:19           ` KAMEZAWA Hiroyuki
  2009-03-02  6:29             ` Balbir Singh
  0 siblings, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02  6:19 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 2 Mar 2009 11:37:26 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 14:35:18]:
> 
> > On Mon, 2 Mar 2009 10:16:31 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 11:03:23]:
> > > 
> > > > On Sun, 01 Mar 2009 12:00:11 +0530
> > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > 
> > > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > > 
> > > > > Changelog v2...v1
> > > > > 1. Add support for res_counter_check_soft_limit_locked. This is used
> > > > >    by the hierarchy code.
> > > > > 
> > > > > Add an interface to allow get/set of soft limits. Soft limits for memory plus
> > > > > swap controller (memsw) is currently not supported. Resource counters have
> > > > > been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
> > > > > added. Unlike hard limits, soft limits can be directly set and do not
> > > > > need any reclaim or checks before setting them to a newer value.
> > > > > 
> > > > > Kamezawa-San raised a question as to whether soft limit should belong
> > > > > to res_counter. Since all resources understand the basic concepts of
> > > > > hard and soft limits, it is justified to add soft limits here. Soft limits
> > > > > are a generic resource usage feature, even file system quotas support
> > > > > soft limits.
> > > > > 
> > > > I don't convice adding more logics to res_counter is a good to do, yet.
> > > >
> > > 
> > > Even though it is extensible and you pay the cost only when soft
> > > limits is turned on? Can you show me why you are not convinced?
> > >  
> > Inserting more codes (like "if") to res_counter itself is not welcome..
> > I think res_counter is too complex as counter already.
> >
> 
> Darn.. we better stop all code development!
>  
I don't say such a thing. My point is we have to keep res_counter as light-weight
as possible. If there are alternatives, we should use that.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  6:05       ` Balbir Singh
  2009-03-02  6:18         ` KAMEZAWA Hiroyuki
@ 2009-03-02  6:21         ` KAMEZAWA Hiroyuki
  2009-03-02  6:36           ` Balbir Singh
  1 sibling, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02  6:21 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 2 Mar 2009 11:35:19 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 14:32:50]:
> 
> > On Mon, 2 Mar 2009 10:10:43 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 09:24:04]:
> > > 
> > > > On Sun, 01 Mar 2009 11:59:59 +0530
> > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > 
> > > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > > > 
> > > > At first, it's said "When cgroup people adds something, the kernel gets slow".
> > > > This is my start point of reviewing. Below is comments to this version of patch.
> > > > 
> > > >  1. I think it's bad to add more hooks to res_counter. It's enough slow to give up
> > > >     adding more fancy things..
> > > 
> > > res_counters was desgined to be extensible, why is adding anything to
> > > it going to make it slow, unless we turn on soft_limits?
> > > 
> > You inserted new "if" logic in the core loop.
> > (What I want to say here is not that this is definitely bad but that "isn't there
> >  any alternatives which is less overhead.)
> > 
> > 
> > > > 
> > > >  2. please avoid to add hooks to hot-path. In your patch, especially a hook to
> > > >     mem_cgroup_uncharge_common() is annoying me.
> > > 
> > > If soft limits are not enabled, the function does a small check and
> > > leaves. 
> > > 
> > &soft_fail_res is passed always even if memory.soft_limit==ULONG_MAX
> > res_counter_soft_limit_excess() adds one more function call and spinlock, and irq-off.
> >
> 
> OK, I see that overhead.. I'll figure out a way to work around it.
>  
> > > > 
> > > >  3. please avoid to use global spinlock more. 
> > > >     no lock is best. mutex is better, maybe.
> > > > 
> > > 
> > > No lock to update a tree which is update concurrently?
> > > 
> > Using tree/sort itself is nonsense, I believe.
> > 
> 
> I tried using prio trees in the past, but they are not easy to update
> either. I won't mind asking for suggestions for a data structure that
> can scaled well, allow quick insert/delete and search.
> 
Now, because the routine is called by kswapd() not by try_to_free.....

It's not necessary to be very very fast. That's my point.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] Memory controller soft limit interface (v3)
  2009-03-02  6:19           ` KAMEZAWA Hiroyuki
@ 2009-03-02  6:29             ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2009-03-02  6:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 15:19:53]:

> On Mon, 2 Mar 2009 11:37:26 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 14:35:18]:
> > 
> > > On Mon, 2 Mar 2009 10:16:31 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 11:03:23]:
> > > > 
> > > > > On Sun, 01 Mar 2009 12:00:11 +0530
> > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > 
> > > > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > > > 
> > > > > > Changelog v2...v1
> > > > > > 1. Add support for res_counter_check_soft_limit_locked. This is used
> > > > > >    by the hierarchy code.
> > > > > > 
> > > > > > Add an interface to allow get/set of soft limits. Soft limits for memory plus
> > > > > > swap controller (memsw) is currently not supported. Resource counters have
> > > > > > been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
> > > > > > added. Unlike hard limits, soft limits can be directly set and do not
> > > > > > need any reclaim or checks before setting them to a newer value.
> > > > > > 
> > > > > > Kamezawa-San raised a question as to whether soft limit should belong
> > > > > > to res_counter. Since all resources understand the basic concepts of
> > > > > > hard and soft limits, it is justified to add soft limits here. Soft limits
> > > > > > are a generic resource usage feature, even file system quotas support
> > > > > > soft limits.
> > > > > > 
> > > > > I don't convice adding more logics to res_counter is a good to do, yet.
> > > > >
> > > > 
> > > > Even though it is extensible and you pay the cost only when soft
> > > > limits is turned on? Can you show me why you are not convinced?
> > > >  
> > > Inserting more codes (like "if") to res_counter itself is not welcome..
> > > I think res_counter is too complex as counter already.
> > >
> > 
> > Darn.. we better stop all code development!
> >  
> I don't say such a thing. My point is we have to keep res_counter as light-weight
> as possible. If there are alternatives, we should use that.
>

Any sort of new feature like this needs support from res_counters, we
need to extend them to remain consistent with out design and code.
Yes, if there are better alternatives, I would use them. BTW, I am
working on a newer scheme to change res_counter locking, but not sure
if that should come in the way of this development. 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  6:21         ` KAMEZAWA Hiroyuki
@ 2009-03-02  6:36           ` Balbir Singh
  2009-03-02  7:06             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-02  6:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 15:21:28]:

> On Mon, 2 Mar 2009 11:35:19 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 14:32:50]:
> > 
> > > On Mon, 2 Mar 2009 10:10:43 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 09:24:04]:
> > > > 
> > > > > On Sun, 01 Mar 2009 11:59:59 +0530
> > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > 
> > > > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > 
> > > > > 
> > > > > At first, it's said "When cgroup people adds something, the kernel gets slow".
> > > > > This is my start point of reviewing. Below is comments to this version of patch.
> > > > > 
> > > > >  1. I think it's bad to add more hooks to res_counter. It's enough slow to give up
> > > > >     adding more fancy things..
> > > > 
> > > > res_counters was desgined to be extensible, why is adding anything to
> > > > it going to make it slow, unless we turn on soft_limits?
> > > > 
> > > You inserted new "if" logic in the core loop.
> > > (What I want to say here is not that this is definitely bad but that "isn't there
> > >  any alternatives which is less overhead.)
> > > 
> > > 
> > > > > 
> > > > >  2. please avoid to add hooks to hot-path. In your patch, especially a hook to
> > > > >     mem_cgroup_uncharge_common() is annoying me.
> > > > 
> > > > If soft limits are not enabled, the function does a small check and
> > > > leaves. 
> > > > 
> > > &soft_fail_res is passed always even if memory.soft_limit==ULONG_MAX
> > > res_counter_soft_limit_excess() adds one more function call and spinlock, and irq-off.
> > >
> > 
> > OK, I see that overhead.. I'll figure out a way to work around it.
> >  
> > > > > 
> > > > >  3. please avoid to use global spinlock more. 
> > > > >     no lock is best. mutex is better, maybe.
> > > > > 
> > > > 
> > > > No lock to update a tree which is update concurrently?
> > > > 
> > > Using tree/sort itself is nonsense, I believe.
> > > 
> > 
> > I tried using prio trees in the past, but they are not easy to update
> > either. I won't mind asking for suggestions for a data structure that
> > can scaled well, allow quick insert/delete and search.
> > 
> Now, because the routine is called by kswapd() not by try_to_free.....
> 
> It's not necessary to be very very fast. That's my point.
>

OK, I get your point, but whay does that make RB-Tree data structure non-sense?
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  6:36           ` Balbir Singh
@ 2009-03-02  7:06             ` KAMEZAWA Hiroyuki
  2009-03-02  7:17               ` KAMEZAWA Hiroyuki
  2009-03-02 12:42               ` Balbir Singh
  0 siblings, 2 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02  7:06 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 2 Mar 2009 12:06:49 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 15:21:28]:
> 
> > On Mon, 2 Mar 2009 11:35:19 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 14:32:50]:
> > > 
> > > > On Mon, 2 Mar 2009 10:10:43 +0530
> > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 09:24:04]:
> > > > > 
> > > > > > On Sun, 01 Mar 2009 11:59:59 +0530
> > > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > > > 
> > > > > > > 
> > > > > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > 
> > > > > > 
> > > > > > At first, it's said "When cgroup people adds something, the kernel gets slow".
> > > > > > This is my start point of reviewing. Below is comments to this version of patch.
> > > > > > 
> > > > > >  1. I think it's bad to add more hooks to res_counter. It's enough slow to give up
> > > > > >     adding more fancy things..
> > > > > 
> > > > > res_counters was desgined to be extensible, why is adding anything to
> > > > > it going to make it slow, unless we turn on soft_limits?
> > > > > 
> > > > You inserted new "if" logic in the core loop.
> > > > (What I want to say here is not that this is definitely bad but that "isn't there
> > > >  any alternatives which is less overhead.)
> > > > 
> > > > 
> > > > > > 
> > > > > >  2. please avoid to add hooks to hot-path. In your patch, especially a hook to
> > > > > >     mem_cgroup_uncharge_common() is annoying me.
> > > > > 
> > > > > If soft limits are not enabled, the function does a small check and
> > > > > leaves. 
> > > > > 
> > > > &soft_fail_res is passed always even if memory.soft_limit==ULONG_MAX
> > > > res_counter_soft_limit_excess() adds one more function call and spinlock, and irq-off.
> > > >
> > > 
> > > OK, I see that overhead.. I'll figure out a way to work around it.
> > >  
> > > > > > 
> > > > > >  3. please avoid to use global spinlock more. 
> > > > > >     no lock is best. mutex is better, maybe.
> > > > > > 
> > > > > 
> > > > > No lock to update a tree which is update concurrently?
> > > > > 
> > > > Using tree/sort itself is nonsense, I believe.
> > > > 
> > > 
> > > I tried using prio trees in the past, but they are not easy to update
> > > either. I won't mind asking for suggestions for a data structure that
> > > can scaled well, allow quick insert/delete and search.
> > > 
> > Now, because the routine is called by kswapd() not by try_to_free.....
> > 
> > It's not necessary to be very very fast. That's my point.
> >
> 
> OK, I get your point, but whay does that make RB-Tree data structure non-sense?
>  

 1. Until memory-shortage, rb-tree is kept to be updated and the users(kernel)
    has to pay its maintainace/check cost, whici is unnecessary.
    Considering trade-off, paying cost only when memory-shortage happens tend to
    be reasonable way.

 2. Current "exceed" just shows "How much we got over my soft limit" but doesn't
    tell any information per-node/zone. Considering this, this rb-tree
    information will not be able to help kswapd (on NUMA).
    But maintain per-node information uses too much resource.

 Considering above 2, it's not bad to find victim by proper logic
 from balance_pgdat() by using mem_cgroup_select_victim().
 like this:
==
 struct mem_cgroup *select_vicitim_at_soft_limit_via_balance_pgdat(int nid, int zid)
 {
     while (?) {
        vitcim = mem_cgroup_select_victim(init_mem_cgroup);  #need some modification.
        if (victim is not over soft-limit)
             continue;
        /* Ok this is candidate */
        usage = mem_cgroup_nid_zid_usage(mem, nid, zid); #get sum of active/inactive
        if (usage_is_enough_big)
              return victim;
     }
 }
 balance_pgdat()
 ...... find target zone....
 ...
 mem = select_victime_at_soft_limit_via_balance_pgdat(nid, zid)
 if (mem)
   sc->mem = mem;
 shrink_zone();
 if (mem) {
   sc->mem = NULL;
   css_put(&mem->css);
 }
==

 We have to pay scan cost but it will not be too big(if there are not thousands of memcg.)
 Under above, round-robin rotation is used rather than sort.
 Maybe I can show you sample.....(but I'm a bit busy.)
 
Thanks,
-Kame


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  7:06             ` KAMEZAWA Hiroyuki
@ 2009-03-02  7:17               ` KAMEZAWA Hiroyuki
  2009-03-02 12:42               ` Balbir Singh
  1 sibling, 0 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02  7:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 2 Mar 2009 16:06:02 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>  Considering above 2, it's not bad to find victim by proper logic
>  from balance_pgdat() by using mem_cgroup_select_victim().
>  like this:
> ==
>  struct mem_cgroup *select_vicitim_at_soft_limit_via_balance_pgdat(int nid, int zid)
>  {
>      while (?) {
>         vitcim = mem_cgroup_select_victim(init_mem_cgroup);  #need some modification.

Another choice is that when a user set soft-limit via interface, add it to
our own private list and scan it here. (of course, remove at rmdir.)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  7:06             ` KAMEZAWA Hiroyuki
  2009-03-02  7:17               ` KAMEZAWA Hiroyuki
@ 2009-03-02 12:42               ` Balbir Singh
  2009-03-02 14:04                 ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-02 12:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 16:06:02]:

> On Mon, 2 Mar 2009 12:06:49 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 15:21:28]:
> > 
> > > On Mon, 2 Mar 2009 11:35:19 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 14:32:50]:
> > > > 
> > > > > On Mon, 2 Mar 2009 10:10:43 +0530
> > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 09:24:04]:
> > > > > > 
> > > > > > > On Sun, 01 Mar 2009 11:59:59 +0530
> > > > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > > > > 
> > > > > > > > 
> > > > > > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > > 
> > > > > > > 
> > > > > > > At first, it's said "When cgroup people adds something, the kernel gets slow".
> > > > > > > This is my start point of reviewing. Below is comments to this version of patch.
> > > > > > > 
> > > > > > >  1. I think it's bad to add more hooks to res_counter. It's enough slow to give up
> > > > > > >     adding more fancy things..
> > > > > > 
> > > > > > res_counters was desgined to be extensible, why is adding anything to
> > > > > > it going to make it slow, unless we turn on soft_limits?
> > > > > > 
> > > > > You inserted new "if" logic in the core loop.
> > > > > (What I want to say here is not that this is definitely bad but that "isn't there
> > > > >  any alternatives which is less overhead.)
> > > > > 
> > > > > 
> > > > > > > 
> > > > > > >  2. please avoid to add hooks to hot-path. In your patch, especially a hook to
> > > > > > >     mem_cgroup_uncharge_common() is annoying me.
> > > > > > 
> > > > > > If soft limits are not enabled, the function does a small check and
> > > > > > leaves. 
> > > > > > 
> > > > > &soft_fail_res is passed always even if memory.soft_limit==ULONG_MAX
> > > > > res_counter_soft_limit_excess() adds one more function call and spinlock, and irq-off.
> > > > >
> > > > 
> > > > OK, I see that overhead.. I'll figure out a way to work around it.
> > > >  
> > > > > > > 
> > > > > > >  3. please avoid to use global spinlock more. 
> > > > > > >     no lock is best. mutex is better, maybe.
> > > > > > > 
> > > > > > 
> > > > > > No lock to update a tree which is update concurrently?
> > > > > > 
> > > > > Using tree/sort itself is nonsense, I believe.
> > > > > 
> > > > 
> > > > I tried using prio trees in the past, but they are not easy to update
> > > > either. I won't mind asking for suggestions for a data structure that
> > > > can scaled well, allow quick insert/delete and search.
> > > > 
> > > Now, because the routine is called by kswapd() not by try_to_free.....
> > > 
> > > It's not necessary to be very very fast. That's my point.
> > >
> > 
> > OK, I get your point, but whay does that make RB-Tree data structure non-sense?
> >  
> 
>  1. Until memory-shortage, rb-tree is kept to be updated and the users(kernel)
>     has to pay its maintainace/check cost, whici is unnecessary.
>     Considering trade-off, paying cost only when memory-shortage happens tend to
>     be reasonable way.
As you've seen in the code, the cost is only at an interval HZ/2
currently. The other overhead is the calculation of excess, I can try
and see if we can get rid of it.

> 
>  2. Current "exceed" just shows "How much we got over my soft limit" but doesn't
>     tell any information per-node/zone. Considering this, this rb-tree
>     information will not be able to help kswapd (on NUMA).
>     But maintain per-node information uses too much resource.

Yes, kswapd is per-node and we try to free all pages belonging to a
zonelist as specified by pgdat->node_zonelists for the memory control
groups that are over their soft limit. Keeping this information per
node makes no sense (exceeds information).

> 
>  Considering above 2, it's not bad to find victim by proper logic
>  from balance_pgdat() by using mem_cgroup_select_victim().
>  like this:
> ==
>  struct mem_cgroup *select_vicitim_at_soft_limit_via_balance_pgdat(int nid, int zid)
>  {
>      while (?) {
>         vitcim = mem_cgroup_select_victim(init_mem_cgroup);  #need some modification.
>         if (victim is not over soft-limit)
>              continue;
>         /* Ok this is candidate */
>         usage = mem_cgroup_nid_zid_usage(mem, nid, zid); #get sum of active/inactive
>         if (usage_is_enough_big)
>               return victim;

We currently track overall usage, so we split into per nid, zid
information and use that? Is that your suggestion? The soft limit is
also an aggregate limit, how do we define usage_is_big_enough or
usage_is_enough_big? Through some heuristics?

>      }
>  }
>  balance_pgdat()
>  ...... find target zone....
>  ...
>  mem = select_victime_at_soft_limit_via_balance_pgdat(nid, zid)
>  if (mem)
>    sc->mem = mem;
>  shrink_zone();
>  if (mem) {
>    sc->mem = NULL;
>    css_put(&mem->css);
>  }
> ==
> 
>  We have to pay scan cost but it will not be too big(if there are not thousands of memcg.)
>  Under above, round-robin rotation is used rather than sort.

Yes, we sort, but not frequently at every page-fault but at a
specified interval.

>  Maybe I can show you sample.....(but I'm a bit busy.)
>

Explanation and review is good, but I don't see how not-sorting will
help? I need something that can help me point to the culprits quickly
enough during soft limit reclaim and RB-Tree works very well for me. 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02 12:42               ` Balbir Singh
@ 2009-03-02 14:04                 ` KAMEZAWA Hiroyuki
  2009-03-02 17:41                   ` Balbir Singh
  0 siblings, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02 14:04 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, KOSAKI Motohiro,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Rik van Riel,
	Andrew Morton

Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02
> 16:06:02]:
>
>> On Mon, 2 Mar 2009 12:06:49 +0530
>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> > OK, I get your point, but whay does that make RB-Tree data structure
>> non-sense?
>> >
>>
>>  1. Until memory-shortage, rb-tree is kept to be updated and the
>> users(kernel)
>>     has to pay its maintainace/check cost, whici is unnecessary.
>>     Considering trade-off, paying cost only when memory-shortage happens
>> tend to
>>     be reasonable way.
> As you've seen in the code, the cost is only at an interval HZ/2
> currently. The other overhead is the calculation of excess, I can try
> and see if we can get rid of it.
>
>>
>>  2. Current "exceed" just shows "How much we got over my soft limit" but
>> doesn't
>>     tell any information per-node/zone. Considering this, this rb-tree
>>     information will not be able to help kswapd (on NUMA).
>>     But maintain per-node information uses too much resource.
>
> Yes, kswapd is per-node and we try to free all pages belonging to a
> zonelist as specified by pgdat->node_zonelists for the memory control
> groups that are over their soft limit. Keeping this information per
> node makes no sense (exceeds information).
>
>>
>>  Considering above 2, it's not bad to find victim by proper logic
>>  from balance_pgdat() by using mem_cgroup_select_victim().
>>  like this:
>> ==
>>  struct mem_cgroup *select_vicitim_at_soft_limit_via_balance_pgdat(int
>> nid, int zid)
>>  {
>>      while (?) {
>>         vitcim = mem_cgroup_select_victim(init_mem_cgroup);  #need some
>> modification.
>>         if (victim is not over soft-limit)
>>              continue;
>>         /* Ok this is candidate */
>>         usage = mem_cgroup_nid_zid_usage(mem, nid, zid); #get sum of
>> active/inactive
>>         if (usage_is_enough_big)
>>               return victim;
>
> We currently track overall usage, so we split into per nid, zid
> information and use that? Is that your suggestion?

My suggestion is that current per-zone statistics interface of memcg
already holds all necessary information. And aggregate usage information
is not worth to be tracked becauset it's no help for kswapd.

>  The soft limit is
> also an aggregate limit, how do we define usage_is_big_enough or
> usage_is_enough_big? Through some heuristics?
>
I think that if memcg/zone's page usage is not 0, it's enough big.
(and round robin rotation as hierachical reclaim can be used.)

There maybe some threshold to try.

For example)
   need_to_reclaim = zone->high - zone->free.
   if (usage_in_this_zone_of_memcg > need_to_reclaim/4)
         select this.

Maybe we can adjust that later.

>>      }
>>  }
>>  balance_pgdat()
>>  ...... find target zone....
>>  ...
>>  mem = select_victime_at_soft_limit_via_balance_pgdat(nid, zid)
>>  if (mem)
>>    sc->mem = mem;
>>  shrink_zone();
>>  if (mem) {
>>    sc->mem = NULL;
>>    css_put(&mem->css);
>>  }
>> ==
>>
>>  We have to pay scan cost but it will not be too big(if there are not
>> thousands of memcg.)
>>  Under above, round-robin rotation is used rather than sort.
>
> Yes, we sort, but not frequently at every page-fault but at a
> specified interval.
>
>>  Maybe I can show you sample.....(but I'm a bit busy.)
>>
>
> Explanation and review is good, but I don't see how not-sorting will
> help? I need something that can help me point to the culprits quickly
> enough during soft limit reclaim and RB-Tree works very well for me.
>

I don't think "tracking memcg which exceeds soft limit" is not worth
to do in synchronous way. It can be done in lazy way when it's necessary
in simpler logic.

BTW, did you do set-softlimit-zero and rmdir() test ?
At quick review, memcg will never be removed from RB tree because
force_empty moves account from children to parent. But no tree ops there.
plz see mem_cgroup_move_account().

I'm sorry if I missed something.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02 14:04                 ` KAMEZAWA Hiroyuki
@ 2009-03-02 17:41                   ` Balbir Singh
  2009-03-02 23:59                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-02 17:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 23:04:34]:

> Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02
> > 16:06:02]:
> >
> >> On Mon, 2 Mar 2009 12:06:49 +0530
> >> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >> > OK, I get your point, but whay does that make RB-Tree data structure
> >> non-sense?
> >> >
> >>
> >>  1. Until memory-shortage, rb-tree is kept to be updated and the
> >> users(kernel)
> >>     has to pay its maintainace/check cost, whici is unnecessary.
> >>     Considering trade-off, paying cost only when memory-shortage happens
> >> tend to
> >>     be reasonable way.
> > As you've seen in the code, the cost is only at an interval HZ/2
> > currently. The other overhead is the calculation of excess, I can try
> > and see if we can get rid of it.
> >
> >>
> >>  2. Current "exceed" just shows "How much we got over my soft limit" but
> >> doesn't
> >>     tell any information per-node/zone. Considering this, this rb-tree
> >>     information will not be able to help kswapd (on NUMA).
> >>     But maintain per-node information uses too much resource.
> >
> > Yes, kswapd is per-node and we try to free all pages belonging to a
> > zonelist as specified by pgdat->node_zonelists for the memory control
> > groups that are over their soft limit. Keeping this information per
> > node makes no sense (exceeds information).
> >
> >>
> >>  Considering above 2, it's not bad to find victim by proper logic
> >>  from balance_pgdat() by using mem_cgroup_select_victim().
> >>  like this:
> >> ==
> >>  struct mem_cgroup *select_vicitim_at_soft_limit_via_balance_pgdat(int
> >> nid, int zid)
> >>  {
> >>      while (?) {
> >>         vitcim = mem_cgroup_select_victim(init_mem_cgroup);  #need some
> >> modification.
> >>         if (victim is not over soft-limit)
> >>              continue;
> >>         /* Ok this is candidate */
> >>         usage = mem_cgroup_nid_zid_usage(mem, nid, zid); #get sum of
> >> active/inactive
> >>         if (usage_is_enough_big)
> >>               return victim;
> >
> > We currently track overall usage, so we split into per nid, zid
> > information and use that? Is that your suggestion?
> 
> My suggestion is that current per-zone statistics interface of memcg
> already holds all necessary information. And aggregate usage information
> is not worth to be tracked becauset it's no help for kswapd.
>

We have that data, but we need aggregate data to see who exceeded the
limit.
 
> >  The soft limit is
> > also an aggregate limit, how do we define usage_is_big_enough or
> > usage_is_enough_big? Through some heuristics?
> >
> I think that if memcg/zone's page usage is not 0, it's enough big.
> (and round robin rotation as hierachical reclaim can be used.)
> 
> There maybe some threshold to try.
> 
> For example)
>    need_to_reclaim = zone->high - zone->free.
>    if (usage_in_this_zone_of_memcg > need_to_reclaim/4)
>          select this.
> 
> Maybe we can adjust that later.
> 

No... this looks broken by design. Even if the administrator sets a
large enough limit and no soft limits, the cgroup gets reclaimed from?


> >>      }
> >>  }
> >>  balance_pgdat()
> >>  ...... find target zone....
> >>  ...
> >>  mem = select_victime_at_soft_limit_via_balance_pgdat(nid, zid)
> >>  if (mem)
> >>    sc->mem = mem;
> >>  shrink_zone();
> >>  if (mem) {
> >>    sc->mem = NULL;
> >>    css_put(&mem->css);
> >>  }
> >> ==
> >>
> >>  We have to pay scan cost but it will not be too big(if there are not
> >> thousands of memcg.)
> >>  Under above, round-robin rotation is used rather than sort.
> >
> > Yes, we sort, but not frequently at every page-fault but at a
> > specified interval.
> >
> >>  Maybe I can show you sample.....(but I'm a bit busy.)
> >>
> >
> > Explanation and review is good, but I don't see how not-sorting will
> > help? I need something that can help me point to the culprits quickly
> > enough during soft limit reclaim and RB-Tree works very well for me.
> >
> 
> I don't think "tracking memcg which exceeds soft limit" is not worth
> to do in synchronous way. It can be done in lazy way when it's necessary
> in simpler logic.
>

The synchronous way can be harmful if we do it every page fault. THe
current logic is quite simple....no?
 
> BTW, did you do set-softlimit-zero and rmdir() test ?
> At quick review, memcg will never be removed from RB tree because
> force_empty moves account from children to parent. But no tree ops there.
> plz see mem_cgroup_move_account().
> 

__mme_cgroup_free() has tree ops, shouldn't that catch this scenario?

> I'm sorry if I missed something.
>

Thanks for the review and discussion. 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02  6:18         ` KAMEZAWA Hiroyuki
@ 2009-03-02 17:52           ` Balbir Singh
  2009-03-03  0:03             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-02 17:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 15:18:30]:

> On Mon, 2 Mar 2009 11:35:19 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > > Then, not-sorted RB-tree can be there.
> > > 
> > > BTW,
> > >    time_after(jiffies, 0)
> > > is buggy (see definition). If you want make this true always,
> > >    time_after(jiffies, jiffies +1)
> > >
> > 
> > HZ/4 is 250/4 jiffies in the worst case (62). We have
> > time_after(jiffies, next_update_interval) and next_update_interval is
> > set to last_tree_update + 62. Not sure if I got what you are pointing
> > to.
> > 
> +	unsigned long next_update = 0;
> +	unsigned long flags;
> +
> +	if (!css_tryget(&mem->css))
> +		return;
> +	prev_usage_in_excess = mem->usage_in_excess;
> +	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +
> +	if (time_check)
> +		next_update = mem->last_tree_update +
> +				MEM_CGROUP_TREE_UPDATE_INTERVAL;
> +	if (new_usage_in_excess && time_after(jiffies, next_update)) {
> +		if (prev_usage_in_excess)
> +			mem_cgroup_remove_exceeded(mem);
> +		mem_cgroup_insert_exceeded(mem);
> +		updated_tree = true;
> +	} else if (prev_usage_in_excess && !new_usage_in_excess) {
> +		mem_cgroup_remove_exceeded(mem);
> +		updated_tree = true;
> +	}
> 
> My point is what happens if time_check==false.
> time_afrter(jiffies, 0) is buggy.
>

I see your point now, but the idea behind doing so is that
time_after(jiffies, 0) will always return false, which forces the
prev_usage_in_excess and !new_usage_in_excess check to execute. We set
the value to false only from __mem_cgroup_free().

Are you suggesting that calling time_after(jiffies, 0) is buggy?
The comment

  Do this with "<0" and ">=0" to only test the sign of the result. A
 
I think refers to the comparison check and not to the parameters. I
hope I am reading this right.
-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02 17:41                   ` Balbir Singh
@ 2009-03-02 23:59                     ` KAMEZAWA Hiroyuki
  2009-03-03 11:12                       ` Balbir Singh
  0 siblings, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02 23:59 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 2 Mar 2009 23:11:56 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 23:04:34]:
> 
> > Balbir Singh wrote:
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02
> > > 16:06:02]:
> > >
> > >> On Mon, 2 Mar 2009 12:06:49 +0530
> > >> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > >> > OK, I get your point, but whay does that make RB-Tree data structure
> > >> non-sense?
> > >> >
> > >>
> > >>  1. Until memory-shortage, rb-tree is kept to be updated and the
> > >> users(kernel)
> > >>     has to pay its maintainace/check cost, whici is unnecessary.
> > >>     Considering trade-off, paying cost only when memory-shortage happens
> > >> tend to
> > >>     be reasonable way.
> > > As you've seen in the code, the cost is only at an interval HZ/2
> > > currently. The other overhead is the calculation of excess, I can try
> > > and see if we can get rid of it.
> > >
> > >>
> > >>  2. Current "exceed" just shows "How much we got over my soft limit" but
> > >> doesn't
> > >>     tell any information per-node/zone. Considering this, this rb-tree
> > >>     information will not be able to help kswapd (on NUMA).
> > >>     But maintain per-node information uses too much resource.
> > >
> > > Yes, kswapd is per-node and we try to free all pages belonging to a
> > > zonelist as specified by pgdat->node_zonelists for the memory control
> > > groups that are over their soft limit. Keeping this information per
> > > node makes no sense (exceeds information).
> > >
> > >>
> > >>  Considering above 2, it's not bad to find victim by proper logic
> > >>  from balance_pgdat() by using mem_cgroup_select_victim().
> > >>  like this:
> > >> ==
> > >>  struct mem_cgroup *select_vicitim_at_soft_limit_via_balance_pgdat(int
> > >> nid, int zid)
> > >>  {
> > >>      while (?) {
> > >>         vitcim = mem_cgroup_select_victim(init_mem_cgroup);  #need some
> > >> modification.
> > >>         if (victim is not over soft-limit)
> > >>              continue;
> > >>         /* Ok this is candidate */
> > >>         usage = mem_cgroup_nid_zid_usage(mem, nid, zid); #get sum of
> > >> active/inactive
> > >>         if (usage_is_enough_big)
> > >>               return victim;
> > >
> > > We currently track overall usage, so we split into per nid, zid
> > > information and use that? Is that your suggestion?
> > 
> > My suggestion is that current per-zone statistics interface of memcg
> > already holds all necessary information. And aggregate usage information
> > is not worth to be tracked becauset it's no help for kswapd.
> >
> 
> We have that data, but we need aggregate data to see who exceeded the
> limit.
>  
Aggregate data is in res_counter, already.



> > >  The soft limit is
> > > also an aggregate limit, how do we define usage_is_big_enough or
> > > usage_is_enough_big? Through some heuristics?
> > >
> > I think that if memcg/zone's page usage is not 0, it's enough big.
> > (and round robin rotation as hierachical reclaim can be used.)
> > 
> > There maybe some threshold to try.
> > 
> > For example)
> >    need_to_reclaim = zone->high - zone->free.
> >    if (usage_in_this_zone_of_memcg > need_to_reclaim/4)
> >          select this.
> > 
> > Maybe we can adjust that later.
> > 
> 
> No... this looks broken by design. Even if the administrator sets a
> large enough limit and no soft limits, the cgroup gets reclaimed from?
> 
I wrote
==
 if (victim is not over soft-limit)
==
....Maybe this discussion style is bad and I should explain my approach in patch.
(I can't write code today, sorry.)


> 
> > >>      }
> > >>  }
> > >>  balance_pgdat()
> > >>  ...... find target zone....
> > >>  ...
> > >>  mem = select_victime_at_soft_limit_via_balance_pgdat(nid, zid)
> > >>  if (mem)
> > >>    sc->mem = mem;
> > >>  shrink_zone();
> > >>  if (mem) {
> > >>    sc->mem = NULL;
> > >>    css_put(&mem->css);
> > >>  }
> > >> ==
> > >>
> > >>  We have to pay scan cost but it will not be too big(if there are not
> > >> thousands of memcg.)
> > >>  Under above, round-robin rotation is used rather than sort.
> > >
> > > Yes, we sort, but not frequently at every page-fault but at a
> > > specified interval.
> > >
> > >>  Maybe I can show you sample.....(but I'm a bit busy.)
> > >>
> > >
> > > Explanation and review is good, but I don't see how not-sorting will
> > > help? I need something that can help me point to the culprits quickly
> > > enough during soft limit reclaim and RB-Tree works very well for me.
> > >
> > 
> > I don't think "tracking memcg which exceeds soft limit" is not worth
> > to do in synchronous way. It can be done in lazy way when it's necessary
> > in simpler logic.
> >
> 
> The synchronous way can be harmful if we do it every page fault. THe
> current logic is quite simple....no?
In my point of view, No.

For example, I can never be able to explain why Hz/4 is the best and
why we have to maintain the tree while there are no memory shortage.

IMHO, Under well controlled system with cgroup, problematic applications
and very huge file cache users are udner limitation. Memory shortage can be
rare event after all.

But, on NUMA, because memcg just checks "usage" and doesn't check
"usage-per-node", there can be memory shortage and this kind of soft-limit
sounds attractive for me.

>  
> > BTW, did you do set-softlimit-zero and rmdir() test ?
> > At quick review, memcg will never be removed from RB tree because
> > force_empty moves account from children to parent. But no tree ops there.
> > plz see mem_cgroup_move_account().
> > 
> 
> __mme_cgroup_free() has tree ops, shouldn't that catch this scenario?
> 
Ok, I missed that. Thank you for clarification.

Regards.
-Kame


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02 17:52           ` Balbir Singh
@ 2009-03-03  0:03             ` KAMEZAWA Hiroyuki
  2009-03-03 11:23               ` Balbir Singh
  0 siblings, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-03  0:03 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 2 Mar 2009 23:22:35 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 15:18:30]:
> 
> > On Mon, 2 Mar 2009 11:35:19 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > > Then, not-sorted RB-tree can be there.
> > > > 
> > > > BTW,
> > > >    time_after(jiffies, 0)
> > > > is buggy (see definition). If you want make this true always,
> > > >    time_after(jiffies, jiffies +1)
> > > >
> > > 
> > > HZ/4 is 250/4 jiffies in the worst case (62). We have
> > > time_after(jiffies, next_update_interval) and next_update_interval is
> > > set to last_tree_update + 62. Not sure if I got what you are pointing
> > > to.
> > > 
> > +	unsigned long next_update = 0;
> > +	unsigned long flags;
> > +
> > +	if (!css_tryget(&mem->css))
> > +		return;
> > +	prev_usage_in_excess = mem->usage_in_excess;
> > +	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +
> > +	if (time_check)
> > +		next_update = mem->last_tree_update +
> > +				MEM_CGROUP_TREE_UPDATE_INTERVAL;
> > +	if (new_usage_in_excess && time_after(jiffies, next_update)) {
> > +		if (prev_usage_in_excess)
> > +			mem_cgroup_remove_exceeded(mem);
> > +		mem_cgroup_insert_exceeded(mem);
> > +		updated_tree = true;
> > +	} else if (prev_usage_in_excess && !new_usage_in_excess) {
> > +		mem_cgroup_remove_exceeded(mem);
> > +		updated_tree = true;
> > +	}
> > 
> > My point is what happens if time_check==false.
> > time_afrter(jiffies, 0) is buggy.
> >
> 
> I see your point now, but the idea behind doing so is that
> time_after(jiffies, 0) will always return false, which forces the
> prev_usage_in_excess and !new_usage_in_excess check to execute. We set
> the value to false only from __mem_cgroup_free().
> 
> Are you suggesting that calling time_after(jiffies, 0) is buggy?
> The comment
> 
>   Do this with "<0" and ">=0" to only test the sign of the result. A
>  
> I think refers to the comparison check and not to the parameters. I
> hope I am reading this right.

 106 #define time_after(a,b)         \
 107         (typecheck(unsigned long, a) && \
 108          typecheck(unsigned long, b) && \
 109          ((long)(b) - (long)(a) < 0))

Reading above.

  if b==0.
     if (long)a <0  -> false
     if (long)a >0  -> true

jiffies is unsigned value. please think of bit-pattern of signed/unsigned value.


Thanks,
-Kame







^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v3)
  2009-03-02  4:44     ` Balbir Singh
@ 2009-03-03  2:43       ` KOSAKI Motohiro
  2009-03-03 11:17         ` Balbir Singh
  0 siblings, 1 reply; 40+ messages in thread
From: KOSAKI Motohiro @ 2009-03-03  2:43 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-02 12:08:01]:
> 
> > Hi Balbir,
> > 
> > > @@ -2015,9 +2016,12 @@ static int kswapd(void *p)
> > >  		finish_wait(&pgdat->kswapd_wait, &wait);
> > >  
> > >  		if (!try_to_freeze()) {
> > > +			struct zonelist *zl = pgdat->node_zonelists;
> > >  			/* We can speed up thawing tasks if we don't call
> > >  			 * balance_pgdat after returning from the refrigerator
> > >  			 */
> > > +			if (!order)
> > > +				mem_cgroup_soft_limit_reclaim(zl, GFP_KERNEL);
> > >  			balance_pgdat(pgdat, order);
> > >  		}
> > >  	}
> > 
> > kswapd's roll is increasing free pages until zone->pages_high in "own node".
> > mem_cgroup_soft_limit_reclaim() free one (or more) exceed page in any node.
> > 
> > Oh, well.
> > I think it is not consistency.
> > 
> > if mem_cgroup_soft_limit_reclaim() is aware to target node and its pages_high,
> > I'm glad.
> 
> Yes, correct the role of kswapd is to keep increasing free pages until
> zone->pages_high and the first set of pages to consider is the memory
> controller over their soft limits. We pass the zonelist to ensure that
> while doing soft reclaim, we focus on the zonelist associated with the
> node. Kamezawa had concernes over calling the soft limit reclaim from
> __alloc_pages_internal(), did you prefer that call path? 

I read your patch again.
So, mem_cgroup_soft_limit_reclaim() caller place seems in balance_pgdat() is better.

Please imazine most bad scenario.
CPU0 (kswapd) take to continue shrinking.
CPU1 take another activity and charge memcg conteniously.
At that time, balance_pgdat() don't exit very long time. then 
mem_cgroup_soft_limit_reclaim() is never called.

In ideal, if another cpu take another charge, kswapd should shrink 
soft limit again.


btw, I don't like "if (!order)" condition. memcg soft limit sould be
always shrinked although 
it's the order of because wakeup_kswapd() argument is merely hint.

another process want another order.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-02 23:59                     ` KAMEZAWA Hiroyuki
@ 2009-03-03 11:12                       ` Balbir Singh
  2009-03-03 11:50                         ` KAMEZAWA Hiroyuki
  2009-03-05  9:04                         ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 40+ messages in thread
From: Balbir Singh @ 2009-03-03 11:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-03 08:59:14]:

> On Mon, 2 Mar 2009 23:11:56 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 23:04:34]:
> > 
> > > Balbir Singh wrote:
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02
> > > > 16:06:02]:
> > > >
> > > >> On Mon, 2 Mar 2009 12:06:49 +0530
> > > >> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > >> > OK, I get your point, but whay does that make RB-Tree data structure
> > > >> non-sense?
> > > >> >
> > > >>
> > > >>  1. Until memory-shortage, rb-tree is kept to be updated and the
> > > >> users(kernel)
> > > >>     has to pay its maintainace/check cost, whici is unnecessary.
> > > >>     Considering trade-off, paying cost only when memory-shortage happens
> > > >> tend to
> > > >>     be reasonable way.
> > > > As you've seen in the code, the cost is only at an interval HZ/2
> > > > currently. The other overhead is the calculation of excess, I can try
> > > > and see if we can get rid of it.
> > > >
> > > >>
> > > >>  2. Current "exceed" just shows "How much we got over my soft limit" but
> > > >> doesn't
> > > >>     tell any information per-node/zone. Considering this, this rb-tree
> > > >>     information will not be able to help kswapd (on NUMA).
> > > >>     But maintain per-node information uses too much resource.
> > > >
> > > > Yes, kswapd is per-node and we try to free all pages belonging to a
> > > > zonelist as specified by pgdat->node_zonelists for the memory control
> > > > groups that are over their soft limit. Keeping this information per
> > > > node makes no sense (exceeds information).
> > > >
> > > >>
> > > >>  Considering above 2, it's not bad to find victim by proper logic
> > > >>  from balance_pgdat() by using mem_cgroup_select_victim().
> > > >>  like this:
> > > >> ==
> > > >>  struct mem_cgroup *select_vicitim_at_soft_limit_via_balance_pgdat(int
> > > >> nid, int zid)
> > > >>  {
> > > >>      while (?) {
> > > >>         vitcim = mem_cgroup_select_victim(init_mem_cgroup);  #need some
> > > >> modification.
> > > >>         if (victim is not over soft-limit)
> > > >>              continue;
> > > >>         /* Ok this is candidate */
> > > >>         usage = mem_cgroup_nid_zid_usage(mem, nid, zid); #get sum of
> > > >> active/inactive
> > > >>         if (usage_is_enough_big)
> > > >>               return victim;
> > > >
> > > > We currently track overall usage, so we split into per nid, zid
> > > > information and use that? Is that your suggestion?
> > > 
> > > My suggestion is that current per-zone statistics interface of memcg
> > > already holds all necessary information. And aggregate usage information
> > > is not worth to be tracked becauset it's no help for kswapd.
> > >
> > 
> > We have that data, but we need aggregate data to see who exceeded the
> > limit.
> >  
> Aggregate data is in res_counter, already.
> 
> 
> 
> > > >  The soft limit is
> > > > also an aggregate limit, how do we define usage_is_big_enough or
> > > > usage_is_enough_big? Through some heuristics?
> > > >
> > > I think that if memcg/zone's page usage is not 0, it's enough big.
> > > (and round robin rotation as hierachical reclaim can be used.)
> > > 
> > > There maybe some threshold to try.
> > > 
> > > For example)
> > >    need_to_reclaim = zone->high - zone->free.
> > >    if (usage_in_this_zone_of_memcg > need_to_reclaim/4)
> > >          select this.
> > > 
> > > Maybe we can adjust that later.
> > > 
> > 
> > No... this looks broken by design. Even if the administrator sets a
> > large enough limit and no soft limits, the cgroup gets reclaimed from?
> > 
> I wrote
> ==
>  if (victim is not over soft-limit)
> ==
> ....Maybe this discussion style is bad and I should explain my approach in patch.
> (I can't write code today, sorry.)
> 
> 
> > 
> > > >>      }
> > > >>  }
> > > >>  balance_pgdat()
> > > >>  ...... find target zone....
> > > >>  ...
> > > >>  mem = select_victime_at_soft_limit_via_balance_pgdat(nid, zid)
> > > >>  if (mem)
> > > >>    sc->mem = mem;
> > > >>  shrink_zone();
> > > >>  if (mem) {
> > > >>    sc->mem = NULL;
> > > >>    css_put(&mem->css);
> > > >>  }
> > > >> ==
> > > >>
> > > >>  We have to pay scan cost but it will not be too big(if there are not
> > > >> thousands of memcg.)
> > > >>  Under above, round-robin rotation is used rather than sort.
> > > >
> > > > Yes, we sort, but not frequently at every page-fault but at a
> > > > specified interval.
> > > >
> > > >>  Maybe I can show you sample.....(but I'm a bit busy.)
> > > >>
> > > >
> > > > Explanation and review is good, but I don't see how not-sorting will
> > > > help? I need something that can help me point to the culprits quickly
> > > > enough during soft limit reclaim and RB-Tree works very well for me.
> > > >
> > > 
> > > I don't think "tracking memcg which exceeds soft limit" is not worth
> > > to do in synchronous way. It can be done in lazy way when it's necessary
> > > in simpler logic.
> > >
> > 
> > The synchronous way can be harmful if we do it every page fault. THe
> > current logic is quite simple....no?
> In my point of view, No.
> 
> For example, I can never be able to explain why Hz/4 is the best and
> why we have to maintain the tree while there are no memory shortage.
> 

Why do we need to track pages even when no hard limits are setup?
Every feature comes with a price when enabled.

> IMHO, Under well controlled system with cgroup, problematic applications
> and very huge file cache users are udner limitation. Memory shortage can be
> rare event after all.

Yes and that is why hard limits make no sense there, soft limits make
more sense in the rare event of shortage, they kick in.

> 
> But, on NUMA, because memcg just checks "usage" and doesn't check
> "usage-per-node", there can be memory shortage and this kind of soft-limit
> sounds attractive for me.
> 


Could you please elaborate further on this?

> >  
> > > BTW, did you do set-softlimit-zero and rmdir() test ?
> > > At quick review, memcg will never be removed from RB tree because
> > > force_empty moves account from children to parent. But no tree ops there.
> > > plz see mem_cgroup_move_account().
> > > 
> > 
> > __mme_cgroup_free() has tree ops, shouldn't that catch this scenario?
> > 
> Ok, I missed that. Thank you for clarification.
> 
> Regards.
> -Kame
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v3)
  2009-03-03  2:43       ` KOSAKI Motohiro
@ 2009-03-03 11:17         ` Balbir Singh
  2009-03-04  0:07           ` KOSAKI Motohiro
  0 siblings, 1 reply; 40+ messages in thread
From: Balbir Singh @ 2009-03-03 11:17 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-03 11:43:49]:

> > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-02 12:08:01]:
> > 
> > > Hi Balbir,
> > > 
> > > > @@ -2015,9 +2016,12 @@ static int kswapd(void *p)
> > > >  		finish_wait(&pgdat->kswapd_wait, &wait);
> > > >  
> > > >  		if (!try_to_freeze()) {
> > > > +			struct zonelist *zl = pgdat->node_zonelists;
> > > >  			/* We can speed up thawing tasks if we don't call
> > > >  			 * balance_pgdat after returning from the refrigerator
> > > >  			 */
> > > > +			if (!order)
> > > > +				mem_cgroup_soft_limit_reclaim(zl, GFP_KERNEL);
> > > >  			balance_pgdat(pgdat, order);
> > > >  		}
> > > >  	}
> > > 
> > > kswapd's roll is increasing free pages until zone->pages_high in "own node".
> > > mem_cgroup_soft_limit_reclaim() free one (or more) exceed page in any node.
> > > 
> > > Oh, well.
> > > I think it is not consistency.
> > > 
> > > if mem_cgroup_soft_limit_reclaim() is aware to target node and its pages_high,
> > > I'm glad.
> > 
> > Yes, correct the role of kswapd is to keep increasing free pages until
> > zone->pages_high and the first set of pages to consider is the memory
> > controller over their soft limits. We pass the zonelist to ensure that
> > while doing soft reclaim, we focus on the zonelist associated with the
> > node. Kamezawa had concernes over calling the soft limit reclaim from
> > __alloc_pages_internal(), did you prefer that call path? 
> 
> I read your patch again.
> So, mem_cgroup_soft_limit_reclaim() caller place seems in balance_pgdat() is better.
> 
> Please imazine most bad scenario.
> CPU0 (kswapd) take to continue shrinking.
> CPU1 take another activity and charge memcg conteniously.
> At that time, balance_pgdat() don't exit very long time. then 
> mem_cgroup_soft_limit_reclaim() is never called.
> 

Yes, true... that is why I added the hooks in __alloc_pages_internal()
in the first two revisions, but Kamezawa objected to them. In the
scenario that you mention that balance_pgdat() is busy, if we are
under global system memory pressure, even after freeing memory from
soft limited cgroups, we don't have sufficient free memory. We need to
go reclaim from the whole system. An administrator can easily avoid
the above scenario by using hard limits on the cgroup running on CPU1.

> In ideal, if another cpu take another charge, kswapd should shrink 
> soft limit again.
>

Could you please elaborate further?
 
> 
> btw, I don't like "if (!order)" condition. memcg soft limit sould be
> always shrinked although 
> it's the order of because wakeup_kswapd() argument is merely hint.
> 
> another process want another order.
> 

Agreed, I'll remove the check.

> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-03  0:03             ` KAMEZAWA Hiroyuki
@ 2009-03-03 11:23               ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2009-03-03 11:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-03 09:03:03]:

> On Mon, 2 Mar 2009 23:22:35 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-02 15:18:30]:
> > 
> > > On Mon, 2 Mar 2009 11:35:19 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > > Then, not-sorted RB-tree can be there.
> > > > > 
> > > > > BTW,
> > > > >    time_after(jiffies, 0)
> > > > > is buggy (see definition). If you want make this true always,
> > > > >    time_after(jiffies, jiffies +1)
> > > > >
> > > > 
> > > > HZ/4 is 250/4 jiffies in the worst case (62). We have
> > > > time_after(jiffies, next_update_interval) and next_update_interval is
> > > > set to last_tree_update + 62. Not sure if I got what you are pointing
> > > > to.
> > > > 
> > > +	unsigned long next_update = 0;
> > > +	unsigned long flags;
> > > +
> > > +	if (!css_tryget(&mem->css))
> > > +		return;
> > > +	prev_usage_in_excess = mem->usage_in_excess;
> > > +	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > > +
> > > +	if (time_check)
> > > +		next_update = mem->last_tree_update +
> > > +				MEM_CGROUP_TREE_UPDATE_INTERVAL;
> > > +	if (new_usage_in_excess && time_after(jiffies, next_update)) {
> > > +		if (prev_usage_in_excess)
> > > +			mem_cgroup_remove_exceeded(mem);
> > > +		mem_cgroup_insert_exceeded(mem);
> > > +		updated_tree = true;
> > > +	} else if (prev_usage_in_excess && !new_usage_in_excess) {
> > > +		mem_cgroup_remove_exceeded(mem);
> > > +		updated_tree = true;
> > > +	}
> > > 
> > > My point is what happens if time_check==false.
> > > time_afrter(jiffies, 0) is buggy.
> > >
> > 
> > I see your point now, but the idea behind doing so is that
> > time_after(jiffies, 0) will always return false, which forces the
> > prev_usage_in_excess and !new_usage_in_excess check to execute. We set
> > the value to false only from __mem_cgroup_free().
> > 
> > Are you suggesting that calling time_after(jiffies, 0) is buggy?
> > The comment
> > 
> >   Do this with "<0" and ">=0" to only test the sign of the result. A
> >  
> > I think refers to the comparison check and not to the parameters. I
> > hope I am reading this right.
> 
>  106 #define time_after(a,b)         \
>  107         (typecheck(unsigned long, a) && \
>  108          typecheck(unsigned long, b) && \
>  109          ((long)(b) - (long)(a) < 0))
> 
> Reading above.
> 
>   if b==0.
>      if (long)a <0  -> false
>      if (long)a >0  -> true
> 
> jiffies is unsigned value. please think of bit-pattern of signed/unsigned value.

Fair enough, the cast to long will be an issue. I'll fix it.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-03 11:12                       ` Balbir Singh
@ 2009-03-03 11:50                         ` KAMEZAWA Hiroyuki
  2009-03-03 13:14                           ` Balbir Singh
  2009-03-05  9:04                         ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-03 11:50 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, KOSAKI Motohiro,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Rik van Riel,
	Andrew Morton

Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-03
> 08:59:14]:
>> But, on NUMA, because memcg just checks "usage" and doesn't check
>> "usage-per-node", there can be memory shortage and this kind of
>> soft-limit
>> sounds attractive for me.
>>
>
>
> Could you please elaborate further on this?
>
Try to explain by artificial example..
.
Assume a system with 4 nodes, and 1G of memory per node.
==
     Node0 -- 1G
     Node1 -- 1G
     Node2 -- 1G
     Node3 -- 1G
==
And assume there are 3 memory cgroups of following hard-limit.
==
    GroupA -- 1G
    GroupB -- 0.6G
    GroupC -- 0.6G
==
If the machine is not-NUMA and 4G SMP, we expect 1.8G of free memory and
we can assume "global memory shortage" is a rare event.

But on NUMA, memory usage can be following.
==
     GroupA -- 950M of usage
     GrouoB -- 550M of usage
     GroupC -- 550M of usage
and
     Node0 -- usage=1G
     Node1 -- usage=1G
     Node2 -- usage=50M
     Node2 -- Usage=0
==
In this case, kswapd will work on Node0, and Node1.
Softlimit will have chance to work. If the user declares GroupA's softlimit
is 800M, GroupA will be victim in this case.

But we have to admit this is hard-to-use scheduling paramter. Almost all
administrator will not be able to set proper value.
A useful case I can think of is creating some "victim" group and guard
other groups from global memory reclaim. I think I need some study about
how-to-use softlimit. But we'll need this kind of paramater,anyway and
I don't have onjection to add this kind of scheduling parameter.
But implementation should be simple at this stage and we should
find best scheduling algorithm under use-case finally...

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-03 11:50                         ` KAMEZAWA Hiroyuki
@ 2009-03-03 13:14                           ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2009-03-03 13:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-03 20:50:56]:

> Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-03
> > 08:59:14]:
> >> But, on NUMA, because memcg just checks "usage" and doesn't check
> >> "usage-per-node", there can be memory shortage and this kind of
> >> soft-limit
> >> sounds attractive for me.
> >>
> >
> >
> > Could you please elaborate further on this?
> >
> Try to explain by artificial example..
> .
> Assume a system with 4 nodes, and 1G of memory per node.
> ==
>      Node0 -- 1G
>      Node1 -- 1G
>      Node2 -- 1G
>      Node3 -- 1G
> ==
> And assume there are 3 memory cgroups of following hard-limit.
> ==
>     GroupA -- 1G
>     GroupB -- 0.6G
>     GroupC -- 0.6G
> ==
> If the machine is not-NUMA and 4G SMP, we expect 1.8G of free memory and
> we can assume "global memory shortage" is a rare event.
> 
> But on NUMA, memory usage can be following.
> ==
>      GroupA -- 950M of usage
>      GrouoB -- 550M of usage
>      GroupC -- 550M of usage
> and
>      Node0 -- usage=1G
>      Node1 -- usage=1G
>      Node2 -- usage=50M
>      Node2 -- Usage=0
> ==
> In this case, kswapd will work on Node0, and Node1.
> Softlimit will have chance to work. If the user declares GroupA's softlimit
> is 800M, GroupA will be victim in this case.
>

Yes, GroupA is the victim, but if GroupA has not allocated from a
particular node, we can ensure that we don't reclaim from that node
even while doing soft limit reclaim.
 
> But we have to admit this is hard-to-use scheduling paramter. Almost all
> administrator will not be able to set proper value.
> A useful case I can think of is creating some "victim" group and guard
> other groups from global memory reclaim. I think I need some study about
> how-to-use softlimit. But we'll need this kind of paramater,anyway and
> I don't have onjection to add this kind of scheduling parameter.
> But implementation should be simple at this stage and we should
> find best scheduling algorithm under use-case finally...
> 

Yes and it should be correct and reliable and not based on heuristics
that are hard to prove as correct or even acceptable. Let me work on
the comments so far and refresh the patches.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v3)
  2009-03-03 11:17         ` Balbir Singh
@ 2009-03-04  0:07           ` KOSAKI Motohiro
  0 siblings, 0 replies; 40+ messages in thread
From: KOSAKI Motohiro @ 2009-03-04  0:07 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

Hi Balbir

> > > > kswapd's roll is increasing free pages until zone->pages_high in "own node".
> > > > mem_cgroup_soft_limit_reclaim() free one (or more) exceed page in any node.
> > > > 
> > > > Oh, well.
> > > > I think it is not consistency.
> > > > 
> > > > if mem_cgroup_soft_limit_reclaim() is aware to target node and its pages_high,
> > > > I'm glad.
> > > 
> > > Yes, correct the role of kswapd is to keep increasing free pages until
> > > zone->pages_high and the first set of pages to consider is the memory
> > > controller over their soft limits. We pass the zonelist to ensure that
> > > while doing soft reclaim, we focus on the zonelist associated with the
> > > node. Kamezawa had concernes over calling the soft limit reclaim from
> > > __alloc_pages_internal(), did you prefer that call path? 
> > 
> > I read your patch again.
> > So, mem_cgroup_soft_limit_reclaim() caller place seems in balance_pgdat() is better.
> > 
> > Please imazine most bad scenario.
> > CPU0 (kswapd) take to continue shrinking.
> > CPU1 take another activity and charge memcg conteniously.
> > At that time, balance_pgdat() don't exit very long time. then 
> > mem_cgroup_soft_limit_reclaim() is never called.
> > 
> 
> Yes, true... that is why I added the hooks in __alloc_pages_internal()
> in the first two revisions, but Kamezawa objected to them. In the
> scenario that you mention that balance_pgdat() is busy, if we are
> under global system memory pressure, even after freeing memory from
> soft limited cgroups, we don't have sufficient free memory. We need to
> go reclaim from the whole system. An administrator can easily avoid
> the above scenario by using hard limits on the cgroup running on CPU1.

I agree with soft limit implementation is difficult.

but I still don't like soft limit in __alloc_pages_internal().
if it does, kswapd reclaim the pages from global LRU *before*
shrinking soft limit.

again, linux reclaim policy is

	free < pages_low:  run kswapd
	free < pages_min:  foreground reclaim via __alloc_pages_internal()

then, if soft limit reclaim put into __alloc_pages_internal(),

	free < pages_low:  run kswapd
	free < pages_min:  soft limit reclaim and 
                           foreground reclaim via __alloc_pages_internal()

it seems unintetional behavior.

In addition, I still strongly oppose againt global lock although 
soft limit shrinking don't put into __alloc_pages_internal().
I think it doesn't depend on caller place.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-03 11:12                       ` Balbir Singh
  2009-03-03 11:50                         ` KAMEZAWA Hiroyuki
@ 2009-03-05  9:04                         ` KAMEZAWA Hiroyuki
  2009-03-05  9:13                           ` KAMEZAWA Hiroyuki
  2009-03-05 15:26                           ` Balbir Singh
  1 sibling, 2 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-05  9:04 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Tue, 3 Mar 2009 16:42:44 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > I wrote
> > ==
> >  if (victim is not over soft-limit)
> > ==
> > ....Maybe this discussion style is bad and I should explain my approach in patch.
> > (I can't write code today, sorry.)
> > 

This is an example of my direction, " do it lazy" softlimit.

Maybe this is not perfect but this addresses almost all my concern.
I hope this will be an input for you.
I didn't divide patch into small pieces intentionally to show a big picture.
Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

An example patch. Don't trust me, this patch may have bugs.

Memory Cgroup Softlimit support(Yet another one)

 memory cgroup accounts and limits usage of memory in the system but it
 don't care of NUMA memory usage. Cpuset can be a help for NUMA user but
 for usual users, it's better easy knob to control "how to avoid memory
 shortage".
 When the user sets softlimit to cgroup, kswapd() checks victim
 groups which has usage over softlimit at first and reclaim memory from them.

 Victim selection condition is
	if (memcg is over its softlimit && it has pages in zone)
 Select Algorithm is Round-Robin.
 And if priority goes high(means small.), this softlimit logic will not work
 and memory cgroup will not disturb kswapd.

 All information is maintained per node and there are no global locks.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   16 +++
 mm/memcontrol.c            |  215 +++++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c                |   36 +++++++
 3 files changed, 259 insertions(+), 8 deletions(-)

Index: mmotm-2.6.29-Mar3/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Mar3.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Mar3/mm/memcontrol.c
@@ -175,12 +175,18 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
-
+	/*
+	 * Softlimit information.
+	 */
+	u64 softlimit;
+	nodemask_t softlimit_mask;
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
+static struct mem_cgroup *init_mem_cgroup __read_mostly;
+
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -210,6 +216,8 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
+#define SOFTLIMIT               (0x10)
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -325,16 +333,13 @@ static bool mem_cgroup_is_obsolete(struc
 /*
  * Call callback function against all cgroup under hierarchy tree.
  */
-static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
+static int __mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
 			  int (*func)(struct mem_cgroup *, void *))
 {
 	int found, ret, nextid;
 	struct cgroup_subsys_state *css;
 	struct mem_cgroup *mem;
 
-	if (!root->use_hierarchy)
-		return (*func)(root, data);
-
 	nextid = 1;
 	do {
 		ret = 0;
@@ -357,6 +362,22 @@ static int mem_cgroup_walk_tree(struct m
 	return ret;
 }
 
+static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
+			  int (*func)(struct mem_cgroup *, void *))
+{
+
+	if (!root->use_hierarchy)
+		return (*func)(root, data);
+
+	return __mem_cgroup_walk_tree(root, data, func);
+}
+
+static int mem_cgroup_walk_all(void *data,
+			       int (*func)(struct mem_cgroup *, void *))
+{
+	return __mem_cgroup_walk_tree(init_mem_cgroup, data, func);
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -1618,6 +1639,168 @@ void mem_cgroup_end_migration(struct mem
 }
 
 /*
+ * Calls for softlimit.
+ */
+/*
+ * When vmscan.c::balance_pgdat() calls this softlimit, A zone in the system
+ * is under memory shortage. We select appropriate victim group..
+ * All calles for softlimit is called by kswapd() per node and each node's
+ * status is independent from others.
+ */
+
+struct softlimit_pernode_info {
+	int last_victim;	    /* ID of mem_cgroup last visited */
+	int count;                  /* # of softlimit reclaim candidates */
+};
+/* Control information per node */
+struct softlimit_pernode_info softlimit_control[MAX_NUMNODES];
+
+static unsigned long
+memcg_evictable_usage(struct mem_cgroup *mem, int nid, int zid)
+{
+	unsigned long total = 0;
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
+
+	if (nr_swap_pages) { /* We have swap ? */
+		total += MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+		total += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON);
+	}
+	total += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE);
+	total += MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
+	return total;
+}
+
+/*
+ * Return true if an ancestor hits its softlimit.
+ */
+static bool memcg_hit_soft_limit(struct mem_cgroup *mem)
+{
+	unsigned long usage;
+	struct cgroup *cgroup;
+
+	do {
+		usage = res_counter_read_u64(&mem->res, RES_USAGE);
+		if (usage > mem->softlimit)
+			return true;
+		cgroup = mem->css.cgroup;
+		if (cgroup->parent) {
+			cgroup = cgroup->parent;
+			mem = mem_cgroup_from_cont(cgroup);
+		} else
+			mem = NULL;
+	} while (mem && mem->use_hierarchy);
+
+	return false;
+}
+
+static int
+__mem_cgroup_update_soft_limit_hint(struct mem_cgroup *mem, void *data)
+{
+	long nid = (long)data;
+	struct softlimit_pernode_info *spi = &softlimit_control[nid];
+	unsigned long usage = 0;
+	int zid;
+
+	/* At first, delete this from candidate.*/
+	node_clear(nid, mem->softlimit_mask);
+
+	/* Calc from High zone is better...*/
+	for (usage = 0, zid = MAX_NR_ZONES - 1;
+	     !usage && zid >= 0;
+	     zid--)
+		usage += memcg_evictable_usage(mem, nid, zid);
+
+	/* This group don't have any evictable page on this node .go to next.*/
+	if (!usage)
+		return 0;
+
+	/* If one of ancestor hits limit, this is reclaim candidate */
+	if (memcg_hit_soft_limit(mem)) {
+		node_set(nid, mem->softlimit_mask);
+		spi->count++;
+	}
+	return 0;
+}
+/*
+ * Calculate hint per appropriate kswapd iteration. Because balance_pgdat()
+ * visits a zone/node several times repeatedly, this hint is helpful for
+ * reducing unnecessary works.
+ */
+void mem_cgroup_update_softlimit_hint(int nid)
+{
+	struct softlimit_pernode_info *spi = &softlimit_control[nid];
+
+	spi->count = 0;
+	/* Visit all and fill per-node hint. */
+	mem_cgroup_walk_all(((void *)(long)nid),
+			    __mem_cgroup_update_soft_limit_hint);
+}
+
+/*
+ * Scan mem_cgroup and return candidate to softlimit-reclaim. Reclaim
+ * information is maintainder per node and this routine checks evictable
+ * usage of memcg in specified zone.
+ * mem->softlimit nodemask is used as a hint to show that reclaiming memory
+ * from this memcg will be help or not.
+ */
+struct mem_cgroup *mem_cgroup_get_victim(int nid, int zid)
+{
+	struct softlimit_pernode_info *spi = &softlimit_control[nid];
+	struct mem_cgroup *ret = NULL;
+	struct mem_cgroup *mem = NULL;
+	int checked, nextid, found;
+	struct cgroup_subsys_state *css;
+	if (spi->count == 0)
+		return NULL;
+
+	checked = 0;
+	while (checked < spi->count) {
+
+		mem = NULL;
+		rcu_read_lock();
+		nextid  = spi->last_victim + 1;
+		css = css_get_next(&mem_cgroup_subsys,
+				   nextid, &init_mem_cgroup->css, &found);
+		if (css && css_tryget(css))
+			mem = container_of(css, struct mem_cgroup, css);
+		rcu_read_unlock();
+
+		if (!css) {
+			spi->last_victim = 0;
+			continue;
+		}
+
+		spi->last_victim = found;
+		if (!mem)
+			continue;
+
+		/* check hint status of this memcg */
+		if (!node_isset(nid, mem->softlimit_mask)) {
+			css_put(css);
+			continue;
+		}
+		checked++;
+
+		if (!memcg_hit_soft_limit(mem)) {
+			/* the hint is obsolete....*/
+			node_clear(nid, mem->softlimit_mask);
+			spi->count--;
+		} else if (memcg_evictable_usage(mem, nid, zid)) {
+			ret = mem;
+			break;
+		}
+		css_put(css);
+	}
+	return ret;
+}
+
+void mem_cgroup_put_victim(struct mem_cgroup *mem)
+{
+	if (mem)
+		css_put(&mem->css);
+}
+
+/*
  * A call to try to shrink memory usage under specified resource controller.
  * This is typically used for page reclaiming for shmem for reducing side
  * effect of page allocation from shmem, which is used by some mem_cgroup.
@@ -1949,7 +2132,10 @@ static u64 mem_cgroup_read(struct cgroup
 	name = MEMFILE_ATTR(cft->private);
 	switch (type) {
 	case _MEM:
-		val = res_counter_read_u64(&mem->res, name);
+		if (name == SOFTLIMIT)
+			val = mem->softlimit;
+		else
+			val = res_counter_read_u64(&mem->res, name);
 		break;
 	case _MEMSWAP:
 		if (do_swap_account)
@@ -1986,6 +2172,12 @@ static int mem_cgroup_write(struct cgrou
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case SOFTLIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->softlimit = val;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2241,6 +2433,12 @@ static struct cftype mem_cgroup_files[] 
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "softlimit",
+		.private = MEMFILE_PRIVATE(_MEM, SOFTLIMIT),
+		.read_u64 = mem_cgroup_read,
+		.write_string = mem_cgroup_write,
+	},
+	{
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
@@ -2464,6 +2662,11 @@ mem_cgroup_create(struct cgroup_subsys *
 	if (parent)
 		mem->swappiness = get_swappiness(parent);
 	atomic_set(&mem->refcnt, 1);
+
+	/* Record the root */
+	if (!init_mem_cgroup)
+		init_mem_cgroup = mem;
+
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
Index: mmotm-2.6.29-Mar3/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.29-Mar3.orig/include/linux/memcontrol.h
+++ mmotm-2.6.29-Mar3/include/linux/memcontrol.h
@@ -116,6 +116,10 @@ static inline bool mem_cgroup_disabled(v
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
+/* SoftLimit stuff */
+extern void mem_cgroup_update_softlimit_hint(int nid);
+extern struct mem_cgroup *mem_cgroup_get_victim(int nid, int zid);
+extern void mem_cgroup_put_victim(struct mem_cgroup *mem);
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
@@ -264,6 +268,18 @@ mem_cgroup_print_oom_info(struct mem_cgr
 {
 }
 
+static inline void mem_cgroup_update_softlimit_hint(int nid)
+{
+}
+
+static inline struct mem_cgroup *mem_cgroup_get_victim(int nid, int zid)
+{
+	return NULL;
+}
+static inline void mem_cgroup_put_victim(struct mem_cgroup *mem)
+{
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
Index: mmotm-2.6.29-Mar3/mm/vmscan.c
===================================================================
--- mmotm-2.6.29-Mar3.orig/mm/vmscan.c
+++ mmotm-2.6.29-Mar3/mm/vmscan.c
@@ -1734,6 +1734,26 @@ unsigned long try_to_free_mem_cgroup_pag
 #endif
 
 /*
+ * When mem_cgroup's softlimit is used, capture kswapd and reclaim in softlimit.
+ * This will be never called when priority is bad.
+ */
+
+void softlimit_shrink_zone(int nid, int zid, int priority, struct zone *zone,
+			      struct scan_control *sc)
+{
+	sc->mem_cgroup = mem_cgroup_get_victim(nid, zid);
+	if (sc->mem_cgroup) {
+		sc->isolate_pages = mem_cgroup_isolate_pages;
+		/* Should we use memcg's swappiness here ? */
+		shrink_zone(priority, zone, sc);
+		mem_cgroup_put_victim(sc->mem_cgroup);
+		sc->mem_cgroup = NULL;
+		sc->isolate_pages = isolate_pages_global;
+	} else
+		shrink_zone(priority, zone, sc);
+}
+
+/*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
  *
@@ -1758,6 +1778,7 @@ static unsigned long balance_pgdat(pg_da
 {
 	int all_zones_ok;
 	int priority;
+	int nid = pgdat->node_id;
 	int i;
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -1785,6 +1806,8 @@ loop_again:
 	for (i = 0; i < pgdat->nr_zones; i++)
 		temp_priority[i] = DEF_PRIORITY;
 
+	mem_cgroup_update_softlimit_hint(nid);
+
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -1863,8 +1886,17 @@ loop_again:
 			 * zone has way too many pages free already.
 			 */
 			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
-						end_zone, 0))
-				shrink_zone(priority, zone, &sc);
+					       end_zone, 0)) {
+				/*
+				 * If priority is not so bad, try softlimit
+				 * of memcg.
+				 */
+				if (!(priority < DEF_PRIORITY - 2))
+					softlimit_shrink_zone(nid, i,
+							 priority, zone, &sc);
+				else
+					shrink_zone(priority, zone, &sc);
+			}
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);





^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-05  9:04                         ` KAMEZAWA Hiroyuki
@ 2009-03-05  9:13                           ` KAMEZAWA Hiroyuki
  2009-03-05 15:26                           ` Balbir Singh
  1 sibling, 0 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-05  9:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton


I wrote almost all concerns are addressed but following part is a concern.

On Thu, 5 Mar 2009 18:04:10 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> @@ -1758,6 +1778,7 @@ static unsigned long balance_pgdat(pg_da
>  {
>  	int all_zones_ok;
>  	int priority;
> +	int nid = pgdat->node_id;
>  	int i;
>  	unsigned long total_scanned;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -1785,6 +1806,8 @@ loop_again:
>  	for (i = 0; i < pgdat->nr_zones; i++)
>  		temp_priority[i] = DEF_PRIORITY;
>  
> +	mem_cgroup_update_softlimit_hint(nid);
> +
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>  		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
>  		unsigned long lru_pages = 0;
> @@ -1863,8 +1886,17 @@ loop_again:
>  			 * zone has way too many pages free already.
>  			 */
>  			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
> -						end_zone, 0))
> -				shrink_zone(priority, zone, &sc);
> +					       end_zone, 0)) {
> +				/*
> +				 * If priority is not so bad, try softlimit
> +				 * of memcg.
> +				 */
> +				if (!(priority < DEF_PRIORITY - 2))
> +					softlimit_shrink_zone(nid, i,
> +							 priority, zone, &sc);
> +				else
> +					shrink_zone(priority, zone, &sc);
> +			}
>  			reclaim_state->reclaimed_slab = 0;
>  			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
>  						lru_pages);

This may increase priority to unexpected level, So, insert softlimit_shrink_zone()
before diving into this priority loop of balance_pgdat() may be better.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-05  9:04                         ` KAMEZAWA Hiroyuki
  2009-03-05  9:13                           ` KAMEZAWA Hiroyuki
@ 2009-03-05 15:26                           ` Balbir Singh
  2009-03-05 23:53                             ` KAMEZAWA Hiroyuki
  2009-03-06  3:23                             ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 40+ messages in thread
From: Balbir Singh @ 2009-03-05 15:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-05 18:04:10]:

> On Tue, 3 Mar 2009 16:42:44 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > > I wrote
> > > ==
> > >  if (victim is not over soft-limit)
> > > ==
> > > ....Maybe this discussion style is bad and I should explain my approach in patch.
> > > (I can't write code today, sorry.)
> > > 
> 
> This is an example of my direction, " do it lazy" softlimit.
> 
> Maybe this is not perfect but this addresses almost all my concern.
> I hope this will be an input for you.
> I didn't divide patch into small pieces intentionally to show a big picture.
> Thanks,
> -Kame
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> An example patch. Don't trust me, this patch may have bugs.
>

Well this is not do it lazy, all memcg's are scanned tree is built everytime
kswapd invokes soft limit reclaim. With 100 cgroups and 5 nodes, we'll
end up scanning cgroups 500 times. There is no ordering of selected
victims, so the largest victim might still be running unaffected.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-05 15:26                           ` Balbir Singh
@ 2009-03-05 23:53                             ` KAMEZAWA Hiroyuki
  2009-03-06  3:23                             ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-05 23:53 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Thu, 5 Mar 2009 20:56:42 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-05 18:04:10]:
> 
> > On Tue, 3 Mar 2009 16:42:44 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > > I wrote
> > > > ==
> > > >  if (victim is not over soft-limit)
> > > > ==
> > > > ....Maybe this discussion style is bad and I should explain my approach in patch.
> > > > (I can't write code today, sorry.)
> > > > 
> > 
> > This is an example of my direction, " do it lazy" softlimit.
> > 
> > Maybe this is not perfect but this addresses almost all my concern.
> > I hope this will be an input for you.
> > I didn't divide patch into small pieces intentionally to show a big picture.
> > Thanks,
> > -Kame
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > An example patch. Don't trust me, this patch may have bugs.
> >
> 
> Well this is not do it lazy, all memcg's are scanned tree is built everytime
> kswapd invokes soft limit reclaim. 
tree is built ? no. there are not tree. And this is lazy. No impact until
kswapd runs.

> With 100 cgroups and 5 nodes, we'll
> end up scanning cgroups 500 times.
No. 100 cgroups. (kswapd works per node and all kswapd doesn't work at once.)

> There is no ordering of selected victims, 
I don't think this is necessary but if you want you can add it easily.

> so the largest victim might still be running unaffected.
> 
No problem from my point of view.

"Soft limit" is a hint from the user to show "if usage is larger than this,
recalaim from this cgroup is appropriate"

Thanks,
-Kame







^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v3)
  2009-03-05 15:26                           ` Balbir Singh
  2009-03-05 23:53                             ` KAMEZAWA Hiroyuki
@ 2009-03-06  3:23                             ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-06  3:23 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Thu, 5 Mar 2009 20:56:42 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-05 18:04:10]:
> 
> > On Tue, 3 Mar 2009 16:42:44 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > > I wrote
> > > > ==
> > > >  if (victim is not over soft-limit)
> > > > ==
> > > > ....Maybe this discussion style is bad and I should explain my approach in patch.
> > > > (I can't write code today, sorry.)
> > > > 
> > 
> > This is an example of my direction, " do it lazy" softlimit.
> > 
> > Maybe this is not perfect but this addresses almost all my concern.
> > I hope this will be an input for you.
> > I didn't divide patch into small pieces intentionally to show a big picture.
> > Thanks,
> > -Kame
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > An example patch. Don't trust me, this patch may have bugs.
> >
> 
> Well this is not do it lazy, all memcg's are scanned tree is built everytime
> kswapd invokes soft limit reclaim. With 100 cgroups and 5 nodes, we'll
> end up scanning cgroups 500 times. There is no ordering of selected
> victims, so the largest victim might still be running unaffected.
> 
I think of more reasonable one. I'll post today if it goes well.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2009-03-06  3:24 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-01  6:29 [PATCH 0/4] Memory controller soft limit patches (v3) Balbir Singh
2009-03-01  6:30 ` [PATCH 1/4] Memory controller soft limit documentation (v3) Balbir Singh
2009-03-01  6:30 ` [PATCH 2/4] Memory controller soft limit interface (v3) Balbir Singh
2009-03-02  2:03   ` KAMEZAWA Hiroyuki
2009-03-02  4:46     ` Balbir Singh
2009-03-02  5:35       ` KAMEZAWA Hiroyuki
2009-03-02  6:07         ` Balbir Singh
2009-03-02  6:19           ` KAMEZAWA Hiroyuki
2009-03-02  6:29             ` Balbir Singh
2009-03-01  6:30 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v3) Balbir Singh
2009-03-01  6:30 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v3) Balbir Singh
2009-03-02  3:08   ` KOSAKI Motohiro
2009-03-02  4:44     ` Balbir Singh
2009-03-03  2:43       ` KOSAKI Motohiro
2009-03-03 11:17         ` Balbir Singh
2009-03-04  0:07           ` KOSAKI Motohiro
2009-03-02  0:24 ` [PATCH 0/4] Memory controller soft limit patches (v3) KAMEZAWA Hiroyuki
2009-03-02  4:40   ` Balbir Singh
2009-03-02  5:32     ` KAMEZAWA Hiroyuki
2009-03-02  6:05       ` Balbir Singh
2009-03-02  6:18         ` KAMEZAWA Hiroyuki
2009-03-02 17:52           ` Balbir Singh
2009-03-03  0:03             ` KAMEZAWA Hiroyuki
2009-03-03 11:23               ` Balbir Singh
2009-03-02  6:21         ` KAMEZAWA Hiroyuki
2009-03-02  6:36           ` Balbir Singh
2009-03-02  7:06             ` KAMEZAWA Hiroyuki
2009-03-02  7:17               ` KAMEZAWA Hiroyuki
2009-03-02 12:42               ` Balbir Singh
2009-03-02 14:04                 ` KAMEZAWA Hiroyuki
2009-03-02 17:41                   ` Balbir Singh
2009-03-02 23:59                     ` KAMEZAWA Hiroyuki
2009-03-03 11:12                       ` Balbir Singh
2009-03-03 11:50                         ` KAMEZAWA Hiroyuki
2009-03-03 13:14                           ` Balbir Singh
2009-03-05  9:04                         ` KAMEZAWA Hiroyuki
2009-03-05  9:13                           ` KAMEZAWA Hiroyuki
2009-03-05 15:26                           ` Balbir Singh
2009-03-05 23:53                             ` KAMEZAWA Hiroyuki
2009-03-06  3:23                             ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).