All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] Memory controller soft limit patches (v8)
@ 2009-07-09 17:14 Balbir Singh
  2009-07-09 17:14 ` [RFC][PATCH 1/5] Memory controller soft limit documentation (v8) Balbir Singh
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Balbir Singh @ 2009-07-09 17:14 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: linux-mm, Balbir Singh, lizf, KOSAKI Motohiro


From: Balbir Singh <balbir@linux.vnet.ibm.com>

New Feature: Soft limits for memory resource controller.

Here is v8 of the new soft limit implementation. Soft limits is a new feature
for the memory resource controller, something similar has existed in the
group scheduler in the form of shares. The CPU controllers interpretation
of shares is very different though. 

Soft limits are the most useful feature to have for environments where
the administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits implementation
provides a soft_limit_in_bytes interface for the memory controller and not
for memory+swap controller. The implementation maintains an RB-Tree of groups
that exceed their soft limit and starts reclaiming from the group that
exceeds this limit by the maximum amount.

v8 has come out after a long duration, we were held back by bug fixes
(most notably swap cache leak fix) and Kamezawa-San has his series of
patches for soft limits. Kamezawa-San asked me to refactor these patches
to make the data structure per-node-per-zone.

TODOs

1. The current implementation maintains the delta from the soft limit
   and pushes back groups to their soft limits, a ratio of delta/soft_limit
   might be more useful
2. Small optimizations that I intend to push in v9, if the v8 design looks
   good and acceptable.

Tests
-----

I've run two memory intensive workloads with differing soft limits and
seen that they are pushed back to their soft limit on contention. Their usage
was their soft limit plus additional memory that they were able to grab
on the system. Soft limit can take a while before we see the expected
results.

Please review, comment.

Series
------

memcg-soft-limits-documentation.patch
memcg-soft-limits-interface.patch
memcg-soft-limits-organize.patch
memcg-soft-limits-refactor-reclaim-bits
memcg-soft-limits-reclaim-on-contention.patch


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 1/5] Memory controller soft limit documentation (v8)
  2009-07-09 17:14 [RFC][PATCH 0/5] Memory controller soft limit patches (v8) Balbir Singh
@ 2009-07-09 17:14 ` Balbir Singh
  2009-07-10  5:32   ` KAMEZAWA Hiroyuki
  2009-07-09 17:14 ` [RFC][PATCH 2/5] Memory controller soft limit interface (v8) Balbir Singh
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2009-07-09 17:14 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: linux-mm, Balbir Singh, lizf, KOSAKI Motohiro

Feature: Add documentation for soft limits

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ab0a021..b47815c 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -379,7 +379,36 @@ cgroups created below it.
 
 NOTE2: This feature can be enabled/disabled per subtree.
 
-7. TODO
+7. Soft limits
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention or low memory control groups
+are pushed back to their soft limits. If the soft limit of each control
+group is very high, they are pushed back as much as possible to make
+sure that one control group does not starve the others of memory.
+
+7.1 Interface
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 megabytes)
+
+# echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use
+
+# echo 1G > memory.soft_limit_in_bytes
+
+NOTE1: Soft limits take effect over a long period of time, since they involve
+       reclaiming memory for balancing between memory cgroups
+NOTE2: It is recommended to set the soft limit always below the hard limit,
+       otherwise the hard limit will take precedence.
+
+8. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH 2/5] Memory controller soft limit interface (v8)
  2009-07-09 17:14 [RFC][PATCH 0/5] Memory controller soft limit patches (v8) Balbir Singh
  2009-07-09 17:14 ` [RFC][PATCH 1/5] Memory controller soft limit documentation (v8) Balbir Singh
@ 2009-07-09 17:14 ` Balbir Singh
  2009-07-09 17:15 ` [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8) Balbir Singh
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2009-07-09 17:14 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: linux-mm, Balbir Singh, lizf, KOSAKI Motohiro

Feature: Add soft limits interface to resource counters

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Add support for res_counter_check_soft_limit_locked. This is used
   by the hierarchy code.

Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.

Kamezawa-San raised a question as to whether soft limit should belong
to res_counter. Since all resources understand the basic concepts of
hard and soft limits, it is justified to add soft limits here. Soft limits
are a generic resource usage feature, even file system quotas support
soft limits.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |   58 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    3 ++
 mm/memcontrol.c             |   20 +++++++++++++++
 3 files changed, 81 insertions(+), 0 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 511f42f..fcb9884 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
 	 */
 	unsigned long long limit;
 	/*
+	 * the limit that usage can be exceed
+	 */
+	unsigned long long soft_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -87,6 +91,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_SOFT_LIMIT,
 };
 
 /*
@@ -132,6 +137,36 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->soft_limit)
+		return true;
+
+	return false;
+}
+
+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->soft_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -147,6 +182,17 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	return ret;
 }
 
+static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_soft_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -180,4 +226,16 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 	return ret;
 }
 
+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+				unsigned long long soft_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->soft_limit = soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index e1338f0..bcdabf3 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = RESOURCE_MAX;
+	counter->soft_limit = RESOURCE_MAX;
 	counter->parent = parent;
 }
 
@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_SOFT_LIMIT:
+		return &counter->soft_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4da70c9..3c9292b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2106,6 +2106,20 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case RES_SOFT_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, soft limits are hard to implement in terms
+		 * of semantics, for now, we support soft limits for
+		 * control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2359,6 +2373,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8)
  2009-07-09 17:14 [RFC][PATCH 0/5] Memory controller soft limit patches (v8) Balbir Singh
  2009-07-09 17:14 ` [RFC][PATCH 1/5] Memory controller soft limit documentation (v8) Balbir Singh
  2009-07-09 17:14 ` [RFC][PATCH 2/5] Memory controller soft limit interface (v8) Balbir Singh
@ 2009-07-09 17:15 ` Balbir Singh
  2009-07-10  5:21   ` KAMEZAWA Hiroyuki
  2009-07-09 17:15 ` [RFC][PATCH 4/5] Memory controller soft limit refactor reclaim flags (v8) Balbir Singh
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2009-07-09 17:15 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: linux-mm, Balbir Singh, lizf, KOSAKI Motohiro

Feature: Organize cgroups over soft limit in a RB-Tree

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v8...v7
1. Make the data structures per zone per node

Changelog v7...v6
1. Refactor the check and update logic. The goal is to allow the
   check logic to be modular, so that it can be revisited in the future
   if something more appropriate is found to be useful.

Changelog v6...v5
1. Update the key before inserting into RB tree. Without the current change
   it could take an additional iteration to get the key correct.

Changelog v5...v4
1. res_counter_uncharge has an additional parameter to indicate if the
   counter was over its soft limit, before uncharge.

Changelog v4...v3
1. Optimizations to ensure we don't uncessarily get res_counter values
2. Fixed a bug in usage of time_after()

Changelog v3...v2
1. Add only the ancestor to the RB-Tree
2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put

Changelog v2...v1
1. Add support for hierarchies
2. The res_counter that is highest in the hierarchy is returned on soft
   limit being exceeded. Since we do hierarchical reclaim and add all
   groups exceeding their soft limits, this approach seems to work well
   in practice.

This patch introduces a RB-Tree for storing memory cgroups that are over their
soft limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
   We are careful about updates, updates take place only after a particular
   time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
   limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    6 +
 kernel/res_counter.c        |   18 ++-
 mm/memcontrol.c             |  304 +++++++++++++++++++++++++++++++++++++------
 3 files changed, 281 insertions(+), 47 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index fcb9884..731af71 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -114,7 +114,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+		unsigned long val, struct res_counter **limit_fail_at,
+		struct res_counter **soft_limit_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -127,7 +128,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+				bool *was_soft_limit_excess);
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4434236..dbdade0 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+			struct res_counter **limit_fail_at,
+			struct res_counter **soft_limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
+	if (soft_limit_fail_at)
+		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
+		/*
+		 * With soft limits, we return the highest ancestor
+		 * that exceeds its soft limit
+		 */
+		if (soft_limit_fail_at &&
+			!res_counter_soft_limit_check_locked(c))
+			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+				bool *was_soft_limit_excess)
 {
 	unsigned long flags;
 	struct res_counter *c;
@@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
+		if (was_soft_limit_excess)
+			*was_soft_limit_excess =
+				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3c9292b..036032b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -29,6 +29,7 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
+#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
@@ -118,6 +119,11 @@ struct mem_cgroup_per_zone {
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
+	struct rb_node		tree_node;	/* RB tree node */
+	unsigned long 		last_tree_update;/* Last time the tree was */
+						/* updated in jiffies     */
+	unsigned long long	usage_in_excess;/* Set to the value by which */
+						/* the soft limit is exceeded*/
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -131,6 +137,28 @@ struct mem_cgroup_lru_info {
 };
 
 /*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+
+struct mem_cgroup_soft_limit_tree_per_zone {
+	struct rb_root rb_root;
+	spinlock_t lock;
+};
+
+struct mem_cgroup_soft_limit_tree_per_node {
+	struct mem_cgroup_soft_limit_tree_per_zone
+		rb_tree_per_zone[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_soft_limit_tree {
+	struct mem_cgroup_soft_limit_tree_per_node
+		*rb_tree_per_node[MAX_NUMNODES];
+};
+
+static struct mem_cgroup_soft_limit_tree soft_limit_tree;
+
+/*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
  * statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -187,6 +215,8 @@ struct mem_cgroup {
 	struct mem_cgroup_stat stat;
 };
 
+#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -215,6 +245,164 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
+static struct mem_cgroup_per_zone *
+mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
+{
+	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
+}
+
+static struct mem_cgroup_per_zone *
+page_cgroup_zoneinfo(struct page_cgroup *pc)
+{
+	struct mem_cgroup *mem = pc->mem_cgroup;
+	int nid = page_cgroup_nid(pc);
+	int zid = page_cgroup_zid(pc);
+
+	if (!mem)
+		return NULL;
+
+	return mem_cgroup_zoneinfo(mem, nid, zid);
+}
+
+static struct mem_cgroup_soft_limit_tree_per_zone *
+soft_limit_tree_node_zone(int nid, int zid)
+{
+	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+}
+
+static struct mem_cgroup_soft_limit_tree_per_zone *
+page_cgroup_soft_limit_tree(struct page_cgroup *pc)
+{
+	int nid = page_cgroup_nid(pc);
+	int zid = page_cgroup_zid(pc);
+
+	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+}
+
+static void
+mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_soft_limit_tree_per_zone *stz)
+{
+	struct rb_node **p = &stz->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct mem_cgroup_per_zone *mz_node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&stz->lock, flags);
+	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+	while (*p) {
+		parent = *p;
+		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
+					tree_node);
+		if (mz->usage_in_excess < mz_node->usage_in_excess)
+			p = &(*p)->rb_left;
+		/*
+		 * We can't avoid mem cgroups that are over their soft
+		 * limit by the same amount
+		 */
+		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&mz->tree_node, parent, p);
+	rb_insert_color(&mz->tree_node, &stz->rb_root);
+	mz->last_tree_update = jiffies;
+	spin_unlock_irqrestore(&stz->lock, flags);
+}
+
+static void
+mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_soft_limit_tree_per_zone *stz)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&stz->lock, flags);
+	rb_erase(&mz->tree_node, &stz->rb_root);
+	spin_unlock_irqrestore(&stz->lock, flags);
+}
+
+static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem,
+					bool over_soft_limit,
+					struct page *page)
+{
+	unsigned long next_update;
+	struct page_cgroup *pc;
+	struct mem_cgroup_per_zone *mz;
+
+	if (!over_soft_limit)
+		return false;
+
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return false;
+	mz = mem_cgroup_zoneinfo(mem, page_cgroup_nid(pc), page_cgroup_zid(pc));
+
+	next_update = mz->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
+	if (time_after(jiffies, next_update))
+		return true;
+
+	return false;
+}
+
+static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
+{
+	unsigned long long prev_usage_in_excess, new_usage_in_excess;
+	bool updated_tree = false;
+	unsigned long flags;
+	struct page_cgroup *pc;
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_soft_limit_tree_per_zone *stz;
+
+	/*
+	 * As long as the page is around, pc's are always
+	 * around and so is the mz, in the remove path
+	 * we are yet to do the css_put(). I don't think
+	 * we need to hold page cgroup lock.
+	 */
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+	mz = mem_cgroup_zoneinfo(mem, page_cgroup_nid(pc), page_cgroup_zid(pc));
+	stz = page_cgroup_soft_limit_tree(pc);
+
+	/*
+	 * We do updates in lazy mode, mem's are removed
+	 * lazily from the per-zone, per-node rb tree
+	 */
+	prev_usage_in_excess = mz->usage_in_excess;
+
+	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+	if (prev_usage_in_excess) {
+		mem_cgroup_remove_exceeded(mem, mz, stz);
+		updated_tree = true;
+	}
+	if (!new_usage_in_excess)
+		goto done;
+	mem_cgroup_insert_exceeded(mem, mz, stz);
+
+done:
+	if (updated_tree) {
+		spin_lock_irqsave(&stz->lock, flags);
+		mz->usage_in_excess = new_usage_in_excess;
+		spin_unlock_irqrestore(&stz->lock, flags);
+	}
+}
+
+static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
+{
+	int node, zone;
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_soft_limit_tree_per_zone *stz;
+
+	for_each_node_state(node, N_POSSIBLE) {
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			mz = mem_cgroup_zoneinfo(mem, node, zone);
+			stz = soft_limit_tree_node_zone(node, zone);
+			mem_cgroup_remove_exceeded(mem, mz, stz);
+		}
+	}
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -239,25 +427,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 	put_cpu();
 }
 
-static struct mem_cgroup_per_zone *
-mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
-{
-	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
-}
-
-static struct mem_cgroup_per_zone *
-page_cgroup_zoneinfo(struct page_cgroup *pc)
-{
-	struct mem_cgroup *mem = pc->mem_cgroup;
-	int nid = page_cgroup_nid(pc);
-	int zid = page_cgroup_zid(pc);
-
-	if (!mem)
-		return NULL;
-
-	return mem_cgroup_zoneinfo(mem, nid, zid);
-}
-
 static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
 					enum lru_list idx)
 {
@@ -972,11 +1141,11 @@ done:
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
-			bool oom)
+			bool oom, struct page *page)
 {
-	struct mem_cgroup *mem, *mem_over_limit;
+	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+	struct res_counter *fail_res, *soft_fail_res = NULL;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1006,16 +1175,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		int ret;
 		bool noswap = false;
 
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
+						&soft_fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+							&fail_res, NULL);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 			noswap = true;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -1053,13 +1223,24 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 		}
 	}
+	/*
+	 * Insert just the ancestor, we should trickle down to the correct
+	 * cgroup for reclaim, since the other nodes will be below their
+	 * soft limit
+	 */
+	if (soft_fail_res) {
+		mem_over_soft_limit =
+			mem_cgroup_from_res_counter(soft_fail_res, res);
+		if (mem_cgroup_soft_limit_check(mem_over_soft_limit, true,
+							page))
+			mem_cgroup_update_tree(mem_over_soft_limit, page);
+	}
 	return 0;
 nomem:
 	css_put(&mem->css);
 	return -ENOMEM;
 }
 
-
 /*
  * A helper function to get mem_cgroup from ID. must be called under
  * rcu_read_lock(). The caller must check css_is_removed() or some if
@@ -1126,9 +1307,9 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 		css_put(&mem->css);
 		return;
 	}
@@ -1205,7 +1386,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 	if (pc->mem_cgroup != from)
 		goto out;
 
-	res_counter_uncharge(&from->res, PAGE_SIZE);
+	res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
 	mem_cgroup_charge_statistics(from, pc, false);
 
 	page = pc->page;
@@ -1225,7 +1406,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 	}
 
 	if (do_swap_account)
-		res_counter_uncharge(&from->memsw, PAGE_SIZE);
+		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
 	css_put(&from->css);
 
 	css_get(&to->css);
@@ -1259,7 +1440,7 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
 	parent = mem_cgroup_from_cont(pcg);
 
 
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
 	if (ret || !parent)
 		return ret;
 
@@ -1289,9 +1470,9 @@ uncharge:
 	/* drop extra refcnt by try_charge() */
 	css_put(&parent->css);
 	/* uncharge if move fails */
-	res_counter_uncharge(&parent->res, PAGE_SIZE);
+	res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
 	if (do_swap_account)
-		res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+		res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
 	return ret;
 }
 
@@ -1316,7 +1497,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 	prefetchw(pc);
 
 	mem = memcg;
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
 	if (ret || !mem)
 		return ret;
 
@@ -1435,14 +1616,14 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
 	/* drop extra refcnt from tryget */
 	css_put(&mem->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, ptr, true);
+	return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
 }
 
 static void
@@ -1479,7 +1660,7 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
 			 * This recorded memcg can be obsolete one. So, avoid
 			 * calling css_tryget
 			 */
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
 			mem_cgroup_put(memcg);
 		}
 		rcu_read_unlock();
@@ -1500,9 +1681,9 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 		return;
 	if (!mem)
 		return;
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 	if (do_swap_account)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 	css_put(&mem->css);
 }
 
@@ -1516,6 +1697,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	bool soft_limit_excess = false;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1554,9 +1736,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		break;
 	}
 
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
 	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 	mem_cgroup_charge_statistics(mem, pc, false);
 
 	ClearPageCgroupUsed(pc);
@@ -1570,6 +1752,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
+	if (mem_cgroup_soft_limit_check(mem, soft_limit_excess, page))
+		mem_cgroup_update_tree(mem, page);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
@@ -1645,7 +1829,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 * We uncharge this because swap is freed.
 		 * This memcg can be obsolete one. We avoid calling css_tryget
 		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
 		mem_cgroup_put(memcg);
 	}
 	rcu_read_unlock();
@@ -1674,7 +1858,8 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 	unlock_page_cgroup(pc);
 
 	if (mem) {
-		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
+		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
+						page);
 		css_put(&mem->css);
 	}
 	*ptr = mem;
@@ -2177,6 +2362,7 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 			res_counter_reset_failcnt(&mem->memsw);
 		break;
 	}
+
 	return 0;
 }
 
@@ -2472,6 +2658,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
 			INIT_LIST_HEAD(&mz->lists[l]);
+		mz->last_tree_update = 0;
+		mz->usage_in_excess = 0;
 	}
 	return 0;
 }
@@ -2517,6 +2705,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	mem_cgroup_remove_from_trees(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -2565,6 +2754,31 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+static int mem_cgroup_soft_limit_tree_init(void)
+{
+	struct mem_cgroup_soft_limit_tree_per_node *rtpn;
+	struct mem_cgroup_soft_limit_tree_per_zone *rtpz;
+	int tmp, node, zone;
+
+	for_each_node_state(node, N_POSSIBLE) {
+		tmp = node;
+		if (!node_state(node, N_NORMAL_MEMORY))
+			tmp = -1;
+		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
+		if (!rtpn)
+			return 1;
+
+		soft_limit_tree.rb_tree_per_node[node] = rtpn;
+
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			rtpz = &rtpn->rb_tree_per_zone[zone];
+			rtpz->rb_root = RB_ROOT;
+			spin_lock_init(&rtpz->lock);
+		}
+	}
+	return 0;
+}
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -2579,11 +2793,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
 			goto free_out;
+
 	/* root ? */
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
 		parent = NULL;
 		root_mem_cgroup = mem;
+		if (mem_cgroup_soft_limit_tree_init())
+			goto free_out;
+
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH 4/5] Memory controller soft limit refactor reclaim flags (v8)
  2009-07-09 17:14 [RFC][PATCH 0/5] Memory controller soft limit patches (v8) Balbir Singh
                   ` (2 preceding siblings ...)
  2009-07-09 17:15 ` [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8) Balbir Singh
@ 2009-07-09 17:15 ` Balbir Singh
  2009-07-09 17:15 ` [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8) Balbir Singh
  2009-07-10  4:53 ` [RFC][PATCH 0/5] Memory controller soft limit patches (v8) KAMEZAWA Hiroyuki
  5 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2009-07-09 17:15 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: linux-mm, Balbir Singh, lizf, KOSAKI Motohiro

Impact: Refactor mem_cgroup_hierarchical_reclaim()

From: Balbir Singh <balbir@linux.vnet.ibm.com>

This patch refactors the arguments passed to
mem_cgroup_hierarchical_reclaim() into flags, so that new parameters don't
have to be passed as we make the reclaim routine more flexible

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |   25 +++++++++++++++++++------
 1 files changed, 19 insertions(+), 6 deletions(-)


diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 13a7696..ca9c257 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -240,6 +240,14 @@ enum charge_type {
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
+/*
+ * Reclaim flags for mem_cgroup_hierarchical_reclaim
+ */
+#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
+#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
+#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
+#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -1030,11 +1038,14 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
-				   gfp_t gfp_mask, bool noswap, bool shrink)
+						gfp_t gfp_mask,
+						unsigned long reclaim_options)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
 	int loop = 0;
+	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
+	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (root_mem->memsw_is_minimum)
@@ -1172,7 +1183,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 
 	while (1) {
 		int ret;
-		bool noswap = false;
+		unsigned long flags = 0;
 
 		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
 						&soft_fail_res);
@@ -1185,7 +1196,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 				break;
 			/* mem+swap counter fails */
 			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
-			noswap = true;
+			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
 		} else
@@ -1197,7 +1208,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 
 		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
-							noswap, false);
+							flags);
 		if (ret)
 			continue;
 
@@ -1992,7 +2003,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			break;
 
 		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
-						   false, true);
+						   MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -2044,7 +2055,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
+		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_NOSWAP |
+						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8)
  2009-07-09 17:14 [RFC][PATCH 0/5] Memory controller soft limit patches (v8) Balbir Singh
                   ` (3 preceding siblings ...)
  2009-07-09 17:15 ` [RFC][PATCH 4/5] Memory controller soft limit refactor reclaim flags (v8) Balbir Singh
@ 2009-07-09 17:15 ` Balbir Singh
  2009-07-10  5:30   ` KAMEZAWA Hiroyuki
  2009-07-10  4:53 ` [RFC][PATCH 0/5] Memory controller soft limit patches (v8) KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2009-07-09 17:15 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: linux-mm, Balbir Singh, lizf, KOSAKI Motohiro

Feature: Implement reclaim from groups over their soft limit

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v8 ..v7
1. Soft limit reclaim takes an order parameter and does no reclaim for
   order > 0. This ensures that we don't do double reclaim for order > 0
2. Make the data structures more scalable, move the reclaim logic
   to a new function mem_cgroup_shrink_node_zone that does per node
   per zone reclaim.
3. Reclaim has moved back to kswapd (balance_pgdat)

Changelog v7...v6
1. Refactored out reclaim_options patch into a separate patch
2. Added additional checks for all swap off condition in
   mem_cgroup_hierarchical_reclaim()

Changelog v6...v5
1. Reclaim arguments to hierarchical reclaim have been merged into one
   parameter called reclaim_options.
2. Check if we failed to reclaim from one cgroup during soft reclaim, if
   so move on to the next one. This can be very useful if the zonelist
   passed to soft limit reclaim has no allocations from the selected
   memory cgroup
3. Coding style cleanups

Changelog v5...v4

1. Throttling is removed, earlier we throttled tasks over their soft limit
2. Reclaim has been moved back to __alloc_pages_internal, several experiments
   and tests showed that it was the best place to reclaim memory. kswapd has
   a different goal, that does not work with a single soft limit for the memory
   cgroup.
3. Soft limit reclaim is more targetted and the pages reclaim depend on the
   amount by which the soft limit is exceeded.

Changelog v4...v3
1. soft_reclaim is now called from balance_pgdat
2. soft_reclaim is aware of nodes and zones
3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
   and at the same time trying to allocate pages and exceed its soft limit.
4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
   particular to a mem cgroup.

Changelog v3...v2
1. Convert several arguments to hierarchical reclaim to flags, thereby
   consolidating them
2. The reclaim for soft limits is now triggered from kswapd
3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument


Changelog v2...v1
1. Added support for hierarchical soft limits

This patch allows reclaim from memory cgroups on contention (via the
direct reclaim path).

memory cgroup soft limit reclaim finds the group that exceeds its soft limit
by the largest number of pages and reclaims pages from it and then reinserts the
cgroup into its correct place in the rbtree.

Added additional checks to mem_cgroup_hierarchical_reclaim() to detect
long loops in case all swap is turned off. The code has been refactored
and the loop check (loop < 2) has been enhanced for soft limits. For soft
limits, we try to do more targetted reclaim. Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded. The proportion has been empirically
determined.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |   11 ++
 include/linux/swap.h       |    5 +
 mm/memcontrol.c            |  224 +++++++++++++++++++++++++++++++++++++++++---
 mm/vmscan.c                |   39 +++++++-
 4 files changed, 262 insertions(+), 17 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..cf20acc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -118,6 +118,9 @@ static inline bool mem_cgroup_disabled(void)
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
 void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
+unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+						gfp_t gfp_mask, int nid,
+						int zid, int priority);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -276,6 +279,14 @@ static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
 {
 }
 
+static inline
+unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+						gfp_t gfp_mask, int nid,
+						int zid, int priority)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6c990e6..afc0721 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -217,6 +217,11 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
+extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+						gfp_t gfp_mask, bool noswap,
+						unsigned int swappiness,
+						struct zone *zone,
+						int nid, int priority);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ca9c257..e7a1cf4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -124,6 +124,9 @@ struct mem_cgroup_per_zone {
 						/* updated in jiffies     */
 	unsigned long long	usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
+	bool on_tree;				/* Is the node on tree? */
+	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
+						/* use container_of	   */
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -216,6 +219,13 @@ struct mem_cgroup {
 
 #define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
 
+/*
+ * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
+ * limit reclaim to prevent infinite loops, if they ever occur.
+ */
+#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(10000)
+#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -247,6 +257,8 @@ enum charge_type {
 #define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
+#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
@@ -287,16 +299,17 @@ page_cgroup_soft_limit_tree(struct page_cgroup *pc)
 }
 
 static void
-mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
+__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
 				struct mem_cgroup_soft_limit_tree_per_zone *stz)
 {
 	struct rb_node **p = &stz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
 	struct mem_cgroup_per_zone *mz_node;
-	unsigned long flags;
 
-	spin_lock_irqsave(&stz->lock, flags);
+	if (mz->on_tree)
+		return;
+
 	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
 	while (*p) {
 		parent = *p;
@@ -314,6 +327,29 @@ mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
 	rb_link_node(&mz->tree_node, parent, p);
 	rb_insert_color(&mz->tree_node, &stz->rb_root);
 	mz->last_tree_update = jiffies;
+	mz->on_tree = true;
+}
+
+static void
+__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_soft_limit_tree_per_zone *stz)
+{
+	if (!mz->on_tree)
+		return;
+	rb_erase(&mz->tree_node, &stz->rb_root);
+	mz->on_tree = false;
+}
+
+static void
+mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_soft_limit_tree_per_zone *stz)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&stz->lock, flags);
+	__mem_cgroup_insert_exceeded(mem, mz, stz);
 	spin_unlock_irqrestore(&stz->lock, flags);
 }
 
@@ -324,7 +360,7 @@ mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
 {
 	unsigned long flags;
 	spin_lock_irqsave(&stz->lock, flags);
-	rb_erase(&mz->tree_node, &stz->rb_root);
+	__mem_cgroup_remove_exceeded(mem, mz, stz);
 	spin_unlock_irqrestore(&stz->lock, flags);
 }
 
@@ -410,6 +446,52 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
 	}
 }
 
+unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
+{
+	unsigned long excess;
+	excess = res_counter_soft_limit_excess(&mem->res) >> PAGE_SHIFT;
+	return (excess > ULONG_MAX) ? ULONG_MAX : excess;
+}
+
+static struct mem_cgroup_per_zone *
+__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_soft_limit_tree_per_zone
+					*stz)
+{
+	struct rb_node *rightmost = NULL;
+	struct mem_cgroup_per_zone *mz = NULL;
+
+retry:
+	rightmost = rb_last(&stz->rb_root);
+	if (!rightmost)
+		goto done;		/* Nothing to reclaim from */
+
+	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
+	/*
+	 * Remove the node now but someone else can add it back,
+	 * we will to add it back at the end of reclaim to its correct
+	 * position in the tree.
+	 */
+	__mem_cgroup_remove_exceeded(mz->mem, mz, stz);
+	if (!css_tryget(&mz->mem->css) ||
+		!res_counter_soft_limit_excess(&mz->mem->res))
+		goto retry;
+done:
+	return mz;
+}
+
+static struct mem_cgroup_per_zone *
+mem_cgroup_largest_soft_limit_node(struct mem_cgroup_soft_limit_tree_per_zone
+					*stz)
+{
+	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
+
+	spin_lock_irqsave(&stz->lock, flags);
+	mz = __mem_cgroup_largest_soft_limit_node(stz);
+	spin_unlock_irqrestore(&stz->lock, flags);
+	return mz;
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -1038,31 +1120,59 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
+						struct zone *zone,
 						gfp_t gfp_mask,
-						unsigned long reclaim_options)
+						unsigned long reclaim_options,
+						int priority)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
 	int loop = 0;
 	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
+	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
+	unsigned long excess = mem_cgroup_get_excess(root_mem);
 
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
 	if (root_mem->memsw_is_minimum)
 		noswap = true;
 
-	while (loop < 2) {
+	while (1) {
 		victim = mem_cgroup_select_victim(root_mem);
-		if (victim == root_mem)
+		if (victim == root_mem) {
 			loop++;
+			if (loop >= 2) {
+				/*
+				 * If we have not been able to reclaim
+				 * anything, it might because there are
+				 * no reclaimable pages under this hierarchy
+				 */
+				if (!check_soft || !total)
+					break;
+				/*
+				 * We want to do more targetted reclaim.
+				 * excess >> 2 is not to excessive so as to
+				 * reclaim too much, nor too less that we keep
+				 * coming back to reclaim from this cgroup
+				 */
+				if (total >= (excess >> 2) ||
+					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
+					break;
+			}
+		}
 		if (!mem_cgroup_local_usage(&victim->stat)) {
 			/* this cgroup's local usage == 0 */
 			css_put(&victim->css);
 			continue;
 		}
 		/* we use swappiness of local cgroup */
-		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
-						   get_swappiness(victim));
+		if (check_soft)
+			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
+				noswap, get_swappiness(victim), zone,
+				zone->zone_pgdat->node_id, priority);
+		else
+			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
+						noswap, get_swappiness(victim));
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -1072,7 +1182,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		if (shrink)
 			return ret;
 		total += ret;
-		if (mem_cgroup_check_under_limit(root_mem))
+		if (check_soft) {
+			if (res_counter_check_under_soft_limit(&root_mem->res))
+				return total;
+		} else if (mem_cgroup_check_under_limit(root_mem))
 			return 1 + total;
 	}
 	return total;
@@ -1207,8 +1320,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
-		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
-							flags);
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
+							gfp_mask, flags, -1);
 		if (ret)
 			continue;
 
@@ -2002,8 +2115,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
-						   MEM_CGROUP_RECLAIM_SHRINK);
+		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
+						GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_SHRINK, -1);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -2055,9 +2169,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
+		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
-						MEM_CGROUP_RECLAIM_SHRINK);
+						MEM_CGROUP_RECLAIM_SHRINK, -1);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -2068,6 +2182,82 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
+unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+						gfp_t gfp_mask, int nid,
+						int zid, int priority)
+{
+	unsigned long nr_reclaimed = 0;
+	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
+	unsigned long flags;
+	unsigned long reclaimed;
+	int loop = 0;
+	struct mem_cgroup_soft_limit_tree_per_zone *stz;
+
+	if (order > 0)
+		return 0;
+
+	stz = soft_limit_tree_node_zone(nid, zid);
+	/*
+	 * This loop can run a while, specially if mem_cgroup's continuously
+	 * keep exceeding their soft limit and putting the system under
+	 * pressure
+	 */
+	do {
+		if (next_mz)
+			mz = next_mz;
+		else
+			mz = mem_cgroup_largest_soft_limit_node(stz);
+		if (!mz)
+			break;
+
+		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
+						gfp_mask,
+						MEM_CGROUP_RECLAIM_SOFT,
+						priority);
+		nr_reclaimed += reclaimed;
+		spin_lock_irqsave(&stz->lock, flags);
+
+		/*
+		 * If we failed to reclaim anything from this memory cgroup
+		 * it is time to move on to the next cgroup
+		 */
+		next_mz = NULL;
+		if (!reclaimed) {
+			do {
+				/*
+				 * By the time we get the soft_limit lock
+				 * again, someone might have aded the
+				 * group back on the RB tree. Iterate to
+				 * make sure we get a different mem.
+				 * mem_cgroup_largest_soft_limit_node returns
+				 * NULL if no other cgroup is present on
+				 * the tree
+				 */
+				next_mz =
+				__mem_cgroup_largest_soft_limit_node(stz);
+			} while (next_mz == mz);
+		}
+		mz->usage_in_excess =
+			res_counter_soft_limit_excess(&mz->mem->res);
+		__mem_cgroup_remove_exceeded(mz->mem, mz, stz);
+		if (mz->usage_in_excess)
+			__mem_cgroup_insert_exceeded(mz->mem, mz, stz);
+		spin_unlock_irqrestore(&stz->lock, flags);
+		css_put(&mz->mem->css);
+		loop++;
+		/*
+		 * Could not reclaim anything and there are no more
+		 * mem cgroups to try or we seem to be looping without
+		 * reclaiming anything.
+		 */
+		if (!nr_reclaimed &&
+			(next_mz == NULL ||
+			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
+			break;
+	} while (!nr_reclaimed);
+	return nr_reclaimed;
+}
+
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -2671,6 +2861,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 			INIT_LIST_HEAD(&mz->lists[l]);
 		mz->last_tree_update = 0;
 		mz->usage_in_excess = 0;
+		mz->on_tree = false;
+		mz->mem = mem;
 	}
 	return 0;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 86dc0c3..d0f5c4d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1780,11 +1780,39 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
+unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+						gfp_t gfp_mask, bool noswap,
+						unsigned int swappiness,
+						struct zone *zone, int nid,
+						int priority)
+{
+	struct scan_control sc = {
+		.may_writepage = !laptop_mode,
+		.may_unmap = 1,
+		.may_swap = !noswap,
+		.swap_cluster_max = SWAP_CLUSTER_MAX,
+		.swappiness = swappiness,
+		.order = 0,
+		.mem_cgroup = mem,
+		.isolate_pages = mem_cgroup_isolate_pages,
+	};
+	nodemask_t nm  = nodemask_of_node(nid);
+
+	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
+	sc.nodemask = &nm;
+	sc.nr_reclaimed = 0;
+	sc.nr_scanned = 0;
+	shrink_zone(priority, zone, &sc);
+	return sc.nr_reclaimed;
+}
+
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
 					   unsigned int swappiness)
 {
+	struct zonelist *zonelist;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -1796,7 +1824,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,
 		.nodemask = NULL, /* we don't care the placement */
 	};
-	struct zonelist *zonelist;
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -1918,6 +1945,7 @@ loop_again:
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
+			int nid, zid;
 
 			if (!populated_zone(zone))
 				continue;
@@ -1932,6 +1960,15 @@ loop_again:
 			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
 			note_zone_scanning_priority(zone, priority);
+
+			nid = pgdat->node_id;
+			zid = zone_idx(zone);
+			/*
+			 * Call soft limit reclaim before calling shrink_zone.
+			 * For now we ignore the return value
+			 */
+			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask,
+							nid, zid, priority);
 			/*
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/5] Memory controller soft limit patches (v8)
  2009-07-09 17:14 [RFC][PATCH 0/5] Memory controller soft limit patches (v8) Balbir Singh
                   ` (4 preceding siblings ...)
  2009-07-09 17:15 ` [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8) Balbir Singh
@ 2009-07-10  4:53 ` KAMEZAWA Hiroyuki
  2009-07-10  5:53   ` Balbir Singh
  5 siblings, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10  4:53 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

On Thu, 09 Jul 2009 22:44:41 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> New Feature: Soft limits for memory resource controller.
> 
> Here is v8 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. The CPU controllers interpretation
> of shares is very different though. 
> 
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
> 
> v8 has come out after a long duration, we were held back by bug fixes
> (most notably swap cache leak fix) and Kamezawa-San has his series of
> patches for soft limits. Kamezawa-San asked me to refactor these patches
> to make the data structure per-node-per-zone.
> 
> TODOs
> 
> 1. The current implementation maintains the delta from the soft limit
>    and pushes back groups to their soft limits, a ratio of delta/soft_limit
>    might be more useful
> 2. Small optimizations that I intend to push in v9, if the v8 design looks
>    good and acceptable.
> 
> Tests
> -----
> 
> I've run two memory intensive workloads with differing soft limits and
> seen that they are pushed back to their soft limit on contention. Their usage
> was their soft limit plus additional memory that they were able to grab
> on the system. Soft limit can take a while before we see the expected
> results.
> 

Before pointing out nitpicks, here are my impressions.
 
 1. seems good in general.

 2. Documentation is not enough. I think it's necessary to write "excuse" as
    "soft-limit is built on complex memory management system's behavior, then,
     this may not work as you expect. But in many case, this works well.
     please take this as best-effort service" or some.

 3. Using "jiffies" again is not good. plz use other check or event counter.

 4. I think it's better to limit soltlimit only against root of hierarcy node.
    (use_hierarchy=1) I can't explain how the system works if several soft limits
    are set to root and its children under a hierarchy.

 5. I'm glad if you extract patch 4/5 as an independent clean up patch.

 6. no overheads ?

other comments to each patch.

Thanks,
-Kame


> Please review, comment.
> 
> Series
> ------
> 
> memcg-soft-limits-documentation.patch
> memcg-soft-limits-interface.patch
> memcg-soft-limits-organize.patch
> memcg-soft-limits-refactor-reclaim-bits
> memcg-soft-limits-reclaim-on-contention.patch
> 
> 
> -- 
> 	Balbir
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8)
  2009-07-09 17:15 ` [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8) Balbir Singh
@ 2009-07-10  5:21   ` KAMEZAWA Hiroyuki
  2009-07-10  6:47     ` Balbir Singh
  2009-07-10  8:05     ` Balbir Singh
  0 siblings, 2 replies; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10  5:21 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

On Thu, 09 Jul 2009 22:45:01 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Feature: Organize cgroups over soft limit in a RB-Tree
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v8...v7
> 1. Make the data structures per zone per node
> 
> Changelog v7...v6
> 1. Refactor the check and update logic. The goal is to allow the
>    check logic to be modular, so that it can be revisited in the future
>    if something more appropriate is found to be useful.
> 
> Changelog v6...v5
> 1. Update the key before inserting into RB tree. Without the current change
>    it could take an additional iteration to get the key correct.
> 
> Changelog v5...v4
> 1. res_counter_uncharge has an additional parameter to indicate if the
>    counter was over its soft limit, before uncharge.
> 
> Changelog v4...v3
> 1. Optimizations to ensure we don't uncessarily get res_counter values
> 2. Fixed a bug in usage of time_after()
> 
> Changelog v3...v2
> 1. Add only the ancestor to the RB-Tree
> 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> 
> Changelog v2...v1
> 1. Add support for hierarchies
> 2. The res_counter that is highest in the hierarchy is returned on soft
>    limit being exceeded. Since we do hierarchical reclaim and add all
>    groups exceeding their soft limits, this approach seems to work well
>    in practice.
> 
> This patch introduces a RB-Tree for storing memory cgroups that are over their
> soft limit. The overall goal is to
> 
> 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
>    We are careful about updates, updates take place only after a particular
>    time interval has passed
> 2. We remove the node from the RB-Tree when the usage goes below the soft
>    limit
> 
> The next set of patches will exploit the RB-Tree to get the group that is
> over its soft limit by the largest amount and reclaim from it, when we
> face memory contention.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  include/linux/res_counter.h |    6 +
>  kernel/res_counter.c        |   18 ++-
>  mm/memcontrol.c             |  304 +++++++++++++++++++++++++++++++++++++------
>  3 files changed, 281 insertions(+), 47 deletions(-)
> 
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index fcb9884..731af71 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -114,7 +114,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
>  		unsigned long val);
>  int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> +		unsigned long val, struct res_counter **limit_fail_at,
> +		struct res_counter **soft_limit_at);
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> @@ -127,7 +128,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
>   */
>  
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
> -void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> +				bool *was_soft_limit_excess);
>  
>  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>  {
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 4434236..dbdade0 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  }
>  
>  int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> +			struct res_counter **limit_fail_at,
> +			struct res_counter **soft_limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
>  	struct res_counter *c, *u;
>  
>  	*limit_fail_at = NULL;
> +	if (soft_limit_fail_at)
> +		*soft_limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
> +		/*
> +		 * With soft limits, we return the highest ancestor
> +		 * that exceeds its soft limit
> +		 */
> +		if (soft_limit_fail_at &&
> +			!res_counter_soft_limit_check_locked(c))
> +			*soft_limit_fail_at = c;
>  		spin_unlock(&c->lock);
>  		if (ret < 0) {
>  			*limit_fail_at = c;
> @@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
>  	counter->usage -= val;
>  }
>  
> -void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> +				bool *was_soft_limit_excess)
>  {
>  	unsigned long flags;
>  	struct res_counter *c;
> @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
> +		if (was_soft_limit_excess)
> +			*was_soft_limit_excess =
> +				!res_counter_soft_limit_check_locked(c);
>  		res_counter_uncharge_locked(c, val);
>  		spin_unlock(&c->lock);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3c9292b..036032b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -29,6 +29,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/limits.h>
>  #include <linux/mutex.h>
> +#include <linux/rbtree.h>
>  #include <linux/slab.h>
>  #include <linux/swap.h>
>  #include <linux/spinlock.h>
> @@ -118,6 +119,11 @@ struct mem_cgroup_per_zone {
>  	unsigned long		count[NR_LRU_LISTS];
>  
>  	struct zone_reclaim_stat reclaim_stat;
> +	struct rb_node		tree_node;	/* RB tree node */
> +	unsigned long 		last_tree_update;/* Last time the tree was */
> +						/* updated in jiffies     */
> +	unsigned long long	usage_in_excess;/* Set to the value by which */
> +						/* the soft limit is exceeded*/
>  };

As pointed out in several times, plz avoid using jiffies.



>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> @@ -131,6 +137,28 @@ struct mem_cgroup_lru_info {
>  };
>  
>  /*
> + * Cgroups above their limits are maintained in a RB-Tree, independent of
> + * their hierarchy representation
> + */
> +
> +struct mem_cgroup_soft_limit_tree_per_zone {
> +	struct rb_root rb_root;
> +	spinlock_t lock;
> +};
> +
> +struct mem_cgroup_soft_limit_tree_per_node {
> +	struct mem_cgroup_soft_limit_tree_per_zone
> +		rb_tree_per_zone[MAX_NR_ZONES];
> +};
> +
> +struct mem_cgroup_soft_limit_tree {
> +	struct mem_cgroup_soft_limit_tree_per_node
> +		*rb_tree_per_node[MAX_NUMNODES];
> +};
> +
> +static struct mem_cgroup_soft_limit_tree soft_limit_tree;
> +
__read_mostly ?

Isn't that structure name too long ?



> +/*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
>   * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> @@ -187,6 +215,8 @@ struct mem_cgroup {
>  	struct mem_cgroup_stat stat;
>  };
>  
> +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
> +
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -215,6 +245,164 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  
> +static struct mem_cgroup_per_zone *
> +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> +{
> +	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> +}
> +
> +static struct mem_cgroup_per_zone *
> +page_cgroup_zoneinfo(struct page_cgroup *pc)
> +{
> +	struct mem_cgroup *mem = pc->mem_cgroup;
> +	int nid = page_cgroup_nid(pc);
> +	int zid = page_cgroup_zid(pc);
> +
> +	if (!mem)
> +		return NULL;
> +
> +	return mem_cgroup_zoneinfo(mem, nid, zid);
> +}

I'm sorry but why this function is rewritten ? Any difference ?

> +
> +static struct mem_cgroup_soft_limit_tree_per_zone *
> +soft_limit_tree_node_zone(int nid, int zid)
> +{
> +	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> +}
> +
> +static struct mem_cgroup_soft_limit_tree_per_zone *
> +page_cgroup_soft_limit_tree(struct page_cgroup *pc)
> +{
> +	int nid = page_cgroup_nid(pc);
> +	int zid = page_cgroup_zid(pc);
> +
> +	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> +}
> +

Hm, I think it's better to use "page" rather than "page_cgroup" as arguments.
This pointer doesn't depends on pc->mem_cgroup and zid, nid information is
gotten from "page", originally. (and we can reduce foot print)




> +static void
> +mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> +				struct mem_cgroup_per_zone *mz,
> +				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> +{
> +	struct rb_node **p = &stz->rb_root.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct mem_cgroup_per_zone *mz_node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&stz->lock, flags);
> +	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);

Hmm, can't this be
 	
	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
	spin_lock_irqsave(&stz->lock, flags);
? There will be no deadlock but I just don't like spinlock under spinlock.

BTW, why spin_lock_irqsave() ? spin_lock() isn't enough ? If you need to
disable IRQ, writing reason somewhere is helpful to undestand the code.


> +	while (*p) {

I feel this *p should be loaded after taking spinlock(&stz->lock) rather than top
of function. No?

> +		parent = *p;
> +		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
> +					tree_node);
> +		if (mz->usage_in_excess < mz_node->usage_in_excess)
> +			p = &(*p)->rb_left;
> +		/*
> +		 * We can't avoid mem cgroups that are over their soft
> +		 * limit by the same amount
> +		 */
> +		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
> +			p = &(*p)->rb_right;
> +	}
> +	rb_link_node(&mz->tree_node, parent, p);
> +	rb_insert_color(&mz->tree_node, &stz->rb_root);
> +	mz->last_tree_update = jiffies;
> +	spin_unlock_irqrestore(&stz->lock, flags);
> +}
> +
> +static void
> +mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> +				struct mem_cgroup_per_zone *mz,
> +				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> +{
> +	unsigned long flags;
> +	spin_lock_irqsave(&stz->lock, flags);
why IRQ save ? again.

> +	rb_erase(&mz->tree_node, &stz->rb_root);
> +	spin_unlock_irqrestore(&stz->lock, flags);
> +}
> +
> +static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem,
> +					bool over_soft_limit,
> +					struct page *page)
> +{
> +	unsigned long next_update;
> +	struct page_cgroup *pc;
> +	struct mem_cgroup_per_zone *mz;
> +
> +	if (!over_soft_limit)
> +		return false;
> +
> +	pc = lookup_page_cgroup(page);
> +	if (unlikely(!pc))
> +		return false;
> +	mz = mem_cgroup_zoneinfo(mem, page_cgroup_nid(pc), page_cgroup_zid(pc));

mz = page_cgroup_zoneinfo(pc)
or
mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zid(page))

> +
> +	next_update = mz->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
> +	if (time_after(jiffies, next_update))
> +		return true;
> +
> +	return false;
> +}
> +
> +static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
> +{
> +	unsigned long long prev_usage_in_excess, new_usage_in_excess;
> +	bool updated_tree = false;
> +	unsigned long flags;
> +	struct page_cgroup *pc;
> +	struct mem_cgroup_per_zone *mz;
> +	struct mem_cgroup_soft_limit_tree_per_zone *stz;
> +
> +	/*
> +	 * As long as the page is around, pc's are always
> +	 * around and so is the mz, in the remove path
> +	 * we are yet to do the css_put(). I don't think
> +	 * we need to hold page cgroup lock.
> +	 */
IIUC, at updating tree,we grab this page which is near-to-be-mapped or
near-to-be-in-radix-treee. If so, not necessary to be annoyied.

> +	pc = lookup_page_cgroup(page);
> +	if (unlikely(!pc))
> +		return;

I bet this can be BUG_ON().

> +	mz = mem_cgroup_zoneinfo(mem, page_cgroup_nid(pc), page_cgroup_zid(pc));
mz = page_cgroup_zoneinfo(pc);

> +	stz = page_cgroup_soft_limit_tree(pc);
> +
> +	/*
> +	 * We do updates in lazy mode, mem's are removed
> +	 * lazily from the per-zone, per-node rb tree
> +	 */
> +	prev_usage_in_excess = mz->usage_in_excess;
> +
> +	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +	if (prev_usage_in_excess) {
> +		mem_cgroup_remove_exceeded(mem, mz, stz);
> +		updated_tree = true;
> +	}
IIUC, mz->usage_in_excess can't be used to find out mz is on-tree.
I think you use "bool on_tree" in patch 5/5. plz use it here.



> +	if (!new_usage_in_excess)
> +		goto done;
> +	mem_cgroup_insert_exceeded(mem, mz, stz);
> +
> +done:
> +	if (updated_tree) {
> +		spin_lock_irqsave(&stz->lock, flags);
> +		mz->usage_in_excess = new_usage_in_excess;
> +		spin_unlock_irqrestore(&stz->lock, flags);
> +	}
> +}
> +
> +static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
> +{
> +	int node, zone;
> +	struct mem_cgroup_per_zone *mz;
> +	struct mem_cgroup_soft_limit_tree_per_zone *stz;
> +
> +	for_each_node_state(node, N_POSSIBLE) {
> +		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +			mz = mem_cgroup_zoneinfo(mem, node, zone);
> +			stz = soft_limit_tree_node_zone(node, zone);
> +			mem_cgroup_remove_exceeded(mem, mz, stz);
> +		}
> +	}
> +}
> +
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  					 struct page_cgroup *pc,
>  					 bool charge)
> @@ -239,25 +427,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  	put_cpu();
>  }
>  
> -static struct mem_cgroup_per_zone *
> -mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> -{
> -	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> -}
> -
> -static struct mem_cgroup_per_zone *
> -page_cgroup_zoneinfo(struct page_cgroup *pc)
> -{
> -	struct mem_cgroup *mem = pc->mem_cgroup;
> -	int nid = page_cgroup_nid(pc);
> -	int zid = page_cgroup_zid(pc);
> -
> -	if (!mem)
> -		return NULL;
> -
> -	return mem_cgroup_zoneinfo(mem, nid, zid);
> -}
> -
>  static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
>  					enum lru_list idx)
>  {
> @@ -972,11 +1141,11 @@ done:
>   */
>  static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			gfp_t gfp_mask, struct mem_cgroup **memcg,
> -			bool oom)
> +			bool oom, struct page *page)
>  {


> -	struct mem_cgroup *mem, *mem_over_limit;
> +	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -	struct res_counter *fail_res;
> +	struct res_counter *fail_res, *soft_fail_res = NULL;
>  
>  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
>  		/* Don't account this! */
> @@ -1006,16 +1175,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		int ret;
>  		bool noswap = false;
>  
> -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> +		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> +						&soft_fail_res);
>  		if (likely(!ret)) {
>  			if (!do_swap_account)
>  				break;
>  			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> -							&fail_res);
> +							&fail_res, NULL);
>  			if (likely(!ret))
>  				break;
>  			/* mem+swap counter fails */
> -			res_counter_uncharge(&mem->res, PAGE_SIZE);
> +			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
>  			noswap = true;
>  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>  									memsw);
> @@ -1053,13 +1223,24 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			goto nomem;
>  		}
>  	}
> +	/*
> +	 * Insert just the ancestor, we should trickle down to the correct
> +	 * cgroup for reclaim, since the other nodes will be below their
> +	 * soft limit
> +	 */
> +	if (soft_fail_res) {
> +		mem_over_soft_limit =
> +			mem_cgroup_from_res_counter(soft_fail_res, res);
> +		if (mem_cgroup_soft_limit_check(mem_over_soft_limit, true,
> +							page))
> +			mem_cgroup_update_tree(mem_over_soft_limit, page);
> +	}



>  	return 0;
>  nomem:
>  	css_put(&mem->css);
>  	return -ENOMEM;
>  }
>  
> -
>  /*
>   * A helper function to get mem_cgroup from ID. must be called under
>   * rcu_read_lock(). The caller must check css_is_removed() or some if
> @@ -1126,9 +1307,9 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
>  		unlock_page_cgroup(pc);
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> +		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
>  		if (do_swap_account)
> -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
>  		css_put(&mem->css);
>  		return;
>  	}
> @@ -1205,7 +1386,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>  	if (pc->mem_cgroup != from)
>  		goto out;
>  
> -	res_counter_uncharge(&from->res, PAGE_SIZE);
> +	res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
>  	mem_cgroup_charge_statistics(from, pc, false);
>  
>  	page = pc->page;
> @@ -1225,7 +1406,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>  	}
>  
>  	if (do_swap_account)
> -		res_counter_uncharge(&from->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
>  	css_put(&from->css);
>  
>  	css_get(&to->css);
> @@ -1259,7 +1440,7 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
>  	parent = mem_cgroup_from_cont(pcg);
>  
>  
> -	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
> +	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
>  	if (ret || !parent)
>  		return ret;
>  
> @@ -1289,9 +1470,9 @@ uncharge:
>  	/* drop extra refcnt by try_charge() */
>  	css_put(&parent->css);
>  	/* uncharge if move fails */
> -	res_counter_uncharge(&parent->res, PAGE_SIZE);
> +	res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
>  	if (do_swap_account)
> -		res_counter_uncharge(&parent->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
>  	return ret;
>  }
>  
> @@ -1316,7 +1497,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
>  	prefetchw(pc);
>  
>  	mem = memcg;
> -	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
> +	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
>  	if (ret || !mem)
>  		return ret;
>  
> @@ -1435,14 +1616,14 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>  	if (!mem)
>  		goto charge_cur_mm;
>  	*ptr = mem;
> -	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
> +	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
>  	/* drop extra refcnt from tryget */
>  	css_put(&mem->css);
>  	return ret;
>  charge_cur_mm:
>  	if (unlikely(!mm))
>  		mm = &init_mm;
> -	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> +	return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
>  }
>  
>  static void
> @@ -1479,7 +1660,7 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
>  			 * This recorded memcg can be obsolete one. So, avoid
>  			 * calling css_tryget
>  			 */
> -			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> +			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
>  			mem_cgroup_put(memcg);
>  		}
>  		rcu_read_unlock();
> @@ -1500,9 +1681,9 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
>  		return;
>  	if (!mem)
>  		return;
> -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
>  	if (do_swap_account)
> -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
>  	css_put(&mem->css);
>  }
>  
> @@ -1516,6 +1697,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem = NULL;
>  	struct mem_cgroup_per_zone *mz;
> +	bool soft_limit_excess = false;
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -1554,9 +1736,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  		break;
>  	}
>  
> -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
>  	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
>  	mem_cgroup_charge_statistics(mem, pc, false);
>  
>  	ClearPageCgroupUsed(pc);
> @@ -1570,6 +1752,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	mz = page_cgroup_zoneinfo(pc);
>  	unlock_page_cgroup(pc);
>  
> +	if (mem_cgroup_soft_limit_check(mem, soft_limit_excess, page))
> +		mem_cgroup_update_tree(mem, page);
>  	/* at swapout, this memcg will be accessed to record to swap */
>  	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>  		css_put(&mem->css);
> @@ -1645,7 +1829,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
>  		 * We uncharge this because swap is freed.
>  		 * This memcg can be obsolete one. We avoid calling css_tryget
>  		 */
> -		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
>  		mem_cgroup_put(memcg);
>  	}
>  	rcu_read_unlock();
> @@ -1674,7 +1858,8 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
>  	unlock_page_cgroup(pc);
>  
>  	if (mem) {
> -		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
> +		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
> +						page);
>  		css_put(&mem->css);
>  	}
>  	*ptr = mem;
> @@ -2177,6 +2362,7 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
>  			res_counter_reset_failcnt(&mem->memsw);
>  		break;
>  	}
> +
>  	return 0;
>  }
noise here.


>  
> @@ -2472,6 +2658,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>  		mz = &pn->zoneinfo[zone];
>  		for_each_lru(l)
>  			INIT_LIST_HEAD(&mz->lists[l]);
> +		mz->last_tree_update = 0;
> +		mz->usage_in_excess = 0;
>  	}
>  	return 0;
>  }
> @@ -2517,6 +2705,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>  {
>  	int node;
>  
> +	mem_cgroup_remove_from_trees(mem);
>  	free_css_id(&mem_cgroup_subsys, &mem->css);
>  
>  	for_each_node_state(node, N_POSSIBLE)
> @@ -2565,6 +2754,31 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>  
> +static int mem_cgroup_soft_limit_tree_init(void)
> +{
> +	struct mem_cgroup_soft_limit_tree_per_node *rtpn;
> +	struct mem_cgroup_soft_limit_tree_per_zone *rtpz;
> +	int tmp, node, zone;
> +
> +	for_each_node_state(node, N_POSSIBLE) {
> +		tmp = node;
> +		if (!node_state(node, N_NORMAL_MEMORY))
> +			tmp = -1;
> +		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
> +		if (!rtpn)
> +			return 1;
> +
> +		soft_limit_tree.rb_tree_per_node[node] = rtpn;
> +
> +		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +			rtpz = &rtpn->rb_tree_per_zone[zone];
> +			rtpz->rb_root = RB_ROOT;
> +			spin_lock_init(&rtpz->lock);
> +		}
> +	}
> +	return 0;
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -2579,11 +2793,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	for_each_node_state(node, N_POSSIBLE)
>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
>  			goto free_out;
> +
>  	/* root ? */
>  	if (cont->parent == NULL) {
>  		enable_swap_cgroup();
>  		parent = NULL;
>  		root_mem_cgroup = mem;
> +		if (mem_cgroup_soft_limit_tree_init())
> +			goto free_out;
> +
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		mem->use_hierarchy = parent->use_hierarchy;
> 

Thx,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8)
  2009-07-09 17:15 ` [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8) Balbir Singh
@ 2009-07-10  5:30   ` KAMEZAWA Hiroyuki
  2009-07-10  6:53     ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10  5:30 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

On Thu, 09 Jul 2009 22:45:12 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Feature: Implement reclaim from groups over their soft limit
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v8 ..v7
> 1. Soft limit reclaim takes an order parameter and does no reclaim for
>    order > 0. This ensures that we don't do double reclaim for order > 0
> 2. Make the data structures more scalable, move the reclaim logic
>    to a new function mem_cgroup_shrink_node_zone that does per node
>    per zone reclaim.
> 3. Reclaim has moved back to kswapd (balance_pgdat)
> 
> Changelog v7...v6
> 1. Refactored out reclaim_options patch into a separate patch
> 2. Added additional checks for all swap off condition in
>    mem_cgroup_hierarchical_reclaim()
> 
> Changelog v6...v5
> 1. Reclaim arguments to hierarchical reclaim have been merged into one
>    parameter called reclaim_options.
> 2. Check if we failed to reclaim from one cgroup during soft reclaim, if
>    so move on to the next one. This can be very useful if the zonelist
>    passed to soft limit reclaim has no allocations from the selected
>    memory cgroup
> 3. Coding style cleanups
> 
> Changelog v5...v4
> 
> 1. Throttling is removed, earlier we throttled tasks over their soft limit
> 2. Reclaim has been moved back to __alloc_pages_internal, several experiments
>    and tests showed that it was the best place to reclaim memory. kswapd has
>    a different goal, that does not work with a single soft limit for the memory
>    cgroup.
> 3. Soft limit reclaim is more targetted and the pages reclaim depend on the
>    amount by which the soft limit is exceeded.
> 
> Changelog v4...v3
> 1. soft_reclaim is now called from balance_pgdat
> 2. soft_reclaim is aware of nodes and zones
> 3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
>    and at the same time trying to allocate pages and exceed its soft limit.
> 4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
>    particular to a mem cgroup.
> 
> Changelog v3...v2
> 1. Convert several arguments to hierarchical reclaim to flags, thereby
>    consolidating them
> 2. The reclaim for soft limits is now triggered from kswapd
> 3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument
> 
> 
> Changelog v2...v1
> 1. Added support for hierarchical soft limits
> 
> This patch allows reclaim from memory cgroups on contention (via the
> direct reclaim path).
> 
> memory cgroup soft limit reclaim finds the group that exceeds its soft limit
> by the largest number of pages and reclaims pages from it and then reinserts the
> cgroup into its correct place in the rbtree.
> 
> Added additional checks to mem_cgroup_hierarchical_reclaim() to detect
> long loops in case all swap is turned off. The code has been refactored
> and the loop check (loop < 2) has been enhanced for soft limits. For soft
> limits, we try to do more targetted reclaim. Instead of bailing out after
> two loops, the routine now reclaims memory proportional to the size by
> which the soft limit is exceeded. The proportion has been empirically
> determined.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  include/linux/memcontrol.h |   11 ++
>  include/linux/swap.h       |    5 +
>  mm/memcontrol.c            |  224 +++++++++++++++++++++++++++++++++++++++++---
>  mm/vmscan.c                |   39 +++++++-
>  4 files changed, 262 insertions(+), 17 deletions(-)
> 
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e46a073..cf20acc 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -118,6 +118,9 @@ static inline bool mem_cgroup_disabled(void)
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
>  void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
> +unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> +						gfp_t gfp_mask, int nid,
> +						int zid, int priority);
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
>  
> @@ -276,6 +279,14 @@ static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
>  {
>  }
>  
> +static inline
> +unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> +						gfp_t gfp_mask, int nid,
> +						int zid, int priority)
> +{
> +	return 0;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 6c990e6..afc0721 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -217,6 +217,11 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>  						  gfp_t gfp_mask, bool noswap,
>  						  unsigned int swappiness);
> +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> +						gfp_t gfp_mask, bool noswap,
> +						unsigned int swappiness,
> +						struct zone *zone,
> +						int nid, int priority);
>  extern int __isolate_lru_page(struct page *page, int mode, int file);
>  extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ca9c257..e7a1cf4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -124,6 +124,9 @@ struct mem_cgroup_per_zone {
>  						/* updated in jiffies     */
>  	unsigned long long	usage_in_excess;/* Set to the value by which */
>  						/* the soft limit is exceeded*/
> +	bool on_tree;				/* Is the node on tree? */
> +	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
> +						/* use container_of	   */
>  };
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> @@ -216,6 +219,13 @@ struct mem_cgroup {
>  
>  #define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
>  
> +/*
> + * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> + * limit reclaim to prevent infinite loops, if they ever occur.
> + */
> +#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(10000)
> +#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
> +
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -247,6 +257,8 @@ enum charge_type {
>  #define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
>  #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
>  #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> +#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
> +#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
>  
>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
> @@ -287,16 +299,17 @@ page_cgroup_soft_limit_tree(struct page_cgroup *pc)
>  }
>  
>  static void
> -mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> +__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
>  				struct mem_cgroup_per_zone *mz,
>  				struct mem_cgroup_soft_limit_tree_per_zone *stz)
>  {
>  	struct rb_node **p = &stz->rb_root.rb_node;
>  	struct rb_node *parent = NULL;
>  	struct mem_cgroup_per_zone *mz_node;
> -	unsigned long flags;
>  
> -	spin_lock_irqsave(&stz->lock, flags);
> +	if (mz->on_tree)
> +		return;
> +
>  	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
>  	while (*p) {
>  		parent = *p;
> @@ -314,6 +327,29 @@ mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
>  	rb_link_node(&mz->tree_node, parent, p);
>  	rb_insert_color(&mz->tree_node, &stz->rb_root);
>  	mz->last_tree_update = jiffies;
> +	mz->on_tree = true;
> +}
> +
> +static void
> +__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> +				struct mem_cgroup_per_zone *mz,
> +				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> +{
> +	if (!mz->on_tree)
> +		return;
> +	rb_erase(&mz->tree_node, &stz->rb_root);
> +	mz->on_tree = false;
> +}
> +
> +static void
> +mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> +				struct mem_cgroup_per_zone *mz,
> +				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&stz->lock, flags);
> +	__mem_cgroup_insert_exceeded(mem, mz, stz);
>  	spin_unlock_irqrestore(&stz->lock, flags);
>  }
>  
> @@ -324,7 +360,7 @@ mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
>  {
>  	unsigned long flags;
>  	spin_lock_irqsave(&stz->lock, flags);
> -	rb_erase(&mz->tree_node, &stz->rb_root);
> +	__mem_cgroup_remove_exceeded(mem, mz, stz);
>  	spin_unlock_irqrestore(&stz->lock, flags);
>  }
>  
> @@ -410,6 +446,52 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
>  	}
>  }
>  
> +unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
> +{
> +	unsigned long excess;
> +	excess = res_counter_soft_limit_excess(&mem->res) >> PAGE_SHIFT;
> +	return (excess > ULONG_MAX) ? ULONG_MAX : excess;
> +}
> +
What this means ? excess can be bigger than ULONG_MAX even after >> PAGE_SHIFT ?



> +static struct mem_cgroup_per_zone *
> +__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_soft_limit_tree_per_zone
> +					*stz)
> +{
> +	struct rb_node *rightmost = NULL;
> +	struct mem_cgroup_per_zone *mz = NULL;
> +
> +retry:
> +	rightmost = rb_last(&stz->rb_root);
> +	if (!rightmost)
> +		goto done;		/* Nothing to reclaim from */
> +
> +	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
> +	/*
> +	 * Remove the node now but someone else can add it back,
> +	 * we will to add it back at the end of reclaim to its correct
> +	 * position in the tree.
> +	 */
> +	__mem_cgroup_remove_exceeded(mz->mem, mz, stz);
> +	if (!css_tryget(&mz->mem->css) ||
> +		!res_counter_soft_limit_excess(&mz->mem->res))
> +		goto retry;
This leaks css's refcnt. plz invert order as

	if (!res_counter_xxxxx() || !css_tryget())



> +done:
> +	return mz;
> +}
> +
> +static struct mem_cgroup_per_zone *
> +mem_cgroup_largest_soft_limit_node(struct mem_cgroup_soft_limit_tree_per_zone
> +					*stz)
> +{
> +	struct mem_cgroup_per_zone *mz;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&stz->lock, flags);
> +	mz = __mem_cgroup_largest_soft_limit_node(stz);
> +	spin_unlock_irqrestore(&stz->lock, flags);
> +	return mz;
> +}
> +
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  					 struct page_cgroup *pc,
>  					 bool charge)
> @@ -1038,31 +1120,59 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
>   * If shrink==true, for avoiding to free too much, this returns immedieately.
>   */
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> +						struct zone *zone,
>  						gfp_t gfp_mask,
> -						unsigned long reclaim_options)
> +						unsigned long reclaim_options,
> +						int priority)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
>  	int loop = 0;
>  	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
>  	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> +	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> +	unsigned long excess = mem_cgroup_get_excess(root_mem);
>  
>  	/* If memsw_is_minimum==1, swap-out is of-no-use. */
>  	if (root_mem->memsw_is_minimum)
>  		noswap = true;
>  
> -	while (loop < 2) {
> +	while (1) {
>  		victim = mem_cgroup_select_victim(root_mem);
> -		if (victim == root_mem)
> +		if (victim == root_mem) {
>  			loop++;
> +			if (loop >= 2) {
> +				/*
> +				 * If we have not been able to reclaim
> +				 * anything, it might because there are
> +				 * no reclaimable pages under this hierarchy
> +				 */
> +				if (!check_soft || !total)
> +					break;
> +				/*
> +				 * We want to do more targetted reclaim.
> +				 * excess >> 2 is not to excessive so as to
> +				 * reclaim too much, nor too less that we keep
> +				 * coming back to reclaim from this cgroup
> +				 */
> +				if (total >= (excess >> 2) ||
> +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> +					break;
> +			}
> +		}

Hmm..this logic is very unclear for me. Why just exit back as usual reclaim ?



>  		if (!mem_cgroup_local_usage(&victim->stat)) {
>  			/* this cgroup's local usage == 0 */
>  			css_put(&victim->css);
>  			continue;
>  		}
>  		/* we use swappiness of local cgroup */
> -		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> -						   get_swappiness(victim));
> +		if (check_soft)
> +			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> +				noswap, get_swappiness(victim), zone,
> +				zone->zone_pgdat->node_id, priority);
> +		else
> +			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> +						noswap, get_swappiness(victim));

Do we need 2 functions ?

>  		css_put(&victim->css);
>  		/*
>  		 * At shrinking usage, we can't check we should stop here or
> @@ -1072,7 +1182,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  		if (shrink)
>  			return ret;
>  		total += ret;
> -		if (mem_cgroup_check_under_limit(root_mem))
> +		if (check_soft) {
> +			if (res_counter_check_under_soft_limit(&root_mem->res))
> +				return total;
> +		} else if (mem_cgroup_check_under_limit(root_mem))
>  			return 1 + total;
>  	}
>  	return total;
> @@ -1207,8 +1320,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> -		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> -							flags);
> +		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> +							gfp_mask, flags, -1);
>  		if (ret)
>  			continue;
>  
> @@ -2002,8 +2115,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> -						   MEM_CGROUP_RECLAIM_SHRINK);
> +		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
> +						GFP_KERNEL,
> +						MEM_CGROUP_RECLAIM_SHRINK, -1);

What this -1 means ?

>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>    		if (curusage >= oldusage)
> @@ -2055,9 +2169,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> +		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
>  						MEM_CGROUP_RECLAIM_NOSWAP |
> -						MEM_CGROUP_RECLAIM_SHRINK);
> +						MEM_CGROUP_RECLAIM_SHRINK, -1);
again.

>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -2068,6 +2182,82 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  	return ret;
>  }
>  
> +unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> +						gfp_t gfp_mask, int nid,
> +						int zid, int priority)
> +{
> +	unsigned long nr_reclaimed = 0;
> +	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> +	unsigned long flags;
> +	unsigned long reclaimed;
> +	int loop = 0;
> +	struct mem_cgroup_soft_limit_tree_per_zone *stz;
> +
> +	if (order > 0)
> +		return 0;
> +
> +	stz = soft_limit_tree_node_zone(nid, zid);
> +	/*
> +	 * This loop can run a while, specially if mem_cgroup's continuously
> +	 * keep exceeding their soft limit and putting the system under
> +	 * pressure
> +	 */
> +	do {
> +		if (next_mz)
> +			mz = next_mz;
> +		else
> +			mz = mem_cgroup_largest_soft_limit_node(stz);
> +		if (!mz)
> +			break;
> +
> +		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> +						gfp_mask,
> +						MEM_CGROUP_RECLAIM_SOFT,
> +						priority);
> +		nr_reclaimed += reclaimed;
> +		spin_lock_irqsave(&stz->lock, flags);
> +
> +		/*
> +		 * If we failed to reclaim anything from this memory cgroup
> +		 * it is time to move on to the next cgroup
> +		 */
> +		next_mz = NULL;
> +		if (!reclaimed) {
> +			do {
> +				/*
> +				 * By the time we get the soft_limit lock
> +				 * again, someone might have aded the
> +				 * group back on the RB tree. Iterate to
> +				 * make sure we get a different mem.
> +				 * mem_cgroup_largest_soft_limit_node returns
> +				 * NULL if no other cgroup is present on
> +				 * the tree
> +				 */
> +				next_mz =
> +				__mem_cgroup_largest_soft_limit_node(stz);
> +			} while (next_mz == mz);
> +		}
> +		mz->usage_in_excess =
> +			res_counter_soft_limit_excess(&mz->mem->res);
> +		__mem_cgroup_remove_exceeded(mz->mem, mz, stz);
> +		if (mz->usage_in_excess)
> +			__mem_cgroup_insert_exceeded(mz->mem, mz, stz);

plz don't push back "mz" if !reclaimd.



> +		spin_unlock_irqrestore(&stz->lock, flags);
> +		css_put(&mz->mem->css);
> +		loop++;
> +		/*
> +		 * Could not reclaim anything and there are no more
> +		 * mem cgroups to try or we seem to be looping without
> +		 * reclaiming anything.
> +		 */
> +		if (!nr_reclaimed &&
> +			(next_mz == NULL ||
> +			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> +			break;
> +	} while (!nr_reclaimed);
> +	return nr_reclaimed;
> +}
> +
>  /*
>   * This routine traverse page_cgroup in given list and drop them all.
>   * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> @@ -2671,6 +2861,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>  			INIT_LIST_HEAD(&mz->lists[l]);
>  		mz->last_tree_update = 0;
>  		mz->usage_in_excess = 0;
> +		mz->on_tree = false;
> +		mz->mem = mem;
>  	}
>  	return 0;
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 86dc0c3..d0f5c4d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1780,11 +1780,39 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  
> +unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> +						gfp_t gfp_mask, bool noswap,
> +						unsigned int swappiness,
> +						struct zone *zone, int nid,
> +						int priority)
> +{
> +	struct scan_control sc = {
> +		.may_writepage = !laptop_mode,
> +		.may_unmap = 1,
> +		.may_swap = !noswap,
> +		.swap_cluster_max = SWAP_CLUSTER_MAX,
> +		.swappiness = swappiness,
> +		.order = 0,
> +		.mem_cgroup = mem,
> +		.isolate_pages = mem_cgroup_isolate_pages,
> +	};
> +	nodemask_t nm  = nodemask_of_node(nid);
> +
> +	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> +			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> +	sc.nodemask = &nm;
> +	sc.nr_reclaimed = 0;
> +	sc.nr_scanned = 0;
> +	shrink_zone(priority, zone, &sc);
> +	return sc.nr_reclaimed;
> +}
> +
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  					   gfp_t gfp_mask,
>  					   bool noswap,
>  					   unsigned int swappiness)
>  {
> +	struct zonelist *zonelist;
>  	struct scan_control sc = {
>  		.may_writepage = !laptop_mode,
>  		.may_unmap = 1,
> @@ -1796,7 +1824,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  		.isolate_pages = mem_cgroup_isolate_pages,
>  		.nodemask = NULL, /* we don't care the placement */
>  	};
> -	struct zonelist *zonelist;
>  
>  	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>  			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> @@ -1918,6 +1945,7 @@ loop_again:
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
>  			int nr_slab;
> +			int nid, zid;
>  
>  			if (!populated_zone(zone))
>  				continue;
> @@ -1932,6 +1960,15 @@ loop_again:
>  			temp_priority[i] = priority;
>  			sc.nr_scanned = 0;
>  			note_zone_scanning_priority(zone, priority);
> +
> +			nid = pgdat->node_id;
> +			zid = zone_idx(zone);
> +			/*
> +			 * Call soft limit reclaim before calling shrink_zone.
> +			 * For now we ignore the return value
> +			 */
> +			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask,
> +							nid, zid, priority);
>  			/*
>  			 * We put equal pressure on every zone, unless one
>  			 * zone has way too many pages free already.
> 


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 1/5] Memory controller soft limit documentation (v8)
  2009-07-09 17:14 ` [RFC][PATCH 1/5] Memory controller soft limit documentation (v8) Balbir Singh
@ 2009-07-10  5:32   ` KAMEZAWA Hiroyuki
  2009-07-10  6:48     ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10  5:32 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

On Thu, 09 Jul 2009 22:44:49 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Feature: Add documentation for soft limits
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  Documentation/cgroups/memory.txt |   31 ++++++++++++++++++++++++++++++-
>  1 files changed, 30 insertions(+), 1 deletions(-)
> 
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index ab0a021..b47815c 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -379,7 +379,36 @@ cgroups created below it.
>  
>  NOTE2: This feature can be enabled/disabled per subtree.
>  
> -7. TODO
> +7. Soft limits
> +
> +Soft limits allow for greater sharing of memory. The idea behind soft limits
> +is to allow control groups to use as much of the memory as needed, provided
> +
> +a. There is no memory contention
> +b. They do not exceed their hard limit
> +
> +When the system detects memory contention or low memory control groups
> +are pushed back to their soft limits. If the soft limit of each control
> +group is very high, they are pushed back as much as possible to make
> +sure that one control group does not starve the others of memory.
> +

It's better to write "this is best-effort service". We add hook only to kswapd.
And hou successfull this work depends on ZONE.

Thanks,
-Kame

> +7.1 Interface
> +
> +Soft limits can be setup by using the following commands (in this example we
> +assume a soft limit of 256 megabytes)
> +
> +# echo 256M > memory.soft_limit_in_bytes
> +
> +If we want to change this to 1G, we can at any time use
> +
> +# echo 1G > memory.soft_limit_in_bytes
> +
> +NOTE1: Soft limits take effect over a long period of time, since they involve
> +       reclaiming memory for balancing between memory cgroups
> +NOTE2: It is recommended to set the soft limit always below the hard limit,
> +       otherwise the hard limit will take precedence.
> +
> +8. TODO
>  
>  1. Add support for accounting huge pages (as a separate controller)
>  2. Make per-cgroup scanner reclaim not-shared pages first
> 
> -- 
> 	Balbir
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/5] Memory controller soft limit patches (v8)
  2009-07-10  4:53 ` [RFC][PATCH 0/5] Memory controller soft limit patches (v8) KAMEZAWA Hiroyuki
@ 2009-07-10  5:53   ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2009-07-10  5:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

On Fri, Jul 10, 2009 at 10:23 AM, KAMEZAWA
Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 09 Jul 2009 22:44:41 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
>>
>> From: Balbir Singh <balbir@linux.vnet.ibm.com>
>>
>> New Feature: Soft limits for memory resource controller.
>>
>> Here is v8 of the new soft limit implementation. Soft limits is a new feature
>> for the memory resource controller, something similar has existed in the
>> group scheduler in the form of shares. The CPU controllers interpretation
>> of shares is very different though.
>>
>> Soft limits are the most useful feature to have for environments where
>> the administrator wants to overcommit the system, such that only on memory
>> contention do the limits become active. The current soft limits implementation
>> provides a soft_limit_in_bytes interface for the memory controller and not
>> for memory+swap controller. The implementation maintains an RB-Tree of groups
>> that exceed their soft limit and starts reclaiming from the group that
>> exceeds this limit by the maximum amount.
>>
>> v8 has come out after a long duration, we were held back by bug fixes
>> (most notably swap cache leak fix) and Kamezawa-San has his series of
>> patches for soft limits. Kamezawa-San asked me to refactor these patches
>> to make the data structure per-node-per-zone.
>>
>> TODOs
>>
>> 1. The current implementation maintains the delta from the soft limit
>>    and pushes back groups to their soft limits, a ratio of delta/soft_limit
>>    might be more useful
>> 2. Small optimizations that I intend to push in v9, if the v8 design looks
>>    good and acceptable.
>>
>> Tests
>> -----
>>
>> I've run two memory intensive workloads with differing soft limits and
>> seen that they are pushed back to their soft limit on contention. Their usage
>> was their soft limit plus additional memory that they were able to grab
>> on the system. Soft limit can take a while before we see the expected
>> results.
>>
>
> Before pointing out nitpicks, here are my impressions.
>
>  1. seems good in general.
>

Thanks

>  2. Documentation is not enough. I think it's necessary to write "excuse" as
>    "soft-limit is built on complex memory management system's behavior, then,
>     this may not work as you expect. But in many case, this works well.
>     please take this as best-effort service" or some.
>

Sure, I'll revisit it and update.

>  3. Using "jiffies" again is not good. plz use other check or event counter.
>

Yes, I considered event based sampling and update. I wrote the code,
but then realized that it works really well if I keep the sampling per
cpu, otherwise it does not scale well. My problem with per-cpu
sampling is that the view we get could vary drastically if we migrated
or the task migrated to a different node and allocated memory.

>  4. I think it's better to limit soltlimit only against root of hierarcy node.
>    (use_hierarchy=1) I can't explain how the system works if several soft limits
>    are set to root and its children under a hierarchy.
>

The idea is that if we add a node and it has children and that node
goes above the soft limit, we'll do hierarchical reclaim from the
children underneath almost like a normal reclaim, where the unused
pages would be reclaimed/ Having said that I am open to your
suggestion, my concern is that semantics can get a bit confusing as to
when the administrator can setup soft limits. We can come up with
guidelines and recommend your suggestion.

>  5. I'm glad if you extract patch 4/5 as an independent clean up patch.
>

Thanks,

>  6. no overheads ?
>

I ran some tests and saw no additional overheads, I'll test some more
and post results. There are some cleanups pending like the ones you
pointed, where we can use page_to_* instead of pc_* routines. I did
not clean them up as I wanted to get out the RFC soon with working
functionality and post v9 with those cleaned up.

Thanks for the review.
Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8)
  2009-07-10  5:21   ` KAMEZAWA Hiroyuki
@ 2009-07-10  6:47     ` Balbir Singh
  2009-07-10  7:16       ` KAMEZAWA Hiroyuki
  2009-07-10  8:05     ` Balbir Singh
  1 sibling, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2009-07-10  6:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:21:35]:

> On Thu, 09 Jul 2009 22:45:01 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Feature: Organize cgroups over soft limit in a RB-Tree
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v8...v7
> > 1. Make the data structures per zone per node
> > 
> > Changelog v7...v6
> > 1. Refactor the check and update logic. The goal is to allow the
> >    check logic to be modular, so that it can be revisited in the future
> >    if something more appropriate is found to be useful.
> > 
> > Changelog v6...v5
> > 1. Update the key before inserting into RB tree. Without the current change
> >    it could take an additional iteration to get the key correct.
> > 
> > Changelog v5...v4
> > 1. res_counter_uncharge has an additional parameter to indicate if the
> >    counter was over its soft limit, before uncharge.
> > 
> > Changelog v4...v3
> > 1. Optimizations to ensure we don't uncessarily get res_counter values
> > 2. Fixed a bug in usage of time_after()
> > 
> > Changelog v3...v2
> > 1. Add only the ancestor to the RB-Tree
> > 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> > 
> > Changelog v2...v1
> > 1. Add support for hierarchies
> > 2. The res_counter that is highest in the hierarchy is returned on soft
> >    limit being exceeded. Since we do hierarchical reclaim and add all
> >    groups exceeding their soft limits, this approach seems to work well
> >    in practice.
> > 
> > This patch introduces a RB-Tree for storing memory cgroups that are over their
> > soft limit. The overall goal is to
> > 
> > 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
> >    We are careful about updates, updates take place only after a particular
> >    time interval has passed
> > 2. We remove the node from the RB-Tree when the usage goes below the soft
> >    limit
> > 
> > The next set of patches will exploit the RB-Tree to get the group that is
> > over its soft limit by the largest amount and reclaim from it, when we
> > face memory contention.
> > 
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > ---
> > 
> >  include/linux/res_counter.h |    6 +
> >  kernel/res_counter.c        |   18 ++-
> >  mm/memcontrol.c             |  304 +++++++++++++++++++++++++++++++++++++------
> >  3 files changed, 281 insertions(+), 47 deletions(-)
> > 
> > 
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index fcb9884..731af71 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -114,7 +114,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> >  int __must_check res_counter_charge_locked(struct res_counter *counter,
> >  		unsigned long val);
> >  int __must_check res_counter_charge(struct res_counter *counter,
> > -		unsigned long val, struct res_counter **limit_fail_at);
> > +		unsigned long val, struct res_counter **limit_fail_at,
> > +		struct res_counter **soft_limit_at);
> >  
> >  /*
> >   * uncharge - tell that some portion of the resource is released
> > @@ -127,7 +128,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
> >   */
> >  
> >  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > +				bool *was_soft_limit_excess);
> >  
> >  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> >  {
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 4434236..dbdade0 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> >  }
> >  
> >  int res_counter_charge(struct res_counter *counter, unsigned long val,
> > -			struct res_counter **limit_fail_at)
> > +			struct res_counter **limit_fail_at,
> > +			struct res_counter **soft_limit_fail_at)
> >  {
> >  	int ret;
> >  	unsigned long flags;
> >  	struct res_counter *c, *u;
> >  
> >  	*limit_fail_at = NULL;
> > +	if (soft_limit_fail_at)
> > +		*soft_limit_fail_at = NULL;
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> >  		spin_lock(&c->lock);
> >  		ret = res_counter_charge_locked(c, val);
> > +		/*
> > +		 * With soft limits, we return the highest ancestor
> > +		 * that exceeds its soft limit
> > +		 */
> > +		if (soft_limit_fail_at &&
> > +			!res_counter_soft_limit_check_locked(c))
> > +			*soft_limit_fail_at = c;
> >  		spin_unlock(&c->lock);
> >  		if (ret < 0) {
> >  			*limit_fail_at = c;
> > @@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
> >  	counter->usage -= val;
> >  }
> >  
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > +				bool *was_soft_limit_excess)
> >  {
> >  	unsigned long flags;
> >  	struct res_counter *c;
> > @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> >  		spin_lock(&c->lock);
> > +		if (was_soft_limit_excess)
> > +			*was_soft_limit_excess =
> > +				!res_counter_soft_limit_check_locked(c);
> >  		res_counter_uncharge_locked(c, val);
> >  		spin_unlock(&c->lock);
> >  	}
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 3c9292b..036032b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/rcupdate.h>
> >  #include <linux/limits.h>
> >  #include <linux/mutex.h>
> > +#include <linux/rbtree.h>
> >  #include <linux/slab.h>
> >  #include <linux/swap.h>
> >  #include <linux/spinlock.h>
> > @@ -118,6 +119,11 @@ struct mem_cgroup_per_zone {
> >  	unsigned long		count[NR_LRU_LISTS];
> >  
> >  	struct zone_reclaim_stat reclaim_stat;
> > +	struct rb_node		tree_node;	/* RB tree node */
> > +	unsigned long 		last_tree_update;/* Last time the tree was */
> > +						/* updated in jiffies     */
> > +	unsigned long long	usage_in_excess;/* Set to the value by which */
> > +						/* the soft limit is exceeded*/
> >  };
> 
> As pointed out in several times, plz avoid using jiffies.
> 
> 
> 
> >  /* Macro for accessing counter */
> >  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> > @@ -131,6 +137,28 @@ struct mem_cgroup_lru_info {
> >  };
> >  
> >  /*
> > + * Cgroups above their limits are maintained in a RB-Tree, independent of
> > + * their hierarchy representation
> > + */
> > +
> > +struct mem_cgroup_soft_limit_tree_per_zone {
> > +	struct rb_root rb_root;
> > +	spinlock_t lock;
> > +};
> > +
> > +struct mem_cgroup_soft_limit_tree_per_node {
> > +	struct mem_cgroup_soft_limit_tree_per_zone
> > +		rb_tree_per_zone[MAX_NR_ZONES];
> > +};
> > +
> > +struct mem_cgroup_soft_limit_tree {
> > +	struct mem_cgroup_soft_limit_tree_per_node
> > +		*rb_tree_per_node[MAX_NUMNODES];
> > +};
> > +
> > +static struct mem_cgroup_soft_limit_tree soft_limit_tree;
> > +
> __read_mostly ?
> 

Yep, good point

> Isn't that structure name too long ?
>

I'll look at making it smaller.
 
> 
> 
> > +/*
> >   * The memory controller data structure. The memory controller controls both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> >   * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> > @@ -187,6 +215,8 @@ struct mem_cgroup {
> >  	struct mem_cgroup_stat stat;
> >  };
> >  
> > +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
> > +
> >  enum charge_type {
> >  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> >  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> > @@ -215,6 +245,164 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >  
> > +static struct mem_cgroup_per_zone *
> > +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> > +{
> > +	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> > +}
> > +
> > +static struct mem_cgroup_per_zone *
> > +page_cgroup_zoneinfo(struct page_cgroup *pc)
> > +{
> > +	struct mem_cgroup *mem = pc->mem_cgroup;
> > +	int nid = page_cgroup_nid(pc);
> > +	int zid = page_cgroup_zid(pc);
> > +
> > +	if (!mem)
> > +		return NULL;
> > +
> > +	return mem_cgroup_zoneinfo(mem, nid, zid);
> > +}
> 
> I'm sorry but why this function is rewritten ? Any difference ?
>

Not it was justed moved up
 
> > +
> > +static struct mem_cgroup_soft_limit_tree_per_zone *
> > +soft_limit_tree_node_zone(int nid, int zid)
> > +{
> > +	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> > +}
> > +
> > +static struct mem_cgroup_soft_limit_tree_per_zone *
> > +page_cgroup_soft_limit_tree(struct page_cgroup *pc)
> > +{
> > +	int nid = page_cgroup_nid(pc);
> > +	int zid = page_cgroup_zid(pc);
> > +
> > +	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> > +}
> > +
> 
> Hm, I think it's better to use "page" rather than "page_cgroup" as arguments.
> This pointer doesn't depends on pc->mem_cgroup and zid, nid information is
> gotten from "page", originally. (and we can reduce foot print)
> 

Absolutely! It is on my todo list.

> 
> 
> 
> > +static void
> > +mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> > +				struct mem_cgroup_per_zone *mz,
> > +				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> > +{
> > +	struct rb_node **p = &stz->rb_root.rb_node;
> > +	struct rb_node *parent = NULL;
> > +	struct mem_cgroup_per_zone *mz_node;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&stz->lock, flags);
> > +	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> 
> Hmm, can't this be
>  	
> 	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> 	spin_lock_irqsave(&stz->lock, flags);
> ? There will be no deadlock but I just don't like spinlock under spinlock.
>

I don't think that is avoidable, we need to get resource counter
values and the next patch.
 
> BTW, why spin_lock_irqsave() ? spin_lock() isn't enough ? If you need to
> disable IRQ, writing reason somewhere is helpful to undestand the code.
> 

Good catch, with reclaim now moving to balance_pgdat() _irqsave is not
needed.

> 
> > +	while (*p) {
> 
> I feel this *p should be loaded after taking spinlock(&stz->lock) rather than top
> of function. No?

No.. since the root remains constant once loaded. Am I missing
something?


> 
> > +		parent = *p;
> > +		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
> > +					tree_node);
> > +		if (mz->usage_in_excess < mz_node->usage_in_excess)
> > +			p = &(*p)->rb_left;
> > +		/*
> > +		 * We can't avoid mem cgroups that are over their soft
> > +		 * limit by the same amount
> > +		 */
> > +		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
> > +			p = &(*p)->rb_right;
> > +	}
> > +	rb_link_node(&mz->tree_node, parent, p);
> > +	rb_insert_color(&mz->tree_node, &stz->rb_root);
> > +	mz->last_tree_update = jiffies;
> > +	spin_unlock_irqrestore(&stz->lock, flags);
> > +}
> > +
> > +static void
> > +mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> > +				struct mem_cgroup_per_zone *mz,
> > +				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> > +{
> > +	unsigned long flags;
> > +	spin_lock_irqsave(&stz->lock, flags);
> why IRQ save ? again.
>

Will remove
 
> > +	rb_erase(&mz->tree_node, &stz->rb_root);
> > +	spin_unlock_irqrestore(&stz->lock, flags);
> > +}
> > +
> > +static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem,
> > +					bool over_soft_limit,
> > +					struct page *page)
> > +{
> > +	unsigned long next_update;
> > +	struct page_cgroup *pc;
> > +	struct mem_cgroup_per_zone *mz;
> > +
> > +	if (!over_soft_limit)
> > +		return false;
> > +
> > +	pc = lookup_page_cgroup(page);
> > +	if (unlikely(!pc))
> > +		return false;
> > +	mz = mem_cgroup_zoneinfo(mem, page_cgroup_nid(pc), page_cgroup_zid(pc));
> 
> mz = page_cgroup_zoneinfo(pc)
> or
> mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zid(page))
>

Will change it.
 
> > +
> > +	next_update = mz->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
> > +	if (time_after(jiffies, next_update))
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> > +static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
> > +{
> > +	unsigned long long prev_usage_in_excess, new_usage_in_excess;
> > +	bool updated_tree = false;
> > +	unsigned long flags;
> > +	struct page_cgroup *pc;
> > +	struct mem_cgroup_per_zone *mz;
> > +	struct mem_cgroup_soft_limit_tree_per_zone *stz;
> > +
> > +	/*
> > +	 * As long as the page is around, pc's are always
> > +	 * around and so is the mz, in the remove path
> > +	 * we are yet to do the css_put(). I don't think
> > +	 * we need to hold page cgroup lock.
> > +	 */
> IIUC, at updating tree,we grab this page which is near-to-be-mapped or
> near-to-be-in-radix-treee. If so, not necessary to be annoyied.

Not sure I understand your comment about annoyied (annoyed?)

> 
> > +	pc = lookup_page_cgroup(page);
> > +	if (unlikely(!pc))
> > +		return;
> 
> I bet this can be BUG_ON().

In the new version we will not need pc

> 
> > +	mz = mem_cgroup_zoneinfo(mem, page_cgroup_nid(pc), page_cgroup_zid(pc));
> mz = page_cgroup_zoneinfo(pc);
> 
In the new version we will not need pc
> > +	stz = page_cgroup_soft_limit_tree(pc);
> > +
> > +	/*
> > +	 * We do updates in lazy mode, mem's are removed
> > +	 * lazily from the per-zone, per-node rb tree
> > +	 */
> > +	prev_usage_in_excess = mz->usage_in_excess;
> > +
> > +	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +	if (prev_usage_in_excess) {
> > +		mem_cgroup_remove_exceeded(mem, mz, stz);
> > +		updated_tree = true;
> > +	}
> IIUC, mz->usage_in_excess can't be used to find out mz is on-tree.
> I think you use "bool on_tree" in patch 5/5. plz use it here.
> 

OK, I'll move that part of the code to this patch.

> 
> 
> > +	if (!new_usage_in_excess)
> > +		goto done;
> > +	mem_cgroup_insert_exceeded(mem, mz, stz);
> > +
> > +done:
> > +	if (updated_tree) {
> > +		spin_lock_irqsave(&stz->lock, flags);
> > +		mz->usage_in_excess = new_usage_in_excess;
> > +		spin_unlock_irqrestore(&stz->lock, flags);
> > +	}
> > +}
> > +
> > +static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
> > +{
> > +	int node, zone;
> > +	struct mem_cgroup_per_zone *mz;
> > +	struct mem_cgroup_soft_limit_tree_per_zone *stz;
> > +
> > +	for_each_node_state(node, N_POSSIBLE) {
> > +		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > +			mz = mem_cgroup_zoneinfo(mem, node, zone);
> > +			stz = soft_limit_tree_node_zone(node, zone);
> > +			mem_cgroup_remove_exceeded(mem, mz, stz);
> > +		}
> > +	}
> > +}
> > +
> >  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> >  					 struct page_cgroup *pc,
> >  					 bool charge)
> > @@ -239,25 +427,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> >  	put_cpu();
> >  }
> >  
> > -static struct mem_cgroup_per_zone *
> > -mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> > -{
> > -	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> > -}
> > -
> > -static struct mem_cgroup_per_zone *
> > -page_cgroup_zoneinfo(struct page_cgroup *pc)
> > -{
> > -	struct mem_cgroup *mem = pc->mem_cgroup;
> > -	int nid = page_cgroup_nid(pc);
> > -	int zid = page_cgroup_zid(pc);
> > -
> > -	if (!mem)
> > -		return NULL;
> > -
> > -	return mem_cgroup_zoneinfo(mem, nid, zid);
> > -}
> > -
> >  static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
> >  					enum lru_list idx)
> >  {
> > @@ -972,11 +1141,11 @@ done:
> >   */
> >  static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  			gfp_t gfp_mask, struct mem_cgroup **memcg,
> > -			bool oom)
> > +			bool oom, struct page *page)
> >  {
> 
> 
> > -	struct mem_cgroup *mem, *mem_over_limit;
> > +	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
> >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > -	struct res_counter *fail_res;
> > +	struct res_counter *fail_res, *soft_fail_res = NULL;
> >  
> >  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> >  		/* Don't account this! */
> > @@ -1006,16 +1175,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		int ret;
> >  		bool noswap = false;
> >  
> > -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> > +		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> > +						&soft_fail_res);
> >  		if (likely(!ret)) {
> >  			if (!do_swap_account)
> >  				break;
> >  			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> > -							&fail_res);
> > +							&fail_res, NULL);
> >  			if (likely(!ret))
> >  				break;
> >  			/* mem+swap counter fails */
> > -			res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
> >  			noswap = true;
> >  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> >  									memsw);
> > @@ -1053,13 +1223,24 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  			goto nomem;
> >  		}
> >  	}
> > +	/*
> > +	 * Insert just the ancestor, we should trickle down to the correct
> > +	 * cgroup for reclaim, since the other nodes will be below their
> > +	 * soft limit
> > +	 */
> > +	if (soft_fail_res) {
> > +		mem_over_soft_limit =
> > +			mem_cgroup_from_res_counter(soft_fail_res, res);
> > +		if (mem_cgroup_soft_limit_check(mem_over_soft_limit, true,
> > +							page))
> > +			mem_cgroup_update_tree(mem_over_soft_limit, page);
> > +	}
> 
> 
> 
> >  	return 0;
> >  nomem:
> >  	css_put(&mem->css);
> >  	return -ENOMEM;
> >  }
> >  
> > -
> >  /*
> >   * A helper function to get mem_cgroup from ID. must be called under
> >   * rcu_read_lock(). The caller must check css_is_removed() or some if
> > @@ -1126,9 +1307,9 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  	lock_page_cgroup(pc);
> >  	if (unlikely(PageCgroupUsed(pc))) {
> >  		unlock_page_cgroup(pc);
> > -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
> >  		if (do_swap_account)
> > -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > +			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
> >  		css_put(&mem->css);
> >  		return;
> >  	}
> > @@ -1205,7 +1386,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> >  	if (pc->mem_cgroup != from)
> >  		goto out;
> >  
> > -	res_counter_uncharge(&from->res, PAGE_SIZE);
> > +	res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
> >  	mem_cgroup_charge_statistics(from, pc, false);
> >  
> >  	page = pc->page;
> > @@ -1225,7 +1406,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> >  	}
> >  
> >  	if (do_swap_account)
> > -		res_counter_uncharge(&from->memsw, PAGE_SIZE);
> > +		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
> >  	css_put(&from->css);
> >  
> >  	css_get(&to->css);
> > @@ -1259,7 +1440,7 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
> >  	parent = mem_cgroup_from_cont(pcg);
> >  
> >  
> > -	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
> > +	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
> >  	if (ret || !parent)
> >  		return ret;
> >  
> > @@ -1289,9 +1470,9 @@ uncharge:
> >  	/* drop extra refcnt by try_charge() */
> >  	css_put(&parent->css);
> >  	/* uncharge if move fails */
> > -	res_counter_uncharge(&parent->res, PAGE_SIZE);
> > +	res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
> >  	if (do_swap_account)
> > -		res_counter_uncharge(&parent->memsw, PAGE_SIZE);
> > +		res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
> >  	return ret;
> >  }
> >  
> > @@ -1316,7 +1497,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> >  	prefetchw(pc);
> >  
> >  	mem = memcg;
> > -	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
> > +	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
> >  	if (ret || !mem)
> >  		return ret;
> >  
> > @@ -1435,14 +1616,14 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> >  	if (!mem)
> >  		goto charge_cur_mm;
> >  	*ptr = mem;
> > -	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
> > +	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
> >  	/* drop extra refcnt from tryget */
> >  	css_put(&mem->css);
> >  	return ret;
> >  charge_cur_mm:
> >  	if (unlikely(!mm))
> >  		mm = &init_mm;
> > -	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> > +	return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
> >  }
> >  
> >  static void
> > @@ -1479,7 +1660,7 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
> >  			 * This recorded memcg can be obsolete one. So, avoid
> >  			 * calling css_tryget
> >  			 */
> > -			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> > +			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
> >  			mem_cgroup_put(memcg);
> >  		}
> >  		rcu_read_unlock();
> > @@ -1500,9 +1681,9 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> >  		return;
> >  	if (!mem)
> >  		return;
> > -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +	res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
> >  	if (do_swap_account)
> > -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > +		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
> >  	css_put(&mem->css);
> >  }
> >  
> > @@ -1516,6 +1697,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> >  	struct page_cgroup *pc;
> >  	struct mem_cgroup *mem = NULL;
> >  	struct mem_cgroup_per_zone *mz;
> > +	bool soft_limit_excess = false;
> >  
> >  	if (mem_cgroup_disabled())
> >  		return NULL;
> > @@ -1554,9 +1736,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> >  		break;
> >  	}
> >  
> > -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +	res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
> >  	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> > -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > +		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
> >  	mem_cgroup_charge_statistics(mem, pc, false);
> >  
> >  	ClearPageCgroupUsed(pc);
> > @@ -1570,6 +1752,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> >  	mz = page_cgroup_zoneinfo(pc);
> >  	unlock_page_cgroup(pc);
> >  
> > +	if (mem_cgroup_soft_limit_check(mem, soft_limit_excess, page))
> > +		mem_cgroup_update_tree(mem, page);
> >  	/* at swapout, this memcg will be accessed to record to swap */
> >  	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> >  		css_put(&mem->css);
> > @@ -1645,7 +1829,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
> >  		 * We uncharge this because swap is freed.
> >  		 * This memcg can be obsolete one. We avoid calling css_tryget
> >  		 */
> > -		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> > +		res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
> >  		mem_cgroup_put(memcg);
> >  	}
> >  	rcu_read_unlock();
> > @@ -1674,7 +1858,8 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
> >  	unlock_page_cgroup(pc);
> >  
> >  	if (mem) {
> > -		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
> > +		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
> > +						page);
> >  		css_put(&mem->css);
> >  	}
> >  	*ptr = mem;
> > @@ -2177,6 +2362,7 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
> >  			res_counter_reset_failcnt(&mem->memsw);
> >  		break;
> >  	}
> > +
> >  	return 0;
> >  }
> noise here.
> 
> 
> >  
> > @@ -2472,6 +2658,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> >  		mz = &pn->zoneinfo[zone];
> >  		for_each_lru(l)
> >  			INIT_LIST_HEAD(&mz->lists[l]);
> > +		mz->last_tree_update = 0;
> > +		mz->usage_in_excess = 0;
> >  	}
> >  	return 0;
> >  }
> > @@ -2517,6 +2705,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
> >  {
> >  	int node;
> >  
> > +	mem_cgroup_remove_from_trees(mem);
> >  	free_css_id(&mem_cgroup_subsys, &mem->css);
> >  
> >  	for_each_node_state(node, N_POSSIBLE)
> > @@ -2565,6 +2754,31 @@ static void __init enable_swap_cgroup(void)
> >  }
> >  #endif
> >  
> > +static int mem_cgroup_soft_limit_tree_init(void)
> > +{
> > +	struct mem_cgroup_soft_limit_tree_per_node *rtpn;
> > +	struct mem_cgroup_soft_limit_tree_per_zone *rtpz;
> > +	int tmp, node, zone;
> > +
> > +	for_each_node_state(node, N_POSSIBLE) {
> > +		tmp = node;
> > +		if (!node_state(node, N_NORMAL_MEMORY))
> > +			tmp = -1;
> > +		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
> > +		if (!rtpn)
> > +			return 1;
> > +
> > +		soft_limit_tree.rb_tree_per_node[node] = rtpn;
> > +
> > +		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > +			rtpz = &rtpn->rb_tree_per_zone[zone];
> > +			rtpz->rb_root = RB_ROOT;
> > +			spin_lock_init(&rtpz->lock);
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> >  static struct cgroup_subsys_state * __ref
> >  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  {
> > @@ -2579,11 +2793,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	for_each_node_state(node, N_POSSIBLE)
> >  		if (alloc_mem_cgroup_per_zone_info(mem, node))
> >  			goto free_out;
> > +
> >  	/* root ? */
> >  	if (cont->parent == NULL) {
> >  		enable_swap_cgroup();
> >  		parent = NULL;
> >  		root_mem_cgroup = mem;
> > +		if (mem_cgroup_soft_limit_tree_init())
> > +			goto free_out;
> > +
> >  	} else {
> >  		parent = mem_cgroup_from_cont(cont->parent);
> >  		mem->use_hierarchy = parent->use_hierarchy;
> > 
> 
> Thx,
> -Kame
>

Thanks for the review, 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 1/5] Memory controller soft limit documentation (v8)
  2009-07-10  5:32   ` KAMEZAWA Hiroyuki
@ 2009-07-10  6:48     ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2009-07-10  6:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:32:16]:

> On Thu, 09 Jul 2009 22:44:49 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Feature: Add documentation for soft limits
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > ---
> > 
> >  Documentation/cgroups/memory.txt |   31 ++++++++++++++++++++++++++++++-
> >  1 files changed, 30 insertions(+), 1 deletions(-)
> > 
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index ab0a021..b47815c 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -379,7 +379,36 @@ cgroups created below it.
> >  
> >  NOTE2: This feature can be enabled/disabled per subtree.
> >  
> > -7. TODO
> > +7. Soft limits
> > +
> > +Soft limits allow for greater sharing of memory. The idea behind soft limits
> > +is to allow control groups to use as much of the memory as needed, provided
> > +
> > +a. There is no memory contention
> > +b. They do not exceed their hard limit
> > +
> > +When the system detects memory contention or low memory control groups
> > +are pushed back to their soft limits. If the soft limit of each control
> > +group is very high, they are pushed back as much as possible to make
> > +sure that one control group does not starve the others of memory.
> > +
> 
> It's better to write "this is best-effort service". We add hook only to kswapd.
> And hou successfull this work depends on ZONE.
>

Will do, Thanks for the review. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8)
  2009-07-10  5:30   ` KAMEZAWA Hiroyuki
@ 2009-07-10  6:53     ` Balbir Singh
  2009-07-10  7:30       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2009-07-10  6:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:30:26]:

> On Thu, 09 Jul 2009 22:45:12 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Feature: Implement reclaim from groups over their soft limit
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v8 ..v7
> > 1. Soft limit reclaim takes an order parameter and does no reclaim for
> >    order > 0. This ensures that we don't do double reclaim for order > 0
> > 2. Make the data structures more scalable, move the reclaim logic
> >    to a new function mem_cgroup_shrink_node_zone that does per node
> >    per zone reclaim.
> > 3. Reclaim has moved back to kswapd (balance_pgdat)
> > 
> > Changelog v7...v6
> > 1. Refactored out reclaim_options patch into a separate patch
> > 2. Added additional checks for all swap off condition in
> >    mem_cgroup_hierarchical_reclaim()
> > 
> > Changelog v6...v5
> > 1. Reclaim arguments to hierarchical reclaim have been merged into one
> >    parameter called reclaim_options.
> > 2. Check if we failed to reclaim from one cgroup during soft reclaim, if
> >    so move on to the next one. This can be very useful if the zonelist
> >    passed to soft limit reclaim has no allocations from the selected
> >    memory cgroup
> > 3. Coding style cleanups
> > 
> > Changelog v5...v4
> > 
> > 1. Throttling is removed, earlier we throttled tasks over their soft limit
> > 2. Reclaim has been moved back to __alloc_pages_internal, several experiments
> >    and tests showed that it was the best place to reclaim memory. kswapd has
> >    a different goal, that does not work with a single soft limit for the memory
> >    cgroup.
> > 3. Soft limit reclaim is more targetted and the pages reclaim depend on the
> >    amount by which the soft limit is exceeded.
> > 
> > Changelog v4...v3
> > 1. soft_reclaim is now called from balance_pgdat
> > 2. soft_reclaim is aware of nodes and zones
> > 3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
> >    and at the same time trying to allocate pages and exceed its soft limit.
> > 4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
> >    particular to a mem cgroup.
> > 
> > Changelog v3...v2
> > 1. Convert several arguments to hierarchical reclaim to flags, thereby
> >    consolidating them
> > 2. The reclaim for soft limits is now triggered from kswapd
> > 3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument
> > 
> > 
> > Changelog v2...v1
> > 1. Added support for hierarchical soft limits
> > 
> > This patch allows reclaim from memory cgroups on contention (via the
> > direct reclaim path).
> > 
> > memory cgroup soft limit reclaim finds the group that exceeds its soft limit
> > by the largest number of pages and reclaims pages from it and then reinserts the
> > cgroup into its correct place in the rbtree.
> > 
> > Added additional checks to mem_cgroup_hierarchical_reclaim() to detect
> > long loops in case all swap is turned off. The code has been refactored
> > and the loop check (loop < 2) has been enhanced for soft limits. For soft
> > limits, we try to do more targetted reclaim. Instead of bailing out after
> > two loops, the routine now reclaims memory proportional to the size by
> > which the soft limit is exceeded. The proportion has been empirically
> > determined.
> > 
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > ---
> > 
> >  include/linux/memcontrol.h |   11 ++
> >  include/linux/swap.h       |    5 +
> >  mm/memcontrol.c            |  224 +++++++++++++++++++++++++++++++++++++++++---
> >  mm/vmscan.c                |   39 +++++++-
> >  4 files changed, 262 insertions(+), 17 deletions(-)
> > 
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index e46a073..cf20acc 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -118,6 +118,9 @@ static inline bool mem_cgroup_disabled(void)
> >  
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> >  void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
> > +unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > +						gfp_t gfp_mask, int nid,
> > +						int zid, int priority);
> >  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >  struct mem_cgroup;
> >  
> > @@ -276,6 +279,14 @@ static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
> >  {
> >  }
> >  
> > +static inline
> > +unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > +						gfp_t gfp_mask, int nid,
> > +						int zid, int priority)
> > +{
> > +	return 0;
> > +}
> > +
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >  
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 6c990e6..afc0721 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -217,6 +217,11 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> >  						  gfp_t gfp_mask, bool noswap,
> >  						  unsigned int swappiness);
> > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > +						gfp_t gfp_mask, bool noswap,
> > +						unsigned int swappiness,
> > +						struct zone *zone,
> > +						int nid, int priority);
> >  extern int __isolate_lru_page(struct page *page, int mode, int file);
> >  extern unsigned long shrink_all_memory(unsigned long nr_pages);
> >  extern int vm_swappiness;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ca9c257..e7a1cf4 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -124,6 +124,9 @@ struct mem_cgroup_per_zone {
> >  						/* updated in jiffies     */
> >  	unsigned long long	usage_in_excess;/* Set to the value by which */
> >  						/* the soft limit is exceeded*/
> > +	bool on_tree;				/* Is the node on tree? */
> > +	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
> > +						/* use container_of	   */
> >  };
> >  /* Macro for accessing counter */
> >  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> > @@ -216,6 +219,13 @@ struct mem_cgroup {
> >  
> >  #define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
> >  
> > +/*
> > + * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> > + * limit reclaim to prevent infinite loops, if they ever occur.
> > + */
> > +#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(10000)
> > +#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
> > +
> >  enum charge_type {
> >  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> >  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> > @@ -247,6 +257,8 @@ enum charge_type {
> >  #define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> >  #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
> >  #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> > +#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
> > +#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
> >  
> >  static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> > @@ -287,16 +299,17 @@ page_cgroup_soft_limit_tree(struct page_cgroup *pc)
> >  }
> >  
> >  static void
> > -mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> > +__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> >  				struct mem_cgroup_per_zone *mz,
> >  				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> >  {
> >  	struct rb_node **p = &stz->rb_root.rb_node;
> >  	struct rb_node *parent = NULL;
> >  	struct mem_cgroup_per_zone *mz_node;
> > -	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&stz->lock, flags);
> > +	if (mz->on_tree)
> > +		return;
> > +
> >  	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> >  	while (*p) {
> >  		parent = *p;
> > @@ -314,6 +327,29 @@ mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> >  	rb_link_node(&mz->tree_node, parent, p);
> >  	rb_insert_color(&mz->tree_node, &stz->rb_root);
> >  	mz->last_tree_update = jiffies;
> > +	mz->on_tree = true;
> > +}
> > +
> > +static void
> > +__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> > +				struct mem_cgroup_per_zone *mz,
> > +				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> > +{
> > +	if (!mz->on_tree)
> > +		return;
> > +	rb_erase(&mz->tree_node, &stz->rb_root);
> > +	mz->on_tree = false;
> > +}
> > +
> > +static void
> > +mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> > +				struct mem_cgroup_per_zone *mz,
> > +				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&stz->lock, flags);
> > +	__mem_cgroup_insert_exceeded(mem, mz, stz);
> >  	spin_unlock_irqrestore(&stz->lock, flags);
> >  }
> >  
> > @@ -324,7 +360,7 @@ mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> >  {
> >  	unsigned long flags;
> >  	spin_lock_irqsave(&stz->lock, flags);
> > -	rb_erase(&mz->tree_node, &stz->rb_root);
> > +	__mem_cgroup_remove_exceeded(mem, mz, stz);
> >  	spin_unlock_irqrestore(&stz->lock, flags);
> >  }
> >  
> > @@ -410,6 +446,52 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
> >  	}
> >  }
> >  
> > +unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
> > +{
> > +	unsigned long excess;
> > +	excess = res_counter_soft_limit_excess(&mem->res) >> PAGE_SHIFT;
> > +	return (excess > ULONG_MAX) ? ULONG_MAX : excess;
> > +}
> > +
> What this means ? excess can be bigger than ULONG_MAX even after >> PAGE_SHIFT ?
>

Good catch, ideally no.
 
> 
> 
> > +static struct mem_cgroup_per_zone *
> > +__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_soft_limit_tree_per_zone
> > +					*stz)
> > +{
> > +	struct rb_node *rightmost = NULL;
> > +	struct mem_cgroup_per_zone *mz = NULL;
> > +
> > +retry:
> > +	rightmost = rb_last(&stz->rb_root);
> > +	if (!rightmost)
> > +		goto done;		/* Nothing to reclaim from */
> > +
> > +	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
> > +	/*
> > +	 * Remove the node now but someone else can add it back,
> > +	 * we will to add it back at the end of reclaim to its correct
> > +	 * position in the tree.
> > +	 */
> > +	__mem_cgroup_remove_exceeded(mz->mem, mz, stz);
> > +	if (!css_tryget(&mz->mem->css) ||
> > +		!res_counter_soft_limit_excess(&mz->mem->res))
> > +		goto retry;
> This leaks css's refcnt. plz invert order as
> 
> 	if (!res_counter_xxxxx() || !css_tryget())
> 
>

Yep, good idea
 
> 
> > +done:
> > +	return mz;
> > +}
> > +
> > +static struct mem_cgroup_per_zone *
> > +mem_cgroup_largest_soft_limit_node(struct mem_cgroup_soft_limit_tree_per_zone
> > +					*stz)
> > +{
> > +	struct mem_cgroup_per_zone *mz;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&stz->lock, flags);
> > +	mz = __mem_cgroup_largest_soft_limit_node(stz);
> > +	spin_unlock_irqrestore(&stz->lock, flags);
> > +	return mz;
> > +}
> > +
> >  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> >  					 struct page_cgroup *pc,
> >  					 bool charge)
> > @@ -1038,31 +1120,59 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> >   * If shrink==true, for avoiding to free too much, this returns immedieately.
> >   */
> >  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > +						struct zone *zone,
> >  						gfp_t gfp_mask,
> > -						unsigned long reclaim_options)
> > +						unsigned long reclaim_options,
> > +						int priority)
> >  {
> >  	struct mem_cgroup *victim;
> >  	int ret, total = 0;
> >  	int loop = 0;
> >  	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> >  	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> > +	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> > +	unsigned long excess = mem_cgroup_get_excess(root_mem);
> >  
> >  	/* If memsw_is_minimum==1, swap-out is of-no-use. */
> >  	if (root_mem->memsw_is_minimum)
> >  		noswap = true;
> >  
> > -	while (loop < 2) {
> > +	while (1) {
> >  		victim = mem_cgroup_select_victim(root_mem);
> > -		if (victim == root_mem)
> > +		if (victim == root_mem) {
> >  			loop++;
> > +			if (loop >= 2) {
> > +				/*
> > +				 * If we have not been able to reclaim
> > +				 * anything, it might because there are
> > +				 * no reclaimable pages under this hierarchy
> > +				 */
> > +				if (!check_soft || !total)
> > +					break;
> > +				/*
> > +				 * We want to do more targetted reclaim.
> > +				 * excess >> 2 is not to excessive so as to
> > +				 * reclaim too much, nor too less that we keep
> > +				 * coming back to reclaim from this cgroup
> > +				 */
> > +				if (total >= (excess >> 2) ||
> > +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> > +					break;
> > +			}
> > +		}
> 
> Hmm..this logic is very unclear for me. Why just exit back as usual reclaim ?
>

Basically what this check does is, it checks to see if the loops > 2,
then as in the previous case (when soft limits were not supported)
exit or if the total reclaimed is 0, exit (because we are running with
swap turned off, may be?). Otherwise, check if we have reclaimed a
certain portion of the total amount we exceed the soft limit by or if
the loops are too large and exit. I hope this clarifies
 
> 
> 
> >  		if (!mem_cgroup_local_usage(&victim->stat)) {
> >  			/* this cgroup's local usage == 0 */
> >  			css_put(&victim->css);
> >  			continue;
> >  		}
> >  		/* we use swappiness of local cgroup */
> > -		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> > -						   get_swappiness(victim));
> > +		if (check_soft)
> > +			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> > +				noswap, get_swappiness(victim), zone,
> > +				zone->zone_pgdat->node_id, priority);
> > +		else
> > +			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> > +						noswap, get_swappiness(victim));
> 
> Do we need 2 functions ?
>

Yes, one does zonelist based reclaim, the other one does shrinking of
a particular zone in a particular node - as identified by
balance_pgdat.
 
> >  		css_put(&victim->css);
> >  		/*
> >  		 * At shrinking usage, we can't check we should stop here or
> > @@ -1072,7 +1182,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  		if (shrink)
> >  			return ret;
> >  		total += ret;
> > -		if (mem_cgroup_check_under_limit(root_mem))
> > +		if (check_soft) {
> > +			if (res_counter_check_under_soft_limit(&root_mem->res))
> > +				return total;
> > +		} else if (mem_cgroup_check_under_limit(root_mem))
> >  			return 1 + total;
> >  	}
> >  	return total;
> > @@ -1207,8 +1320,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		if (!(gfp_mask & __GFP_WAIT))
> >  			goto nomem;
> >  
> > -		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> > -							flags);
> > +		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > +							gfp_mask, flags, -1);
> >  		if (ret)
> >  			continue;
> >  
> > @@ -2002,8 +2115,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> >  		if (!ret)
> >  			break;
> >  
> > -		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> > -						   MEM_CGROUP_RECLAIM_SHRINK);
> > +		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
> > +						GFP_KERNEL,
> > +						MEM_CGROUP_RECLAIM_SHRINK, -1);
> 
> What this -1 means ?
>

-1 means don't care, I should clarify that via comments.
 
> >  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> >  		/* Usage is reduced ? */
> >    		if (curusage >= oldusage)
> > @@ -2055,9 +2169,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> >  		if (!ret)
> >  			break;
> >  
> > -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> > +		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> >  						MEM_CGROUP_RECLAIM_NOSWAP |
> > -						MEM_CGROUP_RECLAIM_SHRINK);
> > +						MEM_CGROUP_RECLAIM_SHRINK, -1);
> again.
> 
> >  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> >  		/* Usage is reduced ? */
> >  		if (curusage >= oldusage)
> > @@ -2068,6 +2182,82 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> >  	return ret;
> >  }
> >  
> > +unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > +						gfp_t gfp_mask, int nid,
> > +						int zid, int priority)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> > +	unsigned long flags;
> > +	unsigned long reclaimed;
> > +	int loop = 0;
> > +	struct mem_cgroup_soft_limit_tree_per_zone *stz;
> > +
> > +	if (order > 0)
> > +		return 0;
> > +
> > +	stz = soft_limit_tree_node_zone(nid, zid);
> > +	/*
> > +	 * This loop can run a while, specially if mem_cgroup's continuously
> > +	 * keep exceeding their soft limit and putting the system under
> > +	 * pressure
> > +	 */
> > +	do {
> > +		if (next_mz)
> > +			mz = next_mz;
> > +		else
> > +			mz = mem_cgroup_largest_soft_limit_node(stz);
> > +		if (!mz)
> > +			break;
> > +
> > +		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> > +						gfp_mask,
> > +						MEM_CGROUP_RECLAIM_SOFT,
> > +						priority);
> > +		nr_reclaimed += reclaimed;
> > +		spin_lock_irqsave(&stz->lock, flags);
> > +
> > +		/*
> > +		 * If we failed to reclaim anything from this memory cgroup
> > +		 * it is time to move on to the next cgroup
> > +		 */
> > +		next_mz = NULL;
> > +		if (!reclaimed) {
> > +			do {
> > +				/*
> > +				 * By the time we get the soft_limit lock
> > +				 * again, someone might have aded the
> > +				 * group back on the RB tree. Iterate to
> > +				 * make sure we get a different mem.
> > +				 * mem_cgroup_largest_soft_limit_node returns
> > +				 * NULL if no other cgroup is present on
> > +				 * the tree
> > +				 */
> > +				next_mz =
> > +				__mem_cgroup_largest_soft_limit_node(stz);
> > +			} while (next_mz == mz);
> > +		}
> > +		mz->usage_in_excess =
> > +			res_counter_soft_limit_excess(&mz->mem->res);
> > +		__mem_cgroup_remove_exceeded(mz->mem, mz, stz);
> > +		if (mz->usage_in_excess)
> > +			__mem_cgroup_insert_exceeded(mz->mem, mz, stz);
> 
> plz don't push back "mz" if !reclaimd.
>

We need to do that, what is someone does a swapoff -a and swapon -a in
between, we still need to give mz a chance. No?
 
> 
> 
> > +		spin_unlock_irqrestore(&stz->lock, flags);
> > +		css_put(&mz->mem->css);
> > +		loop++;
> > +		/*
> > +		 * Could not reclaim anything and there are no more
> > +		 * mem cgroups to try or we seem to be looping without
> > +		 * reclaiming anything.
> > +		 */
> > +		if (!nr_reclaimed &&
> > +			(next_mz == NULL ||
> > +			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> > +			break;
> > +	} while (!nr_reclaimed);
> > +	return nr_reclaimed;
> > +}
> > +
> >  /*
> >   * This routine traverse page_cgroup in given list and drop them all.
> >   * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> > @@ -2671,6 +2861,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> >  			INIT_LIST_HEAD(&mz->lists[l]);
> >  		mz->last_tree_update = 0;
> >  		mz->usage_in_excess = 0;
> > +		mz->on_tree = false;
> > +		mz->mem = mem;
> >  	}
> >  	return 0;
> >  }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 86dc0c3..d0f5c4d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1780,11 +1780,39 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  
> > +unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > +						gfp_t gfp_mask, bool noswap,
> > +						unsigned int swappiness,
> > +						struct zone *zone, int nid,
> > +						int priority)
> > +{
> > +	struct scan_control sc = {
> > +		.may_writepage = !laptop_mode,
> > +		.may_unmap = 1,
> > +		.may_swap = !noswap,
> > +		.swap_cluster_max = SWAP_CLUSTER_MAX,
> > +		.swappiness = swappiness,
> > +		.order = 0,
> > +		.mem_cgroup = mem,
> > +		.isolate_pages = mem_cgroup_isolate_pages,
> > +	};
> > +	nodemask_t nm  = nodemask_of_node(nid);
> > +
> > +	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> > +			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> > +	sc.nodemask = &nm;
> > +	sc.nr_reclaimed = 0;
> > +	sc.nr_scanned = 0;
> > +	shrink_zone(priority, zone, &sc);
> > +	return sc.nr_reclaimed;
> > +}
> > +
> >  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >  					   gfp_t gfp_mask,
> >  					   bool noswap,
> >  					   unsigned int swappiness)
> >  {
> > +	struct zonelist *zonelist;
> >  	struct scan_control sc = {
> >  		.may_writepage = !laptop_mode,
> >  		.may_unmap = 1,
> > @@ -1796,7 +1824,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >  		.isolate_pages = mem_cgroup_isolate_pages,
> >  		.nodemask = NULL, /* we don't care the placement */
> >  	};
> > -	struct zonelist *zonelist;
> >  
> >  	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> >  			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> > @@ -1918,6 +1945,7 @@ loop_again:
> >  		for (i = 0; i <= end_zone; i++) {
> >  			struct zone *zone = pgdat->node_zones + i;
> >  			int nr_slab;
> > +			int nid, zid;
> >  
> >  			if (!populated_zone(zone))
> >  				continue;
> > @@ -1932,6 +1960,15 @@ loop_again:
> >  			temp_priority[i] = priority;
> >  			sc.nr_scanned = 0;
> >  			note_zone_scanning_priority(zone, priority);
> > +
> > +			nid = pgdat->node_id;
> > +			zid = zone_idx(zone);
> > +			/*
> > +			 * Call soft limit reclaim before calling shrink_zone.
> > +			 * For now we ignore the return value
> > +			 */
> > +			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask,
> > +							nid, zid, priority);
> >  			/*
> >  			 * We put equal pressure on every zone, unless one
> >  			 * zone has way too many pages free already.
> > 
> 
> 
> Thanks,
> -Kame
>

Thanks for the review! 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8)
  2009-07-10  6:47     ` Balbir Singh
@ 2009-07-10  7:16       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10  7:16 UTC (permalink / raw)
  To: balbir; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

On Fri, 10 Jul 2009 12:17:23 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:21:35]:
> 
> > On Thu, 09 Jul 2009 22:45:01 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > Feature: Organize cgroups over soft limit in a RB-Tree
> > > 
> > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > > +	while (*p) {
> > 
> > I feel this *p should be loaded after taking spinlock(&stz->lock) rather than top
> > of function. No?
> 
> No.. since the root remains constant once loaded. Am I missing
> something?
> 
No, I just missed it.


> 
> > 
> > > +		parent = *p;
> > > +		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
> > > +					tree_node);
> > > +		if (mz->usage_in_excess < mz_node->usage_in_excess)
> > > +			p = &(*p)->rb_left;
> > > +		/*
> > > +		 * We can't avoid mem cgroups that are over their soft
> > > +		 * limit by the same amount
> > > +		 */
> > > +		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
> > > +			p = &(*p)->rb_right;
> > > +	}
> > > +	rb_link_node(&mz->tree_node, parent, p);
> > > +	rb_insert_color(&mz->tree_node, &stz->rb_root);
> > > +	mz->last_tree_update = jiffies;
> > > +	spin_unlock_irqrestore(&stz->lock, flags);
> > > +}
> > > +
> > > +static void
> > > +mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> > > +				struct mem_cgroup_per_zone *mz,
> > > +				struct mem_cgroup_soft_limit_tree_per_zone *stz)
> > > +{
> > > +	unsigned long flags;
> > > +	spin_lock_irqsave(&stz->lock, flags);
> > why IRQ save ? again.
> >
> 
> Will remove
>  
> > > +	rb_erase(&mz->tree_node, &stz->rb_root);
> > > +	spin_unlock_irqrestore(&stz->lock, flags);
> > > +}
> > > +
> > > +static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem,
> > > +					bool over_soft_limit,
> > > +					struct page *page)
> > > +{
> > > +	unsigned long next_update;
> > > +	struct page_cgroup *pc;
> > > +	struct mem_cgroup_per_zone *mz;
> > > +
> > > +	if (!over_soft_limit)
> > > +		return false;
> > > +
> > > +	pc = lookup_page_cgroup(page);
> > > +	if (unlikely(!pc))
> > > +		return false;
> > > +	mz = mem_cgroup_zoneinfo(mem, page_cgroup_nid(pc), page_cgroup_zid(pc));
> > 
> > mz = page_cgroup_zoneinfo(pc)
> > or
> > mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zid(page))
> >
> 
> Will change it.
>  
> > > +
> > > +	next_update = mz->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
> > > +	if (time_after(jiffies, next_update))
> > > +		return true;
> > > +
> > > +	return false;
> > > +}
> > > +
> > > +static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
> > > +{
> > > +	unsigned long long prev_usage_in_excess, new_usage_in_excess;
> > > +	bool updated_tree = false;
> > > +	unsigned long flags;
> > > +	struct page_cgroup *pc;
> > > +	struct mem_cgroup_per_zone *mz;
> > > +	struct mem_cgroup_soft_limit_tree_per_zone *stz;
> > > +
> > > +	/*
> > > +	 * As long as the page is around, pc's are always
> > > +	 * around and so is the mz, in the remove path
> > > +	 * we are yet to do the css_put(). I don't think
> > > +	 * we need to hold page cgroup lock.
> > > +	 */
> > IIUC, at updating tree,we grab this page which is near-to-be-mapped or
> > near-to-be-in-radix-treee. If so, not necessary to be annoyied.
> 
> Not sure I understand your comment about annoyied (annoyed?)
> 
Ah, sorry, I wanted to say "pc is always valid here"

> > 
> > > +	pc = lookup_page_cgroup(page);
> > > +	if (unlikely(!pc))
> > > +		return;
> > 
> > I bet this can be BUG_ON().
> 
> In the new version we will not need pc
> 
ok.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8)
  2009-07-10  6:53     ` Balbir Singh
@ 2009-07-10  7:30       ` KAMEZAWA Hiroyuki
  2009-07-10  7:49         ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10  7:30 UTC (permalink / raw)
  To: balbir; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

On Fri, 10 Jul 2009 12:23:06 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:30:26]:
> 
> > On Thu, 09 Jul 2009 22:45:12 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > Feature: Implement reclaim from groups over their soft limit
> > > 
> > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > -	while (loop < 2) {
> > > +	while (1) {
> > >  		victim = mem_cgroup_select_victim(root_mem);
> > > -		if (victim == root_mem)
> > > +		if (victim == root_mem) {
> > >  			loop++;
> > > +			if (loop >= 2) {
> > > +				/*
> > > +				 * If we have not been able to reclaim
> > > +				 * anything, it might because there are
> > > +				 * no reclaimable pages under this hierarchy
> > > +				 */
> > > +				if (!check_soft || !total)
> > > +					break;
> > > +				/*
> > > +				 * We want to do more targetted reclaim.
> > > +				 * excess >> 2 is not to excessive so as to
> > > +				 * reclaim too much, nor too less that we keep
> > > +				 * coming back to reclaim from this cgroup
> > > +				 */
> > > +				if (total >= (excess >> 2) ||
> > > +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> > > +					break;
> > > +			}
> > > +		}
> > 
> > Hmm..this logic is very unclear for me. Why just exit back as usual reclaim ?
> >
> 
> Basically what this check does is, it checks to see if the loops > 2,
> then as in the previous case (when soft limits were not supported)
> exit or if the total reclaimed is 0, exit (because we are running with
> swap turned off, may be?). Otherwise, check if we have reclaimed a
> certain portion of the total amount we exceed the soft limit by or if
> the loops are too large and exit. I hope this clarifies
>  
+#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(10000)
+#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
+
.....too big. 

IMO,
> > > +				if (total >= (excess >> 2) ||
> > > +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> > > +					break;
is unnecessary. Do you want to block kswapd here for such a long time ?
loops > 2 is definitely enough, I believe.
If you find out loops>2 is not enough later, just retrying soft limit is enough.



> > 
> > 
> > >  		if (!mem_cgroup_local_usage(&victim->stat)) {
> > >  			/* this cgroup's local usage == 0 */
> > >  			css_put(&victim->css);
> > >  			continue;
> > >  		}
> > >  		/* we use swappiness of local cgroup */
> > > -		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> > > -						   get_swappiness(victim));
> > > +		if (check_soft)
> > > +			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> > > +				noswap, get_swappiness(victim), zone,
> > > +				zone->zone_pgdat->node_id, priority);
> > > +		else
> > > +			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> > > +						noswap, get_swappiness(victim));
> > 
> > Do we need 2 functions ?
> >
> 
> Yes, one does zonelist based reclaim, the other one does shrinking of
> a particular zone in a particular node - as identified by
> balance_pgdat.
>  
> > >  		css_put(&victim->css);
> > >  		/*
> > >  		 * At shrinking usage, we can't check we should stop here or
> > > @@ -1072,7 +1182,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > >  		if (shrink)
> > >  			return ret;
> > >  		total += ret;
> > > -		if (mem_cgroup_check_under_limit(root_mem))
> > > +		if (check_soft) {
> > > +			if (res_counter_check_under_soft_limit(&root_mem->res))
> > > +				return total;
> > > +		} else if (mem_cgroup_check_under_limit(root_mem))
> > >  			return 1 + total;
> > >  	}
> > >  	return total;
> > > @@ -1207,8 +1320,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > >  		if (!(gfp_mask & __GFP_WAIT))
> > >  			goto nomem;
> > >  
> > > -		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> > > -							flags);
> > > +		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > > +							gfp_mask, flags, -1);
> > >  		if (ret)
> > >  			continue;
> > >  
> > > @@ -2002,8 +2115,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> > >  		if (!ret)
> > >  			break;
> > >  
> > > -		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> > > -						   MEM_CGROUP_RECLAIM_SHRINK);
> > > +		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
> > > +						GFP_KERNEL,
> > > +						MEM_CGROUP_RECLAIM_SHRINK, -1);
> > 
> > What this -1 means ?
> >
> 
> -1 means don't care, I should clarify that via comments.
>  

Hmm, rather than comment,
#define DONT_CARE_PRIRITY	(-1)
or some is self explaining.


> > >  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> > >  		/* Usage is reduced ? */
> > >    		if (curusage >= oldusage)
> > > @@ -2055,9 +2169,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> > >  		if (!ret)
> > >  			break;
> > >  
> > > -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> > > +		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> > >  						MEM_CGROUP_RECLAIM_NOSWAP |
> > > -						MEM_CGROUP_RECLAIM_SHRINK);
> > > +						MEM_CGROUP_RECLAIM_SHRINK, -1);
> > again.
> > 
> > >  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> > >  		/* Usage is reduced ? */
> > >  		if (curusage >= oldusage)
> > > @@ -2068,6 +2182,82 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> > >  	return ret;
> > >  }
> > >  
> > > +unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > > +						gfp_t gfp_mask, int nid,
> > > +						int zid, int priority)
> > > +{
> > > +	unsigned long nr_reclaimed = 0;
> > > +	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> > > +	unsigned long flags;
> > > +	unsigned long reclaimed;
> > > +	int loop = 0;
> > > +	struct mem_cgroup_soft_limit_tree_per_zone *stz;
> > > +
> > > +	if (order > 0)
> > > +		return 0;
> > > +
> > > +	stz = soft_limit_tree_node_zone(nid, zid);
> > > +	/*
> > > +	 * This loop can run a while, specially if mem_cgroup's continuously
> > > +	 * keep exceeding their soft limit and putting the system under
> > > +	 * pressure
> > > +	 */
> > > +	do {
> > > +		if (next_mz)
> > > +			mz = next_mz;
> > > +		else
> > > +			mz = mem_cgroup_largest_soft_limit_node(stz);
> > > +		if (!mz)
> > > +			break;
> > > +
> > > +		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> > > +						gfp_mask,
> > > +						MEM_CGROUP_RECLAIM_SOFT,
> > > +						priority);
> > > +		nr_reclaimed += reclaimed;
> > > +		spin_lock_irqsave(&stz->lock, flags);
> > > +
> > > +		/*
> > > +		 * If we failed to reclaim anything from this memory cgroup
> > > +		 * it is time to move on to the next cgroup
> > > +		 */
> > > +		next_mz = NULL;
> > > +		if (!reclaimed) {
> > > +			do {
> > > +				/*
> > > +				 * By the time we get the soft_limit lock
> > > +				 * again, someone might have aded the
> > > +				 * group back on the RB tree. Iterate to
> > > +				 * make sure we get a different mem.
> > > +				 * mem_cgroup_largest_soft_limit_node returns
> > > +				 * NULL if no other cgroup is present on
> > > +				 * the tree
> > > +				 */
> > > +				next_mz =
> > > +				__mem_cgroup_largest_soft_limit_node(stz);
> > > +			} while (next_mz == mz);
> > > +		}
> > > +		mz->usage_in_excess =
> > > +			res_counter_soft_limit_excess(&mz->mem->res);
> > > +		__mem_cgroup_remove_exceeded(mz->mem, mz, stz);
> > > +		if (mz->usage_in_excess)
> > > +			__mem_cgroup_insert_exceeded(mz->mem, mz, stz);
> > 
> > plz don't push back "mz" if !reclaimd.
> >
> 
> We need to do that, what is someone does a swapoff -a and swapon -a in
> between, we still need to give mz a chance. No?
>  
kswapd's original behavior will work well in such special case, No ?

In !reclaimed case, loss to push it back is larger than benefit, I think.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8)
  2009-07-10  7:30       ` KAMEZAWA Hiroyuki
@ 2009-07-10  7:49         ` Balbir Singh
  2009-07-10 10:56           ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2009-07-10  7:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 16:30:56]:

> On Fri, 10 Jul 2009 12:23:06 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:30:26]:
> > 
> > > On Thu, 09 Jul 2009 22:45:12 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > Feature: Implement reclaim from groups over their soft limit
> > > > 
> > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > -	while (loop < 2) {
> > > > +	while (1) {
> > > >  		victim = mem_cgroup_select_victim(root_mem);
> > > > -		if (victim == root_mem)
> > > > +		if (victim == root_mem) {
> > > >  			loop++;
> > > > +			if (loop >= 2) {
> > > > +				/*
> > > > +				 * If we have not been able to reclaim
> > > > +				 * anything, it might because there are
> > > > +				 * no reclaimable pages under this hierarchy
> > > > +				 */
> > > > +				if (!check_soft || !total)
> > > > +					break;
> > > > +				/*
> > > > +				 * We want to do more targetted reclaim.
> > > > +				 * excess >> 2 is not to excessive so as to
> > > > +				 * reclaim too much, nor too less that we keep
> > > > +				 * coming back to reclaim from this cgroup
> > > > +				 */
> > > > +				if (total >= (excess >> 2) ||
> > > > +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> > > > +					break;
> > > > +			}
> > > > +		}
> > > 
> > > Hmm..this logic is very unclear for me. Why just exit back as usual reclaim ?
> > >
> > 
> > Basically what this check does is, it checks to see if the loops > 2,
> > then as in the previous case (when soft limits were not supported)
> > exit or if the total reclaimed is 0, exit (because we are running with
> > swap turned off, may be?). Otherwise, check if we have reclaimed a
> > certain portion of the total amount we exceed the soft limit by or if
> > the loops are too large and exit. I hope this clarifies
> >  
> +#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(10000)
> +#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
> +
> .....too big. 
> 

Agreed, will cut it short

> IMO,
> > > > +				if (total >= (excess >> 2) ||
> > > > +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> > > > +					break;
> is unnecessary. Do you want to block kswapd here for such a long time ?
> loops > 2 is definitely enough, I believe.
> If you find out loops>2 is not enough later, just retrying soft limit is enough.
> 


Yes, worth experimenting with, I'll redo with the special code
removed.
> 
> 
> > > 
> > > 
> > > >  		if (!mem_cgroup_local_usage(&victim->stat)) {
> > > >  			/* this cgroup's local usage == 0 */
> > > >  			css_put(&victim->css);
> > > >  			continue;
> > > >  		}
> > > >  		/* we use swappiness of local cgroup */
> > > > -		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> > > > -						   get_swappiness(victim));
> > > > +		if (check_soft)
> > > > +			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> > > > +				noswap, get_swappiness(victim), zone,
> > > > +				zone->zone_pgdat->node_id, priority);
> > > > +		else
> > > > +			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> > > > +						noswap, get_swappiness(victim));
> > > 
> > > Do we need 2 functions ?
> > >
> > 
> > Yes, one does zonelist based reclaim, the other one does shrinking of
> > a particular zone in a particular node - as identified by
> > balance_pgdat.
> >  
> > > >  		css_put(&victim->css);
> > > >  		/*
> > > >  		 * At shrinking usage, we can't check we should stop here or
> > > > @@ -1072,7 +1182,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > > >  		if (shrink)
> > > >  			return ret;
> > > >  		total += ret;
> > > > -		if (mem_cgroup_check_under_limit(root_mem))
> > > > +		if (check_soft) {
> > > > +			if (res_counter_check_under_soft_limit(&root_mem->res))
> > > > +				return total;
> > > > +		} else if (mem_cgroup_check_under_limit(root_mem))
> > > >  			return 1 + total;
> > > >  	}
> > > >  	return total;
> > > > @@ -1207,8 +1320,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > >  		if (!(gfp_mask & __GFP_WAIT))
> > > >  			goto nomem;
> > > >  
> > > > -		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> > > > -							flags);
> > > > +		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > > > +							gfp_mask, flags, -1);
> > > >  		if (ret)
> > > >  			continue;
> > > >  
> > > > @@ -2002,8 +2115,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> > > >  		if (!ret)
> > > >  			break;
> > > >  
> > > > -		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> > > > -						   MEM_CGROUP_RECLAIM_SHRINK);
> > > > +		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
> > > > +						GFP_KERNEL,
> > > > +						MEM_CGROUP_RECLAIM_SHRINK, -1);
> > > 
> > > What this -1 means ?
> > >
> > 
> > -1 means don't care, I should clarify that via comments.
> >  
> 
> Hmm, rather than comment,
> #define DONT_CARE_PRIRITY	(-1)
> or some is self explaining.
>

Sure, will do
 
> 
> > > >  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> > > >  		/* Usage is reduced ? */
> > > >    		if (curusage >= oldusage)
> > > > @@ -2055,9 +2169,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> > > >  		if (!ret)
> > > >  			break;
> > > >  
> > > > -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> > > > +		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> > > >  						MEM_CGROUP_RECLAIM_NOSWAP |
> > > > -						MEM_CGROUP_RECLAIM_SHRINK);
> > > > +						MEM_CGROUP_RECLAIM_SHRINK, -1);
> > > again.
> > > 
> > > >  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> > > >  		/* Usage is reduced ? */
> > > >  		if (curusage >= oldusage)
> > > > @@ -2068,6 +2182,82 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> > > >  	return ret;
> > > >  }
> > > >  
> > > > +unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > > > +						gfp_t gfp_mask, int nid,
> > > > +						int zid, int priority)
> > > > +{
> > > > +	unsigned long nr_reclaimed = 0;
> > > > +	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> > > > +	unsigned long flags;
> > > > +	unsigned long reclaimed;
> > > > +	int loop = 0;
> > > > +	struct mem_cgroup_soft_limit_tree_per_zone *stz;
> > > > +
> > > > +	if (order > 0)
> > > > +		return 0;
> > > > +
> > > > +	stz = soft_limit_tree_node_zone(nid, zid);
> > > > +	/*
> > > > +	 * This loop can run a while, specially if mem_cgroup's continuously
> > > > +	 * keep exceeding their soft limit and putting the system under
> > > > +	 * pressure
> > > > +	 */
> > > > +	do {
> > > > +		if (next_mz)
> > > > +			mz = next_mz;
> > > > +		else
> > > > +			mz = mem_cgroup_largest_soft_limit_node(stz);
> > > > +		if (!mz)
> > > > +			break;
> > > > +
> > > > +		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> > > > +						gfp_mask,
> > > > +						MEM_CGROUP_RECLAIM_SOFT,
> > > > +						priority);
> > > > +		nr_reclaimed += reclaimed;
> > > > +		spin_lock_irqsave(&stz->lock, flags);
> > > > +
> > > > +		/*
> > > > +		 * If we failed to reclaim anything from this memory cgroup
> > > > +		 * it is time to move on to the next cgroup
> > > > +		 */
> > > > +		next_mz = NULL;
> > > > +		if (!reclaimed) {
> > > > +			do {
> > > > +				/*
> > > > +				 * By the time we get the soft_limit lock
> > > > +				 * again, someone might have aded the
> > > > +				 * group back on the RB tree. Iterate to
> > > > +				 * make sure we get a different mem.
> > > > +				 * mem_cgroup_largest_soft_limit_node returns
> > > > +				 * NULL if no other cgroup is present on
> > > > +				 * the tree
> > > > +				 */
> > > > +				next_mz =
> > > > +				__mem_cgroup_largest_soft_limit_node(stz);
> > > > +			} while (next_mz == mz);
> > > > +		}
> > > > +		mz->usage_in_excess =
> > > > +			res_counter_soft_limit_excess(&mz->mem->res);
> > > > +		__mem_cgroup_remove_exceeded(mz->mem, mz, stz);
> > > > +		if (mz->usage_in_excess)
> > > > +			__mem_cgroup_insert_exceeded(mz->mem, mz, stz);
> > > 
> > > plz don't push back "mz" if !reclaimd.
> > >
> > 
> > We need to do that, what is someone does a swapoff -a and swapon -a in
> > between, we still need to give mz a chance. No?
> >  
> kswapd's original behavior will work well in such special case, No ?
> 
> In !reclaimed case, loss to push it back is larger than benefit, I think.
>

OK, I'll try it out. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8)
  2009-07-10  5:21   ` KAMEZAWA Hiroyuki
  2009-07-10  6:47     ` Balbir Singh
@ 2009-07-10  8:05     ` Balbir Singh
  2009-07-10  8:14       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2009-07-10  8:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:21:35]:

> 
> As pointed out in several times, plz avoid using jiffies.

Sorry, I forgot to respond to this part. Are you suggesting we avoid
jiffies (use ktime_t) or the time based approach. I responded to the
time base versus scanning approach to the mail earlier.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8)
  2009-07-10  8:05     ` Balbir Singh
@ 2009-07-10  8:14       ` KAMEZAWA Hiroyuki
  2009-07-10  8:20         ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10  8:14 UTC (permalink / raw)
  To: balbir; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

On Fri, 10 Jul 2009 13:35:57 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:21:35]:
> 
> > 
> > As pointed out in several times, plz avoid using jiffies.
> 
> Sorry, I forgot to respond to this part. Are you suggesting we avoid
> jiffies (use ktime_t) or the time based approach. I responded to the
> time base versus scanning approach to the mail earlier.
> 
> -

IIUC, it was a comment to old patches "don't use jiffies, count event"
(by Andrew Morton ?) I fully agree with that.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8)
  2009-07-10  8:14       ` KAMEZAWA Hiroyuki
@ 2009-07-10  8:20         ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2009-07-10  8:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 17:14:57]:

> On Fri, 10 Jul 2009 13:35:57 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:21:35]:
> > 
> > > 
> > > As pointed out in several times, plz avoid using jiffies.
> > 
> > Sorry, I forgot to respond to this part. Are you suggesting we avoid
> > jiffies (use ktime_t) or the time based approach. I responded to the
> > time base versus scanning approach to the mail earlier.
> > 
> > -
> 
> IIUC, it was a comment to old patches "don't use jiffies, count event"
> (by Andrew Morton ?) I fully agree with that.
>

OK, so I played around with it and found that per-cpu events did not
give me good behaviour. Let me retry my approach and experiments and
see if it works well.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8)
  2009-07-10  7:49         ` Balbir Singh
@ 2009-07-10 10:56           ` Balbir Singh
  2009-07-10 14:15             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2009-07-10 10:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

* Balbir Singh <balbir@linux.vnet.ibm.com> [2009-07-10 13:19:06]:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 16:30:56]:
> 
> > On Fri, 10 Jul 2009 12:23:06 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 14:30:26]:
> > > 
> > > > On Thu, 09 Jul 2009 22:45:12 +0530
> > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > Feature: Implement reclaim from groups over their soft limit
> > > > > 
> > > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > > -	while (loop < 2) {
> > > > > +	while (1) {
> > > > >  		victim = mem_cgroup_select_victim(root_mem);
> > > > > -		if (victim == root_mem)
> > > > > +		if (victim == root_mem) {
> > > > >  			loop++;
> > > > > +			if (loop >= 2) {
> > > > > +				/*
> > > > > +				 * If we have not been able to reclaim
> > > > > +				 * anything, it might because there are
> > > > > +				 * no reclaimable pages under this hierarchy
> > > > > +				 */
> > > > > +				if (!check_soft || !total)
> > > > > +					break;
> > > > > +				/*
> > > > > +				 * We want to do more targetted reclaim.
> > > > > +				 * excess >> 2 is not to excessive so as to
> > > > > +				 * reclaim too much, nor too less that we keep
> > > > > +				 * coming back to reclaim from this cgroup
> > > > > +				 */
> > > > > +				if (total >= (excess >> 2) ||
> > > > > +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> > > > > +					break;
> > > > > +			}
> > > > > +		}
> > > > 
> > > > Hmm..this logic is very unclear for me. Why just exit back as usual reclaim ?
> > > >
> > > 
> > > Basically what this check does is, it checks to see if the loops > 2,
> > > then as in the previous case (when soft limits were not supported)
> > > exit or if the total reclaimed is 0, exit (because we are running with
> > > swap turned off, may be?). Otherwise, check if we have reclaimed a
> > > certain portion of the total amount we exceed the soft limit by or if
> > > the loops are too large and exit. I hope this clarifies
> > >  
> > +#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(10000)
> > +#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
> > +
> > .....too big. 
> > 
> 
> Agreed, will cut it short
> 
> > IMO,
> > > > > +				if (total >= (excess >> 2) ||
> > > > > +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> > > > > +					break;
> > is unnecessary. Do you want to block kswapd here for such a long time ?
> > loops > 2 is definitely enough, I believe.
> > If you find out loops>2 is not enough later, just retrying soft limit is enough.
> > 
> 
> 
> Yes, worth experimenting with, I'll redo with the special code
> removed.


OK, so I experimented with it, I found the following behaviour

1. We try to reclaim, priority is high, scanned pages are low and
   hence memory cgroup zone reclaim returns 0 (no pages could be
   reclaimed).
2. Now regular reclaim from balance_pgdat() is called, it is able
   to shrink from global LRU and hence some other mem cgroup, thus
   breaking soft limit semantics.


> > > > > +			res_counter_soft_limit_excess(&mz->mem->res);
> > > > > +		__mem_cgroup_remove_exceeded(mz->mem, mz, stz);
> > > > > +		if (mz->usage_in_excess)
> > > > > +			__mem_cgroup_insert_exceeded(mz->mem, mz, stz);
> > > > 
> > > > plz don't push back "mz" if !reclaimd.
> > > >
> > > 
> > > We need to do that, what is someone does a swapoff -a and swapon -a in
> > > between, we still need to give mz a chance. No?
> > >  
> > kswapd's original behavior will work well in such special case, No ?
> > 
> > In !reclaimed case, loss to push it back is larger than benefit, I think.
> >
> 
> OK, I'll try it out. 
>

I tried, it did not work out well, please see above. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8)
  2009-07-10 10:56           ` Balbir Singh
@ 2009-07-10 14:15             ` KAMEZAWA Hiroyuki
  2009-07-10 14:22               ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10 14:15 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

Balbir Singh wrote:
> * Balbir Singh <balbir@linux.vnet.ibm.com> [2009-07-10 13:19:06]:
>>
>> Yes, worth experimenting with, I'll redo with the special code
>> removed.
>
>
> OK, so I experimented with it, I found the following behaviour
>
> 1. We try to reclaim, priority is high, scanned pages are low and
>    hence memory cgroup zone reclaim returns 0 (no pages could be
>    reclaimed).
> 2. Now regular reclaim from balance_pgdat() is called, it is able
>    to shrink from global LRU and hence some other mem cgroup, thus
>    breaking soft limit semantics.
>
IMO, "breaking soft limit" cannot be an excuse for delaying kswapd too much.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8)
  2009-07-10 14:15             ` KAMEZAWA Hiroyuki
@ 2009-07-10 14:22               ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2009-07-10 14:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, linux-mm, lizf, KOSAKI Motohiro

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-07-10 23:15:20]:

> Balbir Singh wrote?$B!'
> > * Balbir Singh <balbir@linux.vnet.ibm.com> [2009-07-10 13:19:06]:
> >>
> >> Yes, worth experimenting with, I'll redo with the special code
> >> removed.
> >
> >
> > OK, so I experimented with it, I found the following behaviour
> >
> > 1. We try to reclaim, priority is high, scanned pages are low and
> >    hence memory cgroup zone reclaim returns 0 (no pages could be
> >    reclaimed).
> > 2. Now regular reclaim from balance_pgdat() is called, it is able
> >    to shrink from global LRU and hence some other mem cgroup, thus
> >    breaking soft limit semantics.
> >
> IMO, "breaking soft limit" cannot be an excuse for delaying kswapd too much.
>

Hmmm... I agree in principle, but if soft limits are turned on, we are
overriding where we should be reclaiming from. The delay IMHO is not
very high and I've run tests without setting any soft limits but with
soft limits feature enabled. I don't see anything going bad or any
overhead.

I've just posted v9 without the changes. I'll do some runs with your
suggestion and see what the complete impact is.
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-07-10 13:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-09 17:14 [RFC][PATCH 0/5] Memory controller soft limit patches (v8) Balbir Singh
2009-07-09 17:14 ` [RFC][PATCH 1/5] Memory controller soft limit documentation (v8) Balbir Singh
2009-07-10  5:32   ` KAMEZAWA Hiroyuki
2009-07-10  6:48     ` Balbir Singh
2009-07-09 17:14 ` [RFC][PATCH 2/5] Memory controller soft limit interface (v8) Balbir Singh
2009-07-09 17:15 ` [RFC][PATCH 3/5] Memory controller soft limit organize cgroups (v8) Balbir Singh
2009-07-10  5:21   ` KAMEZAWA Hiroyuki
2009-07-10  6:47     ` Balbir Singh
2009-07-10  7:16       ` KAMEZAWA Hiroyuki
2009-07-10  8:05     ` Balbir Singh
2009-07-10  8:14       ` KAMEZAWA Hiroyuki
2009-07-10  8:20         ` Balbir Singh
2009-07-09 17:15 ` [RFC][PATCH 4/5] Memory controller soft limit refactor reclaim flags (v8) Balbir Singh
2009-07-09 17:15 ` [RFC][PATCH 5/5] Memory controller soft limit reclaim on contention (v8) Balbir Singh
2009-07-10  5:30   ` KAMEZAWA Hiroyuki
2009-07-10  6:53     ` Balbir Singh
2009-07-10  7:30       ` KAMEZAWA Hiroyuki
2009-07-10  7:49         ` Balbir Singh
2009-07-10 10:56           ` Balbir Singh
2009-07-10 14:15             ` KAMEZAWA Hiroyuki
2009-07-10 14:22               ` Balbir Singh
2009-07-10  4:53 ` [RFC][PATCH 0/5] Memory controller soft limit patches (v8) KAMEZAWA Hiroyuki
2009-07-10  5:53   ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.