All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Memory controller soft limit patches (v5)
@ 2009-03-12 17:56 Balbir Singh
  2009-03-12 17:56 ` [PATCH 1/4] Memory controller soft limit documentation (v5) Balbir Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-12 17:56 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

New Feature: Soft limits for memory resource controller.

Changelog v5...v4
1. Several changes to the reclaim logic, please see the patch 4 (reclaim on
   contention). I've experimented with several possibilities for reclaim
   and chose to come back to this due to the excellent behaviour seen while
   testing the patchset.
2. Reduced the overhead of soft limits on resource counters very significantly.
   Reaim benchmark now shows almost no drop in performance.

Changelog v4...v3
1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
   while doing soft limit reclaim. We don't record priorities while
   doing soft reclaim
2. Some of the overheads associated with soft limits (like calculating
   excess each time) is eliminated
3. The time_after(jiffies, 0) bug has been fixed
4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
   and at the same time tasks are increasing the memory footprint and causing
   the mem cgroup to exceed its soft limit.

Changelog v3...v2
1. Implemented several review comments from Kosaki-San and Kamezawa-San
   Please see individual changelogs for changes

Changelog v2...v1
1. Soft limits now support hierarchies
2. Use spinlocks instead of mutexes for synchronization of the RB tree

Here is v5 of the new soft limit implementation. Soft limits is a new feature
for the memory resource controller, something similar has existed in the
group scheduler in the form of shares. The CPU controllers interpretation
of shares is very different though. 

Soft limits are the most useful feature to have for environments where
the administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits implementation
provides a soft_limit_in_bytes interface for the memory controller and not
for memory+swap controller. The implementation maintains an RB-Tree of groups
that exceed their soft limit and starts reclaiming from the group that
exceeds this limit by the maximum amount.

Kamezawa-San has another patchset for soft limits, but I don't like the reclaim logic of watermark based balancing of zones for global memory cgroup limits.
I also don't like the data structures, a list does not scale well. Kamezawa's
objection to this patch is the cost of sorting, which is really negligible,
since the updates happen at a fixed interval (curently four times a second).
I however do like the priority feature in Kamezawa's patchset. The feature
can be easily adopted to this incrementally.

TODOs

1. The current implementation maintains the delta from the soft limit
   and pushes back groups to their soft limits, a ratio of delta/soft_limit
   might be more useful
2. It would be nice to have more targetted reclaim (in terms of pages to
   recalim) interface. So that groups are pushed back, close to their soft
   limits.

Tests
-----

I've run two memory intensive workloads with differing soft limits and
seen that they are pushed back to their soft limit on contention. Their usage
was their soft limit plus additional memory that they were able to grab
on the system. Soft limit can take a while before we see the expected
results.

The other tests I've run are
1. Deletion of groups while soft limit is in progress in the hierarchy
2. Setting the soft limit to zero and running other groups with non-zero
   soft limits.
3. Setting the soft limit to zero and testing if the mem cgroup is able
   to use available memory

Please review, comment.

Series
------

memcg-soft-limit-documentation.patch
memcg-add-soft-limit-interface.patch
memcg-organize-over-soft-limit-groups.patch
memcg-soft-limit-reclaim-on-contention.patch



-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/4] Memory controller soft limit documentation (v5)
  2009-03-12 17:56 [PATCH 0/4] Memory controller soft limit patches (v5) Balbir Singh
@ 2009-03-12 17:56 ` Balbir Singh
  2009-03-12 17:56 ` [PATCH 2/4] Memory controller soft limit interface (v5) Balbir Singh
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-12 17:56 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Add documentation for soft limits

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index a98a7fe..c5f73d9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -360,7 +360,36 @@ cgroups created below it.
 
 NOTE2: This feature can be enabled/disabled per subtree.
 
-7. TODO
+7. Soft limits
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention or low memory control groups
+are pushed back to their soft limits. If the soft limit of each control
+group is very high, they are pushed back as much as possible to make
+sure that one control group does not starve the others of memory.
+
+7.1 Interface
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 megabytes)
+
+# echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use
+
+# echo 1G > memory.soft_limit_in_bytes
+
+NOTE1: Soft limits take effect over a long period of time, since they involve
+       reclaiming memory for balancing between memory cgroups
+NOTE2: It is recommended to set the soft limit always below the hard limit,
+       otherwise the hard limit will take precedence.
+
+8. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/4] Memory controller soft limit interface (v5)
  2009-03-12 17:56 [PATCH 0/4] Memory controller soft limit patches (v5) Balbir Singh
  2009-03-12 17:56 ` [PATCH 1/4] Memory controller soft limit documentation (v5) Balbir Singh
@ 2009-03-12 17:56 ` Balbir Singh
  2009-03-12 22:59   ` Andrew Morton
  2009-03-12 17:56 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v5) Balbir Singh
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2009-03-12 17:56 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Add soft limits interface to resource counters

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Add support for res_counter_check_soft_limit_locked. This is used
   by the hierarchy code.

Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.

Kamezawa-San raised a question as to whether soft limit should belong
to res_counter. Since all resources understand the basic concepts of
hard and soft limits, it is justified to add soft limits here. Soft limits
are a generic resource usage feature, even file system quotas support
soft limits.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |   58 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    3 ++
 mm/memcontrol.c             |   20 +++++++++++++++
 3 files changed, 81 insertions(+), 0 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 4c5bcf6..5c821fd 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
 	 */
 	unsigned long long limit;
 	/*
+	 * the limit that usage can be exceed
+	 */
+	unsigned long long soft_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -85,6 +89,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_SOFT_LIMIT,
 };
 
 /*
@@ -130,6 +135,36 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->soft_limit)
+		return true;
+
+	return false;
+}
+
+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->soft_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -145,6 +180,17 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	return ret;
 }
 
+static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_soft_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -178,4 +224,16 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 	return ret;
 }
 
+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+				unsigned long long soft_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->soft_limit = soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bf8e753..4e6dafe 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->soft_limit = (unsigned long long)LLONG_MAX;
 	counter->parent = parent;
 }
 
@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_SOFT_LIMIT:
+		return &counter->soft_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 804bbae..b4f5b15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1986,6 +1986,20 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case RES_SOFT_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, soft limits are hard to implement in terms
+		 * of semantics, for now, we support soft limits for
+		 * control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2235,6 +2249,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/4] Memory controller soft limit organize cgroups (v5)
  2009-03-12 17:56 [PATCH 0/4] Memory controller soft limit patches (v5) Balbir Singh
  2009-03-12 17:56 ` [PATCH 1/4] Memory controller soft limit documentation (v5) Balbir Singh
  2009-03-12 17:56 ` [PATCH 2/4] Memory controller soft limit interface (v5) Balbir Singh
@ 2009-03-12 17:56 ` Balbir Singh
  2009-03-12 23:04   ` Andrew Morton
                     ` (2 more replies)
  2009-03-12 17:56 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v5) Balbir Singh
  2009-03-13  7:02 ` [PATCH 0/4] Memory controller soft limit patches (v5) KAMEZAWA Hiroyuki
  4 siblings, 3 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-12 17:56 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Organize cgroups over soft limit in a RB-Tree

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v5...v4
1. res_counter_uncharge has an additional parameter to indicate if the
   counter was over its soft limit, before uncharge.

Changelog v4...v3
1. Optimizations to ensure we don't uncessarily get res_counter values
2. Fixed a bug in usage of time_after()

Changelog v3...v2
1. Add only the ancestor to the RB-Tree
2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put

Changelog v2...v1
1. Add support for hierarchies
2. The res_counter that is highest in the hierarchy is returned on soft
   limit being exceeded. Since we do hierarchical reclaim and add all
   groups exceeding their soft limits, this approach seems to work well
   in practice.

This patch introduces a RB-Tree for storing memory cgroups that are over their
soft limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
   We are careful about updates, updates take place only after a particular
   time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
   limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    6 +-
 kernel/res_counter.c        |   18 +++++
 mm/memcontrol.c             |  141 ++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 143 insertions(+), 22 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 5c821fd..5bbf8b1 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+		unsigned long val, struct res_counter **limit_fail_at,
+		struct res_counter **soft_limit_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+				bool *was_soft_limit_excess);
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4e6dafe..51ec438 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+			struct res_counter **limit_fail_at,
+			struct res_counter **soft_limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
+	if (soft_limit_fail_at)
+		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
+		/*
+		 * With soft limits, we return the highest ancestor
+		 * that exceeds its soft limit
+		 */
+		if (soft_limit_fail_at &&
+			!res_counter_soft_limit_check_locked(c))
+			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+				bool *was_soft_limit_excess)
 {
 	unsigned long flags;
 	struct res_counter *c;
@@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
+		if (c == counter && was_soft_limit_excess)
+			*was_soft_limit_excess =
+				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b4f5b15..4e2a79e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -29,6 +29,7 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
+#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
@@ -129,6 +130,14 @@ struct mem_cgroup_lru_info {
 };
 
 /*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+
+static struct rb_root mem_cgroup_soft_limit_tree;
+static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
+
+/*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
  * statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -176,12 +185,20 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	struct rb_node mem_cgroup_node;		/* RB tree node */
+	unsigned long long usage_in_excess;	/* Set to the value by which */
+						/* the soft limit is exceeded*/
+	unsigned long last_tree_update;		/* Last time the tree was */
+						/* updated in jiffies     */
+
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
 
+#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -214,6 +231,41 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct mem_cgroup *mem_node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	while (*p) {
+		parent = *p;
+		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
+		if (mem->usage_in_excess < mem_node->usage_in_excess)
+			p = &(*p)->rb_left;
+		/*
+		 * We can't avoid mem cgroups that are over their soft
+		 * limit by the same amount
+		 */
+		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&mem->mem_cgroup_node, parent, p);
+	rb_insert_color(&mem->mem_cgroup_node,
+			&mem_cgroup_soft_limit_tree);
+	mem->last_tree_update = jiffies;
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -897,6 +949,40 @@ static void record_last_oom(struct mem_cgroup *mem)
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
+static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
+						bool time_check)
+{
+	unsigned long long prev_usage_in_excess, new_usage_in_excess;
+	bool updated_tree = false;
+	unsigned long next_update = 0;
+	unsigned long flags;
+
+	prev_usage_in_excess = mem->usage_in_excess;
+
+	if (time_check)
+		next_update = mem->last_tree_update +
+				MEM_CGROUP_TREE_UPDATE_INTERVAL;
+
+	if (!time_check || time_after(jiffies, next_update)) {
+		new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+		if (prev_usage_in_excess) {
+			mem_cgroup_remove_exceeded(mem);
+			updated_tree = true;
+		}
+		if (!new_usage_in_excess)
+			goto done;
+		mem_cgroup_insert_exceeded(mem);
+		updated_tree = true;
+	}
+
+done:
+	if (updated_tree) {
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->last_tree_update = jiffies;
+		mem->usage_in_excess = new_usage_in_excess;
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	}
+}
 
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
@@ -906,9 +992,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom)
 {
-	struct mem_cgroup *mem, *mem_over_limit;
+	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+	struct res_counter *fail_res, *soft_fail_res = NULL;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -938,16 +1024,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		int ret;
 		bool noswap = false;
 
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
+						&soft_fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+							&fail_res, NULL);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 			noswap = true;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -985,6 +1072,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 		}
 	}
+
+	/*
+	 * Insert just the ancestor, we should trickle down to the correct
+	 * cgroup for reclaim, since the other nodes will be below their
+	 * soft limit
+	 */
+	if (soft_fail_res) {
+		mem_over_soft_limit =
+			mem_cgroup_from_res_counter(soft_fail_res, res);
+		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
+	}
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -1045,9 +1143,9 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 		css_put(&mem->css);
 		return;
 	}
@@ -1100,10 +1198,10 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 	if (pc->mem_cgroup != from)
 		goto out;
 
-	res_counter_uncharge(&from->res, PAGE_SIZE);
+	res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (do_swap_account)
-		res_counter_uncharge(&from->memsw, PAGE_SIZE);
+		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
 	css_put(&from->css);
 
 	css_get(&to->css);
@@ -1167,9 +1265,9 @@ uncharge:
 	/* drop extra refcnt by try_charge() */
 	css_put(&parent->css);
 	/* uncharge if move fails */
-	res_counter_uncharge(&parent->res, PAGE_SIZE);
+	res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
 	if (do_swap_account)
-		res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+		res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
 	return ret;
 }
 
@@ -1298,7 +1396,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 			 * Recorded ID can be obsolete. We avoid calling
 			 * css_tryget()
 			 */
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 			mem_cgroup_put(mem);
 		}
 		rcu_read_unlock();
@@ -1377,7 +1475,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 			 * This recorded memcg can be obsolete one. So, avoid
 			 * calling css_tryget
 			 */
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
 			mem_cgroup_put(memcg);
 		}
 		rcu_read_unlock();
@@ -1392,9 +1490,9 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 		return;
 	if (!mem)
 		return;
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 	if (do_swap_account)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 	css_put(&mem->css);
 }
 
@@ -1408,6 +1506,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	bool soft_limit_excess = false;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1445,9 +1544,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		break;
 	}
 
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
 	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 	mem_cgroup_charge_statistics(mem, pc, false);
 
 	ClearPageCgroupUsed(pc);
@@ -1461,6 +1560,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
+	if (soft_limit_excess)
+		mem_cgroup_check_and_update_tree(mem, true);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
@@ -1529,7 +1630,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 * We uncharge this because swap is freed.
 		 * This memcg can be obsolete one. We avoid calling css_tryget
 		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
 		mem_cgroup_put(memcg);
 	}
 	rcu_read_unlock();
@@ -2393,6 +2494,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	mem_cgroup_check_and_update_tree(mem, false);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -2459,6 +2561,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
 		parent = NULL;
+		mem_cgroup_soft_limit_tree = RB_ROOT;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
@@ -2479,6 +2582,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->usage_in_excess = 0;
+	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-12 17:56 [PATCH 0/4] Memory controller soft limit patches (v5) Balbir Singh
                   ` (2 preceding siblings ...)
  2009-03-12 17:56 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v5) Balbir Singh
@ 2009-03-12 17:56 ` Balbir Singh
  2009-03-12 23:34   ` Andrew Morton
                     ` (2 more replies)
  2009-03-13  7:02 ` [PATCH 0/4] Memory controller soft limit patches (v5) KAMEZAWA Hiroyuki
  4 siblings, 3 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-12 17:56 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Implement reclaim from groups over their soft limit

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v5...v4

1. Throttling is removed, earlier we throttled tasks over their soft limit
2. Reclaim has been moved back to __alloc_pages_internal, several experiments
   and tests showed that it was the best place to reclaim memory. kswapd has
   a different goal, that does not work with a single soft limit for the memory
   cgroup.
3. Soft limit reclaim is more targetted and the pages reclaim depend on the
   amount by which the soft limit is exceeded.

Changelog v4...v3
1. soft_reclaim is now called from balance_pgdat
2. soft_reclaim is aware of nodes and zones
3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
   and at the same time trying to allocate pages and exceed its soft limit.
4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
   particular to a mem cgroup.

Changelog v3...v2
1. Convert several arguments to hierarchical reclaim to flags, thereby
   consolidating them
2. The reclaim for soft limits is now triggered from kswapd
3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument


Changelog v2...v1
1. Added support for hierarchical soft limits

This patch allows reclaim from memory cgroups on contention (via the
kswapd() path) only if the order is 0.

memory cgroup soft limit reclaim finds the group that exceeds its soft limit
by the largest amount and reclaims pages from it and then reinserts the
cgroup into its correct place in the rbtree.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |    7 +-
 include/linux/swap.h       |    1 
 mm/memcontrol.c            |  180 +++++++++++++++++++++++++++++++++++++++-----
 mm/page_alloc.c            |    9 ++
 mm/vmscan.c                |    5 +
 5 files changed, 179 insertions(+), 23 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..c0aeab9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,7 +116,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl,
+						gfp_t gfp_mask);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -264,6 +265,10 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
+{
+	return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 989eb53..c128337 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,7 @@ static inline void lru_cache_add_active_file(struct page *page)
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
+						  struct zonelist *zl,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e2a79e..941d57e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,7 @@
 #include <linux/res_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
+#include <linux/completion.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/smp.h>
@@ -191,6 +192,7 @@ struct mem_cgroup {
 	unsigned long last_tree_update;		/* Last time the tree was */
 						/* updated in jiffies     */
 
+	bool on_tree;				/* Is the node on tree? */
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -227,18 +229,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
+/*
+ * Bits used for hierarchical reclaim bits
+ */
+#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
+#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
+#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
+#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
+#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
-static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 {
 	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
 	struct rb_node *parent = NULL;
 	struct mem_cgroup *mem_node;
-	unsigned long flags;
 
-	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	if (mem->on_tree)
+		return;
+
 	while (*p) {
 		parent = *p;
 		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
@@ -255,6 +268,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 	rb_insert_color(&mem->mem_cgroup_node,
 			&mem_cgroup_soft_limit_tree);
 	mem->last_tree_update = jiffies;
+	mem->on_tree = true;
+}
+
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	if (!mem->on_tree)
+		return;
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	mem->on_tree = false;
+}
+
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	__mem_cgroup_insert_exceeded(mem);
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 }
 
@@ -262,8 +292,53 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
-	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	__mem_cgroup_remove_exceeded(mem);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	unsigned long long excess;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	excess = mem->usage_in_excess >> PAGE_SHIFT;
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	return (excess > ULONG_MAX) ? ULONG_MAX : excess;
+}
+
+static struct mem_cgroup *__mem_cgroup_largest_soft_limit_node(void)
+{
+	struct rb_node *rightmost = NULL;
+	struct mem_cgroup *mem = NULL;
+
+retry:
+	rightmost = rb_last(&mem_cgroup_soft_limit_tree);
+	if (!rightmost)
+		goto done;		/* Nothing to reclaim from */
+
+	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
+	/*
+	 * Remove the node now but someone else can add it back,
+	 * we will to add it back at the end of reclaim to its correct
+	 * position in the tree.
+	 */
+	__mem_cgroup_remove_exceeded(mem);
+	if (!css_tryget(&mem->css) || !res_counter_soft_limit_excess(&mem->res))
+		goto retry;
+done:
+	return mem;
+}
+
+static struct mem_cgroup *mem_cgroup_largest_soft_limit_node(void)
+{
+	struct mem_cgroup *mem;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	mem = __mem_cgroup_largest_soft_limit_node();
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	return mem;
 }
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
@@ -888,14 +963,39 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
-				   gfp_t gfp_mask, bool noswap, bool shrink)
+						struct zonelist *zl,
+						gfp_t gfp_mask,
+						unsigned long flags)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
 	int loop = 0;
+	bool noswap = flags & MEM_CGROUP_RECLAIM_NOSWAP;
+	bool shrink = flags & MEM_CGROUP_RECLAIM_SHRINK;
+	bool check_soft = flags & MEM_CGROUP_RECLAIM_SOFT;
+	unsigned long excess = mem_cgroup_get_excess(root_mem);
 
-	while (loop < 2) {
+	while (1) {
+		if (loop >= 2) {
+			/*
+			 * With soft limits, do more targetted reclaim
+			 */
+			if (check_soft && (total >= (excess >> 4)))
+				break;
+			else if (!check_soft)
+				break;
+		}
 		victim = mem_cgroup_select_victim(root_mem);
+		/*
+		 * In the first loop, don't reclaim from victims below
+		 * their soft limit
+		 */
+		if (!loop && res_counter_check_under_soft_limit(&victim->res)) {
+			if (victim == root_mem)
+				loop++;
+			css_put(&victim->css);
+			continue;
+		}
 		if (victim == root_mem)
 			loop++;
 		if (!mem_cgroup_local_usage(&victim->stat)) {
@@ -904,8 +1004,9 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 			continue;
 		}
 		/* we use swappiness of local cgroup */
-		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
-						   get_swappiness(victim));
+		ret = try_to_free_mem_cgroup_pages(victim, zl, gfp_mask,
+							noswap,
+							get_swappiness(victim));
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -915,7 +1016,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		if (shrink)
 			return ret;
 		total += ret;
-		if (mem_cgroup_check_under_limit(root_mem))
+		if (check_soft) {
+			if (res_counter_check_under_soft_limit(&root_mem->res))
+				return total;
+		} else if (mem_cgroup_check_under_limit(root_mem))
 			return 1 + total;
 	}
 	return total;
@@ -1022,7 +1126,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 
 	while (1) {
 		int ret;
-		bool noswap = false;
+		unsigned long flags = 0;
 
 		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
 						&soft_fail_res);
@@ -1035,7 +1139,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 				break;
 			/* mem+swap counter fails */
 			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
-			noswap = true;
+			flags = MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
 		} else
@@ -1046,8 +1150,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
-		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
-							noswap, false);
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
+							gfp_mask, flags);
 		if (ret)
 			continue;
 
@@ -1741,8 +1845,8 @@ int mem_cgroup_shrink_usage(struct page *page,
 		return 0;
 
 	do {
-		progress = mem_cgroup_hierarchical_reclaim(mem,
-					gfp_mask, true, false);
+		progress = mem_cgroup_hierarchical_reclaim(mem, NULL,
+					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
 		progress += mem_cgroup_check_under_limit(mem);
 	} while (!progress && --retry);
 
@@ -1796,8 +1900,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
-						   false, true);
+		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
+						GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -1845,7 +1950,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
+		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_NOSWAP |
+						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -1856,6 +1963,39 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
+{
+	unsigned long nr_reclaimed = 0;
+	struct mem_cgroup *mem;
+	unsigned long flags;
+	unsigned long reclaimed;
+
+	/*
+	 * This loop can run a while, specially if mem_cgroup's continuously
+	 * keep exceeding their soft limit and putting the system under
+	 * pressure
+	 */
+	do {
+		mem = mem_cgroup_largest_soft_limit_node();
+		if (!mem)
+			break;
+
+		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
+						gfp_mask,
+						MEM_CGROUP_RECLAIM_SOFT);
+		nr_reclaimed += reclaimed;
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+		__mem_cgroup_remove_exceeded(mem);
+		if (mem->usage_in_excess)
+			__mem_cgroup_insert_exceeded(mem);
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+		css_put(&mem->css);
+		cond_resched();
+	} while (!nr_reclaimed);
+	return nr_reclaimed;
+}
+
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -1979,7 +2119,7 @@ try_to_free:
 			ret = -EINTR;
 			goto out;
 		}
-		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
+		progress = try_to_free_mem_cgroup_pages(mem, NULL, GFP_KERNEL,
 						false, get_swappiness(mem));
 		if (!progress) {
 			nr_retries--;
@@ -2584,6 +2724,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	mem->usage_in_excess = 0;
 	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
+	mem->on_tree = false;
+
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46bd24c..b49c90f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1583,7 +1583,14 @@ nofail_alloc:
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+	/*
+	 * Try to free up some pages from the memory controllers soft
+	 * limit queue.
+	 */
+	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
+	if (!order || !did_some_progress)
+		did_some_progress += try_to_free_pages(zonelist, order,
+							gfp_mask);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bfd853b..f212b30 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
+					   struct zonelist *zonelist,
 					   gfp_t gfp_mask,
 					   bool noswap,
 					   unsigned int swappiness)
@@ -1721,14 +1722,14 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.mem_cgroup = mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,
 	};
-	struct zonelist *zonelist;
 
 	if (noswap)
 		sc.may_unmap = 0;
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
+	if (!zonelist)
+		zonelist = NODE_DATA(numa_node_id())->node_zonelists;
 	return do_try_to_free_pages(zonelist, &sc);
 }
 #endif

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] Memory controller soft limit interface (v5)
  2009-03-12 17:56 ` [PATCH 2/4] Memory controller soft limit interface (v5) Balbir Singh
@ 2009-03-12 22:59   ` Andrew Morton
  2009-03-13  4:58     ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2009-03-12 22:59 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, yamamoto, lizf, kosaki.motohiro, riel, kamezawa.hiroyu

On Thu, 12 Mar 2009 23:26:20 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> +/**
> + * Get the difference between the usage and the soft limit
> + * @cnt: The counter
> + *
> + * Returns 0 if usage is less than or equal to soft limit
> + * The difference between usage and soft limit, otherwise.
> + */
> +static inline unsigned long long
> +res_counter_soft_limit_excess(struct res_counter *cnt)
> +{
> +	unsigned long long excess;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	if (cnt->usage <= cnt->soft_limit)
> +		excess = 0;
> +	else
> +		excess = cnt->usage - cnt->soft_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return excess;
> +}
>
> ...
>  
> +static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_soft_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
>
> ...
>
> +static inline int
> +res_counter_set_soft_limit(struct res_counter *cnt,
> +				unsigned long long soft_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->soft_limit = soft_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}

These functions look too large to be inlined?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v5)
  2009-03-12 17:56 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v5) Balbir Singh
@ 2009-03-12 23:04   ` Andrew Morton
  2009-03-13  5:03     ` Balbir Singh
  2009-03-13  0:47   ` KOSAKI Motohiro
  2009-03-13  6:59   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2009-03-12 23:04 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, yamamoto, lizf, kosaki.motohiro, riel, kamezawa.hiroyu

On Thu, 12 Mar 2009 23:26:25 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Feature: Organize cgroups over soft limit in a RB-Tree
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v5...v4
> 1. res_counter_uncharge has an additional parameter to indicate if the
>    counter was over its soft limit, before uncharge.
> 
> Changelog v4...v3
> 1. Optimizations to ensure we don't uncessarily get res_counter values
> 2. Fixed a bug in usage of time_after()
> 
> Changelog v3...v2
> 1. Add only the ancestor to the RB-Tree
> 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> 
> Changelog v2...v1
> 1. Add support for hierarchies
> 2. The res_counter that is highest in the hierarchy is returned on soft
>    limit being exceeded. Since we do hierarchical reclaim and add all
>    groups exceeding their soft limits, this approach seems to work well
>    in practice.
> 
> This patch introduces a RB-Tree for storing memory cgroups that are over their
> soft limit. The overall goal is to
> 
> 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
>    We are careful about updates, updates take place only after a particular
>    time interval has passed
> 2. We remove the node from the RB-Tree when the usage goes below the soft
>    limit
> 
> The next set of patches will exploit the RB-Tree to get the group that is
> over its soft limit by the largest amount and reclaim from it, when we
> face memory contention.
> 
>
> ...
>
> +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)

Wall-clock time is a quite poor way of tracking system activity. 
There's little correlation between the two things.

>From a general design point of view it would be better to pace this
polling activity in a manner which correlates with the amount of system
activity.  For example, "once per 100,000 pages scanned" is much more
adaptive than "once per 250 milliseconds".

>
> ...
>> @@ -2459,6 +2561,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	if (cont->parent == NULL) {
>  		enable_swap_cgroup();
>  		parent = NULL;
> +		mem_cgroup_soft_limit_tree = RB_ROOT;

This can be done at compile time?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-12 17:56 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v5) Balbir Singh
@ 2009-03-12 23:34   ` Andrew Morton
  2009-03-13  7:53     ` Balbir Singh
  2009-03-13  1:36   ` KOSAKI Motohiro
  2009-03-13  6:51   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2009-03-12 23:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, yamamoto, lizf, kosaki.motohiro, riel, kamezawa.hiroyu

On Thu, 12 Mar 2009 23:26:31 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Feature: Implement reclaim from groups over their soft limit
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v5...v4
> 
> 1. Throttling is removed, earlier we throttled tasks over their soft limit
> 2. Reclaim has been moved back to __alloc_pages_internal, several experiments
>    and tests showed that it was the best place to reclaim memory. kswapd has
>    a different goal, that does not work with a single soft limit for the memory
>    cgroup.
> 3. Soft limit reclaim is more targetted and the pages reclaim depend on the
>    amount by which the soft limit is exceeded.
> 
> Changelog v4...v3
> 1. soft_reclaim is now called from balance_pgdat
> 2. soft_reclaim is aware of nodes and zones
> 3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
>    and at the same time trying to allocate pages and exceed its soft limit.
> 4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
>    particular to a mem cgroup.
> 
> Changelog v3...v2
> 1. Convert several arguments to hierarchical reclaim to flags, thereby
>    consolidating them
> 2. The reclaim for soft limits is now triggered from kswapd
> 3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument
> 
> 
> Changelog v2...v1
> 1. Added support for hierarchical soft limits
> 
> This patch allows reclaim from memory cgroups on contention (via the
> kswapd() path) only if the order is 0.

Why for order-0 only?

What are the implications of not handling higher-order pages?

Why kswapd only?

What are the implications of omitting this from direct reclaim?

> memory cgroup soft limit reclaim finds the group that exceeds its soft limit
> by the largest amount and reclaims pages from it and then reinserts the
> cgroup into its correct place in the rbtree.

Why trim the single worst-case group rather than (say) trimming all
groups by a common proportion?  Or other things.


When you say "by the largest amount", is that the "by largest number of
pages" or "by the largest percentage"?

What are the implications of <whichever you chose>?

>
> ...
>
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> -				   gfp_t gfp_mask, bool noswap, bool shrink)
> +						struct zonelist *zl,
> +						gfp_t gfp_mask,
> +						unsigned long flags)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
>  	int loop = 0;
> +	bool noswap = flags & MEM_CGROUP_RECLAIM_NOSWAP;
> +	bool shrink = flags & MEM_CGROUP_RECLAIM_SHRINK;
> +	bool check_soft = flags & MEM_CGROUP_RECLAIM_SOFT;

`flags' wasn't a great choice of identifier.  It's a bit misleading,
and you'll be in a pickle if you later try to add a
spin_lock_irqsave(lock, flags) to this function.  Maybe choose a more
specific name?

> +	unsigned long excess = mem_cgroup_get_excess(root_mem);
>  
> -	while (loop < 2) {
> +	while (1) {
> +		if (loop >= 2) {
> +			/*
> +			 * With soft limits, do more targetted reclaim
> +			 */
> +			if (check_soft && (total >= (excess >> 4)))
> +				break;
> +			else if (!check_soft)
> +				break;

maybe..

			if (!check_soft)
				break;
			if (total >= (excess >> 4))
				break;

dunno.

The ">> 4" magic number would benefit from an explanatory comment.  Why
not ">> 3"???


> +		}
>  		victim = mem_cgroup_select_victim(root_mem);
> +		/*
> +		 * In the first loop, don't reclaim from victims below
> +		 * their soft limit
> +		 */
>
> ...
>
> +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
> +{
> +	unsigned long nr_reclaimed = 0;
> +	struct mem_cgroup *mem;
> +	unsigned long flags;
> +	unsigned long reclaimed;
> +
> +	/*
> +	 * This loop can run a while, specially if mem_cgroup's continuously
> +	 * keep exceeding their soft limit and putting the system under
> +	 * pressure
> +	 */
> +	do {
> +		mem = mem_cgroup_largest_soft_limit_node();
> +		if (!mem)
> +			break;
> +
> +		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
> +						gfp_mask,
> +						MEM_CGROUP_RECLAIM_SOFT);
> +		nr_reclaimed += reclaimed;
> +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +		__mem_cgroup_remove_exceeded(mem);
> +		if (mem->usage_in_excess)
> +			__mem_cgroup_insert_exceeded(mem);
> +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +		css_put(&mem->css);
> +		cond_resched();

spin_lock_irq() would suffice here.  Or the cond_resched() is a bug.

There's a decent argument that spin_lock_irq() is dangerous, and its
saving is so small that it's better to use the more robust
spin_lock_irqsave() all the time.  But was that the intent here?


> +	} while (!nr_reclaimed);
> +	return nr_reclaimed;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v5)
  2009-03-12 17:56 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v5) Balbir Singh
  2009-03-12 23:04   ` Andrew Morton
@ 2009-03-13  0:47   ` KOSAKI Motohiro
  2009-03-13  5:04     ` Balbir Singh
  2009-03-13  6:59   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-03-13  0:47 UTC (permalink / raw)
  To: Balbir Singh
  Cc: kosaki.motohiro, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

>  /*
> + * Cgroups above their limits are maintained in a RB-Tree, independent of
> + * their hierarchy representation
> + */
> +
> +static struct rb_root mem_cgroup_soft_limit_tree;
> +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);

I have objection to this.
Please don't use global spin lock.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-12 17:56 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v5) Balbir Singh
  2009-03-12 23:34   ` Andrew Morton
@ 2009-03-13  1:36   ` KOSAKI Motohiro
  2009-03-13  4:13     ` Balbir Singh
  2009-03-13  6:51   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-03-13  1:36 UTC (permalink / raw)
  To: Balbir Singh
  Cc: kosaki.motohiro, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

Hi

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 46bd24c..b49c90f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1583,7 +1583,14 @@ nofail_alloc:
>  	reclaim_state.reclaimed_slab = 0;
>  	p->reclaim_state = &reclaim_state;
>  
> -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> +	/*
> +	 * Try to free up some pages from the memory controllers soft
> +	 * limit queue.
> +	 */
> +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> +	if (!order || !did_some_progress)
> +		did_some_progress += try_to_free_pages(zonelist, order,
> +							gfp_mask);

I have two objection to this.

- "if (!order || !did_some_progress)" mean no call try_to_free_pages()
  in order>0 and did_some_progress>0 case.
  but mem_cgroup_soft_limit_reclaim() don't have lumpy reclaim.
  then, it break high order reclaim.
- in global reclaim view, foreground reclaim and background reclaim's
  reclaim rate is about 1:9 typically.
  then, kswapd reclaim the pages by global lru order before proceccing
  this logic.
  IOW, this soft limit is not SOFT.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  1:36   ` KOSAKI Motohiro
@ 2009-03-13  4:13     ` Balbir Singh
  2009-03-13  4:31       ` KOSAKI Motohiro
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  4:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 10:36:31]:

> Hi
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 46bd24c..b49c90f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1583,7 +1583,14 @@ nofail_alloc:
> >  	reclaim_state.reclaimed_slab = 0;
> >  	p->reclaim_state = &reclaim_state;
> >  
> > -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > +	/*
> > +	 * Try to free up some pages from the memory controllers soft
> > +	 * limit queue.
> > +	 */
> > +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> > +	if (!order || !did_some_progress)
> > +		did_some_progress += try_to_free_pages(zonelist, order,
> > +							gfp_mask);
> 
> I have two objection to this.
> 
> - "if (!order || !did_some_progress)" mean no call try_to_free_pages()
>   in order>0 and did_some_progress>0 case.
>   but mem_cgroup_soft_limit_reclaim() don't have lumpy reclaim.
>   then, it break high order reclaim.

I am sending a fix for this right away. Thanks, the check should be
if (order || !did_some_progress)

> - in global reclaim view, foreground reclaim and background reclaim's
>   reclaim rate is about 1:9 typically.
>   then, kswapd reclaim the pages by global lru order before proceccing
>   this logic.
>   IOW, this soft limit is not SOFT.

It depends on what you mean by soft. I call them soft since they are
imposed only when there is contention. If you mean kswapd runs more
often than direct reclaim, that is true, but it does not impact this
code extensively since the high water mark is a very small compared to
the pages present on the system.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  4:13     ` Balbir Singh
@ 2009-03-13  4:31       ` KOSAKI Motohiro
  2009-03-13  4:50         ` KOSAKI Motohiro
  2009-03-13  4:58         ` Balbir Singh
  0 siblings, 2 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-03-13  4:31 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

> > > -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > +	/*
> > > +	 * Try to free up some pages from the memory controllers soft
> > > +	 * limit queue.
> > > +	 */
> > > +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> > > +	if (!order || !did_some_progress)
> > > +		did_some_progress += try_to_free_pages(zonelist, order,
> > > +							gfp_mask);
> > 
> > I have two objection to this.
> > 
> > - "if (!order || !did_some_progress)" mean no call try_to_free_pages()
> >   in order>0 and did_some_progress>0 case.
> >   but mem_cgroup_soft_limit_reclaim() don't have lumpy reclaim.
> >   then, it break high order reclaim.
> 
> I am sending a fix for this right away. Thanks, the check should be
> if (order || !did_some_progress)

No.

it isn't enough.
after is does, order-1 allocation case twrice reclaim (soft limit shrinking
and normal try_to_free_pages()).
then, order-1 reclaim makes slower about 2 times.

unfortunately, order-1 allocation is very frequent. it is used for
kernel stack.


> > - in global reclaim view, foreground reclaim and background reclaim's
> >   reclaim rate is about 1:9 typically.
> >   then, kswapd reclaim the pages by global lru order before proceccing
> >   this logic.
> >   IOW, this soft limit is not SOFT.
> 
> It depends on what you mean by soft. I call them soft since they are
> imposed only when there is contention. If you mean kswapd runs more
> often than direct reclaim, that is true, but it does not impact this
> code extensively since the high water mark is a very small compared to
> the pages present on the system.

No.

My point is, contention case kswapd wakeup. and kswapd reclaim by
global lru order before soft limit shrinking.
Therefore, In typical usage, mem_cgroup_soft_limit_reclaim() almost
don't call properly.

soft limit shrinking should run before processing global reclaim.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  4:31       ` KOSAKI Motohiro
@ 2009-03-13  4:50         ` KOSAKI Motohiro
  2009-03-13  5:07           ` Balbir Singh
  2009-03-13  5:26           ` Balbir Singh
  2009-03-13  4:58         ` Balbir Singh
  1 sibling, 2 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-03-13  4:50 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

> > > I have two objection to this.
> > > 
> > > - "if (!order || !did_some_progress)" mean no call try_to_free_pages()
> > >   in order>0 and did_some_progress>0 case.
> > >   but mem_cgroup_soft_limit_reclaim() don't have lumpy reclaim.
> > >   then, it break high order reclaim.
> > 
> > I am sending a fix for this right away. Thanks, the check should be
> > if (order || !did_some_progress)
> 
> No.
> 
> it isn't enough.
> after is does, order-1 allocation case twrice reclaim (soft limit shrinking
> and normal try_to_free_pages()).
> then, order-1 reclaim makes slower about 2 times.
> 
> unfortunately, order-1 allocation is very frequent. it is used for
> kernel stack.

in normal order-1 reclaim is:

1. try_to_free_pages()
2. get_page_from_freelist()
3. retry if order-1 page don't exist

Coundn't you use the same logic?

> > > - in global reclaim view, foreground reclaim and background reclaim's
> > >   reclaim rate is about 1:9 typically.
> > >   then, kswapd reclaim the pages by global lru order before proceccing
> > >   this logic.
> > >   IOW, this soft limit is not SOFT.
> > 
> > It depends on what you mean by soft. I call them soft since they are
> > imposed only when there is contention. If you mean kswapd runs more
> > often than direct reclaim, that is true, but it does not impact this
> > code extensively since the high water mark is a very small compared to
> > the pages present on the system.
> 
> No.
> 
> My point is, contention case kswapd wakeup. and kswapd reclaim by
> global lru order before soft limit shrinking.
> Therefore, In typical usage, mem_cgroup_soft_limit_reclaim() almost
> don't call properly.
> 
> soft limit shrinking should run before processing global reclaim.

Do you have the reason of disliking call from kswapd ?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  4:31       ` KOSAKI Motohiro
  2009-03-13  4:50         ` KOSAKI Motohiro
@ 2009-03-13  4:58         ` Balbir Singh
  1 sibling, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  4:58 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 13:31:41]:

> > > > -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > > +	/*
> > > > +	 * Try to free up some pages from the memory controllers soft
> > > > +	 * limit queue.
> > > > +	 */
> > > > +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> > > > +	if (!order || !did_some_progress)
> > > > +		did_some_progress += try_to_free_pages(zonelist, order,
> > > > +							gfp_mask);
> > > 
> > > I have two objection to this.
> > > 
> > > - "if (!order || !did_some_progress)" mean no call try_to_free_pages()
> > >   in order>0 and did_some_progress>0 case.
> > >   but mem_cgroup_soft_limit_reclaim() don't have lumpy reclaim.
> > >   then, it break high order reclaim.
> > 
> > I am sending a fix for this right away. Thanks, the check should be
> > if (order || !did_some_progress)
> 
> No.
> 
> it isn't enough.
> after is does, order-1 allocation case twrice reclaim (soft limit shrinking
> and normal try_to_free_pages()).
> then, order-1 reclaim makes slower about 2 times.

My benchmarks don't show any degredation...  this slowdown will occur *iff*
soft limits are enabled and groups are over their soft limit. Even if
soft limit reclaim were to be initiated through kswapd (which is
through my experimentation, a bad place to do it), you'd have delays
incurred since you would have increased contention on zone lru lock.

Anyway, lets boil it down your comment to

The issue you claim occurs when the cgroups are over their soft
limit and there is memory contention.

> 
> unfortunately, order-1 allocation is very frequent. it is used for
> kernel stack.
>
> 
> > > - in global reclaim view, foreground reclaim and background reclaim's
> > >   reclaim rate is about 1:9 typically.
> > >   then, kswapd reclaim the pages by global lru order before proceccing
> > >   this logic.
> > >   IOW, this soft limit is not SOFT.
> > 
> > It depends on what you mean by soft. I call them soft since they are
> > imposed only when there is contention. If you mean kswapd runs more
> > often than direct reclaim, that is true, but it does not impact this
> > code extensively since the high water mark is a very small compared to
> > the pages present on the system.
> 
> No.
> 
> My point is, contention case kswapd wakeup. and kswapd reclaim by
> global lru order before soft limit shrinking.

I've seen the same even if kswapd is used for reclaim, since we have
no control over priority and the length to scan. shrink_zone() does
more work than soft limit reclaim.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] Memory controller soft limit interface (v5)
  2009-03-12 22:59   ` Andrew Morton
@ 2009-03-13  4:58     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  4:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, yamamoto, lizf, kosaki.motohiro, riel, kamezawa.hiroyu

* Andrew Morton <akpm@linux-foundation.org> [2009-03-12 15:59:05]:

> On Thu, 12 Mar 2009 23:26:20 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > +/**
> > + * Get the difference between the usage and the soft limit
> > + * @cnt: The counter
> > + *
> > + * Returns 0 if usage is less than or equal to soft limit
> > + * The difference between usage and soft limit, otherwise.
> > + */
> > +static inline unsigned long long
> > +res_counter_soft_limit_excess(struct res_counter *cnt)
> > +{
> > +	unsigned long long excess;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&cnt->lock, flags);
> > +	if (cnt->usage <= cnt->soft_limit)
> > +		excess = 0;
> > +	else
> > +		excess = cnt->usage - cnt->soft_limit;
> > +	spin_unlock_irqrestore(&cnt->lock, flags);
> > +	return excess;
> > +}
> >
> > ...
> >  
> > +static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
> > +{
> > +	bool ret;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&cnt->lock, flags);
> > +	ret = res_counter_soft_limit_check_locked(cnt);
> > +	spin_unlock_irqrestore(&cnt->lock, flags);
> > +	return ret;
> > +}
> >
> > ...
> >
> > +static inline int
> > +res_counter_set_soft_limit(struct res_counter *cnt,
> > +				unsigned long long soft_limit)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&cnt->lock, flags);
> > +	cnt->soft_limit = soft_limit;
> > +	spin_unlock_irqrestore(&cnt->lock, flags);
> > +	return 0;
> > +}
> 
> These functions look too large to be inlined?
>

I'll send a patch to fix it and move them to res_counter.c 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v5)
  2009-03-12 23:04   ` Andrew Morton
@ 2009-03-13  5:03     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  5:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, yamamoto, lizf, kosaki.motohiro, riel, kamezawa.hiroyu

* Andrew Morton <akpm@linux-foundation.org> [2009-03-12 16:04:24]:

> On Thu, 12 Mar 2009 23:26:25 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Feature: Organize cgroups over soft limit in a RB-Tree
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v5...v4
> > 1. res_counter_uncharge has an additional parameter to indicate if the
> >    counter was over its soft limit, before uncharge.
> > 
> > Changelog v4...v3
> > 1. Optimizations to ensure we don't uncessarily get res_counter values
> > 2. Fixed a bug in usage of time_after()
> > 
> > Changelog v3...v2
> > 1. Add only the ancestor to the RB-Tree
> > 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> > 
> > Changelog v2...v1
> > 1. Add support for hierarchies
> > 2. The res_counter that is highest in the hierarchy is returned on soft
> >    limit being exceeded. Since we do hierarchical reclaim and add all
> >    groups exceeding their soft limits, this approach seems to work well
> >    in practice.
> > 
> > This patch introduces a RB-Tree for storing memory cgroups that are over their
> > soft limit. The overall goal is to
> > 
> > 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
> >    We are careful about updates, updates take place only after a particular
> >    time interval has passed
> > 2. We remove the node from the RB-Tree when the usage goes below the soft
> >    limit
> > 
> > The next set of patches will exploit the RB-Tree to get the group that is
> > over its soft limit by the largest amount and reclaim from it, when we
> > face memory contention.
> > 
> >
> > ...
> >
> > +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
> 
> Wall-clock time is a quite poor way of tracking system activity. 
> There's little correlation between the two things.
> 
> >From a general design point of view it would be better to pace this
> polling activity in a manner which correlates with the amount of system
> activity.  For example, "once per 100,000 pages scanned" is much more
> adaptive than "once per 250 milliseconds".
>

Good suggestion, I want to keep the tree up-to-date before we hit
scanning for reclaim. I can experiment with the idea and try.
 
> >
> > ...
> >> @@ -2459,6 +2561,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	if (cont->parent == NULL) {
> >  		enable_swap_cgroup();
> >  		parent = NULL;
> > +		mem_cgroup_soft_limit_tree = RB_ROOT;
> 
> This can be done at compile time?
>

As in declaring using RB_ROOT, sure can be done. I'll send a series of
fixes with review comments incorporated. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v5)
  2009-03-13  0:47   ` KOSAKI Motohiro
@ 2009-03-13  5:04     ` Balbir Singh
  2009-03-13  5:22       ` KOSAKI Motohiro
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  5:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 09:47:31]:

> >  /*
> > + * Cgroups above their limits are maintained in a RB-Tree, independent of
> > + * their hierarchy representation
> > + */
> > +
> > +static struct rb_root mem_cgroup_soft_limit_tree;
> > +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
> 
> I have objection to this.
> Please don't use global spin lock.

We need a global data structure, per node, per zone is no good, since
the limits (soft limit in this case) is for the entire cgroup.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  4:50         ` KOSAKI Motohiro
@ 2009-03-13  5:07           ` Balbir Singh
  2009-03-13  6:54             ` KOSAKI Motohiro
  2009-03-13  5:26           ` Balbir Singh
  1 sibling, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  5:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 13:50:26]:

> > > > I have two objection to this.
> > > > 
> > > > - "if (!order || !did_some_progress)" mean no call try_to_free_pages()
> > > >   in order>0 and did_some_progress>0 case.
> > > >   but mem_cgroup_soft_limit_reclaim() don't have lumpy reclaim.
> > > >   then, it break high order reclaim.
> > > 
> > > I am sending a fix for this right away. Thanks, the check should be
> > > if (order || !did_some_progress)
> > 
> > No.
> > 
> > it isn't enough.
> > after is does, order-1 allocation case twrice reclaim (soft limit shrinking
> > and normal try_to_free_pages()).
> > then, order-1 reclaim makes slower about 2 times.
> > 
> > unfortunately, order-1 allocation is very frequent. it is used for
> > kernel stack.
> 
> in normal order-1 reclaim is:
> 
> 1. try_to_free_pages()
> 2. get_page_from_freelist()
> 3. retry if order-1 page don't exist
> 
> Coundn't you use the same logic?
> 
> > > > - in global reclaim view, foreground reclaim and background reclaim's
> > > >   reclaim rate is about 1:9 typically.
> > > >   then, kswapd reclaim the pages by global lru order before proceccing
> > > >   this logic.
> > > >   IOW, this soft limit is not SOFT.
> > > 
> > > It depends on what you mean by soft. I call them soft since they are
> > > imposed only when there is contention. If you mean kswapd runs more
> > > often than direct reclaim, that is true, but it does not impact this
> > > code extensively since the high water mark is a very small compared to
> > > the pages present on the system.
> > 
> > No.
> > 
> > My point is, contention case kswapd wakeup. and kswapd reclaim by
> > global lru order before soft limit shrinking.
> > Therefore, In typical usage, mem_cgroup_soft_limit_reclaim() almost
> > don't call properly.
> > 
> > soft limit shrinking should run before processing global reclaim.
> 
> Do you have the reason of disliking call from kswapd ?
>

Yes, I sent that reason out as comments to Kame's patches. kswapd or
balance_pgdat controls the zones, priority and in effect how many
pages we scan while doing reclaim. I did lots of experiments and found
that if soft limit reclaim occurred from kswapd, soft_limit_reclaim
would almost always fail and shrink_zone() would succeed, since it
looks at the whole zone and is always able to find some pages at all
priority levels. It also does not allow for targetted reclaim based on
how much we exceed the soft limit by. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v5)
  2009-03-13  5:04     ` Balbir Singh
@ 2009-03-13  5:22       ` KOSAKI Motohiro
  2009-03-13  8:20         ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-03-13  5:22 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

> > >  /*
> > > + * Cgroups above their limits are maintained in a RB-Tree, independent of
> > > + * their hierarchy representation
> > > + */
> > > +
> > > +static struct rb_root mem_cgroup_soft_limit_tree;
> > > +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
> > 
> > I have objection to this.
> > Please don't use global spin lock.
> 
> We need a global data structure, per node, per zone is no good, since
> the limits (soft limit in this case) is for the entire cgroup.

this smell the data structure is wrong.

rb-tree soring is one of efficient reclaiming technique.
but global lock bust due to this patch's good side.

if its updating is really rare, rcu is better?
or couldn't you select another data structure?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  4:50         ` KOSAKI Motohiro
  2009-03-13  5:07           ` Balbir Singh
@ 2009-03-13  5:26           ` Balbir Singh
  2009-03-13  5:34             ` KOSAKI Motohiro
  1 sibling, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  5:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 13:50:26]:

> > > > I have two objection to this.
> > > > 
> > > > - "if (!order || !did_some_progress)" mean no call try_to_free_pages()
> > > >   in order>0 and did_some_progress>0 case.
> > > >   but mem_cgroup_soft_limit_reclaim() don't have lumpy reclaim.
> > > >   then, it break high order reclaim.
> > > 
> > > I am sending a fix for this right away. Thanks, the check should be
> > > if (order || !did_some_progress)
> > 
> > No.
> > 
> > it isn't enough.
> > after is does, order-1 allocation case twrice reclaim (soft limit shrinking
> > and normal try_to_free_pages()).
> > then, order-1 reclaim makes slower about 2 times.
> > 
> > unfortunately, order-1 allocation is very frequent. it is used for
> > kernel stack.
> 
> in normal order-1 reclaim is:
> 
> 1. try_to_free_pages()
> 2. get_page_from_freelist()
> 3. retry if order-1 page don't exist
> 
> Coundn't you use the same logic?

Sorry, forgot to answer this question earlier.

I assume that by order-1 you mean 2 pages (2^1).

Are you suggesting that if soft limit reclaim fails, we retry? Not
sure if I understand your suggestion completely.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  5:26           ` Balbir Singh
@ 2009-03-13  5:34             ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-03-13  5:34 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 13:50:26]:
> 
> > > > > I have two objection to this.
> > > > > 
> > > > > - "if (!order || !did_some_progress)" mean no call try_to_free_pages()
> > > > >   in order>0 and did_some_progress>0 case.
> > > > >   but mem_cgroup_soft_limit_reclaim() don't have lumpy reclaim.
> > > > >   then, it break high order reclaim.
> > > > 
> > > > I am sending a fix for this right away. Thanks, the check should be
> > > > if (order || !did_some_progress)
> > > 
> > > No.
> > > 
> > > it isn't enough.
> > > after is does, order-1 allocation case twrice reclaim (soft limit shrinking
> > > and normal try_to_free_pages()).
> > > then, order-1 reclaim makes slower about 2 times.
> > > 
> > > unfortunately, order-1 allocation is very frequent. it is used for
> > > kernel stack.
> > 
> > in normal order-1 reclaim is:
> > 
> > 1. try_to_free_pages()
> > 2. get_page_from_freelist()
> > 3. retry if order-1 page don't exist
> > 
> > Coundn't you use the same logic?
> 
> Sorry, forgot to answer this question earlier.
> 
> I assume that by order-1 you mean 2 pages (2^1).

Yes, almost architecutre's kernel stack use 2 pages.


> Are you suggesting that if soft limit reclaim fails, we retry? Not
> sure if I understand your suggestion completely.

sorry. my last explanation is too poor.

if we need only 2pages, soft_limit_shrink() typically success
reclaim it.
then

1. mem_cgroup_soft_limit_reclaim()
2. get_page_from_freelist() if mem_cgroup_soft_limit_reclaim() return >0.
3. goto got_pg if if get_page_from_freelist() is successed.
4. try_to_free_pages()
5. get_page_from_freelist() if try_to_free_pages() return >0.
6. goto 1 if order-1 page don't exist

obiously, this logic make slower more higher order because
(2) is often fail in higher order.
but that's ok. higher order reclaim is very slow. additional
overhead don't observe (maybe).

What do you think?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-12 17:56 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v5) Balbir Singh
  2009-03-12 23:34   ` Andrew Morton
  2009-03-13  1:36   ` KOSAKI Motohiro
@ 2009-03-13  6:51   ` KAMEZAWA Hiroyuki
  2009-03-13  7:15     ` Balbir Singh
  2 siblings, 1 reply; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-13  6:51 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

Balbir Singh wrote:
> Feature: Implement reclaim from groups over their soft limit
>
> From: Balbir Singh <balbir@linux.vnet.ibm.com>

> +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t
> gfp_mask)
> +{
> +	unsigned long nr_reclaimed = 0;
> +	struct mem_cgroup *mem;
> +	unsigned long flags;
> +	unsigned long reclaimed;
> +
> +	/*
> +	 * This loop can run a while, specially if mem_cgroup's continuously
> +	 * keep exceeding their soft limit and putting the system under
> +	 * pressure
> +	 */
> +	do {
> +		mem = mem_cgroup_largest_soft_limit_node();
> +		if (!mem)
> +			break;
> +
> +		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
> +						gfp_mask,
> +						MEM_CGROUP_RECLAIM_SOFT);
> +		nr_reclaimed += reclaimed;
> +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +		__mem_cgroup_remove_exceeded(mem);
> +		if (mem->usage_in_excess)
> +			__mem_cgroup_insert_exceeded(mem);
> +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +		css_put(&mem->css);
> +		cond_resched();
> +	} while (!nr_reclaimed);
> +	return nr_reclaimed;
> +}
> +
 Why do you never consider bad corner case....
 As I wrote many times, "order of global usage" doesn't mean the
 biggest user of memcg containes memory in zones which we want.
 So, please don't pust "mem" back to RB-tree if reclaimed is 0.

 This routine seems toooo bad as v4.
 Nack again.

Regards,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  5:07           ` Balbir Singh
@ 2009-03-13  6:54             ` KOSAKI Motohiro
  2009-03-13  7:03               ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-03-13  6:54 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

> > > My point is, contention case kswapd wakeup. and kswapd reclaim by
> > > global lru order before soft limit shrinking.
> > > Therefore, In typical usage, mem_cgroup_soft_limit_reclaim() almost
> > > don't call properly.
> > > 
> > > soft limit shrinking should run before processing global reclaim.
> > 
> > Do you have the reason of disliking call from kswapd ?
> >
> 
> Yes, I sent that reason out as comments to Kame's patches. kswapd or
> balance_pgdat controls the zones, priority and in effect how many
> pages we scan while doing reclaim. I did lots of experiments and found
> that if soft limit reclaim occurred from kswapd, soft_limit_reclaim
> would almost always fail and shrink_zone() would succeed, since it
> looks at the whole zone and is always able to find some pages at all
> priority levels. It also does not allow for targetted reclaim based on
> how much we exceed the soft limit by. 

hm
I read past discussion. so, I think we discuss many aspect at once.
So, my current thinking is below, 

(1) if the group don't have any soft limit shrinking page, 
    mem_cgroup_soft_limit_reclaim() spent time unnecessary.
    -> right.
      actually, past global reclaim had similar problem.
      then zone_is_all_unreclaimable() was introduced.
      maybe we can use similar technique to memcg.

(2) mem_cgroup_soft_limit_reclaim() should be called from?
    -> under discussion.
       we should solve (1) at first for proper constructive
       discussion.

(3) What's "fairness" of soft limit?
    -> perfectly another aspect.

So, I'd like to discuss (1) at first.
Although we don't kswapd shrinking, (1) is problem.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v5)
  2009-03-12 17:56 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v5) Balbir Singh
  2009-03-12 23:04   ` Andrew Morton
  2009-03-13  0:47   ` KOSAKI Motohiro
@ 2009-03-13  6:59   ` KAMEZAWA Hiroyuki
  2009-03-13  7:09     ` Balbir Singh
  2 siblings, 1 reply; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-13  6:59 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

Balbir Singh wrote:
> Feature: Organize cgroups over soft limit in a RB-Tree
>
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
>
> Changelog v5...v4
> 1. res_counter_uncharge has an additional parameter to indicate if the
>    counter was over its soft limit, before uncharge.
>
> Changelog v4...v3
> 1. Optimizations to ensure we don't uncessarily get res_counter values
> 2. Fixed a bug in usage of time_after()
>
> Changelog v3...v2
> 1. Add only the ancestor to the RB-Tree
> 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
>
> Changelog v2...v1
> 1. Add support for hierarchies
> 2. The res_counter that is highest in the hierarchy is returned on soft
>    limit being exceeded. Since we do hierarchical reclaim and add all
>    groups exceeding their soft limits, this approach seems to work well
>    in practice.
>
> This patch introduces a RB-Tree for storing memory cgroups that are over
> their
> soft limit. The overall goal is to
>
> 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
>    We are careful about updates, updates take place only after a
> particular
>    time interval has passed
> 2. We remove the node from the RB-Tree when the usage goes below the soft
>    limit
>
> The next set of patches will exploit the RB-Tree to get the group that is
> over its soft limit by the largest amount and reclaim from it, when we
> face memory contention.
>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
>
>  include/linux/res_counter.h |    6 +-
>  kernel/res_counter.c        |   18 +++++
>  mm/memcontrol.c             |  141
> ++++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 143 insertions(+), 22 deletions(-)
>
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 5c821fd..5bbf8b1 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter,
> struct res_counter *parent);
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
>  		unsigned long val);
>  int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> +		unsigned long val, struct res_counter **limit_fail_at,
> +		struct res_counter **soft_limit_at);
>
>  /*
>   * uncharge - tell that some portion of the resource is released
> @@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter
> *counter,
>   */
>
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned
> long val);
> -void res_counter_uncharge(struct res_counter *counter, unsigned long
> val);
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> +				bool *was_soft_limit_excess);
>
>  static inline bool res_counter_limit_check_locked(struct res_counter
> *cnt)
>  {
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 4e6dafe..51ec438 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter
> *counter, unsigned long val)
>  }
>
>  int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> +			struct res_counter **limit_fail_at,
> +			struct res_counter **soft_limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
>  	struct res_counter *c, *u;
>
>  	*limit_fail_at = NULL;
> +	if (soft_limit_fail_at)
> +		*soft_limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
> +		/*
> +		 * With soft limits, we return the highest ancestor
> +		 * that exceeds its soft limit
> +		 */
> +		if (soft_limit_fail_at &&
> +			!res_counter_soft_limit_check_locked(c))
> +			*soft_limit_fail_at = c;
>  		spin_unlock(&c->lock);
>  		if (ret < 0) {
>  			*limit_fail_at = c;

Do we need these all check at every call ?
If you just check tree update once per HZ/?? , please check
this onece per HZ/??. And please help people who doesn't use softlimit
i.e. who doesn't mount cgroup but have to use configured kernel.



> +static struct rb_root mem_cgroup_soft_limit_tree;
> +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
> +
Can't we breakd down this lock to per node or per cpu or...?

> +/*
>   * The memory controller data structure. The memory controller controls
> both
>   * page cache and RSS per cgroup. We would eventually like to provide
>   * statistics based on the statistics developed by Rik Van Riel for
> clock-pro,
> @@ -176,12 +185,20 @@ struct mem_cgroup {
>
>  	unsigned int	swappiness;
>
> +	struct rb_node mem_cgroup_node;		/* RB tree node */
> +	unsigned long long usage_in_excess;	/* Set to the value by which */
> +						/* the soft limit is exceeded*/
> +	unsigned long last_tree_update;		/* Last time the tree was */
> +						/* updated in jiffies     */
> +
>  	/*
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
>  	struct mem_cgroup_stat stat;
>  };
>
> +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
> +

Why HZ/4 again.

>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -214,6 +231,41 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>
> +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +{
> +	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct mem_cgroup *mem_node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	while (*p) {
> +		parent = *p;
> +		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> +		if (mem->usage_in_excess < mem_node->usage_in_excess)
> +			p = &(*p)->rb_left;
> +		/*
> +		 * We can't avoid mem cgroups that are over their soft
> +		 * limit by the same amount
> +		 */
> +		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
> +			p = &(*p)->rb_right;
> +	}
> +	rb_link_node(&mem->mem_cgroup_node, parent, p);
> +	rb_insert_color(&mem->mem_cgroup_node,
> +			&mem_cgroup_soft_limit_tree);
> +	mem->last_tree_update = jiffies;
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +}
> +
> +static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> +{
> +	unsigned long flags;
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +}
> +
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  					 struct page_cgroup *pc,
>  					 bool charge)
> @@ -897,6 +949,40 @@ static void record_last_oom(struct mem_cgroup *mem)
>  	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
>  }
>
> +static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
> +						bool time_check)
> +{
> +	unsigned long long prev_usage_in_excess, new_usage_in_excess;
> +	bool updated_tree = false;
> +	unsigned long next_update = 0;
> +	unsigned long flags;
> +
> +	prev_usage_in_excess = mem->usage_in_excess;
> +
> +	if (time_check)
> +		next_update = mem->last_tree_update +
> +				MEM_CGROUP_TREE_UPDATE_INTERVAL;
> +
> +	if (!time_check || time_after(jiffies, next_update)) {
> +		new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +		if (prev_usage_in_excess) {
> +			mem_cgroup_remove_exceeded(mem);
> +			updated_tree = true;
> +		}
> +		if (!new_usage_in_excess)
> +			goto done;
> +		mem_cgroup_insert_exceeded(mem);
> +		updated_tree = true;
> +	}
> +
> +done:
> +	if (updated_tree) {
> +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +		mem->last_tree_update = jiffies;
> +		mem->usage_in_excess = new_usage_in_excess;
> +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +	}
> +}
Why update key parameter after inserting tree ? Is this bug ?
Maybe RB-tree will be not sorted.

Bye,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v5)
  2009-03-12 17:56 [PATCH 0/4] Memory controller soft limit patches (v5) Balbir Singh
                   ` (3 preceding siblings ...)
  2009-03-12 17:56 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v5) Balbir Singh
@ 2009-03-13  7:02 ` KAMEZAWA Hiroyuki
  2009-03-13  7:07   ` KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-13  7:02 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

Balbir Singh さんは書きました:
>
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
>
> New Feature: Soft limits for memory resource controller.
>
> Changelog v5...v4
> 1. Several changes to the reclaim logic, please see the patch 4 (reclaim
> on
>    contention). I've experimented with several possibilities for reclaim
>    and chose to come back to this due to the excellent behaviour seen
> while
>    testing the patchset.
> 2. Reduced the overhead of soft limits on resource counters very
> significantly.
>    Reaim benchmark now shows almost no drop in performance.
>
It seems there are no changes to answer my last comments.

Nack again. I'll update my own version again.

Thanks,
-Kame

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  6:54             ` KOSAKI Motohiro
@ 2009-03-13  7:03               ` Balbir Singh
  2009-03-13  7:17                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  7:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 15:54:03]:

> > > > My point is, contention case kswapd wakeup. and kswapd reclaim by
> > > > global lru order before soft limit shrinking.
> > > > Therefore, In typical usage, mem_cgroup_soft_limit_reclaim() almost
> > > > don't call properly.
> > > > 
> > > > soft limit shrinking should run before processing global reclaim.
> > > 
> > > Do you have the reason of disliking call from kswapd ?
> > >
> > 
> > Yes, I sent that reason out as comments to Kame's patches. kswapd or
> > balance_pgdat controls the zones, priority and in effect how many
> > pages we scan while doing reclaim. I did lots of experiments and found
> > that if soft limit reclaim occurred from kswapd, soft_limit_reclaim
> > would almost always fail and shrink_zone() would succeed, since it
> > looks at the whole zone and is always able to find some pages at all
> > priority levels. It also does not allow for targetted reclaim based on
> > how much we exceed the soft limit by. 
> 
> hm
> I read past discussion. so, I think we discuss many aspect at once.
> So, my current thinking is below, 
> 
> (1) if the group don't have any soft limit shrinking page, 
>     mem_cgroup_soft_limit_reclaim() spent time unnecessary.
>     -> right.

If the soft limit RB tree is empty, we don't spend any time at all.
Are you referring to something else? Am I missing something? The tree
will be empty if no group is over the soft limit.

>       actually, past global reclaim had similar problem.
>       then zone_is_all_unreclaimable() was introduced.
>       maybe we can use similar technique to memcg.
> 
> (2) mem_cgroup_soft_limit_reclaim() should be called from?
>     -> under discussion.
>        we should solve (1) at first for proper constructive
>        discussion.
> 
> (3) What's "fairness" of soft limit?
>     -> perfectly another aspect.
> 
> So, I'd like to discuss (1) at first.
> Although we don't kswapd shrinking, (1) is problem.
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v5)
  2009-03-13  7:02 ` [PATCH 0/4] Memory controller soft limit patches (v5) KAMEZAWA Hiroyuki
@ 2009-03-13  7:07   ` KAMEZAWA Hiroyuki
  2009-03-13  7:15     ` Andrew Morton
  2009-03-13  7:18     ` Balbir Singh
  0 siblings, 2 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-13  7:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro,
	Rik van Riel, Andrew Morton

KAMEZAWA Hiroyuki さんは書きました:
> Balbir Singh さんは書きました:
>>
>> From: Balbir Singh <balbir@linux.vnet.ibm.com>
>>
>> New Feature: Soft limits for memory resource controller.
>>
>> Changelog v5...v4
>> 1. Several changes to the reclaim logic, please see the patch 4 (reclaim
>> on
>>    contention). I've experimented with several possibilities for reclaim
>>    and chose to come back to this due to the excellent behaviour seen
>> while
>>    testing the patchset.
>> 2. Reduced the overhead of soft limits on resource counters very
>> significantly.
>>    Reaim benchmark now shows almost no drop in performance.
>>
> It seems there are no changes to answer my last comments.
>
> Nack again. I'll update my own version again.
>
Sigh, this is in -mm ? okay...I'll update onto -mm as much as I can.
Very heavy work, maybe.
Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v5)
  2009-03-13  6:59   ` KAMEZAWA Hiroyuki
@ 2009-03-13  7:09     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  7:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-13 15:59:49]:

> Balbir Singh wrote:
> > Feature: Organize cgroups over soft limit in a RB-Tree
> >
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> >
> > Changelog v5...v4
> > 1. res_counter_uncharge has an additional parameter to indicate if the
> >    counter was over its soft limit, before uncharge.
> >
> > Changelog v4...v3
> > 1. Optimizations to ensure we don't uncessarily get res_counter values
> > 2. Fixed a bug in usage of time_after()
> >
> > Changelog v3...v2
> > 1. Add only the ancestor to the RB-Tree
> > 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> >
> > Changelog v2...v1
> > 1. Add support for hierarchies
> > 2. The res_counter that is highest in the hierarchy is returned on soft
> >    limit being exceeded. Since we do hierarchical reclaim and add all
> >    groups exceeding their soft limits, this approach seems to work well
> >    in practice.
> >
> > This patch introduces a RB-Tree for storing memory cgroups that are over
> > their
> > soft limit. The overall goal is to
> >
> > 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
> >    We are careful about updates, updates take place only after a
> > particular
> >    time interval has passed
> > 2. We remove the node from the RB-Tree when the usage goes below the soft
> >    limit
> >
> > The next set of patches will exploit the RB-Tree to get the group that is
> > over its soft limit by the largest amount and reclaim from it, when we
> > face memory contention.
> >
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > ---
> >
> >  include/linux/res_counter.h |    6 +-
> >  kernel/res_counter.c        |   18 +++++
> >  mm/memcontrol.c             |  141
> > ++++++++++++++++++++++++++++++++++++++-----
> >  3 files changed, 143 insertions(+), 22 deletions(-)
> >
> >
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index 5c821fd..5bbf8b1 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter,
> > struct res_counter *parent);
> >  int __must_check res_counter_charge_locked(struct res_counter *counter,
> >  		unsigned long val);
> >  int __must_check res_counter_charge(struct res_counter *counter,
> > -		unsigned long val, struct res_counter **limit_fail_at);
> > +		unsigned long val, struct res_counter **limit_fail_at,
> > +		struct res_counter **soft_limit_at);
> >
> >  /*
> >   * uncharge - tell that some portion of the resource is released
> > @@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter
> > *counter,
> >   */
> >
> >  void res_counter_uncharge_locked(struct res_counter *counter, unsigned
> > long val);
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long
> > val);
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > +				bool *was_soft_limit_excess);
> >
> >  static inline bool res_counter_limit_check_locked(struct res_counter
> > *cnt)
> >  {
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 4e6dafe..51ec438 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter
> > *counter, unsigned long val)
> >  }
> >
> >  int res_counter_charge(struct res_counter *counter, unsigned long val,
> > -			struct res_counter **limit_fail_at)
> > +			struct res_counter **limit_fail_at,
> > +			struct res_counter **soft_limit_fail_at)
> >  {
> >  	int ret;
> >  	unsigned long flags;
> >  	struct res_counter *c, *u;
> >
> >  	*limit_fail_at = NULL;
> > +	if (soft_limit_fail_at)
> > +		*soft_limit_fail_at = NULL;
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> >  		spin_lock(&c->lock);
> >  		ret = res_counter_charge_locked(c, val);
> > +		/*
> > +		 * With soft limits, we return the highest ancestor
> > +		 * that exceeds its soft limit
> > +		 */
> > +		if (soft_limit_fail_at &&
> > +			!res_counter_soft_limit_check_locked(c))
> > +			*soft_limit_fail_at = c;
> >  		spin_unlock(&c->lock);
> >  		if (ret < 0) {
> >  			*limit_fail_at = c;
> 
> Do we need these all check at every call ?
> If you just check tree update once per HZ/?? , please check
> this onece per HZ/??. And please help people who doesn't use softlimit
> i.e. who doesn't mount cgroup but have to use configured kernel.
> 
> 
> 
> > +static struct rb_root mem_cgroup_soft_limit_tree;
> > +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
> > +
> Can't we breakd down this lock to per node or per cpu or...?
>

I responded to the same comment by Kosaki.
 
> > +/*
> >   * The memory controller data structure. The memory controller controls
> > both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> >   * statistics based on the statistics developed by Rik Van Riel for
> > clock-pro,
> > @@ -176,12 +185,20 @@ struct mem_cgroup {
> >
> >  	unsigned int	swappiness;
> >
> > +	struct rb_node mem_cgroup_node;		/* RB tree node */
> > +	unsigned long long usage_in_excess;	/* Set to the value by which */
> > +						/* the soft limit is exceeded*/
> > +	unsigned long last_tree_update;		/* Last time the tree was */
> > +						/* updated in jiffies     */
> > +
> >  	/*
> >  	 * statistics. This must be placed at the end of memcg.
> >  	 */
> >  	struct mem_cgroup_stat stat;
> >  };
> >
> > +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
> > +
> 
> Why HZ/4 again.
> 

Good question. HZ/4 is not too aggressive or not to lax. The value is
empirical.


> >  enum charge_type {
> >  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> >  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> > @@ -214,6 +231,41 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >
> > +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> > +{
> > +	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
> > +	struct rb_node *parent = NULL;
> > +	struct mem_cgroup *mem_node;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	while (*p) {
> > +		parent = *p;
> > +		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> > +		if (mem->usage_in_excess < mem_node->usage_in_excess)
> > +			p = &(*p)->rb_left;
> > +		/*
> > +		 * We can't avoid mem cgroups that are over their soft
> > +		 * limit by the same amount
> > +		 */
> > +		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
> > +			p = &(*p)->rb_right;
> > +	}
> > +	rb_link_node(&mem->mem_cgroup_node, parent, p);
> > +	rb_insert_color(&mem->mem_cgroup_node,
> > +			&mem_cgroup_soft_limit_tree);
> > +	mem->last_tree_update = jiffies;
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +}
> > +
> > +static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> > +{
> > +	unsigned long flags;
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +}
> > +
> >  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> >  					 struct page_cgroup *pc,
> >  					 bool charge)
> > @@ -897,6 +949,40 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> >  }
> >
> > +static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
> > +						bool time_check)
> > +{
> > +	unsigned long long prev_usage_in_excess, new_usage_in_excess;
> > +	bool updated_tree = false;
> > +	unsigned long next_update = 0;
> > +	unsigned long flags;
> > +
> > +	prev_usage_in_excess = mem->usage_in_excess;
> > +
> > +	if (time_check)
> > +		next_update = mem->last_tree_update +
> > +				MEM_CGROUP_TREE_UPDATE_INTERVAL;
> > +
> > +	if (!time_check || time_after(jiffies, next_update)) {
> > +		new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +		if (prev_usage_in_excess) {
> > +			mem_cgroup_remove_exceeded(mem);
> > +			updated_tree = true;
> > +		}
> > +		if (!new_usage_in_excess)
> > +			goto done;
> > +		mem_cgroup_insert_exceeded(mem);
> > +		updated_tree = true;
> > +	}
> > +
> > +done:
> > +	if (updated_tree) {
> > +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +		mem->last_tree_update = jiffies;
> > +		mem->usage_in_excess = new_usage_in_excess;
> > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +	}
> > +}
> Why update key parameter after inserting tree ? Is this bug ?
> Maybe RB-tree will be not sorted.
>

Hmm.. I'll revisit. Thanks!

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  6:51   ` KAMEZAWA Hiroyuki
@ 2009-03-13  7:15     ` Balbir Singh
  2009-03-13  8:41       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  7:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-13 15:51:25]:

> Balbir Singh wrote:
> > Feature: Implement reclaim from groups over their soft limit
> >
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> > +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t
> > gfp_mask)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +	struct mem_cgroup *mem;
> > +	unsigned long flags;
> > +	unsigned long reclaimed;
> > +
> > +	/*
> > +	 * This loop can run a while, specially if mem_cgroup's continuously
> > +	 * keep exceeding their soft limit and putting the system under
> > +	 * pressure
> > +	 */
> > +	do {
> > +		mem = mem_cgroup_largest_soft_limit_node();
> > +		if (!mem)
> > +			break;
> > +
> > +		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
> > +						gfp_mask,
> > +						MEM_CGROUP_RECLAIM_SOFT);
> > +		nr_reclaimed += reclaimed;
> > +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +		__mem_cgroup_remove_exceeded(mem);
> > +		if (mem->usage_in_excess)
> > +			__mem_cgroup_insert_exceeded(mem);
> > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +		css_put(&mem->css);
> > +		cond_resched();
> > +	} while (!nr_reclaimed);
> > +	return nr_reclaimed;
> > +}
> > +
>  Why do you never consider bad corner case....
>  As I wrote many times, "order of global usage" doesn't mean the
>  biggest user of memcg containes memory in zones which we want.
>  So, please don't pust "mem" back to RB-tree if reclaimed is 0.
> 
>  This routine seems toooo bad as v4.

Are you talking about cases where a particular mem cgroup never
allocated from a node? Thanks.. let me take a look at it

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v5)
  2009-03-13  7:07   ` KAMEZAWA Hiroyuki
@ 2009-03-13  7:15     ` Andrew Morton
  2009-03-13  7:29       ` Balbir Singh
  2009-03-13  7:18     ` Balbir Singh
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2009-03-13  7:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro,
	Rik van Riel

On Fri, 13 Mar 2009 16:07:35 +0900 (JST) "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > Nack again. I'll update my own version again.
> >
> Sigh, this is in -mm ? okay...I'll update onto -mm as much as I can.
> Very heavy work, maybe.

I dropped them all again.  it appears that quite a few changes are needed
and I don't think we want these patches interfering with other cgroup
and general MM development.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  7:03               ` Balbir Singh
@ 2009-03-13  7:17                 ` KOSAKI Motohiro
  2009-03-13  7:26                   ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-03-13  7:17 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

> > hm
> > I read past discussion. so, I think we discuss many aspect at once.
> > So, my current thinking is below, 
> > 
> > (1) if the group don't have any soft limit shrinking page, 
> >     mem_cgroup_soft_limit_reclaim() spent time unnecessary.
> >     -> right.
> 
> If the soft limit RB tree is empty, we don't spend any time at all.
> Are you referring to something else? Am I missing something? The tree
> will be empty if no group is over the soft limit.

maybe, I am missing anything.
May I ask your following paragraph meaning?


> I experimented a *lot* with zone reclaim and found it to be not so
> effective. Here is why
> 
> 1. We have no control over priority or how much to scan, that is
> controlled by balance_pgdat(). If we find that we are unable to scan
> anything, we continue scanning with the scan > 0 check, but we scan
> the same pages and the same number, because shrink_zone does scan >>
> priority.

I thought this sentense mean soft-limit-shrinking spent a lot of time.
if not, could you please tell me what makes so slower?

and, you wrote:

> 
> Yes, I sent that reason out as comments to Kame's patches. kswapd or
> balance_pgdat controls the zones, priority and in effect how many
> pages we scan while doing reclaim. I did lots of experiments and found
> that if soft limit reclaim occurred from kswapd, soft_limit_reclaim
> would almost always fail and shrink_zone() would succeed, since it
> looks at the whole zone and is always able to find some pages at all
> priority levels. It also does not allow for targetted reclaim based on
> how much we exceed the soft limit by. 

but, if "soft_limit_reclaim fail and shrink_zone() succeed" don't cause
any performance degression, I don't find why kswapd is wrong.

I guess you and kamezawa-san know it detail. but my understanding don't reach it.
Could you please tell me what so slowness.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v5)
  2009-03-13  7:07   ` KAMEZAWA Hiroyuki
  2009-03-13  7:15     ` Andrew Morton
@ 2009-03-13  7:18     ` Balbir Singh
  1 sibling, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  7:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-13 16:07:35]:

> KAMEZAWA Hiroyuki ?$B$5$s$O=q$-$^$7$?!'
> > Balbir Singh ?$B$5$s$O=q$-$^$7$?!'
> >>
> >> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> >>
> >> New Feature: Soft limits for memory resource controller.
> >>
> >> Changelog v5...v4
> >> 1. Several changes to the reclaim logic, please see the patch 4 (reclaim
> >> on
> >>    contention). I've experimented with several possibilities for reclaim
> >>    and chose to come back to this due to the excellent behaviour seen
> >> while
> >>    testing the patchset.
> >> 2. Reduced the overhead of soft limits on resource counters very
> >> significantly.
> >>    Reaim benchmark now shows almost no drop in performance.
> >>
> > It seems there are no changes to answer my last comments.
> >
> > Nack again. I'll update my own version again.
> >
> Sigh, this is in -mm ? okay...I'll update onto -mm as much as I can.
> Very heavy work, maybe.

Andrew just dropped it from -mm, so don't rush on updating. I was
about to send fixes to address review comments, but I'll just merge
that in -v6.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  7:17                 ` KOSAKI Motohiro
@ 2009-03-13  7:26                   ` Balbir Singh
  2009-03-13  8:37                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  7:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 16:17:25]:

> > > hm
> > > I read past discussion. so, I think we discuss many aspect at once.
> > > So, my current thinking is below, 
> > > 
> > > (1) if the group don't have any soft limit shrinking page, 
> > >     mem_cgroup_soft_limit_reclaim() spent time unnecessary.
> > >     -> right.
> > 
> > If the soft limit RB tree is empty, we don't spend any time at all.
> > Are you referring to something else? Am I missing something? The tree
> > will be empty if no group is over the soft limit.
> 
> maybe, I am missing anything.
> May I ask your following paragraph meaning?
> 
> 
> > I experimented a *lot* with zone reclaim and found it to be not so
> > effective. Here is why
> > 
> > 1. We have no control over priority or how much to scan, that is
> > controlled by balance_pgdat(). If we find that we are unable to scan
> > anything, we continue scanning with the scan > 0 check, but we scan
> > the same pages and the same number, because shrink_zone does scan >>
> > priority.
> 
> I thought this sentense mean soft-limit-shrinking spent a lot of time.
> if not, could you please tell me what makes so slower?
> 

Let me summarize below


> and, you wrote:
> 
> > 
> > Yes, I sent that reason out as comments to Kame's patches. kswapd or
> > balance_pgdat controls the zones, priority and in effect how many
> > pages we scan while doing reclaim. I did lots of experiments and found
> > that if soft limit reclaim occurred from kswapd, soft_limit_reclaim
> > would almost always fail and shrink_zone() would succeed, since it
> > looks at the whole zone and is always able to find some pages at all
> > priority levels. It also does not allow for targetted reclaim based on
> > how much we exceed the soft limit by. 
> 
> but, if "soft_limit_reclaim fail and shrink_zone() succeed" don't cause
> any performance degression, I don't find why kswapd is wrong.
> 
> I guess you and kamezawa-san know it detail. but my understanding don't reach it.
> Could you please tell me what so slowness.

Here is what I saw in my experiments

1. Kame's scan logic, selects shrink_zone for the mem cgroup, but the
   pages scanned and reclaimed from depend on priority and watermarks
   of the zone and *not* at all on the soft limit parameters.
2. Because soft limit reclaim fails to reclaim anythoing (due to 1),
   shrink_zone which is called, does reclaiming indepedent of any
   knowledge of soft limits, which does not work as expected.



-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v5)
  2009-03-13  7:15     ` Andrew Morton
@ 2009-03-13  7:29       ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  7:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, lizf,
	KOSAKI Motohiro, Rik van Riel

* Andrew Morton <akpm@linux-foundation.org> [2009-03-13 00:15:14]:

> On Fri, 13 Mar 2009 16:07:35 +0900 (JST) "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > > Nack again. I'll update my own version again.
> > >
> > Sigh, this is in -mm ? okay...I'll update onto -mm as much as I can.
> > Very heavy work, maybe.
> 
> I dropped them all again.  it appears that quite a few changes are needed
> and I don't think we want these patches interfering with other cgroup
> and general MM development.
>

Thanks Andrew and I'll send an updated patchset later. I'll fix most
review comments and post v6. Hopefully, Kame's and my thought
processes would merge and we'll be ready for inclusion soon again. 

I don't think v6 will be ready for inclusion yet.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-12 23:34   ` Andrew Morton
@ 2009-03-13  7:53     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  7:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, yamamoto, lizf, kosaki.motohiro, riel, kamezawa.hiroyu

* Andrew Morton <akpm@linux-foundation.org> [2009-03-12 16:34:25]:

> On Thu, 12 Mar 2009 23:26:31 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Feature: Implement reclaim from groups over their soft limit
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v5...v4
> > 
> > 1. Throttling is removed, earlier we throttled tasks over their soft limit
> > 2. Reclaim has been moved back to __alloc_pages_internal, several experiments
> >    and tests showed that it was the best place to reclaim memory. kswapd has
> >    a different goal, that does not work with a single soft limit for the memory
> >    cgroup.
> > 3. Soft limit reclaim is more targetted and the pages reclaim depend on the
> >    amount by which the soft limit is exceeded.
> > 
> > Changelog v4...v3
> > 1. soft_reclaim is now called from balance_pgdat
> > 2. soft_reclaim is aware of nodes and zones
> > 3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
> >    and at the same time trying to allocate pages and exceed its soft limit.
> > 4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
> >    particular to a mem cgroup.
> > 
> > Changelog v3...v2
> > 1. Convert several arguments to hierarchical reclaim to flags, thereby
> >    consolidating them
> > 2. The reclaim for soft limits is now triggered from kswapd
> > 3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument
> > 
> > 
> > Changelog v2...v1
> > 1. Added support for hierarchical soft limits
> > 
> > This patch allows reclaim from memory cgroups on contention (via the
> > kswapd() path) only if the order is 0.
> 
> Why for order-0 only?
>

Mem cgroup does not allocate order > 1 pages.
 
> What are the implications of not handling higher-order pages?
> 
> Why kswapd only?
> 
> What are the implications of omitting this from direct reclaim?
> 
> > memory cgroup soft limit reclaim finds the group that exceeds its soft limit
> > by the largest amount and reclaims pages from it and then reinserts the
> > cgroup into its correct place in the rbtree.
> 
> Why trim the single worst-case group rather than (say) trimming all
> groups by a common proportion?  Or other things.
> 
> 
> When you say "by the largest amount", is that the "by largest number of
> pages" or "by the largest percentage"?
> 
> What are the implications of <whichever you chose>?
>

I'll update the changelog in the next post.

 
> >
> > ...
> >
> >  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > -				   gfp_t gfp_mask, bool noswap, bool shrink)
> > +						struct zonelist *zl,
> > +						gfp_t gfp_mask,
> > +						unsigned long flags)
> >  {
> >  	struct mem_cgroup *victim;
> >  	int ret, total = 0;
> >  	int loop = 0;
> > +	bool noswap = flags & MEM_CGROUP_RECLAIM_NOSWAP;
> > +	bool shrink = flags & MEM_CGROUP_RECLAIM_SHRINK;
> > +	bool check_soft = flags & MEM_CGROUP_RECLAIM_SOFT;
> 
> `flags' wasn't a great choice of identifier.  It's a bit misleading,
> and you'll be in a pickle if you later try to add a
> spin_lock_irqsave(lock, flags) to this function.  Maybe choose a more
> specific name?

Yes.. good point, I should call it reclaim_options

> 
> > +	unsigned long excess = mem_cgroup_get_excess(root_mem);
> >  
> > -	while (loop < 2) {
> > +	while (1) {
> > +		if (loop >= 2) {
> > +			/*
> > +			 * With soft limits, do more targetted reclaim
> > +			 */
> > +			if (check_soft && (total >= (excess >> 4)))
> > +				break;
> > +			else if (!check_soft)
> > +				break;
> 
> maybe..
> 
> 			if (!check_soft)
> 				break;
> 			if (total >= (excess >> 4))
> 				break;
> 
> dunno.
> 
> The ">> 4" magic number would benefit from an explanatory comment.  Why
> not ">> 3"???
> 

Done.. will fix it.

> 
> > +		}
> >  		victim = mem_cgroup_select_victim(root_mem);
> > +		/*
> > +		 * In the first loop, don't reclaim from victims below
> > +		 * their soft limit
> > +		 */
> >
> > ...
> >
> > +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +	struct mem_cgroup *mem;
> > +	unsigned long flags;
> > +	unsigned long reclaimed;
> > +
> > +	/*
> > +	 * This loop can run a while, specially if mem_cgroup's continuously
> > +	 * keep exceeding their soft limit and putting the system under
> > +	 * pressure
> > +	 */
> > +	do {
> > +		mem = mem_cgroup_largest_soft_limit_node();
> > +		if (!mem)
> > +			break;
> > +
> > +		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
> > +						gfp_mask,
> > +						MEM_CGROUP_RECLAIM_SOFT);
> > +		nr_reclaimed += reclaimed;
> > +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +		__mem_cgroup_remove_exceeded(mem);
> > +		if (mem->usage_in_excess)
> > +			__mem_cgroup_insert_exceeded(mem);
> > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +		css_put(&mem->css);
> > +		cond_resched();
> 
> spin_lock_irq() would suffice here.  Or the cond_resched() is a bug.
> 
> There's a decent argument that spin_lock_irq() is dangerous, and its
> saving is so small that it's better to use the more robust
> spin_lock_irqsave() all the time.  But was that the intent here?
>

spin_lock_irqsave allows more control over where the routine can be
called fromi (that does not matter right now, since we have just one
call path(. I could just use spin_lock_irq and that would be fine for
now. Like you said the benefits are really small.

cond_resched() can go away, I'll remove it.
 
> 
> > +	} while (!nr_reclaimed);
> > +	return nr_reclaimed;
> > +}
> 
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v5)
  2009-03-13  5:22       ` KOSAKI Motohiro
@ 2009-03-13  8:20         ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2009-03-13  8:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 14:22:28]:

> > > >  /*
> > > > + * Cgroups above their limits are maintained in a RB-Tree, independent of
> > > > + * their hierarchy representation
> > > > + */
> > > > +
> > > > +static struct rb_root mem_cgroup_soft_limit_tree;
> > > > +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
> > > 
> > > I have objection to this.
> > > Please don't use global spin lock.
> > 
> > We need a global data structure, per node, per zone is no good, since
> > the limits (soft limit in this case) is for the entire cgroup.
> 
> this smell the data structure is wrong.
> 
> rb-tree soring is one of efficient reclaiming technique.
> but global lock bust due to this patch's good side.
>

memory cgroup is a global data structure. Can we have the same mem
cgorup in several RB-Trees?
 
> if its updating is really rare, rcu is better?
> or couldn't you select another data structure?

RCU for RB-Trees? We'll need to RCU'fy all the links and the core
RB-Tree. RB-Trees are being used in the scheduler and for
io-scheduling, hrtimers, etc. What is your concern with the data
structure?


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  7:26                   ` Balbir Singh
@ 2009-03-13  8:37                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-13  8:37 UTC (permalink / raw)
  To: balbir
  Cc: KOSAKI Motohiro, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

Balbir Singh wrote:
> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-13 16:17:25]:

> 1. Kame's scan logic, selects shrink_zone for the mem cgroup, but the
>    pages scanned and reclaimed from depend on priority and watermarks
>    of the zone and *not* at all on the soft limit parameters.
What means "not at all" ? My test result was illusion ?
My routine reclaims memory from memcg which over soft limit....
What modification is necessary ?
(Anyway, I'll remove priority and introduce something more intellegent here.)

> 2. Because soft limit reclaim fails to reclaim anythoing (due to 1),
>    shrink_zone which is called, does reclaiming indepedent of any
>    knowledge of soft limits, which does not work as expected.
>
I agree that we need some hook to loop in of shrink_zone to taking
care of softlimit.

Thanks,
-Kame


>
> --
> 	Balbir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v5)
  2009-03-13  7:15     ` Balbir Singh
@ 2009-03-13  8:41       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-13  8:41 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, lizf,
	KOSAKI Motohiro, Rik van Riel, Andrew Morton

Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-13
> 15:51:25]:
>
>> Balbir Singh wrote:
>> > Feature: Implement reclaim from groups over their soft limit
>> >
>> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
>>
>> > +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl,
>> gfp_t
>> > gfp_mask)
>> > +{
>> > +	unsigned long nr_reclaimed = 0;
>> > +	struct mem_cgroup *mem;
>> > +	unsigned long flags;
>> > +	unsigned long reclaimed;
>> > +
>> > +	/*
>> > +	 * This loop can run a while, specially if mem_cgroup's continuously
>> > +	 * keep exceeding their soft limit and putting the system under
>> > +	 * pressure
>> > +	 */
>> > +	do {
>> > +		mem = mem_cgroup_largest_soft_limit_node();
>> > +		if (!mem)
>> > +			break;
>> > +
>> > +		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
>> > +						gfp_mask,
>> > +						MEM_CGROUP_RECLAIM_SOFT);
>> > +		nr_reclaimed += reclaimed;
>> > +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
>> > +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
>> > +		__mem_cgroup_remove_exceeded(mem);
>> > +		if (mem->usage_in_excess)
>> > +			__mem_cgroup_insert_exceeded(mem);
>> > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
>> > +		css_put(&mem->css);
>> > +		cond_resched();
>> > +	} while (!nr_reclaimed);
>> > +	return nr_reclaimed;
>> > +}
>> > +
>>  Why do you never consider bad corner case....
>>  As I wrote many times, "order of global usage" doesn't mean the
>>  biggest user of memcg containes memory in zones which we want.
>>  So, please don't pust "mem" back to RB-tree if reclaimed is 0.
>>
>>  This routine seems toooo bad as v4.
>
> Are you talking about cases where a particular mem cgroup never
> allocated from a node? Thanks.. let me take a look at it
>
Using cpuset to test and limiting nodes for memory is an easy way,

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2009-03-13  8:41 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-12 17:56 [PATCH 0/4] Memory controller soft limit patches (v5) Balbir Singh
2009-03-12 17:56 ` [PATCH 1/4] Memory controller soft limit documentation (v5) Balbir Singh
2009-03-12 17:56 ` [PATCH 2/4] Memory controller soft limit interface (v5) Balbir Singh
2009-03-12 22:59   ` Andrew Morton
2009-03-13  4:58     ` Balbir Singh
2009-03-12 17:56 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v5) Balbir Singh
2009-03-12 23:04   ` Andrew Morton
2009-03-13  5:03     ` Balbir Singh
2009-03-13  0:47   ` KOSAKI Motohiro
2009-03-13  5:04     ` Balbir Singh
2009-03-13  5:22       ` KOSAKI Motohiro
2009-03-13  8:20         ` Balbir Singh
2009-03-13  6:59   ` KAMEZAWA Hiroyuki
2009-03-13  7:09     ` Balbir Singh
2009-03-12 17:56 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v5) Balbir Singh
2009-03-12 23:34   ` Andrew Morton
2009-03-13  7:53     ` Balbir Singh
2009-03-13  1:36   ` KOSAKI Motohiro
2009-03-13  4:13     ` Balbir Singh
2009-03-13  4:31       ` KOSAKI Motohiro
2009-03-13  4:50         ` KOSAKI Motohiro
2009-03-13  5:07           ` Balbir Singh
2009-03-13  6:54             ` KOSAKI Motohiro
2009-03-13  7:03               ` Balbir Singh
2009-03-13  7:17                 ` KOSAKI Motohiro
2009-03-13  7:26                   ` Balbir Singh
2009-03-13  8:37                     ` KAMEZAWA Hiroyuki
2009-03-13  5:26           ` Balbir Singh
2009-03-13  5:34             ` KOSAKI Motohiro
2009-03-13  4:58         ` Balbir Singh
2009-03-13  6:51   ` KAMEZAWA Hiroyuki
2009-03-13  7:15     ` Balbir Singh
2009-03-13  8:41       ` KAMEZAWA Hiroyuki
2009-03-13  7:02 ` [PATCH 0/4] Memory controller soft limit patches (v5) KAMEZAWA Hiroyuki
2009-03-13  7:07   ` KAMEZAWA Hiroyuki
2009-03-13  7:15     ` Andrew Morton
2009-03-13  7:29       ` Balbir Singh
2009-03-13  7:18     ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.