linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Memory controller soft limit patches (v4)
@ 2009-03-06  9:23 Balbir Singh
  2009-03-06  9:23 ` [PATCH 1/4] Memory controller soft limit documentation (v4) Balbir Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Balbir Singh @ 2009-03-06  9:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

New Feature: Soft limits for memory resource controller.

Changelog v4...v3
1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
   while doing soft limit reclaim. We don't record priorities while
   doing soft reclaim
2. Some of the overheads associated with soft limits (like calculating
   excess each time) is eliminated
3. The time_after(jiffies, 0) bug has been fixed
4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
   and at the same time tasks are increasing the memory footprint and causing
   the mem cgroup to exceed its soft limit.

Changelog v3...v2
1. Implemented several review comments from Kosaki-San and Kamezawa-San
   Please see individual changelogs for changes

Changelog v2...v1
1. Soft limits now support hierarchies
2. Use spinlocks instead of mutexes for synchronization of the RB tree

Here is v4 of the new soft limit implementation. Soft limits is a new feature
for the memory resource controller, something similar has existed in the
group scheduler in the form of shares. The CPU controllers interpretation
of shares is very different though. 

Soft limits are the most useful feature to have for environments where
the administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits implementation
provides a soft_limit_in_bytes interface for the memory controller and not
for memory+swap controller. The implementation maintains an RB-Tree of groups
that exceed their soft limit and starts reclaiming from the group that
exceeds this limit by the maximum amount.

If there are no major objections to the patches, I would like to get them
included in -mm.

TODOs

1. The current implementation maintains the delta from the soft limit
   and pushes back groups to their soft limits, a ratio of delta/soft_limit
   might be more useful
2. It would be nice to have more targetted reclaim (in terms of pages to
   recalim) interface. So that groups are pushed back, close to their soft
   limits.

Tests
-----

I've run two memory intensive workloads with differing soft limits and
seen that they are pushed back to their soft limit on contention. Their usage
was their soft limit plus additional memory that they were able to grab
on the system. Soft limit can take a while before we see the expected
results.

The other tests I've run are
1. Deletion of groups while soft limit is in progress in the hierarchy
2. Setting the soft limit to zero and running other groups with non-zero
   soft limits.
3. Setting the soft limit to zero and testing if the mem cgroup is able
   to use available memory

Please review, comment.

Series
------

memcg-soft-limit-documentation.patch
memcg-add-soft-limit-interface.patch
memcg-organize-over-soft-limit-groups.patch
memcg-soft-limit-reclaim-on-contention.patch



-- 
	Balbir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] Memory controller soft limit documentation (v4)
  2009-03-06  9:23 [PATCH 0/4] Memory controller soft limit patches (v4) Balbir Singh
@ 2009-03-06  9:23 ` Balbir Singh
  2009-03-06  9:23 ` [PATCH 2/4] Memory controller soft limit interface (v4) Balbir Singh
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2009-03-06  9:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

Feature: Add documentation for soft limits

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt |   27 ++++++++++++++++++++++++++-
 1 files changed, 26 insertions(+), 1 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index a98a7fe..812cb74 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -360,7 +360,32 @@ cgroups created below it.
 
 NOTE2: This feature can be enabled/disabled per subtree.
 
-7. TODO
+7. Soft limits
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention or low memory (kswapd is woken up)
+control groups are pushed back to their soft limits. If the soft limit of each
+control group is very high, they are pushed back as much as possible to make
+sure that one control group does not starve the others of memory.
+
+7.1 Interface
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 megabytes)
+
+# echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use
+
+# echo 1G > memory.soft_limit_in_bytes
+
+
+8. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

-- 
	Balbir

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] Memory controller soft limit interface (v4)
  2009-03-06  9:23 [PATCH 0/4] Memory controller soft limit patches (v4) Balbir Singh
  2009-03-06  9:23 ` [PATCH 1/4] Memory controller soft limit documentation (v4) Balbir Singh
@ 2009-03-06  9:23 ` Balbir Singh
  2009-03-06  9:23 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v4) Balbir Singh
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2009-03-06  9:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

Feature: Add soft limits interface to resource counters

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Add support for res_counter_check_soft_limit_locked. This is used
   by the hierarchy code.

Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.

Kamezawa-San raised a question as to whether soft limit should belong
to res_counter. Since all resources understand the basic concepts of
hard and soft limits, it is justified to add soft limits here. Soft limits
are a generic resource usage feature, even file system quotas support
soft limits.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |   58 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    3 ++
 mm/memcontrol.c             |   20 +++++++++++++++
 3 files changed, 81 insertions(+), 0 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 4c5bcf6..5c821fd 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
 	 */
 	unsigned long long limit;
 	/*
+	 * the limit that usage can be exceed
+	 */
+	unsigned long long soft_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -85,6 +89,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_SOFT_LIMIT,
 };
 
 /*
@@ -130,6 +135,36 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->soft_limit)
+		return true;
+
+	return false;
+}
+
+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->soft_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -145,6 +180,17 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	return ret;
 }
 
+static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_soft_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -178,4 +224,16 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 	return ret;
 }
 
+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+				unsigned long long soft_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->soft_limit = soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bf8e753..4e6dafe 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->soft_limit = (unsigned long long)LLONG_MAX;
 	counter->parent = parent;
 }
 
@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_SOFT_LIMIT:
+		return &counter->soft_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7bb14fd..75a7b1a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1939,6 +1939,20 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case RES_SOFT_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, soft limits are hard to implement in terms
+		 * of semantics, for now, we support soft limits for
+		 * control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2188,6 +2202,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,

-- 
	Balbir

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] Memory controller soft limit organize cgroups (v4)
  2009-03-06  9:23 [PATCH 0/4] Memory controller soft limit patches (v4) Balbir Singh
  2009-03-06  9:23 ` [PATCH 1/4] Memory controller soft limit documentation (v4) Balbir Singh
  2009-03-06  9:23 ` [PATCH 2/4] Memory controller soft limit interface (v4) Balbir Singh
@ 2009-03-06  9:23 ` Balbir Singh
  2009-03-06  9:23 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v4) Balbir Singh
  2009-03-06  9:54 ` [PATCH 0/4] Memory controller soft limit patches (v4) KAMEZAWA Hiroyuki
  4 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2009-03-06  9:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

Feature: Organize cgroups over soft limit in a RB-Tree

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v4...v3
1. Optimizations to ensure we don't uncessarily get res_counter values
2. Fixed a bug in usage of time_after()

Changelog v3...v2
1. Add only the ancestor to the RB-Tree
2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put

Changelog v2...v1
1. Add support for hierarchies
2. The res_counter that is highest in the hierarchy is returned on soft
   limit being exceeded. Since we do hierarchical reclaim and add all
   groups exceeding their soft limits, this approach seems to work well
   in practice.

This patch introduces a RB-Tree for storing memory cgroups that are over their
soft limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
   We are careful about updates, updates take place only after a particular
   time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
   limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    3 +
 kernel/res_counter.c        |   12 ++++-
 mm/memcontrol.c             |  114 +++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 123 insertions(+), 6 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 5c821fd..d16613b 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+		unsigned long val, struct res_counter **limit_fail_at,
+		struct res_counter **soft_limit_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4e6dafe..08b7614 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+			struct res_counter **limit_fail_at,
+			struct res_counter **soft_limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
+	if (soft_limit_fail_at)
+		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
+		/*
+		 * With soft limits, we return the highest ancestor
+		 * that exceeds its soft limit
+		 */
+		if (soft_limit_fail_at &&
+			!res_counter_soft_limit_check_locked(c))
+			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75a7b1a..d548dd2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -29,6 +29,7 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
+#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
@@ -129,6 +130,14 @@ struct mem_cgroup_lru_info {
 };
 
 /*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+
+static struct rb_root mem_cgroup_soft_limit_tree;
+static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
+
+/*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
  * statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -176,12 +185,20 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	struct rb_node mem_cgroup_node;		/* RB tree node */
+	unsigned long long usage_in_excess;	/* Set to the value by which */
+						/* the soft limit is exceeded*/
+	unsigned long last_tree_update;		/* Last time the tree was */
+						/* updated in jiffies     */
+
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
 
+#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -214,6 +231,41 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct mem_cgroup *mem_node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	while (*p) {
+		parent = *p;
+		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
+		if (mem->usage_in_excess < mem_node->usage_in_excess)
+			p = &(*p)->rb_left;
+		/*
+		 * We can't avoid mem cgroups that are over their soft
+		 * limit by the same amount
+		 */
+		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&mem->mem_cgroup_node, parent, p);
+	rb_insert_color(&mem->mem_cgroup_node,
+			&mem_cgroup_soft_limit_tree);
+	mem->last_tree_update = jiffies;
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -897,6 +949,43 @@ static void record_last_oom(struct mem_cgroup *mem)
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
+static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
+						bool time_check)
+{
+	unsigned long long prev_usage_in_excess, new_usage_in_excess;
+	bool updated_tree = false;
+	unsigned long next_update = 0;
+	unsigned long flags;
+
+	if (!css_tryget(&mem->css))
+		return;
+	prev_usage_in_excess = mem->usage_in_excess;
+
+	if (time_check)
+		next_update = mem->last_tree_update +
+				MEM_CGROUP_TREE_UPDATE_INTERVAL;
+
+	if (!time_check || time_after(jiffies, next_update)) {
+		new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+		if (prev_usage_in_excess) {
+			mem_cgroup_remove_exceeded(mem);
+			updated_tree = true;
+		}
+		if (!new_usage_in_excess)
+			goto done;
+		mem_cgroup_insert_exceeded(mem);
+		updated_tree = true;
+	}
+
+done:
+	if (updated_tree) {
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->last_tree_update = jiffies;
+		mem->usage_in_excess = new_usage_in_excess;
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	}
+	css_put(&mem->css);
+}
 
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
@@ -906,9 +995,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom)
 {
-	struct mem_cgroup *mem, *mem_over_limit;
+	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+	struct res_counter *fail_res, *soft_fail_res = NULL;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -938,12 +1027,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		int ret;
 		bool noswap = false;
 
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
+						&soft_fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+							&fail_res, NULL);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
@@ -985,6 +1075,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 		}
 	}
+
+	/*
+	 * Insert just the ancestor, we should trickle down to the correct
+	 * cgroup for reclaim, since the other nodes will be below their
+	 * soft limit
+	 */
+	if (soft_fail_res) {
+		mem_over_soft_limit =
+			mem_cgroup_from_res_counter(soft_fail_res, res);
+		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
+	}
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -1422,6 +1523,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
+	mem_cgroup_check_and_update_tree(mem, true);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
@@ -2346,6 +2448,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	mem_cgroup_check_and_update_tree(mem, false);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -2412,6 +2515,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
 		parent = NULL;
+		mem_cgroup_soft_limit_tree = RB_ROOT;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
@@ -2432,6 +2536,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->usage_in_excess = 0;
+	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)

-- 
	Balbir

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] Memory controller soft limit reclaim on contention (v4)
  2009-03-06  9:23 [PATCH 0/4] Memory controller soft limit patches (v4) Balbir Singh
                   ` (2 preceding siblings ...)
  2009-03-06  9:23 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v4) Balbir Singh
@ 2009-03-06  9:23 ` Balbir Singh
  2009-03-06  9:51   ` KAMEZAWA Hiroyuki
  2009-03-06  9:54 ` [PATCH 0/4] Memory controller soft limit patches (v4) KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 18+ messages in thread
From: Balbir Singh @ 2009-03-06  9:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

Feature: Implement reclaim from groups over their soft limit

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v4...v3
1. soft_reclaim is now called from balance_pgdat
2. soft_reclaim is aware of nodes and zones
3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
   and at the same time trying to allocate pages and exceed its soft limit.
4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
   particular to a mem cgroup.

Changelog v3...v2
1. Convert several arguments to hierarchical reclaim to flags, thereby
   consolidating them
2. The reclaim for soft limits is now triggered from kswapd
3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument


Changelog v2...v1
1. Added support for hierarchical soft limits

This patch allows reclaim from memory cgroups on contention (via the
kswapd() path) only if the order is 0.

memory cgroup soft limit reclaim finds the group that exceeds its soft limit
by the largest amount and reclaims pages from it and then reinserts the
cgroup into its correct place in the rbtree.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |    9 ++
 include/linux/swap.h       |    5 +
 mm/memcontrol.c            |  223 +++++++++++++++++++++++++++++++++++++++++---
 mm/vmscan.c                |   26 +++++
 4 files changed, 245 insertions(+), 18 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..16343d0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,6 +116,9 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
+unsigned long
+mem_cgroup_soft_limit_reclaim(int priority, struct zone *zone, int nid,
+				gfp_t gfp_mask);
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
@@ -264,6 +267,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
+static inline unsigned long
+mem_cgroup_soft_limit_reclaim(int priority, struct zone *zone, int nid,
+				gfp_t gfp_mask)
+{
+	return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 989eb53..37bc2a9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -217,6 +217,11 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
+extern unsigned long mem_cgroup_shrink_zone(struct mem_cgroup *mem,
+						struct zone *zone,
+						gfp_t gfp_mask,
+						unsigned int swappiness,
+						int priority);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d548dd2..3be1f27 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,7 @@
 #include <linux/res_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
+#include <linux/completion.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/smp.h>
@@ -191,6 +192,14 @@ struct mem_cgroup {
 	unsigned long last_tree_update;		/* Last time the tree was */
 						/* updated in jiffies     */
 
+	bool on_tree;				/* Is the node on tree? */
+	struct completion wait_on_soft_reclaim;
+	/*
+	 * Set to > 0, when reclaim is initiated due to 
+	 * the soft limit being exceeded. It adds an additional atomic
+	 * operation to page fault path.
+	 */
+	int soft_limit_reclaim_count;
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -227,18 +236,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
+/*
+ * Bits used for hierarchical reclaim bits
+ */
+#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
+#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
+#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
+#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
+#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
-static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 {
 	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
 	struct rb_node *parent = NULL;
 	struct mem_cgroup *mem_node;
-	unsigned long flags;
 
-	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	if (mem->on_tree)
+		return;
+
 	while (*p) {
 		parent = *p;
 		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
@@ -255,6 +275,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 	rb_insert_color(&mem->mem_cgroup_node,
 			&mem_cgroup_soft_limit_tree);
 	mem->last_tree_update = jiffies;
+	mem->on_tree = true;
+}
+
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	if (!mem->on_tree)
+		return;
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	mem->on_tree = false;
+}
+
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	__mem_cgroup_insert_exceeded(mem);
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 }
 
@@ -262,8 +299,34 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
-	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	__mem_cgroup_remove_exceeded(mem);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+static struct mem_cgroup *mem_cgroup_get_largest_soft_limit_exceeding_node(void)
+{
+	struct rb_node *rightmost = NULL;
+	struct mem_cgroup *mem = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+retry:
+	rightmost = rb_last(&mem_cgroup_soft_limit_tree);
+	if (!rightmost)
+		goto done;		/* Nothing to reclaim from */
+
+	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
+	/*
+	 * Remove the node now but someone else can add it back,
+	 * we will to add it back at the end of reclaim to its correct
+	 * position in the tree.
+	 */
+	__mem_cgroup_remove_exceeded(mem);
+	if (!css_tryget(&mem->css) || !res_counter_soft_limit_excess(&mem->res))
+		goto retry;
+done:
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	return mem;
 }
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
@@ -324,6 +387,27 @@ static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
 	return total;
 }
 
+static unsigned long long
+mem_cgroup_get_node_zone_usage(struct mem_cgroup *mem, struct zone *zone,
+				int nid)
+{
+	int l;
+	unsigned long long total = 0;
+	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
+
+	/*
+	 * Is holding the zone LRU lock being overly protective?
+	 * This routine is not invoked from the hot path anyway.
+	 */
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	mz = mem_cgroup_zoneinfo(mem, nid, zone_idx(zone));
+	for_each_evictable_lru(l)
+		total += MEM_CGROUP_ZSTAT(mz, l);
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+	return total * PAGE_SIZE;
+}
+
 static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
 {
 	return container_of(cgroup_subsys_state(cont,
@@ -888,14 +972,30 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
-				   gfp_t gfp_mask, bool noswap, bool shrink)
+						struct zone *zone,
+						gfp_t gfp_mask,
+						unsigned long flags,
+						int priority)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
 	int loop = 0;
+	bool noswap = flags & MEM_CGROUP_RECLAIM_NOSWAP;
+	bool shrink = flags & MEM_CGROUP_RECLAIM_SHRINK;
+	bool check_soft = flags & MEM_CGROUP_RECLAIM_SOFT;
 
 	while (loop < 2) {
 		victim = mem_cgroup_select_victim(root_mem);
+		/*
+		 * In the first loop, don't reclaim from victims below
+		 * their soft limit
+		 */
+		if (!loop && res_counter_check_under_soft_limit(&victim->res)) {
+			if (victim == root_mem)
+				loop++;
+			css_put(&victim->css);
+			continue;
+		}
 		if (victim == root_mem)
 			loop++;
 		if (!mem_cgroup_local_usage(&victim->stat)) {
@@ -904,8 +1004,14 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 			continue;
 		}
 		/* we use swappiness of local cgroup */
-		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
-						   get_swappiness(victim));
+		if (!check_soft)
+			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
+							noswap,
+							get_swappiness(victim));
+		else
+			ret = mem_cgroup_shrink_zone(victim, zone, gfp_mask,
+							get_swappiness(victim),
+							priority);
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -915,7 +1021,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		if (shrink)
 			return ret;
 		total += ret;
-		if (mem_cgroup_check_under_limit(root_mem))
+		if (check_soft) {
+			if (res_counter_check_under_soft_limit(&root_mem->res))
+				return total;
+		} else if (mem_cgroup_check_under_limit(root_mem))
 			return 1 + total;
 	}
 	return total;
@@ -1025,7 +1134,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 
 	while (1) {
 		int ret;
-		bool noswap = false;
+		unsigned long flags = 0;
 
 		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
 						&soft_fail_res);
@@ -1038,7 +1147,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 				break;
 			/* mem+swap counter fails */
 			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			noswap = true;
+			flags = MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
 		} else
@@ -1049,8 +1158,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
-		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
-							noswap, false);
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
+							gfp_mask, flags, 0);
 		if (ret)
 			continue;
 
@@ -1082,9 +1191,29 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 	 * soft limit
 	 */
 	if (soft_fail_res) {
+		/*
+		 * Throttle the task here, if it is undergoing soft limit
+		 * reclaim and failing soft limits
+		 */
+		unsigned long flags;
+		bool wait = false;
+
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		if (mem->soft_limit_reclaim_count) {
+			INIT_COMPLETION(mem->wait_on_soft_reclaim);
+			wait = true;
+		}
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 		mem_over_soft_limit =
 			mem_cgroup_from_res_counter(soft_fail_res, res);
 		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
+		/*
+		 * We hold the mmap_sem and throttle, I don't think there
+		 * should be corner cases, but this part could use more
+		 * review
+		 */
+		if (wait)
+			wait_for_completion(&mem->wait_on_soft_reclaim);
 	}
 	return 0;
 nomem:
@@ -1695,8 +1824,8 @@ int mem_cgroup_shrink_usage(struct page *page,
 		return 0;
 
 	do {
-		progress = mem_cgroup_hierarchical_reclaim(mem,
-					gfp_mask, true, false);
+		progress = mem_cgroup_hierarchical_reclaim(mem, NULL,
+					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP, 0);
 		progress += mem_cgroup_check_under_limit(mem);
 	} while (!progress && --retry);
 
@@ -1750,8 +1879,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
-						   false, true);
+		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
+						GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_SHRINK, 0);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -1799,7 +1929,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
+		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_NOSWAP |
+						MEM_CGROUP_RECLAIM_SHRINK, 0);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -1810,6 +1942,59 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
+unsigned long
+mem_cgroup_soft_limit_reclaim(int priority, struct zone *zone, int nid,
+				gfp_t gfp_mask)
+{
+	unsigned long nr_reclaimed = 0;
+	struct mem_cgroup *mem;
+	unsigned long flags;
+	unsigned long long usage;
+
+	/*
+	 * This loop can run a while, specially if mem_cgroup's continuously
+	 * keep exceeding their soft limit and putting the system under
+	 * pressure
+	 */
+	do {
+		mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
+		if (!mem)
+			break;
+		usage = mem_cgroup_get_node_zone_usage(mem, zone, nid);
+		if (!usage)
+			goto skip_reclaim;
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->soft_limit_reclaim_count++;
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+		nr_reclaimed += mem_cgroup_hierarchical_reclaim(mem, zone,
+						gfp_mask,
+						MEM_CGROUP_RECLAIM_SOFT,
+						priority);
+skip_reclaim:
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		/*
+		 * Don't wake up throttled tasks unless we reclaimed from
+		 * them
+		 */
+		if (usage)
+			mem->soft_limit_reclaim_count--;
+		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+		/*
+		 * We need to remove and reinsert the node in its correct
+		 * position
+		 */
+		__mem_cgroup_remove_exceeded(mem);
+		if (mem->usage_in_excess)
+			__mem_cgroup_insert_exceeded(mem);
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+		if (usage)
+			complete_all(&mem->wait_on_soft_reclaim);
+		css_put(&mem->css);
+		cond_resched();
+	} while (!nr_reclaimed);
+	return nr_reclaimed;
+}
+
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -2538,6 +2723,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	mem->usage_in_excess = 0;
 	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
+	mem->on_tree = false;
+	init_completion(&mem->wait_on_soft_reclaim);
+	mem->soft_limit_reclaim_count = 0;
+
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 83f2ca4..2b8cdad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1737,6 +1737,26 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
 	return do_try_to_free_pages(zonelist, &sc);
 }
+
+unsigned long mem_cgroup_shrink_zone(struct mem_cgroup *mem,
+					struct zone *zone, gfp_t gfp_mask,
+					unsigned int swappiness,
+					int priority)
+{
+	struct scan_control sc = {
+		.may_writepage = !laptop_mode,
+		.may_unmap = 1,
+		.swap_cluster_max = SWAP_CLUSTER_MAX,
+		.swappiness = swappiness,
+		.order = 0,
+		.mem_cgroup = mem,
+		.isolate_pages = mem_cgroup_isolate_pages,
+	};
+	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
+	shrink_zone(priority, zone, &sc);
+	return sc.nr_reclaimed;
+}
 #endif
 
 /*
@@ -1870,7 +1890,11 @@ loop_again:
 			 */
 			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
 						end_zone, 0))
-				shrink_zone(priority, zone, &sc);
+				if ((priority >= DEF_PRIORITY/2) ||
+					!mem_cgroup_soft_limit_reclaim(priority,
+						zone, pgdat->node_id,
+						GFP_KERNEL))
+					shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);

-- 
	Balbir

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v4)
  2009-03-06  9:23 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v4) Balbir Singh
@ 2009-03-06  9:51   ` KAMEZAWA Hiroyuki
  2009-03-06 10:01     ` Balbir Singh
  0 siblings, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-06  9:51 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Fri, 06 Mar 2009 14:53:53 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> ---
> 
>  include/linux/memcontrol.h |    9 ++
>  include/linux/swap.h       |    5 +
>  mm/memcontrol.c            |  223 +++++++++++++++++++++++++++++++++++++++++---
>  mm/vmscan.c                |   26 +++++
>  4 files changed, 245 insertions(+), 18 deletions(-)
> 
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..16343d0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -116,6 +116,9 @@ static inline bool mem_cgroup_disabled(void)
>  }
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> +unsigned long
> +mem_cgroup_soft_limit_reclaim(int priority, struct zone *zone, int nid,
> +				gfp_t gfp_mask);
>  
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
> @@ -264,6 +267,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> +static inline unsigned long
> +mem_cgroup_soft_limit_reclaim(int priority, struct zone *zone, int nid,
> +				gfp_t gfp_mask)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 989eb53..37bc2a9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -217,6 +217,11 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>  						  gfp_t gfp_mask, bool noswap,
>  						  unsigned int swappiness);
> +extern unsigned long mem_cgroup_shrink_zone(struct mem_cgroup *mem,
> +						struct zone *zone,
> +						gfp_t gfp_mask,
> +						unsigned int swappiness,
> +						int priority);
>  extern int __isolate_lru_page(struct page *page, int mode, int file);
>  extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d548dd2..3be1f27 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -20,6 +20,7 @@
>  #include <linux/res_counter.h>
>  #include <linux/memcontrol.h>
>  #include <linux/cgroup.h>
> +#include <linux/completion.h>
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/smp.h>
> @@ -191,6 +192,14 @@ struct mem_cgroup {
>  	unsigned long last_tree_update;		/* Last time the tree was */
>  						/* updated in jiffies     */
>  
> +	bool on_tree;				/* Is the node on tree? */
> +	struct completion wait_on_soft_reclaim;
> +	/*
> +	 * Set to > 0, when reclaim is initiated due to 
> +	 * the soft limit being exceeded. It adds an additional atomic
> +	 * operation to page fault path.
> +	 */
> +	int soft_limit_reclaim_count;
>  	/*
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
> @@ -227,18 +236,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
>  #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
>  #define MEMFILE_ATTR(val)	((val) & 0xffff)
>  
> +/*
> + * Bits used for hierarchical reclaim bits
> + */
> +#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
> +#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> +#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
> +#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> +#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
> +#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
> +
>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  
> -static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
>  {
>  	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
>  	struct rb_node *parent = NULL;
>  	struct mem_cgroup *mem_node;
> -	unsigned long flags;
>  
> -	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	if (mem->on_tree)
> +		return;
> +
>  	while (*p) {
>  		parent = *p;
>  		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> @@ -255,6 +275,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
>  	rb_insert_color(&mem->mem_cgroup_node,
>  			&mem_cgroup_soft_limit_tree);
>  	mem->last_tree_update = jiffies;
> +	mem->on_tree = true;
> +}
> +
> +static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> +{
> +	if (!mem->on_tree)
> +		return;
> +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> +	mem->on_tree = false;
> +}
> +
> +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	__mem_cgroup_insert_exceeded(mem);
>  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
>  }
>  
> @@ -262,8 +299,34 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
>  {
>  	unsigned long flags;
>  	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> -	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> +	__mem_cgroup_remove_exceeded(mem);
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +}
> +
> +static struct mem_cgroup *mem_cgroup_get_largest_soft_limit_exceeding_node(void)
> +{
> +	struct rb_node *rightmost = NULL;
> +	struct mem_cgroup *mem = NULL;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +retry:
> +	rightmost = rb_last(&mem_cgroup_soft_limit_tree);
> +	if (!rightmost)
> +		goto done;		/* Nothing to reclaim from */
> +
> +	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
> +	/*
> +	 * Remove the node now but someone else can add it back,
> +	 * we will to add it back at the end of reclaim to its correct
> +	 * position in the tree.
> +	 */
> +	__mem_cgroup_remove_exceeded(mem);
> +	if (!css_tryget(&mem->css) || !res_counter_soft_limit_excess(&mem->res))
> +		goto retry;
> +done:
>  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +	return mem;
>  }
>  
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> @@ -324,6 +387,27 @@ static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
>  	return total;
>  }
>  
> +static unsigned long long
> +mem_cgroup_get_node_zone_usage(struct mem_cgroup *mem, struct zone *zone,
> +				int nid)
> +{
> +	int l;
> +	unsigned long long total = 0;
> +	struct mem_cgroup_per_zone *mz;
> +	unsigned long flags;
> +
> +	/*
> +	 * Is holding the zone LRU lock being overly protective?
> +	 * This routine is not invoked from the hot path anyway.
> +	 */
> +	spin_lock_irqsave(&zone->lru_lock, flags);
> +	mz = mem_cgroup_zoneinfo(mem, nid, zone_idx(zone));
> +	for_each_evictable_lru(l)
> +		total += MEM_CGROUP_ZSTAT(mz, l);
> +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> +	return total * PAGE_SIZE;
> +}
> +
>  static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
>  {
>  	return container_of(cgroup_subsys_state(cont,
> @@ -888,14 +972,30 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
>   * If shrink==true, for avoiding to free too much, this returns immedieately.
>   */
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> -				   gfp_t gfp_mask, bool noswap, bool shrink)
> +						struct zone *zone,
> +						gfp_t gfp_mask,
> +						unsigned long flags,
> +						int priority)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
>  	int loop = 0;
> +	bool noswap = flags & MEM_CGROUP_RECLAIM_NOSWAP;
> +	bool shrink = flags & MEM_CGROUP_RECLAIM_SHRINK;
> +	bool check_soft = flags & MEM_CGROUP_RECLAIM_SOFT;
>  
>  	while (loop < 2) {
>  		victim = mem_cgroup_select_victim(root_mem);
> +		/*
> +		 * In the first loop, don't reclaim from victims below
> +		 * their soft limit
> +		 */
> +		if (!loop && res_counter_check_under_soft_limit(&victim->res)) {
> +			if (victim == root_mem)
> +				loop++;
> +			css_put(&victim->css);
> +			continue;
> +		}
>  		if (victim == root_mem)
>  			loop++;
>  		if (!mem_cgroup_local_usage(&victim->stat)) {
> @@ -904,8 +1004,14 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  			continue;
>  		}
>  		/* we use swappiness of local cgroup */
> -		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> -						   get_swappiness(victim));
> +		if (!check_soft)
> +			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> +							noswap,
> +							get_swappiness(victim));
> +		else
> +			ret = mem_cgroup_shrink_zone(victim, zone, gfp_mask,
> +							get_swappiness(victim),
> +							priority);
>  		css_put(&victim->css);
>  		/*
>  		 * At shrinking usage, we can't check we should stop here or
> @@ -915,7 +1021,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  		if (shrink)
>  			return ret;
>  		total += ret;
> -		if (mem_cgroup_check_under_limit(root_mem))
> +		if (check_soft) {
> +			if (res_counter_check_under_soft_limit(&root_mem->res))
> +				return total;
> +		} else if (mem_cgroup_check_under_limit(root_mem))
>  			return 1 + total;
>  	}
>  	return total;
> @@ -1025,7 +1134,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  
>  	while (1) {
>  		int ret;
> -		bool noswap = false;
> +		unsigned long flags = 0;
>  
>  		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
>  						&soft_fail_res);
> @@ -1038,7 +1147,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  				break;
>  			/* mem+swap counter fails */
>  			res_counter_uncharge(&mem->res, PAGE_SIZE);
> -			noswap = true;
> +			flags = MEM_CGROUP_RECLAIM_NOSWAP;
>  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>  									memsw);
>  		} else
> @@ -1049,8 +1158,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> -		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> -							noswap, false);
> +		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> +							gfp_mask, flags, 0);
>  		if (ret)
>  			continue;
>  
> @@ -1082,9 +1191,29 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	 * soft limit
>  	 */
>  	if (soft_fail_res) {
> +		/*
> +		 * Throttle the task here, if it is undergoing soft limit
> +		 * reclaim and failing soft limits
> +		 */
> +		unsigned long flags;
> +		bool wait = false;
> +
> +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +		if (mem->soft_limit_reclaim_count) {
> +			INIT_COMPLETION(mem->wait_on_soft_reclaim);
> +			wait = true;
> +		}
> +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
>  		mem_over_soft_limit =
>  			mem_cgroup_from_res_counter(soft_fail_res, res);
>  		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
> +		/*
> +		 * We hold the mmap_sem and throttle, I don't think there
> +		 * should be corner cases, but this part could use more
> +		 * review
> +		 */
> +		if (wait)
> +			wait_for_completion(&mem->wait_on_soft_reclaim);
>  	}
What ???? Why we have to wait here...holding mmap->sem...This is too bad.



>  	return 0;
>  nomem:
> @@ -1695,8 +1824,8 @@ int mem_cgroup_shrink_usage(struct page *page,
>  		return 0;
>  
>  	do {
> -		progress = mem_cgroup_hierarchical_reclaim(mem,
> -					gfp_mask, true, false);
> +		progress = mem_cgroup_hierarchical_reclaim(mem, NULL,
> +					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP, 0);
>  		progress += mem_cgroup_check_under_limit(mem);
>  	} while (!progress && --retry);
>  
> @@ -1750,8 +1879,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> -						   false, true);
> +		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
> +						GFP_KERNEL,
> +						MEM_CGROUP_RECLAIM_SHRINK, 0);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>    		if (curusage >= oldusage)
> @@ -1799,7 +1929,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
> +		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> +						MEM_CGROUP_RECLAIM_NOSWAP |
> +						MEM_CGROUP_RECLAIM_SHRINK, 0);
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -1810,6 +1942,59 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  	return ret;
>  }
>  
> +unsigned long
> +mem_cgroup_soft_limit_reclaim(int priority, struct zone *zone, int nid,
> +				gfp_t gfp_mask)
> +{
> +	unsigned long nr_reclaimed = 0;
> +	struct mem_cgroup *mem;
> +	unsigned long flags;
> +	unsigned long long usage;
> +
> +	/*
> +	 * This loop can run a while, specially if mem_cgroup's continuously
> +	 * keep exceeding their soft limit and putting the system under
> +	 * pressure
> +	 */
> +	do {
> +		mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
> +		if (!mem)
> +			break;
> +		usage = mem_cgroup_get_node_zone_usage(mem, zone, nid);
> +		if (!usage)
> +			goto skip_reclaim;

Why this works well ? if "mem" is the laragest, it will be inserted again
as the largest. Do I miss any ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v4)
  2009-03-06  9:23 [PATCH 0/4] Memory controller soft limit patches (v4) Balbir Singh
                   ` (3 preceding siblings ...)
  2009-03-06  9:23 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v4) Balbir Singh
@ 2009-03-06  9:54 ` KAMEZAWA Hiroyuki
  2009-03-06 10:05   ` Balbir Singh
  2009-03-06 10:34   ` [RFC][PATCH 0/3] memory controller soft limit (Yet Another One) v1 KAMEZAWA Hiroyuki
  4 siblings, 2 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-06  9:54 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Fri, 06 Mar 2009 14:53:23 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> New Feature: Soft limits for memory resource controller.
> 
> Changelog v4...v3
> 1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
>    while doing soft limit reclaim. We don't record priorities while
>    doing soft reclaim
> 2. Some of the overheads associated with soft limits (like calculating
>    excess each time) is eliminated
> 3. The time_after(jiffies, 0) bug has been fixed
> 4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
>    and at the same time tasks are increasing the memory footprint and causing
>    the mem cgroup to exceed its soft limit.
> 
I don't think this "4" is necessary.


> Changelog v3...v2
> 1. Implemented several review comments from Kosaki-San and Kamezawa-San
>    Please see individual changelogs for changes
> 
> Changelog v2...v1
> 1. Soft limits now support hierarchies
> 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> 
> Here is v4 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. The CPU controllers interpretation
> of shares is very different though. 
> 
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
> 
> If there are no major objections to the patches, I would like to get them
> included in -mm.
> 
You got Nack from me, again ;) And you know why.
I'll post my one later, I hope that one will be good input for you.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v4)
  2009-03-06  9:51   ` KAMEZAWA Hiroyuki
@ 2009-03-06 10:01     ` Balbir Singh
  2009-03-06 10:14       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 18+ messages in thread
From: Balbir Singh @ 2009-03-06 10:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-06 18:51:24]:

> On Fri, 06 Mar 2009 14:53:53 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > ---
> > 
> >  include/linux/memcontrol.h |    9 ++
> >  include/linux/swap.h       |    5 +
> >  mm/memcontrol.c            |  223 +++++++++++++++++++++++++++++++++++++++++---
> >  mm/vmscan.c                |   26 +++++
> >  4 files changed, 245 insertions(+), 18 deletions(-)
> > 
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 18146c9..16343d0 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -116,6 +116,9 @@ static inline bool mem_cgroup_disabled(void)
> >  }
> >  
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> > +unsigned long
> > +mem_cgroup_soft_limit_reclaim(int priority, struct zone *zone, int nid,
> > +				gfp_t gfp_mask);
> >  
> >  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >  struct mem_cgroup;
> > @@ -264,6 +267,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> >  {
> >  }
> >  
> > +static inline unsigned long
> > +mem_cgroup_soft_limit_reclaim(int priority, struct zone *zone, int nid,
> > +				gfp_t gfp_mask)
> > +{
> > +	return 0;
> > +}
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >  
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 989eb53..37bc2a9 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -217,6 +217,11 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> >  						  gfp_t gfp_mask, bool noswap,
> >  						  unsigned int swappiness);
> > +extern unsigned long mem_cgroup_shrink_zone(struct mem_cgroup *mem,
> > +						struct zone *zone,
> > +						gfp_t gfp_mask,
> > +						unsigned int swappiness,
> > +						int priority);
> >  extern int __isolate_lru_page(struct page *page, int mode, int file);
> >  extern unsigned long shrink_all_memory(unsigned long nr_pages);
> >  extern int vm_swappiness;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d548dd2..3be1f27 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/res_counter.h>
> >  #include <linux/memcontrol.h>
> >  #include <linux/cgroup.h>
> > +#include <linux/completion.h>
> >  #include <linux/mm.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/smp.h>
> > @@ -191,6 +192,14 @@ struct mem_cgroup {
> >  	unsigned long last_tree_update;		/* Last time the tree was */
> >  						/* updated in jiffies     */
> >  
> > +	bool on_tree;				/* Is the node on tree? */
> > +	struct completion wait_on_soft_reclaim;
> > +	/*
> > +	 * Set to > 0, when reclaim is initiated due to 
> > +	 * the soft limit being exceeded. It adds an additional atomic
> > +	 * operation to page fault path.
> > +	 */
> > +	int soft_limit_reclaim_count;
> >  	/*
> >  	 * statistics. This must be placed at the end of memcg.
> >  	 */
> > @@ -227,18 +236,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
> >  #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
> >  #define MEMFILE_ATTR(val)	((val) & 0xffff)
> >  
> > +/*
> > + * Bits used for hierarchical reclaim bits
> > + */
> > +#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
> > +#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> > +#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
> > +#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> > +#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
> > +#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
> > +
> >  static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >  
> > -static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> > +static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> >  {
> >  	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
> >  	struct rb_node *parent = NULL;
> >  	struct mem_cgroup *mem_node;
> > -	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	if (mem->on_tree)
> > +		return;
> > +
> >  	while (*p) {
> >  		parent = *p;
> >  		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> > @@ -255,6 +275,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> >  	rb_insert_color(&mem->mem_cgroup_node,
> >  			&mem_cgroup_soft_limit_tree);
> >  	mem->last_tree_update = jiffies;
> > +	mem->on_tree = true;
> > +}
> > +
> > +static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> > +{
> > +	if (!mem->on_tree)
> > +		return;
> > +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> > +	mem->on_tree = false;
> > +}
> > +
> > +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	__mem_cgroup_insert_exceeded(mem);
> >  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> >  }
> >  
> > @@ -262,8 +299,34 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> >  {
> >  	unsigned long flags;
> >  	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > -	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> > +	__mem_cgroup_remove_exceeded(mem);
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +}
> > +
> > +static struct mem_cgroup *mem_cgroup_get_largest_soft_limit_exceeding_node(void)
> > +{
> > +	struct rb_node *rightmost = NULL;
> > +	struct mem_cgroup *mem = NULL;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +retry:
> > +	rightmost = rb_last(&mem_cgroup_soft_limit_tree);
> > +	if (!rightmost)
> > +		goto done;		/* Nothing to reclaim from */
> > +
> > +	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
> > +	/*
> > +	 * Remove the node now but someone else can add it back,
> > +	 * we will to add it back at the end of reclaim to its correct
> > +	 * position in the tree.
> > +	 */
> > +	__mem_cgroup_remove_exceeded(mem);
> > +	if (!css_tryget(&mem->css) || !res_counter_soft_limit_excess(&mem->res))
> > +		goto retry;
> > +done:
> >  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +	return mem;
> >  }
> >  
> >  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> > @@ -324,6 +387,27 @@ static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
> >  	return total;
> >  }
> >  
> > +static unsigned long long
> > +mem_cgroup_get_node_zone_usage(struct mem_cgroup *mem, struct zone *zone,
> > +				int nid)
> > +{
> > +	int l;
> > +	unsigned long long total = 0;
> > +	struct mem_cgroup_per_zone *mz;
> > +	unsigned long flags;
> > +
> > +	/*
> > +	 * Is holding the zone LRU lock being overly protective?
> > +	 * This routine is not invoked from the hot path anyway.
> > +	 */
> > +	spin_lock_irqsave(&zone->lru_lock, flags);
> > +	mz = mem_cgroup_zoneinfo(mem, nid, zone_idx(zone));
> > +	for_each_evictable_lru(l)
> > +		total += MEM_CGROUP_ZSTAT(mz, l);
> > +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> > +	return total * PAGE_SIZE;
> > +}
> > +
> >  static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
> >  {
> >  	return container_of(cgroup_subsys_state(cont,
> > @@ -888,14 +972,30 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> >   * If shrink==true, for avoiding to free too much, this returns immedieately.
> >   */
> >  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > -				   gfp_t gfp_mask, bool noswap, bool shrink)
> > +						struct zone *zone,
> > +						gfp_t gfp_mask,
> > +						unsigned long flags,
> > +						int priority)
> >  {
> >  	struct mem_cgroup *victim;
> >  	int ret, total = 0;
> >  	int loop = 0;
> > +	bool noswap = flags & MEM_CGROUP_RECLAIM_NOSWAP;
> > +	bool shrink = flags & MEM_CGROUP_RECLAIM_SHRINK;
> > +	bool check_soft = flags & MEM_CGROUP_RECLAIM_SOFT;
> >  
> >  	while (loop < 2) {
> >  		victim = mem_cgroup_select_victim(root_mem);
> > +		/*
> > +		 * In the first loop, don't reclaim from victims below
> > +		 * their soft limit
> > +		 */
> > +		if (!loop && res_counter_check_under_soft_limit(&victim->res)) {
> > +			if (victim == root_mem)
> > +				loop++;
> > +			css_put(&victim->css);
> > +			continue;
> > +		}
> >  		if (victim == root_mem)
> >  			loop++;
> >  		if (!mem_cgroup_local_usage(&victim->stat)) {
> > @@ -904,8 +1004,14 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  			continue;
> >  		}
> >  		/* we use swappiness of local cgroup */
> > -		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> > -						   get_swappiness(victim));
> > +		if (!check_soft)
> > +			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> > +							noswap,
> > +							get_swappiness(victim));
> > +		else
> > +			ret = mem_cgroup_shrink_zone(victim, zone, gfp_mask,
> > +							get_swappiness(victim),
> > +							priority);
> >  		css_put(&victim->css);
> >  		/*
> >  		 * At shrinking usage, we can't check we should stop here or
> > @@ -915,7 +1021,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  		if (shrink)
> >  			return ret;
> >  		total += ret;
> > -		if (mem_cgroup_check_under_limit(root_mem))
> > +		if (check_soft) {
> > +			if (res_counter_check_under_soft_limit(&root_mem->res))
> > +				return total;
> > +		} else if (mem_cgroup_check_under_limit(root_mem))
> >  			return 1 + total;
> >  	}
> >  	return total;
> > @@ -1025,7 +1134,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  
> >  	while (1) {
> >  		int ret;
> > -		bool noswap = false;
> > +		unsigned long flags = 0;
> >  
> >  		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> >  						&soft_fail_res);
> > @@ -1038,7 +1147,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  				break;
> >  			/* mem+swap counter fails */
> >  			res_counter_uncharge(&mem->res, PAGE_SIZE);
> > -			noswap = true;
> > +			flags = MEM_CGROUP_RECLAIM_NOSWAP;
> >  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> >  									memsw);
> >  		} else
> > @@ -1049,8 +1158,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		if (!(gfp_mask & __GFP_WAIT))
> >  			goto nomem;
> >  
> > -		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> > -							noswap, false);
> > +		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > +							gfp_mask, flags, 0);
> >  		if (ret)
> >  			continue;
> >  
> > @@ -1082,9 +1191,29 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  	 * soft limit
> >  	 */
> >  	if (soft_fail_res) {
> > +		/*
> > +		 * Throttle the task here, if it is undergoing soft limit
> > +		 * reclaim and failing soft limits
> > +		 */
> > +		unsigned long flags;
> > +		bool wait = false;
> > +
> > +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +		if (mem->soft_limit_reclaim_count) {
> > +			INIT_COMPLETION(mem->wait_on_soft_reclaim);
> > +			wait = true;
> > +		}
> > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> >  		mem_over_soft_limit =
> >  			mem_cgroup_from_res_counter(soft_fail_res, res);
> >  		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
> > +		/*
> > +		 * We hold the mmap_sem and throttle, I don't think there
> > +		 * should be corner cases, but this part could use more
> > +		 * review
> > +		 */
> > +		if (wait)
> > +			wait_for_completion(&mem->wait_on_soft_reclaim);
> >  	}
> What ???? Why we have to wait here...holding mmap->sem...This is too bad.
>

Since mmap_sem is no longer used for pthread_mutex*, I was not sure.
That is why I added the comment asking for more review and see what
people think about it. We get here only when

1. The memcg is over its soft limit
2. Tasks/threads belonging to memcg are faulting in more pages

The idea is to throttle them. If we did reclaim inline, like we do for
hard limits, we can still end up holding mmap_sem for a long time.

 
> 
> 
> >  	return 0;
> >  nomem:
> > @@ -1695,8 +1824,8 @@ int mem_cgroup_shrink_usage(struct page *page,
> >  		return 0;
> >  
> >  	do {
> > -		progress = mem_cgroup_hierarchical_reclaim(mem,
> > -					gfp_mask, true, false);
> > +		progress = mem_cgroup_hierarchical_reclaim(mem, NULL,
> > +					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP, 0);
> >  		progress += mem_cgroup_check_under_limit(mem);
> >  	} while (!progress && --retry);
> >  
> > @@ -1750,8 +1879,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> >  		if (!ret)
> >  			break;
> >  
> > -		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> > -						   false, true);
> > +		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
> > +						GFP_KERNEL,
> > +						MEM_CGROUP_RECLAIM_SHRINK, 0);
> >  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> >  		/* Usage is reduced ? */
> >    		if (curusage >= oldusage)
> > @@ -1799,7 +1929,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> >  		if (!ret)
> >  			break;
> >  
> > -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
> > +		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> > +						MEM_CGROUP_RECLAIM_NOSWAP |
> > +						MEM_CGROUP_RECLAIM_SHRINK, 0);
> >  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> >  		/* Usage is reduced ? */
> >  		if (curusage >= oldusage)
> > @@ -1810,6 +1942,59 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> >  	return ret;
> >  }
> >  
> > +unsigned long
> > +mem_cgroup_soft_limit_reclaim(int priority, struct zone *zone, int nid,
> > +				gfp_t gfp_mask)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +	struct mem_cgroup *mem;
> > +	unsigned long flags;
> > +	unsigned long long usage;
> > +
> > +	/*
> > +	 * This loop can run a while, specially if mem_cgroup's continuously
> > +	 * keep exceeding their soft limit and putting the system under
> > +	 * pressure
> > +	 */
> > +	do {
> > +		mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
> > +		if (!mem)
> > +			break;
> > +		usage = mem_cgroup_get_node_zone_usage(mem, zone, nid);
> > +		if (!usage)
> > +			goto skip_reclaim;
> 
> Why this works well ? if "mem" is the laragest, it will be inserted again
> as the largest. Do I miss any ?
>

No that is correct, but when reclaim is initiated from a different
zone/node combination, we still want mem to show up. Consider a simple
test case I run

Run "a" with soft_limit to zero and "b" with soft_limit to a 2G. When
I hit memory contention or low watermarks for the zone, I want to
reclaim from "a" most if not all the time. Removing "a" will not allow
other zone/node reclaims to find "a". 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/4] Memory controller soft limit patches (v4)
  2009-03-06  9:54 ` [PATCH 0/4] Memory controller soft limit patches (v4) KAMEZAWA Hiroyuki
@ 2009-03-06 10:05   ` Balbir Singh
  2009-03-06 10:34   ` [RFC][PATCH 0/3] memory controller soft limit (Yet Another One) v1 KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2009-03-06 10:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-06 18:54:40]:

> On Fri, 06 Mar 2009 14:53:23 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > New Feature: Soft limits for memory resource controller.
> > 
> > Changelog v4...v3
> > 1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
> >    while doing soft limit reclaim. We don't record priorities while
> >    doing soft reclaim
> > 2. Some of the overheads associated with soft limits (like calculating
> >    excess each time) is eliminated
> > 3. The time_after(jiffies, 0) bug has been fixed
> > 4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
> >    and at the same time tasks are increasing the memory footprint and causing
> >    the mem cgroup to exceed its soft limit.
> > 
> I don't think this "4" is necessary.
>

I responded to it and I had asked for review for this. Lets discuss it
there. I am open to doing this or not.
 
> 
> > Changelog v3...v2
> > 1. Implemented several review comments from Kosaki-San and Kamezawa-San
> >    Please see individual changelogs for changes
> > 
> > Changelog v2...v1
> > 1. Soft limits now support hierarchies
> > 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> > 
> > Here is v4 of the new soft limit implementation. Soft limits is a new feature
> > for the memory resource controller, something similar has existed in the
> > group scheduler in the form of shares. The CPU controllers interpretation
> > of shares is very different though. 
> > 
> > Soft limits are the most useful feature to have for environments where
> > the administrator wants to overcommit the system, such that only on memory
> > contention do the limits become active. The current soft limits implementation
> > provides a soft_limit_in_bytes interface for the memory controller and not
> > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > that exceed their soft limit and starts reclaiming from the group that
> > exceeds this limit by the maximum amount.
> > 
> > If there are no major objections to the patches, I would like to get them
> > included in -mm.
> > 
> You got Nack from me, again ;) And you know why.
> I'll post my one later, I hope that one will be good input for you.
>

Lets discuss the patches and your objections. I suspect it is because
of 4 above, but I don't want to keep guessing. 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v4)
  2009-03-06 10:01     ` Balbir Singh
@ 2009-03-06 10:14       ` KAMEZAWA Hiroyuki
  2009-03-06 10:41         ` Balbir Singh
  0 siblings, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-06 10:14 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Fri, 6 Mar 2009 15:31:55 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:


> > > +		if (wait)
> > > +			wait_for_completion(&mem->wait_on_soft_reclaim);
> > >  	}
> > What ???? Why we have to wait here...holding mmap->sem...This is too bad.
> >
> 
> Since mmap_sem is no longer used for pthread_mutex*, I was not sure.
> That is why I added the comment asking for more review and see what
> people think about it. We get here only when
> 
> 1. The memcg is over its soft limit
> 2. Tasks/threads belonging to memcg are faulting in more pages
> 
> The idea is to throttle them. If we did reclaim inline, like we do for
> hard limits, we can still end up holding mmap_sem for a long time.
> 
This "throttle" is hard to measuer the effect and IIUC, not implemneted in
vmscan.c ...for global try_to_free_pages() yet.
Under memory shortage. before reaching here, the thread already called
try_to_free_pages() or check some memory shorage conditions because
it called alloc_pages(). So, waiting here is redundant and gives it
too much penaly.


> > > +	/*
> > > +	 * This loop can run a while, specially if mem_cgroup's continuously
> > > +	 * keep exceeding their soft limit and putting the system under
> > > +	 * pressure
> > > +	 */
> > > +	do {
> > > +		mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
> > > +		if (!mem)
> > > +			break;
> > > +		usage = mem_cgroup_get_node_zone_usage(mem, zone, nid);
> > > +		if (!usage)
> > > +			goto skip_reclaim;
> > 
> > Why this works well ? if "mem" is the laragest, it will be inserted again
> > as the largest. Do I miss any ?
> >
> 
> No that is correct, but when reclaim is initiated from a different
> zone/node combination, we still want mem to show up. 
....
your logic is
==
   nr_reclaimd = 0;
   do {
      mem = select victim.
      remvoe victim from the RBtree (the largest usage one is selected)
      if (victim is not good)
          goto  skip this.
      reclaimed += shirnk_zone.
      
skip_this:
      if (mem is still exceeds soft limit)
           insert RB tree again.
   } while(!nr_reclaimed)
==
When this exits loop ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 0/3] memory controller soft limit (Yet Another One) v1
  2009-03-06  9:54 ` [PATCH 0/4] Memory controller soft limit patches (v4) KAMEZAWA Hiroyuki
  2009-03-06 10:05   ` Balbir Singh
@ 2009-03-06 10:34   ` KAMEZAWA Hiroyuki
  2009-03-06 10:36     ` [RFC][PATCH 1/3] soft limit interface (Yet Another One) KAMEZAWA Hiroyuki
                       ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-06 10:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, KOSAKI Motohiro,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Rik van Riel,
	Andrew Morton


I don't say this should go but there are big distance between Balbir and me, so
showing what I'm thinking of in a patch. 

[1/3] interface of softlimit.
[2/3] recalaim logic of softlimit
[3/3] documenation.

Characteristic is.

  1. No hook to fast path.
  2. memory.softlimit_priority file is used in addtion to memory.softlimit file.
  3. vicitm cgroup at softlimit depends on priority given by user.
  4. softlimit can be set to any cgroup even if it's children in hierarchy.
  5. has some logic to sync with kswapd()'s balance_pgdat().

This patch should be sophisticated to some extent.(and may have bug.)

Example) Assume group_A which uses hierarchy and childrsn 01, 02, 03.
         The lower number priority, the less memory is reclaimd. 

   /group_A/    softlimit=300M      priority=0  (priority0 is ignored)
            01/ softlimit=unlimited priority=1
            02/ softlimit=unlimited priority=3
            03/ softlimit=unlimited priority=3
 
  1. When kswapd runs, memory will be reclaimed by 02 and 03 in round-robin.
  2. If no memory can be reclaimed from 02 and 03, memory will be reclaimed from 01
  3. If no memory can be reclaimed from 01,02,03, global shrink_zone() is called.

I'm sorry if my response is too slow.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 1/3]  soft limit interface (Yet Another One)
  2009-03-06 10:34   ` [RFC][PATCH 0/3] memory controller soft limit (Yet Another One) v1 KAMEZAWA Hiroyuki
@ 2009-03-06 10:36     ` KAMEZAWA Hiroyuki
  2009-03-06 10:37     ` [RFC][PATCH 2/3] memcg sotlimit logic " KAMEZAWA Hiroyuki
  2009-03-06 10:38     ` [RFC][PATCH 3/3] memcg documenation soft limit " KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-06 10:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, KOSAKI Motohiro,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Rik van Riel,
	Andrew Morton

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is a part of softlimit patch series for memcg.(1/3)

An interface for softlimit of memcg
This adds following params to memcg.

  - softlimit -- local softlimit of the memcg. exported as
		memory.softlimit file

  - softlimit_priority -- local softlimit priority of the memcg. exported as
	        memory.softlimit_priority
		high number is low priority...0 means "don't use soft limit"

  - min_softlimit_governor -- A memcg which has min softlimit in ancestors.

By this patch, following customization of memcg tree can be done (by users)
Example A)
    groupA softlimit = unlimited,prio=0    governor is group A.
     |- groupB softlimit = 1G,prio=1   governor is group B.
	  |- group C softlimit = unlimited,prio=3   governor is group B.
	  |- group D softlimit = unlimited,prio=2   governor is group B.
	  |- group E softlimit = unlimited,prio=3   governor is group B.

In above, group C and D,E 's its own softlimit is not set but under hierarchy,
it's dominated by groupB's one. Because Group C and E's priority is  lower
than GroupD's, they will be the first victim. (selection between C and E is
done by round-robin.)

Documentation will be added by following patches.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  146 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 143 insertions(+), 3 deletions(-)

Index: mmotm-2.6.29-Mar3/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Mar3.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Mar3/mm/memcontrol.c
@@ -175,7 +175,13 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
-
+	/*
+	 * softlimit
+	 */
+	u64 softlimit;
+	struct mem_cgroup *min_softlimit_governor;
+	int softlimit_priority;
+	struct list_head softlimit_list;
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -210,6 +216,10 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
+#define MEM_SOFTLIMIT           (0x10)
+#define MEM_SOFTLIMIT_PRIO      (0x11)
+
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -1537,6 +1547,44 @@ void mem_cgroup_uncharge_swap(swp_entry_
 #endif
 
 /*
+ * For softlimit handling.
+ */
+
+static DECLARE_RWSEM(softlimit_sem);
+#define SOFTLIMIT_MAX_PRIO  (4)
+
+struct {
+	struct list_head list[SOFTLIMIT_MAX_PRIO];
+} softlimit_head;
+
+static void __init init_softlimit(void)
+{
+	int i;
+	for (i = 0; i < SOFTLIMIT_MAX_PRIO; i++)
+		INIT_LIST_HEAD(&softlimit_head.list[i]);
+}
+
+static void softlimit_add_list_locked(struct mem_cgroup *mem)
+{
+	int level = mem->softlimit_priority;
+	list_add(&mem->softlimit_list, &softlimit_head.list[level]);
+}
+
+static void softlimit_del_list_locked(struct mem_cgroup *mem)
+{
+	if (!list_empty(&mem->softlimit_list))
+		list_del_init(&mem->softlimit_list);
+}
+
+static void softlimit_del_list(struct mem_cgroup *mem)
+{
+	down_write(&softlimit_sem);
+	softlimit_del_list_locked(mem);
+	up_write(&softlimit_sem);
+}
+
+
+/*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
  */
@@ -1939,6 +1987,56 @@ static int mem_cgroup_hierarchy_write(st
 	return retval;
 }
 
+static int __memcg_update_softlimit(struct mem_cgroup *mem, void *val)
+{
+	struct mem_cgroup *tmp = mem;
+	struct mem_cgroup *governor = NULL;
+	u64 min_softlimit = ULLONG_MAX;
+	struct cgroup *cg;
+
+	do {
+		if (min_softlimit > tmp->softlimit) {
+			min_softlimit = tmp->softlimit;
+			governor = tmp;
+		}
+
+		cg = tmp->css.cgroup;
+		if (!cg->parent)
+			break;
+		tmp = mem_cgroup_from_cont(cg->parent);
+	} while (tmp->use_hierarchy);
+
+	mem->min_softlimit_governor = governor;
+	return 0;
+}
+
+static int mem_cgroup_resize_softlimit(struct mem_cgroup *memcg,
+				       u64 val)
+{
+
+	down_write(&softlimit_sem);
+	memcg->softlimit = val;
+	/* Updates all children's governor information */
+	mem_cgroup_walk_tree(memcg, NULL, __memcg_update_softlimit);
+	up_write(&softlimit_sem);
+	return 0;
+}
+
+static int mem_cgroup_set_softlimit_prio(struct mem_cgroup *memcg,
+					 int prio)
+{
+	if ((prio < 0) || (prio >= SOFTLIMIT_MAX_PRIO))
+		return -EINVAL;
+
+	down_write(&softlimit_sem);
+	softlimit_del_list_locked(memcg);
+	memcg->softlimit_priority = prio;
+	if (prio)
+		softlimit_add_list_locked(memcg);
+	up_write(&softlimit_sem);
+	return 0;
+}
+
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
@@ -1949,7 +2047,12 @@ static u64 mem_cgroup_read(struct cgroup
 	name = MEMFILE_ATTR(cft->private);
 	switch (type) {
 	case _MEM:
-		val = res_counter_read_u64(&mem->res, name);
+		if (name == MEM_SOFTLIMIT)
+			val = mem->softlimit;
+		else if (name == MEM_SOFTLIMIT_PRIO)
+			val = mem->softlimit_priority;
+		else
+			val = res_counter_read_u64(&mem->res, name);
 		break;
 	case _MEMSWAP:
 		if (do_swap_account)
@@ -1986,6 +2089,12 @@ static int mem_cgroup_write(struct cgrou
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case MEM_SOFTLIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		ret = mem_cgroup_resize_softlimit(memcg, val);
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2176,6 +2285,14 @@ static int mem_control_stat_show(struct 
 	return 0;
 }
 
+static int mem_cgroup_write_softlimit_priority(struct cgroup *cgrp,
+					       struct cftype *cft,
+					       u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	return mem_cgroup_set_softlimit_prio(memcg, (int)val);
+}
+
 static u64 mem_cgroup_swappiness_read(struct cgroup *cgrp, struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
@@ -2235,6 +2352,18 @@ static struct cftype mem_cgroup_files[] 
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "softlimit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, MEM_SOFTLIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "softlimit_priority",
+		.private = MEMFILE_PRIVATE(_MEM, MEM_SOFTLIMIT_PRIO),
+		.write_u64 = mem_cgroup_write_softlimit_priority,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,
@@ -2438,12 +2567,16 @@ mem_cgroup_create(struct cgroup_subsys *
 	/* root ? */
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
+		init_softlimit();
 		parent = NULL;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 	}
-
+	INIT_LIST_HEAD(&mem->softlimit_list);
+	mem->softlimit = ULLONG_MAX;
+	/* This mutex is against softlimit */
+	down_write(&softlimit_sem);
 	if (parent && parent->use_hierarchy) {
 		res_counter_init(&mem->res, &parent->res);
 		res_counter_init(&mem->memsw, &parent->memsw);
@@ -2454,10 +2587,16 @@ mem_cgroup_create(struct cgroup_subsys *
 		 * mem_cgroup(see mem_cgroup_put).
 		 */
 		mem_cgroup_get(parent);
+		/* Inherit softlimit governor */
+		mem->min_softlimit_governor = parent->min_softlimit_governor;
+		mem->softlimit_priority = parent->softlimit_priority;
+		softlimit_add_list_locked(mem);
 	} else {
 		res_counter_init(&mem->res, NULL);
 		res_counter_init(&mem->memsw, NULL);
 	}
+	up_write(&softlimit_sem);
+
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
@@ -2483,6 +2622,7 @@ static void mem_cgroup_destroy(struct cg
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	softlimit_del_list(mem);
 	mem_cgroup_put(mem);
 }
 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 2/3] memcg sotlimit logic (Yet Another One)
  2009-03-06 10:34   ` [RFC][PATCH 0/3] memory controller soft limit (Yet Another One) v1 KAMEZAWA Hiroyuki
  2009-03-06 10:36     ` [RFC][PATCH 1/3] soft limit interface (Yet Another One) KAMEZAWA Hiroyuki
@ 2009-03-06 10:37     ` KAMEZAWA Hiroyuki
  2009-03-06 10:38     ` [RFC][PATCH 3/3] memcg documenation soft limit " KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-06 10:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, KOSAKI Motohiro,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Rik van Riel,
	Andrew Morton

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Patch for Mem cgroup softlimit (2/3)

This patch implements the core logic.

Memory cgroups with softlimit are linked to list of there own priority.
Balance_pgdat() scan it and try to remove pages.
Scanning status is recorded per node (means per kswapd) and cgroups
on the same softlimit level is scanned in round-robin manner.

If no softlimit hits, returns NULL.

balance_pgdat() work as following.
  1. at start, reset status and start scanning from the lowest priority.
     (= SOFTLIMIT_MAX_PRIORITY = 3)
  2. if priority is 0, ignore softlimit.
  2. Scan list of the priority and get victim.
  3. If no victim on the list, decrement priority (goto 2.)

the number fo scanning under softlimit is limited by balance_pgdat()
w.r.t scanning priority and target.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: mmotm-2.6.29-Mar3/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Mar3.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Mar3/mm/memcontrol.c
@@ -599,6 +599,21 @@ unsigned long mem_cgroup_zone_nr_pages(s
 	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
+unsigned long
+mem_cgroup_evictable_usage(struct mem_cgroup *memcg, int nid, int zid)
+{
+	unsigned long total = 0;
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	if (nr_swap_pages) {
+		total += MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+		total += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON);
+	}
+	total +=  MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
+	total +=  MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE);
+	return total;
+}
+
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone)
 {
@@ -1583,6 +1598,134 @@ static void softlimit_del_list(struct me
 	up_write(&softlimit_sem);
 }
 
+static bool mem_cgroup_hit_softlimit(struct mem_cgroup *mem)
+{
+	struct mem_cgroup *governor = mem->min_softlimit_governor;
+	u64 usage;
+
+	usage = res_counter_read_u64(&governor->res, RES_USAGE);
+	return (usage > governor->softlimit);
+}
+
+/*
+ * Softlimit Handler. Softlimit is called by kswapd() and kswapd exists per
+ * node. Then, control structs for softlimit exists per node.
+ * Only user of this struct is kswapd. No lock is necessary.
+ */
+struct softlimit_control {
+	int prio;
+	struct mem_cgroup *iter[SOFTLIMIT_MAX_PRIO];
+};
+struct softlimit_control softlimit_ctrl[MAX_NUMNODES][MAX_NR_ZONES];
+
+/*
+ * Called when balance_pgdat() enters new turn and reset priority
+ * information recorded.
+ */
+void mem_cgroup_start_softlimit_reclaim(int nid)
+{
+	int zid;
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++)
+		softlimit_ctrl[nid][zid].prio = SOFTLIMIT_MAX_PRIO - 1;
+}
+/*
+ * Seatch victim in specified priority level. If not found, retruns NULL.
+ * For implemnting round-robin, list_for_each_entry_from() is used.
+ */
+struct mem_cgroup *__mem_cgroup_get_vicitm_prio(int nid, int zid,
+				struct mem_cgroup *start, int prio)
+{
+	struct list_head *list = &softlimit_head.list[prio];
+	struct mem_cgroup *mem, *ret;
+	int loop = 0;
+
+	if (!start && list_empty(list))
+		return NULL;
+
+	if (!start) /* start from the head of list */
+		start = list_entry(list->next,
+				   struct mem_cgroup, softlimit_list);
+	mem = start;
+	ret = NULL;
+retry:  /* round robin */
+	list_for_each_entry_from(mem, list, softlimit_list) {
+		if (loop == 1 && mem == start)
+			break;
+		if (!css_tryget(&mem->css))
+			continue;
+		if (mem_cgroup_hit_softlimit(mem) &&
+		    mem_cgroup_evictable_usage(mem, nid, zid)) {
+			ret = mem;
+			break;
+		}
+		css_put(&mem->css);
+	}
+	if (!ret && loop++ == 0) {
+		/* restart from the head of list */
+		mem = list_entry(list->next,
+				 struct mem_cgroup, softlimit_list);
+		goto retry;
+	}
+	return ret;
+}
+
+struct mem_cgroup *mem_cgroup_get_victim(int nid, int zid)
+{
+	struct softlimit_control *slc = &softlimit_ctrl[nid][zid];
+	struct mem_cgroup *mem, *ret;
+	int prio;
+	ret = NULL;
+
+	/* before enter round-robin, check it's worth to try or not. */
+	if (slc->prio == 0)
+		return NULL;
+	prio = slc->prio;
+	/* Try read-lock */
+	if (!down_read_trylock(&softlimit_sem))
+		return NULL;
+new_prio:
+	/* At first check start point marker */
+	mem = slc->iter[prio];
+	if (mem) {
+		if (css_is_removed(&mem->css) ||
+		    mem->softlimit_priority != prio) {
+			mem_cgroup_put(mem);
+		}
+	}
+	slc->iter[prio] = NULL;
+	ret = __mem_cgroup_get_vicitm_prio(nid, zid, mem, prio);
+	if (mem) {
+		mem_cgroup_put(mem);
+		mem = NULL;
+	}
+	if (!ret) {
+		prio--;
+		if (prio > 0)
+			goto new_prio;
+	}
+	if (ret) { /* Remember the "next" position */
+		prio = ret->softlimit_priority;
+		if (ret->softlimit_list.next != &softlimit_head.list[prio]) {
+			mem = list_entry(ret->softlimit_list.next,
+					 struct mem_cgroup, softlimit_list);
+			slc->iter[prio] = mem;
+			mem_cgroup_get(mem);
+		} else
+			slc->iter[prio] = NULL;
+	} else {
+		/* We have no candidates. ignore softlimit in this turn */
+		slc->prio = 0;
+	}
+	up_read(&softlimit_sem);
+	return ret;
+}
+
+void mem_cgroup_put_victim(struct mem_cgroup *mem)
+{
+	if (mem)
+		mem_cgroup_put(mem);
+}
 
 /*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
Index: mmotm-2.6.29-Mar3/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.29-Mar3.orig/include/linux/memcontrol.h
+++ mmotm-2.6.29-Mar3/include/linux/memcontrol.h
@@ -117,6 +117,10 @@ static inline bool mem_cgroup_disabled(v
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
 
+extern void mem_cgroup_start_softlimit_reclaim(int nid);
+extern struct mem_cgroup *mem_cgroup_get_victim(int nid, int zid);
+extern void mem_cgroup_put_vicitm(struct mem_cgroup *mem);
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -264,6 +268,20 @@ mem_cgroup_print_oom_info(struct mem_cgr
 {
 }
 
+
+static void mem_cgroup_start_softlimit_reclaim(int nid)
+{
+}
+
+static struct mem_cgroup *mem_cgroup_get_vicitm(int nid, int zid)
+{
+	return NULL;
+}
+
+static void mem_cgroup_put_vicitm(struct mem_cgroup *mem)
+{
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
Index: mmotm-2.6.29-Mar3/mm/vmscan.c
===================================================================
--- mmotm-2.6.29-Mar3.orig/mm/vmscan.c
+++ mmotm-2.6.29-Mar3/mm/vmscan.c
@@ -1733,6 +1733,39 @@ unsigned long try_to_free_mem_cgroup_pag
 }
 #endif
 
+#define SOFTLIMIT_SCAN_MAX (512)
+void shrink_zone_softlimit(struct scan_control *sc, struct zone *zone,
+			   int order, int priority, int target, int end_zone)
+{
+	struct mem_cgroup *mem;
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	int scan = SWAP_CLUSTER_MAX;
+
+	scan <<= (DEF_PRIORITY - priority);
+	if (scan > (target * 2))
+		scan = target * 2;
+retry:
+	mem = mem_cgroup_get_victim(nid, zid);
+	if (!mem)
+		return;
+
+	sc->nr_scanned = 0;
+	sc->mem_cgroup = mem;
+	sc->isolate_pages = mem_cgroup_isolate_pages;
+
+	shrink_zone(priority, zone, sc);
+	sc->mem_cgroup = NULL;
+	sc->isolate_pages = isolate_pages_global;
+	if (zone_watermark_ok(zone, order, target, end_zone, 0))
+		return;
+	scan -= sc->nr_scanned;
+	/* We should avoid too much scanning against this priority level */
+	if (scan > 0)
+		goto retry;
+	return;
+}
+
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
@@ -1776,6 +1809,7 @@ static unsigned long balance_pgdat(pg_da
 	 */
 	int temp_priority[MAX_NR_ZONES];
 
+	mem_cgroup_start_softlimit_reclaim(pgdat->node_id);
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
@@ -1856,6 +1890,11 @@ loop_again:
 					       end_zone, 0))
 				all_zones_ok = 0;
 			temp_priority[i] = priority;
+
+			/* try soft limit of memory cgroup */
+			shrink_zone_softlimit(&sc, zone, order, priority,
+				      8 * zone->pages_high, end_zone);
+
 			sc.nr_scanned = 0;
 			note_zone_scanning_priority(zone, priority);
 			/*


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 3/3] memcg documenation soft limit (Yet Another One)
  2009-03-06 10:34   ` [RFC][PATCH 0/3] memory controller soft limit (Yet Another One) v1 KAMEZAWA Hiroyuki
  2009-03-06 10:36     ` [RFC][PATCH 1/3] soft limit interface (Yet Another One) KAMEZAWA Hiroyuki
  2009-03-06 10:37     ` [RFC][PATCH 2/3] memcg sotlimit logic " KAMEZAWA Hiroyuki
@ 2009-03-06 10:38     ` KAMEZAWA Hiroyuki
  2009-03-06 16:47       ` Randy Dunlap
  2 siblings, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-06 10:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, KOSAKI Motohiro,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Rik van Riel,
	Andrew Morton

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Documentation for softlimit (3/3)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/memory.txt |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Index: mmotm-2.6.29-Mar3/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-2.6.29-Mar3.orig/Documentation/cgroups/memory.txt
+++ mmotm-2.6.29-Mar3/Documentation/cgroups/memory.txt
@@ -322,6 +322,25 @@ will be charged as a new owner of it.
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 softlimit
+  Memory cgroup supports softlimit and has 2 params for control.
+    - memory.softlimit_in_bytes
+	- softlimit to this cgroup.
+    - memory.softlimit_priority.
+	- priority of this cgroup at softlimit reclaim.
+	  Allowed priority level is 3-0 and 3 is the lowest.
+	  If 0, this cgroup will not be target of softlimit.
+
+  At memory shortage of the system (or local node/zone), softlimit helps
+  kswapd(), a global memory recalim kernel thread, and inform victim cgroup
+  to be shrinked to kswapd.
+
+  Victim selection logic:
+  The kernel searches from the lowest priroty(3) up to the highest(1).
+  If it find a cgroup witch has memory larger than softlimit, steal memory
+  from it.
+  If multiple cgroups are on the same priority, each cgroup wil be a
+  victim in turn.
 
 6. Hierarchy support
 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v4)
  2009-03-06 10:14       ` KAMEZAWA Hiroyuki
@ 2009-03-06 10:41         ` Balbir Singh
  0 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2009-03-06 10:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-06 19:14:36]:

> On Fri, 6 Mar 2009 15:31:55 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> 
> > > > +		if (wait)
> > > > +			wait_for_completion(&mem->wait_on_soft_reclaim);
> > > >  	}
> > > What ???? Why we have to wait here...holding mmap->sem...This is too bad.
> > >
> > 
> > Since mmap_sem is no longer used for pthread_mutex*, I was not sure.
> > That is why I added the comment asking for more review and see what
> > people think about it. We get here only when
> > 
> > 1. The memcg is over its soft limit
> > 2. Tasks/threads belonging to memcg are faulting in more pages
> > 
> > The idea is to throttle them. If we did reclaim inline, like we do for
> > hard limits, we can still end up holding mmap_sem for a long time.
> > 
> This "throttle" is hard to measuer the effect and IIUC, not implemneted in
> vmscan.c ...for global try_to_free_pages() yet.
> Under memory shortage. before reaching here, the thread already called
> try_to_free_pages() or check some memory shorage conditions because
> it called alloc_pages(). So, waiting here is redundant and gives it
> too much penaly.

The reason for adding it consider the the following scenario

1. Create cgroup "a", give it a soft limit of 0
2. Create cgroup "b", give it a soft limit of 3G.

With both "a' and "b" running, reclaiming from "a" makes no sense, it
goes and does a bulk allocation and increases it usage again. It does
not make sense to reclaim from "b" until it crosses 3G.

Throttling is not implemented in the main VM, but we have seen several
patches for it. This is a special case for soft limits.

> 
> 
> > > > +	/*
> > > > +	 * This loop can run a while, specially if mem_cgroup's continuously
> > > > +	 * keep exceeding their soft limit and putting the system under
> > > > +	 * pressure
> > > > +	 */
> > > > +	do {
> > > > +		mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
> > > > +		if (!mem)
> > > > +			break;
> > > > +		usage = mem_cgroup_get_node_zone_usage(mem, zone, nid);
> > > > +		if (!usage)
> > > > +			goto skip_reclaim;
> > > 
> > > Why this works well ? if "mem" is the laragest, it will be inserted again
> > > as the largest. Do I miss any ?
> > >
> > 
> > No that is correct, but when reclaim is initiated from a different
> > zone/node combination, we still want mem to show up. 
> ....
> your logic is
> ==
>    nr_reclaimd = 0;
>    do {
>       mem = select victim.
>       remvoe victim from the RBtree (the largest usage one is selected)
>       if (victim is not good)
>           goto  skip this.
>       reclaimed += shirnk_zone.
>       
> skip_this:
>       if (mem is still exceeds soft limit)
>            insert RB tree again.
>    } while(!nr_reclaimed)
> ==
> When this exits loop ?
>

This is spill over from the main code without zones and nodes. Since
there, there was no concept of 0 usage and having a mem_cgroup on the
tree with highest usage. In practice, if we hit soft limit reclaim,
for each zone, kswapd will be called, at-least for one of the
node/zones that the mem we dequeud from has memory usage in. At that
point, the necessary changes to the RB-Tree will happen. However, you
have found a potential problem and I'll fix it in the next iteration.
 
> Thanks,
> -Kame
> 
> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH 3/3] memcg documenation soft limit (Yet Another One)
  2009-03-06 10:38     ` [RFC][PATCH 3/3] memcg documenation soft limit " KAMEZAWA Hiroyuki
@ 2009-03-06 16:47       ` Randy Dunlap
  2009-03-08 23:44         ` KAMEZAWA Hiroyuki
  2009-03-08 23:45         ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 18+ messages in thread
From: Randy Dunlap @ 2009-03-06 16:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, KOSAKI Motohiro,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Rik van Riel,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Documentation for softlimit (3/3)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  Documentation/cgroups/memory.txt |   19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> Index: mmotm-2.6.29-Mar3/Documentation/cgroups/memory.txt
> ===================================================================
> --- mmotm-2.6.29-Mar3.orig/Documentation/cgroups/memory.txt
> +++ mmotm-2.6.29-Mar3/Documentation/cgroups/memory.txt
> @@ -322,6 +322,25 @@ will be charged as a new owner of it.
>    - a cgroup which uses hierarchy and it has child cgroup.
>    - a cgroup which uses hierarchy and not the root of hierarchy.
>  
> +5.4 softlimit
> +  Memory cgroup supports softlimit and has 2 params for control.

                                                parameters

> +    - memory.softlimit_in_bytes
> +	- softlimit to this cgroup.
         softlimit for this cgroup
		(i.e., no beginning '-' and no ending '.')


> +    - memory.softlimit_priority.
> +	- priority of this cgroup at softlimit reclaim.
	 priority of this cgroup at softlimit reclaim

> +	  Allowed priority level is 3-0 and 3 is the lowest.

	Not very user friendly...

> +	  If 0, this cgroup will not be target of softlimit.
> +
> +  At memory shortage of the system (or local node/zone), softlimit helps
> +  kswapd(), a global memory recalim kernel thread, and inform victim cgroup

                               reclaim                    informs

> +  to be shrinked to kswapd.
> +
> +  Victim selection logic:
> +  The kernel searches from the lowest priroty(3) up to the highest(1).

                                         priority                     0 ?? (from above)

> +  If it find a cgroup witch has memory larger than softlimit, steal memory

           finds         which

> +  from it.
> +  If multiple cgroups are on the same priority, each cgroup wil be a

                                                               will

> +  victim in turn.
>  
>  6. Hierarchy support


-- 
~Randy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH 3/3] memcg documenation soft limit (Yet Another One)
  2009-03-06 16:47       ` Randy Dunlap
@ 2009-03-08 23:44         ` KAMEZAWA Hiroyuki
  2009-03-08 23:45         ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-08 23:44 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Balbir Singh, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, KOSAKI Motohiro,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Rik van Riel,
	Andrew Morton

On Fri, 06 Mar 2009 08:47:31 -0800
Randy Dunlap <randy.dunlap@oracle.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thank you. (and sorry
-Kame

> > 
> > Documentation for softlimit (3/3)
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  Documentation/cgroups/memory.txt |   19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> > Index: mmotm-2.6.29-Mar3/Documentation/cgroups/memory.txt
> > ===================================================================
> > --- mmotm-2.6.29-Mar3.orig/Documentation/cgroups/memory.txt
> > +++ mmotm-2.6.29-Mar3/Documentation/cgroups/memory.txt
> > @@ -322,6 +322,25 @@ will be charged as a new owner of it.
> >    - a cgroup which uses hierarchy and it has child cgroup.
> >    - a cgroup which uses hierarchy and not the root of hierarchy.
> >  
> > +5.4 softlimit
> > +  Memory cgroup supports softlimit and has 2 params for control.
> 
>                                                 parameters
> 
> > +    - memory.softlimit_in_bytes
> > +	- softlimit to this cgroup.
>          softlimit for this cgroup
> 		(i.e., no beginning '-' and no ending '.')
> 
> 
> > +    - memory.softlimit_priority.
> > +	- priority of this cgroup at softlimit reclaim.
> 	 priority of this cgroup at softlimit reclaim
> 
> > +	  Allowed priority level is 3-0 and 3 is the lowest.
> 
> 	Not very user friendly...
> 
> > +	  If 0, this cgroup will not be target of softlimit.
> > +
> > +  At memory shortage of the system (or local node/zone), softlimit helps
> > +  kswapd(), a global memory recalim kernel thread, and inform victim cgroup
> 
>                                reclaim                    informs
> 
> > +  to be shrinked to kswapd.
> > +
> > +  Victim selection logic:
> > +  The kernel searches from the lowest priroty(3) up to the highest(1).
> 
>                                          priority                     0 ?? (from above)
> 
> > +  If it find a cgroup witch has memory larger than softlimit, steal memory
> 
>            finds         which
> 
> > +  from it.
> > +  If multiple cgroups are on the same priority, each cgroup wil be a
> 
>                                                                will
> 
> > +  victim in turn.
> >  
> >  6. Hierarchy support
> 
> 
> -- 
> ~Randy
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH 3/3] memcg documenation soft limit (Yet Another One)
  2009-03-06 16:47       ` Randy Dunlap
  2009-03-08 23:44         ` KAMEZAWA Hiroyuki
@ 2009-03-08 23:45         ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-08 23:45 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Balbir Singh, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, KOSAKI Motohiro,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Rik van Riel,
	Andrew Morton

On Fri, 06 Mar 2009 08:47:31 -0800
Randy Dunlap <randy.dunlap@oracle.com> wrote:
> > +	  Allowed priority level is 3-0 and 3 is the lowest.
> 
> 	Not very user friendly...
> 
Ok, then, 0(lowest) - 8(highest) in the next version.

-Kame

> > +	  If 0, this cgroup will not be target of softlimit.
> > +
> > +  At memory shortage of the system (or local node/zone), softlimit helps
> > +  kswapd(), a global memory recalim kernel thread, and inform victim cgroup
> 
>                                reclaim                    informs
> 
> > +  to be shrinked to kswapd.
> > +
> > +  Victim selection logic:
> > +  The kernel searches from the lowest priroty(3) up to the highest(1).
> 
>                                          priority                     0 ?? (from above)
> 
> > +  If it find a cgroup witch has memory larger than softlimit, steal memory
> 
>            finds         which
> 
> > +  from it.
> > +  If multiple cgroups are on the same priority, each cgroup wil be a
> 
>                                                                will
> 
> > +  victim in turn.
> >  
> >  6. Hierarchy support
> 
> 
> -- 
> ~Randy
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-03-08 23:47 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-06  9:23 [PATCH 0/4] Memory controller soft limit patches (v4) Balbir Singh
2009-03-06  9:23 ` [PATCH 1/4] Memory controller soft limit documentation (v4) Balbir Singh
2009-03-06  9:23 ` [PATCH 2/4] Memory controller soft limit interface (v4) Balbir Singh
2009-03-06  9:23 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v4) Balbir Singh
2009-03-06  9:23 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v4) Balbir Singh
2009-03-06  9:51   ` KAMEZAWA Hiroyuki
2009-03-06 10:01     ` Balbir Singh
2009-03-06 10:14       ` KAMEZAWA Hiroyuki
2009-03-06 10:41         ` Balbir Singh
2009-03-06  9:54 ` [PATCH 0/4] Memory controller soft limit patches (v4) KAMEZAWA Hiroyuki
2009-03-06 10:05   ` Balbir Singh
2009-03-06 10:34   ` [RFC][PATCH 0/3] memory controller soft limit (Yet Another One) v1 KAMEZAWA Hiroyuki
2009-03-06 10:36     ` [RFC][PATCH 1/3] soft limit interface (Yet Another One) KAMEZAWA Hiroyuki
2009-03-06 10:37     ` [RFC][PATCH 2/3] memcg sotlimit logic " KAMEZAWA Hiroyuki
2009-03-06 10:38     ` [RFC][PATCH 3/3] memcg documenation soft limit " KAMEZAWA Hiroyuki
2009-03-06 16:47       ` Randy Dunlap
2009-03-08 23:44         ` KAMEZAWA Hiroyuki
2009-03-08 23:45         ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).