[RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs
@ 2012-02-10 21:36 Aneesh Kumar K.V
  2012-02-10 21:36 ` [RFC PATCH 1/6] hugetlb: Add a new hugetlb cgroup Aneesh Kumar K.V
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-10 21:36 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aneesh.kumar

Hi,

This patchset implements a cgroup resource controller for HugeTLB pages.
It is similar to the existing hugetlb quota support in that the limit is
enforced at mmap(2) time and not at fault time. HugeTLB quota limit the
number of huge pages that can be allocated per superblock.

For shared mapping we track the region mapped by a task along with the
hugetlb cgroup in inode region list. We keep the hugetlb cgroup charged
even after the task that did mmap(2) exit. The uncharge happens during
file truncate. For Private mapping we charge and uncharge from the current
task cgroup.

The current patchset doesn't support cgroup hierarchy. We also don't
allow task migration across cgroup.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/6] hugetlb: Add a new hugetlb cgroup
  2012-02-10 21:36 [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
@ 2012-02-10 21:36 ` Aneesh Kumar K.V
  2012-02-10 21:36 ` [RFC PATCH 2/6] hugetlbfs: Add usage and max usage files to " Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-10 21:36 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aneesh.kumar

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

hugetlb controller helps in controlling the number of hugepages a cgroup
can allocate. We enforce the limit during mmap time and NOT during fault
time. The behaviour is similar to hugetlb quota support but quota enforce
the limit per hugetlb mount point.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/Makefile          |    1 +
 fs/hugetlbfs/hugetlb_cgroup.c  |  207 ++++++++++++++++++++++++++++++++++++++++
 include/linux/cgroup_subsys.h  |    6 +
 include/linux/hugetlb.h        |   12 ++-
 include/linux/hugetlb_cgroup.h |   21 ++++
 init/Kconfig                   |   10 ++
 mm/hugetlb.c                   |   58 +++++++++++-
 7 files changed, 312 insertions(+), 3 deletions(-)
 create mode 100644 fs/hugetlbfs/hugetlb_cgroup.c
 create mode 100644 include/linux/hugetlb_cgroup.h

diff --git a/fs/hugetlbfs/Makefile b/fs/hugetlbfs/Makefile
index 6adf870..986c778 100644
--- a/fs/hugetlbfs/Makefile
+++ b/fs/hugetlbfs/Makefile
@@ -5,3 +5,4 @@
 obj-$(CONFIG_HUGETLBFS) += hugetlbfs.o
 
 hugetlbfs-objs := inode.o
+hugetlbfs-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
new file mode 100644
index 0000000..f6521ee
--- /dev/null
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -0,0 +1,207 @@
+/*
+ *
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/res_counter.h>
+
+/* lifted from mem control */
+#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
+#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val)	((val) & 0xffff)
+
+struct hugetlb_cgroup {
+	struct cgroup_subsys_state css;
+	/*
+	 * the counter to account for hugepages from hugetlb.
+	 */
+	struct res_counter memhuge[HUGE_MAX_HSTATE];
+};
+
+struct cgroup_subsys hugetlb_subsys __read_mostly;
+struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+
+static inline
+struct hugetlb_cgroup *css_to_hugetlbcgroup(struct cgroup_subsys_state *s)
+{
+	return container_of(s, struct hugetlb_cgroup, css);
+}
+
+static inline
+struct hugetlb_cgroup *cgroup_to_hugetlbcgroup(struct cgroup *cgroup)
+{
+	return css_to_hugetlbcgroup(cgroup_subsys_state(cgroup,
+							hugetlb_subsys_id));
+}
+
+static inline
+struct hugetlb_cgroup *task_hugetlbcgroup(struct task_struct *task)
+{
+	return css_to_hugetlbcgroup(task_subsys_state(task, hugetlb_subsys_id));
+}
+
+static inline int hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
+{
+	return (h_cg == root_h_cgroup);
+}
+
+u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft)
+{
+	int name, idx;
+	unsigned long long val;
+	struct hugetlb_cgroup *h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+
+	idx = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+
+	val = res_counter_read_u64(&h_cgroup->memhuge[idx], name);
+	return val;
+}
+
+int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
+			 const char *buffer)
+{
+	int name, ret, idx;
+	unsigned long long val;
+	struct hugetlb_cgroup *h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+
+	/* This function does all necessary parse...reuse it */
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return ret;
+
+	idx = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+
+	switch (name) {
+	case RES_LIMIT:
+		ret = res_counter_set_limit(&h_cgroup->memhuge[idx], val);
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static int hugetlbcgroup_can_attach(struct cgroup_subsys *ss,
+				    struct cgroup *new_cgrp,
+				    struct cgroup_taskset *set)
+{
+	struct hugetlb_cgroup *h_cg;
+	struct task_struct *task = cgroup_taskset_first(set);
+	/*
+	 * Make sure all the task in the set are in root cgroup
+	 * We only allow move from root cgroup to other cgroup.
+	 */
+	while (task != NULL) {
+		rcu_read_lock();
+		h_cg = task_hugetlbcgroup(task);
+		if (!hugetlb_cgroup_is_root(h_cg)) {
+			rcu_read_unlock();
+			return -EOPNOTSUPP;
+		}
+		rcu_read_unlock();
+		task = cgroup_taskset_next(set);
+	}
+	return 0;
+}
+
+/*
+ * called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *
+hugetlbcgroup_create(struct cgroup_subsys *ss, struct cgroup *cgroup)
+{
+	int idx, ret;
+	struct cgroup *parent_cgroup;
+	struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
+
+	h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
+	if (!h_cgroup)
+		return ERR_PTR(-ENOMEM);
+
+	parent_cgroup = cgroup->parent;
+	if (parent_cgroup) {
+		parent_h_cgroup = cgroup_to_hugetlbcgroup(parent_cgroup);
+		/*
+		 * We don't support subdirectories
+		 */
+		if (!hugetlb_cgroup_is_root(parent_h_cgroup)) {
+			ret = -EINVAL;
+			goto err_out;
+		}
+		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+			res_counter_init(&h_cgroup->memhuge[idx],
+					 &parent_h_cgroup->memhuge[idx]);
+	} else {
+		root_h_cgroup = h_cgroup;
+		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+			res_counter_init(&h_cgroup->memhuge[idx], NULL);
+	}
+	return &h_cgroup->css;
+
+err_out:
+	kfree(h_cgroup);
+	return ERR_PTR(ret);
+}
+
+static int hugetlbcgroup_pre_destroy(struct cgroup_subsys *ss,
+				     struct cgroup *cgroup)
+{
+	u64 val;
+	int idx;
+	struct hugetlb_cgroup *h_cgroup;
+
+	h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+	/*
+	 * We don't allow a cgroup deletion if it have some
+	 * resource charged against it.
+	 */
+	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
+		val = res_counter_read_u64(&h_cgroup->memhuge[idx], RES_USAGE);
+		if (val)
+			return -EBUSY;
+	}
+	return 0;
+}
+
+static void hugetlbcgroup_destroy(struct cgroup_subsys *ss,
+				  struct cgroup *cgroup)
+{
+	struct hugetlb_cgroup *h_cgroup;
+
+	h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+	kfree(h_cgroup);
+}
+
+static int hugetlbcgroup_populate(struct cgroup_subsys *ss,
+				  struct cgroup *cgroup)
+{
+	return register_hugetlb_cgroup_files(ss, cgroup);
+}
+
+struct cgroup_subsys hugetlb_subsys = {
+	.name = "hugetlb",
+	.can_attach = hugetlbcgroup_can_attach,
+	.create     = hugetlbcgroup_create,
+	.pre_destroy = hugetlbcgroup_pre_destroy,
+	.destroy    = hugetlbcgroup_destroy,
+	.populate   = hugetlbcgroup_populate,
+	.subsys_id  = hugetlb_subsys_id,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0bd390c..895923a 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -72,3 +72,9 @@ SUBSYS(net_prio)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+SUBSYS(hugetlb)
+#endif
+
+/* */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9d6c86..2b6b231 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
 #include <linux/mm_types.h>
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
+#include <linux/cgroup.h>
 
 struct ctl_table;
 struct user_struct;
@@ -68,7 +69,8 @@ int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 void hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
-
+int register_hugetlb_cgroup_files(struct cgroup_subsys *ss,
+				  struct cgroup *cgroup);
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline int PageHuge(struct page *page)
@@ -109,7 +111,11 @@ static inline void copy_huge_page(struct page *dst, struct page *src)
 }
 
 #define hugetlb_change_protection(vma, address, end, newprot)
-
+static inline int register_huge_cgroup_files(struct cgroup_subsys *ss,
+					 struct cgroup *cgroup);
+{
+	return 0;
+}
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define HUGETLB_ANON_FILE "anon_hugepage"
@@ -220,6 +226,8 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+	/* cgroup control files */
+	struct cftype cgroup_limit_file;
 	char name[HSTATE_NAME_LEN];
 };
 
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
new file mode 100644
index 0000000..2330dd0
--- /dev/null
+++ b/include/linux/hugetlb_cgroup.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#ifndef _LINUX_HUGETLB_CGROUP_H
+#define _LINUX_HUGETLB_CGROUP_H
+
+extern u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft);
+extern int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
+				const char *buffer);
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 3f42cd6..78d4961 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -673,6 +673,16 @@ config CGROUP_MEM_RES_CTLR
 	  This config option also selects MM_OWNER config option, which
 	  could in turn add some fork/exit overhead.
 
+config CGROUP_HUGETLB_RES_CTLR
+	bool "HugeTLB Resource Controller for Control Groups"
+	depends on RESOURCE_COUNTERS && HUGETLBFS
+	help
+	  Provides a simple cgroup Resource Controller for HugeTLB pages.
+	  The controller limit is enforced during mmap(2) time, so that
+	  application can fall back to allocation using smaller page size
+	  if the cgroup resource limit prevented them from allocating HugeTLB
+	  pages.
+
 config CGROUP_MEM_RES_CTLR_SWAP
 	bool "Memory Resource Controller Swap Extension"
 	depends on CGROUP_MEM_RES_CTLR && SWAP
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f34bd8..f643f72 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -28,6 +28,11 @@
 
 #include <linux/hugetlb.h>
 #include <linux/node.h>
+#include <linux/cgroup.h>
+#include <linux/hugetlb_cgroup.h>
+#include <linux/res_counter.h>
+#include <linux/page_cgroup.h>
+
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1798,6 +1803,57 @@ static int __init hugetlb_init(void)
 }
 module_init(hugetlb_init);
 
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+int register_hugetlb_cgroup_files(struct cgroup_subsys *ss,
+				  struct cgroup *cgroup)
+{
+	int ret = 0;
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		ret = cgroup_add_file(cgroup, ss, &h->cgroup_limit_file);
+		if (ret)
+			return ret;
+	}
+	return ret;
+}
+
+#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
+static char *mem_fmt(char *buf, unsigned long n)
+{
+	if (n >= (1UL << 30))
+		sprintf(buf, "%luGB", n >> 30);
+	else if (n >= (1UL << 20))
+		sprintf(buf, "%luMB", n >> 20);
+	else
+		sprintf(buf, "%luKB", n >> 10);
+	return buf;
+}
+
+static int hugetlb_cgroup_file_init(struct hstate *h, int idx)
+{
+	char buf[32];
+	struct cftype *cft;
+
+	/* format the size */
+	mem_fmt(buf, huge_page_size(h));
+
+	/* Add the limit file */
+	cft = &h->cgroup_limit_file;
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+	cft->read_u64 = hugetlb_cgroup_read;
+	cft->write_string = hugetlb_cgroup_write;
+
+	return 0;
+}
+#else
+static int hugetlb_cgroup_file_init(struct hstate *h, int idx)
+{
+	return 0;
+}
+#endif
+
 /* Should be called on processing a hugepagesz=... option */
 void __init hugetlb_add_hstate(unsigned order)
 {
@@ -1821,7 +1877,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
-
+	hugetlb_cgroup_file_init(h, max_hstate - 1);
 	parsed_hstate = h;
 }
 
-- 
1.7.9

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/6] hugetlbfs: Add usage and max usage files to hugetlb cgroup
  2012-02-10 21:36 [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
  2012-02-10 21:36 ` [RFC PATCH 1/6] hugetlb: Add a new hugetlb cgroup Aneesh Kumar K.V
@ 2012-02-10 21:36 ` Aneesh Kumar K.V
  2012-02-10 21:36 ` [RFC PATCH 3/6] hugetlbfs: Add new region handling functions Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-10 21:36 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aneesh.kumar

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c  |   12 ++++++++++++
 include/linux/hugetlb.h        |    2 ++
 include/linux/hugetlb_cgroup.h |    1 +
 mm/hugetlb.c                   |   21 +++++++++++++++++++++
 4 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index f6521ee..f2368ed 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -99,6 +99,18 @@ int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
 	return ret;
 }
 
+int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event)
+{
+	int name, idx;
+	struct hugetlb_cgroup *h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+
+	idx = MEMFILE_TYPE(event);
+	name = MEMFILE_ATTR(event);
+
+	res_counter_reset_max(&h_cgroup->memhuge[idx]);
+	return 0;
+}
+
 static int hugetlbcgroup_can_attach(struct cgroup_subsys *ss,
 				    struct cgroup *new_cgrp,
 				    struct cgroup_taskset *set)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2b6b231..4392b6a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -228,6 +228,8 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 	/* cgroup control files */
 	struct cftype cgroup_limit_file;
+	struct cftype cgroup_usage_file;
+	struct cftype cgroup_max_usage_file;
 	char name[HSTATE_NAME_LEN];
 };
 
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 2330dd0..11cd6c4 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -18,4 +18,5 @@
 extern u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft);
 extern int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
 				const char *buffer);
+extern int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event);
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f643f72..865b41f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1814,6 +1814,13 @@ int register_hugetlb_cgroup_files(struct cgroup_subsys *ss,
 		ret = cgroup_add_file(cgroup, ss, &h->cgroup_limit_file);
 		if (ret)
 			return ret;
+		ret = cgroup_add_file(cgroup, ss, &h->cgroup_usage_file);
+		if (ret)
+			return ret;
+		ret = cgroup_add_file(cgroup, ss, &h->cgroup_max_usage_file);
+		if (ret)
+			return ret;
+
 	}
 	return ret;
 }
@@ -1845,6 +1852,20 @@ static int hugetlb_cgroup_file_init(struct hstate *h, int idx)
 	cft->read_u64 = hugetlb_cgroup_read;
 	cft->write_string = hugetlb_cgroup_write;
 
+	/* Add the usage file */
+	cft = &h->cgroup_usage_file;
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
+	cft->private  = MEMFILE_PRIVATE(idx, RES_USAGE);
+	cft->read_u64 = hugetlb_cgroup_read;
+
+	/* Add the MAX usage file */
+	cft = &h->cgroup_max_usage_file;
+	snprintf(cft->name, MAX_CFTYPE_NAME,
+		 "%s.max_usage_in_bytes", buf);
+	cft->private  = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+	cft->trigger  = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read;
+
 	return 0;
 }
 #else
-- 
1.7.9

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/6] hugetlbfs: Add new region handling functions.
  2012-02-10 21:36 [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
  2012-02-10 21:36 ` [RFC PATCH 1/6] hugetlb: Add a new hugetlb cgroup Aneesh Kumar K.V
  2012-02-10 21:36 ` [RFC PATCH 2/6] hugetlbfs: Add usage and max usage files to " Aneesh Kumar K.V
@ 2012-02-10 21:36 ` Aneesh Kumar K.V
  2012-02-10 21:36 ` [RFC PATCH 4/6] hugetlbfs: Add controller support for shared mapping Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-10 21:36 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aneesh.kumar

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

These functions takes an extra argument and only merge regions if the data value
matches. This help us to build regions with difference hugetlb cgroup values.
Last patch in the series will merge this to existing region code, having this as
separate allows us to add cgroup support shared and private mapping in separate
patchset.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c |  127 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 127 insertions(+), 0 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index f2368ed..c4934c7 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -31,9 +31,136 @@ struct hugetlb_cgroup {
 	struct res_counter memhuge[HUGE_MAX_HSTATE];
 };
 
+struct file_region_with_data {
+	struct list_head link;
+	long from;
+	long to;
+	unsigned long data;
+};
+
 struct cgroup_subsys hugetlb_subsys __read_mostly;
 struct hugetlb_cgroup *root_h_cgroup __read_mostly;
 
+/*
+ * A vairant of region_add that only merges regions only if data
+ * match.
+ */
+static long region_chg_with_same(struct list_head *head,
+				 long f, long t, unsigned long data)
+{
+	long chg = 0;
+	struct file_region_with_data *rg, *nrg, *trg;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+	/*
+	 * If we are below the current region then a new region is required.
+	 * Subtle, allocate a new region at the position but make it zero
+	 * size such that we can guarantee to record the reservation.
+	 */
+	if (&rg->link == head || t < rg->from) {
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (!nrg)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to = f;
+		nrg->data = data;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+		return t - f;
+	}
+	/*
+	 * f rg->from t rg->to
+	 */
+	if (f < rg->from && data != rg->data) {
+		/* we need to allocate a new region */
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (!nrg)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to = f;
+		nrg->data = data;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+	}
+
+	/* Round our left edge to the current segment if it encloses us. */
+	if (f > rg->from)
+		f = rg->from;
+	chg = t - f;
+
+	/* Check for and consume any regions we now overlap with. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		if (rg->from > t)
+			return chg;
+		/*
+		 * rg->from f rg->to t
+		 */
+		if (t > rg->to && data != rg->data) {
+			/* we need to allocate a new region */
+			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+			if (!nrg)
+				return -ENOMEM;
+			nrg->from = rg->to;
+			nrg->to  = rg->to;
+			nrg->data = data;
+			INIT_LIST_HEAD(&nrg->link);
+			list_add(&nrg->link, &rg->link);
+		}
+		/*
+		 * update charge
+		 */
+		if (rg->to > t) {
+			chg += rg->to - t;
+			t = rg->to;
+		}
+		chg -= rg->to - rg->from;
+	}
+	return chg;
+}
+
+static void region_add_with_same(struct list_head *head,
+				 long f, long t, unsigned long data)
+{
+	struct file_region_with_data *rg, *nrg, *trg;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+
+		if (rg->from > t)
+			return;
+		if (&rg->link == head)
+			return;
+
+		/*FIXME!! this can possibly delete few regions */
+		/* We need to worry only if we match data */
+		if (rg->data == data) {
+			if (f < rg->from)
+				rg->from = f;
+			if (t > rg->to) {
+				/* if we are the last entry */
+				if (rg->link.next == head) {
+					rg->to = t;
+					break;
+				} else {
+					nrg = list_entry(rg->link.next,
+							 typeof(*nrg), link);
+					rg->to = nrg->from;
+				}
+			}
+		}
+		f = rg->to;
+	}
+}
+
 static inline
 struct hugetlb_cgroup *css_to_hugetlbcgroup(struct cgroup_subsys_state *s)
 {
-- 
1.7.9

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 4/6] hugetlbfs: Add controller support for shared mapping
  2012-02-10 21:36 [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2012-02-10 21:36 ` [RFC PATCH 3/6] hugetlbfs: Add new region handling functions Aneesh Kumar K.V
@ 2012-02-10 21:36 ` Aneesh Kumar K.V
  2012-02-10 21:36 ` [RFC PATCH 5/6] hugetlbfs: Add controller support for private mapping Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-10 21:36 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aneesh.kumar

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

HugeTLB controller is different from a memory controller in that we charge
controller during mmap() time and not fault time. This make sure userspace
can fallback to non-hugepage allocation when mmap fails during to controller
limit.

For shared mapping we need to track the hugetlb cgroup along with the range.
If two task in two different cgroup map the same area only the non-overlapping
part should be charged to the second task. Hence we need to track the cgroup
along with range.  We always charge during mmap(2) and we do uncharge during
truncate. The region list is tracked in the inode->i_mapping->private_list.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c  |  114 ++++++++++++++++++++++++++++++++++++++++
 fs/hugetlbfs/inode.c           |    1 +
 include/linux/hugetlb_cgroup.h |   40 ++++++++++++++
 mm/hugetlb.c                   |   85 +++++++++++++++++++++---------
 4 files changed, 214 insertions(+), 26 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index c4934c7..c478fb0 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -17,6 +17,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/res_counter.h>
+#include <linux/list.h>
 
 /* lifted from mem control */
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -344,3 +345,116 @@ struct cgroup_subsys hugetlb_subsys = {
 	.populate   = hugetlbcgroup_populate,
 	.subsys_id  = hugetlb_subsys_id,
 };
+
+long hugetlb_page_charge(struct list_head *head,
+			struct hstate *h, long f, long t)
+{
+	long chg;
+	int ret = 0, idx;
+	unsigned long csize;
+	struct hugetlb_cgroup *h_cg;
+	struct res_counter *fail_res;
+
+	/*
+	 * Get the task cgroup within rcu_readlock and also
+	 * get cgroup reference to make sure cgroup destroy won't
+	 * race with page_charge. We don't allow a cgroup destroy
+	 * when the cgroup have some charge against it
+	 */
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+	css_get(&h_cg->css);
+	rcu_read_unlock();
+
+	chg = region_chg_with_same(head, f, t, (unsigned long)h_cg);
+	if (chg < 0)
+		goto err_out;
+
+	if (hugetlb_cgroup_is_root(h_cg))
+		goto err_out;
+
+	csize = chg * huge_page_size(h);
+	idx = h - hstates;
+	ret = res_counter_charge(&h_cg->memhuge[idx], csize, &fail_res);
+
+err_out:
+	/* Now that we have charged we can drop cgroup reference */
+	css_put(&h_cg->css);
+	if (!ret)
+		return chg;
+
+	/* We don't worry about region_uncharge */
+	return ret;
+}
+
+void hugetlb_page_uncharge(struct list_head *head, int idx, int nr_pages)
+{
+	struct hugetlb_cgroup *h_cg;
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+
+	if (!hugetlb_cgroup_is_root(h_cg))
+		res_counter_uncharge(&h_cg->memhuge[idx], csize);
+	rcu_read_unlock();
+	/*
+	 * We could ideally remove zero size regions from
+	 * resv map hcg_regions here
+	 */
+	return;
+}
+
+void hugetlb_commit_page_charge(struct list_head *head, long f, long t)
+{
+	struct hugetlb_cgroup *h_cg;
+
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+	region_add_with_same(head, f, t, (unsigned long)h_cg);
+	rcu_read_unlock();
+	return;
+}
+
+long  hugetlb_truncate_cgroup_charge(struct hstate *h,
+				     struct list_head *head, long end)
+{
+	long chg = 0, csize;
+	int idx = h - hstates;
+	struct hugetlb_cgroup *h_cg;
+	struct file_region_with_data *rg, *trg;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (end <= rg->to)
+			break;
+	if (&rg->link == head)
+		return 0;
+
+	/* If we are in the middle of a region then adjust it. */
+	if (end > rg->from) {
+		chg = rg->to - end;
+		rg->to = end;
+		h_cg = (struct hugetlb_cgroup *)rg->data;
+		if (!hugetlb_cgroup_is_root(h_cg)) {
+			csize = chg * huge_page_size(h);
+			res_counter_uncharge(&h_cg->memhuge[idx], csize);
+		}
+		rg = list_entry(rg->link.next, typeof(*rg), link);
+	}
+
+	/* Drop any remaining regions. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		chg += rg->to - rg->from;
+		h_cg = (struct hugetlb_cgroup *)rg->data;
+		if (!hugetlb_cgroup_is_root(h_cg)) {
+			csize = (rg->to - rg->from) * huge_page_size(h);
+			res_counter_uncharge(&h_cg->memhuge[idx], csize);
+		}
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return chg;
+}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 1e85a7a..2680578 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/magic.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb_cgroup.h>
 
 #include <asm/uaccess.h>
 
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 11cd6c4..3131d62 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -15,8 +15,48 @@
 #ifndef _LINUX_HUGETLB_CGROUP_H
 #define _LINUX_HUGETLB_CGROUP_H
 
+extern long region_add(struct list_head *head, long f, long t);
+extern long region_chg(struct list_head *head, long f, long t);
+extern long region_truncate(struct list_head *head, long end);
+extern long region_count(struct list_head *head, long f, long t);
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
 extern u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft);
 extern int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
 				const char *buffer);
 extern int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event);
+extern long hugetlb_page_charge(struct list_head *head,
+				struct hstate *h, long f, long t);
+extern void hugetlb_page_uncharge(struct list_head *head,
+				  int idx, int nr_pages);
+extern void hugetlb_commit_page_charge(struct list_head *head, long f, long t);
+extern long hugetlb_truncate_cgroup_charge(struct hstate *h,
+					   struct list_head *head, long from);
+#else
+static inline long hugetlb_page_charge(struct list_head *head,
+				       struct hstate *h, long f, long t)
+{
+	return region_chg(head, f, t);
+}
+
+static inline void hugetlb_page_uncharge(struct list_head *head,
+					 int idx, int nr_pages)
+{
+	return;
+}
+
+static inline void hugetlb_commit_page_charge(struct list_head *head,
+					      long f, long t)
+{
+	region_add(head, f, t);
+	return;
+}
+
+static inline long hugetlb_truncate_cgroup_charge(struct hstate *h,
+						  struct list_head *head,
+						  long from)
+{
+	return region_truncate(head, from);
+}
+#endif /* CONFIG_CGROUP_HUGETLB_RES_CTLR */
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 865b41f..102410f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -78,7 +78,7 @@ struct file_region {
 	long to;
 };
 
-static long region_add(struct list_head *head, long f, long t)
+long region_add(struct list_head *head, long f, long t)
 {
 	struct file_region *rg, *nrg, *trg;
 
@@ -114,7 +114,7 @@ static long region_add(struct list_head *head, long f, long t)
 	return 0;
 }
 
-static long region_chg(struct list_head *head, long f, long t)
+long region_chg(struct list_head *head, long f, long t)
 {
 	struct file_region *rg, *nrg;
 	long chg = 0;
@@ -163,7 +163,7 @@ static long region_chg(struct list_head *head, long f, long t)
 	return chg;
 }
 
-static long region_truncate(struct list_head *head, long end)
+long region_truncate(struct list_head *head, long end)
 {
 	struct file_region *rg, *trg;
 	long chg = 0;
@@ -193,7 +193,7 @@ static long region_truncate(struct list_head *head, long end)
 	return chg;
 }
 
-static long region_count(struct list_head *head, long f, long t)
+long region_count(struct list_head *head, long f, long t)
 {
 	struct file_region *rg;
 	long chg = 0;
@@ -983,11 +983,11 @@ static long vma_needs_reservation(struct hstate *h,
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
+
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		return region_chg(&inode->i_mapping->private_list,
-							idx, idx + 1);
-
+		return hugetlb_page_charge(&inode->i_mapping->private_list,
+					   h, idx, idx + 1);
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
 
@@ -1002,16 +1002,33 @@ static long vma_needs_reservation(struct hstate *h,
 		return 0;
 	}
 }
+
+static void vma_uncharge_reservation(struct hstate *h,
+				     struct vm_area_struct *vma,
+				     unsigned long chg)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+
+	if (vma->vm_flags & VM_MAYSHARE) {
+		return hugetlb_page_uncharge(&inode->i_mapping->private_list,
+					     h - hstates,
+					     chg << huge_page_order(h));
+	}
+}
+
 static void vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
+
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		region_add(&inode->i_mapping->private_list, idx, idx + 1);
-
+		hugetlb_commit_page_charge(&inode->i_mapping->private_list,
+					   idx, idx + 1);
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *reservations = vma_resv_map(vma);
@@ -1040,9 +1057,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-VM_FAULT_OOM);
-	if (chg)
-		if (hugetlb_get_quota(inode->i_mapping, chg))
+	if (chg) {
+		if (hugetlb_get_quota(inode->i_mapping, chg)) {
+			vma_uncharge_reservation(h, vma, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
+		}
+	}
 
 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
@@ -1051,7 +1071,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
-			hugetlb_put_quota(inode->i_mapping, chg);
+			if (chg) {
+				vma_uncharge_reservation(h, vma, chg);
+				hugetlb_put_quota(inode->i_mapping, chg);
+			}
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 		}
 	}
@@ -1059,7 +1082,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	set_page_private(page, (unsigned long) mapping);
 
 	vma_commit_reservation(h, vma, addr);
-
 	return page;
 }
 
@@ -2961,9 +2983,10 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		chg = region_chg(&inode->i_mapping->private_list, from, to);
-	else {
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		chg = hugetlb_page_charge(&inode->i_mapping->private_list,
+					  h, from, to);
+	} else {
 		struct resv_map *resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
@@ -2978,19 +3001,17 @@ int hugetlb_reserve_pages(struct inode *inode,
 		return chg;
 
 	/* There must be enough filesystem quota for the mapping */
-	if (hugetlb_get_quota(inode->i_mapping, chg))
-		return -ENOSPC;
-
+	if (hugetlb_get_quota(inode->i_mapping, chg)) {
+		ret = -ENOSPC;
+		goto err_quota;
+	}
 	/*
 	 * Check enough hugepages are available for the reservation.
 	 * Hand back the quota if there are not
 	 */
 	ret = hugetlb_acct_memory(h, chg);
-	if (ret < 0) {
-		hugetlb_put_quota(inode->i_mapping, chg);
-		return ret;
-	}
-
+	if (ret < 0)
+		goto err_acct_mem;
 	/*
 	 * Account for the reservations made. Shared mappings record regions
 	 * that have reservations as they are shared by multiple VMAs.
@@ -3003,14 +3024,26 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&inode->i_mapping->private_list, from, to);
+		hugetlb_commit_page_charge(&inode->i_mapping->private_list,
+					   from, to);
 	return 0;
+err_acct_mem:
+	hugetlb_put_quota(inode->i_mapping, chg);
+err_quota:
+	if (!vma || vma->vm_flags & VM_MAYSHARE)
+		hugetlb_page_uncharge(&inode->i_mapping->private_list,
+				      h - hstates, chg << huge_page_order(h));
+	return ret;
+
 }
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
+	long chg;
 	struct hstate *h = hstate_inode(inode);
-	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+
+	chg = hugetlb_truncate_cgroup_charge(h, &inode->i_mapping->private_list,
+					     offset);
 
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
-- 
1.7.9

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 5/6] hugetlbfs: Add controller support for private mapping
  2012-02-10 21:36 [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2012-02-10 21:36 ` [RFC PATCH 4/6] hugetlbfs: Add controller support for shared mapping Aneesh Kumar K.V
@ 2012-02-10 21:36 ` Aneesh Kumar K.V
  2012-02-17  5:22   ` bill4carson
  2012-02-10 21:36 ` [RFC PATCH 6/6] hugetlbfs: Switch to new region APIs Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-10 21:36 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aneesh.kumar

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

HugeTLB controller is different from a memory controller in that we charge
controller during mmap() time and not fault time. This make sure userspace
can fallback to non-hugepage allocation when mmap fails due to controller
limit.

For private mapping we always charge/uncharge from the current task cgroup.
Charging happens during mmap(2) and uncharge happens during the
vm_operations->close when resv_map refcount reaches zero. The uncharge count
is stored in struct resv_map. For child task after fork the charging happens
during fault time in alloc_huge_page. We also need to make sure for private
mapping each vma for hugeTLB mapping have struct resv_map allocated so that we
can store the uncharge count in resv_map.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c  |   50 ++++++++++++++++++++++++++++++++
 include/linux/hugetlb.h        |    7 ++++
 include/linux/hugetlb_cgroup.h |   16 ++++++++++
 mm/hugetlb.c                   |   62 ++++++++++++++++++++++++++++++++--------
 4 files changed, 123 insertions(+), 12 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index c478fb0..f828fb2 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -458,3 +458,53 @@ long  hugetlb_truncate_cgroup_charge(struct hstate *h,
 	}
 	return chg;
 }
+
+int hugetlb_priv_page_charge(struct resv_map *map, struct hstate *h, long chg)
+{
+	long csize;
+	int idx, ret;
+	struct hugetlb_cgroup *h_cg;
+	struct res_counter *fail_res;
+
+	/*
+	 * Get the task cgroup within rcu_readlock and also
+	 * get cgroup reference to make sure cgroup destroy won't
+	 * race with page_charge. We don't allow a cgroup destroy
+	 * when the cgroup have some charge against it
+	 */
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+	css_get(&h_cg->css);
+	rcu_read_unlock();
+
+	if (hugetlb_cgroup_is_root(h_cg)) {
+		ret = chg;
+		goto err_out;
+	}
+
+	csize = chg * huge_page_size(h);
+	idx = h - hstates;
+	ret = res_counter_charge(&h_cg->memhuge[idx], csize, &fail_res);
+	if (!ret) {
+		map->nr_pages[idx] += chg << huge_page_order(h);
+		ret = chg;
+	}
+err_out:
+	css_put(&h_cg->css);
+	return ret;
+}
+
+void hugetlb_priv_page_uncharge(struct resv_map *map, int idx, int nr_pages)
+{
+	struct hugetlb_cgroup *h_cg;
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+	if (!hugetlb_cgroup_is_root(h_cg)) {
+		res_counter_uncharge(&h_cg->memhuge[idx], csize);
+		map->nr_pages[idx] -= nr_pages;
+	}
+	rcu_read_unlock();
+	return;
+}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4392b6a..e2ba381 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -233,6 +233,12 @@ struct hstate {
 	char name[HSTATE_NAME_LEN];
 };
 
+struct resv_map {
+	struct kref refs;
+	int nr_pages[HUGE_MAX_HSTATE];
+	struct list_head regions;
+};
+
 struct huge_bootmem_page {
 	struct list_head list;
 	struct hstate *hstate;
@@ -323,6 +329,7 @@ static inline unsigned hstate_index_to_shift(unsigned index)
 
 #else
 struct hstate {};
+struct resv_map {};
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 3131d62..c3738df 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -32,6 +32,10 @@ extern void hugetlb_page_uncharge(struct list_head *head,
 extern void hugetlb_commit_page_charge(struct list_head *head, long f, long t);
 extern long hugetlb_truncate_cgroup_charge(struct hstate *h,
 					   struct list_head *head, long from);
+extern int hugetlb_priv_page_charge(struct resv_map *map,
+				    struct hstate *h, long chg);
+extern void hugetlb_priv_page_uncharge(struct resv_map *map,
+				       int idx, int nr_pages);
 #else
 static inline long hugetlb_page_charge(struct list_head *head,
 				       struct hstate *h, long f, long t)
@@ -58,5 +62,17 @@ static inline long hugetlb_truncate_cgroup_charge(struct hstate *h,
 {
 	return region_truncate(head, from);
 }
+
+static inline int hugetlb_priv_page_charge(struct resv_map *map,
+					   struct hstate *h, long chg)
+{
+	return chg;
+}
+
+static inline void hugetlb_priv_page_uncharge(struct resv_map *map,
+					      int idx, int nr_pages)
+{
+	return;
+}
 #endif /* CONFIG_CGROUP_HUGETLB_RES_CTLR */
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 102410f..5a91838 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -303,14 +303,9 @@ static void set_vma_private_data(struct vm_area_struct *vma,
 	vma->vm_private_data = (void *)value;
 }
 
-struct resv_map {
-	struct kref refs;
-	struct list_head regions;
-};
-
 static struct resv_map *resv_map_alloc(void)
 {
-	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
+	struct resv_map *resv_map = kzalloc(sizeof(*resv_map), GFP_KERNEL);
 	if (!resv_map)
 		return NULL;
 
@@ -322,10 +317,16 @@ static struct resv_map *resv_map_alloc(void)
 
 static void resv_map_release(struct kref *ref)
 {
+	int idx;
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
 	region_truncate(&resv_map->regions, 0);
+	/* drop the hugetlb cgroup charge */
+	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
+		hugetlb_priv_page_uncharge(resv_map, idx,
+					   resv_map->nr_pages[idx]);
+	}
 	kfree(resv_map);
 }
 
@@ -989,9 +990,20 @@ static long vma_needs_reservation(struct hstate *h,
 		return hugetlb_page_charge(&inode->i_mapping->private_list,
 					   h, idx, idx + 1);
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		return 1;
-
+		struct resv_map *resv_map = vma_resv_map(vma);
+		if (!resv_map) {
+			/*
+			 * We didn't allocate resv_map for this vma.
+			 * Allocate it here.
+			 */
+			resv_map = resv_map_alloc();
+			if (!resv_map)
+				return -ENOMEM;
+			set_vma_resv_map(vma, resv_map);
+		}
+		return hugetlb_priv_page_charge(resv_map, h, 1);
 	} else  {
+		/* We did the priv page charging in mmap call */
 		long err;
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *reservations = vma_resv_map(vma);
@@ -1007,14 +1019,20 @@ static void vma_uncharge_reservation(struct hstate *h,
 				     struct vm_area_struct *vma,
 				     unsigned long chg)
 {
+	int idx = h - hstates;
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		return hugetlb_page_uncharge(&inode->i_mapping->private_list,
-					     h - hstates,
-					     chg << huge_page_order(h));
+					     idx, chg << huge_page_order(h));
+	} else {
+		struct resv_map *resv_map = vma_resv_map(vma);
+
+		return hugetlb_priv_page_uncharge(resv_map,
+						  idx,
+						  chg << huge_page_order(h));
 	}
 }
 
@@ -2165,6 +2183,22 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 */
 	if (reservations)
 		kref_get(&reservations->refs);
+	else if (!(vma->vm_flags & VM_MAYSHARE)) {
+		/*
+		 * for non shared vma we need resv map to track
+		 * hugetlb cgroup usage. Allocate it here. Charging
+		 * the cgroup will take place in fault path.
+		 */
+		struct resv_map *resv_map = resv_map_alloc();
+		/*
+		 * If we fail to allocate resv_map here. We will allocate
+		 * one when we do alloc_huge_page. So we don't handle
+		 * ENOMEM here. The function also return void. So there is
+		 * nothing much we can do.
+		 */
+		if (resv_map)
+			set_vma_resv_map(vma, resv_map);
+	}
 }
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
@@ -2968,7 +3002,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 {
 	long ret, chg;
 	struct hstate *h = hstate_inode(inode);
-
+	struct resv_map *resv_map = NULL;
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
 	 * attempt will be made for VM_NORESERVE to allocate a page
@@ -2987,7 +3021,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 		chg = hugetlb_page_charge(&inode->i_mapping->private_list,
 					  h, from, to);
 	} else {
-		struct resv_map *resv_map = resv_map_alloc();
+		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
 
@@ -2995,6 +3029,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 
 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
+		chg = hugetlb_priv_page_charge(resv_map, h, chg);
 	}
 
 	if (chg < 0)
@@ -3033,6 +3068,9 @@ err_quota:
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		hugetlb_page_uncharge(&inode->i_mapping->private_list,
 				      h - hstates, chg << huge_page_order(h));
+	else
+		hugetlb_priv_page_uncharge(resv_map, h - hstates,
+					   chg << huge_page_order(h));
 	return ret;
 
 }
-- 
1.7.9

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 6/6] hugetlbfs: Switch to new region APIs
  2012-02-10 21:36 [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2012-02-10 21:36 ` [RFC PATCH 5/6] hugetlbfs: Add controller support for private mapping Aneesh Kumar K.V
@ 2012-02-10 21:36 ` Aneesh Kumar K.V
  2012-02-11 12:37 ` [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Hillf Danton
  2012-02-14  6:58 ` KAMEZAWA Hiroyuki
  7 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-10 21:36 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aneesh.kumar

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Remove the old code which is not used

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/Makefile          |    2 +-
 fs/hugetlbfs/hugetlb_cgroup.c  |  135 +--------------------------
 fs/hugetlbfs/region.c          |  202 ++++++++++++++++++++++++++++++++++++++++
 include/linux/hugetlb_cgroup.h |   17 +++-
 mm/hugetlb.c                   |  163 +--------------------------------
 5 files changed, 222 insertions(+), 297 deletions(-)
 create mode 100644 fs/hugetlbfs/region.c

diff --git a/fs/hugetlbfs/Makefile b/fs/hugetlbfs/Makefile
index 986c778..3c544fe 100644
--- a/fs/hugetlbfs/Makefile
+++ b/fs/hugetlbfs/Makefile
@@ -4,5 +4,5 @@
 
 obj-$(CONFIG_HUGETLBFS) += hugetlbfs.o
 
-hugetlbfs-objs := inode.o
+hugetlbfs-objs := inode.o region.o
 hugetlbfs-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index f828fb2..b30db96 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -18,6 +18,8 @@
 #include <linux/hugetlb.h>
 #include <linux/res_counter.h>
 #include <linux/list.h>
+#include <linux/hugetlb_cgroup.h>
+
 
 /* lifted from mem control */
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -32,136 +34,9 @@ struct hugetlb_cgroup {
 	struct res_counter memhuge[HUGE_MAX_HSTATE];
 };
 
-struct file_region_with_data {
-	struct list_head link;
-	long from;
-	long to;
-	unsigned long data;
-};
-
 struct cgroup_subsys hugetlb_subsys __read_mostly;
 struct hugetlb_cgroup *root_h_cgroup __read_mostly;
 
-/*
- * A vairant of region_add that only merges regions only if data
- * match.
- */
-static long region_chg_with_same(struct list_head *head,
-				 long f, long t, unsigned long data)
-{
-	long chg = 0;
-	struct file_region_with_data *rg, *nrg, *trg;
-
-	/* Locate the region we are before or in. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-	/*
-	 * If we are below the current region then a new region is required.
-	 * Subtle, allocate a new region at the position but make it zero
-	 * size such that we can guarantee to record the reservation.
-	 */
-	if (&rg->link == head || t < rg->from) {
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
-		nrg->from = f;
-		nrg->to = f;
-		nrg->data = data;
-		INIT_LIST_HEAD(&nrg->link);
-		list_add(&nrg->link, rg->link.prev);
-		return t - f;
-	}
-	/*
-	 * f rg->from t rg->to
-	 */
-	if (f < rg->from && data != rg->data) {
-		/* we need to allocate a new region */
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
-		nrg->from = f;
-		nrg->to = f;
-		nrg->data = data;
-		INIT_LIST_HEAD(&nrg->link);
-		list_add(&nrg->link, rg->link.prev);
-	}
-
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-	chg = t - f;
-
-	/* Check for and consume any regions we now overlap with. */
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		if (rg->from > t)
-			return chg;
-		/*
-		 * rg->from f rg->to t
-		 */
-		if (t > rg->to && data != rg->data) {
-			/* we need to allocate a new region */
-			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-			if (!nrg)
-				return -ENOMEM;
-			nrg->from = rg->to;
-			nrg->to  = rg->to;
-			nrg->data = data;
-			INIT_LIST_HEAD(&nrg->link);
-			list_add(&nrg->link, &rg->link);
-		}
-		/*
-		 * update charge
-		 */
-		if (rg->to > t) {
-			chg += rg->to - t;
-			t = rg->to;
-		}
-		chg -= rg->to - rg->from;
-	}
-	return chg;
-}
-
-static void region_add_with_same(struct list_head *head,
-				 long f, long t, unsigned long data)
-{
-	struct file_region_with_data *rg, *nrg, *trg;
-
-	/* Locate the region we are before or in. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-
-		if (rg->from > t)
-			return;
-		if (&rg->link == head)
-			return;
-
-		/*FIXME!! this can possibly delete few regions */
-		/* We need to worry only if we match data */
-		if (rg->data == data) {
-			if (f < rg->from)
-				rg->from = f;
-			if (t > rg->to) {
-				/* if we are the last entry */
-				if (rg->link.next == head) {
-					rg->to = t;
-					break;
-				} else {
-					nrg = list_entry(rg->link.next,
-							 typeof(*nrg), link);
-					rg->to = nrg->from;
-				}
-			}
-		}
-		f = rg->to;
-	}
-}
-
 static inline
 struct hugetlb_cgroup *css_to_hugetlbcgroup(struct cgroup_subsys_state *s)
 {
@@ -366,7 +241,7 @@ long hugetlb_page_charge(struct list_head *head,
 	css_get(&h_cg->css);
 	rcu_read_unlock();
 
-	chg = region_chg_with_same(head, f, t, (unsigned long)h_cg);
+	chg = region_chg(head, f, t, (unsigned long)h_cg);
 	if (chg < 0)
 		goto err_out;
 
@@ -411,7 +286,7 @@ void hugetlb_commit_page_charge(struct list_head *head, long f, long t)
 
 	rcu_read_lock();
 	h_cg = task_hugetlbcgroup(current);
-	region_add_with_same(head, f, t, (unsigned long)h_cg);
+	region_add(head, f, t, (unsigned long)h_cg);
 	rcu_read_unlock();
 	return;
 }
@@ -422,7 +297,7 @@ long  hugetlb_truncate_cgroup_charge(struct hstate *h,
 	long chg = 0, csize;
 	int idx = h - hstates;
 	struct hugetlb_cgroup *h_cg;
-	struct file_region_with_data *rg, *trg;
+	struct file_region *rg, *trg;
 
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
diff --git a/fs/hugetlbfs/region.c b/fs/hugetlbfs/region.c
new file mode 100644
index 0000000..d2445fb
--- /dev/null
+++ b/fs/hugetlbfs/region.c
@@ -0,0 +1,202 @@
+/*
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/list.h>
+#include <linux/hugetlb_cgroup.h>
+
+/*
+ * Region tracking -- allows tracking of reservations and instantiated pages
+ *                    across the pages in a mapping.
+ *
+ * The region data structures are protected by a combination of the mmap_sem
+ * and the hugetlb_instantion_mutex.  To access or modify a region the caller
+ * must either hold the mmap_sem for write, or the mmap_sem for read and
+ * the hugetlb_instantiation mutex:
+ *
+ *	down_write(&mm->mmap_sem);
+ * or
+ *	down_read(&mm->mmap_sem);
+ *	mutex_lock(&hugetlb_instantiation_mutex);
+ */
+
+long region_chg(struct list_head *head, long f, long t, unsigned long data)
+{
+	long chg = 0;
+	struct file_region *rg, *nrg, *trg;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+	/*
+	 * If we are below the current region then a new region is required.
+	 * Subtle, allocate a new region at the position but make it zero
+	 * size such that we can guarantee to record the reservation.
+	 */
+	if (&rg->link == head || t < rg->from) {
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (!nrg)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to = f;
+		nrg->data = data;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+		return t - f;
+	}
+	/*
+	 * f rg->from t rg->to
+	 */
+	if (f < rg->from && data != rg->data) {
+		/* we need to allocate a new region */
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (!nrg)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to = f;
+		nrg->data = data;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+	}
+
+	/* Round our left edge to the current segment if it encloses us. */
+	if (f > rg->from)
+		f = rg->from;
+	chg = t - f;
+
+	/* Check for and consume any regions we now overlap with. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		if (rg->from > t)
+			return chg;
+		/*
+		 * rg->from f rg->to t
+		 */
+		if (t > rg->to && data != rg->data) {
+			/* we need to allocate a new region */
+			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+			if (!nrg)
+				return -ENOMEM;
+			nrg->from = rg->to;
+			nrg->to  = rg->to;
+			nrg->data = data;
+			INIT_LIST_HEAD(&nrg->link);
+			list_add(&nrg->link, &rg->link);
+		}
+		/*
+		 * update charge
+		 */
+		if (rg->to > t) {
+			chg += rg->to - t;
+			t = rg->to;
+		}
+		chg -= rg->to - rg->from;
+	}
+	return chg;
+}
+
+void region_add(struct list_head *head, long f, long t, unsigned long data)
+{
+	struct file_region *rg, *nrg, *trg;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+
+		if (rg->from > t)
+			return;
+		if (&rg->link == head)
+			return;
+
+		/*FIXME!! this can possibly delete few regions */
+		/* We need to worry only if we match data */
+		if (rg->data == data) {
+			if (f < rg->from)
+				rg->from = f;
+			if (t > rg->to) {
+				/* if we are the last entry */
+				if (rg->link.next == head) {
+					rg->to = t;
+					break;
+				} else {
+					nrg = list_entry(rg->link.next,
+							 typeof(*nrg), link);
+					rg->to = nrg->from;
+				}
+			}
+		}
+		f = rg->to;
+	}
+}
+
+long region_truncate(struct list_head *head, long end)
+{
+	struct file_region *rg, *trg;
+	long chg = 0;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (end <= rg->to)
+			break;
+	if (&rg->link == head)
+		return 0;
+
+	/* If we are in the middle of a region then adjust it. */
+	if (end > rg->from) {
+		chg = rg->to - end;
+		rg->to = end;
+		rg = list_entry(rg->link.next, typeof(*rg), link);
+	}
+
+	/* Drop any remaining regions. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		chg += rg->to - rg->from;
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return chg;
+}
+
+long region_count(struct list_head *head, long f, long t)
+{
+	struct file_region *rg;
+	long chg = 0;
+
+	/* Locate each segment we overlap with, and count that overlap. */
+	list_for_each_entry(rg, head, link) {
+		int seg_from;
+		int seg_to;
+
+		if (rg->to <= f)
+			continue;
+		if (rg->from >= t)
+			break;
+
+		seg_from = max(rg->from, f);
+		seg_to = min(rg->to, t);
+
+		chg += seg_to - seg_from;
+	}
+
+	return chg;
+}
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index c3738df..c2256ee 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -15,8 +15,16 @@
 #ifndef _LINUX_HUGETLB_CGROUP_H
 #define _LINUX_HUGETLB_CGROUP_H
 
-extern long region_add(struct list_head *head, long f, long t);
-extern long region_chg(struct list_head *head, long f, long t);
+struct file_region {
+	long from, to;
+	unsigned long data;
+	struct list_head link;
+};
+
+extern long region_chg(struct list_head *head, long f, long t,
+		       unsigned long data);
+extern void region_add(struct list_head *head, long f, long t,
+		       unsigned long data);
 extern long region_truncate(struct list_head *head, long end);
 extern long region_count(struct list_head *head, long f, long t);
 
@@ -40,7 +48,7 @@ extern void hugetlb_priv_page_uncharge(struct resv_map *map,
 static inline long hugetlb_page_charge(struct list_head *head,
 				       struct hstate *h, long f, long t)
 {
-	return region_chg(head, f, t);
+	return region_chg(head, f, t, 0);
 }
 
 static inline void hugetlb_page_uncharge(struct list_head *head,
@@ -52,8 +60,7 @@ static inline void hugetlb_page_uncharge(struct list_head *head,
 static inline void hugetlb_commit_page_charge(struct list_head *head,
 					      long f, long t)
 {
-	region_add(head, f, t);
-	return;
+	return region_add(head, f, t, 0);
 }
 
 static inline long hugetlb_truncate_cgroup_charge(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5a91838..950793f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -59,165 +59,6 @@ static unsigned long __initdata default_hstate_size;
 static DEFINE_SPINLOCK(hugetlb_lock);
 
 /*
- * Region tracking -- allows tracking of reservations and instantiated pages
- *                    across the pages in a mapping.
- *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantion_mutex.  To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
- *
- *	down_write(&mm->mmap_sem);
- * or
- *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
- */
-struct file_region {
-	struct list_head link;
-	long from;
-	long to;
-};
-
-long region_add(struct list_head *head, long f, long t)
-{
-	struct file_region *rg, *nrg, *trg;
-
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-
-	/* Check for and consume any regions we now overlap with. */
-	nrg = rg;
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		if (rg->from > t)
-			break;
-
-		/* If this area reaches higher then extend our area to
-		 * include it completely.  If this is not the first area
-		 * which we intend to reuse, free it. */
-		if (rg->to > t)
-			t = rg->to;
-		if (rg != nrg) {
-			list_del(&rg->link);
-			kfree(rg);
-		}
-	}
-	nrg->from = f;
-	nrg->to = t;
-	return 0;
-}
-
-long region_chg(struct list_head *head, long f, long t)
-{
-	struct file_region *rg, *nrg;
-	long chg = 0;
-
-	/* Locate the region we are before or in. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	/* If we are below the current region then a new region is required.
-	 * Subtle, allocate a new region at the position but make it zero
-	 * size such that we can guarantee to record the reservation. */
-	if (&rg->link == head || t < rg->from) {
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
-		nrg->from = f;
-		nrg->to   = f;
-		INIT_LIST_HEAD(&nrg->link);
-		list_add(&nrg->link, rg->link.prev);
-
-		return t - f;
-	}
-
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-	chg = t - f;
-
-	/* Check for and consume any regions we now overlap with. */
-	list_for_each_entry(rg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		if (rg->from > t)
-			return chg;
-
-		/* We overlap with this area, if it extends further than
-		 * us then we must extend ourselves.  Account for its
-		 * existing reservation. */
-		if (rg->to > t) {
-			chg += rg->to - t;
-			t = rg->to;
-		}
-		chg -= rg->to - rg->from;
-	}
-	return chg;
-}
-
-long region_truncate(struct list_head *head, long end)
-{
-	struct file_region *rg, *trg;
-	long chg = 0;
-
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (end <= rg->to)
-			break;
-	if (&rg->link == head)
-		return 0;
-
-	/* If we are in the middle of a region then adjust it. */
-	if (end > rg->from) {
-		chg = rg->to - end;
-		rg->to = end;
-		rg = list_entry(rg->link.next, typeof(*rg), link);
-	}
-
-	/* Drop any remaining regions. */
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		chg += rg->to - rg->from;
-		list_del(&rg->link);
-		kfree(rg);
-	}
-	return chg;
-}
-
-long region_count(struct list_head *head, long f, long t)
-{
-	struct file_region *rg;
-	long chg = 0;
-
-	/* Locate each segment we overlap with, and count that overlap. */
-	list_for_each_entry(rg, head, link) {
-		int seg_from;
-		int seg_to;
-
-		if (rg->to <= f)
-			continue;
-		if (rg->from >= t)
-			break;
-
-		seg_from = max(rg->from, f);
-		seg_to = min(rg->to, t);
-
-		chg += seg_to - seg_from;
-	}
-
-	return chg;
-}
-
-/*
  * Convert the address within this vma to the page offset within
  * the mapping, in pagecache page units; huge pages here.
  */
@@ -1008,7 +849,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *reservations = vma_resv_map(vma);
 
-		err = region_chg(&reservations->regions, idx, idx + 1);
+		err = region_chg(&reservations->regions, idx, idx + 1, 0);
 		if (err < 0)
 			return err;
 		return 0;
@@ -1052,7 +893,7 @@ static void vma_commit_reservation(struct hstate *h,
 		struct resv_map *reservations = vma_resv_map(vma);
 
 		/* Mark this page used in the map. */
-		region_add(&reservations->regions, idx, idx + 1);
+		region_add(&reservations->regions, idx, idx + 1, 0);
 	}
 }
 
-- 
1.7.9

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs
  2012-02-10 21:36 [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2012-02-10 21:36 ` [RFC PATCH 6/6] hugetlbfs: Switch to new region APIs Aneesh Kumar K.V
@ 2012-02-11 12:37 ` Hillf Danton
  2012-02-12 17:44   ` Aneesh Kumar K.V
  2012-02-14  6:58 ` KAMEZAWA Hiroyuki
  7 siblings, 1 reply; 15+ messages in thread
From: Hillf Danton @ 2012-02-11 12:37 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, Andrea Arcangeli,
	Michal Hocko, Andrew Morton, Johannes Weiner

On Sat, Feb 11, 2012 at 5:36 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Hi,
>
> This patchset implements a cgroup resource controller for HugeTLB pages.
> It is similar to the existing hugetlb quota support in that the limit is
> enforced at mmap(2) time and not at fault time. HugeTLB quota limit the
> number of huge pages that can be allocated per superblock.
>

Hello Aneesh

Thanks for your work:)

Mind to post the whole patchset on LKML with Andrea, Michal,
Johannes and Andrew also Cced, for more eyes and thoughts?

Good weekend
Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs
  2012-02-11 12:37 ` [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Hillf Danton
@ 2012-02-12 17:44   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-12 17:44 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, mgorman, kamezawa.hiroyu, Andrea Arcangeli,
	Michal Hocko, Andrew Morton, Johannes Weiner

On Sat, 11 Feb 2012 20:37:23 +0800, Hillf Danton <dhillf@gmail.com> wrote:
> On Sat, Feb 11, 2012 at 5:36 AM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > Hi,
> >
> > This patchset implements a cgroup resource controller for HugeTLB pages.
> > It is similar to the existing hugetlb quota support in that the limit is
> > enforced at mmap(2) time and not at fault time. HugeTLB quota limit the
> > number of huge pages that can be allocated per superblock.
> >
> 
> Hello Aneesh
> 
> Thanks for your work:)
> 
> Mind to post the whole patchset on LKML with Andrea, Michal,
> Johannes and Andrew also Cced, for more eyes and thoughts?
> 

Will do in the next iteration 

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs
  2012-02-10 21:36 [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2012-02-11 12:37 ` [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Hillf Danton
@ 2012-02-14  6:58 ` KAMEZAWA Hiroyuki
  2012-02-17  8:00   ` Aneesh Kumar K.V
  2012-02-17  8:03     ` Aneesh Kumar K.V
  7 siblings, 2 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  6:58 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linux-mm, mgorman, dhillf

On Sat, 11 Feb 2012 03:06:40 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> Hi,
> 
> This patchset implements a cgroup resource controller for HugeTLB pages.
> It is similar to the existing hugetlb quota support in that the limit is
> enforced at mmap(2) time and not at fault time. HugeTLB quota limit the
> number of huge pages that can be allocated per superblock.
> 
> For shared mapping we track the region mapped by a task along with the
> hugetlb cgroup in inode region list. We keep the hugetlb cgroup charged
> even after the task that did mmap(2) exit. The uncharge happens during
> file truncate. For Private mapping we charge and uncharge from the current
> task cgroup.
> 

Hm, Could you provide an Documenation/cgroup/hugetlb.txt at RFC ?
It makes clear what your patch does.

I wonder whether this should be under memory cgroup or not. In the 1st design
of cgroup, I think it was considered one-feature-one-subsystem was good.

But in recent discussion, I tend to hear that's hard to use.
Now, memory cgroup has 

   memory.xxxxx for controlling anon/file
   memory.memsw.xxxx for controlling memory+swap
   memory.kmem.tcp_xxxx for tcp controlling.

How about memory.hugetlb.xxxxx ?


> The current patchset doesn't support cgroup hierarchy. We also don't
> allow task migration across cgroup.

What happens when a user destroys a cgroup which contains alive hugetlb pages ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 5/6] hugetlbfs: Add controller support for private mapping
  2012-02-10 21:36 ` [RFC PATCH 5/6] hugetlbfs: Add controller support for private mapping Aneesh Kumar K.V
@ 2012-02-17  5:22   ` bill4carson
  2012-02-17  8:05     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 15+ messages in thread
From: bill4carson @ 2012-02-17  5:22 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf



On 2012a1'02ae??11ae?JPY 05:36, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<aneesh.kumar@linux.vnet.ibm.com>
>
> HugeTLB controller is different from a memory controller in that we charge
> controller during mmap() time and not fault time. This make sure userspace
> can fallback to non-hugepage allocation when mmap fails due to controller
> limit.
>
> For private mapping we always charge/uncharge from the current task cgroup.
> Charging happens during mmap(2) and uncharge happens during the
> vm_operations->close when resv_map refcount reaches zero. The uncharge count
> is stored in struct resv_map. For child task after fork the charging happens
> during fault time in alloc_huge_page. We also need to make sure for private
> mapping each vma for hugeTLB mapping have struct resv_map allocated so that we
> can store the uncharge count in resv_map.
>
> Signed-off-by: Aneesh Kumar K.V<aneesh.kumar@linux.vnet.ibm.com>
> ---
>   fs/hugetlbfs/hugetlb_cgroup.c  |   50 ++++++++++++++++++++++++++++++++
>   include/linux/hugetlb.h        |    7 ++++
>   include/linux/hugetlb_cgroup.h |   16 ++++++++++
>   mm/hugetlb.c                   |   62 ++++++++++++++++++++++++++++++++--------
>   4 files changed, 123 insertions(+), 12 deletions(-)
>
> diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
> index c478fb0..f828fb2 100644
> --- a/fs/hugetlbfs/hugetlb_cgroup.c
> +++ b/fs/hugetlbfs/hugetlb_cgroup.c
> @@ -458,3 +458,53 @@ long  hugetlb_truncate_cgroup_charge(struct hstate *h,
>   	}
>   	return chg;
>   }
> +
> +int hugetlb_priv_page_charge(struct resv_map *map, struct hstate *h, long chg)
> +{
> +	long csize;
> +	int idx, ret;
> +	struct hugetlb_cgroup *h_cg;
> +	struct res_counter *fail_res;
> +
> +	/*
> +	 * Get the task cgroup within rcu_readlock and also
> +	 * get cgroup reference to make sure cgroup destroy won't
> +	 * race with page_charge. We don't allow a cgroup destroy
> +	 * when the cgroup have some charge against it
> +	 */
> +	rcu_read_lock();
> +	h_cg = task_hugetlbcgroup(current);
> +	css_get(&h_cg->css);
> +	rcu_read_unlock();
> +
> +	if (hugetlb_cgroup_is_root(h_cg)) {
> +		ret = chg;
> +		goto err_out;
> +	}
> +
> +	csize = chg * huge_page_size(h);
> +	idx = h - hstates;
> +	ret = res_counter_charge(&h_cg->memhuge[idx], csize,&fail_res);
> +	if (!ret) {
> +		map->nr_pages[idx] += chg<<  huge_page_order(h);
> +		ret = chg;
> +	}
> +err_out:
> +	css_put(&h_cg->css);
> +	return ret;
> +}
> +
> +void hugetlb_priv_page_uncharge(struct resv_map *map, int idx, int nr_pages)
> +{
> +	struct hugetlb_cgroup *h_cg;
> +	unsigned long csize = nr_pages * PAGE_SIZE;
> +
> +	rcu_read_lock();
> +	h_cg = task_hugetlbcgroup(current);
> +	if (!hugetlb_cgroup_is_root(h_cg)) {
> +		res_counter_uncharge(&h_cg->memhuge[idx], csize);
> +		map->nr_pages[idx] -= nr_pages;
> +	}
> +	rcu_read_unlock();
> +	return;
> +}
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 4392b6a..e2ba381 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -233,6 +233,12 @@ struct hstate {
>   	char name[HSTATE_NAME_LEN];
>   };
>
> +struct resv_map {
> +	struct kref refs;
> +	int nr_pages[HUGE_MAX_HSTATE];
> +	struct list_head regions;
> +};
> +

Please put resv_map after HUGE_MAX_HSTATE definition,
otherwise it will break on non-x86 arches, which has no
HUGE_MAX_HSTATE definition.


#ifndef HUGE_MAX_HSTATE
#define HUGE_MAX_HSTATE 1
#endif

+struct resv_map {
+	struct kref refs;
+	int nr_pages[HUGE_MAX_HSTATE];
+	struct list_head regions;
+};




>   struct huge_bootmem_page {
>   	struct list_head list;
>   	struct hstate *hstate;
> @@ -323,6 +329,7 @@ static inline unsigned hstate_index_to_shift(unsigned index)
>
>   #else
>   struct hstate {};
> +struct resv_map {};
>   #define alloc_huge_page_node(h, nid) NULL
>   #define alloc_bootmem_huge_page(h) NULL
>   #define hstate_file(f) NULL
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 3131d62..c3738df 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -32,6 +32,10 @@ extern void hugetlb_page_uncharge(struct list_head *head,
>   extern void hugetlb_commit_page_charge(struct list_head *head, long f, long t);
>   extern long hugetlb_truncate_cgroup_charge(struct hstate *h,
>   					   struct list_head *head, long from);
> +extern int hugetlb_priv_page_charge(struct resv_map *map,
> +				    struct hstate *h, long chg);
> +extern void hugetlb_priv_page_uncharge(struct resv_map *map,
> +				       int idx, int nr_pages);
>   #else
>   static inline long hugetlb_page_charge(struct list_head *head,
>   				       struct hstate *h, long f, long t)
> @@ -58,5 +62,17 @@ static inline long hugetlb_truncate_cgroup_charge(struct hstate *h,
>   {
>   	return region_truncate(head, from);
>   }
> +
> +static inline int hugetlb_priv_page_charge(struct resv_map *map,
> +					   struct hstate *h, long chg)
> +{
> +	return chg;
> +}
> +
> +static inline void hugetlb_priv_page_uncharge(struct resv_map *map,
> +					      int idx, int nr_pages)
> +{
> +	return;
> +}
>   #endif /* CONFIG_CGROUP_HUGETLB_RES_CTLR */
>   #endif
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 102410f..5a91838 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -303,14 +303,9 @@ static void set_vma_private_data(struct vm_area_struct *vma,
>   	vma->vm_private_data = (void *)value;
>   }
>
> -struct resv_map {
> -	struct kref refs;
> -	struct list_head regions;
> -};
> -
>   static struct resv_map *resv_map_alloc(void)
>   {
> -	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
> +	struct resv_map *resv_map = kzalloc(sizeof(*resv_map), GFP_KERNEL);
>   	if (!resv_map)
>   		return NULL;
>
> @@ -322,10 +317,16 @@ static struct resv_map *resv_map_alloc(void)
>
>   static void resv_map_release(struct kref *ref)
>   {
> +	int idx;
>   	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>
>   	/* Clear out any active regions before we release the map. */
>   	region_truncate(&resv_map->regions, 0);
> +	/* drop the hugetlb cgroup charge */
> +	for (idx = 0; idx<  HUGE_MAX_HSTATE; idx++) {
> +		hugetlb_priv_page_uncharge(resv_map, idx,
> +					   resv_map->nr_pages[idx]);
> +	}
>   	kfree(resv_map);
>   }
>
> @@ -989,9 +990,20 @@ static long vma_needs_reservation(struct hstate *h,
>   		return hugetlb_page_charge(&inode->i_mapping->private_list,
>   					   h, idx, idx + 1);
>   	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> -		return 1;
> -
> +		struct resv_map *resv_map = vma_resv_map(vma);
> +		if (!resv_map) {
> +			/*
> +			 * We didn't allocate resv_map for this vma.
> +			 * Allocate it here.
> +			 */
> +			resv_map = resv_map_alloc();
> +			if (!resv_map)
> +				return -ENOMEM;
> +			set_vma_resv_map(vma, resv_map);
> +		}
> +		return hugetlb_priv_page_charge(resv_map, h, 1);
>   	} else  {
> +		/* We did the priv page charging in mmap call */
>   		long err;
>   		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
>   		struct resv_map *reservations = vma_resv_map(vma);
> @@ -1007,14 +1019,20 @@ static void vma_uncharge_reservation(struct hstate *h,
>   				     struct vm_area_struct *vma,
>   				     unsigned long chg)
>   {
> +	int idx = h - hstates;
>   	struct address_space *mapping = vma->vm_file->f_mapping;
>   	struct inode *inode = mapping->host;
>
>
>   	if (vma->vm_flags&  VM_MAYSHARE) {
>   		return hugetlb_page_uncharge(&inode->i_mapping->private_list,
> -					     h - hstates,
> -					     chg<<  huge_page_order(h));
> +					     idx, chg<<  huge_page_order(h));
> +	} else {
> +		struct resv_map *resv_map = vma_resv_map(vma);
> +
> +		return hugetlb_priv_page_uncharge(resv_map,
> +						  idx,
> +						  chg<<  huge_page_order(h));
>   	}
>   }
>
> @@ -2165,6 +2183,22 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
>   	 */
>   	if (reservations)
>   		kref_get(&reservations->refs);
> +	else if (!(vma->vm_flags&  VM_MAYSHARE)) {
> +		/*
> +		 * for non shared vma we need resv map to track
> +		 * hugetlb cgroup usage. Allocate it here. Charging
> +		 * the cgroup will take place in fault path.
> +		 */
> +		struct resv_map *resv_map = resv_map_alloc();
> +		/*
> +		 * If we fail to allocate resv_map here. We will allocate
> +		 * one when we do alloc_huge_page. So we don't handle
> +		 * ENOMEM here. The function also return void. So there is
> +		 * nothing much we can do.
> +		 */
> +		if (resv_map)
> +			set_vma_resv_map(vma, resv_map);
> +	}
>   }
>
>   static void hugetlb_vm_op_close(struct vm_area_struct *vma)
> @@ -2968,7 +3002,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>   {
>   	long ret, chg;
>   	struct hstate *h = hstate_inode(inode);
> -
> +	struct resv_map *resv_map = NULL;
>   	/*
>   	 * Only apply hugepage reservation if asked. At fault time, an
>   	 * attempt will be made for VM_NORESERVE to allocate a page
> @@ -2987,7 +3021,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>   		chg = hugetlb_page_charge(&inode->i_mapping->private_list,
>   					  h, from, to);
>   	} else {
> -		struct resv_map *resv_map = resv_map_alloc();
> +		resv_map = resv_map_alloc();
>   		if (!resv_map)
>   			return -ENOMEM;
>
> @@ -2995,6 +3029,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>
>   		set_vma_resv_map(vma, resv_map);
>   		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> +		chg = hugetlb_priv_page_charge(resv_map, h, chg);
>   	}
>
>   	if (chg<  0)
> @@ -3033,6 +3068,9 @@ err_quota:
>   	if (!vma || vma->vm_flags&  VM_MAYSHARE)
>   		hugetlb_page_uncharge(&inode->i_mapping->private_list,
>   				      h - hstates, chg<<  huge_page_order(h));
> +	else
> +		hugetlb_priv_page_uncharge(resv_map, h - hstates,
> +					   chg<<  huge_page_order(h));
>   	return ret;
>
>   }

-- 
I am a slow learner
but I will keep trying to fight for my dreams!

--bill

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs
  2012-02-14  6:58 ` KAMEZAWA Hiroyuki
@ 2012-02-17  8:00   ` Aneesh Kumar K.V
  2012-02-17  8:03     ` Aneesh Kumar K.V
  1 sibling, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-17  8:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mgorman, dhillf


Hi Kamezawa,

Sorry for a late response as I was out of office for last few days.

On Tue, 14 Feb 2012 15:58:43 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Sat, 11 Feb 2012 03:06:40 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> 
> > Hi,
> > 
> > This patchset implements a cgroup resource controller for HugeTLB pages.
> > It is similar to the existing hugetlb quota support in that the limit is
> > enforced at mmap(2) time and not at fault time. HugeTLB quota limit the
> > number of huge pages that can be allocated per superblock.
> > 
> > For shared mapping we track the region mapped by a task along with the
> > hugetlb cgroup in inode region list. We keep the hugetlb cgroup charged
> > even after the task that did mmap(2) exit. The uncharge happens during
> > file truncate. For Private mapping we charge and uncharge from the current
> > task cgroup.
> > 
> 
> Hm, Could you provide an Documenation/cgroup/hugetlb.txt at RFC ?
> It makes clear what your patch does.

Will do in the next iteration.

> 
> I wonder whether this should be under memory cgroup or not. In the 1st design
> of cgroup, I think it was considered one-feature-one-subsystem was good.
> 
> But in recent discussion, I tend to hear that's hard to use.
> Now, memory cgroup has 
> 
>    memory.xxxxx for controlling anon/file
>    memory.memsw.xxxx for controlling memory+swap
>    memory.kmem.tcp_xxxx for tcp controlling.
> 
> How about memory.hugetlb.xxxxx ?
> 

That is how i did one of the earlier version of the patch. But there are
few difference with the way we want to control hugetlb allocation. With
hugetlb cgroup, we actually want to enable application to fall back to
using normal pagesize if we are crossing cgroup limit. ie, we need to
enforce the limit during mmap. memcg tracks cgroup details along
with pages, hence implementing above gets challenging. Another
difference is we keep the cgroup charged even if the task exit as long as
the file is present in hugetlbfs. ie, if an application did mmap with
MAP_SHARED in hugetlbfs, the file size will be extended to the requested
length arg in mmap. This file will consume pages from hugetlb resource
until it is truncated. We want to track that resource usage as a part
of hugetlb cgroup. 

>From the interface point of view what we have in hugetlb cgroup is
similar to what is in memcg. We end up with files like the below

hugetlb.16GB.limit_in_bytes
hugetlb.16GB.max_usage_in_bytes 
hugetlb.16GB.usage_in_bytes
hugetlb.16MB.limit_in_bytes
hugetlb.16MB.max_usage_in_bytes  
hugetlb.16MB.usage_in_bytes

> 
> > The current patchset doesn't support cgroup hierarchy. We also don't
> > allow task migration across cgroup.
> 
> What happens when a user destroys a cgroup which contains alive hugetlb pages ?
> 
> Thanks,
> -Kame
> 

Thanks
-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs
  2012-02-14  6:58 ` KAMEZAWA Hiroyuki
@ 2012-02-17  8:03     ` Aneesh Kumar K.V
  2012-02-17  8:03     ` Aneesh Kumar K.V
  1 sibling, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-17  8:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mgorman, dhillf, LKML, Andrew Morton


Hi Kamezawa,

Sorry for the late response as I was out of office for last few days.

On Tue, 14 Feb 2012 15:58:43 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Sat, 11 Feb 2012 03:06:40 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> 
> > Hi,
> > 
> > This patchset implements a cgroup resource controller for HugeTLB pages.
> > It is similar to the existing hugetlb quota support in that the limit is
> > enforced at mmap(2) time and not at fault time. HugeTLB quota limit the
> > number of huge pages that can be allocated per superblock.
> > 
> > For shared mapping we track the region mapped by a task along with the
> > hugetlb cgroup in inode region list. We keep the hugetlb cgroup charged
> > even after the task that did mmap(2) exit. The uncharge happens during
> > file truncate. For Private mapping we charge and uncharge from the current
> > task cgroup.
> > 
> 
> Hm, Could you provide an Documenation/cgroup/hugetlb.txt at RFC ?
> It makes clear what your patch does.

Will do in the next iteration.

> 
> I wonder whether this should be under memory cgroup or not. In the 1st design
> of cgroup, I think it was considered one-feature-one-subsystem was good.
> 
> But in recent discussion, I tend to hear that's hard to use.
> Now, memory cgroup has 
> 
>    memory.xxxxx for controlling anon/file
>    memory.memsw.xxxx for controlling memory+swap
>    memory.kmem.tcp_xxxx for tcp controlling.
> 
> How about memory.hugetlb.xxxxx ?
> 

That is how i did one of the earlier version of the patch. But there are
few difference with the way we want to control hugetlb allocation. With
hugetlb cgroup, we actually want to enable application to fall back to
using normal pagesize if we are crossing cgroup limit. ie, we need to
enforce the limit during mmap. memcg tracks cgroup details along
with pages, hence implementing above gets challenging. Another
difference is we keep the cgroup charged even if the task exit as long as
the file is present in hugetlbfs. ie, if an application did mmap with
MAP_SHARED in hugetlbfs, the file size will be extended to the requested
length arg in mmap. This file will consume pages from hugetlb resource
until it is truncated. We want to track that resource usage as a part
of hugetlb cgroup. 

>From the interface point of view what we have in hugetlb cgroup is
similar to what is in memcg. We end up with files like the below

hugetlb.16GB.limit_in_bytes
hugetlb.16GB.max_usage_in_bytes 
hugetlb.16GB.usage_in_bytes
hugetlb.16MB.limit_in_bytes
hugetlb.16MB.max_usage_in_bytes  
hugetlb.16MB.usage_in_bytes

> 
> > The current patchset doesn't support cgroup hierarchy. We also don't
> > allow task migration across cgroup.
> 
> What happens when a user destroys a cgroup which contains alive hugetlb pages ?
> 
> Thanks,
> -Kame
> 

Thanks
-aneesh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs
@ 2012-02-17  8:03     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-17  8:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mgorman, dhillf, LKML, Andrew Morton


Hi Kamezawa,

Sorry for the late response as I was out of office for last few days.

On Tue, 14 Feb 2012 15:58:43 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Sat, 11 Feb 2012 03:06:40 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> 
> > Hi,
> > 
> > This patchset implements a cgroup resource controller for HugeTLB pages.
> > It is similar to the existing hugetlb quota support in that the limit is
> > enforced at mmap(2) time and not at fault time. HugeTLB quota limit the
> > number of huge pages that can be allocated per superblock.
> > 
> > For shared mapping we track the region mapped by a task along with the
> > hugetlb cgroup in inode region list. We keep the hugetlb cgroup charged
> > even after the task that did mmap(2) exit. The uncharge happens during
> > file truncate. For Private mapping we charge and uncharge from the current
> > task cgroup.
> > 
> 
> Hm, Could you provide an Documenation/cgroup/hugetlb.txt at RFC ?
> It makes clear what your patch does.

Will do in the next iteration.

> 
> I wonder whether this should be under memory cgroup or not. In the 1st design
> of cgroup, I think it was considered one-feature-one-subsystem was good.
> 
> But in recent discussion, I tend to hear that's hard to use.
> Now, memory cgroup has 
> 
>    memory.xxxxx for controlling anon/file
>    memory.memsw.xxxx for controlling memory+swap
>    memory.kmem.tcp_xxxx for tcp controlling.
> 
> How about memory.hugetlb.xxxxx ?
> 

That is how i did one of the earlier version of the patch. But there are
few difference with the way we want to control hugetlb allocation. With
hugetlb cgroup, we actually want to enable application to fall back to
using normal pagesize if we are crossing cgroup limit. ie, we need to
enforce the limit during mmap. memcg tracks cgroup details along
with pages, hence implementing above gets challenging. Another
difference is we keep the cgroup charged even if the task exit as long as
the file is present in hugetlbfs. ie, if an application did mmap with
MAP_SHARED in hugetlbfs, the file size will be extended to the requested
length arg in mmap. This file will consume pages from hugetlb resource
until it is truncated. We want to track that resource usage as a part
of hugetlb cgroup. 

>From the interface point of view what we have in hugetlb cgroup is
similar to what is in memcg. We end up with files like the below

hugetlb.16GB.limit_in_bytes
hugetlb.16GB.max_usage_in_bytes 
hugetlb.16GB.usage_in_bytes
hugetlb.16MB.limit_in_bytes
hugetlb.16MB.max_usage_in_bytes  
hugetlb.16MB.usage_in_bytes

> 
> > The current patchset doesn't support cgroup hierarchy. We also don't
> > allow task migration across cgroup.
> 
> What happens when a user destroys a cgroup which contains alive hugetlb pages ?
> 
> Thanks,
> -Kame
> 

Thanks
-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 5/6] hugetlbfs: Add controller support for private mapping
  2012-02-17  5:22   ` bill4carson
@ 2012-02-17  8:05     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-17  8:05 UTC (permalink / raw)
  To: bill4carson; +Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf

On Fri, 17 Feb 2012 13:22:44 +0800, bill4carson <bill4carson@gmail.com> wrote:
> 
> 
> On 2012年02月11日 05:36, Aneesh Kumar K.V wrote:
> > From: "Aneesh Kumar K.V"<aneesh.kumar@linux.vnet.ibm.com>
> >
> > HugeTLB controller is different from a memory controller in that we charge
> > controller during mmap() time and not fault time. This make sure userspace
> > can fallback to non-hugepage allocation when mmap fails due to controller
> > limit.
> >
> > For private mapping we always charge/uncharge from the current task cgroup.
> > Charging happens during mmap(2) and uncharge happens during the
> > vm_operations->close when resv_map refcount reaches zero. The uncharge count
> > is stored in struct resv_map. For child task after fork the charging happens
> > during fault time in alloc_huge_page. We also need to make sure for private
> > mapping each vma for hugeTLB mapping have struct resv_map allocated so that we
> > can store the uncharge count in resv_map.
> >
> > Signed-off-by: Aneesh Kumar K.V<aneesh.kumar@linux.vnet.ibm.com>
> > ---
> >   fs/hugetlbfs/hugetlb_cgroup.c  |   50 ++++++++++++++++++++++++++++++++
> >   include/linux/hugetlb.h        |    7 ++++
> >   include/linux/hugetlb_cgroup.h |   16 ++++++++++
> >   mm/hugetlb.c                   |   62 ++++++++++++++++++++++++++++++++--------
> >   4 files changed, 123 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
> > index c478fb0..f828fb2 100644
> > --- a/fs/hugetlbfs/hugetlb_cgroup.c
> > +++ b/fs/hugetlbfs/hugetlb_cgroup.c
> > @@ -458,3 +458,53 @@ long  hugetlb_truncate_cgroup_charge(struct hstate *h,
> >   	}
> >   	return chg;
> >   }
> > +
> > +int hugetlb_priv_page_charge(struct resv_map *map, struct hstate *h, long chg)
> > +{
> > +	long csize;
> > +	int idx, ret;
> > +	struct hugetlb_cgroup *h_cg;
> > +	struct res_counter *fail_res;
> > +
> > +	/*
> > +	 * Get the task cgroup within rcu_readlock and also
> > +	 * get cgroup reference to make sure cgroup destroy won't
> > +	 * race with page_charge. We don't allow a cgroup destroy
> > +	 * when the cgroup have some charge against it
> > +	 */
> > +	rcu_read_lock();
> > +	h_cg = task_hugetlbcgroup(current);
> > +	css_get(&h_cg->css);
> > +	rcu_read_unlock();
> > +
> > +	if (hugetlb_cgroup_is_root(h_cg)) {
> > +		ret = chg;
> > +		goto err_out;
> > +	}
> > +
> > +	csize = chg * huge_page_size(h);
> > +	idx = h - hstates;
> > +	ret = res_counter_charge(&h_cg->memhuge[idx], csize,&fail_res);
> > +	if (!ret) {
> > +		map->nr_pages[idx] += chg<<  huge_page_order(h);
> > +		ret = chg;
> > +	}
> > +err_out:
> > +	css_put(&h_cg->css);
> > +	return ret;
> > +}
> > +
> > +void hugetlb_priv_page_uncharge(struct resv_map *map, int idx, int nr_pages)
> > +{
> > +	struct hugetlb_cgroup *h_cg;
> > +	unsigned long csize = nr_pages * PAGE_SIZE;
> > +
> > +	rcu_read_lock();
> > +	h_cg = task_hugetlbcgroup(current);
> > +	if (!hugetlb_cgroup_is_root(h_cg)) {
> > +		res_counter_uncharge(&h_cg->memhuge[idx], csize);
> > +		map->nr_pages[idx] -= nr_pages;
> > +	}
> > +	rcu_read_unlock();
> > +	return;
> > +}
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 4392b6a..e2ba381 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -233,6 +233,12 @@ struct hstate {
> >   	char name[HSTATE_NAME_LEN];
> >   };
> >
> > +struct resv_map {
> > +	struct kref refs;
> > +	int nr_pages[HUGE_MAX_HSTATE];
> > +	struct list_head regions;
> > +};
> > +
> 
> Please put resv_map after HUGE_MAX_HSTATE definition,
> otherwise it will break on non-x86 arches, which has no
> HUGE_MAX_HSTATE definition.
> 
> 
> #ifndef HUGE_MAX_HSTATE
> #define HUGE_MAX_HSTATE 1
> #endif
> 
> +struct resv_map {
> +	struct kref refs;
> +	int nr_pages[HUGE_MAX_HSTATE];
> +	struct list_head regions;
> +};
> 
> 
> 
> 

Will do in the next iteration.

Thanks for the review
-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-02-17  8:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-10 21:36 [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
2012-02-10 21:36 ` [RFC PATCH 1/6] hugetlb: Add a new hugetlb cgroup Aneesh Kumar K.V
2012-02-10 21:36 ` [RFC PATCH 2/6] hugetlbfs: Add usage and max usage files to " Aneesh Kumar K.V
2012-02-10 21:36 ` [RFC PATCH 3/6] hugetlbfs: Add new region handling functions Aneesh Kumar K.V
2012-02-10 21:36 ` [RFC PATCH 4/6] hugetlbfs: Add controller support for shared mapping Aneesh Kumar K.V
2012-02-10 21:36 ` [RFC PATCH 5/6] hugetlbfs: Add controller support for private mapping Aneesh Kumar K.V
2012-02-17  5:22   ` bill4carson
2012-02-17  8:05     ` Aneesh Kumar K.V
2012-02-10 21:36 ` [RFC PATCH 6/6] hugetlbfs: Switch to new region APIs Aneesh Kumar K.V
2012-02-11 12:37 ` [RFC PATCH 0/6] hugetlbfs: Add cgroup resource controller for hugetlbfs Hillf Danton
2012-02-12 17:44   ` Aneesh Kumar K.V
2012-02-14  6:58 ` KAMEZAWA Hiroyuki
2012-02-17  8:00   ` Aneesh Kumar K.V
2012-02-17  8:03   ` Aneesh Kumar K.V
2012-02-17  8:03     ` Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.