[PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs
@ 2012-02-20 11:21 Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 1/9] hugetlbfs: Add new HugeTLB cgroup Aneesh Kumar K.V
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups

Hi,

This patchset implements a cgroup resource controller for HugeTLB pages.
It is similar to the existing hugetlb quota support in that the limit is
enforced at mmap(2) time and not at fault time. HugeTLB quota limit the
number of huge pages that can allocated per superblock.

For shared mapping we track the region mapped by a task along with the
hugetlb cgroup. We keep the hugetlb cgroup charged even after the task
that did mmap(2) exit. The uncharge happens during truncate. For Private
mapping we charge and uncharge from the current task cgroup.

A sample strace output for an application doing malloc with hugectl is given
below. libhugetlbfs will fallback to normal pagesize if the HugeTLB mmap fails.

open("/mnt/libhugetlbfs.tmp.uhLMgy", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
unlink("/mnt/libhugetlbfs.tmp.uhLMgy")  = 0

.........

mmap(0x20000000000, 50331648, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "libhugetlbfs", 12libhugetlbfs)            = 12
write(2, ": WARNING: New heap segment map" ....
mmap(NULL, 42008576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xfff946c0000
....

Changes from RFC post:
* Added support for HugeTLB cgroup hierarchy
* Added support for task migration
* Added documentation patch
* Other Bug fixes

-aneesh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH -V1 1/9] hugetlbfs: Add new HugeTLB cgroup
  2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
@ 2012-02-20 11:21 ` Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 2/9] hugetlbfs: Add usage and max usage files to hugetlb cgroup Aneesh Kumar K.V
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

hugetlb controller helps in controlling the number of hugepages a cgroup
can allocate. We enforce the limit during mmap time and NOT during fault
time. The behaviour is similar to hugetlb quota support but quota enforce
the limit per hugetlb mount point.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/Makefile          |    1 +
 fs/hugetlbfs/hugetlb_cgroup.c  |  196 ++++++++++++++++++++++++++++++++++++++++
 include/linux/cgroup_subsys.h  |    6 ++
 include/linux/hugetlb.h        |   12 ++-
 include/linux/hugetlb_cgroup.h |   21 +++++
 init/Kconfig                   |   10 ++
 mm/hugetlb.c                   |   58 ++++++++++++-
 7 files changed, 301 insertions(+), 3 deletions(-)
 create mode 100644 fs/hugetlbfs/hugetlb_cgroup.c
 create mode 100644 include/linux/hugetlb_cgroup.h

diff --git a/fs/hugetlbfs/Makefile b/fs/hugetlbfs/Makefile
index 6adf870..986c778 100644
--- a/fs/hugetlbfs/Makefile
+++ b/fs/hugetlbfs/Makefile
@@ -5,3 +5,4 @@
 obj-$(CONFIG_HUGETLBFS) += hugetlbfs.o
 
 hugetlbfs-objs := inode.o
+hugetlbfs-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
new file mode 100644
index 0000000..b5b3cb8
--- /dev/null
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -0,0 +1,196 @@
+/*
+ *
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/res_counter.h>
+
+/* lifted from mem control */
+#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
+#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val)	((val) & 0xffff)
+
+struct hugetlb_cgroup {
+	struct cgroup_subsys_state css;
+	/*
+	 * the counter to account for hugepages from hugetlb.
+	 */
+	struct res_counter memhuge[HUGE_MAX_HSTATE];
+};
+
+struct cgroup_subsys hugetlb_subsys __read_mostly;
+struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+
+static inline
+struct hugetlb_cgroup *css_to_hugetlbcgroup(struct cgroup_subsys_state *s)
+{
+	return container_of(s, struct hugetlb_cgroup, css);
+}
+
+static inline
+struct hugetlb_cgroup *cgroup_to_hugetlbcgroup(struct cgroup *cgroup)
+{
+	return css_to_hugetlbcgroup(cgroup_subsys_state(cgroup,
+							hugetlb_subsys_id));
+}
+
+static inline
+struct hugetlb_cgroup *task_hugetlbcgroup(struct task_struct *task)
+{
+	return css_to_hugetlbcgroup(task_subsys_state(task, hugetlb_subsys_id));
+}
+
+static inline int hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
+{
+	return (h_cg == root_h_cgroup);
+}
+
+u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft)
+{
+	int name, idx;
+	unsigned long long val;
+	struct hugetlb_cgroup *h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+
+	idx = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+
+	val = res_counter_read_u64(&h_cgroup->memhuge[idx], name);
+	return val;
+}
+
+int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
+			 const char *buffer)
+{
+	int name, ret, idx;
+	unsigned long long val;
+	struct hugetlb_cgroup *h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+
+	/* This function does all necessary parse...reuse it */
+	ret = res_counter_memparse_write_strategy(buffer, &val);
+	if (ret)
+		return ret;
+
+	idx = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+
+	switch (name) {
+	case RES_LIMIT:
+		ret = res_counter_set_limit(&h_cgroup->memhuge[idx], val);
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static int hugetlbcgroup_can_attach(struct cgroup_subsys *ss,
+				    struct cgroup *new_cgrp,
+				    struct cgroup_taskset *set)
+{
+	struct hugetlb_cgroup *h_cg;
+	struct task_struct *task = cgroup_taskset_first(set);
+	/*
+	 * Make sure all the task in the set are in root cgroup
+	 * We only allow move from root cgroup to other cgroup.
+	 */
+	while (task != NULL) {
+		rcu_read_lock();
+		h_cg = task_hugetlbcgroup(task);
+		if (!hugetlb_cgroup_is_root(h_cg)) {
+			rcu_read_unlock();
+			return -EOPNOTSUPP;
+		}
+		rcu_read_unlock();
+		task = cgroup_taskset_next(set);
+	}
+	return 0;
+}
+
+/*
+ * called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *
+hugetlbcgroup_create(struct cgroup_subsys *ss, struct cgroup *cgroup)
+{
+	int idx;
+	struct cgroup *parent_cgroup;
+	struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
+
+	h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
+	if (!h_cgroup)
+		return ERR_PTR(-ENOMEM);
+
+	parent_cgroup = cgroup->parent;
+	if (parent_cgroup) {
+		parent_h_cgroup = cgroup_to_hugetlbcgroup(parent_cgroup);
+		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+			res_counter_init(&h_cgroup->memhuge[idx],
+					 &parent_h_cgroup->memhuge[idx]);
+	} else {
+		root_h_cgroup = h_cgroup;
+		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+			res_counter_init(&h_cgroup->memhuge[idx], NULL);
+	}
+	return &h_cgroup->css;
+}
+
+static int hugetlbcgroup_pre_destroy(struct cgroup_subsys *ss,
+				     struct cgroup *cgroup)
+{
+	u64 val;
+	int idx;
+	struct hugetlb_cgroup *h_cgroup;
+
+	h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+	/*
+	 * We don't allow a cgroup deletion if it have some
+	 * resource charged against it.
+	 */
+	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
+		val = res_counter_read_u64(&h_cgroup->memhuge[idx], RES_USAGE);
+		if (val)
+			return -EBUSY;
+	}
+	return 0;
+}
+
+static void hugetlbcgroup_destroy(struct cgroup_subsys *ss,
+				  struct cgroup *cgroup)
+{
+	struct hugetlb_cgroup *h_cgroup;
+
+	h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+	kfree(h_cgroup);
+}
+
+static int hugetlbcgroup_populate(struct cgroup_subsys *ss,
+				  struct cgroup *cgroup)
+{
+	return register_hugetlb_cgroup_files(ss, cgroup);
+}
+
+struct cgroup_subsys hugetlb_subsys = {
+	.name = "hugetlb",
+	.can_attach = hugetlbcgroup_can_attach,
+	.create     = hugetlbcgroup_create,
+	.pre_destroy = hugetlbcgroup_pre_destroy,
+	.destroy    = hugetlbcgroup_destroy,
+	.populate   = hugetlbcgroup_populate,
+	.subsys_id  = hugetlb_subsys_id,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0bd390c..895923a 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -72,3 +72,9 @@ SUBSYS(net_prio)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+SUBSYS(hugetlb)
+#endif
+
+/* */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9d6c86..2b6b231 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
 #include <linux/mm_types.h>
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
+#include <linux/cgroup.h>
 
 struct ctl_table;
 struct user_struct;
@@ -68,7 +69,8 @@ int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 void hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
-
+int register_hugetlb_cgroup_files(struct cgroup_subsys *ss,
+				  struct cgroup *cgroup);
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline int PageHuge(struct page *page)
@@ -109,7 +111,11 @@ static inline void copy_huge_page(struct page *dst, struct page *src)
 }
 
 #define hugetlb_change_protection(vma, address, end, newprot)
-
+static inline int register_huge_cgroup_files(struct cgroup_subsys *ss,
+					 struct cgroup *cgroup);
+{
+	return 0;
+}
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define HUGETLB_ANON_FILE "anon_hugepage"
@@ -220,6 +226,8 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+	/* cgroup control files */
+	struct cftype cgroup_limit_file;
 	char name[HSTATE_NAME_LEN];
 };
 
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
new file mode 100644
index 0000000..2330dd0
--- /dev/null
+++ b/include/linux/hugetlb_cgroup.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#ifndef _LINUX_HUGETLB_CGROUP_H
+#define _LINUX_HUGETLB_CGROUP_H
+
+extern u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft);
+extern int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
+				const char *buffer);
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 3f42cd6..78d4961 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -673,6 +673,16 @@ config CGROUP_MEM_RES_CTLR
 	  This config option also selects MM_OWNER config option, which
 	  could in turn add some fork/exit overhead.
 
+config CGROUP_HUGETLB_RES_CTLR
+	bool "HugeTLB Resource Controller for Control Groups"
+	depends on RESOURCE_COUNTERS && HUGETLBFS
+	help
+	  Provides a simple cgroup Resource Controller for HugeTLB pages.
+	  The controller limit is enforced during mmap(2) time, so that
+	  application can fall back to allocation using smaller page size
+	  if the cgroup resource limit prevented them from allocating HugeTLB
+	  pages.
+
 config CGROUP_MEM_RES_CTLR_SWAP
 	bool "Memory Resource Controller Swap Extension"
 	depends on CGROUP_MEM_RES_CTLR && SWAP
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f34bd8..f643f72 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -28,6 +28,11 @@
 
 #include <linux/hugetlb.h>
 #include <linux/node.h>
+#include <linux/cgroup.h>
+#include <linux/hugetlb_cgroup.h>
+#include <linux/res_counter.h>
+#include <linux/page_cgroup.h>
+
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1798,6 +1803,57 @@ static int __init hugetlb_init(void)
 }
 module_init(hugetlb_init);
 
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+int register_hugetlb_cgroup_files(struct cgroup_subsys *ss,
+				  struct cgroup *cgroup)
+{
+	int ret = 0;
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		ret = cgroup_add_file(cgroup, ss, &h->cgroup_limit_file);
+		if (ret)
+			return ret;
+	}
+	return ret;
+}
+
+#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
+static char *mem_fmt(char *buf, unsigned long n)
+{
+	if (n >= (1UL << 30))
+		sprintf(buf, "%luGB", n >> 30);
+	else if (n >= (1UL << 20))
+		sprintf(buf, "%luMB", n >> 20);
+	else
+		sprintf(buf, "%luKB", n >> 10);
+	return buf;
+}
+
+static int hugetlb_cgroup_file_init(struct hstate *h, int idx)
+{
+	char buf[32];
+	struct cftype *cft;
+
+	/* format the size */
+	mem_fmt(buf, huge_page_size(h));
+
+	/* Add the limit file */
+	cft = &h->cgroup_limit_file;
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+	cft->read_u64 = hugetlb_cgroup_read;
+	cft->write_string = hugetlb_cgroup_write;
+
+	return 0;
+}
+#else
+static int hugetlb_cgroup_file_init(struct hstate *h, int idx)
+{
+	return 0;
+}
+#endif
+
 /* Should be called on processing a hugepagesz=... option */
 void __init hugetlb_add_hstate(unsigned order)
 {
@@ -1821,7 +1877,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
-
+	hugetlb_cgroup_file_init(h, max_hstate - 1);
 	parsed_hstate = h;
 }
 
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -V1 2/9] hugetlbfs: Add usage and max usage files to hugetlb cgroup
  2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 1/9] hugetlbfs: Add new HugeTLB cgroup Aneesh Kumar K.V
@ 2012-02-20 11:21 ` Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 3/9] hugetlbfs: Add new region handling functions Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c  |   12 ++++++++++++
 include/linux/hugetlb.h        |    2 ++
 include/linux/hugetlb_cgroup.h |    1 +
 mm/hugetlb.c                   |   21 +++++++++++++++++++++
 4 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index b5b3cb8..75dbdd8 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -99,6 +99,18 @@ int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
 	return ret;
 }
 
+int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event)
+{
+	int name, idx;
+	struct hugetlb_cgroup *h_cgroup = cgroup_to_hugetlbcgroup(cgroup);
+
+	idx = MEMFILE_TYPE(event);
+	name = MEMFILE_ATTR(event);
+
+	res_counter_reset_max(&h_cgroup->memhuge[idx]);
+	return 0;
+}
+
 static int hugetlbcgroup_can_attach(struct cgroup_subsys *ss,
 				    struct cgroup *new_cgrp,
 				    struct cgroup_taskset *set)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2b6b231..4392b6a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -228,6 +228,8 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 	/* cgroup control files */
 	struct cftype cgroup_limit_file;
+	struct cftype cgroup_usage_file;
+	struct cftype cgroup_max_usage_file;
 	char name[HSTATE_NAME_LEN];
 };
 
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 2330dd0..11cd6c4 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -18,4 +18,5 @@
 extern u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft);
 extern int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
 				const char *buffer);
+extern int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event);
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f643f72..865b41f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1814,6 +1814,13 @@ int register_hugetlb_cgroup_files(struct cgroup_subsys *ss,
 		ret = cgroup_add_file(cgroup, ss, &h->cgroup_limit_file);
 		if (ret)
 			return ret;
+		ret = cgroup_add_file(cgroup, ss, &h->cgroup_usage_file);
+		if (ret)
+			return ret;
+		ret = cgroup_add_file(cgroup, ss, &h->cgroup_max_usage_file);
+		if (ret)
+			return ret;
+
 	}
 	return ret;
 }
@@ -1845,6 +1852,20 @@ static int hugetlb_cgroup_file_init(struct hstate *h, int idx)
 	cft->read_u64 = hugetlb_cgroup_read;
 	cft->write_string = hugetlb_cgroup_write;
 
+	/* Add the usage file */
+	cft = &h->cgroup_usage_file;
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
+	cft->private  = MEMFILE_PRIVATE(idx, RES_USAGE);
+	cft->read_u64 = hugetlb_cgroup_read;
+
+	/* Add the MAX usage file */
+	cft = &h->cgroup_max_usage_file;
+	snprintf(cft->name, MAX_CFTYPE_NAME,
+		 "%s.max_usage_in_bytes", buf);
+	cft->private  = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+	cft->trigger  = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read;
+
 	return 0;
 }
 #else
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -V1 3/9] hugetlbfs: Add new region handling functions.
  2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 1/9] hugetlbfs: Add new HugeTLB cgroup Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 2/9] hugetlbfs: Add usage and max usage files to hugetlb cgroup Aneesh Kumar K.V
@ 2012-02-20 11:21 ` Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 4/9] hugetlbfs: Add controller support for shared mapping Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

These functions takes an extra argument and only merge regions if the data value
matches. This help us to build regions with difference hugetlb cgroup values.
Last patch in the series will merge this to existing region code, having this as
separate allows us to add cgroup support shared and private mapping in separate
patchset.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c |  127 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 127 insertions(+), 0 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index 75dbdd8..9bd2691 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -31,9 +31,136 @@ struct hugetlb_cgroup {
 	struct res_counter memhuge[HUGE_MAX_HSTATE];
 };
 
+struct file_region_with_data {
+	struct list_head link;
+	long from;
+	long to;
+	unsigned long data;
+};
+
 struct cgroup_subsys hugetlb_subsys __read_mostly;
 struct hugetlb_cgroup *root_h_cgroup __read_mostly;
 
+/*
+ * A vairant of region_add that only merges regions only if data
+ * match.
+ */
+static long region_chg_with_same(struct list_head *head,
+				 long f, long t, unsigned long data)
+{
+	long chg = 0;
+	struct file_region_with_data *rg, *nrg, *trg;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+	/*
+	 * If we are below the current region then a new region is required.
+	 * Subtle, allocate a new region at the position but make it zero
+	 * size such that we can guarantee to record the reservation.
+	 */
+	if (&rg->link == head || t < rg->from) {
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (!nrg)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to = f;
+		nrg->data = data;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+		return t - f;
+	}
+	/*
+	 * f rg->from t rg->to
+	 */
+	if (f < rg->from && data != rg->data) {
+		/* we need to allocate a new region */
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (!nrg)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to = f;
+		nrg->data = data;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+	}
+
+	/* Round our left edge to the current segment if it encloses us. */
+	if (f > rg->from)
+		f = rg->from;
+	chg = t - f;
+
+	/* Check for and consume any regions we now overlap with. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		if (rg->from > t)
+			return chg;
+		/*
+		 * rg->from f rg->to t
+		 */
+		if (t > rg->to && data != rg->data) {
+			/* we need to allocate a new region */
+			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+			if (!nrg)
+				return -ENOMEM;
+			nrg->from = rg->to;
+			nrg->to  = rg->to;
+			nrg->data = data;
+			INIT_LIST_HEAD(&nrg->link);
+			list_add(&nrg->link, &rg->link);
+		}
+		/*
+		 * update charge
+		 */
+		if (rg->to > t) {
+			chg += rg->to - t;
+			t = rg->to;
+		}
+		chg -= rg->to - rg->from;
+	}
+	return chg;
+}
+
+static void region_add_with_same(struct list_head *head,
+				 long f, long t, unsigned long data)
+{
+	struct file_region_with_data *rg, *nrg, *trg;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+
+		if (rg->from > t)
+			return;
+		if (&rg->link == head)
+			return;
+
+		/*FIXME!! this can possibly delete few regions */
+		/* We need to worry only if we match data */
+		if (rg->data == data) {
+			if (f < rg->from)
+				rg->from = f;
+			if (t > rg->to) {
+				/* if we are the last entry */
+				if (rg->link.next == head) {
+					rg->to = t;
+					break;
+				} else {
+					nrg = list_entry(rg->link.next,
+							 typeof(*nrg), link);
+					rg->to = nrg->from;
+				}
+			}
+		}
+		f = rg->to;
+	}
+}
+
 static inline
 struct hugetlb_cgroup *css_to_hugetlbcgroup(struct cgroup_subsys_state *s)
 {
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -V1 4/9] hugetlbfs: Add controller support for shared mapping
  2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2012-02-20 11:21 ` [PATCH -V1 3/9] hugetlbfs: Add new region handling functions Aneesh Kumar K.V
@ 2012-02-20 11:21 ` Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 5/9] hugetlbfs: Add controller support for private mapping Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

HugeTLB controller is different from a memory controller in that we charge
controller during mmap() time and not fault time. This make sure userspace
can fallback to non-hugepage allocation when mmap fails during to controller
limit.

For shared mapping we need to track the hugetlb cgroup along with the range.
If two task in two different cgroup map the same area only the non-overlapping
part should be charged to the second task. Hence we need to track the cgroup
along with range.  We always charge during mmap(2) and we do uncharge during
truncate. The region list is tracked in the inode->i_mapping->private_list.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c  |  114 ++++++++++++++++++++++++++++++++++++++++
 fs/hugetlbfs/inode.c           |    1 +
 include/linux/hugetlb_cgroup.h |   39 ++++++++++++++
 mm/hugetlb.c                   |   84 ++++++++++++++++++++---------
 4 files changed, 212 insertions(+), 26 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index 9bd2691..4806d43 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -17,6 +17,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/res_counter.h>
+#include <linux/list.h>
 
 /* lifted from mem control */
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -333,3 +334,116 @@ struct cgroup_subsys hugetlb_subsys = {
 	.populate   = hugetlbcgroup_populate,
 	.subsys_id  = hugetlb_subsys_id,
 };
+
+long hugetlb_page_charge(struct list_head *head,
+			struct hstate *h, long f, long t)
+{
+	long chg;
+	int ret = 0, idx;
+	unsigned long csize;
+	struct hugetlb_cgroup *h_cg;
+	struct res_counter *fail_res;
+
+	/*
+	 * Get the task cgroup within rcu_readlock and also
+	 * get cgroup reference to make sure cgroup destroy won't
+	 * race with page_charge. We don't allow a cgroup destroy
+	 * when the cgroup have some charge against it
+	 */
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+	css_get(&h_cg->css);
+	rcu_read_unlock();
+
+	chg = region_chg_with_same(head, f, t, (unsigned long)h_cg);
+	if (chg < 0)
+		goto err_out;
+
+	if (hugetlb_cgroup_is_root(h_cg))
+		goto err_out;
+
+	csize = chg * huge_page_size(h);
+	idx = h - hstates;
+	ret = res_counter_charge(&h_cg->memhuge[idx], csize, &fail_res);
+
+err_out:
+	/* Now that we have charged we can drop cgroup reference */
+	css_put(&h_cg->css);
+	if (!ret)
+		return chg;
+
+	/* We don't worry about region_uncharge */
+	return ret;
+}
+
+void hugetlb_page_uncharge(struct list_head *head, int idx, long nr_pages)
+{
+	struct hugetlb_cgroup *h_cg;
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+
+	if (!hugetlb_cgroup_is_root(h_cg))
+		res_counter_uncharge(&h_cg->memhuge[idx], csize);
+	rcu_read_unlock();
+	/*
+	 * We could ideally remove zero size regions from
+	 * resv map hcg_regions here
+	 */
+	return;
+}
+
+void hugetlb_commit_page_charge(struct list_head *head, long f, long t)
+{
+	struct hugetlb_cgroup *h_cg;
+
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+	region_add_with_same(head, f, t, (unsigned long)h_cg);
+	rcu_read_unlock();
+	return;
+}
+
+long hugetlb_truncate_cgroup(struct hstate *h,
+			     struct list_head *head, long end)
+{
+	long chg = 0, csize;
+	int idx = h - hstates;
+	struct hugetlb_cgroup *h_cg;
+	struct file_region_with_data *rg, *trg;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (end <= rg->to)
+			break;
+	if (&rg->link == head)
+		return 0;
+
+	/* If we are in the middle of a region then adjust it. */
+	if (end > rg->from) {
+		chg = rg->to - end;
+		rg->to = end;
+		h_cg = (struct hugetlb_cgroup *)rg->data;
+		if (!hugetlb_cgroup_is_root(h_cg)) {
+			csize = chg * huge_page_size(h);
+			res_counter_uncharge(&h_cg->memhuge[idx], csize);
+		}
+		rg = list_entry(rg->link.next, typeof(*rg), link);
+	}
+
+	/* Drop any remaining regions. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		chg += rg->to - rg->from;
+		h_cg = (struct hugetlb_cgroup *)rg->data;
+		if (!hugetlb_cgroup_is_root(h_cg)) {
+			csize = (rg->to - rg->from) * huge_page_size(h);
+			res_counter_uncharge(&h_cg->memhuge[idx], csize);
+		}
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return chg;
+}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 1e85a7a..2680578 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/magic.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb_cgroup.h>
 
 #include <asm/uaccess.h>
 
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 11cd6c4..9240e99 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -15,8 +15,47 @@
 #ifndef _LINUX_HUGETLB_CGROUP_H
 #define _LINUX_HUGETLB_CGROUP_H
 
+extern long region_add(struct list_head *head, long f, long t);
+extern long region_chg(struct list_head *head, long f, long t);
+extern long region_truncate(struct list_head *head, long end);
+extern long region_count(struct list_head *head, long f, long t);
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
 extern u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft);
 extern int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
 				const char *buffer);
 extern int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event);
+extern long hugetlb_page_charge(struct list_head *head,
+				struct hstate *h, long f, long t);
+extern void hugetlb_page_uncharge(struct list_head *head,
+				  int idx, long nr_pages);
+extern void hugetlb_commit_page_charge(struct list_head *head, long f, long t);
+extern long hugetlb_truncate_cgroup(struct hstate *h,
+				    struct list_head *head, long from);
+#else
+static inline long hugetlb_page_charge(struct list_head *head,
+				       struct hstate *h, long f, long t)
+{
+	return region_chg(head, f, t);
+}
+
+static inline void hugetlb_page_uncharge(struct list_head *head,
+					 int idx, long nr_pages)
+{
+	return;
+}
+
+static inline void hugetlb_commit_page_charge(struct list_head *head,
+					      long f, long t)
+{
+	region_add(head, f, t);
+	return;
+}
+
+static inline long hugetlb_truncate_cgroup(struct hstate *h,
+					   struct list_head *head, long from)
+{
+	return region_truncate(head, from);
+}
+#endif /* CONFIG_CGROUP_HUGETLB_RES_CTLR */
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 865b41f..80ee085 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -78,7 +78,7 @@ struct file_region {
 	long to;
 };
 
-static long region_add(struct list_head *head, long f, long t)
+long region_add(struct list_head *head, long f, long t)
 {
 	struct file_region *rg, *nrg, *trg;
 
@@ -114,7 +114,7 @@ static long region_add(struct list_head *head, long f, long t)
 	return 0;
 }
 
-static long region_chg(struct list_head *head, long f, long t)
+long region_chg(struct list_head *head, long f, long t)
 {
 	struct file_region *rg, *nrg;
 	long chg = 0;
@@ -163,7 +163,7 @@ static long region_chg(struct list_head *head, long f, long t)
 	return chg;
 }
 
-static long region_truncate(struct list_head *head, long end)
+long region_truncate(struct list_head *head, long end)
 {
 	struct file_region *rg, *trg;
 	long chg = 0;
@@ -193,7 +193,7 @@ static long region_truncate(struct list_head *head, long end)
 	return chg;
 }
 
-static long region_count(struct list_head *head, long f, long t)
+long region_count(struct list_head *head, long f, long t)
 {
 	struct file_region *rg;
 	long chg = 0;
@@ -983,11 +983,11 @@ static long vma_needs_reservation(struct hstate *h,
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
+
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		return region_chg(&inode->i_mapping->private_list,
-							idx, idx + 1);
-
+		return hugetlb_page_charge(&inode->i_mapping->private_list,
+					   h, idx, idx + 1);
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
 
@@ -1002,16 +1002,33 @@ static long vma_needs_reservation(struct hstate *h,
 		return 0;
 	}
 }
+
+static void vma_uncharge_reservation(struct hstate *h,
+				     struct vm_area_struct *vma,
+				     unsigned long chg)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+
+	if (vma->vm_flags & VM_MAYSHARE) {
+		return hugetlb_page_uncharge(&inode->i_mapping->private_list,
+					     h - hstates,
+					     chg << huge_page_order(h));
+	}
+}
+
 static void vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
+
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		region_add(&inode->i_mapping->private_list, idx, idx + 1);
-
+		hugetlb_commit_page_charge(&inode->i_mapping->private_list,
+					   idx, idx + 1);
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *reservations = vma_resv_map(vma);
@@ -1040,9 +1057,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-VM_FAULT_OOM);
-	if (chg)
-		if (hugetlb_get_quota(inode->i_mapping, chg))
+	if (chg) {
+		if (hugetlb_get_quota(inode->i_mapping, chg)) {
+			vma_uncharge_reservation(h, vma, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
+		}
+	}
 
 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
@@ -1051,7 +1071,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
-			hugetlb_put_quota(inode->i_mapping, chg);
+			if (chg) {
+				vma_uncharge_reservation(h, vma, chg);
+				hugetlb_put_quota(inode->i_mapping, chg);
+			}
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 		}
 	}
@@ -1059,7 +1082,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	set_page_private(page, (unsigned long) mapping);
 
 	vma_commit_reservation(h, vma, addr);
-
 	return page;
 }
 
@@ -2961,9 +2983,10 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		chg = region_chg(&inode->i_mapping->private_list, from, to);
-	else {
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		chg = hugetlb_page_charge(&inode->i_mapping->private_list,
+					  h, from, to);
+	} else {
 		struct resv_map *resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
@@ -2978,19 +3001,17 @@ int hugetlb_reserve_pages(struct inode *inode,
 		return chg;
 
 	/* There must be enough filesystem quota for the mapping */
-	if (hugetlb_get_quota(inode->i_mapping, chg))
-		return -ENOSPC;
-
+	if (hugetlb_get_quota(inode->i_mapping, chg)) {
+		ret = -ENOSPC;
+		goto err_quota;
+	}
 	/*
 	 * Check enough hugepages are available for the reservation.
 	 * Hand back the quota if there are not
 	 */
 	ret = hugetlb_acct_memory(h, chg);
-	if (ret < 0) {
-		hugetlb_put_quota(inode->i_mapping, chg);
-		return ret;
-	}
-
+	if (ret < 0)
+		goto err_acct_mem;
 	/*
 	 * Account for the reservations made. Shared mappings record regions
 	 * that have reservations as they are shared by multiple VMAs.
@@ -3003,15 +3024,26 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&inode->i_mapping->private_list, from, to);
+		hugetlb_commit_page_charge(&inode->i_mapping->private_list,
+					   from, to);
 	return 0;
+err_acct_mem:
+	hugetlb_put_quota(inode->i_mapping, chg);
+err_quota:
+	if (!vma || vma->vm_flags & VM_MAYSHARE)
+		hugetlb_page_uncharge(&inode->i_mapping->private_list,
+				      h - hstates, chg << huge_page_order(h));
+	return ret;
+
 }
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
+	long chg;
 	struct hstate *h = hstate_inode(inode);
-	long chg = region_truncate(&inode->i_mapping->private_list, offset);
 
+	chg = hugetlb_truncate_cgroup(h, &inode->i_mapping->private_list,
+				      offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -V1 5/9] hugetlbfs: Add controller support for private mapping
  2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2012-02-20 11:21 ` [PATCH -V1 4/9] hugetlbfs: Add controller support for shared mapping Aneesh Kumar K.V
@ 2012-02-20 11:21 ` Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 6/9] hugetlbfs: Switch to new region APIs Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

HugeTLB controller is different from a memory controller in that we charge
controller during mmap() time and not fault time. This make sure userspace
can fallback to non-hugepage allocation when mmap fails due to controller
limit.

For private mapping we always charge/uncharge from the current task cgroup.
Charging happens during mmap(2) and uncharge happens during the
vm_operations->close when resv_map refcount reaches zero. The uncharge count
is stored in struct resv_map. For child task after fork the charging happens
during fault time in alloc_huge_page. We also need to make sure for private
mapping each vma for hugeTLB mapping have struct resv_map allocated so that we
can store the uncharge count in resv_map.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c  |   50 ++++++++++++++++++++++++++++++++
 include/linux/hugetlb.h        |    7 ++++
 include/linux/hugetlb_cgroup.h |   16 ++++++++++
 mm/hugetlb.c                   |   62 ++++++++++++++++++++++++++++++++--------
 4 files changed, 123 insertions(+), 12 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index 4806d43..a75661d 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -447,3 +447,53 @@ long hugetlb_truncate_cgroup(struct hstate *h,
 	}
 	return chg;
 }
+
+int hugetlb_priv_page_charge(struct resv_map *map, struct hstate *h, long chg)
+{
+	long csize;
+	int idx, ret;
+	struct hugetlb_cgroup *h_cg;
+	struct res_counter *fail_res;
+
+	/*
+	 * Get the task cgroup within rcu_readlock and also
+	 * get cgroup reference to make sure cgroup destroy won't
+	 * race with page_charge. We don't allow a cgroup destroy
+	 * when the cgroup have some charge against it
+	 */
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+	css_get(&h_cg->css);
+	rcu_read_unlock();
+
+	if (hugetlb_cgroup_is_root(h_cg)) {
+		ret = chg;
+		goto err_out;
+	}
+
+	csize = chg * huge_page_size(h);
+	idx = h - hstates;
+	ret = res_counter_charge(&h_cg->memhuge[idx], csize, &fail_res);
+	if (!ret) {
+		map->nr_pages[idx] += chg << huge_page_order(h);
+		ret = chg;
+	}
+err_out:
+	css_put(&h_cg->css);
+	return ret;
+}
+
+void hugetlb_priv_page_uncharge(struct resv_map *map, int idx, long nr_pages)
+{
+	struct hugetlb_cgroup *h_cg;
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	rcu_read_lock();
+	h_cg = task_hugetlbcgroup(current);
+	if (!hugetlb_cgroup_is_root(h_cg)) {
+		res_counter_uncharge(&h_cg->memhuge[idx], csize);
+		map->nr_pages[idx] -= nr_pages;
+	}
+	rcu_read_unlock();
+	return;
+}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4392b6a..8576fa0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -253,6 +253,12 @@ struct hstate *size_to_hstate(unsigned long size);
 #define HUGE_MAX_HSTATE 1
 #endif
 
+struct resv_map {
+	struct kref refs;
+	long nr_pages[HUGE_MAX_HSTATE];
+	struct list_head regions;
+};
+
 extern struct hstate hstates[HUGE_MAX_HSTATE];
 extern unsigned int default_hstate_idx;
 
@@ -323,6 +329,7 @@ static inline unsigned hstate_index_to_shift(unsigned index)
 
 #else
 struct hstate {};
+struct resv_map {};
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 9240e99..1af9dd8 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -32,6 +32,10 @@ extern void hugetlb_page_uncharge(struct list_head *head,
 extern void hugetlb_commit_page_charge(struct list_head *head, long f, long t);
 extern long hugetlb_truncate_cgroup(struct hstate *h,
 				    struct list_head *head, long from);
+extern int hugetlb_priv_page_charge(struct resv_map *map,
+				    struct hstate *h, long chg);
+extern void hugetlb_priv_page_uncharge(struct resv_map *map,
+				       int idx, long nr_pages);
 #else
 static inline long hugetlb_page_charge(struct list_head *head,
 				       struct hstate *h, long f, long t)
@@ -57,5 +61,17 @@ static inline long hugetlb_truncate_cgroup(struct hstate *h,
 {
 	return region_truncate(head, from);
 }
+
+static inline int hugetlb_priv_page_charge(struct resv_map *map,
+					   struct hstate *h, long chg)
+{
+	return chg;
+}
+
+static inline void hugetlb_priv_page_uncharge(struct resv_map *map,
+					      int idx, long nr_pages)
+{
+	return;
+}
 #endif /* CONFIG_CGROUP_HUGETLB_RES_CTLR */
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 80ee085..e1a0328 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -303,14 +303,9 @@ static void set_vma_private_data(struct vm_area_struct *vma,
 	vma->vm_private_data = (void *)value;
 }
 
-struct resv_map {
-	struct kref refs;
-	struct list_head regions;
-};
-
 static struct resv_map *resv_map_alloc(void)
 {
-	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
+	struct resv_map *resv_map = kzalloc(sizeof(*resv_map), GFP_KERNEL);
 	if (!resv_map)
 		return NULL;
 
@@ -322,10 +317,16 @@ static struct resv_map *resv_map_alloc(void)
 
 static void resv_map_release(struct kref *ref)
 {
+	int idx;
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
 	region_truncate(&resv_map->regions, 0);
+	/* drop the hugetlb cgroup charge */
+	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
+		hugetlb_priv_page_uncharge(resv_map, idx,
+					   resv_map->nr_pages[idx]);
+	}
 	kfree(resv_map);
 }
 
@@ -989,9 +990,20 @@ static long vma_needs_reservation(struct hstate *h,
 		return hugetlb_page_charge(&inode->i_mapping->private_list,
 					   h, idx, idx + 1);
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		return 1;
-
+		struct resv_map *resv_map = vma_resv_map(vma);
+		if (!resv_map) {
+			/*
+			 * We didn't allocate resv_map for this vma.
+			 * Allocate it here.
+			 */
+			resv_map = resv_map_alloc();
+			if (!resv_map)
+				return -ENOMEM;
+			set_vma_resv_map(vma, resv_map);
+		}
+		return hugetlb_priv_page_charge(resv_map, h, 1);
 	} else  {
+		/* We did the priv page charging in mmap call */
 		long err;
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *reservations = vma_resv_map(vma);
@@ -1007,14 +1019,20 @@ static void vma_uncharge_reservation(struct hstate *h,
 				     struct vm_area_struct *vma,
 				     unsigned long chg)
 {
+	int idx = h - hstates;
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		return hugetlb_page_uncharge(&inode->i_mapping->private_list,
-					     h - hstates,
-					     chg << huge_page_order(h));
+					     idx, chg << huge_page_order(h));
+	} else {
+		struct resv_map *resv_map = vma_resv_map(vma);
+
+		return hugetlb_priv_page_uncharge(resv_map,
+						  idx,
+						  chg << huge_page_order(h));
 	}
 }
 
@@ -2165,6 +2183,22 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 */
 	if (reservations)
 		kref_get(&reservations->refs);
+	else if (!(vma->vm_flags & VM_MAYSHARE)) {
+		/*
+		 * for non shared vma we need resv map to track
+		 * hugetlb cgroup usage. Allocate it here. Charging
+		 * the cgroup will take place in fault path.
+		 */
+		struct resv_map *resv_map = resv_map_alloc();
+		/*
+		 * If we fail to allocate resv_map here. We will allocate
+		 * one when we do alloc_huge_page. So we don't handle
+		 * ENOMEM here. The function also return void. So there is
+		 * nothing much we can do.
+		 */
+		if (resv_map)
+			set_vma_resv_map(vma, resv_map);
+	}
 }
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
@@ -2968,7 +3002,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 {
 	long ret, chg;
 	struct hstate *h = hstate_inode(inode);
-
+	struct resv_map *resv_map = NULL;
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
 	 * attempt will be made for VM_NORESERVE to allocate a page
@@ -2987,7 +3021,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 		chg = hugetlb_page_charge(&inode->i_mapping->private_list,
 					  h, from, to);
 	} else {
-		struct resv_map *resv_map = resv_map_alloc();
+		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
 
@@ -2995,6 +3029,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 
 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
+		chg = hugetlb_priv_page_charge(resv_map, h, chg);
 	}
 
 	if (chg < 0)
@@ -3033,6 +3068,9 @@ err_quota:
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		hugetlb_page_uncharge(&inode->i_mapping->private_list,
 				      h - hstates, chg << huge_page_order(h));
+	else
+		hugetlb_priv_page_uncharge(resv_map, h - hstates,
+					   chg << huge_page_order(h));
 	return ret;
 
 }
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -V1 6/9] hugetlbfs: Switch to new region APIs
  2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2012-02-20 11:21 ` [PATCH -V1 5/9] hugetlbfs: Add controller support for private mapping Aneesh Kumar K.V
@ 2012-02-20 11:21 ` Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 7/9] hugetlbfs: Add truncate region functions Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Remove the old code which is not used

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/Makefile          |    2 +-
 fs/hugetlbfs/hugetlb_cgroup.c  |  135 +--------------------------
 fs/hugetlbfs/region.c          |  202 ++++++++++++++++++++++++++++++++++++++++
 include/linux/hugetlb_cgroup.h |   17 +++-
 mm/hugetlb.c                   |  163 +--------------------------------
 5 files changed, 222 insertions(+), 297 deletions(-)
 create mode 100644 fs/hugetlbfs/region.c

diff --git a/fs/hugetlbfs/Makefile b/fs/hugetlbfs/Makefile
index 986c778..3c544fe 100644
--- a/fs/hugetlbfs/Makefile
+++ b/fs/hugetlbfs/Makefile
@@ -4,5 +4,5 @@
 
 obj-$(CONFIG_HUGETLBFS) += hugetlbfs.o
 
-hugetlbfs-objs := inode.o
+hugetlbfs-objs := inode.o region.o
 hugetlbfs-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index a75661d..a4c6786 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -18,6 +18,8 @@
 #include <linux/hugetlb.h>
 #include <linux/res_counter.h>
 #include <linux/list.h>
+#include <linux/hugetlb_cgroup.h>
+
 
 /* lifted from mem control */
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -32,136 +34,9 @@ struct hugetlb_cgroup {
 	struct res_counter memhuge[HUGE_MAX_HSTATE];
 };
 
-struct file_region_with_data {
-	struct list_head link;
-	long from;
-	long to;
-	unsigned long data;
-};
-
 struct cgroup_subsys hugetlb_subsys __read_mostly;
 struct hugetlb_cgroup *root_h_cgroup __read_mostly;
 
-/*
- * A vairant of region_add that only merges regions only if data
- * match.
- */
-static long region_chg_with_same(struct list_head *head,
-				 long f, long t, unsigned long data)
-{
-	long chg = 0;
-	struct file_region_with_data *rg, *nrg, *trg;
-
-	/* Locate the region we are before or in. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-	/*
-	 * If we are below the current region then a new region is required.
-	 * Subtle, allocate a new region at the position but make it zero
-	 * size such that we can guarantee to record the reservation.
-	 */
-	if (&rg->link == head || t < rg->from) {
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
-		nrg->from = f;
-		nrg->to = f;
-		nrg->data = data;
-		INIT_LIST_HEAD(&nrg->link);
-		list_add(&nrg->link, rg->link.prev);
-		return t - f;
-	}
-	/*
-	 * f rg->from t rg->to
-	 */
-	if (f < rg->from && data != rg->data) {
-		/* we need to allocate a new region */
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
-		nrg->from = f;
-		nrg->to = f;
-		nrg->data = data;
-		INIT_LIST_HEAD(&nrg->link);
-		list_add(&nrg->link, rg->link.prev);
-	}
-
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-	chg = t - f;
-
-	/* Check for and consume any regions we now overlap with. */
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		if (rg->from > t)
-			return chg;
-		/*
-		 * rg->from f rg->to t
-		 */
-		if (t > rg->to && data != rg->data) {
-			/* we need to allocate a new region */
-			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-			if (!nrg)
-				return -ENOMEM;
-			nrg->from = rg->to;
-			nrg->to  = rg->to;
-			nrg->data = data;
-			INIT_LIST_HEAD(&nrg->link);
-			list_add(&nrg->link, &rg->link);
-		}
-		/*
-		 * update charge
-		 */
-		if (rg->to > t) {
-			chg += rg->to - t;
-			t = rg->to;
-		}
-		chg -= rg->to - rg->from;
-	}
-	return chg;
-}
-
-static void region_add_with_same(struct list_head *head,
-				 long f, long t, unsigned long data)
-{
-	struct file_region_with_data *rg, *nrg, *trg;
-
-	/* Locate the region we are before or in. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-
-		if (rg->from > t)
-			return;
-		if (&rg->link == head)
-			return;
-
-		/*FIXME!! this can possibly delete few regions */
-		/* We need to worry only if we match data */
-		if (rg->data == data) {
-			if (f < rg->from)
-				rg->from = f;
-			if (t > rg->to) {
-				/* if we are the last entry */
-				if (rg->link.next == head) {
-					rg->to = t;
-					break;
-				} else {
-					nrg = list_entry(rg->link.next,
-							 typeof(*nrg), link);
-					rg->to = nrg->from;
-				}
-			}
-		}
-		f = rg->to;
-	}
-}
-
 static inline
 struct hugetlb_cgroup *css_to_hugetlbcgroup(struct cgroup_subsys_state *s)
 {
@@ -355,7 +230,7 @@ long hugetlb_page_charge(struct list_head *head,
 	css_get(&h_cg->css);
 	rcu_read_unlock();
 
-	chg = region_chg_with_same(head, f, t, (unsigned long)h_cg);
+	chg = region_chg(head, f, t, (unsigned long)h_cg);
 	if (chg < 0)
 		goto err_out;
 
@@ -400,7 +275,7 @@ void hugetlb_commit_page_charge(struct list_head *head, long f, long t)
 
 	rcu_read_lock();
 	h_cg = task_hugetlbcgroup(current);
-	region_add_with_same(head, f, t, (unsigned long)h_cg);
+	region_add(head, f, t, (unsigned long)h_cg);
 	rcu_read_unlock();
 	return;
 }
@@ -411,7 +286,7 @@ long hugetlb_truncate_cgroup(struct hstate *h,
 	long chg = 0, csize;
 	int idx = h - hstates;
 	struct hugetlb_cgroup *h_cg;
-	struct file_region_with_data *rg, *trg;
+	struct file_region *rg, *trg;
 
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
diff --git a/fs/hugetlbfs/region.c b/fs/hugetlbfs/region.c
new file mode 100644
index 0000000..d2445fb
--- /dev/null
+++ b/fs/hugetlbfs/region.c
@@ -0,0 +1,202 @@
+/*
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/list.h>
+#include <linux/hugetlb_cgroup.h>
+
+/*
+ * Region tracking -- allows tracking of reservations and instantiated pages
+ *                    across the pages in a mapping.
+ *
+ * The region data structures are protected by a combination of the mmap_sem
+ * and the hugetlb_instantion_mutex.  To access or modify a region the caller
+ * must either hold the mmap_sem for write, or the mmap_sem for read and
+ * the hugetlb_instantiation mutex:
+ *
+ *	down_write(&mm->mmap_sem);
+ * or
+ *	down_read(&mm->mmap_sem);
+ *	mutex_lock(&hugetlb_instantiation_mutex);
+ */
+
+long region_chg(struct list_head *head, long f, long t, unsigned long data)
+{
+	long chg = 0;
+	struct file_region *rg, *nrg, *trg;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+	/*
+	 * If we are below the current region then a new region is required.
+	 * Subtle, allocate a new region at the position but make it zero
+	 * size such that we can guarantee to record the reservation.
+	 */
+	if (&rg->link == head || t < rg->from) {
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (!nrg)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to = f;
+		nrg->data = data;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+		return t - f;
+	}
+	/*
+	 * f rg->from t rg->to
+	 */
+	if (f < rg->from && data != rg->data) {
+		/* we need to allocate a new region */
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (!nrg)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to = f;
+		nrg->data = data;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+	}
+
+	/* Round our left edge to the current segment if it encloses us. */
+	if (f > rg->from)
+		f = rg->from;
+	chg = t - f;
+
+	/* Check for and consume any regions we now overlap with. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		if (rg->from > t)
+			return chg;
+		/*
+		 * rg->from f rg->to t
+		 */
+		if (t > rg->to && data != rg->data) {
+			/* we need to allocate a new region */
+			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+			if (!nrg)
+				return -ENOMEM;
+			nrg->from = rg->to;
+			nrg->to  = rg->to;
+			nrg->data = data;
+			INIT_LIST_HEAD(&nrg->link);
+			list_add(&nrg->link, &rg->link);
+		}
+		/*
+		 * update charge
+		 */
+		if (rg->to > t) {
+			chg += rg->to - t;
+			t = rg->to;
+		}
+		chg -= rg->to - rg->from;
+	}
+	return chg;
+}
+
+void region_add(struct list_head *head, long f, long t, unsigned long data)
+{
+	struct file_region *rg, *nrg, *trg;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+
+		if (rg->from > t)
+			return;
+		if (&rg->link == head)
+			return;
+
+		/*FIXME!! this can possibly delete few regions */
+		/* We need to worry only if we match data */
+		if (rg->data == data) {
+			if (f < rg->from)
+				rg->from = f;
+			if (t > rg->to) {
+				/* if we are the last entry */
+				if (rg->link.next == head) {
+					rg->to = t;
+					break;
+				} else {
+					nrg = list_entry(rg->link.next,
+							 typeof(*nrg), link);
+					rg->to = nrg->from;
+				}
+			}
+		}
+		f = rg->to;
+	}
+}
+
+long region_truncate(struct list_head *head, long end)
+{
+	struct file_region *rg, *trg;
+	long chg = 0;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (end <= rg->to)
+			break;
+	if (&rg->link == head)
+		return 0;
+
+	/* If we are in the middle of a region then adjust it. */
+	if (end > rg->from) {
+		chg = rg->to - end;
+		rg->to = end;
+		rg = list_entry(rg->link.next, typeof(*rg), link);
+	}
+
+	/* Drop any remaining regions. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		chg += rg->to - rg->from;
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return chg;
+}
+
+long region_count(struct list_head *head, long f, long t)
+{
+	struct file_region *rg;
+	long chg = 0;
+
+	/* Locate each segment we overlap with, and count that overlap. */
+	list_for_each_entry(rg, head, link) {
+		int seg_from;
+		int seg_to;
+
+		if (rg->to <= f)
+			continue;
+		if (rg->from >= t)
+			break;
+
+		seg_from = max(rg->from, f);
+		seg_to = min(rg->to, t);
+
+		chg += seg_to - seg_from;
+	}
+
+	return chg;
+}
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 1af9dd8..eaad86b 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -15,8 +15,16 @@
 #ifndef _LINUX_HUGETLB_CGROUP_H
 #define _LINUX_HUGETLB_CGROUP_H
 
-extern long region_add(struct list_head *head, long f, long t);
-extern long region_chg(struct list_head *head, long f, long t);
+struct file_region {
+	long from, to;
+	unsigned long data;
+	struct list_head link;
+};
+
+extern long region_chg(struct list_head *head, long f, long t,
+		       unsigned long data);
+extern void region_add(struct list_head *head, long f, long t,
+		       unsigned long data);
 extern long region_truncate(struct list_head *head, long end);
 extern long region_count(struct list_head *head, long f, long t);
 
@@ -40,7 +48,7 @@ extern void hugetlb_priv_page_uncharge(struct resv_map *map,
 static inline long hugetlb_page_charge(struct list_head *head,
 				       struct hstate *h, long f, long t)
 {
-	return region_chg(head, f, t);
+	return region_chg(head, f, t, 0);
 }
 
 static inline void hugetlb_page_uncharge(struct list_head *head,
@@ -52,8 +60,7 @@ static inline void hugetlb_page_uncharge(struct list_head *head,
 static inline void hugetlb_commit_page_charge(struct list_head *head,
 					      long f, long t)
 {
-	region_add(head, f, t);
-	return;
+	return region_add(head, f, t, 0);
 }
 
 static inline long hugetlb_truncate_cgroup(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e1a0328..08555c6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -59,165 +59,6 @@ static unsigned long __initdata default_hstate_size;
 static DEFINE_SPINLOCK(hugetlb_lock);
 
 /*
- * Region tracking -- allows tracking of reservations and instantiated pages
- *                    across the pages in a mapping.
- *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantion_mutex.  To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
- *
- *	down_write(&mm->mmap_sem);
- * or
- *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
- */
-struct file_region {
-	struct list_head link;
-	long from;
-	long to;
-};
-
-long region_add(struct list_head *head, long f, long t)
-{
-	struct file_region *rg, *nrg, *trg;
-
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-
-	/* Check for and consume any regions we now overlap with. */
-	nrg = rg;
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		if (rg->from > t)
-			break;
-
-		/* If this area reaches higher then extend our area to
-		 * include it completely.  If this is not the first area
-		 * which we intend to reuse, free it. */
-		if (rg->to > t)
-			t = rg->to;
-		if (rg != nrg) {
-			list_del(&rg->link);
-			kfree(rg);
-		}
-	}
-	nrg->from = f;
-	nrg->to = t;
-	return 0;
-}
-
-long region_chg(struct list_head *head, long f, long t)
-{
-	struct file_region *rg, *nrg;
-	long chg = 0;
-
-	/* Locate the region we are before or in. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	/* If we are below the current region then a new region is required.
-	 * Subtle, allocate a new region at the position but make it zero
-	 * size such that we can guarantee to record the reservation. */
-	if (&rg->link == head || t < rg->from) {
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
-		nrg->from = f;
-		nrg->to   = f;
-		INIT_LIST_HEAD(&nrg->link);
-		list_add(&nrg->link, rg->link.prev);
-
-		return t - f;
-	}
-
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-	chg = t - f;
-
-	/* Check for and consume any regions we now overlap with. */
-	list_for_each_entry(rg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		if (rg->from > t)
-			return chg;
-
-		/* We overlap with this area, if it extends further than
-		 * us then we must extend ourselves.  Account for its
-		 * existing reservation. */
-		if (rg->to > t) {
-			chg += rg->to - t;
-			t = rg->to;
-		}
-		chg -= rg->to - rg->from;
-	}
-	return chg;
-}
-
-long region_truncate(struct list_head *head, long end)
-{
-	struct file_region *rg, *trg;
-	long chg = 0;
-
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (end <= rg->to)
-			break;
-	if (&rg->link == head)
-		return 0;
-
-	/* If we are in the middle of a region then adjust it. */
-	if (end > rg->from) {
-		chg = rg->to - end;
-		rg->to = end;
-		rg = list_entry(rg->link.next, typeof(*rg), link);
-	}
-
-	/* Drop any remaining regions. */
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		chg += rg->to - rg->from;
-		list_del(&rg->link);
-		kfree(rg);
-	}
-	return chg;
-}
-
-long region_count(struct list_head *head, long f, long t)
-{
-	struct file_region *rg;
-	long chg = 0;
-
-	/* Locate each segment we overlap with, and count that overlap. */
-	list_for_each_entry(rg, head, link) {
-		int seg_from;
-		int seg_to;
-
-		if (rg->to <= f)
-			continue;
-		if (rg->from >= t)
-			break;
-
-		seg_from = max(rg->from, f);
-		seg_to = min(rg->to, t);
-
-		chg += seg_to - seg_from;
-	}
-
-	return chg;
-}
-
-/*
  * Convert the address within this vma to the page offset within
  * the mapping, in pagecache page units; huge pages here.
  */
@@ -1008,7 +849,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *reservations = vma_resv_map(vma);
 
-		err = region_chg(&reservations->regions, idx, idx + 1);
+		err = region_chg(&reservations->regions, idx, idx + 1, 0);
 		if (err < 0)
 			return err;
 		return 0;
@@ -1052,7 +893,7 @@ static void vma_commit_reservation(struct hstate *h,
 		struct resv_map *reservations = vma_resv_map(vma);
 
 		/* Mark this page used in the map. */
-		region_add(&reservations->regions, idx, idx + 1);
+		region_add(&reservations->regions, idx, idx + 1, 0);
 	}
 }
 
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -V1 7/9] hugetlbfs: Add truncate region functions
  2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2012-02-20 11:21 ` [PATCH -V1 6/9] hugetlbfs: Switch to new region APIs Aneesh Kumar K.V
@ 2012-02-20 11:21 ` Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 8/9] hugetlbfs: Add task migration support Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 9/9] hugetlbfs: Add HugeTLB controller documentation Aneesh Kumar K.V
  8 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This will later be used by the task migration patches.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c  |   84 ++++++++++++++++++++++++++++++++++++++++
 fs/hugetlbfs/region.c          |   58 +++++++++++++++++++++++++++
 include/linux/hugetlb_cgroup.h |   12 +++++-
 3 files changed, 153 insertions(+), 1 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index a4c6786..b8b319b 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -323,6 +323,90 @@ long hugetlb_truncate_cgroup(struct hstate *h,
 	return chg;
 }
 
+long hugetlb_truncate_cgroup_range(struct hstate *h,
+				   struct list_head *head, long from, long to)
+{
+	long chg = 0, csize;
+	int idx = h - hstates;
+	struct hugetlb_cgroup *h_cg;
+	struct file_region *rg, *trg;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (from <= rg->to)
+			break;
+	if (&rg->link == head)
+		return 0;
+
+	/* If we are in the middle of a region then adjust it. */
+	if (from > rg->from) {
+		if (to < rg->to) {
+			struct file_region *nrg;
+			/* rg->from from to rg->to */
+			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+			/*
+			 * If we fail to allocate we return the
+			 * with the 0 charge . Later a complete
+			 * truncate will reclaim the left over space
+			 */
+			if (!nrg)
+				return 0;
+			nrg->from = to;
+			nrg->to = rg->to;
+			nrg->data = rg->data;
+			INIT_LIST_HEAD(&nrg->link);
+			list_add(&nrg->link, &rg->link);
+
+			/* Adjust the rg entry */
+			rg->to = from;
+			chg = to - from;
+			h_cg = (struct hugetlb_cgroup *)rg->data;
+			if (!hugetlb_cgroup_is_root(h_cg)) {
+				csize = chg * huge_page_size(h);
+				res_counter_uncharge(&h_cg->memhuge[idx],
+						     csize);
+			}
+			return chg;
+		}
+		chg = rg->to - from;
+		rg->to = from;
+		h_cg = (struct hugetlb_cgroup *)rg->data;
+		if (!hugetlb_cgroup_is_root(h_cg)) {
+			csize = chg * huge_page_size(h);
+			res_counter_uncharge(&h_cg->memhuge[idx], csize);
+		}
+		rg = list_entry(rg->link.next, typeof(*rg), link);
+	}
+	/* Drop any remaining regions till to */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (rg->from >= to)
+			break;
+		if (&rg->link == head)
+			break;
+		if (rg->to > to) {
+			/* rg->from to rg->to */
+			chg += to - rg->from;
+			rg->from = to;
+			h_cg = (struct hugetlb_cgroup *)rg->data;
+			if (!hugetlb_cgroup_is_root(h_cg)) {
+				csize = (to - rg->from) * huge_page_size(h);
+				res_counter_uncharge(&h_cg->memhuge[idx],
+						     csize);
+			}
+			return chg;
+		}
+		chg += rg->to - rg->from;
+		h_cg = (struct hugetlb_cgroup *)rg->data;
+		if (!hugetlb_cgroup_is_root(h_cg)) {
+			csize = (rg->to - rg->from) * huge_page_size(h);
+			res_counter_uncharge(&h_cg->memhuge[idx], csize);
+		}
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return chg;
+}
+
 int hugetlb_priv_page_charge(struct resv_map *map, struct hstate *h, long chg)
 {
 	long csize;
diff --git a/fs/hugetlbfs/region.c b/fs/hugetlbfs/region.c
index d2445fb..8ac63b0 100644
--- a/fs/hugetlbfs/region.c
+++ b/fs/hugetlbfs/region.c
@@ -200,3 +200,61 @@ long region_count(struct list_head *head, long f, long t)
 
 	return chg;
 }
+
+long region_truncate_range(struct list_head *head, long from, long to)
+{
+	long chg = 0;
+	struct file_region *rg, *trg;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (from <= rg->to)
+			break;
+	if (&rg->link == head)
+		return 0;
+
+	/* If we are in the middle of a region then adjust it. */
+	if (from > rg->from) {
+		if (to < rg->to) {
+			struct file_region *nrg;
+			/* rf->from f t rg->to */
+			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+			/*
+			 * If we fail to allocate we return the
+			 * with the 0 charge . Later a complete
+			 * truncate will reclaim the left over space
+			 */
+			if (!nrg)
+				return 0;
+			nrg->from = to;
+			nrg->to = rg->to;
+			nrg->data = rg->data;
+			INIT_LIST_HEAD(&nrg->link);
+			list_add(&nrg->link, &rg->link);
+
+			/* Adjust the rg entry */
+			rg->to = from;
+			chg = to - from;
+			return chg;
+		}
+		chg = rg->to - from;
+		rg->to = from;
+		rg = list_entry(rg->link.next, typeof(*rg), link);
+	}
+	/* Drop any remaining regions till to */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (rg->from >= to)
+			break;
+		if (&rg->link == head)
+			break;
+		if (rg->to > to) {
+			chg += to - rg->from;
+			rg->from = to;
+			return chg;
+		}
+		chg += rg->to - rg->from;
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return chg;
+}
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index eaad86b..68c1d61 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -27,7 +27,7 @@ extern void region_add(struct list_head *head, long f, long t,
 		       unsigned long data);
 extern long region_truncate(struct list_head *head, long end);
 extern long region_count(struct list_head *head, long f, long t);
-
+extern long region_truncate_range(struct list_head *head, long from, long end);
 #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
 extern u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft);
 extern int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
@@ -40,6 +40,9 @@ extern void hugetlb_page_uncharge(struct list_head *head,
 extern void hugetlb_commit_page_charge(struct list_head *head, long f, long t);
 extern long hugetlb_truncate_cgroup(struct hstate *h,
 				    struct list_head *head, long from);
+extern long  hugetlb_truncate_cgroup_range(struct hstate *h,
+					   struct list_head *head,
+					   long from, long end);
 extern int hugetlb_priv_page_charge(struct resv_map *map,
 				    struct hstate *h, long chg);
 extern void hugetlb_priv_page_uncharge(struct resv_map *map,
@@ -69,6 +72,13 @@ static inline long hugetlb_truncate_cgroup(struct hstate *h,
 	return region_truncate(head, from);
 }
 
+static inline long  hugetlb_truncate_cgroup_range(struct hstate *h,
+						  struct list_head *head,
+						  long from, long end)
+{
+	return region_truncate_range(head, from, end);
+}
+
 static inline int hugetlb_priv_page_charge(struct resv_map *map,
 					   struct hstate *h, long chg)
 {
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -V1 8/9] hugetlbfs: Add task migration support
  2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2012-02-20 11:21 ` [PATCH -V1 7/9] hugetlbfs: Add truncate region functions Aneesh Kumar K.V
@ 2012-02-20 11:21 ` Aneesh Kumar K.V
  2012-02-20 11:21 ` [PATCH -V1 9/9] hugetlbfs: Add HugeTLB controller documentation Aneesh Kumar K.V
  8 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This patch add task migration support to hugetlb cgroup. When task migrate we
don't move charge across hugetlb cgroup.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/hugetlb_cgroup.c  |   74 ----------------------
 fs/hugetlbfs/region.c          |   24 -------
 include/linux/hugetlb.h        |    1 -
 include/linux/hugetlb_cgroup.h |   17 -----
 mm/hugetlb.c                   |  134 +++++++++++++++-------------------------
 5 files changed, 49 insertions(+), 201 deletions(-)

diff --git a/fs/hugetlbfs/hugetlb_cgroup.c b/fs/hugetlbfs/hugetlb_cgroup.c
index b8b319b..44b3d5e 100644
--- a/fs/hugetlbfs/hugetlb_cgroup.c
+++ b/fs/hugetlbfs/hugetlb_cgroup.c
@@ -114,29 +114,6 @@ int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event)
 	return 0;
 }
 
-static int hugetlbcgroup_can_attach(struct cgroup_subsys *ss,
-				    struct cgroup *new_cgrp,
-				    struct cgroup_taskset *set)
-{
-	struct hugetlb_cgroup *h_cg;
-	struct task_struct *task = cgroup_taskset_first(set);
-	/*
-	 * Make sure all the task in the set are in root cgroup
-	 * We only allow move from root cgroup to other cgroup.
-	 */
-	while (task != NULL) {
-		rcu_read_lock();
-		h_cg = task_hugetlbcgroup(task);
-		if (!hugetlb_cgroup_is_root(h_cg)) {
-			rcu_read_unlock();
-			return -EOPNOTSUPP;
-		}
-		rcu_read_unlock();
-		task = cgroup_taskset_next(set);
-	}
-	return 0;
-}
-
 /*
  * called from kernel/cgroup.c with cgroup_lock() held.
  */
@@ -202,7 +179,6 @@ static int hugetlbcgroup_populate(struct cgroup_subsys *ss,
 
 struct cgroup_subsys hugetlb_subsys = {
 	.name = "hugetlb",
-	.can_attach = hugetlbcgroup_can_attach,
 	.create     = hugetlbcgroup_create,
 	.pre_destroy = hugetlbcgroup_pre_destroy,
 	.destroy    = hugetlbcgroup_destroy,
@@ -406,53 +382,3 @@ long hugetlb_truncate_cgroup_range(struct hstate *h,
 	}
 	return chg;
 }
-
-int hugetlb_priv_page_charge(struct resv_map *map, struct hstate *h, long chg)
-{
-	long csize;
-	int idx, ret;
-	struct hugetlb_cgroup *h_cg;
-	struct res_counter *fail_res;
-
-	/*
-	 * Get the task cgroup within rcu_readlock and also
-	 * get cgroup reference to make sure cgroup destroy won't
-	 * race with page_charge. We don't allow a cgroup destroy
-	 * when the cgroup have some charge against it
-	 */
-	rcu_read_lock();
-	h_cg = task_hugetlbcgroup(current);
-	css_get(&h_cg->css);
-	rcu_read_unlock();
-
-	if (hugetlb_cgroup_is_root(h_cg)) {
-		ret = chg;
-		goto err_out;
-	}
-
-	csize = chg * huge_page_size(h);
-	idx = h - hstates;
-	ret = res_counter_charge(&h_cg->memhuge[idx], csize, &fail_res);
-	if (!ret) {
-		map->nr_pages[idx] += chg << huge_page_order(h);
-		ret = chg;
-	}
-err_out:
-	css_put(&h_cg->css);
-	return ret;
-}
-
-void hugetlb_priv_page_uncharge(struct resv_map *map, int idx, long nr_pages)
-{
-	struct hugetlb_cgroup *h_cg;
-	unsigned long csize = nr_pages * PAGE_SIZE;
-
-	rcu_read_lock();
-	h_cg = task_hugetlbcgroup(current);
-	if (!hugetlb_cgroup_is_root(h_cg)) {
-		res_counter_uncharge(&h_cg->memhuge[idx], csize);
-		map->nr_pages[idx] -= nr_pages;
-	}
-	rcu_read_unlock();
-	return;
-}
diff --git a/fs/hugetlbfs/region.c b/fs/hugetlbfs/region.c
index 8ac63b0..483473f 100644
--- a/fs/hugetlbfs/region.c
+++ b/fs/hugetlbfs/region.c
@@ -177,30 +177,6 @@ long region_truncate(struct list_head *head, long end)
 	return chg;
 }
 
-long region_count(struct list_head *head, long f, long t)
-{
-	struct file_region *rg;
-	long chg = 0;
-
-	/* Locate each segment we overlap with, and count that overlap. */
-	list_for_each_entry(rg, head, link) {
-		int seg_from;
-		int seg_to;
-
-		if (rg->to <= f)
-			continue;
-		if (rg->from >= t)
-			break;
-
-		seg_from = max(rg->from, f);
-		seg_to = min(rg->to, t);
-
-		chg += seg_to - seg_from;
-	}
-
-	return chg;
-}
-
 long region_truncate_range(struct list_head *head, long from, long to)
 {
 	long chg = 0;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8576fa0..226f488 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -255,7 +255,6 @@ struct hstate *size_to_hstate(unsigned long size);
 
 struct resv_map {
 	struct kref refs;
-	long nr_pages[HUGE_MAX_HSTATE];
 	struct list_head regions;
 };
 
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 68c1d61..9d51235 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -26,7 +26,6 @@ extern long region_chg(struct list_head *head, long f, long t,
 extern void region_add(struct list_head *head, long f, long t,
 		       unsigned long data);
 extern long region_truncate(struct list_head *head, long end);
-extern long region_count(struct list_head *head, long f, long t);
 extern long region_truncate_range(struct list_head *head, long from, long end);
 #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
 extern u64 hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft);
@@ -43,10 +42,6 @@ extern long hugetlb_truncate_cgroup(struct hstate *h,
 extern long  hugetlb_truncate_cgroup_range(struct hstate *h,
 					   struct list_head *head,
 					   long from, long end);
-extern int hugetlb_priv_page_charge(struct resv_map *map,
-				    struct hstate *h, long chg);
-extern void hugetlb_priv_page_uncharge(struct resv_map *map,
-				       int idx, long nr_pages);
 #else
 static inline long hugetlb_page_charge(struct list_head *head,
 				       struct hstate *h, long f, long t)
@@ -78,17 +73,5 @@ static inline long  hugetlb_truncate_cgroup_range(struct hstate *h,
 {
 	return region_truncate_range(head, from, end);
 }
-
-static inline int hugetlb_priv_page_charge(struct resv_map *map,
-					   struct hstate *h, long chg)
-{
-	return chg;
-}
-
-static inline void hugetlb_priv_page_uncharge(struct resv_map *map,
-					      int idx, long nr_pages)
-{
-	return;
-}
 #endif /* CONFIG_CGROUP_HUGETLB_RES_CTLR */
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 08555c6..aaed6d3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -156,18 +156,15 @@ static struct resv_map *resv_map_alloc(void)
 	return resv_map;
 }
 
-static void resv_map_release(struct kref *ref)
+static void resv_map_release(struct hstate *h, struct resv_map *resv_map)
 {
-	int idx;
-	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
-
-	/* Clear out any active regions before we release the map. */
-	region_truncate(&resv_map->regions, 0);
-	/* drop the hugetlb cgroup charge */
-	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
-		hugetlb_priv_page_uncharge(resv_map, idx,
-					   resv_map->nr_pages[idx]);
-	}
+	/*
+	 * We should not have any regions left here, if we were able to
+	 * do memory allocation when in trunage_cgroup_range.
+	 *
+	 * Clear out any active regions before we release the map
+	 */
+	hugetlb_truncate_cgroup(h, &resv_map->regions, 0);
 	kfree(resv_map);
 }
 
@@ -380,9 +377,7 @@ static void free_huge_page(struct page *page)
 	 */
 	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
-	struct address_space *mapping;
 
-	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
@@ -398,8 +393,6 @@ static void free_huge_page(struct page *page)
 		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
-	if (mapping)
-		hugetlb_put_quota(mapping, 1);
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -822,12 +815,12 @@ static void return_unused_surplus_pages(struct hstate *h,
 static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
+	pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 
 	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		return hugetlb_page_charge(&inode->i_mapping->private_list,
 					   h, idx, idx + 1);
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
@@ -842,18 +835,13 @@ static long vma_needs_reservation(struct hstate *h,
 				return -ENOMEM;
 			set_vma_resv_map(vma, resv_map);
 		}
-		return hugetlb_priv_page_charge(resv_map, h, 1);
-	} else  {
-		/* We did the priv page charging in mmap call */
-		long err;
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *reservations = vma_resv_map(vma);
-
-		err = region_chg(&reservations->regions, idx, idx + 1, 0);
-		if (err < 0)
-			return err;
-		return 0;
+		return hugetlb_page_charge(&resv_map->regions,
+					   h, idx, idx + 1);
 	}
+	/*
+	 * We did the private page charging in mmap call
+	 */
+	return 0;
 }
 
 static void vma_uncharge_reservation(struct hstate *h,
@@ -861,40 +849,37 @@ static void vma_uncharge_reservation(struct hstate *h,
 				     unsigned long chg)
 {
 	int idx = h - hstates;
+	struct list_head *region_list;
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 
-	if (vma->vm_flags & VM_MAYSHARE) {
-		return hugetlb_page_uncharge(&inode->i_mapping->private_list,
-					     idx, chg << huge_page_order(h));
-	} else {
+	if (vma->vm_flags & VM_MAYSHARE)
+		region_list = &inode->i_mapping->private_list;
+	else {
 		struct resv_map *resv_map = vma_resv_map(vma);
-
-		return hugetlb_priv_page_uncharge(resv_map,
-						  idx,
-						  chg << huge_page_order(h));
+		region_list = &resv_map->regions;
 	}
+	return hugetlb_page_uncharge(region_list,
+				     idx, chg << huge_page_order(h));
 }
 
 static void vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-
+	struct list_head *region_list;
+	pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		hugetlb_commit_page_charge(&inode->i_mapping->private_list,
-					   idx, idx + 1);
-	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
+		region_list = &inode->i_mapping->private_list;
+	} else  {
 		struct resv_map *reservations = vma_resv_map(vma);
-
-		/* Mark this page used in the map. */
-		region_add(&reservations->regions, idx, idx + 1, 0);
+		region_list = &reservations->regions;
 	}
+	hugetlb_commit_page_charge(region_list, idx, idx + 1);
+	return;
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -937,10 +922,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 		}
 	}
-
 	set_page_private(page, (unsigned long) mapping);
-
-	vma_commit_reservation(h, vma, addr);
+	if (chg)
+		vma_commit_reservation(h, vma, addr);
 	return page;
 }
 
@@ -2045,20 +2029,19 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
-	struct resv_map *reservations = vma_resv_map(vma);
-	unsigned long reserve;
-	unsigned long start;
-	unsigned long end;
+	struct resv_map *resv_map = vma_resv_map(vma);
+	unsigned long reserve, start, end;
 
-	if (reservations) {
+	if (resv_map) {
 		start = vma_hugecache_offset(h, vma, vma->vm_start);
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
-		reserve = (end - start) -
-			region_count(&reservations->regions, start, end);
-
-		kref_put(&reservations->refs, resv_map_release);
-
+		reserve = hugetlb_truncate_cgroup_range(h, &resv_map->regions,
+							start, end);
+		/* open coded kref_put */
+		if (atomic_sub_and_test(1, &resv_map->refs.refcount)) {
+			resv_map_release(h, resv_map);
+		}
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
 			hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
@@ -2842,6 +2825,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 					vm_flags_t vm_flags)
 {
 	long ret, chg;
+	struct list_head *region_list;
 	struct hstate *h = hstate_inode(inode);
 	struct resv_map *resv_map = NULL;
 	/*
@@ -2859,20 +2843,17 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		chg = hugetlb_page_charge(&inode->i_mapping->private_list,
-					  h, from, to);
+		region_list = &inode->i_mapping->private_list;
 	} else {
 		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
 
-		chg = to - from;
-
 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
-		chg = hugetlb_priv_page_charge(resv_map, h, chg);
+		region_list = &resv_map->regions;
 	}
-
+	chg = hugetlb_page_charge(region_list, h, from, to);
 	if (chg < 0)
 		return chg;
 
@@ -2888,32 +2869,15 @@ int hugetlb_reserve_pages(struct inode *inode,
 	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0)
 		goto err_acct_mem;
-	/*
-	 * Account for the reservations made. Shared mappings record regions
-	 * that have reservations as they are shared by multiple VMAs.
-	 * When the last VMA disappears, the region map says how much
-	 * the reservation was and the page cache tells how much of
-	 * the reservation was consumed. Private mappings are per-VMA and
-	 * only the consumed reservations are tracked. When the VMA
-	 * disappears, the original reservation is the VMA size and the
-	 * consumed reservations are stored in the map. Hence, nothing
-	 * else has to be done for private mappings here
-	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		hugetlb_commit_page_charge(&inode->i_mapping->private_list,
-					   from, to);
+
+	hugetlb_commit_page_charge(region_list, from, to);
 	return 0;
 err_acct_mem:
 	hugetlb_put_quota(inode->i_mapping, chg);
 err_quota:
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		hugetlb_page_uncharge(&inode->i_mapping->private_list,
-				      h - hstates, chg << huge_page_order(h));
-	else
-		hugetlb_priv_page_uncharge(resv_map, h - hstates,
-					   chg << huge_page_order(h));
+	hugetlb_page_uncharge(region_list, h - hstates,
+			      chg << huge_page_order(h));
 	return ret;
-
 }
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
@@ -2927,7 +2891,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
 
-	hugetlb_put_quota(inode->i_mapping, (chg - freed));
+	hugetlb_put_quota(inode->i_mapping, chg);
 	hugetlb_acct_memory(h, -(chg - freed));
 }
 
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -V1 9/9] hugetlbfs: Add HugeTLB controller documentation
  2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
                   ` (7 preceding siblings ...)
  2012-02-20 11:21 ` [PATCH -V1 8/9] hugetlbfs: Add task migration support Aneesh Kumar K.V
@ 2012-02-20 11:21 ` Aneesh Kumar K.V
  8 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-20 11:21 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 Documentation/cgroups/hugetlb.txt |   54 +++++++++++++++++++++++++++++++++++++
 1 files changed, 54 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/hugetlb.txt

diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
new file mode 100644
index 0000000..722aa8e
--- /dev/null
+++ b/Documentation/cgroups/hugetlb.txt
@@ -0,0 +1,54 @@
+HugeTLB controller
+-----------------
+
+The HugetTLB controller is used to group tasks using cgroups and
+limit the HugeTLB pages used by these groups of tasks. HugetTLB cgroup
+enforce the limit during mmap(2). This enables application to fall back
+to allocation using smaller page size if the cgroup resource limit prevented
+them from allocating HugeTLB pages.
+
+
+The HugetTLB controller supports multi-hierarchy groups and task migration
+across cgroups.
+
+HugeTLB groups can be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -o hugetlb none /sys/fs/cgroup
+
+With the above step, the initial or the root HugeTLB cgroup becomes
+visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
+the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. HugeTLB
+cgroup create seperate limit, usage and max_usage files for each huge page
+size supported. An example listing is given below
+
+hugetlb.16GB.limit_in_bytes
+hugetlb.16GB.max_usage_in_bytes
+hugetlb.16GB.usage_in_bytes
+hugetlb.16MB.limit_in_bytes
+hugetlb.16MB.max_usage_in_bytes
+hugetlb.16MB.usage_in_bytes
+
+/sys/fs/cgroup/hugetlb.<pagesize>.usage_in_bytes  gives the HugeTLB usage
+by this group which is essentially the total size HugeTLB pages obtained
+by all the tasks in the system.
+
+New cgroup can be created under root HugeTLB cgroup /sys/fs/cgroup
+
+# cd /sys/fs/cgroup
+# mkdir g1
+# echo $$ > g1/tasks
+
+The above steps create a new group g1 and move the current shell
+process (bash) into it. 16MB HugeTLB pages consumed by this bash and its
+children can be obtained from g1/hugetlb.16MB.usage_in_bytes and the same
+is accumulated in /sys/fs/cgroup/hugetlb.16MB.usage_in_bytes.
+
+We can limit the usage of 16MB hugepage by a hugeTLB cgroup using
+hugetlb.16MB.limit_in_bytes
+
+# echo 16M > /sys/fs/cgroup/g1/hugetlb.16MB.limit_in_bytes
+# hugectl  --heap=16M /root/heap
+libhugetlbfs: WARNING: New heap segment map at 0x20000000000 failed: Cannot allocate memory
+# echo -1 > /sys/fs/cgroup/g1/hugetlb.16MB.limit_in_bytes
+# hugectl  --heap=16M /root/heap
+#
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-02-20 11:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-20 11:21 [PATCH -V1 0/9] hugetlbfs: Add cgroup resource controller for hugetlbfs Aneesh Kumar K.V
2012-02-20 11:21 ` [PATCH -V1 1/9] hugetlbfs: Add new HugeTLB cgroup Aneesh Kumar K.V
2012-02-20 11:21 ` [PATCH -V1 2/9] hugetlbfs: Add usage and max usage files to hugetlb cgroup Aneesh Kumar K.V
2012-02-20 11:21 ` [PATCH -V1 3/9] hugetlbfs: Add new region handling functions Aneesh Kumar K.V
2012-02-20 11:21 ` [PATCH -V1 4/9] hugetlbfs: Add controller support for shared mapping Aneesh Kumar K.V
2012-02-20 11:21 ` [PATCH -V1 5/9] hugetlbfs: Add controller support for private mapping Aneesh Kumar K.V
2012-02-20 11:21 ` [PATCH -V1 6/9] hugetlbfs: Switch to new region APIs Aneesh Kumar K.V
2012-02-20 11:21 ` [PATCH -V1 7/9] hugetlbfs: Add truncate region functions Aneesh Kumar K.V
2012-02-20 11:21 ` [PATCH -V1 8/9] hugetlbfs: Add task migration support Aneesh Kumar K.V
2012-02-20 11:21 ` [PATCH -V1 9/9] hugetlbfs: Add HugeTLB controller documentation Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).