Linux-kselftest Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
@ 2019-08-08 23:13 Mina Almasry
  2019-08-08 23:13 ` [RFC PATCH v2 1/5] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Mina Almasry @ 2019-08-08 23:13 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest

Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.

We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.

Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:

- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch, with tests to verify
functionality and show the usage.

Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
   the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
   duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
   hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that modifies
   the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
   reservation time rather than fault time. Adding a new page_counter seems
   better as userspace could, if it wants, choose to enforce different cgroups
   differently: one via limit_in_bytes, and another via
   reservation_limit_in_bytes. This could be very useful if you're
   transitioning how hugetlb memory is partitioned on your system one
   cgroup at a time, for example. Also, someone may find usage for both
   limit_in_bytes and reservation_limit_in_bytes concurrently, and this
   approach gives them the option to do so.

Caveats:
1. This support is implemented for cgroups-v1. I have not tried
   hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet.
   This is largely because we use cgroups-v1 for now. If required, I
   can add hugetlb_cgroup support to cgroups v2 in this patch or
   a follow up.
2. Most complicated bit of this patch I believe is: where to store the
   pointer to the hugetlb_cgroup to uncharge at unreservation time?
   Normally the cgroup pointers hang off the struct page. But, with
   hugetlb_cgroup reservations, one task can reserve a specific page and another
   task may fault it in (I believe), so storing the pointer in struct
   page is not appropriate. Proposed approach here is to store the pointer in
   the resv_map. See patch for details.

Signed-off-by: Mina Almasry <almasrymina@google.com>

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Changes in v2:
- Split the patch into a 5 patch series.
- Fixed patch subject.

Mina Almasry (5):
  hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add reservation accounting for private mappings
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb_cgroup: Add hugetlb_cgroup reservation tests

 include/linux/hugetlb.h                       |  10 +-
 include/linux/hugetlb_cgroup.h                |  19 +-
 mm/hugetlb.c                                  | 256 ++++++++--
 mm/hugetlb_cgroup.c                           | 153 +++++-
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 438 ++++++++++++++++++
 .../selftests/vm/write_hugetlb_memory.sh      |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++++++++++
 9 files changed, 1087 insertions(+), 68 deletions(-)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

--
2.23.0.rc1.153.gdeed80330f-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 1/5] hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  2019-08-08 23:13 [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
@ 2019-08-08 23:13 ` Mina Almasry
  2019-08-08 23:13 ` [RFC PATCH v2 2/5] hugetlb_cgroup: Add interface for charge/uncharge Mina Almasry
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-08-08 23:13 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest

These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.
---
 include/linux/hugetlb.h |  2 +-
 mm/hugetlb_cgroup.c     | 86 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edfca42783192..6777b3013345d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -340,7 +340,7 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
-	struct cftype cgroup_files[5];
+	struct cftype cgroup_files[9];
 #endif
 	char name[HSTATE_NAME_LEN];
 };
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 68c2f2f3c05b7..708103663988a 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
 	 * the counter to account for hugepages from hugetlb.
 	 */
 	struct page_counter hugepage[HUGE_MAX_HSTATE];
+	/*
+	 * the counter to account for hugepage reservations from hugetlb.
+	 */
+	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
 };

 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -33,6 +37,15 @@ struct hugetlb_cgroup {

 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline
+struct page_counter *get_counter(struct hugetlb_cgroup *h_cg, int idx,
+				 bool reserved)
+{
+	if (reserved)
+		return  &h_cg->reserved_hugepage[idx];
+	return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -256,28 +269,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,

 enum {
 	RES_USAGE,
+	RES_RESERVATION_USAGE,
 	RES_LIMIT,
+	RES_RESERVATION_LIMIT,
 	RES_MAX_USAGE,
+	RES_RESERVATION_MAX_USAGE,
 	RES_FAILCNT,
+	RES_RESERVATION_FAILCNT,
 };

 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 				   struct cftype *cft)
 {
 	struct page_counter *counter;
+	struct page_counter *reserved_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

 	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+	reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

 	switch (MEMFILE_ATTR(cft->private)) {
 	case RES_USAGE:
 		return (u64)page_counter_read(counter) * PAGE_SIZE;
+	case RES_RESERVATION_USAGE:
+		return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
 	case RES_LIMIT:
 		return (u64)counter->max * PAGE_SIZE;
+	case RES_RESERVATION_LIMIT:
+		return (u64)reserved_counter->max * PAGE_SIZE;
 	case RES_MAX_USAGE:
 		return (u64)counter->watermark * PAGE_SIZE;
+	case RES_RESERVATION_MAX_USAGE:
+		return (u64)reserved_counter->watermark * PAGE_SIZE;
 	case RES_FAILCNT:
 		return counter->failcnt;
+	case RES_RESERVATION_FAILCNT:
+		return reserved_counter->failcnt;
 	default:
 		BUG();
 	}
@@ -291,6 +318,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 	int ret, idx;
 	unsigned long nr_pages;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+	bool reserved = false;

 	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
 		return -EINVAL;
@@ -303,10 +331,16 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 	idx = MEMFILE_IDX(of_cft(of)->private);
 	nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));

+	if (MEMFILE_ATTR(of_cft(of)->private) == RES_RESERVATION_LIMIT) {
+		reserved = true;
+	}
+
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
+	case RES_RESERVATION_LIMIT:
 	case RES_LIMIT:
 		mutex_lock(&hugetlb_limit_mutex);
-		ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+		ret = page_counter_set_max(get_counter(h_cg, idx, reserved),
+					   nr_pages);
 		mutex_unlock(&hugetlb_limit_mutex);
 		break;
 	default:
@@ -320,18 +354,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
 	int ret = 0;
-	struct page_counter *counter;
+	struct page_counter *counter, *reserved_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));

 	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+	reserved_counter = &h_cg->reserved_hugepage[
+		MEMFILE_IDX(of_cft(of)->private)];

 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
 		page_counter_reset_watermark(counter);
 		break;
+	case RES_RESERVATION_MAX_USAGE:
+		page_counter_reset_watermark(reserved_counter);
+		break;
 	case RES_FAILCNT:
 		counter->failcnt = 0;
 		break;
+	case RES_RESERVATION_FAILCNT:
+		reserved_counter->failcnt = 0;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -357,7 +399,7 @@ static void __init __hugetlb_cgroup_file_init(int idx)
 	struct hstate *h = &hstates[idx];

 	/* format the size */
-	mem_fmt(buf, 32, huge_page_size(h));
+	mem_fmt(buf, sizeof(buf), huge_page_size(h));

 	/* Add the limit file */
 	cft = &h->cgroup_files[0];
@@ -366,28 +408,58 @@ static void __init __hugetlb_cgroup_file_init(int idx)
 	cft->read_u64 = hugetlb_cgroup_read_u64;
 	cft->write = hugetlb_cgroup_write;

-	/* Add the usage file */
+	/* Add the reservation limit file */
 	cft = &h->cgroup_files[1];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes",
+		 buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+	cft->write = hugetlb_cgroup_write;
+
+	/* Add the usage file */
+	cft = &h->cgroup_files[2];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
 	cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
 	cft->read_u64 = hugetlb_cgroup_read_u64;

+	/* Add the reservation usage file */
+	cft = &h->cgroup_files[3];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes",
+			buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
 	/* Add the MAX usage file */
-	cft = &h->cgroup_files[2];
+	cft = &h->cgroup_files[4];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
 	cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

+	/* Add the MAX reservation usage file */
+	cft = &h->cgroup_files[5];
+	snprintf(cft->name, MAX_CFTYPE_NAME,
+			"%s.reservation_max_usage_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_MAX_USAGE);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
 	/* Add the failcntfile */
-	cft = &h->cgroup_files[3];
+	cft = &h->cgroup_files[6];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
 	cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

+	/* Add the reservation failcntfile */
+	cft = &h->cgroup_files[7];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf);
+	cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files[4];
+	cft = &h->cgroup_files[8];
 	memset(cft, 0, sizeof(*cft));

 	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
--
2.23.0.rc1.153.gdeed80330f-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 2/5] hugetlb_cgroup: Add interface for charge/uncharge
  2019-08-08 23:13 [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
  2019-08-08 23:13 ` [RFC PATCH v2 1/5] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
@ 2019-08-08 23:13 ` Mina Almasry
  2019-08-08 23:13 ` [RFC PATCH v2 3/5] hugetlb_cgroup: Add reservation accounting for private mappings Mina Almasry
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-08-08 23:13 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest

Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb
usage or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

---
 include/linux/hugetlb_cgroup.h |  8 +++--
 mm/hugetlb.c                   |  3 +-
 mm/hugetlb_cgroup.c            | 63 ++++++++++++++++++++++++++++------
 3 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 063962f6dfc6a..0725f809cd2d9 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -52,7 +52,8 @@ static inline bool hugetlb_cgroup_disabled(void)
 }

 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-					struct hugetlb_cgroup **ptr);
+					struct hugetlb_cgroup **ptr,
+					bool reserved);
 extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 					 struct hugetlb_cgroup *h_cg,
 					 struct page *page);
@@ -60,6 +61,9 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 					 struct page *page);
 extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
 					   struct hugetlb_cgroup *h_cg);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+					    unsigned long nr_pages);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
 				   struct page *newhpage);
@@ -83,7 +87,7 @@ static inline bool hugetlb_cgroup_disabled(void)

 static inline int
 hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-			     struct hugetlb_cgroup **ptr)
+			     struct hugetlb_cgroup **ptr, bool reserved)
 {
 	return 0;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab2..c153bef42e729 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2078,7 +2078,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			gbl_chg = 1;
 	}

-	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
+	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
+					   false);
 	if (ret)
 		goto out_subpool_put;

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 708103663988a..119176a0b2ec5 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -74,8 +74,10 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
 	int idx;

 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-		if (page_counter_read(&h_cg->hugepage[idx]))
+		if (page_counter_read(get_counter(h_cg, idx, true)) ||
+		    page_counter_read(get_counter(h_cg, idx, false))) {
 			return true;
+		}
 	}
 	return false;
 }
@@ -86,18 +88,27 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
 	int idx;

 	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
-		struct page_counter *counter = &h_cgroup->hugepage[idx];
 		struct page_counter *parent = NULL;
+		struct page_counter *reserved_parent = NULL;
 		unsigned long limit;
 		int ret;

-		if (parent_h_cgroup)
-			parent = &parent_h_cgroup->hugepage[idx];
-		page_counter_init(counter, parent);
+		if (parent_h_cgroup) {
+			parent = get_counter(parent_h_cgroup, idx, false);
+			reserved_parent = get_counter(parent_h_cgroup, idx,
+						      true);
+		}
+		page_counter_init(get_counter(h_cgroup, idx, false), parent);
+		page_counter_init(get_counter(h_cgroup, idx, true),
+				  reserved_parent);

 		limit = round_down(PAGE_COUNTER_MAX,
 				   1 << huge_page_order(&hstates[idx]));
-		ret = page_counter_set_max(counter, limit);
+
+		ret = page_counter_set_max(get_counter(
+					h_cgroup, idx, false), limit);
+		ret = page_counter_set_max(get_counter(
+					h_cgroup, idx, true), limit);
 		VM_BUG_ON(ret);
 	}
 }
@@ -127,6 +138,25 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
 	kfree(h_cgroup);
 }

+static void hugetlb_cgroup_move_parent_reservation(int idx,
+						   struct hugetlb_cgroup *h_cg)
+{
+	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
+
+	/* Move the reservation counters. */
+	if (!parent_hugetlb_cgroup(h_cg)) {
+		parent = root_h_cgroup;
+		/* root has no limit */
+		page_counter_charge(
+				&root_h_cgroup->reserved_hugepage[idx],
+				page_counter_read(get_counter(h_cg, idx,
+							      true)));
+	}
+
+	/* Take the pages off the local counter */
+	page_counter_cancel(get_counter(h_cg, idx, true),
+			    page_counter_read(get_counter(h_cg, idx, true)));
+}

 /*
  * Should be called with hugetlb_lock held.
@@ -181,6 +211,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 	do {
 		for_each_hstate(h) {
 			spin_lock(&hugetlb_lock);
+			hugetlb_cgroup_move_parent_reservation(idx, h_cg);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)
 				hugetlb_cgroup_move_parent(idx, h_cg, page);

@@ -192,7 +223,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 }

 int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-				 struct hugetlb_cgroup **ptr)
+				 struct hugetlb_cgroup **ptr, bool reserved)
 {
 	int ret = 0;
 	struct page_counter *counter;
@@ -215,8 +246,10 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
 	}
 	rcu_read_unlock();

-	if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter))
+	if (!page_counter_try_charge(get_counter(h_cg, idx, reserved),
+				     nr_pages, &counter)) {
 		ret = -ENOMEM;
+	}
 	css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
@@ -250,7 +283,8 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 	if (unlikely(!h_cg))
 		return;
 	set_hugetlb_cgroup(page, NULL);
-	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
+	page_counter_uncharge(get_counter(h_cg, idx, false), nr_pages);
+
 	return;
 }

@@ -263,7 +297,16 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
 	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
 		return;

-	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
+	page_counter_uncharge(get_counter(h_cg, idx, false), nr_pages);
+}
+
+void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+				     unsigned long nr_pages)
+{
+	if (hugetlb_cgroup_disabled() || !p)
+		return;
+
+	page_counter_uncharge(p, nr_pages);
 	return;
 }

--
2.23.0.rc1.153.gdeed80330f-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 3/5] hugetlb_cgroup: Add reservation accounting for private mappings
  2019-08-08 23:13 [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
  2019-08-08 23:13 ` [RFC PATCH v2 1/5] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
  2019-08-08 23:13 ` [RFC PATCH v2 2/5] hugetlb_cgroup: Add interface for charge/uncharge Mina Almasry
@ 2019-08-08 23:13 ` Mina Almasry
  2019-08-08 23:13 ` [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings Mina Almasry
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-08-08 23:13 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest

Normally the pointer to the cgroup to uncharge hangs off the struct
page, and gets queried when it's time to free the page. With
hugetlb_cgroup reservations, this is not possible. Because it's possible
for a page to be reserved by one task and actually faulted in by another
task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

---
 include/linux/hugetlb.h        |  8 ++++++
 include/linux/hugetlb_cgroup.h | 11 ++++++++
 mm/hugetlb.c                   | 47 ++++++++++++++++++++++++++++++++--
 mm/hugetlb_cgroup.c            | 12 ---------
 4 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6777b3013345d..90b3c928d16c1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,14 @@ struct resv_map {
 	long adds_in_progress;
 	struct list_head region_cache;
 	long region_cache_count;
+ #ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * On private mappings, the counter to uncharge reservations is stored
+	 * here. If these fields are 0, then the mapping is shared.
+	 */
+	struct page_counter *reservation_counter;
+	unsigned long pages_per_hpage;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 0725f809cd2d9..1fdde63a4e775 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -25,6 +25,17 @@ struct hugetlb_cgroup;
 #define HUGETLB_CGROUP_MIN_ORDER	2

 #ifdef CONFIG_CGROUP_HUGETLB
+struct hugetlb_cgroup {
+	struct cgroup_subsys_state css;
+	/*
+	 * the counter to account for hugepages from hugetlb.
+	 */
+	struct page_counter hugepage[HUGE_MAX_HSTATE];
+	/*
+	 * the counter to account for hugepage reservations from hugetlb.
+	 */
+	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
+};

 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c153bef42e729..235996aef6618 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -711,6 +711,16 @@ struct resv_map *resv_map_alloc(void)
 	INIT_LIST_HEAD(&resv_map->regions);

 	resv_map->adds_in_progress = 0;
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * Initialize these to 0. On shared mappings, 0's here indicate these
+	 * fields don't do cgroup accounting. On private mappings, these will be
+	 * re-initialized to the proper values, to indicate that hugetlb cgroup
+	 * reservations are to be un-charged from here.
+	 */
+	resv_map->reservation_counter = NULL;
+	resv_map->pages_per_hpage = 0;
+#endif

 	INIT_LIST_HEAD(&resv_map->region_cache);
 	list_add(&rg->link, &resv_map->region_cache);
@@ -3192,7 +3202,19 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)

 	reserve = (end - start) - region_count(resv, start, end);

-	kref_put(&resv->refs, resv_map_release);
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * Since we check for HPAGE_RESV_OWNER above, this must a private
+	 * mapping, and these values should be none-zero, and should point to
+	 * the hugetlb_cgroup counter to uncharge for this reservation.
+	 */
+	WARN_ON(!resv->reservation_counter);
+	WARN_ON(!resv->pages_per_hpage);
+
+	hugetlb_cgroup_uncharge_counter(
+			resv->reservation_counter,
+			(end - start) * resv->pages_per_hpage);
+#endif

 	if (reserve) {
 		/*
@@ -3202,6 +3224,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
 		hugetlb_acct_memory(h, -gbl_reserve);
 	}
+
+	kref_put(&resv->refs, resv_map_release);
 }

 static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -4516,6 +4540,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
+	struct hugetlb_cgroup *h_cg;
 	long gbl_reserve;

 	/* This should never happen */
@@ -4549,11 +4574,29 @@ int hugetlb_reserve_pages(struct inode *inode,
 		chg = region_chg(resv_map, from, to);

 	} else {
+		/* Private mapping. */
+		chg = to - from;
+
+		if (hugetlb_cgroup_charge_cgroup(
+					hstate_index(h),
+					chg * pages_per_huge_page(h),
+					&h_cg, true)) {
+			return -ENOMEM;
+		}
+
 		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;

-		chg = to - from;
+#ifdef CONFIG_CGROUP_HUGETLB
+		/*
+		 * Since this branch handles private mappings, we attach the
+		 * counter to uncharge for this reservation off resv_map.
+		 */
+		resv_map->reservation_counter =
+			&h_cg->reserved_hugepage[hstate_index(h)];
+		resv_map->pages_per_hpage = pages_per_huge_page(h);
+#endif

 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 119176a0b2ec5..06e99ae1fec81 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -19,18 +19,6 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>

-struct hugetlb_cgroup {
-	struct cgroup_subsys_state css;
-	/*
-	 * the counter to account for hugepages from hugetlb.
-	 */
-	struct page_counter hugepage[HUGE_MAX_HSTATE];
-	/*
-	 * the counter to account for hugepage reservations from hugetlb.
-	 */
-	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
-};
-
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_IDX(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
--
2.23.0.rc1.153.gdeed80330f-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings
  2019-08-08 23:13 [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
                   ` (2 preceding siblings ...)
  2019-08-08 23:13 ` [RFC PATCH v2 3/5] hugetlb_cgroup: Add reservation accounting for private mappings Mina Almasry
@ 2019-08-08 23:13 ` Mina Almasry
  2019-08-13 23:54   ` Mike Kravetz
  2019-08-08 23:13 ` [RFC PATCH v2 5/5] hugetlb_cgroup: Add hugetlb_cgroup reservation tests Mina Almasry
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-08-08 23:13 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest

For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

When a file_region entry is added to the resv_map via region_add, we
also charge the appropriate hugetlb_cgroup and put the pointer to that
in file_region->reservation_counter. This is slightly delicate since we
need to not modify the resv_map until we know that charging the
reservation has succeeded. If charging doesn't succeed, we report the
error to the caller, so that the kernel fails the reservation.

On region_del, which is when the hugetlb memory is unreserved, we delete
the file_region entry in the resv_map, but also uncharge the
file_region->reservation_counter.

---
 mm/hugetlb.c | 208 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 170 insertions(+), 38 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 235996aef6618..d76e3137110ab 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -242,8 +242,72 @@ struct file_region {
 	struct list_head link;
 	long from;
 	long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * On shared mappings, each reserved region appears as a struct
+	 * file_region in resv_map. These fields hold the info needed to
+	 * uncharge each reservation.
+	 */
+	struct page_counter *reservation_counter;
+	unsigned long pages_per_hpage;
+#endif
 };

+/* Must be called with resv->lock held. Calling this with dry_run == true will
+ * count the number of pages added but will not modify the linked list.
+ */
+static long consume_regions_we_overlap_with(struct file_region *rg,
+		struct list_head *head, long f, long *t,
+		struct hugetlb_cgroup *h_cg,
+		struct hstate *h,
+		bool dry_run)
+{
+	long add = 0;
+	struct file_region *trg = NULL, *nrg = NULL;
+
+	/* Consume any regions we now overlap with. */
+	nrg = rg;
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		if (rg->from > *t)
+			break;
+
+		/* If this area reaches higher then extend our area to
+		 * include it completely.  If this is not the first area
+		 * which we intend to reuse, free it.
+		 */
+		if (rg->to > *t)
+			*t = rg->to;
+		if (rg != nrg) {
+			/* Decrement return value by the deleted range.
+			 * Another range will span this area so that by
+			 * end of routine add will be >= zero
+			 */
+			add -= (rg->to - rg->from);
+			if (!dry_run) {
+				list_del(&rg->link);
+				kfree(rg);
+			}
+		}
+	}
+
+	add += (nrg->from - f);		/* Added to beginning of region */
+	add += *t - nrg->to;		/* Added to end of region */
+
+	if (!dry_run) {
+		nrg->from = f;
+		nrg->to = *t;
+#ifdef CONFIG_CGROUP_HUGETLB
+		nrg->reservation_counter =
+			&h_cg->reserved_hugepage[hstate_index(h)];
+		nrg->pages_per_hpage = pages_per_huge_page(h);
+#endif
+	}
+
+	return add;
+}
+
 /*
  * Add the huge page range represented by [f, t) to the reserve
  * map.  In the normal case, existing regions will be expanded
@@ -258,11 +322,13 @@ struct file_region {
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
  */
-static long region_add(struct resv_map *resv, long f, long t)
+static long region_add(struct hstate *h, struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg, *trg;
-	long add = 0;
+	struct file_region *rg, *nrg;
+	long add = 0, newadd = 0;
+	struct hugetlb_cgroup *h_cg = NULL;
+	int ret = 0;

 	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
@@ -277,6 +343,23 @@ static long region_add(struct resv_map *resv, long f, long t)
 	 * from the cache and use it for this range.
 	 */
 	if (&rg->link == head || t < rg->from) {
+#ifdef CONFIG_CGROUP_HUGETLB
+		/*
+		 * If res->reservation_counter is NULL, then it means this is
+		 * a shared mapping, and hugetlb cgroup accounting should be
+		 * done on the file_region entries inside resv_map.
+		 */
+		if (!resv->reservation_counter) {
+			ret = hugetlb_cgroup_charge_cgroup(
+					hstate_index(h),
+					(t - f) * pages_per_huge_page(h),
+					&h_cg, true);
+		}
+
+		if (ret)
+			goto out_locked;
+#endif
+
 		VM_BUG_ON(resv->region_cache_count <= 0);

 		resv->region_cache_count--;
@@ -286,6 +369,15 @@ static long region_add(struct resv_map *resv, long f, long t)

 		nrg->from = f;
 		nrg->to = t;
+
+#ifdef CONFIG_CGROUP_HUGETLB
+		if (h_cg) {
+			nrg->reservation_counter =
+				&h_cg->reserved_hugepage[hstate_index(h)];
+			nrg->pages_per_hpage = pages_per_huge_page(h);
+		}
+#endif
+
 		list_add(&nrg->link, rg->link.prev);

 		add += t - f;
@@ -296,38 +388,37 @@ static long region_add(struct resv_map *resv, long f, long t)
 	if (f > rg->from)
 		f = rg->from;

-	/* Check for and consume any regions we now overlap with. */
-	nrg = rg;
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		if (rg->from > t)
-			break;
-
-		/* If this area reaches higher then extend our area to
-		 * include it completely.  If this is not the first area
-		 * which we intend to reuse, free it. */
-		if (rg->to > t)
-			t = rg->to;
-		if (rg != nrg) {
-			/* Decrement return value by the deleted range.
-			 * Another range will span this area so that by
-			 * end of routine add will be >= zero
-			 */
-			add -= (rg->to - rg->from);
-			list_del(&rg->link);
-			kfree(rg);
-		}
+#ifdef CONFIG_CGROUP_HUGETLB
+	/* Count any regions we now overlap with. */
+	add = consume_regions_we_overlap_with(rg, head, f, &t, NULL, NULL,
+			true);
+
+	if (!resv->reservation_counter) {
+		ret = hugetlb_cgroup_charge_cgroup(
+				hstate_index(h),
+				add * pages_per_huge_page(h),
+				&h_cg, true);
 	}

-	add += (nrg->from - f);		/* Added to beginning of region */
-	nrg->from = f;
-	add += t - nrg->to;		/* Added to end of region */
-	nrg->to = t;
+	if (ret)
+		goto out_locked;
+#endif
+
+	/* Check for and consume any regions we now overlap with. */
+	newadd = consume_regions_we_overlap_with(rg, head, f, &t, h_cg, h,
+			false);
+	/*
+	 * If these aren't equal, then there is a bug with
+	 * consume_regions_we_overlap_with, and we're charging the wrong amount
+	 * of memory.
+	 */
+	WARN_ON(add != newadd);

 out_locked:
 	resv->adds_in_progress--;
 	spin_unlock(&resv->lock);
+	if (ret)
+		return ret;
 	VM_BUG_ON(add < 0);
 	return add;
 }
@@ -487,6 +578,10 @@ static long region_del(struct resv_map *resv, long f, long t)
 	struct file_region *rg, *trg;
 	struct file_region *nrg = NULL;
 	long del = 0;
+#ifdef CONFIG_CGROUP_HUGETLB
+	struct page_counter *reservation_counter = NULL;
+	unsigned long pages_per_hpage = 0;
+#endif

 retry:
 	spin_lock(&resv->lock);
@@ -514,6 +609,14 @@ static long region_del(struct resv_map *resv, long f, long t)
 				nrg = list_first_entry(&resv->region_cache,
 							struct file_region,
 							link);
+#ifdef CONFIG_CGROUP_HUGETLB
+				/*
+				 * Save counter information from the deleted
+				 * node, in case we need to do an uncharge.
+				 */
+				reservation_counter = nrg->reservation_counter;
+				pages_per_hpage = nrg->pages_per_hpage;
+#endif
 				list_del(&nrg->link);
 				resv->region_cache_count--;
 			}
@@ -543,6 +646,14 @@ static long region_del(struct resv_map *resv, long f, long t)

 		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
 			del += rg->to - rg->from;
+#ifdef CONFIG_CGROUP_HUGETLB
+			/*
+			 * Save counter information from the deleted node,
+			 * in case we need to do an uncharge.
+			 */
+			reservation_counter = rg->reservation_counter;
+			pages_per_hpage = rg->pages_per_hpage;
+#endif
 			list_del(&rg->link);
 			kfree(rg);
 			continue;
@@ -559,6 +670,19 @@ static long region_del(struct resv_map *resv, long f, long t)

 	spin_unlock(&resv->lock);
 	kfree(nrg);
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * If resv->reservation_counter is NULL, then this is shared
+	 * reservation, and the reserved memory is tracked in the file_struct
+	 * entries inside of resv_map. So we need to uncharge the memory here.
+	 */
+	if (reservation_counter && pages_per_hpage && del > 0 &&
+	    !resv->reservation_counter) {
+		hugetlb_cgroup_uncharge_counter(
+				reservation_counter,
+				del * pages_per_hpage);
+	}
+#endif
 	return del;
 }

@@ -1930,7 +2054,7 @@ static long __vma_reservation_common(struct hstate *h,
 		ret = region_chg(resv, idx, idx + 1);
 		break;
 	case VMA_COMMIT_RESV:
-		ret = region_add(resv, idx, idx + 1);
+		ret = region_add(h, resv, idx, idx + 1);
 		break;
 	case VMA_END_RESV:
 		region_abort(resv, idx, idx + 1);
@@ -1938,7 +2062,7 @@ static long __vma_reservation_common(struct hstate *h,
 		break;
 	case VMA_ADD_RESV:
 		if (vma->vm_flags & VM_MAYSHARE)
-			ret = region_add(resv, idx, idx + 1);
+			ret = region_add(h, resv, idx, idx + 1);
 		else {
 			region_abort(resv, idx, idx + 1);
 			ret = region_del(resv, idx, idx + 1);
@@ -4536,7 +4660,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 					struct vm_area_struct *vma,
 					vm_flags_t vm_flags)
 {
-	long ret, chg;
+	long ret, chg, add;
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
@@ -4624,9 +4748,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 */
 	ret = hugetlb_acct_memory(h, gbl_reserve);
 	if (ret < 0) {
-		/* put back original number of pages, chg */
-		(void)hugepage_subpool_put_pages(spool, chg);
-		goto out_err;
+		goto out_put_pages;
 	}

 	/*
@@ -4641,7 +4763,12 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		long add = region_add(resv_map, from, to);
+		add = region_add(h, resv_map, from, to);
+		if (add < 0) {
+			ret = -ENOMEM;
+			goto out_acct_memory;
+		}
+

 		if (unlikely(chg > add)) {
 			/*
@@ -4659,10 +4786,15 @@ int hugetlb_reserve_pages(struct inode *inode,
 		}
 	}
 	return 0;
+out_acct_memory:
+	hugetlb_acct_memory(h, -gbl_reserve);
+out_put_pages:
+	/* put back original number of pages, chg */
+	(void)hugepage_subpool_put_pages(spool, chg);
 out_err:
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		/* Don't call region_abort if region_chg failed */
-		if (chg >= 0)
+		/* Don't call region_abort if region_chg or region_add failed */
+		if (chg >= 0 && add >= 0)
 			region_abort(resv_map, from, to);
 	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_put(&resv_map->refs, resv_map_release);
--
2.23.0.rc1.153.gdeed80330f-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 5/5] hugetlb_cgroup: Add hugetlb_cgroup reservation tests
  2019-08-08 23:13 [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
                   ` (3 preceding siblings ...)
  2019-08-08 23:13 ` [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings Mina Almasry
@ 2019-08-08 23:13 ` Mina Almasry
  2019-08-09 17:54 ` [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mike Kravetz
       [not found] ` <20190815035352.14952-1-hdanton@sina.com>
  6 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-08-08 23:13 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest

The tests use both shared and private mapped hugetlb memory, and
monitors the hugetlb usage counter as well as the hugetlb reservation
counter. They test different configurations such as hugetlb memory usage
via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

---
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 438 ++++++++++++++++++
 .../selftests/vm/write_hugetlb_memory.sh      |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++++++++++
 5 files changed, 717 insertions(+)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34d..d3bed9407773c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+write_to_hugetlbfs
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 9534dc2bc9295..8d37d5409b52c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs

 TEST_PROGS := run_vmtests

@@ -29,3 +30,6 @@ include ../lib.mk
 $(OUTPUT)/userfaultfd: LDLIBS += -lpthread

 $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+
+# Why does adding $(OUTPUT)/ like above not apply this flag..?
+write_to_hugetlbfs: CFLAGS += -static
diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
new file mode 100755
index 0000000000000..bf0b6dcec9977
--- /dev/null
+++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,438 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+      mkdir -p $cgroup_path
+      mount -t cgroup -o hugetlb,memory cgroup $cgroup_path
+fi
+
+cleanup () {
+	echo $$ > $cgroup_path/tasks
+
+	set +e
+	if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
+	      kill -2 write_to_hugetlbfs
+	      # Wait for hugetlbfs memory to get depleted.
+	      sleep 0.5
+	fi
+	set -e
+
+	if [[ -e /mnt/huge ]]; then
+	      rm -rf /mnt/huge/*
+	      umount /mnt/huge || echo error
+	      rmdir /mnt/huge
+	fi
+	if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+	      rmdir $cgroup_path/hugetlb_cgroup_test
+	fi
+	if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+	      rmdir $cgroup_path/hugetlb_cgroup_test1
+	fi
+	if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+	      rmdir $cgroup_path/hugetlb_cgroup_test2
+	fi
+	echo 0 > /proc/sys/vm/nr_hugepages
+	echo CLEANUP DONE
+}
+
+cleanup
+
+function expect_equal() {
+      local expected="$1"
+      local actual="$2"
+      local error="$3"
+
+      if [[ "$expected" != "$actual" ]]; then
+	    echo "expected ($expected) != actual ($actual): $3"
+	    cleanup
+	    exit 1
+      fi
+}
+
+function setup_cgroup() {
+      local name="$1"
+      local cgroup_limit="$2"
+      local reservation_limit="$3"
+
+      mkdir $cgroup_path/$name
+
+      echo writing cgroup limit: "$cgroup_limit"
+      echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
+
+      echo writing reseravation limit: "$reservation_limit"
+      echo "$reservation_limit" > \
+	    $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
+}
+
+function write_hugetlbfs_and_get_usage() {
+      local cgroup="$1"
+      local size="$2"
+      local populate="$3"
+      local write="$4"
+      local path="$5"
+      local method="$6"
+      local private="$7"
+      local expect_failure="$8"
+
+      # Function return values.
+      reservation_failed=0
+      oom_killed=0
+      hugetlb_difference=0
+      reserved_difference=0
+
+      local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes
+      local reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes
+
+      local hugetlb_before=$(cat $hugetlb_usage)
+      local reserved_before=$(cat $reserved_usage)
+
+      echo
+      echo Starting:
+      echo hugetlb_usage="$hugetlb_before"
+      echo reserved_usage="$reserved_before"
+      echo expect_failure is "$expect_failure"
+
+      set +e
+      if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \
+	    [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then
+	    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+		  "$cgroup"  "$path" "$method" "$private" "-l" &
+
+	    local write_result=$?
+	    # This sleep is to make sure that the script above has had enough
+	    # time to do its thing, since it runs in the background. This may
+	    # cause races...
+	    sleep 0.5
+	    echo write_result is $write_result
+      else
+	    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+		  "$cgroup"  "$path" "$method" "$private"
+	    local write_result=$?
+      fi
+      set -e
+
+      if [[ "$write_result" == 1 ]]; then
+	    reservation_failed=1
+      fi
+
+      # On linus/master, the above process gets SIGBUS'd on oomkill, with
+      # return code 135. On earlier kernels, it gets actual oomkill, with return
+      # code 137, so just check for both conditions incase we're testing against
+      # an earlier kernel.
+      if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then
+	    oom_killed=1
+      fi
+
+      local hugetlb_after=$(cat $hugetlb_usage)
+      local reserved_after=$(cat $reserved_usage)
+
+      echo After write:
+      echo hugetlb_usage="$hugetlb_after"
+      echo reserved_usage="$reserved_after"
+
+      hugetlb_difference=$(($hugetlb_after - $hugetlb_before))
+      reserved_difference=$(($reserved_after - $reserved_before))
+}
+
+function cleanup_hugetlb_memory() {
+      set +e
+      if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
+	    echo kiling write_to_hugetlbfs
+	    killall -2 write_to_hugetlbfs
+	    # Wait for hugetlbfs memory to get depleted.
+	    sleep 0.5
+      fi
+      set -e
+
+      if [[ -e /mnt/huge ]]; then
+	    rm -rf /mnt/huge/*
+	      umount /mnt/huge
+	      rmdir /mnt/huge
+      fi
+}
+
+function run_test() {
+      local size="$1"
+      local populate="$2"
+      local write="$3"
+      local cgroup_limit="$4"
+      local reservation_limit="$5"
+      local nr_hugepages="$6"
+      local method="$7"
+      local private="$8"
+      local expect_failure="$9"
+
+      # Function return values.
+      hugetlb_difference=0
+      reserved_difference=0
+      reservation_failed=0
+      oom_killed=0
+
+      echo nr hugepages = "$nr_hugepages"
+      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
+
+      setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit"
+
+      mkdir -p /mnt/huge
+      mount -t hugetlbfs \
+	    -o pagesize=2M,size=256M none /mnt/huge
+
+      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \
+	    "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure"
+
+      cleanup_hugetlb_memory
+
+      local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.usage_in_bytes)
+      local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.reservation_usage_in_bytes)
+
+      expect_equal "0" "$final_hugetlb" "final hugetlb is not zero"
+      expect_equal "0" "$final_reservation" "final reservation is not zero"
+}
+
+function run_multiple_cgroup_test() {
+      local size1="$1"
+      local populate1="$2"
+      local write1="$3"
+      local cgroup_limit1="$4"
+      local reservation_limit1="$5"
+
+      local size2="$6"
+      local populate2="$7"
+      local write2="$8"
+      local cgroup_limit2="$9"
+      local reservation_limit2="${10}"
+
+      local nr_hugepages="${11}"
+      local method="${12}"
+      local private="${13}"
+      local expect_failure="${14}"
+
+      # Function return values.
+      hugetlb_difference1=0
+      reserved_difference1=0
+      reservation_failed1=0
+      oom_killed1=0
+
+      hugetlb_difference2=0
+      reserved_difference2=0
+      reservation_failed2=0
+      oom_killed2=0
+
+
+      echo nr hugepages = "$nr_hugepages"
+      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
+
+      setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1"
+      setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2"
+
+      mkdir -p /mnt/huge
+      mount -t hugetlbfs \
+	    -o pagesize=2M,size=256M none /mnt/huge
+
+      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \
+	    "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \
+	    "$expect_failure"
+
+      hugetlb_difference1=$hugetlb_difference
+      reserved_difference1=$reserved_difference
+      reservation_failed1=$reservation_failed
+      oom_killed1=$oom_killed
+
+      local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.usage_in_bytes
+      local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.reservation_usage_in_bytes
+      local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.usage_in_bytes
+      local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.reservation_usage_in_bytes
+
+      local usage_before_second_write=$(cat $cgroup1_hugetlb_usage)
+      local reservation_usage_before_second_write=$(cat \
+	    $cgroup1_reservation_usage)
+
+      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \
+	    "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \
+	    "$expect_failure"
+
+      hugetlb_difference2=$hugetlb_difference
+      reserved_difference2=$reserved_difference
+      reservation_failed2=$reservation_failed
+      oom_killed2=$oom_killed
+
+      expect_equal "$usage_before_second_write" \
+	    "$(cat $cgroup1_hugetlb_usage)" "Usage changed."
+      expect_equal "$reservation_usage_before_second_write" \
+	    "$(cat $cgroup1_reservation_usage)" "Reservation usage changed."
+
+      cleanup_hugetlb_memory
+
+      local final_hugetlb=$(cat $cgroup1_hugetlb_usage)
+      local final_reservation=$(cat $cgroup1_reservation_usage)
+
+      expect_equal "0" "$final_hugetlb" \
+	    "hugetlbt_cgroup_test1 final hugetlb is not zero"
+      expect_equal "0" "$final_reservation" \
+	    "hugetlbt_cgroup_test1 final reservation is not zero"
+
+      local final_hugetlb=$(cat $cgroup2_hugetlb_usage)
+      local final_reservation=$(cat $cgroup2_reservation_usage)
+
+      expect_equal "0" "$final_hugetlb" \
+	    "hugetlb_cgroup_test2 final hugetlb is not zero"
+      expect_equal "0" "$final_reservation" \
+	    "hugetlb_cgroup_test2 final reservation is not zero"
+}
+
+for private in "" "-r" ; do
+for populate in  "" "-o"; do
+for method in 0 1 2; do
+
+# Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported.
+if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then
+      continue
+fi
+
+# Skip populated shmem tests. Doesn't seem to be supported.
+if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then
+      continue
+fi
+
+cleanup
+echo
+echo
+echo
+echo Test normal case.
+echo private=$private, populate=$populate, method=$method
+run_test $((10 * 1024 * 1024)) "$populate" "" $((20 * 1024 * 1024)) \
+      $((20 * 1024 * 1024)) 10 "$method" "$private" "0"
+
+echo Memory charged to hugtlb=$hugetlb_difference
+echo Memory charged to reservation=$reserved_difference
+
+if [[ "$populate" == "-o" ]]; then
+      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
+	    "Reserved memory charged to hugetlb cgroup."
+else
+      expect_equal "0" "$hugetlb_difference" \
+	    "Reserved memory charged to hugetlb cgroup."
+fi
+
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
+      "Reserved memory not charged to reservation usage."
+echo 'PASS'
+
+cleanup
+echo
+echo
+echo
+echo Test normal case with write.
+echo private=$private, populate=$populate, method=$method
+run_test $((10 * 1024 * 1024)) "$populate" '-w' $((20 * 1024 * 1024)) \
+      $((20 * 1024 * 1024)) 10 "$method" "$private" "0"
+
+echo Memory charged to hugtlb=$hugetlb_difference
+echo Memory charged to reservation=$reserved_difference
+
+expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
+      "Reserved memory charged to hugetlb cgroup."
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
+      "Reserved memory not charged to reservation usage."
+echo 'PASS'
+
+
+cleanup
+echo
+echo
+echo
+echo Test more than reservation case.
+echo private=$private, populate=$populate, method=$method
+run_test "$((10 * 1024 * 1024))" "$populate" '' "$((20 * 1024 * 1024))" \
+      "$((5 * 1024 * 1024))" "10" "$method" "$private" "1"
+
+expect_equal "1" "$reservation_failed" "Reservation succeeded."
+echo 'PASS'
+
+cleanup
+
+echo
+echo
+echo
+echo Test more than cgroup limit case.
+echo private=$private, populate=$populate, method=$method
+
+# Not sure if shm memory can be cleaned up when the process gets sigbus'd.
+if [[ "$method" != 2 ]]; then
+      run_test $((10 * 1024 * 1024)) "$populate" "-w" $((5 * 1024 * 1024)) \
+	    $((20 * 1024 * 1024)) 10 "$method" "$private" "1"
+
+      expect_equal "1" "$oom_killed" "Not oom killed."
+fi
+echo 'PASS'
+
+cleanup
+
+echo
+echo
+echo
+echo Test normal case, multiple cgroups.
+echo private=$private, populate=$populate, method=$method
+run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "" \
+      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
+      "$populate" "" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
+      "$method" "$private" "0"
+
+echo Memory charged to hugtlb1=$hugetlb_difference1
+echo Memory charged to reservation1=$reserved_difference1
+echo Memory charged to hugtlb2=$hugetlb_difference2
+echo Memory charged to reservation2=$reserved_difference2
+
+expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
+      "Incorrect reservations charged to cgroup 1."
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
+      "Incorrect reservation charged to cgroup 2."
+if [[ "$populate" == "-o" ]]; then
+      expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
+	    "Incorrect hugetlb charged to cgroup 1."
+      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
+	    "Incorrect hugetlb charged to cgroup 2."
+else
+      expect_equal "0" "$hugetlb_difference1" \
+	    "Incorrect hugetlb charged to cgroup 1."
+      expect_equal "0" "$hugetlb_difference2" \
+	    "Incorrect hugetlb charged to cgroup 2."
+fi
+echo 'PASS'
+
+cleanup
+echo
+echo
+echo
+echo Test normal case with write, multiple cgroups.
+echo private=$private, populate=$populate, method=$method
+run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "-w" \
+      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
+      "$populate" "-w" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
+      "$method" "$private" "0"
+
+echo Memory charged to hugtlb1=$hugetlb_difference1
+echo Memory charged to reservation1=$reserved_difference1
+echo Memory charged to hugtlb2=$hugetlb_difference2
+echo Memory charged to reservation2=$reserved_difference2
+
+expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
+      "Incorrect hugetlb charged to cgroup 1."
+expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
+      "Incorrect reservation charged to cgroup 1."
+expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
+      "Incorrect hugetlb charged to cgroup 2."
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
+      "Incorrected reservation charged to cgroup 2."
+
+echo 'PASS'
+
+done # private
+done # populate
+done # method
+
+umount $cgroup_path
+rmdir $cgroup_path
diff --git a/tools/testing/selftests/vm/write_hugetlb_memory.sh b/tools/testing/selftests/vm/write_hugetlb_memory.sh
new file mode 100644
index 0000000000000..08f5fa5527cfd
--- /dev/null
+++ b/tools/testing/selftests/vm/write_hugetlb_memory.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+size=$1
+populate=$2
+write=$3
+cgroup=$4
+path=$5
+method=$6
+private=$7
+want_sleep=$8
+
+echo "Putting task in cgroup '$cgroup'"
+echo $$ > /dev/cgroup/memory/"$cgroup"/tasks
+
+echo "Method is $method"
+
+set +e
+./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \
+      "$private" "$want_sleep"
diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
new file mode 100644
index 0000000000000..f02a897427a97
--- /dev/null
+++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
@@ -0,0 +1,252 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program reserves and uses hugetlb memory, supporting a bunch of
+ * scenorios needed by the charged_reserved_hugetlb.sh test.
+ */
+
+#include <err.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/shm.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+
+/* Global definitions. */
+enum method {
+	HUGETLBFS,
+	MMAP_MAP_HUGETLB,
+	SHM,
+	MAX_METHOD
+};
+
+
+/* Global variables. */
+static const char *self;
+static char *shmaddr;
+static int shmid;
+
+/*
+ * Show usage and exit.
+ */
+static void exit_usage(void)
+{
+
+	printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
+		"[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
+		"[-o] [-w]\n", self);
+	exit(EXIT_FAILURE);
+}
+
+void sig_handler(int signo)
+{
+	printf("Received %d.\n", signo);
+	if (signo == SIGINT) {
+		printf("Deleting the memory\n");
+		if (shmdt((const void *)shmaddr) != 0) {
+			perror("Detach failure");
+			shmctl(shmid, IPC_RMID, NULL);
+			exit(4);
+		}
+
+		shmctl(shmid, IPC_RMID, NULL);
+		printf("Done deleting the memory\n");
+	}
+	exit(2);
+}
+
+int main(int argc, char **argv)
+{
+	int fd = 0;
+	int key = 0;
+	int *ptr = NULL;
+	int c = 0;
+	int size = 0;
+	char path[256] = "";
+	enum method method = MAX_METHOD;
+	int want_sleep = 0, private = 0;
+	int populate = 0;
+	int write = 0;
+
+	unsigned long i;
+
+
+	if (signal(SIGINT, sig_handler) == SIG_ERR)
+		err(1, "\ncan't catch SIGINT\n");
+
+	/* Parse command-line arguments. */
+	setvbuf(stdout, NULL, _IONBF, 0);
+	self = argv[0];
+
+	while ((c = getopt(argc, argv, "s:p:m:owlr")) != -1) {
+		switch (c) {
+		case 's':
+			size = atoi(optarg);
+			break;
+		case 'p':
+			strncpy(path, optarg, sizeof(path));
+			break;
+		case 'm':
+			if (atoi(optarg) >= MAX_METHOD) {
+				errno = EINVAL;
+				perror("Invalid -m.");
+				exit_usage();
+			}
+			method = atoi(optarg);
+			break;
+		case 'o':
+			populate = 1;
+			break;
+		case 'w':
+			write = 1;
+			break;
+		case 'l':
+			want_sleep = 1;
+			break;
+		case 'r':
+			private = 1;
+			break;
+		default:
+			errno = EINVAL;
+			perror("Invalid arg");
+			exit_usage();
+		}
+	}
+
+	if (strncmp(path, "", sizeof(path)) != 0) {
+		printf("Writing to this path: %s\n", path);
+	} else {
+		errno = EINVAL;
+		perror("path not found");
+		exit_usage();
+	}
+
+	if (size != 0) {
+		printf("Writing this size: %d\n", size);
+	} else {
+		errno = EINVAL;
+		perror("size not found");
+		exit_usage();
+	}
+
+	if (!populate)
+		printf("Not populating.\n");
+	else
+		printf("Populating.\n");
+
+	if (!write)
+		printf("Not writing to memory.\n");
+
+	if (method == MAX_METHOD) {
+		errno = EINVAL;
+		perror("-m Invalid");
+		exit_usage();
+	} else
+		printf("Using method=%d\n", method);
+
+	if (!private)
+		printf("Shared mapping.\n");
+	else
+		printf("Private mapping.\n");
+
+
+	switch (method) {
+	case HUGETLBFS:
+		printf("Allocating using HUGETLBFS.\n");
+		fd = open(path, O_CREAT | O_RDWR, 0777);
+		if (fd == -1)
+			err(1, "Failed to open file.");
+
+		ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			(private ? MAP_PRIVATE : MAP_SHARED) | (populate ?
+				MAP_POPULATE : 0), fd, 0);
+
+		if (ptr == MAP_FAILED) {
+			close(fd);
+			err(1, "Error mapping the file");
+		}
+		break;
+	case MMAP_MAP_HUGETLB:
+		printf("Allocating using MAP_HUGETLB.\n");
+		ptr = mmap(NULL, size,
+		PROT_READ | PROT_WRITE,
+		(private ? (MAP_PRIVATE | MAP_ANONYMOUS) : MAP_SHARED) |
+		MAP_HUGETLB | (populate ?
+			MAP_POPULATE : 0),
+		-1, 0);
+
+		if (ptr == MAP_FAILED)
+			err(1, "mmap");
+
+		printf("Returned address is %p\n", ptr);
+		break;
+	case SHM:
+		printf("Allocating using SHM.\n");
+		shmid = shmget(key, size, SHM_HUGETLB | IPC_CREAT | SHM_R |
+				SHM_W);
+		if (shmid < 0) {
+			shmid = shmget(++key, size, SHM_HUGETLB | IPC_CREAT |
+					SHM_R | SHM_W);
+			if (shmid < 0)
+				err(1, "shmget");
+
+		}
+		printf("shmid: 0x%x, shmget key:%d\n", shmid, key);
+
+		shmaddr = shmat(shmid, NULL, 0);
+		if (shmaddr == (char *)-1) {
+			perror("Shared memory attach failure");
+			shmctl(shmid, IPC_RMID, NULL);
+			exit(2);
+		}
+		printf("shmaddr: %p\n", shmaddr);
+
+		break;
+	default:
+		errno = EINVAL;
+		err(1, "Invalid method.");
+	}
+
+	if (write) {
+		printf("Writing to memory.\n");
+		if (method != SHM) {
+			memset(ptr, 1, size);
+		} else {
+			printf("Starting the writes:\n");
+			for (i = 0; i < size; i++) {
+				shmaddr[i] = (char)(i);
+				if (!(i % (1024 * 1024)))
+					printf(".");
+			}
+			printf("\n");
+
+			printf("Starting the Check...");
+			for (i = 0; i < size; i++)
+				if (shmaddr[i] != (char)i) {
+					printf("\nIndex %lu mismatched\n", i);
+					exit(3);
+				}
+			printf("Done.\n");
+
+
+		}
+	}
+
+	if (want_sleep) {
+		/* Signal to caller that we're done. */
+		printf("DONE\n");
+
+		/* Hold memory until external kill signal is delivered. */
+		while (1)
+			sleep(100);
+	}
+
+	close(fd);
+
+	return 0;
+}
--
2.23.0.rc1.153.gdeed80330f-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
  2019-08-08 23:13 [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
                   ` (4 preceding siblings ...)
  2019-08-08 23:13 ` [RFC PATCH v2 5/5] hugetlb_cgroup: Add hugetlb_cgroup reservation tests Mina Almasry
@ 2019-08-09 17:54 ` Mike Kravetz
  2019-08-09 19:42   ` Mina Almasry
       [not found] ` <20190815035352.14952-1-hdanton@sina.com>
  6 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2019-08-09 17:54 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, rientjes, shakeelb, gthelen, akpm, khalid.aziz,
	linux-kernel, linux-mm, linux-kselftest, Michal Koutný,
	Aneesh Kumar, cgroups

(+CC  Michal Koutný, cgroups@vger.kernel.org, Aneesh Kumar)

On 8/8/19 4:13 PM, Mina Almasry wrote:
> Problem:
> Currently tasks attempting to allocate more hugetlb memory than is available get
> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> However, if a task attempts to allocate hugetlb memory only more than its
> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> but will SIGBUS the task when it attempts to fault the memory in.
> 
> We have developers interested in using hugetlb_cgroups, and they have expressed
> dissatisfaction regarding this behavior. We'd like to improve this
> behavior such that tasks violating the hugetlb_cgroup limits get an error on
> mmap/shmget time, rather than getting SIGBUS'd when they try to fault
> the excess memory in.
> 
> The underlying problem is that today's hugetlb_cgroup accounting happens
> at hugetlb memory *fault* time, rather than at *reservation* time.
> Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
> the offending task gets SIGBUS'd.
> 
> Proposed Solution:
> A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
> counter has slightly different semantics than
> hugetlb.xMB.[limit|usage]_in_bytes:
> 
> - While usage_in_bytes tracks all *faulted* hugetlb memory,
> reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
> 
> - If a task attempts to reserve more memory than limit_in_bytes allows,
> the kernel will allow it to do so. But if a task attempts to reserve
> more memory than reservation_limit_in_bytes, the kernel will fail this
> reservation.
> 
> This proposal is implemented in this patch, with tests to verify
> functionality and show the usage.

Thanks for taking on this effort Mina.

Before looking at the details of the code, it might be helpful to discuss
the expected semantics of the proposed reservation limits.

I see you took into account the differences between private and shared
mappings.  This is good, as the reservation behavior is different for each
of these cases.  First let's look at private mappings.

For private mappings, the reservation usage will be the size of the mapping.
This should be fairly simple.  As reservations are consumed in the hugetlbfs
code, reservations in the resv_map are removed.  I see you have a hook into
region_del.  So, the expectation is that as reservations are consumed the
reservation usage will drop for the cgroup.  Correct?
The only tricky thing about private mappings is COW because of fork.  Current
reservation semantics specify that all reservations stay with the parent.
If child faults and can not get page, SIGBUS.  I assume the new reservation
limits will work the same.

I believe tracking reservations for shared mappings can get quite complicated.
The hugetlbfs reservation code around shared mappings 'works' on the basis
that shared mapping reservations are global.  As a result, reservations are
more associated with the inode than with the task making the reservation.
For example, consider a file of size 4 hugetlb pages.
Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
all 4 pages, and 2 additional reservations are taken.  I am not really sure
of the desired semantics here for reservation limits if A and B are in separate
cgroups.  Should B be charged for 4 or 2 reservations?
Also in the example above, after both tasks create their mappings suppose
Task B faults in the first page.  Does the reservation usage of Task A go
down as it originally had the reservation?

It should also be noted that when hugetlbfs reservations are 'consumed' for
shared mappings there are no changes to the resv_map.  Rather the unmap code
compares the contents of the page cache to the resv_map to determine how
many reservations were actually consumed.  I did not look close enough to
determine the code drops reservation usage counts as pages are added to shared
mappings.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
  2019-08-09 17:54 ` [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mike Kravetz
@ 2019-08-09 19:42   ` Mina Almasry
  2019-08-10 18:58     ` Mike Kravetz
  0 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-08-09 19:42 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, akpm,
	khalid.aziz, open list, linux-mm, linux-kselftest,
	Michal Koutný,
	Aneesh Kumar, cgroups

On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> (+CC  Michal Koutný, cgroups@vger.kernel.org, Aneesh Kumar)
>
> On 8/8/19 4:13 PM, Mina Almasry wrote:
> > Problem:
> > Currently tasks attempting to allocate more hugetlb memory than is available get
> > a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> > However, if a task attempts to allocate hugetlb memory only more than its
> > hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> > but will SIGBUS the task when it attempts to fault the memory in.
> >
> > We have developers interested in using hugetlb_cgroups, and they have expressed
> > dissatisfaction regarding this behavior. We'd like to improve this
> > behavior such that tasks violating the hugetlb_cgroup limits get an error on
> > mmap/shmget time, rather than getting SIGBUS'd when they try to fault
> > the excess memory in.
> >
> > The underlying problem is that today's hugetlb_cgroup accounting happens
> > at hugetlb memory *fault* time, rather than at *reservation* time.
> > Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
> > the offending task gets SIGBUS'd.
> >
> > Proposed Solution:
> > A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
> > counter has slightly different semantics than
> > hugetlb.xMB.[limit|usage]_in_bytes:
> >
> > - While usage_in_bytes tracks all *faulted* hugetlb memory,
> > reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
> >
> > - If a task attempts to reserve more memory than limit_in_bytes allows,
> > the kernel will allow it to do so. But if a task attempts to reserve
> > more memory than reservation_limit_in_bytes, the kernel will fail this
> > reservation.
> >
> > This proposal is implemented in this patch, with tests to verify
> > functionality and show the usage.
>
> Thanks for taking on this effort Mina.
>
No problem! Thanks for reviewing!

> Before looking at the details of the code, it might be helpful to discuss
> the expected semantics of the proposed reservation limits.
>
> I see you took into account the differences between private and shared
> mappings.  This is good, as the reservation behavior is different for each
> of these cases.  First let's look at private mappings.
>
> For private mappings, the reservation usage will be the size of the mapping.
> This should be fairly simple.  As reservations are consumed in the hugetlbfs
> code, reservations in the resv_map are removed.  I see you have a hook into
> region_del.  So, the expectation is that as reservations are consumed the
> reservation usage will drop for the cgroup.  Correct?

I assume by 'reservations are consumed' you mean when a reservation
goes from just 'reserved' to actually in use (as in the task is
writing to the hugetlb page or something). If so, then the answer is
no, that is not correct. When reservations are consumed, the
reservation usage stays the same. I.e. the reservation usage tracks
hugetlb memory (reserved + used) you could say. This is 100% the
intention, as we want to know on mmap time if there is enough 'free'
(that is unreserved and unused) memory left over in the cgroup to
satisfy the mmap call.

The hooks in region_add and region_del are to account shared mappings
only. There is a check in those code blocks that makes sure the code
is only engaged in shared mappings. The commit messages of patches 3/5
and 4/5 go into more details regarding this.

> The only tricky thing about private mappings is COW because of fork.  Current
> reservation semantics specify that all reservations stay with the parent.
> If child faults and can not get page, SIGBUS.  I assume the new reservation
> limits will work the same.
>

Although I did not explicitly try it, yes. It should work the same.
The additional reservation due to the COW will get charged to whatever
cgroup the fork is in. If the task can't get a page it gets SIGBUS'd.
If there is not enough room to charge the cgroup it's in, then the
charge will fail, which I assume will trigger error path that also
leads to SIGBUS.

> I believe tracking reservations for shared mappings can get quite complicated.
> The hugetlbfs reservation code around shared mappings 'works' on the basis
> that shared mapping reservations are global.  As a result, reservations are
> more associated with the inode than with the task making the reservation.

FWIW, I found it not too bad. And my tests at least don't detect an
anomaly around shared mappings. The key I think is that I'm tracking
cgroup to uncharge on the file_region entry inside the resv_map, so we
know who allocated each file_region entry exactly and we can uncharge
them when the entry is region_del'd.

> For example, consider a file of size 4 hugetlb pages.
> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
> all 4 pages, and 2 additional reservations are taken.  I am not really sure
> of the desired semantics here for reservation limits if A and B are in separate
> cgroups.  Should B be charged for 4 or 2 reservations?

Task A's cgroup is charged 2 pages to its reservation usage.
Task B's cgroup is charged 2 pages to its reservation usage.

This is analogous to how shared memory accounting is done for things
like tmpfs, and I see no strong reason right now to deviate. I.e. the
task that made the reservation is charged with it, and others use it
without getting charged.

> Also in the example above, after both tasks create their mappings suppose
> Task B faults in the first page.  Does the reservation usage of Task A go
> down as it originally had the reservation?
>

Reservation usage never goes down when pages are consumed. Yes, I
would have this problem if I was planning to decrement reservation
usage when pages are put into use, but, the goal is to find out if
there is 'free' memory (unreserved + unused) in the cgroup at mmap
time, so we want a counter that tracks (reserved + used).

> It should also be noted that when hugetlbfs reservations are 'consumed' for
> shared mappings there are no changes to the resv_map.  Rather the unmap code
> compares the contents of the page cache to the resv_map to determine how
> many reservations were actually consumed.  I did not look close enough to
> determine the code drops reservation usage counts as pages are added to shared
> mappings.

I think this concern also goes away if reservation usage doesn't go
down when pages are consumed, but let me know if you still have
doubts.

Thanks for taking a look so far!
>
> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
  2019-08-09 19:42   ` Mina Almasry
@ 2019-08-10 18:58     ` Mike Kravetz
  2019-08-10 22:01       ` Mina Almasry
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2019-08-10 18:58 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, akpm,
	khalid.aziz, open list, linux-mm, linux-kselftest,
	Michal Koutný,
	Aneesh Kumar, cgroups

On 8/9/19 12:42 PM, Mina Almasry wrote:
> On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>> On 8/8/19 4:13 PM, Mina Almasry wrote:
>>> Problem:
>>> Currently tasks attempting to allocate more hugetlb memory than is available get
>>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
>>> However, if a task attempts to allocate hugetlb memory only more than its
>>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
>>> but will SIGBUS the task when it attempts to fault the memory in.
<snip>
>> I believe tracking reservations for shared mappings can get quite complicated.
>> The hugetlbfs reservation code around shared mappings 'works' on the basis
>> that shared mapping reservations are global.  As a result, reservations are
>> more associated with the inode than with the task making the reservation.
> 
> FWIW, I found it not too bad. And my tests at least don't detect an
> anomaly around shared mappings. The key I think is that I'm tracking
> cgroup to uncharge on the file_region entry inside the resv_map, so we
> know who allocated each file_region entry exactly and we can uncharge
> them when the entry is region_del'd.
> 
>> For example, consider a file of size 4 hugetlb pages.
>> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
>> all 4 pages, and 2 additional reservations are taken.  I am not really sure
>> of the desired semantics here for reservation limits if A and B are in separate
>> cgroups.  Should B be charged for 4 or 2 reservations?
> 
> Task A's cgroup is charged 2 pages to its reservation usage.
> Task B's cgroup is charged 2 pages to its reservation usage.

OK,
Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
allocation.  The mmap would succeed, but Task B could potentially need to
allocate more than 2 huge pages.  So, when faulting in more than 2 huge
pages B would get a SIGBUS.  Correct?  Or, am I missing something?

Perhaps reservation charge should always be the same as map size/maximum
allocation size?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
  2019-08-10 18:58     ` Mike Kravetz
@ 2019-08-10 22:01       ` Mina Almasry
  2019-08-13 23:40         ` Mike Kravetz
  0 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-08-10 22:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, akpm,
	khalid.aziz, open list, linux-mm, linux-kselftest,
	Michal Koutný,
	Aneesh Kumar, cgroups

On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/9/19 12:42 PM, Mina Almasry wrote:
> > On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >> On 8/8/19 4:13 PM, Mina Almasry wrote:
> >>> Problem:
> >>> Currently tasks attempting to allocate more hugetlb memory than is available get
> >>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> >>> However, if a task attempts to allocate hugetlb memory only more than its
> >>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> >>> but will SIGBUS the task when it attempts to fault the memory in.
> <snip>
> >> I believe tracking reservations for shared mappings can get quite complicated.
> >> The hugetlbfs reservation code around shared mappings 'works' on the basis
> >> that shared mapping reservations are global.  As a result, reservations are
> >> more associated with the inode than with the task making the reservation.
> >
> > FWIW, I found it not too bad. And my tests at least don't detect an
> > anomaly around shared mappings. The key I think is that I'm tracking
> > cgroup to uncharge on the file_region entry inside the resv_map, so we
> > know who allocated each file_region entry exactly and we can uncharge
> > them when the entry is region_del'd.
> >
> >> For example, consider a file of size 4 hugetlb pages.
> >> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
> >> all 4 pages, and 2 additional reservations are taken.  I am not really sure
> >> of the desired semantics here for reservation limits if A and B are in separate
> >> cgroups.  Should B be charged for 4 or 2 reservations?
> >
> > Task A's cgroup is charged 2 pages to its reservation usage.
> > Task B's cgroup is charged 2 pages to its reservation usage.
>
> OK,
> Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
> allocation.  The mmap would succeed, but Task B could potentially need to
> allocate more than 2 huge pages.  So, when faulting in more than 2 huge
> pages B would get a SIGBUS.  Correct?  Or, am I missing something?
>
> Perhaps reservation charge should always be the same as map size/maximum
> allocation size?

I'm thinking this would work similar to how other shared memory like
tmpfs is accounted for right now. I.e. if a task conducts an operation
that causes memory to be allocated then that task is charged for that
memory, and if another task uses memory that has already been
allocated and charged by another task, then it can use the memory
without being charged.

So in case of hugetlb memory, if a task is mmaping memory that causes
a new reservation to be made, and new entries to be created in the
resv_map for the shared mapping, then that task gets charged. If the
task is mmaping memory that is already reserved or faulted, then it
reserves or faults it without getting charged.

In the example above, in chronological order:
- Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations.
- Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb
reservations because the first 2 are charged already and can be used
without incurring a charge.
- Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults,
since none of the 4 pages are faulted in yet. If the task is only
allowed 2 hugetlb page faults then it will actually get a SIGBUS.
- Task A accesses 4 hugetlb pages, gets charged no faults, since all
the hugetlb faults is charged to Task B.

So, yes, I can see a scenario where userspace still gets SIGBUS'd, but
I think that's fine because:
1. Notice that the SIGBUS is due to the faulting limit, and not the
reservation limit, so we're not regressing the status quo per say.
Folks using the fault limit today understand the SIGBUS risk.
2. the way I expect folks to use this is to use 'reservation limits'
to partition the available hugetlb memory on the machine using it and
forgo using the existing fault limits. Using both at the same time I
think would be a superuser feature for folks that really know what
they are doing, and understand the risk of SIGBUS that comes with
using the existing fault limits.
3. I expect userspace to in general handle this correctly because
there are similar challenges with all shared memory and accounting of
it, even in tmpfs, I think.

I would not like to charge the full reservation to every process that
does the mmap. Think of this, much more common scenario: Task A and B
are supposed to collaborate on a 10 hugetlb pages of data. Task B
should not access any hugetlb memory other than the memory it is
working on with Task A, so:

1. Task A is put in a cgroup with 10 hugetlb pages reservation limit.
2. Task B is put in a cgroup with 0 hugetlb pages of reservation limit.
3. Task A mmaps 10 hugetlb pages of hugetlb memory, and notifies Task
B that it is done.
4. Task B, due to programmer error, tries to mmap hugetlb memory
beyond what Task A set up for it, it gets denied at mmap time by the
cgroup reservation limit.
5. Task B mmaps the same 10 hugetlb pages of memory and starts working
on them. The mmap succeeds because Task B is not charged anything.

If we were charging the full reservation to both Tasks A and B, then
both A and B would have be in cgroups that allow 10 pages of hugetlb
reservations, and the mmap in step 4 would succeed, and we
accidentally overcommitted the amount of hugetlb memory available.

> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
  2019-08-10 22:01       ` Mina Almasry
@ 2019-08-13 23:40         ` Mike Kravetz
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Kravetz @ 2019-08-13 23:40 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, akpm,
	khalid.aziz, open list, linux-mm, linux-kselftest,
	Michal Koutný,
	Aneesh Kumar, cgroups

On 8/10/19 3:01 PM, Mina Almasry wrote:
> On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 8/9/19 12:42 PM, Mina Almasry wrote:
>>> On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>> On 8/8/19 4:13 PM, Mina Almasry wrote:
>>>>> Problem:
>>>>> Currently tasks attempting to allocate more hugetlb memory than is available get
>>>>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
>>>>> However, if a task attempts to allocate hugetlb memory only more than its
>>>>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
>>>>> but will SIGBUS the task when it attempts to fault the memory in.
>> <snip>
>>>> I believe tracking reservations for shared mappings can get quite complicated.
>>>> The hugetlbfs reservation code around shared mappings 'works' on the basis
>>>> that shared mapping reservations are global.  As a result, reservations are
>>>> more associated with the inode than with the task making the reservation.
>>>
>>> FWIW, I found it not too bad. And my tests at least don't detect an
>>> anomaly around shared mappings. The key I think is that I'm tracking
>>> cgroup to uncharge on the file_region entry inside the resv_map, so we
>>> know who allocated each file_region entry exactly and we can uncharge
>>> them when the entry is region_del'd.
>>>
>>>> For example, consider a file of size 4 hugetlb pages.
>>>> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
>>>> all 4 pages, and 2 additional reservations are taken.  I am not really sure
>>>> of the desired semantics here for reservation limits if A and B are in separate
>>>> cgroups.  Should B be charged for 4 or 2 reservations?
>>>
>>> Task A's cgroup is charged 2 pages to its reservation usage.
>>> Task B's cgroup is charged 2 pages to its reservation usage.
>>
>> OK,
>> Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
>> allocation.  The mmap would succeed, but Task B could potentially need to
>> allocate more than 2 huge pages.  So, when faulting in more than 2 huge
>> pages B would get a SIGBUS.  Correct?  Or, am I missing something?
>>
>> Perhaps reservation charge should always be the same as map size/maximum
>> allocation size?
> 
> I'm thinking this would work similar to how other shared memory like
> tmpfs is accounted for right now. I.e. if a task conducts an operation
> that causes memory to be allocated then that task is charged for that
> memory, and if another task uses memory that has already been
> allocated and charged by another task, then it can use the memory
> without being charged.
> 
> So in case of hugetlb memory, if a task is mmaping memory that causes
> a new reservation to be made, and new entries to be created in the
> resv_map for the shared mapping, then that task gets charged. If the
> task is mmaping memory that is already reserved or faulted, then it
> reserves or faults it without getting charged.
> 
> In the example above, in chronological order:
> - Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations.
> - Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb
> reservations because the first 2 are charged already and can be used
> without incurring a charge.
> - Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults,
> since none of the 4 pages are faulted in yet. If the task is only
> allowed 2 hugetlb page faults then it will actually get a SIGBUS.
> - Task A accesses 4 hugetlb pages, gets charged no faults, since all
> the hugetlb faults is charged to Task B.
> 
> So, yes, I can see a scenario where userspace still gets SIGBUS'd, but
> I think that's fine because:
> 1. Notice that the SIGBUS is due to the faulting limit, and not the
> reservation limit, so we're not regressing the status quo per say.
> Folks using the fault limit today understand the SIGBUS risk.
> 2. the way I expect folks to use this is to use 'reservation limits'
> to partition the available hugetlb memory on the machine using it and
> forgo using the existing fault limits. Using both at the same time I
> think would be a superuser feature for folks that really know what
> they are doing, and understand the risk of SIGBUS that comes with
> using the existing fault limits.
> 3. I expect userspace to in general handle this correctly because
> there are similar challenges with all shared memory and accounting of
> it, even in tmpfs, I think.

Ok, that helps explain your use case.  I agree that it would be difficult
to use both fault and reservation limits together.  Especially in the case
of shared mappings.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings
  2019-08-08 23:13 ` [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings Mina Almasry
@ 2019-08-13 23:54   ` Mike Kravetz
  2019-08-14 16:46     ` Mike Kravetz
  2019-08-15 23:08     ` Mina Almasry
  0 siblings, 2 replies; 19+ messages in thread
From: Mike Kravetz @ 2019-08-13 23:54 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, rientjes, shakeelb, gthelen, akpm, khalid.aziz,
	linux-kernel, linux-mm, linux-kselftest

On 8/8/19 4:13 PM, Mina Almasry wrote:
> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
> in the resv_map entries, in file_region->reservation_counter.
> 
> When a file_region entry is added to the resv_map via region_add, we
> also charge the appropriate hugetlb_cgroup and put the pointer to that
> in file_region->reservation_counter. This is slightly delicate since we
> need to not modify the resv_map until we know that charging the
> reservation has succeeded. If charging doesn't succeed, we report the
> error to the caller, so that the kernel fails the reservation.

I wish we did not need to modify these region_() routines as they are
already difficult to understand.  However, I see no other way with the
desired semantics.

> On region_del, which is when the hugetlb memory is unreserved, we delete
> the file_region entry in the resv_map, but also uncharge the
> file_region->reservation_counter.
> 
> ---
>  mm/hugetlb.c | 208 +++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 170 insertions(+), 38 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 235996aef6618..d76e3137110ab 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -242,8 +242,72 @@ struct file_region {
>  	struct list_head link;
>  	long from;
>  	long to;
> +#ifdef CONFIG_CGROUP_HUGETLB
> +	/*
> +	 * On shared mappings, each reserved region appears as a struct
> +	 * file_region in resv_map. These fields hold the info needed to
> +	 * uncharge each reservation.
> +	 */
> +	struct page_counter *reservation_counter;
> +	unsigned long pages_per_hpage;
> +#endif
>  };
> 
> +/* Must be called with resv->lock held. Calling this with dry_run == true will
> + * count the number of pages added but will not modify the linked list.
> + */
> +static long consume_regions_we_overlap_with(struct file_region *rg,
> +		struct list_head *head, long f, long *t,
> +		struct hugetlb_cgroup *h_cg,
> +		struct hstate *h,
> +		bool dry_run)
> +{
> +	long add = 0;
> +	struct file_region *trg = NULL, *nrg = NULL;
> +
> +	/* Consume any regions we now overlap with. */
> +	nrg = rg;
> +	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> +		if (&rg->link == head)
> +			break;
> +		if (rg->from > *t)
> +			break;
> +
> +		/* If this area reaches higher then extend our area to
> +		 * include it completely.  If this is not the first area
> +		 * which we intend to reuse, free it.
> +		 */
> +		if (rg->to > *t)
> +			*t = rg->to;
> +		if (rg != nrg) {
> +			/* Decrement return value by the deleted range.
> +			 * Another range will span this area so that by
> +			 * end of routine add will be >= zero
> +			 */
> +			add -= (rg->to - rg->from);
> +			if (!dry_run) {
> +				list_del(&rg->link);
> +				kfree(rg);

Is it possible that the region struct we are deleting pointed to
a reservation_counter?  Perhaps even for another cgroup?
Just concerned with the way regions are coalesced that we may be
deleting counters.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings
  2019-08-13 23:54   ` Mike Kravetz
@ 2019-08-14 16:46     ` Mike Kravetz
  2019-08-15 23:04       ` Mina Almasry
  2019-08-15 23:08     ` Mina Almasry
  1 sibling, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2019-08-14 16:46 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, rientjes, shakeelb, gthelen, akpm, khalid.aziz,
	linux-kernel, linux-mm, linux-kselftest

On 8/13/19 4:54 PM, Mike Kravetz wrote:
> On 8/8/19 4:13 PM, Mina Almasry wrote:
>> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
>> in the resv_map entries, in file_region->reservation_counter.
>>
>> When a file_region entry is added to the resv_map via region_add, we
>> also charge the appropriate hugetlb_cgroup and put the pointer to that
>> in file_region->reservation_counter. This is slightly delicate since we
>> need to not modify the resv_map until we know that charging the
>> reservation has succeeded. If charging doesn't succeed, we report the
>> error to the caller, so that the kernel fails the reservation.
> 
> I wish we did not need to modify these region_() routines as they are
> already difficult to understand.  However, I see no other way with the
> desired semantics.
> 

I suspect you have considered this, but what about using the return value
from region_chg() in hugetlb_reserve_pages() to charge reservation limits?
There is a VERY SMALL race where the value could be too large, but that
can be checked and adjusted at region_add time as is done with normal
accounting today.  If the question is, where would we store the information
to uncharge?, then we can hang a structure off the vma.  This would be
similar to what is done for private mappings.  In fact, I would suggest
making them both use a new cgroup reserve structure hanging off the vma.

One issue I see is what to do if a vma is split?  The private mapping case
'should' handle this today, but I would not be surprised if such code is
missing or incorrect.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings
  2019-08-14 16:46     ` Mike Kravetz
@ 2019-08-15 23:04       ` Mina Almasry
  2019-08-16 16:28         ` Mike Kravetz
  0 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-08-15 23:04 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, akpm,
	khalid.aziz, open list, linux-mm, linux-kselftest

On Wed, Aug 14, 2019 at 9:46 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/13/19 4:54 PM, Mike Kravetz wrote:
> > On 8/8/19 4:13 PM, Mina Almasry wrote:
> >> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
> >> in the resv_map entries, in file_region->reservation_counter.
> >>
> >> When a file_region entry is added to the resv_map via region_add, we
> >> also charge the appropriate hugetlb_cgroup and put the pointer to that
> >> in file_region->reservation_counter. This is slightly delicate since we
> >> need to not modify the resv_map until we know that charging the
> >> reservation has succeeded. If charging doesn't succeed, we report the
> >> error to the caller, so that the kernel fails the reservation.
> >
> > I wish we did not need to modify these region_() routines as they are
> > already difficult to understand.  However, I see no other way with the
> > desired semantics.
> >
>
> I suspect you have considered this, but what about using the return value
> from region_chg() in hugetlb_reserve_pages() to charge reservation limits?
> There is a VERY SMALL race where the value could be too large, but that
> can be checked and adjusted at region_add time as is done with normal
> accounting today.

I have not actually until now; I didn't consider doing stuff with the
resv_map while not holding onto the resv_map->lock. I guess that's the
small race you're talking about. Seems fine to me, but I'm more
worried about hanging off the vma below.

> If the question is, where would we store the information
> to uncharge?, then we can hang a structure off the vma.  This would be
> similar to what is done for private mappings.  In fact, I would suggest
> making them both use a new cgroup reserve structure hanging off the vma.
>

I actually did consider hanging off the info to uncharge off the vma,
but I didn't for a couple of reasons:

1. region_del is called from hugetlb_unreserve_pages, and I don't have
access to the vma there. Maybe there is a way to query the proper vma
I don't know about?
2. hugetlb_reserve_pages seems to be able to conduct a reservation
with a NULL *vma. Not sure what to do in that case.

Is there a way to get around these that I'm missing here?

FWIW I think tracking is better in resv_map since the reservations are
in resv_map themselves. If I do another structure, then for each
reservation there will be an entry in resv_map and an entry in the new
structure and they need to be kept in sync and I need to handle errors
for when they get out of sync.

> One issue I see is what to do if a vma is split?  The private mapping case
> 'should' handle this today, but I would not be surprised if such code is
> missing or incorrect.
>
> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings
  2019-08-13 23:54   ` Mike Kravetz
  2019-08-14 16:46     ` Mike Kravetz
@ 2019-08-15 23:08     ` Mina Almasry
  2019-08-16 16:33       ` Mike Kravetz
  1 sibling, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-08-15 23:08 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, akpm,
	khalid.aziz, open list, linux-mm, linux-kselftest

On Tue, Aug 13, 2019 at 4:54 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/8/19 4:13 PM, Mina Almasry wrote:
> > For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
> > in the resv_map entries, in file_region->reservation_counter.
> >
> > When a file_region entry is added to the resv_map via region_add, we
> > also charge the appropriate hugetlb_cgroup and put the pointer to that
> > in file_region->reservation_counter. This is slightly delicate since we
> > need to not modify the resv_map until we know that charging the
> > reservation has succeeded. If charging doesn't succeed, we report the
> > error to the caller, so that the kernel fails the reservation.
>
> I wish we did not need to modify these region_() routines as they are
> already difficult to understand.  However, I see no other way with the
> desired semantics.
>
> > On region_del, which is when the hugetlb memory is unreserved, we delete
> > the file_region entry in the resv_map, but also uncharge the
> > file_region->reservation_counter.
> >
> > ---
> >  mm/hugetlb.c | 208 +++++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 170 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 235996aef6618..d76e3137110ab 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -242,8 +242,72 @@ struct file_region {
> >       struct list_head link;
> >       long from;
> >       long to;
> > +#ifdef CONFIG_CGROUP_HUGETLB
> > +     /*
> > +      * On shared mappings, each reserved region appears as a struct
> > +      * file_region in resv_map. These fields hold the info needed to
> > +      * uncharge each reservation.
> > +      */
> > +     struct page_counter *reservation_counter;
> > +     unsigned long pages_per_hpage;
> > +#endif
> >  };
> >
> > +/* Must be called with resv->lock held. Calling this with dry_run == true will
> > + * count the number of pages added but will not modify the linked list.
> > + */
> > +static long consume_regions_we_overlap_with(struct file_region *rg,
> > +             struct list_head *head, long f, long *t,
> > +             struct hugetlb_cgroup *h_cg,
> > +             struct hstate *h,
> > +             bool dry_run)
> > +{
> > +     long add = 0;
> > +     struct file_region *trg = NULL, *nrg = NULL;
> > +
> > +     /* Consume any regions we now overlap with. */
> > +     nrg = rg;
> > +     list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> > +             if (&rg->link == head)
> > +                     break;
> > +             if (rg->from > *t)
> > +                     break;
> > +
> > +             /* If this area reaches higher then extend our area to
> > +              * include it completely.  If this is not the first area
> > +              * which we intend to reuse, free it.
> > +              */
> > +             if (rg->to > *t)
> > +                     *t = rg->to;
> > +             if (rg != nrg) {
> > +                     /* Decrement return value by the deleted range.
> > +                      * Another range will span this area so that by
> > +                      * end of routine add will be >= zero
> > +                      */
> > +                     add -= (rg->to - rg->from);
> > +                     if (!dry_run) {
> > +                             list_del(&rg->link);
> > +                             kfree(rg);
>
> Is it possible that the region struct we are deleting pointed to
> a reservation_counter?  Perhaps even for another cgroup?
> Just concerned with the way regions are coalesced that we may be
> deleting counters.
>

Yep, that needs to be handled I think. Thanks for catching!


> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 1/5] hugetlb_cgroup: Add hugetlb_cgroup reservation counter
       [not found] ` <20190815035352.14952-1-hdanton@sina.com>
@ 2019-08-15 23:21   ` Mina Almasry
  0 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-08-15 23:21 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Mike Kravetz, shuah, David Rientjes, Shakeel Butt, Greg Thelen,
	akpm, khalid.aziz, open list, linux-mm, linux-kselftest

On Wed, Aug 14, 2019 at 8:54 PM Hillf Danton <hdanton@sina.com> wrote:
>
>
> On Thu,  8 Aug 2019 16:13:36 -0700 Mina Almasry wrote:
> >
> > These counters will track hugetlb reservations rather than hugetlb
> > memory faulted in. This patch only adds the counter, following patches
> > add the charging and uncharging of the counter.
> > ---
>
>   !?!
>

Thanks for reviewing. I'm not sure what you're referring to though.
What's wrong here?

> >  include/linux/hugetlb.h |  2 +-
> >  mm/hugetlb_cgroup.c     | 86 +++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 80 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index edfca42783192..6777b3013345d 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -340,7 +340,7 @@ struct hstate {
> >       unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> >  #ifdef CONFIG_CGROUP_HUGETLB
> >       /* cgroup control files */
> > -     struct cftype cgroup_files[5];
> > +     struct cftype cgroup_files[9];
>
> Move that enum in this header file and replace numbers with characters
> to easy both reading and maintaining.
> >  #endif
> >       char name[HSTATE_NAME_LEN];
> >  };
> > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> > index 68c2f2f3c05b7..708103663988a 100644
> > --- a/mm/hugetlb_cgroup.c
> > +++ b/mm/hugetlb_cgroup.c
> > @@ -25,6 +25,10 @@ struct hugetlb_cgroup {
> >        * the counter to account for hugepages from hugetlb.
> >        */
> >       struct page_counter hugepage[HUGE_MAX_HSTATE];
> > +     /*
> > +      * the counter to account for hugepage reservations from hugetlb.
> > +      */
> > +     struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
> >  };
> >
> >  #define MEMFILE_PRIVATE(x, val)      (((x) << 16) | (val))
> > @@ -33,6 +37,15 @@ struct hugetlb_cgroup {
> >
> >  static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
> >
> > +static inline
> > +struct page_counter *get_counter(struct hugetlb_cgroup *h_cg, int idx,
> > +                              bool reserved)
>
> s/get_/hugetlb_cgroup_get_/ to make it not too generic.
> > +{
> > +     if (reserved)
> > +             return  &h_cg->reserved_hugepage[idx];
> > +     return &h_cg->hugepage[idx];
> > +}
> > +
> >  static inline
> >  struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
> >  {
> > @@ -256,28 +269,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> >
> >  enum {
> >       RES_USAGE,
> > +     RES_RESERVATION_USAGE,
> >       RES_LIMIT,
> > +     RES_RESERVATION_LIMIT,
> >       RES_MAX_USAGE,
> > +     RES_RESERVATION_MAX_USAGE,
> >       RES_FAILCNT,
> > +     RES_RESERVATION_FAILCNT,
> >  };
> >
> >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> >                                  struct cftype *cft)
> >  {
> >       struct page_counter *counter;
> > +     struct page_counter *reserved_counter;
> >       struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> >
> >       counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> > +     reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
> >
> >       switch (MEMFILE_ATTR(cft->private)) {
> >       case RES_USAGE:
> >               return (u64)page_counter_read(counter) * PAGE_SIZE;
> > +     case RES_RESERVATION_USAGE:
> > +             return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
> >       case RES_LIMIT:
> >               return (u64)counter->max * PAGE_SIZE;
> > +     case RES_RESERVATION_LIMIT:
> > +             return (u64)reserved_counter->max * PAGE_SIZE;
> >       case RES_MAX_USAGE:
> >               return (u64)counter->watermark * PAGE_SIZE;
> > +     case RES_RESERVATION_MAX_USAGE:
> > +             return (u64)reserved_counter->watermark * PAGE_SIZE;
> >       case RES_FAILCNT:
> >               return counter->failcnt;
> > +     case RES_RESERVATION_FAILCNT:
> > +             return reserved_counter->failcnt;
> >       default:
> >               BUG();
> >       }
> > @@ -291,6 +318,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >       int ret, idx;
> >       unsigned long nr_pages;
> >       struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> > +     bool reserved = false;
> >
> >       if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> >               return -EINVAL;
> > @@ -303,10 +331,16 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >       idx = MEMFILE_IDX(of_cft(of)->private);
> >       nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
> >
> > +     if (MEMFILE_ATTR(of_cft(of)->private) == RES_RESERVATION_LIMIT) {
> > +             reserved = true;
> > +     }
> > +
> >       switch (MEMFILE_ATTR(of_cft(of)->private)) {
> > +     case RES_RESERVATION_LIMIT:
>                 reserved = true;
>                 /* fall thru */
>
> >       case RES_LIMIT:
> >               mutex_lock(&hugetlb_limit_mutex);
> > -             ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
> > +             ret = page_counter_set_max(get_counter(h_cg, idx, reserved),
> > +                                        nr_pages);
> >               mutex_unlock(&hugetlb_limit_mutex);
> >               break;
> >       default:
> > @@ -320,18 +354,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
> >                                   char *buf, size_t nbytes, loff_t off)
> >  {
> >       int ret = 0;
> > -     struct page_counter *counter;
> > +     struct page_counter *counter, *reserved_counter;
> >       struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> >
> >       counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> > +     reserved_counter = &h_cg->reserved_hugepage[
> > +             MEMFILE_IDX(of_cft(of)->private)];
> >
> >       switch (MEMFILE_ATTR(of_cft(of)->private)) {
> >       case RES_MAX_USAGE:
> >               page_counter_reset_watermark(counter);
> >               break;
> > +     case RES_RESERVATION_MAX_USAGE:
> > +             page_counter_reset_watermark(reserved_counter);
> > +             break;
> >       case RES_FAILCNT:
> >               counter->failcnt = 0;
> >               break;
> > +     case RES_RESERVATION_FAILCNT:
> > +             reserved_counter->failcnt = 0;
> > +             break;
> >       default:
> >               ret = -EINVAL;
> >               break;
> > @@ -357,7 +399,7 @@ static void __init __hugetlb_cgroup_file_init(int idx)
> >       struct hstate *h = &hstates[idx];
> >
> >       /* format the size */
> > -     mem_fmt(buf, 32, huge_page_size(h));
> > +     mem_fmt(buf, sizeof(buf), huge_page_size(h));
> >
> >       /* Add the limit file */
> >       cft = &h->cgroup_files[0];
> > @@ -366,28 +408,58 @@ static void __init __hugetlb_cgroup_file_init(int idx)
> >       cft->read_u64 = hugetlb_cgroup_read_u64;
> >       cft->write = hugetlb_cgroup_write;
> >
> > -     /* Add the usage file */
> > +     /* Add the reservation limit file */
> >       cft = &h->cgroup_files[1];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes",
> > +              buf);
> > +     cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT);
> > +     cft->read_u64 = hugetlb_cgroup_read_u64;
> > +     cft->write = hugetlb_cgroup_write;
> > +
> > +     /* Add the usage file */
> > +     cft = &h->cgroup_files[2];
> >       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
> >       cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> >       cft->read_u64 = hugetlb_cgroup_read_u64;
> >
> > +     /* Add the reservation usage file */
> > +     cft = &h->cgroup_files[3];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes",
> > +                     buf);
> > +     cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE);
> > +     cft->read_u64 = hugetlb_cgroup_read_u64;
> > +
> >       /* Add the MAX usage file */
> > -     cft = &h->cgroup_files[2];
> > +     cft = &h->cgroup_files[4];
> >       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
> >       cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
> >       cft->write = hugetlb_cgroup_reset;
> >       cft->read_u64 = hugetlb_cgroup_read_u64;
> >
> > +     /* Add the MAX reservation usage file */
> > +     cft = &h->cgroup_files[5];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME,
> > +                     "%s.reservation_max_usage_in_bytes", buf);
> > +     cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_MAX_USAGE);
> > +     cft->write = hugetlb_cgroup_reset;
> > +     cft->read_u64 = hugetlb_cgroup_read_u64;
> > +
> >       /* Add the failcntfile */
> > -     cft = &h->cgroup_files[3];
> > +     cft = &h->cgroup_files[6];
> >       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
> >       cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> >       cft->write = hugetlb_cgroup_reset;
> >       cft->read_u64 = hugetlb_cgroup_read_u64;
> >
> > +     /* Add the reservation failcntfile */
> > +     cft = &h->cgroup_files[7];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf);
> > +     cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> > +     cft->write = hugetlb_cgroup_reset;
> > +     cft->read_u64 = hugetlb_cgroup_read_u64;
> > +
> >       /* NULL terminate the last cft */
> > -     cft = &h->cgroup_files[4];
> > +     cft = &h->cgroup_files[8];
> >       memset(cft, 0, sizeof(*cft));
>
> Replace numbers with characters.
> >
> >       WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
> > --
> > 2.23.0.rc1.153.gdeed80330f-goog
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings
  2019-08-15 23:04       ` Mina Almasry
@ 2019-08-16 16:28         ` Mike Kravetz
  2019-08-16 18:06           ` Mina Almasry
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2019-08-16 16:28 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, akpm,
	khalid.aziz, open list, linux-mm, linux-kselftest

On 8/15/19 4:04 PM, Mina Almasry wrote:
> On Wed, Aug 14, 2019 at 9:46 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 8/13/19 4:54 PM, Mike Kravetz wrote:
>>> On 8/8/19 4:13 PM, Mina Almasry wrote:
>>>> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
>>>> in the resv_map entries, in file_region->reservation_counter.
>>>>
>>>> When a file_region entry is added to the resv_map via region_add, we
>>>> also charge the appropriate hugetlb_cgroup and put the pointer to that
>>>> in file_region->reservation_counter. This is slightly delicate since we
>>>> need to not modify the resv_map until we know that charging the
>>>> reservation has succeeded. If charging doesn't succeed, we report the
>>>> error to the caller, so that the kernel fails the reservation.
>>>
>>> I wish we did not need to modify these region_() routines as they are
>>> already difficult to understand.  However, I see no other way with the
>>> desired semantics.
>>>
>>
>> I suspect you have considered this, but what about using the return value
>> from region_chg() in hugetlb_reserve_pages() to charge reservation limits?
>> There is a VERY SMALL race where the value could be too large, but that
>> can be checked and adjusted at region_add time as is done with normal
>> accounting today.
> 
> I have not actually until now; I didn't consider doing stuff with the
> resv_map while not holding onto the resv_map->lock. I guess that's the
> small race you're talking about. Seems fine to me, but I'm more
> worried about hanging off the vma below.

This race is already handled for other 'reservation like' things in
hugetlb_reserve_pages.  So, I don't think the race is much of an issue.

>> If the question is, where would we store the information
>> to uncharge?, then we can hang a structure off the vma.  This would be
>> similar to what is done for private mappings.  In fact, I would suggest
>> making them both use a new cgroup reserve structure hanging off the vma.
>>
> 
> I actually did consider hanging off the info to uncharge off the vma,
> but I didn't for a couple of reasons:
> 
> 1. region_del is called from hugetlb_unreserve_pages, and I don't have
> access to the vma there. Maybe there is a way to query the proper vma
> I don't know about?

I am still thinking about closely tying cgroup revervation limits/usage
to existing reservation accounting.  Of most concern (to me) is handling
shared mappings.  Reservations created for shared mappings are more
associated with the inode/file than individual mappings.  For example,
consider a task which mmaps(MAP_SHARED) a hugetlbfs file.  At mmap time
reservations are created based on the size of the mmap.  Now, if the task
unmaps and/or exits the reservations still exist as they are associated
with the file rather than the mapping.

Honesty, I think this existing reservation bevahior is wrong or at least
not desirable.  Because there are outstanding reservations, the number of
reserved huge pages can not be used for other purposes.  It is also very
difficult for a user or admin to determine the source of the reservations.
No one is currently complaining about this behavior.  This proposal just
made me think about it.

Tying cgroup reservation limits/usage to existing reservation accounting
will introduce the same issues there.  We will need to clearly document the
behavior.

> 2. hugetlb_reserve_pages seems to be able to conduct a reservation
> with a NULL *vma. Not sure what to do in that case.
> 
> Is there a way to get around these that I'm missing here?

You are correct.  The !vma case is there for System V shared memory such
as a call to shmget(SHM_HUGETLB).  In this case, reservations for the
entire shared emmory segment are taken at shmget time.

In your model, the caller of shmget who creates the shared memory segment
would get charged for all the reservations.  Users, (those calling shmat)
would not be charged.

> FWIW I think tracking is better in resv_map since the reservations are
> in resv_map themselves. If I do another structure, then for each
> reservation there will be an entry in resv_map and an entry in the new
> structure and they need to be kept in sync and I need to handle errors
> for when they get out of sync.

I think you may be correct.  However, this implies that we will need to
change the way we do reservation in the resv_map for shared mappings.
I will comment on that in reply to patch 4.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings
  2019-08-15 23:08     ` Mina Almasry
@ 2019-08-16 16:33       ` Mike Kravetz
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Kravetz @ 2019-08-16 16:33 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, akpm,
	khalid.aziz, open list, linux-mm, linux-kselftest

On 8/15/19 4:08 PM, Mina Almasry wrote:
> On Tue, Aug 13, 2019 at 4:54 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>  mm/hugetlb.c | 208 +++++++++++++++++++++++++++++++++++++++++----------
>>>  1 file changed, 170 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 235996aef6618..d76e3137110ab 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -242,8 +242,72 @@ struct file_region {
>>>       struct list_head link;
>>>       long from;
>>>       long to;
>>> +#ifdef CONFIG_CGROUP_HUGETLB
>>> +     /*
>>> +      * On shared mappings, each reserved region appears as a struct
>>> +      * file_region in resv_map. These fields hold the info needed to
>>> +      * uncharge each reservation.
>>> +      */
>>> +     struct page_counter *reservation_counter;
>>> +     unsigned long pages_per_hpage;
>>> +#endif
>>>  };
>>>
>>> +/* Must be called with resv->lock held. Calling this with dry_run == true will
>>> + * count the number of pages added but will not modify the linked list.
>>> + */
>>> +static long consume_regions_we_overlap_with(struct file_region *rg,
>>> +             struct list_head *head, long f, long *t,
>>> +             struct hugetlb_cgroup *h_cg,
>>> +             struct hstate *h,
>>> +             bool dry_run)
>>> +{
>>> +     long add = 0;
>>> +     struct file_region *trg = NULL, *nrg = NULL;
>>> +
>>> +     /* Consume any regions we now overlap with. */
>>> +     nrg = rg;
>>> +     list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
>>> +             if (&rg->link == head)
>>> +                     break;
>>> +             if (rg->from > *t)
>>> +                     break;
>>> +
>>> +             /* If this area reaches higher then extend our area to
>>> +              * include it completely.  If this is not the first area
>>> +              * which we intend to reuse, free it.
>>> +              */
>>> +             if (rg->to > *t)
>>> +                     *t = rg->to;
>>> +             if (rg != nrg) {
>>> +                     /* Decrement return value by the deleted range.
>>> +                      * Another range will span this area so that by
>>> +                      * end of routine add will be >= zero
>>> +                      */
>>> +                     add -= (rg->to - rg->from);
>>> +                     if (!dry_run) {
>>> +                             list_del(&rg->link);
>>> +                             kfree(rg);
>>
>> Is it possible that the region struct we are deleting pointed to
>> a reservation_counter?  Perhaps even for another cgroup?
>> Just concerned with the way regions are coalesced that we may be
>> deleting counters.
>>
> 
> Yep, that needs to be handled I think. Thanks for catching!
> 

I believe that we will no longer be able to coalesce reserv_map entries
for shared mappings.  That is because we need to record who is responsible
for creating reservation entries.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings
  2019-08-16 16:28         ` Mike Kravetz
@ 2019-08-16 18:06           ` Mina Almasry
  0 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-08-16 18:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, akpm,
	khalid.aziz, open list, linux-mm, linux-kselftest

On Fri, Aug 16, 2019 at 9:29 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/15/19 4:04 PM, Mina Almasry wrote:
> > On Wed, Aug 14, 2019 at 9:46 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> On 8/13/19 4:54 PM, Mike Kravetz wrote:
> >>> On 8/8/19 4:13 PM, Mina Almasry wrote:
> >>>> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
> >>>> in the resv_map entries, in file_region->reservation_counter.
> >>>>
> >>>> When a file_region entry is added to the resv_map via region_add, we
> >>>> also charge the appropriate hugetlb_cgroup and put the pointer to that
> >>>> in file_region->reservation_counter. This is slightly delicate since we
> >>>> need to not modify the resv_map until we know that charging the
> >>>> reservation has succeeded. If charging doesn't succeed, we report the
> >>>> error to the caller, so that the kernel fails the reservation.
> >>>
> >>> I wish we did not need to modify these region_() routines as they are
> >>> already difficult to understand.  However, I see no other way with the
> >>> desired semantics.
> >>>
> >>
> >> I suspect you have considered this, but what about using the return value
> >> from region_chg() in hugetlb_reserve_pages() to charge reservation limits?
> >> There is a VERY SMALL race where the value could be too large, but that
> >> can be checked and adjusted at region_add time as is done with normal
> >> accounting today.
> >
> > I have not actually until now; I didn't consider doing stuff with the
> > resv_map while not holding onto the resv_map->lock. I guess that's the
> > small race you're talking about. Seems fine to me, but I'm more
> > worried about hanging off the vma below.
>
> This race is already handled for other 'reservation like' things in
> hugetlb_reserve_pages.  So, I don't think the race is much of an issue.
>
> >> If the question is, where would we store the information
> >> to uncharge?, then we can hang a structure off the vma.  This would be
> >> similar to what is done for private mappings.  In fact, I would suggest
> >> making them both use a new cgroup reserve structure hanging off the vma.
> >>
> >
> > I actually did consider hanging off the info to uncharge off the vma,
> > but I didn't for a couple of reasons:
> >
> > 1. region_del is called from hugetlb_unreserve_pages, and I don't have
> > access to the vma there. Maybe there is a way to query the proper vma
> > I don't know about?
>
> I am still thinking about closely tying cgroup revervation limits/usage
> to existing reservation accounting.  Of most concern (to me) is handling
> shared mappings.  Reservations created for shared mappings are more
> associated with the inode/file than individual mappings.  For example,
> consider a task which mmaps(MAP_SHARED) a hugetlbfs file.  At mmap time
> reservations are created based on the size of the mmap.  Now, if the task
> unmaps and/or exits the reservations still exist as they are associated
> with the file rather than the mapping.
>

I'm aware of this behavior, and IMO it seems fine to me. I believe it
works the same way with tmfs today. I think a task that creates a file
in tmpfs gets charged the memory, and even if the task exits the
memory is still charged to its cgroup, and the memory remains charged
until the tmpfs file is deleted by someone.

Makes sense to me for hugetlb reservations to work the same way. The
memory remains charged until the hugetlbfs file gets deleted. But, if
you think of improvement, I'm happy to oblige :)

> Honesty, I think this existing reservation bevahior is wrong or at least
> not desirable.  Because there are outstanding reservations, the number of
> reserved huge pages can not be used for other purposes.  It is also very
> difficult for a user or admin to determine the source of the reservations.
> No one is currently complaining about this behavior.  This proposal just
> made me think about it.
>
> Tying cgroup reservation limits/usage to existing reservation accounting
> will introduce the same issues there.  We will need to clearly document the
> behavior.
>

Yes, seems we're maybe converging on a solution here, so the next
patchset will include docs for your review.

> > 2. hugetlb_reserve_pages seems to be able to conduct a reservation
> > with a NULL *vma. Not sure what to do in that case.
> >
> > Is there a way to get around these that I'm missing here?
>
> You are correct.  The !vma case is there for System V shared memory such
> as a call to shmget(SHM_HUGETLB).  In this case, reservations for the
> entire shared emmory segment are taken at shmget time.
>
> In your model, the caller of shmget who creates the shared memory segment
> would get charged for all the reservations.  Users, (those calling shmat)
> would not be charged.
>
> > FWIW I think tracking is better in resv_map since the reservations are
> > in resv_map themselves. If I do another structure, then for each
> > reservation there will be an entry in resv_map and an entry in the new
> > structure and they need to be kept in sync and I need to handle errors
> > for when they get out of sync.
>
> I think you may be correct.  However, this implies that we will need to
> change the way we do reservation in the resv_map for shared mappings.
> I will comment on that in reply to patch 4.
>
> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-08 23:13 [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
2019-08-08 23:13 ` [RFC PATCH v2 1/5] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
2019-08-08 23:13 ` [RFC PATCH v2 2/5] hugetlb_cgroup: Add interface for charge/uncharge Mina Almasry
2019-08-08 23:13 ` [RFC PATCH v2 3/5] hugetlb_cgroup: Add reservation accounting for private mappings Mina Almasry
2019-08-08 23:13 ` [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings Mina Almasry
2019-08-13 23:54   ` Mike Kravetz
2019-08-14 16:46     ` Mike Kravetz
2019-08-15 23:04       ` Mina Almasry
2019-08-16 16:28         ` Mike Kravetz
2019-08-16 18:06           ` Mina Almasry
2019-08-15 23:08     ` Mina Almasry
2019-08-16 16:33       ` Mike Kravetz
2019-08-08 23:13 ` [RFC PATCH v2 5/5] hugetlb_cgroup: Add hugetlb_cgroup reservation tests Mina Almasry
2019-08-09 17:54 ` [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mike Kravetz
2019-08-09 19:42   ` Mina Almasry
2019-08-10 18:58     ` Mike Kravetz
2019-08-10 22:01       ` Mina Almasry
2019-08-13 23:40         ` Mike Kravetz
     [not found] ` <20190815035352.14952-1-hdanton@sina.com>
2019-08-15 23:21   ` [RFC PATCH v2 1/5] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry

Linux-kselftest Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-kselftest/0 linux-kselftest/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-kselftest linux-kselftest/ https://lore.kernel.org/linux-kselftest \
		linux-kselftest@vger.kernel.org linux-kselftest@archiver.kernel.org
	public-inbox-index linux-kselftest


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kselftest


AGPL code for this site: git clone https://public-inbox.org/ public-inbox