Linux-kselftest Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
@ 2019-09-10 23:31 Mina Almasry
  2019-09-10 23:31 ` [PATCH v4 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny

Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.

We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.

Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:

- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch, with tests to verify
functionality and show the usage.

Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
   the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
   duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
   hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that modifies
   the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
   reservation time rather than fault time. Adding a new page_counter seems
   better as userspace could, if it wants, choose to enforce different cgroups
   differently: one via limit_in_bytes, and another via
   reservation_limit_in_bytes. This could be very useful if you're
   transitioning how hugetlb memory is partitioned on your system one
   cgroup at a time, for example. Also, someone may find usage for both
   limit_in_bytes and reservation_limit_in_bytes concurrently, and this
   approach gives them the option to do so.

Caveats:
1. This support is implemented for cgroups-v1. I have not tried
   hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet.
   This is largely because we use cgroups-v1 for now. If required, I
   can add hugetlb_cgroup support to cgroups v2 in this patch or
   a follow up.
2. Most complicated bit of this patch I believe is: where to store the
   pointer to the hugetlb_cgroup to uncharge at unreservation time?
   Normally the cgroup pointers hang off the struct page. But, with
   hugetlb_cgroup reservations, one task can reserve a specific page and another
   task may fault it in (I believe), so storing the pointer in struct
   page is not appropriate. Proposed approach here is to store the pointer in
   the resv_map. See patch for details.

Testing:
- Added tests passing.
- libhugetlbfs tests mostly passing, but some tests have trouble with and
  without this patch series. Seems environment issue rather than code:
  - Overall results:
    ********** TEST SUMMARY
    *                      2M
    *                      32-bit 64-bit
    *     Total testcases:    84      0
    *             Skipped:     0      0
    *                PASS:    66      0
    *                FAIL:    14      0
    *    Killed by signal:     0      0
    *   Bad configuration:     4      0
    *       Expected FAIL:     0      0
    *     Unexpected PASS:     0      0
    *    Test not present:     0      0
    * Strange test result:     0      0
    **********
  - Failing tests:
    - elflink_rw_and_share_test("linkhuge_rw") segfaults with and without this
      patch series.
    - LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc (2M: 32):
      FAIL    Address is not hugepage
    - LD_PRELOAD=libhugetlbfs.so HUGETLB_RESTRICT_EXE=unknown:malloc
      HUGETLB_MORECORE=yes malloc (2M: 32):
      FAIL    Address is not hugepage
    - LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc_manysmall (2M: 32):
      FAIL    Address is not hugepage
    - GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
      HUGETLB_MORECORE=yes heapshrink (2M: 32):
      FAIL    Heap not on hugepages
    - GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
      libheapshrink.so HUGETLB_MORECORE=yes heapshrink (2M: 32):
      FAIL    Heap not on hugepages
    - HUGETLB_ELFMAP=RW linkhuge_rw (2M: 32): FAIL    small_data is not hugepage
    - HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32):
      FAIL    small_data is not hugepage
    - alloc-instantiate-race shared (2M: 32):
      Bad configuration: sched_setaffinity(cpu1): Invalid argument -
      FAIL    Child 1 killed by signal Killed
    - shmoverride_linked (2M: 32):
      FAIL    shmget failed size 2097152 from line 176: Invalid argument
    - HUGETLB_SHM=yes shmoverride_linked (2M: 32):
      FAIL    shmget failed size 2097152 from line 176: Invalid argument
    - shmoverride_linked_static (2M: 32):
      FAIL shmget failed size 2097152 from line 176: Invalid argument
    - HUGETLB_SHM=yes shmoverride_linked_static (2M: 32):
      FAIL shmget failed size 2097152 from line 176: Invalid argument
    - LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32):
      FAIL shmget failed size 2097152 from line 176: Invalid argument
    - LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32):
      FAIL    shmget failed size 2097152 from line 176: Invalid argument

Signed-off-by: Mina Almasry <almasrymina@google.com>

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Changes in v4:
- Split up 'hugetlb_cgroup: add accounting for shared mappings' into 4 patches
  for better isolation and context on the individual changes:
  - hugetlb_cgroup: add accounting for shared mappings
  - hugetlb: disable region_add file_region coalescing
  - hugetlb: remove duplicated code
  - hugetlb: region_chg provides only cache entry
- Fixed resv->adds_in_progress accounting.
- Retained behavior that region_add never fails, in earlier patchsets region_add
  could return failure.
- Fixed libhugetlbfs failure.
- Minor fix to the added tests that was preventing them from running on some
  environments.

Changes in v3:
- Addressed comments of Hillf Danton:
  - Added docs.
  - cgroup_files now uses enum.
  - Various readability improvements.
- Addressed comments of Mike Kravetz.
  - region_* functions no longer coalesce file_region entries in the resv_map.
  - region_add() and region_chg() refactored to make them much easier to
    understand and remove duplicated code so this patch doesn't add too much
    complexity.
  - Refactored common functionality into helpers.

Changes in v2:
- Split the patch into a 5 patch series.
- Fixed patch subject.

Mina Almasry (9):
  hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add reservation accounting for private mappings
  hugetlb: region_chg provides only cache entry
  hugetlb: remove duplicated code
  hugetlb: disable region_add file_region coalescing
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb_cgroup: Add hugetlb_cgroup reservation tests
  hugetlb_cgroup: Add hugetlb_cgroup reservation docs

 .../admin-guide/cgroup-v1/hugetlb.rst         |  84 ++-
 include/linux/hugetlb.h                       |  24 +-
 include/linux/hugetlb_cgroup.h                |  24 +-
 mm/hugetlb.c                                  | 516 +++++++++++-------
 mm/hugetlb_cgroup.c                           | 189 +++++--
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 440 +++++++++++++++
 .../selftests/vm/write_hugetlb_memory.sh      |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 +++++++++
 10 files changed, 1304 insertions(+), 252 deletions(-)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
@ 2019-09-10 23:31 ` Mina Almasry
  2019-09-16 23:43   ` shuah
  2019-09-10 23:31 ` [PATCH v4 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations Mina Almasry
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny, Hillf Danton

These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Hillf Danton <hdanton@sina.com>
---
 include/linux/hugetlb.h |  16 +++++-
 mm/hugetlb_cgroup.c     | 111 ++++++++++++++++++++++++++++++----------
 2 files changed, 100 insertions(+), 27 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edfca42783192..128ff1aff1c93 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -320,6 +320,20 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,

 #ifdef CONFIG_HUGETLB_PAGE

+enum {
+	HUGETLB_RES_USAGE,
+	HUGETLB_RES_RESERVATION_USAGE,
+	HUGETLB_RES_LIMIT,
+	HUGETLB_RES_RESERVATION_LIMIT,
+	HUGETLB_RES_MAX_USAGE,
+	HUGETLB_RES_RESERVATION_MAX_USAGE,
+	HUGETLB_RES_FAILCNT,
+	HUGETLB_RES_RESERVATION_FAILCNT,
+	HUGETLB_RES_NULL,
+	HUGETLB_RES_MAX,
+};
+
+
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
@@ -340,7 +354,7 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
-	struct cftype cgroup_files[5];
+	struct cftype cgroup_files[HUGETLB_RES_MAX];
 #endif
 	char name[HSTATE_NAME_LEN];
 };
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 68c2f2f3c05b7..51a72624bd1ff 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
 	 * the counter to account for hugepages from hugetlb.
 	 */
 	struct page_counter hugepage[HUGE_MAX_HSTATE];
+	/*
+	 * the counter to account for hugepage reservations from hugetlb.
+	 */
+	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
 };

 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -33,6 +37,15 @@ struct hugetlb_cgroup {

 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline
+struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx,
+				 bool reserved)
+{
+	if (reserved)
+		return  &h_cg->reserved_hugepage[idx];
+	return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -254,30 +267,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
 	return;
 }

-enum {
-	RES_USAGE,
-	RES_LIMIT,
-	RES_MAX_USAGE,
-	RES_FAILCNT,
-};
-
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 				   struct cftype *cft)
 {
 	struct page_counter *counter;
+	struct page_counter *reserved_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

 	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+	reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

 	switch (MEMFILE_ATTR(cft->private)) {
-	case RES_USAGE:
+	case HUGETLB_RES_USAGE:
 		return (u64)page_counter_read(counter) * PAGE_SIZE;
-	case RES_LIMIT:
+	case HUGETLB_RES_RESERVATION_USAGE:
+		return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
+	case HUGETLB_RES_LIMIT:
 		return (u64)counter->max * PAGE_SIZE;
-	case RES_MAX_USAGE:
+	case HUGETLB_RES_RESERVATION_LIMIT:
+		return (u64)reserved_counter->max * PAGE_SIZE;
+	case HUGETLB_RES_MAX_USAGE:
 		return (u64)counter->watermark * PAGE_SIZE;
-	case RES_FAILCNT:
+	case HUGETLB_RES_RESERVATION_MAX_USAGE:
+		return (u64)reserved_counter->watermark * PAGE_SIZE;
+	case HUGETLB_RES_FAILCNT:
 		return counter->failcnt;
+	case HUGETLB_RES_RESERVATION_FAILCNT:
+		return reserved_counter->failcnt;
 	default:
 		BUG();
 	}
@@ -291,6 +307,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 	int ret, idx;
 	unsigned long nr_pages;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+	bool reserved = false;

 	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
 		return -EINVAL;
@@ -304,9 +321,13 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 	nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));

 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
-	case RES_LIMIT:
+	case HUGETLB_RES_RESERVATION_LIMIT:
+		reserved = true;
+		/* Fall through. */
+	case HUGETLB_RES_LIMIT:
 		mutex_lock(&hugetlb_limit_mutex);
-		ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+		ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
+					   nr_pages);
 		mutex_unlock(&hugetlb_limit_mutex);
 		break;
 	default:
@@ -320,18 +341,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
 	int ret = 0;
-	struct page_counter *counter;
+	struct page_counter *counter, *reserved_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));

 	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+	reserved_counter = &h_cg->reserved_hugepage[
+		MEMFILE_IDX(of_cft(of)->private)];

 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
-	case RES_MAX_USAGE:
+	case HUGETLB_RES_MAX_USAGE:
 		page_counter_reset_watermark(counter);
 		break;
-	case RES_FAILCNT:
+	case HUGETLB_RES_RESERVATION_MAX_USAGE:
+		page_counter_reset_watermark(reserved_counter);
+		break;
+	case HUGETLB_RES_FAILCNT:
 		counter->failcnt = 0;
 		break;
+	case HUGETLB_RES_RESERVATION_FAILCNT:
+		reserved_counter->failcnt = 0;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -357,37 +386,67 @@ static void __init __hugetlb_cgroup_file_init(int idx)
 	struct hstate *h = &hstates[idx];

 	/* format the size */
-	mem_fmt(buf, 32, huge_page_size(h));
+	mem_fmt(buf, sizeof(buf), huge_page_size(h));

 	/* Add the limit file */
-	cft = &h->cgroup_files[0];
+	cft = &h->cgroup_files[HUGETLB_RES_LIMIT];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
-	cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_LIMIT);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+	cft->write = hugetlb_cgroup_write;
+
+	/* Add the reservation limit file */
+	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_LIMIT];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes",
+		 buf);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_LIMIT);
 	cft->read_u64 = hugetlb_cgroup_read_u64;
 	cft->write = hugetlb_cgroup_write;

 	/* Add the usage file */
-	cft = &h->cgroup_files[1];
+	cft = &h->cgroup_files[HUGETLB_RES_USAGE];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
-	cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_USAGE);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
+	/* Add the reservation usage file */
+	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_USAGE];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes",
+			buf);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_USAGE);
 	cft->read_u64 = hugetlb_cgroup_read_u64;

 	/* Add the MAX usage file */
-	cft = &h->cgroup_files[2];
+	cft = &h->cgroup_files[HUGETLB_RES_MAX_USAGE];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
-	cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_MAX_USAGE);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
+	/* Add the MAX reservation usage file */
+	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_MAX_USAGE];
+	snprintf(cft->name, MAX_CFTYPE_NAME,
+			"%s.reservation_max_usage_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_MAX_USAGE);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

 	/* Add the failcntfile */
-	cft = &h->cgroup_files[3];
+	cft = &h->cgroup_files[HUGETLB_RES_FAILCNT];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
-	cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+	cft->private  = MEMFILE_PRIVATE(idx, HUGETLB_RES_FAILCNT);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
+	/* Add the reservation failcntfile */
+	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_FAILCNT];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf);
+	cft->private  = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_FAILCNT);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files[4];
+	cft = &h->cgroup_files[HUGETLB_RES_NULL];
 	memset(cft, 0, sizeof(*cft));

 	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
  2019-09-10 23:31 ` [PATCH v4 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
@ 2019-09-10 23:31 ` Mina Almasry
  2019-09-17  1:29   ` shuah
  2019-09-10 23:31 ` [PATCH v4 3/9] hugetlb_cgroup: add reservation accounting for private mappings Mina Almasry
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny

Augements hugetlb_cgroup_charge_cgroup to be able to charge hugetlb
usage or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/linux/hugetlb_cgroup.h | 13 ++++--
 mm/hugetlb.c                   |  6 ++-
 mm/hugetlb_cgroup.c            | 82 +++++++++++++++++++++++++++-------
 3 files changed, 80 insertions(+), 21 deletions(-)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 063962f6dfc6a..c467715dd8fb8 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -52,14 +52,19 @@ static inline bool hugetlb_cgroup_disabled(void)
 }

 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-					struct hugetlb_cgroup **ptr);
+					struct hugetlb_cgroup **ptr,
+					bool reserved);
 extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 					 struct hugetlb_cgroup *h_cg,
 					 struct page *page);
 extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 					 struct page *page);
 extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-					   struct hugetlb_cgroup *h_cg);
+					   struct hugetlb_cgroup *h_cg,
+					   bool reserved);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+					    unsigned long nr_pages);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
 				   struct page *newhpage);
@@ -83,7 +88,7 @@ static inline bool hugetlb_cgroup_disabled(void)

 static inline int
 hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-			     struct hugetlb_cgroup **ptr)
+			     struct hugetlb_cgroup **ptr, bool reserved)
 {
 	return 0;
 }
@@ -102,7 +107,7 @@ hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)

 static inline void
 hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-			       struct hugetlb_cgroup *h_cg)
+			       struct hugetlb_cgroup *h_cg, bool reserved)
 {
 }

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d7296dd11b83..e975f55aede94 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2078,7 +2078,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			gbl_chg = 1;
 	}

-	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
+	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
+					   false);
 	if (ret)
 		goto out_subpool_put;

@@ -2126,7 +2127,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	return page;

 out_uncharge_cgroup:
-	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
+	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
+			false);
 out_subpool_put:
 	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 51a72624bd1ff..2ab36a98d834e 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -38,8 +38,8 @@ struct hugetlb_cgroup {
 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

 static inline
-struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx,
-				 bool reserved)
+struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg,
+						int idx, bool reserved)
 {
 	if (reserved)
 		return  &h_cg->reserved_hugepage[idx];
@@ -74,8 +74,12 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
 	int idx;

 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-		if (page_counter_read(&h_cg->hugepage[idx]))
+		if (page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx,
+						true)) ||
+		    page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx,
+				    false))) {
 			return true;
+		}
 	}
 	return false;
 }
@@ -86,18 +90,30 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
 	int idx;

 	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
-		struct page_counter *counter = &h_cgroup->hugepage[idx];
 		struct page_counter *parent = NULL;
+		struct page_counter *reserved_parent = NULL;
 		unsigned long limit;
 		int ret;

-		if (parent_h_cgroup)
-			parent = &parent_h_cgroup->hugepage[idx];
-		page_counter_init(counter, parent);
+		if (parent_h_cgroup) {
+			parent = hugetlb_cgroup_get_counter(
+					parent_h_cgroup, idx, false);
+			reserved_parent = hugetlb_cgroup_get_counter(
+					parent_h_cgroup, idx, true);
+		}
+		page_counter_init(hugetlb_cgroup_get_counter(
+					h_cgroup, idx, false), parent);
+		page_counter_init(hugetlb_cgroup_get_counter(
+					h_cgroup, idx, true),
+				  reserved_parent);

 		limit = round_down(PAGE_COUNTER_MAX,
 				   1 << huge_page_order(&hstates[idx]));
-		ret = page_counter_set_max(counter, limit);
+
+		ret = page_counter_set_max(hugetlb_cgroup_get_counter(
+					h_cgroup, idx, false), limit);
+		ret = page_counter_set_max(hugetlb_cgroup_get_counter(
+					h_cgroup, idx, true), limit);
 		VM_BUG_ON(ret);
 	}
 }
@@ -127,6 +143,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
 	kfree(h_cgroup);
 }

+static void hugetlb_cgroup_move_parent_reservation(int idx,
+						   struct hugetlb_cgroup *h_cg)
+{
+	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
+
+	/* Move the reservation counters. */
+	if (!parent_hugetlb_cgroup(h_cg)) {
+		parent = root_h_cgroup;
+		/* root has no limit */
+		page_counter_charge(
+				&root_h_cgroup->reserved_hugepage[idx],
+				page_counter_read(hugetlb_cgroup_get_counter(
+						h_cg, idx, true)));
+	}
+
+	/* Take the pages off the local counter */
+	page_counter_cancel(hugetlb_cgroup_get_counter(h_cg, idx, true),
+			    page_counter_read(hugetlb_cgroup_get_counter(h_cg,
+					    idx, true)));
+}

 /*
  * Should be called with hugetlb_lock held.
@@ -181,6 +217,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 	do {
 		for_each_hstate(h) {
 			spin_lock(&hugetlb_lock);
+			hugetlb_cgroup_move_parent_reservation(idx, h_cg);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)
 				hugetlb_cgroup_move_parent(idx, h_cg, page);

@@ -192,7 +229,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 }

 int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-				 struct hugetlb_cgroup **ptr)
+				 struct hugetlb_cgroup **ptr, bool reserved)
 {
 	int ret = 0;
 	struct page_counter *counter;
@@ -215,8 +252,11 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
 	}
 	rcu_read_unlock();

-	if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter))
+	if (!page_counter_try_charge(hugetlb_cgroup_get_counter(h_cg, idx,
+								reserved),
+				     nr_pages, &counter)) {
 		ret = -ENOMEM;
+	}
 	css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
@@ -250,12 +290,14 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 	if (unlikely(!h_cg))
 		return;
 	set_hugetlb_cgroup(page, NULL);
-	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
+	page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, false),
+			nr_pages);
+
 	return;
 }

 void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-				    struct hugetlb_cgroup *h_cg)
+				    struct hugetlb_cgroup *h_cg, bool reserved)
 {
 	if (hugetlb_cgroup_disabled() || !h_cg)
 		return;
@@ -263,8 +305,17 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
 	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
 		return;

-	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
-	return;
+	page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
+			nr_pages);
+}
+
+void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+				     unsigned long nr_pages)
+{
+	if (hugetlb_cgroup_disabled() || !p)
+		return;
+
+	page_counter_uncharge(p, nr_pages);
 }

 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
@@ -326,7 +377,8 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 		/* Fall through. */
 	case HUGETLB_RES_LIMIT:
 		mutex_lock(&hugetlb_limit_mutex);
-		ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
+		ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx,
+								      reserved),
 					   nr_pages);
 		mutex_unlock(&hugetlb_limit_mutex);
 		break;
--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 3/9] hugetlb_cgroup: add reservation accounting for private mappings
  2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
  2019-09-10 23:31 ` [PATCH v4 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
  2019-09-10 23:31 ` [PATCH v4 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations Mina Almasry
@ 2019-09-10 23:31 ` Mina Almasry
  2019-09-10 23:31 ` [PATCH v4 4/9] hugetlb: region_chg provides only cache entry Mina Almasry
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny

Normally the pointer to the cgroup to uncharge hangs off the struct
page, and gets queried when it's time to free the page. With
hugetlb_cgroup reservations, this is not possible. Because it's possible
for a page to be reserved by one task and actually faulted in by another
task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/linux/hugetlb.h        |  8 ++++++
 include/linux/hugetlb_cgroup.h | 11 ++++++++
 mm/hugetlb.c                   | 47 ++++++++++++++++++++++++++++++++--
 mm/hugetlb_cgroup.c            | 12 ---------
 4 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 128ff1aff1c93..536cb144cf484 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,14 @@ struct resv_map {
 	long adds_in_progress;
 	struct list_head region_cache;
 	long region_cache_count;
+ #ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * On private mappings, the counter to uncharge reservations is stored
+	 * here. If these fields are 0, then the mapping is shared.
+	 */
+	struct page_counter *reservation_counter;
+	unsigned long pages_per_hpage;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index c467715dd8fb8..8c6ea58c63c89 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -25,6 +25,17 @@ struct hugetlb_cgroup;
 #define HUGETLB_CGROUP_MIN_ORDER	2

 #ifdef CONFIG_CGROUP_HUGETLB
+struct hugetlb_cgroup {
+	struct cgroup_subsys_state css;
+	/*
+	 * the counter to account for hugepages from hugetlb.
+	 */
+	struct page_counter hugepage[HUGE_MAX_HSTATE];
+	/*
+	 * the counter to account for hugepage reservations from hugetlb.
+	 */
+	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
+};

 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e975f55aede94..fbd7c52e17348 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -711,6 +711,16 @@ struct resv_map *resv_map_alloc(void)
 	INIT_LIST_HEAD(&resv_map->regions);

 	resv_map->adds_in_progress = 0;
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * Initialize these to 0. On shared mappings, 0's here indicate these
+	 * fields don't do cgroup accounting. On private mappings, these will be
+	 * re-initialized to the proper values, to indicate that hugetlb cgroup
+	 * reservations are to be un-charged from here.
+	 */
+	resv_map->reservation_counter = NULL;
+	resv_map->pages_per_hpage = 0;
+#endif

 	INIT_LIST_HEAD(&resv_map->region_cache);
 	list_add(&rg->link, &resv_map->region_cache);
@@ -3193,7 +3203,19 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)

 	reserve = (end - start) - region_count(resv, start, end);

-	kref_put(&resv->refs, resv_map_release);
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * Since we check for HPAGE_RESV_OWNER above, this must a private
+	 * mapping, and these values should be none-zero, and should point to
+	 * the hugetlb_cgroup counter to uncharge for this reservation.
+	 */
+	WARN_ON(!resv->reservation_counter);
+	WARN_ON(!resv->pages_per_hpage);
+
+	hugetlb_cgroup_uncharge_counter(
+			resv->reservation_counter,
+			(end - start) * resv->pages_per_hpage);
+#endif

 	if (reserve) {
 		/*
@@ -3203,6 +3225,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
 		hugetlb_acct_memory(h, -gbl_reserve);
 	}
+
+	kref_put(&resv->refs, resv_map_release);
 }

 static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -4536,6 +4560,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
+	struct hugetlb_cgroup *h_cg;
 	long gbl_reserve;

 	/* This should never happen */
@@ -4569,11 +4594,29 @@ int hugetlb_reserve_pages(struct inode *inode,
 		chg = region_chg(resv_map, from, to);

 	} else {
+		/* Private mapping. */
+		chg = to - from;
+
+		if (hugetlb_cgroup_charge_cgroup(
+					hstate_index(h),
+					chg * pages_per_huge_page(h),
+					&h_cg, true)) {
+			return -ENOMEM;
+		}
+
 		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;

-		chg = to - from;
+#ifdef CONFIG_CGROUP_HUGETLB
+		/*
+		 * Since this branch handles private mappings, we attach the
+		 * counter to uncharge for this reservation off resv_map.
+		 */
+		resv_map->reservation_counter =
+			&h_cg->reserved_hugepage[hstate_index(h)];
+		resv_map->pages_per_hpage = pages_per_huge_page(h);
+#endif

 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 2ab36a98d834e..66ffd32063722 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -19,18 +19,6 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>

-struct hugetlb_cgroup {
-	struct cgroup_subsys_state css;
-	/*
-	 * the counter to account for hugepages from hugetlb.
-	 */
-	struct page_counter hugepage[HUGE_MAX_HSTATE];
-	/*
-	 * the counter to account for hugepage reservations from hugetlb.
-	 */
-	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
-};
-
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_IDX(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 4/9] hugetlb: region_chg provides only cache entry
  2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
                   ` (2 preceding siblings ...)
  2019-09-10 23:31 ` [PATCH v4 3/9] hugetlb_cgroup: add reservation accounting for private mappings Mina Almasry
@ 2019-09-10 23:31 ` Mina Almasry
  2019-09-16 22:17   ` Mike Kravetz
  2019-09-10 23:31 ` [PATCH v4 5/9] hugetlb: remove duplicated code Mina Almasry
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny

Current behavior is that region_chg provides both a cache entry in
resv->region_cache, AND a placeholder entry in resv->regions. region_add
first tries to use the placeholder, and if it finds that the placeholder
has been deleted by a racing region_del call, it uses the cache entry.

This behavior is completely unnecessary and is removed in this patch for
a couple of reasons:

1. region_add needs to either find a cached file_region entry in
   resv->region_cache, or find an entry in resv->regions to expand. It
   does not need both.
2. region_chg adding a placeholder entry in resv->regions opens up
   a possible race with region_del, where region_chg adds a placeholder
   region in resv->regions, and this region is deleted by a racing call
   to region_del during region_chg execution or before region_add is
   called. Removing the race makes the code easier to reason about and
   maintain.

In addition, a follow up patch in this series disables region
coalescing, which would be further complicated if the race with
region_del exists.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 mm/hugetlb.c | 63 +++++++++-------------------------------------------
 1 file changed, 11 insertions(+), 52 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fbd7c52e17348..bea51ae422f63 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -246,14 +246,10 @@ struct file_region {

 /*
  * Add the huge page range represented by [f, t) to the reserve
- * map.  In the normal case, existing regions will be expanded
- * to accommodate the specified range.  Sufficient regions should
- * exist for expansion due to the previous call to region_chg
- * with the same range.  However, it is possible that region_del
- * could have been called after region_chg and modifed the map
- * in such a way that no region exists to be expanded.  In this
- * case, pull a region descriptor from the cache associated with
- * the map and use that for the new range.
+ * map.  Existing regions will be expanded to accommodate the specified
+ * range, or a region will be taken from the cache.  Sufficient regions
+ * must exist in the cache due to the previous call to region_chg with
+ * the same range.
  *
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
@@ -272,9 +268,8 @@ static long region_add(struct resv_map *resv, long f, long t)

 	/*
 	 * If no region exists which can be expanded to include the
-	 * specified range, the list must have been modified by an
-	 * interleving call to region_del().  Pull a region descriptor
-	 * from the cache and use it for this range.
+	 * specified range, pull a region descriptor from the cache
+	 * and use it for this range.
 	 */
 	if (&rg->link == head || t < rg->from) {
 		VM_BUG_ON(resv->region_cache_count <= 0);
@@ -339,15 +334,9 @@ static long region_add(struct resv_map *resv, long f, long t)
  * call to region_add that will actually modify the reserve
  * map to add the specified range [f, t).  region_chg does
  * not change the number of huge pages represented by the
- * map.  However, if the existing regions in the map can not
- * be expanded to represent the new range, a new file_region
- * structure is added to the map as a placeholder.  This is
- * so that the subsequent region_add call will have all the
- * regions it needs and will not fail.
- *
- * Upon entry, region_chg will also examine the cache of region descriptors
- * associated with the map.  If there are not enough descriptors cached, one
- * will be allocated for the in progress add operation.
+ * map.  A new file_region structure is added to the cache
+ * as a placeholder, so that the subsequent region_add
+ * call will have all the regions it needs and will not fail.
  *
  * Returns the number of huge pages that need to be added to the existing
  * reservation map for the range [f, t).  This number is greater or equal to
@@ -357,10 +346,9 @@ static long region_add(struct resv_map *resv, long f, long t)
 static long region_chg(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg = NULL;
+	struct file_region *rg;
 	long chg = 0;

-retry:
 	spin_lock(&resv->lock);
 retry_locked:
 	resv->adds_in_progress++;
@@ -378,10 +366,8 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		spin_unlock(&resv->lock);

 		trg = kmalloc(sizeof(*trg), GFP_KERNEL);
-		if (!trg) {
-			kfree(nrg);
+		if (!trg)
 			return -ENOMEM;
-		}

 		spin_lock(&resv->lock);
 		list_add(&trg->link, &resv->region_cache);
@@ -394,28 +380,6 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		if (f <= rg->to)
 			break;

-	/* If we are below the current region then a new region is required.
-	 * Subtle, allocate a new region at the position but make it zero
-	 * size such that we can guarantee to record the reservation. */
-	if (&rg->link == head || t < rg->from) {
-		if (!nrg) {
-			resv->adds_in_progress--;
-			spin_unlock(&resv->lock);
-			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-			if (!nrg)
-				return -ENOMEM;
-
-			nrg->from = f;
-			nrg->to   = f;
-			INIT_LIST_HEAD(&nrg->link);
-			goto retry;
-		}
-
-		list_add(&nrg->link, rg->link.prev);
-		chg = t - f;
-		goto out_nrg;
-	}
-
 	/* Round our left edge to the current segment if it encloses us. */
 	if (f > rg->from)
 		f = rg->from;
@@ -439,11 +403,6 @@ static long region_chg(struct resv_map *resv, long f, long t)
 	}

 out:
-	spin_unlock(&resv->lock);
-	/*  We already know we raced and no longer need the new region */
-	kfree(nrg);
-	return chg;
-out_nrg:
 	spin_unlock(&resv->lock);
 	return chg;
 }
--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 5/9] hugetlb: remove duplicated code
  2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
                   ` (3 preceding siblings ...)
  2019-09-10 23:31 ` [PATCH v4 4/9] hugetlb: region_chg provides only cache entry Mina Almasry
@ 2019-09-10 23:31 ` Mina Almasry
  2019-09-16 22:25   ` Mike Kravetz
  2019-09-10 23:31 ` [PATCH v4 6/9] hugetlb: disable region_add file_region coalescing Mina Almasry
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny

Remove duplicated code between region_chg and region_add, and refactor it into
a common function, add_reservation_in_range. This is mostly done because
there is a follow up change in this series that disables region
coalescing in region_add, and I want to make that change in one place
only. It should improve maintainability anyway on its own.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 mm/hugetlb.c | 116 ++++++++++++++++++++++++---------------------------
 1 file changed, 54 insertions(+), 62 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bea51ae422f63..ce5ed1056fefd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -244,6 +244,57 @@ struct file_region {
 	long to;
 };

+static long add_reservation_in_range(
+		struct resv_map *resv, long f, long t, bool count_only)
+{
+
+	long chg = 0;
+	struct list_head *head = &resv->regions;
+	struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+
+	/* Round our left edge to the current segment if it encloses us. */
+	if (f > rg->from)
+		f = rg->from;
+
+	chg = t - f;
+
+	/* Check for and consume any regions we now overlap with. */
+	nrg = rg;
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		if (rg->from > t)
+			break;
+
+		/* We overlap with this area, if it extends further than
+		 * us then we must extend ourselves.  Account for its
+		 * existing reservation.
+		 */
+		if (rg->to > t) {
+			chg += rg->to - t;
+			t = rg->to;
+		}
+		chg -= rg->to - rg->from;
+
+		if (!count_only && rg != nrg) {
+			list_del(&rg->link);
+			kfree(rg);
+		}
+	}
+
+	if (!count_only) {
+		nrg->from = f;
+		nrg->to = t;
+	}
+
+	return chg;
+}
+
 /*
  * Add the huge page range represented by [f, t) to the reserve
  * map.  Existing regions will be expanded to accommodate the specified
@@ -257,7 +308,7 @@ struct file_region {
 static long region_add(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg, *trg;
+	struct file_region *rg, *nrg;
 	long add = 0;

 	spin_lock(&resv->lock);
@@ -287,38 +338,7 @@ static long region_add(struct resv_map *resv, long f, long t)
 		goto out_locked;
 	}

-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-
-	/* Check for and consume any regions we now overlap with. */
-	nrg = rg;
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		if (rg->from > t)
-			break;
-
-		/* If this area reaches higher then extend our area to
-		 * include it completely.  If this is not the first area
-		 * which we intend to reuse, free it. */
-		if (rg->to > t)
-			t = rg->to;
-		if (rg != nrg) {
-			/* Decrement return value by the deleted range.
-			 * Another range will span this area so that by
-			 * end of routine add will be >= zero
-			 */
-			add -= (rg->to - rg->from);
-			list_del(&rg->link);
-			kfree(rg);
-		}
-	}
-
-	add += (nrg->from - f);		/* Added to beginning of region */
-	nrg->from = f;
-	add += t - nrg->to;		/* Added to end of region */
-	nrg->to = t;
+	add = add_reservation_in_range(resv, f, t, false);

 out_locked:
 	resv->adds_in_progress--;
@@ -345,8 +365,6 @@ static long region_add(struct resv_map *resv, long f, long t)
  */
 static long region_chg(struct resv_map *resv, long f, long t)
 {
-	struct list_head *head = &resv->regions;
-	struct file_region *rg;
 	long chg = 0;

 	spin_lock(&resv->lock);
@@ -375,34 +393,8 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		goto retry_locked;
 	}

-	/* Locate the region we are before or in. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-	chg = t - f;
-
-	/* Check for and consume any regions we now overlap with. */
-	list_for_each_entry(rg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
-		if (rg->from > t)
-			goto out;
+	chg = add_reservation_in_range(resv, f, t, true);

-		/* We overlap with this area, if it extends further than
-		 * us then we must extend ourselves.  Account for its
-		 * existing reservation. */
-		if (rg->to > t) {
-			chg += rg->to - t;
-			t = rg->to;
-		}
-		chg -= rg->to - rg->from;
-	}
-
-out:
 	spin_unlock(&resv->lock);
 	return chg;
 }
--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 6/9] hugetlb: disable region_add file_region coalescing
  2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
                   ` (4 preceding siblings ...)
  2019-09-10 23:31 ` [PATCH v4 5/9] hugetlb: remove duplicated code Mina Almasry
@ 2019-09-10 23:31 ` Mina Almasry
  2019-09-16 23:57   ` Mike Kravetz
  2019-09-10 23:31 ` [PATCH v4 7/9] hugetlb_cgroup: add accounting for shared mappings Mina Almasry
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny

A follow up patch in this series adds hugetlb cgroup uncharge info the
file_region entries in resv->regions. The cgroup uncharge info may
differ for different regions, so they can no longer be coalesced at
region_add time. So, disable region coalescing in region_add in this
patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6].
New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
[5->6].

Special care needs to be taken to handle the resv->adds_in_progress
variable correctly. In the past, only 1 region would be added for every
region_chg and region_add call. But now, each call may add multiple
regions, so we can no longer increment adds_in_progress by 1 in region_chg,
or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
region_chg calls add_reservation_in_range() to count the number of regions
needed and allocates those, and that info is passed to region_add and
region_abort to decrement adds_in_progress correctly.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 mm/hugetlb.c | 279 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 167 insertions(+), 112 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ce5ed1056fefd..5eca34d9b753d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -244,55 +244,80 @@ struct file_region {
 	long to;
 };

+/* Helper that removes a struct file_region from the resv_map cache and returns
+ * it for use.
+ */
+static struct file_region *get_file_region_entry_from_cache(
+		struct resv_map *resv, long from, long to);
+
 static long add_reservation_in_range(
-		struct resv_map *resv, long f, long t, bool count_only)
+		struct resv_map *resv,
+		long f, long t,
+		long *regions_needed,
+		bool count_only)
 {
-
-	long chg = 0;
+	long add = 0;
 	struct list_head *head = &resv->regions;
+	long last_accounted_offset = f;
 	struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;

-	/* Locate the region we are before or in. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
+	if (regions_needed)
+		*regions_needed = 0;

-	chg = t - f;
+	/* In this loop, we essentially handle an entry for the range
+	 * last_accounted_offset -> rg->from, at every iteration, with some
+	 * bounds checking.
+	 */
+	list_for_each_entry_safe(rg, trg, head, link) {
+		/* Skip irrelevant regions that start before our range. */
+		if (rg->from < f) {
+			/* If this region ends after the last accounted offset,
+			 * then we need to update last_accounted_offset.
+			 */
+			if (rg->to > last_accounted_offset)
+				last_accounted_offset = rg->to;
+			continue;
+		}

-	/* Check for and consume any regions we now overlap with. */
-	nrg = rg;
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
+		/* When we find a region that starts beyond our range, we've
+		 * finished.
+		 */
 		if (rg->from > t)
 			break;

-		/* We overlap with this area, if it extends further than
-		 * us then we must extend ourselves.  Account for its
-		 * existing reservation.
+		/* Add an entry for last_accounted_offset -> rg->from, and
+		 * update last_accounted_offset.
 		 */
-		if (rg->to > t) {
-			chg += rg->to - t;
-			t = rg->to;
+		if (rg->from > last_accounted_offset) {
+			add += rg->from - last_accounted_offset;
+			if (!count_only) {
+				nrg = get_file_region_entry_from_cache(resv,
+						last_accounted_offset,
+						rg->from);
+				list_add(&nrg->link, rg->link.prev);
+			} else if (regions_needed)
+				*regions_needed += 1;
 		}
-		chg -= rg->to - rg->from;

-		if (!count_only && rg != nrg) {
-			list_del(&rg->link);
-			kfree(rg);
-		}
+		last_accounted_offset = rg->to;
 	}

-	if (!count_only) {
-		nrg->from = f;
-		nrg->to = t;
+	/* Handle the case where our range extends beyond
+	 * last_accounted_offset.
+	 */
+	if (last_accounted_offset < t) {
+		add += t - last_accounted_offset;
+		if (!count_only) {
+			nrg = get_file_region_entry_from_cache(resv,
+					last_accounted_offset, t);
+			list_add(&nrg->link, rg->link.prev);
+		} else if (regions_needed)
+			*regions_needed += 1;
+
+		last_accounted_offset = t;
 	}

-	return chg;
+	return add;
 }

 /*
@@ -302,46 +327,24 @@ static long add_reservation_in_range(
  * must exist in the cache due to the previous call to region_chg with
  * the same range.
  *
+ * regions_needed is the out value provided by a previous
+ * call to region_chg.
+ *
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
  */
-static long region_add(struct resv_map *resv, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t,
+		long regions_needed)
 {
-	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg;
 	long add = 0;

 	spin_lock(&resv->lock);
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;

-	/*
-	 * If no region exists which can be expanded to include the
-	 * specified range, pull a region descriptor from the cache
-	 * and use it for this range.
-	 */
-	if (&rg->link == head || t < rg->from) {
-		VM_BUG_ON(resv->region_cache_count <= 0);
-
-		resv->region_cache_count--;
-		nrg = list_first_entry(&resv->region_cache, struct file_region,
-					link);
-		list_del(&nrg->link);
-
-		nrg->from = f;
-		nrg->to = t;
-		list_add(&nrg->link, rg->link.prev);
-
-		add += t - f;
-		goto out_locked;
-	}
+	VM_BUG_ON(resv->region_cache_count < regions_needed);

-	add = add_reservation_in_range(resv, f, t, false);
+	add = add_reservation_in_range(resv, f, t, NULL, false);
+	resv->adds_in_progress -= regions_needed;

-out_locked:
-	resv->adds_in_progress--;
 	spin_unlock(&resv->lock);
 	VM_BUG_ON(add < 0);
 	return add;
@@ -358,44 +361,54 @@ static long region_add(struct resv_map *resv, long f, long t)
  * as a placeholder, so that the subsequent region_add
  * call will have all the regions it needs and will not fail.
  *
+ * out_regions_needed is the number of regions added to the
+ * resv->region_cache_count.  This value needs to be provided to a follow up
+ * call to region_add or region_abort for proper accounting.
+ *
  * Returns the number of huge pages that need to be added to the existing
  * reservation map for the range [f, t).  This number is greater or equal to
  * zero.  -ENOMEM is returned if a new file_region structure or cache entry
  * is needed and can not be allocated.
  */
-static long region_chg(struct resv_map *resv, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t,
+		long *out_regions_needed)
 {
-	long chg = 0;
+	struct file_region *trg = NULL;
+	long chg = 0, regions_needed = 0;

+retry:
 	spin_lock(&resv->lock);
-retry_locked:
-	resv->adds_in_progress++;
+
+	/* Count how many hugepages in this range are NOT respresented. */
+	chg = add_reservation_in_range(resv, f, t, &regions_needed, true);
+

 	/*
 	 * Check for sufficient descriptors in the cache to accommodate
 	 * the number of in progress add operations.
 	 */
-	if (resv->adds_in_progress > resv->region_cache_count) {
-		struct file_region *trg;
-
-		VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1);
+	if (resv->region_cache_count < regions_needed) {
 		/* Must drop lock to allocate a new descriptor. */
-		resv->adds_in_progress--;
 		spin_unlock(&resv->lock);

-		trg = kmalloc(sizeof(*trg), GFP_KERNEL);
-		if (!trg)
-			return -ENOMEM;
+		while (resv->region_cache_count < regions_needed + 1) {
+			trg = kmalloc(sizeof(*trg), GFP_KERNEL);
+			if (!trg)
+				return -ENOMEM;

-		spin_lock(&resv->lock);
-		list_add(&trg->link, &resv->region_cache);
-		resv->region_cache_count++;
-		goto retry_locked;
+			spin_lock(&resv->lock);
+			list_add(&trg->link, &resv->region_cache);
+			resv->region_cache_count++;
+			spin_unlock(&resv->lock);
+		}
+		goto retry;
 	}

-	chg = add_reservation_in_range(resv, f, t, true);
+	resv->adds_in_progress += regions_needed;

 	spin_unlock(&resv->lock);
+	if (out_regions_needed)
+		*out_regions_needed = regions_needed;
 	return chg;
 }

@@ -404,17 +417,19 @@ static long region_chg(struct resv_map *resv, long f, long t)
  * of the resv_map keeps track of the operations in progress between
  * calls to region_chg and region_add.  Operations are sometimes
  * aborted after the call to region_chg.  In such cases, region_abort
- * is called to decrement the adds_in_progress counter.
+ * is called to decrement the adds_in_progress counter. regions_needed
+ * is the value returned by the region_chg call, it is used to decrement
+ * the adds_in_progress counter.
  *
  * NOTE: The range arguments [f, t) are not needed or used in this
  * routine.  They are kept to make reading the calling code easier as
  * arguments will match the associated region_chg call.
  */
-static void region_abort(struct resv_map *resv, long f, long t)
+static void region_abort(struct resv_map *resv, long f, long t,
+		long regions_needed)
 {
 	spin_lock(&resv->lock);
-	VM_BUG_ON(!resv->region_cache_count);
-	resv->adds_in_progress--;
+	resv->adds_in_progress -= regions_needed;
 	spin_unlock(&resv->lock);
 }

@@ -1865,7 +1880,9 @@ enum vma_resv_mode {
 };
 static long __vma_reservation_common(struct hstate *h,
 				struct vm_area_struct *vma, unsigned long addr,
-				enum vma_resv_mode mode)
+				enum vma_resv_mode mode,
+				long *out_regions_needed,
+				long in_regions_needed)
 {
 	struct resv_map *resv;
 	pgoff_t idx;
@@ -1878,20 +1895,24 @@ static long __vma_reservation_common(struct hstate *h,
 	idx = vma_hugecache_offset(h, vma, addr);
 	switch (mode) {
 	case VMA_NEEDS_RESV:
-		ret = region_chg(resv, idx, idx + 1);
+		VM_BUG_ON(!out_regions_needed);
+		ret = region_chg(resv, idx, idx + 1, out_regions_needed);
 		break;
 	case VMA_COMMIT_RESV:
-		ret = region_add(resv, idx, idx + 1);
+		VM_BUG_ON(in_regions_needed == -1);
+		ret = region_add(resv, idx, idx + 1, in_regions_needed);
 		break;
 	case VMA_END_RESV:
-		region_abort(resv, idx, idx + 1);
+		VM_BUG_ON(in_regions_needed == -1);
+		region_abort(resv, idx, idx + 1, in_regions_needed);
 		ret = 0;
 		break;
 	case VMA_ADD_RESV:
+		VM_BUG_ON(in_regions_needed == -1);
 		if (vma->vm_flags & VM_MAYSHARE)
-			ret = region_add(resv, idx, idx + 1);
+			ret = region_add(resv, idx, idx + 1, in_regions_needed);
 		else {
-			region_abort(resv, idx, idx + 1);
+			region_abort(resv, idx, idx + 1, in_regions_needed);
 			ret = region_del(resv, idx, idx + 1);
 		}
 		break;
@@ -1925,27 +1946,35 @@ static long __vma_reservation_common(struct hstate *h,
 }

 static long vma_needs_reservation(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long addr)
+			struct vm_area_struct *vma, unsigned long addr,
+			long *out_regions_needed)
 {
-	return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV);
+	return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV,
+			out_regions_needed, -1);
 }

 static long vma_commit_reservation(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long addr)
+			struct vm_area_struct *vma, unsigned long addr,
+			long regions_needed)
 {
-	return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV);
+	return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV, NULL,
+			regions_needed);
 }

 static void vma_end_reservation(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long addr)
+			struct vm_area_struct *vma, unsigned long addr,
+			long regions_needed)
 {
-	(void)__vma_reservation_common(h, vma, addr, VMA_END_RESV);
+	(void)__vma_reservation_common(h, vma, addr, VMA_END_RESV, NULL,
+			regions_needed);
 }

 static long vma_add_reservation(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long addr)
+			struct vm_area_struct *vma, unsigned long addr,
+			long regions_needed)
 {
-	return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV);
+	return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV, NULL,
+			regions_needed);
 }

 /*
@@ -1963,8 +1992,10 @@ static void restore_reserve_on_error(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long address,
 			struct page *page)
 {
+	long regions_needed = 0;
 	if (unlikely(PagePrivate(page))) {
-		long rc = vma_needs_reservation(h, vma, address);
+		long rc = vma_needs_reservation(h, vma, address,
+				&regions_needed);

 		if (unlikely(rc < 0)) {
 			/*
@@ -1980,7 +2011,8 @@ static void restore_reserve_on_error(struct hstate *h,
 			 */
 			ClearPagePrivate(page);
 		} else if (rc) {
-			rc = vma_add_reservation(h, vma, address);
+			rc = vma_add_reservation(h, vma, address,
+					regions_needed);
 			if (unlikely(rc < 0))
 				/*
 				 * See above comment about rare out of
@@ -1988,7 +2020,7 @@ static void restore_reserve_on_error(struct hstate *h,
 				 */
 				ClearPagePrivate(page);
 		} else
-			vma_end_reservation(h, vma, address);
+			vma_end_reservation(h, vma, address, regions_needed);
 	}
 }

@@ -2002,6 +2034,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	long gbl_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
+	long regions_needed = 0;

 	idx = hstate_index(h);
 	/*
@@ -2009,7 +2042,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 * has a reservation for the page to be allocated.  A return
 	 * code of zero indicates a reservation exists (no change).
 	 */
-	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
+	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr,
+			&regions_needed);
 	if (map_chg < 0)
 		return ERR_PTR(-ENOMEM);

@@ -2023,7 +2057,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (map_chg || avoid_reserve) {
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
 		if (gbl_chg < 0) {
-			vma_end_reservation(h, vma, addr);
+			vma_end_reservation(h, vma, addr, regions_needed);
 			return ERR_PTR(-ENOSPC);
 		}

@@ -2069,7 +2103,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,

 	set_page_private(page, (unsigned long)spool);

-	map_commit = vma_commit_reservation(h, vma, addr);
+	map_commit = vma_commit_reservation(h, vma, addr, regions_needed);
 	if (unlikely(map_chg > map_commit)) {
 		/*
 		 * The page was added to the reservation map between
@@ -2093,7 +2127,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 out_subpool_put:
 	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
-	vma_end_reservation(h, vma, addr);
+	vma_end_reservation(h, vma, addr, regions_needed);
 	return ERR_PTR(-ENOSPC);
 }

@@ -3778,6 +3812,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page = false;
+	long regions_needed = 0;

 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -3895,12 +3930,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	 * the spinlock.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
-		if (vma_needs_reservation(h, vma, haddr) < 0) {
+		if (vma_needs_reservation(h, vma, haddr, &regions_needed) < 0) {
 			ret = VM_FAULT_OOM;
 			goto backout_unlocked;
 		}
 		/* Just decrements count, does not deallocate */
-		vma_end_reservation(h, vma, haddr);
+		vma_end_reservation(h, vma, haddr, regions_needed);
 	}

 	ptl = huge_pte_lock(h, mm, ptep);
@@ -3990,6 +4025,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct address_space *mapping;
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
+	long regions_needed = 0;

 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
@@ -4044,12 +4080,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * consumed.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
-		if (vma_needs_reservation(h, vma, haddr) < 0) {
+		if (vma_needs_reservation(h, vma, haddr, &regions_needed) < 0) {
 			ret = VM_FAULT_OOM;
 			goto out_mutex;
 		}
 		/* Just decrements count, does not deallocate */
-		vma_end_reservation(h, vma, haddr);
+		vma_end_reservation(h, vma, haddr, regions_needed);

 		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
@@ -4512,7 +4548,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
 	struct hugetlb_cgroup *h_cg;
-	long gbl_reserve;
+	long gbl_reserve, regions_needed = 0;

 	/* This should never happen */
 	if (from > to) {
@@ -4542,7 +4578,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 		 */
 		resv_map = inode_resv_map(inode);

-		chg = region_chg(resv_map, from, to);
+		chg = region_chg(resv_map, from, to, &regions_needed);

 	} else {
 		/* Private mapping. */
@@ -4612,7 +4648,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		long add = region_add(resv_map, from, to);
+		long add = region_add(resv_map, from, to, regions_needed);

 		if (unlikely(chg > add)) {
 			/*
@@ -4634,7 +4670,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		/* Don't call region_abort if region_chg failed */
 		if (chg >= 0)
-			region_abort(resv_map, from, to);
+			region_abort(resv_map, from, to, regions_needed);
 	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_put(&resv_map->refs, resv_map_release);
 	return ret;
@@ -5058,3 +5094,22 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		spin_unlock(&hugetlb_lock);
 	}
 }
+
+static struct file_region *get_file_region_entry_from_cache(
+		struct resv_map *resv, long from, long to)
+{
+	struct file_region *nrg = NULL;
+
+	VM_BUG_ON(resv->region_cache_count <= 0);
+
+	resv->region_cache_count--;
+	nrg = list_first_entry(&resv->region_cache, struct file_region,
+			link);
+	VM_BUG_ON(!nrg);
+	list_del(&nrg->link);
+
+	nrg->from = from;
+	nrg->to = to;
+
+	return nrg;
+}
--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 7/9] hugetlb_cgroup: add accounting for shared mappings
  2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
                   ` (5 preceding siblings ...)
  2019-09-10 23:31 ` [PATCH v4 6/9] hugetlb: disable region_add file_region coalescing Mina Almasry
@ 2019-09-10 23:31 ` Mina Almasry
  2019-09-10 23:31 ` [PATCH v4 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests Mina Almasry
  2019-09-10 23:31 ` [PATCH v4 9/9] hugetlb_cgroup: Add hugetlb_cgroup reservation docs Mina Almasry
  8 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny

For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

After a call to region_chg, we charge the approprate hugetlb_cgroup, and if
successful, we pass on the hugetlb_cgroup info to a follow up region_add call.
When a file_region entry is added to the resv_map via region_add, we put the
pointer to that cgroup in file_region->reservation_counter. If charging doesn't
succeed, we report the error to the caller, so that the kernel fails the
reservation.

On region_del, which is when the hugetlb memory is unreserved, we also uncharge
the file_region->reservation_counter.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 mm/hugetlb.c | 147 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 115 insertions(+), 32 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5eca34d9b753d..711690b87dce5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -242,6 +242,15 @@ struct file_region {
 	struct list_head link;
 	long from;
 	long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * On shared mappings, each reserved region appears as a struct
+	 * file_region in resv_map. These fields hold the info needed to
+	 * uncharge each reservation.
+	 */
+	struct page_counter *reservation_counter;
+	unsigned long pages_per_hpage;
+#endif
 };

 /* Helper that removes a struct file_region from the resv_map cache and returns
@@ -250,9 +259,29 @@ struct file_region {
 static struct file_region *get_file_region_entry_from_cache(
 		struct resv_map *resv, long from, long to);

-static long add_reservation_in_range(
-		struct resv_map *resv,
+/* Helper that records hugetlb_cgroup uncharge info. */
+static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg,
+		struct file_region *nrg, struct hstate *h)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+	if (h_cg) {
+		nrg->reservation_counter =
+			&h_cg->reserved_hugepage[hstate_index(h)];
+		nrg->pages_per_hpage = pages_per_huge_page(h);
+	} else {
+		nrg->reservation_counter = NULL;
+		nrg->pages_per_hpage = 0;
+	}
+#endif
+}
+
+/* Must be called with resv->lock held. Calling this with dry_run == true will
+ * count the number of pages to be added but will not modify the linked list.
+ */
+static long add_reservation_in_range(struct resv_map *resv,
 		long f, long t,
+		struct hugetlb_cgroup *h_cg,
+		struct hstate *h,
 		long *regions_needed,
 		bool count_only)
 {
@@ -294,6 +323,8 @@ static long add_reservation_in_range(
 				nrg = get_file_region_entry_from_cache(resv,
 						last_accounted_offset,
 						rg->from);
+				record_hugetlb_cgroup_uncharge_info(h_cg, nrg,
+						h);
 				list_add(&nrg->link, rg->link.prev);
 			} else if (regions_needed)
 				*regions_needed += 1;
@@ -310,6 +341,7 @@ static long add_reservation_in_range(
 		if (!count_only) {
 			nrg = get_file_region_entry_from_cache(resv,
 					last_accounted_offset, t);
+			record_hugetlb_cgroup_uncharge_info(h_cg, nrg, h);
 			list_add(&nrg->link, rg->link.prev);
 		} else if (regions_needed)
 			*regions_needed += 1;
@@ -317,6 +349,7 @@ static long add_reservation_in_range(
 		last_accounted_offset = t;
 	}

+	VM_BUG_ON(add < 0);
 	return add;
 }

@@ -333,8 +366,8 @@ static long add_reservation_in_range(
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
  */
-static long region_add(struct resv_map *resv, long f, long t,
-		long regions_needed)
+static long region_add(struct hstate *h, struct hugetlb_cgroup *h_cg,
+		struct resv_map *resv, long f, long t, long regions_needed)
 {
 	long add = 0;

@@ -342,7 +375,7 @@ static long region_add(struct resv_map *resv, long f, long t,

 	VM_BUG_ON(resv->region_cache_count < regions_needed);

-	add = add_reservation_in_range(resv, f, t, NULL, false);
+	add = add_reservation_in_range(resv, f, t, h_cg, h, NULL, false);
 	resv->adds_in_progress -= regions_needed;

 	spin_unlock(&resv->lock);
@@ -380,7 +413,8 @@ static long region_chg(struct resv_map *resv, long f, long t,
 	spin_lock(&resv->lock);

 	/* Count how many hugepages in this range are NOT respresented. */
-	chg = add_reservation_in_range(resv, f, t, &regions_needed, true);
+	chg = add_reservation_in_range(resv, f, t, NULL, NULL, &regions_needed,
+			true);


 	/*
@@ -433,6 +467,25 @@ static void region_abort(struct resv_map *resv, long f, long t,
 	spin_unlock(&resv->lock);
 }

+static void  uncharge_cgroup_if_shared_mapping(struct resv_map *resv,
+		struct file_region *rg,
+		unsigned long nr_pages)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * If resv->reservation_counter is NULL, then this is shared
+	 * reservation, and the reserved memory is tracked in the file_struct
+	 * entries inside of resv_map. So we need to uncharge the memory here.
+	 */
+	if (rg->reservation_counter && rg->pages_per_hpage && nr_pages > 0 &&
+			!resv->reservation_counter) {
+		hugetlb_cgroup_uncharge_counter(
+				rg->reservation_counter,
+				nr_pages * rg->pages_per_hpage);
+	}
+#endif
+}
+
 /*
  * Delete the specified range [f, t) from the reserve map.  If the
  * t parameter is LONG_MAX, this indicates that ALL regions after f
@@ -453,6 +506,8 @@ static long region_del(struct resv_map *resv, long f, long t)
 	struct file_region *rg, *trg;
 	struct file_region *nrg = NULL;
 	long del = 0;
+	struct page_counter *reservation_counter = NULL;
+	unsigned long pages_per_hpage = 0;

 retry:
 	spin_lock(&resv->lock);
@@ -502,6 +557,9 @@ static long region_del(struct resv_map *resv, long f, long t)
 			/* Original entry is trimmed */
 			rg->to = f;

+			uncharge_cgroup_if_shared_mapping(resv, rg,
+					nrg->to - nrg->from);
+
 			list_add(&nrg->link, &rg->link);
 			nrg = NULL;
 			break;
@@ -509,6 +567,8 @@ static long region_del(struct resv_map *resv, long f, long t)

 		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
 			del += rg->to - rg->from;
+			uncharge_cgroup_if_shared_mapping(resv, rg,
+					rg->to - rg->from);
 			list_del(&rg->link);
 			kfree(rg);
 			continue;
@@ -517,14 +577,20 @@ static long region_del(struct resv_map *resv, long f, long t)
 		if (f <= rg->from) {	/* Trim beginning of region */
 			del += t - rg->from;
 			rg->from = t;
+
+			uncharge_cgroup_if_shared_mapping(resv, rg,
+					t - rg->from);
 		} else {		/* Trim end of region */
 			del += rg->to - f;
 			rg->to = f;
+
+			uncharge_cgroup_if_shared_mapping(resv, rg, rg->to - f);
 		}
 	}

 	spin_unlock(&resv->lock);
 	kfree(nrg);
+
 	return del;
 }

@@ -1900,7 +1966,8 @@ static long __vma_reservation_common(struct hstate *h,
 		break;
 	case VMA_COMMIT_RESV:
 		VM_BUG_ON(in_regions_needed == -1);
-		ret = region_add(resv, idx, idx + 1, in_regions_needed);
+		ret = region_add(NULL, NULL, resv, idx, idx + 1,
+				in_regions_needed);
 		break;
 	case VMA_END_RESV:
 		VM_BUG_ON(in_regions_needed == -1);
@@ -1910,7 +1977,8 @@ static long __vma_reservation_common(struct hstate *h,
 	case VMA_ADD_RESV:
 		VM_BUG_ON(in_regions_needed == -1);
 		if (vma->vm_flags & VM_MAYSHARE)
-			ret = region_add(resv, idx, idx + 1, in_regions_needed);
+			ret = region_add(NULL, NULL, resv, idx, idx + 1,
+					in_regions_needed);
 		else {
 			region_abort(resv, idx, idx + 1, in_regions_needed);
 			ret = region_del(resv, idx, idx + 1);
@@ -4547,7 +4615,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
-	struct hugetlb_cgroup *h_cg;
+	struct hugetlb_cgroup *h_cg = NULL;
 	long gbl_reserve, regions_needed = 0;

 	/* This should never happen */
@@ -4584,27 +4652,10 @@ int hugetlb_reserve_pages(struct inode *inode,
 		/* Private mapping. */
 		chg = to - from;

-		if (hugetlb_cgroup_charge_cgroup(
-					hstate_index(h),
-					chg * pages_per_huge_page(h),
-					&h_cg, true)) {
-			return -ENOMEM;
-		}
-
 		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;

-#ifdef CONFIG_CGROUP_HUGETLB
-		/*
-		 * Since this branch handles private mappings, we attach the
-		 * counter to uncharge for this reservation off resv_map.
-		 */
-		resv_map->reservation_counter =
-			&h_cg->reserved_hugepage[hstate_index(h)];
-		resv_map->pages_per_hpage = pages_per_huge_page(h);
-#endif
-
 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
 	}
@@ -4614,6 +4665,16 @@ int hugetlb_reserve_pages(struct inode *inode,
 		goto out_err;
 	}

+	ret = hugetlb_cgroup_charge_cgroup(
+			hstate_index(h),
+			chg * pages_per_huge_page(h),
+			&h_cg, true);
+
+	if (ret < 0) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
 	/*
 	 * There must be enough pages in the subpool for the mapping. If
 	 * the subpool has a minimum size, there may be some global
@@ -4622,7 +4683,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	gbl_reserve = hugepage_subpool_get_pages(spool, chg);
 	if (gbl_reserve < 0) {
 		ret = -ENOSPC;
-		goto out_err;
+		goto out_uncharge_cgroup;
 	}

 	/*
@@ -4631,9 +4692,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 */
 	ret = hugetlb_acct_memory(h, gbl_reserve);
 	if (ret < 0) {
-		/* put back original number of pages, chg */
-		(void)hugepage_subpool_put_pages(spool, chg);
-		goto out_err;
+		goto out_put_pages;
 	}

 	/*
@@ -4648,7 +4707,8 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		long add = region_add(resv_map, from, to, regions_needed);
+		long add = region_add(h, h_cg, resv_map, from, to,
+				regions_needed);

 		if (unlikely(chg > add)) {
 			/*
@@ -4660,12 +4720,35 @@ int hugetlb_reserve_pages(struct inode *inode,
 			 */
 			long rsv_adjust;

+			hugetlb_cgroup_uncharge_cgroup(
+					hstate_index(h),
+					(chg - add) * pages_per_huge_page(h),
+					h_cg, true);
+
 			rsv_adjust = hugepage_subpool_put_pages(spool,
-								chg - add);
+					chg - add);
 			hugetlb_acct_memory(h, -rsv_adjust);
+
 		}
+	} else {
+#ifdef CONFIG_CGROUP_HUGETLB
+		/*
+		 * Since this branch handles private mappings, we attach the
+		 * counter to uncharge for this reservation off resv_map.
+		 */
+		resv_map->reservation_counter =
+			&h_cg->reserved_hugepage[hstate_index(h)];
+		resv_map->pages_per_hpage = pages_per_huge_page(h);
+#endif
 	}
 	return 0;
+out_put_pages:
+	/* put back original number of pages, chg */
+	(void)hugepage_subpool_put_pages(spool, chg);
+out_uncharge_cgroup:
+	hugetlb_cgroup_uncharge_cgroup(hstate_index(h),
+			chg * pages_per_huge_page(h),
+			h_cg, true);
 out_err:
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		/* Don't call region_abort if region_chg failed */
--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests
  2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
                   ` (6 preceding siblings ...)
  2019-09-10 23:31 ` [PATCH v4 7/9] hugetlb_cgroup: add accounting for shared mappings Mina Almasry
@ 2019-09-10 23:31 ` Mina Almasry
  2019-09-17  1:52   ` shuah
  2019-09-10 23:31 ` [PATCH v4 9/9] hugetlb_cgroup: Add hugetlb_cgroup reservation docs Mina Almasry
  8 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny

The tests use both shared and private mapped hugetlb memory, and
monitors the hugetlb usage counter as well as the hugetlb reservation
counter. They test different configurations such as hugetlb memory usage
via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 440 ++++++++++++++++++
 .../selftests/vm/write_hugetlb_memory.sh      |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++++++++++
 5 files changed, 719 insertions(+)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34d..d3bed9407773c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+write_to_hugetlbfs
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 9534dc2bc9295..8d37d5409b52c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs

 TEST_PROGS := run_vmtests

@@ -29,3 +30,6 @@ include ../lib.mk
 $(OUTPUT)/userfaultfd: LDLIBS += -lpthread

 $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+
+# Why does adding $(OUTPUT)/ like above not apply this flag..?
+write_to_hugetlbfs: CFLAGS += -static
diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
new file mode 100755
index 0000000000000..09e90e8f6fab4
--- /dev/null
+++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,440 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+      mkdir -p $cgroup_path
+      mount -t cgroup -o hugetlb,memory cgroup $cgroup_path
+fi
+
+cleanup () {
+	echo $$ > $cgroup_path/tasks
+
+	set +e
+	if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
+	      kill -2 write_to_hugetlbfs
+	      # Wait for hugetlbfs memory to get depleted.
+	      sleep 0.5
+	fi
+	set -e
+
+	if [[ -e /mnt/huge ]]; then
+	      rm -rf /mnt/huge/*
+	      umount /mnt/huge || echo error
+	      rmdir /mnt/huge
+	fi
+	if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+	      rmdir $cgroup_path/hugetlb_cgroup_test
+	fi
+	if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+	      rmdir $cgroup_path/hugetlb_cgroup_test1
+	fi
+	if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+	      rmdir $cgroup_path/hugetlb_cgroup_test2
+	fi
+	echo 0 > /proc/sys/vm/nr_hugepages
+	echo CLEANUP DONE
+}
+
+cleanup
+
+function expect_equal() {
+      local expected="$1"
+      local actual="$2"
+      local error="$3"
+
+      if [[ "$expected" != "$actual" ]]; then
+	    echo "expected ($expected) != actual ($actual): $3"
+	    cleanup
+	    exit 1
+      fi
+}
+
+function setup_cgroup() {
+      local name="$1"
+      local cgroup_limit="$2"
+      local reservation_limit="$3"
+
+      mkdir $cgroup_path/$name
+
+      echo writing cgroup limit: "$cgroup_limit"
+      echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
+
+      echo writing reseravation limit: "$reservation_limit"
+      echo "$reservation_limit" > \
+	    $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
+      echo 0 > $cgroup_path/$name/cpuset.cpus
+      echo 0 > $cgroup_path/$name/cpuset.mems
+}
+
+function write_hugetlbfs_and_get_usage() {
+      local cgroup="$1"
+      local size="$2"
+      local populate="$3"
+      local write="$4"
+      local path="$5"
+      local method="$6"
+      local private="$7"
+      local expect_failure="$8"
+
+      # Function return values.
+      reservation_failed=0
+      oom_killed=0
+      hugetlb_difference=0
+      reserved_difference=0
+
+      local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes
+      local reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes
+
+      local hugetlb_before=$(cat $hugetlb_usage)
+      local reserved_before=$(cat $reserved_usage)
+
+      echo
+      echo Starting:
+      echo hugetlb_usage="$hugetlb_before"
+      echo reserved_usage="$reserved_before"
+      echo expect_failure is "$expect_failure"
+
+      set +e
+      if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \
+	    [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then
+	    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+		  "$cgroup"  "$path" "$method" "$private" "-l" &
+
+	    local write_result=$?
+	    # This sleep is to make sure that the script above has had enough
+	    # time to do its thing, since it runs in the background. This may
+	    # cause races...
+	    sleep 0.5
+	    echo write_result is $write_result
+      else
+	    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+		  "$cgroup"  "$path" "$method" "$private"
+	    local write_result=$?
+      fi
+      set -e
+
+      if [[ "$write_result" == 1 ]]; then
+	    reservation_failed=1
+      fi
+
+      # On linus/master, the above process gets SIGBUS'd on oomkill, with
+      # return code 135. On earlier kernels, it gets actual oomkill, with return
+      # code 137, so just check for both conditions incase we're testing against
+      # an earlier kernel.
+      if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then
+	    oom_killed=1
+      fi
+
+      local hugetlb_after=$(cat $hugetlb_usage)
+      local reserved_after=$(cat $reserved_usage)
+
+      echo After write:
+      echo hugetlb_usage="$hugetlb_after"
+      echo reserved_usage="$reserved_after"
+
+      hugetlb_difference=$(($hugetlb_after - $hugetlb_before))
+      reserved_difference=$(($reserved_after - $reserved_before))
+}
+
+function cleanup_hugetlb_memory() {
+      set +e
+      if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
+	    echo kiling write_to_hugetlbfs
+	    killall -2 write_to_hugetlbfs
+	    # Wait for hugetlbfs memory to get depleted.
+	    sleep 0.5
+      fi
+      set -e
+
+      if [[ -e /mnt/huge ]]; then
+	    rm -rf /mnt/huge/*
+	      umount /mnt/huge
+	      rmdir /mnt/huge
+      fi
+}
+
+function run_test() {
+      local size="$1"
+      local populate="$2"
+      local write="$3"
+      local cgroup_limit="$4"
+      local reservation_limit="$5"
+      local nr_hugepages="$6"
+      local method="$7"
+      local private="$8"
+      local expect_failure="$9"
+
+      # Function return values.
+      hugetlb_difference=0
+      reserved_difference=0
+      reservation_failed=0
+      oom_killed=0
+
+      echo nr hugepages = "$nr_hugepages"
+      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
+
+      setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit"
+
+      mkdir -p /mnt/huge
+      mount -t hugetlbfs \
+	    -o pagesize=2M,size=256M none /mnt/huge
+
+      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \
+	    "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure"
+
+      cleanup_hugetlb_memory
+
+      local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.usage_in_bytes)
+      local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.reservation_usage_in_bytes)
+
+      expect_equal "0" "$final_hugetlb" "final hugetlb is not zero"
+      expect_equal "0" "$final_reservation" "final reservation is not zero"
+}
+
+function run_multiple_cgroup_test() {
+      local size1="$1"
+      local populate1="$2"
+      local write1="$3"
+      local cgroup_limit1="$4"
+      local reservation_limit1="$5"
+
+      local size2="$6"
+      local populate2="$7"
+      local write2="$8"
+      local cgroup_limit2="$9"
+      local reservation_limit2="${10}"
+
+      local nr_hugepages="${11}"
+      local method="${12}"
+      local private="${13}"
+      local expect_failure="${14}"
+
+      # Function return values.
+      hugetlb_difference1=0
+      reserved_difference1=0
+      reservation_failed1=0
+      oom_killed1=0
+
+      hugetlb_difference2=0
+      reserved_difference2=0
+      reservation_failed2=0
+      oom_killed2=0
+
+
+      echo nr hugepages = "$nr_hugepages"
+      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
+
+      setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1"
+      setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2"
+
+      mkdir -p /mnt/huge
+      mount -t hugetlbfs \
+	    -o pagesize=2M,size=256M none /mnt/huge
+
+      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \
+	    "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \
+	    "$expect_failure"
+
+      hugetlb_difference1=$hugetlb_difference
+      reserved_difference1=$reserved_difference
+      reservation_failed1=$reservation_failed
+      oom_killed1=$oom_killed
+
+      local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.usage_in_bytes
+      local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.reservation_usage_in_bytes
+      local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.usage_in_bytes
+      local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.reservation_usage_in_bytes
+
+      local usage_before_second_write=$(cat $cgroup1_hugetlb_usage)
+      local reservation_usage_before_second_write=$(cat \
+	    $cgroup1_reservation_usage)
+
+      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \
+	    "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \
+	    "$expect_failure"
+
+      hugetlb_difference2=$hugetlb_difference
+      reserved_difference2=$reserved_difference
+      reservation_failed2=$reservation_failed
+      oom_killed2=$oom_killed
+
+      expect_equal "$usage_before_second_write" \
+	    "$(cat $cgroup1_hugetlb_usage)" "Usage changed."
+      expect_equal "$reservation_usage_before_second_write" \
+	    "$(cat $cgroup1_reservation_usage)" "Reservation usage changed."
+
+      cleanup_hugetlb_memory
+
+      local final_hugetlb=$(cat $cgroup1_hugetlb_usage)
+      local final_reservation=$(cat $cgroup1_reservation_usage)
+
+      expect_equal "0" "$final_hugetlb" \
+	    "hugetlbt_cgroup_test1 final hugetlb is not zero"
+      expect_equal "0" "$final_reservation" \
+	    "hugetlbt_cgroup_test1 final reservation is not zero"
+
+      local final_hugetlb=$(cat $cgroup2_hugetlb_usage)
+      local final_reservation=$(cat $cgroup2_reservation_usage)
+
+      expect_equal "0" "$final_hugetlb" \
+	    "hugetlb_cgroup_test2 final hugetlb is not zero"
+      expect_equal "0" "$final_reservation" \
+	    "hugetlb_cgroup_test2 final reservation is not zero"
+}
+
+for private in "" "-r" ; do
+for populate in  "" "-o"; do
+for method in 0 1 2; do
+
+# Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported.
+if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then
+      continue
+fi
+
+# Skip populated shmem tests. Doesn't seem to be supported.
+if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then
+      continue
+fi
+
+cleanup
+echo
+echo
+echo
+echo Test normal case.
+echo private=$private, populate=$populate, method=$method
+run_test $((10 * 1024 * 1024)) "$populate" "" $((20 * 1024 * 1024)) \
+      $((20 * 1024 * 1024)) 10 "$method" "$private" "0"
+
+echo Memory charged to hugtlb=$hugetlb_difference
+echo Memory charged to reservation=$reserved_difference
+
+if [[ "$populate" == "-o" ]]; then
+      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
+	    "Reserved memory charged to hugetlb cgroup."
+else
+      expect_equal "0" "$hugetlb_difference" \
+	    "Reserved memory charged to hugetlb cgroup."
+fi
+
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
+      "Reserved memory not charged to reservation usage."
+echo 'PASS'
+
+cleanup
+echo
+echo
+echo
+echo Test normal case with write.
+echo private=$private, populate=$populate, method=$method
+run_test $((10 * 1024 * 1024)) "$populate" '-w' $((20 * 1024 * 1024)) \
+      $((20 * 1024 * 1024)) 10 "$method" "$private" "0"
+
+echo Memory charged to hugtlb=$hugetlb_difference
+echo Memory charged to reservation=$reserved_difference
+
+expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
+      "Reserved memory charged to hugetlb cgroup."
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
+      "Reserved memory not charged to reservation usage."
+echo 'PASS'
+
+
+cleanup
+echo
+echo
+echo
+echo Test more than reservation case.
+echo private=$private, populate=$populate, method=$method
+run_test "$((10 * 1024 * 1024))" "$populate" '' "$((20 * 1024 * 1024))" \
+      "$((5 * 1024 * 1024))" "10" "$method" "$private" "1"
+
+expect_equal "1" "$reservation_failed" "Reservation succeeded."
+echo 'PASS'
+
+cleanup
+
+echo
+echo
+echo
+echo Test more than cgroup limit case.
+echo private=$private, populate=$populate, method=$method
+
+# Not sure if shm memory can be cleaned up when the process gets sigbus'd.
+if [[ "$method" != 2 ]]; then
+      run_test $((10 * 1024 * 1024)) "$populate" "-w" $((5 * 1024 * 1024)) \
+	    $((20 * 1024 * 1024)) 10 "$method" "$private" "1"
+
+      expect_equal "1" "$oom_killed" "Not oom killed."
+fi
+echo 'PASS'
+
+cleanup
+
+echo
+echo
+echo
+echo Test normal case, multiple cgroups.
+echo private=$private, populate=$populate, method=$method
+run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "" \
+      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
+      "$populate" "" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
+      "$method" "$private" "0"
+
+echo Memory charged to hugtlb1=$hugetlb_difference1
+echo Memory charged to reservation1=$reserved_difference1
+echo Memory charged to hugtlb2=$hugetlb_difference2
+echo Memory charged to reservation2=$reserved_difference2
+
+expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
+      "Incorrect reservations charged to cgroup 1."
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
+      "Incorrect reservation charged to cgroup 2."
+if [[ "$populate" == "-o" ]]; then
+      expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
+	    "Incorrect hugetlb charged to cgroup 1."
+      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
+	    "Incorrect hugetlb charged to cgroup 2."
+else
+      expect_equal "0" "$hugetlb_difference1" \
+	    "Incorrect hugetlb charged to cgroup 1."
+      expect_equal "0" "$hugetlb_difference2" \
+	    "Incorrect hugetlb charged to cgroup 2."
+fi
+echo 'PASS'
+
+cleanup
+echo
+echo
+echo
+echo Test normal case with write, multiple cgroups.
+echo private=$private, populate=$populate, method=$method
+run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "-w" \
+      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
+      "$populate" "-w" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
+      "$method" "$private" "0"
+
+echo Memory charged to hugtlb1=$hugetlb_difference1
+echo Memory charged to reservation1=$reserved_difference1
+echo Memory charged to hugtlb2=$hugetlb_difference2
+echo Memory charged to reservation2=$reserved_difference2
+
+expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
+      "Incorrect hugetlb charged to cgroup 1."
+expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
+      "Incorrect reservation charged to cgroup 1."
+expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
+      "Incorrect hugetlb charged to cgroup 2."
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
+      "Incorrected reservation charged to cgroup 2."
+
+echo 'PASS'
+
+done # private
+done # populate
+done # method
+
+umount $cgroup_path
+rmdir $cgroup_path
diff --git a/tools/testing/selftests/vm/write_hugetlb_memory.sh b/tools/testing/selftests/vm/write_hugetlb_memory.sh
new file mode 100644
index 0000000000000..08f5fa5527cfd
--- /dev/null
+++ b/tools/testing/selftests/vm/write_hugetlb_memory.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+size=$1
+populate=$2
+write=$3
+cgroup=$4
+path=$5
+method=$6
+private=$7
+want_sleep=$8
+
+echo "Putting task in cgroup '$cgroup'"
+echo $$ > /dev/cgroup/memory/"$cgroup"/tasks
+
+echo "Method is $method"
+
+set +e
+./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \
+      "$private" "$want_sleep"
diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
new file mode 100644
index 0000000000000..f02a897427a97
--- /dev/null
+++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
@@ -0,0 +1,252 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program reserves and uses hugetlb memory, supporting a bunch of
+ * scenorios needed by the charged_reserved_hugetlb.sh test.
+ */
+
+#include <err.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/shm.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+
+/* Global definitions. */
+enum method {
+	HUGETLBFS,
+	MMAP_MAP_HUGETLB,
+	SHM,
+	MAX_METHOD
+};
+
+
+/* Global variables. */
+static const char *self;
+static char *shmaddr;
+static int shmid;
+
+/*
+ * Show usage and exit.
+ */
+static void exit_usage(void)
+{
+
+	printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
+		"[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
+		"[-o] [-w]\n", self);
+	exit(EXIT_FAILURE);
+}
+
+void sig_handler(int signo)
+{
+	printf("Received %d.\n", signo);
+	if (signo == SIGINT) {
+		printf("Deleting the memory\n");
+		if (shmdt((const void *)shmaddr) != 0) {
+			perror("Detach failure");
+			shmctl(shmid, IPC_RMID, NULL);
+			exit(4);
+		}
+
+		shmctl(shmid, IPC_RMID, NULL);
+		printf("Done deleting the memory\n");
+	}
+	exit(2);
+}
+
+int main(int argc, char **argv)
+{
+	int fd = 0;
+	int key = 0;
+	int *ptr = NULL;
+	int c = 0;
+	int size = 0;
+	char path[256] = "";
+	enum method method = MAX_METHOD;
+	int want_sleep = 0, private = 0;
+	int populate = 0;
+	int write = 0;
+
+	unsigned long i;
+
+
+	if (signal(SIGINT, sig_handler) == SIG_ERR)
+		err(1, "\ncan't catch SIGINT\n");
+
+	/* Parse command-line arguments. */
+	setvbuf(stdout, NULL, _IONBF, 0);
+	self = argv[0];
+
+	while ((c = getopt(argc, argv, "s:p:m:owlr")) != -1) {
+		switch (c) {
+		case 's':
+			size = atoi(optarg);
+			break;
+		case 'p':
+			strncpy(path, optarg, sizeof(path));
+			break;
+		case 'm':
+			if (atoi(optarg) >= MAX_METHOD) {
+				errno = EINVAL;
+				perror("Invalid -m.");
+				exit_usage();
+			}
+			method = atoi(optarg);
+			break;
+		case 'o':
+			populate = 1;
+			break;
+		case 'w':
+			write = 1;
+			break;
+		case 'l':
+			want_sleep = 1;
+			break;
+		case 'r':
+			private = 1;
+			break;
+		default:
+			errno = EINVAL;
+			perror("Invalid arg");
+			exit_usage();
+		}
+	}
+
+	if (strncmp(path, "", sizeof(path)) != 0) {
+		printf("Writing to this path: %s\n", path);
+	} else {
+		errno = EINVAL;
+		perror("path not found");
+		exit_usage();
+	}
+
+	if (size != 0) {
+		printf("Writing this size: %d\n", size);
+	} else {
+		errno = EINVAL;
+		perror("size not found");
+		exit_usage();
+	}
+
+	if (!populate)
+		printf("Not populating.\n");
+	else
+		printf("Populating.\n");
+
+	if (!write)
+		printf("Not writing to memory.\n");
+
+	if (method == MAX_METHOD) {
+		errno = EINVAL;
+		perror("-m Invalid");
+		exit_usage();
+	} else
+		printf("Using method=%d\n", method);
+
+	if (!private)
+		printf("Shared mapping.\n");
+	else
+		printf("Private mapping.\n");
+
+
+	switch (method) {
+	case HUGETLBFS:
+		printf("Allocating using HUGETLBFS.\n");
+		fd = open(path, O_CREAT | O_RDWR, 0777);
+		if (fd == -1)
+			err(1, "Failed to open file.");
+
+		ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			(private ? MAP_PRIVATE : MAP_SHARED) | (populate ?
+				MAP_POPULATE : 0), fd, 0);
+
+		if (ptr == MAP_FAILED) {
+			close(fd);
+			err(1, "Error mapping the file");
+		}
+		break;
+	case MMAP_MAP_HUGETLB:
+		printf("Allocating using MAP_HUGETLB.\n");
+		ptr = mmap(NULL, size,
+		PROT_READ | PROT_WRITE,
+		(private ? (MAP_PRIVATE | MAP_ANONYMOUS) : MAP_SHARED) |
+		MAP_HUGETLB | (populate ?
+			MAP_POPULATE : 0),
+		-1, 0);
+
+		if (ptr == MAP_FAILED)
+			err(1, "mmap");
+
+		printf("Returned address is %p\n", ptr);
+		break;
+	case SHM:
+		printf("Allocating using SHM.\n");
+		shmid = shmget(key, size, SHM_HUGETLB | IPC_CREAT | SHM_R |
+				SHM_W);
+		if (shmid < 0) {
+			shmid = shmget(++key, size, SHM_HUGETLB | IPC_CREAT |
+					SHM_R | SHM_W);
+			if (shmid < 0)
+				err(1, "shmget");
+
+		}
+		printf("shmid: 0x%x, shmget key:%d\n", shmid, key);
+
+		shmaddr = shmat(shmid, NULL, 0);
+		if (shmaddr == (char *)-1) {
+			perror("Shared memory attach failure");
+			shmctl(shmid, IPC_RMID, NULL);
+			exit(2);
+		}
+		printf("shmaddr: %p\n", shmaddr);
+
+		break;
+	default:
+		errno = EINVAL;
+		err(1, "Invalid method.");
+	}
+
+	if (write) {
+		printf("Writing to memory.\n");
+		if (method != SHM) {
+			memset(ptr, 1, size);
+		} else {
+			printf("Starting the writes:\n");
+			for (i = 0; i < size; i++) {
+				shmaddr[i] = (char)(i);
+				if (!(i % (1024 * 1024)))
+					printf(".");
+			}
+			printf("\n");
+
+			printf("Starting the Check...");
+			for (i = 0; i < size; i++)
+				if (shmaddr[i] != (char)i) {
+					printf("\nIndex %lu mismatched\n", i);
+					exit(3);
+				}
+			printf("Done.\n");
+
+
+		}
+	}
+
+	if (want_sleep) {
+		/* Signal to caller that we're done. */
+		printf("DONE\n");
+
+		/* Hold memory until external kill signal is delivered. */
+		while (1)
+			sleep(100);
+	}
+
+	close(fd);
+
+	return 0;
+}
--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 9/9] hugetlb_cgroup: Add hugetlb_cgroup reservation docs
  2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
                   ` (7 preceding siblings ...)
  2019-09-10 23:31 ` [PATCH v4 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests Mina Almasry
@ 2019-09-10 23:31 ` Mina Almasry
  2019-09-17  1:58   ` shuah
  8 siblings, 1 reply; 19+ messages in thread
From: Mina Almasry @ 2019-09-10 23:31 UTC (permalink / raw)
  To: mike.kravetz
  Cc: shuah, almasrymina, rientjes, shakeelb, gthelen, akpm,
	khalid.aziz, linux-kernel, linux-mm, linux-kselftest, cgroups,
	aneesh.kumar, mkoutny, Hillf Danton

Add docs for how to use hugetlb_cgroup reservations, and their behavior.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Hillf Danton <hdanton@sina.com>
---
 .../admin-guide/cgroup-v1/hugetlb.rst         | 84 ++++++++++++++++---
 1 file changed, 73 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index a3902aa253a96..cc6eb859fc722 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -2,13 +2,6 @@
 HugeTLB Controller
 ==================

-The HugeTLB controller allows to limit the HugeTLB usage per control group and
-enforces the controller limit during page fault. Since HugeTLB doesn't
-support page reclaim, enforcing the limit at page fault time implies that,
-the application will get SIGBUS signal if it tries to access HugeTLB pages
-beyond its limit. This requires the application to know beforehand how much
-HugeTLB pages it would require for its use.
-
 HugeTLB controller can be created by first mounting the cgroup filesystem.

 # mount -t cgroup -o hugetlb none /sys/fs/cgroup
@@ -28,10 +21,14 @@ process (bash) into it.

 Brief summary of control files::

- hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
- hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
- hugetlb.<hugepagesize>.usage_in_bytes     # show current usage for "hugepagesize" hugetlb
- hugetlb.<hugepagesize>.failcnt		   # show the number of allocation failure due to HugeTLB limit
+ hugetlb.<hugepagesize>.reservation_limit_in_bytes     # set/show limit of "hugepagesize" hugetlb reservations
+ hugetlb.<hugepagesize>.reservation_max_usage_in_bytes # show max "hugepagesize" hugetlb reservations recorded
+ hugetlb.<hugepagesize>.reservation_usage_in_bytes     # show current reservations for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.reservation_failcnt            # show the number of allocation failure due to HugeTLB reservation limit
+ hugetlb.<hugepagesize>.limit_in_bytes                 # set/show limit of "hugepagesize" hugetlb faults
+ hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
+ hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit

 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
 files include::
@@ -40,11 +37,76 @@ files include::
   hugetlb.1GB.max_usage_in_bytes
   hugetlb.1GB.usage_in_bytes
   hugetlb.1GB.failcnt
+  hugetlb.1GB.reservation_limit_in_bytes
+  hugetlb.1GB.reservation_max_usage_in_bytes
+  hugetlb.1GB.reservation_usage_in_bytes
+  hugetlb.1GB.reservation_failcnt
   hugetlb.64KB.limit_in_bytes
   hugetlb.64KB.max_usage_in_bytes
   hugetlb.64KB.usage_in_bytes
   hugetlb.64KB.failcnt
+  hugetlb.64KB.reservation_limit_in_bytes
+  hugetlb.64KB.reservation_max_usage_in_bytes
+  hugetlb.64KB.reservation_usage_in_bytes
+  hugetlb.64KB.reservation_failcnt
   hugetlb.32MB.limit_in_bytes
   hugetlb.32MB.max_usage_in_bytes
   hugetlb.32MB.usage_in_bytes
   hugetlb.32MB.failcnt
+  hugetlb.32MB.reservation_limit_in_bytes
+  hugetlb.32MB.reservation_max_usage_in_bytes
+  hugetlb.32MB.reservation_usage_in_bytes
+  hugetlb.32MB.reservation_failcnt
+
+
+1. Reservation limits
+
+The HugeTLB controller allows to limit the HugeTLB reservations per control
+group and enforces the controller limit at reservation time. Reservation limits
+are superior to Page fault limits (see section 2), since Reservation limits are
+enforced at reservation time, and never causes the application to get SIGBUS
+signal. Instead, if the application is violating its limits, then it gets an
+error on reservation time, i.e. the mmap or shmget return an error.
+
+
+2. Page fault limits
+
+The HugeTLB controller allows to limit the HugeTLB usage (page fault) per
+control group and enforces the controller limit during page fault. Since HugeTLB
+doesn't support page reclaim, enforcing the limit at page fault time implies
+that, the application will get SIGBUS signal if it tries to access HugeTLB
+pages beyond its limit. This requires the application to know beforehand how
+much HugeTLB pages it would require for its use.
+
+
+3. Caveats with shared memory
+
+a. Charging and uncharging:
+
+For shared hugetlb memory, both hugetlb reservation and usage (page faults) are
+charged to the first task that causes the memory to be reserved or faulted,
+and all subsequent uses of this reserved or faulted memory is done without
+charging.
+
+Shared hugetlb memory is only uncharged when it is unreseved or deallocated.
+This is usually when the hugetlbfs file is deleted, and not when the task that
+caused the reservation or fault has exited.
+
+b. Interaction between reservation limit and fault limit.
+
+Generally, it's not recommended to set both of the reservation limit and fault
+limit in a cgroup. For private memory, the fault usage cannot exceed the
+reservation usage, so if you set both, one of those limits will be useless.
+
+For shared memory, a cgroup's fault usage may be greater than its reservation
+usage, so some care needs to be taken. Consider this example:
+
+- Task A reserves 4 pages in a shared hugetlbfs file. Cgroup A will get
+  4 reservations charged to it and no faults charged to it.
+- Task B reserves and faults the same 4 pages as Task A. Cgroup B will get no
+  reservation charge, but will get charged 4 faulted pages. If Cgroup B's limit
+  is less than 4, then Task B will get a SIGBUS.
+
+For the above scenario, it's not recommended for the userspace to set both
+reservation limits and fault limits, but it is still allowed to in case it sees
+some use for it.
--
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 4/9] hugetlb: region_chg provides only cache entry
  2019-09-10 23:31 ` [PATCH v4 4/9] hugetlb: region_chg provides only cache entry Mina Almasry
@ 2019-09-16 22:17   ` Mike Kravetz
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Kravetz @ 2019-09-16 22:17 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, rientjes, shakeelb, gthelen, akpm, khalid.aziz,
	linux-kernel, linux-mm, linux-kselftest, cgroups, aneesh.kumar,
	mkoutny

On 9/10/19 4:31 PM, Mina Almasry wrote:
> Current behavior is that region_chg provides both a cache entry in
> resv->region_cache, AND a placeholder entry in resv->regions. region_add
> first tries to use the placeholder, and if it finds that the placeholder
> has been deleted by a racing region_del call, it uses the cache entry.
> 
> This behavior is completely unnecessary and is removed in this patch for
> a couple of reasons:
> 
> 1. region_add needs to either find a cached file_region entry in
>    resv->region_cache, or find an entry in resv->regions to expand. It
>    does not need both.
> 2. region_chg adding a placeholder entry in resv->regions opens up
>    a possible race with region_del, where region_chg adds a placeholder
>    region in resv->regions, and this region is deleted by a racing call
>    to region_del during region_chg execution or before region_add is
>    called. Removing the race makes the code easier to reason about and
>    maintain.
> 
> In addition, a follow up patch in this series disables region
> coalescing, which would be further complicated if the race with
> region_del exists.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>

Thanks.  I like this modification as it does simplify the code and could
be added as a general cleanup independent of the other changes.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> ---
>  mm/hugetlb.c | 63 +++++++++-------------------------------------------
>  1 file changed, 11 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index fbd7c52e17348..bea51ae422f63 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -246,14 +246,10 @@ struct file_region {
> 
>  /*
>   * Add the huge page range represented by [f, t) to the reserve
> - * map.  In the normal case, existing regions will be expanded
> - * to accommodate the specified range.  Sufficient regions should
> - * exist for expansion due to the previous call to region_chg
> - * with the same range.  However, it is possible that region_del
> - * could have been called after region_chg and modifed the map
> - * in such a way that no region exists to be expanded.  In this
> - * case, pull a region descriptor from the cache associated with
> - * the map and use that for the new range.
> + * map.  Existing regions will be expanded to accommodate the specified
> + * range, or a region will be taken from the cache.  Sufficient regions
> + * must exist in the cache due to the previous call to region_chg with
> + * the same range.
>   *
>   * Return the number of new huge pages added to the map.  This
>   * number is greater than or equal to zero.
> @@ -272,9 +268,8 @@ static long region_add(struct resv_map *resv, long f, long t)
> 
>  	/*
>  	 * If no region exists which can be expanded to include the
> -	 * specified range, the list must have been modified by an
> -	 * interleving call to region_del().  Pull a region descriptor
> -	 * from the cache and use it for this range.
> +	 * specified range, pull a region descriptor from the cache
> +	 * and use it for this range.
>  	 */
>  	if (&rg->link == head || t < rg->from) {
>  		VM_BUG_ON(resv->region_cache_count <= 0);
> @@ -339,15 +334,9 @@ static long region_add(struct resv_map *resv, long f, long t)
>   * call to region_add that will actually modify the reserve
>   * map to add the specified range [f, t).  region_chg does
>   * not change the number of huge pages represented by the
> - * map.  However, if the existing regions in the map can not
> - * be expanded to represent the new range, a new file_region
> - * structure is added to the map as a placeholder.  This is
> - * so that the subsequent region_add call will have all the
> - * regions it needs and will not fail.
> - *
> - * Upon entry, region_chg will also examine the cache of region descriptors
> - * associated with the map.  If there are not enough descriptors cached, one
> - * will be allocated for the in progress add operation.
> + * map.  A new file_region structure is added to the cache
> + * as a placeholder, so that the subsequent region_add
> + * call will have all the regions it needs and will not fail.
>   *
>   * Returns the number of huge pages that need to be added to the existing
>   * reservation map for the range [f, t).  This number is greater or equal to
> @@ -357,10 +346,9 @@ static long region_add(struct resv_map *resv, long f, long t)
>  static long region_chg(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
> -	struct file_region *rg, *nrg = NULL;
> +	struct file_region *rg;
>  	long chg = 0;
> 
> -retry:
>  	spin_lock(&resv->lock);
>  retry_locked:
>  	resv->adds_in_progress++;
> @@ -378,10 +366,8 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		spin_unlock(&resv->lock);
> 
>  		trg = kmalloc(sizeof(*trg), GFP_KERNEL);
> -		if (!trg) {
> -			kfree(nrg);
> +		if (!trg)
>  			return -ENOMEM;
> -		}
> 
>  		spin_lock(&resv->lock);
>  		list_add(&trg->link, &resv->region_cache);
> @@ -394,28 +380,6 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		if (f <= rg->to)
>  			break;
> 
> -	/* If we are below the current region then a new region is required.
> -	 * Subtle, allocate a new region at the position but make it zero
> -	 * size such that we can guarantee to record the reservation. */
> -	if (&rg->link == head || t < rg->from) {
> -		if (!nrg) {
> -			resv->adds_in_progress--;
> -			spin_unlock(&resv->lock);
> -			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> -			if (!nrg)
> -				return -ENOMEM;
> -
> -			nrg->from = f;
> -			nrg->to   = f;
> -			INIT_LIST_HEAD(&nrg->link);
> -			goto retry;
> -		}
> -
> -		list_add(&nrg->link, rg->link.prev);
> -		chg = t - f;
> -		goto out_nrg;
> -	}
> -
>  	/* Round our left edge to the current segment if it encloses us. */
>  	if (f > rg->from)
>  		f = rg->from;
> @@ -439,11 +403,6 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  	}
> 
>  out:
> -	spin_unlock(&resv->lock);
> -	/*  We already know we raced and no longer need the new region */
> -	kfree(nrg);
> -	return chg;
> -out_nrg:
>  	spin_unlock(&resv->lock);
>  	return chg;
>  }
> --
> 2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 5/9] hugetlb: remove duplicated code
  2019-09-10 23:31 ` [PATCH v4 5/9] hugetlb: remove duplicated code Mina Almasry
@ 2019-09-16 22:25   ` Mike Kravetz
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Kravetz @ 2019-09-16 22:25 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, rientjes, shakeelb, gthelen, akpm, khalid.aziz,
	linux-kernel, linux-mm, linux-kselftest, cgroups, aneesh.kumar,
	mkoutny

On 9/10/19 4:31 PM, Mina Almasry wrote:
> Remove duplicated code between region_chg and region_add, and refactor it into
> a common function, add_reservation_in_range. This is mostly done because
> there is a follow up change in this series that disables region
> coalescing in region_add, and I want to make that change in one place
> only. It should improve maintainability anyway on its own.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>

Like the previous patch, this is a good improvement indepentent of the
rest of the series.  Thanks!

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> ---
>  mm/hugetlb.c | 116 ++++++++++++++++++++++++---------------------------
>  1 file changed, 54 insertions(+), 62 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bea51ae422f63..ce5ed1056fefd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -244,6 +244,57 @@ struct file_region {
>  	long to;
>  };
> 
> +static long add_reservation_in_range(
> +		struct resv_map *resv, long f, long t, bool count_only)
> +{
> +
> +	long chg = 0;
> +	struct list_head *head = &resv->regions;
> +	struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
> +
> +	/* Locate the region we are before or in. */
> +	list_for_each_entry(rg, head, link)
> +		if (f <= rg->to)
> +			break;
> +
> +	/* Round our left edge to the current segment if it encloses us. */
> +	if (f > rg->from)
> +		f = rg->from;
> +
> +	chg = t - f;
> +
> +	/* Check for and consume any regions we now overlap with. */
> +	nrg = rg;
> +	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> +		if (&rg->link == head)
> +			break;
> +		if (rg->from > t)
> +			break;
> +
> +		/* We overlap with this area, if it extends further than
> +		 * us then we must extend ourselves.  Account for its
> +		 * existing reservation.
> +		 */
> +		if (rg->to > t) {
> +			chg += rg->to - t;
> +			t = rg->to;
> +		}
> +		chg -= rg->to - rg->from;
> +
> +		if (!count_only && rg != nrg) {
> +			list_del(&rg->link);
> +			kfree(rg);
> +		}
> +	}
> +
> +	if (!count_only) {
> +		nrg->from = f;
> +		nrg->to = t;
> +	}
> +
> +	return chg;
> +}
> +
>  /*
>   * Add the huge page range represented by [f, t) to the reserve
>   * map.  Existing regions will be expanded to accommodate the specified
> @@ -257,7 +308,7 @@ struct file_region {
>  static long region_add(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
> -	struct file_region *rg, *nrg, *trg;
> +	struct file_region *rg, *nrg;
>  	long add = 0;
> 
>  	spin_lock(&resv->lock);
> @@ -287,38 +338,7 @@ static long region_add(struct resv_map *resv, long f, long t)
>  		goto out_locked;
>  	}
> 
> -	/* Round our left edge to the current segment if it encloses us. */
> -	if (f > rg->from)
> -		f = rg->from;
> -
> -	/* Check for and consume any regions we now overlap with. */
> -	nrg = rg;
> -	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> -		if (&rg->link == head)
> -			break;
> -		if (rg->from > t)
> -			break;
> -
> -		/* If this area reaches higher then extend our area to
> -		 * include it completely.  If this is not the first area
> -		 * which we intend to reuse, free it. */
> -		if (rg->to > t)
> -			t = rg->to;
> -		if (rg != nrg) {
> -			/* Decrement return value by the deleted range.
> -			 * Another range will span this area so that by
> -			 * end of routine add will be >= zero
> -			 */
> -			add -= (rg->to - rg->from);
> -			list_del(&rg->link);
> -			kfree(rg);
> -		}
> -	}
> -
> -	add += (nrg->from - f);		/* Added to beginning of region */
> -	nrg->from = f;
> -	add += t - nrg->to;		/* Added to end of region */
> -	nrg->to = t;
> +	add = add_reservation_in_range(resv, f, t, false);
> 
>  out_locked:
>  	resv->adds_in_progress--;
> @@ -345,8 +365,6 @@ static long region_add(struct resv_map *resv, long f, long t)
>   */
>  static long region_chg(struct resv_map *resv, long f, long t)
>  {
> -	struct list_head *head = &resv->regions;
> -	struct file_region *rg;
>  	long chg = 0;
> 
>  	spin_lock(&resv->lock);
> @@ -375,34 +393,8 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		goto retry_locked;
>  	}
> 
> -	/* Locate the region we are before or in. */
> -	list_for_each_entry(rg, head, link)
> -		if (f <= rg->to)
> -			break;
> -
> -	/* Round our left edge to the current segment if it encloses us. */
> -	if (f > rg->from)
> -		f = rg->from;
> -	chg = t - f;
> -
> -	/* Check for and consume any regions we now overlap with. */
> -	list_for_each_entry(rg, rg->link.prev, link) {
> -		if (&rg->link == head)
> -			break;
> -		if (rg->from > t)
> -			goto out;
> +	chg = add_reservation_in_range(resv, f, t, true);
> 
> -		/* We overlap with this area, if it extends further than
> -		 * us then we must extend ourselves.  Account for its
> -		 * existing reservation. */
> -		if (rg->to > t) {
> -			chg += rg->to - t;
> -			t = rg->to;
> -		}
> -		chg -= rg->to - rg->from;
> -	}
> -
> -out:
>  	spin_unlock(&resv->lock);
>  	return chg;
>  }
> --
> 2.23.0.162.g0b9fbb3734-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  2019-09-10 23:31 ` [PATCH v4 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
@ 2019-09-16 23:43   ` shuah
  0 siblings, 0 replies; 19+ messages in thread
From: shuah @ 2019-09-16 23:43 UTC (permalink / raw)
  To: Mina Almasry, mike.kravetz
  Cc: rientjes, shakeelb, gthelen, akpm, khalid.aziz, linux-kernel,
	linux-mm, linux-kselftest, cgroups, aneesh.kumar, mkoutny,
	Hillf Danton, shuah

On 9/10/19 5:31 PM, Mina Almasry wrote:
> These counters will track hugetlb reservations rather than hugetlb
> memory faulted in. This patch only adds the counter, following patches
> add the charging and uncharging of the counter.
> 

Why are we adding these counters? I see the reasons in the cover letter.
Why not add the details on why this is needed here in this commitlog

Please add more information on why and rephrase the commit log.

> Signed-off-by: Mina Almasry <almasrymina@google.com>
> Acked-by: Hillf Danton <hdanton@sina.com>
> ---
>   include/linux/hugetlb.h |  16 +++++-
>   mm/hugetlb_cgroup.c     | 111 ++++++++++++++++++++++++++++++----------
>   2 files changed, 100 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index edfca42783192..128ff1aff1c93 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -320,6 +320,20 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> 
>   #ifdef CONFIG_HUGETLB_PAGE
> 
> +enum {
> +	HUGETLB_RES_USAGE,
> +	HUGETLB_RES_RESERVATION_USAGE,
> +	HUGETLB_RES_LIMIT,
> +	HUGETLB_RES_RESERVATION_LIMIT,
> +	HUGETLB_RES_MAX_USAGE,
> +	HUGETLB_RES_RESERVATION_MAX_USAGE,
> +	HUGETLB_RES_FAILCNT,
> +	HUGETLB_RES_RESERVATION_FAILCNT,
> +	HUGETLB_RES_NULL,
> +	HUGETLB_RES_MAX,
> +};
> +

Please add information on what these track. As an example
HUGETLB_RES_RESERVATION_LIMIT is so close to HUGETLB_RES_LIMIT

What are we tracking and measuring?

> +
>   #define HSTATE_NAME_LEN 32
>   /* Defines one hugetlb page size */
>   struct hstate {
> @@ -340,7 +354,7 @@ struct hstate {
>   	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
>   #ifdef CONFIG_CGROUP_HUGETLB
>   	/* cgroup control files */
> -	struct cftype cgroup_files[5];
> +	struct cftype cgroup_files[HUGETLB_RES_MAX];
>   #endif
>   	char name[HSTATE_NAME_LEN];
>   };
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 68c2f2f3c05b7..51a72624bd1ff 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -25,6 +25,10 @@ struct hugetlb_cgroup {
>   	 * the counter to account for hugepages from hugetlb.
>   	 */
>   	struct page_counter hugepage[HUGE_MAX_HSTATE];
> +	/*
> +	 * the counter to account for hugepage reservations from hugetlb.
> +	 */
> +	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
>   };
> 
>   #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
> @@ -33,6 +37,15 @@ struct hugetlb_cgroup {
> 
>   static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
> 
> +static inline
> +struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx,
> +				 bool reserved)
> +{
> +	if (reserved)
> +		return  &h_cg->reserved_hugepage[idx];
> +	return &h_cg->hugepage[idx];
> +}
> +
>   static inline
>   struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
>   {
> @@ -254,30 +267,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>   	return;
>   }
> 
> -enum {
> -	RES_USAGE,
> -	RES_LIMIT,
> -	RES_MAX_USAGE,
> -	RES_FAILCNT,
> -};
> -
>   static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
>   				   struct cftype *cft)
>   {
>   	struct page_counter *counter;
> +	struct page_counter *reserved_counter;
>   	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> 
>   	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> +	reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
> 
>   	switch (MEMFILE_ATTR(cft->private)) {
> -	case RES_USAGE:
> +	case HUGETLB_RES_USAGE:
>   		return (u64)page_counter_read(counter) * PAGE_SIZE;
> -	case RES_LIMIT:
> +	case HUGETLB_RES_RESERVATION_USAGE:
> +		return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
> +	case HUGETLB_RES_LIMIT:
>   		return (u64)counter->max * PAGE_SIZE;
> -	case RES_MAX_USAGE:
> +	case HUGETLB_RES_RESERVATION_LIMIT:
> +		return (u64)reserved_counter->max * PAGE_SIZE;
> +	case HUGETLB_RES_MAX_USAGE:
>   		return (u64)counter->watermark * PAGE_SIZE;
> -	case RES_FAILCNT:
> +	case HUGETLB_RES_RESERVATION_MAX_USAGE:
> +		return (u64)reserved_counter->watermark * PAGE_SIZE;
> +	case HUGETLB_RES_FAILCNT:
>   		return counter->failcnt;
> +	case HUGETLB_RES_RESERVATION_FAILCNT:
> +		return reserved_counter->failcnt;
>   	default:
>   		BUG();
>   	}
> @@ -291,6 +307,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>   	int ret, idx;
>   	unsigned long nr_pages;
>   	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> +	bool reserved = false;
> 
>   	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
>   		return -EINVAL;
> @@ -304,9 +321,13 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>   	nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
> 
>   	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> -	case RES_LIMIT:
> +	case HUGETLB_RES_RESERVATION_LIMIT:
> +		reserved = true;
> +		/* Fall through. */
> +	case HUGETLB_RES_LIMIT:
>   		mutex_lock(&hugetlb_limit_mutex);
> -		ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
> +		ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
> +					   nr_pages);
>   		mutex_unlock(&hugetlb_limit_mutex);
>   		break;
>   	default:
> @@ -320,18 +341,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
>   				    char *buf, size_t nbytes, loff_t off)
>   {
>   	int ret = 0;
> -	struct page_counter *counter;
> +	struct page_counter *counter, *reserved_counter;
>   	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> 
>   	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> +	reserved_counter = &h_cg->reserved_hugepage[
> +		MEMFILE_IDX(of_cft(of)->private)];
> 

Please indent this. It is hard to read.

>   	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> -	case RES_MAX_USAGE:
> +	case HUGETLB_RES_MAX_USAGE:
>   		page_counter_reset_watermark(counter);
>   		break;
> -	case RES_FAILCNT:
> +	case HUGETLB_RES_RESERVATION_MAX_USAGE:
> +		page_counter_reset_watermark(reserved_counter);
> +		break;
> +	case HUGETLB_RES_FAILCNT:
>   		counter->failcnt = 0;
>   		break;
> +	case HUGETLB_RES_RESERVATION_FAILCNT:
> +		reserved_counter->failcnt = 0;
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -357,37 +386,67 @@ static void __init __hugetlb_cgroup_file_init(int idx)
>   	struct hstate *h = &hstates[idx];
> 
>   	/* format the size */
> -	mem_fmt(buf, 32, huge_page_size(h));
> +	mem_fmt(buf, sizeof(buf), huge_page_size(h));
> 
>   	/* Add the limit file */
> -	cft = &h->cgroup_files[0];
> +	cft = &h->cgroup_files[HUGETLB_RES_LIMIT];
>   	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
> -	cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
> +	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_LIMIT);
> +	cft->read_u64 = hugetlb_cgroup_read_u64;
> +	cft->write = hugetlb_cgroup_write;
> +
> +	/* Add the reservation limit file */
> +	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_LIMIT];
> +	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes",
> +		 buf);
> +	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_LIMIT);
>   	cft->read_u64 = hugetlb_cgroup_read_u64;
>   	cft->write = hugetlb_cgroup_write;
> 
>   	/* Add the usage file */
> -	cft = &h->cgroup_files[1];
> +	cft = &h->cgroup_files[HUGETLB_RES_USAGE];
>   	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
> -	cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> +	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_USAGE);
> +	cft->read_u64 = hugetlb_cgroup_read_u64;
> +
> +	/* Add the reservation usage file */
> +	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_USAGE];
> +	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes",
> +			buf);

Please line buf with cft->name for readability.

> +	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_USAGE);
>   	cft->read_u64 = hugetlb_cgroup_read_u64;
> 
>   	/* Add the MAX usage file */
> -	cft = &h->cgroup_files[2];
> +	cft = &h->cgroup_files[HUGETLB_RES_MAX_USAGE];
>   	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
> -	cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
> +	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_MAX_USAGE);
> +	cft->write = hugetlb_cgroup_reset;
> +	cft->read_u64 = hugetlb_cgroup_read_u64;
> +
> +	/* Add the MAX reservation usage file */
> +	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_MAX_USAGE];
> +	snprintf(cft->name, MAX_CFTYPE_NAME,
> +			"%s.reservation_max_usage_in_bytes", buf);

Same here.

> +	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_MAX_USAGE);
>   	cft->write = hugetlb_cgroup_reset;
>   	cft->read_u64 = hugetlb_cgroup_read_u64;
> 
>   	/* Add the failcntfile */
> -	cft = &h->cgroup_files[3];
> +	cft = &h->cgroup_files[HUGETLB_RES_FAILCNT];
>   	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
> -	cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> +	cft->private  = MEMFILE_PRIVATE(idx, HUGETLB_RES_FAILCNT);
> +	cft->write = hugetlb_cgroup_reset;
> +	cft->read_u64 = hugetlb_cgroup_read_u64;
> +
> +	/* Add the reservation failcntfile */
> +	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_FAILCNT];
> +	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf);
> +	cft->private  = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_FAILCNT);
>   	cft->write = hugetlb_cgroup_reset;
>   	cft->read_u64 = hugetlb_cgroup_read_u64;
> 
>   	/* NULL terminate the last cft */
> -	cft = &h->cgroup_files[4];
> +	cft = &h->cgroup_files[HUGETLB_RES_NULL];
>   	memset(cft, 0, sizeof(*cft));
> 
>   	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
> --
> 2.23.0.162.g0b9fbb3734-goog
> 

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 6/9] hugetlb: disable region_add file_region coalescing
  2019-09-10 23:31 ` [PATCH v4 6/9] hugetlb: disable region_add file_region coalescing Mina Almasry
@ 2019-09-16 23:57   ` Mike Kravetz
  2019-09-17  0:16     ` Mina Almasry
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2019-09-16 23:57 UTC (permalink / raw)
  To: Mina Almasry
  Cc: shuah, rientjes, shakeelb, gthelen, akpm, khalid.aziz,
	linux-kernel, linux-mm, linux-kselftest, cgroups, aneesh.kumar,
	mkoutny

On 9/10/19 4:31 PM, Mina Almasry wrote:
> A follow up patch in this series adds hugetlb cgroup uncharge info the
> file_region entries in resv->regions. The cgroup uncharge info may
> differ for different regions, so they can no longer be coalesced at
> region_add time. So, disable region coalescing in region_add in this
> patch.
> 
> Behavior change:
> 
> Say a resv_map exists like this [0->1], [2->3], and [5->6].
> 
> Then a region_chg/add call comes in region_chg/add(f=0, t=5).
> 
> Old code would generate resv->regions: [0->5], [5->6].
> New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> [5->6].
> 
> Special care needs to be taken to handle the resv->adds_in_progress
> variable correctly. In the past, only 1 region would be added for every
> region_chg and region_add call. But now, each call may add multiple
> regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
> region_chg calls add_reservation_in_range() to count the number of regions
> needed and allocates those, and that info is passed to region_add and
> region_abort to decrement adds_in_progress correctly.

Hate to throw more theoretical examples at you but ...

Consider an existing reserv_map like [3-10]
Then a region_chg/add call comes in region_chg/add(f=0, t=10).
The region_chg is going to return 3 (additional reservations needed), and
also out_regions_needed = 1 as it would want to create a region [0-3].
Correct?
But, there is nothing to prevent another thread from doing a region_del [5-7]
after the region_chg and before region_add.  Correct?
If so, it seems the region_add would need to create two regions, but there
is only one in the cache and we would BUG in get_file_region_entry_from_cache.
Am I reading the code correctly?

The existing code wants to make sure region_add called after region_chg will
never return error.  This is why all needed allocations were done in the
region_chg call, and it was relatively easy to do in existing code when
region_chg would only need one additional region at most.

I'm thinking that we may have to make region_chg allocate the worst case
number of regions (t - f)/2, OR change to the code such that region_add
could return an error.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 6/9] hugetlb: disable region_add file_region coalescing
  2019-09-16 23:57   ` Mike Kravetz
@ 2019-09-17  0:16     ` Mina Almasry
  0 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-09-17  0:16 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: shuah, David Rientjes, Shakeel Butt, Greg Thelen, Andrew Morton,
	khalid.aziz, open list, linux-mm, linux-kselftest, cgroups,
	Aneesh Kumar, Michal Koutný

On Mon, Sep 16, 2019 at 4:57 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 9/10/19 4:31 PM, Mina Almasry wrote:
> > A follow up patch in this series adds hugetlb cgroup uncharge info the
> > file_region entries in resv->regions. The cgroup uncharge info may
> > differ for different regions, so they can no longer be coalesced at
> > region_add time. So, disable region coalescing in region_add in this
> > patch.
> >
> > Behavior change:
> >
> > Say a resv_map exists like this [0->1], [2->3], and [5->6].
> >
> > Then a region_chg/add call comes in region_chg/add(f=0, t=5).
> >
> > Old code would generate resv->regions: [0->5], [5->6].
> > New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> > [5->6].
> >
> > Special care needs to be taken to handle the resv->adds_in_progress
> > variable correctly. In the past, only 1 region would be added for every
> > region_chg and region_add call. But now, each call may add multiple
> > regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> > or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
> > region_chg calls add_reservation_in_range() to count the number of regions
> > needed and allocates those, and that info is passed to region_add and
> > region_abort to decrement adds_in_progress correctly.
>
> Hate to throw more theoretical examples at you but ...
>
> Consider an existing reserv_map like [3-10]
> Then a region_chg/add call comes in region_chg/add(f=0, t=10).
> The region_chg is going to return 3 (additional reservations needed), and
> also out_regions_needed = 1 as it would want to create a region [0-3].
> Correct?
> But, there is nothing to prevent another thread from doing a region_del [5-7]
> after the region_chg and before region_add.  Correct?
> If so, it seems the region_add would need to create two regions, but there
> is only one in the cache and we would BUG in get_file_region_entry_from_cache.
> Am I reading the code correctly?
>
> The existing code wants to make sure region_add called after region_chg will
> never return error.  This is why all needed allocations were done in the
> region_chg call, and it was relatively easy to do in existing code when
> region_chg would only need one additional region at most.
>
> I'm thinking that we may have to make region_chg allocate the worst case
> number of regions (t - f)/2, OR change to the code such that region_add
> could return an error.

Yep you are right, I missed reasoning about the region_del punch hole
into the reservations case. Let me consider these 2 options.

> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  2019-09-10 23:31 ` [PATCH v4 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations Mina Almasry
@ 2019-09-17  1:29   ` shuah
  0 siblings, 0 replies; 19+ messages in thread
From: shuah @ 2019-09-17  1:29 UTC (permalink / raw)
  To: Mina Almasry, mike.kravetz
  Cc: rientjes, shakeelb, gthelen, akpm, khalid.aziz, linux-kernel,
	linux-mm, linux-kselftest, cgroups, aneesh.kumar, mkoutny, shuah

On 9/10/19 5:31 PM, Mina Almasry wrote:
> Augements hugetlb_cgroup_charge_cgroup to be able to charge hugetlb
> usage or hugetlb reservation counter.
> 

Augments?

> Adds a new interface to uncharge a hugetlb_cgroup counter via
> hugetlb_cgroup_uncharge_counter.
> 
> Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
> hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> ---
>   include/linux/hugetlb_cgroup.h | 13 ++++--
>   mm/hugetlb.c                   |  6 ++-
>   mm/hugetlb_cgroup.c            | 82 +++++++++++++++++++++++++++-------
>   3 files changed, 80 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 063962f6dfc6a..c467715dd8fb8 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -52,14 +52,19 @@ static inline bool hugetlb_cgroup_disabled(void)
>   }
> 
>   extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
> -					struct hugetlb_cgroup **ptr);
> +					struct hugetlb_cgroup **ptr,
> +					bool reserved);
>   extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>   					 struct hugetlb_cgroup *h_cg,
>   					 struct page *page);
>   extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>   					 struct page *page);
>   extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> -					   struct hugetlb_cgroup *h_cg);
> +					   struct hugetlb_cgroup *h_cg,
> +					   bool reserved);
> +extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
> +					    unsigned long nr_pages);
> +
>   extern void hugetlb_cgroup_file_init(void) __init;
>   extern void hugetlb_cgroup_migrate(struct page *oldhpage,
>   				   struct page *newhpage);
> @@ -83,7 +88,7 @@ static inline bool hugetlb_cgroup_disabled(void)
> 
>   static inline int
>   hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
> -			     struct hugetlb_cgroup **ptr)
> +			     struct hugetlb_cgroup **ptr, bool reserved)
Please line the arguments.

>   {
>   	return 0;
>   }
> @@ -102,7 +107,7 @@ hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)
> 
>   static inline void
>   hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> -			       struct hugetlb_cgroup *h_cg)
> +			       struct hugetlb_cgroup *h_cg, bool reserved)

Same here.

>   {
>   }
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6d7296dd11b83..e975f55aede94 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2078,7 +2078,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>   			gbl_chg = 1;
>   	}
> 
> -	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
> +	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
> +					   false);
>   	if (ret)
>   		goto out_subpool_put;
> 
> @@ -2126,7 +2127,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>   	return page;
> 
>   out_uncharge_cgroup:
> -	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
> +	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
> +			false);

Please be consistent with indentation. Line this up like you did above.

>   out_subpool_put:
>   	if (map_chg || avoid_reserve)
>   		hugepage_subpool_put_pages(spool, 1);
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 51a72624bd1ff..2ab36a98d834e 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -38,8 +38,8 @@ struct hugetlb_cgroup {
>   static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
> 
>   static inline
> -struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx,
> -				 bool reserved)
> +struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg,
> +						int idx, bool reserved)

Same here.

>   {
>   	if (reserved)
>   		return  &h_cg->reserved_hugepage[idx];
> @@ -74,8 +74,12 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
>   	int idx;
> 
>   	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
> -		if (page_counter_read(&h_cg->hugepage[idx]))
> +		if (page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx,
> +						true)) ||
> +		    page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx,
> +				    false))) {
>   			return true;
> +		}
>   	}
>   	return false;
>   }
> @@ -86,18 +90,30 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
>   	int idx;
> 
>   	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
> -		struct page_counter *counter = &h_cgroup->hugepage[idx];
>   		struct page_counter *parent = NULL;
> +		struct page_counter *reserved_parent = NULL;
>   		unsigned long limit;
>   		int ret;
> 
> -		if (parent_h_cgroup)
> -			parent = &parent_h_cgroup->hugepage[idx];
> -		page_counter_init(counter, parent);
> +		if (parent_h_cgroup) {
> +			parent = hugetlb_cgroup_get_counter(
> +					parent_h_cgroup, idx, false);
> +			reserved_parent = hugetlb_cgroup_get_counter(
> +					parent_h_cgroup, idx, true);
> +		}
> +		page_counter_init(hugetlb_cgroup_get_counter(
> +					h_cgroup, idx, false), parent);
> +		page_counter_init(hugetlb_cgroup_get_counter(
> +					h_cgroup, idx, true),
> +				  reserved_parent);
> 
>   		limit = round_down(PAGE_COUNTER_MAX,
>   				   1 << huge_page_order(&hstates[idx]));
> -		ret = page_counter_set_max(counter, limit);
> +
> +		ret = page_counter_set_max(hugetlb_cgroup_get_counter(
> +					h_cgroup, idx, false), limit);
> +		ret = page_counter_set_max(hugetlb_cgroup_get_counter(
> +					h_cgroup, idx, true), limit);
>   		VM_BUG_ON(ret);
>   	}
>   }
> @@ -127,6 +143,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
>   	kfree(h_cgroup);
>   }
> 
> +static void hugetlb_cgroup_move_parent_reservation(int idx,
> +						   struct hugetlb_cgroup *h_cg)
> +{
> +	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
> +
> +	/* Move the reservation counters. */
> +	if (!parent_hugetlb_cgroup(h_cg)) {
> +		parent = root_h_cgroup;
> +		/* root has no limit */
> +		page_counter_charge(
> +				&root_h_cgroup->reserved_hugepage[idx],
> +				page_counter_read(hugetlb_cgroup_get_counter(
> +						h_cg, idx, true)));
> +	}
> +
> +	/* Take the pages off the local counter */
> +	page_counter_cancel(hugetlb_cgroup_get_counter(h_cg, idx, true),
> +			    page_counter_read(hugetlb_cgroup_get_counter(h_cg,
> +					    idx, true)));
> +}
> 
>   /*
>    * Should be called with hugetlb_lock held.
> @@ -181,6 +217,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
>   	do {
>   		for_each_hstate(h) {
>   			spin_lock(&hugetlb_lock);
> +			hugetlb_cgroup_move_parent_reservation(idx, h_cg);
>   			list_for_each_entry(page, &h->hugepage_activelist, lru)
>   				hugetlb_cgroup_move_parent(idx, h_cg, page);
> 
> @@ -192,7 +229,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
>   }
> 
>   int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
> -				 struct hugetlb_cgroup **ptr)
> +				 struct hugetlb_cgroup **ptr, bool reserved)
>   {
>   	int ret = 0;
>   	struct page_counter *counter;
> @@ -215,8 +252,11 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
>   	}
>   	rcu_read_unlock();
> 
> -	if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter))
> +	if (!page_counter_try_charge(hugetlb_cgroup_get_counter(h_cg, idx,
> +								reserved),
> +				     nr_pages, &counter)) {
>   		ret = -ENOMEM;
> +	}
>   	css_put(&h_cg->css);
>   done:
>   	*ptr = h_cg;
> @@ -250,12 +290,14 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>   	if (unlikely(!h_cg))
>   		return;
>   	set_hugetlb_cgroup(page, NULL);
> -	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> +	page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, false),
> +			nr_pages);
> +
>   	return;
>   }
> 
>   void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> -				    struct hugetlb_cgroup *h_cg)
> +				    struct hugetlb_cgroup *h_cg, bool reserved)
>   {
>   	if (hugetlb_cgroup_disabled() || !h_cg)
>   		return;
> @@ -263,8 +305,17 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>   	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
>   		return;
> 
> -	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> -	return;
> +	page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
> +			nr_pages);
> +}
> +
> +void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
> +				     unsigned long nr_pages)
> +{
> +	if (hugetlb_cgroup_disabled() || !p)
> +		return;
> +
> +	page_counter_uncharge(p, nr_pages);
>   }
> 
>   static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> @@ -326,7 +377,8 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>   		/* Fall through. */
>   	case HUGETLB_RES_LIMIT:
>   		mutex_lock(&hugetlb_limit_mutex);
> -		ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
> +		ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx,
> +								      reserved),
>   					   nr_pages);
>   		mutex_unlock(&hugetlb_limit_mutex);
>   		break;
> --
> 2.23.0.162.g0b9fbb3734-goog
> 

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests
  2019-09-10 23:31 ` [PATCH v4 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests Mina Almasry
@ 2019-09-17  1:52   ` shuah
  2019-09-19  1:53     ` Mina Almasry
  0 siblings, 1 reply; 19+ messages in thread
From: shuah @ 2019-09-17  1:52 UTC (permalink / raw)
  To: Mina Almasry, mike.kravetz
  Cc: rientjes, shakeelb, gthelen, akpm, khalid.aziz, linux-kernel,
	linux-mm, linux-kselftest, cgroups, aneesh.kumar, mkoutny, shuah

On 9/10/19 5:31 PM, Mina Almasry wrote:
> The tests use both shared and private mapped hugetlb memory, and
> monitors the hugetlb usage counter as well as the hugetlb reservation
> counter. They test different configurations such as hugetlb memory usage
> via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
> MAP_POPULATE.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> ---
>   tools/testing/selftests/vm/.gitignore         |   1 +
>   tools/testing/selftests/vm/Makefile           |   4 +
>   .../selftests/vm/charge_reserved_hugetlb.sh   | 440 ++++++++++++++++++
>   .../selftests/vm/write_hugetlb_memory.sh      |  22 +
>   .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++++++++++
>   5 files changed, 719 insertions(+)
>   create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
>   create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
>   create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c
> 
> diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
> index 31b3c98b6d34d..d3bed9407773c 100644
> --- a/tools/testing/selftests/vm/.gitignore
> +++ b/tools/testing/selftests/vm/.gitignore
> @@ -14,3 +14,4 @@ virtual_address_range
>   gup_benchmark
>   va_128TBswitch
>   map_fixed_noreplace
> +write_to_hugetlbfs
> diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
> index 9534dc2bc9295..8d37d5409b52c 100644
> --- a/tools/testing/selftests/vm/Makefile
> +++ b/tools/testing/selftests/vm/Makefile
> @@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
>   TEST_GEN_FILES += userfaultfd
>   TEST_GEN_FILES += va_128TBswitch
>   TEST_GEN_FILES += virtual_address_range
> +TEST_GEN_FILES += write_to_hugetlbfs
> 
>   TEST_PROGS := run_vmtests
> 
> @@ -29,3 +30,6 @@ include ../lib.mk
>   $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
> 
>   $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
> +
> +# Why does adding $(OUTPUT)/ like above not apply this flag..?

Can you verify the following and remove this comment, once you figure
out if you need $(OUTPUT)/
> +write_to_hugetlbfs: CFLAGS += -static

It should. Did you test "make O=" and "KBUILD_OUTPUT" kselftest
use-cases?

> diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
> new file mode 100755
> index 0000000000000..09e90e8f6fab4
> --- /dev/null
> +++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
> @@ -0,0 +1,440 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +
> +set -e
> +
> +cgroup_path=/dev/cgroup/memory
> +if [[ ! -e $cgroup_path ]]; then
> +      mkdir -p $cgroup_path
> +      mount -t cgroup -o hugetlb,memory cgroup $cgroup_path
> +fi
> +

Does this test need root access? If yes, please add root check
and skip the test when a non-root runs the test.

> +cleanup () {
> +	echo $$ > $cgroup_path/tasks
> +
> +	set +e
> +	if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
> +	      kill -2 write_to_hugetlbfs
> +	      # Wait for hugetlbfs memory to get depleted.
> +	      sleep 0.5

This time looks arbitrary. How can you be sure it gets depleted?
Is there another way to check for it.

> +	fi
> +	set -e
> +
> +	if [[ -e /mnt/huge ]]; then
> +	      rm -rf /mnt/huge/*
> +	      umount /mnt/huge || echo error
> +	      rmdir /mnt/huge
> +	fi
> +	if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
> +	      rmdir $cgroup_path/hugetlb_cgroup_test
> +	fi
> +	if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
> +	      rmdir $cgroup_path/hugetlb_cgroup_test1
> +	fi
> +	if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
> +	      rmdir $cgroup_path/hugetlb_cgroup_test2
> +	fi
> +	echo 0 > /proc/sys/vm/nr_hugepages
> +	echo CLEANUP DONE
> +}
> +
> +cleanup
> +
> +function expect_equal() {
> +      local expected="$1"
> +      local actual="$2"
> +      local error="$3"
> +
> +      if [[ "$expected" != "$actual" ]]; then
> +	    echo "expected ($expected) != actual ($actual): $3"
> +	    cleanup
> +	    exit 1
> +      fi
> +}
> +
> +function setup_cgroup() {
> +      local name="$1"
> +      local cgroup_limit="$2"
> +      local reservation_limit="$3"
> +
> +      mkdir $cgroup_path/$name
> +
> +      echo writing cgroup limit: "$cgroup_limit"
> +      echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
> +
> +      echo writing reseravation limit: "$reservation_limit"
> +      echo "$reservation_limit" > \
> +	    $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
> +      echo 0 > $cgroup_path/$name/cpuset.cpus
> +      echo 0 > $cgroup_path/$name/cpuset.mems
> +}
> +
> +function write_hugetlbfs_and_get_usage() {
> +      local cgroup="$1"
> +      local size="$2"
> +      local populate="$3"
> +      local write="$4"
> +      local path="$5"
> +      local method="$6"
> +      local private="$7"
> +      local expect_failure="$8"
> +
> +      # Function return values.
> +      reservation_failed=0
> +      oom_killed=0
> +      hugetlb_difference=0
> +      reserved_difference=0
> +
> +      local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes
> +      local reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes
> +
> +      local hugetlb_before=$(cat $hugetlb_usage)
> +      local reserved_before=$(cat $reserved_usage)
> +
> +      echo
> +      echo Starting:
> +      echo hugetlb_usage="$hugetlb_before"
> +      echo reserved_usage="$reserved_before"
> +      echo expect_failure is "$expect_failure"
> +
> +      set +e
> +      if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \
> +	    [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then
> +	    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
> +		  "$cgroup"  "$path" "$method" "$private" "-l" &
> +
> +	    local write_result=$?
> +	    # This sleep is to make sure that the script above has had enough
> +	    # time to do its thing, since it runs in the background. This may
> +	    # cause races...
> +	    sleep 0.5

I am not happy with these arbitrary sleep times, especially coupled with
the comment about races above. :)

> +	    echo write_result is $write_result
> +      else
> +	    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
> +		  "$cgroup"  "$path" "$method" "$private"
> +	    local write_result=$?
> +      fi
> +      set -e
> +
> +      if [[ "$write_result" == 1 ]]; then
> +	    reservation_failed=1
> +      fi
> +
> +      # On linus/master, the above process gets SIGBUS'd on oomkill, with
> +      # return code 135. On earlier kernels, it gets actual oomkill, with return
> +      # code 137, so just check for both conditions incase we're testing against

in case

> +      # an earlier kernel.
> +      if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then

Please add defines for these return values.

> +	    oom_killed=1
> +      fi
> +
> +      local hugetlb_after=$(cat $hugetlb_usage)
> +      local reserved_after=$(cat $reserved_usage)
> +
> +      echo After write:
> +      echo hugetlb_usage="$hugetlb_after"
> +      echo reserved_usage="$reserved_after"
> +
> +      hugetlb_difference=$(($hugetlb_after - $hugetlb_before))
> +      reserved_difference=$(($reserved_after - $reserved_before))
> +}
> +
> +function cleanup_hugetlb_memory() {
> +      set +e
> +      if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
> +	    echo kiling write_to_hugetlbfs
> +	    killall -2 write_to_hugetlbfs
> +	    # Wait for hugetlbfs memory to get depleted.
> +	    sleep 0.5

Sleep time? Rationale for this number?

> +      fi
> +      set -e
> +
> +      if [[ -e /mnt/huge ]]; then
> +	    rm -rf /mnt/huge/*
> +	      umount /mnt/huge
> +	      rmdir /mnt/huge
> +      fi
> +}
> +
> +function run_test() {
> +      local size="$1"
> +      local populate="$2"
> +      local write="$3"
> +      local cgroup_limit="$4"
> +      local reservation_limit="$5"
> +      local nr_hugepages="$6"
> +      local method="$7"
> +      local private="$8"
> +      local expect_failure="$9"
> +
> +      # Function return values.
> +      hugetlb_difference=0
> +      reserved_difference=0
> +      reservation_failed=0
> +      oom_killed=0
> +
> +      echo nr hugepages = "$nr_hugepages"
> +      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
> +
> +      setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit"
> +
> +      mkdir -p /mnt/huge
> +      mount -t hugetlbfs \
> +	    -o pagesize=2M,size=256M none /mnt/huge
> +
> +      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \
> +	    "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure"
> +
> +      cleanup_hugetlb_memory
> +
> +      local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.usage_in_bytes)
> +      local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.reservation_usage_in_bytes)
> +
> +      expect_equal "0" "$final_hugetlb" "final hugetlb is not zero"
> +      expect_equal "0" "$final_reservation" "final reservation is not zero"
> +}
> +
> +function run_multiple_cgroup_test() {
> +      local size1="$1"
> +      local populate1="$2"
> +      local write1="$3"
> +      local cgroup_limit1="$4"
> +      local reservation_limit1="$5"
> +
> +      local size2="$6"
> +      local populate2="$7"
> +      local write2="$8"
> +      local cgroup_limit2="$9"
> +      local reservation_limit2="${10}"
> +
> +      local nr_hugepages="${11}"
> +      local method="${12}"
> +      local private="${13}"
> +      local expect_failure="${14}"
> +
> +      # Function return values.
> +      hugetlb_difference1=0
> +      reserved_difference1=0
> +      reservation_failed1=0
> +      oom_killed1=0
> +
> +      hugetlb_difference2=0
> +      reserved_difference2=0
> +      reservation_failed2=0
> +      oom_killed2=0
> +
> +
> +      echo nr hugepages = "$nr_hugepages"
> +      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
> +
> +      setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1"
> +      setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2"
> +
> +      mkdir -p /mnt/huge
> +      mount -t hugetlbfs \
> +	    -o pagesize=2M,size=256M none /mnt/huge
> +
> +      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \
> +	    "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \
> +	    "$expect_failure"
> +
> +      hugetlb_difference1=$hugetlb_difference
> +      reserved_difference1=$reserved_difference
> +      reservation_failed1=$reservation_failed
> +      oom_killed1=$oom_killed
> +
> +      local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.usage_in_bytes
> +      local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.reservation_usage_in_bytes
> +      local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.usage_in_bytes
> +      local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.reservation_usage_in_bytes
> +
> +      local usage_before_second_write=$(cat $cgroup1_hugetlb_usage)
> +      local reservation_usage_before_second_write=$(cat \
> +	    $cgroup1_reservation_usage)
> +
> +      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \
> +	    "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \
> +	    "$expect_failure"
> +
> +      hugetlb_difference2=$hugetlb_difference
> +      reserved_difference2=$reserved_difference
> +      reservation_failed2=$reservation_failed
> +      oom_killed2=$oom_killed
> +
> +      expect_equal "$usage_before_second_write" \
> +	    "$(cat $cgroup1_hugetlb_usage)" "Usage changed."
> +      expect_equal "$reservation_usage_before_second_write" \
> +	    "$(cat $cgroup1_reservation_usage)" "Reservation usage changed."
> +
> +      cleanup_hugetlb_memory
> +
> +      local final_hugetlb=$(cat $cgroup1_hugetlb_usage)
> +      local final_reservation=$(cat $cgroup1_reservation_usage)
> +
> +      expect_equal "0" "$final_hugetlb" \
> +	    "hugetlbt_cgroup_test1 final hugetlb is not zero"
> +      expect_equal "0" "$final_reservation" \
> +	    "hugetlbt_cgroup_test1 final reservation is not zero"
> +
> +      local final_hugetlb=$(cat $cgroup2_hugetlb_usage)
> +      local final_reservation=$(cat $cgroup2_reservation_usage)
> +
> +      expect_equal "0" "$final_hugetlb" \
> +	    "hugetlb_cgroup_test2 final hugetlb is not zero"
> +      expect_equal "0" "$final_reservation" \
> +	    "hugetlb_cgroup_test2 final reservation is not zero"
> +}
> +
> +for private in "" "-r" ; do
> +for populate in  "" "-o"; do
> +for method in 0 1 2; do
> +
> +# Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported.
> +if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then
> +      continue
> +fi
> +
> +# Skip populated shmem tests. Doesn't seem to be supported.
> +if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then
> +      continue
> +fi
> +
> +cleanup
> +echo
> +echo
> +echo
> +echo Test normal case.
> +echo private=$private, populate=$populate, method=$method
> +run_test $((10 * 1024 * 1024)) "$populate" "" $((20 * 1024 * 1024)) \
> +      $((20 * 1024 * 1024)) 10 "$method" "$private" "0"
> +
> +echo Memory charged to hugtlb=$hugetlb_difference
> +echo Memory charged to reservation=$reserved_difference
> +
> +if [[ "$populate" == "-o" ]]; then
> +      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
> +	    "Reserved memory charged to hugetlb cgroup."
> +else
> +      expect_equal "0" "$hugetlb_difference" \
> +	    "Reserved memory charged to hugetlb cgroup."
> +fi
> +
> +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
> +      "Reserved memory not charged to reservation usage."
> +echo 'PASS'
> +
> +cleanup
> +echo
> +echo
> +echo
> +echo Test normal case with write.
> +echo private=$private, populate=$populate, method=$method
> +run_test $((10 * 1024 * 1024)) "$populate" '-w' $((20 * 1024 * 1024)) \
> +      $((20 * 1024 * 1024)) 10 "$method" "$private" "0"
> +
> +echo Memory charged to hugtlb=$hugetlb_difference
> +echo Memory charged to reservation=$reserved_difference
> +
> +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
> +      "Reserved memory charged to hugetlb cgroup."
> +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
> +      "Reserved memory not charged to reservation usage."
> +echo 'PASS'
> +
> +
> +cleanup
> +echo
> +echo
> +echo
> +echo Test more than reservation case.
> +echo private=$private, populate=$populate, method=$method
> +run_test "$((10 * 1024 * 1024))" "$populate" '' "$((20 * 1024 * 1024))" \
> +      "$((5 * 1024 * 1024))" "10" "$method" "$private" "1"
> +
> +expect_equal "1" "$reservation_failed" "Reservation succeeded."
> +echo 'PASS'
> +
> +cleanup
> +
> +echo
> +echo
> +echo
> +echo Test more than cgroup limit case.
> +echo private=$private, populate=$populate, method=$method
> +
> +# Not sure if shm memory can be cleaned up when the process gets sigbus'd.
> +if [[ "$method" != 2 ]]; then
> +      run_test $((10 * 1024 * 1024)) "$populate" "-w" $((5 * 1024 * 1024)) \
> +	    $((20 * 1024 * 1024)) 10 "$method" "$private" "1"
> +
> +      expect_equal "1" "$oom_killed" "Not oom killed."
> +fi
> +echo 'PASS'
> +
> +cleanup
> +
> +echo
> +echo
> +echo
> +echo Test normal case, multiple cgroups.
> +echo private=$private, populate=$populate, method=$method
> +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "" \
> +      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
> +      "$populate" "" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
> +      "$method" "$private" "0"
> +
> +echo Memory charged to hugtlb1=$hugetlb_difference1
> +echo Memory charged to reservation1=$reserved_difference1
> +echo Memory charged to hugtlb2=$hugetlb_difference2
> +echo Memory charged to reservation2=$reserved_difference2
> +
> +expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
> +      "Incorrect reservations charged to cgroup 1."
> +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
> +      "Incorrect reservation charged to cgroup 2."
> +if [[ "$populate" == "-o" ]]; then
> +      expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
> +	    "Incorrect hugetlb charged to cgroup 1."
> +      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
> +	    "Incorrect hugetlb charged to cgroup 2."
> +else
> +      expect_equal "0" "$hugetlb_difference1" \
> +	    "Incorrect hugetlb charged to cgroup 1."
> +      expect_equal "0" "$hugetlb_difference2" \
> +	    "Incorrect hugetlb charged to cgroup 2."
> +fi
> +echo 'PASS'
> +
> +cleanup
> +echo
> +echo
> +echo
> +echo Test normal case with write, multiple cgroups.
> +echo private=$private, populate=$populate, method=$method
> +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "-w" \
> +      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
> +      "$populate" "-w" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
> +      "$method" "$private" "0"
> +
> +echo Memory charged to hugtlb1=$hugetlb_difference1
> +echo Memory charged to reservation1=$reserved_difference1
> +echo Memory charged to hugtlb2=$hugetlb_difference2
> +echo Memory charged to reservation2=$reserved_difference2
> +
> +expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
> +      "Incorrect hugetlb charged to cgroup 1."
> +expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
> +      "Incorrect reservation charged to cgroup 1."
> +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
> +      "Incorrect hugetlb charged to cgroup 2."
> +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
> +      "Incorrected reservation charged to cgroup 2."
> +
> +echo 'PASS'
> +
> +done # private
> +done # populate
> +done # method
> +
> +umount $cgroup_path
> +rmdir $cgroup_path
> diff --git a/tools/testing/selftests/vm/write_hugetlb_memory.sh b/tools/testing/selftests/vm/write_hugetlb_memory.sh
> new file mode 100644
> index 0000000000000..08f5fa5527cfd
> --- /dev/null
> +++ b/tools/testing/selftests/vm/write_hugetlb_memory.sh
> @@ -0,0 +1,22 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +
> +set -e
> +
> +size=$1
> +populate=$2
> +write=$3
> +cgroup=$4
> +path=$5
> +method=$6
> +private=$7
> +want_sleep=$8
> +
> +echo "Putting task in cgroup '$cgroup'"
> +echo $$ > /dev/cgroup/memory/"$cgroup"/tasks
> +
> +echo "Method is $method"
> +
> +set +e
> +./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \
> +      "$private" "$want_sleep"
> diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> new file mode 100644
> index 0000000000000..f02a897427a97
> --- /dev/null
> +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> @@ -0,0 +1,252 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This program reserves and uses hugetlb memory, supporting a bunch of
> + * scenorios needed by the charged_reserved_hugetlb.sh test.

Spelling?

> + */
> +
> +#include <err.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <fcntl.h>
> +#include <sys/types.h>
> +#include <sys/shm.h>
> +#include <sys/stat.h>
> +#include <sys/mman.h>
> +
> +/* Global definitions. */
> +enum method {
> +	HUGETLBFS,
> +	MMAP_MAP_HUGETLB,
> +	SHM,
> +	MAX_METHOD
> +};
> +
> +
> +/* Global variables. */
> +static const char *self;
> +static char *shmaddr;
> +static int shmid;
> +
> +/*
> + * Show usage and exit.
> + */
> +static void exit_usage(void)
> +{
> +
> +	printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
> +		"[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
> +		"[-o] [-w]\n", self);
> +	exit(EXIT_FAILURE);
> +}
> +
> +void sig_handler(int signo)
> +{
> +	printf("Received %d.\n", signo);
> +	if (signo == SIGINT) {
> +		printf("Deleting the memory\n");
> +		if (shmdt((const void *)shmaddr) != 0) {
> +			perror("Detach failure");
> +			shmctl(shmid, IPC_RMID, NULL);
> +			exit(4);

Is this a skip error code? Please note that kselftest framework
interprets this as a skipped test when returb value is 4.

> +		}
> +
> +		shmctl(shmid, IPC_RMID, NULL);
> +		printf("Done deleting the memory\n");
> +	}
> +	exit(2);

What about this one? What does exit code 2 mean?

> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int fd = 0;
> +	int key = 0;
> +	int *ptr = NULL;
> +	int c = 0;
> +	int size = 0;
> +	char path[256] = "";
> +	enum method method = MAX_METHOD;
> +	int want_sleep = 0, private = 0;
> +	int populate = 0;
> +	int write = 0;
> +
> +	unsigned long i;
> +
> +
> +	if (signal(SIGINT, sig_handler) == SIG_ERR)
> +		err(1, "\ncan't catch SIGINT\n");
> +
> +	/* Parse command-line arguments. */
> +	setvbuf(stdout, NULL, _IONBF, 0);
> +	self = argv[0];
> +
> +	while ((c = getopt(argc, argv, "s:p:m:owlr")) != -1) {
> +		switch (c) {
> +		case 's':
> +			size = atoi(optarg);
> +			break;
> +		case 'p':
> +			strncpy(path, optarg, sizeof(path));
> +			break;
> +		case 'm':
> +			if (atoi(optarg) >= MAX_METHOD) {
> +				errno = EINVAL;
> +				perror("Invalid -m.");
> +				exit_usage();
> +			}
> +			method = atoi(optarg);
> +			break;
> +		case 'o':
> +			populate = 1;
> +			break;
> +		case 'w':
> +			write = 1;
> +			break;
> +		case 'l':
> +			want_sleep = 1;
> +			break;
> +		case 'r':
> +			private = 1;
> +			break;
> +		default:
> +			errno = EINVAL;
> +			perror("Invalid arg");
> +			exit_usage();
> +		}
> +	}
> +
> +	if (strncmp(path, "", sizeof(path)) != 0) {
> +		printf("Writing to this path: %s\n", path);
> +	} else {
> +		errno = EINVAL;
> +		perror("path not found");
> +		exit_usage();
> +	}
> +
> +	if (size != 0) {
> +		printf("Writing this size: %d\n", size);
> +	} else {
> +		errno = EINVAL;
> +		perror("size not found");
> +		exit_usage();
> +	}
> +
> +	if (!populate)
> +		printf("Not populating.\n");
> +	else
> +		printf("Populating.\n");
> +
> +	if (!write)
> +		printf("Not writing to memory.\n");
> +
> +	if (method == MAX_METHOD) {
> +		errno = EINVAL;
> +		perror("-m Invalid");
> +		exit_usage();
> +	} else
> +		printf("Using method=%d\n", method);
> +
> +	if (!private)
> +		printf("Shared mapping.\n");
> +	else
> +		printf("Private mapping.\n");
> +
> +
> +	switch (method) {
> +	case HUGETLBFS:
> +		printf("Allocating using HUGETLBFS.\n");
> +		fd = open(path, O_CREAT | O_RDWR, 0777);
> +		if (fd == -1)
> +			err(1, "Failed to open file.");
> +
> +		ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
> +			(private ? MAP_PRIVATE : MAP_SHARED) | (populate ?
> +				MAP_POPULATE : 0), fd, 0);
> +
> +		if (ptr == MAP_FAILED) {
> +			close(fd);
> +			err(1, "Error mapping the file");
> +		}
> +		break;
> +	case MMAP_MAP_HUGETLB:
> +		printf("Allocating using MAP_HUGETLB.\n");
> +		ptr = mmap(NULL, size,
> +		PROT_READ | PROT_WRITE,
> +		(private ? (MAP_PRIVATE | MAP_ANONYMOUS) : MAP_SHARED) |
> +		MAP_HUGETLB | (populate ?
> +			MAP_POPULATE : 0),
> +		-1, 0);
> +
> +		if (ptr == MAP_FAILED)
> +			err(1, "mmap");
> +
> +		printf("Returned address is %p\n", ptr);
> +		break;
> +	case SHM:
> +		printf("Allocating using SHM.\n");
> +		shmid = shmget(key, size, SHM_HUGETLB | IPC_CREAT | SHM_R |
> +				SHM_W);
> +		if (shmid < 0) {
> +			shmid = shmget(++key, size, SHM_HUGETLB | IPC_CREAT |
> +					SHM_R | SHM_W);
> +			if (shmid < 0)
> +				err(1, "shmget");
> +
> +		}
> +		printf("shmid: 0x%x, shmget key:%d\n", shmid, key);
> +
> +		shmaddr = shmat(shmid, NULL, 0);
> +		if (shmaddr == (char *)-1) {
> +			perror("Shared memory attach failure");
> +			shmctl(shmid, IPC_RMID, NULL);
> +			exit(2);
> +		}
> +		printf("shmaddr: %p\n", shmaddr);
> +
> +		break;
> +	default:
> +		errno = EINVAL;
> +		err(1, "Invalid method.");
> +	}
> +
> +	if (write) {
> +		printf("Writing to memory.\n");
> +		if (method != SHM) {
> +			memset(ptr, 1, size);
> +		} else {
> +			printf("Starting the writes:\n");
> +			for (i = 0; i < size; i++) {
> +				shmaddr[i] = (char)(i);
> +				if (!(i % (1024 * 1024)))
> +					printf(".");
> +			}
> +			printf("\n");
> +
> +			printf("Starting the Check...");
> +			for (i = 0; i < size; i++)
> +				if (shmaddr[i] != (char)i) {
> +					printf("\nIndex %lu mismatched\n", i);
> +					exit(3);
> +				}
> +			printf("Done.\n");
> +
> +
> +		}
> +	}
> +
> +	if (want_sleep) {
> +		/* Signal to caller that we're done. */
> +		printf("DONE\n");
> +
> +		/* Hold memory until external kill signal is delivered. */
> +		while (1)
> +			sleep(100);

Please don't add sleeps. This will hold up the kselftest run.

> +	}
> +
> +	close(fd);

Is this close() necessary in all cases? Looks like MMAP_MAP_HUGETLB
is the only case that opens it.

I am not sure if the error legs are correct.

> +
> +	return 0;
> +}
> --
> 2.23.0.162.g0b9fbb3734-goog
> 

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 9/9] hugetlb_cgroup: Add hugetlb_cgroup reservation docs
  2019-09-10 23:31 ` [PATCH v4 9/9] hugetlb_cgroup: Add hugetlb_cgroup reservation docs Mina Almasry
@ 2019-09-17  1:58   ` shuah
  0 siblings, 0 replies; 19+ messages in thread
From: shuah @ 2019-09-17  1:58 UTC (permalink / raw)
  To: Mina Almasry, mike.kravetz
  Cc: rientjes, shakeelb, gthelen, akpm, khalid.aziz, linux-kernel,
	linux-mm, linux-kselftest, cgroups, aneesh.kumar, mkoutny,
	Hillf Danton, shuah

On 9/10/19 5:31 PM, Mina Almasry wrote:
> Add docs for how to use hugetlb_cgroup reservations, and their behavior.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> Acked-by: Hillf Danton <hdanton@sina.com>
> ---
>   .../admin-guide/cgroup-v1/hugetlb.rst         | 84 ++++++++++++++++---
>   1 file changed, 73 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> index a3902aa253a96..cc6eb859fc722 100644
> --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> @@ -2,13 +2,6 @@
>   HugeTLB Controller
>   ==================
> 
> -The HugeTLB controller allows to limit the HugeTLB usage per control group and
> -enforces the controller limit during page fault. Since HugeTLB doesn't
> -support page reclaim, enforcing the limit at page fault time implies that,
> -the application will get SIGBUS signal if it tries to access HugeTLB pages
> -beyond its limit. This requires the application to know beforehand how much
> -HugeTLB pages it would require for its use.
> -
>   HugeTLB controller can be created by first mounting the cgroup filesystem.
> 
>   # mount -t cgroup -o hugetlb none /sys/fs/cgroup
> @@ -28,10 +21,14 @@ process (bash) into it.
> 
>   Brief summary of control files::
> 
> - hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
> - hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
> - hugetlb.<hugepagesize>.usage_in_bytes     # show current usage for "hugepagesize" hugetlb
> - hugetlb.<hugepagesize>.failcnt		   # show the number of allocation failure due to HugeTLB limit
> + hugetlb.<hugepagesize>.reservation_limit_in_bytes     # set/show limit of "hugepagesize" hugetlb reservations
> + hugetlb.<hugepagesize>.reservation_max_usage_in_bytes # show max "hugepagesize" hugetlb reservations recorded
> + hugetlb.<hugepagesize>.reservation_usage_in_bytes     # show current reservations for "hugepagesize" hugetlb
> + hugetlb.<hugepagesize>.reservation_failcnt            # show the number of allocation failure due to HugeTLB reservation limit
> + hugetlb.<hugepagesize>.limit_in_bytes                 # set/show limit of "hugepagesize" hugetlb faults
> + hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
> + hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
> + hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
> 
>   For a system supporting three hugepage sizes (64k, 32M and 1G), the control
>   files include::
> @@ -40,11 +37,76 @@ files include::
>     hugetlb.1GB.max_usage_in_bytes
>     hugetlb.1GB.usage_in_bytes
>     hugetlb.1GB.failcnt
> +  hugetlb.1GB.reservation_limit_in_bytes
> +  hugetlb.1GB.reservation_max_usage_in_bytes
> +  hugetlb.1GB.reservation_usage_in_bytes
> +  hugetlb.1GB.reservation_failcnt
>     hugetlb.64KB.limit_in_bytes
>     hugetlb.64KB.max_usage_in_bytes
>     hugetlb.64KB.usage_in_bytes
>     hugetlb.64KB.failcnt
> +  hugetlb.64KB.reservation_limit_in_bytes
> +  hugetlb.64KB.reservation_max_usage_in_bytes
> +  hugetlb.64KB.reservation_usage_in_bytes
> +  hugetlb.64KB.reservation_failcnt
>     hugetlb.32MB.limit_in_bytes
>     hugetlb.32MB.max_usage_in_bytes
>     hugetlb.32MB.usage_in_bytes
>     hugetlb.32MB.failcnt
> +  hugetlb.32MB.reservation_limit_in_bytes
> +  hugetlb.32MB.reservation_max_usage_in_bytes
> +  hugetlb.32MB.reservation_usage_in_bytes
> +  hugetlb.32MB.reservation_failcnt
> +
> +
> +1. Reservation limits
> +
> +The HugeTLB controller allows to limit the HugeTLB reservations per control
> +group and enforces the controller limit at reservation time. Reservation limits
> +are superior to Page fault limits (see section 2), since Reservation limits are
> +enforced at reservation time, and never causes the application to get SIGBUS
> +signal. Instead, if the application is violating its limits, then it gets an
> +error on reservation time, i.e. the mmap or shmget return an error.
> +
> +
> +2. Page fault limits
> +
> +The HugeTLB controller allows to limit the HugeTLB usage (page fault) per
> +control group and enforces the controller limit during page fault. Since HugeTLB
> +doesn't support page reclaim, enforcing the limit at page fault time implies
> +that, the application will get SIGBUS signal if it tries to access HugeTLB
> +pages beyond its limit. This requires the application to know beforehand how
> +much HugeTLB pages it would require for its use.
> +
> +
> +3. Caveats with shared memory
> +
> +a. Charging and uncharging:
> +
> +For shared hugetlb memory, both hugetlb reservation and usage (page faults) are
> +charged to the first task that causes the memory to be reserved or faulted,
> +and all subsequent uses of this reserved or faulted memory is done without
> +charging.
> +
> +Shared hugetlb memory is only uncharged when it is unreseved or deallocated.

Spelling?

> +This is usually when the hugetlbfs file is deleted, and not when the task that
> +caused the reservation or fault has exited.
> +
> +b. Interaction between reservation limit and fault limit.
> +
> +Generally, it's not recommended to set both of the reservation limit and fault
> +limit in a cgroup. For private memory, the fault usage cannot exceed the
> +reservation usage, so if you set both, one of those limits will be useless.
> +

Is this enforced? What happens when attempt is made to set fault limit
on a cgroup that has reservation limit and vice versa.

> +For shared memory, a cgroup's fault usage may be greater than its reservation
> +usage, so some care needs to be taken. Consider this example:
> +
> +- Task A reserves 4 pages in a shared hugetlbfs file. Cgroup A will get
> +  4 reservations charged to it and no faults charged to it.
> +- Task B reserves and faults the same 4 pages as Task A. Cgroup B will get no
> +  reservation charge, but will get charged 4 faulted pages. If Cgroup B's limit
> +  is less than 4, then Task B will get a SIGBUS.
> +
> +For the above scenario, it's not recommended for the userspace to set both
> +reservation limits and fault limits, but it is still allowed to in case it sees
> +some use for it.

What would be the scenarios where setting both could be useful? Please 
explain.
> --
> 2.23.0.162.g0b9fbb3734-goog
> 

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests
  2019-09-17  1:52   ` shuah
@ 2019-09-19  1:53     ` Mina Almasry
  0 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2019-09-19  1:53 UTC (permalink / raw)
  To: shuah
  Cc: Mike Kravetz, David Rientjes, Shakeel Butt, Greg Thelen,
	Andrew Morton, khalid.aziz, open list, linux-mm, linux-kselftest,
	cgroups, Aneesh Kumar, Michal Koutný

On Mon, Sep 16, 2019 at 6:52 PM shuah <shuah@kernel.org> wrote:
>
> On 9/10/19 5:31 PM, Mina Almasry wrote:
> > The tests use both shared and private mapped hugetlb memory, and
> > monitors the hugetlb usage counter as well as the hugetlb reservation
> > counter. They test different configurations such as hugetlb memory usage
> > via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
> > MAP_POPULATE.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > ---
> >   tools/testing/selftests/vm/.gitignore         |   1 +
> >   tools/testing/selftests/vm/Makefile           |   4 +
> >   .../selftests/vm/charge_reserved_hugetlb.sh   | 440 ++++++++++++++++++
> >   .../selftests/vm/write_hugetlb_memory.sh      |  22 +
> >   .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++++++++++
> >   5 files changed, 719 insertions(+)
> >   create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
> >   create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
> >   create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c
> >
> > diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
> > index 31b3c98b6d34d..d3bed9407773c 100644
> > --- a/tools/testing/selftests/vm/.gitignore
> > +++ b/tools/testing/selftests/vm/.gitignore
> > @@ -14,3 +14,4 @@ virtual_address_range
> >   gup_benchmark
> >   va_128TBswitch
> >   map_fixed_noreplace
> > +write_to_hugetlbfs
> > diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
> > index 9534dc2bc9295..8d37d5409b52c 100644
> > --- a/tools/testing/selftests/vm/Makefile
> > +++ b/tools/testing/selftests/vm/Makefile
> > @@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
> >   TEST_GEN_FILES += userfaultfd
> >   TEST_GEN_FILES += va_128TBswitch
> >   TEST_GEN_FILES += virtual_address_range
> > +TEST_GEN_FILES += write_to_hugetlbfs
> >
> >   TEST_PROGS := run_vmtests
> >
> > @@ -29,3 +30,6 @@ include ../lib.mk
> >   $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
> >
> >   $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
> > +
> > +# Why does adding $(OUTPUT)/ like above not apply this flag..?
>
> Can you verify the following and remove this comment, once you figure
> out if you need $(OUTPUT)/
> > +write_to_hugetlbfs: CFLAGS += -static
>
> It should. Did you test "make O=" and "KBUILD_OUTPUT" kselftest
> use-cases?
>

Turns out I don't need -static actually.

> > diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
> > new file mode 100755
> > index 0000000000000..09e90e8f6fab4
> > --- /dev/null
> > +++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
> > @@ -0,0 +1,440 @@
> > +#!/bin/sh
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +set -e
> > +
> > +cgroup_path=/dev/cgroup/memory
> > +if [[ ! -e $cgroup_path ]]; then
> > +      mkdir -p $cgroup_path
> > +      mount -t cgroup -o hugetlb,memory cgroup $cgroup_path
> > +fi
> > +
>
> Does this test need root access? If yes, please add root check
> and skip the test when a non-root runs the test.
>
> > +cleanup () {
> > +     echo $$ > $cgroup_path/tasks
> > +
> > +     set +e
> > +     if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
> > +           kill -2 write_to_hugetlbfs
> > +           # Wait for hugetlbfs memory to get depleted.
> > +           sleep 0.5
>
> This time looks arbitrary. How can you be sure it gets depleted?
> Is there another way to check for it.
>
> > +     fi
> > +     set -e
> > +
> > +     if [[ -e /mnt/huge ]]; then
> > +           rm -rf /mnt/huge/*
> > +           umount /mnt/huge || echo error
> > +           rmdir /mnt/huge
> > +     fi
> > +     if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
> > +           rmdir $cgroup_path/hugetlb_cgroup_test
> > +     fi
> > +     if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
> > +           rmdir $cgroup_path/hugetlb_cgroup_test1
> > +     fi
> > +     if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
> > +           rmdir $cgroup_path/hugetlb_cgroup_test2
> > +     fi
> > +     echo 0 > /proc/sys/vm/nr_hugepages
> > +     echo CLEANUP DONE
> > +}
> > +
> > +cleanup
> > +
> > +function expect_equal() {
> > +      local expected="$1"
> > +      local actual="$2"
> > +      local error="$3"
> > +
> > +      if [[ "$expected" != "$actual" ]]; then
> > +         echo "expected ($expected) != actual ($actual): $3"
> > +         cleanup
> > +         exit 1
> > +      fi
> > +}
> > +
> > +function setup_cgroup() {
> > +      local name="$1"
> > +      local cgroup_limit="$2"
> > +      local reservation_limit="$3"
> > +
> > +      mkdir $cgroup_path/$name
> > +
> > +      echo writing cgroup limit: "$cgroup_limit"
> > +      echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
> > +
> > +      echo writing reseravation limit: "$reservation_limit"
> > +      echo "$reservation_limit" > \
> > +         $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
> > +      echo 0 > $cgroup_path/$name/cpuset.cpus
> > +      echo 0 > $cgroup_path/$name/cpuset.mems
> > +}
> > +
> > +function write_hugetlbfs_and_get_usage() {
> > +      local cgroup="$1"
> > +      local size="$2"
> > +      local populate="$3"
> > +      local write="$4"
> > +      local path="$5"
> > +      local method="$6"
> > +      local private="$7"
> > +      local expect_failure="$8"
> > +
> > +      # Function return values.
> > +      reservation_failed=0
> > +      oom_killed=0
> > +      hugetlb_difference=0
> > +      reserved_difference=0
> > +
> > +      local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes
> > +      local reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes
> > +
> > +      local hugetlb_before=$(cat $hugetlb_usage)
> > +      local reserved_before=$(cat $reserved_usage)
> > +
> > +      echo
> > +      echo Starting:
> > +      echo hugetlb_usage="$hugetlb_before"
> > +      echo reserved_usage="$reserved_before"
> > +      echo expect_failure is "$expect_failure"
> > +
> > +      set +e
> > +      if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \
> > +         [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then
> > +         bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
> > +               "$cgroup"  "$path" "$method" "$private" "-l" &
> > +
> > +         local write_result=$?
> > +         # This sleep is to make sure that the script above has had enough
> > +         # time to do its thing, since it runs in the background. This may
> > +         # cause races...
> > +         sleep 0.5
>
> I am not happy with these arbitrary sleep times, especially coupled with
> the comment about races above. :)
>
> > +         echo write_result is $write_result
> > +      else
> > +         bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
> > +               "$cgroup"  "$path" "$method" "$private"
> > +         local write_result=$?
> > +      fi
> > +      set -e
> > +
> > +      if [[ "$write_result" == 1 ]]; then
> > +         reservation_failed=1
> > +      fi
> > +
> > +      # On linus/master, the above process gets SIGBUS'd on oomkill, with
> > +      # return code 135. On earlier kernels, it gets actual oomkill, with return
> > +      # code 137, so just check for both conditions incase we're testing against
>
> in case
>
> > +      # an earlier kernel.
> > +      if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then
>
> Please add defines for these return values.
>

There is comment that explains this line. Not enough clarity?

> > +         oom_killed=1
> > +      fi
> > +
> > +      local hugetlb_after=$(cat $hugetlb_usage)
> > +      local reserved_after=$(cat $reserved_usage)
> > +
> > +      echo After write:
> > +      echo hugetlb_usage="$hugetlb_after"
> > +      echo reserved_usage="$reserved_after"
> > +
> > +      hugetlb_difference=$(($hugetlb_after - $hugetlb_before))
> > +      reserved_difference=$(($reserved_after - $reserved_before))
> > +}
> > +
> > +function cleanup_hugetlb_memory() {
> > +      set +e
> > +      if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
> > +         echo kiling write_to_hugetlbfs
> > +         killall -2 write_to_hugetlbfs
> > +         # Wait for hugetlbfs memory to get depleted.
> > +         sleep 0.5
>
> Sleep time? Rationale for this number?
>
> > +      fi
> > +      set -e
> > +
> > +      if [[ -e /mnt/huge ]]; then
> > +         rm -rf /mnt/huge/*
> > +           umount /mnt/huge
> > +           rmdir /mnt/huge
> > +      fi
> > +}
> > +
> > +function run_test() {
> > +      local size="$1"
> > +      local populate="$2"
> > +      local write="$3"
> > +      local cgroup_limit="$4"
> > +      local reservation_limit="$5"
> > +      local nr_hugepages="$6"
> > +      local method="$7"
> > +      local private="$8"
> > +      local expect_failure="$9"
> > +
> > +      # Function return values.
> > +      hugetlb_difference=0
> > +      reserved_difference=0
> > +      reservation_failed=0
> > +      oom_killed=0
> > +
> > +      echo nr hugepages = "$nr_hugepages"
> > +      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
> > +
> > +      setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit"
> > +
> > +      mkdir -p /mnt/huge
> > +      mount -t hugetlbfs \
> > +         -o pagesize=2M,size=256M none /mnt/huge
> > +
> > +      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \
> > +         "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure"
> > +
> > +      cleanup_hugetlb_memory
> > +
> > +      local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.usage_in_bytes)
> > +      local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.reservation_usage_in_bytes)
> > +
> > +      expect_equal "0" "$final_hugetlb" "final hugetlb is not zero"
> > +      expect_equal "0" "$final_reservation" "final reservation is not zero"
> > +}
> > +
> > +function run_multiple_cgroup_test() {
> > +      local size1="$1"
> > +      local populate1="$2"
> > +      local write1="$3"
> > +      local cgroup_limit1="$4"
> > +      local reservation_limit1="$5"
> > +
> > +      local size2="$6"
> > +      local populate2="$7"
> > +      local write2="$8"
> > +      local cgroup_limit2="$9"
> > +      local reservation_limit2="${10}"
> > +
> > +      local nr_hugepages="${11}"
> > +      local method="${12}"
> > +      local private="${13}"
> > +      local expect_failure="${14}"
> > +
> > +      # Function return values.
> > +      hugetlb_difference1=0
> > +      reserved_difference1=0
> > +      reservation_failed1=0
> > +      oom_killed1=0
> > +
> > +      hugetlb_difference2=0
> > +      reserved_difference2=0
> > +      reservation_failed2=0
> > +      oom_killed2=0
> > +
> > +
> > +      echo nr hugepages = "$nr_hugepages"
> > +      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
> > +
> > +      setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1"
> > +      setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2"
> > +
> > +      mkdir -p /mnt/huge
> > +      mount -t hugetlbfs \
> > +         -o pagesize=2M,size=256M none /mnt/huge
> > +
> > +      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \
> > +         "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \
> > +         "$expect_failure"
> > +
> > +      hugetlb_difference1=$hugetlb_difference
> > +      reserved_difference1=$reserved_difference
> > +      reservation_failed1=$reservation_failed
> > +      oom_killed1=$oom_killed
> > +
> > +      local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.usage_in_bytes
> > +      local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.reservation_usage_in_bytes
> > +      local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.usage_in_bytes
> > +      local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.reservation_usage_in_bytes
> > +
> > +      local usage_before_second_write=$(cat $cgroup1_hugetlb_usage)
> > +      local reservation_usage_before_second_write=$(cat \
> > +         $cgroup1_reservation_usage)
> > +
> > +      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \
> > +         "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \
> > +         "$expect_failure"
> > +
> > +      hugetlb_difference2=$hugetlb_difference
> > +      reserved_difference2=$reserved_difference
> > +      reservation_failed2=$reservation_failed
> > +      oom_killed2=$oom_killed
> > +
> > +      expect_equal "$usage_before_second_write" \
> > +         "$(cat $cgroup1_hugetlb_usage)" "Usage changed."
> > +      expect_equal "$reservation_usage_before_second_write" \
> > +         "$(cat $cgroup1_reservation_usage)" "Reservation usage changed."
> > +
> > +      cleanup_hugetlb_memory
> > +
> > +      local final_hugetlb=$(cat $cgroup1_hugetlb_usage)
> > +      local final_reservation=$(cat $cgroup1_reservation_usage)
> > +
> > +      expect_equal "0" "$final_hugetlb" \
> > +         "hugetlbt_cgroup_test1 final hugetlb is not zero"
> > +      expect_equal "0" "$final_reservation" \
> > +         "hugetlbt_cgroup_test1 final reservation is not zero"
> > +
> > +      local final_hugetlb=$(cat $cgroup2_hugetlb_usage)
> > +      local final_reservation=$(cat $cgroup2_reservation_usage)
> > +
> > +      expect_equal "0" "$final_hugetlb" \
> > +         "hugetlb_cgroup_test2 final hugetlb is not zero"
> > +      expect_equal "0" "$final_reservation" \
> > +         "hugetlb_cgroup_test2 final reservation is not zero"
> > +}
> > +
> > +for private in "" "-r" ; do
> > +for populate in  "" "-o"; do
> > +for method in 0 1 2; do
> > +
> > +# Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported.
> > +if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then
> > +      continue
> > +fi
> > +
> > +# Skip populated shmem tests. Doesn't seem to be supported.
> > +if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then
> > +      continue
> > +fi
> > +
> > +cleanup
> > +echo
> > +echo
> > +echo
> > +echo Test normal case.
> > +echo private=$private, populate=$populate, method=$method
> > +run_test $((10 * 1024 * 1024)) "$populate" "" $((20 * 1024 * 1024)) \
> > +      $((20 * 1024 * 1024)) 10 "$method" "$private" "0"
> > +
> > +echo Memory charged to hugtlb=$hugetlb_difference
> > +echo Memory charged to reservation=$reserved_difference
> > +
> > +if [[ "$populate" == "-o" ]]; then
> > +      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
> > +         "Reserved memory charged to hugetlb cgroup."
> > +else
> > +      expect_equal "0" "$hugetlb_difference" \
> > +         "Reserved memory charged to hugetlb cgroup."
> > +fi
> > +
> > +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
> > +      "Reserved memory not charged to reservation usage."
> > +echo 'PASS'
> > +
> > +cleanup
> > +echo
> > +echo
> > +echo
> > +echo Test normal case with write.
> > +echo private=$private, populate=$populate, method=$method
> > +run_test $((10 * 1024 * 1024)) "$populate" '-w' $((20 * 1024 * 1024)) \
> > +      $((20 * 1024 * 1024)) 10 "$method" "$private" "0"
> > +
> > +echo Memory charged to hugtlb=$hugetlb_difference
> > +echo Memory charged to reservation=$reserved_difference
> > +
> > +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
> > +      "Reserved memory charged to hugetlb cgroup."
> > +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
> > +      "Reserved memory not charged to reservation usage."
> > +echo 'PASS'
> > +
> > +
> > +cleanup
> > +echo
> > +echo
> > +echo
> > +echo Test more than reservation case.
> > +echo private=$private, populate=$populate, method=$method
> > +run_test "$((10 * 1024 * 1024))" "$populate" '' "$((20 * 1024 * 1024))" \
> > +      "$((5 * 1024 * 1024))" "10" "$method" "$private" "1"
> > +
> > +expect_equal "1" "$reservation_failed" "Reservation succeeded."
> > +echo 'PASS'
> > +
> > +cleanup
> > +
> > +echo
> > +echo
> > +echo
> > +echo Test more than cgroup limit case.
> > +echo private=$private, populate=$populate, method=$method
> > +
> > +# Not sure if shm memory can be cleaned up when the process gets sigbus'd.
> > +if [[ "$method" != 2 ]]; then
> > +      run_test $((10 * 1024 * 1024)) "$populate" "-w" $((5 * 1024 * 1024)) \
> > +         $((20 * 1024 * 1024)) 10 "$method" "$private" "1"
> > +
> > +      expect_equal "1" "$oom_killed" "Not oom killed."
> > +fi
> > +echo 'PASS'
> > +
> > +cleanup
> > +
> > +echo
> > +echo
> > +echo
> > +echo Test normal case, multiple cgroups.
> > +echo private=$private, populate=$populate, method=$method
> > +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "" \
> > +      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
> > +      "$populate" "" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
> > +      "$method" "$private" "0"
> > +
> > +echo Memory charged to hugtlb1=$hugetlb_difference1
> > +echo Memory charged to reservation1=$reserved_difference1
> > +echo Memory charged to hugtlb2=$hugetlb_difference2
> > +echo Memory charged to reservation2=$reserved_difference2
> > +
> > +expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
> > +      "Incorrect reservations charged to cgroup 1."
> > +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
> > +      "Incorrect reservation charged to cgroup 2."
> > +if [[ "$populate" == "-o" ]]; then
> > +      expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
> > +         "Incorrect hugetlb charged to cgroup 1."
> > +      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
> > +         "Incorrect hugetlb charged to cgroup 2."
> > +else
> > +      expect_equal "0" "$hugetlb_difference1" \
> > +         "Incorrect hugetlb charged to cgroup 1."
> > +      expect_equal "0" "$hugetlb_difference2" \
> > +         "Incorrect hugetlb charged to cgroup 2."
> > +fi
> > +echo 'PASS'
> > +
> > +cleanup
> > +echo
> > +echo
> > +echo
> > +echo Test normal case with write, multiple cgroups.
> > +echo private=$private, populate=$populate, method=$method
> > +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "-w" \
> > +      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
> > +      "$populate" "-w" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
> > +      "$method" "$private" "0"
> > +
> > +echo Memory charged to hugtlb1=$hugetlb_difference1
> > +echo Memory charged to reservation1=$reserved_difference1
> > +echo Memory charged to hugtlb2=$hugetlb_difference2
> > +echo Memory charged to reservation2=$reserved_difference2
> > +
> > +expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
> > +      "Incorrect hugetlb charged to cgroup 1."
> > +expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
> > +      "Incorrect reservation charged to cgroup 1."
> > +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
> > +      "Incorrect hugetlb charged to cgroup 2."
> > +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
> > +      "Incorrected reservation charged to cgroup 2."
> > +
> > +echo 'PASS'
> > +
> > +done # private
> > +done # populate
> > +done # method
> > +
> > +umount $cgroup_path
> > +rmdir $cgroup_path
> > diff --git a/tools/testing/selftests/vm/write_hugetlb_memory.sh b/tools/testing/selftests/vm/write_hugetlb_memory.sh
> > new file mode 100644
> > index 0000000000000..08f5fa5527cfd
> > --- /dev/null
> > +++ b/tools/testing/selftests/vm/write_hugetlb_memory.sh
> > @@ -0,0 +1,22 @@
> > +#!/bin/sh
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +set -e
> > +
> > +size=$1
> > +populate=$2
> > +write=$3
> > +cgroup=$4
> > +path=$5
> > +method=$6
> > +private=$7
> > +want_sleep=$8
> > +
> > +echo "Putting task in cgroup '$cgroup'"
> > +echo $$ > /dev/cgroup/memory/"$cgroup"/tasks
> > +
> > +echo "Method is $method"
> > +
> > +set +e
> > +./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \
> > +      "$private" "$want_sleep"
> > diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> > new file mode 100644
> > index 0000000000000..f02a897427a97
> > --- /dev/null
> > +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> > @@ -0,0 +1,252 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * This program reserves and uses hugetlb memory, supporting a bunch of
> > + * scenorios needed by the charged_reserved_hugetlb.sh test.
>
> Spelling?
>
> > + */
> > +
> > +#include <err.h>
> > +#include <errno.h>
> > +#include <signal.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <unistd.h>
> > +#include <fcntl.h>
> > +#include <sys/types.h>
> > +#include <sys/shm.h>
> > +#include <sys/stat.h>
> > +#include <sys/mman.h>
> > +
> > +/* Global definitions. */
> > +enum method {
> > +     HUGETLBFS,
> > +     MMAP_MAP_HUGETLB,
> > +     SHM,
> > +     MAX_METHOD
> > +};
> > +
> > +
> > +/* Global variables. */
> > +static const char *self;
> > +static char *shmaddr;
> > +static int shmid;
> > +
> > +/*
> > + * Show usage and exit.
> > + */
> > +static void exit_usage(void)
> > +{
> > +
> > +     printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
> > +             "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
> > +             "[-o] [-w]\n", self);
> > +     exit(EXIT_FAILURE);
> > +}
> > +
> > +void sig_handler(int signo)
> > +{
> > +     printf("Received %d.\n", signo);
> > +     if (signo == SIGINT) {
> > +             printf("Deleting the memory\n");
> > +             if (shmdt((const void *)shmaddr) != 0) {
> > +                     perror("Detach failure");
> > +                     shmctl(shmid, IPC_RMID, NULL);
> > +                     exit(4);
>
> Is this a skip error code? Please note that kselftest framework
> interprets this as a skipped test when returb value is 4.
>
This is not a kselftest framework binary. It's a tool to be called
from charge_reserved_hugetlb.sh The exit value is just to make all the
exits in this file return different codes so I can debug easier where
the tool exited from. They don't actually mean anything. Is that OK?

> > +             }
> > +
> > +             shmctl(shmid, IPC_RMID, NULL);
> > +             printf("Done deleting the memory\n");
> > +     }
> > +     exit(2);
>
> What about this one? What does exit code 2 mean?
>
> > +}
> > +
> > +int main(int argc, char **argv)
> > +{
> > +     int fd = 0;
> > +     int key = 0;
> > +     int *ptr = NULL;
> > +     int c = 0;
> > +     int size = 0;
> > +     char path[256] = "";
> > +     enum method method = MAX_METHOD;
> > +     int want_sleep = 0, private = 0;
> > +     int populate = 0;
> > +     int write = 0;
> > +
> > +     unsigned long i;
> > +
> > +
> > +     if (signal(SIGINT, sig_handler) == SIG_ERR)
> > +             err(1, "\ncan't catch SIGINT\n");
> > +
> > +     /* Parse command-line arguments. */
> > +     setvbuf(stdout, NULL, _IONBF, 0);
> > +     self = argv[0];
> > +
> > +     while ((c = getopt(argc, argv, "s:p:m:owlr")) != -1) {
> > +             switch (c) {
> > +             case 's':
> > +                     size = atoi(optarg);
> > +                     break;
> > +             case 'p':
> > +                     strncpy(path, optarg, sizeof(path));
> > +                     break;
> > +             case 'm':
> > +                     if (atoi(optarg) >= MAX_METHOD) {
> > +                             errno = EINVAL;
> > +                             perror("Invalid -m.");
> > +                             exit_usage();
> > +                     }
> > +                     method = atoi(optarg);
> > +                     break;
> > +             case 'o':
> > +                     populate = 1;
> > +                     break;
> > +             case 'w':
> > +                     write = 1;
> > +                     break;
> > +             case 'l':
> > +                     want_sleep = 1;
> > +                     break;
> > +             case 'r':
> > +                     private = 1;
> > +                     break;
> > +             default:
> > +                     errno = EINVAL;
> > +                     perror("Invalid arg");
> > +                     exit_usage();
> > +             }
> > +     }
> > +
> > +     if (strncmp(path, "", sizeof(path)) != 0) {
> > +             printf("Writing to this path: %s\n", path);
> > +     } else {
> > +             errno = EINVAL;
> > +             perror("path not found");
> > +             exit_usage();
> > +     }
> > +
> > +     if (size != 0) {
> > +             printf("Writing this size: %d\n", size);
> > +     } else {
> > +             errno = EINVAL;
> > +             perror("size not found");
> > +             exit_usage();
> > +     }
> > +
> > +     if (!populate)
> > +             printf("Not populating.\n");
> > +     else
> > +             printf("Populating.\n");
> > +
> > +     if (!write)
> > +             printf("Not writing to memory.\n");
> > +
> > +     if (method == MAX_METHOD) {
> > +             errno = EINVAL;
> > +             perror("-m Invalid");
> > +             exit_usage();
> > +     } else
> > +             printf("Using method=%d\n", method);
> > +
> > +     if (!private)
> > +             printf("Shared mapping.\n");
> > +     else
> > +             printf("Private mapping.\n");
> > +
> > +
> > +     switch (method) {
> > +     case HUGETLBFS:
> > +             printf("Allocating using HUGETLBFS.\n");
> > +             fd = open(path, O_CREAT | O_RDWR, 0777);
> > +             if (fd == -1)
> > +                     err(1, "Failed to open file.");
> > +
> > +             ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
> > +                     (private ? MAP_PRIVATE : MAP_SHARED) | (populate ?
> > +                             MAP_POPULATE : 0), fd, 0);
> > +
> > +             if (ptr == MAP_FAILED) {
> > +                     close(fd);
> > +                     err(1, "Error mapping the file");
> > +             }
> > +             break;
> > +     case MMAP_MAP_HUGETLB:
> > +             printf("Allocating using MAP_HUGETLB.\n");
> > +             ptr = mmap(NULL, size,
> > +             PROT_READ | PROT_WRITE,
> > +             (private ? (MAP_PRIVATE | MAP_ANONYMOUS) : MAP_SHARED) |
> > +             MAP_HUGETLB | (populate ?
> > +                     MAP_POPULATE : 0),
> > +             -1, 0);
> > +
> > +             if (ptr == MAP_FAILED)
> > +                     err(1, "mmap");
> > +
> > +             printf("Returned address is %p\n", ptr);
> > +             break;
> > +     case SHM:
> > +             printf("Allocating using SHM.\n");
> > +             shmid = shmget(key, size, SHM_HUGETLB | IPC_CREAT | SHM_R |
> > +                             SHM_W);
> > +             if (shmid < 0) {
> > +                     shmid = shmget(++key, size, SHM_HUGETLB | IPC_CREAT |
> > +                                     SHM_R | SHM_W);
> > +                     if (shmid < 0)
> > +                             err(1, "shmget");
> > +
> > +             }
> > +             printf("shmid: 0x%x, shmget key:%d\n", shmid, key);
> > +
> > +             shmaddr = shmat(shmid, NULL, 0);
> > +             if (shmaddr == (char *)-1) {
> > +                     perror("Shared memory attach failure");
> > +                     shmctl(shmid, IPC_RMID, NULL);
> > +                     exit(2);
> > +             }
> > +             printf("shmaddr: %p\n", shmaddr);
> > +
> > +             break;
> > +     default:
> > +             errno = EINVAL;
> > +             err(1, "Invalid method.");
> > +     }
> > +
> > +     if (write) {
> > +             printf("Writing to memory.\n");
> > +             if (method != SHM) {
> > +                     memset(ptr, 1, size);
> > +             } else {
> > +                     printf("Starting the writes:\n");
> > +                     for (i = 0; i < size; i++) {
> > +                             shmaddr[i] = (char)(i);
> > +                             if (!(i % (1024 * 1024)))
> > +                                     printf(".");
> > +                     }
> > +                     printf("\n");
> > +
> > +                     printf("Starting the Check...");
> > +                     for (i = 0; i < size; i++)
> > +                             if (shmaddr[i] != (char)i) {
> > +                                     printf("\nIndex %lu mismatched\n", i);
> > +                                     exit(3);
> > +                             }
> > +                     printf("Done.\n");
> > +
> > +
> > +             }
> > +     }
> > +
> > +     if (want_sleep) {
> > +             /* Signal to caller that we're done. */
> > +             printf("DONE\n");
> > +
> > +             /* Hold memory until external kill signal is delivered. */
> > +             while (1)
> > +                     sleep(100);
>
> Please don't add sleeps. This will hold up the kselftest run.

This tool is meant to allocate a bunch of memory and hold it until the
code in ./charge_reserved_hugetlb.sh checks that the usage is
accounted for. charge_reserved_hugetlb.sh runs this tool in the
background, and asks it to hold the memory until it verifies
accounting, then asks it to delete the memory. It should not hold up a
run.

>
> > +     }
> > +
> > +     close(fd);
>
> Is this close() necessary in all cases? Looks like MMAP_MAP_HUGETLB
> is the only case that opens it.
>
> I am not sure if the error legs are correct.
>
> > +
> > +     return 0;
> > +}
> > --
> > 2.23.0.162.g0b9fbb3734-goog
> >
>
> thanks,
> -- Shuah

Thanks for the review, I should be able to address everything else in
the next patchset.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-10 23:31 [PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
2019-09-10 23:31 ` [PATCH v4 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
2019-09-16 23:43   ` shuah
2019-09-10 23:31 ` [PATCH v4 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations Mina Almasry
2019-09-17  1:29   ` shuah
2019-09-10 23:31 ` [PATCH v4 3/9] hugetlb_cgroup: add reservation accounting for private mappings Mina Almasry
2019-09-10 23:31 ` [PATCH v4 4/9] hugetlb: region_chg provides only cache entry Mina Almasry
2019-09-16 22:17   ` Mike Kravetz
2019-09-10 23:31 ` [PATCH v4 5/9] hugetlb: remove duplicated code Mina Almasry
2019-09-16 22:25   ` Mike Kravetz
2019-09-10 23:31 ` [PATCH v4 6/9] hugetlb: disable region_add file_region coalescing Mina Almasry
2019-09-16 23:57   ` Mike Kravetz
2019-09-17  0:16     ` Mina Almasry
2019-09-10 23:31 ` [PATCH v4 7/9] hugetlb_cgroup: add accounting for shared mappings Mina Almasry
2019-09-10 23:31 ` [PATCH v4 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests Mina Almasry
2019-09-17  1:52   ` shuah
2019-09-19  1:53     ` Mina Almasry
2019-09-10 23:31 ` [PATCH v4 9/9] hugetlb_cgroup: Add hugetlb_cgroup reservation docs Mina Almasry
2019-09-17  1:58   ` shuah

Linux-kselftest Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-kselftest/0 linux-kselftest/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-kselftest linux-kselftest/ https://lore.kernel.org/linux-kselftest \
		linux-kselftest@vger.kernel.org
	public-inbox-index linux-kselftest

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kselftest


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git