All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] hugetlb memcg accounting
@ 2023-09-26 19:49 Nhat Pham
  2023-09-26 19:49 ` [PATCH 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller Nhat Pham
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Nhat Pham @ 2023-09-26 19:49 UTC (permalink / raw)
  To: akpm
  Cc: riel, hannes, mhocko, roman.gushchin, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed. In
addition, a new selftest is added to demonstrate and verify this new
behavior.

Nhat Pham (2):
  hugetlb: memcg: account hugetlb-backed memory in memory controller
  selftests: add a selftest to verify hugetlb usage in memcg

 MAINTAINERS                                   |   2 +
 fs/hugetlbfs/inode.c                          |   2 +-
 include/linux/hugetlb.h                       |   6 +-
 include/linux/memcontrol.h                    |   8 +
 mm/hugetlb.c                                  |  23 +-
 mm/memcontrol.c                               |  40 ++++
 tools/testing/selftests/cgroup/.gitignore     |   1 +
 tools/testing/selftests/cgroup/Makefile       |   2 +
 .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
 9 files changed, 297 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller
  2023-09-26 19:49 [PATCH 0/2] hugetlb memcg accounting Nhat Pham
@ 2023-09-26 19:49 ` Nhat Pham
  2023-09-26 19:49 ` [PATCH 2/2] selftests: add a selftest to verify hugetlb usage in memcg Nhat Pham
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-09-26 19:49 UTC (permalink / raw)
  To: akpm
  Cc: riel, hannes, mhocko, roman.gushchin, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is allocated, and uncharging when the folio is freed (analogous to
the hugetlb controller).

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 fs/hugetlbfs/inode.c       |  2 +-
 include/linux/hugetlb.h    |  6 ++++--
 include/linux/memcontrol.h |  8 ++++++++
 mm/hugetlb.c               | 23 ++++++++++++++++------
 mm/memcontrol.c            | 40 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 60fce26ff937..034967319955 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -902,7 +902,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 * to keep reservation accounting consistent.
 		 */
 		hugetlb_set_vma_policy(&pseudo_vma, inode, index);
-		folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0);
+		folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0, true);
 		hugetlb_drop_vma_policy(&pseudo_vma);
 		if (IS_ERR(folio)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a30686e649f7..9b73db1605a2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -713,7 +713,8 @@ struct huge_bootmem_page {
 
 int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				unsigned long addr, int avoid_reserve);
+				unsigned long addr, int avoid_reserve,
+				bool restore_reserve_on_memcg_failure);
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask);
 struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma,
@@ -1016,7 +1017,8 @@ static inline int isolate_or_dissolve_huge_page(struct page *page,
 
 static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 					   unsigned long addr,
-					   int avoid_reserve)
+					   int avoid_reserve,
+					   bool restore_reserve_on_memcg_failure)
 {
 	return NULL;
 }
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e0cfab58ab71..8094679c99dd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -677,6 +677,8 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
 	return __mem_cgroup_charge(folio, mm, gfp);
 }
 
+int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp);
+
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry);
 void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
@@ -1251,6 +1253,12 @@ static inline int mem_cgroup_charge(struct folio *folio,
 	return 0;
 }
 
+static inline int mem_cgroup_hugetlb_charge_folio(struct folio *folio,
+		gfp_t gfp)
+{
+	return 0;
+}
+
 static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
 			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de220e3ff8be..ff88ea4df11a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
 				     pages_per_huge_page(h), folio);
 	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					  pages_per_huge_page(h), folio);
+	mem_cgroup_uncharge(folio);
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
@@ -3004,7 +3005,8 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 }
 
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+					unsigned long addr, int avoid_reserve,
+					bool restore_reserve_on_memcg_failure)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
@@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					pages_per_huge_page(h), folio);
 	}
+
+	/* undo allocation if memory controller disallows it. */
+	if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) {
+		if (restore_reserve_on_memcg_failure)
+			restore_reserve_on_error(h, vma, addr, folio);
+		folio_put(folio);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	return folio;
 
 out_uncharge_cgroup:
@@ -5179,7 +5190,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
-				new_folio = alloc_hugetlb_folio(dst_vma, addr, 1);
+				new_folio = alloc_hugetlb_folio(dst_vma, addr, 1, false);
 				if (IS_ERR(new_folio)) {
 					folio_put(pte_folio);
 					ret = PTR_ERR(new_folio);
@@ -5656,7 +5667,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * be acquired again before returning to the caller, as expected.
 	 */
 	spin_unlock(ptl);
-	new_folio = alloc_hugetlb_folio(vma, haddr, outside_reserve);
+	new_folio = alloc_hugetlb_folio(vma, haddr, outside_reserve, true);
 
 	if (IS_ERR(new_folio)) {
 		/*
@@ -5930,7 +5941,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 							VM_UFFD_MISSING);
 		}
 
-		folio = alloc_hugetlb_folio(vma, haddr, 0);
+		folio = alloc_hugetlb_folio(vma, haddr, 0, true);
 		if (IS_ERR(folio)) {
 			/*
 			 * Returning error will result in faulting task being
@@ -6352,7 +6363,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
+		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0, true);
 		if (IS_ERR(folio)) {
 			ret = -ENOMEM;
 			goto out;
@@ -6394,7 +6405,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
+		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0, false);
 		if (IS_ERR(folio)) {
 			folio_put(*foliop);
 			ret = -ENOMEM;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1a322a75172..e7ae63f14120 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7050,6 +7050,46 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
 	return ret;
 }
 
+static struct mem_cgroup *get_mem_cgroup_from_current(void)
+{
+	struct mem_cgroup *memcg;
+
+again:
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (!css_tryget(&memcg->css)) {
+		rcu_read_unlock();
+		goto again;
+	}
+	rcu_read_unlock();
+	return memcg;
+}
+
+/**
+ * mem_cgroup_hugetlb_charge_folio - Charge a newly allocated hugetlb folio.
+ * @folio: folio to charge.
+ * @gfp: reclaim mode
+ *
+ * This function charges an allocated hugetlbf folio to the memcg of the
+ * current task.
+ *
+ * Returns 0 on success. Otherwise, an error code is returned.
+ */
+int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp)
+{
+	struct mem_cgroup *memcg;
+	int ret;
+
+	if (mem_cgroup_disabled())
+		return 0;
+
+	memcg = get_mem_cgroup_from_current();
+	ret = charge_memcg(folio, memcg, gfp);
+	mem_cgroup_put(memcg);
+
+	return ret;
+}
+
 /**
  * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
  * @folio: folio to charge.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] selftests: add a selftest to verify hugetlb usage in memcg
  2023-09-26 19:49 [PATCH 0/2] hugetlb memcg accounting Nhat Pham
  2023-09-26 19:49 ` [PATCH 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller Nhat Pham
@ 2023-09-26 19:49 ` Nhat Pham
  2023-09-26 20:41   ` Nhat Pham
  2023-09-26 20:50 ` [PATCH 0/2] hugetlb memcg accounting Frank van der Linden
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Nhat Pham @ 2023-09-26 19:49 UTC (permalink / raw)
  To: akpm
  Cc: riel, hannes, mhocko, roman.gushchin, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

This patch add a new kselftest to demonstrate and verify the new
hugetlb memcg accounting behavior.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 MAINTAINERS                                   |   2 +
 tools/testing/selftests/cgroup/.gitignore     |   1 +
 tools/testing/selftests/cgroup/Makefile       |   2 +
 .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
 4 files changed, 227 insertions(+)
 create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c

diff --git a/MAINTAINERS b/MAINTAINERS
index bf0f54c24f81..ce9f40bcc2ba 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5269,6 +5269,7 @@ S:	Maintained
 F:	mm/memcontrol.c
 F:	mm/swap_cgroup.c
 F:	tools/testing/selftests/cgroup/memcg_protection.m
+F:	tools/testing/selftests/cgroup/test_hugetlb_memcg.c
 F:	tools/testing/selftests/cgroup/test_kmem.c
 F:	tools/testing/selftests/cgroup/test_memcontrol.c
 
@@ -9652,6 +9653,7 @@ F:	include/linux/hugetlb.h
 F:	mm/hugetlb.c
 F:	mm/hugetlb_vmemmap.c
 F:	mm/hugetlb_vmemmap.h
+F:	tools/testing/selftests/cgroup/test_hugetlb_memcg.c
 
 HVA ST MEDIA DRIVER
 M:	Jean-Christophe Trotin <jean-christophe.trotin@foss.st.com>
diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index af8c3f30b9c1..2732e0b29271 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -7,4 +7,5 @@ test_kill
 test_cpu
 test_cpuset
 test_zswap
+test_hugetlb_memcg
 wait_inotify
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index c27f05f6ce9b..00b441928909 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -14,6 +14,7 @@ TEST_GEN_PROGS += test_kill
 TEST_GEN_PROGS += test_cpu
 TEST_GEN_PROGS += test_cpuset
 TEST_GEN_PROGS += test_zswap
+TEST_GEN_PROGS += test_hugetlb_memcg
 
 LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h
 
@@ -27,3 +28,4 @@ $(OUTPUT)/test_kill: cgroup_util.c
 $(OUTPUT)/test_cpu: cgroup_util.c
 $(OUTPUT)/test_cpuset: cgroup_util.c
 $(OUTPUT)/test_zswap: cgroup_util.c
+$(OUTPUT)/test_hugetlb_memcg: cgroup_util.c
diff --git a/tools/testing/selftests/cgroup/test_hugetlb_memcg.c b/tools/testing/selftests/cgroup/test_hugetlb_memcg.c
new file mode 100644
index 000000000000..9651f6af6914
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_hugetlb_memcg.c
@@ -0,0 +1,222 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+
+#include <linux/limits.h>
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <fcntl.h>
+#include "../kselftest.h"
+#include "cgroup_util.h"
+
+#define ADDR ((void *)(0x0UL))
+#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
+/* mapping 8 MBs == 4 hugepages */
+#define LENGTH (8UL*1024*1024)
+#define PROTECTION (PROT_READ | PROT_WRITE)
+
+/* borrowed from mm/hmm-tests.c */
+static long get_hugepage_size(void)
+{
+	int fd;
+	char buf[2048];
+	int len;
+	char *p, *q, *path = "/proc/meminfo", *tag = "Hugepagesize:";
+	long val;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		/* Error opening the file */
+		return -1;
+	}
+
+	len = read(fd, buf, sizeof(buf));
+	close(fd);
+	if (len < 0) {
+		/* Error in reading the file */
+		return -1;
+	}
+	if (len == sizeof(buf)) {
+		/* Error file is too large */
+		return -1;
+	}
+	buf[len] = '\0';
+
+	/* Search for a tag if provided */
+	if (tag) {
+		p = strstr(buf, tag);
+		if (!p)
+			return -1; /* looks like the line we want isn't there */
+		p += strlen(tag);
+	} else
+		p = buf;
+
+	val = strtol(p, &q, 0);
+	if (*q != ' ') {
+		/* Error parsing the file */
+		return -1;
+	}
+
+	return val;
+}
+
+static int set_file(const char *path, long value)
+{
+	FILE *file;
+	int ret;
+
+	file = fopen(path, "w");
+	if (!file)
+		return -1;
+	ret = fprintf(file, "%ld\n", value);
+	fclose(file);
+	return ret;
+}
+
+static int set_nr_hugepages(long value)
+{
+	return set_file("/proc/sys/vm/nr_hugepages", value);
+}
+
+static unsigned int check_first(char *addr)
+{
+	return *(unsigned int *)addr;
+}
+
+static void write_data(char *addr)
+{
+	unsigned long i;
+
+	for (i = 0; i < LENGTH; i++)
+		*(addr + i) = (char)i;
+}
+
+static int hugetlb_test_program(const char *cgroup, void *arg)
+{
+	char *test_group = (char *)arg;
+	void *addr;
+	long old_current, expected_current, current;
+	int ret = EXIT_FAILURE;
+
+	old_current = cg_read_long(test_group, "memory.current");
+
+	addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
+	if (addr == MAP_FAILED) {
+		ksft_print_msg("fail to mmap.\n");
+		return EXIT_FAILURE;
+	}
+	current = cg_read_long(test_group, "memory.current");
+	if (current - old_current >= MB(2)) {
+		ksft_print_msg("mmap should not increase hugepage usage.\n");
+		goto out_failed_munmap;
+	}
+	old_current = current;
+
+	/* read the first page */
+	check_first(addr);
+	expected_current = old_current + MB(2);
+	current = cg_read_long(test_group, "memory.current");
+	if (!values_close(expected_current, current, 1)) {
+		ksft_print_msg("memory usage should increase by around 2MB.\n");
+		goto out_failed_munmap;
+	}
+
+	/* write to the whole range */
+	write_data(addr);
+	current = cg_read_long(test_group, "memory.current");
+	expected_current = old_current + MB(8);
+	if (!values_close(expected_current, current, 1)) {
+		ksft_print_msg("memory usage should increase by around 8MB.\n");
+		goto out_failed_munmap;
+	}
+
+	/* unmap the whole range */
+	munmap(addr, LENGTH);
+	current = cg_read_long(test_group, "memory.current");
+	expected_current = old_current;
+	if (!values_close(expected_current, current, 1)) {
+		ksft_print_msg("memory usage should go back down.\n");
+		return ret;
+	}
+
+	ret = EXIT_SUCCESS;
+	return ret;
+
+out_failed_munmap:
+	munmap(addr, LENGTH);
+	return ret;
+}
+
+static int test_hugetlb_memcg(char *root)
+{
+	int ret = KSFT_FAIL;
+	char *test_group;
+	long old_current, expected_current, current;
+
+	test_group = cg_name(root, "hugetlb_memcg_test");
+
+	if (!test_group || cg_create(test_group)) {
+		ksft_print_msg("fail to create cgroup.\n");
+		goto out;
+	}
+
+	if (cg_write(test_group, "memory.max", "100M")) {
+		ksft_print_msg("fail to set cgroup memory limit.\n");
+		goto out;
+	}
+
+	/* disable swap */
+	if (cg_write(test_group, "memory.swap.max", "0")) {
+		ksft_print_msg("fail to disable swap.\n");
+		goto out;
+	}
+	old_current = cg_read_long(test_group, "memory.current");
+
+	set_nr_hugepages(20);
+	current = cg_read_long(test_group, "memory.current");
+	expected_current = old_current;
+	if (!values_close(expected_current, current, 10)) {
+		ksft_print_msg(
+			"memory usage should not increase after setting nr_hugepages.\n");
+		goto out;
+	}
+
+	if (!cg_run(test_group, hugetlb_test_program, (void *)test_group))
+		ret = KSFT_PASS;
+out:
+	cg_destroy(test_group);
+	free(test_group);
+	return ret;
+}
+
+int main(int argc, char **argv)
+{
+	char root[PATH_MAX];
+	int ret = EXIT_SUCCESS;
+
+	/* Unit is kB! */
+	if (get_hugepage_size() != 2048) {
+		ksft_print_msg("test_hugetlb_memcg requires 2MB hugepages\n");
+		ksft_test_result_skip("test_hugetlb_memcg\n");
+		return ret;
+	}
+
+	if (cg_find_unified_root(root, sizeof(root)))
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+	switch (test_hugetlb_memcg(root)) {
+	case KSFT_PASS:
+		ksft_test_result_pass("test_hugetlb_memcg\n");
+		break;
+	case KSFT_SKIP:
+		ksft_test_result_skip("test_hugetlb_memcg\n");
+		break;
+	default:
+		ret = EXIT_FAILURE;
+		ksft_test_result_fail("test_hugetlb_memcg\n");
+		break;
+	}
+
+	return ret;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] selftests: add a selftest to verify hugetlb usage in memcg
  2023-09-26 19:49 ` [PATCH 2/2] selftests: add a selftest to verify hugetlb usage in memcg Nhat Pham
@ 2023-09-26 20:41   ` Nhat Pham
  0 siblings, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-09-26 20:41 UTC (permalink / raw)
  To: akpm
  Cc: riel, hannes, mhocko, roman.gushchin, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

Ooops I forgot to commit this change.
Basically just moving the nr_hugepages check inside the test program
run within the cgroup. This should apply cleanly:

diff --git a/tools/testing/selftests/cgroup/test_hugetlb_memcg.c b/tools/testing/selftests/cgroup/test_hugetlb_memcg.c
index 9651f6af6914..436522257cd2 100644
--- a/tools/testing/selftests/cgroup/test_hugetlb_memcg.c
+++ b/tools/testing/selftests/cgroup/test_hugetlb_memcg.c
@@ -100,6 +100,14 @@ static int hugetlb_test_program(const char *cgroup, void *arg)
 	int ret = EXIT_FAILURE;
 
 	old_current = cg_read_long(test_group, "memory.current");
+	set_nr_hugepages(20);
+	current = cg_read_long(test_group, "memory.current");
+	expected_current = old_current;
+	if (!values_close(expected_current, current, 10)) {
+		ksft_print_msg(
+			"memory usage should not increase after setting nr_hugepages.\n");
+		return EXIT_FAILURE;
+	}
 
 	addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
 	if (addr == MAP_FAILED) {
@@ -152,10 +160,8 @@ static int test_hugetlb_memcg(char *root)
 {
 	int ret = KSFT_FAIL;
 	char *test_group;
-	long old_current, expected_current, current;
 
 	test_group = cg_name(root, "hugetlb_memcg_test");
-
 	if (!test_group || cg_create(test_group)) {
 		ksft_print_msg("fail to create cgroup.\n");
 		goto out;
@@ -171,16 +177,6 @@ static int test_hugetlb_memcg(char *root)
 		ksft_print_msg("fail to disable swap.\n");
 		goto out;
 	}
-	old_current = cg_read_long(test_group, "memory.current");
-
-	set_nr_hugepages(20);
-	current = cg_read_long(test_group, "memory.current");
-	expected_current = old_current;
-	if (!values_close(expected_current, current, 10)) {
-		ksft_print_msg(
-			"memory usage should not increase after setting nr_hugepages.\n");
-		goto out;
-	}
 
 	if (!cg_run(test_group, hugetlb_test_program, (void *)test_group))
 		ret = KSFT_PASS;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-26 19:49 [PATCH 0/2] hugetlb memcg accounting Nhat Pham
  2023-09-26 19:49 ` [PATCH 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller Nhat Pham
  2023-09-26 19:49 ` [PATCH 2/2] selftests: add a selftest to verify hugetlb usage in memcg Nhat Pham
@ 2023-09-26 20:50 ` Frank van der Linden
  2023-09-26 22:14   ` Johannes Weiner
  2023-09-26 23:31   ` Nhat Pham
  2023-09-27 11:21 ` Michal Hocko
  2023-09-28  1:00 ` Nhat Pham
  4 siblings, 2 replies; 20+ messages in thread
From: Frank van der Linden @ 2023-09-26 20:50 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, riel, hannes, mhocko, roman.gushchin, shakeelb,
	muchun.song, tj, lizefan.x, shuah, mike.kravetz, yosryahmed,
	linux-mm, kernel-team, linux-kernel, cgroups

On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.
>
> Nhat Pham (2):
>   hugetlb: memcg: account hugetlb-backed memory in memory controller
>   selftests: add a selftest to verify hugetlb usage in memcg
>
>  MAINTAINERS                                   |   2 +
>  fs/hugetlbfs/inode.c                          |   2 +-
>  include/linux/hugetlb.h                       |   6 +-
>  include/linux/memcontrol.h                    |   8 +
>  mm/hugetlb.c                                  |  23 +-
>  mm/memcontrol.c                               |  40 ++++
>  tools/testing/selftests/cgroup/.gitignore     |   1 +
>  tools/testing/selftests/cgroup/Makefile       |   2 +
>  .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
>  9 files changed, 297 insertions(+), 9 deletions(-)
>  create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
>
> --
> 2.34.1
>

We've had this behavior at Google for a long time, and we're actually
getting rid of it. hugetlb pages are a precious resource that should
be accounted for separately. They are not just any memory, they are
physically contiguous memory, charging them the same as any other
region of the same size ended up not making sense, especially not for
larger hugetlb page sizes.

Additionally, if this behavior is changed just like that, there will
be quite a few workloads that will break badly because they'll hit
their limits immediately - imagine a container that uses 1G hugetlb
pages to back something large (a database, a VM), and 'plain' memory
for control processes.

What do your workloads do? Is it not possible for you to account for
hugetlb pages separately? Sure, it can be annoying to have to deal
with 2 separate totals that you need to take into account, but again,
hugetlb pages are a resource that is best dealt with separately.

- Frank

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-26 20:50 ` [PATCH 0/2] hugetlb memcg accounting Frank van der Linden
@ 2023-09-26 22:14   ` Johannes Weiner
  2023-09-27 12:50     ` Michal Hocko
  2023-09-26 23:31   ` Nhat Pham
  1 sibling, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2023-09-26 22:14 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: Nhat Pham, akpm, riel, mhocko, roman.gushchin, shakeelb,
	muchun.song, tj, lizefan.x, shuah, mike.kravetz, yosryahmed,
	linux-mm, kernel-team, linux-kernel, cgroups

Hi Frank,

On Tue, Sep 26, 2023 at 01:50:10PM -0700, Frank van der Linden wrote:
> On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
> >
> > Nhat Pham (2):
> >   hugetlb: memcg: account hugetlb-backed memory in memory controller
> >   selftests: add a selftest to verify hugetlb usage in memcg
> >
> >  MAINTAINERS                                   |   2 +
> >  fs/hugetlbfs/inode.c                          |   2 +-
> >  include/linux/hugetlb.h                       |   6 +-
> >  include/linux/memcontrol.h                    |   8 +
> >  mm/hugetlb.c                                  |  23 +-
> >  mm/memcontrol.c                               |  40 ++++
> >  tools/testing/selftests/cgroup/.gitignore     |   1 +
> >  tools/testing/selftests/cgroup/Makefile       |   2 +
> >  .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
> >  9 files changed, 297 insertions(+), 9 deletions(-)
> >  create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
> >
> > --
> > 2.34.1
> >
> 
> We've had this behavior at Google for a long time, and we're actually
> getting rid of it. hugetlb pages are a precious resource that should
> be accounted for separately. They are not just any memory, they are
> physically contiguous memory, charging them the same as any other
> region of the same size ended up not making sense, especially not for
> larger hugetlb page sizes.

I agree that on one hand they're a limited resource, and some form of
access control makes sense. There is the hugetlb cgroup controller
that allows for tracking and apportioning them per-cgroups.

But on the other hand they're also still just host memory that a
cgroup can consume, which is the domain of memcg.

Those two aren't mutually exclusive. It makes sense to set a limit on
a cgroup's access to hugetlb. It also makes sense that the huge pages
a cgroup IS using count toward its memory limit, where they displace
file cache and anonymous pages under pressure. Or that they're
considered when determining degree of protection from global pressure.

This isn't unlike e.g. kernel memory being special in that it consumes
lowmem and isn't reclaimable. This shows up in total memory, while it
was also tracked and limited separately. (Separate control disappeared
for lack of a good enforcement mechanism - but hugetlb has that.)

The fact that memory consumed by hugetlb is currently not considered
inside memcg (host memory accounting and control) is inconsistent. It
has been quite confusing to our service owners and complicating things
for our containers team.

For example, jobs need to describe their overall memory size in order
to match them to machines and co-locate them. Based on that parameter
the container limits as well as protection (memory.low) from global
pressure is set. Currently, there are ugly hacks in place to subtract
any hugetlb quota from the container config - otherwise the limits and
protection settings would be way too big if a large part of the host
memory consumption isn't a part of it. This has been quite cumbersome
and error prone.

> Additionally, if this behavior is changed just like that, there will
> be quite a few workloads that will break badly because they'll hit
> their limits immediately - imagine a container that uses 1G hugetlb
> pages to back something large (a database, a VM), and 'plain' memory
> for control processes.

I agree with you there. This could break existing setups. We've added
new consumers to memcg in the past without thinking too hard about it,
but hugetlb often makes up a huge portion of a group's overall memory
footprint. And we *do* have those subtraction hacks in place that
would then fail in the other direction.

A cgroup mountflag makes sense for this to ease the transition.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-26 20:50 ` [PATCH 0/2] hugetlb memcg accounting Frank van der Linden
  2023-09-26 22:14   ` Johannes Weiner
@ 2023-09-26 23:31   ` Nhat Pham
  1 sibling, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-09-26 23:31 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: akpm, riel, hannes, mhocko, roman.gushchin, shakeelb,
	muchun.song, tj, lizefan.x, shuah, mike.kravetz, yosryahmed,
	linux-mm, kernel-team, linux-kernel, cgroups

On Tue, Sep 26, 2023 at 1:50 PM Frank van der Linden <fvdl@google.com> wrote:
>
> On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
> >
> > Nhat Pham (2):
> >   hugetlb: memcg: account hugetlb-backed memory in memory controller
> >   selftests: add a selftest to verify hugetlb usage in memcg
> >
> >  MAINTAINERS                                   |   2 +
> >  fs/hugetlbfs/inode.c                          |   2 +-
> >  include/linux/hugetlb.h                       |   6 +-
> >  include/linux/memcontrol.h                    |   8 +
> >  mm/hugetlb.c                                  |  23 +-
> >  mm/memcontrol.c                               |  40 ++++
> >  tools/testing/selftests/cgroup/.gitignore     |   1 +
> >  tools/testing/selftests/cgroup/Makefile       |   2 +
> >  .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
> >  9 files changed, 297 insertions(+), 9 deletions(-)
> >  create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
> >
> > --
> > 2.34.1
> >
>
> We've had this behavior at Google for a long time, and we're actually
> getting rid of it. hugetlb pages are a precious resource that should
> be accounted for separately. They are not just any memory, they are
> physically contiguous memory, charging them the same as any other
> region of the same size ended up not making sense, especially not for
> larger hugetlb page sizes.

I agree hugetlb is a special kind of resource. But as Johannes
pointed out, it is still a form of memory. Semantically, its usage
should be modulated by the memory controller.

We do have the HugeTLB controller for hugetlb-specific
restriction, and where appropriate we definitely should take
advantage of it. But it does not fix the hole we have in memory
usage reporting, as well as (over)protection and reclaim dynamics.
Hence the need for the userspace hack (as Johannes described):
manually adding/subtracting HugeTLB usage where applicable.
This is not only inelegant, but also cumbersome and buggy.

>
> Additionally, if this behavior is changed just like that, there will
> be quite a few workloads that will break badly because they'll hit
> their limits immediately - imagine a container that uses 1G hugetlb
> pages to back something large (a database, a VM), and 'plain' memory
> for control processes.
>
> What do your workloads do? Is it not possible for you to account for
> hugetlb pages separately? Sure, it can be annoying to have to deal
> with 2 separate totals that you need to take into account, but again,
> hugetlb pages are a resource that is best dealt with separately.
>

Johannes beat me to it - he described our use case, and what we
have hacked together to temporarily get around the issue.

A knob/flag to turn on/off this behavior sounds good to me.

> - Frank
Thanks for the comments, Frank!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-26 19:49 [PATCH 0/2] hugetlb memcg accounting Nhat Pham
                   ` (2 preceding siblings ...)
  2023-09-26 20:50 ` [PATCH 0/2] hugetlb memcg accounting Frank van der Linden
@ 2023-09-27 11:21 ` Michal Hocko
  2023-09-27 18:47   ` Johannes Weiner
  2023-09-27 23:33   ` Nhat Pham
  2023-09-28  1:00 ` Nhat Pham
  4 siblings, 2 replies; 20+ messages in thread
From: Michal Hocko @ 2023-09-27 11:21 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, riel, hannes, roman.gushchin, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
> 
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.

The primary reason why hugetlb is living outside of memcg (and the core
MM as well) is that it doesn't really fit the whole scheme. In several
aspects. First and the foremost it is an independently managed resource
with its own pool management, use and lifetime.

There is no notion of memory reclaim and this makes a huge difference
for the pool that might consume considerable amount of memory. While
this is the case for many kernel allocations as well they usually do not
consume considerable portions of the accounted memory. This makes it
really tricky to handle limit enforcement gracefully.

Another important aspect comes from the lifetime semantics when a proper
reservations accounting and managing needs to handle mmap time rather
than than usual allocation path. While pages are allocated they do not
belong to anybody and only later at the #PF time (or read for the fs
backed mapping) the ownership is established. That makes it really hard
to manage memory as whole under the memcg anyway as a large part of
that pool sits without an ownership yet it cannot be used for any other
purpose.

These and more reasons where behind the earlier decision o have a
dedicated hugetlb controller.

Also I will also Nack involving hugetlb pages being accounted by
default. This would break any setups which mix normal and hugetlb memory
with memcg limits applied.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-26 22:14   ` Johannes Weiner
@ 2023-09-27 12:50     ` Michal Hocko
  2023-09-27 16:44       ` Johannes Weiner
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-09-27 12:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Frank van der Linden, Nhat Pham, akpm, riel, roman.gushchin,
	shakeelb, muchun.song, tj, lizefan.x, shuah, mike.kravetz,
	yosryahmed, linux-mm, kernel-team, linux-kernel, cgroups

On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
[...]
> The fact that memory consumed by hugetlb is currently not considered
> inside memcg (host memory accounting and control) is inconsistent. It
> has been quite confusing to our service owners and complicating things
> for our containers team.

I do understand how that is confusing and inconsistent as well. Hugetlb
is bringing throughout its existence I am afraid.

As noted in other reply though I am not sure hugeltb pool can be
reasonably incorporated with a sane semantic. Neither of the regular
allocation nor the hugetlb reservation/actual use can fallback to the
pool of the other. This makes them 2 different things each hitting their
own failure cases that require a dedicated handling.

Just from top of my head these are cases I do not see easy way out from:
	- hugetlb charge failure has two failure modes - pool empty
	  or memcg limit reached. The former is not recoverable and
	  should fail without any further intervention the latter might
	  benefit from reclaiming.
	- !hugetlb memory charge failure cannot consider any hugetlb
	  pages - they are implicit memory.min protection so it is
          impossible to manage reclaim protection without having a
          knowledge of the hugetlb use.
	- there is no way to control the hugetlb pool distribution by
	  memcg limits. How do we distinguish reservations from actual
	  use?
	- pre-allocated pool is consuming memory without any actual
	  owner until it is actually used and even that has two stages
	  (reserved and really used). This makes it really hard to
	  manage memory as whole when there is a considerable amount of
	  hugetlb memore preallocated.
I am pretty sure there are many more interesting cases. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-27 12:50     ` Michal Hocko
@ 2023-09-27 16:44       ` Johannes Weiner
  2023-09-27 17:22         ` Nhat Pham
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2023-09-27 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Frank van der Linden, Nhat Pham, akpm, riel, roman.gushchin,
	shakeelb, muchun.song, tj, lizefan.x, shuah, mike.kravetz,
	yosryahmed, linux-mm, kernel-team, linux-kernel, cgroups

On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote:
> On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
> [...]
> > The fact that memory consumed by hugetlb is currently not considered
> > inside memcg (host memory accounting and control) is inconsistent. It
> > has been quite confusing to our service owners and complicating things
> > for our containers team.
> 
> I do understand how that is confusing and inconsistent as well. Hugetlb
> is bringing throughout its existence I am afraid.
> 
> As noted in other reply though I am not sure hugeltb pool can be
> reasonably incorporated with a sane semantic. Neither of the regular
> allocation nor the hugetlb reservation/actual use can fallback to the
> pool of the other. This makes them 2 different things each hitting their
> own failure cases that require a dedicated handling.
> 
> Just from top of my head these are cases I do not see easy way out from:
> 	- hugetlb charge failure has two failure modes - pool empty
> 	  or memcg limit reached. The former is not recoverable and
> 	  should fail without any further intervention the latter might
> 	  benefit from reclaiming.
> 	- !hugetlb memory charge failure cannot consider any hugetlb
> 	  pages - they are implicit memory.min protection so it is
>           impossible to manage reclaim protection without having a
>           knowledge of the hugetlb use.
> 	- there is no way to control the hugetlb pool distribution by
> 	  memcg limits. How do we distinguish reservations from actual
> 	  use?
> 	- pre-allocated pool is consuming memory without any actual
> 	  owner until it is actually used and even that has two stages
> 	  (reserved and really used). This makes it really hard to
> 	  manage memory as whole when there is a considerable amount of
> 	  hugetlb memore preallocated.

It's important to distinguish hugetlb access policy from memory use
policy. This patch isn't about hugetlb access, it's about general
memory use.

Hugetlb access policy is a separate domain with separate
answers. Preallocating is a privileged operation, for access control
there is the hugetlb cgroup controller etc.

What's missing is that once you get past the access restrictions and
legitimately get your hands on huge pages, that memory use gets
reflected in memory.current and exerts pressure on *other* memory
inside the group, such as anon or optimistic cache allocations.

Note that hugetlb *can* be allocated on demand. It's unexpected that
when an application optimistically allocates a couple of 2M hugetlb
pages those aren't reflected in its memory.current. The same is true
for hugetlb_cma. If the gigantic pages aren't currently allocated to a
cgroup, that CMA memory can be used for movable memory elsewhere.

The points you and Frank raise are reasons and scenarios where
additional hugetlb access control is necessary - preallocation,
limited availability of 1G pages etc. But they're not reasons against
charging faulted in hugetlb to the memcg *as well*.

My point is we need both. One to manage competition over hugetlb,
because it has unique limitations. The other to manage competition
over host memory which hugetlb is a part of.

Here is a usecase from our fleet.

Imagine a configuration with two 32G containers. The machine is booted
with hugetlb_cma=6G, and each container may or may not use up to 3
gigantic page, depending on the workload within it. The rest is anon,
cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to
3G to enforce hugetlb fairness. But how do you configure memory.max to
keep *overall* consumption, including anon, cache, slab etc. fair?

If used hugetlb is charged, you can just set memory.max=32G regardless
of the workload inside.

Without it, you'd have to constantly poll hugetlb usage and readjust
memory.max!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-27 16:44       ` Johannes Weiner
@ 2023-09-27 17:22         ` Nhat Pham
  0 siblings, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-09-27 17:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Frank van der Linden, akpm, riel, roman.gushchin,
	shakeelb, muchun.song, tj, lizefan.x, shuah, mike.kravetz,
	yosryahmed, linux-mm, kernel-team, linux-kernel, cgroups

On Wed, Sep 27, 2023 at 9:44 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
> > [...]
> > > The fact that memory consumed by hugetlb is currently not considered
> > > inside memcg (host memory accounting and control) is inconsistent. It
> > > has been quite confusing to our service owners and complicating things
> > > for our containers team.
> >
> > I do understand how that is confusing and inconsistent as well. Hugetlb
> > is bringing throughout its existence I am afraid.
> >
> > As noted in other reply though I am not sure hugeltb pool can be
> > reasonably incorporated with a sane semantic. Neither of the regular
> > allocation nor the hugetlb reservation/actual use can fallback to the
> > pool of the other. This makes them 2 different things each hitting their
> > own failure cases that require a dedicated handling.
> >
> > Just from top of my head these are cases I do not see easy way out from:
> >       - hugetlb charge failure has two failure modes - pool empty
> >         or memcg limit reached. The former is not recoverable and
> >         should fail without any further intervention the latter might
> >         benefit from reclaiming.
> >       - !hugetlb memory charge failure cannot consider any hugetlb
> >         pages - they are implicit memory.min protection so it is
> >           impossible to manage reclaim protection without having a
> >           knowledge of the hugetlb use.
> >       - there is no way to control the hugetlb pool distribution by
> >         memcg limits. How do we distinguish reservations from actual
> >         use?
> >       - pre-allocated pool is consuming memory without any actual
> >         owner until it is actually used and even that has two stages
> >         (reserved and really used). This makes it really hard to
> >         manage memory as whole when there is a considerable amount of
> >         hugetlb memore preallocated.
>
> It's important to distinguish hugetlb access policy from memory use
> policy. This patch isn't about hugetlb access, it's about general
> memory use.
>
> Hugetlb access policy is a separate domain with separate
> answers. Preallocating is a privileged operation, for access control
> there is the hugetlb cgroup controller etc.
>
> What's missing is that once you get past the access restrictions and
> legitimately get your hands on huge pages, that memory use gets
> reflected in memory.current and exerts pressure on *other* memory
> inside the group, such as anon or optimistic cache allocations.
>
> Note that hugetlb *can* be allocated on demand. It's unexpected that
> when an application optimistically allocates a couple of 2M hugetlb
> pages those aren't reflected in its memory.current. The same is true
> for hugetlb_cma. If the gigantic pages aren't currently allocated to a
> cgroup, that CMA memory can be used for movable memory elsewhere.
>
> The points you and Frank raise are reasons and scenarios where
> additional hugetlb access control is necessary - preallocation,
> limited availability of 1G pages etc. But they're not reasons against
> charging faulted in hugetlb to the memcg *as well*.
>
> My point is we need both. One to manage competition over hugetlb,
> because it has unique limitations. The other to manage competition
> over host memory which hugetlb is a part of.
>
> Here is a usecase from our fleet.
>
> Imagine a configuration with two 32G containers. The machine is booted
> with hugetlb_cma=6G, and each container may or may not use up to 3
> gigantic page, depending on the workload within it. The rest is anon,
> cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to
> 3G to enforce hugetlb fairness. But how do you configure memory.max to
> keep *overall* consumption, including anon, cache, slab etc. fair?
>
> If used hugetlb is charged, you can just set memory.max=32G regardless
> of the workload inside.
>
> Without it, you'd have to constantly poll hugetlb usage and readjust
> memory.max!

Yep, and I'd like to add that this could and have caused issues in
our production system, when there is a delay in memory limits
(low or max) correction. The userspace agent in charge of correcting
these only runs periodically, and within consecutive runs the system
could be in an over/underprotected state. An instantaneous charge
towards the memory controller would close this gap.

I think we need both a HugeTLB controller and memory controller
accounting.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-27 11:21 ` Michal Hocko
@ 2023-09-27 18:47   ` Johannes Weiner
  2023-09-27 21:37     ` Roman Gushchin
  2023-10-01 23:27     ` Mike Kravetz
  2023-09-27 23:33   ` Nhat Pham
  1 sibling, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2023-09-27 18:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nhat Pham, akpm, riel, roman.gushchin, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> > 
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
> 
> The primary reason why hugetlb is living outside of memcg (and the core
> MM as well) is that it doesn't really fit the whole scheme. In several
> aspects. First and the foremost it is an independently managed resource
> with its own pool management, use and lifetime.

Honestly, the simpler explanation is that few people have used hugetlb
in regular, containerized non-HPC workloads.

Hugetlb has historically been much more special, and it retains a
specialness that warrants e.g. the hugetlb cgroup container. But it
has also made strides with hugetlb_cma, migratability, madvise support
etc. that allows much more on-demand use. It's no longer the case that
you just put a static pool of memory aside during boot and only a few
blessed applications are using it.

For example, we're using hugetlb_cma very broadly with generic
containers. The CMA region is fully usable by movable non-huge stuff
until huge pages are allocated in it. With the hugetlb controller you
can define a maximum number of hugetlb pages that can be used per
container. But what if that container isn't using any? Why shouldn't
it be allowed to use its overall memory allowance for anon and cache
instead?

With hugetlb being more dynamic, it becomes the same problem that we
had with dedicated tcp and kmem pools. It didn't make sense to fail a
random slab allocation when you still have memory headroom or can
reclaim some cache. Nowadays, the same problem applies to hugetlb.

> There is no notion of memory reclaim and this makes a huge difference
> for the pool that might consume considerable amount of memory. While
> this is the case for many kernel allocations as well they usually do not
> consume considerable portions of the accounted memory. This makes it
> really tricky to handle limit enforcement gracefully.

I don't think that's true. For some workloads, network buffers can
absolutely dominate. And they work just fine with cgroup limits. It
isn't a problem that they aren't reclaimable themselves, it's just
important that they put pressure on stuff that is.

So that if you use 80% hugetlb, the other memory is forced to stay in
the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
group is still allowed to use the full 100% of its host memory
allowance, without requiring some outside agent continuously
monitoring and adjusting the container limits.

> Another important aspect comes from the lifetime semantics when a proper
> reservations accounting and managing needs to handle mmap time rather
> than than usual allocation path. While pages are allocated they do not
> belong to anybody and only later at the #PF time (or read for the fs
> backed mapping) the ownership is established. That makes it really hard
> to manage memory as whole under the memcg anyway as a large part of
> that pool sits without an ownership yet it cannot be used for any other
> purpose.
>
> These and more reasons where behind the earlier decision o have a
> dedicated hugetlb controller.

Yeah, there is still a need for an actual hugetlb controller for the
static use cases (and even for dynamic access to hugetlb_cma).

But you need memcg coverage as well for the more dynamic cases to work
as expected. And having that doesn't really interfere with the static
usecases.

> Also I will also Nack involving hugetlb pages being accounted by
> default. This would break any setups which mix normal and hugetlb memory
> with memcg limits applied.

Yes, no disagreement there. I think we're all on the same page this
needs to be opt-in, say with a cgroup mount option.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-27 18:47   ` Johannes Weiner
@ 2023-09-27 21:37     ` Roman Gushchin
  2023-09-28 12:52       ` Johannes Weiner
  2023-10-01 23:27     ` Mike Kravetz
  1 sibling, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2023-09-27 21:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Nhat Pham, akpm, riel, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On Wed, Sep 27, 2023 at 02:47:38PM -0400, Johannes Weiner wrote:
> On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > > Currently, hugetlb memory usage is not acounted for in the memory
> > > controller, which could lead to memory overprotection for cgroups with
> > > hugetlb-backed memory. This has been observed in our production system.
> > > 
> > > This patch series rectifies this issue by charging the memcg when the
> > > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > > addition, a new selftest is added to demonstrate and verify this new
> > > behavior.
> > 
> > The primary reason why hugetlb is living outside of memcg (and the core
> > MM as well) is that it doesn't really fit the whole scheme. In several
> > aspects. First and the foremost it is an independently managed resource
> > with its own pool management, use and lifetime.
> 
> Honestly, the simpler explanation is that few people have used hugetlb
> in regular, containerized non-HPC workloads.
> 
> Hugetlb has historically been much more special, and it retains a
> specialness that warrants e.g. the hugetlb cgroup container. But it
> has also made strides with hugetlb_cma, migratability, madvise support
> etc. that allows much more on-demand use. It's no longer the case that
> you just put a static pool of memory aside during boot and only a few
> blessed applications are using it.
> 
> For example, we're using hugetlb_cma very broadly with generic
> containers. The CMA region is fully usable by movable non-huge stuff
> until huge pages are allocated in it. With the hugetlb controller you
> can define a maximum number of hugetlb pages that can be used per
> container. But what if that container isn't using any? Why shouldn't
> it be allowed to use its overall memory allowance for anon and cache
> instead?

Cool, I remember proposing hugetlb memcg stats several years ago and if
I remember correctly at that time you was opposing it based on the idea
that huge pages are not a part of the overall memcg flow: they are not
a subject for memory pressure, can't be evicted, etc. And thp's were seen
as a long-term replacement. Even though all above it's true, hugetlb has
it's niche and I don't think thp's will realistically replace it any time
soon.

So I'm glad to see this effort (and very supportive) on making hugetlb
more convenient and transparent for an end user.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-27 11:21 ` Michal Hocko
  2023-09-27 18:47   ` Johannes Weiner
@ 2023-09-27 23:33   ` Nhat Pham
  1 sibling, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-09-27 23:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, riel, hannes, roman.gushchin, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On Wed, Sep 27, 2023 at 4:21 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
>
> The primary reason why hugetlb is living outside of memcg (and the core
> MM as well) is that it doesn't really fit the whole scheme. In several
> aspects. First and the foremost it is an independently managed resource
> with its own pool management, use and lifetime.
>
> There is no notion of memory reclaim and this makes a huge difference
> for the pool that might consume considerable amount of memory. While
> this is the case for many kernel allocations as well they usually do not
> consume considerable portions of the accounted memory. This makes it
> really tricky to handle limit enforcement gracefully.
>
> Another important aspect comes from the lifetime semantics when a proper
> reservations accounting and managing needs to handle mmap time rather
> than than usual allocation path. While pages are allocated they do not
> belong to anybody and only later at the #PF time (or read for the fs
> backed mapping) the ownership is established. That makes it really hard
> to manage memory as whole under the memcg anyway as a large part of
> that pool sits without an ownership yet it cannot be used for any other
> purpose.
>
> These and more reasons where behind the earlier decision o have a
> dedicated hugetlb controller.

While I believe all of these are true, I think they are not reasons not to
have memcg accounting. As everyone has pointed out, memcg
accounting by itself cannot handle all situations - it is not a fix-all.
Other mechanisms, such as the HugeTLB controller, could be the better
solution in these cases, and hugetlb memcg accounting is definitely not
an attempt to infringe upon these control domains.

However, memcg accounting is still necessary for certain memory limits
enforcement to work cleanly and properly - such as the use cases we have
(as Johannes has beautifully described). It will certainly help
administrators simplify their control workflow a lot (assuming we do not
surprise them with this change - a new mount option to opt-in should
help with the transition).

>
> Also I will also Nack involving hugetlb pages being accounted by
> default. This would break any setups which mix normal and hugetlb memory
> with memcg limits applied.

Got it! I'll introduce some opt-in mechanisms in the next version. This is
my oversight.


> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-26 19:49 [PATCH 0/2] hugetlb memcg accounting Nhat Pham
                   ` (3 preceding siblings ...)
  2023-09-27 11:21 ` Michal Hocko
@ 2023-09-28  1:00 ` Nhat Pham
  4 siblings, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-09-28  1:00 UTC (permalink / raw)
  To: akpm
  Cc: riel, hannes, mhocko, roman.gushchin, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.
>
> Nhat Pham (2):
>   hugetlb: memcg: account hugetlb-backed memory in memory controller
>   selftests: add a selftest to verify hugetlb usage in memcg
>
>  MAINTAINERS                                   |   2 +
>  fs/hugetlbfs/inode.c                          |   2 +-
>  include/linux/hugetlb.h                       |   6 +-
>  include/linux/memcontrol.h                    |   8 +
>  mm/hugetlb.c                                  |  23 +-
>  mm/memcontrol.c                               |  40 ++++
>  tools/testing/selftests/cgroup/.gitignore     |   1 +
>  tools/testing/selftests/cgroup/Makefile       |   2 +
>  .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
>  9 files changed, 297 insertions(+), 9 deletions(-)
>  create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
>
> --
> 2.34.1

Thanks for all the comments and suggestions everyone!
FYI, I have sent out a second version of the patch series with the new
mount flag:

https://lore.kernel.org/lkml/20230928005723.1709119-1-nphamcs@gmail.com/T/#t

Feel free to check it out and discuss it over there too!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-27 21:37     ` Roman Gushchin
@ 2023-09-28 12:52       ` Johannes Weiner
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Weiner @ 2023-09-28 12:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Nhat Pham, akpm, riel, shakeelb, muchun.song, tj,
	lizefan.x, shuah, mike.kravetz, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On Wed, Sep 27, 2023 at 02:37:47PM -0700, Roman Gushchin wrote:
> On Wed, Sep 27, 2023 at 02:47:38PM -0400, Johannes Weiner wrote:
> > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > > > Currently, hugetlb memory usage is not acounted for in the memory
> > > > controller, which could lead to memory overprotection for cgroups with
> > > > hugetlb-backed memory. This has been observed in our production system.
> > > > 
> > > > This patch series rectifies this issue by charging the memcg when the
> > > > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > > > addition, a new selftest is added to demonstrate and verify this new
> > > > behavior.
> > > 
> > > The primary reason why hugetlb is living outside of memcg (and the core
> > > MM as well) is that it doesn't really fit the whole scheme. In several
> > > aspects. First and the foremost it is an independently managed resource
> > > with its own pool management, use and lifetime.
> > 
> > Honestly, the simpler explanation is that few people have used hugetlb
> > in regular, containerized non-HPC workloads.
> > 
> > Hugetlb has historically been much more special, and it retains a
> > specialness that warrants e.g. the hugetlb cgroup container. But it
> > has also made strides with hugetlb_cma, migratability, madvise support
> > etc. that allows much more on-demand use. It's no longer the case that
> > you just put a static pool of memory aside during boot and only a few
> > blessed applications are using it.
> > 
> > For example, we're using hugetlb_cma very broadly with generic
> > containers. The CMA region is fully usable by movable non-huge stuff
> > until huge pages are allocated in it. With the hugetlb controller you
> > can define a maximum number of hugetlb pages that can be used per
> > container. But what if that container isn't using any? Why shouldn't
> > it be allowed to use its overall memory allowance for anon and cache
> > instead?
> 
> Cool, I remember proposing hugetlb memcg stats several years ago and if
> I remember correctly at that time you was opposing it based on the idea
> that huge pages are not a part of the overall memcg flow: they are not
> a subject for memory pressure, can't be evicted, etc. And thp's were seen
> as a long-term replacement. Even though all above it's true, hugetlb has
> it's niche and I don't think thp's will realistically replace it any time
> soon.

Yeah, Michal's arguments very much reminded me of my stance then. I
stand corrected.

I'm still hopeful that we can make 2M work transparently. I'd expect
1G to remain in the hugetlb domain for some time to come, but even
those are mostly dynamic now with your hugetlb_cma feature!

> So I'm glad to see this effort (and very supportive) on making hugetlb
> more convenient and transparent for an end user.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-09-27 18:47   ` Johannes Weiner
  2023-09-27 21:37     ` Roman Gushchin
@ 2023-10-01 23:27     ` Mike Kravetz
  2023-10-02 14:42       ` Johannes Weiner
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Kravetz @ 2023-10-01 23:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Nhat Pham, akpm, riel, roman.gushchin, shakeelb,
	muchun.song, tj, lizefan.x, shuah, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On 09/27/23 14:47, Johannes Weiner wrote:
> On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> 
> So that if you use 80% hugetlb, the other memory is forced to stay in
> the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> group is still allowed to use the full 100% of its host memory
> allowance, without requiring some outside agent continuously
> monitoring and adjusting the container limits.

Jumping in late here as I was traveling last week.  In addition, I want
to state my limited cgroup knowledge up front.

I was thinking of your scenario above a little differently.  Suppose a
group is up and running at almost 100% memory usage.  However, the majority
of that memory is reclaimable.  Now, someone wants to allocate a 2M hugetlb
page.  There is not 2MB free, but we could easily reclaim 2MB to make room
for the hugetlb page.  I may be missing something, but I do not see how that
is going to happen.  It seems like we would really want that behavior.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-10-01 23:27     ` Mike Kravetz
@ 2023-10-02 14:42       ` Johannes Weiner
  2023-10-02 14:58         ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2023-10-02 14:42 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, Nhat Pham, akpm, riel, roman.gushchin, shakeelb,
	muchun.song, tj, lizefan.x, shuah, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On Sun, Oct 01, 2023 at 04:27:30PM -0700, Mike Kravetz wrote:
> On 09/27/23 14:47, Johannes Weiner wrote:
> > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > 
> > So that if you use 80% hugetlb, the other memory is forced to stay in
> > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> > group is still allowed to use the full 100% of its host memory
> > allowance, without requiring some outside agent continuously
> > monitoring and adjusting the container limits.
> 
> Jumping in late here as I was traveling last week.  In addition, I want
> to state my limited cgroup knowledge up front.
> 
> I was thinking of your scenario above a little differently.  Suppose a
> group is up and running at almost 100% memory usage.  However, the majority
> of that memory is reclaimable.  Now, someone wants to allocate a 2M hugetlb
> page.  There is not 2MB free, but we could easily reclaim 2MB to make room
> for the hugetlb page.  I may be missing something, but I do not see how that
> is going to happen.  It seems like we would really want that behavior.

But that is actually what it does, no?

alloc_hugetlb_folio
  mem_cgroup_hugetlb_charge_folio
    charge_memcg
      try_charge
        !page_counter_try_charge ?
          !try_to_free_mem_cgroup_pages ?
            mem_cgroup_oom

So it does reclaim when the hugetlb hits the cgroup limit. And if that
fails to make room, it OOMs the cgroup.

Or maybe I'm missing something?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-10-02 14:42       ` Johannes Weiner
@ 2023-10-02 14:58         ` Michal Hocko
  2023-10-02 15:36           ` Johannes Weiner
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-10-02 14:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mike Kravetz, Nhat Pham, akpm, riel, roman.gushchin, shakeelb,
	muchun.song, tj, lizefan.x, shuah, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On Mon 02-10-23 10:42:50, Johannes Weiner wrote:
> On Sun, Oct 01, 2023 at 04:27:30PM -0700, Mike Kravetz wrote:
> > On 09/27/23 14:47, Johannes Weiner wrote:
> > > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > > 
> > > So that if you use 80% hugetlb, the other memory is forced to stay in
> > > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> > > group is still allowed to use the full 100% of its host memory
> > > allowance, without requiring some outside agent continuously
> > > monitoring and adjusting the container limits.
> > 
> > Jumping in late here as I was traveling last week.  In addition, I want
> > to state my limited cgroup knowledge up front.
> > 
> > I was thinking of your scenario above a little differently.  Suppose a
> > group is up and running at almost 100% memory usage.  However, the majority
> > of that memory is reclaimable.  Now, someone wants to allocate a 2M hugetlb
> > page.  There is not 2MB free, but we could easily reclaim 2MB to make room
> > for the hugetlb page.  I may be missing something, but I do not see how that
> > is going to happen.  It seems like we would really want that behavior.
> 
> But that is actually what it does, no?
> 
> alloc_hugetlb_folio
>   mem_cgroup_hugetlb_charge_folio
>     charge_memcg
>       try_charge
>         !page_counter_try_charge ?
>           !try_to_free_mem_cgroup_pages ?
>             mem_cgroup_oom
> 
> So it does reclaim when the hugetlb hits the cgroup limit. And if that
> fails to make room, it OOMs the cgroup.
> 
> Or maybe I'm missing something?

I beleve that Mike alludes to what I have pointed in other email:
http://lkml.kernel.org/r/ZRrI90KcRBwVZn/r@dhcp22.suse.cz and a situation
when the hugetlb requests results in an acutal hugetlb allocation rather
than consumption from the pre-allocated pool. In that case memcg is not
involved because the charge happens only after the allocation happens.
That btw. means that this request could disrupt a different memcg even
if the current one is at the limit or it could be reclaimed instead.

Also there is not OOM as hugetlb pages are costly requests and we do not
invoke the oom killer.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] hugetlb memcg accounting
  2023-10-02 14:58         ` Michal Hocko
@ 2023-10-02 15:36           ` Johannes Weiner
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Weiner @ 2023-10-02 15:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Nhat Pham, akpm, riel, roman.gushchin, shakeelb,
	muchun.song, tj, lizefan.x, shuah, yosryahmed, linux-mm,
	kernel-team, linux-kernel, cgroups

On Mon, Oct 02, 2023 at 04:58:21PM +0200, Michal Hocko wrote:
> Also there is not OOM as hugetlb pages are costly requests and we do not
> invoke the oom killer.

Ah good point.

That seems like a policy choice we could make. However, since hugetlb
users are already set up for and come to expect SIGBUS for physical
failure as well as hugetlb_cgroup limits, we should have memcg follow
established precedent and leave the OOM killer out.

Agree that a sentence in the changelog about this makes sense though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-10-02 15:36 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-26 19:49 [PATCH 0/2] hugetlb memcg accounting Nhat Pham
2023-09-26 19:49 ` [PATCH 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller Nhat Pham
2023-09-26 19:49 ` [PATCH 2/2] selftests: add a selftest to verify hugetlb usage in memcg Nhat Pham
2023-09-26 20:41   ` Nhat Pham
2023-09-26 20:50 ` [PATCH 0/2] hugetlb memcg accounting Frank van der Linden
2023-09-26 22:14   ` Johannes Weiner
2023-09-27 12:50     ` Michal Hocko
2023-09-27 16:44       ` Johannes Weiner
2023-09-27 17:22         ` Nhat Pham
2023-09-26 23:31   ` Nhat Pham
2023-09-27 11:21 ` Michal Hocko
2023-09-27 18:47   ` Johannes Weiner
2023-09-27 21:37     ` Roman Gushchin
2023-09-28 12:52       ` Johannes Weiner
2023-10-01 23:27     ` Mike Kravetz
2023-10-02 14:42       ` Johannes Weiner
2023-10-02 14:58         ` Michal Hocko
2023-10-02 15:36           ` Johannes Weiner
2023-09-27 23:33   ` Nhat Pham
2023-09-28  1:00 ` Nhat Pham

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.