* [PATCH v1] hugetlb: Add hugetlb.*.numa_stat file
@ 2021-10-19 21:54 Mina Almasry
2021-10-20 18:41 ` kernel test robot
0 siblings, 1 reply; 2+ messages in thread
From: Mina Almasry @ 2021-10-19 21:54 UTC (permalink / raw)
Cc: Mina Almasry, Mike Kravetz, Andrew Morton, Shuah Khan,
Miaohe Lin, Oscar Salvador, Michal Hocko, Muchun Song,
David Rientjes, Jue Wang, Yang Yao, Joanna Li, Cannon Matthews,
linux-mm, linux-kernel
For hugetlb backed jobs/VMs it's critical to understand the numa
information for the memory backing these jobs to deliver optimal
performance.
Currently this techinically can be queried from /proc/self/numa_maps, but
there are significant issues with that. Namely:
1. Memory can be mapped on unmapped.
2. numa_maps are per process and need to be aggregaged across all
proceses in the cgroup. For shared memory this is more involved as
the userspace needs to make sure it doesn't double count shared
mappings.
3. I believe querying numa_maps needs to hold the mmap_lock which adds
to the contention on this lock.
For these reasons I propose simply adding hugetlb.*.numa_stat file,
which shows the numa information of the cgroup similarly to
memory.numa_stat.
On cgroup-v2:
cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
total=2097152 N0=2097152 N1=0
On cgroup-v1:
cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
total=2097152 N0=2097152 N1=0
hierarichal_total=2097152 N0=2097152 N1=0
This patch was tested manually by allocating hugetlb memory and querying
the hugetlb.*.numa_stat file of the cgroup and its parents.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jue Wang <juew@google.com>
Cc: Yang Yao <ygyao@google.com>
Cc: Joanna Li <joannali@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Mina Almasry <almasrymina@google.com>
---
.../admin-guide/cgroup-v1/hugetlb.rst | 4 +
Documentation/admin-guide/cgroup-v2.rst | 7 ++
include/linux/hugetlb.h | 4 +-
include/linux/hugetlb_cgroup.h | 7 ++
mm/hugetlb_cgroup.c | 93 +++++++++++++++++--
.../testing/selftests/vm/write_to_hugetlbfs.c | 9 +-
6 files changed, 113 insertions(+), 11 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index 338f2c7d7a1c..0fa724d82abb 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -29,12 +29,14 @@ Brief summary of control files::
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit
+ hugetlb.<hugepagesize>.numa_stat # show the numa information of the hugetlb memory charged to this cgroup
For a system supporting three hugepage sizes (64k, 32M and 1G), the control
files include::
hugetlb.1GB.limit_in_bytes
hugetlb.1GB.max_usage_in_bytes
+ hugetlb.1GB.numa_stat
hugetlb.1GB.usage_in_bytes
hugetlb.1GB.failcnt
hugetlb.1GB.rsvd.limit_in_bytes
@@ -43,6 +45,7 @@ files include::
hugetlb.1GB.rsvd.failcnt
hugetlb.64KB.limit_in_bytes
hugetlb.64KB.max_usage_in_bytes
+ hugetlb.64KB.numa_stat
hugetlb.64KB.usage_in_bytes
hugetlb.64KB.failcnt
hugetlb.64KB.rsvd.limit_in_bytes
@@ -51,6 +54,7 @@ files include::
hugetlb.64KB.rsvd.failcnt
hugetlb.32MB.limit_in_bytes
hugetlb.32MB.max_usage_in_bytes
+ hugetlb.32MB.numa_stat
hugetlb.32MB.usage_in_bytes
hugetlb.32MB.failcnt
hugetlb.32MB.rsvd.limit_in_bytes
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 4d8c27eca96b..8ba0d6aadd2c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2252,6 +2252,13 @@ HugeTLB Interface Files
are local to the cgroup i.e. not hierarchical. The file modified event
generated on this file reflects only the local events.
+ hugetlb.<hugepagesize>.numa_stat
+ Similar to memory.numa_stat, it shows the numa information of the
+ memory in this cgroup:
+
+ /dev/cgroup/memory/test # cat hugetlb.2MB.numa_stat
+ total=0 N0=0 N1=0
+
Misc
----
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1faebe1cd0ed..0445faaa636e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -613,8 +613,8 @@ struct hstate {
#endif
#ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
- struct cftype cgroup_files_dfl[7];
- struct cftype cgroup_files_legacy[9];
+ struct cftype cgroup_files_dfl[8];
+ struct cftype cgroup_files_legacy[10];
#endif
char name[HSTATE_NAME_LEN];
};
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index c137396129db..54ff6ec68ed3 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -36,6 +36,11 @@ enum hugetlb_memory_event {
HUGETLB_NR_MEMORY_EVENTS,
};
+struct hugetlb_cgroup_per_node {
+ /* hugetlb usage in bytes over all hstates. */
+ unsigned long usage[HUGE_MAX_HSTATE];
+};
+
struct hugetlb_cgroup {
struct cgroup_subsys_state css;
@@ -57,6 +62,8 @@ struct hugetlb_cgroup {
/* Handle for "hugetlb.events.local" */
struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
+
+ struct hugetlb_cgroup_per_node *nodeinfo[];
};
static inline struct hugetlb_cgroup *
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 5383023d0cca..0a550954fb5a 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -92,6 +92,7 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
struct hugetlb_cgroup *parent_h_cgroup)
{
int idx;
+ int node;
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
struct page_counter *fault_parent = NULL;
@@ -124,6 +125,15 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
limit);
VM_BUG_ON(ret);
}
+
+ for_each_node(node) {
+ /* Set node_to_alloc to -1 for offline nodes. */
+ int node_to_alloc =
+ node_state(node, N_NORMAL_MEMORY) ? node : -1;
+ h_cgroup->nodeinfo[node] =
+ kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
+ GFP_KERNEL, node_to_alloc);
+ }
}
static struct cgroup_subsys_state *
@@ -132,7 +142,10 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
struct hugetlb_cgroup *h_cgroup;
- h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
+ unsigned int size =
+ sizeof(*h_cgroup) +
+ MAX_NUMNODES * sizeof(struct hugetlb_cgroup_per_node *);
+ h_cgroup = kzalloc(size, GFP_KERNEL);
if (!h_cgroup)
return ERR_PTR(-ENOMEM);
@@ -292,7 +305,9 @@ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
return;
__set_hugetlb_cgroup(page, h_cg, rsvd);
- return;
+ if (!rsvd && h_cg)
+ h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages
+ << PAGE_SHIFT;
}
void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
@@ -331,7 +346,9 @@ static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
if (rsvd)
css_put(&h_cg->css);
-
+ else
+ h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages
+ << PAGE_SHIFT;
return;
}
@@ -421,6 +438,56 @@ enum {
RES_RSVD_FAILCNT,
};
+static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
+{
+ int nid;
+ struct cftype *cft = seq_cft(seq);
+ int idx = MEMFILE_IDX(cft->private);
+ bool legacy = MEMFILE_ATTR(cft->private);
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
+ struct cgroup_subsys_state *css;
+ unsigned long usage;
+
+ if (legacy) {
+ /* Add up usage across all nodes for the non-hierarchical total. */
+ usage = 0;
+ for_each_node_state(nid, N_MEMORY)
+ usage += h_cg->nodeinfo[nid]->usage[idx];
+ seq_printf(seq, "total=%lu", usage);
+
+ /* Simply print the per-node usage for the non-hierarchical total. */
+ for_each_node_state(nid, N_MEMORY)
+ seq_printf(seq, " N%d=%lu", nid,
+ h_cg->nodeinfo[nid]->usage[idx]);
+ seq_putc(seq, '\n');
+ }
+
+ /* The hierarchical total is pretty much the value recorded by the
+ * counter, so use that.
+ */
+ seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
+ (u64)page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
+
+ /* For each node, transverse the css tree to obtain the hierarichal
+ * node usage.
+ */
+ for_each_node_state(nid, N_MEMORY) {
+ usage = 0;
+ rcu_read_lock();
+ css_for_each_descendant_pre(css, &h_cg->css) {
+ usage += hugetlb_cgroup_from_css(css)
+ ->nodeinfo[nid]
+ ->usage[idx];
+ }
+ rcu_read_unlock();
+ seq_printf(seq, " N%d=%lu", nid, usage);
+ }
+
+ seq_putc(seq, '\n');
+
+ return 0;
+}
+
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
@@ -654,8 +721,14 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
cft->seq_show = hugetlb_cgroup_read_u64_max;
cft->flags = CFTYPE_NOT_ON_ROOT;
- /* Add the events file */
+ /* Add the numa stat file */
cft = &h->cgroup_files_dfl[4];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
+ cft->seq_show = hugetlb_cgroup_read_numa_stat;
+ cft->flags = CFTYPE_NOT_ON_ROOT;
+
+ /* Add the events file */
+ cft = &h->cgroup_files_dfl[5];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
cft->private = MEMFILE_PRIVATE(idx, 0);
cft->seq_show = hugetlb_events_show;
@@ -663,7 +736,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
cft->flags = CFTYPE_NOT_ON_ROOT;
/* Add the events.local file */
- cft = &h->cgroup_files_dfl[5];
+ cft = &h->cgroup_files_dfl[6];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
cft->private = MEMFILE_PRIVATE(idx, 0);
cft->seq_show = hugetlb_events_local_show;
@@ -672,7 +745,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
cft->flags = CFTYPE_NOT_ON_ROOT;
/* NULL terminate the last cft */
- cft = &h->cgroup_files_dfl[6];
+ cft = &h->cgroup_files_dfl[7];
memset(cft, 0, sizeof(*cft));
WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
@@ -742,8 +815,14 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;
+ /* Add the numa stat file */
+ cft = &h->cgroup_files_dfl[8];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
+ cft->private = MEMFILE_PRIVATE(idx, 1);
+ cft->seq_show = hugetlb_cgroup_read_numa_stat;
+
/* NULL terminate the last cft */
- cft = &h->cgroup_files_legacy[8];
+ cft = &h->cgroup_files_legacy[9];
memset(cft, 0, sizeof(*cft));
WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
index 6a2caba19ee1..d2da6315a40c 100644
--- a/tools/testing/selftests/vm/write_to_hugetlbfs.c
+++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
@@ -37,8 +37,8 @@ static int shmid;
static void exit_usage(void)
{
printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
- "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
- "[-o] [-w] [-n]\n",
+ "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l(sleep)] [-r(private)] "
+ "[-o(populate)] [-w(rite)] [-n(o-reserve)]\n",
self);
exit(EXIT_FAILURE);
}
@@ -161,6 +161,11 @@ int main(int argc, char **argv)
else
printf("RESERVE mapping.\n");
+ if (want_sleep)
+ printf("Sleeping\n");
+ else
+ printf("Not sleeping\n");
+
switch (method) {
case HUGETLBFS:
printf("Allocating using HUGETLBFS.\n");
--
2.33.0.1079.g6e70778dc9-goog
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH v1] hugetlb: Add hugetlb.*.numa_stat file
2021-10-19 21:54 [PATCH v1] hugetlb: Add hugetlb.*.numa_stat file Mina Almasry
@ 2021-10-20 18:41 ` kernel test robot
0 siblings, 0 replies; 2+ messages in thread
From: kernel test robot @ 2021-10-20 18:41 UTC (permalink / raw)
To: Mina Almasry
Cc: llvm, kbuild-all, Mina Almasry, Mike Kravetz, Andrew Morton,
Linux Memory Management List, Shuah Khan, Miaohe Lin,
Oscar Salvador, Michal Hocko, Muchun Song, David Rientjes
[-- Attachment #1: Type: text/plain, Size: 3723 bytes --]
Hi Mina,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on hnaz-mm/master]
[also build test WARNING on tj-cgroup/for-next linus/master v5.15-rc6 next-20211020]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Mina-Almasry/hugetlb-Add-hugetlb-numa_stat-file/20211020-055543
base: https://github.com/hnaz/linux-mm master
config: x86_64-randconfig-a004-20211019 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9660563950aaed54020bfdf0be07e7096a9553e4)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/cdac71b7be0126a6f559110105cb7baff1b6552b
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Mina-Almasry/hugetlb-Add-hugetlb-numa_stat-file/20211020-055543
git checkout cdac71b7be0126a6f559110105cb7baff1b6552b
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>):
>> mm/hugetlb_cgroup.c:469:6: warning: format specifies type 'unsigned long' but the argument has type 'unsigned long long' [-Wformat]
(u64)page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
vim +469 mm/hugetlb_cgroup.c
440
441 static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
442 {
443 int nid;
444 struct cftype *cft = seq_cft(seq);
445 int idx = MEMFILE_IDX(cft->private);
446 bool legacy = MEMFILE_ATTR(cft->private);
447 struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
448 struct cgroup_subsys_state *css;
449 unsigned long usage;
450
451 if (legacy) {
452 /* Add up usage across all nodes for the non-hierarchical total. */
453 usage = 0;
454 for_each_node_state(nid, N_MEMORY)
455 usage += h_cg->nodeinfo[nid]->usage[idx];
456 seq_printf(seq, "total=%lu", usage);
457
458 /* Simply print the per-node usage for the non-hierarchical total. */
459 for_each_node_state(nid, N_MEMORY)
460 seq_printf(seq, " N%d=%lu", nid,
461 h_cg->nodeinfo[nid]->usage[idx]);
462 seq_putc(seq, '\n');
463 }
464
465 /* The hierarchical total is pretty much the value recorded by the
466 * counter, so use that.
467 */
468 seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
> 469 (u64)page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
470
471 /* For each node, transverse the css tree to obtain the hierarichal
472 * node usage.
473 */
474 for_each_node_state(nid, N_MEMORY) {
475 usage = 0;
476 rcu_read_lock();
477 css_for_each_descendant_pre(css, &h_cg->css) {
478 usage += hugetlb_cgroup_from_css(css)
479 ->nodeinfo[nid]
480 ->usage[idx];
481 }
482 rcu_read_unlock();
483 seq_printf(seq, " N%d=%lu", nid, usage);
484 }
485
486 seq_putc(seq, '\n');
487
488 return 0;
489 }
490
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33235 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2021-10-20 18:42 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 21:54 [PATCH v1] hugetlb: Add hugetlb.*.numa_stat file Mina Almasry
2021-10-20 18:41 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).