bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/5] mm: Select victim using bpf_oom_evaluate_task
@ 2023-08-10  8:13 Chuyi Zhou
  2023-08-10  8:13 ` [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task Chuyi Zhou
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-10  8:13 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, ast, daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Chuyi Zhou

Changes
-------

This is v2 of the BPF OOM policy patchset.
v1 : https://lore.kernel.org/lkml/20230804093804.47039-1-zhouchuyi@bytedance.com/
v1 -> v2 changes:

- rename bpf_select_task to bpf_oom_evaluate_task and bypass the
tsk_is_oom_victim (and MMF_OOM_SKIP) logic. (Michal)

- add a new hook to set policy's name, so dump_header() can know
what has been the selection policy when reporting messages. (Michal)

- add a tracepoint when select_bad_process() find nothing. (Alan)

- add a doc to to describe how it is all supposed to work. (Alan)

================

This patchset adds a new interface and use it to select victim when OOM
is invoked. The mainly motivation is the need to customizable OOM victim
selection functionality.

The new interface is a bpf hook plugged in oom_evaluate_task. It takes oc
and current task as parameters and return a result indicating which one is
selected by the attached bpf program.

There are several conserns when designing this interface suggested by
Michal:

1. Hooking into oom_evaluate_task can keep the consistency of global and
memcg OOM interface. Besides, it seems the least disruptive to the existing
oom killer implementation.

2. Userspace can handle a lot on its own and provide the input to the BPF
program to make a decision. Since the oom scope iteration will be
implemented already in the kernel so all the BPF program has to do is to
rank processes or memcgs.

3. The new interface should better bypass the current heuristic rules
(e.g., tsk_is_oom_victim, and MMF_OOM_SKIP) to meet an arbitrary oom
policy's need.

Chuyi Zhou (5):
  mm, oom: Introduce bpf_oom_evaluate_task
  mm: Add policy_name to identify OOM policies
  mm: Add a tracepoint when OOM victim selection is failed
  bpf: Add a OOM policy test
  bpf: Add a BPF OOM policy Doc

 Documentation/bpf/oom.rst                     |  70 +++++++++
 include/linux/oom.h                           |   7 +
 include/trace/events/oom.h                    |  18 +++
 mm/oom_kill.c                                 | 100 +++++++++++--
 .../bpf/prog_tests/test_oom_policy.c          | 140 ++++++++++++++++++
 .../testing/selftests/bpf/progs/oom_policy.c  | 104 +++++++++++++
 6 files changed, 428 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/bpf/oom.rst
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
 create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-08-10  8:13 [RFC PATCH v2 0/5] mm: Select victim using bpf_oom_evaluate_task Chuyi Zhou
@ 2023-08-10  8:13 ` Chuyi Zhou
  2023-08-17  2:07   ` Alexei Starovoitov
                     ` (2 more replies)
  2023-08-10  8:13 ` [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies Chuyi Zhou
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-10  8:13 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, ast, daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Chuyi Zhou, Michal Hocko

This patch adds a new hook bpf_oom_evaluate_task in oom_evaluate_task. It
takes oc and current iterating task as parameters and returns a result
indicating which one should be selected. We can use it to bypass the
current logic of oom_evaluate_task and implement customized OOM policies
in the attached BPF progams.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 mm/oom_kill.c | 59 +++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 50 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 612b5597d3af..255c9ef1d808 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -18,6 +18,7 @@
  *  kernel subsystems and hints as to where to find out what things do.
  */
 
+#include <linux/bpf.h>
 #include <linux/oom.h>
 #include <linux/mm.h>
 #include <linux/err.h>
@@ -305,6 +306,27 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
+enum {
+	NO_BPF_POLICY,
+	BPF_EVAL_ABORT,
+	BPF_EVAL_NEXT,
+	BPF_EVAL_SELECT,
+};
+
+__weak noinline int bpf_oom_evaluate_task(struct task_struct *task, struct oom_control *oc)
+{
+	return NO_BPF_POLICY;
+}
+
+BTF_SET8_START(oom_bpf_fmodret_ids)
+BTF_ID_FLAGS(func, bpf_oom_evaluate_task)
+BTF_SET8_END(oom_bpf_fmodret_ids)
+
+static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
+	.owner = THIS_MODULE,
+	.set   = &oom_bpf_fmodret_ids,
+};
+
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
@@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
 		goto next;
 
+	/*
+	 * If task is allocating a lot of memory and has been marked to be
+	 * killed first if it triggers an oom, then select it.
+	 */
+	if (oom_task_origin(task)) {
+		points = LONG_MAX;
+		goto select;
+	}
+
+	switch (bpf_oom_evaluate_task(task, oc)) {
+	case BPF_EVAL_ABORT:
+		goto abort; /* abort search process */
+	case BPF_EVAL_NEXT:
+		goto next; /* ignore the task */
+	case BPF_EVAL_SELECT:
+		goto select; /* select the task */
+	default:
+		break; /* No BPF policy */
+	}
+
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves unless
@@ -329,15 +371,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto abort;
 	}
 
-	/*
-	 * If task is allocating a lot of memory and has been marked to be
-	 * killed first if it triggers an oom, then select it.
-	 */
-	if (oom_task_origin(task)) {
-		points = LONG_MAX;
-		goto select;
-	}
-
 	points = oom_badness(task, oc->totalpages);
 	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
@@ -732,10 +765,18 @@ static struct ctl_table vm_oom_kill_table[] = {
 
 static int __init oom_init(void)
 {
+	int err;
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
 #ifdef CONFIG_SYSCTL
 	register_sysctl_init("vm", vm_oom_kill_table);
 #endif
+
+#ifdef CONFIG_BPF_SYSCALL
+	err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
+	if (err)
+		pr_warn("error while registering oom fmodret entrypoints: %d", err);
+#endif
+
 	return 0;
 }
 subsys_initcall(oom_init)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies
  2023-08-10  8:13 [RFC PATCH v2 0/5] mm: Select victim using bpf_oom_evaluate_task Chuyi Zhou
  2023-08-10  8:13 ` [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task Chuyi Zhou
@ 2023-08-10  8:13 ` Chuyi Zhou
  2023-08-14 20:51   ` Jonathan Corbet
  2023-08-10  8:13 ` [RFC PATCH v2 3/5] mm: Add a tracepoint when OOM victim selection is failed Chuyi Zhou
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-10  8:13 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, ast, daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Chuyi Zhou

This patch adds a new metadata policy_name in oom_control and report it
in dump_header(), so we can know what has been the selection policy. In
BPF program, we can call kfunc set_oom_policy_name to set the current
user-defined policy name. The in-kernel policy_name is "default".

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 include/linux/oom.h |  7 +++++++
 mm/oom_kill.c       | 42 +++++++++++++++++++++++++++++++++++++++---
 2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7d0c9c48a0c5..69d0f2ec6ea6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -22,6 +22,10 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+enum {
+	POLICY_NAME_LEN = 16,
+};
+
 /*
  * Details of the page allocation that triggered the oom killer that are used to
  * determine what should be killed.
@@ -52,6 +56,9 @@ struct oom_control {
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
+
+	/* Used to report the policy info. */
+	char policy_name[POLICY_NAME_LEN];
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 255c9ef1d808..3239dcdba4d7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -443,6 +443,35 @@ static int dump_task(struct task_struct *p, void *arg)
 	return 0;
 }
 
+__bpf_kfunc void set_oom_policy_name(struct oom_control *oc, const char *src, size_t sz)
+{
+	memset(oc->policy_name, 0, sizeof(oc->policy_name));
+
+	if (sz > POLICY_NAME_LEN)
+		sz = POLICY_NAME_LEN;
+
+	memcpy(oc->policy_name, src, sz);
+}
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "kfuncs which will be used in BPF programs");
+
+__weak noinline void bpf_set_policy_name(struct oom_control *oc)
+{
+}
+
+__diag_pop();
+
+BTF_SET8_START(bpf_oom_policy_kfunc_ids)
+BTF_ID_FLAGS(func, set_oom_policy_name)
+BTF_SET8_END(bpf_oom_policy_kfunc_ids)
+
+static const struct btf_kfunc_id_set bpf_oom_policy_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_oom_policy_kfunc_ids,
+};
+
 /**
  * dump_tasks - dump current memory state of all system tasks
  * @oc: pointer to struct oom_control
@@ -484,8 +513,8 @@ static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
 
 static void dump_header(struct oom_control *oc, struct task_struct *p)
 {
-	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
-		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
+	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, policy_name=%s, oom_score_adj=%hd\n",
+		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, oc->policy_name,
 			current->signal->oom_score_adj);
 	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
 		pr_warn("COMPACTION is disabled!!!\n");
@@ -775,8 +804,11 @@ static int __init oom_init(void)
 	err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
 	if (err)
 		pr_warn("error while registering oom fmodret entrypoints: %d", err);
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
+					&bpf_oom_policy_kfunc_set);
+	if (err)
+		pr_warn("error while registering oom kfunc entrypoints: %d", err);
 #endif
-
 	return 0;
 }
 subsys_initcall(oom_init)
@@ -1196,6 +1228,10 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	set_oom_policy_name(oc, "default", sizeof("default"));
+
+	bpf_set_policy_name(oc);
+
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 3/5] mm: Add a tracepoint when OOM victim selection is failed
  2023-08-10  8:13 [RFC PATCH v2 0/5] mm: Select victim using bpf_oom_evaluate_task Chuyi Zhou
  2023-08-10  8:13 ` [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task Chuyi Zhou
  2023-08-10  8:13 ` [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies Chuyi Zhou
@ 2023-08-10  8:13 ` Chuyi Zhou
  2023-08-16 11:54   ` Alan Maguire
  2023-08-10  8:13 ` [RFC PATCH v2 4/5] bpf: Add a OOM policy test Chuyi Zhou
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-10  8:13 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, ast, daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Chuyi Zhou, Alan Maguire

This patch add a tracepoint to mark the scenario where nothing was
chosen for OOM killer. This would allow BPF programs to catch the fact
that the BPF OOM policy didn't work well.

Suggested-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 include/trace/events/oom.h | 18 ++++++++++++++++++
 mm/oom_kill.c              |  1 +
 2 files changed, 19 insertions(+)

diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
index 26a11e4a2c36..b6ae1134229c 100644
--- a/include/trace/events/oom.h
+++ b/include/trace/events/oom.h
@@ -6,6 +6,7 @@
 #define _TRACE_OOM_H
 #include <linux/tracepoint.h>
 #include <trace/events/mmflags.h>
+#include <linux/oom.h>
 
 TRACE_EVENT(oom_score_adj_update,
 
@@ -151,6 +152,23 @@ TRACE_EVENT(skip_task_reaping,
 	TP_printk("pid=%d", __entry->pid)
 );
 
+TRACE_EVENT(select_bad_process_end,
+
+	TP_PROTO(struct oom_control *oc),
+
+	TP_ARGS(oc),
+
+	TP_STRUCT__entry(
+		__array(char, policy_name, POLICY_NAME_LEN)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->policy_name, oc->policy_name, POLICY_NAME_LEN);
+	),
+
+	TP_printk("policy_name=%s", __entry->policy_name)
+);
+
 #ifdef CONFIG_COMPACTION
 TRACE_EVENT(compact_retry,
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3239dcdba4d7..af40a1b750fa 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1235,6 +1235,7 @@ bool out_of_memory(struct oom_control *oc)
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
+		trace_select_bad_process_end(oc);
 		dump_header(oc, NULL);
 		pr_warn("Out of memory and no killable processes...\n");
 		/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 4/5] bpf: Add a OOM policy test
  2023-08-10  8:13 [RFC PATCH v2 0/5] mm: Select victim using bpf_oom_evaluate_task Chuyi Zhou
                   ` (2 preceding siblings ...)
  2023-08-10  8:13 ` [RFC PATCH v2 3/5] mm: Add a tracepoint when OOM victim selection is failed Chuyi Zhou
@ 2023-08-10  8:13 ` Chuyi Zhou
  2023-08-16 11:53   ` Alan Maguire
  2023-09-13  7:55   ` Bixuan Cui
  2023-08-10  8:13 ` [RFC PATCH v2 5/5] bpf: Add a BPF OOM policy Doc Chuyi Zhou
  2023-08-16 15:49 ` [PATCH RFC v2 0/5] mm: Select victim using bpf_oom_evaluate_task Yosry Ahmed
  5 siblings, 2 replies; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-10  8:13 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, ast, daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Chuyi Zhou

This patch adds a test which implements a priority-based policy through
bpf_oom_evaluate_task.

The BPF program, oom_policy.c, compares the cgroup priority of two tasks
and select the lower one. The userspace program test_oom_policy.c
maintains a priority map by using cgroup id as the keys and priority as
the values. We could protect certain cgroups from oom-killer by setting
higher priority.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 .../bpf/prog_tests/test_oom_policy.c          | 140 ++++++++++++++++++
 .../testing/selftests/bpf/progs/oom_policy.c  | 104 +++++++++++++
 2 files changed, 244 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
 create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
new file mode 100644
index 000000000000..bea61ff22603
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/stat.h>
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "oom_policy.skel.h"
+
+static int map_fd;
+static int cg_nr;
+struct {
+	const char *path;
+	int fd;
+	unsigned long long id;
+} cgs[] = {
+	{ "/cg1" },
+	{ "/cg2" },
+};
+
+
+static struct oom_policy *open_load_oom_policy_skel(void)
+{
+	struct oom_policy *skel;
+	int err;
+
+	skel = oom_policy__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return NULL;
+
+	err = oom_policy__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	return skel;
+
+cleanup:
+	oom_policy__destroy(skel);
+	return NULL;
+}
+
+static void run_memory_consume(unsigned long long consume_size, int idx)
+{
+	char *buf;
+
+	join_parent_cgroup(cgs[idx].path);
+	buf = malloc(consume_size);
+	memset(buf, 0, consume_size);
+	sleep(2);
+	exit(0);
+}
+
+static int set_cgroup_prio(unsigned long long cg_id, int prio)
+{
+	int err;
+
+	err = bpf_map_update_elem(map_fd, &cg_id, &prio, BPF_ANY);
+	ASSERT_EQ(err, 0, "update_map");
+	return err;
+}
+
+static int prepare_cgroup_environment(void)
+{
+	int err;
+
+	err = setup_cgroup_environment();
+	if (err)
+		goto clean_cg_env;
+	for (int i = 0; i < cg_nr; i++) {
+		err = cgs[i].fd = create_and_get_cgroup(cgs[i].path);
+		if (!ASSERT_GE(cgs[i].fd, 0, "cg_create"))
+			goto clean_cg_env;
+		cgs[i].id = get_cgroup_id(cgs[i].path);
+	}
+	return 0;
+clean_cg_env:
+	cleanup_cgroup_environment();
+	return err;
+}
+
+void test_oom_policy(void)
+{
+	struct oom_policy *skel;
+	struct bpf_link *link;
+	int err;
+	int victim_pid;
+	unsigned long long victim_cg_id;
+
+	link = NULL;
+	cg_nr = ARRAY_SIZE(cgs);
+
+	skel = open_load_oom_policy_skel();
+	err = oom_policy__attach(skel);
+	if (!ASSERT_OK(err, "oom_policy__attach"))
+		goto cleanup;
+
+	map_fd = bpf_object__find_map_fd_by_name(skel->obj, "cg_map");
+	if (!ASSERT_GE(map_fd, 0, "find map"))
+		goto cleanup;
+
+	err = prepare_cgroup_environment();
+	if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
+		goto cleanup;
+
+	write_cgroup_file("/", "memory.max", "10M");
+
+	/*
+	 * Set higher priority to cg2 and lower to cg1, so we would select
+	 * task under cg1 as victim.(see oom_policy.c)
+	 */
+	set_cgroup_prio(cgs[0].id, 10);
+	set_cgroup_prio(cgs[1].id, 50);
+
+	victim_cg_id = cgs[0].id;
+	victim_pid = fork();
+
+	if (victim_pid == 0)
+		run_memory_consume(1024 * 1024 * 4, 0);
+
+	if (fork() == 0)
+		run_memory_consume(1024 * 1024 * 8, 1);
+
+	while (wait(NULL) > 0)
+		;
+
+	ASSERT_EQ(skel->bss->victim_pid, victim_pid, "victim_pid");
+	ASSERT_EQ(skel->bss->victim_cg_id, victim_cg_id, "victim_cgid");
+	ASSERT_EQ(skel->bss->failed_cnt, 1, "failed_cnt");
+cleanup:
+	bpf_link__destroy(link);
+	oom_policy__destroy(skel);
+	cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/oom_policy.c b/tools/testing/selftests/bpf/progs/oom_policy.c
new file mode 100644
index 000000000000..fc9efc93914e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/oom_policy.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, int);
+	__type(value, int);
+	__uint(max_entries, 24);
+} cg_map SEC(".maps");
+
+unsigned int victim_pid;
+u64 victim_cg_id;
+int failed_cnt;
+
+#define	EOPNOTSUPP	95
+
+enum {
+	NO_BPF_POLICY,
+	BPF_EVAL_ABORT,
+	BPF_EVAL_NEXT,
+	BPF_EVAL_SELECT,
+};
+
+extern void set_oom_policy_name(struct oom_control *oc, const char *buf, size_t sz) __ksym;
+
+static __always_inline u64 task_cgroup_id(struct task_struct *task)
+{
+	struct kernfs_node *node;
+	struct task_group *tg;
+
+	if (!task)
+		return 0;
+
+	tg = task->sched_task_group;
+	node = tg->css.cgroup->kn;
+
+	return node->id;
+}
+
+SEC("fentry/oom_kill_process")
+int BPF_PROG(oom_kill_process_k, struct oom_control *oc, const char *message)
+{
+	struct task_struct *victim = oc->chosen;
+
+	if (victim) {
+		victim_cg_id = task_cgroup_id(victim);
+		victim_pid = victim->pid;
+	}
+
+	return 0;
+}
+
+SEC("fentry/bpf_set_policy_name")
+int BPF_PROG(set_police_name_k, struct oom_control *oc)
+{
+	char name[] = "cg_prio";
+	set_oom_policy_name(oc, name, sizeof(name));
+	return 0;
+}
+
+SEC("tp_btf/select_bad_process_end")
+int BPF_PROG(record_failed, struct oom_control *oc)
+{
+	failed_cnt += 1;
+	return 0;
+}
+
+SEC("fmod_ret/bpf_oom_evaluate_task")
+int BPF_PROG(bpf_oom_evaluate_task, struct task_struct *task, struct oom_control *oc)
+{
+	int chosen_cg_prio, task_cg_prio;
+	u64 chosen_cg_id, task_cg_id;
+	struct task_struct *chosen;
+	int *val;
+
+	if (!failed_cnt)
+		return BPF_EVAL_NEXT;
+
+	chosen = oc->chosen;
+	if (!chosen)
+		return BPF_EVAL_SELECT;
+
+	chosen_cg_id = task_cgroup_id(chosen);
+	task_cg_id = task_cgroup_id(task);
+	chosen_cg_prio = task_cg_prio = 0;
+	val = bpf_map_lookup_elem(&cg_map, &chosen_cg_id);
+	if (val)
+		chosen_cg_prio = *val;
+	val = bpf_map_lookup_elem(&cg_map, &task_cg_id);
+	if (val)
+		task_cg_prio = *val;
+
+	if (chosen_cg_prio > task_cg_prio)
+		return BPF_EVAL_SELECT;
+	if (chosen_cg_prio < task_cg_prio)
+		return BPF_EVAL_NEXT;
+
+	return NO_BPF_POLICY;
+}
+
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 5/5] bpf: Add a BPF OOM policy Doc
  2023-08-10  8:13 [RFC PATCH v2 0/5] mm: Select victim using bpf_oom_evaluate_task Chuyi Zhou
                   ` (3 preceding siblings ...)
  2023-08-10  8:13 ` [RFC PATCH v2 4/5] bpf: Add a OOM policy test Chuyi Zhou
@ 2023-08-10  8:13 ` Chuyi Zhou
  2023-08-16 15:49 ` [PATCH RFC v2 0/5] mm: Select victim using bpf_oom_evaluate_task Yosry Ahmed
  5 siblings, 0 replies; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-10  8:13 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, ast, daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Chuyi Zhou

This patch adds a new doc Documentation/bpf/oom.rst to describe how
BPF OOM policy is supposed to work.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 Documentation/bpf/oom.rst | 70 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)
 create mode 100644 Documentation/bpf/oom.rst

diff --git a/Documentation/bpf/oom.rst b/Documentation/bpf/oom.rst
new file mode 100644
index 000000000000..9bad1fd30d4a
--- /dev/null
+++ b/Documentation/bpf/oom.rst
@@ -0,0 +1,70 @@
+=============
+BPF OOM Policy
+=============
+
+The Out Of Memory Killer (aka OOM Killer) is invoked when the system is
+critically low on memory. The in-kernel implementation is to iterate over
+all tasks in the specific oom domain (all tasks for global and all members
+of memcg tree for hard limit oom) and select a victim based some heuristic
+policy to kill.
+
+Specifically:
+
+1. Begin to iterate tasks using ``oom_evaluate_task()`` and find a valid (killable)
+   victim in iteration N, select it.
+
+2. In iteration N + 1, N + 2..., we compare the current iteration task with the
+   previous selected task, if current is more suitable then select it.
+
+3. finally we get a victim to kill.
+
+However, this does not meet the needs of users in some special scenarios. Using
+the eBPF capabilities, We can implement customized OOM policies to meet needs.
+
+Developer API:
+==================
+
+bpf_oom_evaluate_task
+----------------------
+
+``bpf_oom_evaluate_task`` is a new interface hooking into ``oom_evaluate_task()``
+which is used to bypass the in-kernel selection logic. Users can customize their
+victim selection policy through BPF programs attached to it.
+::
+
+    int bpf_oom_evaluate_task(struct task_struct *task,
+                                struct oom_control *oc);
+
+return value::
+
+    NO_BPF_POLICY     no bpf policy and would fallback to the in-kernel selection
+    BPF_EVAL_ABORT    abort the selection (exit from current selection loop)
+    BPF_EVAL_NEXT     ignore the task
+    BPF_EAVL_SELECT   select the current task
+
+Suppose we want to select a victim based on the specified pid when OOM is
+invoked, we can use the following BPF program::
+
+    SEC("fmod_ret/bpf_oom_evaluate_task")
+    int BPF_PROG(bpf_oom_evaluate_task, struct task_struct *task, struct oom_control *oc)
+    {
+        if (task->pid == target_pid)
+            return BPF_EAVL_SELECT;
+        return BPF_EVAL_NEXT;
+    }
+
+bpf_set_policy_name
+---------------------
+
+``bpf_set_policy_name`` is a interface hooking before the start of victim selection. We can
+set policy's name in the attached program, so dump_header() can identify different policies
+when reporting messages. We can set policy's name through kfunc ``set_oom_policy_name``
+::
+
+    SEC("fentry/bpf_set_policy_name")
+    int BPF_PROG(set_police_name_k, struct oom_control *oc)
+    {
+	    char name[] = "my_policy";
+	    set_oom_policy_name(oc, name, sizeof(name));
+	    return 0;
+    }
\ No newline at end of file
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies
  2023-08-10  8:13 ` [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies Chuyi Zhou
@ 2023-08-14 20:51   ` Jonathan Corbet
  2023-08-15  2:28     ` Chuyi Zhou
                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Jonathan Corbet @ 2023-08-14 20:51 UTC (permalink / raw)
  To: Chuyi Zhou, hannes, mhocko, roman.gushchin, ast, daniel, andrii,
	muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Chuyi Zhou

Chuyi Zhou <zhouchuyi@bytedance.com> writes:

> This patch adds a new metadata policy_name in oom_control and report it
> in dump_header(), so we can know what has been the selection policy. In
> BPF program, we can call kfunc set_oom_policy_name to set the current
> user-defined policy name. The in-kernel policy_name is "default".
>
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  include/linux/oom.h |  7 +++++++
>  mm/oom_kill.c       | 42 +++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 46 insertions(+), 3 deletions(-)

So I have a possibly dumb question here...

> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 7d0c9c48a0c5..69d0f2ec6ea6 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -22,6 +22,10 @@ enum oom_constraint {
>  	CONSTRAINT_MEMCG,
>  };
>  
> +enum {
> +	POLICY_NAME_LEN = 16,
> +};

We've defined our name length, fine...

>  /*
>   * Details of the page allocation that triggered the oom killer that are used to
>   * determine what should be killed.
> @@ -52,6 +56,9 @@ struct oom_control {
>  
>  	/* Used to print the constraint info. */
>  	enum oom_constraint constraint;
> +
> +	/* Used to report the policy info. */
> +	char policy_name[POLICY_NAME_LEN];
>  };

...that is the length of the array, appended to the structure...

>  extern struct mutex oom_lock;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 255c9ef1d808..3239dcdba4d7 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -443,6 +443,35 @@ static int dump_task(struct task_struct *p, void *arg)
>  	return 0;
>  }
>  
> +__bpf_kfunc void set_oom_policy_name(struct oom_control *oc, const char *src, size_t sz)
> +{
> +	memset(oc->policy_name, 0, sizeof(oc->policy_name));
> +
> +	if (sz > POLICY_NAME_LEN)
> +		sz = POLICY_NAME_LEN;
> +
> +	memcpy(oc->policy_name, src, sz);
> +}

This truncates the name, possibly leaving it without a terminating NUL
character, right?

> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "kfuncs which will be used in BPF programs");
> +
> +__weak noinline void bpf_set_policy_name(struct oom_control *oc)
> +{
> +}
> +
> +__diag_pop();
> +
> +BTF_SET8_START(bpf_oom_policy_kfunc_ids)
> +BTF_ID_FLAGS(func, set_oom_policy_name)
> +BTF_SET8_END(bpf_oom_policy_kfunc_ids)
> +
> +static const struct btf_kfunc_id_set bpf_oom_policy_kfunc_set = {
> +	.owner          = THIS_MODULE,
> +	.set            = &bpf_oom_policy_kfunc_ids,
> +};
> +
>  /**
>   * dump_tasks - dump current memory state of all system tasks
>   * @oc: pointer to struct oom_control
> @@ -484,8 +513,8 @@ static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
>  
>  static void dump_header(struct oom_control *oc, struct task_struct *p)
>  {
> -	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
> -		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
> +	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, policy_name=%s, oom_score_adj=%hd\n",
> +		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, oc->policy_name,

...and if the policy name is unterminated, this print will run off the
end of the structure.

Am I missing something here?

Thanks,

jon


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies
  2023-08-14 20:51   ` Jonathan Corbet
@ 2023-08-15  2:28     ` Chuyi Zhou
  2023-09-14 12:02     ` Bixuan Cui
  2023-09-14 12:04     ` Bixuan Cui
  2 siblings, 0 replies; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-15  2:28 UTC (permalink / raw)
  To: Jonathan Corbet, hannes, mhocko, roman.gushchin, ast, daniel,
	andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu

Hello,

在 2023/8/15 04:51, Jonathan Corbet 写道:
> Chuyi Zhou <zhouchuyi@bytedance.com> writes:
> 
>> This patch adds a new metadata policy_name in oom_control and report it
>> in dump_header(), so we can know what has been the selection policy. In
>> BPF program, we can call kfunc set_oom_policy_name to set the current
>> user-defined policy name. The in-kernel policy_name is "default".
>>
>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>> ---
>>   include/linux/oom.h |  7 +++++++
>>   mm/oom_kill.c       | 42 +++++++++++++++++++++++++++++++++++++++---
>>   2 files changed, 46 insertions(+), 3 deletions(-)
> 
> So I have a possibly dumb question here...
> 
>> diff --git a/include/linux/oom.h b/include/linux/oom.h
>> index 7d0c9c48a0c5..69d0f2ec6ea6 100644
>> --- a/include/linux/oom.h
>> +++ b/include/linux/oom.h
>> @@ -22,6 +22,10 @@ enum oom_constraint {
>>   	CONSTRAINT_MEMCG,
>>   };
>>   
>> +enum {
>> +	POLICY_NAME_LEN = 16,
>> +};
> 
> We've defined our name length, fine...
> 
>>   /*
>>    * Details of the page allocation that triggered the oom killer that are used to
>>    * determine what should be killed.
>> @@ -52,6 +56,9 @@ struct oom_control {
>>   
>>   	/* Used to print the constraint info. */
>>   	enum oom_constraint constraint;
>> +
>> +	/* Used to report the policy info. */
>> +	char policy_name[POLICY_NAME_LEN];
>>   };
> 
> ...that is the length of the array, appended to the structure...
> 
>>   extern struct mutex oom_lock;
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 255c9ef1d808..3239dcdba4d7 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -443,6 +443,35 @@ static int dump_task(struct task_struct *p, void *arg)
>>   	return 0;
>>   }
>>   
>> +__bpf_kfunc void set_oom_policy_name(struct oom_control *oc, const char *src, size_t sz)
>> +{
>> +	memset(oc->policy_name, 0, sizeof(oc->policy_name));
>> +
>> +	if (sz > POLICY_NAME_LEN)
>> +		sz = POLICY_NAME_LEN;
>> +
>> +	memcpy(oc->policy_name, src, sz);
>> +}
> 
> This truncates the name, possibly leaving it without a terminating NUL
> character, right?
> 

Yes, indeed. We should guarantee that the policy_name is terminated with 
a NULL character to avoid some undefined behavior.

Thanks for your helpful review.

------
Chuyi Zhou


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 4/5] bpf: Add a OOM policy test
  2023-08-10  8:13 ` [RFC PATCH v2 4/5] bpf: Add a OOM policy test Chuyi Zhou
@ 2023-08-16 11:53   ` Alan Maguire
  2023-08-16 12:31     ` Chuyi Zhou
  2023-09-13  7:55   ` Bixuan Cui
  1 sibling, 1 reply; 30+ messages in thread
From: Alan Maguire @ 2023-08-16 11:53 UTC (permalink / raw)
  To: Chuyi Zhou, hannes, mhocko, roman.gushchin, ast, daniel, andrii,
	muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu

On 10/08/2023 09:13, Chuyi Zhou wrote:
> This patch adds a test which implements a priority-based policy through
> bpf_oom_evaluate_task.
> 
> The BPF program, oom_policy.c, compares the cgroup priority of two tasks
> and select the lower one. The userspace program test_oom_policy.c
> maintains a priority map by using cgroup id as the keys and priority as
> the values. We could protect certain cgroups from oom-killer by setting
> higher priority.
> 
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  .../bpf/prog_tests/test_oom_policy.c          | 140 ++++++++++++++++++
>  .../testing/selftests/bpf/progs/oom_policy.c  | 104 +++++++++++++
>  2 files changed, 244 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>  create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
> new file mode 100644
> index 000000000000..bea61ff22603
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
> @@ -0,0 +1,140 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#define _GNU_SOURCE
> +
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <unistd.h>
> +#include <stdlib.h>
> +#include <signal.h>
> +#include <sys/stat.h>
> +#include <test_progs.h>
> +#include <bpf/btf.h>
> +#include <bpf/bpf.h>
> +
> +#include "cgroup_helpers.h"
> +#include "oom_policy.skel.h"
> +
> +static int map_fd;
> +static int cg_nr;
> +struct {
> +	const char *path;
> +	int fd;
> +	unsigned long long id;
> +} cgs[] = {
> +	{ "/cg1" },
> +	{ "/cg2" },
> +};
> +
> +
> +static struct oom_policy *open_load_oom_policy_skel(void)
> +{
> +	struct oom_policy *skel;
> +	int err;
> +
> +	skel = oom_policy__open();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		return NULL;
> +
> +	err = oom_policy__load(skel);
> +	if (!ASSERT_OK(err, "skel_load"))
> +		goto cleanup;
> +
> +	return skel;
> +
> +cleanup:
> +	oom_policy__destroy(skel);
> +	return NULL;
> +}
> +
> +static void run_memory_consume(unsigned long long consume_size, int idx)
> +{
> +	char *buf;
> +
> +	join_parent_cgroup(cgs[idx].path);
> +	buf = malloc(consume_size);
> +	memset(buf, 0, consume_size);
> +	sleep(2);
> +	exit(0);
> +}
> +
> +static int set_cgroup_prio(unsigned long long cg_id, int prio)
> +{
> +	int err;
> +
> +	err = bpf_map_update_elem(map_fd, &cg_id, &prio, BPF_ANY);
> +	ASSERT_EQ(err, 0, "update_map");
> +	return err;
> +}
> +
> +static int prepare_cgroup_environment(void)
> +{
> +	int err;
> +
> +	err = setup_cgroup_environment();
> +	if (err)
> +		goto clean_cg_env;
> +	for (int i = 0; i < cg_nr; i++) {
> +		err = cgs[i].fd = create_and_get_cgroup(cgs[i].path);
> +		if (!ASSERT_GE(cgs[i].fd, 0, "cg_create"))
> +			goto clean_cg_env;
> +		cgs[i].id = get_cgroup_id(cgs[i].path);
> +	}
> +	return 0;
> +clean_cg_env:
> +	cleanup_cgroup_environment();
> +	return err;
> +}
> +
> +void test_oom_policy(void)
> +{
> +	struct oom_policy *skel;
> +	struct bpf_link *link;
> +	int err;
> +	int victim_pid;
> +	unsigned long long victim_cg_id;
> +
> +	link = NULL;
> +	cg_nr = ARRAY_SIZE(cgs);
> +
> +	skel = open_load_oom_policy_skel();
> +	err = oom_policy__attach(skel);
> +	if (!ASSERT_OK(err, "oom_policy__attach"))
> +		goto cleanup;
> +
> +	map_fd = bpf_object__find_map_fd_by_name(skel->obj, "cg_map");
> +	if (!ASSERT_GE(map_fd, 0, "find map"))
> +		goto cleanup;
> +
> +	err = prepare_cgroup_environment();
> +	if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
> +		goto cleanup;
> +
> +	write_cgroup_file("/", "memory.max", "10M");
> +
> +	/*
> +	 * Set higher priority to cg2 and lower to cg1, so we would select
> +	 * task under cg1 as victim.(see oom_policy.c)
> +	 */
> +	set_cgroup_prio(cgs[0].id, 10);
> +	set_cgroup_prio(cgs[1].id, 50);
> +
> +	victim_cg_id = cgs[0].id;
> +	victim_pid = fork();
> +
> +	if (victim_pid == 0)
> +		run_memory_consume(1024 * 1024 * 4, 0);
> +
> +	if (fork() == 0)
> +		run_memory_consume(1024 * 1024 * 8, 1);
> +
> +	while (wait(NULL) > 0)
> +		;
> +
> +	ASSERT_EQ(skel->bss->victim_pid, victim_pid, "victim_pid");
> +	ASSERT_EQ(skel->bss->victim_cg_id, victim_cg_id, "victim_cgid");
> +	ASSERT_EQ(skel->bss->failed_cnt, 1, "failed_cnt");
> +cleanup:
> +	bpf_link__destroy(link);
> +	oom_policy__destroy(skel);
> +	cleanup_cgroup_environment();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/oom_policy.c b/tools/testing/selftests/bpf/progs/oom_policy.c
> new file mode 100644
> index 000000000000..fc9efc93914e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/oom_policy.c
> @@ -0,0 +1,104 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_helpers.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__type(key, int);
> +	__type(value, int);
> +	__uint(max_entries, 24);
> +} cg_map SEC(".maps");
> +
> +unsigned int victim_pid;
> +u64 victim_cg_id;
> +int failed_cnt;
> +
> +#define	EOPNOTSUPP	95
> +
> +enum {
> +	NO_BPF_POLICY,
> +	BPF_EVAL_ABORT,
> +	BPF_EVAL_NEXT,
> +	BPF_EVAL_SELECT,
> +};

When I built a kernel using this series and tried building the
associated test for that kernel I saw:

progs/oom_policy.c:22:2: error: redefinition of enumerator 'NO_BPF_POLICY'
        NO_BPF_POLICY,
        ^
/home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75894:2:
note: previous definition is here
        NO_BPF_POLICY = 0,
        ^
progs/oom_policy.c:23:2: error: redefinition of enumerator 'BPF_EVAL_ABORT'
        BPF_EVAL_ABORT,
        ^
/home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75895:2:
note: previous definition is here
        BPF_EVAL_ABORT = 1,
        ^
progs/oom_policy.c:24:2: error: redefinition of enumerator 'BPF_EVAL_NEXT'
        BPF_EVAL_NEXT,
        ^
/home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75896:2:
note: previous definition is here
        BPF_EVAL_NEXT = 2,
        ^
progs/oom_policy.c:  CLNG-BPF [test_maps] tailcall_bpf2bpf4.bpf.o
25:2: error: redefinition of enumerator 'BPF_EVAL_SELECT'
        BPF_EVAL_SELECT,
        ^
/home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75897:2:
note: previous definition is here
        BPF_EVAL_SELECT = 3,
        ^
4 errors generated.


So you shouldn't need the enum definition since it already makes it into
vmlinux.h.

I also ran into test failures when I removed the above (and compilation
succeeded):


test_oom_policy:PASS:prepare cgroup env 0 nsec
(cgroup_helpers.c:130: errno: No such file or directory) Opening
/mnt/cgroup-test-work-dir23054//memory.max
set_cgroup_prio:PASS:update_map 0 nsec
set_cgroup_prio:PASS:update_map 0 nsec
test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
expected 23058
test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
expected 68
test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
expected 1
#154     oom_policy:FAIL
Summary: 1/0 PASSED, 0 SKIPPED, 1 FAILED

So it seems that because my system was using the cgroupv1 memory
controller, it could not be used for v2 unless I rebooted with

systemd.unified_cgroup_hierarchy=1

...on the boot commandline. It would be good to note any such
requirements for this test in the selftests/bpf/README.rst.
Might also be worth adding

write_cgroup_file("", "cgroup.subtree_control", "+memory");

...to ensure the memory controller is enabled for the root cgroup.

At that point the test still failed:

set_cgroup_prio:PASS:update_map 0 nsec
test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
expected 12649
test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
expected 9583
test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
expected 1
#154     oom_policy:FAIL
Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
Successfully unloaded bpf_testmod.ko.


Are there other implicit assumptions about configuration that cause this
test to fail perhaps?

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 3/5] mm: Add a tracepoint when OOM victim selection is failed
  2023-08-10  8:13 ` [RFC PATCH v2 3/5] mm: Add a tracepoint when OOM victim selection is failed Chuyi Zhou
@ 2023-08-16 11:54   ` Alan Maguire
  0 siblings, 0 replies; 30+ messages in thread
From: Alan Maguire @ 2023-08-16 11:54 UTC (permalink / raw)
  To: Chuyi Zhou, hannes, mhocko, roman.gushchin, ast, daniel, andrii,
	muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu

On 10/08/2023 09:13, Chuyi Zhou wrote:
> This patch add a tracepoint to mark the scenario where nothing was
> chosen for OOM killer. This would allow BPF programs to catch the fact
> that the BPF OOM policy didn't work well.
> 
> Suggested-by: Alan Maguire <alan.maguire@oracle.com>
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  include/trace/events/oom.h | 18 ++++++++++++++++++
>  mm/oom_kill.c              |  1 +
>  2 files changed, 19 insertions(+)
> 
> diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
> index 26a11e4a2c36..b6ae1134229c 100644
> --- a/include/trace/events/oom.h
> +++ b/include/trace/events/oom.h
> @@ -6,6 +6,7 @@
>  #define _TRACE_OOM_H
>  #include <linux/tracepoint.h>
>  #include <trace/events/mmflags.h>
> +#include <linux/oom.h>
>  
>  TRACE_EVENT(oom_score_adj_update,
>  
> @@ -151,6 +152,23 @@ TRACE_EVENT(skip_task_reaping,
>  	TP_printk("pid=%d", __entry->pid)
>  );
>  
> +TRACE_EVENT(select_bad_process_end,
> +

would oom_select_bad_process_fail be a better name here?
"_end" is kind of neutral, whereas "_fail" indicates something
unexpected happened.

> +	TP_PROTO(struct oom_control *oc),
> +
> +	TP_ARGS(oc),
> +
> +	TP_STRUCT__entry(
> +		__array(char, policy_name, POLICY_NAME_LEN)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->policy_name, oc->policy_name, POLICY_NAME_LEN);
> +	),
> +
> +	TP_printk("policy_name=%s", __entry->policy_name)
> +);
> +
>  #ifdef CONFIG_COMPACTION
>  TRACE_EVENT(compact_retry,
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3239dcdba4d7..af40a1b750fa 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1235,6 +1235,7 @@ bool out_of_memory(struct oom_control *oc)
>  	select_bad_process(oc);
>  	/* Found nothing?!?! */
>  	if (!oc->chosen) {
> +		trace_select_bad_process_end(oc);
>  		dump_header(oc, NULL);
>  		pr_warn("Out of memory and no killable processes...\n");
>  		/*

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 4/5] bpf: Add a OOM policy test
  2023-08-16 11:53   ` Alan Maguire
@ 2023-08-16 12:31     ` Chuyi Zhou
  2023-08-16 13:49       ` Alan Maguire
  0 siblings, 1 reply; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-16 12:31 UTC (permalink / raw)
  To: Alan Maguire, hannes, mhocko, roman.gushchin, ast, daniel,
	andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu

Hello,

在 2023/8/16 19:53, Alan Maguire 写道:
> On 10/08/2023 09:13, Chuyi Zhou wrote:
>> This patch adds a test which implements a priority-based policy through
>> bpf_oom_evaluate_task.
>>
>> The BPF program, oom_policy.c, compares the cgroup priority of two tasks
>> and select the lower one. The userspace program test_oom_policy.c
>> maintains a priority map by using cgroup id as the keys and priority as
>> the values. We could protect certain cgroups from oom-killer by setting
>> higher priority.
>>
>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>> ---
>>   .../bpf/prog_tests/test_oom_policy.c          | 140 ++++++++++++++++++
>>   .../testing/selftests/bpf/progs/oom_policy.c  | 104 +++++++++++++
>>   2 files changed, 244 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>> new file mode 100644
>> index 000000000000..bea61ff22603
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>> @@ -0,0 +1,140 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +#define _GNU_SOURCE
>> +
>> +#include <stdio.h>
>> +#include <fcntl.h>
>> +#include <unistd.h>
>> +#include <stdlib.h>
>> +#include <signal.h>
>> +#include <sys/stat.h>
>> +#include <test_progs.h>
>> +#include <bpf/btf.h>
>> +#include <bpf/bpf.h>
>> +
>> +#include "cgroup_helpers.h"
>> +#include "oom_policy.skel.h"
>> +
>> +static int map_fd;
>> +static int cg_nr;
>> +struct {
>> +	const char *path;
>> +	int fd;
>> +	unsigned long long id;
>> +} cgs[] = {
>> +	{ "/cg1" },
>> +	{ "/cg2" },
>> +};
>> +
>> +
>> +static struct oom_policy *open_load_oom_policy_skel(void)
>> +{
>> +	struct oom_policy *skel;
>> +	int err;
>> +
>> +	skel = oom_policy__open();
>> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
>> +		return NULL;
>> +
>> +	err = oom_policy__load(skel);
>> +	if (!ASSERT_OK(err, "skel_load"))
>> +		goto cleanup;
>> +
>> +	return skel;
>> +
>> +cleanup:
>> +	oom_policy__destroy(skel);
>> +	return NULL;
>> +}
>> +
>> +static void run_memory_consume(unsigned long long consume_size, int idx)
>> +{
>> +	char *buf;
>> +
>> +	join_parent_cgroup(cgs[idx].path);
>> +	buf = malloc(consume_size);
>> +	memset(buf, 0, consume_size);
>> +	sleep(2);
>> +	exit(0);
>> +}
>> +
>> +static int set_cgroup_prio(unsigned long long cg_id, int prio)
>> +{
>> +	int err;
>> +
>> +	err = bpf_map_update_elem(map_fd, &cg_id, &prio, BPF_ANY);
>> +	ASSERT_EQ(err, 0, "update_map");
>> +	return err;
>> +}
>> +
>> +static int prepare_cgroup_environment(void)
>> +{
>> +	int err;
>> +
>> +	err = setup_cgroup_environment();
>> +	if (err)
>> +		goto clean_cg_env;
>> +	for (int i = 0; i < cg_nr; i++) {
>> +		err = cgs[i].fd = create_and_get_cgroup(cgs[i].path);
>> +		if (!ASSERT_GE(cgs[i].fd, 0, "cg_create"))
>> +			goto clean_cg_env;
>> +		cgs[i].id = get_cgroup_id(cgs[i].path);
>> +	}
>> +	return 0;
>> +clean_cg_env:
>> +	cleanup_cgroup_environment();
>> +	return err;
>> +}
>> +
>> +void test_oom_policy(void)
>> +{
>> +	struct oom_policy *skel;
>> +	struct bpf_link *link;
>> +	int err;
>> +	int victim_pid;
>> +	unsigned long long victim_cg_id;
>> +
>> +	link = NULL;
>> +	cg_nr = ARRAY_SIZE(cgs);
>> +
>> +	skel = open_load_oom_policy_skel();
>> +	err = oom_policy__attach(skel);
>> +	if (!ASSERT_OK(err, "oom_policy__attach"))
>> +		goto cleanup;
>> +
>> +	map_fd = bpf_object__find_map_fd_by_name(skel->obj, "cg_map");
>> +	if (!ASSERT_GE(map_fd, 0, "find map"))
>> +		goto cleanup;
>> +
>> +	err = prepare_cgroup_environment();
>> +	if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
>> +		goto cleanup;
>> +
>> +	write_cgroup_file("/", "memory.max", "10M");
>> +
>> +	/*
>> +	 * Set higher priority to cg2 and lower to cg1, so we would select
>> +	 * task under cg1 as victim.(see oom_policy.c)
>> +	 */
>> +	set_cgroup_prio(cgs[0].id, 10);
>> +	set_cgroup_prio(cgs[1].id, 50);
>> +
>> +	victim_cg_id = cgs[0].id;
>> +	victim_pid = fork();
>> +
>> +	if (victim_pid == 0)
>> +		run_memory_consume(1024 * 1024 * 4, 0);
>> +
>> +	if (fork() == 0)
>> +		run_memory_consume(1024 * 1024 * 8, 1);
>> +
>> +	while (wait(NULL) > 0)
>> +		;
>> +
>> +	ASSERT_EQ(skel->bss->victim_pid, victim_pid, "victim_pid");
>> +	ASSERT_EQ(skel->bss->victim_cg_id, victim_cg_id, "victim_cgid");
>> +	ASSERT_EQ(skel->bss->failed_cnt, 1, "failed_cnt");
>> +cleanup:
>> +	bpf_link__destroy(link);
>> +	oom_policy__destroy(skel);
>> +	cleanup_cgroup_environment();
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/oom_policy.c b/tools/testing/selftests/bpf/progs/oom_policy.c
>> new file mode 100644
>> index 000000000000..fc9efc93914e
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/oom_policy.c
>> @@ -0,0 +1,104 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +#include <vmlinux.h>
>> +#include <bpf/bpf_tracing.h>
>> +#include <bpf/bpf_helpers.h>
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_HASH);
>> +	__type(key, int);
>> +	__type(value, int);
>> +	__uint(max_entries, 24);
>> +} cg_map SEC(".maps");
>> +
>> +unsigned int victim_pid;
>> +u64 victim_cg_id;
>> +int failed_cnt;
>> +
>> +#define	EOPNOTSUPP	95
>> +
>> +enum {
>> +	NO_BPF_POLICY,
>> +	BPF_EVAL_ABORT,
>> +	BPF_EVAL_NEXT,
>> +	BPF_EVAL_SELECT,
>> +};
> 
> When I built a kernel using this series and tried building the
> associated test for that kernel I saw:
> 
> progs/oom_policy.c:22:2: error: redefinition of enumerator 'NO_BPF_POLICY'
>          NO_BPF_POLICY,
>          ^
> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75894:2:
> note: previous definition is here
>          NO_BPF_POLICY = 0,
>          ^
> progs/oom_policy.c:23:2: error: redefinition of enumerator 'BPF_EVAL_ABORT'
>          BPF_EVAL_ABORT,
>          ^
> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75895:2:
> note: previous definition is here
>          BPF_EVAL_ABORT = 1,
>          ^
> progs/oom_policy.c:24:2: error: redefinition of enumerator 'BPF_EVAL_NEXT'
>          BPF_EVAL_NEXT,
>          ^
> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75896:2:
> note: previous definition is here
>          BPF_EVAL_NEXT = 2,
>          ^
> progs/oom_policy.c:  CLNG-BPF [test_maps] tailcall_bpf2bpf4.bpf.o
> 25:2: error: redefinition of enumerator 'BPF_EVAL_SELECT'
>          BPF_EVAL_SELECT,
>          ^
> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75897:2:
> note: previous definition is here
>          BPF_EVAL_SELECT = 3,
>          ^
> 4 errors generated.
> 
> 
> So you shouldn't need the enum definition since it already makes it into
> vmlinux.h.
> OK. It seems my vmlinux.h doesn't contain these enum...
> I also ran into test failures when I removed the above (and compilation
> succeeded):
> 
> 
> test_oom_policy:PASS:prepare cgroup env 0 nsec
> (cgroup_helpers.c:130: errno: No such file or directory) Opening
> /mnt/cgroup-test-work-dir23054//memory.max
> set_cgroup_prio:PASS:update_map 0 nsec
> set_cgroup_prio:PASS:update_map 0 nsec
> test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
> expected 23058
> test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
> expected 68
> test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
> expected 1
> #154     oom_policy:FAIL
> Summary: 1/0 PASSED, 0 SKIPPED, 1 FAILED
> 
> So it seems that because my system was using the cgroupv1 memory
> controller, it could not be used for v2 unless I rebooted with
> 
> systemd.unified_cgroup_hierarchy=1
> 
> ...on the boot commandline. It would be good to note any such
> requirements for this test in the selftests/bpf/README.rst.
> Might also be worth adding
> 
> write_cgroup_file("", "cgroup.subtree_control", "+memory");
> 
> ...to ensure the memory controller is enabled for the root cgroup.
> 
> At that point the test still failed:
> 
> set_cgroup_prio:PASS:update_map 0 nsec
> test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
> expected 12649
> test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
> expected 9583
> test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
> expected 1
> #154     oom_policy:FAIL
> Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
> Successfully unloaded bpf_testmod.ko.
> 
> 
It seems that OOM is not invoked in your environment(you can check it in 
demsg). If the memcg OOM is invoked by the test, we would record the 
*victim_pid* and *victim_cgid* and they would not be zero. I guess the 
reason is memory_control is not enabled in cgroup 
"/mnt/cgroup-test-work-dir23054/", because I see the error message:
(cgroup_helpers.c:130: errno: No such file or directory) Opening
 > /mnt/cgroup-test-work-dir23054//memory.max

Thanks for your review and test!

> Are there other implicit assumptions about configuration that cause this
> test to fail perhaps?
> 
> Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 4/5] bpf: Add a OOM policy test
  2023-08-16 12:31     ` Chuyi Zhou
@ 2023-08-16 13:49       ` Alan Maguire
  2023-08-16 14:34         ` Chuyi Zhou
  0 siblings, 1 reply; 30+ messages in thread
From: Alan Maguire @ 2023-08-16 13:49 UTC (permalink / raw)
  To: Chuyi Zhou, hannes, mhocko, roman.gushchin, ast, daniel, andrii,
	muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu

On 16/08/2023 13:31, Chuyi Zhou wrote:
> Hello,
> 
> 在 2023/8/16 19:53, Alan Maguire 写道:
>> On 10/08/2023 09:13, Chuyi Zhou wrote:
>>> This patch adds a test which implements a priority-based policy through
>>> bpf_oom_evaluate_task.
>>>
>>> The BPF program, oom_policy.c, compares the cgroup priority of two tasks
>>> and select the lower one. The userspace program test_oom_policy.c
>>> maintains a priority map by using cgroup id as the keys and priority as
>>> the values. We could protect certain cgroups from oom-killer by setting
>>> higher priority.
>>>
>>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>>> ---
>>>   .../bpf/prog_tests/test_oom_policy.c          | 140 ++++++++++++++++++
>>>   .../testing/selftests/bpf/progs/oom_policy.c  | 104 +++++++++++++
>>>   2 files changed, 244 insertions(+)
>>>   create mode 100644
>>> tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>>   create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c
>>>
>>> diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>> b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>> new file mode 100644
>>> index 000000000000..bea61ff22603
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>> @@ -0,0 +1,140 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +#define _GNU_SOURCE
>>> +
>>> +#include <stdio.h>
>>> +#include <fcntl.h>
>>> +#include <unistd.h>
>>> +#include <stdlib.h>
>>> +#include <signal.h>
>>> +#include <sys/stat.h>
>>> +#include <test_progs.h>
>>> +#include <bpf/btf.h>
>>> +#include <bpf/bpf.h>
>>> +
>>> +#include "cgroup_helpers.h"
>>> +#include "oom_policy.skel.h"
>>> +
>>> +static int map_fd;
>>> +static int cg_nr;
>>> +struct {
>>> +    const char *path;
>>> +    int fd;
>>> +    unsigned long long id;
>>> +} cgs[] = {
>>> +    { "/cg1" },
>>> +    { "/cg2" },
>>> +};
>>> +
>>> +
>>> +static struct oom_policy *open_load_oom_policy_skel(void)
>>> +{
>>> +    struct oom_policy *skel;
>>> +    int err;
>>> +
>>> +    skel = oom_policy__open();
>>> +    if (!ASSERT_OK_PTR(skel, "skel_open"))
>>> +        return NULL;
>>> +
>>> +    err = oom_policy__load(skel);
>>> +    if (!ASSERT_OK(err, "skel_load"))
>>> +        goto cleanup;
>>> +
>>> +    return skel;
>>> +
>>> +cleanup:
>>> +    oom_policy__destroy(skel);
>>> +    return NULL;
>>> +}
>>> +
>>> +static void run_memory_consume(unsigned long long consume_size, int
>>> idx)
>>> +{
>>> +    char *buf;
>>> +
>>> +    join_parent_cgroup(cgs[idx].path);
>>> +    buf = malloc(consume_size);
>>> +    memset(buf, 0, consume_size);
>>> +    sleep(2);
>>> +    exit(0);
>>> +}
>>> +
>>> +static int set_cgroup_prio(unsigned long long cg_id, int prio)
>>> +{
>>> +    int err;
>>> +
>>> +    err = bpf_map_update_elem(map_fd, &cg_id, &prio, BPF_ANY);
>>> +    ASSERT_EQ(err, 0, "update_map");
>>> +    return err;
>>> +}
>>> +
>>> +static int prepare_cgroup_environment(void)
>>> +{
>>> +    int err;
>>> +
>>> +    err = setup_cgroup_environment();
>>> +    if (err)
>>> +        goto clean_cg_env;
>>> +    for (int i = 0; i < cg_nr; i++) {
>>> +        err = cgs[i].fd = create_and_get_cgroup(cgs[i].path);
>>> +        if (!ASSERT_GE(cgs[i].fd, 0, "cg_create"))
>>> +            goto clean_cg_env;
>>> +        cgs[i].id = get_cgroup_id(cgs[i].path);
>>> +    }
>>> +    return 0;
>>> +clean_cg_env:
>>> +    cleanup_cgroup_environment();
>>> +    return err;
>>> +}
>>> +
>>> +void test_oom_policy(void)
>>> +{
>>> +    struct oom_policy *skel;
>>> +    struct bpf_link *link;
>>> +    int err;
>>> +    int victim_pid;
>>> +    unsigned long long victim_cg_id;
>>> +
>>> +    link = NULL;
>>> +    cg_nr = ARRAY_SIZE(cgs);
>>> +
>>> +    skel = open_load_oom_policy_skel();
>>> +    err = oom_policy__attach(skel);
>>> +    if (!ASSERT_OK(err, "oom_policy__attach"))
>>> +        goto cleanup;
>>> +
>>> +    map_fd = bpf_object__find_map_fd_by_name(skel->obj, "cg_map");
>>> +    if (!ASSERT_GE(map_fd, 0, "find map"))
>>> +        goto cleanup;
>>> +
>>> +    err = prepare_cgroup_environment();
>>> +    if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
>>> +        goto cleanup;
>>> +
>>> +    write_cgroup_file("/", "memory.max", "10M");
>>> +
>>> +    /*
>>> +     * Set higher priority to cg2 and lower to cg1, so we would select
>>> +     * task under cg1 as victim.(see oom_policy.c)
>>> +     */
>>> +    set_cgroup_prio(cgs[0].id, 10);
>>> +    set_cgroup_prio(cgs[1].id, 50);
>>> +
>>> +    victim_cg_id = cgs[0].id;
>>> +    victim_pid = fork();
>>> +
>>> +    if (victim_pid == 0)
>>> +        run_memory_consume(1024 * 1024 * 4, 0);
>>> +
>>> +    if (fork() == 0)
>>> +        run_memory_consume(1024 * 1024 * 8, 1);
>>> +
>>> +    while (wait(NULL) > 0)
>>> +        ;
>>> +
>>> +    ASSERT_EQ(skel->bss->victim_pid, victim_pid, "victim_pid");
>>> +    ASSERT_EQ(skel->bss->victim_cg_id, victim_cg_id, "victim_cgid");
>>> +    ASSERT_EQ(skel->bss->failed_cnt, 1, "failed_cnt");
>>> +cleanup:
>>> +    bpf_link__destroy(link);
>>> +    oom_policy__destroy(skel);
>>> +    cleanup_cgroup_environment();
>>> +}
>>> diff --git a/tools/testing/selftests/bpf/progs/oom_policy.c
>>> b/tools/testing/selftests/bpf/progs/oom_policy.c
>>> new file mode 100644
>>> index 000000000000..fc9efc93914e
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/bpf/progs/oom_policy.c
>>> @@ -0,0 +1,104 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +#include <vmlinux.h>
>>> +#include <bpf/bpf_tracing.h>
>>> +#include <bpf/bpf_helpers.h>
>>> +
>>> +char _license[] SEC("license") = "GPL";
>>> +
>>> +struct {
>>> +    __uint(type, BPF_MAP_TYPE_HASH);
>>> +    __type(key, int);
>>> +    __type(value, int);
>>> +    __uint(max_entries, 24);
>>> +} cg_map SEC(".maps");
>>> +
>>> +unsigned int victim_pid;
>>> +u64 victim_cg_id;
>>> +int failed_cnt;
>>> +
>>> +#define    EOPNOTSUPP    95
>>> +
>>> +enum {
>>> +    NO_BPF_POLICY,
>>> +    BPF_EVAL_ABORT,
>>> +    BPF_EVAL_NEXT,
>>> +    BPF_EVAL_SELECT,
>>> +};
>>
>> When I built a kernel using this series and tried building the
>> associated test for that kernel I saw:
>>
>> progs/oom_policy.c:22:2: error: redefinition of enumerator
>> 'NO_BPF_POLICY'
>>          NO_BPF_POLICY,
>>          ^
>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75894:2:
>> note: previous definition is here
>>          NO_BPF_POLICY = 0,
>>          ^
>> progs/oom_policy.c:23:2: error: redefinition of enumerator
>> 'BPF_EVAL_ABORT'
>>          BPF_EVAL_ABORT,
>>          ^
>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75895:2:
>> note: previous definition is here
>>          BPF_EVAL_ABORT = 1,
>>          ^
>> progs/oom_policy.c:24:2: error: redefinition of enumerator
>> 'BPF_EVAL_NEXT'
>>          BPF_EVAL_NEXT,
>>          ^
>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75896:2:
>> note: previous definition is here
>>          BPF_EVAL_NEXT = 2,
>>          ^
>> progs/oom_policy.c:  CLNG-BPF [test_maps] tailcall_bpf2bpf4.bpf.o
>> 25:2: error: redefinition of enumerator 'BPF_EVAL_SELECT'
>>          BPF_EVAL_SELECT,
>>          ^
>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75897:2:
>> note: previous definition is here
>>          BPF_EVAL_SELECT = 3,
>>          ^
>> 4 errors generated.
>>
>>
>> So you shouldn't need the enum definition since it already makes it into
>> vmlinux.h.
>> OK. It seems my vmlinux.h doesn't contain these enum...
>> I also ran into test failures when I removed the above (and compilation
>> succeeded):
>>
>>
>> test_oom_policy:PASS:prepare cgroup env 0 nsec
>> (cgroup_helpers.c:130: errno: No such file or directory) Opening
>> /mnt/cgroup-test-work-dir23054//memory.max
>> set_cgroup_prio:PASS:update_map 0 nsec
>> set_cgroup_prio:PASS:update_map 0 nsec
>> test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
>> expected 23058
>> test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
>> expected 68
>> test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
>> expected 1
>> #154     oom_policy:FAIL
>> Summary: 1/0 PASSED, 0 SKIPPED, 1 FAILED
>>
>> So it seems that because my system was using the cgroupv1 memory
>> controller, it could not be used for v2 unless I rebooted with
>>
>> systemd.unified_cgroup_hierarchy=1
>>
>> ...on the boot commandline. It would be good to note any such
>> requirements for this test in the selftests/bpf/README.rst.
>> Might also be worth adding
>>
>> write_cgroup_file("", "cgroup.subtree_control", "+memory");
>>
>> ...to ensure the memory controller is enabled for the root cgroup.
>>
>> At that point the test still failed:
>>
>> set_cgroup_prio:PASS:update_map 0 nsec
>> test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
>> expected 12649
>> test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
>> expected 9583
>> test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
>> expected 1
>> #154     oom_policy:FAIL
>> Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
>> Successfully unloaded bpf_testmod.ko.
>>
>>
> It seems that OOM is not invoked in your environment(you can check it in
> demsg). If the memcg OOM is invoked by the test, we would record the
> *victim_pid* and *victim_cgid* and they would not be zero. I guess the
> reason is memory_control is not enabled in cgroup
> "/mnt/cgroup-test-work-dir23054/", because I see the error message:
> (cgroup_helpers.c:130: errno: No such file or directory) Opening
>> /mnt/cgroup-test-work-dir23054//memory.max

Right, but after I set up unified cgroup hierarchy and rebooted, that
message disappeared and cgroup setup succeeded, _but_ the test still
failed with 0 victim_pid/cgid.  I see nothing OOM-related in dmesg, but
the toplevel cgroupv2 cgroup.controllers file contains:

cpuset cpu io memory hugetlb pids rdma

Is there something else that needs to be done to enable OOM scanning?
I see the oom_reaper process:

root          72       2  0 11:30 ?        00:00:00 [oom_reaper]


This test will need to pass BPF CI, so any assumptions about
configuration need to be ironed out. For example, I think you should
probably have

diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
index bea61ff22603..54fdb8a59816 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
@@ -109,6 +109,7 @@ void test_oom_policy(void)
        if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
                goto cleanup;

+       write_cgroup_file("/", "cgroup.subtree_control", "+memory");
        write_cgroup_file("/", "memory.max", "10M");

        /*

...to be safe, since

https://docs.kernel.org/admin-guide/cgroup-v2.html#organizing-processes-and-threads

...says

"No controller is enabled by default. Controllers can be enabled and
disabled by writing to the "cgroup.subtree_control" file:

# echo "+cpu +memory -io" > cgroup.subtree_control

"

Are there any other aspects of configuration like that which might
explain why the test passes for you but fails for me?

Alan

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 4/5] bpf: Add a OOM policy test
  2023-08-16 13:49       ` Alan Maguire
@ 2023-08-16 14:34         ` Chuyi Zhou
  2023-09-28 11:35           ` Charlley Green
  0 siblings, 1 reply; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-16 14:34 UTC (permalink / raw)
  To: Alan Maguire, hannes, mhocko, roman.gushchin, ast, daniel,
	andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu

Hello,

在 2023/8/16 21:49, Alan Maguire 写道:
> On 16/08/2023 13:31, Chuyi Zhou wrote:
>> Hello,
>>
>> 在 2023/8/16 19:53, Alan Maguire 写道:
>>> On 10/08/2023 09:13, Chuyi Zhou wrote:
>>>> This patch adds a test which implements a priority-based policy through
>>>> bpf_oom_evaluate_task.
>>>>
>>>> The BPF program, oom_policy.c, compares the cgroup priority of two tasks
>>>> and select the lower one. The userspace program test_oom_policy.c
>>>> maintains a priority map by using cgroup id as the keys and priority as
>>>> the values. We could protect certain cgroups from oom-killer by setting
>>>> higher priority.
>>>>
>>>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>>>> ---
>>>>    .../bpf/prog_tests/test_oom_policy.c          | 140 ++++++++++++++++++
>>>>    .../testing/selftests/bpf/progs/oom_policy.c  | 104 +++++++++++++
>>>>    2 files changed, 244 insertions(+)
>>>>    create mode 100644
>>>> tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>>>    create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c
>>>>
>>>> diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>>> b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>>> new file mode 100644
>>>> index 000000000000..bea61ff22603
>>>> --- /dev/null
>>>> +++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>>> @@ -0,0 +1,140 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +#define _GNU_SOURCE
>>>> +
>>>> +#include <stdio.h>
>>>> +#include <fcntl.h>
>>>> +#include <unistd.h>
>>>> +#include <stdlib.h>
>>>> +#include <signal.h>
>>>> +#include <sys/stat.h>
>>>> +#include <test_progs.h>
>>>> +#include <bpf/btf.h>
>>>> +#include <bpf/bpf.h>
>>>> +
>>>> +#include "cgroup_helpers.h"
>>>> +#include "oom_policy.skel.h"
>>>> +
>>>> +static int map_fd;
>>>> +static int cg_nr;
>>>> +struct {
>>>> +    const char *path;
>>>> +    int fd;
>>>> +    unsigned long long id;
>>>> +} cgs[] = {
>>>> +    { "/cg1" },
>>>> +    { "/cg2" },
>>>> +};
>>>> +
>>>> +
>>>> +static struct oom_policy *open_load_oom_policy_skel(void)
>>>> +{
>>>> +    struct oom_policy *skel;
>>>> +    int err;
>>>> +
>>>> +    skel = oom_policy__open();
>>>> +    if (!ASSERT_OK_PTR(skel, "skel_open"))
>>>> +        return NULL;
>>>> +
>>>> +    err = oom_policy__load(skel);
>>>> +    if (!ASSERT_OK(err, "skel_load"))
>>>> +        goto cleanup;
>>>> +
>>>> +    return skel;
>>>> +
>>>> +cleanup:
>>>> +    oom_policy__destroy(skel);
>>>> +    return NULL;
>>>> +}
>>>> +
>>>> +static void run_memory_consume(unsigned long long consume_size, int
>>>> idx)
>>>> +{
>>>> +    char *buf;
>>>> +
>>>> +    join_parent_cgroup(cgs[idx].path);
>>>> +    buf = malloc(consume_size);
>>>> +    memset(buf, 0, consume_size);
>>>> +    sleep(2);
>>>> +    exit(0);
>>>> +}
>>>> +
>>>> +static int set_cgroup_prio(unsigned long long cg_id, int prio)
>>>> +{
>>>> +    int err;
>>>> +
>>>> +    err = bpf_map_update_elem(map_fd, &cg_id, &prio, BPF_ANY);
>>>> +    ASSERT_EQ(err, 0, "update_map");
>>>> +    return err;
>>>> +}
>>>> +
>>>> +static int prepare_cgroup_environment(void)
>>>> +{
>>>> +    int err;
>>>> +
>>>> +    err = setup_cgroup_environment();
>>>> +    if (err)
>>>> +        goto clean_cg_env;
>>>> +    for (int i = 0; i < cg_nr; i++) {
>>>> +        err = cgs[i].fd = create_and_get_cgroup(cgs[i].path);
>>>> +        if (!ASSERT_GE(cgs[i].fd, 0, "cg_create"))
>>>> +            goto clean_cg_env;
>>>> +        cgs[i].id = get_cgroup_id(cgs[i].path);
>>>> +    }
>>>> +    return 0;
>>>> +clean_cg_env:
>>>> +    cleanup_cgroup_environment();
>>>> +    return err;
>>>> +}
>>>> +
>>>> +void test_oom_policy(void)
>>>> +{
>>>> +    struct oom_policy *skel;
>>>> +    struct bpf_link *link;
>>>> +    int err;
>>>> +    int victim_pid;
>>>> +    unsigned long long victim_cg_id;
>>>> +
>>>> +    link = NULL;
>>>> +    cg_nr = ARRAY_SIZE(cgs);
>>>> +
>>>> +    skel = open_load_oom_policy_skel();
>>>> +    err = oom_policy__attach(skel);
>>>> +    if (!ASSERT_OK(err, "oom_policy__attach"))
>>>> +        goto cleanup;
>>>> +
>>>> +    map_fd = bpf_object__find_map_fd_by_name(skel->obj, "cg_map");
>>>> +    if (!ASSERT_GE(map_fd, 0, "find map"))
>>>> +        goto cleanup;
>>>> +
>>>> +    err = prepare_cgroup_environment();
>>>> +    if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
>>>> +        goto cleanup;
>>>> +
>>>> +    write_cgroup_file("/", "memory.max", "10M");
>>>> +
>>>> +    /*
>>>> +     * Set higher priority to cg2 and lower to cg1, so we would select
>>>> +     * task under cg1 as victim.(see oom_policy.c)
>>>> +     */
>>>> +    set_cgroup_prio(cgs[0].id, 10);
>>>> +    set_cgroup_prio(cgs[1].id, 50);
>>>> +
>>>> +    victim_cg_id = cgs[0].id;
>>>> +    victim_pid = fork();
>>>> +
>>>> +    if (victim_pid == 0)
>>>> +        run_memory_consume(1024 * 1024 * 4, 0);
>>>> +
>>>> +    if (fork() == 0)
>>>> +        run_memory_consume(1024 * 1024 * 8, 1);
>>>> +
>>>> +    while (wait(NULL) > 0)
>>>> +        ;
>>>> +
>>>> +    ASSERT_EQ(skel->bss->victim_pid, victim_pid, "victim_pid");
>>>> +    ASSERT_EQ(skel->bss->victim_cg_id, victim_cg_id, "victim_cgid");
>>>> +    ASSERT_EQ(skel->bss->failed_cnt, 1, "failed_cnt");
>>>> +cleanup:
>>>> +    bpf_link__destroy(link);
>>>> +    oom_policy__destroy(skel);
>>>> +    cleanup_cgroup_environment();
>>>> +}
>>>> diff --git a/tools/testing/selftests/bpf/progs/oom_policy.c
>>>> b/tools/testing/selftests/bpf/progs/oom_policy.c
>>>> new file mode 100644
>>>> index 000000000000..fc9efc93914e
>>>> --- /dev/null
>>>> +++ b/tools/testing/selftests/bpf/progs/oom_policy.c
>>>> @@ -0,0 +1,104 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +#include <vmlinux.h>
>>>> +#include <bpf/bpf_tracing.h>
>>>> +#include <bpf/bpf_helpers.h>
>>>> +
>>>> +char _license[] SEC("license") = "GPL";
>>>> +
>>>> +struct {
>>>> +    __uint(type, BPF_MAP_TYPE_HASH);
>>>> +    __type(key, int);
>>>> +    __type(value, int);
>>>> +    __uint(max_entries, 24);
>>>> +} cg_map SEC(".maps");
>>>> +
>>>> +unsigned int victim_pid;
>>>> +u64 victim_cg_id;
>>>> +int failed_cnt;
>>>> +
>>>> +#define    EOPNOTSUPP    95
>>>> +
>>>> +enum {
>>>> +    NO_BPF_POLICY,
>>>> +    BPF_EVAL_ABORT,
>>>> +    BPF_EVAL_NEXT,
>>>> +    BPF_EVAL_SELECT,
>>>> +};
>>>
>>> When I built a kernel using this series and tried building the
>>> associated test for that kernel I saw:
>>>
>>> progs/oom_policy.c:22:2: error: redefinition of enumerator
>>> 'NO_BPF_POLICY'
>>>           NO_BPF_POLICY,
>>>           ^
>>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75894:2:
>>> note: previous definition is here
>>>           NO_BPF_POLICY = 0,
>>>           ^
>>> progs/oom_policy.c:23:2: error: redefinition of enumerator
>>> 'BPF_EVAL_ABORT'
>>>           BPF_EVAL_ABORT,
>>>           ^
>>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75895:2:
>>> note: previous definition is here
>>>           BPF_EVAL_ABORT = 1,
>>>           ^
>>> progs/oom_policy.c:24:2: error: redefinition of enumerator
>>> 'BPF_EVAL_NEXT'
>>>           BPF_EVAL_NEXT,
>>>           ^
>>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75896:2:
>>> note: previous definition is here
>>>           BPF_EVAL_NEXT = 2,
>>>           ^
>>> progs/oom_policy.c:  CLNG-BPF [test_maps] tailcall_bpf2bpf4.bpf.o
>>> 25:2: error: redefinition of enumerator 'BPF_EVAL_SELECT'
>>>           BPF_EVAL_SELECT,
>>>           ^
>>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75897:2:
>>> note: previous definition is here
>>>           BPF_EVAL_SELECT = 3,
>>>           ^
>>> 4 errors generated.
>>>
>>>
>>> So you shouldn't need the enum definition since it already makes it into
>>> vmlinux.h.
>>> OK. It seems my vmlinux.h doesn't contain these enum...
>>> I also ran into test failures when I removed the above (and compilation
>>> succeeded):
>>>
>>>
>>> test_oom_policy:PASS:prepare cgroup env 0 nsec
>>> (cgroup_helpers.c:130: errno: No such file or directory) Opening
>>> /mnt/cgroup-test-work-dir23054//memory.max
>>> set_cgroup_prio:PASS:update_map 0 nsec
>>> set_cgroup_prio:PASS:update_map 0 nsec
>>> test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
>>> expected 23058
>>> test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
>>> expected 68
>>> test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
>>> expected 1
>>> #154     oom_policy:FAIL
>>> Summary: 1/0 PASSED, 0 SKIPPED, 1 FAILED
>>>
>>> So it seems that because my system was using the cgroupv1 memory
>>> controller, it could not be used for v2 unless I rebooted with
>>>
>>> systemd.unified_cgroup_hierarchy=1
>>>
>>> ...on the boot commandline. It would be good to note any such
>>> requirements for this test in the selftests/bpf/README.rst.
>>> Might also be worth adding
>>>
>>> write_cgroup_file("", "cgroup.subtree_control", "+memory");
>>>
>>> ...to ensure the memory controller is enabled for the root cgroup.
>>>
>>> At that point the test still failed:
>>>
>>> set_cgroup_prio:PASS:update_map 0 nsec
>>> test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
>>> expected 12649
>>> test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
>>> expected 9583
>>> test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
>>> expected 1
>>> #154     oom_policy:FAIL
>>> Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
>>> Successfully unloaded bpf_testmod.ko.
>>>
>>>
>> It seems that OOM is not invoked in your environment(you can check it in
>> demsg). If the memcg OOM is invoked by the test, we would record the
>> *victim_pid* and *victim_cgid* and they would not be zero. I guess the
>> reason is memory_control is not enabled in cgroup
>> "/mnt/cgroup-test-work-dir23054/", because I see the error message:
>> (cgroup_helpers.c:130: errno: No such file or directory) Opening
>>> /mnt/cgroup-test-work-dir23054//memory.max
> 
> Right, but after I set up unified cgroup hierarchy and rebooted, that
> message disappeared and cgroup setup succeeded, _but_ the test still
> failed with 0 victim_pid/cgid.  I see nothing OOM-related in dmesg, but
> the toplevel cgroupv2 cgroup.controllers file contains:
> 
> cpuset cpu io memory hugetlb pids rdma
>

Dose the toplevel cgroupv2's *cgroup.subtree_control* looks like that?
/sys/fs/cgroup$ cat cgroup.subtree_control

	cpuset cpu io memory hugetlb pids

This prog test would mkdir a test cgroup dir under the toplevel's 
cgroupv2 and rmdir after the test finishing. In my env, the test cgroup 
path looks like:

/sys/fs/cgroup/cgroup-test-work-dirxxx/

This test would run in cgroup-test-work-dirxxx.

If we want to enable memory controller in cgroup-test-work-dirxxx, we 
should guarantee that /sys/fs/cgroup/cgroup.subtree_control contanins
"memory".


> Is there something else that needs to be done to enable OOM scanning?
> I see the oom_reaper process:
> 
> root          72       2  0 11:30 ?        00:00:00 [oom_reaper]
> 
> 
> This test will need to pass BPF CI, so any assumptions about
> configuration need to be ironed out. For example, I think you should
> probably have
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
> b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
> index bea61ff22603..54fdb8a59816 100644
> --- a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
> +++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
> @@ -109,6 +109,7 @@ void test_oom_policy(void)
>          if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
>                  goto cleanup;
> 
> +       write_cgroup_file("/", "cgroup.subtree_control", "+memory");
>          write_cgroup_file("/", "memory.max", "10M");

Yes, you are right. We do need something to guarantee that the memory 
controller is enabled in cgroup-test-work-dir.
	write_cgroup_file("/", "cgroup.subtree_control", "+memory");
This code actually dose something like:

echo "+memory" > /sys/fs/cgroup/cgroup-test-work-dir/cgroup.subtree_control

What we need actually is
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control

Thanks!
>
>          /*
> 
> ...to be safe, since
> 
> https://docs.kernel.org/admin-guide/cgroup-v2.html#organizing-processes-and-threads
> 
> ...says
> 
> "No controller is enabled by default. Controllers can be enabled and
> disabled by writing to the "cgroup.subtree_control" file:
> 
> # echo "+cpu +memory -io" > cgroup.subtree_control
> 
> "
> 
> Are there any other aspects of configuration like that which might
> explain why the test passes for you but fails for me?
> 
> Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC v2 0/5] mm: Select victim using bpf_oom_evaluate_task
  2023-08-10  8:13 [RFC PATCH v2 0/5] mm: Select victim using bpf_oom_evaluate_task Chuyi Zhou
                   ` (4 preceding siblings ...)
  2023-08-10  8:13 ` [RFC PATCH v2 5/5] bpf: Add a BPF OOM policy Doc Chuyi Zhou
@ 2023-08-16 15:49 ` Yosry Ahmed
  5 siblings, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2023-08-16 15:49 UTC (permalink / raw)
  To: zhouchuyi
  Cc: andrii, ast, bpf, daniel, hannes, linux-kernel, mhocko,
	muchun.song, robin.lu, roman.gushchin, wuyun.abel, Yosry Ahmed

> Changes
> -------
>
> This is v2 of the BPF OOM policy patchset.
> v1 : https://lore.kernel.org/lkml/20230804093804.47039-1-zhouchuyi@bytedance.com/
> v1 -> v2 changes:
>
> - rename bpf_select_task to bpf_oom_evaluate_task and bypass the
> tsk_is_oom_victim (and MMF_OOM_SKIP) logic. (Michal)
>
> - add a new hook to set policy's name, so dump_header() can know
> what has been the selection policy when reporting messages. (Michal)
>
> - add a tracepoint when select_bad_process() find nothing. (Alan)
>
> - add a doc to to describe how it is all supposed to work. (Alan)
>
> ================
>
> This patchset adds a new interface and use it to select victim when OOM
> is invoked. The mainly motivation is the need to customizable OOM victim
> selection functionality.
>
> The new interface is a bpf hook plugged in oom_evaluate_task. It takes oc
> and current task as parameters and return a result indicating which one is
> selected by the attached bpf program.
>
> There are several conserns when designing this interface suggested by
> Michal:
>
> 1. Hooking into oom_evaluate_task can keep the consistency of global and
> memcg OOM interface. Besides, it seems the least disruptive to the existing
> oom killer implementation.
>
> 2. Userspace can handle a lot on its own and provide the input to the BPF
> program to make a decision. Since the oom scope iteration will be
> implemented already in the kernel so all the BPF program has to do is to
> rank processes or memcgs.
>
> 3. The new interface should better bypass the current heuristic rules
> (e.g., tsk_is_oom_victim, and MMF_OOM_SKIP) to meet an arbitrary oom
> policy's need.

Can we linux-mm on such changes? I almost missed this series :)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-08-10  8:13 ` [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task Chuyi Zhou
@ 2023-08-17  2:07   ` Alexei Starovoitov
  2023-08-17  2:51     ` Chuyi Zhou
  2023-08-22 10:39     ` Michal Hocko
  2023-09-13  1:18   ` Bixuan Cui
  2023-09-13 11:24   ` Bixuan Cui
  2 siblings, 2 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2023-08-17  2:07 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	muchun.song, bpf, LKML, wuyun.abel, robin.lu, Michal Hocko

On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>  static int oom_evaluate_task(struct task_struct *task, void *arg)
>  {
>         struct oom_control *oc = arg;
> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>         if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>                 goto next;
>
> +       /*
> +        * If task is allocating a lot of memory and has been marked to be
> +        * killed first if it triggers an oom, then select it.
> +        */
> +       if (oom_task_origin(task)) {
> +               points = LONG_MAX;
> +               goto select;
> +       }
> +
> +       switch (bpf_oom_evaluate_task(task, oc)) {
> +       case BPF_EVAL_ABORT:
> +               goto abort; /* abort search process */
> +       case BPF_EVAL_NEXT:
> +               goto next; /* ignore the task */
> +       case BPF_EVAL_SELECT:
> +               goto select; /* select the task */
> +       default:
> +               break; /* No BPF policy */
> +       }
> +

I think forcing bpf prog to look at every task is going to be limiting
long term.
It's more flexible to invoke bpf prog from out_of_memory()
and if it doesn't choose a task then fallback to select_bad_process().
I believe that's what Roman was proposing.
bpf can choose to iterate memcg or it might have some side knowledge
that there are processes that can be set as oc->chosen right away,
so it can skip the iteration.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-08-17  2:07   ` Alexei Starovoitov
@ 2023-08-17  2:51     ` Chuyi Zhou
  2023-08-17  3:22       ` Alexei Starovoitov
  2023-08-22 10:39     ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-17  2:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	muchun.song, bpf, LKML, wuyun.abel, robin.lu, Michal Hocko

Hello,

在 2023/8/17 10:07, Alexei Starovoitov 写道:
> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>>   static int oom_evaluate_task(struct task_struct *task, void *arg)
>>   {
>>          struct oom_control *oc = arg;
>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>>          if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>>                  goto next;
>>
>> +       /*
>> +        * If task is allocating a lot of memory and has been marked to be
>> +        * killed first if it triggers an oom, then select it.
>> +        */
>> +       if (oom_task_origin(task)) {
>> +               points = LONG_MAX;
>> +               goto select;
>> +       }
>> +
>> +       switch (bpf_oom_evaluate_task(task, oc)) {
>> +       case BPF_EVAL_ABORT:
>> +               goto abort; /* abort search process */
>> +       case BPF_EVAL_NEXT:
>> +               goto next; /* ignore the task */
>> +       case BPF_EVAL_SELECT:
>> +               goto select; /* select the task */
>> +       default:
>> +               break; /* No BPF policy */
>> +       }
>> +
> 
> I think forcing bpf prog to look at every task is going to be limiting
> long term.
> It's more flexible to invoke bpf prog from out_of_memory()
> and if it doesn't choose a task then fallback to select_bad_process().
> I believe that's what Roman was proposing.
> bpf can choose to iterate memcg or it might have some side knowledge
> that there are processes that can be set as oc->chosen right away,
> so it can skip the iteration.

IIUC, We may need some new bpf features if we want to iterating 
tasks/memcg in BPF, sush as:
bpf_for_each_task
bpf_for_each_memcg
bpf_for_each_task_in_memcg
...

It seems we have some work to do first in the BPF side.
Will these iterating features be useful in other BPF scenario except OOM 
Policy?

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-08-17  2:51     ` Chuyi Zhou
@ 2023-08-17  3:22       ` Alexei Starovoitov
  2023-08-18  3:30         ` Chuyi Zhou
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2023-08-17  3:22 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Muchun Song, bpf, LKML, wuyun.abel, robin.lu, Michal Hocko

On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>
> Hello,
>
> 在 2023/8/17 10:07, Alexei Starovoitov 写道:
> > On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >>   static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>   {
> >>          struct oom_control *oc = arg;
> >> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>          if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >>                  goto next;
> >>
> >> +       /*
> >> +        * If task is allocating a lot of memory and has been marked to be
> >> +        * killed first if it triggers an oom, then select it.
> >> +        */
> >> +       if (oom_task_origin(task)) {
> >> +               points = LONG_MAX;
> >> +               goto select;
> >> +       }
> >> +
> >> +       switch (bpf_oom_evaluate_task(task, oc)) {
> >> +       case BPF_EVAL_ABORT:
> >> +               goto abort; /* abort search process */
> >> +       case BPF_EVAL_NEXT:
> >> +               goto next; /* ignore the task */
> >> +       case BPF_EVAL_SELECT:
> >> +               goto select; /* select the task */
> >> +       default:
> >> +               break; /* No BPF policy */
> >> +       }
> >> +
> >
> > I think forcing bpf prog to look at every task is going to be limiting
> > long term.
> > It's more flexible to invoke bpf prog from out_of_memory()
> > and if it doesn't choose a task then fallback to select_bad_process().
> > I believe that's what Roman was proposing.
> > bpf can choose to iterate memcg or it might have some side knowledge
> > that there are processes that can be set as oc->chosen right away,
> > so it can skip the iteration.
>
> IIUC, We may need some new bpf features if we want to iterating
> tasks/memcg in BPF, sush as:
> bpf_for_each_task
> bpf_for_each_memcg
> bpf_for_each_task_in_memcg
> ...
>
> It seems we have some work to do first in the BPF side.
> Will these iterating features be useful in other BPF scenario except OOM
> Policy?

Yes.
Use open coded iterators though.
Like example in
https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/

bpf_for_each(task_vma, vma, task, 0) { ... }
will safely iterate vma-s of the task.
Similarly struct css_task_iter can be hidden inside bpf open coded iterator.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-08-17  3:22       ` Alexei Starovoitov
@ 2023-08-18  3:30         ` Chuyi Zhou
  2023-08-18  4:34           ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Chuyi Zhou @ 2023-08-18  3:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Muchun Song, bpf, LKML, wuyun.abel, robin.lu, Michal Hocko

Hello,
在 2023/8/17 11:22, Alexei Starovoitov 写道:
> On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>>
>> Hello,
>>
>> 在 2023/8/17 10:07, Alexei Starovoitov 写道:
>>> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>>>>    static int oom_evaluate_task(struct task_struct *task, void *arg)
>>>>    {
>>>>           struct oom_control *oc = arg;
>>>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>>>>           if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>>>>                   goto next;
>>>>
>>>> +       /*
>>>> +        * If task is allocating a lot of memory and has been marked to be
>>>> +        * killed first if it triggers an oom, then select it.
>>>> +        */
>>>> +       if (oom_task_origin(task)) {
>>>> +               points = LONG_MAX;
>>>> +               goto select;
>>>> +       }
>>>> +
>>>> +       switch (bpf_oom_evaluate_task(task, oc)) {
>>>> +       case BPF_EVAL_ABORT:
>>>> +               goto abort; /* abort search process */
>>>> +       case BPF_EVAL_NEXT:
>>>> +               goto next; /* ignore the task */
>>>> +       case BPF_EVAL_SELECT:
>>>> +               goto select; /* select the task */
>>>> +       default:
>>>> +               break; /* No BPF policy */
>>>> +       }
>>>> +
>>>
>>> I think forcing bpf prog to look at every task is going to be limiting
>>> long term.
>>> It's more flexible to invoke bpf prog from out_of_memory()
>>> and if it doesn't choose a task then fallback to select_bad_process().
>>> I believe that's what Roman was proposing.
>>> bpf can choose to iterate memcg or it might have some side knowledge
>>> that there are processes that can be set as oc->chosen right away,
>>> so it can skip the iteration.
>>
>> IIUC, We may need some new bpf features if we want to iterating
>> tasks/memcg in BPF, sush as:
>> bpf_for_each_task
>> bpf_for_each_memcg
>> bpf_for_each_task_in_memcg
>> ...
>>
>> It seems we have some work to do first in the BPF side.
>> Will these iterating features be useful in other BPF scenario except OOM
>> Policy?
> 
> Yes.
> Use open coded iterators though.
> Like example in
> https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/
> 
> bpf_for_each(task_vma, vma, task, 0) { ... }
> will safely iterate vma-s of the task.
> Similarly struct css_task_iter can be hidden inside bpf open coded iterator.
OK. I think the following APIs whould be useful and I am willing to 
start with these in another bpf-next RFC patchset:

1. bpf_for_each(task). Just like for_each_process(p) in kernel to 
itearing all tasks in the system with rcu_read_lock().

2. bpf_for_each(css_task, task, css). It works like 
css_task_iter_{start, next, end} and would be used to iterating 
tasks/threads under a css.

3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works 
like css_next_descendant_{pre, post} to iterating all descendant.

If you have better ideas or any advice, please let me know.
Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-08-18  3:30         ` Chuyi Zhou
@ 2023-08-18  4:34           ` Alexei Starovoitov
  0 siblings, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2023-08-18  4:34 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Muchun Song, bpf, LKML, wuyun.abel, robin.lu, Michal Hocko

On Thu, Aug 17, 2023 at 8:30 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>
> Hello,
> 在 2023/8/17 11:22, Alexei Starovoitov 写道:
> > On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >>
> >> Hello,
> >>
> >> 在 2023/8/17 10:07, Alexei Starovoitov 写道:
> >>> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >>>>    static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>>>    {
> >>>>           struct oom_control *oc = arg;
> >>>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>>>           if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >>>>                   goto next;
> >>>>
> >>>> +       /*
> >>>> +        * If task is allocating a lot of memory and has been marked to be
> >>>> +        * killed first if it triggers an oom, then select it.
> >>>> +        */
> >>>> +       if (oom_task_origin(task)) {
> >>>> +               points = LONG_MAX;
> >>>> +               goto select;
> >>>> +       }
> >>>> +
> >>>> +       switch (bpf_oom_evaluate_task(task, oc)) {
> >>>> +       case BPF_EVAL_ABORT:
> >>>> +               goto abort; /* abort search process */
> >>>> +       case BPF_EVAL_NEXT:
> >>>> +               goto next; /* ignore the task */
> >>>> +       case BPF_EVAL_SELECT:
> >>>> +               goto select; /* select the task */
> >>>> +       default:
> >>>> +               break; /* No BPF policy */
> >>>> +       }
> >>>> +
> >>>
> >>> I think forcing bpf prog to look at every task is going to be limiting
> >>> long term.
> >>> It's more flexible to invoke bpf prog from out_of_memory()
> >>> and if it doesn't choose a task then fallback to select_bad_process().
> >>> I believe that's what Roman was proposing.
> >>> bpf can choose to iterate memcg or it might have some side knowledge
> >>> that there are processes that can be set as oc->chosen right away,
> >>> so it can skip the iteration.
> >>
> >> IIUC, We may need some new bpf features if we want to iterating
> >> tasks/memcg in BPF, sush as:
> >> bpf_for_each_task
> >> bpf_for_each_memcg
> >> bpf_for_each_task_in_memcg
> >> ...
> >>
> >> It seems we have some work to do first in the BPF side.
> >> Will these iterating features be useful in other BPF scenario except OOM
> >> Policy?
> >
> > Yes.
> > Use open coded iterators though.
> > Like example in
> > https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/
> >
> > bpf_for_each(task_vma, vma, task, 0) { ... }
> > will safely iterate vma-s of the task.
> > Similarly struct css_task_iter can be hidden inside bpf open coded iterator.
> OK. I think the following APIs whould be useful and I am willing to
> start with these in another bpf-next RFC patchset:
>
> 1. bpf_for_each(task). Just like for_each_process(p) in kernel to
> itearing all tasks in the system with rcu_read_lock().
>
> 2. bpf_for_each(css_task, task, css). It works like
> css_task_iter_{start, next, end} and would be used to iterating
> tasks/threads under a css.
>
> 3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works
> like css_next_descendant_{pre, post} to iterating all descendant.
>
> If you have better ideas or any advice, please let me know.

Sounds great. Such 3 new iterators are unrelated to oom discussion and
can be developed/landed in parallel.
They will be useful in other bpf programs.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-08-17  2:07   ` Alexei Starovoitov
  2023-08-17  2:51     ` Chuyi Zhou
@ 2023-08-22 10:39     ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2023-08-22 10:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Chuyi Zhou, Johannes Weiner, Roman Gushchin, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, muchun.song, bpf, LKML,
	wuyun.abel, robin.lu

On Wed 16-08-23 19:07:10, Alexei Starovoitov wrote:
> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >  static int oom_evaluate_task(struct task_struct *task, void *arg)
> >  {
> >         struct oom_control *oc = arg;
> > @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >         if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >                 goto next;
> >
> > +       /*
> > +        * If task is allocating a lot of memory and has been marked to be
> > +        * killed first if it triggers an oom, then select it.
> > +        */
> > +       if (oom_task_origin(task)) {
> > +               points = LONG_MAX;
> > +               goto select;
> > +       }
> > +
> > +       switch (bpf_oom_evaluate_task(task, oc)) {
> > +       case BPF_EVAL_ABORT:
> > +               goto abort; /* abort search process */
> > +       case BPF_EVAL_NEXT:
> > +               goto next; /* ignore the task */
> > +       case BPF_EVAL_SELECT:
> > +               goto select; /* select the task */
> > +       default:
> > +               break; /* No BPF policy */
> > +       }
> > +
> 
> I think forcing bpf prog to look at every task is going to be limiting
> long term.
> It's more flexible to invoke bpf prog from out_of_memory()
> and if it doesn't choose a task then fallback to select_bad_process().
> I believe that's what Roman was proposing.
> bpf can choose to iterate memcg or it might have some side knowledge
> that there are processes that can be set as oc->chosen right away,
> so it can skip the iteration.

This is certainly possible but I am worried this will lead to a lot of
duplication. There are common tasks that all/most oom victim selection
implementations should do. First of all they should make sure that the
victim is belonging to the oom domain. Arguably it should be also aware
of ongoing oom victim tear down to prevent from overkilling. Proper oom
victim reference counting handling. Most people are not even aware of
those things. Do we really want all those to be re-invented - most
likely incorrectly?

Advantage of reusing oom_evaluate_task is that all that can be avoided.
Iterating over tasks with a pre-defined oom-victim sure sounds like
unnecessary and wasted CPU cycles but if you want to prevent
over-killing this is still necessary. As the oom killer can be invoked
really rapidly (before the victim has a chance to die) I believe this is
a very useful feature.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-08-10  8:13 ` [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task Chuyi Zhou
  2023-08-17  2:07   ` Alexei Starovoitov
@ 2023-09-13  1:18   ` Bixuan Cui
  2023-09-13  8:00     ` Chuyi Zhou
  2023-09-13 11:24   ` Bixuan Cui
  2 siblings, 1 reply; 30+ messages in thread
From: Bixuan Cui @ 2023-09-13  1:18 UTC (permalink / raw)
  To: Chuyi Zhou, hannes, mhocko, roman.gushchin, ast, daniel, andrii,
	muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Michal Hocko



在 2023/8/10 16:13, Chuyi Zhou 写道:
> +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> 
> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint 
> constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } 
> +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + 
> BPF_EVAL_SELECT, +}; +

I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using 
NO_BPF_POLICY etc. I think
+enum {
+	NO_BPF_POLICY,
+	BPF_EVAL_ABORT,
+	BPF_EVAL_NEXT,
+	BPF_EVAL_SELECT,
+};
+
definitions can be placed in include/linux/oom.h

Thanks
Bixuan Cui

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 4/5] bpf: Add a OOM policy test
  2023-08-10  8:13 ` [RFC PATCH v2 4/5] bpf: Add a OOM policy test Chuyi Zhou
  2023-08-16 11:53   ` Alan Maguire
@ 2023-09-13  7:55   ` Bixuan Cui
  1 sibling, 0 replies; 30+ messages in thread
From: Bixuan Cui @ 2023-09-13  7:55 UTC (permalink / raw)
  To: Chuyi Zhou, hannes, mhocko, roman.gushchin, ast, daniel, andrii,
	muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu



在 2023/8/10 16:13, Chuyi Zhou 写道:
> + +void test_oom_policy(void) +{ + struct oom_policy *skel; + struct 
> bpf_link *link;
'link' doesn't seem to be used.

Thanks
Bixuan Cui

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-09-13  1:18   ` Bixuan Cui
@ 2023-09-13  8:00     ` Chuyi Zhou
  0 siblings, 0 replies; 30+ messages in thread
From: Chuyi Zhou @ 2023-09-13  8:00 UTC (permalink / raw)
  To: Bixuan Cui, hannes, mhocko, roman.gushchin, ast, daniel, andrii,
	muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Michal Hocko

Hello, Bixuan.

在 2023/9/13 09:18, Bixuan Cui 写道:
> 
> 
> 在 2023/8/10 16:13, Chuyi Zhou 写道:
>> +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> 
>> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint 
>> constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } 
>> +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + 
>> BPF_EVAL_SELECT, +}; +
> 
> I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using 
> NO_BPF_POLICY etc. I think
> +enum {
> +    NO_BPF_POLICY,
> +    BPF_EVAL_ABORT,
> +    BPF_EVAL_NEXT,
> +    BPF_EVAL_SELECT,
> +};
> +
> definitions can be placed in include/linux/oom.h
> 

Thanks for your feedback!

Yes, Maybe we should move these enums to a more proper place so that 
they can be generated by BTF and we can take them from vmlinux.h.

> Thanks
> Bixuan Cui

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task
  2023-08-10  8:13 ` [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task Chuyi Zhou
  2023-08-17  2:07   ` Alexei Starovoitov
  2023-09-13  1:18   ` Bixuan Cui
@ 2023-09-13 11:24   ` Bixuan Cui
  2 siblings, 0 replies; 30+ messages in thread
From: Bixuan Cui @ 2023-09-13 11:24 UTC (permalink / raw)
  To: Chuyi Zhou, hannes, mhocko, roman.gushchin, ast, daniel, andrii,
	muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu, Michal Hocko



在 2023/8/10 16:13, Chuyi Zhou 写道:
> + +__weak noinline int bpf_oom_evaluate_task(struct task_struct *task, 
> struct oom_control *oc) +{ + return NO_BPF_POLICY; +} + 
> +BTF_SET8_START(oom_bpf_fmodret_ids) +BTF_ID_FLAGS(func, 
> bpf_oom_evaluate_task) +BTF_SET8_END(oom_bpf_fmodret_ids)
I have a question here, why use __weak? Is there other modules that can 
redefine bpf_oom_evaluate_task? why not use __bpf_kfunc 
(Documentation/bpf/kfuncs.rst) ?

Thanks
Bixuan Cui

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies
  2023-08-14 20:51   ` Jonathan Corbet
  2023-08-15  2:28     ` Chuyi Zhou
@ 2023-09-14 12:02     ` Bixuan Cui
  2023-09-14 12:50       ` [External] " Chuyi Zhou
  2023-09-14 12:04     ` Bixuan Cui
  2 siblings, 1 reply; 30+ messages in thread
From: Bixuan Cui @ 2023-09-14 12:02 UTC (permalink / raw)
  To: Jonathan Corbet, Chuyi Zhou, hannes, mhocko, roman.gushchin, ast,
	daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu



在 2023/8/15 4:51, Jonathan Corbet 写道:
>>   /**
>>    * dump_tasks - dump current memory state of all system tasks
>>    * @oc: pointer to struct oom_control
>> @@ -484,8 +513,8 @@ static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
>>   
>>   static void dump_header(struct oom_control *oc, struct task_struct *p)
>>   {
>> -	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
>> -		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
>> +	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, policy_name=%s, oom_score_adj=%hd\n",
>> +		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, oc->policy_name,
> ...and if the policy name is unterminated, this print will run off the
> end of the structure.
> 
> Am I missing something here?
Perhaps it is inaccurate to use policy name in this way. For example, 
some one use BPF_PROG(bpf_oom_evaluate_task, ...) but do not set the 
policy name through bpf_set_policy_name. In this way, the result is 
still policy name=default, which ultimately leads to error print in the 
dump_header.
I think a better way:

+static const char *const policy_select[] = {
+    "OOM_DEFAULT";
+    "BPF_ABORT",
+    "BPF_NEXT",
+    "BPF_SELECT",
+};

struct oom_control {

  	/* Used to print the constraint info. */
  	enum oom_constraint constraint;
+
+	/* Used to report the policy select. */
+	int policy_select;
  };

static int oom_evaluate_task(struct task_struct *task, void *arg)
{
...

+	switch (bpf_oom_evaluate_task(task, oc)) {
+	case BPF_EVAL_ABORT:
+              oc->policy_select = BPF_EVAL_ABORT;
+		goto abort; /* abort search process */
+	case BPF_EVAL_NEXT:
+              oc->policy_select = BPF_EVAL_NEXT;
+		goto next; /* ignore the task */
+	case BPF_EVAL_SELECT:
+             oc->policy_select = BPF_EVAL_SELECT;
+		goto select; /* select the task */
+	default:
+		break; /* No BPF policy */
+	}

  static void dump_header(struct oom_control *oc, struct task_struct *p)
  {
-	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, 
oom_score_adj=%hd\n",
-		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
+	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, 
policy_name=%s, oom_score_adj=%hd\n",
+		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, 
policy_select[oc->policy_select],
  			current->signal->oom_score_adj);


And all definitions of oc should be added
struct oom_control oc = {
      .select = NO_BPF_POLICY,
}

Delete set_oom_policy_name, it makes no sense for users to set policy 
names. :-)

Thanks
Bixuan Cui






^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies
  2023-08-14 20:51   ` Jonathan Corbet
  2023-08-15  2:28     ` Chuyi Zhou
  2023-09-14 12:02     ` Bixuan Cui
@ 2023-09-14 12:04     ` Bixuan Cui
  2 siblings, 0 replies; 30+ messages in thread
From: Bixuan Cui @ 2023-09-14 12:04 UTC (permalink / raw)
  To: Jonathan Corbet, Chuyi Zhou, hannes, mhocko, roman.gushchin, ast,
	daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu



在 2023/8/15 4:51, Jonathan Corbet 写道:
>>   /**
>>    * dump_tasks - dump current memory state of all system tasks
>>    * @oc: pointer to struct oom_control
>> @@ -484,8 +513,8 @@ static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
>>   
>>   static void dump_header(struct oom_control *oc, struct task_struct *p)
>>   {
>> -	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
>> -		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
>> +	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, policy_name=%s, oom_score_adj=%hd\n",
>> +		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, oc->policy_name,
> ...and if the policy name is unterminated, this print will run off the
> end of the structure.
> 
> Am I missing something here?
Perhaps it is inaccurate to use policy name in this way. For example, 
some one use BPF_PROG(bpf_oom_evaluate_task, ...) but do not set the 
policy name through bpf_set_policy_name. In this way, the result is 
still policy name=default, which ultimately leads to error print in the 
dump_header.
I think a better way:

+static const char *const policy_select[] = {
+    "OOM_DEFAULT";
+    "BPF_ABORT",
+    "BPF_NEXT",
+    "BPF_SELECT",
+};

struct oom_control {

  	/* Used to print the constraint info. */
  	enum oom_constraint constraint;
+
+	/* Used to report the policy select. */
+	int policy_select;
  };

static int oom_evaluate_task(struct task_struct *task, void *arg)
{
...

+	switch (bpf_oom_evaluate_task(task, oc)) {
+	case BPF_EVAL_ABORT:
+              oc->policy_select = BPF_EVAL_ABORT;
+		goto abort; /* abort search process */
+	case BPF_EVAL_NEXT:
+              oc->policy_select = BPF_EVAL_NEXT;
+		goto next; /* ignore the task */
+	case BPF_EVAL_SELECT:
+             oc->policy_select = BPF_EVAL_SELECT;
+		goto select; /* select the task */
+	default:
+		break; /* No BPF policy */
+	}

  static void dump_header(struct oom_control *oc, struct task_struct *p)
  {
-	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, 
oom_score_adj=%hd\n",
-		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
+	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, 
policy_name=%s, oom_score_adj=%hd\n",
+		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, 
policy_select[oc->policy_select],
  			current->signal->oom_score_adj);


And all definitions of oc should be added
struct oom_control oc = {
      .select = NO_BPF_POLICY,
}

Delete set_oom_policy_name, it makes no sense for users to set policy 
names. :-)

Thanks
Bixuan Cui






^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies
  2023-09-14 12:02     ` Bixuan Cui
@ 2023-09-14 12:50       ` Chuyi Zhou
  2023-09-15  2:28         ` Bixuan Cui
  0 siblings, 1 reply; 30+ messages in thread
From: Chuyi Zhou @ 2023-09-14 12:50 UTC (permalink / raw)
  To: Bixuan Cui, Jonathan Corbet, hannes, mhocko, roman.gushchin, ast,
	daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu

Hello,

在 2023/9/14 20:02, Bixuan Cui 写道:
> 
> 
> 在 2023/8/15 4:51, Jonathan Corbet 写道:
>>>   /**
>>>    * dump_tasks - dump current memory state of all system tasks
>>>    * @oc: pointer to struct oom_control
>>> @@ -484,8 +513,8 @@ static void dump_oom_summary(struct oom_control 
>>> *oc, struct task_struct *victim)
>>>   static void dump_header(struct oom_control *oc, struct task_struct *p)
>>>   {
>>> -    pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, 
>>> oom_score_adj=%hd\n",
>>> -        current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
>>> +    pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, 
>>> policy_name=%s, oom_score_adj=%hd\n",
>>> +        current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, 
>>> oc->policy_name,
>> ...and if the policy name is unterminated, this print will run off the
>> end of the structure.
>>
>> Am I missing something here?
> Perhaps it is inaccurate to use policy name in this way. For example, 
> some one use BPF_PROG(bpf_oom_evaluate_task, ...) but do not set the 
> policy name through bpf_set_policy_name. In this way, the result is 
> still policy name=default, which ultimately leads to error print in the 
> dump_header.
> I think a better way:
> 
> +static const char *const policy_select[] = {
> +    "OOM_DEFAULT";
> +    "BPF_ABORT",
> +    "BPF_NEXT",
> +    "BPF_SELECT",
> +};
> 
> struct oom_control {
> 
>       /* Used to print the constraint info. */
>       enum oom_constraint constraint;
> +
> +    /* Used to report the policy select. */
> +    int policy_select;
>   };
> 
> static int oom_evaluate_task(struct task_struct *task, void *arg)
> {
> ...
> 
> +    switch (bpf_oom_evaluate_task(task, oc)) {
> +    case BPF_EVAL_ABORT:
> +              oc->policy_select = BPF_EVAL_ABORT;
> +        goto abort; /* abort search process */
> +    case BPF_EVAL_NEXT:
> +              oc->policy_select = BPF_EVAL_NEXT;
> +        goto next; /* ignore the task */
> +    case BPF_EVAL_SELECT:
> +             oc->policy_select = BPF_EVAL_SELECT;
> +        goto select; /* select the task */
> +    default:
> +        break; /* No BPF policy */
> +    }
> 
>   static void dump_header(struct oom_control *oc, struct task_struct *p)
>   {
> -    pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, 
> oom_score_adj=%hd\n",
> -        current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
> +    pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, 
> policy_name=%s, oom_score_adj=%hd\n",
> +        current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, 
> policy_select[oc->policy_select],
>               current->signal->oom_score_adj);
> 
> 

The policy_name may be different from the previous OOM reporting, even 
though they are using the same policy.

> And all definitions of oc should be added
> struct oom_control oc = {
>       .select = NO_BPF_POLICY,
> }
> 
> Delete set_oom_policy_name, it makes no sense for users to set policy 
> names. :-)
> 

There can be multiple OOM policy in the system at the same time.

If we need to apply different OOM policies to different memcgs based on 
different scenarios, we can use this hook(set_oom_policy_name) to set 
name to identify which policy in invoked at that time.

Just some thoughts.

Thanks.

> Thanks
> Bixuan Cui
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies
  2023-09-14 12:50       ` [External] " Chuyi Zhou
@ 2023-09-15  2:28         ` Bixuan Cui
  2023-09-15  3:31           ` Chuyi Zhou
  0 siblings, 1 reply; 30+ messages in thread
From: Bixuan Cui @ 2023-09-15  2:28 UTC (permalink / raw)
  To: Chuyi Zhou, Jonathan Corbet, hannes, mhocko, roman.gushchin, ast,
	daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu



在 2023/9/14 20:50, Chuyi Zhou 写道:
>>
>> Delete set_oom_policy_name, it makes no sense for users to set policy 
>> names. 🙂
>>
> 
> There can be multiple OOM policy in the system at the same time.
> 
> If we need to apply different OOM policies to different memcgs based on 
> different scenarios, we can use this hook(set_oom_policy_name) to set 
> name to identify which policy in invoked at that time.
> 
> Just some thoughts.
Well, I thought the system would only load one OOM policy(set one policy 
name), which would set the prio of all memcgs.

What you mean is that multiple bpf.c may be loaded at the system, and 
each bpf.c sets the different policy for different memcgs?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies
  2023-09-15  2:28         ` Bixuan Cui
@ 2023-09-15  3:31           ` Chuyi Zhou
  0 siblings, 0 replies; 30+ messages in thread
From: Chuyi Zhou @ 2023-09-15  3:31 UTC (permalink / raw)
  To: Bixuan Cui, Jonathan Corbet, hannes, mhocko, roman.gushchin, ast,
	daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu



在 2023/9/15 10:28, Bixuan Cui 写道:
> 
> 
> 在 2023/9/14 20:50, Chuyi Zhou 写道:
>>>
>>> Delete set_oom_policy_name, it makes no sense for users to set policy 
>>> names. 🙂
>>>
>>
>> There can be multiple OOM policy in the system at the same time.
>>
>> If we need to apply different OOM policies to different memcgs based 
>> on different scenarios, we can use this hook(set_oom_policy_name) to 
>> set name to identify which policy in invoked at that time.
>>
>> Just some thoughts.
> Well, I thought the system would only load one OOM policy(set one policy 
> name), which would set the prio of all memcgs.
> 
> What you mean is that multiple bpf.c may be loaded at the system, and 
> each bpf.c sets the different policy for different memcgs?

No. Not multiple bpf_oompolicy.c but multiple OOM Policy in one BPF Program.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 4/5] bpf: Add a OOM policy test
  2023-08-16 14:34         ` Chuyi Zhou
@ 2023-09-28 11:35           ` Charlley Green
  0 siblings, 0 replies; 30+ messages in thread
From: Charlley Green @ 2023-09-28 11:35 UTC (permalink / raw)
  To: Chuyi Zhou, Alan Maguire, hannes, mhocko, roman.gushchin, ast,
	daniel, andrii, muchun.song
  Cc: bpf, linux-kernel, wuyun.abel, robin.lu


在 2023/8/16 22:34, Chuyi Zhou 写道:
> Hello,
>
> 在 2023/8/16 21:49, Alan Maguire 写道:
>> On 16/08/2023 13:31, Chuyi Zhou wrote:
>>> Hello,
>>>
>>> 在 2023/8/16 19:53, Alan Maguire 写道:
>>>> On 10/08/2023 09:13, Chuyi Zhou wrote:
>>>>> This patch adds a test which implements a priority-based policy 
>>>>> through
>>>>> bpf_oom_evaluate_task.
>>>>>
>>>>> The BPF program, oom_policy.c, compares the cgroup priority of two 
>>>>> tasks
>>>>> and select the lower one. The userspace program test_oom_policy.c
>>>>> maintains a priority map by using cgroup id as the keys and 
>>>>> priority as
>>>>> the values. We could protect certain cgroups from oom-killer by 
>>>>> setting
>>>>> higher priority.
>>>>>
>>>>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>>>>> ---
>>>>>    .../bpf/prog_tests/test_oom_policy.c          | 140 
>>>>> ++++++++++++++++++
>>>>>    .../testing/selftests/bpf/progs/oom_policy.c  | 104 +++++++++++++
>>>>>    2 files changed, 244 insertions(+)
>>>>>    create mode 100644
>>>>> tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>>>>    create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c
>>>>>
>>>>> diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>>>> b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>>>> new file mode 100644
>>>>> index 000000000000..bea61ff22603
>>>>> --- /dev/null
>>>>> +++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>>>>> @@ -0,0 +1,140 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>>> +#define _GNU_SOURCE
>>>>> +
>>>>> +#include <stdio.h>
>>>>> +#include <fcntl.h>
>>>>> +#include <unistd.h>
>>>>> +#include <stdlib.h>
>>>>> +#include <signal.h>
>>>>> +#include <sys/stat.h>
>>>>> +#include <test_progs.h>
>>>>> +#include <bpf/btf.h>
>>>>> +#include <bpf/bpf.h>
>>>>> +
>>>>> +#include "cgroup_helpers.h"
>>>>> +#include "oom_policy.skel.h"
>>>>> +
>>>>> +static int map_fd;
>>>>> +static int cg_nr;
>>>>> +struct {
>>>>> +    const char *path;
>>>>> +    int fd;
>>>>> +    unsigned long long id;
>>>>> +} cgs[] = {
>>>>> +    { "/cg1" },
>>>>> +    { "/cg2" },
>>>>> +};
>>>>> +
>>>>> +
>>>>> +static struct oom_policy *open_load_oom_policy_skel(void)
>>>>> +{
>>>>> +    struct oom_policy *skel;
>>>>> +    int err;
>>>>> +
>>>>> +    skel = oom_policy__open();
>>>>> +    if (!ASSERT_OK_PTR(skel, "skel_open"))
>>>>> +        return NULL;
>>>>> +
>>>>> +    err = oom_policy__load(skel);
>>>>> +    if (!ASSERT_OK(err, "skel_load"))
>>>>> +        goto cleanup;
>>>>> +
>>>>> +    return skel;
>>>>> +
>>>>> +cleanup:
>>>>> +    oom_policy__destroy(skel);
>>>>> +    return NULL;
>>>>> +}
>>>>> +
>>>>> +static void run_memory_consume(unsigned long long consume_size, int
>>>>> idx)
>>>>> +{
>>>>> +    char *buf;
>>>>> +
>>>>> +    join_parent_cgroup(cgs[idx].path);
>>>>> +    buf = malloc(consume_size);
>>>>> +    memset(buf, 0, consume_size);
>>>>> +    sleep(2);
>>>>> +    exit(0);
>>>>> +}
>>>>> +
>>>>> +static int set_cgroup_prio(unsigned long long cg_id, int prio)
>>>>> +{
>>>>> +    int err;
>>>>> +
>>>>> +    err = bpf_map_update_elem(map_fd, &cg_id, &prio, BPF_ANY);
>>>>> +    ASSERT_EQ(err, 0, "update_map");
>>>>> +    return err;
>>>>> +}
>>>>> +
>>>>> +static int prepare_cgroup_environment(void)
>>>>> +{
>>>>> +    int err;
>>>>> +
>>>>> +    err = setup_cgroup_environment();
>>>>> +    if (err)
>>>>> +        goto clean_cg_env;
>>>>> +    for (int i = 0; i < cg_nr; i++) {
>>>>> +        err = cgs[i].fd = create_and_get_cgroup(cgs[i].path);
>>>>> +        if (!ASSERT_GE(cgs[i].fd, 0, "cg_create"))
>>>>> +            goto clean_cg_env;
>>>>> +        cgs[i].id = get_cgroup_id(cgs[i].path);
>>>>> +    }
>>>>> +    return 0;
>>>>> +clean_cg_env:
>>>>> +    cleanup_cgroup_environment();
>>>>> +    return err;
>>>>> +}
>>>>> +
>>>>> +void test_oom_policy(void)
>>>>> +{
>>>>> +    struct oom_policy *skel;
>>>>> +    struct bpf_link *link;
>>>>> +    int err;
>>>>> +    int victim_pid;
>>>>> +    unsigned long long victim_cg_id;
>>>>> +
>>>>> +    link = NULL;
>>>>> +    cg_nr = ARRAY_SIZE(cgs);
>>>>> +
>>>>> +    skel = open_load_oom_policy_skel();
>>>>> +    err = oom_policy__attach(skel);
>>>>> +    if (!ASSERT_OK(err, "oom_policy__attach"))
>>>>> +        goto cleanup;
>>>>> +
>>>>> +    map_fd = bpf_object__find_map_fd_by_name(skel->obj, "cg_map");
>>>>> +    if (!ASSERT_GE(map_fd, 0, "find map"))
>>>>> +        goto cleanup;
>>>>> +
>>>>> +    err = prepare_cgroup_environment();
>>>>> +    if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
>>>>> +        goto cleanup;
>>>>> +
>>>>> +    write_cgroup_file("/", "memory.max", "10M");
>>>>> +
>>>>> +    /*
>>>>> +     * Set higher priority to cg2 and lower to cg1, so we would 
>>>>> select
>>>>> +     * task under cg1 as victim.(see oom_policy.c)
>>>>> +     */
>>>>> +    set_cgroup_prio(cgs[0].id, 10);
>>>>> +    set_cgroup_prio(cgs[1].id, 50);
>>>>> +
>>>>> +    victim_cg_id = cgs[0].id;
>>>>> +    victim_pid = fork();
>>>>> +
>>>>> +    if (victim_pid == 0)
>>>>> +        run_memory_consume(1024 * 1024 * 4, 0);
>>>>> +
>>>>> +    if (fork() == 0)
>>>>> +        run_memory_consume(1024 * 1024 * 8, 1);
>>>>> +
>>>>> +    while (wait(NULL) > 0)
>>>>> +        ;
>>>>> +
>>>>> +    ASSERT_EQ(skel->bss->victim_pid, victim_pid, "victim_pid");
>>>>> +    ASSERT_EQ(skel->bss->victim_cg_id, victim_cg_id, "victim_cgid");
>>>>> +    ASSERT_EQ(skel->bss->failed_cnt, 1, "failed_cnt");
>>>>> +cleanup:
>>>>> +    bpf_link__destroy(link);
>>>>> +    oom_policy__destroy(skel);
>>>>> +    cleanup_cgroup_environment();
>>>>> +}
>>>>> diff --git a/tools/testing/selftests/bpf/progs/oom_policy.c
>>>>> b/tools/testing/selftests/bpf/progs/oom_policy.c
>>>>> new file mode 100644
>>>>> index 000000000000..fc9efc93914e
>>>>> --- /dev/null
>>>>> +++ b/tools/testing/selftests/bpf/progs/oom_policy.c
>>>>> @@ -0,0 +1,104 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>>> +#include <vmlinux.h>
>>>>> +#include <bpf/bpf_tracing.h>
>>>>> +#include <bpf/bpf_helpers.h>
>>>>> +
>>>>> +char _license[] SEC("license") = "GPL";
>>>>> +
>>>>> +struct {
>>>>> +    __uint(type, BPF_MAP_TYPE_HASH);
>>>>> +    __type(key, int);
>>>>> +    __type(value, int);
>>>>> +    __uint(max_entries, 24);
>>>>> +} cg_map SEC(".maps");
>>>>> +
>>>>> +unsigned int victim_pid;
>>>>> +u64 victim_cg_id;
>>>>> +int failed_cnt;
>>>>> +
>>>>> +#define    EOPNOTSUPP    95
>>>>> +
>>>>> +enum {
>>>>> +    NO_BPF_POLICY,
>>>>> +    BPF_EVAL_ABORT,
>>>>> +    BPF_EVAL_NEXT,
>>>>> +    BPF_EVAL_SELECT,
>>>>> +};
>>>>
>>>> When I built a kernel using this series and tried building the
>>>> associated test for that kernel I saw:
>>>>
>>>> progs/oom_policy.c:22:2: error: redefinition of enumerator
>>>> 'NO_BPF_POLICY'
>>>>           NO_BPF_POLICY,
>>>>           ^
>>>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75894:2: 
>>>>
>>>> note: previous definition is here
>>>>           NO_BPF_POLICY = 0,
>>>>           ^
>>>> progs/oom_policy.c:23:2: error: redefinition of enumerator
>>>> 'BPF_EVAL_ABORT'
>>>>           BPF_EVAL_ABORT,
>>>>           ^
>>>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75895:2: 
>>>>
>>>> note: previous definition is here
>>>>           BPF_EVAL_ABORT = 1,
>>>>           ^
>>>> progs/oom_policy.c:24:2: error: redefinition of enumerator
>>>> 'BPF_EVAL_NEXT'
>>>>           BPF_EVAL_NEXT,
>>>>           ^
>>>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75896:2: 
>>>>
>>>> note: previous definition is here
>>>>           BPF_EVAL_NEXT = 2,
>>>>           ^
>>>> progs/oom_policy.c:  CLNG-BPF [test_maps] tailcall_bpf2bpf4.bpf.o
>>>> 25:2: error: redefinition of enumerator 'BPF_EVAL_SELECT'
>>>>           BPF_EVAL_SELECT,
>>>>           ^
>>>> /home/opc/src/bpf-next/tools/testing/selftests/bpf/tools/include/vmlinux.h:75897:2: 
>>>>
>>>> note: previous definition is here
>>>>           BPF_EVAL_SELECT = 3,
>>>>           ^
>>>> 4 errors generated.
>>>>
>>>>
>>>> So you shouldn't need the enum definition since it already makes it 
>>>> into
>>>> vmlinux.h.
>>>> OK. It seems my vmlinux.h doesn't contain these enum...
>>>> I also ran into test failures when I removed the above (and 
>>>> compilation
>>>> succeeded):
>>>>
>>>>
>>>> test_oom_policy:PASS:prepare cgroup env 0 nsec
>>>> (cgroup_helpers.c:130: errno: No such file or directory) Opening
>>>> /mnt/cgroup-test-work-dir23054//memory.max
>>>> set_cgroup_prio:PASS:update_map 0 nsec
>>>> set_cgroup_prio:PASS:update_map 0 nsec
>>>> test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
>>>> expected 23058
>>>> test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
>>>> expected 68
>>>> test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
>>>> expected 1
>>>> #154     oom_policy:FAIL
>>>> Summary: 1/0 PASSED, 0 SKIPPED, 1 FAILED
>>>>
>>>> So it seems that because my system was using the cgroupv1 memory
>>>> controller, it could not be used for v2 unless I rebooted with
>>>>
>>>> systemd.unified_cgroup_hierarchy=1
>>>>
>>>> ...on the boot commandline. It would be good to note any such
>>>> requirements for this test in the selftests/bpf/README.rst.
>>>> Might also be worth adding
>>>>
>>>> write_cgroup_file("", "cgroup.subtree_control", "+memory");
>>>>
>>>> ...to ensure the memory controller is enabled for the root cgroup.
>>>>
>>>> At that point the test still failed:
>>>>
>>>> set_cgroup_prio:PASS:update_map 0 nsec
>>>> test_oom_policy:FAIL:victim_pid unexpected victim_pid: actual 0 !=
>>>> expected 12649
>>>> test_oom_policy:FAIL:victim_cgid unexpected victim_cgid: actual 0 !=
>>>> expected 9583
>>>> test_oom_policy:FAIL:failed_cnt unexpected failed_cnt: actual 0 !=
>>>> expected 1
>>>> #154     oom_policy:FAIL
>>>> Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
>>>> Successfully unloaded bpf_testmod.ko.
>>>>
>>>>
>>> It seems that OOM is not invoked in your environment(you can check 
>>> it in
>>> demsg). If the memcg OOM is invoked by the test, we would record the
>>> *victim_pid* and *victim_cgid* and they would not be zero. I guess the
>>> reason is memory_control is not enabled in cgroup
>>> "/mnt/cgroup-test-work-dir23054/", because I see the error message:
>>> (cgroup_helpers.c:130: errno: No such file or directory) Opening
>>>> /mnt/cgroup-test-work-dir23054//memory.max
>>
>> Right, but after I set up unified cgroup hierarchy and rebooted, that
>> message disappeared and cgroup setup succeeded, _but_ the test still
>> failed with 0 victim_pid/cgid.  I see nothing OOM-related in dmesg, but
>> the toplevel cgroupv2 cgroup.controllers file contains:
>>
>> cpuset cpu io memory hugetlb pids rdma
>>
>
> Dose the toplevel cgroupv2's *cgroup.subtree_control* looks like that?
> /sys/fs/cgroup$ cat cgroup.subtree_control
>
>     cpuset cpu io memory hugetlb pids
>
> This prog test would mkdir a test cgroup dir under the toplevel's 
> cgroupv2 and rmdir after the test finishing. In my env, the test 
> cgroup path looks like:
>
> /sys/fs/cgroup/cgroup-test-work-dirxxx/
>
> This test would run in cgroup-test-work-dirxxx.
>
> If we want to enable memory controller in cgroup-test-work-dirxxx, we 
> should guarantee that /sys/fs/cgroup/cgroup.subtree_control contanins
> "memory".
>
>
>> Is there something else that needs to be done to enable OOM scanning?
>> I see the oom_reaper process:
>>
>> root          72       2  0 11:30 ?        00:00:00 [oom_reaper]
>>
>>
>> This test will need to pass BPF CI, so any assumptions about
>> configuration need to be ironed out. For example, I think you should
>> probably have
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>> b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>> index bea61ff22603..54fdb8a59816 100644
>> --- a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>> +++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
>> @@ -109,6 +109,7 @@ void test_oom_policy(void)
>>          if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
>>                  goto cleanup;
>>
>> +       write_cgroup_file("/", "cgroup.subtree_control", "+memory");
>>          write_cgroup_file("/", "memory.max", "10M");
>
> Yes, you are right. We do need something to guarantee that the memory 
> controller is enabled in cgroup-test-work-dir.
>     write_cgroup_file("/", "cgroup.subtree_control", "+memory");
> This code actually dose something like:
>
> echo "+memory" > 
> /sys/fs/cgroup/cgroup-test-work-dir/cgroup.subtree_control
>
> What we need actually is
> echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
>
> Thanks!
Hello!

I confirmed that I have the memory and other necessary options in
cgroup.subtree_control you mentioned.

cat /sys/fs/cgroup/cgroup.subtree_control

cpuset cpu io memory hugetlb pids rdma misc

I almost exactly reproduced the problem Alan encountered.

make => redefinition of enumerator XXX => delete the enum => start the
bpf selftest with "make run_tests"

In the end, three Fails also appeared, with victim_pid, victim_cgid, and
failed_cnt all being 0.

My environment:

- I installed the necessary software for bpf on Ubuntu 22.04.

- I downloaded "linux-source-6.2.0 - Linux kernel source for version
6.2.0 with Ubuntu patches" from apt source.

- I have applied your series of patches on this version of the kernel
source code.

- I compiled and installed this patched kernel and installed its
linux-tools

- Finally I reproduced the above error.

Could you please provide an exact environment in which your success can
be replicated?

Perhaps this requires a more precise environment, including

which version of the kernel source code your patch needs to be based on,

which distribution of operating system,

which necessary software needs to be installed,

and which features need to be enabled.

If you can help solve the problem, I would be very grateful!
>>
>>          /*
>>
>> ...to be safe, since
>>
>> https://docs.kernel.org/admin-guide/cgroup-v2.html#organizing-processes-and-threads 
>>
>>
>> ...says
>>
>> "No controller is enabled by default. Controllers can be enabled and
>> disabled by writing to the "cgroup.subtree_control" file:
>>
>> # echo "+cpu +memory -io" > cgroup.subtree_control
>>
>> "
>>
>> Are there any other aspects of configuration like that which might
>> explain why the test passes for you but fails for me?
>>
>> Alan
>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-09-28 11:35 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-10  8:13 [RFC PATCH v2 0/5] mm: Select victim using bpf_oom_evaluate_task Chuyi Zhou
2023-08-10  8:13 ` [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task Chuyi Zhou
2023-08-17  2:07   ` Alexei Starovoitov
2023-08-17  2:51     ` Chuyi Zhou
2023-08-17  3:22       ` Alexei Starovoitov
2023-08-18  3:30         ` Chuyi Zhou
2023-08-18  4:34           ` Alexei Starovoitov
2023-08-22 10:39     ` Michal Hocko
2023-09-13  1:18   ` Bixuan Cui
2023-09-13  8:00     ` Chuyi Zhou
2023-09-13 11:24   ` Bixuan Cui
2023-08-10  8:13 ` [RFC PATCH v2 2/5] mm: Add policy_name to identify OOM policies Chuyi Zhou
2023-08-14 20:51   ` Jonathan Corbet
2023-08-15  2:28     ` Chuyi Zhou
2023-09-14 12:02     ` Bixuan Cui
2023-09-14 12:50       ` [External] " Chuyi Zhou
2023-09-15  2:28         ` Bixuan Cui
2023-09-15  3:31           ` Chuyi Zhou
2023-09-14 12:04     ` Bixuan Cui
2023-08-10  8:13 ` [RFC PATCH v2 3/5] mm: Add a tracepoint when OOM victim selection is failed Chuyi Zhou
2023-08-16 11:54   ` Alan Maguire
2023-08-10  8:13 ` [RFC PATCH v2 4/5] bpf: Add a OOM policy test Chuyi Zhou
2023-08-16 11:53   ` Alan Maguire
2023-08-16 12:31     ` Chuyi Zhou
2023-08-16 13:49       ` Alan Maguire
2023-08-16 14:34         ` Chuyi Zhou
2023-09-28 11:35           ` Charlley Green
2023-09-13  7:55   ` Bixuan Cui
2023-08-10  8:13 ` [RFC PATCH v2 5/5] bpf: Add a BPF OOM policy Doc Chuyi Zhou
2023-08-16 15:49 ` [PATCH RFC v2 0/5] mm: Select victim using bpf_oom_evaluate_task Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).