BPF Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH bpf-next v2 0/3] bpf: implement bpf_send_signal() helper
@ 2019-05-22  5:39 Yonghong Song
  2019-05-22  5:39 ` [PATCH bpf-next v2 1/3] " Yonghong Song
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Yonghong Song @ 2019-05-22  5:39 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Peter Zijlstra,
	Yonghong Song

This patch tries to solve the following specific use case.

Currently, bpf program can already collect stack traces
through kernel function get_perf_callchain()
when certain events happens (e.g., cache miss counter or
cpu clock counter overflows). But such stack traces are
not enough for jitted programs, e.g., hhvm (jited php).
To get real stack trace, jit engine internal data structures
need to be traversed in order to get the real user functions.

bpf program itself may not be the best place to traverse
the jit engine as the traversing logic could be complex and
it is not a stable interface either.

Instead, hhvm implements a signal handler,
e.g. for SIGALARM, and a set of program locations which
it can dump stack traces. When it receives a signal, it will
dump the stack in next such program location.

This patch implements bpf_send_signal() helper to send
a signal to hhvm in real time, resulting in intended stack traces.

Patch #1 implemented the bpf_send_helper() in the kernel,
Patch #2 synced uapi header bpf.h to tools directory.
Patch #3 added a self test which covers tracepoint
and perf_event bpf programs.

Changelogs:
  RFC v1 => v2:
    . previous version allows to send signal to an arbitrary
      pid. This version just sends the signal to current
      task to avoid unstable pid and potential races between
      sending signals and task state changes for the pid.

Yonghong Song (3):
  bpf: implement bpf_send_signal() helper
  tools/bpf: sync bpf uapi header bpf.h to tools directory
  tools/bpf: add a selftest for bpf_send_signal() helper

 include/uapi/linux/bpf.h                      |  17 +-
 kernel/trace/bpf_trace.c                      |  67 ++++++
 tools/include/uapi/linux/bpf.h                |  17 +-
 tools/testing/selftests/bpf/Makefile          |   3 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   1 +
 .../bpf/progs/test_send_signal_kern.c         |  51 +++++
 .../selftests/bpf/test_send_signal_user.c     | 212 ++++++++++++++++++
 7 files changed, 365 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_send_signal_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_send_signal_user.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper
  2019-05-22  5:39 [PATCH bpf-next v2 0/3] bpf: implement bpf_send_signal() helper Yonghong Song
@ 2019-05-22  5:39 ` " Yonghong Song
  2019-05-23 15:41   ` Daniel Borkmann
  2019-05-22  5:39 ` [PATCH bpf-next v2 2/3] tools/bpf: sync bpf uapi header bpf.h to tools directory Yonghong Song
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Yonghong Song @ 2019-05-22  5:39 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Peter Zijlstra,
	Yonghong Song

This patch tries to solve the following specific use case.

Currently, bpf program can already collect stack traces
through kernel function get_perf_callchain()
when certain events happens (e.g., cache miss counter or
cpu clock counter overflows). But such stack traces are
not enough for jitted programs, e.g., hhvm (jited php).
To get real stack trace, jit engine internal data structures
need to be traversed in order to get the real user functions.

bpf program itself may not be the best place to traverse
the jit engine as the traversing logic could be complex and
it is not a stable interface either.

Instead, hhvm implements a signal handler,
e.g. for SIGALARM, and a set of program locations which
it can dump stack traces. When it receives a signal, it will
dump the stack in next such program location.

Such a mechanism can be implemented in the following way:
  . a perf ring buffer is created between bpf program
    and tracing app.
  . once a particular event happens, bpf program writes
    to the ring buffer and the tracing app gets notified.
  . the tracing app sends a signal SIGALARM to the hhvm.

But this method could have large delays and causing profiling
results skewed.

This patch implements bpf_send_signal() helper to send
a signal to hhvm in real time, resulting in intended stack traces.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/uapi/linux/bpf.h | 17 +++++++++-
 kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 63e0cf66f01a..68d4470523a0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2672,6 +2672,20 @@ union bpf_attr {
  *		0 on success.
  *
  *		**-ENOENT** if the bpf-local-storage cannot be found.
+ *
+ * int bpf_send_signal(u32 sig)
+ *	Description
+ *		Send signal *sig* to the current task.
+ *	Return
+ *		0 on success or successfully queued.
+ *
+ *		**-EBUSY** if work queue under nmi is full.
+ *
+ *		**-EINVAL** if *sig* is invalid.
+ *
+ *		**-EPERM** if no permission to send the *sig*.
+ *
+ *		**-EAGAIN** if bpf program can try again.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2782,7 +2796,8 @@ union bpf_attr {
 	FN(strtol),			\
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
-	FN(sk_storage_delete),
+	FN(sk_storage_delete),		\
+	FN(send_signal),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f92d6ad5e080..f8cd0db7289f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+struct send_signal_irq_work {
+	struct irq_work irq_work;
+	u32 sig;
+};
+
+static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
+
+static void do_bpf_send_signal(struct irq_work *entry)
+{
+	struct send_signal_irq_work *work;
+
+	work = container_of(entry, struct send_signal_irq_work, irq_work);
+	group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
+}
+
+BPF_CALL_1(bpf_send_signal, u32, sig)
+{
+	struct send_signal_irq_work *work = NULL;
+
+	/* Similar to bpf_probe_write_user, task needs to be
+	 * in a sound condition and kernel memory access be
+	 * permitted in order to send signal to the current
+	 * task.
+	 */
+	if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
+		return -EPERM;
+	if (unlikely(uaccess_kernel()))
+		return -EPERM;
+	if (unlikely(!nmi_uaccess_okay()))
+		return -EPERM;
+
+	if (in_nmi()) {
+		work = this_cpu_ptr(&send_signal_work);
+		if (work->irq_work.flags & IRQ_WORK_BUSY)
+			return -EBUSY;
+
+		work->sig = sig;
+		irq_work_queue(&work->irq_work);
+		return 0;
+	}
+
+	return group_send_sig_info(sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
+
+}
+
+static const struct bpf_func_proto bpf_send_signal_proto = {
+	.func		= bpf_send_signal,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -617,6 +669,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_get_current_cgroup_id:
 		return &bpf_get_current_cgroup_id_proto;
 #endif
+	case BPF_FUNC_send_signal:
+		return &bpf_send_signal_proto;
 	default:
 		return NULL;
 	}
@@ -1343,5 +1397,18 @@ static int __init bpf_event_init(void)
 	return 0;
 }
 
+static int __init send_signal_irq_work_init(void)
+{
+	int cpu;
+	struct send_signal_irq_work *work;
+
+	for_each_possible_cpu(cpu) {
+		work = per_cpu_ptr(&send_signal_work, cpu);
+		init_irq_work(&work->irq_work, do_bpf_send_signal);
+	}
+	return 0;
+}
+
 fs_initcall(bpf_event_init);
+subsys_initcall(send_signal_irq_work_init);
 #endif /* CONFIG_MODULES */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v2 2/3] tools/bpf: sync bpf uapi header bpf.h to tools directory
  2019-05-22  5:39 [PATCH bpf-next v2 0/3] bpf: implement bpf_send_signal() helper Yonghong Song
  2019-05-22  5:39 ` [PATCH bpf-next v2 1/3] " Yonghong Song
@ 2019-05-22  5:39 ` Yonghong Song
  2019-05-22  5:39 ` [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper Yonghong Song
  2019-05-22 16:38 ` [PATCH bpf-next v2 0/3] bpf: implement " Stanislav Fomichev
  3 siblings, 0 replies; 19+ messages in thread
From: Yonghong Song @ 2019-05-22  5:39 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Peter Zijlstra,
	Yonghong Song

The bpf uapi header include/uapi/linux/bpf.h is sync'ed
to tools/include/uapi/linux/bpf.h.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/include/uapi/linux/bpf.h | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 63e0cf66f01a..68d4470523a0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2672,6 +2672,20 @@ union bpf_attr {
  *		0 on success.
  *
  *		**-ENOENT** if the bpf-local-storage cannot be found.
+ *
+ * int bpf_send_signal(u32 sig)
+ *	Description
+ *		Send signal *sig* to the current task.
+ *	Return
+ *		0 on success or successfully queued.
+ *
+ *		**-EBUSY** if work queue under nmi is full.
+ *
+ *		**-EINVAL** if *sig* is invalid.
+ *
+ *		**-EPERM** if no permission to send the *sig*.
+ *
+ *		**-EAGAIN** if bpf program can try again.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2782,7 +2796,8 @@ union bpf_attr {
 	FN(strtol),			\
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
-	FN(sk_storage_delete),
+	FN(sk_storage_delete),		\
+	FN(send_signal),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper
  2019-05-22  5:39 [PATCH bpf-next v2 0/3] bpf: implement bpf_send_signal() helper Yonghong Song
  2019-05-22  5:39 ` [PATCH bpf-next v2 1/3] " Yonghong Song
  2019-05-22  5:39 ` [PATCH bpf-next v2 2/3] tools/bpf: sync bpf uapi header bpf.h to tools directory Yonghong Song
@ 2019-05-22  5:39 ` Yonghong Song
  2019-05-22 18:48   ` Andrii Nakryiko
  2019-05-22 19:10   ` Stanislav Fomichev
  2019-05-22 16:38 ` [PATCH bpf-next v2 0/3] bpf: implement " Stanislav Fomichev
  3 siblings, 2 replies; 19+ messages in thread
From: Yonghong Song @ 2019-05-22  5:39 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Peter Zijlstra,
	Yonghong Song

The test covered both nmi and tracepoint perf events.
  $ ./test_send_signal_user
  test_send_signal (tracepoint): OK
  test_send_signal (perf_event): OK

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   3 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   1 +
 .../bpf/progs/test_send_signal_kern.c         |  51 +++++
 .../selftests/bpf/test_send_signal_user.c     | 212 ++++++++++++++++++
 4 files changed, 266 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_send_signal_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_send_signal_user.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 66f2dca1dee1..5eb6368a96a2 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -23,7 +23,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
 	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
 	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
-	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl
+	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl \
+	test_send_signal_user
 
 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
 TEST_GEN_FILES = $(BPF_OBJ_FILES)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 5f6f9e7aba2a..cb02521b8e58 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -216,6 +216,7 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
 	(void *) BPF_FUNC_sk_storage_get;
 static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
 	(void *)BPF_FUNC_sk_storage_delete;
+static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
new file mode 100644
index 000000000000..45a1a1a2c345
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+#include <linux/bpf.h>
+#include <linux/version.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") info_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u64),
+	.max_entries = 1,
+};
+
+BPF_ANNOTATE_KV_PAIR(info_map, __u32, __u64);
+
+struct bpf_map_def SEC("maps") status_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u64),
+	.max_entries = 1,
+};
+
+BPF_ANNOTATE_KV_PAIR(status_map, __u32, __u64);
+
+SEC("send_signal_demo")
+int bpf_send_signal_test(void *ctx)
+{
+	__u64 *info_val, *status_val;
+	__u32 key = 0, pid, sig;
+	int ret;
+
+	status_val = bpf_map_lookup_elem(&status_map, &key);
+	if (!status_val || *status_val != 0)
+		return 0;
+
+	info_val = bpf_map_lookup_elem(&info_map, &key);
+	if (!info_val || *info_val == 0)
+		return 0;
+
+	sig = *info_val >> 32;
+	pid = *info_val & 0xffffFFFF;
+
+	if ((bpf_get_current_pid_tgid() >> 32) == pid) {
+		ret = bpf_send_signal(sig);
+		if (ret == 0)
+			*status_val = 1;
+	}
+
+	return 0;
+}
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_send_signal_user.c b/tools/testing/selftests/bpf/test_send_signal_user.c
new file mode 100644
index 000000000000..0bd0f7674860
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_send_signal_user.c
@@ -0,0 +1,212 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <signal.h>
+#include <syscall.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+#include <linux/perf_event.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "bpf_rlimit.h"
+
+#define CHECK(condition, tag, format...) ({				\
+	int __ret = !!(condition);					\
+	if (__ret) {							\
+		printf("%s(%d):FAIL:%s ", __func__, __LINE__, tag);	\
+		printf(format);						\
+		printf("\n");						\
+	}								\
+	__ret;								\
+})
+
+static volatile int signal_received = 0;
+
+static void sigusr1_handler(int signum)
+{
+	signal_received++;
+}
+
+static void test_common(struct perf_event_attr *attr, int prog_type,
+			const char *test_name)
+{
+	int pmu_fd, prog_fd, info_map_fd, status_map_fd;
+	const char *file = "./test_send_signal_kern.o";
+	struct bpf_object *obj = NULL;
+	int pipe_c2p[2], pipe_p2c[2];
+	char buf[256];
+	int err = 0;
+	u32 key = 0;
+	pid_t pid;
+	u64 val;
+
+	if (CHECK(pipe(pipe_c2p), test_name,
+		  "pipe pipe_c2p error: %s", strerror(errno)))
+		return;
+
+	if (CHECK(pipe(pipe_p2c), test_name,
+		  "pipe pipe_p2c error: %s", strerror(errno))) {
+		close(pipe_c2p[0]);
+		close(pipe_c2p[1]);
+		return;
+	}
+
+	pid = fork();
+	if (CHECK(pid < 0, test_name, "fork error: %s", strerror(errno))) {
+		close(pipe_c2p[0]);
+		close(pipe_c2p[1]);
+		close(pipe_p2c[0]);
+		close(pipe_p2c[1]);
+		return;
+	}
+
+	if (pid == 0) {
+		/* install signal handler and notify parent */
+		signal(SIGUSR1, sigusr1_handler);
+
+		close(pipe_c2p[0]); /* close read */
+		close(pipe_p2c[1]); /* close write */
+
+		/* notify parent signal handler is installed */
+		write(pipe_c2p[1], buf, 1);
+
+		/* make sense parent enabled bpf program to send_signal */
+		read(pipe_p2c[0], buf, 1);
+
+		/* wait a little for signal handler */
+		sleep(1);
+
+		if (signal_received)
+			write(pipe_c2p[1], "2", 1);
+		else
+			write(pipe_c2p[1], "0", 1);
+
+		/* wait for parent notification and exit */
+		read(pipe_p2c[0], buf, 1);
+
+		close(pipe_c2p[1]);
+		close(pipe_p2c[0]);
+		exit(0);
+	}
+
+	close(pipe_c2p[1]); /* close write */
+	close(pipe_p2c[0]); /* close read */
+
+	err = bpf_prog_load(file, prog_type, &obj, &prog_fd);
+	if (CHECK(err < 0, test_name, "bpf_prog_load error: %s",
+		  strerror(errno)))
+		goto prog_load_failure;
+
+	pmu_fd = syscall(__NR_perf_event_open, attr, pid, -1,
+			 -1 /* group id */, 0 /* flags */);
+	if (CHECK(pmu_fd < 0, test_name, "perf_event_open error: %s",
+		  strerror(errno)))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err < 0, test_name, "ioctl perf_event_ioc_enable error: %s",
+		  strerror(errno)))
+		goto disable_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err < 0, test_name, "ioctl perf_event_ioc_set_bpf error: %s",
+		  strerror(errno)))
+		goto disable_pmu;
+
+	info_map_fd = bpf_object__find_map_fd_by_name(obj, "info_map");
+	if (CHECK(info_map_fd < 0, test_name, "find map %s error", "info_map"))
+		goto disable_pmu;
+
+	status_map_fd = bpf_object__find_map_fd_by_name(obj, "status_map");
+	if (CHECK(status_map_fd < 0, test_name, "find map %s error", "status_map"))
+		goto disable_pmu;
+
+	/* wait until child signal handler installed */
+	read(pipe_c2p[0], buf, 1);
+
+	/* trigger the bpf send_signal */
+	key = 0;
+	val = (((u64)(SIGUSR1)) << 32) | pid;
+	bpf_map_update_elem(info_map_fd, &key, &val, 0);
+
+	/* notify child that bpf program can send_signal now */
+	write(pipe_p2c[1], buf, 1);
+
+	/* wait for result */
+	read(pipe_c2p[0], buf, 1);
+
+	if (buf[0] == '2')
+		printf("test_send_signal (%s): OK\n", test_name);
+	else
+		printf("test_send_signal (%s): FAIL\n", test_name);
+
+	/* notify child safe to exit */
+	write(pipe_p2c[1], buf, 1);
+
+disable_pmu:
+	close(pmu_fd);
+close_prog:
+	bpf_object__close(obj);
+prog_load_failure:
+	close(pipe_c2p[0]);
+	close(pipe_p2c[1]);
+	wait(NULL);
+}
+
+static void test_tracepoint(void)
+{
+	struct perf_event_attr attr = {
+		.type = PERF_TYPE_TRACEPOINT,
+		.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN,
+		.sample_period = 1,
+		.wakeup_events = 1,
+	};
+	int bytes, efd;
+	char buf[256];
+
+	snprintf(buf, sizeof(buf),
+		 "/sys/kernel/debug/tracing/events/syscalls/sys_enter_nanosleep/id");
+	efd = open(buf, O_RDONLY, 0);
+	if (CHECK(efd < 0, "tracepoint",
+		  "open syscalls/sys_enter_nanosleep/id failure: %s",
+		  strerror(errno)))
+		return;
+
+	bytes = read(efd, buf, sizeof(buf));
+	close(efd);
+	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "tracepoint",
+		  "read syscalls/sys_enter_nanosleep/id failure: %s",
+		  strerror(errno)))
+		return;
+
+	attr.config = strtol(buf, NULL, 0);
+
+	test_common(&attr, BPF_PROG_TYPE_TRACEPOINT, "tracepoint");
+}
+
+static void test_nmi_perf_event(void)
+{
+	struct perf_event_attr attr = {
+		.sample_freq = 50,
+		.freq = 1,
+		.type = PERF_TYPE_HARDWARE,
+		.config = PERF_COUNT_HW_CPU_CYCLES,
+	};
+
+	test_common(&attr, BPF_PROG_TYPE_PERF_EVENT, "perf_event");
+}
+
+int main(void)
+{
+	test_tracepoint();
+	test_nmi_perf_event();
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 0/3] bpf: implement bpf_send_signal() helper
  2019-05-22  5:39 [PATCH bpf-next v2 0/3] bpf: implement bpf_send_signal() helper Yonghong Song
                   ` (2 preceding siblings ...)
  2019-05-22  5:39 ` [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper Yonghong Song
@ 2019-05-22 16:38 ` " Stanislav Fomichev
  2019-05-22 16:43   ` Alexei Starovoitov
  3 siblings, 1 reply; 19+ messages in thread
From: Stanislav Fomichev @ 2019-05-22 16:38 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Peter Zijlstra

On 05/21, Yonghong Song wrote:
> This patch tries to solve the following specific use case.
> 
> Currently, bpf program can already collect stack traces
> through kernel function get_perf_callchain()
> when certain events happens (e.g., cache miss counter or
> cpu clock counter overflows). But such stack traces are
> not enough for jitted programs, e.g., hhvm (jited php).
> To get real stack trace, jit engine internal data structures
> need to be traversed in order to get the real user functions.
> 
> bpf program itself may not be the best place to traverse
> the jit engine as the traversing logic could be complex and
> it is not a stable interface either.
> 
> Instead, hhvm implements a signal handler,
> e.g. for SIGALARM, and a set of program locations which
> it can dump stack traces. When it receives a signal, it will
> dump the stack in next such program location.
> 

[..]
> This patch implements bpf_send_signal() helper to send
> a signal to hhvm in real time, resulting in intended stack traces.
Series looks good. One minor nit/question: maybe rename bpf_send_signal
to something like bpf_send_signal_to_current/bpf_current_send_signal/etc?
bpf_send_signal is too generic now that you send the signal
to the current process..

> Patch #1 implemented the bpf_send_helper() in the kernel,
> Patch #2 synced uapi header bpf.h to tools directory.
> Patch #3 added a self test which covers tracepoint
> and perf_event bpf programs.
> 
> Changelogs:
>   RFC v1 => v2:
>     . previous version allows to send signal to an arbitrary
>       pid. This version just sends the signal to current
>       task to avoid unstable pid and potential races between
>       sending signals and task state changes for the pid.
> 
> Yonghong Song (3):
>   bpf: implement bpf_send_signal() helper
>   tools/bpf: sync bpf uapi header bpf.h to tools directory
>   tools/bpf: add a selftest for bpf_send_signal() helper
> 
>  include/uapi/linux/bpf.h                      |  17 +-
>  kernel/trace/bpf_trace.c                      |  67 ++++++
>  tools/include/uapi/linux/bpf.h                |  17 +-
>  tools/testing/selftests/bpf/Makefile          |   3 +-
>  tools/testing/selftests/bpf/bpf_helpers.h     |   1 +
>  .../bpf/progs/test_send_signal_kern.c         |  51 +++++
>  .../selftests/bpf/test_send_signal_user.c     | 212 ++++++++++++++++++
>  7 files changed, 365 insertions(+), 3 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>  create mode 100644 tools/testing/selftests/bpf/test_send_signal_user.c
> 
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 0/3] bpf: implement bpf_send_signal() helper
  2019-05-22 16:38 ` [PATCH bpf-next v2 0/3] bpf: implement " Stanislav Fomichev
@ 2019-05-22 16:43   ` Alexei Starovoitov
  2019-05-22 17:11     ` Stanislav Fomichev
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2019-05-22 16:43 UTC (permalink / raw)
  To: Stanislav Fomichev, Yonghong Song
  Cc: bpf, netdev, Daniel Borkmann, Kernel Team, Peter Zijlstra

On 5/22/19 9:38 AM, Stanislav Fomichev wrote:
> On 05/21, Yonghong Song wrote:
>> This patch tries to solve the following specific use case.
>>
>> Currently, bpf program can already collect stack traces
>> through kernel function get_perf_callchain()
>> when certain events happens (e.g., cache miss counter or
>> cpu clock counter overflows). But such stack traces are
>> not enough for jitted programs, e.g., hhvm (jited php).
>> To get real stack trace, jit engine internal data structures
>> need to be traversed in order to get the real user functions.
>>
>> bpf program itself may not be the best place to traverse
>> the jit engine as the traversing logic could be complex and
>> it is not a stable interface either.
>>
>> Instead, hhvm implements a signal handler,
>> e.g. for SIGALARM, and a set of program locations which
>> it can dump stack traces. When it receives a signal, it will
>> dump the stack in next such program location.
>>
> 
> [..]
>> This patch implements bpf_send_signal() helper to send
>> a signal to hhvm in real time, resulting in intended stack traces.
> Series looks good. One minor nit/question: maybe rename bpf_send_signal
> to something like bpf_send_signal_to_current/bpf_current_send_signal/etc?
> bpf_send_signal is too generic now that you send the signal
> to the current process..

bpf_send_signal_to_current was Yonghong's original name
and I asked him to rename it to bpf_send_signal :)
I don't see bpf sending signals to arbitrary processes
in the near future.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 0/3] bpf: implement bpf_send_signal() helper
  2019-05-22 16:43   ` Alexei Starovoitov
@ 2019-05-22 17:11     ` Stanislav Fomichev
  0 siblings, 0 replies; 19+ messages in thread
From: Stanislav Fomichev @ 2019-05-22 17:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, bpf, netdev, Daniel Borkmann, Kernel Team, Peter Zijlstra

On 05/22, Alexei Starovoitov wrote:
> On 5/22/19 9:38 AM, Stanislav Fomichev wrote:
> > On 05/21, Yonghong Song wrote:
> >> This patch tries to solve the following specific use case.
> >>
> >> Currently, bpf program can already collect stack traces
> >> through kernel function get_perf_callchain()
> >> when certain events happens (e.g., cache miss counter or
> >> cpu clock counter overflows). But such stack traces are
> >> not enough for jitted programs, e.g., hhvm (jited php).
> >> To get real stack trace, jit engine internal data structures
> >> need to be traversed in order to get the real user functions.
> >>
> >> bpf program itself may not be the best place to traverse
> >> the jit engine as the traversing logic could be complex and
> >> it is not a stable interface either.
> >>
> >> Instead, hhvm implements a signal handler,
> >> e.g. for SIGALARM, and a set of program locations which
> >> it can dump stack traces. When it receives a signal, it will
> >> dump the stack in next such program location.
> >>
> > 
> > [..]
> >> This patch implements bpf_send_signal() helper to send
> >> a signal to hhvm in real time, resulting in intended stack traces.
> > Series looks good. One minor nit/question: maybe rename bpf_send_signal
> > to something like bpf_send_signal_to_current/bpf_current_send_signal/etc?
> > bpf_send_signal is too generic now that you send the signal
> > to the current process..
> 
> bpf_send_signal_to_current was Yonghong's original name
> and I asked him to rename it to bpf_send_signal :)
> I don't see bpf sending signals to arbitrary processes
> in the near future.
:) I was thinking that it might be a bit more clear if we communicate
target process in the name, but I don't feel strongly about it.

So it's not about the future. OTOH, who knows, I remember when people
said there would not be any locking/loops in bpf. And here we are :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper
  2019-05-22  5:39 ` [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper Yonghong Song
@ 2019-05-22 18:48   ` Andrii Nakryiko
  2019-05-22 19:38     ` Yonghong Song
  2019-05-22 19:10   ` Stanislav Fomichev
  1 sibling, 1 reply; 19+ messages in thread
From: Andrii Nakryiko @ 2019-05-22 18:48 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Peter Zijlstra

On Tue, May 21, 2019 at 10:40 PM Yonghong Song <yhs@fb.com> wrote:
>
> The test covered both nmi and tracepoint perf events.
>   $ ./test_send_signal_user
>   test_send_signal (tracepoint): OK
>   test_send_signal (perf_event): OK
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  tools/testing/selftests/bpf/Makefile          |   3 +-
>  tools/testing/selftests/bpf/bpf_helpers.h     |   1 +
>  .../bpf/progs/test_send_signal_kern.c         |  51 +++++
>  .../selftests/bpf/test_send_signal_user.c     | 212 ++++++++++++++++++
>  4 files changed, 266 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>  create mode 100644 tools/testing/selftests/bpf/test_send_signal_user.c
>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 66f2dca1dee1..5eb6368a96a2 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -23,7 +23,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>         test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
>         test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
>         test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
> -       test_netcnt test_tcpnotify_user test_sock_fields test_sysctl
> +       test_netcnt test_tcpnotify_user test_sock_fields test_sysctl \
> +       test_send_signal_user
>
>  BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>  TEST_GEN_FILES = $(BPF_OBJ_FILES)
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index 5f6f9e7aba2a..cb02521b8e58 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -216,6 +216,7 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
>         (void *) BPF_FUNC_sk_storage_get;
>  static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
>         (void *)BPF_FUNC_sk_storage_delete;
> +static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
>
>  /* llvm builtin functions that eBPF C program may use to
>   * emit BPF_LD_ABS and BPF_LD_IND instructions
> diff --git a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
> new file mode 100644
> index 000000000000..45a1a1a2c345
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2019 Facebook
> +#include <linux/bpf.h>
> +#include <linux/version.h>
> +#include "bpf_helpers.h"
> +
> +struct bpf_map_def SEC("maps") info_map = {
> +       .type = BPF_MAP_TYPE_ARRAY,
> +       .key_size = sizeof(__u32),
> +       .value_size = sizeof(__u64),
> +       .max_entries = 1,
> +};
> +
> +BPF_ANNOTATE_KV_PAIR(info_map, __u32, __u64);
> +
> +struct bpf_map_def SEC("maps") status_map = {
> +       .type = BPF_MAP_TYPE_ARRAY,
> +       .key_size = sizeof(__u32),
> +       .value_size = sizeof(__u64),
> +       .max_entries = 1,
> +};
> +
> +BPF_ANNOTATE_KV_PAIR(status_map, __u32, __u64);
> +
> +SEC("send_signal_demo")
> +int bpf_send_signal_test(void *ctx)
> +{
> +       __u64 *info_val, *status_val;
> +       __u32 key = 0, pid, sig;
> +       int ret;
> +
> +       status_val = bpf_map_lookup_elem(&status_map, &key);
> +       if (!status_val || *status_val != 0)
> +               return 0;
> +
> +       info_val = bpf_map_lookup_elem(&info_map, &key);
> +       if (!info_val || *info_val == 0)
> +               return 0;
> +
> +       sig = *info_val >> 32;
> +       pid = *info_val & 0xffffFFFF;
> +
> +       if ((bpf_get_current_pid_tgid() >> 32) == pid) {
> +               ret = bpf_send_signal(sig);
> +               if (ret == 0)
> +                       *status_val = 1;
> +       }
> +
> +       return 0;
> +}
> +char __license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/test_send_signal_user.c b/tools/testing/selftests/bpf/test_send_signal_user.c
> new file mode 100644
> index 000000000000..0bd0f7674860
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_send_signal_user.c
> @@ -0,0 +1,212 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <syscall.h>
> +#include <sys/ioctl.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <sys/wait.h>
> +#include <fcntl.h>
> +#include <unistd.h>
> +
> +#include <linux/perf_event.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include "bpf_rlimit.h"
> +
> +#define CHECK(condition, tag, format...) ({                            \
> +       int __ret = !!(condition);                                      \
> +       if (__ret) {                                                    \
> +               printf("%s(%d):FAIL:%s ", __func__, __LINE__, tag);     \
> +               printf(format);                                         \
> +               printf("\n");                                           \
> +       }                                                               \
> +       __ret;                                                          \
> +})
> +
> +static volatile int signal_received = 0;
> +
> +static void sigusr1_handler(int signum)
> +{
> +       signal_received++;
> +}
> +
> +static void test_common(struct perf_event_attr *attr, int prog_type,
> +                       const char *test_name)
> +{
> +       int pmu_fd, prog_fd, info_map_fd, status_map_fd;
> +       const char *file = "./test_send_signal_kern.o";
> +       struct bpf_object *obj = NULL;
> +       int pipe_c2p[2], pipe_p2c[2];
> +       char buf[256];
> +       int err = 0;
> +       u32 key = 0;
> +       pid_t pid;
> +       u64 val;
> +
> +       if (CHECK(pipe(pipe_c2p), test_name,
> +                 "pipe pipe_c2p error: %s", strerror(errno)))
> +               return;
> +
> +       if (CHECK(pipe(pipe_p2c), test_name,
> +                 "pipe pipe_p2c error: %s", strerror(errno))) {
> +               close(pipe_c2p[0]);
> +               close(pipe_c2p[1]);
> +               return;
> +       }
> +
> +       pid = fork();
> +       if (CHECK(pid < 0, test_name, "fork error: %s", strerror(errno))) {
> +               close(pipe_c2p[0]);
> +               close(pipe_c2p[1]);
> +               close(pipe_p2c[0]);
> +               close(pipe_p2c[1]);
> +               return;
> +       }
> +
> +       if (pid == 0) {
> +               /* install signal handler and notify parent */
> +               signal(SIGUSR1, sigusr1_handler);
> +
> +               close(pipe_c2p[0]); /* close read */
> +               close(pipe_p2c[1]); /* close write */
> +
> +               /* notify parent signal handler is installed */
> +               write(pipe_c2p[1], buf, 1);
> +
> +               /* make sense parent enabled bpf program to send_signal */
> +               read(pipe_p2c[0], buf, 1);
> +
> +               /* wait a little for signal handler */
> +               sleep(1);
> +
> +               if (signal_received)
> +                       write(pipe_c2p[1], "2", 1);
> +               else
> +                       write(pipe_c2p[1], "0", 1);
> +
> +               /* wait for parent notification and exit */
> +               read(pipe_p2c[0], buf, 1);
> +
> +               close(pipe_c2p[1]);
> +               close(pipe_p2c[0]);
> +               exit(0);
> +       }
> +
> +       close(pipe_c2p[1]); /* close write */
> +       close(pipe_p2c[0]); /* close read */
> +
> +       err = bpf_prog_load(file, prog_type, &obj, &prog_fd);
> +       if (CHECK(err < 0, test_name, "bpf_prog_load error: %s",
> +                 strerror(errno)))
> +               goto prog_load_failure;
> +
> +       pmu_fd = syscall(__NR_perf_event_open, attr, pid, -1,
> +                        -1 /* group id */, 0 /* flags */);
> +       if (CHECK(pmu_fd < 0, test_name, "perf_event_open error: %s",
> +                 strerror(errno)))
> +               goto close_prog;
> +
> +       err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> +       if (CHECK(err < 0, test_name, "ioctl perf_event_ioc_enable error: %s",
> +                 strerror(errno)))
> +               goto disable_pmu;
> +
> +       err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +       if (CHECK(err < 0, test_name, "ioctl perf_event_ioc_set_bpf error: %s",
> +                 strerror(errno)))
> +               goto disable_pmu;
> +
> +       info_map_fd = bpf_object__find_map_fd_by_name(obj, "info_map");
> +       if (CHECK(info_map_fd < 0, test_name, "find map %s error", "info_map"))
> +               goto disable_pmu;
> +
> +       status_map_fd = bpf_object__find_map_fd_by_name(obj, "status_map");
> +       if (CHECK(status_map_fd < 0, test_name, "find map %s error", "status_map"))
> +               goto disable_pmu;
> +
> +       /* wait until child signal handler installed */
> +       read(pipe_c2p[0], buf, 1);
> +
> +       /* trigger the bpf send_signal */
> +       key = 0;
> +       val = (((u64)(SIGUSR1)) << 32) | pid;
> +       bpf_map_update_elem(info_map_fd, &key, &val, 0);
> +
> +       /* notify child that bpf program can send_signal now */
> +       write(pipe_p2c[1], buf, 1);
> +
> +       /* wait for result */
> +       read(pipe_c2p[0], buf, 1);
> +
> +       if (buf[0] == '2')
> +               printf("test_send_signal (%s): OK\n", test_name);
> +       else
> +               printf("test_send_signal (%s): FAIL\n", test_name);
> +
> +       /* notify child safe to exit */
> +       write(pipe_p2c[1], buf, 1);
> +
> +disable_pmu:
> +       close(pmu_fd);
> +close_prog:
> +       bpf_object__close(obj);
> +prog_load_failure:
> +       close(pipe_c2p[0]);
> +       close(pipe_p2c[1]);
> +       wait(NULL);
> +}
> +
> +static void test_tracepoint(void)
> +{
> +       struct perf_event_attr attr = {
> +               .type = PERF_TYPE_TRACEPOINT,
> +               .sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN,
> +               .sample_period = 1,
> +               .wakeup_events = 1,
> +       };
> +       int bytes, efd;
> +       char buf[256];
> +
> +       snprintf(buf, sizeof(buf),
> +                "/sys/kernel/debug/tracing/events/syscalls/sys_enter_nanosleep/id");
> +       efd = open(buf, O_RDONLY, 0);
> +       if (CHECK(efd < 0, "tracepoint",
> +                 "open syscalls/sys_enter_nanosleep/id failure: %s",
> +                 strerror(errno)))
> +               return;
> +
> +       bytes = read(efd, buf, sizeof(buf));
> +       close(efd);
> +       if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "tracepoint",
> +                 "read syscalls/sys_enter_nanosleep/id failure: %s",
> +                 strerror(errno)))
> +               return;
> +
> +       attr.config = strtol(buf, NULL, 0);
> +
> +       test_common(&attr, BPF_PROG_TYPE_TRACEPOINT, "tracepoint");
> +}
> +
> +static void test_nmi_perf_event(void)
> +{
> +       struct perf_event_attr attr = {
> +               .sample_freq = 50,
> +               .freq = 1,
> +               .type = PERF_TYPE_HARDWARE,
> +               .config = PERF_COUNT_HW_CPU_CYCLES,
> +       };
> +
> +       test_common(&attr, BPF_PROG_TYPE_PERF_EVENT, "perf_event");
> +}
> +
> +int main(void)
> +{
> +       test_tracepoint();
> +       test_nmi_perf_event();

Tests should probably propagate failure up to main() and return exit
code != 0, if any of the tests failed.

> +       return 0;
> +}
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper
  2019-05-22  5:39 ` [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper Yonghong Song
  2019-05-22 18:48   ` Andrii Nakryiko
@ 2019-05-22 19:10   ` Stanislav Fomichev
  2019-05-22 19:44     ` Yonghong Song
  1 sibling, 1 reply; 19+ messages in thread
From: Stanislav Fomichev @ 2019-05-22 19:10 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Peter Zijlstra

On 05/21, Yonghong Song wrote:
> The test covered both nmi and tracepoint perf events.
>   $ ./test_send_signal_user
>   test_send_signal (tracepoint): OK
>   test_send_signal (perf_event): OK
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  tools/testing/selftests/bpf/Makefile          |   3 +-
>  tools/testing/selftests/bpf/bpf_helpers.h     |   1 +
>  .../bpf/progs/test_send_signal_kern.c         |  51 +++++
>  .../selftests/bpf/test_send_signal_user.c     | 212 ++++++++++++++++++
>  4 files changed, 266 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>  create mode 100644 tools/testing/selftests/bpf/test_send_signal_user.c
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 66f2dca1dee1..5eb6368a96a2 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -23,7 +23,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>  	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
>  	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
>  	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
> -	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl
> +	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl \
> +	test_send_signal_user
>  
>  BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>  TEST_GEN_FILES = $(BPF_OBJ_FILES)
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index 5f6f9e7aba2a..cb02521b8e58 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -216,6 +216,7 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
>  	(void *) BPF_FUNC_sk_storage_get;
>  static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
>  	(void *)BPF_FUNC_sk_storage_delete;
> +static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
>  
>  /* llvm builtin functions that eBPF C program may use to
>   * emit BPF_LD_ABS and BPF_LD_IND instructions
> diff --git a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
> new file mode 100644
> index 000000000000..45a1a1a2c345
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2019 Facebook
> +#include <linux/bpf.h>
> +#include <linux/version.h>
> +#include "bpf_helpers.h"
> +
> +struct bpf_map_def SEC("maps") info_map = {
> +	.type = BPF_MAP_TYPE_ARRAY,
> +	.key_size = sizeof(__u32),
> +	.value_size = sizeof(__u64),
> +	.max_entries = 1,
> +};
> +
> +BPF_ANNOTATE_KV_PAIR(info_map, __u32, __u64);
> +
> +struct bpf_map_def SEC("maps") status_map = {
> +	.type = BPF_MAP_TYPE_ARRAY,
> +	.key_size = sizeof(__u32),
> +	.value_size = sizeof(__u64),
> +	.max_entries = 1,
> +};
> +
> +BPF_ANNOTATE_KV_PAIR(status_map, __u32, __u64);
> +
> +SEC("send_signal_demo")
> +int bpf_send_signal_test(void *ctx)
> +{
> +	__u64 *info_val, *status_val;
> +	__u32 key = 0, pid, sig;
> +	int ret;
> +
> +	status_val = bpf_map_lookup_elem(&status_map, &key);
> +	if (!status_val || *status_val != 0)
> +		return 0;
> +
> +	info_val = bpf_map_lookup_elem(&info_map, &key);
> +	if (!info_val || *info_val == 0)
> +		return 0;
> +
> +	sig = *info_val >> 32;
> +	pid = *info_val & 0xffffFFFF;
> +
> +	if ((bpf_get_current_pid_tgid() >> 32) == pid) {
> +		ret = bpf_send_signal(sig);
> +		if (ret == 0)
> +			*status_val = 1;
> +	}
> +
> +	return 0;
> +}
> +char __license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/test_send_signal_user.c b/tools/testing/selftests/bpf/test_send_signal_user.c
> new file mode 100644
> index 000000000000..0bd0f7674860
> --- /dev/null
[..]
> +++ b/tools/testing/selftests/bpf/test_send_signal_user.c
Any reason you didn't put it under bpf/prog_tests?
That way you don't need to define your own CHECK macro and
care about the includes.

> @@ -0,0 +1,212 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <syscall.h>
> +#include <sys/ioctl.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <sys/wait.h>
> +#include <fcntl.h>
> +#include <unistd.h>
> +
> +#include <linux/perf_event.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include "bpf_rlimit.h"
> +
> +#define CHECK(condition, tag, format...) ({				\
> +	int __ret = !!(condition);					\
> +	if (__ret) {							\
> +		printf("%s(%d):FAIL:%s ", __func__, __LINE__, tag);	\
> +		printf(format);						\
> +		printf("\n");						\
> +	}								\
> +	__ret;								\
> +})
> +
> +static volatile int signal_received = 0;
> +
> +static void sigusr1_handler(int signum)
> +{
> +	signal_received++;
> +}
> +
> +static void test_common(struct perf_event_attr *attr, int prog_type,
> +			const char *test_name)
> +{
> +	int pmu_fd, prog_fd, info_map_fd, status_map_fd;
> +	const char *file = "./test_send_signal_kern.o";
> +	struct bpf_object *obj = NULL;
> +	int pipe_c2p[2], pipe_p2c[2];
> +	char buf[256];
> +	int err = 0;
> +	u32 key = 0;
> +	pid_t pid;
> +	u64 val;
> +
> +	if (CHECK(pipe(pipe_c2p), test_name,
> +		  "pipe pipe_c2p error: %s", strerror(errno)))
> +		return;
> +
> +	if (CHECK(pipe(pipe_p2c), test_name,
> +		  "pipe pipe_p2c error: %s", strerror(errno))) {
> +		close(pipe_c2p[0]);
> +		close(pipe_c2p[1]);
> +		return;
> +	}
> +
> +	pid = fork();
> +	if (CHECK(pid < 0, test_name, "fork error: %s", strerror(errno))) {
> +		close(pipe_c2p[0]);
> +		close(pipe_c2p[1]);
> +		close(pipe_p2c[0]);
> +		close(pipe_p2c[1]);
> +		return;
> +	}
> +
> +	if (pid == 0) {
> +		/* install signal handler and notify parent */
> +		signal(SIGUSR1, sigusr1_handler);
> +
> +		close(pipe_c2p[0]); /* close read */
> +		close(pipe_p2c[1]); /* close write */
> +
> +		/* notify parent signal handler is installed */
> +		write(pipe_c2p[1], buf, 1);
> +
> +		/* make sense parent enabled bpf program to send_signal */
> +		read(pipe_p2c[0], buf, 1);
> +
> +		/* wait a little for signal handler */
> +		sleep(1);
> +
> +		if (signal_received)
> +			write(pipe_c2p[1], "2", 1);
> +		else
> +			write(pipe_c2p[1], "0", 1);
> +
> +		/* wait for parent notification and exit */
> +		read(pipe_p2c[0], buf, 1);
> +
> +		close(pipe_c2p[1]);
> +		close(pipe_p2c[0]);
> +		exit(0);
> +	}
> +
> +	close(pipe_c2p[1]); /* close write */
> +	close(pipe_p2c[0]); /* close read */
> +
> +	err = bpf_prog_load(file, prog_type, &obj, &prog_fd);
> +	if (CHECK(err < 0, test_name, "bpf_prog_load error: %s",
> +		  strerror(errno)))
> +		goto prog_load_failure;
> +
> +	pmu_fd = syscall(__NR_perf_event_open, attr, pid, -1,
> +			 -1 /* group id */, 0 /* flags */);
> +	if (CHECK(pmu_fd < 0, test_name, "perf_event_open error: %s",
> +		  strerror(errno)))
> +		goto close_prog;
> +
> +	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> +	if (CHECK(err < 0, test_name, "ioctl perf_event_ioc_enable error: %s",
> +		  strerror(errno)))
> +		goto disable_pmu;
> +
> +	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +	if (CHECK(err < 0, test_name, "ioctl perf_event_ioc_set_bpf error: %s",
> +		  strerror(errno)))
> +		goto disable_pmu;
> +
> +	info_map_fd = bpf_object__find_map_fd_by_name(obj, "info_map");
> +	if (CHECK(info_map_fd < 0, test_name, "find map %s error", "info_map"))
> +		goto disable_pmu;
> +
> +	status_map_fd = bpf_object__find_map_fd_by_name(obj, "status_map");
> +	if (CHECK(status_map_fd < 0, test_name, "find map %s error", "status_map"))
> +		goto disable_pmu;
> +
> +	/* wait until child signal handler installed */
> +	read(pipe_c2p[0], buf, 1);
> +
> +	/* trigger the bpf send_signal */
> +	key = 0;
> +	val = (((u64)(SIGUSR1)) << 32) | pid;
> +	bpf_map_update_elem(info_map_fd, &key, &val, 0);
> +
> +	/* notify child that bpf program can send_signal now */
> +	write(pipe_p2c[1], buf, 1);
> +
> +	/* wait for result */
> +	read(pipe_c2p[0], buf, 1);
> +
> +	if (buf[0] == '2')
> +		printf("test_send_signal (%s): OK\n", test_name);
> +	else
> +		printf("test_send_signal (%s): FAIL\n", test_name);
> +
> +	/* notify child safe to exit */
> +	write(pipe_p2c[1], buf, 1);
> +
> +disable_pmu:
> +	close(pmu_fd);
> +close_prog:
> +	bpf_object__close(obj);
> +prog_load_failure:
> +	close(pipe_c2p[0]);
> +	close(pipe_p2c[1]);
> +	wait(NULL);
> +}
> +
> +static void test_tracepoint(void)
> +{
> +	struct perf_event_attr attr = {
> +		.type = PERF_TYPE_TRACEPOINT,
> +		.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN,
> +		.sample_period = 1,
> +		.wakeup_events = 1,
> +	};
> +	int bytes, efd;
> +	char buf[256];
> +
> +	snprintf(buf, sizeof(buf),
> +		 "/sys/kernel/debug/tracing/events/syscalls/sys_enter_nanosleep/id");
> +	efd = open(buf, O_RDONLY, 0);
> +	if (CHECK(efd < 0, "tracepoint",
> +		  "open syscalls/sys_enter_nanosleep/id failure: %s",
> +		  strerror(errno)))
> +		return;
> +
> +	bytes = read(efd, buf, sizeof(buf));
> +	close(efd);
> +	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "tracepoint",
> +		  "read syscalls/sys_enter_nanosleep/id failure: %s",
> +		  strerror(errno)))
> +		return;
> +
> +	attr.config = strtol(buf, NULL, 0);
> +
> +	test_common(&attr, BPF_PROG_TYPE_TRACEPOINT, "tracepoint");
> +}
> +
> +static void test_nmi_perf_event(void)
> +{
> +	struct perf_event_attr attr = {
> +		.sample_freq = 50,
> +		.freq = 1,
> +		.type = PERF_TYPE_HARDWARE,
> +		.config = PERF_COUNT_HW_CPU_CYCLES,
> +	};
> +
> +	test_common(&attr, BPF_PROG_TYPE_PERF_EVENT, "perf_event");
> +}
> +
> +int main(void)
> +{
> +	test_tracepoint();
> +	test_nmi_perf_event();
> +	return 0;
> +}
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper
  2019-05-22 18:48   ` Andrii Nakryiko
@ 2019-05-22 19:38     ` Yonghong Song
  0 siblings, 0 replies; 19+ messages in thread
From: Yonghong Song @ 2019-05-22 19:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Peter Zijlstra



On 5/22/19 11:48 AM, Andrii Nakryiko wrote:
> On Tue, May 21, 2019 at 10:40 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> The test covered both nmi and tracepoint perf events.
>>    $ ./test_send_signal_user
>>    test_send_signal (tracepoint): OK
>>    test_send_signal (perf_event): OK
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   tools/testing/selftests/bpf/Makefile          |   3 +-
>>   tools/testing/selftests/bpf/bpf_helpers.h     |   1 +
>>   .../bpf/progs/test_send_signal_kern.c         |  51 +++++
>>   .../selftests/bpf/test_send_signal_user.c     | 212 ++++++++++++++++++
>>   4 files changed, 266 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>>   create mode 100644 tools/testing/selftests/bpf/test_send_signal_user.c
>>
>> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
>> index 66f2dca1dee1..5eb6368a96a2 100644
>> --- a/tools/testing/selftests/bpf/Makefile
>> +++ b/tools/testing/selftests/bpf/Makefile
>> @@ -23,7 +23,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>>          test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
>>          test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
>>          test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
>> -       test_netcnt test_tcpnotify_user test_sock_fields test_sysctl
>> +       test_netcnt test_tcpnotify_user test_sock_fields test_sysctl \
>> +       test_send_signal_user
>>
>>   BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>>   TEST_GEN_FILES = $(BPF_OBJ_FILES)
>> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>> index 5f6f9e7aba2a..cb02521b8e58 100644
>> --- a/tools/testing/selftests/bpf/bpf_helpers.h
>> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
[...]
>> +
>> +int main(void)
>> +{
>> +       test_tracepoint();
>> +       test_nmi_perf_event();
> 
> Tests should probably propagate failure up to main() and return exit
> code != 0, if any of the tests failed.

Good catch! Will fix it in the next revision.

> 
>> +       return 0;
>> +}
>> --
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper
  2019-05-22 19:10   ` Stanislav Fomichev
@ 2019-05-22 19:44     ` Yonghong Song
  0 siblings, 0 replies; 19+ messages in thread
From: Yonghong Song @ 2019-05-22 19:44 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Peter Zijlstra



On 5/22/19 12:10 PM, Stanislav Fomichev wrote:
> On 05/21, Yonghong Song wrote:
>> The test covered both nmi and tracepoint perf events.
>>    $ ./test_send_signal_user
>>    test_send_signal (tracepoint): OK
>>    test_send_signal (perf_event): OK
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   tools/testing/selftests/bpf/Makefile          |   3 +-
>>   tools/testing/selftests/bpf/bpf_helpers.h     |   1 +
>>   .../bpf/progs/test_send_signal_kern.c         |  51 +++++
>>   .../selftests/bpf/test_send_signal_user.c     | 212 ++++++++++++++++++
>>   4 files changed, 266 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>>   create mode 100644 tools/testing/selftests/bpf/test_send_signal_user.c
>>
>> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
>> index 66f2dca1dee1..5eb6368a96a2 100644
>> --- a/tools/testing/selftests/bpf/Makefile
>> +++ b/tools/testing/selftests/bpf/Makefile
>> @@ -23,7 +23,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>>   	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
>>   	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
>>   	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
>> -	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl
>> +	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl \
>> +	test_send_signal_user
>>   
>>   BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>>   TEST_GEN_FILES = $(BPF_OBJ_FILES)
>> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>> index 5f6f9e7aba2a..cb02521b8e58 100644
>> --- a/tools/testing/selftests/bpf/bpf_helpers.h
>> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
>> @@ -216,6 +216,7 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
>>   	(void *) BPF_FUNC_sk_storage_get;
>>   static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
>>   	(void *)BPF_FUNC_sk_storage_delete;
>> +static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
>>   
>>   /* llvm builtin functions that eBPF C program may use to
>>    * emit BPF_LD_ABS and BPF_LD_IND instructions
>> diff --git a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>> new file mode 100644
>> index 000000000000..45a1a1a2c345
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>> @@ -0,0 +1,51 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (c) 2019 Facebook
>> +#include <linux/bpf.h>
>> +#include <linux/version.h>
>> +#include "bpf_helpers.h"
>> +
>> +struct bpf_map_def SEC("maps") info_map = {
>> +	.type = BPF_MAP_TYPE_ARRAY,
>> +	.key_size = sizeof(__u32),
>> +	.value_size = sizeof(__u64),
>> +	.max_entries = 1,
>> +};
>> +
>> +BPF_ANNOTATE_KV_PAIR(info_map, __u32, __u64);
>> +
>> +struct bpf_map_def SEC("maps") status_map = {
>> +	.type = BPF_MAP_TYPE_ARRAY,
>> +	.key_size = sizeof(__u32),
>> +	.value_size = sizeof(__u64),
>> +	.max_entries = 1,
>> +};
>> +
>> +BPF_ANNOTATE_KV_PAIR(status_map, __u32, __u64);
>> +
>> +SEC("send_signal_demo")
>> +int bpf_send_signal_test(void *ctx)
>> +{
>> +	__u64 *info_val, *status_val;
>> +	__u32 key = 0, pid, sig;
>> +	int ret;
>> +
>> +	status_val = bpf_map_lookup_elem(&status_map, &key);
>> +	if (!status_val || *status_val != 0)
>> +		return 0;
>> +
>> +	info_val = bpf_map_lookup_elem(&info_map, &key);
>> +	if (!info_val || *info_val == 0)
>> +		return 0;
>> +
>> +	sig = *info_val >> 32;
>> +	pid = *info_val & 0xffffFFFF;
>> +
>> +	if ((bpf_get_current_pid_tgid() >> 32) == pid) {
>> +		ret = bpf_send_signal(sig);
>> +		if (ret == 0)
>> +			*status_val = 1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +char __license[] SEC("license") = "GPL";
>> diff --git a/tools/testing/selftests/bpf/test_send_signal_user.c b/tools/testing/selftests/bpf/test_send_signal_user.c
>> new file mode 100644
>> index 000000000000..0bd0f7674860
>> --- /dev/null
> [..]
>> +++ b/tools/testing/selftests/bpf/test_send_signal_user.c
> Any reason you didn't put it under bpf/prog_tests?

The only reason I put it as a standalone test is
that the program receives signals and tries to
minimize potential impact on other tests.

But it might be okay as the signal is sent to
child process.

> That way you don't need to define your own CHECK macro and
> care about the includes.

Right. I will try to use bpf/prog_tests infrastructure
in the next revision.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper
  2019-05-22  5:39 ` [PATCH bpf-next v2 1/3] " Yonghong Song
@ 2019-05-23 15:41   ` Daniel Borkmann
  2019-05-23 15:58     ` Yonghong Song
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2019-05-23 15:41 UTC (permalink / raw)
  To: Yonghong Song, bpf, netdev
  Cc: Alexei Starovoitov, kernel-team, Peter Zijlstra

On 05/22/2019 07:39 AM, Yonghong Song wrote:
> This patch tries to solve the following specific use case.
> 
> Currently, bpf program can already collect stack traces
> through kernel function get_perf_callchain()
> when certain events happens (e.g., cache miss counter or
> cpu clock counter overflows). But such stack traces are
> not enough for jitted programs, e.g., hhvm (jited php).
> To get real stack trace, jit engine internal data structures
> need to be traversed in order to get the real user functions.
> 
> bpf program itself may not be the best place to traverse
> the jit engine as the traversing logic could be complex and
> it is not a stable interface either.
> 
> Instead, hhvm implements a signal handler,
> e.g. for SIGALARM, and a set of program locations which
> it can dump stack traces. When it receives a signal, it will
> dump the stack in next such program location.
> 
> Such a mechanism can be implemented in the following way:
>   . a perf ring buffer is created between bpf program
>     and tracing app.
>   . once a particular event happens, bpf program writes
>     to the ring buffer and the tracing app gets notified.
>   . the tracing app sends a signal SIGALARM to the hhvm.
> 
> But this method could have large delays and causing profiling
> results skewed.
> 
> This patch implements bpf_send_signal() helper to send
> a signal to hhvm in real time, resulting in intended stack traces.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/uapi/linux/bpf.h | 17 +++++++++-
>  kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 63e0cf66f01a..68d4470523a0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2672,6 +2672,20 @@ union bpf_attr {
>   *		0 on success.
>   *
>   *		**-ENOENT** if the bpf-local-storage cannot be found.
> + *
> + * int bpf_send_signal(u32 sig)
> + *	Description
> + *		Send signal *sig* to the current task.
> + *	Return
> + *		0 on success or successfully queued.
> + *
> + *		**-EBUSY** if work queue under nmi is full.
> + *
> + *		**-EINVAL** if *sig* is invalid.
> + *
> + *		**-EPERM** if no permission to send the *sig*.
> + *
> + *		**-EAGAIN** if bpf program can try again.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -2782,7 +2796,8 @@ union bpf_attr {
>  	FN(strtol),			\
>  	FN(strtoul),			\
>  	FN(sk_storage_get),		\
> -	FN(sk_storage_delete),
> +	FN(sk_storage_delete),		\
> +	FN(send_signal),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index f92d6ad5e080..f8cd0db7289f 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
>  	.arg3_type	= ARG_ANYTHING,
>  };
>  
> +struct send_signal_irq_work {
> +	struct irq_work irq_work;
> +	u32 sig;
> +};
> +
> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
> +
> +static void do_bpf_send_signal(struct irq_work *entry)
> +{
> +	struct send_signal_irq_work *work;
> +
> +	work = container_of(entry, struct send_signal_irq_work, irq_work);
> +	group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
> +}
> +
> +BPF_CALL_1(bpf_send_signal, u32, sig)
> +{
> +	struct send_signal_irq_work *work = NULL;
> +
> +	/* Similar to bpf_probe_write_user, task needs to be
> +	 * in a sound condition and kernel memory access be
> +	 * permitted in order to send signal to the current
> +	 * task.
> +	 */
> +	if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
> +		return -EPERM;
> +	if (unlikely(uaccess_kernel()))
> +		return -EPERM;
> +	if (unlikely(!nmi_uaccess_okay()))
> +		return -EPERM;
> +
> +	if (in_nmi()) {

Hm, bit confused, can't this only be done out of process context in
general since only there current points to e.g. hhvm? I'm probably
missing something. Could you elaborate?

> +		work = this_cpu_ptr(&send_signal_work);
> +		if (work->irq_work.flags & IRQ_WORK_BUSY)
> +			return -EBUSY;
> +
> +		work->sig = sig;
> +		irq_work_queue(&work->irq_work);
> +		return 0;
> +	}
> +
> +	return group_send_sig_info(sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
> +

Nit: extra newline slipped in

> +}
> +
> +static const struct bpf_func_proto bpf_send_signal_proto = {
> +	.func		= bpf_send_signal,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_ANYTHING,
> +};
> +
>  static const struct bpf_func_proto *
>  tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  {
> @@ -617,6 +669,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  	case BPF_FUNC_get_current_cgroup_id:
>  		return &bpf_get_current_cgroup_id_proto;
>  #endif
> +	case BPF_FUNC_send_signal:
> +		return &bpf_send_signal_proto;
>  	default:
>  		return NULL;
>  	}
> @@ -1343,5 +1397,18 @@ static int __init bpf_event_init(void)
>  	return 0;
>  }
>  
> +static int __init send_signal_irq_work_init(void)
> +{
> +	int cpu;
> +	struct send_signal_irq_work *work;
> +
> +	for_each_possible_cpu(cpu) {
> +		work = per_cpu_ptr(&send_signal_work, cpu);
> +		init_irq_work(&work->irq_work, do_bpf_send_signal);
> +	}
> +	return 0;
> +}
> +
>  fs_initcall(bpf_event_init);
> +subsys_initcall(send_signal_irq_work_init);
>  #endif /* CONFIG_MODULES */
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper
  2019-05-23 15:41   ` Daniel Borkmann
@ 2019-05-23 15:58     ` Yonghong Song
  2019-05-23 16:28       ` Daniel Borkmann
  0 siblings, 1 reply; 19+ messages in thread
From: Yonghong Song @ 2019-05-23 15:58 UTC (permalink / raw)
  To: Daniel Borkmann, bpf, netdev
  Cc: Alexei Starovoitov, Kernel Team, Peter Zijlstra



On 5/23/19 8:41 AM, Daniel Borkmann wrote:
> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>> This patch tries to solve the following specific use case.
>>
>> Currently, bpf program can already collect stack traces
>> through kernel function get_perf_callchain()
>> when certain events happens (e.g., cache miss counter or
>> cpu clock counter overflows). But such stack traces are
>> not enough for jitted programs, e.g., hhvm (jited php).
>> To get real stack trace, jit engine internal data structures
>> need to be traversed in order to get the real user functions.
>>
>> bpf program itself may not be the best place to traverse
>> the jit engine as the traversing logic could be complex and
>> it is not a stable interface either.
>>
>> Instead, hhvm implements a signal handler,
>> e.g. for SIGALARM, and a set of program locations which
>> it can dump stack traces. When it receives a signal, it will
>> dump the stack in next such program location.
>>
>> Such a mechanism can be implemented in the following way:
>>    . a perf ring buffer is created between bpf program
>>      and tracing app.
>>    . once a particular event happens, bpf program writes
>>      to the ring buffer and the tracing app gets notified.
>>    . the tracing app sends a signal SIGALARM to the hhvm.
>>
>> But this method could have large delays and causing profiling
>> results skewed.
>>
>> This patch implements bpf_send_signal() helper to send
>> a signal to hhvm in real time, resulting in intended stack traces.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/uapi/linux/bpf.h | 17 +++++++++-
>>   kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 83 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 63e0cf66f01a..68d4470523a0 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>    *		0 on success.
>>    *
>>    *		**-ENOENT** if the bpf-local-storage cannot be found.
>> + *
>> + * int bpf_send_signal(u32 sig)
>> + *	Description
>> + *		Send signal *sig* to the current task.
>> + *	Return
>> + *		0 on success or successfully queued.
>> + *
>> + *		**-EBUSY** if work queue under nmi is full.
>> + *
>> + *		**-EINVAL** if *sig* is invalid.
>> + *
>> + *		**-EPERM** if no permission to send the *sig*.
>> + *
>> + *		**-EAGAIN** if bpf program can try again.
>>    */
>>   #define __BPF_FUNC_MAPPER(FN)		\
>>   	FN(unspec),			\
>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>   	FN(strtol),			\
>>   	FN(strtoul),			\
>>   	FN(sk_storage_get),		\
>> -	FN(sk_storage_delete),
>> +	FN(sk_storage_delete),		\
>> +	FN(send_signal),
>>   
>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>    * function eBPF program intends to call
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index f92d6ad5e080..f8cd0db7289f 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
>>   	.arg3_type	= ARG_ANYTHING,
>>   };
>>   
>> +struct send_signal_irq_work {
>> +	struct irq_work irq_work;
>> +	u32 sig;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>> +
>> +static void do_bpf_send_signal(struct irq_work *entry)
>> +{
>> +	struct send_signal_irq_work *work;
>> +
>> +	work = container_of(entry, struct send_signal_irq_work, irq_work);
>> +	group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>> +}
>> +
>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>> +{
>> +	struct send_signal_irq_work *work = NULL;
>> +
>> +	/* Similar to bpf_probe_write_user, task needs to be
>> +	 * in a sound condition and kernel memory access be
>> +	 * permitted in order to send signal to the current
>> +	 * task.
>> +	 */
>> +	if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
>> +		return -EPERM;
>> +	if (unlikely(uaccess_kernel()))
>> +		return -EPERM;
>> +	if (unlikely(!nmi_uaccess_okay()))
>> +		return -EPERM;
>> +
>> +	if (in_nmi()) {
> 
> Hm, bit confused, can't this only be done out of process context in
> general since only there current points to e.g. hhvm? I'm probably
> missing something. Could you elaborate?

That is true. If in nmi, it is out of process context and in nmi 
context, we use an irq_work here since group_send_sig_info() has
spinlock inside. The bpf program (e.g., a perf_event program) needs to 
check it is with right current (e.g., by pid) before calling
this helper.

Does this address your question?

> 
>> +		work = this_cpu_ptr(&send_signal_work);
>> +		if (work->irq_work.flags & IRQ_WORK_BUSY)
>> +			return -EBUSY;
>> +
>> +		work->sig = sig;
>> +		irq_work_queue(&work->irq_work);
>> +		return 0;
>> +	}
>> +
>> +	return group_send_sig_info(sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>> +
> 
> Nit: extra newline slipped in
Thanks. Will remove this in the next revision.
> 
>> +}
>> +
>> +static const struct bpf_func_proto bpf_send_signal_proto = {
>> +	.func		= bpf_send_signal,
>> +	.gpl_only	= false,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_ANYTHING,
>> +};
>> +
>>   static const struct bpf_func_proto *
>>   tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>   {
>> @@ -617,6 +669,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>   	case BPF_FUNC_get_current_cgroup_id:
>>   		return &bpf_get_current_cgroup_id_proto;
>>   #endif
>> +	case BPF_FUNC_send_signal:
>> +		return &bpf_send_signal_proto;
>>   	default:
>>   		return NULL;
>>   	}
>> @@ -1343,5 +1397,18 @@ static int __init bpf_event_init(void)
>>   	return 0;
>>   }
>>   
>> +static int __init send_signal_irq_work_init(void)
>> +{
>> +	int cpu;
>> +	struct send_signal_irq_work *work;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		work = per_cpu_ptr(&send_signal_work, cpu);
>> +		init_irq_work(&work->irq_work, do_bpf_send_signal);
>> +	}
>> +	return 0;
>> +}
>> +
>>   fs_initcall(bpf_event_init);
>> +subsys_initcall(send_signal_irq_work_init);
>>   #endif /* CONFIG_MODULES */
>>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper
  2019-05-23 15:58     ` Yonghong Song
@ 2019-05-23 16:28       ` Daniel Borkmann
  2019-05-23 21:07         ` Yonghong Song
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2019-05-23 16:28 UTC (permalink / raw)
  To: Yonghong Song, bpf, netdev
  Cc: Alexei Starovoitov, Kernel Team, Peter Zijlstra

On 05/23/2019 05:58 PM, Yonghong Song wrote:
> On 5/23/19 8:41 AM, Daniel Borkmann wrote:
>> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>>> This patch tries to solve the following specific use case.
>>>
>>> Currently, bpf program can already collect stack traces
>>> through kernel function get_perf_callchain()
>>> when certain events happens (e.g., cache miss counter or
>>> cpu clock counter overflows). But such stack traces are
>>> not enough for jitted programs, e.g., hhvm (jited php).
>>> To get real stack trace, jit engine internal data structures
>>> need to be traversed in order to get the real user functions.
>>>
>>> bpf program itself may not be the best place to traverse
>>> the jit engine as the traversing logic could be complex and
>>> it is not a stable interface either.
>>>
>>> Instead, hhvm implements a signal handler,
>>> e.g. for SIGALARM, and a set of program locations which
>>> it can dump stack traces. When it receives a signal, it will
>>> dump the stack in next such program location.
>>>
>>> Such a mechanism can be implemented in the following way:
>>>    . a perf ring buffer is created between bpf program
>>>      and tracing app.
>>>    . once a particular event happens, bpf program writes
>>>      to the ring buffer and the tracing app gets notified.
>>>    . the tracing app sends a signal SIGALARM to the hhvm.
>>>
>>> But this method could have large delays and causing profiling
>>> results skewed.
>>>
>>> This patch implements bpf_send_signal() helper to send
>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>
>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>> ---
>>>   include/uapi/linux/bpf.h | 17 +++++++++-
>>>   kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++
>>>   2 files changed, 83 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 63e0cf66f01a..68d4470523a0 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>    *		0 on success.
>>>    *
>>>    *		**-ENOENT** if the bpf-local-storage cannot be found.
>>> + *
>>> + * int bpf_send_signal(u32 sig)
>>> + *	Description
>>> + *		Send signal *sig* to the current task.
>>> + *	Return
>>> + *		0 on success or successfully queued.
>>> + *
>>> + *		**-EBUSY** if work queue under nmi is full.
>>> + *
>>> + *		**-EINVAL** if *sig* is invalid.
>>> + *
>>> + *		**-EPERM** if no permission to send the *sig*.
>>> + *
>>> + *		**-EAGAIN** if bpf program can try again.
>>>    */
>>>   #define __BPF_FUNC_MAPPER(FN)		\
>>>   	FN(unspec),			\
>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>>   	FN(strtol),			\
>>>   	FN(strtoul),			\
>>>   	FN(sk_storage_get),		\
>>> -	FN(sk_storage_delete),
>>> +	FN(sk_storage_delete),		\
>>> +	FN(send_signal),
>>>   
>>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>>    * function eBPF program intends to call
>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>> index f92d6ad5e080..f8cd0db7289f 100644
>>> --- a/kernel/trace/bpf_trace.c
>>> +++ b/kernel/trace/bpf_trace.c
>>> @@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
>>>   	.arg3_type	= ARG_ANYTHING,
>>>   };
>>>   
>>> +struct send_signal_irq_work {
>>> +	struct irq_work irq_work;
>>> +	u32 sig;
>>> +};
>>> +
>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>>> +
>>> +static void do_bpf_send_signal(struct irq_work *entry)
>>> +{
>>> +	struct send_signal_irq_work *work;
>>> +
>>> +	work = container_of(entry, struct send_signal_irq_work, irq_work);
>>> +	group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>> +}
>>> +
>>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>>> +{
>>> +	struct send_signal_irq_work *work = NULL;
>>> +
>>> +	/* Similar to bpf_probe_write_user, task needs to be
>>> +	 * in a sound condition and kernel memory access be
>>> +	 * permitted in order to send signal to the current
>>> +	 * task.
>>> +	 */
>>> +	if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
>>> +		return -EPERM;
>>> +	if (unlikely(uaccess_kernel()))
>>> +		return -EPERM;
>>> +	if (unlikely(!nmi_uaccess_okay()))
>>> +		return -EPERM;
>>> +
>>> +	if (in_nmi()) {
>>
>> Hm, bit confused, can't this only be done out of process context in
>> general since only there current points to e.g. hhvm? I'm probably
>> missing something. Could you elaborate?
> 
> That is true. If in nmi, it is out of process context and in nmi 
> context, we use an irq_work here since group_send_sig_info() has
> spinlock inside. The bpf program (e.g., a perf_event program) needs to 
> check it is with right current (e.g., by pid) before calling
> this helper.
> 
> Does this address your question?

Hm, but how is it guaranteed that 'current' inside the callback is still
the very same you intend to send the signal to?

What happens if you're in softirq and send SIGKILL to yourself? Is this
ignored/handled gracefully in such case?

I think some more elaborate comment in the code would definitely be help.

Btw, you probably need to wrap it under #ifdef CONFIG_IRQ_WORK.

>>> +		work = this_cpu_ptr(&send_signal_work);
>>> +		if (work->irq_work.flags & IRQ_WORK_BUSY)
>>> +			return -EBUSY;
>>> +
>>> +		work->sig = sig;
>>> +		irq_work_queue(&work->irq_work);
>>> +		return 0;
>>> +	}
>>> +
>>> +	return group_send_sig_info(sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>> +
>>
>> Nit: extra newline slipped in
> Thanks. Will remove this in the next revision.
>>
>>> +}
>>> +
>>> +static const struct bpf_func_proto bpf_send_signal_proto = {
>>> +	.func		= bpf_send_signal,
>>> +	.gpl_only	= false,
>>> +	.ret_type	= RET_INTEGER,
>>> +	.arg1_type	= ARG_ANYTHING,
>>> +};
>>> +
>>>   static const struct bpf_func_proto *
>>>   tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>>   {
>>> @@ -617,6 +669,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>>   	case BPF_FUNC_get_current_cgroup_id:
>>>   		return &bpf_get_current_cgroup_id_proto;
>>>   #endif
>>> +	case BPF_FUNC_send_signal:
>>> +		return &bpf_send_signal_proto;
>>>   	default:
>>>   		return NULL;
>>>   	}
>>> @@ -1343,5 +1397,18 @@ static int __init bpf_event_init(void)
>>>   	return 0;
>>>   }
>>>   
>>> +static int __init send_signal_irq_work_init(void)
>>> +{
>>> +	int cpu;
>>> +	struct send_signal_irq_work *work;
>>> +
>>> +	for_each_possible_cpu(cpu) {
>>> +		work = per_cpu_ptr(&send_signal_work, cpu);
>>> +		init_irq_work(&work->irq_work, do_bpf_send_signal);
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>>   fs_initcall(bpf_event_init);
>>> +subsys_initcall(send_signal_irq_work_init);
>>>   #endif /* CONFIG_MODULES */
>>>
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper
  2019-05-23 16:28       ` Daniel Borkmann
@ 2019-05-23 21:07         ` Yonghong Song
  2019-05-23 21:30           ` Yonghong Song
  0 siblings, 1 reply; 19+ messages in thread
From: Yonghong Song @ 2019-05-23 21:07 UTC (permalink / raw)
  To: Daniel Borkmann, bpf, netdev
  Cc: Alexei Starovoitov, Kernel Team, Peter Zijlstra



On 5/23/19 9:28 AM, Daniel Borkmann wrote:
> On 05/23/2019 05:58 PM, Yonghong Song wrote:
>> On 5/23/19 8:41 AM, Daniel Borkmann wrote:
>>> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>>>> This patch tries to solve the following specific use case.
>>>>
>>>> Currently, bpf program can already collect stack traces
>>>> through kernel function get_perf_callchain()
>>>> when certain events happens (e.g., cache miss counter or
>>>> cpu clock counter overflows). But such stack traces are
>>>> not enough for jitted programs, e.g., hhvm (jited php).
>>>> To get real stack trace, jit engine internal data structures
>>>> need to be traversed in order to get the real user functions.
>>>>
>>>> bpf program itself may not be the best place to traverse
>>>> the jit engine as the traversing logic could be complex and
>>>> it is not a stable interface either.
>>>>
>>>> Instead, hhvm implements a signal handler,
>>>> e.g. for SIGALARM, and a set of program locations which
>>>> it can dump stack traces. When it receives a signal, it will
>>>> dump the stack in next such program location.
>>>>
>>>> Such a mechanism can be implemented in the following way:
>>>>     . a perf ring buffer is created between bpf program
>>>>       and tracing app.
>>>>     . once a particular event happens, bpf program writes
>>>>       to the ring buffer and the tracing app gets notified.
>>>>     . the tracing app sends a signal SIGALARM to the hhvm.
>>>>
>>>> But this method could have large delays and causing profiling
>>>> results skewed.
>>>>
>>>> This patch implements bpf_send_signal() helper to send
>>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>>
>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>> ---
>>>>    include/uapi/linux/bpf.h | 17 +++++++++-
>>>>    kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 83 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index 63e0cf66f01a..68d4470523a0 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>>     *		0 on success.
>>>>     *
>>>>     *		**-ENOENT** if the bpf-local-storage cannot be found.
>>>> + *
>>>> + * int bpf_send_signal(u32 sig)
>>>> + *	Description
>>>> + *		Send signal *sig* to the current task.
>>>> + *	Return
>>>> + *		0 on success or successfully queued.
>>>> + *
>>>> + *		**-EBUSY** if work queue under nmi is full.
>>>> + *
>>>> + *		**-EINVAL** if *sig* is invalid.
>>>> + *
>>>> + *		**-EPERM** if no permission to send the *sig*.
>>>> + *
>>>> + *		**-EAGAIN** if bpf program can try again.
>>>>     */
>>>>    #define __BPF_FUNC_MAPPER(FN)		\
>>>>    	FN(unspec),			\
>>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>>>    	FN(strtol),			\
>>>>    	FN(strtoul),			\
>>>>    	FN(sk_storage_get),		\
>>>> -	FN(sk_storage_delete),
>>>> +	FN(sk_storage_delete),		\
>>>> +	FN(send_signal),
>>>>    
>>>>    /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>>>     * function eBPF program intends to call
>>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>>> index f92d6ad5e080..f8cd0db7289f 100644
>>>> --- a/kernel/trace/bpf_trace.c
>>>> +++ b/kernel/trace/bpf_trace.c
>>>> @@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
>>>>    	.arg3_type	= ARG_ANYTHING,
>>>>    };
>>>>    
>>>> +struct send_signal_irq_work {
>>>> +	struct irq_work irq_work;
>>>> +	u32 sig;
>>>> +};
>>>> +
>>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>>>> +
>>>> +static void do_bpf_send_signal(struct irq_work *entry)
>>>> +{
>>>> +	struct send_signal_irq_work *work;
>>>> +
>>>> +	work = container_of(entry, struct send_signal_irq_work, irq_work);
>>>> +	group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>>> +}
>>>> +
>>>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>>>> +{
>>>> +	struct send_signal_irq_work *work = NULL;
>>>> +
>>>> +	/* Similar to bpf_probe_write_user, task needs to be
>>>> +	 * in a sound condition and kernel memory access be
>>>> +	 * permitted in order to send signal to the current
>>>> +	 * task.
>>>> +	 */
>>>> +	if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
>>>> +		return -EPERM;
>>>> +	if (unlikely(uaccess_kernel()))
>>>> +		return -EPERM;
>>>> +	if (unlikely(!nmi_uaccess_okay()))
>>>> +		return -EPERM;
>>>> +
>>>> +	if (in_nmi()) {
>>>
>>> Hm, bit confused, can't this only be done out of process context in
>>> general since only there current points to e.g. hhvm? I'm probably
>>> missing something. Could you elaborate?
>>
>> That is true. If in nmi, it is out of process context and in nmi
>> context, we use an irq_work here since group_send_sig_info() has
>> spinlock inside. The bpf program (e.g., a perf_event program) needs to
>> check it is with right current (e.g., by pid) before calling
>> this helper.
>>
>> Does this address your question?

Thanks, Daniel. The below are really good questions which I did not
really think through with irq_work.

> 
> Hm, but how is it guaranteed that 'current' inside the callback is still
> the very same you intend to send the signal to?

I went through irq_work infrastructure. It looks that "current" may
change. irq_work is registered as an interrupt on x86.
After nmi, some lower priority
interrupts get chances to execute including irq_work. But there are some
other interrupts like timer_interrupt and reschedule_interrupt may 
change "current". But since we are still in interrupt context, task 
should not be destroyed, so the task structure pointer is still valid.

I will pass "current" task struct pointer to irq_work as well. This
is similar to what we did in stackmap.c:
   work->sem = &current->mm->mmap_sem;
   irq_work_queue(&work->irq_work);
At irq_work_run() time, the previous "current" in nmi should still be
valid.

> 
> What happens if you're in softirq and send SIGKILL to yourself? Is this
> ignored/handled gracefully in such case?

It is not ignored. It handled gracefully in this case. I tried my 
example to send SIGKILL. The call stack looks below.

[   24.211943]  bpf_send_signal+0x9/0xb0
[   24.212427]  bpf_prog_fec6e7cc664d5b91_bpf_send_signal_test+0x228/0x1000
[   24.213249]  ? bpf_overflow_handler+0xb7/0x180
[   24.213853]  ? __perf_event_overflow+0x51/0xe0
[   24.214385]  ? perf_swevent_hrtimer+0x10a/0x160
[   24.214965]  ? __update_load_avg_cfs_rq+0x1a9/0x1c0
[   24.215609]  ? task_tick_fair+0x50/0x690
[   24.216104]  ? run_timer_softirq+0x208/0x490
[   24.216637]  ? timerqueue_del+0x1e/0x40
[   24.217111]  ? task_clock_event_del+0x10/0x10
[   24.217658]  ? __hrtimer_run_queues+0x10d/0x2c0
[   24.218217]  ? hrtimer_interrupt+0x122/0x270
[   24.218765]  ? rcu_irq_enter+0x31/0x110
[   24.219223]  ? smp_apic_timer_interrupt+0x67/0x160
[   24.219842]  ? apic_timer_interrupt+0xf/0x20
[   24.220383]  </IRQ>
[   24.220655]  ? event_sched_out.isra.108+0x150/0x150
[   24.221271]  ? smp_call_function_single+0xdc/0x100
[   24.221898]  ? perf_event_sysfs_show+0x20/0x20
[   24.222469]  ? cpu_function_call+0x42/0x60
[   24.222982]  ? cpu_clock_event_read+0x10/0x10
[   24.223525]  ? event_function_call+0xe6/0xf0
[   24.224053]  ? event_sched_out.isra.108+0x150/0x150
[   24.224657]  ? perf_remove_from_context+0x20/0x70
[   24.225262]  ? perf_event_release_kernel+0x106/0x2e0
[   24.225896]  ? perf_release+0xc/0x10
[   24.226347]  ? __fput+0xc9/0x230
[   24.226767]  ? task_work_run+0x83/0xb0
[   24.227243]  ? do_exit+0x300/0xc50
[   24.227674]  ? syscall_trace_enter+0x1c9/0x2d0
[   24.228223]  ? do_group_exit+0x39/0xb0
[   24.228695]  ? __x64_sys_exit_group+0x14/0x20
[   24.229270]  ? do_syscall_64+0x49/0x130
[   24.229762]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

I see the task is killed and other process is not impacted and
no kernel crash/warning.

> 
> I think some more elaborate comment in the code would definitely be help.

Definitely will add some comments.

> 
> Btw, you probably need to wrap it under #ifdef CONFIG_IRQ_WORK.

I will check this. stackmaps.c use irq_work as well and did not have
CONFIG_IRQ_WORK. Maybe we are missing there as well.

> 
>>>> +		work = this_cpu_ptr(&send_signal_work);
>>>> +		if (work->irq_work.flags & IRQ_WORK_BUSY)
>>>> +			return -EBUSY;
>>>> +
>>>> +		work->sig = sig;
>>>> +		irq_work_queue(&work->irq_work);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	return group_send_sig_info(sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>>> +
>>>
>>> Nit: extra newline slipped in
>> Thanks. Will remove this in the next revision.
>>>
>>>> +}
>>>> +
>>>> +static const struct bpf_func_proto bpf_send_signal_proto = {
>>>> +	.func		= bpf_send_signal,
>>>> +	.gpl_only	= false,
>>>> +	.ret_type	= RET_INTEGER,
>>>> +	.arg1_type	= ARG_ANYTHING,
>>>> +};
>>>> +
>>>>    static const struct bpf_func_proto *
>>>>    tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>>>    {
>>>> @@ -617,6 +669,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>>>    	case BPF_FUNC_get_current_cgroup_id:
>>>>    		return &bpf_get_current_cgroup_id_proto;
>>>>    #endif
>>>> +	case BPF_FUNC_send_signal:
>>>> +		return &bpf_send_signal_proto;
>>>>    	default:
>>>>    		return NULL;
>>>>    	}
>>>> @@ -1343,5 +1397,18 @@ static int __init bpf_event_init(void)
>>>>    	return 0;
>>>>    }
>>>>    
>>>> +static int __init send_signal_irq_work_init(void)
>>>> +{
>>>> +	int cpu;
>>>> +	struct send_signal_irq_work *work;
>>>> +
>>>> +	for_each_possible_cpu(cpu) {
>>>> +		work = per_cpu_ptr(&send_signal_work, cpu);
>>>> +		init_irq_work(&work->irq_work, do_bpf_send_signal);
>>>> +	}
>>>> +	return 0;
>>>> +}
>>>> +
>>>>    fs_initcall(bpf_event_init);
>>>> +subsys_initcall(send_signal_irq_work_init);
>>>>    #endif /* CONFIG_MODULES */
>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper
  2019-05-23 21:07         ` Yonghong Song
@ 2019-05-23 21:30           ` Yonghong Song
  2019-05-23 23:08             ` Daniel Borkmann
  0 siblings, 1 reply; 19+ messages in thread
From: Yonghong Song @ 2019-05-23 21:30 UTC (permalink / raw)
  To: Daniel Borkmann, bpf, netdev
  Cc: Alexei Starovoitov, Kernel Team, Peter Zijlstra



On 5/23/19 2:07 PM, Yonghong Song wrote:
> 
> 
> On 5/23/19 9:28 AM, Daniel Borkmann wrote:
>> On 05/23/2019 05:58 PM, Yonghong Song wrote:
>>> On 5/23/19 8:41 AM, Daniel Borkmann wrote:
>>>> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>>>>> This patch tries to solve the following specific use case.
>>>>>
>>>>> Currently, bpf program can already collect stack traces
>>>>> through kernel function get_perf_callchain()
>>>>> when certain events happens (e.g., cache miss counter or
>>>>> cpu clock counter overflows). But such stack traces are
>>>>> not enough for jitted programs, e.g., hhvm (jited php).
>>>>> To get real stack trace, jit engine internal data structures
>>>>> need to be traversed in order to get the real user functions.
>>>>>
>>>>> bpf program itself may not be the best place to traverse
>>>>> the jit engine as the traversing logic could be complex and
>>>>> it is not a stable interface either.
>>>>>
>>>>> Instead, hhvm implements a signal handler,
>>>>> e.g. for SIGALARM, and a set of program locations which
>>>>> it can dump stack traces. When it receives a signal, it will
>>>>> dump the stack in next such program location.
>>>>>
>>>>> Such a mechanism can be implemented in the following way:
>>>>>      . a perf ring buffer is created between bpf program
>>>>>        and tracing app.
>>>>>      . once a particular event happens, bpf program writes
>>>>>        to the ring buffer and the tracing app gets notified.
>>>>>      . the tracing app sends a signal SIGALARM to the hhvm.
>>>>>
>>>>> But this method could have large delays and causing profiling
>>>>> results skewed.
>>>>>
>>>>> This patch implements bpf_send_signal() helper to send
>>>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>>>
>>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>> ---
>>>>>     include/uapi/linux/bpf.h | 17 +++++++++-
>>>>>     kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++
>>>>>     2 files changed, 83 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>> index 63e0cf66f01a..68d4470523a0 100644
>>>>> --- a/include/uapi/linux/bpf.h
>>>>> +++ b/include/uapi/linux/bpf.h
>>>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>>>      *		0 on success.
>>>>>      *
>>>>>      *		**-ENOENT** if the bpf-local-storage cannot be found.
>>>>> + *
>>>>> + * int bpf_send_signal(u32 sig)
>>>>> + *	Description
>>>>> + *		Send signal *sig* to the current task.
>>>>> + *	Return
>>>>> + *		0 on success or successfully queued.
>>>>> + *
>>>>> + *		**-EBUSY** if work queue under nmi is full.
>>>>> + *
>>>>> + *		**-EINVAL** if *sig* is invalid.
>>>>> + *
>>>>> + *		**-EPERM** if no permission to send the *sig*.
>>>>> + *
>>>>> + *		**-EAGAIN** if bpf program can try again.
>>>>>      */
>>>>>     #define __BPF_FUNC_MAPPER(FN)		\
>>>>>     	FN(unspec),			\
>>>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>>>>     	FN(strtol),			\
>>>>>     	FN(strtoul),			\
>>>>>     	FN(sk_storage_get),		\
>>>>> -	FN(sk_storage_delete),
>>>>> +	FN(sk_storage_delete),		\
>>>>> +	FN(send_signal),
>>>>>     
>>>>>     /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>>>>      * function eBPF program intends to call
>>>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>>>> index f92d6ad5e080..f8cd0db7289f 100644
>>>>> --- a/kernel/trace/bpf_trace.c
>>>>> +++ b/kernel/trace/bpf_trace.c
>>>>> @@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
>>>>>     	.arg3_type	= ARG_ANYTHING,
>>>>>     };
>>>>>     
>>>>> +struct send_signal_irq_work {
>>>>> +	struct irq_work irq_work;
>>>>> +	u32 sig;
>>>>> +};
>>>>> +
>>>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>>>>> +
>>>>> +static void do_bpf_send_signal(struct irq_work *entry)
>>>>> +{
>>>>> +	struct send_signal_irq_work *work;
>>>>> +
>>>>> +	work = container_of(entry, struct send_signal_irq_work, irq_work);
>>>>> +	group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>>>> +}
>>>>> +
>>>>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>>>>> +{
>>>>> +	struct send_signal_irq_work *work = NULL;
>>>>> +
>>>>> +	/* Similar to bpf_probe_write_user, task needs to be
>>>>> +	 * in a sound condition and kernel memory access be
>>>>> +	 * permitted in order to send signal to the current
>>>>> +	 * task.
>>>>> +	 */
>>>>> +	if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
>>>>> +		return -EPERM;
>>>>> +	if (unlikely(uaccess_kernel()))
>>>>> +		return -EPERM;
>>>>> +	if (unlikely(!nmi_uaccess_okay()))
>>>>> +		return -EPERM;
>>>>> +
>>>>> +	if (in_nmi()) {
>>>>
>>>> Hm, bit confused, can't this only be done out of process context in
>>>> general since only there current points to e.g. hhvm? I'm probably
>>>> missing something. Could you elaborate?
>>>
>>> That is true. If in nmi, it is out of process context and in nmi
>>> context, we use an irq_work here since group_send_sig_info() has
>>> spinlock inside. The bpf program (e.g., a perf_event program) needs to
>>> check it is with right current (e.g., by pid) before calling
>>> this helper.
>>>
>>> Does this address your question?
> 
> Thanks, Daniel. The below are really good questions which I did not
> really think through with irq_work.
> 
>>
>> Hm, but how is it guaranteed that 'current' inside the callback is still
>> the very same you intend to send the signal to?
> 
> I went through irq_work infrastructure. It looks that "current" may
> change. irq_work is registered as an interrupt on x86.
> After nmi, some lower priority
> interrupts get chances to execute including irq_work. But there are some
> other interrupts like timer_interrupt and reschedule_interrupt may
> change "current". But since we are still in interrupt context, task
> should not be destroyed, so the task structure pointer is still valid.
> 
> I will pass "current" task struct pointer to irq_work as well. This
> is similar to what we did in stackmap.c:
>     work->sem = &current->mm->mmap_sem;
>     irq_work_queue(&work->irq_work);
> At irq_work_run() time, the previous "current" in nmi should still be
> valid.
> 
>>
>> What happens if you're in softirq and send SIGKILL to yourself? Is this
>> ignored/handled gracefully in such case?
> 
> It is not ignored. It handled gracefully in this case. I tried my
> example to send SIGKILL. The call stack looks below.
> 
> [   24.211943]  bpf_send_signal+0x9/0xb0
> [   24.212427]  bpf_prog_fec6e7cc664d5b91_bpf_send_signal_test+0x228/0x1000
> [   24.213249]  ? bpf_overflow_handler+0xb7/0x180
> [   24.213853]  ? __perf_event_overflow+0x51/0xe0
> [   24.214385]  ? perf_swevent_hrtimer+0x10a/0x160
> [   24.214965]  ? __update_load_avg_cfs_rq+0x1a9/0x1c0
> [   24.215609]  ? task_tick_fair+0x50/0x690
> [   24.216104]  ? run_timer_softirq+0x208/0x490
> [   24.216637]  ? timerqueue_del+0x1e/0x40
> [   24.217111]  ? task_clock_event_del+0x10/0x10
> [   24.217658]  ? __hrtimer_run_queues+0x10d/0x2c0
> [   24.218217]  ? hrtimer_interrupt+0x122/0x270
> [   24.218765]  ? rcu_irq_enter+0x31/0x110
> [   24.219223]  ? smp_apic_timer_interrupt+0x67/0x160
> [   24.219842]  ? apic_timer_interrupt+0xf/0x20
> [   24.220383]  </IRQ>
> [   24.220655]  ? event_sched_out.isra.108+0x150/0x150
> [   24.221271]  ? smp_call_function_single+0xdc/0x100
> [   24.221898]  ? perf_event_sysfs_show+0x20/0x20
> [   24.222469]  ? cpu_function_call+0x42/0x60
> [   24.222982]  ? cpu_clock_event_read+0x10/0x10
> [   24.223525]  ? event_function_call+0xe6/0xf0
> [   24.224053]  ? event_sched_out.isra.108+0x150/0x150
> [   24.224657]  ? perf_remove_from_context+0x20/0x70
> [   24.225262]  ? perf_event_release_kernel+0x106/0x2e0
> [   24.225896]  ? perf_release+0xc/0x10
> [   24.226347]  ? __fput+0xc9/0x230
> [   24.226767]  ? task_work_run+0x83/0xb0
> [   24.227243]  ? do_exit+0x300/0xc50
> [   24.227674]  ? syscall_trace_enter+0x1c9/0x2d0
> [   24.228223]  ? do_group_exit+0x39/0xb0
> [   24.228695]  ? __x64_sys_exit_group+0x14/0x20
> [   24.229270]  ? do_syscall_64+0x49/0x130
> [   24.229762]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> I see the task is killed and other process is not impacted and
> no kernel crash/warning.
> 
>>
>> I think some more elaborate comment in the code would definitely be help.
> 
> Definitely will add some comments.
> 
>>
>> Btw, you probably need to wrap it under #ifdef CONFIG_IRQ_WORK.
> 
> I will check this. stackmaps.c use irq_work as well and did not have
> CONFIG_IRQ_WORK. Maybe we are missing there as well.

Looks like we do not need CONFIG_IRQ_WORK.

We have:
obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o

config BPF_EVENTS
         depends on BPF_SYSCALL
         depends on (KPROBE_EVENTS || UPROBE_EVENTS) && PERF_EVENTS

config PERF_EVENTS
         bool "Kernel performance events and counters"
         default y if PROFILING
         depends on HAVE_PERF_EVENTS
         select IRQ_WORK
> 
>>
>>>>> +		work = this_cpu_ptr(&send_signal_work);
>>>>> +		if (work->irq_work.flags & IRQ_WORK_BUSY)
>>>>> +			return -EBUSY;
>>>>> +
>>>>> +		work->sig = sig;
>>>>> +		irq_work_queue(&work->irq_work);
>>>>> +		return 0;
>>>>> +	}
>>>>> +
>>>>> +	return group_send_sig_info(sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>>>> +
>>>>
>>>> Nit: extra newline slipped in
>>> Thanks. Will remove this in the next revision.
>>>>
>>>>> +}
>>>>> +
>>>>> +static const struct bpf_func_proto bpf_send_signal_proto = {
>>>>> +	.func		= bpf_send_signal,
>>>>> +	.gpl_only	= false,
>>>>> +	.ret_type	= RET_INTEGER,
>>>>> +	.arg1_type	= ARG_ANYTHING,
>>>>> +};
>>>>> +
>>>>>     static const struct bpf_func_proto *
>>>>>     tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>>>>     {
>>>>> @@ -617,6 +669,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>>>>     	case BPF_FUNC_get_current_cgroup_id:
>>>>>     		return &bpf_get_current_cgroup_id_proto;
>>>>>     #endif
>>>>> +	case BPF_FUNC_send_signal:
>>>>> +		return &bpf_send_signal_proto;
>>>>>     	default:
>>>>>     		return NULL;
>>>>>     	}
>>>>> @@ -1343,5 +1397,18 @@ static int __init bpf_event_init(void)
>>>>>     	return 0;
>>>>>     }
>>>>>     
>>>>> +static int __init send_signal_irq_work_init(void)
>>>>> +{
>>>>> +	int cpu;
>>>>> +	struct send_signal_irq_work *work;
>>>>> +
>>>>> +	for_each_possible_cpu(cpu) {
>>>>> +		work = per_cpu_ptr(&send_signal_work, cpu);
>>>>> +		init_irq_work(&work->irq_work, do_bpf_send_signal);
>>>>> +	}
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>>     fs_initcall(bpf_event_init);
>>>>> +subsys_initcall(send_signal_irq_work_init);
>>>>>     #endif /* CONFIG_MODULES */
>>>>>
>>>>
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper
  2019-05-23 21:30           ` Yonghong Song
@ 2019-05-23 23:08             ` Daniel Borkmann
  2019-05-23 23:54               ` Yonghong Song
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2019-05-23 23:08 UTC (permalink / raw)
  To: Yonghong Song, bpf, netdev
  Cc: Alexei Starovoitov, Kernel Team, Peter Zijlstra

On 05/23/2019 11:30 PM, Yonghong Song wrote:
> On 5/23/19 2:07 PM, Yonghong Song wrote:
>> On 5/23/19 9:28 AM, Daniel Borkmann wrote:
>>> On 05/23/2019 05:58 PM, Yonghong Song wrote:
>>>> On 5/23/19 8:41 AM, Daniel Borkmann wrote:
>>>>> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>>>>>> This patch tries to solve the following specific use case.
>>>>>>
>>>>>> Currently, bpf program can already collect stack traces
>>>>>> through kernel function get_perf_callchain()
>>>>>> when certain events happens (e.g., cache miss counter or
>>>>>> cpu clock counter overflows). But such stack traces are
>>>>>> not enough for jitted programs, e.g., hhvm (jited php).
>>>>>> To get real stack trace, jit engine internal data structures
>>>>>> need to be traversed in order to get the real user functions.
>>>>>>
>>>>>> bpf program itself may not be the best place to traverse
>>>>>> the jit engine as the traversing logic could be complex and
>>>>>> it is not a stable interface either.
>>>>>>
>>>>>> Instead, hhvm implements a signal handler,
>>>>>> e.g. for SIGALARM, and a set of program locations which
>>>>>> it can dump stack traces. When it receives a signal, it will
>>>>>> dump the stack in next such program location.
>>>>>>
>>>>>> Such a mechanism can be implemented in the following way:
>>>>>>      . a perf ring buffer is created between bpf program
>>>>>>        and tracing app.
>>>>>>      . once a particular event happens, bpf program writes
>>>>>>        to the ring buffer and the tracing app gets notified.
>>>>>>      . the tracing app sends a signal SIGALARM to the hhvm.
>>>>>>
>>>>>> But this method could have large delays and causing profiling
>>>>>> results skewed.
>>>>>>
>>>>>> This patch implements bpf_send_signal() helper to send
>>>>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>>>>
>>>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>>> ---
>>>>>>     include/uapi/linux/bpf.h | 17 +++++++++-
>>>>>>     kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++
>>>>>>     2 files changed, 83 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>> index 63e0cf66f01a..68d4470523a0 100644
>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>>>>      *		0 on success.
>>>>>>      *
>>>>>>      *		**-ENOENT** if the bpf-local-storage cannot be found.
>>>>>> + *
>>>>>> + * int bpf_send_signal(u32 sig)
>>>>>> + *	Description
>>>>>> + *		Send signal *sig* to the current task.
>>>>>> + *	Return
>>>>>> + *		0 on success or successfully queued.
>>>>>> + *
>>>>>> + *		**-EBUSY** if work queue under nmi is full.
>>>>>> + *
>>>>>> + *		**-EINVAL** if *sig* is invalid.
>>>>>> + *
>>>>>> + *		**-EPERM** if no permission to send the *sig*.
>>>>>> + *
>>>>>> + *		**-EAGAIN** if bpf program can try again.
>>>>>>      */
>>>>>>     #define __BPF_FUNC_MAPPER(FN)		\
>>>>>>     	FN(unspec),			\
>>>>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>>>>>     	FN(strtol),			\
>>>>>>     	FN(strtoul),			\
>>>>>>     	FN(sk_storage_get),		\
>>>>>> -	FN(sk_storage_delete),
>>>>>> +	FN(sk_storage_delete),		\
>>>>>> +	FN(send_signal),
>>>>>>     
>>>>>>     /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>>>>>      * function eBPF program intends to call
>>>>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>>>>> index f92d6ad5e080..f8cd0db7289f 100644
>>>>>> --- a/kernel/trace/bpf_trace.c
>>>>>> +++ b/kernel/trace/bpf_trace.c
>>>>>> @@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
>>>>>>     	.arg3_type	= ARG_ANYTHING,
>>>>>>     };
>>>>>>     
>>>>>> +struct send_signal_irq_work {
>>>>>> +	struct irq_work irq_work;
>>>>>> +	u32 sig;
>>>>>> +};
>>>>>> +
>>>>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>>>>>> +
>>>>>> +static void do_bpf_send_signal(struct irq_work *entry)
>>>>>> +{
>>>>>> +	struct send_signal_irq_work *work;
>>>>>> +
>>>>>> +	work = container_of(entry, struct send_signal_irq_work, irq_work);
>>>>>> +	group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>>>>> +}
>>>>>> +
>>>>>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>>>>>> +{
>>>>>> +	struct send_signal_irq_work *work = NULL;
>>>>>> +
>>>>>> +	/* Similar to bpf_probe_write_user, task needs to be
>>>>>> +	 * in a sound condition and kernel memory access be
>>>>>> +	 * permitted in order to send signal to the current
>>>>>> +	 * task.
>>>>>> +	 */
>>>>>> +	if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
>>>>>> +		return -EPERM;
>>>>>> +	if (unlikely(uaccess_kernel()))
>>>>>> +		return -EPERM;
>>>>>> +	if (unlikely(!nmi_uaccess_okay()))
>>>>>> +		return -EPERM;
>>>>>> +
>>>>>> +	if (in_nmi()) {
>>>>>
>>>>> Hm, bit confused, can't this only be done out of process context in
>>>>> general since only there current points to e.g. hhvm? I'm probably
>>>>> missing something. Could you elaborate?
>>>>
>>>> That is true. If in nmi, it is out of process context and in nmi
>>>> context, we use an irq_work here since group_send_sig_info() has
>>>> spinlock inside. The bpf program (e.g., a perf_event program) needs to
>>>> check it is with right current (e.g., by pid) before calling
>>>> this helper.
>>>>
>>>> Does this address your question?
>>
>> Thanks, Daniel. The below are really good questions which I did not
>> really think through with irq_work.
>>
>>> Hm, but how is it guaranteed that 'current' inside the callback is still
>>> the very same you intend to send the signal to?
>>
>> I went through irq_work infrastructure. It looks that "current" may
>> change. irq_work is registered as an interrupt on x86.
>> After nmi, some lower priority
>> interrupts get chances to execute including irq_work. But there are some
>> other interrupts like timer_interrupt and reschedule_interrupt may
>> change "current". But since we are still in interrupt context, task
>> should not be destroyed, so the task structure pointer is still valid.
>>
>> I will pass "current" task struct pointer to irq_work as well. This
>> is similar to what we did in stackmap.c:
>>     work->sem = &current->mm->mmap_sem;
>>     irq_work_queue(&work->irq_work);
>> At irq_work_run() time, the previous "current" in nmi should still be
>> valid.
>>
>>> What happens if you're in softirq and send SIGKILL to yourself? Is this
>>> ignored/handled gracefully in such case?
>>
>> It is not ignored. It handled gracefully in this case. I tried my
>> example to send SIGKILL. The call stack looks below.
>>
>> [   24.211943]  bpf_send_signal+0x9/0xb0
>> [   24.212427]  bpf_prog_fec6e7cc664d5b91_bpf_send_signal_test+0x228/0x1000
>> [   24.213249]  ? bpf_overflow_handler+0xb7/0x180
>> [   24.213853]  ? __perf_event_overflow+0x51/0xe0
>> [   24.214385]  ? perf_swevent_hrtimer+0x10a/0x160
>> [   24.214965]  ? __update_load_avg_cfs_rq+0x1a9/0x1c0
>> [   24.215609]  ? task_tick_fair+0x50/0x690
>> [   24.216104]  ? run_timer_softirq+0x208/0x490
>> [   24.216637]  ? timerqueue_del+0x1e/0x40
>> [   24.217111]  ? task_clock_event_del+0x10/0x10
>> [   24.217658]  ? __hrtimer_run_queues+0x10d/0x2c0
>> [   24.218217]  ? hrtimer_interrupt+0x122/0x270
>> [   24.218765]  ? rcu_irq_enter+0x31/0x110
>> [   24.219223]  ? smp_apic_timer_interrupt+0x67/0x160
>> [   24.219842]  ? apic_timer_interrupt+0xf/0x20
>> [   24.220383]  </IRQ>
>> [   24.220655]  ? event_sched_out.isra.108+0x150/0x150
>> [   24.221271]  ? smp_call_function_single+0xdc/0x100
>> [   24.221898]  ? perf_event_sysfs_show+0x20/0x20
>> [   24.222469]  ? cpu_function_call+0x42/0x60
>> [   24.222982]  ? cpu_clock_event_read+0x10/0x10
>> [   24.223525]  ? event_function_call+0xe6/0xf0
>> [   24.224053]  ? event_sched_out.isra.108+0x150/0x150
>> [   24.224657]  ? perf_remove_from_context+0x20/0x70
>> [   24.225262]  ? perf_event_release_kernel+0x106/0x2e0
>> [   24.225896]  ? perf_release+0xc/0x10
>> [   24.226347]  ? __fput+0xc9/0x230
>> [   24.226767]  ? task_work_run+0x83/0xb0
>> [   24.227243]  ? do_exit+0x300/0xc50
>> [   24.227674]  ? syscall_trace_enter+0x1c9/0x2d0
>> [   24.228223]  ? do_group_exit+0x39/0xb0
>> [   24.228695]  ? __x64_sys_exit_group+0x14/0x20
>> [   24.229270]  ? do_syscall_64+0x49/0x130
>> [   24.229762]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> I see the task is killed and other process is not impacted and
>> no kernel crash/warning.

Hm, but I rather meant when you have the case that we're in_serving_softirq()
e.g. when processing packets on rx and you send a signal to yourself. Shouldn't
we bail out from the helper in such situation as well?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper
  2019-05-23 23:08             ` Daniel Borkmann
@ 2019-05-23 23:54               ` Yonghong Song
  2019-05-24 21:32                 ` Daniel Borkmann
  0 siblings, 1 reply; 19+ messages in thread
From: Yonghong Song @ 2019-05-23 23:54 UTC (permalink / raw)
  To: Daniel Borkmann, bpf, netdev
  Cc: Alexei Starovoitov, Kernel Team, Peter Zijlstra



On 5/23/19 4:08 PM, Daniel Borkmann wrote:
> On 05/23/2019 11:30 PM, Yonghong Song wrote:
>> On 5/23/19 2:07 PM, Yonghong Song wrote:
>>> On 5/23/19 9:28 AM, Daniel Borkmann wrote:
>>>> On 05/23/2019 05:58 PM, Yonghong Song wrote:
>>>>> On 5/23/19 8:41 AM, Daniel Borkmann wrote:
>>>>>> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>>>>>>> This patch tries to solve the following specific use case.
>>>>>>>
>>>>>>> Currently, bpf program can already collect stack traces
>>>>>>> through kernel function get_perf_callchain()
>>>>>>> when certain events happens (e.g., cache miss counter or
>>>>>>> cpu clock counter overflows). But such stack traces are
>>>>>>> not enough for jitted programs, e.g., hhvm (jited php).
>>>>>>> To get real stack trace, jit engine internal data structures
>>>>>>> need to be traversed in order to get the real user functions.
>>>>>>>
>>>>>>> bpf program itself may not be the best place to traverse
>>>>>>> the jit engine as the traversing logic could be complex and
>>>>>>> it is not a stable interface either.
>>>>>>>
>>>>>>> Instead, hhvm implements a signal handler,
>>>>>>> e.g. for SIGALARM, and a set of program locations which
>>>>>>> it can dump stack traces. When it receives a signal, it will
>>>>>>> dump the stack in next such program location.
>>>>>>>
>>>>>>> Such a mechanism can be implemented in the following way:
>>>>>>>       . a perf ring buffer is created between bpf program
>>>>>>>         and tracing app.
>>>>>>>       . once a particular event happens, bpf program writes
>>>>>>>         to the ring buffer and the tracing app gets notified.
>>>>>>>       . the tracing app sends a signal SIGALARM to the hhvm.
>>>>>>>
>>>>>>> But this method could have large delays and causing profiling
>>>>>>> results skewed.
>>>>>>>
>>>>>>> This patch implements bpf_send_signal() helper to send
>>>>>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>>>>>
>>>>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>>>> ---
>>>>>>>      include/uapi/linux/bpf.h | 17 +++++++++-
>>>>>>>      kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++
>>>>>>>      2 files changed, 83 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>>> index 63e0cf66f01a..68d4470523a0 100644
>>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>>>>>       *		0 on success.
>>>>>>>       *
>>>>>>>       *		**-ENOENT** if the bpf-local-storage cannot be found.
>>>>>>> + *
>>>>>>> + * int bpf_send_signal(u32 sig)
>>>>>>> + *	Description
>>>>>>> + *		Send signal *sig* to the current task.
>>>>>>> + *	Return
>>>>>>> + *		0 on success or successfully queued.
>>>>>>> + *
>>>>>>> + *		**-EBUSY** if work queue under nmi is full.
>>>>>>> + *
>>>>>>> + *		**-EINVAL** if *sig* is invalid.
>>>>>>> + *
>>>>>>> + *		**-EPERM** if no permission to send the *sig*.
>>>>>>> + *
>>>>>>> + *		**-EAGAIN** if bpf program can try again.
>>>>>>>       */
>>>>>>>      #define __BPF_FUNC_MAPPER(FN)		\
>>>>>>>      	FN(unspec),			\
>>>>>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>>>>>>      	FN(strtol),			\
>>>>>>>      	FN(strtoul),			\
>>>>>>>      	FN(sk_storage_get),		\
>>>>>>> -	FN(sk_storage_delete),
>>>>>>> +	FN(sk_storage_delete),		\
>>>>>>> +	FN(send_signal),
>>>>>>>      
>>>>>>>      /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>>>>>>       * function eBPF program intends to call
>>>>>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>>>>>> index f92d6ad5e080..f8cd0db7289f 100644
>>>>>>> --- a/kernel/trace/bpf_trace.c
>>>>>>> +++ b/kernel/trace/bpf_trace.c
>>>>>>> @@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
>>>>>>>      	.arg3_type	= ARG_ANYTHING,
>>>>>>>      };
>>>>>>>      
>>>>>>> +struct send_signal_irq_work {
>>>>>>> +	struct irq_work irq_work;
>>>>>>> +	u32 sig;
>>>>>>> +};
>>>>>>> +
>>>>>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>>>>>>> +
>>>>>>> +static void do_bpf_send_signal(struct irq_work *entry)
>>>>>>> +{
>>>>>>> +	struct send_signal_irq_work *work;
>>>>>>> +
>>>>>>> +	work = container_of(entry, struct send_signal_irq_work, irq_work);
>>>>>>> +	group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>>>>>> +}
>>>>>>> +
>>>>>>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>>>>>>> +{
>>>>>>> +	struct send_signal_irq_work *work = NULL;
>>>>>>> +
>>>>>>> +	/* Similar to bpf_probe_write_user, task needs to be
>>>>>>> +	 * in a sound condition and kernel memory access be
>>>>>>> +	 * permitted in order to send signal to the current
>>>>>>> +	 * task.
>>>>>>> +	 */
>>>>>>> +	if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
>>>>>>> +		return -EPERM;
>>>>>>> +	if (unlikely(uaccess_kernel()))
>>>>>>> +		return -EPERM;
>>>>>>> +	if (unlikely(!nmi_uaccess_okay()))
>>>>>>> +		return -EPERM;
>>>>>>> +
>>>>>>> +	if (in_nmi()) {
>>>>>>
>>>>>> Hm, bit confused, can't this only be done out of process context in
>>>>>> general since only there current points to e.g. hhvm? I'm probably
>>>>>> missing something. Could you elaborate?
>>>>>
>>>>> That is true. If in nmi, it is out of process context and in nmi
>>>>> context, we use an irq_work here since group_send_sig_info() has
>>>>> spinlock inside. The bpf program (e.g., a perf_event program) needs to
>>>>> check it is with right current (e.g., by pid) before calling
>>>>> this helper.
>>>>>
>>>>> Does this address your question?
>>>
>>> Thanks, Daniel. The below are really good questions which I did not
>>> really think through with irq_work.
>>>
>>>> Hm, but how is it guaranteed that 'current' inside the callback is still
>>>> the very same you intend to send the signal to?
>>>
>>> I went through irq_work infrastructure. It looks that "current" may
>>> change. irq_work is registered as an interrupt on x86.
>>> After nmi, some lower priority
>>> interrupts get chances to execute including irq_work. But there are some
>>> other interrupts like timer_interrupt and reschedule_interrupt may
>>> change "current". But since we are still in interrupt context, task
>>> should not be destroyed, so the task structure pointer is still valid.
>>>
>>> I will pass "current" task struct pointer to irq_work as well. This
>>> is similar to what we did in stackmap.c:
>>>      work->sem = &current->mm->mmap_sem;
>>>      irq_work_queue(&work->irq_work);
>>> At irq_work_run() time, the previous "current" in nmi should still be
>>> valid.
>>>
>>>> What happens if you're in softirq and send SIGKILL to yourself? Is this
>>>> ignored/handled gracefully in such case?
>>>
>>> It is not ignored. It handled gracefully in this case. I tried my
>>> example to send SIGKILL. The call stack looks below.
>>>
>>> [   24.211943]  bpf_send_signal+0x9/0xb0
>>> [   24.212427]  bpf_prog_fec6e7cc664d5b91_bpf_send_signal_test+0x228/0x1000
>>> [   24.213249]  ? bpf_overflow_handler+0xb7/0x180
>>> [   24.213853]  ? __perf_event_overflow+0x51/0xe0
>>> [   24.214385]  ? perf_swevent_hrtimer+0x10a/0x160
>>> [   24.214965]  ? __update_load_avg_cfs_rq+0x1a9/0x1c0
>>> [   24.215609]  ? task_tick_fair+0x50/0x690
>>> [   24.216104]  ? run_timer_softirq+0x208/0x490
>>> [   24.216637]  ? timerqueue_del+0x1e/0x40
>>> [   24.217111]  ? task_clock_event_del+0x10/0x10
>>> [   24.217658]  ? __hrtimer_run_queues+0x10d/0x2c0
>>> [   24.218217]  ? hrtimer_interrupt+0x122/0x270
>>> [   24.218765]  ? rcu_irq_enter+0x31/0x110
>>> [   24.219223]  ? smp_apic_timer_interrupt+0x67/0x160
>>> [   24.219842]  ? apic_timer_interrupt+0xf/0x20
>>> [   24.220383]  </IRQ>
>>> [   24.220655]  ? event_sched_out.isra.108+0x150/0x150
>>> [   24.221271]  ? smp_call_function_single+0xdc/0x100
>>> [   24.221898]  ? perf_event_sysfs_show+0x20/0x20
>>> [   24.222469]  ? cpu_function_call+0x42/0x60
>>> [   24.222982]  ? cpu_clock_event_read+0x10/0x10
>>> [   24.223525]  ? event_function_call+0xe6/0xf0
>>> [   24.224053]  ? event_sched_out.isra.108+0x150/0x150
>>> [   24.224657]  ? perf_remove_from_context+0x20/0x70
>>> [   24.225262]  ? perf_event_release_kernel+0x106/0x2e0
>>> [   24.225896]  ? perf_release+0xc/0x10
>>> [   24.226347]  ? __fput+0xc9/0x230
>>> [   24.226767]  ? task_work_run+0x83/0xb0
>>> [   24.227243]  ? do_exit+0x300/0xc50
>>> [   24.227674]  ? syscall_trace_enter+0x1c9/0x2d0
>>> [   24.228223]  ? do_group_exit+0x39/0xb0
>>> [   24.228695]  ? __x64_sys_exit_group+0x14/0x20
>>> [   24.229270]  ? do_syscall_64+0x49/0x130
>>> [   24.229762]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> I see the task is killed and other process is not impacted and
>>> no kernel crash/warning.
> 
> Hm, but I rather meant when you have the case that we're in_serving_softirq()
> e.g. when processing packets on rx and you send a signal to yourself. Shouldn't
> we bail out from the helper in such situation as well?

Just want to clarify. Are you concerned with safety or correctness?

For safety, if we do send signal here, we may wreck the system?

For correctness, you mean the information we got to send a signal
to process is not quite right if in_serving_softirq()? F.e,
the performance counter overflow may be caused by softirq rather
the process itself? So in this case, we should only send signal
to process when in process context, and in nmi (not serving softirq)?

If for correctness, do you think we should add a "flags" parameter
to the bpf_send_signal() helper such that:
    . default not checking is_serving_softirq()
    . bit0: if set, bail out if is_serving_softirq()
    . other bits: reserved

> 
> Thanks,
> Daniel
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper
  2019-05-23 23:54               ` Yonghong Song
@ 2019-05-24 21:32                 ` Daniel Borkmann
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Borkmann @ 2019-05-24 21:32 UTC (permalink / raw)
  To: Yonghong Song, bpf, netdev
  Cc: Alexei Starovoitov, Kernel Team, Peter Zijlstra

On 05/24/2019 01:54 AM, Yonghong Song wrote:
> 
> 
> On 5/23/19 4:08 PM, Daniel Borkmann wrote:
>> On 05/23/2019 11:30 PM, Yonghong Song wrote:
>>> On 5/23/19 2:07 PM, Yonghong Song wrote:
>>>> On 5/23/19 9:28 AM, Daniel Borkmann wrote:
>>>>> On 05/23/2019 05:58 PM, Yonghong Song wrote:
>>>>>> On 5/23/19 8:41 AM, Daniel Borkmann wrote:
>>>>>>> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>>>>>>>> This patch tries to solve the following specific use case.
>>>>>>>>
>>>>>>>> Currently, bpf program can already collect stack traces
>>>>>>>> through kernel function get_perf_callchain()
>>>>>>>> when certain events happens (e.g., cache miss counter or
>>>>>>>> cpu clock counter overflows). But such stack traces are
>>>>>>>> not enough for jitted programs, e.g., hhvm (jited php).
>>>>>>>> To get real stack trace, jit engine internal data structures
>>>>>>>> need to be traversed in order to get the real user functions.
>>>>>>>>
>>>>>>>> bpf program itself may not be the best place to traverse
>>>>>>>> the jit engine as the traversing logic could be complex and
>>>>>>>> it is not a stable interface either.
>>>>>>>>
>>>>>>>> Instead, hhvm implements a signal handler,
>>>>>>>> e.g. for SIGALARM, and a set of program locations which
>>>>>>>> it can dump stack traces. When it receives a signal, it will
>>>>>>>> dump the stack in next such program location.
>>>>>>>>
>>>>>>>> Such a mechanism can be implemented in the following way:
>>>>>>>>       . a perf ring buffer is created between bpf program
>>>>>>>>         and tracing app.
>>>>>>>>       . once a particular event happens, bpf program writes
>>>>>>>>         to the ring buffer and the tracing app gets notified.
>>>>>>>>       . the tracing app sends a signal SIGALARM to the hhvm.
>>>>>>>>
>>>>>>>> But this method could have large delays and causing profiling
>>>>>>>> results skewed.
>>>>>>>>
>>>>>>>> This patch implements bpf_send_signal() helper to send
>>>>>>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>>>>>>
>>>>>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>>>>> ---
>>>>>>>>      include/uapi/linux/bpf.h | 17 +++++++++-
>>>>>>>>      kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      2 files changed, 83 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>>>> index 63e0cf66f01a..68d4470523a0 100644
>>>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>>>>>>       *		0 on success.
>>>>>>>>       *
>>>>>>>>       *		**-ENOENT** if the bpf-local-storage cannot be found.
>>>>>>>> + *
>>>>>>>> + * int bpf_send_signal(u32 sig)
>>>>>>>> + *	Description
>>>>>>>> + *		Send signal *sig* to the current task.
>>>>>>>> + *	Return
>>>>>>>> + *		0 on success or successfully queued.
>>>>>>>> + *
>>>>>>>> + *		**-EBUSY** if work queue under nmi is full.
>>>>>>>> + *
>>>>>>>> + *		**-EINVAL** if *sig* is invalid.
>>>>>>>> + *
>>>>>>>> + *		**-EPERM** if no permission to send the *sig*.
>>>>>>>> + *
>>>>>>>> + *		**-EAGAIN** if bpf program can try again.
>>>>>>>>       */
>>>>>>>>      #define __BPF_FUNC_MAPPER(FN)		\
>>>>>>>>      	FN(unspec),			\
>>>>>>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>>>>>>>      	FN(strtol),			\
>>>>>>>>      	FN(strtoul),			\
>>>>>>>>      	FN(sk_storage_get),		\
>>>>>>>> -	FN(sk_storage_delete),
>>>>>>>> +	FN(sk_storage_delete),		\
>>>>>>>> +	FN(send_signal),
>>>>>>>>      
>>>>>>>>      /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>>>>>>>       * function eBPF program intends to call
>>>>>>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>>>>>>> index f92d6ad5e080..f8cd0db7289f 100644
>>>>>>>> --- a/kernel/trace/bpf_trace.c
>>>>>>>> +++ b/kernel/trace/bpf_trace.c
>>>>>>>> @@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
>>>>>>>>      	.arg3_type	= ARG_ANYTHING,
>>>>>>>>      };
>>>>>>>>      
>>>>>>>> +struct send_signal_irq_work {
>>>>>>>> +	struct irq_work irq_work;
>>>>>>>> +	u32 sig;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>>>>>>>> +
>>>>>>>> +static void do_bpf_send_signal(struct irq_work *entry)
>>>>>>>> +{
>>>>>>>> +	struct send_signal_irq_work *work;
>>>>>>>> +
>>>>>>>> +	work = container_of(entry, struct send_signal_irq_work, irq_work);
>>>>>>>> +	group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>>>>>>>> +{
>>>>>>>> +	struct send_signal_irq_work *work = NULL;
>>>>>>>> +
>>>>>>>> +	/* Similar to bpf_probe_write_user, task needs to be
>>>>>>>> +	 * in a sound condition and kernel memory access be
>>>>>>>> +	 * permitted in order to send signal to the current
>>>>>>>> +	 * task.
>>>>>>>> +	 */
>>>>>>>> +	if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
>>>>>>>> +		return -EPERM;
>>>>>>>> +	if (unlikely(uaccess_kernel()))
>>>>>>>> +		return -EPERM;
>>>>>>>> +	if (unlikely(!nmi_uaccess_okay()))
>>>>>>>> +		return -EPERM;
>>>>>>>> +
>>>>>>>> +	if (in_nmi()) {
>>>>>>>
>>>>>>> Hm, bit confused, can't this only be done out of process context in
>>>>>>> general since only there current points to e.g. hhvm? I'm probably
>>>>>>> missing something. Could you elaborate?
>>>>>>
>>>>>> That is true. If in nmi, it is out of process context and in nmi
>>>>>> context, we use an irq_work here since group_send_sig_info() has
>>>>>> spinlock inside. The bpf program (e.g., a perf_event program) needs to
>>>>>> check it is with right current (e.g., by pid) before calling
>>>>>> this helper.
>>>>>>
>>>>>> Does this address your question?
>>>>
>>>> Thanks, Daniel. The below are really good questions which I did not
>>>> really think through with irq_work.
>>>>
>>>>> Hm, but how is it guaranteed that 'current' inside the callback is still
>>>>> the very same you intend to send the signal to?
>>>>
>>>> I went through irq_work infrastructure. It looks that "current" may
>>>> change. irq_work is registered as an interrupt on x86.
>>>> After nmi, some lower priority
>>>> interrupts get chances to execute including irq_work. But there are some
>>>> other interrupts like timer_interrupt and reschedule_interrupt may
>>>> change "current". But since we are still in interrupt context, task
>>>> should not be destroyed, so the task structure pointer is still valid.
>>>>
>>>> I will pass "current" task struct pointer to irq_work as well. This
>>>> is similar to what we did in stackmap.c:
>>>>      work->sem = &current->mm->mmap_sem;
>>>>      irq_work_queue(&work->irq_work);
>>>> At irq_work_run() time, the previous "current" in nmi should still be
>>>> valid.
>>>>
>>>>> What happens if you're in softirq and send SIGKILL to yourself? Is this
>>>>> ignored/handled gracefully in such case?
>>>>
>>>> It is not ignored. It handled gracefully in this case. I tried my
>>>> example to send SIGKILL. The call stack looks below.
>>>>
>>>> [   24.211943]  bpf_send_signal+0x9/0xb0
>>>> [   24.212427]  bpf_prog_fec6e7cc664d5b91_bpf_send_signal_test+0x228/0x1000
>>>> [   24.213249]  ? bpf_overflow_handler+0xb7/0x180
>>>> [   24.213853]  ? __perf_event_overflow+0x51/0xe0
>>>> [   24.214385]  ? perf_swevent_hrtimer+0x10a/0x160
>>>> [   24.214965]  ? __update_load_avg_cfs_rq+0x1a9/0x1c0
>>>> [   24.215609]  ? task_tick_fair+0x50/0x690
>>>> [   24.216104]  ? run_timer_softirq+0x208/0x490
>>>> [   24.216637]  ? timerqueue_del+0x1e/0x40
>>>> [   24.217111]  ? task_clock_event_del+0x10/0x10
>>>> [   24.217658]  ? __hrtimer_run_queues+0x10d/0x2c0
>>>> [   24.218217]  ? hrtimer_interrupt+0x122/0x270
>>>> [   24.218765]  ? rcu_irq_enter+0x31/0x110
>>>> [   24.219223]  ? smp_apic_timer_interrupt+0x67/0x160
>>>> [   24.219842]  ? apic_timer_interrupt+0xf/0x20
>>>> [   24.220383]  </IRQ>
>>>> [   24.220655]  ? event_sched_out.isra.108+0x150/0x150
>>>> [   24.221271]  ? smp_call_function_single+0xdc/0x100
>>>> [   24.221898]  ? perf_event_sysfs_show+0x20/0x20
>>>> [   24.222469]  ? cpu_function_call+0x42/0x60
>>>> [   24.222982]  ? cpu_clock_event_read+0x10/0x10
>>>> [   24.223525]  ? event_function_call+0xe6/0xf0
>>>> [   24.224053]  ? event_sched_out.isra.108+0x150/0x150
>>>> [   24.224657]  ? perf_remove_from_context+0x20/0x70
>>>> [   24.225262]  ? perf_event_release_kernel+0x106/0x2e0
>>>> [   24.225896]  ? perf_release+0xc/0x10
>>>> [   24.226347]  ? __fput+0xc9/0x230
>>>> [   24.226767]  ? task_work_run+0x83/0xb0
>>>> [   24.227243]  ? do_exit+0x300/0xc50
>>>> [   24.227674]  ? syscall_trace_enter+0x1c9/0x2d0
>>>> [   24.228223]  ? do_group_exit+0x39/0xb0
>>>> [   24.228695]  ? __x64_sys_exit_group+0x14/0x20
>>>> [   24.229270]  ? do_syscall_64+0x49/0x130
>>>> [   24.229762]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>
>>>> I see the task is killed and other process is not impacted and
>>>> no kernel crash/warning.
>>
>> Hm, but I rather meant when you have the case that we're in_serving_softirq()
>> e.g. when processing packets on rx and you send a signal to yourself. Shouldn't
>> we bail out from the helper in such situation as well?
> 
> Just want to clarify. Are you concerned with safety or correctness?
> 
> For safety, if we do send signal here, we may wreck the system?
> 
> For correctness, you mean the information we got to send a signal
> to process is not quite right if in_serving_softirq()? F.e,
> the performance counter overflow may be caused by softirq rather
> the process itself? So in this case, we should only send signal
> to process when in process context, and in nmi (not serving softirq)?
> 
> If for correctness, do you think we should add a "flags" parameter
> to the bpf_send_signal() helper such that:
>     . default not checking is_serving_softirq()
>     . bit0: if set, bail out if is_serving_softirq()
>     . other bits: reserved

Scratch my thought, we do bail out in case of PF_KTHREAD, so should be
okay. Was thinking in terms of both, not wrecking the system / messing
with kthreads and with regards to correctness.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-22  5:39 [PATCH bpf-next v2 0/3] bpf: implement bpf_send_signal() helper Yonghong Song
2019-05-22  5:39 ` [PATCH bpf-next v2 1/3] " Yonghong Song
2019-05-23 15:41   ` Daniel Borkmann
2019-05-23 15:58     ` Yonghong Song
2019-05-23 16:28       ` Daniel Borkmann
2019-05-23 21:07         ` Yonghong Song
2019-05-23 21:30           ` Yonghong Song
2019-05-23 23:08             ` Daniel Borkmann
2019-05-23 23:54               ` Yonghong Song
2019-05-24 21:32                 ` Daniel Borkmann
2019-05-22  5:39 ` [PATCH bpf-next v2 2/3] tools/bpf: sync bpf uapi header bpf.h to tools directory Yonghong Song
2019-05-22  5:39 ` [PATCH bpf-next v2 3/3] tools/bpf: add a selftest for bpf_send_signal() helper Yonghong Song
2019-05-22 18:48   ` Andrii Nakryiko
2019-05-22 19:38     ` Yonghong Song
2019-05-22 19:10   ` Stanislav Fomichev
2019-05-22 19:44     ` Yonghong Song
2019-05-22 16:38 ` [PATCH bpf-next v2 0/3] bpf: implement " Stanislav Fomichev
2019-05-22 16:43   ` Alexei Starovoitov
2019-05-22 17:11     ` Stanislav Fomichev

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org bpf@archiver.kernel.org
	public-inbox-index bpf


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/ public-inbox